This post is by Lizzie. I thought of calling this post also: Lynx, hares and the utility of many-analyst studies, but I thought that schmeared stuff more than I mean to. I also took the photo from Paris last fall. Why a photo of Paris? Why not.
At an evening discussion event recently I made a passing comment about how I wish ecology knew more assuredly what causes the lynx-hare population dynamics (including their magical Lotka-Volterra cycles). Almost immediately several folks swept in that how could I think this when we *do* know. And then they proceeded to each share a different mechanistic (if you will) model, including:
– Maternal effects such that scared bunnies (aka hares) keep hiding out from lynx well after lynx population numbers have plummeted.
– Something to do with willow (the trophic level below the hares).
– Disease.
– Someone mentioned the moon (but then someone emailed me later about sunspots, maybe the moon was the sunspots misremembered?).
– And (I think) more hypotheses that I cannot now recall.
I was not surprised by this. One reason for this is that I have tried it before and got a similar set of assured and wildly diverging answers. The other is that ecology seems to me an inchoate field where we’re still struggling for general theories and how to sort out what is going on. I like to think we’re making it progress, but I am rather sure that it is currently slow.
This all brings me vaguely around to a recent paper that a reader pointed Andrew to and he then pointed me to it, which is a new many-analyst study by Gould and colleagues (lots of colleagues!) ‘Same data, different analysts: variation in effect sizes due to analytical decisions in ecology and evolutionary biology.’
As the title suggests the paper gives the same data and research questions to a set of teams (who signed up to do this, and be co-authors for their work) and then sees how different the ‘answers’ are. I say ‘answers’ because obviously a pain-point in this sort of study is how to decide what to extract from any analysis and deem an answer. The authors tried to think about this in advance, as they pre-registered their study, but I was amazed at how painful I found both the presence of the pre-registration — or, to be more specific: the continual side bars on ‘deviations from pre-registration’ — and sifting out what the authors themselves were trying to tell me. Before I get to the latter though, here’s my favorite deviation:
Some analysts had difficulty implementing our instructions to derive the out-of-sample predictions, and in some cases (especially for the Eucalyptus data), they submitted predictions with implausibly extreme values. We believed these values were incorrect and thus made the conservative decision to exclude out-of-sample predictions where the estimates were > 3 standard deviations from the mean value from the full dataset provided to teams for analysis.
I only skimmed the paper, but I think the abstract captures some of what confused me:
For both datasets, we found substantial variation in the variable selection and random effects structures among analyses, as well as in the ratings of the analytical methods by peer reviewers, but we found no strong relationship between any of these and deviation from the meta-analytic mean. In other words, analyses with results that were far from the mean were no more or less likely to have dissimilar variable sets, use random effects in their models, or receive poor peer reviews than those analyses that found results that were close to the mean.
This led me down a brief path of skimming some other many-analyst studies or viewpoints (and the religion one cited within). At the end of the path I realized that these studies are not simply trying to point out a potentially concerning level of variation in the answers obtained from different teams using the same data for the same question, but something more.
Some are suggesting this should be a new way to do science — as if doing this will give us more confidence in the answers, while others (including Gould et al.) were doing something else: trying to find out which types of analyses give ‘better’ answers. The authors don’t clearly say they’re doing this (at least not a quick skim) but why else would so much of the paper include peer reviews of the analyses or dissections of whether the presence of ‘random effects’ tend to give more similar answers (I felt there was someone who either thinks hierarchical models are ‘better’ or has heard that and thinks otherwise behind this particular analysis).
Either one of these aims makes me want much more than any of these studies are currently offering. For the former (‘let’s add this to how we do science’) I wanted more information on how exactly the authors think science advances, effectively if you’re telling me this will improve science I think you first owe me a good model of how science works so I can better assess your claim. In the latter (‘which way is better?’) I obviously wanted simulated data, where we could find out which methods got closer to the truth, because we would actually know the truth.
This got me to musing about what my colleagues the other night would think of the challenge of simulating data for a many-analyst ecology study. I wonder if some would bristle that we don’t know enough to simulate such data, but if that’s true, I think we have a real problem. And the challenge to ask ecologists to simulate data to then give out for a many-analyst study seems to me perhaps a better place to focus our efforts if we want to improve ecology than to ask more people to do many-analyst studies on different datasets given different questions (further, I’d be more interested in a many-analyst study on simulated data).
I think the value of many-analyst studies lies elsewhere. First, to show the variation (in which case, we do not need so many of them) and then, perhaps to get better models for specific applications. This is where it occurred to me the cherry blossom competition I run with Jonathan Auerbach and David Kepplinger is a many-analyst study of sorts, but in a very different spirit. We’re looking for more predictive models! We’re not disturbed by the variation in the way that I think I was supposed to be by some of the many-analyst studies I read (skimmed).
Indeed the most interesting part to me of Gould et al. was the discussion where issues were raised about whether the research questions given to the teams were too vague or whether readers should even be surprised by this variation. They write:
We recognize that some researchers have long maintained a healthy level of skepticism of individual studies as part of sound and practical scientific practice, and it is possible that those researchers will be neither surprised nor concerned by our results. However, we doubt that many researchers are sufficiently aware of the potential problems of analytical flexibility to be appropriately skeptical. I hope that our work leads to conversations in ecology, evolutionary biology, and other disciplines about how best to contend with heterogeneity in results that is attributable to analytical decisions.
I see their point, but then I wonder about how well they added to the conversation when I have no idea why they did so many tests or what their question(s) exactly was. I also think any such conversation should be framed with a solid grounding in both how science works (who knows, but there are theories and ideas, none of which were really mentioned) and how statistics work. Both of these areas should give all scientists a good dose of skepticism, so do we really need many-analyst studies for that? I would hope they’re offering something more.
Three side notes:
1) This work reminds me of the debate over whether the bird Parus major was declining due to mistiming with its caterpillar food resource with anthropogenic warming (the peak of caterpillar abundance was advancing faster than bird reproductive timing, these birds use caterpillars to feed nestlings). The Dutch team said it was happening in their birds, the British team said it was not happening in their woods, but whenever I worked on them together in a hierarchical model they looked the same to me (see interactions 221 for the Dutch and 180 for the UK in Fig 1B here). And then this paper came out after the Dutch team had more data.
2) Another big take-home to me from the discussion of Gould et al. was a search for a simple answer of how to fix this mess, as opposed to acknowledging there’s no one or easy thing that would fix this, as often mentioned on this blog.
3) I was also sort of disturbed how these papers seemed to think of model averaging and model comparison, but I will save that for another post.

Quote from above: “I wanted more information on how exactly the authors think science advances, effectively if you’re telling me this will improve science I think you first owe me a good model of how science works so I can better assess your claim.”
This reminded me of when I still read some things from certain people who seem to use words like “collaboration” a lot in discussions and the papers they write. It seemed to me that there was a very narrow view of “collaboration” present in which large groups of researchers work on the same experiment, or perhaps data set when analyzing, and that that was somehow better for science because of “the incentives” that made everyone non-collaborative or whatever.
To contrast this to make my point, I would say that when individual scientists write a paper and share this with other scientists so they can make use of it this can be considered to be very “collaborative”. And I would say that when many people “collaborate” together in such a way that it makes it unnecessarily harder for others that don’t participate in that particular “collaboration” to receive funding or get their papers published or get attention for their findings, this can be considered to not be very “collaborative’.
Furthermore, I also think it might be important to be mindful of how these types of projects may facilitate certain forms of manipulation or sub-optimal scientific processes. Take the same data set, many analysts type projects for example. That project received lots of money, and in turn can give some money to participants. This alone kind of nudges people to join such a project, which in itself may already be something worth considering from a “what even is science” and “how should science work”- perspective.
But more than that, certain topics might be given more attention than others. These people who analyze the data might go on and subsequently spend more time and attention on the specific topic or analyses that were presented to them in the large scale project they signed up for. Or, these types of projects might pave the way for not only analyzing the same data by many analysts, but gathering data by many researchers and then subsequently analyzing that “collaboratively” gathered data. Sure, that may sound good (lots of participants, even pre-registered, etc.) but from a “how should science work” this might all be highly problematic (e.g. potentially way too much power for small group of people, such large projects nearly impossible for others to replicate in a similar manner, etc.).
I am hereby pre-registering the hypothesis that there will be people from certain parts of social science that talk a lot about “transparency’ and “collaboration” that will organize some sort of election of study-proposals that the “collaborative” group will then vote on which will then be “collaboratively” performed by this large group. Of course, this is all to use “the wisdom of the crowd” (or whatever that nonsense is called) and to be more “collaborative” and most of all to “change the incentives”. Because that’s just what needs to happen for some reason. Just “change the incentives”…
Quote from above: “Or, these types of projects might pave the way for not only analyzing the same data by many analysts, but gathering data by many researchers and then subsequently analyzing that “collaboratively” gathered data.”
Wait, wasn’t there some sort of this “wisdom of the crowd” stuff concerning prediction markets and estimating which studies will replicate? I am pretty sure there was something like that. That might be another example of why this type of large scale “collaborative” stuff might not be very good for science and/or might be done without seemingly much pondering concerning “how should science work” and “what even is science”.
If you are going to sort of nudge researchers to replicate certain findings and not others based on something like crowd estimation and prediction markets, you are possibly totally messing with certain processes in science. Perhaps it is much more in line with science to let scientists decide themselves whether they want to replicate some study, regardless of a possible associated predication market score (or however that works) or some “replication formula” nonsense that may or may not still be considered worthy of being developed.
Perhaps just like hares and lynxes, science should be kind of left alone as much as possible and let natural processes take place…
Anon:
There is no pure process of science nor does it make sense in some abstract way to say “science should be kind of left alone.” Science is done for purposes and it is paid for. There is no vacuum in the procedures of science any more than there is a vacuum in political affairs or family life or any social setting. And, for that matter, Lizzie has as much right to propose certain research areas as you have to write your comment (and as I have to write my reply).
When I used “left alone” I was thinking about “not interacting with or bother someone or interfere with something”. I used it to try and further make my point about the possible dangers of interfering with certain processes in science, for instance by way of possibly nudging scientists to replicate certain findings and not others (via the whole prediction market stuff I mentioned).
I did not mean to imply that scientists can’t observe certain things or think about certain things or make models concerning certain things like hare-lynx population dynamics. I did mean that it might not be a good idea to more directly interfere with such things by for instance introducing more hares when the lynxes get hungry. Just like I think it might not be a good idea to interfere with science by way of predication markets and large scale collaborative efforts in which a small group of people may have way too much influence, which is what I was largely trying to make clear.
The interesting thing about lynx hare dynamics to me is that you can get it from all kinds of agent based models with slightly different rules. In other words, it’s kind of an attractor type behavior. So it would make sense that multiple people have a multitude of different “rules” they believe are involved. You’re gonna get the same qualitative dynamics from a wide variety of rules so people are gonna come up with a wide variety of rules to explain the qualitative dynamics.
Think you are looking for this: https://en.wikipedia.org/wiki/Experimentum_crucis
The original metrics used to judge science were replicability and ability to perform impressive/surprising feats like non-trivial predictions about the future. Followed by plausibility of premises and logical soundness.
The new science is the “random effects” stuff. It is pretty much incompatible with the former, as they found out.
What does this sentence mean?”
“bird Parus major was declining due to misting with its caterpillar food resource with anthropogenic warming.”
It sounds full of meaning. I tried AI but that did not help much. I suppose it means a bird needs a food source and a stable temperature/climate over the years.
Lizzie’s a phenologist. That’ll help with the Googling. She’s written about the link between the ecology of caterpillars, tree-budding (one of her specialities, hence the cherry blossom contest), and bird migration phenomena before.
Lizzie explains it just below this. However, poetically at least, “misting” sounds as if it has a technical meaning which I ought to have known. Is Google clever enough in general when there is a misspelling to offer a correct candidate? In this instance, there are two letters missing so I assume that makes things harder. Not being a bird watcher makes it all the more difficult.
Sorry for the typo — it should have been mistiming. And I added some extra explanation. But the main point here to me is that I am not sure there was ever great evidence that the Dutch birds were ever really mistimed.
Lizzie says, “I made a passing comment about how I wish ecology knew more assuredly what causes the lynx-hare population dynamics (including their magical Lotka-Volterra cycles).”
The Lotka-Volterra model’s very well understood. The cycles of populations resulting from the model’s assumptions are no more mysterious than the orbits of planets in Newton’s model of gravity. They’re both just the results of solving simple dynamics models (i.e., differential equations). Now things get chaotic when you try to scale up to larger numbers of species in the sense that the answer you get over time becomes ridiculously sensitive to inputs.
You can go back to Lotka and Volterra’s original papers for their conception of the model. Here’s a nice overview with a historical perspective from a classically named journal:
Mira-Cristiana Anisiu: Lotka, Volterra and their model. Didactica Mathematica.
Anisiu says,
There are only the four parameters and they all have natural interpretations. And they lead to exactly the dynamics you’re talking about. I wrote a Stan case study here: Predator-prey population dynamics: the Lotka-Volterra model in Stan.
I’m curious what question was posed to people where they came up with all those whacky explanations like sunspots or disease? None of that’s modeled in the basic system of predator-prey equations. You can add covariates to a model like this and try to model things like disease. You could also model food supply for prey—as is, the prey’s just assumed to grow exponentially if not preyed upon. A more realistic model would also have some limit to growth of prey if not preyed upon.
I love that case study! And I agree the equations and their results are straight-forward, but the general idea is often that these oscillations should persist, and often they don’t, or they change in ways not predicted by the model. There’s also a debate about what drives what. Here’s a few quick examples with the caveat that I have not read these papers beyond the abstracts (or, if I have, not recently), but they capture what I have learned over time:
– Trapping patterns explain shifts in size of oscillations (Bifurcations and chaos in ecology: lynx returns revisited: https://onlinelibrary.wiley.com/doi/abs/10.1046/j.1461-0248.2000.00128.x)
– ‘Recent’ declines explained by climate change (Linking climate change to population cycles of hares and lynx: https://onlinelibrary.wiley.com/doi/full/10.1111/gcb.12321)
– Maternal effects explain this pattern as well (https://royalsocietypublishing.org/doi/full/10.1098/rstb.2008.0292)
– Patchy environments also required (https://esajournals.onlinelibrary.wiley.com/doi/abs/10.2307/2937249 and see https://www.sciencedirect.com/science/article/pii/S0022519314004135)
So, basically, the simple question I asked, invoked a couple different related debates to people I assume.
Bob, it’s one thing to say that the LV equations have a certain behavior, it’s another thing to say that this behavior has some underlying mechanistic cause.
For example we can talk about the Navier Stokes equations for fluid flow, but to explain why they work we need to discuss things like intermolecular forces and diffusion and whatever.
Differential equations are almost always about aggregating the behavior of individual elements of a system and taking some averages or similar.
The hares lynx issue is that the Lotka-Volterra predator-prey equations do not fit the data (on annual numbers of hare and lynx pelts), at least not with lynx as predators and hares as prey.
Gilpin, M.E., 1973. Do hares eat lynx?. The American Naturalist, 107(957), pp.727-730.
Hare’s population peaks lag lynx peaks, while Lotka-Volterra dynamics have prey population cycles leading predator cycles.
We understand a great deal about Lotka-Volterra equations for predator-prey and for competition; they are great conceptual models. But they are linear, and they have single species dynamics of exponential growth (the limit to growth in your final sentence). While allowing nonlinearity helps L-V competition models fit competition data (Gilpin, M.E. and Justice, K.E., 1972. Reinterpretation of the invalidation of the principle of competitive exclusion. Nature, 236(5345), pp.273-274), it doesn’t work well for predator-prey dynamics.
So: the question that generated all those whacky explanations was what other L-V predators (disease? trappers? market prices for pelts?) or prey (“forbs”? abundance, or nutritional or defensive compounds) could explain the actual data.
What bugs me about these many analysts studies is how they present what should absolutely not be news as some sort of shocking flaw with science. Of course different groups of scientists will reach different conclusions. There is no single best model independent of domain knowledge. But this doesn’t mean everything is inherently subjective and all the analyses must be equal. Some models will be more obviously misspecified than others. When you go in with an underspecified question, or look at conclusions drawn by researchers without a lot of domain knowledge in the area, then you should expect this kind of thing. Presenting it as big news implies its somehow reasonable to go around believing that data alone perfectly determines what we should believe.
I agree and think this applies to some of the complaints raised (as often stated by Anoneuoid) about interpreting regression coefficients. With domain knowledge, and a healthy dose of humility and testing different specifications, I think interpreting a regression coefficient can be informative. There are certainly myriad abuses of such interpretations, but equating these abuses with a statement that such interpretation is always an abuse, seems unwarranted to me.
+1
Yes, the level of shock these papers seem to want is concerning (and your comment, I assume, is similar to a comment from a peer review of the Gould et al. paper given the quote from the Gould et al. discussion I included). If scientists are shocked that vague question + data = different ‘answers’ then we’re not training people well.
From the discussion section of Gould et al.: “However, we doubt that many researchers are sufficiently aware of the potential problems of analytical flexibility to be appropriately skeptical.”
Perhaps vague questions + data = different “answers” is part of some larger (future) study which the authors have already (unconsciously) started. They could maybe write the following in the discussion section of the possible next paper concerning this all:
“However, we doubt that many researchers are sufficiently aware of the potential problems of flexibility in designing studies and how this relates to the subsequent conclusions to be appropriately skeptical.”
Sigh. From my current perspective, we’re not training ecologists well, especially when it comes to analyses for anything other than very simple manipulative experiments. Both higher leaders and monitoring protocol leads in my land management agency think that a vague question (“is there a temporal trend?”) and a dataset with minimal documentation can be outsourced to someone to produce reliable management-relevant information, despite complex survey designs and revisit designs, imperfect detection or censored observations, et alone shifts in the protocol for how the data are collected..
Because the almost uniform distributions of effect size estimates in the Gould et al. paper are a huge red flag for that approach, I dug down into their paper. Yes different interpretations of the vague question is one component. But the overconfidence of some self-anointed ecological analysts also contributed: T tests for these data are an extreme of model misspecification. Worse: the only documentation provided was a dictionary defining each variable. Nothing on the sampling design or structure of dependence among observations, nothing on causes of missing values, or why each variable was measured (or generated from climate data). Nothing on the QA/QC preprocessing of the data flagging and removing some values. All of those produce patterns in the data, and need to be understood in order to produce valid estimates of the effect of interest.
My disappointment (not shock) is that so many ecologists (my field) think vague question + data is sufficient if given to a subject matter expert, that documentation of al of those other aspects isn’t essential.
The additional aspect of our program is that we continue to monitor, so we are collecting the data to mark to market the current trend estimates. Bad estimates not only misinform resource management decisions, they will erode the credibility of our program, and possibly of the idea of obtaining scientific information to guide management decisions.
There is an alarming number of “scientists” who believe that once something is published, that means (by the magic of peer review) it becomes “Science”. The view that “of course different groups of scientists will reach different conclusions” doesn’t really fit in this simplistic model of science. There really is a widespread belief that peer review serves the role of vetting the theory, model, and data and that we should just trust that process to ensure a “right” model was used and that anyone else would have reached the same conclusion. Yet many, many, many papers make it through peer review on the basis of signficance chasing poorly specified questions with little appreciation for just how flexible that process is. This is, quite unfortunately, how science has been done for decades. One function these many analyst studies are performing is making it just how naive and misguided this worldview is. I agree this should absolutely not be news… and yet…