Benjamin Kircup writes:
I think you will be very interested to see this preprint that is making the rounds: Same data, different analysts: variation in effect sizes due to analytical decisions in ecology and evolutionary biology (ecoevorxiv.org)
I see several ties to social science, including the study of how data interpretation varies across scientists studying complex systems; but also the sociology of science. This is a pretty deep introspection for a field; and possibly damning. The garden of forking paths is wide. They cite you first, which is perhaps a good sign.
Ecologists frequently pride themselves on data analysis and interpretation skills. If there weren’t any variability, what skill would there be? It would all be mechanistic, rote, unimaginative, uninteresting. In general, actually, that’s the perception many have of typical biostatistics. It leaves insights on the table by being terribly rote and using the most conservative kinds of analytic tools (yet another t-test, etc). The price of this is that different people will reach different conclusions with the same data – and that’s not typically discussed, but raises questions about the literature as a whole.
One point: apparently the peer reviews didn’t systematically reward finding large effect sizes. That’s perhaps counterintuitive and suggests that the community isn’t rewarding bias, at least in that dimension. It would be interesting to see what you would do with the data.
The first thing I noticed is that the paper has about a thousand authors! This sort of collaborative paper kind of breaks the whole scientific-authorship system.
I have two more serious thoughts:
1. Kircup makes a really interesting point, that analysts “pride themselves on data analysis and interpretation skills. If there weren’t any variability, what skill would there be?”, but then it’s considered a bad or surprising thing that if you give the same data to different analysts, they come to different conclusions. There really does seem to be a fundamental paradox here. On one hand, different analysts do different things—Pete Palmer and Bill James have different styles, and you wouldn’t expect them to come to the same conclusions—; on the other hand, we expect strong results to appear no matter who is analyzing the data.
A partial resolution to this paradox is that much of the skill of data analysis and interpretation comes in what questions to ask. In these replication projects (I think Bob Carpenter calls them “bake-offs”), several different teams are given the same question and the same data and then each do their separate analysis. David Rothschild and I did one of these; it was called We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results, and we were the only analysts of that Florida poll from 2016 that estimated Trump to be in the lead. Usually, though, data and questions are not fixed, despite what it might look like when you read the published paper. Still, there’s something intriguing about what we might call the Analyst’s Paradox.
2. Regarding his final bit (“apparently the peer reviews didn’t systematically reward finding large effect sizes”), I think Kircup is missing the point. Peer reviews don’t systematically reward finding large effect sizes. What they systematically reward is finding “statistically significant” effects, i.e. those that are at least two standard errors from zero. But by restricting yourself to those, you automatically overestimate effect sizes, as I discussed to interminable length in papers such as Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors and The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. So they are rewarding bias, just indirectly.
Ya know, people just want to stay employed and be important without being bothered by the inconvenience of truth.
That’s not scientists, that’s just people. This is never going away.
This has got to be a typo:
“David Rothschild and I did one of these myself;”
Fixed; thanks!
Concerning the point about different conclusions (‘the analyst’s paradox’):
I believe that good data analysis should lead to roughly similar results when asking the same question. Usually the question or the quantity under investigation can be neatly written in mathematical notation (e.g. E(X|Y), q(X|Y) with q as the quantile function, or something else). However, using phrases instead of mathematical notation to ask a question leads initially to a different sub-question: What quantity do we need to estimate in order to answer the question? Am I looking for E(Y|X), or is it E(Y|X,Z), or even E(X|Y)? Answering this question lies at the beginning of the garden of forking paths, and if different researchers take different paths already then, they will often end up with wildly different results. I dislike reading research papers that do not address this question for that reason. Fortunately, the research question is often addressed in the form of equations for linear models [e.g. Y=X+X^2+Z+epsilon already suffices, since I know we estimate E(Y|X,Z)].
Once the research quantity has been defined, the estimates should not vary so much between researchers. But sometimes they do, especially if the research question is complicated. But when this happens, it is my wish and expectation that each of the research teams recognises the existence of the variability of their estimates and acknowledges the impact of forking paths on the result. However, unless I find a djinni in a lamp, I doubt that this wish will be heard. But at least it is a standard to which I can hold my own work.
Jeez, people were studying the clutch size issue in the 1970s, when I was a grad student. If it is still a live question, probably it doesn’t have clear answers. Anyway, I think the authors got in right in their conclusion: “Overall, our results suggest to us that, where there is a diverse set of plausible analysis options, no
single analysis should be considered a complete or reliable answer to a research question.”
Totally agree with you here, although, I would add that, with at least 20 years plus of the evidence-based ecology movement under our belts, not to mention outside of ecology, you have to wonder why people would expect this to be true anyway, for the types of questions that we typically ask. It’s a useful paper for getting more people thinking more, but it’s also hard not to have a “no sh*t Sherlock” response to this exercise…
Gould et al. is a study of researcher degrees of freedom in ecology. They just picked a couple of unpublished datasets that happened to be suitable for their purposes, one of which happened to be about clutch size in birds. The goal of the study wasn’t to examine clutch size in birds (or regeneration of Eucalyptis, which was the other dataset). Nor was their goal to identify some novel unstudied ecological question and then have many different analysts answer it. They were studying ecologists, not ecology. Given their goals, their choices of dataset seem fine to me. I mean, obviously you might get different results if you chose different datasets. But that would be true no matter what criteria they used to choose their datasets (“Does this dataset address a novel unstudied question?” or whatever).
I don’t see any paradox. Because of the forking paths, we expect that giving different analysts the same data should result in a range of results. There is an underlying population of possible results (sampling distribution or prior distribution). The value of an exercise like this – if it involves enough teams – is to estimate the sampling variance. In some problems, the answers may be clustered near the average so the variance in methods, parameters, assumptions doesn’t matter while in other problems, we can learn the x% interval of answers.
The way they summarized the results of the two studies don’t raise any red flags for me:
“For the blue tit analyses, the average effect was convincingly negative, with less growth for nestlings living with more siblings but there was near continuous variation in effect size from large negative effects to effects near zero, and most effects ranged from weakly negative to weakly positive, with about a third of effects crossing the traditional threshold of significance in one direction or the other.”
and
“the average relationship between grass cover and Eucalyptus seedling number was only slightly negative and not convincingly different from zero, and most effects ranged from weakly negative to weakly positive, with about a third of effects crossing the traditional threshold of significance in one direction or the other.”
What’s wrong is the expectation that a single study should generate a single estimate that can be treated as gospel.
There is a whole lesson plan in the Bayes BATS https://www.stat.uci.edu/bayes-bats/ project that makes the point that you get opposite results depending on if you do frequentist or Bayesian analysis. It encourages students to make arguments for the validity of each approach, but more importantly it teaches them this exact lesson. If you are are dealing with the real world of ecological or social data the whole point of there being many approaches is that you might get different results. That’s why your reviewers say things like you should use a negative binomial or this should be accelerated failure not proportional hazards. Or maybe you need to consider selection bias or if your post sampling stratification. Some results can handle this, some are not as stable. The issue is that people take results at face value and definitive.
Andrew’s example that he participated in had something you could argue was the true behavior of the system to compare all the results to. I only skimmed the archive paper, but it doesn’t have that and it could have.
The study designers could have just as easily (or more easily) simulated the data they sent out — for example, build data on bird clutches from a generative model based on life history optimality theory then add some plot/year/etc. effects on top of it (perhaps an ENSO event or such) and send that around. Then they could have compared the analysts’ outcomes to the true underlying behavior of the system. I thought getting at the underlying `true’ behaviors of ecological systems is what we’re aiming to do in ecology, but how we go about our research — including how this study was set up — makes me wonder.
I think this might be even more of an issue in industry – where data science teams from a small set of the same tech companies promote some new (at least in industry) approach. Almost always these are more complex statistical approaches but are marketed as a being ‘better’ than whatever current standard, less complex approach is in use and a such should be the new default, regardless of context.
I have often wondered how confounded their analysis are with both where they are performed and by whom. It is hard to know how well these approaches will translate in companies that don’t have world class teams, with all of the associated engineering infrastructure and resources.
Surely a new, advanced surgical method not only will have greater efficacy in a well resourced hospital and with a world class surgeon but that surgical method might also lead to much worse outcomes vs a simpler treatment/intervention in an less resourced context.
In environments with fewer resources, different problem classes, and no access to PhDs, etc. I would think that statistical approaches that are more robust to context, even if they have a lower upper bound on efficacy, might be often be the better solution.