Three unblinded mice

Howard Wainer points us to a recent news article by Jennifer Couzin-Frankel, who writes about the selection bias arising from the routine use of outcome criteria to exclude animals in medical trials. In statistics and econometrics, this is drilled into us: Selection on x is OK, selection on y is not OK. But apparently in biomedical research this principle is not so well known (or, perhaps, it is all too well known).

Couzin-Frankel starts with an example of a drug trial in which 3 of the 10 mice in the treatment group were removed from the analysis because they had died from massive strokes. This sounds pretty bad, but it’s even worse than that: this was from a paper under review that “described how a new drug protected a rodent’s brain after a stroke.” Death isn’t a very good way to protect a rodent’s brain!

The news article continues:

“This isn’t fraud,” says Dirnagl [the outside reviewer who caught this particular problem], who often works with mice. Dropping animals from a research study for any number of reasons, he explains, is an entrenched, accepted part of the culture. “You look at your data, there are no rules. … People exclude animals at their whim, they just do it and they don’t report it.”

It’s not fraud because “fraud” is a state of mind, defined by the psychological state of the perpetrator rather than by the consequences of the actions.

Also this bit was amusing:

“I was trained as an animal researcher,” says Lisa Bero, now a health policy expert at the University of California, San Francisco. “Their idea of randomization is, you stick your hand in the cage and whichever one comes up to you, you grab. That is not a random way to select an animal.” Some animals might be fearful, or biters, or they might just be curled up in the corner, asleep. None will be chosen. And there, bias begins.

That happens in samples of humans too. Nobody wants to interview the biters. Or, more likely, those people just don’t respond. They’re too busy biting to go answer surveys.

Statisticians are just as bad! (Or maybe we’re worse, because we should know better)

Of course, we laugh and laugh about this sort of thing, but when it comes to evaluating our own teaching or our own research effectivness, we not only don’t randomize, we don’t even define treatments, or take any reasonable pre-test or outcome measurements at all!

We use non-statistical, really pre-scientific tools to decide what, in our opinion, “works,” in our teaching and research. So maybe it should be no surprise that biomedical researchers often work with some pre-scientific intuitions too.

Accepting uncertainty and embracing variation

Ultimately I think many of these problems come from a fundamental, fundamental misunderstanding: lack of recognition of uncertainty and variability. My impression is that people think of a medical treatment as something that “works” or “doesn’t work.” And, the (implicit) idea is that if it works, it works for everyone. OK, not really everyone, but for the bad cases there are extenuating circumstances. From that perspective, it makes perfect sense to exclude treated mice who die early: these are just noise cases that interfere with the signal.

OK, sure, sure, everybody knows about statistics and p-values and all that, but my impression is that researchers see these methods as a way to prove that an effect is real. That is, statistics is seen, not as a way to model variation, but as a way to remove uncertainty. There is of course some truth to this attitude—the law of large numbers and all that—but it’s hard to use statistics well if you think you know the answer ahead of time.

[And, no, for the anti-Bayesians out there, using a prior distribution is not “thinking you know the answer ahead of time.” A prior distribution is a tool, just like a statistical model is a tool, for mapping the information coming from raw data, to inferences about parameters and predictions. A prior distribution expresses what you know before you include new data. Of course it does not imply that you know the answer ahead of time; indeed, the whole point of analyzing new data is that, before seeing such data, you remain uncertain about key aspects of the world.]

So, just to say this again, I think that researchers of all sorts (including statisticians, when we consider our own teaching methods) rely on two pre-scientific or pre-statistical ideas:

1. The idea that effects are “real” (and, implicitly, in the expected direction) or “not real.” By believing this (or acting as if you believe it), you are denying the existence of variation. And, of course, if there really were no variation, it would be no big deal to discard data that don’t fit your hypothesis.

2. The idea that a statistical analysis determines whether an effect is real or not. By believing this (or acting as if you believe it), you are denying the existence of uncertainty. And this will lead you to brush aside criticisms and think of issues such as selection bias as technicalities rather than serious concerns.

P.S. Commenter Rahul writes:

The sad part is stuff like excluding subjects on a whim will get you zero repercussions almost all the time. In most cases you get published and perhaps tenure or more funding. If you are terribly unlucky you get mentioned on a blog like this and a few of us go tsk tsk.

It’s worse than that! Even after this article we still don’t know who discarded the data from those three rats or the lab where it happened. If you read the news article, it was all done confidentially. So, for all we know, there might be dozens of papers published by that research group, all with results based on discarding dead animals from the treatment group, and we have no way of knowing about it. And even the offending paper, the one being discussed here, might well eventually be published.

I guess maybe someone can do a search on all published papers involving mouse trials of drugs for protecting the brain after stroke, just looking at those studies with 10 mice in the control group and 7 mice in the treatment group. There can’t be that many of these, right?

40 thoughts on “Three unblinded mice

  1. The sad part is stuff like excluding subjects on a whim will get you zero repercussions almost all the time. In most cases you get published and perhaps tenure or more funding. If you are terribly unlucky you get mentioned on a blog like this and a few of us go tsk tsk. Life goes on.

    I think we have our priorities all wrong. cf. self-plagiarism. We police science like the drunk who looks for his lost keys under the streetlight.

  2. For arguments sake, let’s assume we had an ideal enlightened researcher / statistician, well versed in Andrew’s edicts about not denying the existence of variation, acknowledging the existence of uncertainty etc.

    How would his enlightenment lead him to solve the “discarded rats that died from massive strokes” problem any differently? Would any methodological jugglery circumvent acknowledging the fact that mice that died of stroke aren’t noise in a stroke study.

    Aren’t we over-complicating the issue? The point is selection bias plan and simple. You do not discard (relevant) data on a whim. Do we really need an appreciation of Bayesianism & other more nuanced issues to understand this aspect?

    • I believe you always need to allow researchers to break rules – but they need to divulge that and give defensible reasons.

      And the first impact of a quality scoring system (of following rules) will be to inform (perhaps just a few) how to better appear not to have hidden things or give apparently defensive reasons to increase publish-ability.

      The thing that always intrigued me here (in clinical research at least) was that although there is always both methodological variation and biological variation (those different temperatures, ages, etc) they are highly confounded and therefor pleading biological variability defenses should not be allowed.

      Furthermore there is much stronger evidence for methodological variation (often even mathematical) and for effective cures for it. Also, it seems most for profit organisations (especially if subject to regulatory audit/review) have little reluctance in buying into this.

    • Rahul:

      To follow up on Keith’s point: I think the problem is often that researchers do not admit uncertainty or variation, they think they’ve already made their discovery, and they think of various data-collection and data-analysis rules as technicalities. These researchers don’t want technicalities to get in the way of science. This is an admirable attitude, in some sense, but it can create problems given that biology is full of uncertainty and variation.

      The enlightened researcher (of which you speak) would accept that the rules are there for a reason, they’re not just picky-picky rules of the game, they’re central to learning about the world.

      • Maybe even the enlightened researcher realises that both they need “to wash their hands to lessen chance of infection” but also that does not make them “immune”?

        • Say we had to decide whether steel valves corroded more in a pipeline application or aluminium. We have one year to decide so we put, say, 10 valves of each type on 20 randomly chosen pipelines.

          Now unfortunately in the last month there’s an unrelated reactor fire which damages extensively 5 steel valves & 1 aluminium valve which makes and year-end corrosion measurement on these damaged valves meaningless.

          What’s the right way to approach this problem? Censored data techniques? The crux seems that at some margin you must trust the domain expert when he says it was an “unrelated” fire?

  3. Re the P.S. in the post:

    It gets even worse. Some other lab that does things right may not have found anything “interesting”. This means no publication, grants, extinction..

    Selection bias in studies may lead to selection bias in the population of scientists, or selection on selection.

    • I asked just this question in stat consulting class yesterday. Everybody laughed the laugh of the doomed and then we tacitly agreed to change the subject.

  4. “Discard data on a whim”
    I work for the organization that licenses physicians and was at a meeting a few weeks ago when someone said, “Being a licensed physician gives you the right to over-rule protocol when your intuition suggests it is not appropriate.” I was aghast, since the very notion of evidence-based medicine suggests that overruling evidence should only be done when there is other, conflicting evidence. So I began asking around “how many physicians would agree with this notion?” My summary of this informal survey is “all of them.”


    • Howard, “I understand your pain,” as they say, and your pain is echoed in Dawes’ article on improper linear models.

      Still, especially as I was starting out, I worked with engineers with years (decades?) more experience than I had but who may not have had the math I would bring to a problem. I might explain the math of a situation, and they would tell me, nicely as a good mentor might, what really worked and why without being able to put it into equations or to offer statistical results as evidence.

      From observation (not recorded :-( ), I learned that it was usually unwise to disagree with them, for they (the good ones) would be right.

      I also learned which had been observant and thoughtful enough to produce such seemingly reliable tacit knowledge and which seemed to be repeating old bits of insight that had limited applicability.

      And I’m (still) learning how to blend observation with data and reasoning more effectively.

      Do you see such effects of tacit knowledge, too? If so, how do you work most effectively in a world that offers data and insight drawn effectively from data, useful tacit knowledge that its owner can’t yet express explicitly, and “superstition”? My approach tends to be to use the tacit knowledge as impetus to dig more deeply in a search for evidence, but I can see that time-challenged professionals presented with critical emerging situations might feel pressured into the reaction your physician group gave.

    • @Howard:

      Not sure how Aaaargh that is when I think about it. As a refinery engineer we had tons of codes (API, EPA etc. ), our protocols so to speak, but the agreed wisdom was that if your intuition or engineering sense said that following a code prescription would be dangerous one did not follow it.

      I guess it’s a question of to-the-letter prescription versus discretion. Can a protocol writer always consider & compensate for all eventualities in advance?

    • It seems to me that James Reason in his /Human Error/ writes about patterning and puzzling (borrowing from Vygotsky, I suppose) as the ways we make decisions. Patterning is fast, accurate, and potentially somewhat limited in applicability. Puzzling is slow, less accurate, potentially much more broadly applicable, and (usually) much more fun for people because it’s what some of us think makes humans special. Patterning sounds a lot like Gary Klein’s recognition-primed decision making as well as the following of medical or other protocols.

      It’s been a while since I read it, but I think Reason’s message was that it was useful to transfer as much decision making to patterning as we reasonably could /and/ to recognize the importance of puzzling and to support its effective use when necessary. Perhaps the physicians were making on-the-fly decisions that they had reached the point where puzzling was needed.

      In Rahul’s example, I hear that the models / protocols / code were known to have potential limitations (in the “All models are wrong” sense), and the engineers and operators were using their experience to decide when (and how) to puzzle.

      In your physician sense, I wonder too if the protocols are, in some cases, based on relatively broad posteriors (i.e., “this” is a good decision on average, but the probabilities don’t vary much over certain other possible treatments, and so perhaps another decision is almost as good on average but perceived to be better in a particular case for reasons not incorporated in the models underlying the protocol).

      In Rahul’s example, perhaps better models could reduce but perhaps never eliminate the problem.

      In Howard’s example, could being more explicit about the uncertainties in the medical protocols and their application help, or is that already taken into account?

      Does that help think about this question?

    • maybe if evidence based medicine didn’t have a lot of embarassing failures, MDs wouldn’t feel that way…
      sort of amusing that you defend evidence based medicine in a thread about lousy garbage quality science

  5. I’m reminded of a quote attributed to Michael Healy:
    “The difference between medical research and agricultural research is that medical research is done by doctors but agricultural research is not done by farmers”.

    I suppose you could say that biostatisticians should be doing more – teaching more innovatively, training people better, whatever – but the fundamental problem with medical research is that most of it is conceived, conducted, and published by people with little or no research training and this is no impediment to a successful research career. Why would they bother to get training? Where’s the incentive to get training?

    If I had a medical degree and a fellowship in some college of surgeons or physicians, I could do research without any other training apart from what I received when I studied for these qualifications. I certainly wouldn’t need a PhD.

    I would have no problem getting my research proposals past ethics committees because they rarely conduct rigorous methodological reviews, focussing more on the ethical aspects of the proposed study.

    I would have no problem getting my research proposals funded by medical research organizations because these bodied are run by people like me and the grant review committees are stacked with people like me.

    There will be biostatisticians on some grant review committees, but these will tend to be public health committees. The majority of grant applications will go to specialist medical or clinical review committees and it will be rare to find a biostatistician on these or indeed anyone with more than moderate numeracy skills.

    I would have no problem getting my work published. I can do an analysis and report anything with a small p-value. I can declare the association between whatever I was looking at to “significant”.

    When I come to discussing my results, I never need to refer to any estimate of the strength of this association; I only need to restate that it was “significant”.

    I can compare my results to those of others who have looked at the same association, but I only need to dichotomize these into those that found a “significant” association and those that did not.

    I can then discuss at length the clinical implications of my results, once again without actually referring explicitly to any of my results. I can then write a short paragraph about potential biases in my study, but I don’t need to worry about this, I’ll just write some generic statements which I won’t take too seriously, and which I’ll end up dismissing altogether anyway. Something like “Of course out study was retrospective…yada, yada, yada” or “Of course our study was cross-sectional…blah, blah, blah.”

    Next I’ll send this off to a specialist medical journal where it will be reviewed by people like me.

    If the journal is interested, it’ll get sent out for review, if not, I’ll be rejected so I’ll send it to the next journal on my list. Eventually it will be reviewed and I’ll deal with the reviewers’ comments.

    Usually the reviewers will want more p-values, especially if I’ve used a regression model with lots of categorical variables; they like p-values for all the dummy variables in the model. No problem there.

    Sometimes a reviewer will want a post-hoc power calculation. I don’t really know how to interpret these or even what they mean, but I’ll do one because the reviewer asked for it.

    Sometimes a reviewer will say that I should have included some variable in my model, but these are easy to handle. If I didn’t collect data on that variable, there’s nothing I can do. If I did, and it’s not in the final model, then that means it wasn’t “significant” because I only included variables in the model that were “significant’.
    I’ll rarely get any questions about biases in my study and if I do they’ll be something like: “Did you exclude patients with [insert rare medical condition here]”. No problem dealing with these.

    Most of the reviewers’ comments will not be about my results or the study design; .they’ll be clinical questions. These present no problem: I’ll just waffle on for a few paragraphs.

    Now all I need to do is send my response back to the Editor and in due course I’ve got a peer-reviewed publication!!!

    • Re the Michael Healy quote: To be fair you need ~8 years of post-K12 classroom education to be a doctor and none to be a farmer.

    • Great quote and you obviously have some experience in clinical research.

      Tom Louis once commented that I painted an overly bleak picture of clinical research (1997).

      I really wish he had gotten the direction correct.

      There are some exceptions and I like to think I was part on being involved in the training of a few of those (perhaps the most successful being C David Naylor) but it is a hard nut to crack. My guess is that many statisticians that get involved actually do more harm than good as the real challenges are not just statistical problems.

      And unless you find a way to avoid ever needing medical care – its important to try to help people get less wrong there!

  6. Dear Andrew,

    I never thought someone would be able to write about statistics and make it fun.
    I even tweeted a phrase or two.
    Statistics has always been a chore to me (undergrad economics student). Thanks for making it enjoyable.


    • Very late to the party but that probably isn’t it. I mean the reviewer in the article concluded that in the full dataset, the therapy harmed the mice instead of protected them. So it’s unlikely the paper could have been published under that title, with that abstract. At least, if that’s the journal Dirnagl was reviewing for.

  7. Pingback: Friday links: a purposeful scientific life, zombie statistics, silly science acronyms, Tarantino vs. Plato, and more | Dynamic Ecology

  8. Pingback: “Three unblinded mice” from Andrew Gelman’s blog | Data Data Data

  9. It seems to me that this problem ties in to the foundational decision to look at Least Squares rather than Least Absolute Values largely for reasons of mathematical convenience and elegance. Squaring differences puts added weight on outliers, a problem people solve by cramming the most extreme outliers down the memory hole.

  10. I wonder what Andrew and the readers think about a recent study that reported that mice inherited memories of their fathers (fear for specific smells). It has been blogged by Virginia Hughes.
    Here is the link for the paper in Nature Neuroscience:

    It is very interesting if true, but I am skeptical on biological and methodological grounds. Biologically, I find it difficult to come up with the mechanism, at least not something very plausible. But nature has surprised us before. Let us focus on the methodology, as that is something Andrew and others probably have opinions on.

    The study used mice, so there could be the kind of issues that were raised in the article by Couzin-Frankel, depending on how careful the researchers were. On top of that, they had to rely on behavioral assays, which I imagine to be not very precise and inherently noisy. If you actually look at their data, you do see a lot of variations, the sample sizes are not large, and the p-values they report don’t look great. Also, with this kind of research design, the researchers only get to report their results if they see effects. So, it seems to me that there are a lot of places where the researchers could fool themselves.

    What do you think?

    • It should be easy enough to replicate with preregistered data-handling rules, no? So we should soon know whether the effect is real.

      It is interesting that the convention is to publish first, replicate later. That makes sense, I guess—if a finding is potentially important, let’s get it out there right away rather than waiting for the replication—but, if we are going that route, maybe there should be no requirement that the original published result attain statistical significance?

      • I would say the convention is “publish first, ignore replications as far as possible”.

        If the convention where as you say, then there would be no problem publishing replications. Much evidence suggests this is not the case.

        • Yah, good point. But this one seems clear enough that it would be worth doing a preregistered replication, I’d think. Unless people just take it as an anomaly and ignore it.

  11. Pingback: Statistics in politics and Robot in politics | abfreshmind

  12. Pingback: Somewhere else, part 97 | Freakonometrics

  13. Pingback: Three unblinded mice « Statistical Modeling, Causal Inference, and Social Science

  14. I do deny “true” variability – but I think that there are lots of complications and interactions even within the strongest of effects. Things don’t fail just because of bad luck – they fail because of one of a trillion complications that obscure the causality that is driving the result.

Comments are closed.