(back to basics:) How is statistics relevant to scientific discovery?

Someone pointed me to this remark by psychology researcher Daniel Gilbert:

Publication is not canonization. Journals are not gospels. They are the vehicles we use to tell each other what we saw (hence “Letters” & “proceedings”). The bar for communicating to each other should not be high. We can decide for ourselves what to make of each other’s speech.

Which led me to this, where Gilbert approvingly quotes a biologist who wrote, “Science is doing what it always has done — failing at a reasonable rate and being corrected. Replication should never be 100%.”

I’m really happy to see this. Gilbert has been loud defender of psychology claims based on high-noise studies (for example, the ovulation-and-clothing paper) and not long ago was associated with the claim that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” This was in the context of an attack by Gilbert and others on a project in which replication studies were conducted on a large set of psychology experiments, and it was found that many of those previously published claims did not hold up under replication.

So I think this is a big step forward, that Gilbert and his colleagues are moving beyond denial, to a more realistic view that accepts that failure is a routine, indeed inevitable part of science, and that, just because a claim is published, even in a prestigious journal, that doesn’t mean it has to be correct.

Gilbert’s revised view—that the replication rate is not 100%, nor should it be—is also helpful in that, once you accept that published work, even papers by ourselves and our friends, can be wrong, this provides an opening for critics, the sort of scientists who Gilbert earlier referred to as “second stringers.”

Once you move to the view that “the bar for communicating to each other should not be high. We can decide for ourselves what to make of each other’s speech,” there is a clear role for accurate critics to move this process along. Just as good science is, ultimately, the discovery of truths that ultimately would be discovered by someone else sometime in the future, thus, the speeding along of a process that we’d hope would happen anyway, so good criticism should speed along the process of scientific correction.

Criticism, correction, and discovery all go together. Obviously discovery is the key part, otherwise there’d be nothing to criticize, indeed nothing to talk about. But, from the other direction, criticism and correction empower discovery.

If we are discouraged from criticizing published work—or if our criticism elicits pushback and attacks from the powerful, or if it’s too hard to publish criticisms and obtain data for replication—that’s bad for discovery, in three ways. First, criticizing errors allows new science to move forward in useful directions. We want science to be a sensible search, not a random walk. Second, learning what went wrong in the past can help us avoid errors in the future. That is, criticism can be methodological and can help advance research methods. Third, the potential for criticism should allow researchers to be more free in their speculation. If authors and editors felt that everything published in a top journal was gospel, there could well be too much caution in what to publish.

Just as, in economics, it is said that a social safety net gives people the freedom to start new ventures, in science the existence of a culture of robust criticism should give researchers a sense of freedom in speculation, in confidence that important mistakes will be caught.

Along with this is the attitude, which I strongly support, that there’s no shame in publishing speculative work that turns out to be wrong. We learn from our mistakes. Shame comes not when people make mistakes, but rather when they dodge criticism, won’t share their data, refuse to admit problems, and attack their critics.

But, yeah, let’s be clear: Speculation is fine, and we don’t want the (healthy) movement toward replication to create any perverse incentives that would discourage people from performing speculative research. What I’d like is for researchers to be more aware of when they’re speculating, both in their published papers and in their press materials. Not claiming a replication rate of 100%, that’s a start.

So let’s talk a bit about failed replications.

First off, as Gilbert and others have noted, an apparently failed replication might not be anything of the sort. It could be that the replication study found no statistical significance because it was too noisy; indeed, I’m not at all happy with the idea of using statistical significance, or any such binary rule, as a measure of success. Or it could be that the effect found in the original study occurs only in some situations and not others. The original and replication studies could differ in some important ways.

One thing that the replication does give you, though, is a new perspective. A couple years ago I suggested the “time-reversal heuristic” as a partial solution to the “research incumbency rule” in which people privilege the first study on a topic—even when the first study is small and uncontrolled and the second study is a large and careful replication.

In theory, an apparently failed replication can itself be a distraction, but in practice we often seem to learn a lot from these replications, for three reasons. First, the very act of performing a replication study can make us realize some of the difficulties and choices involved in the original study. This happened with us when we performed a replication of one of our own papers! Second, the failed replication casts some doubt on the original claim, which can motivate a more critical look of the original paper, and which can then reveal all sorts of problems that nobody noticed the first time. Third, lots of papers have such serious methodological problems that their conclusions are essentially little more than shufflings of random numbers—but not everyone understands methodological criticisms, so a replication study can be a convincer. Recall the recent paper with the replication prediction market: lots of these failed replications were no surprise to educated outsiders.

What, then, is—or should be—the role of statistics, and statistical criticism in the process of scientific research?

Statistics can help researchers in three ways:
– Design and data collection
– Data analysis
– Decision making.

Let’s go through each of these:

Design and data collection. Statistics can help us evaluate measures and can also give us a sense of how much accuracy we will need from our data to make strong conclusions later on. It turns out that many statistical intuitions developed many decades ago in the context of estimation of large effects with good data, do not work so well when estimating small effects with noisy data; see this article for discussion of that point.

Data analysis. As has been discussed many times, one of the sources of the recent replication crisis in science is the garden of forking paths: Researchers gather rich data but then only report a small subset of what they found: by selecting on statistical significance they are throwing away a lot of data and keeping a random subset. The solution is to report all your data with no selection and no arbitrary dichotomization. At this point, though, analysis becomes more difficult: analyzing a whole grid of comparisons is more difficult than analyzing just one simple difference. Statistical methods can come to the rescue here, in the form of multilevel models.

Decision making. One way to think about the crisis of replication is that if you make decisions based on selected statistically significant comparisons, you will overstate effect sizes. Then you have people going around making unrealistic claims, and it can take years of clean-up to dial back expectations.

And how does statistical criticism fit into all this? Criticism of individual studies has allowed us to develop our understanding, giving us insight into designing future studies and interpreting past work.

To loop back to Daniel Gilbert’s observations quoted at the top of this post: We want to encourage scientists to play with new ideas. To this purpose, I recommend the following steps:

Reduce the costs of failed experimentation by being more clear when research-based claims are speculative.

React openly to follow-up studies. Once you recognize that published claims can be wrong (indeed, that’s part of the process), don’t hang on to them too long or you’ll reduce your opportunities to learn.

Publish all your data and all your comparisons (you can do this using graphs so as to show many comparisons in a compact grid of plots). If you follow current standard practice and focus on statistically significant comparisons, you’re losing lots of opportunities to learn.

Avoid the two-tier system. Give respect to a student project or Arxiv paper just as you would to a paper published in Science or Nature.

One last question

Finally, what should we think about research that, ultimately, has no value, where the measurements are so noisy that nothing useful can be learned about the topics under study?

For example, there’s that research on beauty and sex ratio which we’ve discussed so many times (see here for background).

What can we get out of that doomed study?

First, it’s been a great example, allowing us to develop statistical methods for assessing what can be learned from noisy studies of small effects. Second, on this particular example, we’ve learned the negative fact that this research was a dead end. Dead ends happen; this is implied by those Gilbert quotes above. One could say that the researcher who worked on those beauty-and-sex-ratio papers did the rest of us a service by exploring this dead end so that other people don’t have to. That’s a scientific contribution too, even if it wasn’t the original aim of the study.

We should all feel free to speculate in our published papers without fear of overly negative consequences in the (likely) event that our speculations are wrong; we should all be less surprised to find that published research claims did not work out (and that’s one positive thing about the replication crisis, that there’s been much more recognition of this point); and we should all be more willing to modify and even let go of ideas that didn’t happen to work out, even if these ideas were published by ourselves and our friends.

We’ve gone a long way in this direction, both statistically and sociologically. From “the replication rate . . . is statistically indistinguishable from 100%” to “Replication should never be 100%”: This is real progress that I’m happy to see, and it gives me more confidence that we can all work together. Not agreeing on every item, I’m sure, but with a common understanding of the fallacy of published work.

42 thoughts on “(back to basics:) How is statistics relevant to scientific discovery?

  1. Hi Andrew,

    This post seems as good as any to just say “thank you” for putting your thoughts out there about this stuff. I have learned a lot by reading your blog the past couple of years. Thanks for spending so much time discussing these topics, and making your discussion available to the general public.

    Cheers,
    An Economist

    • Another “thank you for all the insights” and a question about replication efforts. So, I was going to attempt to write something clever along the lines of “The Courts Get Statistics (p less than .05)”, decided it was likely to be lame (p greater than .95) and so re-read Cohen’s “The Earth is Round (p less than .05)”; whereupon I again came across the idea of the “strong form” of NHST and thus the following question:

      Should replication efforts be of the form:

      (A) H0: Treatment has no effect, re-run experiment and see if H0 is rejected via magic p-number,

      or

      (B) H0: Treatment produces effect size reported in finding sought to be replicated, calculate minimum sample size likely to generate reported effect, test and see if H0 is rejected?

      Or is (B) already done or, as more likely, have I got replication all wrong?

  2. > the discovery of truths that ultimately would be discovered by someone else sometime in the future, thus, the speeding along of a process that we’d hope would happen anyway, so good criticism should speed along the process of scientific correction
    Agree (and Peirce’s conceptualization of the economy of research).

    • Thanks, Andrew for the great comments. In large part, I’m in agreement. You and others here are a rare breed. Intellectually curious. Concerning though, out there in the biomedical research community, is the extent to which Evidence-Based medicine has been co-opted to serve as a marketing tool. Replicating the research that has co-opted it seems like such a waste of resources. But maybe I am missing something here. I have tended to roll my eyes when I have been informed of a hypothesis or a study. Some are simply so silly. Of course, that is a subjective opinion.

  3. It’s such an old story that has been resisted for thousands of years. Time and Chance happen to them all, as Ecclesiastes points out. I start my intro stats classes with that. Thank you for your continued defense of timeless wisdom.

  4. I’m afraid that Gilbert’s view may not have changed, and that he simply entered this thread to support his friend. It’s really hard to know what these people actually believe. For example they praised Sabeti’s article, which defended some of the most questionable work in psychology. I have a hard time believing they actually think the work mentioned in that article merits further investigation let alone journal space.

    • Jordan:

      I can’t say for sure. My interactions with Gilbert haven’t been pleasant but I’m inclined here to give him the benefit of the doubt.

      My best guess is that Gilbert, like everyone (myself included), holds various somewhat contradictory ideas in his head at the same time. The notorious quote, “the study actually shows that the replication rate in psychology is quite high – indeed, it is statistically indistinguishable from 100%,” did not come directly from Gilbert—it was in a Harvard press release promoting Gilbert’s work. As I wrote in the above, Gilbert is associated with that claim, and I assume he supported it at the time, without maybe being completely clear on what exactly it was implying. At the same time that this “100%” quote came out, Gilbert was of course aware that not every study replicates: if nothing else, he had to have been aware of Bem’s ESP study which was published in a psychology journal. So, even then, the 100% claim was a bit of hyperbole. So, even then, I expect he would’ve had no problem endorsing the statement, “Replication should never be 100%.”

      Anyway, here’s why I find this episode encouraging. There have been times when Gilbert has made strong claims about psychology having a high replication rate. But now he’s emphasizing that failed replications happen all the time. To me, this is great news that he’s changing his emphasis, moving away from untenable optimism about replication toward more realism.

      Even if Gilbert’s motivation here is to support a friend, that’s fine. I’m much much happier when prominent researchers accept replication failures than when they try to defend every published claim to the bitter end.

      If my goal were to “win” an argument with Gilbert, maybe I’d be frustrated that he’s now saying that lack of replication is common. If my goal were to win, I’d want him to stay at the most extreme, ridiculous position so I could point out how silly that is. But my goal is not to win an argument; my goal is to help researchers design, analyze, and interpret their studies better. And, for that goal, it’s good news when prominent people in the field move toward acceptance of failed replication attempts.

      My reason for pointing out the earlier “100% claim from that old press release is not to slam Gilbert as a hypocrite but rather to express happiness at the shift in the discourse in the field of psychology. As you note in your comment, this shift is complicated and has not happened completely, not at all. I still think it’s a good sign. What people believe is important, but it’s hard for me to pin down what people believe. When people evolve in what they say and in what they emphasize, that can be worth noting even if we don’t have a clear sense of the underlying beliefs.

      • “If my goal were to “win” an argument with Gilbert, maybe I’d be frustrated that he’s now saying that lack of replication is common. If my goal were to win, I’d want him to stay at the most extreme, ridiculous position so I could point out how silly that is. But my goal is not to win an argument; my goal is to help researchers design, analyze, and interpret their studies better. And, for that goal, it’s good news when prominent people in the field move toward acceptance of failed replication attempts.”

        Well put. Wanting to “win” is one of the biggest obstructions to good science.

  5. “How is statistics relevant to scientific discovery?” is a very general question because there are many things meant by “scientific discovery”. I think many of the “discoveries” in many areas of science aren’t really discoveries in the sense of discoveries in physics, chemistry, and some biology. In other words, I wouldn’t fly in a plane built by the discoveries in these areas.

    Regardless, I have asked that exact question with regard to discoveries in cell and molecular biology, which underpins much of biomedical and agricultural advances. This is not the science of Ionnides or Harrell, which is typically clinical trials or similar. It is also not the science of Hastie or Leek or whoever dealing with “discovery” from big data. This is good old bench biology where historically a researcher did something like run a gel with one or more unknowns (the treated conditions) and used the qualitative patterns as evidence of either existence (present/absent) or qualitative amount (obviously more of X than Y). Everything was run in triplicate simply to make sure that that something didn’t go wrong with the procedure. If statistics were given it was a mean and SD (of N=3!).

    Somewhere around Y2000, researchers started using inferential statistics (t-tests and ANOVA) to their qualitative judgment of the results: Not only does this mean look bigger, the p-value for the t-test is between 0.001 and 0.01 (two asterisks, yay!). Nothing else about the work is different except that maybe around 2010, researchers were told that n=3 wasn’t good enough for inferential statistics so now n is usually 5-10.

    Nobody estimates effect sizes. Nobody estimates intervals (confidence or credible). There doesn’t seem to be any sense of what different effect sizes would even mean biologically. It’s all about presence/absence or bigger/smaller. And yet the science is advancing rapidly. Real stuff is discovered (and not highly conditional stuff that depends on the day of the week and if it is raining).

    I’m not a bench biologists so I am trying to figure out this exact question — what is the role of statistics in this rather large, rather important area of science (at least if measured by how much research dollars are going there relative to other areas)? My short answer is that the inferential statistics seem to be a marker of “science” (because, p-values) to editors and reviewers but that the actual process of this science has not been improved at all by inferential statistics and in fact could continue as done prior to Y2000 without it. I’d love to have someone in the field comment on this.

    My attempt to address this is here – which is probably a really boring read but it’s my attempt for me to jot down how the statistics are actually being used

    https://www.middleprofessor.com/post/comments-on-the-role-of-statistics-in-boström-a-pgc1-alpha-dependent-myokine-that-drives-brown-fat-like-development-of-white-fat-and-thermogenesis/

    • Jeff:

      You write, “Somewhere around Y2000, researchers started using inferential statistics (t-tests and ANOVA) to their qualitative judgment of the results: Not only does this mean look bigger, the p-value for the t-test is between 0.001 and 0.01 (two asterisks, yay!)”

      People were doing this way before the year 2000. Right now, I happen to be writing a paper which includes reanalysis of work published in 1988 that did just what you’re saying, interpreting results not just on statistical significance but on p-value ranges.

      • I am talking about papers in bench biology — cell and molecular biology. Compare the papers of James Allison (the most recent Nobel laureate in Physiology) from the 80s and 90s (which is the work for his Nobel) to those from 2010-present. The former have no/few p-values. The latter are full of asterisks. Y2000 is just a pretty good demarcation point on a continuum that has no obvious demarcation.

    • Following up on above. Science works a bit differently in bench biology than often described here and this is where forking paths could come in but also where these are corrected. The researcher starts with some prior knowledge of how something works but needs to identify the agents (signaling molecules, receptors, etc.) so starts with something like a big scan (expression levels on a bunch of proteins or similar) and looks at 1) proteins with smallest p-values (blah) combined with 2) prior knowledge of function of these proteins (which knowledge may come from mice or fish or yeast). This initial analysis is usually in a supplement (compare to the many GWAS or similar studies in which the association itself makes the cover of Science or Nature). Then designs a simple experiment with exquisite control of the cellular environment to test the candidates. Then a mouse model with the gene knocked out is generated and the experiment with this knockout has a specific prediction on the direction of the effect. Then a mouse line engineered to overexposes the signal is used. Etc. Etc. Each experiment makes sense in light of the results of the previous experiments. So in this sense, the p-values from the previous experiments provide the “decision rule” for which next experiment to do. There is the potential for many forking paths in the sense of different experiments but I think if one path of experiments were done based on a false positive, then this set of experiments would not show what would be predicted by the previous experiments, and this whole deadens path of experiments (done by some grad student) would be thrown out. This is what I think is missing from the transparency of the process, all the dead-end paths that were pursued because of false positives. That said, the process as a whole seems to be self-correcting within the lab itself. Occasionally the false positives
      do get published and these tend to be corrected because many other labs are following up on the experiments to gain further insight of the biological mechanism. It is this replication process (replicate someone else’s result to start your own path) that is missing from many other sciences (ecology and evolution for example, which is more my field).

      • This could be attributed to the relatively “hard” nature of the bench biology. The theoretical foundations are more solid than in social sciences. The experiment designs are highly specific and replicable. The links between cell work and animal work are hard logic. The measurements are probably a lot less noisy. When the signal to noise ratio is high, you can use a highly suboptimal signal processing protocol and still get the signal right most of the time.

      • This also sounds like the use of statistics in many industrial situations (e.g. designing and testing automobiles or computer chips). Experiments are typically small (partly because they are expensive), and the results are used to try to develop a better design for the next experiment.

    • And yet the science is advancing rapidly. Real stuff is discovered (and not highly conditional stuff that depends on the day of the week and if it is raining).

      What is your evidence for this? Breathless press releases are certainly being generated at a rapid pace but any kind of reproducibility, understanding, or synthesis? No, I don’t see it.

      My attempt to address this is here – which is probably a really boring read but it’s my attempt for me to jot down how the statistics are actually being used

      The statistics are used to tell if something is “real”. If the p-value is below 0.05 there is a “real” difference, otherwise there might be or might not until the paper is published. If p remains greater than 0.05 when the paper gets published, then there was no “real” difference. Whether this happens depends on how strongly the researcher believes p should be less than 0.05 and how much money they have to keep taking new measurements.

      Also, that paper doesn’t look like it was done blinded (couldn’t find any mention) and also their headline claim is “Here we show that PGC1α expression in muscle stimulates an increase in expression of Fndc5”. However, they never show us a plot of PGC1alpha expression vs Fndc5 expression. I don’t see any reason to take it seriously when they fail to do basic stuff like this.

      • My evidence that knowledge of fundamental cell & molecular biology is making rapid progress is a comparison of any intro bio textbook in 2019 to one from 10 years ago, 20 years ago and 30 years ago. I don’t think this information is simply a mountain of poo generated by cargo cult science (unlike my opinion on lots of other sciences). Again – I’m not talking about nutrition research, or the latest bit on diet and cardiovascular disease, or the “discovery” of “genes for … ” that make all the headlines. I’m talking about cell and molecular biology or what I’d call bench biology. Much of molecular biology was founded by physicists who thought everything in physics had been discovered.

        But my point was: I don’t think inferential statistics are doing much for the science – it’s the same process of prior knowledge + some logical thinking that makes predictions + elegant experiments to basically confirm the prediction + some more logical thinking that makes more predictions + more elegant experiments and so on. The p-values/asterisks give it a veneer of approval from reviewers/editors but the science worked perfectly well before this and I think would work perfectly well without it. Because the experiments have exquisite control there is little noise and little need for complex models. t-tests, ANOVA, non-parametric mann-whitney U stuff. Again – there is no estimation of effects because there is (mostly) no theory on what an effect size should be — either the MAPK pathway is activated or its not, for example.

        When Ionnides complains that most science is wrong, he’s referring to the “reading Gelman’s blog causes cancer” stuff and not bench biology, which largely fly’s under the radar of statisticians. My question is, would the field advance knowledge faster or more efficiently if they actually thought about effect sizes and had good theoretical models that generated quantitative predictions?

        • Can you find any evidence that all this new stuff in the textbooks is reliable? I’m pretty familiar with molecular bio and their practices, and do not have anywhere near the confidence you do. I am certain you won’t be able to find many replication studies if you look, and the ones you do find will have a low rate of success (at least below 50%, more often near 10%).

          It isn’t hard to come up with a whole this thing binds to that thing, etc, etc if you can throw away whatever data you don’t like (there is always a legitimate excuse), don’t have to attach numbers to anything, and no one is running direct replications.

          This guy is pretty close to my own impression:

          Biologists summarize their results with the help of all-too-well recognizable diagrams, in which a favorite protein is placed in the middle and connected to everything else with two-way arrows. Even if a diagram makes overall sense (Figure 3A), it is usually useless for a quantitative analysis, which limits its predictive or investigative value to a very narrow range. The language used by biologists for verbal communications is not better and is not unlike that used by stock market analysts. Both are vague (e.g., “a balance between pro- and antiapoptotic Bcl-2 proteins appears to control the cell viability, and seems to correlate in the long term with the ability to form tumors”) and avoid clear predictions.

          https://www.cell.com/cancer-cell/fulltext/S1535-6108(02)00133-2

          the experiments have exquisite control there is little noise

          I have no idea what gave you this impression but it is wrong.

        • I’ll throw the question back into your court: what evidence do you have that the textbook knowledge is largely unreliable? A paper describing the crazy way that biologists do science isn’t evidence. Can you point to any paper suggesting (much less showing) that intro bio textbook knowledge of cell and molecular biology is unreliable and unreplicatable?

          Again – this is a tangent to the theme of Andrew’s post which is what is the role of statistics in science discovery and my answer for cell and molecular biology is, “not much”.

        • RE:’Again – this is a tangent to the theme of Andrew’s post which is what is the role of statistics in science discovery and my answer for cell and molecular biology is, “not much”.
          ——-
          Exactly

        • 1) It is should all be considered unreliable until someone has replicated it.

          2) The replications of that type of data that do exist indicate a very sad state of affairs:

          A) Out of 50 studies, 32 dropped because it was too expensive to even figure out how to replicate it. As for the rest, only half were “mostly repeatable”. That was last summer, not sure what the update is on this.
          https://www.sciencemag.org/news/2018/07/plan-replicate-50-high-impact-cancer-papers-shrinks-just-18

          B) AMgen reports being unable to replicate 90% of preclinical cancer papers, then three more. https://www.nature.com/news/biotech-giant-publishes-failures-to-confirm-high-profile-science-1.19269

          C) Bayer reports only 25% of studies could be replicated: https://www.nature.com/articles/nrd3439-c1

          D) An NIH run study reports only 1/12 studies regarding spinal cord injury fully replicated: https://www.sciencedirect.com/science/article/pii/S0014488611002391

          These studies are all biochem, molecular bio, cell bio, and rodents. The exact same way the “textbook knowledge” is arrived at.

        • I should add that getting a reproducible phenomenon is only the first step too. The next even harder step is to interpret it correctly… this is very difficult when so many different explanations are possible for a finding like “x is higher under condition A than in condition B”.

          But I do agree that the typical usage of stats has contributed less than nothing to those types of studies.

        • Other than A) none of these reproducibility crises news reports are relevant to what I’ve said, which is 1) cell and molecular biology is making rapid advances in knowledge, largely without statistics (or in spite of using p-values) and 2) the cell and molecular biology presented in *intro bio textbooks* is very well replicated. Since apoptosis is the whipping boy of both comments above, I’ll use this as an example of the kind of knowledge that would be integrated into an Intro BIO textbook.

          https://www.roswellpark.org/sites/default/files/hernandez_4-25-17_paper_5_nrc.2015.17.pdf

          Are we that skeptical that the 30 years of bench work by hundreds of labs and thousands of researchers to generate one picture in a textbook is probably wrong?

        • My argument is only on the elucidation of the apoptotic pathways that has been discovered by hundreds of labs and thousands of workers and not translating this knowledge to develop cancer treatments

        • Are we that skeptical that the 30 years of bench work by hundreds of labs and thousands of researchers to generate one picture in a textbook is probably wrong?

          Sorry, I must have missed this originally. The answer is yes.

          I am not impressed by those non-quantitative models at all. I would be totally unsurprised to find that if I look into it the model requires impossible amounts of energy, space, reaction rates, etc to actually work. Almost no one is doing this basic sanity checking.

          In my own research I found that the entire cell surface must be devoted to a single type of receptor for the usual interpretation of the results to work. There were over a thousand papers on it, and I was the first to think to check.

        • I have a weird job (lawyer) and a long time ago, while preparing to depose a hematologist I read the textbook “Blood” by Jandl from cover to cover. Recently, preparing for the same gig, I read the newest version of “Hematology” by Hoffman, et al cover to cover. In the intervening 30-odd years the standard textbook swelled by about 50%. Yet very little of what I found was what I’d call new knowledge. A large portion of the new stuff is dedicated to the oldest problem in hematology which is categorizing diseases. Over the years some things that used to be called AGL became AML then ANLL and thereafter nearly atomized. Things that were called acute lymphocytic leukemias wandered off to join the lymphomas and are now subdivided into a dizzying array of subtypes.

          Much of what has apparently driven the non-stop recategorizing has been the use of immunohistochemical staining which allows (as I understand it) those doing the testing to say approximately where along the differentiation path from stem cell to e.g. myeloid cell the stained cell you’re looking at can be placed. I have from time to time had to fight battles over diagnosis as the numbers for specificity and sensitivity for these tests suggest that there are lots of false positive diagnoses being made. I’ve yet to win mainly because judges cannot bring themselves to believe that a lawyer might be a better judge of the probability of an accurate diagnosis than a real life hematologist … but I digress.

          After the pathology stuff I’d say the next biggest chunk of new information is derived from the various shouts of Eureka! over the last three decades. Remember when the discovery of apoptosis meant that a pill to make bad stem cells commit seppuku was just around the corner? It’s still just around the corner. And what about the -omics revolution? Well there’s all kinds of -omics now but none of it looks anything like the Krebs cycle. Instead it’s invariably “when metabolizing glucose the marrow sees a proliferation of blah blah blah which could be mediated by nearby stromal cells, extra-marrow metabolism and exchange … etc”. All of which looks suspiciously like “when we turn the black box upside down it makes different noises which could be due to x number of things which, interacting amongst themselves, could be consistent with zillions of different pathways”.

          The treatments are largely different though with few exceptions the outcomes are more or less the same. There are of course some wonderful developments with Gleevac getting lots of attention and hairy cell leukemia reduced largely to a chronic condition. There are others as well.

          Overall the story was one of very modest progress, increasingly complex (perhaps hopelessly so) theories and no coherent theory of leukemogenesis. The etiology sections (with the exception of viruses and certain lymphomas) has hardly changed at all.

          Just my take and please recall “Jim, I’m not Bones! I’m the lawyer dammit!”

        • “All of which looks suspiciously like “when we turn the black box upside down it makes different noises which could be due to x number of things which, interacting amongst themselves, could be consistent with zillions of different pathways”.”

          Great metaphor!

        • the cell and molecular biology presented in *intro bio textbooks* is very well replicated.
          </blockquote.

          Please just link to one of these replications you refer to.* In my experience, experiments are basically not being repeated regarding anything. I exaggerate a bit since obviously I just linked to a few, but it is a big deal when it happens.

          https://www.roswellpark.org/sites/default/files/hernandez_4-25-17_paper_5_nrc.2015.17.pdf

          Are we that skeptical that the 30 years of bench work by hundreds of labs and thousands of researchers to generate one picture in a textbook is probably wrong?

          After what I have seen, this wouldn’t surprise me one bit. This is not a difficult thing to achieve.

          Would you be surprised if hundreds of monasteries and thousands of monks all prayed (really, really hard mind you) about how many angels could fit on a pin just to generate a number that is probably (not even) wrong? Because I have no more confidence in praying than in NHST as a means for gaining a correct understanding of things.

          Regarding that specific link:

          1) I look at figure 3 and see a unidirectional pathway. I know this is a “wrong picture” based on principle alone.

          2) Can you point to one of the claims in this paper you believe has been independently reproduced*?

          * I already know you can’t without digging through the literature because if not you wouldn’t be linking to this high level stuff if you actually knew of direct replications in the literature. So the next step is for you to claim everyone is running these replications “informally” but not publishing them. Is that right?

        • Sorry about the unclosed blockquote tag above…

          Looking closer I found a mention of reproducibility:

          In the mammalian and even overall vertebrate context, although there are a few exceptions, in most cases orthologues of BCL-2 family members can be used interchangeably. This has enabled rapid progress owing to the reproducibility of findings in experiments using human and mouse cells.

          They don’t cite anything in particular but here is the immediately preceding sentence:

          “BAX and BAK promote cell death by causing mitochondrial outer membrane permeabilization (MOMP; see below), enabling the release of cytochrome c, which activates the CED-4 homologue apoptotic peptidase activating factor 1 (APAF1) in the cytosol to cause activation of caspase 9 and the caspase cascade 55.”

          Where reference 55 is:

          Green, D. R. & Kroemer, G. The pathophysiology of mitochondrial cell death. Science 305, 626–629 (2004). https://www.ncbi.nlm.nih.gov/pubmed/15286356

          That sentence contains 6 claims:
          1) BAX and BAK “promote” cell death
          2) BAX and BAK cause MOMP
          3) MOMP causes cell death
          4) MOMP “enables” the “release” of cytochrome c
          5) Cytochrome c “activates” APAF1 in the cytosol
          6) “Activated” APAF1 in the cytosol causes activation of caspase 9 and the caspase cascade

          Are there any of these specific claims you are particularly confident have been replicated? I see a problem that some like #1 are a bit vague so it isn’t clear what replication would even mean.

          What does “promote” mean? That sounds like sometimes BAX/BAK treatment leads to cell death and sometimes it doesn’t… so basically that needs to be split up into different sub-claims for each context, whatever they may be. Then each of those experiments should have been independently reproduced.

        • All models are wrong (eg. fig 3 in the linked paper) — this is trivially true — but you seem to imply that nothing is right (https://chem.tufts.edu/answersinscience/relativityofwrong.htm). If something about fig 3 is right, then we have gained knowledge about the apoptotic pathway, which was my very simple claim.

          “Because I have no more confidence in praying than in NHST as a means for gaining a correct understanding of thing” — I have said throughout, since this was the topic of the post, that NHST doesn’t play a role in this knowledge discovery. If you read the papers, the researchers are doing what they’ve always done experiment-wise. They are simply adding p-values now where they didn’t before. This may give the impression that NHST is doing work, my claim is that it is not.

          on replication – maybe we have different concepts of replication. From the linked paper:

          “These experiments revealed that BCL-2 did not affect cell proliferation, but promoted cell survival by preventing the death of growth factor-dependent cells cultured without a cytokine. Korsmeyer and colleagues (22) extended these findings by showing that BCL2 transgenic mice accumulate excess B lymphocytes and that these cells are protected from spontaneous death in culture. Further studies from numerous groups showed that overexpression of BCL-2 was able to block apoptosis, triggered by diverse cellular stresses, in cell lines and in primary cells of many types (23–29)”

          These are replications, or more specifically “conceptual replications” (https://elifesciences.org/articles/23383). Maybe there are “direct replications” in there too.

        • This may give the impression that NHST is doing work, my claim is that it is not.

          The experiments are to measure things in two different groups and then see which is higher/lower, right? This is NHST except using “eyeballing” instead of math. No one has ever claimed there is a problem with the math of NHST, the problem is with the logic used when performing this procedure.

          Maybe there are “direct replications” in there too.

          Conceptual replications are a completely other thing. They are more than fine on their own but not when used as a substitute for direct replications. Only doing conceptual replications is a great way to generate elaborate chains of BS: https://statmodeling.stat.columbia.edu/2019/03/04/yes-design-analysis-no-power-no-sample-size-calculations/#comment-982976

        • This may give the impression that NHST is doing work, my claim is that it is not.

          The experiments are to measure things in two different groups and then see which is higher/lower, right? This is NHST except using “eyeballing” instead of math. No one has ever claimed there is a problem with the math of NHST, the problem is with the logic used when performing this procedure.

          Maybe there are “direct replications” in there too.

          Conceptual replications are a completely other thing. They are more than fine on their own but not when used as a substitute for direct replications. Only doing conceptual replications is a great way to generate elaborate chains of BS: https://statmodeling.stat.columbia.edu/2019/03/04/yes-design-analysis-no-power-no-sample-size-calculations/#comment-982976

        • Here’s the form of question most people want from a NHST:

          Does the outcome differ between Group A and Group B?

          I’d imagine everyone party to this comment thread realizes that a p-value of 0.00135 does not actually answer that question in a useful manner. So presumably we all want know of an alternative manner of answering that sort of question.

          Is there a Bayesian (or Gelman-ian) way of stating that question so that it could be answered using a statistical analysis of some kind? I’d think we really ought to be able to come up with a valid answer to the question stated thusly:

          Based on this particular dataset , how confident are we that the outcome differs between Group A and Group B?

        • Stop the presses! Someone tell type I diabetics that their insulin treatment is unlikely to work because Banting, Best, et al. were probably wrong because they eyeballed their experimental results and their probably wasn’t an independent attempt at a direct replication using the same line of rabbit from the same source and the same source of pancreatic extract.

          https://www.physiology.org/doi/abs/10.1152/ajplegacy.1922.62.1.162?journalCode=ajplegacy

        • Jeff said,

          “Stop the presses! Someone tell type I diabetics that their insulin treatment is unlikely to work because Banting, Best, et al. were probably wrong because they eyeballed their experimental results and their probably wasn’t an independent attempt at a direct replication using the same line of rabbit from the same source and the same source of pancreatic extract.”

          Nicely put for dramatic effect, but it’s also leaving out all the subsequent “data” of type I diabetics using insulin treatment and a) having it relieve their diabetes symptoms and b) not suffering serious negative consequences. (But also, finding that it didn’t work for some type I diabetics, because they had an allergic reaction which destroyed the sheep insulin — and then the further experimentation with pig insulin, which works for at least some such people.)

        • >Here’s the form of question most people want from a NHST:

          >Does the outcome differ between Group A and Group B?

          The first step is to realize that this question is stupid from the get-go. The answer to this question is *always always always* “YES”. The outcomes couldn’t even match at the 22nd decimal place much less out to infinite decimal places.

          So the question is stupid as asked.

          What is a better question? How about “What is the expected value of the utility of the difference in outcomes between Group A and Group B?” That’s a good question for something like a drug or a medical treatment or a policy change or etc.

          But suppose we’re not talking about a “consumable”, ie. something that has a meaningful “utility”, like I don’t think there’s a utility directly associated with say finding out information about the difference in growth rates of trees before vs after an El Niño year… So how do we phrase that kind of question to be meaningful?

          “What is the probability that the difference in growth rate of oak trees in the year after vs the year before an El Nino event is within the range [r, r+dr] and under what assumptions?”

          That’s a meaningful question that actually depends on the data. “Is there a difference” can immediately be answered “yes” for essentially every question without looking at data.

  6. Yes, being educated to think like a statistician I too realize that the literal answer to such a question is “always”. In my experience, pointing out the “stupidity” of a question to the person asking is counter-productive.

    Hence my search for a more helpful restatement of the question that a) admits a non-degenerate answer while also b) remains accessible to those whose training is in some content area rather than statistics.

    I suppose my point is there are limits to how far most people can embrace uncertainty. While a binary decision rule is a bad idea for all the reasons this blog routinely points out, my feeling is the best you’re going to do in clinical settings is get researchers to embrace the idea of assigning probabilities to an outcome being beyond some clinically-relevant threshold. It is usually a step too far to try and wave them off of thresholds altogether.

    Like it or not, the clinical practitioner is strongly inclined to think in terms of thresholds, cutpoints, recommendations and the like. What I’m trying to glean from this discussion is a best practice for suggesting statistically defensible formulations that relate explicitly to some threshold.

    I think I like the concept of estimating “probability that the growth rate being at least 10% greater” from a given dataset. Perhaps that should be the “posterior probability that…” and it should be estimated from a Bayesian model. Is there any Frequentist alternative that gets at nearly the same concept?

    • Brent said,

      “I suppose my point is there are limits to how far most people can embrace uncertainty. While a binary decision rule is a bad idea for all the reasons this blog routinely points out, my feeling is the best you’re going to do in clinical settings is get researchers to embrace the idea of assigning probabilities to an outcome being beyond some clinically-relevant threshold. It is usually a step too far to try and wave them off of thresholds altogether.

      Like it or not, the clinical practitioner is strongly inclined to think in terms of thresholds, cutpoints, recommendations and the like. ”

      I think that a better solution is to put more effort in helping people learn to embrace uncertainty. The reality of uncertainty (and the related concept of continuum rather than dichotomous thinking) needs to be taught from the early grades, in science, math, physics, etc. Yes, that means helping teachers learn to accept uncertainty, so that’s an important place to focus.

      (Ironically, it seems that many young people and non-scientists are ahead of the game in accepting the concept of “non-binary” gender.)

  7. Jeff walker wrote:

    Stop the presses! Someone tell type I diabetics that their insulin treatment is unlikely to work because Banting, Best, et al. were probably wrong because they eyeballed their experimental results and their probably wasn’t an independent attempt at a direct replication using the same line of rabbit from the same source and the same source of pancreatic extract.

    https://www.physiology.org/doi/abs/10.1152/ajplegacy.1922.62.1.162?journalCode=ajplegacy

    This is a great paper. Where do you see any NHST (checking if two groups are different)? They are concerned about describing in detail quantitative observations like this:

    For purposes of physiological assay of insulin we consider that the most satisfactory basis at present is the number of cubic centimeters which lowers the percentage of blood sugar in normal rabbits to 0.045 in from 2 to 4 hours.

    Sorry, but I don’t think you have understood my earlier posts. As for the role of independent replication here, I will take a look at the literature from that time.

  8. off topic

    Second attempt… I really wish the Name/Mail fields would remain filled in. I think the first attempt with this post didn’t show up because of a typo in the name. I am going to try allowing cookies for this site, maybe that is the fix.

    Jeff walker wrote:

    Stop the presses! Someone tell type I diabetics that their insulin treatment is unlikely to work because Banting, Best, et al. were probably wrong because they eyeballed their experimental results and their probably wasn’t an independent attempt at a direct replication using the same line of rabbit from the same source and the same source of pancreatic extract.

    https://www.physiology.org/doi/abs/10.1152/ajplegacy.1922.62.1.162?journalCode=ajplegacy

    This is a great paper. Where do you see any NHST (checking if two groups are different)? They are concerned about describing in detail quantitative observations like this:

    For purposes of physiological assay of insulin we consider that the most satisfactory basis at present is the number of cubic centimeters which lowers the percentage of blood sugar in normal rabbits to 0.045 in from 2 to 4 hours.

    Sorry, but I don’t think you have understood my earlier posts. As for the role of independent replication here, I will take a look at the literature from that time.

  9. We are on the verge of submitting a paper to the top journal in my field where we will openly discuss the ambiguities arising even from a “high power” replication attempt, the largest sample study ever done on a particular topic (the topic is too obscure for this audience to unpack details; it relates to how linguistic theory and real-time processing come face to face). It took us four years and thousands of dollars to do this work. I expect the paper to be rejected because it ends without providing “closure”. However, I hope to be wrong and hope to be surprised. I may report back on what happens. I see this submission as a test of the system within at least my field (psycholinguistics), and more broadly, a test of Andrew’s ideas in the real world. My hope is that Andrew’s ideas are now known widely enough that one could in principle write such a paper in a top journal without making overblown claims that are just dust in the air.

Leave a Reply to Jordan Anaya Cancel reply

Your email address will not be published. Required fields are marked *