Could the so-called “fragility index” be useful as a conceptual tool even though it would be a bad idea to actually use it?

Erik Drysdale writes:

You’ve mentioned the fragility index (FI) on your blog once before – a technique which does a post-hoc assessment on the number of patients that would need to change event status for a result to be insignificant. It’s quite popular in medical journals these days, and I have helped colleagues at my hospital use the technique for two papers. I haven’t seen a lot of analysis done on what exactly this quantity represents statistically (except for Potter 2019).

I’ve written a short blog post exploring that FI, and I show that its expected value (using some simplifying assumptions) is a function of the power of the test. While this formula can be used post-hoc to estimate power of a test, this leads to a very noisy estimate (as you’ve pointed out many times before). On the plus side, this post-hoc power estimate is conservative and does not suffer from the usual problem of inflated power estimates because it explicitly conditions on statistical significance being achieved.

Overall, I agree closely with your original view that the FI is a neat idea, but rests on a flawed system. However, I am more positive towards its practical use because it seems to get doctors to think much more in terms of sample size rather than measured effect size.

From my previous post, the criticism of the so-called fragility index is that (a) it’s all about “statistical significance,” and (b) it’s noisy. So I wouldn’t really like people to be using it. I guess Drysdale’s point is that it could be useful as a conceptual tool even though it would be a bad idea to actually use it. Kind of like “statistical power,” which is based on “statistical significance” which is kinda horrible, but is still a super-useful concept in that it gets people thinking about the connections between design, data collection, measurement, inference, and decisions.

I guess the right way to go forward would be to create something with the best of both worlds: “Bayesian power analysis” or “Bayesian fragility index” or something like that. In our recent work we’ve used the general term “design analysis” to capture the general idea but without restricting to the classical concept of “power,” which is tied so closely to statistical significance. Similarly, “fragility” is related to “influence” of data points, so in a sense these methods are already out there, but there’s this particular connection to how inferences will be summarized.

30 thoughts on “Could the so-called “fragility index” be useful as a conceptual tool even though it would be a bad idea to actually use it?

  1. reading between the lines of this post and many others, can we summarize like this?

    “Conducting a sound statistical assessment for most data sets is way too hard for most people and often isn’t even possible because their experimental configuration is so poor. But the demand (I deem the word “need” inappropriate, because no one “needs” to analyze worthless data) to get some “usable” result is so immense that, rather than teaching people to use sound statistical methods and experimental configurations, the discussion boils down to which is the least-unsound less complex method that people can and will actually attempt, balancing usability, outcome accuracy and conceptual suitability, that can get people some result they can publish and claim is credible”

    In a way allowing these methods is kind of like legalizing dope: no one thinks dope is good for people, but the demand is so overwhelming and the damage from using it is – at least as far as we know now – not too terrible, so we legalize.

    Yes? No? Modifications?

    • I find your position too extreme. It seems like you would only have a few people do data analysis at all, given that you consider virtually all that is done to be worthless. I agree with the many criticisms of poorly designed research, improper measurements, and most importantly, inappropriate conclusions. However, the idea that it should be all thrown out seems too extreme to me – it looks like your frustration has boiled over.

      If we could drop the silly games surrounding the need to publish, get citations, get promotions, get grants, etc. that give rise to so many poorly conducted and reported studies, we would still have the fact that doing good statistical work is hard. But if we can strip away those nefarious influences, I think we would find that even poorly conducted statistical studies provide some information of use. They are all evidence – of varying quality to be sure. But poor quality is not equal to zero. Some studies should be thrown out – yes – but not every study that fails to be well designed needs to be erased. And I would like to see more people trying to do statistical work, even if they don’t meet up to your standards. I’d just like to see them be more humble about their work and more appreciative of constructive criticism.

      • But poor quality is not equal to zero

        When you have ~20% replication rates, that means the value of most studies is actually less than zero since they’ve genetated misinformation. *

        * Ignoring any tangential information generated

        • Failure to replicate does not mean the hypotheses are wrong. Still, your point is well taken – many studies indeed have values less than zero. But that shouldn’t be a reason to throw out, for example, all studies using NHST. As much as I dislike the way it is used, some of those studies still have value.

        • But that shouldn’t be a reason to throw out, for example, all studies using NHST. As much as I dislike the way it is used, some of those studies still have value.

          True.

          The issue is the filtering and that the studies are designed around the idea a significant p-value means the result is “real”. If not for that belief, a different type of (far more informative) study would be done in the first place. Eg, collect some kind of longitudinal data or dose response and fit a theoretical curve.

          The damage of NHST is more what we don’t see, a good study peppered with p-values is a bit annoying but can still be fine.

        • “The issue is the filtering and that the studies are designed around the idea a significant p-value means the result is “real”. If not for that belief, a different type of (far more informative) study would be done in the first place. Eg, collect some kind of longitudinal data or dose response and fit a theoretical curve.”

          I worked on a project with access to good-quality longitudinal data (anywhere from around 10 to nearly 20 measurement occasions over a period of a few months) on a couple hundred people yet the analysis that actually was written up and published was simply a comparison with p-values between treatment and control groups at a few specific time points from within that time series. I think it was three specific time points, for each we chose the measurement occasion closest to the desired date.

          Each of those single-time comparisons was treated as a separate standalone answer to “did the groups differ at that time?”. When I suggested that the entire series ought to be modeled and the trajectories compared, with specific comparisons at those special time points, I was told that would be some complicated it was unpublishable and that the only valid answer to the research question was the p-value from those ridiculously simplistic comparisons.

          So that’s what we did and what was published. Crazy, huh?

        • Name:

          Yeah, I’ve seen that sort of thing! First step is to do lots of comparisons, second step is to look at everything that’s statistically significant, third step is to use the pattern of statistical significance to tell a story. The irony is that significance testing is supposed to protect against being fooled by noise, but when it’s used in this way, it becomes an added source of noise, garbling whatever data are there.

        • name said: “I was told that would be some complicated it was unpublishable ”

          anon said: “Conducting a sound statistical assessment for most data sets is way too hard for most people ”

          See? :)) I don’t think my critique is too harsh at all.

          “name’s” suggestion that the data actually be analyzed *scientifically* was rejected because Actual Science is too complicated for most of the audience for the paper! Which is absolutely hilarious because while I have virtually no formal education in statistics, I would understand the results of the type of study “name” is suggesting without too much difficulty, even if I didn’t understand the details of the modeling.

        • “…“name’s” suggestion that the data actually be analyzed *scientifically* was rejected because Actual Science is too complicated for most of the audience for the paper! Which is absolutely hilarious because while I have virtually no formal education in statistics, I would understand the results of the type of study “name” is suggesting without too much difficulty, even if I didn’t understand the details of the modeling.”

          The key sticking point was, under my proposed analysis we’d be fitting trajectories to all the data and comparing estimates from those fitted trajectories at each of the pre-specified time points. My employers felt it would be impossible to convince reviewers we weren’t trying to pull a fast one by comparing “abstract” curves rather than “real” individual measurements.

          Of course that meant we dumped each “real” individual time-point measurement into a regression model, added a bunch of covariate and compared the estimated group means from those models. So either way it’s model-estimated quantities, they just wanted models that only used 1/10th of the available data to inform the estimate.

          My own statistics ability is quite limited (compared for instance to the norms on this blog!) so I totally agree doing good statistics is hard. But in my experience, the limits are set in a lot of real-world academic research by what random reviewers are comfortable with. Not what the analyst and investigators can actually implement. I sometimes feel the statistical discourse involved in research publishing is at the level of a freshman “math for history majors” course rather than an applied statistics seminar.

      • “It seems like you would only have a few people do data analysis at all”

        I’m not taking a stance on that. Just summarizing the problem, which seems to be that a lot of people are using tools improperly on a regular basis because they don’t know the proper way to use the tools, and it seems like it’s just too hard for them to grasp the proper way.

        Are “data analysis” and “statistics” the same? I don’t know, but I don’t think so. You can do a lot of useful and effective data analysis using only rudimentary statistical tools or even just percentages and ratios. There’s “statistics” in the layman’s sense: like baseball and basketball stats, but I personally wouldn’t count averages as “statistics” in a mathematical sense. God forbid there’s even a BA in any science who can’t calculate an average.

        Even stuff like adjusting for various factors doesn’t strike me as necessarily “statistics”. It’s just basic math. I mean, suppose we had some US subgroup suffering “disproportionately” from COVID, but then we adjusted or compensated for the lack of English speakers in the group. Whatever the answer, there’s not a lot of stats involved in that calculation. I’m no statistician and I could do it myself and be satisfied with the answer. What would make this, in my mind, a “statistics” project is inserting an NHST / p-value comparison to confirm or reject the conclusion.

        What say you? Where’s the line, or is there one?

        • I think you are trying to draw an impossible distinction. From your reasoning, it would appear that running a multiple regression model is not statistics unless you look at the p-values. If you look at a statistics text (painful though it is), what you are calling data analysis is statistics. Graphs (visualization), aggregation, ratios, simple inference, complex inference, time series, …. it’s all in there.

          As for a lot of people misusing the tools, absolutely. I think that is true of any subject. People misuse economics all the time – some of them are economists, but that is a minority of the misusers (my disagreements with my colleagues are not so much that they misuse the subject but that they claim it does more than it really does). Similarly, I suspect most statisticians (I guess indicated by their degrees, a PhD in this case) don’t misuse their tools. I, on the other hand, having an economics degree may well misuse some tools. Does this mean I should not do anything that is considered “statistics?” I think we’ve had this discussion on this blog before – and I think most people don’t believe statistical work should be confined to statisticians only.

          What I think we all agree on is that people need more and better education in statistics. Until that happens, we are likely to see more examples of bad work. Since you can’t outlaw the bad work, it is better to attack the bad incentives where we reward people for doing bad work if it results in publications, news stories, TED talks, and the like.

        • Own personal definition of statistics is “Any time we make a logical argument about the way the world works using data”

          So by that definition it’s all statistics.

        • Daniel: “Any time we make a logical argument about the way the world works using data”
          I agree, but that requires one to define/clarify what is meant by “logical argument”.

          To me that logical argument should be using a probability model to represent in an idealized sense how unknown parameter values were set and the data could have been generated given those parameter values. Hence the use of a probability model is required to make it specifically statistics rather than science more generally.

        • Even just graphing the data in a way that clarifies a relationship is doing statistics. Probability models are not required.

        • Christian and Daniel.

          I agree my definition is limiting but with descriptive and exploratory statistics some have argued expectations inform both and that could be seen as implicitly involve probabilities. And formally expectations define probabilities.

          Maybe I am not so concerned what is considered science versus statistics and often use the phrase scientific statistics.

        • Daniel Lakeland said:

          ‘Own personal definition of statistics is “Any time we make a logical argument about the way the world works using data”’

          Wow!!!! It would be news to many generations of scientists that just arguing from data is “statistics”!!! I personally went through my BS and most of my MS without any stats and filled many spreadsheets with calculations and plots – all very sound. The one course I eventually took in stats was the biggest joke of a course in my entire STEM curriculum across eight years of education. We entered data into spreadsheets and calculated simple T tests and other ridiculous nonsense.

          Dale Said:
          “If you look at a statistics text…what you are calling data analysis is statistics. Graphs (visualization), aggregation, ratios, simple inference, complex inference, time series, …. it’s all in there.”

          ??? other than aggregation and time series this pretty much describes many lower-level math and science courses doesn’t it? I thought I learned ratios in middle school. I also did my first plots in HS algebra, right? y=mx+b? Even this HS stuff seems beyond the reach of many people using statistics.

          Doesn’t all of science use inference? I’ve been able to make pretty strong inferences from just plotting X-Y data and not even bothering with a regression. Regression can quantify the relationship, but if you have solid relationship you don’t need it to **find** the relationship – a qualitative assessment is often sufficient. Of course if you’re noise mining then I guess MLR is useful to come up with *something*.

          I did analysis a while back looking for certain kinds of advantages in shipping, where I did multiple linear regression on several different models. No worthwhile advantages emerged from the data because there aren’t any – the market efficiently allocates products to the most efficient mode of transportation. But you could tweak the noise all day long coming up with goofy little stories.

          Ha, that’s funny, that brings us back full circle to “name’s” story above. Maybe if you did a “real” model on it you’d just be twiddling the noise – which is why NHST is used in the first place!!!

        • For me the distinctions between science and statistics is that science is the process of creating mechanistic explanations for phenomena, statistics is the process of arguing that a given mechanistic explanation together with given quantitative values for the unknowns forms a reasonably accurate explanation for the measurements

          Statistics is a part of science but not the whole if it.

        • research hypothesis -> statistical hypothesis -> data

          The statistical hypothesis is derived from the research hypothesis but with additional assumptions about uncertainy in the data and parameters.

          NHST is not statistics and plays no role in science, since the statistical hypothesis is *not* derived from the research hypothesis.

        • @Keith: “I agree my definition is limiting but with descriptive and exploratory statistics some have argued expectations inform both and that could be seen as implicitly involve probabilities. And formally expectations define probabilities.” In my view formal probabilities are *models* for expectations; they are a human invention and do not automatically “exist”. You may have expectations, but as long as you’re not using a probability model explicitly… you’re not using a probability model.

    • Anon:

      That’s a bit too harsh. I’d rather say that statistics is hard, and researchers have goals that are not always well served by existing methods, so they come up with new methods to address these goals, and we can try to figure out what these goals are and come up with methods to address them.

      I don’t think the “legalizing dope” analogy works because statistical methods are already legal.

  2. I’ve found the comments by A. Althouse quite informative: https://discourse.datamethods.org/t/fragility-index/1920/8

    > For one, the FI is just another flipped-around version of a p-value.
    >
    > For two, rejection (or skepticism) of results based on the FI is at odds with the balance any RCT must strike: recruiting the minimal number of patients required to answer the study question (with whatever operating characteristics the trial is designed to have). I said this once on Twitter, but if people have a problem with “fragile” results based on the p<0.05 cutoff, they’re basically saying that results at p<0.05 aren’t good enough (which might be a fair discussion!) but that instead of calculating FI, they ought to be advocating for lower alpha levels (or Bayesian approaches with a very high threshold of certainty) rather than just pointing out “fragility” of published results.

    Which I think is an argument that translates quite well also to the hypothetical "Bayesian fragility index"

  3. Andrew-
    The abstract to Nassem Taleb’s book, Antifragile: Things That Gain From Disorder, ” “Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better”.

    Would you care to speculate on how the FI for patients in clinical trials could be expanded or generalized to include Taleb’s concept of anti-fragile?

    But respond only if it’s an interesting question…

  4. Andrew, do you have the same criticisms of:

    An automatic finite-sample robustness metric (by Broderick, Giordano, and Meager)
    https://arxiv.org/abs/2011.14999
    https://www.chamberlainseminar.org/past-seminars/autumn-2021#h.2b54pwliqf8o

    They’re still focused on influence and statsig, but also on what fraction of data needs to be dropped to change the sign of the estimate.

    (I played around with this here: https://michaelwiebe.com/blog/2021/01/amip)

    • Michael:

      The authors of this article are my friends! I looked at it awhile ago but not in full detail. I’m somewhere in the middle of a research project on influence and model understanding, and it’s my plan to carefully read this and other related papers, but I haven’t got to it yet. I hadn’t realized the paper looks at statistical significance, but I guess this is relevant for a lot of statistical practice. I’ve written some papers looking at the statistical property of inferences conditional on statistical significance, and that’s a similar idea.

    • Wonder if this old trick of mine might be relevant here – focus on a given parameter of interest and then jointly estimate all the other parameters somehow (estimate, profile/maximize or integrate/Bayes) to get a one dimensional likelihood for the interest parameter for each individual observation. That should give an ordering on individual observations that by discarding will move the combined likelihood MLE the most in a given direction.

      Read at own risk https://statmodeling.stat.columbia.edu/wp-content/uploads/2011/03/plot111.pdf

  5. Regarding the Bayesian power analysis, I worked with a statistician a few years back on modelling interval-censored plant abundance data and did something in this area: https://peerj.com/preprints/2532/
    It was in the context of biodiversity monitoring, so we looked at the proportion of the slope parameter posterior that was below zero across simulations. It spent 12 months in peer review, was rejected, and by that point I had other things to do and so it’s only a preprint.

Leave a Reply

Your email address will not be published. Required fields are marked *