Power analysis and NIH-style statistical practice: What’s the implicit model?

So. Following up on our discussion of “the 80% power lie,” I was thinking about the implicit model underlying NIH’s 80% power rule.

Several commenters pointed out that, to have your study design approved by NSF, it’s not required that you demonstrate that you have 80% power for real; what’s needed is to show 80% power conditional on an effect size of interest, and also you must demonstrate that this particular effect size is plausible. On the other hand, in NIH-world the null hypothesis could be true: indeed, in some way the purpose of the study is to see, well, not if the null hypothesis is true, but if there’s enough evidence to reject the null.

So, given all this, what’s the implicit model? Let “theta” be the parameter of interest, and suppose the power analysis is performed assuming theta = 0.5, say, on some scale.

My guess, based on how power analysis is usually done and on how studies actually end up, is that in this sort of setting the true average effect size is more like 0.1, with a lot of variation: perhaps it’s -0.1 in some settings and +0.3 in others.

But forget about what I think. Let’s ask: what does the NIH think, or what distribution for theta is implied by NIH’s policies and actions?

To start with, if the effect is real, we’re supposed to think that theta = 0.5 is a conservative estimate. So maybe we can imagine some distribution of effect sizes like normal with mean 0.75, sd 0.25, so that the effect is probably larger than the minimal level specified in the power analysis.

Next, I think there’s some expectation that the effect is probably real, let’s say there’s at least a 50% chance of there being a large effect as hypothesized.

Finally, the NIH accepts that researcher’s model could’ve been wrong, in which case theta is some low value. Not exactly zero, but maybe somewhere in a normal distribution with mean 0 and standard deviation 0.1, say.

Put this together and you get a bimodal distribution:

And this doesn’t typically make sense, that something would either have a near-zero, undetectable effect, or a huge effect with little possibility of anything in between. But that’s what’s the NIH is implicitly assuming, I think.

33 thoughts on “Power analysis and NIH-style statistical practice: What’s the implicit model?

  1. Actually, I think that bimodal distribution makes a lot of sense. Remember that this isn’t the distribution of all effect sizes, but rather effect sizes that come up to the NIH for review. Some projects are grounded in deep understanding of the science, some are not.

    I’m not claiming to know the mixing probabilities but from my experience, bimodal is very reasonable.

    • *Not a life scientist, so this is all conjecture*

      I would guess the plausibility of a bimodal effect size distribution varies across areas. If you take something like an understudied infectious disease where your lab bench experiments were wildly successful and you now want to try animal trials, it’s not so implausible that either your treatment works similarly to the lab bench and cures the disease, or it is entirely ineffective because the mechanism fails to translate to within the body, the immune system destroys it, the liver filters it out, it fails to cross the blood brain barrier etc. When you’re basically trying to kill an organism there’s an actual binary outcome of whether you do so or not, and the various infectious organisms may be sufficiently similarly situated that it works across all of them or doesn’t.

      Now take type 2 diabetes for comparison and a bimodal distribution seems much less likely because its the result of a lot of different biological pathways that are deeply linked to other functions of the human body. Smaller effect sizes centered close to zero are probably a more plausible prior there.

      • In general, I agree; sometimes a theory pans out (i.e., mouse model close enough to the mechanisms in humans) and sometimes it does not.

        Not 100% sure if I fully understand what is meant about type 2 diabetes. I think you are saying that type 2 diabetes is really an umbrella for a large number of different issues, so the idea that one treatment could be effective on a large percent of the population is low? If I understand your point correctly, I agree, and would lump that into “improper theory”. That is, if you told the NIH that you believed your experiment would have 80% power, but that was conditional on the idea that 75% of type 2 diabetes cased were caused by mechanism A, but it turns out only 5% are, then I would argue that your model for type 2 diabetes is flawed.

        But yes, that does blur the lines a little and may well lead to a distribution that’s just more spread out rather than bimodal.

        • I think the mix of causes is part of it. Essentially the idea would be that type 2 diabetes is 1) not mono-causal (even in any particular case) and 2) the causes are big long-term problems like obesity, blood pressure, cholestorol, diet etc that are more likely to be made better or worse rather than cured or not cured. So even if your theory does pan out, your diabetes drug may be more incremental than curative.

          So as a huge oversimplification, the severity of different diseases might follow an underlying causal structure like these:

          severity of rabies = 1 * rabies virus present
          severity of diabetes = 0.1 * BMI + 0.3 * cholestoral + 0.7 * BP + 0.2 * age

          So the diabetes medications might be attempting to reduce the size of one of those coefficients or improve the distribution of BMI, cholestorol or blood pressure, but even if they are totally successful the effect size on diabetes will be moderate.

          A proposed rabies drug by contrast might well be aimed at killing the rabies virus and will either succeed or fail at that task. If I know that some large percentage of initially promising (theoretically and in earlier stage empirical tests) drugs aimed at that disease ultimately don’t kill the virus, then it still makes sense to have a big mass near zero. But that mass at zero should be tight because rabies is just bad unless you fix it, so it’s all or nothing.

          Again these are simplifications, but if your theoretical model of the disease looks more like rabies, a bimodal distribution looks plausible. If it looks more like diabetes a unimodal distribution seems more likely. Most diseases are probably somewhere in between.

    • To me, the bimodal distribution may be true across treatments, but is does not make sense for specific treatments. For example, some treatments are novel and the mechanisms behind them are not well understood. In that case, the mass around 0 makes most sense. Other treatments, however, have long histories with many well-understood mechanisms, and studies may be designed to test more specific manipulations, populations, etc—in these cases, the distribution with mass mostly above 0 makes sense.

      Therefore, one may expect that a bimodal distribution is a good prior across all treatments being tested, but not necessarily for a given treatment. When calculating power (which is conditional on your expected effect size), you know whether or not the treatment you are testing is novel or well-understood, so there should really never be a case where the prior for your effect size is bimodal like in the post above.

      • I’ve never come across a new treatment, where its proponents/inventors did not think that they had deep understanding of the biology pointing to the particular drug/its mechanism playing a key/central/whatever role in the particular disease. Usually when a drug advances to clinical trials, there have been quite a few positive experiments in animal models, which were conducted, because there was some literature/mechanistic understanding/in-vitro experiments/something else that suggested it was worth trying. So, on a whole the distribution of effects seen with other drugs in a similar situation would indeed make sense as a prior.

        Well, actually, the distribution of observed effects (which from a couple of publications on this that I have seen is not too far off a N(0,1) for log-rate/odds/hazard ratios – with perhaps a slight shift towards the desired effect direction) is probably not what you truly want, but rather an estimate of the distribution of true effects that would have produced this distribution of observed effects. Whether that’s a bimodal distribution is very, very hard to say. It depends on what distribution of effect sizes the discovery pipeline delivers. I’d not be surprised, if it were bimodal, but I’d also not be surprised, if it were a unimodal distribution.

      • Other treatments, however, have long histories with many well-understood mechanisms

        What medical treatment do you think has a “well-understood mechanism”.

        Using a loose definition of “treatment” lets allow “smoking causes lung cancer”, which is probably the most confident claim made by 20th and 21st century medicine and its claimed smoking is a trillion dollar per year problem[1]. Do you think that process is “well-understood”?

        [1] http://fortune.com/2017/01/10/smoking-costs-who-cancer-institute-trillion/

        • Wouldn’t you consider at least the mechanism of action of insulin (in its different declinations: animal-sourced, recombinant human insulin, insulin analogs) to be “well-understood”? What about enzyme replacement therapies? Coagulation factors?

        • Recently I read an article that had to do with the microbiome, regulation of insulin sensitivity and brain inflammation. Thinking it might demonstrate a point I just now went to pubmed to search for it; and immediately realized that even this seemingly new/unique angle on “how insulin works” was just a needle in a haystack of gut bacteria, brains and inflammasomal insulin effect modulation discoveries. It’s just like my junior high sci teacher said: “every answer births a litter of questions”.

        • If insulin was really well-understood would people still need to receive lifelong treatment?

          I don’t see why any understanding of the mechanism is needed at all. Its just “this person is healthy and we see their blood levels of insulin do this, while this person is unhealthy and has no insulin, lets try to get the unhealthy person to have the same blood insulin levels as the healthy person”.

          As to details of insulin signalling, I’d have to take a deeper look. However, after a quick search using terms I thought may yield “contrarian results”, I see that (contrary to ~100 years of medical thought) there are some recent claims that:

          1) Insulin is not required for blood sugar regulation after all:
          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4469812/

          2) Type 1 diabetics are still capable of producing insulin:
          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3855529/

        • > If insulin was really well-understood would people still need to receive lifelong treatment?

          Why not? You may understand well how an engine works, but you still need to fill up the tank.

          > I don’t see why any understanding of the mechanism is needed at all.

          I don’t say understanding is “needed”. The question is to what extend do you think that the mechanism of action of insulin is understood. Do you think metabolic pathways are completely made up?

        • The question is to what extend do you think that the mechanism of action of insulin is understood.

          I don’t know enough about it, however based on areas where I do have deeper knowledge I would guess no. The literature is probably filled with incorrect and misleading claims but since direct replications are so rare you need to look really hard at the methods to figure out what went on.

          Do you think metabolic pathways are completely made up?

          No, not completely. A lot of the metabolic research got done before NHST infested biomedical research so is probably much more reliable. Eg, see how they come up with chemical formulas and check the precise quantitative predictions of their model against the data: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1266984/

          Also, I wouldn’t say metabolic pathways have much to do with the mechanism of insulin per se, thats more a new topic.

          And look at this early (and properly interpreted) significance test I found:

          The oxygen/pyruvate ratio of the avitaminous brain is significantly lower than that of the normal. The probable error (i.e. P.E. = 0.6745 √(σ1/n1, + σ2/n2) where n is the number of observations and a is the standard deviation) of the hypoiodite results of the normal and the avitaminous brains is 13-2, whilst the difference between the two means is 105, i.e. eight times as big as the P.E. Similarly the P.E. for the bisulphite results is only 16-4 whilst the difference between the means is 125 or 71 times as large.

          It is not yet possible to say what this difference signifies. There are various possible explanations which may be mentioned.

          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1267123/

        • My guess is that there are a lot of individual differences in response to insulin. For example, some people have an allergy to sheep insulin (By allergy, I don’t just mean itching or something like that — it’s an immune response that destroys the insulin and may have other effects as well.)

        • https://en.m.wikipedia.org/wiki/Insulin_signal_transduction_pathway

          Maybe this is all incorrect and misleading. I really don’t know.

          I know it looks so thorough and complicated with so many highly educated people involved that “how could they come up with something like that if it was so wrong”, but it would not surprise me at all. I mean, think about the millennia during which the most highly educated people in Europe spent their lives arguing about (now considered largely irrelevant) theology.

          Start with the first step:

          The glucose passively diffuses in the beta cell through a GLUT-2 vesicle [sic?].

          https://en.wikipedia.org/wiki/Insulin_signal_transduction_pathway#cite_note-2

          I would first look for the reference they use for this (it seems to be a textbook so not wasting my time) and see if there is any quantitative info about this process.

          1) How many glucose molecules diffuse through a single GLUT-2 protein per second for the range of known physiological (and pathological) concentrations?
          2) How many of these GLUT-2 proteins are there on the surface of a beta cell at any one time?
          3) What is the size of the these GLUT-2 proteins?
          4) What is the net flux of radiolabeled glucose into the beta cells at various concentrations?
          5) How many GLUT-2 proteins are translated per mRNA in beta cells?
          6) How long does it take for a GLUT-2 protein to go from translation to the plasma membrane in beta cells?
          7) What is the lifetime of a GLUT-2 protein in beta cells?

          It is once I start asking quantitative questions like this that I’ve hit a dead end or seen the models fall apart. I haven’t checked for this case so for all I know it checks out, but I would not bet my life on that.

          Then here is a paper I found indicating that claim about GLUT-2 is regarding rodent cells, but it seems things work totally different in human cells:

          SLC2A2 encoding glucose transporter -2 (GLUT2) acts as the primary glucose transporter and sensor in rodent pancreatic islets and is widely assumed to play a similar role in humans. In healthy adults SLC2A2 variants are associated with elevated fasting plasma glucose (fpg) concentrations but physiological characterisation does not support a defect in pancreatic beta-cell function. Interspecies differences can create barriers for the follow up of disease association signals.

          https://www.ncbi.nlm.nih.gov/pubmed/21920790

          Is it really that things are different between rodents and humans, or that nobody ever replicated the original study in rodent cells? Perhaps after the first study “proving” it was GLUT-2 people threw away conflicting results assuming the experiment “didn’t work”? There are now 5 references for me to check for the claim:

          SLC2A2 encoding glucose transporter -2 (GLUT2) acts as the primary glucose transporter and sensor in rodent pancreatic islets

          And that is where the claims about very first step about insulin signalling leads…it is all very time consuming to check this stuff.

        • If your point is that nothing is well-understood, I can concede that. After all, in the best case a perfect understanding of something would take us down to fundamental physics. And our understanding of fundamental physics is far from perfect.

        • Nope, I’m saying I wouldn’t be surprised to discover the so called “Glucose Transporter” proteins do not transport glucose into the cell at all.

        • Ok, so if we agree that bio-medical science built upon NHST burned down, fell over, and sank into the swamp how do we ensure that the fourth castle stays up?

        • Carlos wrote:

          Ok. Let’s just say that “no biological process is well understood”.

          I really wouldn’t even put terms like “well-understood” or “deeply understood” in the same sentence as “biological process”. It makes it seem like that’s a plausible goal to aim for in the near future.

          I’d reserve terms like that for when there is a computational/mathematical model of the process, some previously unthought-of and otherwise surprising predictions about the behavior of the system are derived from the model, and then those predictions turn out to be accurate according to multiple teams collecting data.

          Thanatos wrote:

          Ok, so if we agree that bio-medical science built upon NHST burned down, fell over, and sank into the swamp how do we ensure that the fourth castle stays up?

          Assuming you are responding to Carlos, this seems to be some kind of strawman (not “well-understood” is not the same as sinking “into the swamp”). Also, what is the “fourth castle”?

        • “All science is either physics or stamp collecting.”

          That really isnt my view. There is nothing special about physics in terms of modelling a process and making/testing quantitative predictions. People used to do this in bio all the time. I shared one example right here:

          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1266984/

          Go back and read the pre-1940 bio literature on some topic that interests you. They used to approach things like a hard science and began making progress like we saw in the hard science, then NHST killed the momentum. It is the approach, not the subject matter, that makes the difference.

    • This is depressing. I thought we were walking around in circles in the swamp of statistical shenanigans but it seems we are actually wading in deeper. Interestingly this article has been cited only once, by Ioannidis and Cristea in an article surveying display of P-values in the journals Science, Nature and PNAS, and here we find…

      “Use of Bayesian methods was scant (2.5%) and rarely (0.7%) articles relied exclusively on Bayesian statistics.”

      So there’s your problem.

      https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5953482/

      • Given current abilities and understanding of Bayesian approaches that could well be just going from the frying pan into the fire.

        As one of my colleagues once explained to me when I was miffed about being dissed for being too interested in the motivations and checking of priors in a key note Bayesian presentation (its not kosher to check priors so carefully) – “most Bayesian just want to pull the Bayesian crank, claim to have solve the problem and quickly move on.”

        https://www.youtube.com/watch?v=iplpKwxFH2I

  2. I’ve observed rather substantial researcher degrees of freedom in power analyses, principally in the choice of effect size. Investigators try to maximize the funding while minimizing the chance of rejection, and to some extent this is reflected in the selection or rejection of variables for which they choose to report the power analyses, completely aside from arbitrarily varying the scale of the effect size. I’ve learned that, when investigators are pushing to report overly rosy power analyses, it is beneficial to frankly explain to them that they may be setting themselves up for failure to find anything worth reporting once they complete the study. At the opposite extreme, I see investigators skewing the effect size to increase the N. Often they are doing the power analysis to satisfy the NIH, but their real interest is with obtaining data on unrelated variables which lack preliminary data to support power analyses. A smart grant reviewer will understand that this is going on, think a bit more deeply, and factor it into ratings appropriately.

  3. This is taking the system more seriously than it’s intended to be. As you wrote ‘it’s not required that you demonstrate that you have 80% power for real; what’s needed is to show 80% power conditional on an effect size of interest, and also you must demonstrate that this particular effect size is plausible.’ In other words, the point of the power calculation is a sanity check for grant writer and reviewer. It’s there to weed out extremely unrealistic assumptions. As soon as you’re in the realm of plausible, but overly optimistic power expectations other aspects like theoretical soundness, measurement validity etc. are more relevant than power.

    • Markus:

      The point of my above analysis is not that researchers literally think the distribution of effect sizes is bimodal. Rather, my point is that, under a realistic distribution of effect sizes, the NIH system is seriously flawed, as it provides a series of incentives and requirements leading to noisy studies and future overestimation of effect sizes. Type M and Type S errors.

    • Investigators use overly optimistic effect size assumption to make the study look more impactful and more likely to succeed. This overly optimistic assumption is then used to justify sample size which is often powered to just 80% to make the budget look attractive too. The end result is an underpowered study.

Leave a Reply to Thanatos Savehn Cancel reply

Your email address will not be published. Required fields are marked *