Yes, it makes sense to do design analysis (“power calculations”) after the data have been collected

This one has come up before but it’s worth a reminder.

Stephen Senn is a thoughtful statistician and I generally agree with his advice but I think he was kinda wrong on this one. Wrong in an interesting way.

Senn’s article is from 2002 and it is called “Power is indeed irrelevant in interpreting completed studies.” His point is that you perform power analysis (that is, figuring out the probability that a study will attain statistical significance at some predetermined level, conditional on some assumptions about effect sizes and measurement error) in order to design a study. Once the data have been collected, so the reasoning goes, the power calculation is irrelevant.

Senn summarizes:

The definition of a medical statistician is one who will not accept that Columbus discovered America because he said he was looking for India in the trial plan. Columbus made an error in his power calculation—he relied on an estimate of the size of the Earth that was too small, but he made one none the less, and it turned out to have very fruitful consequences.

The Columbus example works for Senn because, although Columbus’s theoretical foundation was wrong, he really did find a route to America. Low-power statistical studies are different, because there, “statistically significant” and thus publishable results can be spurious. And, the lower the power, the more likely the estimate is vastly inflated and possibly in the wrong direction.

In a low-power study, the seeming “win” of statistical significance can actually be a trap. Economists speak of a “winner’s curse” in which the highest bidder in an auction will, on average, be overpaying. Research studies—even randomized experiments—suffer from a similar winner’s curse, that by focusing on comparisons that are statistically significant, we (the scholarly community as well as individual researchers) get a systematically biased and over-optimistic picture of the world.

John Carlin and I argue in our recent paper that post-data design analysis (I prefer not to speak of power analysis because power is all about declarations of statistical significance, and I’m more interested in type M and type S errors) can be a key step in avoiding this winner’s curse.

In short: design analysis can help inform a statistical data summary.

From a frequentist perspective, design analysis—before or after data are collected—give you a distribution or set of distributions on which to condition, to evaluate frequency properties of statistical procedures. (In frequentist analysis you work with distributions conditional on the true parameter value theta, and the key step of design analysis is to posit some reasonable range of possibilities for theta.)

From a Bayesian perspective, design analysis works because it’s a way to include prior information and indeed can be thought of as an approximation to a fully Bayesian analysis.

In any case, the hypothesized parameter values in the design analysis should come using external subject-matter knowledge, not from the noisy point estimate from the data at hand. For example, if you run a little experiment and you get an estimate of 42 with a standard error of 20, you should probably not base your design analysis on an assumed underlying effect size of 42 or even 22. So, sure, if that’s what you were going to do, I agree with Senn that this is a bad idea.

Anyway, I’m not trying to criticize Senn for writing a paper in 2002 that does not anticipate the argument from our paper published twelve years later. I just think it’s interesting that a recommendation which in its context made complete sense—on both theoretical and practical grounds—now needs updating given our new understanding of how to interpret published results.

55 thoughts on “Yes, it makes sense to do design analysis (“power calculations”) after the data have been collected

  1. I think both sides here still have the logic of science reversed (which I would blame on residual NHST-thinking).

    Few studies should be focusing on finding differences (ie “effects”) anyway. Instead they should be looking for stable relationships between various phenomena (including time/dose) and coming up with models that explain them. In other words, looking for “universalities” rather than differences. Then they should check for deviation from the models they came up with. Of course if the study is “low power”, it is going to be difficult to distinguish between the various models.

    • From the perspective of science this is absolutely correct. But a lot of statistics is just really “measurement technology”. So for example, from a project I’m interested in right now, if I want to “measure” the income level which we can consider the scale bar for poverty in the US on the basis of at Income/Scale = 1 you can afford shelter, food, and transportation to and from the work that generates the income (and basically nothing else) then you need to infer what that level is from some large amount of survey data whose focus was not the thing of interest. Simply getting a plausible range of values for this scale bar is the goal, not to infer causal structure around why that level of income is what it is…

      Now, on the other hand, later I’d like to use this estimate and some ideas about changes to public policy to discuss why it is that the population near to poverty may be increasing or decreasing in an area… and there I totally agree with you that we want to find universality and causality and if I’m unable using the blurry measurement techniques to show what’s going on, I still should focus on the causal models and hope to get better measurements later.

    • It would clarify matters for some of us if you could expand on this comment, particularly as it applies to Senn’s field of analysis, which is pharmacology?

      First, the drug designer has usually started not with a stylized empirical observation but with a causal model of physiology which provides the basis for hoping that the drug might have a particular effect. For example, one might hope that a drug will help to treat asthma, as measured by forced volume of expiration, because one knows that its neurological action should produce bronchiodilation. But one does NOT have a quantitative model from which to estimate effect size, dose, or timing. In Bayesian terms, one’s priors on these are very weak; and that would seem to vitiate the practical difference between a Bayesian and NHST analysis?

      Second, there is almost always an existing drug which is believed to be effective (we already have lots of asthma drugs!) In this situation, one cannot withhold treatment from patients, so the “control” group must receive the baseline therapy. The question is whether the new drug is more effective than the old, and the “universality” one is trying to establish seems to be definitionally about differences?

      • Phil: as a former biostatistician, I agree strongly with this.

        I’ve seen lots of statisticians say testing hypotheses is near worthless. And in many cases they are right. But in my experience working with biologists, there are plenty of cases where you really don’t want much more from your data than to test a hypothesis. For full credit, I do note that Anoneuiod did say that “few studies should be focusing…”, not “no studies”.

        Example from my previous work: we are interested in whether a given gene accelerates the onset of AD. We use mice models because that’s what we can ethically do. To measure differences in cognitive ability in mice, we measure the mice’s ability to learn a maze (skipping over those details for now). We test for differences in learning rates several times late in life between mice with and without the gene of interest.

        Here we basically don’t care about the effect size at all. I have no interest in predicting how quickly mice learn how to finish mazes. But my PI would like some assurance that this gene does accelerate AD before they pour more money and resources into studying it. I don’t need to explain to this crowd that p-value less than 0.05 in mouse study does not mean we know this gene is associated with AD in humans (trying to jam in as many fallacies as possible), but all else being equal, I’d rather have a grad student look into the gene A that had p = 0.01 in this study than gene B that had p = 0.41. Now, there is something to be said for doing more work to get a good prior, but that’s a different discussion.

        Extremely technically, we do care about effect size in that if we ran 10,000 mice, we could test positive for a gene that has a positive near zero effect, which would then likely be a near zero effect in humans. But since most of these trials have something like 20-30 mice, the distribution of the estimate of a near-zero effect is nearly identical to the mythological Null Effect. Once we recognize that, the discussion of the effect size starts to be less interesting, other than really extreme cases (which, with small samples, should be taken with a grain of salt anyways).

        In these cases, the researchers are of course interested in universality and stable relationships. But that’s an extremely expensive next step.

        Finally, I would say that this general framework happens almost constantly in the biological sciences.

        • This is an example of the correct use of p values: as screening procedures

          Note that there’s a secret prior in your model: your prior belief that the frequency with which your test leads to an “uninteresting” (zero, or very small) effect is very close to 1. If in fact you had maybe 30% “real” effects, then your variability would include so many real effects that it would become wide and p values would be useless for distinguishing from zero.

          In other words your logic is “if essentially all of these are meaningless, then the X percentile of these results is an upper bound for the range of meaninglessness”

          • Of course. I noted that the discussion of posterior probability vs p-value was not what I was discussing here, and clearly was in favor of posterior probabilities when prior information is available. This was merely a discussion of “is testing hypotheses a good idea?” (regardless of whether to use Bayesian of frequentist tests).

            And I agree it makes sense to think of p-values (or posterior probabilities) as screening procedures. But I’d also argue that the scientific process can just be seen a method for screening/filtering down lots of ideas until we’re left with reliable ideas.

            Or at least, when science is being done correctly.

        • we are interested in whether a given gene accelerates the onset of AD. We use mice models because that’s what we can ethically do. To measure differences in cognitive ability in mice, we measure the mice’s ability to learn a maze (skipping over those details for now). We test for differences in learning rates several times late in life between mice with and without the gene of interest.

          Here we basically don’t care about the effect size at all. I have no interest in predicting how quickly mice learn how to finish mazes. But my PI would like some assurance that this gene does accelerate AD…

          I think you are handwaving over the most important part. How do you know the knockout (or whatever treatment) isn’t affecting the hunger of the mice, or their anxiety? What about the sense of smell/vision/hearing? What if it makes them defecate/urinate more often, so people handle them differently, or push the stopwatch button a little slower? Maybe the treatment makes them fatter or weaker so they are slower(or vice versa). I could easily go on about the possible causes of some effect on running a maze having little to do with “cognitive ability”, which is also likely poorly defined.

          What are the properties of “cognitive ability”? Is it supposed to be some (relatively) constant value that the maze times plateau around? If so, what is causing the variance from trial to trial or day to day? Why are you looking at learning rate rather than the plateau, can’t mice learn a bad habit that makes them plateau earlier but navigate the maze slower? Or even that a treatment interpreted as “helpful” makes them all dumber, but they reach a worse plateau at a faster rate? What if the data for one mouse is 5.12456 s on the last 4 trials? Is that plausible, or is there a minimum amount of variance you would expect?

          Now to be fair, it is the scientists job to deal with these issues, not the statisticians job. But they aren’t doing their job because they think you are giving them something with the p-value/significance that isn’t so. I don’t see where this assurance the difference has something to do with Alzheimers is coming from.

          • I don’t think there IS an assurance that “the difference has something to do with Alzheimers” I think it’s just “if there’s no difference we might as well not look to explain any difference”.

            The next step isn’t “assert gene X causes Alzheimers” the next step would typically be something like “study what genes are in the regulatory pathway that the gene of interest participates in, and look for ideas about how it might influence memory or influence the production of brain plaques, or influence the permeability of the blood brain barrier, or etc etc”.

            The biologists aren’t stupid about this (at least some of them aren’t) but they have limited resources, so they need a filter to figure out which small set of things to look at. a p value *is* a filter.

            • “if there’s no difference we might as well not look to explain any difference”.

              This is something I’d agree with. The useful findings for these types of studies are precisely those that no one publishes. From such a study you can conclude either the effect is small, or that any substantial effect is getting canceled out by some other effect(s) (ie the experimental setup is too uncontrolled to figure out what is going on). Essentially we can learn that it isn’t worth anyone doing a direct replication.

              But how does the rest of your post follow from the first sentence:

              I don’t think there IS an assurance “the difference has something to do with Alzheimers”

              I see nothing close to “assurance”.

              • The rest of my post follows from my having discussed this with biologists who actually do this kind of stuff (my wife is a biologist). Rarely do the biologists I know filter stuff on p values and then assert a discovery, it’s usually a precursor to deciding what projects to pursue using other techniques. Something like “we selected the 35 compounds with smallest p values in the foo test and then tested each one using histological sectioning of the liver / fluorescence microscopy of the cellular vacuoles / a selective cre knockout / quantitative pcr of the nervous system tissue” … whatever.

                Of course, I personally think my wife and many of her colleagues are very good biologists… so the mileage may vary with other groups studying other topics, and/or with different economic incentives (pharma for example).

              • “The useful findings for these types of studies are precisely those that no one publishes”
                Yup, in fact drawing conclusions in a single study (e.g. “this is not the gene you are looking for – move on”) is either silly or desperate. Just describe what you did in a study, why and what happened.

            • Daniel: Yep, pretty much agree with all of that. I’ll only add that before diving into regulatory pathway analyses, etc., there’s plenty of other data that’s collected and analyzed as a proxy for cognitive impairment. The maze data is just one of several types of data collected.

              Anoneuiod: if you’re interested in the protocol, it’s called the Morris Water Maze. They do their best to address all the issues you discussed, although the early statistical methods suggested are laughably invalid (a recent paper I coauthored suggested what I believe to be a much better method). But you cannot fault biologists for not having a PhD in statistics!

              Finally, saying “…they aren’t doing their job..” is really not fair, especially given your evidence. I knew them first hand, they took their work EXTREMELY seriously (like 60+ hours/week seriously) and were very concerned about doing good science.

              • I’ve done the Morris Water Maze myself (but with rats). It was certainly strenuous. Eg, repeatedly leaning over to get them in/out the pool, breathing in damp rodent dander/etc all day (which gives many people allergies to the rats so your nose is running). It was actually awful.

                That doesn’t mean we learned anything useful about the disease/treatment from it. Some of the rats seemed to like swimming rather than go back to the cage, others slipped off the platform once then everafter had an aversion to it. Instead those learned to swim around until they timed out (after awhile, eg 60 sec, you need pull them). Others figured it out right away by “randomly” touching it then could just swim right there until you moved the platform…

            • “using histological sectioning of the liver / fluorescence microscopy of the cellular vacuoles / a selective cre knockout / quantitative pcr of the nervous system tissue”

              Where they will chain another significance test on top, or just as bad, avoid quantification altogether.

              I find it difficult to believe this isn’t going on since I was trained to do it, everyone in the department was doing it, all journal club articles we covered were written by people who did it, guest speakers I talked to did it, and I’ve been to conferences and seen tens of thousands of posters of people describing doing it. This was a few years ago now, but when I follow up on a news article and read the paper I see they are still doing it.

              That said, I do think ranking p-values can be useful as a screening procedure. That just isn’t where it ends. The better way would be to look at epidemiological data, model it, then figure out what cell bio stuff should correspond to those parameters and investigate that.

      • I think Anonuoid’s comments apply more to earlier stages in the scientific process. Once you’ve identified a pathway you’d like to modify with a drug, and have figured out a delivery mechanisms and etc etc you come into the “measurement technology” portion of statistics. Your goal is not to “hypothesize that there’s no difference and reject this null in favor of there is a difference” your goal ought to be “decide whether the new drug on the whole provides a benefit and how big?”

        From the perspective of bayesian decision theory, if there is an established measurement of how good the benefit is, and an established measurement of how bad the side effects are, you should estimate Expected Goodness under a prior for effect sizes. This is purely a measurement issue “how good is this drug?” not “can I find evidence that foobar pathway induces relaxation of bronchospasm through transport of barbaz across the motor neuron membranes” or whatever.

        Now in these cases, the role of the prior should be to regulate your effect size estimate so that it’s less subject to accidental over or under estimation. Imagine you have some kind of idea of at least the order of magnitude of the effect sizes for main effect and side effect from past drugs… Encoding these order of magnitudes in a prior is pretty easy (Actual Effect / Order of Mag) ~ gamma(2,2) or something like that. It needs to be broad enough that you accept the idea of 2 or 4x improvements, but narrow enough that you require a *lot* of data evidence to accept 20 or 100x improvements. (note, I’m assuming a scale where effects are multiplicative and positive and a “bad” effect would be zero, not negative, your model/situation might vary).

        The problem is, in practice, this is absolutely NOTHING like what regulatory rules for Pharma look like and there are no established “goodness” functions for outcomes.

        • Probably just a different use of words, but I would consider what you referred to as much later in the scientific process; my example refers to basic scientific research, whereas yours refers to drug development. There’s about a 10-15 year waiting period between our examples. But I think you mean “different stages within a given experiment?”

          I’m curious about your statement that about being “no established goodness functions”. In grad school, the professor I took 50% of my courses from emphasized heavily that the FDA cares a whole lot about clinically meaningful outcomes, and that establishing these was a huge part of clinical trials. For example, he said if a drug already exists to treat A, and you wanted to develop a new drug to treat A, you would have to present either that your new drug would be significantly cheaper OR that you would need to test the hypothesis that your new drug will be at least (x) percent better effect than the established treatment on an established outcome AND make the argument that an (x) percent better effect would have a real world effect on patients. Of course, there were exceptions (such as Ebola treatments) in which performing a standard clinical trial would be impossible but giving up of development is not a good option either. These were handled as special cases, as they should be.

          Now, in disclosure of biases, he’s on several FDA advisory committees. So he may have an overly optimistic view of the FDA, or he could have be telling us what he’s been advocating for. From him listening to him, the FDA is doing a very well thought out job on an extremely tricky balancing act (“you get (a)(cheap) + (1-a)(safe and effective) drugs”). But again, I understand he has bias and I’m always curious to hear other interactions with the FDA from the other side of things.

          • Oops, just realized I don’t think your answer was aimed at my answer. Hence the confusion about stages in the scientific process.

          • Yes, I was replying directly to Phil Koop.

            What I meant by an established scale for goodness is that the FDA does not have a function which is constant across all clinical trials and weights how good some standardized outcome is for the population. Furthermore, it doesn’t generally consider variability of outcomes across the population, it tends to consider the average outcome and how reliably it is estimated (standard errors). Let me give you an example of why that’s bad.

            Suppose the outcome is something like “average number of consecutive days between major ER related psychiatric events” for people taking daily doses of an anti-psychotic drug that treats schizophrenia. The study group is a group of people who have something like 15 days between major events when unmedicated…

            Now suppose you find that with established drug 1 very consistently this is between 60 and 80 days with a mean of 70 +- 0.2

            suppose with drug 2 this is highly variable, for some people it’s as low as 15 days, and for others its 300, with an average of 69 +- 1

            Now, 300 days straight for a person who had previously been in and out of the ER every 15 days is a LOT of benefit. I have to guess a nonlinear relationship is appropriate here. You can get a lot more accomplished if you have long stretches free of problems.

            But the drug “doesn’t have an improvement” on the average.

            If you could establish some kind of nonlinear scale you’d probably find that “the expected value of goodness” is WAY higher for the second drug, but to do that you need to integrate goodness over the posterior probability of the outcomes. Even using a flat prior, you’d find that the nonlinearity in “goodness” dominates here. Finally, considering the idea that there might be consistent causal reasons why certain people respond well to this drug, having the second drug on hand to try is hugely important.

            Hypothesis testing is NOT a good basis for decision making.

            I’m not saying the FDA would definitely do the wrong thing here. But I’m saying that if they did the right thing, it’d be on an Ad-Hoc basis, not because they have a consistent framework to do the right thing.

            • Well, so it sounds a little like you’re getting into the topic personalized medicine.

              It’s certainly an extremely hot topic at the moment; if I can use more data than just your standard symptoms to find a treatment that’s especially good for you, that’s a good thing, right?

              The professor I mentioned earlier seemed to have a fairly pessimistic view of this. His concern seemed to be that the issue with personalized medicine is that you’ve just really upped the amount of drug testing you need to do to cover a population. He seemed to imply that personalized medicine would only be feasible through the use of p-hacking.

              I don’t know where I sit on that question. Well, I think I know what I don’t know that’s needed to answer the question. My view is that it really depends on how the world works; if it’s true that treatment effects vary quite a lot for a large number of treatments, then there’s utility in spending time for figuring out what treatment works best in which subpopulation. On the other hand, if the variability of treatment effect is pretty low, then you’ll spend a very, very large amount of money getting nowhere. I don’t know enough about the general treatment response landscape to know which hypothesis is more accurate.

              Your example of schizophrenia is actually a very good example of what’s probably the first case. Because schizophrenia is so inaccurately defined, it seems extremely likely that what works great on patient A does nothing for patient B; it’s very likely that there’s two completely different mechanisms for the two patients, both classified into a heterogeneous group of “patients diagnosed with schizophrenia”. I think we probably both agree that a more important first step is clearly defining the differences of schizophrenia in patient A and B.

              • Well, personalized medicine is a more deep topic. I’m just pointing out that a simple bayesian decision rule for a highly variable treatment where the “utility” of the outcome is not a linear function of the outcome measure will do the right thing… whereas a simple hypothesis test trying to reject the idea that the average treatment outcome is no better than an existing drug will not. That is true regardless of whether you personalize the treatment or you just have a one-size fits all treatment and a variable outcome with nonlinear utility.

                I agree with you about the schizophrenia issue, and that’s part of the reason I chose that example… because you might very well expect several mechanisms and so some drugs work very well on sub-populations.

                As an example of nonlinear utility where you don’t have maybe sub-populations consider the following outcomes coded as 0, 1, 2, 3, 4, 5:

                0 = no side effect
                1 = headache
                2 = headache nausea and vomiting
                3 = headache nausea and vomiting with seizures
                4 = headache nausea and vomiting with siezures and temporary organ failure, or permanent organ failure
                5 = stroke, heart attack, death

                The last thing you’d want to do would be to average the 0-5 scale numbers and then accept a drug if it had lower side effect score than an existing drug… I mean, suppose existing drug has consistent 2, headache nausea and vomiting, but it tends to save your life…. and the new drug has 90% of no side effect and tends to save your life in those cases, and 10% stroke, heart attack, and death. The new drug’s score is about .5 which is much less than 2 but I dont’ know about you, but I’d rather “consistently save my life but have to suffer a day of headache nausea and vomiting” compared to “90% chance your life will be saved with no side effect but 10% chance you’ll die a horrible 4 hour painful death.”

                This is just an example, and often people are kind of aware of it when there’s a constructed scale… but the same is absolutely true when the outcome is a simple continuous natural measurement (like days between hospitalizations or reduction in cholesterol concentration in the blood, or whatever)

              • In my example, suppose you map the scores to the following “badness” function:

                0 -> 0
                1 -> 1
                2 -> 4
                3 -> 500
                4 -> 25000
                5 -> 7000000

                The Bayesian expected badness score for the new drug is now 0.9 * 0 + 0.1 *7000000 = 700000
                The Bayesian expected badness score for the old drug is 500

              • Ah, yes, now I see what you mean by non-linear utility. I breezed through it before without giving it proper thought. Yeah, I don’t know the details about how much these things are taken into consideration.

              • Cliff: I’ve had dentists try to sell me treatments to reduce the depth of periodontal pockets. With nice brochures…. FDA approved…. average reduction in pocket depth of 0.1 mm after $1500 of treatment or whatever…

                Look 0.1 mm is about the resolving power of the eye (for an object held at closest focal distance) so having tested several thousand patients they could determine that yes the pocket depth reduced on average, by an amount too small to see, but it was approved (and it was “significant” thanks to “statistics”) but face it. it was a scam. It had no practical utility. I personally suspect that many cancer drugs are essentially the same.

      • For example, one might hope that a drug will help to treat asthma, as measured by forced volume of expiration, because one knows that its neurological action should produce bronchiodilation. But one does NOT have a quantitative model from which to estimate effect size, dose, or timing.

        The point is just seeing a difference isn’t enough. You need some model of what is going on so you can tell what aspect is being affected. I don’t know much about asthma and the methods used to study it, but lets take a quick look about what is being measured here:

        The maneuver is highly dependent on patient cooperation and effort, and is normally repeated at least three times to ensure reproducibility. Since results are dependent on patient cooperation, FVC can only be underestimated, never overestimated.

        Due to the patient cooperation required, spirometry can only be used on children old enough to comprehend and follow the instructions given (6 years old or more), and only on patients who are able to understand and follow instructions — thus, this test is not suitable for patients who are unconscious, heavily sedated, or have limitations that would interfere with vigorous respiratory efforts. Other types of lung function tests are available for infants and unconscious persons.

        Another major limitation is the fact that many intermittent or mild asthmatics have normal spirometry between acute exacerbation, limiting spirometry’s usefulness as a diagnostic. It is more useful as a monitoring tool: a sudden decrease in FEV1 or other spirometric measure in the same patient can signal worsening control, even if the raw value is still normal. Patients are encouraged to record their personal best measures.

        So we can see that motivation is a factor. It sounds like fatigue, depression, etc will also affect this measurement. I would bet giving someone “uppers” (eg adderall) would increase their respiratory capability, while having little to do with bronchodilation. I am also sure someone who has collected such data could think of many alternative explanations. Maybe the treatment gives them test anxiety so they begin exercising more for practice, etc.

        • I agree with you at the level of drug discovery, but at the point where you have a drug whose function is thought to be bronchodilatory based on something like observed effects in mice, and a known demonstrated chemical function on adrenal receptors or whatever…. and you’ve shown that it’s not dangerous to test, then you need to measure the size of this effect… so you give out A/B labeled inhaler medications where one is your drug and one is an existing drug, and you do before-after spirometry tests on the two groups, and you ask yourself “how big is the improvement, relative to the improvement with the existing drug”?

          At this stage, it’s a measurement issue, not a discovery issue.

          Earlier stages, absolutely it’d be a terrible way to “discover” a drug to simply try out all the compounds you can think of on mice and search for “statistically significant differences”. But that might not be a bad way to get drug *candidates* whose mechanism of action is then studied in detail using biochemical assays in mice and soforth.

          My general feeling is that the appropriate mechanism for discovery is Broad Cheap Screening -> Small number of careful biochemical / functional / studies -> Choice of things to try for their side effects -> Clinical trial to establish level of effectiveness (Bayesian decision theory preferred).

          • where you have a drug whose function is thought to be bronchodilatory based on something like observed effects in mice, and a known demonstrated chemical function on adrenal receptors

            Sure, and it is all based on stringing together the results of NHST instances. That is bad enough, then you have institutionalized p-hacking on top. I don’t see how that method can lead to anything of value. So if it does, what is really going on to make it work?

            • Lots of the times it’s just the intraocular test (ie. “look this picture has way more fluorescence than that one”). So I don’t think that it all relies on p values. in many cases the intraocular test is probably better, because it tends to find things where the effect is large rather than p = 0.03

  2. I think this part: “the hypothesized parameter values in the design analysis should come using external subject-matter knowledge, not from the noisy point estimate from the data at hand” is key.
    Much of the papers that were written that told people not to do post-hoc power analysis (e.g., Hoenig and Heisey, 2001, were based on the premise that the observed effect size is used in the power calculations, sometimes also called “observed power” and I think Mayo calls it “shpower”.

    • Felix:

      I think you have a goods point here.

      When Allan Detsky presented his paper When Was a ‘Negative’ Clinical Trial Big Enough? at the University of Toronto the statisticians would dismissively quote “Sir David Cox says that power is quite irrelevant in the analysis of data” and refuse to discuss the topic any further.

      Perhaps partly because of this I started to work with Detsky about a year later (yup, the statisticians were dismissive of me as well). He always refused to discuss that paper afterword – but one of the major projects was figure out how to do meta-analysis to set an empirically well informed prior of effect size to carry of cost/benefit analyses of clinical trails that assessed their likely impact on adoption in clinical practice over the full life cycle use.

      Now that paper was a variation on post-hoc power taking noisy estimates of Pc and Pc as fixed and assessing two pre-specified effect sizes (25 and 50% risk reductions). Hmm – moving on empirically well informed prior of effect sizes was the right move.

    • @Andrew: Felix’s point is very relevant. I think your title would be clearer if you removed the parenthetical “power calculations”, since the “design analysis” you are talking about is very different from what has traditionally been known as “power analysis”, particularly the post hoc variety.

    • Yes, that is the key point in my opinion.

      I didn’t go back to Stephen’s 2002 paper but I think what he argues against is the following. You set up a clinical trial with 80% power to detect a certain effect size, say 0.4 for a new antidepressant against placebo. You run the trial and it is not statistically significant; the observed effect is much smaller, maybe 0.2. One sometimes reads in the medical literature the authors say that the trial was ‘underpowered’ and a larger trial is needed. That kind of logic drives Stephen (and me) crazy.

    • You make an excellent point, Felix. It is unfortunate that Andrew failed to make clear the distinction between a) pre-data power analysis based on assumed effect and variation and b) post-data power analysis based on observed effect sizes and variation. His post will do nothing more than muddy the waters. (Trump might end this comment with “Sad”.)

      • Michael:

        That’s a bit rude of you to say that. I say this very clearly in my post:

        In any case, the hypothesized parameter values in the design analysis should come using external subject-matter knowledge, not from the noisy point estimate from the data at hand.

      • Michael:

        Felix actually prefaced his comment with that quote!??

        “I think this part: “the hypothesized parameter values in the design analysis should come using external subject-matter knowledge, not from the noisy point estimate from the data at hand” is key.”

  3. Isn’t a post-hoc power calculation just a secret plot by the Bayesian Illuminati to introduce a sleeper prior into the w̶h̶i̶t̶e̶h̶o̶u̶s̶e̶ hallowed halls of Frequentism?

    • IMO the idea that all prior information can or should take the form of a probability distribution holds back both Bayesian and non-Bayesian inference.

        • Daniel:

          It depends on the problem. If you’re studying large effects (which I think you should!) with simple models, than almost all the prior information is in the likelihood. On the other hand, if you’re studying small effects (for example, ovulation and voting, or beauty and sex ratio), then almost all the prior information is, or should be, in the prior distribution.

          • I’d add, if you’re studying complex topics with causal hypotheses about how outcomes occur, then almost all the information is in the likelihood… like for example all the ODE pharmacology stuff that your collaborators are doing in Stan, or many many other possibilites: contaminant transport in groundwater, predator prey dynamics in the Lynx dataset, thermal transport in the oceans, the effect of tax policy on married couples with children, how periodic fasting affects the population of immune system cells in the liver… whatever, if you have some kind of causal theory about how things work, encoding that theory into a description of what sorts of data you’d expect to see is where most of the prior information is.

            • indeed, that has to be true because *the choice of parameters* is basically part of the likelihood, and so you can only put prior distributions on the parameters after you’ve specified the likelihood. The fact that a parameter is even included is itself a key part of the prior information.

            • Daniel,

              Could you recommend some reading on what to do in a situation, where you need to test a causal model of a complex intervention, but the intervention theory is too complex for the sample size?

              (i.e. intervention -> autonomy support -> 4 mediators -> intention -> 2 self-regulatory skills -> increased physical activity. In addition, participants are nedted in schools and classrooms)

              • Hi Matti,

                I’d have to say start by building the model, and worry about computation at a second stage. Once you’ve got the model, try fitting it to sub-samples of your data to see if it works.

                I’ve used Stan for projects involving tens of thousands of data points and thousands of model parameters, so my inclination is to say don’t worry until you hit that wall. When I did that, it ran in 4 to 10 hours to produce initial estimates… when I had debugged the model I could easily imagine running it for 3 or 4 days to get a great sample.. Not that bad compared to potentially weeks of modeling and months of doing the experiment.

                Highly informed models often need a lot less data anyway. If you happen to have a couple million or billion data points… and you really need them all, then your best bet is to hire someone who specializes in that level of high end computing… and tell them what your model is, and have them figure out how to fit it, or how to approximate it in a way that can be fit. In that case, you’ll need to have that model specified anyway, otherwise they’ll propose something they know how to fit instead of something that follows from the science.

              • And if I misread you and you are saying that you have a complex theory and only a few samples… no problem at all, just run the model, you won’t identify your parameters to within epsilon, but you will find out more than you started out with.

                The key in these situations is to use informative priors, not overly informative, but think carefully about the magnitudes of what you expect and encode them realistically, that will keep your model away from the nonsense regions of parameter space and you’ll spend your time finding out what is reasonable instead of finding out that the sampler went off to “average time until you learn how to hit a raquetball reliably is 37,000 years” and soforth.

              • Many thanks Daniel!

                It’s a clinical trial so total n is 400-600 at best (after plenty of dropout).

                Any reading you’d suggest, and do you think such a model is impossible to build without knowledge about the technical intricacies (i.e. underlying math)?

      • ojm, I generally agree at the level of model formulation, interpretation, synthesis, theorizing, etc. But once you have distilled a specific model, or set of models perhaps, external/prior information NEEDS to be encoded in form of probability distribution in order to do probabilistic inference. Indeed the definition of Bayesian inference is representing uncertainty using probability, enabling the use of sum/product rules to get some powerful inference. Sure, you can do *approximate* inference playing around with point estimates, etc. But maybe you had something else in mind?

        • Sort of agree, Bayes Theorem only runs (only should be run) with probabilities but only a smidgen of the external/prior information can be properly handled that way.

          I also would like to here what ojm has in mind.

        • Hi Chris,
          Nothing earth shattering. Quick comments below.

          I agree that once you have done all that and if you subsequently want a posterior then you need a prior.

          But that’s basically tautological.

          More importantly, the general point is that a whole lot of non probabilistic prior information has gone into all the steps you mention. Hence the seeming implication of the original comment that any time we want to incorporate prior information into an analysis it involves a probability distribution is just false. Daniel acknowledges this but doesn’t seem to worry about the implications.

          In terms of the case at hand, one can carry out Gelman’s suggested analysis without needing to think of it as an approximation to introducing a prior probability distribution. It could equally well be that introducing a prior probability distribution is an approximation to a different form of prior information. For example we could treat the hypothesised effect size as a parameter and study the dependence on this for any choice of its value.

          • Perhaps I should clarify some things – I believe there are cases where you can acknowledge the need for prior information while criticising introducing this in the form of a probability distribution over parameters. My criticism of probability distributions in these cases is not that they are ‘subjective’ but that they may be the wrong form to express your prior information.

            I also have no issue expressing prior information in the form of probability distributions in many cases while wearing a Bayesian hat. I just don’t think it is good and/or necessary to conflate ‘prior information’ with ‘probability distribution’ (or approximation to) in general.

          • ojm, I think we’re largely in agreement then. I follow the “all models are wrong, but some are useful” approach pretty doggedly. As Andrew and Daniel were getting at, lots of prior/external information makes its way in via the likelihood (and, I would argue, basically every aspect of modeling)! Even after we’ve put on our Bayesian hats, I would also agree that it can be really tricky to express external information in form of a prior probability distribution on a parameter, unless you’re already one of Jaynes’ robots…

Leave a Reply

Your email address will not be published.