Oops, “If theta>0, we would not often claim theta>0” should be “If theta0,” of course. and the filter seemed to have eaten part of my sentence: “a failure to reject indicates that theta<theta_1. As N→infty”. That will teach me to use less than and greater than signs in posts…

]]>“Power is a broken idea, the Bayesian version makes more sense: how much data do you need to make the high probability posterior interval of your parameter estimate smaller than some desirable precision / width?”

It seems more that the folk statistical idea of what power is is broken; that’s what causes people to do silly things with it. Suppose you were interested in precision. What is precision for? It isn’t a fundamental statistical idea; it is a heuristic (which credible interval is the precision? central 95%? standard deviation? particular HPD? or is it the curvature of the posterior at the mode?). It is a fine heuristic, but we have to understand that that’s all it is and we need more basic formal principles to define it. We can’t *replace* those ideas with precision.

But let’s examine the heuristic from a classical perspective to see if power is “broken” and whether we should consider precision instead (as, say, CI advocates suggest). The basic idea behind precision, stripped down, is that we want to be able to differentiate true values in one range from true values in another, and the bounds of these ranges are separated by some amount X. When X is smaller, we have more precision.

Suppose a classical test has a max Type I error rate of alpha and this is acceptable low to us. If theta>0, we would not often claim theta>0. Now we construct a test with high power (1-alpha): if theta>=theta_1, we would very often claim that theta>0 (and hence, theta!<0). The test will work in the other way, too; a failure to reject indicates that thetainfty, we can make theta_1 closer and closer to 0 and and get the same high power (1-alpha). In what sense is this not precision? Talking about precision as the steepness of the power curve seems totally reasonable.

Granted, precision-as-power is precision *targeted at a particular region of the parameter space*, but that’s important, because not all models will give you the same precision at all points in the parameter space. And it is pretty typical for a particular region of the parameter space to be of interest (help/harm, loss/gain, etc) so this makes sense. We care more about regions of the parameter space where the qualitative interpetation of the parameter changes (and if you wanted to examine a different part of the parameter space, you could anyway).

So I don’t get why power is “broken”, unless one thinks you “estimate” it from data or some weirdness like that. Of course strange folk-statistical ideas about power are going to seem broken.

]]>“I’ve read her blog and her first book several times, and frankly, I still don’t grok any distinction.” +++1

]]>Anon:

It depends. Sometimes you can get a stable estimate of variance using formulas such as 0.5/sqrt(n); other times, sure, it can make sense to perform a range of design analyses using different values for the variance.

]]>Clear enough that we should not use the observed effect for post-hoc power analysis, but is it any more justified to use the observed variance to perform post-hoc design analysis, since good a priori information about variability seems to be even more challenging to come by?

]]>Test

]]>If you come in after the fact it’s one thing, but when doing power analysis to choose design parameters prior to the study I think it’s fine to try to convince people about the use of Bayesian decision theory instead of power analysis, doctors don’t want to die on the hill of p value defending, they want to convince people their study is a good one. A bunch of estimates showing a cost benefit type analysis in favor of the chosen experimental design choices isn’t going to hurt if well explained.

Even post hoc, if you’re trying to argue for follow-up… Again it’s a good idea to compare the preliminary study to the proposed follow up using Bayesian decision theory.

]]>The hypothesis “is the pivotal quantity equal to zero” is mathematically equivalent to the hypothesis “is the quantity we are estimating equal to the estimate value” the first one arises as an invertible mathematical transforming of the second (usually shifting and rescaling)

The most important point though is that the p values you mention are not for “is the *effect* equal to zero” which I think is the source of some confusion about the p=0.01 stuff

]]>FWIW I take it back. currently time warped back to yesterday on one device and two days ago on another.

]]>Thanks for the heads up about the mixed content! Will look into this.

]]>It worked for me. After Andrew’s suggestion that it was a caching problem on our end I scrubbed browsers on desktop and phone yet was still getting time shifted into the past; and only on this blog. I just refreshed the homepage and for the first time in a good while it appears to be current (if the “Postdoc position …” post is indeed current).

]]>I am having similar problems. It may be several issues, but one is that this site uses javascript to handle various aspects of comments. Some of the scripts seem to hosted on a different site and are downloaded as needed. A conflict occurs because this site is run through secure http (https:) while the call to the off site scripts is made by non-secure http (http:) call. Some web browsers do not allow such calls (because it breaks security).

This may be only part of the problem because I am not sure why I only sometimes have trouble viewing comments.

]]>It appears our web host has implemented some aggressive caching. When one visitor looks at a page, the host caches it for ~10-20 minutes. For example the homepage comment count is rarely accurate since within those 20 minutes, new comments can roll in on posts. But then if you click on a post, you may be the first to view it within that 20 minute span, so it rebuilds the page in the cache. This leads to discrepancies in comments on the post, comment counts, sidebar comments, and comments not appearing immediately.

I’ve attempted to turn off caching on both the host and on the CDN, to see if it improves. I will continue monitoring this thread in case there is additional information you or other commenters can provide. Hopefully the situation improves.

]]>Same here. Using Chrome.

]]>For around 2 weeks I have had recurring issues both posting and reading this blog. Sometimes my posted comments don’t show up for a long time. Sometimes, when I click to read comments on a post, there are no comments there (but there really are comments, I just can’t see them). Sometimes I see the comments only to return a few minutes later and they are gone – but they then reappear. Sometimes the new posts don’t appear for a long time, sometimes they appear earlier in the day (when I am used to seeing them). In short, the blog posting/reading has become quite erratic in the past 2 weeks or so. Something clearly seems to work differently, but I have no clue what it is.

]]>Hi Shravan,

Can you help me understand your issues on this blog? What browsers are you using when you can’t post? What error message do you get?

]]>First paper was accepted w/o complaints: https://www.sciencedirect.com/science/article/pii/S0749596X18300640?via%3Dihub

PS I made this paper open access for 1700 euros. Elsevier’s sole contribution was to add at least one typo that wasn’t there in the original submitted latex source. Someone on Elsevier’s production staff looked at the paperand said, no, we cannot allow a paper to be published without mistakes in it, let me edit the source .tex file and add a mistake.

]]>‘Corey is saying the limits are 0.025 and 0.975 he’s referring to a hypothesis “the true value is equal to the observed mean” in the usual case.’

No I’m not — in the usual case the confidence limits are based on pivotal quantities which have distributions invariant to the true value of the mean.

]]>Patrick,

Crickey, thanks for the references. I am on holiday so don’t have access to the full texts. I have read Royall’s monograph (Statistical Evidence: A Likelihood Paradigm) several times but I don’t remember that bit. I know Held and Ott define the minimum Bayes factor several different ways – I think only Goodman’s definition is the same as the likelihood ratio. The Wagenmakers approach requires a number of assumptions including a uniform prior and is a rehash of the “p-value overstates the evidence against the null” argument. No room to argue all this here – i’ll put something up on arXiv soon.

Finally, never underestimate Fisher.

]]>Hey Nick, obviously the fact that Fisher believed the so-called alpha-postulate (or p postulate; the idea that p-values present equivalent amounts of evidence against the null regardless of sample size) should not persuade us to do so as well. However, I’m surprised you believe it based on a likelihood justification. Royall (1986) himself demonstrates that precise p-values in fact provide greater evidence against a point null when yielded by *smaller* sample sizes. The same is demonstrated by Wagenmakers (2007) and generalized to evidential p-value bounds by Held & Ott (2016).

See:

Royall, R. M. (1986). The effect of sample size on the meaning of significance tests. Am Stat, 40(4), 313-315.

Held, L., Ott, M. (2016). How the maximal evidence of p-values against point null hypotheses depends on sample size. Am Stat, 70(4), 335-341.

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychon Bull Rev, 14(5), 779-804.

Kyle:

I don’t think it’s rude to say that you find Mayo’s writing to be confusing or that you find her ideas to be useless. I disagree—I find her ideas useful (not directly useful in my data analysis, but indirectly in helping me think about my own philosophy of statistics)—but I don’t think it’s rude to express your views. The thing I thought was rude was when people flinging giving personal insults. If you just want to say that someone’s work is confusing, wrong, useless, whatever, that’s fine: go for it, and explain your reasons.

]]>So I am not alone in my reaction to Mayo’s work! I know our host considers it “rude” to keep saying this here, and I am genuinely sorry about that. But Mayo is a public intellectual whose quirky, allusive, academically philosophical prose makes it impossible for many educated people to grasp her points.

]]>I had to pause here to think a little harder. When we say “the p value” we need to also state “to reject the hypothesis H” and which H do we mean? Specifically the usual way a confidence interval is calculated is for rejecting the hypothesis that the true parameter value is different from the estimated value (often the sample mean). That’s a different hypothesis than the “null” where usually the null is “the parameter value is zero”

so when you say p = 0.01 you are probably referring to the p value to reject “a null hypothesis” and when Corey is saying the limits are 0.025 and 0.975 he’s referring to a hypothesis “the true value is equal to the observed mean” in the usual case.

]]>Point well taken! That said, in that particular context (i.e., consulting for medical doctors who have done a null hypothesis test in a frequentist framework), telling them that the entire enterprise of NHST is wrong would be an impossible sell. Broader systemic things need to change before that particular audience would be at all sympathetic to that line of thinking.

Like, if Andrew instead just wrote a letter that said “The whole NHST enterprise is wrong-headed, and since the idea of statistical significance is wrong-headed, power analyses are not useful.” it would probably fall on deaf ears, even though it’s probably closer to the mark on his actual thoughts about it, based on my reading of the blog over a few years. It’s important to meet some people where they’re at, and work through changes gradually, IMO.

]]>I don’t think so — the paragraph opens by describing its point as “slightly tangential” to the previous discussion of power and doesn’t mention power at all.

]]>I think you’re confused. p values do not have confidence limits, a parameter has confidence limits.

And the way a confidence limit is defined usually is that it’s the parameter value at which the p value for that parameter is 0.025 and the parameter value for which the p value is 0.975

]]>J. Norway.

People often take ‘illogic” labels too literally/at face value rather than trying to clearly discern what the logic should be (which I think Cohen didn’t).

For instance, this from my past http://statmodeling.stat.columbia.edu/2017/03/03/yes-makes-sense-design-analysis-power-calculations-data-collected/#comment-434462

]]>J

I don’t think “power analysis” is so useful because “power” is all about statistical significance, which I think is a generally useless idea (see for example here: http://www.stat.columbia.edu/~gelman/research/published/abandon_final.pdf). I do, however, think that post-hoc design analysis can be very useful, as long as they are based on reasonable assumptions of effect sizes rather than being computed by plugging in noisy estimates from the data. I write about post-hoc design analysis in this paper: http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf and this one: http://www.stat.columbia.edu/~gelman/research/published/incrementalism_3.pdf

]]>That being said, D. Mayo’s “severity” concept has shown up on this site a few times, and I’ve never understood how that (severity) is not just a form of post-hoc power analysis. Am I right, or am I missing something deeper there? I’ve read her blog and her first book several times, and frankly, I still don’t grok any distinction. Can anyone elucidate? Thanks.

]]>No. A p-value of 0.01 for instance cannot have these limits.

]]>I think he meant to calculate the power to reject given your N and observed variation if the real value of the parameter was the lower and the upper limit of the current CI. It’s not clear to me that this is so trivial. The frequency in question is not the frequency with which you can reject each confidence interval you’ll get from repeated runs, but rather the frequency with which you’ll reject the constant values Lower_of_my_current_CI and Upper_of_my_current_CI in repeated trials.

It still misses the point, which is that we really should care about what the parameter values are, not whether we can accept or reject null hypotheses…

]]>Put another way, the goal of research is to provide useful true information about the world at reasonable cost, not to frequently be able to reject null hypotheses.

]]>“The two end points of the confidence interval could be treated as point estimates and have their own p-values calculated.”

I think I know why this isn’t common practice. Here are the one-sided p-values for the endpoints of the usual 95% central confidence interval:

lower limit: 0.025

upper limit: 0.975

These values are independent of the data by the definition of the confidence interval.

Nick, if I understand you, I think it’s not right to say that p-values quantify the degree of “evidence against the null” as opposed to just whether we reject the null. If I decide a priori that I’m willing to reject the null while accepting that I’ll be wrong 5% of the time, then I am implicitly declaring that a p-value of .051 and a p-value of .51 have the same interpretation with respect to the null. This is widely misunderstood, I think, largely due to the never-ending stream of articles that say things like “this result was significant at p = .001” when there’s no way they would have rejected at p = .01, or “this result was nearly significant (p = .051)” as though their *choice* of significance level were a random variable. Because reject/fail-to-reject is a binary choice, it’s not particularly meaningful to interpret the distance from the critical value as a magnitude of “evidence against the null.” Of course, the significance level used is arbitrary and the researcher might have chosen .001 or .051 in an alternate universe, but that’s one of the reasons NHST’s are criticized. Besides, we have other, better tools for assessing the quality of evidence, one of them being sample size. I believe part of the point of above comments was that a small sample size is much more likely to grossly underestimate standard error than a larger study, so the quality of the p-value itself is more questionable in a small study.

]]>The biggest issue is that power is all about rejecting the null, which is a questionable thing in the first place. Power or something like it should be used to help you pick an N. You don’t need to know what the real effect size is at all, just what effect size would be considered practically useful or of interest. If your surgery technique reducing incidence of *bad stuff* by 10% would be enough to recommend it, use the 10% reduction. Later if it turns out that the estimate is 30% reduction then great.

The better way to choose N is Bayesian decision theory anyway. Choose N that minimizes expected societal cost considering cost of the research and reduction in societal cost from whatever benefit the research gives, averaged over the real informative prior for effect size…

We can drop power in the wastebasket if we use decision theory.

]]>Don’t Calculate Post-hoc Power Using Observed Estimate of Effect Size

Gelman, Andrew, PhD

Annals of Surgery: July 9, 2018 – Volume Publish Ahead of Print – Issue – p

doi: 10.1097/SLA.0000000000002908

Letter to the Editor: PDF Only

https://journals.lww.com/annalsofsurgery/Citation/publishahead/Don_t_Calculate_Post_hoc_Power_Using_Observed.95527.aspx

Definitely both.

The reference argues for both but assesses the “strength” of the posterior e.g. using the posterior probability of parameters with probability increases larger than a favored parameter value. (Meeting with the author later this week if anyone has questions).

]]>https://journals.lww.com/annalsofsurgery/Citation/publishahead/Don_t_Calculate_Post_hoc_Power_Using_Observed.95527.aspx ]]>

In truth, the population effect size is mostly unknowable for many studies, and massively inflated when based on prior work, so this makes more sense to me.

]]>Can see it did get published – for some reason couldn’t see the comments earlier.

]]>This episode doesn’t surprise me much – there is still lots of statistical illiteracy in medical journals, even after decades of efforts by people like Doug Altman. As an example that I’ve been concerned with lately, the New England Journal of Medicine, no less, likes to insist on having significance tests for baseline characteristics in randomised clinical trials – where you know that any differences are just chance.

]]>But researchers often want to know whether their study was a real no-hoper in terms of type 2 errors and that seems a reasonable question. Whaddya think of this approach: http://www.robertgrantstats.co.uk/papers/false_nonsig_rate.pdf

]]>I am referring to things like this:

“To show that our finding of no interaction between

the RC type and the definite/indefinite conditions was

not due to a lack of statistical power, we conducted a

power analysis using the results of the first critical word

reading time. Using the error terms of the main effects

and interaction of the ANOVA on this word, we found

that we had power above .8 to detect an interaction of

the size found in experiment 3 of Gordon et al. (2001)

(this interaction was found when names replaced the

embedded NP of the RC). Thus, we concluded that our

lack of detection of an interaction between the two

factors of our ANOVA was not due to a lack of statistical

power.”

p 103 of http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.452.2923&rep=rep1&type=pdf

]]>https://journals.lww.com/annalsofsurgery/Citation/publishahead/Don_t_Calculate_Post_hoc_Power_Using_Observed.95527.aspx ]]>

Jackson,

I don’t revere p-values, I’m just pointing out that logically from a likelihood perspective they do a superb job doing exactly what Fisher designed them to do – represent strength of evidence against the null. If the small study and the large study have the same p-value then the large study must have a much smaller observed effect size that is closer to the null. This smaller effect size is exactly offset by the lower level of uncertainty associated with the larger sample size. So yes, the 2 studies represent equivalent evidence against the null.

Andrew,

A low p-value implies a high observed power. It doesn’t necessarily imply ‘high power’ in the sense that the effect size estimate has low uncertainty attached to it, which is the sense I think you are using.

On a slightly tangential point, I wonder why we don’t use confidence intervals for P-values. Results are often given as a point estimate, its p-value and then the 95% confidence interval of the point estimate. The two end points of the confidence interval could be treated as point estimates and have their own p-values calculated. This might bring home the point that the p-value is a random variable and that an exact replication study is unlikely to result in the same p-value as the original study.

]]>