It depends. Sometimes you can get a stable estimate of variance using formulas such as 0.5/sqrt(n); other times, sure, it can make sense to perform a range of design analyses using different values for the variance.

]]>Even post hoc, if you’re trying to argue for follow-up… Again it’s a good idea to compare the preliminary study to the proposed follow up using Bayesian decision theory.

]]>The most important point though is that the p values you mention are not for “is the *effect* equal to zero” which I think is the source of some confusion about the p=0.01 stuff

]]>This may be only part of the problem because I am not sure why I only sometimes have trouble viewing comments.

]]>I’ve attempted to turn off caching on both the host and on the CDN, to see if it improves. I will continue monitoring this thread in case there is additional information you or other commenters can provide. Hopefully the situation improves.

]]>Can you help me understand your issues on this blog? What browsers are you using when you can’t post? What error message do you get?

]]>PS I made this paper open access for 1700 euros. Elsevier’s sole contribution was to add at least one typo that wasn’t there in the original submitted latex source. Someone on Elsevier’s production staff looked at the paperand said, no, we cannot allow a paper to be published without mistakes in it, let me edit the source .tex file and add a mistake.

]]>No I’m not — in the usual case the confidence limits are based on pivotal quantities which have distributions invariant to the true value of the mean.

]]>Crickey, thanks for the references. I am on holiday so don’t have access to the full texts. I have read Royall’s monograph (Statistical Evidence: A Likelihood Paradigm) several times but I don’t remember that bit. I know Held and Ott define the minimum Bayes factor several different ways – I think only Goodman’s definition is the same as the likelihood ratio. The Wagenmakers approach requires a number of assumptions including a uniform prior and is a rehash of the “p-value overstates the evidence against the null” argument. No room to argue all this here – i’ll put something up on arXiv soon.

Finally, never underestimate Fisher.

]]>See:

Royall, R. M. (1986). The effect of sample size on the meaning of significance tests. Am Stat, 40(4), 313-315.

Held, L., Ott, M. (2016). How the maximal evidence of p-values against point null hypotheses depends on sample size. Am Stat, 70(4), 335-341.

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychon Bull Rev, 14(5), 779-804.

I don’t think it’s rude to say that you find Mayo’s writing to be confusing or that you find her ideas to be useless. I disagree—I find her ideas useful (not directly useful in my data analysis, but indirectly in helping me think about my own philosophy of statistics)—but I don’t think it’s rude to express your views. The thing I thought was rude was when people flinging giving personal insults. If you just want to say that someone’s work is confusing, wrong, useless, whatever, that’s fine: go for it, and explain your reasons.

]]>so when you say p = 0.01 you are probably referring to the p value to reject “a null hypothesis” and when Corey is saying the limits are 0.025 and 0.975 he’s referring to a hypothesis “the true value is equal to the observed mean” in the usual case.

]]>Like, if Andrew instead just wrote a letter that said “The whole NHST enterprise is wrong-headed, and since the idea of statistical significance is wrong-headed, power analyses are not useful.” it would probably fall on deaf ears, even though it’s probably closer to the mark on his actual thoughts about it, based on my reading of the blog over a few years. It’s important to meet some people where they’re at, and work through changes gradually, IMO.

]]>And the way a confidence limit is defined usually is that it’s the parameter value at which the p value for that parameter is 0.025 and the parameter value for which the p value is 0.975

]]>People often take ‘illogic” labels too literally/at face value rather than trying to clearly discern what the logic should be (which I think Cohen didn’t).

For instance, this from my past http://statmodeling.stat.columbia.edu/2017/03/03/yes-makes-sense-design-analysis-power-calculations-data-collected/#comment-434462

]]>I don’t think “power analysis” is so useful because “power” is all about statistical significance, which I think is a generally useless idea (see for example here: http://www.stat.columbia.edu/~gelman/research/published/abandon_final.pdf). I do, however, think that post-hoc design analysis can be very useful, as long as they are based on reasonable assumptions of effect sizes rather than being computed by plugging in noisy estimates from the data. I write about post-hoc design analysis in this paper: http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf and this one: http://www.stat.columbia.edu/~gelman/research/published/incrementalism_3.pdf

]]>That being said, D. Mayo’s “severity” concept has shown up on this site a few times, and I’ve never understood how that (severity) is not just a form of post-hoc power analysis. Am I right, or am I missing something deeper there? I’ve read her blog and her first book several times, and frankly, I still don’t grok any distinction. Can anyone elucidate? Thanks.

]]>It still misses the point, which is that we really should care about what the parameter values are, not whether we can accept or reject null hypotheses…

]]>I think I know why this isn’t common practice. Here are the one-sided p-values for the endpoints of the usual 95% central confidence interval:

lower limit: 0.025

upper limit: 0.975

These values are independent of the data by the definition of the confidence interval.

The better way to choose N is Bayesian decision theory anyway. Choose N that minimizes expected societal cost considering cost of the research and reduction in societal cost from whatever benefit the research gives, averaged over the real informative prior for effect size…

We can drop power in the wastebasket if we use decision theory.

]]>Don’t Calculate Post-hoc Power Using Observed Estimate of Effect Size

Gelman, Andrew, PhD

Annals of Surgery: July 9, 2018 – Volume Publish Ahead of Print – Issue – p

doi: 10.1097/SLA.0000000000002908

Letter to the Editor: PDF Only

https://journals.lww.com/annalsofsurgery/Citation/publishahead/Don_t_Calculate_Post_hoc_Power_Using_Observed.95527.aspx

The reference argues for both but assesses the “strength” of the posterior e.g. using the posterior probability of parameters with probability increases larger than a favored parameter value. (Meeting with the author later this week if anyone has questions).

]]>https://journals.lww.com/annalsofsurgery/Citation/publishahead/Don_t_Calculate_Post_hoc_Power_Using_Observed.95527.aspx ]]>

In truth, the population effect size is mostly unknowable for many studies, and massively inflated when based on prior work, so this makes more sense to me.

]]>This episode doesn’t surprise me much – there is still lots of statistical illiteracy in medical journals, even after decades of efforts by people like Doug Altman. As an example that I’ve been concerned with lately, the New England Journal of Medicine, no less, likes to insist on having significance tests for baseline characteristics in randomised clinical trials – where you know that any differences are just chance.

]]>But researchers often want to know whether their study was a real no-hoper in terms of type 2 errors and that seems a reasonable question. Whaddya think of this approach: http://www.robertgrantstats.co.uk/papers/false_nonsig_rate.pdf

]]>“To show that our finding of no interaction between

the RC type and the definite/indefinite conditions was

not due to a lack of statistical power, we conducted a

power analysis using the results of the first critical word

reading time. Using the error terms of the main effects

and interaction of the ANOVA on this word, we found

that we had power above .8 to detect an interaction of

the size found in experiment 3 of Gordon et al. (2001)

(this interaction was found when names replaced the

embedded NP of the RC). Thus, we concluded that our

lack of detection of an interaction between the two

factors of our ANOVA was not due to a lack of statistical

power.”

p 103 of http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.452.2923&rep=rep1&type=pdf

]]>https://journals.lww.com/annalsofsurgery/Citation/publishahead/Don_t_Calculate_Post_hoc_Power_Using_Observed.95527.aspx ]]>

I don’t revere p-values, I’m just pointing out that logically from a likelihood perspective they do a superb job doing exactly what Fisher designed them to do – represent strength of evidence against the null. If the small study and the large study have the same p-value then the large study must have a much smaller observed effect size that is closer to the null. This smaller effect size is exactly offset by the lower level of uncertainty associated with the larger sample size. So yes, the 2 studies represent equivalent evidence against the null.

Andrew,

A low p-value implies a high observed power. It doesn’t necessarily imply ‘high power’ in the sense that the effect size estimate has low uncertainty attached to it, which is the sense I think you are using.

On a slightly tangential point, I wonder why we don’t use confidence intervals for P-values. Results are often given as a point estimate, its p-value and then the 95% confidence interval of the point estimate. The two end points of the confidence interval could be treated as point estimates and have their own p-values calculated. This might bring home the point that the p-value is a random variable and that an exact replication study is unlikely to result in the same p-value as the original study.

]]>You write, “the best estimate of the true effect size is the observed effect size.” First, strictly speaking, there is no “observed effect size”; all that you can get is an estimate. Second, the usual point estimate may be “best” to you, but it’s not “best” to me! Using this estimate results in systematic overestimates of effect size, replicability rates, etc., leading to the famous replication crisis in which people have been stunned and surprised by failed replications—they maybe wouldn’t have been so stunned and surprised had they been aware of type M error. And, no, a small p-value does not imply high power: this is a key lesson that we’ve learned over the past 10 years or so. Some discussion is here, here, and here.

]]>