Brian Bucher (who describes himself as “just an engineer, not a statistician”) writes:

I’ve read your paper with John Carlin, Beyond Power Calculations. Would you happen to know of instances in the published or unpublished literature that implement this type of design analysis, especially using your retrodesign() function [here’s an updated version from Andy Timm], so I could see more examples of it in action? Would you be up for creating a blog post on the topic, sort of a “The use of this tool in the wild” type thing?

I [Bucher] found this from Clay Ford and this from Shravan Vasishth and plan on working my way through them, but it would be great to have even more examples.

I promised to write such a post asking for more examples—and here it is! So feel free to send some in. I have a couple examples in section 2 of this paper.

After I told Bucher the post is coming, he threw in another question:

I’d also be curious about if you would apply this methodology in cases where there was technically no statistical significance. I’m thinking primarily of these two cases:

(a) There was no alpha value chosen before the study, and the authors weren’t testing a p-value against an alpha, but just reporting a p-value (such as 0.06) and deciding that it was sufficiently small to conclude that there was likely an effect and worth further experimentation/investigation. (Fisher-ian?)

(b) There was an alpha value chosen (0.05), and the t-test didn’t reject the null because the p-value was 0.08. However, in addition to the frequentist analysis, the authors generated a Bayes factor of 2.0 and claimed this showed that a difference between the two groups was twice as likely as having no difference between groups, and, therefore, conclude a difference in groups.

Letter (a) is a decent description of the type of analyses that I often do (mostly DOEs), since I don’t use alpha-thresholds unless required by a third party.

Letter (b) is (basically) something from a paper that I’m analyzing, and it would be great if I could estimate the Type-S/M errors without violating any statistical laws.

I have my fingers crossed, because in your Beyond Power Calculations paper you do say,

If the result is not statistically significant, the chance of the estimate having the wrong sign is 49% (not shown in the Appendix; this is the probability of a Type S error conditional on nonsignificance)—so that the direction of the estimate gives almost no information on the sign of the true effect.

…so I do have hope that the methods are generally applicable to nonsignificant results as well.

Full disclosure, I [Bucher] posted a version of this question to stackexchange but have not (yet) received any comments.

My reply:

We were thinking of type M and type S errors as frequency properties. The idea is that you define a statistical procedure and then work out its average properties over repeated use. So far, we’ve mostly thought about the procedure which is “do an analysis and report it if it’s ‘statistically significant'”—in my original paper with Tuerlinckx on type M and type S errors (full text here), we talked about the frequency properties of “claims with confidence.”

In your case it seems that you want inference about a particular effect size given available information, and I think you’d be best off just attacking the problem Bayesianly. Write down a reasonable prior distribution for your effect size and then go from there. Sure, there’s a challenge here in having to specify a prior, but that’s the price you have to pay: Without prior, you can’t do much in the way of inference when your data are noisy.

Here are two examples:

https://surveyinsights.org/?p=8708

https://www.jmir.org/2017/11/e397/

I read through the second example given by StanFloyd and I wonder if the authors misapplied the procedure. They report Type M and S errors twice, once for the literature-based expectation (delta = .37, Type M = 1.56) and for the ES observed in their study (d = .26, Type M = 2.13). Is this appropriate? It seems like the wrong way to use Type M error. Instead, they should have said something like “If our a priori delta = .37 is correct, we would expect to observe an effect of about .37/1.56 = .24, which is very close to what we actually observed.” Without a statement like that, and with them having computed two values for Type M, it won’t be clear which Type M value to use when interpreting their results. Or maybe the authors are implying that the “true” Type M rate is between 1.5 and 2, but they never say that. Actually, the only rationale they state for conducting the procedure is “Gelman and Carlin suggest” it. I mean, that’s good enough reason for me… :)

As far as I can tell they don’t calculate errors for the observed interaction effect size, which is 0.7994. d = 0.26 is the lower bound of the confidence interval for the effect from the literature, and they also use the high-end value of d = 0.48. So they’ve used several possible effect sizes given previous literature, which seems OK. I’m guessing that their conclusion is based on all three cases of d * (type M) being smaller than their observed effect size, and the type S errors being small enough to make this simple scaling reasonable. I’m not sure that that’s how Gelman and Carlin intended the errors to be used, but it doesn’t look like an awful approach, if all you’re checking is whether your findings agree with previous literature re: significance, rather than the effect size itself.

Say the population is modeled as normally distributed with sd = 1, with all means equally likely a priori.

We call the model with mean = 0

H0, and all other possible modelsH1. Then to get the bayes factor we calculate the likelihood of the data underH0, and sum of the likelihoods under every possible mean besides 0 forH1. If we integrated over all possible means besides zero, shouldn’t the answer be infinity?In R:

Results approach infinity as smaller intervals are used:

You would need to take the mean of the H1 likelihoods, not the sum: the choice of delta affects the prior’s normalising constant.

Thanks, but I dont see why. If I want a composite hypothesis that mu is either 9 or 10, I would use this denominator: P(D|H10) + P(D|H9). If we are using a flat prior this constant should be in the numerator p(D|H0), and cancel out right?

Also, this is what I think of when you say “normalizing constant”: https://en.m.wikipedia.org/wiki/Normalizing_constant

But I would never call that the

prior’snormalizing constant. I have never heard of a prior itself having one. Do you mean something else?Basically I am just looking for the derivation of the correct way to handle this. When I looked it up I came across some pretty odd stuff about needing to use Cauchy priors for some reason, which looked like a fudge to me. And intuitively I would think that probability the mean is exactly zero should approach zero as we include more and more very similar alternative possibilities like 1e-6, -1.2e-5, etc.

The Bayes factor is in this case is essentially the posterior p(H0|D) with all priors cancelled out. The denominator is only slightly less because it is missing p(D|H0), but if we sum enough different terms of similar magnitude the loss of one should be negligible.

The prior has a normalising constant, because it’s a probability distribution. In any case, I don’t mean the hypothesis prior. I mean the prior distribution for mu, conditional on H1. Since you’ve set this to be uniform, there’s a normalising constant proportional to delta. More details below.

You’re comparing the null likelihood L(x | H0) to the likelihood for H1, L(x | H1) = E(L(x | H1, mu)), where the expectation is over values of mu. This is equal to the sum of L(x | H1, mu) over values of mu, times the probability of that mu, given H1, i.e. sum_{mu} L(x | H1, mu) p(mu | H1). There are 20/delta possible values of mu, and the distribution is uniform, so p(mu | H1) = delta/20 for each possible mu. So the H1 likelihood is

L(x | H1) = E(L(x | H1, mu)) = sum_{mu: p(mu | H1) > 0} L(x | H1, mu) * delta/20.

There’s now an extra multiplicative factor proportional to delta, which should stop the likelihood going to infinity.

Doesnt that delta/20 also need to be in the numerator or else the prior is not uniform?

In this case the value of mu is the hypothesis. This is what I calculated:

Starting with Bayes rule for p(H_0|data):

Then we use a uniform prior, so:

So the priors all cancel:

Then call this the composite hypothesis:

So we can write:

Then the posteriors for the two hypothesis (of “no difference” and “some difference”) are:

The “normalizing constant” in the denominator of each cancels if we take the ratio to get the Bayes factor (actually the reciprocal of what was used originally):

Substituting back in for p(data|H_c):

Here what I call H_c was called H1 in the code, but I think it is clearer this way…

OK, I think we disagree here:

> Then call this the composite hypothesis:

> > p(data | H_c) = p(data | H_1) + … + p(data | H_n)

I think this should be a mean, not a sum. For example, if you had a situation where p(data | H_k) = some constant b for all k, the above would give p(data | H_c) > 1 for n > 1/b, instead of p(data | H_c) = b.

Taking the mean here would mean multiplying each value of bf in your original results by delta/20, which still gives you an increasing series, just not one that increases so dramatically.

But the sum was already there from the very first step. All I did is aggregate it into a single term for the step you pointed out.

The dividing by n you want to do is already incorporated into the uniform prior, which equals 1/n for all “hypotheses” (including H_0) and so cancels out:

If I understand this example correctly… then “p(data | H_k) = some constant b for all k” is impossible. How is the likelihood going to be exactly the same for two normal distributions with different means?

But I think that isn’t so important. It is the cancelled priors are causing the confusion. The denominator is:

When all the priors are equal to 1/n:

Then subtract the term for H_0 from both sides to get the composite hypothesis:

I could be wrong, but this gives the answer I find intuitively correct as well (see the discussion below with Daniel Lakeland).

The prior probability for H_c is (n-1)/n, right, so

p(data) = p(data | H_0) * 1/n + p(data | H_c) * (n-1)/n,

or, to match your version,

p(data | H_c) = p(data) * n/(n-1) – p(data | H_0) * 1/(n-1).

We also know that

p(data) = p(data | H_0) * 1/n + p(data | H_1) * 1/n + … + p(data | H_n) * 1/n,

and therefore

p(data | H_c) = p(data | H_1) * 1/(n-1) + … + p(data | H_n) * 1/(n-1).

The example with all the probabilities being b isn’t meant to be an example for the normal distribution case, it’s a example because the normality is irrelevant to the composition of likelihoods.

Also, Lakeland’s not saying anything that relates to Bayes factors, he’s talking about the probability of any one parameter value being zero in a continuous distribution. That applies to the prior, before we have any data. The largest Bayes factor would occur if your alternative hypothesis is that the mean is equal to xbar, but the factor would still be finite.

Trying a simpler argument. Suppose you have no data. Then p(data | H_k) = 1, for any mean parameter k. Your algorithm would then claim that the Bayes factor against any single parameter value – not just zero – would tend to infinity as you approach a continuous prior. In other words, it would conclude that the empty dataset gives overwhelming evidence against any parameter value, in addition to the information in the prior. Does this seem reasonable?

How so? Eg, prod(dnorm(x, 1, 1)) is different than prod(dnorm(x, 1.1, 1)) right? So how can p(data|H1) = p(data|H1.1) = a constant?

This is a totally different situation. It isnt point hypothesis vs composite hypothesis of everything else.

Yes, this is exactly what my intuition says should be the case.

You think an empty dataset can give overwhelming evidence against a hypothesis, instead of no evidence at all?

Yes if the hypothesis is a priori false given the assumptions being used. Like hypothesizing 1=2. If 1=2 bayes theorem would look different or not exist.

But a Bayes factor doesn’t count the information from the prior, it’s the ratio between the prior and posterior odds i.e. the effect of the data only. If there’s no data, it should be equal to one.

To derive the bayes factor you need to make certain assumptions, I figure somewhere in there it must imply this result.

If the tested value is impossible, then sure, the Bayes factor is undefined. But mu = 0 isn’t impossible a priori, it just has a probability of zero, as do all the other possible values.

Im not clear on the distinction you are trying to make between a set of a assumptions leading to an outcome having zero probability vs “impossible”.

> Im not clear on the distinction you are trying to make between a set of a assumptions leading to an outcome having zero probability vs “impossible”.

If there’s a set of appreciable size containing the given value whose total probability is zero, then this is a stronger notion than just “you won’t predict this one particular value”.

For example : p(x) = {if x = 1 then proportional to normal(0,1)}

the entire infinite interval for x < 1 has zero probability so not only does say 0 have zero probability but so does any region around 0 +- size less than 1.

OK, forget the case where delta goes to infinity. Just take delta as fixed, and say we have no data. Then mu = 0 has positive probability, and p(data | H_0) is defined as 1. The Bayes factor is then p(data | H_c) / p(data | H_0)

. But for the other point hypotheses, H_k for k != 0, we also have p(data | H_k) = 1. What should p(data | H_c) be?

I’m not following this at all unfortunately. It sounds like you are saying to consider the case where mu must be an integer (for example), why would p(data|H_0) = 1 if we have no data?

We were able to consider no data in the continuous case because p(data|H_0) necessarily became negligibly small relative to the sum of many hypotheses with very similar likelihoods. There is nothing like that going on here, so p(data|H_0) would be undefined if we had no data.

p(data | anything) is a function of the data value prior to observing data. It’s not a number. You can say however that:

integrate(p(data|anything) ddata) = 1 for the integral over all possible data values.

This is posterior to observing the data, but the data’s length is zero.

Exactly, after observing no data, the probability distribution over the parameters is the prior, and the probability over any new data point is the prior predictive, which as a density is a function of a free variable, namely whatever value for the data you want to plug in.

Yes. My somewhat roundabout point is that zero-length data here means the Bayes factor would be one. That requires p(data | H_c) = 1, which it won’t be if you set it to be the sum of the likelihoods for all the sub-hypotheses H_k, since those are also all equal to one.

There’s no such thing as a probability distribution that’s uniform over the whole real line.

And yes, if you have a continuum of possibilities for mu, then the probability that any one of them will be the correct one exactly goes to zero, just as if you have x ~ Normal(0,1) then the probability that x = 0 exactly is zero, as is true for any other exact value of x. In order to have a well defined probability you have to integrate over some interval, so the probability that x is in [-0.00001, 0.0001] is a nonzero number for example, but it’s related to the width of the interval.

Yes, this is my intuition. That is why I believe the correct calculation of a Bayes factor in this case should result in infinity regardless of the data.

Picking an interval that “may as well be zero for all practical purposes” could work.* I don’t use Bayes factors… just thought the original quote seemed off:

* I’ll take your word for it for now, but it seems like you would be able to inflate your Bayes factor by using a less precise interval…

Yes, if you have some discrete hypotheses, like “the value is 0” or “the value is 1” or “the value is -1” then you can generate a bayes factor for “the value is not 0”

but if you have a continuous probability density then “the value is not zero” has probability 1 a-priori regardless of any data…

Can you shed some light on the difference between that and what they do here: https://statswithr.github.io/book/hypothesis-testing-with-normal-populations.html

Yes, they have two discrete models they’re checking, the model where mu is exactly m0, compared to the model where mu is unknown and has prior normal(m0,sigma^2/n0)

I think the calculation you’re trying to do is comparing two possible subsets of mu under a *single* prior/model, but one subset is infinitesimally large. In other words, using nonstandard analysis where dmu is actually an infinitesimal number:

p(data | mu=0) p(mu=0) dmu / sum(p(data | mu = m) p(mu=m) dmu, for m values from -M to +M step by dmu, with m not equal to 0)

You’ll notice that in the numerator of this quantity is a limited number p(data|mu=0) p(mu=0) multiplied by an infinitesimal number dmu so the numerator is infinitesimal.

On the other hand, in the denominator is an integral which is infinitesimally close to 1.

The result will have to be infinitesimal, and the closest standard number to an infinitesimal number is the number 0.

Thinking along these nonstandard analysis lines, you can compare the calculation you linked to… in that calculation you can define two priors p1 and p2

p1 = 1/dmu for values between -dmu/2 and dmu/2, that is an infinitely high spike of width dmu around 0….

p2 = the normal distribution function normal(0,sigma^2/n0)

then the numerator for their calculation is:

p(data | mu=0) p1(mu=0) dmu = p1(data | mu=0) 1/mu * dmu = p(data|mu=0)

He’s right about type S errors being Fisherian.

A P-value against a null hypothesis of theta less than zero is an indirect measure (via modus tollens) of the risk of a type S error.

I believe more credit needs to be given to S-type errors than actually given. A clinical researcher is able to state a claim such as: “applying treatment A reduces the effect of B”. He is concerned being wrong in that, in fact, A increases the effect of B. To consider this framework of presenting claims he can state his claims using meaning equivalence alternatives and also, what he is not claiming, with surface similarity alternatives. The S type error controls for meaning equivalence alternative statements that are wrong. The nice thing is that it involves the whole study design in that it “bootstraps” the S-type error. The even nicer thing is that clinical researchers can properly interpret such errors. In fact, and as stated by Gelman, it is about making “claims with confidence”. For an example see https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070.

No, it doesn’t… All “effects” will be significant if you keep increasing the sample size and don’t keep lowering the significance threshold to compensate.

Wow. Looking into the thread above I came across this:

https://en.wikipedia.org/wiki/Lindley%27s_paradox

I’ve gotten an email or two from people who use the package in production with feature requests or questions, I can see if they’d want to share. I think one was in pharma and the other in manufacturing.

Both were situations where slightly exaggerated treatment effects could be a real problem. One mentioned a sort of downstream problem where a small exaggeration at step A snowballed into a much more significant problem down the road.

Continuing from above:

I’m having trouble following this thread now. Can you clarify what you are responding to?

I think I laid out my reasoning which is based on a few principles of probability and basic algebra pretty clearly… it isn’t clear to me where (if anywhere) Daniel Lakeland and Mark Webster think I made an error.

I see that Mark Webster thinks I should be normalizing to the number of possible values, but this seems to be based more because he does not like the implications of not doing that. I do not see how this should be justifiably incorporated into my calculations.

I think that when “the authors generated a Bayes factor of 2.0 and claimed this showed that a difference between the two groups was twice as likely as having no difference between groups” they were probably putting a mass of probility at mu=0. See for example https://www.ncbi.nlm.nih.gov/pubmed/29441460

I don’t think it really matters what prior you use, there will still be infinitely many very similar likelihoods to the one with mu = 0. In the continuous case the Bayes factor (“some difference” over “exactly zero difference”) should be infinite. So whatever they are calculating must be something else (perhaps they use an interval around zero).

There are at least 6 different ways to calculate a Bayes factor (see Held and Ott “On P-values and Bayes factors”).

I presume they are using BF= -e p log p as this is about 2.0 when the p value is 0.08.

I followed that to Edwards 1963, where it says:

Edwards W, Lindman H, Savage LJ. 1963. Bayesian statistical inference for psychological research.

Psychol. Rev. 70:193–242

So it looks like I am correct. It is based on that “spike and slab” concept which is a fudge someone came up with because they disagreed with what Bayes rule was telling them: it is a waste of time to compare one exact prediction vs “anything else” (ie, NHST).

Instead, they should be comparing the precise predictions derived from multiple explanations people have come up with, along with the associated measurement and theoretical uncertainties. If you do that, you won’t have these types of problems.

> In the continuous case the Bayes factor (“some difference” over “exactly zero difference”) should be infinite.

If by “continuous case” you mean that the probability of mu being exactly zero (arbitrarily close to zero) is zero (infinitesimal) that’s precisely what I suggested that is NOT being assumed in the calculation of that Bayes factor.

By the way, I didn’t find the paper referenced by Nick Adams but I found this one from the same authors: https://www.zora.uzh.ch/id/eprint/135381/1/final.pdf

Yes, it sounds like they are making some sort of contradictory assumption. Ie, that Mu is a continuous variable and also not a continuous variable at the same time. Or that mu is continuous everywhere except at exactly zero, is there an example of something like this existing in nature?

This sounds like a mathematical fantasy/fudge someone came up with to justify doing something that Bayes rule was telling them they shouldn’t be doing (checking for exactly zero difference between groups).

> some sort of contradictory assumption. Ie, that Mu is a continuous variable and also not a continuous variable at the same time

That’s called a mixture. Look it up.

For an example of something like this existing in nature, you may appreciate this one from Haldane:

“An illustration from genetics will make the point clear. The plant Primula sinensis pos- sesses twelve pairs of chromosomes of ap- proximately equal size. A pair of genes se- lected at random will lie on different chro- mosomes in 11/12 of all cases, giving a pro- portion x = .5 of “cross-overs.” In 1/12 of all cases, they lie on the same chromosome, the values of the cross-over ratio x rang- ing from 0 to .5 without any very marked preference for any part of this range, ex- cept perhaps for a tendency to avoid values very close to .5.”

https://projecteuclid.org/download/pdfview_1/euclid.ss/1494489818

This assumes Mendel’s second law always holds, which does not appear to be the case, eg:

https://www.ncbi.nlm.nih.gov/pubmed/11331939

More generally, it’s suspected non-random segregation is very important in maintaining tissue stem cells. One daughter gets the older DNA and remains a stem cell, the other gets the newer DNA and goes on to differentiate into whatever functional tissue cell is needed: https://royalsocietypublishing.org/doi/10.1098/rstb.2010.0279

But assuming the segregation of two chromosomes could be completely independent of each other, that is comparing two different hypotheses regarding the data generating processes. This is different from “the data was sampled from a normal distribution with unknown mean of either exactly zero vs something else. That would be the same process with a different parameter.

Thanks, Anoneuoid, for the additional information on meiotic and mitotic asymmetries. Very interesting.