Chris Mead points us to a new document by John Deke, Mariel Finucane, and Daniel Thal, prepared for the Department of Education’s Institute of Education Sciences. It’s caled The BASIE (BAyeSian Interpretation of Estimates) Framework for Interpreting Findings from Impact Evaluations: A Practical Guide for Education Researchers, and here’s the summary:

BASIE is a framework for interpreting impact estimates from evaluations. It is an alternative to null hypothesis significance testing. This guide walks researchers through the key steps of applying BASIE, including selecting prior evidence, reporting impact estimates, interpreting impact estimates, and conducting sensitivity analyses. The guide also provides conceptual and technical details for evaluation methodologists.

I looove this, not just all the Bayesian stuff but also the respect it shows for the traditional goals of null hypothesis significance testing. They’re offering a replacement, not just an alternative.

Also they do good with the details. For example:

Probabilityis the key tool we need to assess uncertainty. By looking across multiple events, we can calculate what fraction of events had different types of outcomes and use that information to make better decisions. This fraction is an estimate of probability called a relative frequency. . . .

The prior distribution.In general, the prior distribution represents all previously available information regarding a parameter of interest. . . .

I really like that they express this in terms of “evidence” and “information” rather than “belief.”

They also discuss graphical displays and communication that is both clear and accurate; for example recommending summaries such as, “We estimate a 75 percent probability that the intervention increased reading test scores by at least 0.15 standard deviations, given our estimates and prior evidence on the impacts of reading programs for elementary school students.”

And it all runs in Stan, which is great partly because Stan is transparent and open-source and has good convergence diagnostics and a big user base and is all-around reliable for Bayesian inference, and also because Stan models are extendable: you can start with a simple hierarchical regression and then add measurement error, mixture components, and whatever else you want.

And this:

Local Stop: Why we do not recommend the flat priorA prior that used to be very popular in Bayesian analysis is called the flat prior (also known as the improper uniform distribution). The flat prior has infinite variance (instead of a bell curve, a flat line). It was seen as objective because it assigns equal prior probability to all possible values of the impact; for example, impacts on test scores of 0, 0.1, 1, 10, and 100 percentile points are all treated as equally plausible.

When probability is defined in terms of belief rather than evidence, the flat prior might seem reasonable—one might imagine that the flat prior reflects the most impartial belief possible (Gelman et al., 2013, Section 2.8). As such, this prior was de rigueur for decades.

But when probability is based on evidence, the implausibility of the flat prior becomes apparent. For example, what evidence exists to support the notion that impacts on test scores of 0, 0.1, 1, 10, and 100 percentile points are all equally probable? No such evidence exists; in fact, quite a bit of evidence is completely inconsistent with this prior (for example, the distribution of impact estimates in the WWC [What Works Clearinghouse]). The practical implication is that the flat prior overestimates the probability of large effects. Following Gelman and Weakliem (2009), we reject the flat prior because it has no basis in evidence.

The implausibility of the flat prior also has an interesting connection to the misinterpretation of p-values. It turns out that the Bayesian posterior probability derived under a flat prior is identical (for simple models, at least) to a one-sided p-value. Therefore, if researchers switch to Bayesian methods but use a flat prior, they will likely continue to exaggerate the probability of large program effects (which is a common result when misinterpreting p-values). . . .

Yes, I’m happy they cite me, but the real point here is that they’re thinking in terms of modeling and evidence, also that they’re connecting to important principles in non-Bayesian inference. As the saying goes, there’s nothing so practical as a good theory.

What makes me particularly happy is the way in which Stan is enabling applied modeling.

This is not to say that all our problems are solved. Once we do cleaner inference, we realized the limitations of experimental data: with between-person studies, sample sizes are never large enough to get stable estimates of interactions of interest (recall 16), which implies the need for . . . more modeling, as well as open recognition of uncertainty in decision making. So lots more to think about going forward.

*Full disclosure:* My research is funded by the Institute of Education Sciences, and I know the authors of the above-linked report.

This is opposite. You are introducing a systematic difference with the treatment. So in fact, the presence of a difference is guaranteed.

And, of course, reliably detecting the presence of a difference is great. But it does not tell you *why* that difference is happening.

Usually there is some theory motivating the intervention in the first place, but unless you derived an otherwise surprising prediction beforehand you can’t claim support for that theory. *

So this BASIE framework still suffers from the same disconnect between scientific theory and statistical model as NHST.

* My favorite example is chemotherapy causing nausea and poor appetite/absorption, which induces caloric restriction. This is then expected to “starve” the cancer (see, eg: https://en.m.wikipedia.org/wiki/Warburg_effect_(oncology) ). I have never seen an RCT account for this, they all assume the primary effect is on killing cancer cells.

Anon:

When they say “characteristics” in that sentence, they’re talking about

pre-treatmentcharacteristics.Sure, but the way to think about these “compare group A to group B” studies is that the intervention is simply one possible source of systematic difference/error out of many.

You hope/believe it will be the largest one but this is often not the case in observational data. The purpose of randomization and blinding is to attempt ensuring that is the case.

Regarding flat priors, there is another way of looking at the issue.

Take Bayes’ rule for hypotheses 0 thru n:

p(H_0|D) = p(H_0) * p(D|H_0) / [ p(H_0) * p(D|H_0) + … + p(H_n) * p(D|H_n) ]

First of all, in the denominator we can drop all negligible terms where p(H_0) * p(D|H_0) >> p(H_i) * p(D|H_i).

Second, if the remaining prior probabilities are approximately equal p(H_0) ~ p(H_i) ~ p(H_n), then they can all cancel out with little loss of accuracy.

So a flat prior is simply a computationally efficient approximation of the “full equation”.

The posterior under a flat prior is:

p(H_0|D) = p(D|H_0) / [ p(D|H_0) + … + p(D|H_n) ]

How exactly do you get from that to the one sided p-value? I’d guess you have to limit H[0:n] to monotonic likelihoods with one varying parameter. But I don’t see how to get from there to a tail probability that includes hypothetical “more extreme” data.

It’s a well-known result.

Thanks. I believe this is similar to the paper shared by unanon below. In that paper they call it “equivalence”, in this one they call it “reconciled”.

This seems (afaict) to mean you can deduce the same final equation for a posterior as you can for a p-value under certain circumstances.

Yes, anyone can prove this to themselves with a few simulations these days.

What I am interested in is whether you can derive the assumptions behind the calculation for the p-values in these cases from Bayes’ rule. Or vice versa, derive Bayes’ rule from the assumptions used to derive these p-values.

Does that make sense?

To be clear, I expect one is a special case of the other so will require some additional assumptions either way. Ie, in the same way that Newtonian physics is a special case of relativity.

A more concrete example may be that for something that happens at least once with probability p per time unit t, then: p*t ~ 1 – (1-p)^t

As long as p << 1 this works, otherwise it doesn't. I think something like that is going on here, where the computationally efficient p-value makes sense under some specific constraints but otherwise leads to the wrong result.

Anyway, this is a problem I think you math people need to resolve before NHST goes away.

I don’t understand. Bayes rule is true by definition of conditional probability. If you believe your unknown parameter can be modeled probabilistically, you accept bayes rule. If you don’t believe it, then the statement is invalid.

In principle there could be some other set of assumptions that also lead to Bayes’ rule.

But I find it much more likely Bayes’ rule is the general case. The numerical correspondance between some p-values and posteriors would then be because those p-values are a special case of Bayes’ rule.

https://web.hku.hk/~gyin/materials/2020ShiYinAmeriStat.pdf

This is related, but afaict it only shows the equivalence (point it out if I missed it). I 100% agree there is an equivalence under certain circumstances. Same as for credible and confidence intervals.

Shouldn’t there be some way to derive one from the other with additional assumptions/constraints though? Ie, if we assume A, B, C, then we can derive the p-value as a special case of the posterior probability.

It reminds me of this paper where Michael Lew (who used to comment here but I haven’t seen him for awhile) interprets a p-value (along with the sample size) as indexing a likelihood:

https://arxiv.org/abs/1311.0081

In particular figure 8. But in that figure, it is the hypothetical effect size that is varying rather than looking at the likelihood of seeing zero for different hypothesized parameter values. It seems to me somehow this must work out to the same thing.

Notably, there is not a single mention of the Bayes factor in this document. What should we gather from this omission?

Harlan

Harlan:

I do not think the Bayes factor is useful for the sorts of problems discussed in that report, which is the problem of estimating the effects of educational interventions.

Andrew: I completely agree.

But others might disagree. For instance the BASIE report writes:

“One practical application of credible intervals is to assess the likelihood that there is no meaningful difference between two groups (in other words, the probability that the groups are practically equivalent).”

When it comes to establishing that groups are (or are not) practically equivalent, some would prefer to do testing with a Bayes factor instead of simply looking at the credible interval. To me, the “BF for testing” is a rather odd approach and I don’t see why a Bayesian wouldn’t instead define prior model odds and use the posterior model odds for testing, or instead, simply look to the credible interval as suggested by BASIE). Perhaps of interest on this: Linde et al. (2022) (https://tinyurl.com/ym7mvxry), and our response https://arxiv.org/pdf/2104.07834.pdf.

Bayesian analysis frameworks are very interesting. However their statements are not pro. Particularly when claiming that their approach deals with “true” values instead of estimates. Both statistical inference and bayesian inference deal with inference, estimates and probabilities. From the very beginning:

“the prior distribution describes how common it is for education interventions to have true (not estimated) effects of varying sizes”

What is the justification for this statement?

Besides they justify the use of a Bayesian framework because previous researchers may have misinterpreted significance tests results. But Bayesian inference does not provide perfect interpretations neither. First, results are highly sensible to the choice of the prior, choice that is even harder for small samples which seems to be common in their field. Second, just as p-values must be interpreted with respect to the sample, Bayesian probabilities must be interpreted with respect to the prior.

Andrew. “I really like that they express this in terms of “evidence” and “information” rather than “belief.”

In my opinion this is really the problem with this work. Thinking that their approach is THE approach to use while it is still statistics. Evidence is not appropriate in this field, and belief is the core of Bayesian inference.

Corentin:

Regarding your paragraph:

1. Nowhere did I see in the report the claim that their approach is “THE approach to use.” They’re presenting one statistical approach, and if you don’t like it you can try something else.

2. I strongly disagree with your statement, “Evidence is not appropriate in this field.” If you’re interested in doing evidence-free inference about education policy, there are lots of places for that, but it’s not what the Institute of Education Sciences is interested in doing.

3. Regarding belief and Bayesian inference, I recommend you start with my papers with Shalizi and with Hennig. My short answer is that I prefer to think of these models in terms of evidence. Belief or betting is one framework for probability, not the only one.

I really like this tool as well. I do wish however that the authors would make explicit that this method is effectively what some have called “empirical Bayes.” Grounding it in the literature would help folks identify and discuss shortcoming, like ignoring uncertainty in the prior distribution parameters.

It also seems perhaps too focused on WWC evidence for informing prior distributions, when many interventions (such as reductions in class size) would not be eligible for review by the WWC. It’s not clear to me that prior evidence from WWC-reviewed studies is appropriate for analyses of any and all education interventions.

I’m grateful to hear your thoughts on this Andrew.