“Beyond ‘Treatment Versus Control’: How Bayesian Analysis Makes Factorial Experiments Feasible in Education Research”

Daniel Kassler, Ira Nichols-Barrer, and Mariel Finucane write:

Researchers often wish to test a large set of related interventions or approaches to implementation. A factorial experiment accomplishes this by examining not only basic treatment–control comparisons but also the effects of multiple implementation “factors” such as different dosages or implementation strategies and the interactions between these factor levels. However, traditional methods of statistical inference may require prohibitively large sample sizes to perform complex factorial experiments.

We present a Bayesian approach to factorial design. Through the use of hierarchical priors and partial pooling, we show how Bayesian analysis substantially increases the precision of estimates in complex experiments with many factors and factor levels, while controlling the risk of false positives from multiple comparisons.

Using an experiment we performed for the U.S. Department of Education as a motivating example, we perform power calculations for both classical and Bayesian methods. We repeatedly simulate factorial experiments with a variety of sample sizes and numbers of treatment arms to estimate the minimum detectable effect (MDE) for each combination.

The Bayesian approach yields substantially lower MDEs when compared with classical methods for complex factorial experiments. For example, to test 72 treatment arms (five factors with two or three levels each), a classical experiment requires nearly twice the sample size as a Bayesian experiment to obtain a given MDE.

They conclude:

Bayesian methods are a valuable tool for researchers interested in studying complex interventions. They make factorial experiments with many treatment arms vastly more feasible.

I love it. This is stuff that I’ve been talking about for a long time but have never actually done. These people really did it. Progress!

7 thoughts on ““Beyond ‘Treatment Versus Control’: How Bayesian Analysis Makes Factorial Experiments Feasible in Education Research”

  1. Andrew said:

    I love it. This is stuff that I’ve been talking about for a long time but have never actually done. These people really did it. Progress!

    Are you sure? It looks like NHST to me:

    Researchers often wish to test a large set of related interventions or approaches to implementation.
    […]
    We repeatedly simulate factorial experiments with a variety of sample sizes and numbers of treatment arms to estimate the minimum detectable effect (MDE) for each combination.
    […]
    we consider the MDE of this experiment to be the smallest difference in effect size with at least an 80% chance of being found significant in the correct direction by the Bayesian model.
    […]
    In our experiment, we set the threshold for significance at .975 to correspond to a two-sided p value at the 95% confidence level, but experimenters may wish to explore other threshold values or even consider dispensing with the intermediate significance calculation altogether.

    They give some lip service to “there is no theoretical reason why the posterior probability cannot be used directly as the outcome” and “dispensing with the intermediate significance calculation”, but do not explain to the reader what exactly they would do with this info.

    The goal of the study was apparently:

    The experiment started with a basic website template that remained the same across all treatment arms and consisted of a map showing school locations at the top followed by a list of schools. Four categories of information were shown for each school: distance to school, academic performance, safety, and school resources. Based on the results of our power calculations, we were able to test a total of five factors in a (3 x 3 x 2 x 2 x 2) configuration, for a total of 72 distinct treatment arms. In this study, the five factors were not examined independently: rather, the experiment sought to identify which of the 72 possible combinations of factor levels represented the best possible design of an information display, after accounting for the interaction effects between factors. In other words, the study sought to identify which of the 72 treatment arms represented the best possible display for each outcome.

    Do you even needs stats to do that? This reminds me of the “riddle”:

    You collected data from two groups and are interested in the difference using alpha = 0.05. Group A had a mean of 10, while Group B had a mean of 8. The p-value = 0.1. What is the average difference between group A and group B?

  2. “Bayesian methods are a valuable tool for researchers” that I fear would require a year or more of coursework for a classically-trained statistician like myself to become proficient at this level. Which I would actually love to do, if anyone is has a fellowship that covers my current salary and benefits, and the tuition costs to boot. :)

    • Michael,

      I gave serious consideration to it last fall and into the winter. Decided that with most likely only 5-6 years left in my career it didn’t merit spending maybe 500-1,000 hours of my own time (in addition to work) to try and become somewhat proficient in techniques that would require major advocacy to even use in any meaningful way.

      If I were 10 years younger it might a battle worth waging but then again 10 years ago it would have been even harder to attempt without the modern tools for Bayesian workflow that have become readily available of late.

      • If you know R or python it is not difficult at all to do something like this. If you don’t, then you will gain a very useful skill independent of anything having to do with bayesian stats.

Leave a Reply to Michael Nelson Cancel reply

Your email address will not be published. Required fields are marked *