Berk Özler writes:

Background: You receive a fictional proposal from a major foundation to review. The proposal wants to look at the impact of 5 minute “patience” training on all kinds of behaviors. This is a poor country, so there are no admin data. They make the following points:

A. If successful, this is really zero cost to roll out—it’s just pushed through smart phones. Therefore, the cost of the program can be modelled as approximately zero. It falls in the “letters to people” kind of stuff.

B. However, they want to show the impact of this on a whole bunch of things. They can check take-up because they know who clicks through and goes through everything, and they expect it to be low.

C. Given take-up and expected impacts, they argue that a fairly small impact could have quite a large effect.

D. But here is the rub: For the experiment they will need $2 million for data collection. They need to survey XY,000 households and do very long surveys on everything you can think of 4 times. Having carefully read all the stuff on p-values, this is the power they need to detect a 1% increase in savings. The combination of a “letter to people” type experiment with (a) lack of admin data and (b) desire to show effects on all kinds of things essentially blows the budget.

I am quite worried about this p-values and power stuff, since (a) on the one hand it’s good that we don’t give too much credence to small sample studies with large effects because of publication bias and (b) on the other hand, if this is going to be interpreted as “powering up” what are essentially stupid interventions, that’s not a great direction to go either. The problem is that without a theoretical framework, it’s harder to be ex ante sanguine about what is “stupid”—agnostic may be a better phrase, since it *might* turn up something.

My instinct here, motivated by the Bayesian optimal sampling literature, would be to say that they should first try this with 100 households for $500 and see what the effects are AND publish the effects. There should be an optimal sequence of experiments that leads to scale-up as positive results arise. In short, publication bias implies a preference for small sample size experiments with big effects, which are probably false. But this should cause us to solve the publication bias problem, NOT create a further distortion by powering up stupidity. Ramsay second best is not going to work here. Unfortunately, the math even in the simple case with single sequential samples is a complete nightmare, but wondering if there is a simpler way to explain this.

I’ll respond to these questions in reverse order:

– To address the last sentence in the above quote: no, the math is not a nightmare here at all. There’s no “math” at all to worry about here: just gather the data, and then, in your analysis, include as predictors all variables that are used in the data collection. With a sequential design, just include time as a predictor in your model. This general issue actually came up in a recent discussion.

To do this and get reasonable results, you’ll want to do a reasonable analysis: don’t aim for or look at p-values, Bayes factors, or “statistical significance”: just fit a serious regression model, using relevant prior information to partially pool effect size estimates toward zero. And of course commit to publishing your results no matter what shows up.

– I’m not quite sure what is meant by “a 1% increase in savings”? This is a poor country; are these cash savings? Who’s doing the saving? Are you saying that the savers will save 1% more? Or that 1% more people will save? These questions are relevant to who you target the intervention to.

– I don’t have a good sense of where this $2 million cost is coming from. I guess $2 million is a bargain if the intervention really works. I don’t have a good sense of whether you think it will really work.

– One way to get a handle on “how effective is the intervention?” question is to consider a large set of possible interventions. You have this “5 minute patience training,” whatever that is. But there must be a lot of other ideas out there that are similarly zero-cost and somehow vaguely backed by previous theory and experiment. Would it make sense to spend $2 million on each of these? This is not a rhetorical question: I’m really asking. If there are 10 possible interventions, would you do a set of studies costing $20 million? Or is the idea that any of these 10 interventions would work, but their effects would not be additive, so you just want to find one good one and it doesn’t matter which one it is.

– A related point is that interventions are compared to alternative courses of action. What are people currently doing? Maybe whatever they are currently doing is actually more effective than this 5 minute patience training?

Anyway, the good news is that you don’t need to worry about “the math.” I think all the difficulty here comes in thinking about the many possible interventions you might do.

I think the math he’s talking about is something about sequential decision theory… Do a small study, get a posterior distribution, then decide whether to follow up with larger study, lather rinse repeat and choose an optimal stopping point. It’s not terrible math but it may be unfamiliar and require a real world utility function that could be contentious.

Either drop the project or charge them 10-100x to be the patsy for pre-specified conclusions. You are in $500/hr lawyer land now.

Thanks! What would be the optimal sampling (or experimental, for that matter) approach, though, in such cases where – the intervention is cheap, so even a pretty small effect would be cost-effective – but that substantially increases the sample size required (especially in a cluster-RCT setting).

How about selection effects? What’s the design? Intention to treat?

To clarify on this, the question can be separated into two parts:

1) What is the optimal sequential experimentation method, with stopping rules, so that the funder doesn’t necessarily spend $2 million on something that might not work.

2) How to analyze the experiment after this sequential decision-making.

To make this concrete, suppose the population has mean income of $100, standard deviation of $50, and the proposed intervention might increase income by 1%, (so by $1), and if delivered at scale, might only cost 10 cents per person to deliver. So small effects for any one person, but a cost-benefit ratio that could potentially be very high.

A standard power calculation would say we need 52,538 treatment and 52,538 control to have 90% power to detect this effect in an experiment on this population. But running an experiment at that size will be very expensive (hence the $2 million price tag).

The question is then asking whether the funder can instead recommend a sequence of conditional funding, where a tranche is paid to do this on a smaller sample, the results are looked at, and then a decision is made as to whether to stop the experiment (because the treatment does not seem to be working), or to fund another tranche, and so on. I know this type of sequential work is done in medical trials – but it is not something we have seen done in these types of economic experiments, and so the goal is to reach out to other fields and see ideas for how to handle such a problem.

As Andrew notes, the funder may also be getting presented continually with many such ideas for 1% improvements, so this could be generalized to them having 20 proposals that all suggest something which could raise income by 1%, but that all would need $2 million to run a fully powered study – and they don’t want to spend $40 million.

I agree with you about this being the way to go, sequential application of decision theory solves this problem, and if you’re not talking with the FDA then you don’t have the further problem that the law doesn’t allow you to do the right thing. FDA trials should do this, but as I understand it have to deal with statistical policy written into the law that isn’t necessarily friendly to this kind of analysis.

I’ve been participating on this blog for over a decade, and I don’t often blow my own consulting horn here, but seriously if someone actually has this kind of problem, contact me because this is the kind of thing I specifically set up my consulting company to help people with

http://www.lakelandappliedsciences.com/

Just to be clear, when I sent this email to Andrew, I was forwarding an email from a colleague, meaning that “asking for a friend” was not sarcastic this time:

“Hi Andrew,

Berk Ozler here from the World Bank’s research department and the Development impact blog. You have, on occasion, commented on a thing or two that I wrote.

I have a question from a colleague and we’re trying to crowd source a few answers from academics as well as policymaker types to see if we cannot put some view points on this together. We’d love your thoughts if you have any – doesn’t matter if you send them to us or post on your own blog…”

So, JIC this goes viral, please note that I am not trying to spend (or allow or prevent someone else to spend) $2 million on any kind of 5-minute training…

Berk.

If it costs nothing to roll it out, save the 2 million and just roll it out. Or are they afraid it could do as much harm as good if implemented?

I see this as analogous to an adaptive design problem. Adaptive trial designs are used to more efficiently search high dimensional treatment spaces than a one-shot trial would allow. The analysis of these designs draws on bandit theory and Thompson sampling. These are popular in, eg, online advertising, where the design space admits lots of possible permutations. What’s implicit here is that an experiment with the treatment being discussed is one among many things that one could study with the $2 million, so there are trade offs. That’s analogous to me to the “high dimension treatment space” that adaptive designs want to explore. Thus, the proposal to do something smaller and then update is precisely the kind of thing that you see with adaptive trials that start with a tilt toward “exploration” but then eventually tilt more toward “exploitation.”

You might try this book on group-sequential trials: https://www.crcpress.com/Group-Sequential-Methods-with-Applications-to-Clinical-Trials/Jennison-Turnbull/p/book/9780849303166

for points 1) and 2) in David’s comment. Seems like the right set up for this question.

Exactly as David points out, I think that funders are increasingly being asked to evaluate and fund proposals that have (a) little prior information and (b)high data collection costs due to large sample sizes from power calculations. These are not just examples like 5 minute patience training, but also proposals that evaluate the impact of (say) a health intervention on (say) schooling. We know the health intervention improves health, but we many not have a strong prior on the impact of the intervention for schooling.

There is little guidance on this; in fact, asking for detailed power calculations is now part of the proposal writing, and this has become increasingly salient with the emphasis on publication bias and false positives in small samples. I don’t know this literature very well, but the math for Bandit problems and optimal experimentation used to be quite hard.

The specific problem for the funder could perhaps be described as:

There are M Treatments, and you have infinite resources (!). You would like to find the most effective treatment, where effective is defined using some metric, say present discounted value. You have little prior information on any of these M treatments but you don’t think that they will harm people.

The cost structure for running an experiment for any of the M treatments is K + c(N), where K is the fixed cost, N is the sample size in the experiment and c(N) is the marginal cost, a function of sample size.

What is the sequence of sample sizes for each of the M treatments so that you converge to the best treatment at minimum cost?

What characterizes the small part of the population that bothers with a 5 minute app on patience from above? For one thing, it tends to exclude the most impatient people.

I don’t understand how a sequence of growing trials would help here. From the definition of the problem, it’s clear that any effect is expected to be very small, so you only really expect “positive” results on the last full-scale trial. Before that you’re in “power = 0.06 territory”.

I can understand pilot studies for the purposes of working out the kinks before doing the real thing, but that doesn’t sound like what’s meant here by “scale-up as positive results arise” (from the OP).

The only reasons I can see for publishing the _results_ of smaller trials are (a) if the smaller study shows a huge, clear positive effect and you’re confident that biases are under control at that level, or (b) if the smaller study shows a reasonably clear _negative_ effect and the intervention seems to be harmful.

So my question is: what is the basis for the decision to scale up? (If it was articulated in the original post or the comments, I’ve missed it so far.)

What the basis should be is maximizing a real world utility.

“In short, publication bias implies a preference for small sample size experiments with big effects, which are probably false. But this should cause us to solve the publication bias problem, NOT create a further distortion by powering up stupidity.”

It *could* be publication bias. Maybe. Or it could be small study effects. It’s not particularly easy to tell and the only way you would make these assumptions is if you go with the sample size/effect size model, which doesn’t always model the real world.

Perhaps the mathematical difficulty is a byproduct of thinking of this as a sequential Monte Carlo design, with continuously updated posteriors? I remember Betancourt mentioning the difficulty of performing these analyses in a talk he gave about HMC a number of years ago in Japan (now on Youtube, https://www.youtube.com/watch?v=uSjsJg8fcwY). A question was raised about updating posteriors as priors when new data arrives and Betancourt, as I remember, essentially said that it wasn’t feasible with regular HMC modeling and required sequential HMC or some other complex methodology.