“Test & Roll: Profit-Maximizing A/B Tests” by Feit and Berman

Elea McDonnell Feit and Ron Berman write:

Marketers often use A/B testing as a tool to compare marketing treatments in a test stage and then deploy the better-performing treatment to the remainder of the consumer population. While these tests have traditionally been analyzed using hypothesis testing, we re-frame them as an explicit trade-off between the opportunity cost of the test (where some customers receive a sub-optimal treatment) and the potential losses associated with deploying a sub-optimal treatment to the remainder of the population.

We derive a closed-form expression for the profit-maximizing test size and show that it is substantially smaller than typically recommended for a hypothesis test, particularly when the response is noisy or when the total population is small. The common practice of using small holdout groups can be rationalized by asymmetric priors. The proposed test design achieves nearly the same expected regret as the flexible, yet harder-to-implement multi-armed bandit under a wide range of conditions.

We [Feit and Berman] demonstrate the benefits of the method in three different marketing contexts—website design, display advertising and catalog tests—in which we estimate priors from past data. In all three cases, the optimal sample sizes are substantially smaller than for a traditional hypothesis test, resulting in higher profit.

I’ve not read the paper in detail, but the basic idea makes a lot of sense to me.

I’m not an expert on this literature. I heard about this particular article from a blog comment today. You readers will perhaps have more to say about the topic.

11 thoughts on ““Test & Roll: Profit-Maximizing A/B Tests” by Feit and Berman

  1. I was surprised the authors write “these tests have traditionally been analyzed using hypothesis testing” and then refer to the Bayesian estimation-based Thompson sampling as “popular”. The main conclusion seems to be, “Although sub-optimal relative to a multi-armed bandit, the profit- maximizing test & roll provides a transparent decision point and reduced operational complexity without significant loss of profit.” I think this is taking “multi-armed bandit” (a problem) to mean “Thompson sampling” (a strategy for solving the problem). By “transparent decision point”, do the authors mean collapsing to a single choice rather than maintaining a stochastic approach like Thompson sampling that only converges to a deterministic choice?

    The authors describe Thompson sampling as “hard to implement”. I would say it’s easy to implement inefficiently (e.g., a few lines of code in R wrapping a Stan model), but hard to implement efficiently (because you need to update Bayesian estimates observation by observation). This is the kind of problem where people seem to like sequential Monte Carlo (SMC) approaches (though there are only a few citations I saw for this combination from a quick Google search).

    • By “hard to implement”, we don’t mean it is hard to estimate the posteriors you need for Thompson sampling. We’re referring to the need to have stochastic treatments deployed on your website or app, which causes all sorts of headaches downstream: speed and reliability, testing of multiple processes, customer service doesn’t know what the customer saw, etc. I can’t tell you how many companies I’ve talked to that say they have tried Thompson sampling (or some other MAB heuristics) and quit because having stochastic treatments in production caused all sorts of other headaches. By “transparent decision point”, we mean there is a clear point in time where the treatment becomes totally deterministic, reducing complexity in the system.

      • Thanks for the clarification. I hadn’t thought about difficulty of deployment. And this is an important practical consideration. I can only imagine it’s even harder to maintain probabilistic selection in medical trials than web interfaces.

        I think the decision to make a deterministic choice at some point is important. It’s not optimal asymptotically, but it’s practically required for efficiency. The efficiency bottleneck isn’t so much making the random choice, but fetching data from memory, or even worse, from disk.

    • > By “transparent decision point”, do the authors mean collapsing to a single choice rather than maintaining a stochastic approach like Thompson sampling that only converges to a deterministic choice?

      > The authors describe Thompson sampling as “hard to implement”. I would say it’s easy to implement inefficiently (e.g., a few lines of code in R wrapping a Stan model), but hard to implement efficiently (because you need to update Bayesian estimates observation by observation).

      I also took transparent decision point to mean a final choice rather than a stochastic approach. And I think there are lots of settings where any kind of stochastic approach is impossible or prohibitively difficult, and not just in the computational sense. Sometimes you’re required to have a legible and consistent policy, sometimes a “treatment” requires a lengthy training period, sometimes persistently tracking long-term identity to prevent contamination across different treatments is impossible, etc. You can’t, for example, stochastically administer different TV ads to different viewers–you get to send one ad to the Superbowl.

      • You can’t, for example, stochastically administer different TV ads to different viewers–you get to send one ad to the Superbowl.

        There is some local advertising during the Super Bowl. I wouldn’t be surprised if there was a startup or ten aimed at hyperlocal or demographically segmented broadcast and cable ads.

        • I’m digging into the paper now but wanted to comment on the Super Bowl angle. This is just one old-school practitioner’s perspective, but I can think of better testing environments. There’s a lot of noise. A lot of waste (i.e., viewers not in your target market). Increased costs on a CPM basis. And the mechanics of aligning a local spot buy frighten me. At a minimum you have to accept that the spot run immediately before yours will be dissimilar unless you can negotiate placement in the pod. I’m shuddering just thinking about it.

  2. Oh hey! It’s this paper again! Saw this a couple years ago and really liked it. It’s always been my intuition that null-hypothesis significance testing is even more inappropriate in the tech industry than academia. The traditional framework considers an asymmetric standard of evidence between the null and alternative hypothesis, but oftentimes in industry settings there’s no natural choice for a “null hypothesis” at all. There’s no reason NOT to chase noise when all alternatives are costless or equivalently costly. But it’s surprisingly hard to convince people of this even at big-name “independent thinker” firms. They end up tying themselves in knots trying to justify some arbitrary choice as a “null hypothesis” just to reshape the problem into something traditional methodology can handle–presumably because so long as they’re going with the state-of-the-art in science at the universities, nobody can blame them for how they’ve done things.

    • Somebody:

      Yes. I get frustrated because some of the most sophisticated academic statistical work is still done in the null hypothesis significance testing framework. All this intellectual contortions to take these cool and useful statistical ideas and cram them into the framework of coverage probabilities, type 1 and 2 error rates, etc. It makes me want to scream, especially because people seem to think this so rigorous.

  3. I like this article a lot. The really stupid thing about significance testing in business is that if the cost of switching from treatment A to B is zero (as it plausibly is in many AB testing applications) then type 1 errors have zero cost. Significance testing’s primary focus is to minimize errors that cost you nothing. In real life the switching cost isn’t actually zero, but it is often very, very small. I welcome this article’s focus on the actual reason businesses engage in testing.

  4. Hey thanks for sending the article. Still trying to understand the intuition though.. were does the difference come from?

    One assumption I thought was a little weird was that they only deploy to the rest of the non-test population (N-n1-n2). In reality shouldn’t you deploy to everyone N?

Leave a Reply

Your email address will not be published. Required fields are marked *