Priors on effect size in A/B testing

I just saw this interesting applied-focused post by Kaiser Fung on non-significance in A/B testing. Kaiser was responding to a post by Ron Kohavi. I can’t find Kohavi’s note anywhere, but you can read Kaiser’s post to get the picture.

Here I want to pick out a few sentences from Kaiser’s post:

Kohavi correctly points out that for a variety of reasons, people push for “shipping flat”, i.e. adopting the Test treatment even though it did not outperform Control in the A/B test. His note carefully lays out these reasons and debunks most of them.

The first section deals with situations in which Kohavi would accept “shipping flat”. He calls these “non-inferiority” scenarios. My response to those scenarios were posted last week. I’d prefer to call several of these quantification scenarios, in which the expected effect is negative, and the purpose of A/B testing is to estimate the magnitude. . . .

The “ship flat or not” decision should be based on a cost-benefit analysis. . . .

This all rang a bell because I’ve been thinking about priors for effect sizes in A/B tests.

– If you think the new treatment will probably work, why test it? Why not just do it? It’s because by gathering data you can be more likely to make the right decision.

– But if you have partial information (characterized in the above discussion by a non-significant p-value) and you have to decide, then you should use the decision analysis.

– Often it makes sense to consider that negative-expectation scenarios. Most proposed innovations are bad ideas, right?

Also this bit from Kohavi:

The problem is exacerbated when the first iteration shows stat-sig negative, tweaks are made, and the treatment is iterated a few times until we have a non-stat sig iteration. In such cases, a replication run should be made with high power to at least confirm that the org is not p-hacking.

– This is related to the idea that A/B tests typically don’t occur in a vacuum; we have a series of innovations, experiments, and decisions.

I want to think more about all this.

21 thoughts on “Priors on effect size in A/B testing

  1. SoundDefau!ting t “shipping flat” like the opposite of the heuristic that “if it ain’t broke, don’t fix it”. Which is based on the second law of thermodynamics that says basically “its easier to break something than put it together or fix it”.

    Data-driven decision making should not ignore lessons learned from many centuries of experience on diverse projects just because it doesn’t fit nicely into a column of the spreadsheet.

    Also, a lot of these A/B tests measure something like user engagement. Eg time spent on the site or number of clicks. So more confusing sites with lots of slow javascript popping up can end up “winning” if the test criteria isnt chosen properly. The internet in general is becoming unusable due to this, another success for NHST.

  2. I think that there is a systematic bias in favour of the control in the descriptions by Kaiser Fung. “Shipping flat” is not a very helpful phrase in that context.

    1: In the article linked by Andrew he erroneously writes that “a p-value above the threshold (commonly 0.05) implies that there is no treatment effect.”* Of course, we all would say, if pressed, that the p>threshold would not support a rejection of the null in a hypothesis test, and that it would represent a very weak level of evidence against the null in a significance test, but neither of those statistical inferences corresponds to an implication of no treatment effect. (Yes, I know that the first phrase in that sentence is wishful thinking!)

    2: In Kaiser Fung’s prior post (https://junkcharts.typepad.com/numbersruleyourworld/2020/01/response-to-kohavis-note-on-non-inferiority-in-ab-testing.html) he writes about his bias explicitly: “I believe in most of the situations he outlined, non-inferiority is wishful thinking. It is better to acknowledge that those changes are expected to impact performance metrics negatively – the purpose of the A/B test in those scenarios is to quantify the extent of the damage.”

    I don’t know enough about the area of research being discussed to prescribe sensible behaviour, but if an intervention carries no cost over the control then it might make sense to adopt it any time the A/B testing shows it to give a higher return than control, no matter how small the difference, and no matter how small or large the p-value might be.

    * In a note that he introduces with “If you learn nothing else from the note, here is the part that you should pay attention to:” Ha!

    • >> In the article linked by Andrew he erroneously writes that “a p-value above the threshold (commonly 0.05) implies that there is no treatment effect

      Yes, this seems very problematic.

      • I believe Mr. Lew should retract this comment. In the linked post, I said no such thing. I cited Kohavi as stating the exact opposite and agreed with him. Specifically, I said “If you learn nothing else from the note, here is the part that you should pay attention to:” followed by Kohavi’s statement that “We cannot accept that a p-value above the threshold (commonly 0.05) implies that there is no treatment effect.”

        • I apologise for connecting together the statement of Kohavi with the blog’s author, Kaiser.

          I did not understand Kaiser’s intention in his writing, as the structure of the statements makes it seem that Kaiser is in agreement with Kohavi’s erroneous statement regarding the inferences associated with a p-value greater than the intended threshold. Where Kaiser wrote “I am in general agreement with most of these points.” and then immediately in the next paragraph “His very first point is the most important. If you learn nothing else from the note, here is the part that you should pay attention to:” followed by the erroneous statement, I took the meaning to be that Kaiser agreed with the erroneous statement. It seems I was mistaken and so I apologise.

        • Michael: No worries. The spacing of that post with the image placement makes the indentation easy to miss. But you are still misreading Kohavi. The quote is “we cannot agree that…” Kohavi objects to treating p>0.05 as “no treatment effect” so we are all in alignment on this point. In practice, people (typically not statisticians) have invented a set of “reasons” to switch to a new treatment even though a randomized controlled experiment failed to show statistically significant benefit. We’re offering counterarguments to some of these “reasons”.

  3. I have a draft paper about how Rubin’s Causal Model may break down when we’re talking about a series of studies, because some of the causes of the effects of the ultimate experiment include pilot results. Anyway, this relates because Khavi invokes forking paths, but unlike forking paths in the analysis, there’s an element of both p-hacking and of legitimate causal manipulation when forking paths occur prior to the experiment. Sure, the more times you run a study, the more likely you are to get p<.05, which is an incentive not to report the prior results. That's bad. But the procedures resulting in the series of tweaks, and the tweaks themselves, do occur before the reported results are produced–they may even be a priori in the sense that you could plausibly map out how you'll choose a path at all the forks–what changes you'll make depending on pilot results–before the first pilot. If that process is fully reported, then it's just an ordinary decision tree, and we can include the tree as part of the treatment. In other words, while the effects of the treatment-as-designed may not replicate in a different context, following that decision tree in a different context (different population, treatment providers, etc.) may result in a treatment that is slightly different from the original but that replicates the original's effects. (This is probably more likely to work when developing behavioral interventions than medical treatments.)

    In any event, it's wise to keep in mind that effects are actually "effects of causes" and it's very rare authors describe all the causes.

  4. Shipping flat shouldn’t really be seen from the viewpoint of stats but the viewpoint of “choosing battles”.

    Generally an a/b test means that a lot of people have worked on a new product for a considerable amount of time and want to record it as a ‘win’.

    An analyst would have to fight pretty hard not to ‘ship it’ at this point. Knowing they’d have to fight against some truly negative products in the future why would they? It expands political capital with no obvious benefit – after all it’s probably not worse then existing standard.

    Best to save energy for when the a/b test really comes back negative.

    • Sam: this is exactly the scenario that happens a lot in the tech/business setting, and precisely why Kohavi wrote the note. For those not familiar with the applied setting, Sam’s last line is key. He’s implying that even when the result is clearly negative, there are frequently attempts to topple it… and that’s been my experience as well.

  5. “If you think the new treatment will probably work, why test it? Why not just do it? It’s because by gathering data you can be more likely to make the right decision.” – this is COMPLETELY wrong. It’s because you also don’t know if the feature will have unintended negative effects in other parts of your ecosystem. Hell even if the feature *works* it can have unintended negative effects eg it boosts user engagement but tanks revenue. At an absolute minimum, you want the A/B test to show you that your proposed booster of user engagement won’t tank revenue – in which case, non-inferiority is a perfectly legitimate argument to make!

  6. I think you are on to something. In the best scenario, the test design team gather together and “pre-register” the experiment. We ask the “sponsor” of the test and other affected teams to state their expectation of the benefits and costs – effectively to state their priors. Statisticians of course use this information to size the test. But also to gain consensus on the decision framework: if the test comes back positive, negative or neutral, what would we do?

    I’ve also used the prior to convert the significance and power into positive/negative predictive values because those are the proper metrics for management to understand the errors. A 90% confidence result may only mean we are right a third of the time when calling an effect significant!

    In a related post, I gave a scenario in which all sides agree the effect will be negative, and yet we should still run an experiment to learn the magnitude of the negative effect! It’s easier to guess the direction than the magnitude of an effect. (Of course, by running the test, we also allow for the possiblity that all sides are wrong with respect to the negativity although in my experience, this point will not win the argument about running the test.)

  7. I’m happy to “ship flat”, because a lot of times we ship algorithms for operational benefits. They are simpler, more theoretically grounded, less expensive to maintain or move us from deprecated technologies. Where I am, the majority of tested improvements may be of this kind.

    • In other words, the A/B test is a kind of integration test with the whole system. When you gave hundreds of important (meaning “corresponding to a lot of dollars”) metrics, there is effectively no other way to test your software works, even if it does not target a specific visible metric.

      • I’m very familiar with such scenarios. The question is: why did the AB test run in the first place? If no one is expecting a KPI benefit and the operational benefits have no uncertainty or risk, then the answer should be ship, not “ship flat”.
        I believe your scenario falls into the AB testing for quantification category. The decision to ship is pre-ordained. There is uncertainty related to the magnitude of the impact, rather than its direction. The value of the test is to nail down the magnitudes. I addressed this scenario as legitimate, in the precursor to the post Andrew linked to.

  8. Perhaps unrelated, but this makes me think of governments and their proposed projects running through a cost benefit analysis. Notionally you can test the proposals put forward and pick the best ones for funding (though that process has its own considerable errors and biases), but spare a thought for which projects have been put forward to be tested in the first place. In government there’s no counter proposal mechanism from the cost benefit analysis stage which suggests that perhaps instead of X you should do Y, or only a subset of X.

    In reality to select those projects internally it’s already gone through a lot of internal stakeholder engagement, strategy, concept design, costing, and perhaps some external input as well. These mechanisms can be accurate, or they can go off the rails. To put this politely: in some circumstances there may be stakeholders who are incentivised by things other than the outcome of the cost benefit analysis, for better or worse.

    • See this video lecture I did for HBR. I think I addressed some of these practical considerations there. In the business world, indeed, what gets tested involves a lot of negotiation. The situation is similar to journalism. We can’t judge journalistic bias by bias within the published materials because there is also selection bias.

  9. I have thought about this! In a paper with Ron Berman (https://arxiv.org/abs/1811.00457), we lay out the decision problem for a typical A/B test on marketing communications and derive an optimal sample size formula. Of course, this requires appropriate informative priors, and we use a meta-analysis of collections of A/B tests to estimate an appropriate prior for the treatment effect from a new A/B test independently drawn from the same population. I presented the paper at StanCon 2019.

    Ron’s comment does make me think about how this could be extended for situations where the we are iterating on related treatments. If we had a data set showing the sequence of A/B tests run by some companies, we might be able to build a model of the expected dynamics. I’d love to know if the designers of the treatment tend to improve them. I imagine this varies.

Leave a Reply to Michael Nelson Cancel reply

Your email address will not be published. Required fields are marked *