A question about the piranha problem as it applies to A/B testing

Wicaksono Wijono writes:

While listening to your seminar about the piranha problem a couple weeks back, I kept thinking about a similar work situation but in the opposite direction. I’d be extremely grateful if you share your thoughts.

So the piranha problem is stated as “There can be some large and predictable effects on behavior, but not a lot, because, if there were, then these different effects would interfere with each other, and as a result it would be hard to see any consistent effects of anything in observational data.” The task, then, is to find out which large effects are real and which are spurious.

At work, sometimes people bring up the opposite argument. When experiments (A/B tests) are pre-registered, a lot of times the results are not statistically significant. And a few months down the line people would ask if we can re-run the experiment, because the app or website has changed, and so the treatment might interact differently with the current version. So instead of arguing that large effects can be explained by an interaction of previously established large effects, some people argue that large effects are hidden by yet unknown interaction effects.

My gut reaction is a resounding no, because otherwise people would re-test things every time they don’t get the results they want, and the number of false positives would go up like crazy. But it feels like there is some ring of truth to the concerns they raise.

For instance, if the old website had a green layout, and we changed the button to green, then it might have a bad impact. However, if the current layout is red, making the button green might make it stand out more, and the treatment will have positive effect. In that regard, it will be difficult to see consistent treatment effects over time when the website itself keeps evolving and the interaction terms keep changing. Even for previously established significant effects, how do we know that the effect size estimated a year ago still holds true with the current version?

What do you think? Is there a good framework to evaluate just when we need to re-run an experiment, if that is even a good idea? I can’t find a satisfying resolution to this.

My reply:

I suspect that large effects are out there, but, as you say, the effects can be strongly dependent on context. So, even if an intervention works in a test, it might not work in the future because in the future the conditions will change in some way. Given all that, I think the right way to study this is to explicitly model effects as varying. For example, instead of doing a single A/B test of an intervention, you could try testing it in many different settings, and then analyze the results with a hierarchical model so that you’re estimating varying effects. Then when it comes to decision-making, you can keep that variation in mind.

29 thoughts on “A question about the piranha problem as it applies to A/B testing

  1. I don’t think this seems really practicable in a web development context. There are limited resources and it’s genuinely costly to build out many different UI options that won’t eventually be used.

    I’m inclined to say that retesting is the right answer and you just need a protocol to enforce discipline (which people desperately need in this space anyway).

    • What would be a good protocol? It seems like any protocol would be highly subjective. The state of the website is temporally dependent, i.e. the website might look about the same in 1 month, but vastly different in 2 years. When does it make sense to retest? 6 months? 1 year?

  2. “For example, instead of doing a single A/B test of an intervention, you could try testing it in many different settings, and then analyze the results with a hierarchical model so that you’re estimating varying effects.”

    Also, expand the notion of different settings to estimating variation in response among study participants. This requires more, perhaps many more, observation time points on each subject, but helps move towards Senn’s vision of “personalized medicine”. And multi-level models.

  3. It seems to me that the issue here is with the interpretation of the A/B test. If you wanted to determine the effect of changing a button from red to green *in general*, you would indeed need to measure that effect under a variety of scenarios as AG describes. But often, the A/B test is meant to determine the effect of the change only in the specific context in which it occurs (e.g., changing red to green in the context of a green background). So then repeating the test as the context changes makes perfect sense because it is measuring a different (potentially correlated) effect each time. You could then use an appropriately-built hierarchical model to take these multiple A/B tests and obtain an estimate of the general effect.

    • +1. Often the context for A/B testing is short-lived by definition, such as when it’s used to compare two promotions at the beginning of a given campaign. People will hazard guesses about “what works” as an input to next quarter’s campaign, but that’s about as far as that goes.

      If you’re Google, you can test to measure which of 41 shades of blue is best. The search results page evolves slowly, so in that stable context if you’ve picked the best blue you can probably feel good about that choice for some time to come. Also, if you’re Google, you’ve built yourself a way to do A/B testing really efficiently, and you have enough traffic that even small improvements are worthwhile.

      If you’re not Google, it gets sketchy fast, not only because context is fleeting but because effects of simple changes are typically much smaller than people think they will be. Some clients read things like https://articles.uie.com/three_hund_million_button/ and want to believe that they’re just a few simple usability fixes away from solving all their conversion problems.

  4. A/B testing offers a unique set up in that you have data on the activity, before, during and after the test. You can choose the before window size, the length of the test and the so called wash out period. The data usually has two dimensions, a result oriented KPI and an activity oriented KPI. The test should show you the achieved balance since they represent short term/long term considerations.

    Now, the main issue is that such tests reflect on associations and rarely on causal effects.

    Multifactor A/B tests, as suggested by others above, are not easy to implement, however stratifying the data into blocks is good practice. Remember: block what you can, randomize what you cannot….

    In designing the tests one should specify its goals and objectives carefully. The S-type and M-type errors proposed by Gelman and Carlin provide a nice approach to evaluate the study design.

    • Can you elaborate on why the tests reflect on associations and not causal effects? Because A/B tests are RCTs at a large scale, I was under the impression that it’s the gold standard for establishing causality?

      And can you explain the part about blocking what we can, and randomize what we cannot? My experience with blocking is, we used it whenever full randomization is not feasible, and not the other way around. I’m curious about alternative ways to use blocking.

  5. people would re-test things every time they don’t get the results they want, and the number of false positives would go up like crazy.

    These are true positives. The statistical machinery is working precisely as designed. Humans are just misusing it to test models of the data generating process they know are false to begin with and misinterpreting the result.

    Ie, your model is missing the “re-testing” aspect and you need to rederive or simulate a new predicted distribution of outcomes for it.

    • How are they true positives? I default to NHST because I don’t know how to set priors for each experiment, but suppose the null is true and we test at alpha = 0.05. I doubt people would ask for a redo if we reject, so the only possible path for redoing once is reject at first experiment or fail to reject at first experiment and redo it, so the fpr is 0.05 + 0.95 * 0.05 = 0.0975, almost twice our wanted false positive rate. But what if we do another run if we fail to reject that second experiment? As we repeat the experiment, the fpr is going to go up like crazy.

      • what does it mean to you to have a false positive? to me it means that we declare that there is a difference from the null when in fact there is exactly 0 difference from the null.

        next it’s trivial that in the real world there is almost never zero difference between any two things… so all detected differences are true!

        • Suppose we prefer to stick with what seems to work thus far, and make changes when customers cleary prefer it over the status quo. Then an example of false positive is when we conclude a significant positive effect, when the actual effect is negative, ie a Type S error.

      • I dont know how to be any clearer. They are true positives because the null model is wrong, it has nothing to do with priors…

        This case is particularly egregious because people are deriving predictions from a model of running a single test, but then running multiple tests. And when their predictions inevitably don’t match the observation, they call it “false positives”!

        If only people could see how dumb this whole thing is, but they can’t. It is just too stupid to believe that almost all modern research amounts to people chasing their own tails by looking for “significance” like this.

        • I don’t understand why we should ever conclude it is a true positive when we need to redo an experiment multiple times to get “significance”—a bad practice to begin with. That is precisely what “chasing after significance” is.

          The point is that we don’t know whether the result is a true positive or false positive, but redoing experiments is a surefire way to boost fpr across the entire pool of experiments. I am not calling a specific result a false positive, but bad practices lead to the “replication crisis”. Our experiments had enough power to reliably detect effect sizes we cared about.

        • If you repeat an experiment multiple times but derive the null model based on the idea you only repeated it once, you will eventually get significance. This is correct, because at least one of the assumptions used to derive the null model is wrong (that you would only repeat the experiment once).

        • If we know we are testing n hypotheses, or will evaluate the experiment n times, then we can use the Holm correction. But if the null hypothesis is true, n becomes a random variable following the geometric distribution, and I don’t see how we can incorporate it other than making a dynamic threshold like https://medium.com/airbnb-engineering/experiments-at-airbnb-e2db3abf39e7 . But this method seems off to me somehow.

          A/B tests are typically light on assumptions. We rely on the CLT, and we can use historical data to estimate the variance of the population to do the power analysis. If we have 95% power and get p = 0.8 from the t-test, I’m going to lean towards there being no meaningful effect.

        • > If we have 95% power and get p = 0.8 from the t-test, I’m going to lean towards there being no meaningful effect.

          Let’s unpack this a little… the power analysis is 95% power *to reject the null when the effect size is about a certain size that is meaningful to you* and then you get p = 0.8 and so there’s some reason to believe that the effect size is not meaningful to you… but not that it’s exactly zero.

          this is one of the major issues which is often not unpacked in the statistical significance testing stuff.. which is that people care about a certain size or bigger, and then they collect *just enough data to somewhat reliably detect this size* and then they don’t understand why someone would say “all nulls are wrong” since whenever they reject the null the effect size estimator seems to be bigger than the minimal quantity that matters to them, and they’ve bought into the idea that “anything else is zero” which is exactly the problem recently discussed … non-significance does not mean equal to zero.

          if you collect a trillion data points you will always reject the point null of 0 effect. Only by intentionally leaving a lot of noise in your estimate are you able to get rejecting the null to have somewhat kinda the property of “only rejecting when the effect size is big enough to care about” but the problem is instead you get “only rejecting when the effect size is big enough, and also biasing your effect size estimate high” which isn’t what you want.

          intertwining the effect size and the precision to detect it into a single threshold decision rule is one of the big problems leading to type M and type S errors.

        • If we know we are testing n hypotheses, or will evaluate the experiment n times, then we can use the Holm correction

          I wouldn’t use any sort of “correction”. I would run custom monte carlo simulations of the processes I thought may have generated the data and compare the data to the results of those. If the data generating process involved multiple tests, then there would be multiple tests in the simulation. Once I got a simulation that matched the data (was not“significantly” different) I would think I had a reasonable approximation of what was going on and would infer future consequences based on that.

          I only skimmed the airbnb document but it didn’t look promising since it is focused on finding a “significant differences”. Instead they should determine the stopping rule based on the cost of the study, estimated costs of implementing the change, and estimated benefits of the change. Statistical significance has no legitimate place in such decisions.

          No doubt stuff like that is why almost all software/websites I am exposed to have been getting less reliable and usable.

  6. The example of red and green buttons renews my confusion over how (and whether) statisticians think about constructs. Since at least Campbell, it would be pretty standard for a social science experimentalist to demand a theory that describes cause and effect as more than simply ‘we put a green button on a red background’. Instead, they would assert the existence of something called ‘contrast’, which could be operationalized as a red button on a green background or as a green button on a red background, and hypothesize that high-contrast buttons are more likely to be pressed. This is far more efficient (if the constructs are valid and the causal theory is right, of course), because we can test just one operationalization and generalize to the other.

    I don’t read stats journals often (unless this site counts as one), but econometricians almost never acknowledge constructs. For example, someone will show that after the passage of Regulation Fair Disclosure, there was a reduction in bid-ask spreads, and write claims in titles and abstracts like ‘leveling the playing improves market liquidity.’ Level playing fields and market liquidity are constructs, not measures or operationalizations. But one of new standard texts in econometrics (Angrist and Piscke) doesn’t use the word ‘construct’ even once.

    So again, my question is how statisticians think about constructs, or whether they even do at all.

    • Believe me, the issue is not with the statisticians. Ideally we want to test the ‘contrast’ in your example, but that would be more work than testing ‘red vs green button’, and resources such as man hours are limited.

  7. Hi Prof. Gelman. Honestly, I’m still kinda not satisfied with that answer. Resources are limited, and we do use hierarchical models to save resources, e.g. if an experiment is too costly or impractical to do everywhere, we might sample some geographic locations and generalize the results. However, if the hierarchical model means we have to do tons of extra work, I don’t think people will be as receptive. Also, if we keep an experiment running for too long to try a lot of different things, the stakeholders might become impatient as they have to make decisions in a timely manner.

    • Wicaksono:

      The modeling should be no extra work at all. Regarding impatience: there’s always a tradeoff, in that you can make quicker decisions if you’re more ok with those decisions being wrong. I recommend that you work with the stakeholders to estimate the costs and benefits of various decision options, along with the costs of gathering the experimental data in the first place. If you’re in a setting where there’s not a big cost to making the wrong decision, then there’s no need for stakeholders to be impatient: they can make decisions based on a small amount of data and then just change their decisions later as necessary as more data come in.

      • Right, the modeling itself isn’t much more extra work, and I’d be happy to do it. I often spent a lot of time making the experiment design as efficient as possible to save other resources. The resources I’m concerned about are man-hours of other people. Cobbling up different versions of a webpage is a lot of extra work.

        The second part hints towards setting a prior for low-stakes decisions, and open up the possibility of changing the decision as more data pours in and we update the posterior. What is a good way to select a prior in this case? An uninformative prior is obviously bad as the posterior will fluctuate a lot. But setting a moderately strong prior at 0 leans towards inaction, while setting the prior at a positive effect might result in a long time until we realize the effect is actually negative.

        • Wicaksono:

          I actually suspect that the more important thing is not the prior but rather the assumptions about costs and benefits. But, to the extent the prior is a concern, I’d imagine that some zero-centered distribution makes sense. There’s no reason this would “lean toward inaction”: that would depend on costs and benefits. If the potential benefits of action are high, it can make sense to act even if there is a high probability that you’re acting in the wrong direction. This also points to the sequential nature of the problem: these are not permanent decisions being made, and it can make sense to decide and then re-evaluate as more information comes in.

        • Suppose we want to make a decision as quickly as possible, so we do 50/50 split on control vs treatment. Once we make a decision, say the users prefer the change, then we want to reap the benefits and do 0/100. But to keep data flowing in, we will have to do something less drastic, say 10/90. Once a decision is made, it will take much longer to collect data to correct it if it turns out to be the wrong decision. But on the other hand, if the treatment really is beneficial, we don’t want to “permanently” lose 10% of that gain, so eventually we will have to stop collecting additional data.

          If we set a prior centered at 0, the stronger it is, the longer it will take for us to be more certain in a decision. And we might not want to make the decision so hastily, because once we make the decision it affects our future ability to correct it. So it exhibits “semi-permanence”, if you will. In this way, a strong 0 prior leans towards inaction.

          It sounds like Bayesian bandits can solve the problem, but in reality it can be difficult to implement.

        • Wicaksono:

          Thanks for following up. Yes, you’re talking about the exploration/expectation tradeoff. One way to address this is to do it in stages: first conduct the experiment and gather preliminary inference, then gather more data in a production-style setting, then switch to the full decision but occasionally run experiments to check alternatives. I agree it can be difficult to implement. One problem I have with the so-called bandit literature is that the problem is typically set up assuming stationary, unchanging effects and with the goal of making a permanent decision. It would make more sense to me to recognize time variation and tradeoffs, and both these aspects of the problem point toward returning to the decision problem later on: this makes sense given that effects are context dependent and can change, and it also takes some of the pressure off the earlier decision process, which should allow people to make decisions sooner without requiring an attitude of approximate certainty.

        • Hey I commented about exploration/exploitation tradeoff yesterday (see below about EGO)! I don’t know what a Bayesian bandit is (are?) though…

          In my opinion, the time dimension of the data is a bit of a red-herring. Nobody instantaneously collects all the data and finalizes a design. However, I assumed the website design will be essentially static once design decisions are made. Later you can redesign if you want. The design is changing over time and therefore the response is changing over time, but I am not sure if the response to a fixed design is changing over time…

        • Nat:

          The point of the time dimension is that the treatment effect is not fixed in stone; it depends on who’s looking at the website, what else they’ve been looking at, etc. Effects will change over time for all sorts of reason. Effects depend on context. Just speaking generally, there’s no point in trying to get extreme precision or certainty about the effect.

        • Andrew:

          My point was that we might be able to obtain a reasonable solution to this design problem by making the simplifying assumption that the website design is fixed and the response to a fixed design is stationary.

          I did not mean that we should assume that the response is fixed in stone or that we want extreme precision. Obviously, there is variation in the response. However, I don’t see the benefit to partitioning the variation according to observation time in this case. I would lump any variation over time into a stationary error term.

  8. Is it helpful to approach this problem from an optimization perspective? For example, in engineering design the Efficient Global Optimization (EGO) method is popular for design optimization when function evaluations (i.e., evaluating performance of different website designs) is severely limited by time or cost.

Leave a Reply to Ron Kenett Cancel reply

Your email address will not be published. Required fields are marked *