Skip to content

Andrew vs. the Multi-Armed Bandit

Andrew and I were talking about coding up some sequential designs for A/B testing in Stan the other week. I volunteered to do the legwork and implement some examples. The literature is very accessible these days—it can be found under the subject heading “multi-armed bandits.” There’s even a Wikipedia page on multi-armed bandits that lays out the basic problem formulation.

It didn’t take long to implement some policies and a simulation framework in R (case study coming after I do some more plots and simulate some contextual bandit probability matching policies). When I responded with positive results on the Bernoulli bandit case with probability matching (aka “Thompson sampling”), Andrew replied with a concern about the nomenclature, which he said I could summarize publicly.

First, Each slot machine (or “bandit”) only has one arm. Hence it’s many one-armed bandits, not one multi-armed bandit.

Indeed. The Wikipedia says it’s sometimes called the K-bandit problem instead, but I haven’t seen any evidence of that.

Second, the basic strategy in these problems is to play on lots of machines until you find out which is the best, and then concentrate your plays on that best machine. This all presupposes that either (a) you’re required to play, or (b) at least one of the machines has positive expected value. But with slot machines, they all have negative expected value for the player (that’s why they’re called “bandits”), and the best strategy is not to play at all. So the whole analogy seems backward to me.

Yes. Nevertheless, people play slot machines. I suppose an economist would say the financial payout has low expected value, but high variance, so play may be rational for a risk-happy player who enjoys to play.

Third, I find the “bandit” terminology obscure and overly cute. It’s an analogy removed at two levels from reality: the optimization problem is not really like playing slot machines, and slot machines are not actually bandits. It’s basically a math joke, and I’m not a big fan of math jokes.

Seriously? The first comments I understand, but the third is perplexing when considering the source is the same statistician and blogger who brought us “Red State, Blue State, Rich State, Poor State”, “Mister P”, “Cantor’s Corner”, and a seemingly endless series of cat pictures, click-bait titles and posts named after novels.

I might not be so sensitive about this had I not pleaded in vain not to name our software “Stan” because it was too cute, too obscure, and the worst choice for SEO since “BUGS”.

P.S. Here’s the conjugate model for Bernoulli bandits in Stan followed by the probability matching policy implemented in RStan.

data {
  int K;                // num arms
  int N;                // num trials
  int z[N];  // arm on trial n
  int y[N];  // reward on trial n
transformed data {
  int successes[K] = rep_array(0, K);
  int trials[K] = rep_array(0, K);
  for (n in 1:N) {
    trials[z[n]] += 1;
    successes[z[n]] += y[n];
generated quantities {
  simplex[K] is_best;
  vector[K] theta;
  for (k in 1:K)
    theta[k] = beta_rng(1 + successes[k], 1 + trials[k] - successes[k]);
    real best_prob = max(theta);
    for (k in 1:K)
      is_best[k] = (theta[k] >= best_prob);
    is_best /= sum(is_best);  // uniform for ties

The generated quantities compute the indicator function needed to compute the event probability that a given bandit is the best. That’s then used as the distribution to sample which bandit’s arm to pull next in the policy.

model_conjugate <- stan_model("bernoulli-bandits-conjugate.stan")
fit_bernoulli_bandit <- function(y, z, K) {
  data <- list(K = K, N = length(y), y = y, z = z)
  sampling(model_conjugate, algorithm = "Fixed_param", data = data, 
           warmup = 0, chains = 1, iter = 1000, refresh = 0)

expectation <- function(fit, param) {
  posterior_summary <- summary(fit, pars = param, probs = c())
  posterior_summary$summary[ , "mean"]

thompson_sampling_policy <- function(y, z, K) {
  posterior <- fit_bernoulli_bandit(y, z, K)
  p_best <- expectation(posterior, "is_best")
  sample(K, 1, replace = TRUE, p_best)

1000 itertions is overkill here, but given everything's conjugate it's very fast. I can get away with a single chain because it's pure Monte Carlo (not MCMC)---that's the beauty of conjugate models. It's often possible to exploit conjugacy or collect sufficient statistics to produce more efficient implementations of a given posterior.


  1. Luke says:

    The first undergrad thesis I supervised at Harvard was on applying SMC to multi-armed bandits. Turns out the bandit community had no interest though…

    • ojm says:

      Hi Luke,
      Off topic: I’ve been doing a little work on player tracking in basketball games lately and see you’ve done a lot in this area.

      Do you happen have any favourite references? In particular we’ve been struggling to robustly deal with player-player intersections etc. Thanks!

  2. gwern says:

    If we’re going to discuss misleading jokes and puns in MABs, one thing that always annoys me about presentations of Thompson sampling is how everyone’s binomial ‘hello world’ example puns on probabilities and 0/1 rewards, rather than make the use of a loss function explicit. (The expectation of ‘p’ need not have anything to do with what bandit is actually optimal.) The point of posterior sampling in RL is to sample a model and then optimize the loss function under that hypothetical model; if you don’t get this, you can’t apply it to anything else like general MDPs! In this example, there is no way to pass in a different loss function because you’ve hardwired the call to ‘expectation’ inside ‘thompson_sampling_policy’; what if I *don’t* have a 0/1 loss and can’t simply pun over ‘p’ = EV of 0/1 binary rewards? (I usually don’t.)

    • @gwern: I’ve also been worried about this for the case study. I just introduced the general case of bandits as having a some distribution over returns. I think that’s what you’re getting at with general loss functions.

      The Bernoulli is very simple in that there’s only two possible outcomes, a reward of 1 with probability theta and a reward of 0 with probability (1 – theta). The pun is in the Bernoulli parameter theta being the expected return of Bernoulli(theta). You get the same pun with the Poisson or the location parameter of the normal.

      The Thompson sampling approach I coded up in Stan requires a way to compare bandit parameters and determine which is best. For the binomial case, that’s just the max function. In general, it could be a nested Monte Carlo calculation if simulating from the bandits given a value for the parameters is possible. Determining which bandit is best for a given set of parameters lets us code up the indicator variable is_best, which is all we need to compute the posterior event probability of each bandit being the best—the Thompson sampling part of it works exactly the same way.

      Being able to draw from the bandits given the parameters would be enough. Then you could build a nested Monte Carlo calculation of the best bandit right into the generated quantities. In the Bernoulli case, that’d involve simulating a bunch of Bernoulli draws from theta and comparing the totals. That’s not necessary, as theta is the expectation for the Bernoulli—that’s the appealing simplicity that can cause confusion because it’s conflating the parameter and the expectation.

      • Andrew says:


        Let me just emphasize that I hate the “bandit” terminology, as the whole point of the actual “bandit” is that it takes your money, whereas in this exploration-exploitation design-and-optimization problem, the goal is to choose an option that has positive expected value. I really do think it misses the point. Jokes are ok but not when the joke goes in the opposite direction of the intended meaning!

        • I’m not so hung up on the sign of the expected payoff and more hung up on complying with the terminology in the literature so as not to confuse readers.

          How about just taking this as meaning drift, the way “board” came to mean something a game was played on and a group of executives, not just a plank of wood?

          • Andrew says:


            If the terminology was as widespread as “board,” I’d say, sure, no problem. But I don’t think that’s the case. “Bandit” is jargon: it’s understood within a narrow community, but outside of that community I think that sort of jargon can exclude rather than include people. I don’t mind a jargon term that has some internal consistency—for example, “regression to the mean”—but I don’t like a jargon term where, to understand where it’s coming from, you first have to hear the story of slot machines and then realize all the ways the analogy doesn’t work.

            And, sure, I have similar problems with the term “Bayesian statistics.” In that case I felt (as of 1995) that it was better to go with the accepted term. But even there I kinda regret not switching to a more descriptive title such as “Data analysis using probability models.”

            • “Multi-armed bandit problem” is widespread enough that it’s what the Wikipedia calls it. What do statisticians call this form of sequential decision problem?

              Do you think I should not put “bandit” in the title and instead use something like “sequential design” or “sequential decision process”? I could instead use the MDP jargon, but we don’t have the Markovian structure you get in a lot of control applications.

              Are you OK with “reward” and the rest of the reinforcement learning terminology, or is there some term used in statistics you feel is less jargony?

              • Andrew says:


                I’m not quite sure what name I’d prefer for this problem. I wouldn’t mind calling it “Bayesian A/B testing” or “Sequential design for the A/B testing problem.” That’s not quite right because it could be A/B/C/… testing. So we’ll have to think more about this one.

                I agree that “multi-armed bandit” is standard jargon within some fields. I still find it unnecessarily obscure and non-descriptive. A term such as “reward” doesn’t bother me because the common English meaning seems close to the technical meaning.

              • Ricardo Silva says:

                Multi-armed bandit is the standard jargon pretty much everywhere, be it in operations research, economics, statistics, computer science etc. DeGroot’s 50-year old textbook on sequential design, pretty much the bible on the topic back then, uses it. If that’s obscure, then all jargon is meaningless.

              • Andrew says:


                I’m not saying the term “bandit” is useless. I just think it points in the wrong direction. Sometimes jargon is delightfully appropriate, other times jargon adds no insight and I think can be a barrier to outsiders. I would prefer a meaningless term such as “Jax” to “bandit,” as at least it wouldn’t have certain misleading associations. Even better, in my opinion, would be a more descriptive term.

              • Ricardo Silva says:

                Fair enough. I just disagree that the term is “understood within a narrow community”.

                For what it’s worth, I sympathize with the feeling of being bothered by non descriptive terms. I myself think “A/B testing” is one of the most detestable and content-free terms I’ve ever come across. It’s literally “arbitrary symbol slash arbitrary symbol”. We might as well call it Jabberwocky/Supercalifragilisticexpialidocious testing (and dear Lord, may nobody follow up on this suggestion!)

              • Andrew says:


                I like “arbitrary symbol slash arbitrary symbol,” as it conveys what A/B testing is: the comparison of two arbitrary alternatives. But, sure, I really like a descriptive term such as “exploration/exploitation tradeoff.” Somewhere in another dimension are terms such as “bias” which are usefully descriptive but also confusing in that their precise meaning does not always line up with what one might expect.

  3. Andrew says:


    Hey! I wrote a post based on this email exchange too. My post is scheduled to appear in Aug. Readers will be able to compare our two writing styles.

  4. a reader says:

    ‘…not to name our software “Stan” because it was too cute, too obscure, and the worst choice for SEO since “BUGS”.’

    My first thought was “at least you didn’t name it R”…but then I recalled that the first Google search result for R is the CRAN website, so maybe that one was not so terrible.

    I was then curious how the software package Stan would show up on a Google search for “Stan”. It does show up (not in the main links, but reference is made in the top right corner)…but I also learned that according to the Cambridge Dictionary (but not Websters, not sure about OED), stan means “an overzealous or obsessive fan of a particular celebrity” (reference to the Eminem song by the same name).

    You learn something everyday!

  5. Dan Simpson says:

    Bob: At least you didn’t have to give a talk about INLA (also the Irish National Liberation Army, a socialist paramilitary group that was active during The Troubles) in Dublin…. There are worse names than Stan

  6. J H Holland referred to the genetic algorithm as a k-armed bandit.

    There’s a fantastic sci-fi painting of a gigantic robot walking through a shallow sea towing a couple of shipd with its two lower arms, and trying to catch a spaceship flying between its two upper arms. A perfect illustration of a k-armed bandit encountering one of Holland’s GA hyperspace planes! Can’t remember the artist :-( .

Leave a Reply