Andrew vs. the Multi-Armed Bandit

Andrew and I were talking about coding up some sequential designs for A/B testing in Stan the other week. I volunteered to do the legwork and implement some examples. The literature is very accessible these days—it can be found under the subject heading “multi-armed bandits.” There’s even a Wikipedia page on multi-armed bandits that lays out the basic problem formulation.

It didn’t take long to implement some policies and a simulation framework in R (case study coming after I do some more plots and simulate some contextual bandit probability matching policies). When I responded with positive results on the Bernoulli bandit case with probability matching (aka “Thompson sampling”), Andrew replied with a concern about the nomenclature, which he said I could summarize publicly.

First, Each slot machine (or “bandit”) only has one arm. Hence it’s many one-armed bandits, not one multi-armed bandit.

Indeed. The Wikipedia says it’s sometimes called the K-bandit problem instead, but I haven’t seen any evidence of that.

Second, the basic strategy in these problems is to play on lots of machines until you find out which is the best, and then concentrate your plays on that best machine. This all presupposes that either (a) you’re required to play, or (b) at least one of the machines has positive expected value. But with slot machines, they all have negative expected value for the player (that’s why they’re called “bandits”), and the best strategy is not to play at all. So the whole analogy seems backward to me.

Yes. Nevertheless, people play slot machines. I suppose an economist would say the financial payout has low expected value, but high variance, so play may be rational for a risk-happy player who enjoys to play.

Third, I find the “bandit” terminology obscure and overly cute. It’s an analogy removed at two levels from reality: the optimization problem is not really like playing slot machines, and slot machines are not actually bandits. It’s basically a math joke, and I’m not a big fan of math jokes.

Seriously? The first comments I understand, but the third is perplexing when considering the source is the same statistician and blogger who brought us “Red State, Blue State, Rich State, Poor State”, “Mister P”, “Cantor’s Corner”, and a seemingly endless series of cat pictures, click-bait titles and posts named after novels.

I might not be so sensitive about this had I not pleaded in vain not to name our software “Stan” because it was too cute, too obscure, and the worst choice for SEO since “BUGS”.

P.S. Here’s the conjugate model for Bernoulli bandits in Stan followed by the probability matching policy implemented in RStan.

data {
  int K;                // num arms
  int N;                // num trials
  int z[N];  // arm on trial n
  int y[N];  // reward on trial n
}
transformed data {
  int successes[K] = rep_array(0, K);
  int trials[K] = rep_array(0, K);
  for (n in 1:N) {
    trials[z[n]] += 1;
    successes[z[n]] += y[n];
  }
}
generated quantities {
  simplex[K] is_best;
  vector[K] theta;
  for (k in 1:K)
    theta[k] = beta_rng(1 + successes[k], 1 + trials[k] - successes[k]);
  {
    real best_prob = max(theta);
    for (k in 1:K)
      is_best[k] = (theta[k] >= best_prob);
    is_best /= sum(is_best);  // uniform for ties
  }
}

The generated quantities compute the indicator function needed to compute the event probability that a given bandit is the best. That’s then used as the distribution to sample which bandit’s arm to pull next in the policy.

model_conjugate <- stan_model("bernoulli-bandits-conjugate.stan")
fit_bernoulli_bandit <- function(y, z, K) {
  data <- list(K = K, N = length(y), y = y, z = z)
  sampling(model_conjugate, algorithm = "Fixed_param", data = data, 
           warmup = 0, chains = 1, iter = 1000, refresh = 0)
}

expectation <- function(fit, param) {
  posterior_summary <- summary(fit, pars = param, probs = c())
  posterior_summary$summary[ , "mean"]
}

thompson_sampling_policy <- function(y, z, K) {
  posterior <- fit_bernoulli_bandit(y, z, K)
  p_best <- expectation(posterior, "is_best")
  sample(K, 1, replace = TRUE, p_best)
}

1000 itertions is overkill here, but given everything's conjugate it's very fast. I can get away with a single chain because it's pure Monte Carlo (not MCMC)---that's the beauty of conjugate models. It's often possible to exploit conjugacy or collect sufficient statistics to produce more efficient implementations of a given posterior.