Stan Puzzle #1: Inferring Ability from Streaks

Inspired by X’s blog’s Le Monde puzzle entries, I have a little Stan coding puzzle for everyone (though you can solve the probabilty part of the coding problem without actually knowing Stan). This almost (heavy emphasis on “almost” there) makes me wish I was writing exams.

Puzzle #1: Inferring Ability from Streaks

Suppose a player is shooting free throws, but rather than recording for each attempt whether it was successful, she or he instead reports the length of her or his streaks of consecutive successes. For the sake of this puzzle, assume the the player makes a sequence of free throw attempts, $latex z = (z_1, z_2, \ldots)$, assumed to be i.i.d. Bernoulli trials with chance of success $latex \theta$, until $latex N$ streaks are recorded. The data recorded is only the length of the streaks, $latex y = (y_1, \ldots, y_N)$.

Puzzle:   Write a Stan program to estimate $latex p(\theta \, | \, y)$.

Example:   Suppose a player sets out to record 4 streaks and makes shots

$latex z = (0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0)$.

This produces the observed data

$latex N = 4$

$latex y = (3, 1, 2, 5)$.

Any number of initial misses ($latex 0$ values) in $latex z$ would lead to the same $latex y$. Also, the sequence $latex z$ always ends with the the first failure after the $latex N$-th streak.

Hint:   Feel free to assume a uniform prior $latex p(\theta)$. The trick is working out the likelihood $latex p(y \, | \, \theta,N)$, after which it is trivial to use Stan to compute $latex p(\theta \, | \, y)$ via sampling.

Another Hint:   Marginalize the unobserved failures (0 values) out of $latex z$. This is non-trivial because we don’t know the length of $latex z$.

Extra Credit:   Is anything lost in observing $latex y$ rather than $latex z$?

Answer: Solution to Stan Puzzle 1.

33 thoughts on “Stan Puzzle #1: Inferring Ability from Streaks

  1. Well, the misses are unobserved so the parameter is unidentified… no wait a minute, the lengths of the streaks are informative about the parameter, so the likelihood will be curved enough that we can marginalize over the missing observations — hey, this is a pretty neat puzzle!

    • Break each streak of 1’s into it’s leading 1 and “tail-streak”, of all the rest of the 1’s. Then;

      * Length of each of the N tail-streaks follows a geom(theta) distribution, and all such observations are independent

      * By construction each leading ‘1’ has to be a success, so doesn’t tell us anything about theta.

      * Length of each of the N streaks of 0’s follows a geom(1-theta) distribution, and are also independent, but are never observed… so they tell us nothing about theta.

    • right, the probability of a streak of size N taken alone is p^-N (here i’m using p instead of theta), and of a streak of zeros of length L is (1-p)^-L

      so this is N samples each with probability p^-k for k the data points, as well as N+1 samples whose probabilities are (1-p)^-n where the n are unobserved but

      n_0 >= 0,
      n_1… n_(N-1) > 0,
      n_N >=0

      and the probabilities for the n are themselves again dependent on (1-p)

      that’s a lot bigger hint I guess, but since I am probably not going to be able to do the whole thing… have at it.

  2. Well, these are just for observations from a geometric r.v., conditional on the number of rounds being at least one. Each likelihood term would be of the form:

    L(y_n|theta) = P(y=y_n|theta > 0) = P(y=y_n|theta)/(1-P(y=0|theta))

    Now to do this in stan I would use the increment_log_prob command in tandem with the geometric PDF as follows:

    for (y in 1:N){
    y ~ neg_binomial(1,theta);
    increment_log_prob(neg_binomial_log(0,1,theta));
    }

  3. I think the trick is that it’s equivalent to four players with equal probability of a miss, or the same equally performing player on different days. It took playerA 4 trials, playerB 2 trials, etc before a miss. How many trials would we expect to wait before the first miss for a given probability of success?

    So then the cmf for the first miss is the complement of getting all successes: cmf=1-theta^t, where theta is probability of success, and t is number of trials. Then you get the pmf=cmf[t+1]-cmf[t]. The data is y+1, or t_obs=(4,2,3,6), each with a count of one. Then the likelihood is from a Poisson distribution with lambda=Nobs*pmf, where Nobs=4 (# of observations/streaks).

    tl;dr: (y+1) ~ Poisson( Nobs*[theta^t-theta^(t+1)] )

    Maybe I shouldn’t be adding that one to the streak length though, since recording always begins on a success… Or maybe what I wrote doesn’t make sense at all now that I’m rereading it. That’s my first guess though.

    • I realized at least this is wrong for sure, because it should be fitting to the counts: “(y+1) ~ Poisson( Nobs*[theta^t-theta^(t+1)] )”

      I do think I’m onto a good approach, I’ll just have to code it up because I’m confusing myself at this point.

      • Here is my guess, it comes out to ~.72:

        require(rstan)
        x=c(3,1,2,5)
        y=unname(table(x))
        dat=list(x=x, y=y, Ndata=length(x))
        ####
        model <- '
        data {
        int Ndata;
        int y[Ndata];
        int x[Ndata];
        }
        parameters {
        real theta;
        }
        model {
        for(i in 1:Ndata){
        y[i] ~ poisson(Ndata*(theta^x[i] – theta^(x[i]+1)));
        }
        }

        ####
        fit <- stan(model_code = model, data = dat, iter = 10000, chains = 6, warmup=500)
        fit2 mean(fit2$theta)
        [1] 0.7193806
        > median(fit2$theta)
        [1] 0.7318201

        • It got garbled, so here: http://pastebin.com/epbPpCde

          I think it makes sense. If we just take the proportion of success we would guess theta~ 0.58 this ignores any streakiness. Assuming I didn’t make an error somewhere, this model is asking “How likely to get one streak of length 5 out of N=4 sets, etc?” Apparently that would be relatively rare for the theta~0.58 case, but more likely for theta~0.72.

  4. For N=1, the probability of the streak being of length y equals f(y|theta) = theta^(y-1) * (1-theta).

    For N>1, the streaks are independent and their order doesn’t matter, so it’s like taking N draws from a discrete distribution with density proportional to f(y|theta). I say proportional because there is a normalising constant, but that doesn’t matter because we’re going to do MCMC so any constant in the likelihood is irrelevant.

    In other words, the likelihood contribution (dropping the constant) for each streak y_i is f(y|theta). taking the logarithm gives log f(y|theta) = (y-1)*theta + (1-theta). Summing over the observations (they are independent) and doing some rearranging we get the log-likelihood (up to constant):

    loglik = (sum(y) – 2N) * theta

    In Stan we just do increment_log_prob(loglik). This gives me theta~0.72.

      • Oh, right, thanks. I did feel like something was off. My algebra is particularly sloppy though — I was always too arrogant to do my homework.

        So instead (if I’m not messing up again) we use

        loglik <- (sum(y) – N) * log(theta) + N*log1m(theta);

        That does seem a lot more sensible, and the posterior looks okay now, with a mean of 0.62.

        Interesting that sum(y) is still a sufficient statistic.

  5. The lazy among us will just to ABC en.wikipedia.org/wiki/Approximate_Bayesian_computation – conditioning first on what is reported and then on what the complete data was to see the loss of information.

    Not fun math, but its likely the last good weather weekend for those of us in the north.

  6. OK, I’m not good with math so I had to brute force this solution (and I notice now that I’m a little late anyway).
    First, it makes it easier if we think about the sequence as a Bernoulli process. If we knew the exact sequence, it would be very easy to compute its probability.:

    Pr(z=[0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0] | theta) = theta^{sum(z)}(1-theta)^{sum(z==0)}

    But we have limited information: N, number of streaks, and y, total points in each streak. How can we make best use of it? Well, sum(y) gives us the total number os points, which is constant regardless of the original sequence z. N (=length(y)) gives us the minimum number of misses. Now we have to marginalize over all possible sequences that could have generated the data.

    For example, the data y = [3, 2, 1] could arise from:

    z = [1, 1, 1, 0, 1, 1, 0, 1, 0], which has Pr(z) = theta^6 (1-theta)^3

    or

    z = [0, 1, 1, 1, 0, 1, 1, 0, 1, 0]
    z = [1, 1, 1, 0, 0, 1, 1, 0, 1, 0]
    z = [1, 1, 1, 0, 1, 1, 0, 0, 1, 0]

    Each with the same Pr(z) = theta^6 (1-theta)^4, so 3*Pr(z), and so on…

    In the end, the probability of any vector y is given by the infinite summation over the (weak) composition of those sequences (took me a while to find what I needed!).

    Pr(y | theta) = sum_{i=0}^{inf} choose(N + i -1, N -1) theta^sum(y) (1 – theta)^{N + i}

    We can pull the constant terms out of the summation, of course.

    Pr(y | theta) = theta^sum(y) (1 – theta)^N sum_{i=0}^{inf} choose(N + i -1, N -1) (1 – theta)^i

    But we don’t have to evaluate the summation, fortunately, because it’s just a constant for a given theta. So, with N = length(theta).

    Pr(y | theta) prop theta^sum(y) (1-theta)^N

    (Or should we? In the likelihood function, this ‘constant’ changes for each value of the parameter. Maybe we should really work it out; but I am lost here).

    Supposing everything is OK, we just take the log

    log(Pr(y | theta)) prop sum(y) * log(theta) + N * log(1 – theta)

    and now it’s easy to implement on Stan with a Beta uniform prior (Beta is it’s conjugate, so we don’t really need any fancy new function, but let’s use Stan features!)

    library(rstan)
    y <- c(3, 1, 2, 7)
    data <- list(y=sum(y), N=length(y))
    model <- "
    functions{
    real streak_log(real y, real theta, int N) {
    return y * log(theta) + N * log1m(theta);
    }
    }
    data {
    int y;
    int N;
    }
    parameters {
    real theta;
    }
    model{
    theta ~ beta(1, 1);
    y ~ streak(theta, N);
    }

    fit <- stan(model_code=model, data=data)

    For the example data, the mean is 0.74, wich a 95%CI of [0.53, 0.91]. If we compute analitically with a Beta distribution, we obtain a mean of 0.737 and basically the same 95%CI.

    Given the beta conjugacy, we can assume that our estimation of theta is biased toward higher values, if we use a uniform distribution: given that the total number of points can never be lower than the number of streaks, our posterior will be centered at 0.5 or higher values. So we are really losing a lot of information by not knowing those zeroes.

    • That is one way to do it, but yes, you do need to work out the term that you call a ‘a constant for a given theta’ — we can only drop terms that do not depend on the model parameters.

      You can work out the infinite sum as a special case of the binomial series, and then you’ll get the right answer.

      Kind of taking three right turns instead of one left, though.

      • Thanks for the tip, Mike.
        My skills in Math are terrible, but I tried to work it out with just a little help from Wikipedia.

        I’m checking your answer but I do not understand why there’s a -1 in
        F (y|theta) = theta^(y-1)(1-theta)
        As I understood it, a datum y =(3) means we know there was just one streak with 3 points, so theta^3*(1-theta) + that crazy summation for any combination of zeros before the streak. What am I missing here?

    • OK, thanks to Martin tip I finally got to the right likelihood function. What was I thinking? The summation depends on theta, so it shouldn’t be dropped! Working out the summation, we obtain

      Pr(y | theta) = theta^sum(y) (1-theta)^N theta^(-N)
      = theta^(sum(y) – N) (1-theta)^N

      Taking the log: (sum(y) – N) * log(theta) + N * (1-theta); which is exactly the same if we consider each number as a random draw from o geometric r.v., as already commented (three turns right, indeed!). Now we can change the previous function accordingly (and make it accept y as a vector of integers):

      real streak_log(int[] y, real theta, int N) {
      return (sum(y) – N) * log(theta) + N * log1m(theta);
      }

      Under a uniform prior, the estimate now is 0.62, with a 95%CI of [0.35, 0.85].

      • I did a monte carlo experiment with 10^6 samples of four streaks[1] and made a plot of the streak lengths along with my fit for theta=0.72 and 0.62(Warning: it is a slow approach because I was trying to match the methodology described in the OP as much as possible).

        It seems my model described above is the correct one to describe the distribution of streak lengths after a small modification(subtract 1 to account for the fact they always start on a success).[3] These are shown by the lines overlaying to the histogram.[2] However, when I counted the number of matches to the set of “observed” lengths (ie 3,1,2,5; Nmatch in the title of the plots), theta=0.62 is clearly better. ~1.7% of those matched the data while only 1.5 % matched when theta=0.72.

        So where did I mess up?

        [1] http://pastebin.com/peEd7KVX
        [2] http://s2.postimg.org/3xtit4lo7/theta.png
        [3] http://pastebin.com/MqStXyfd

        http://pastebin.com/MqStXyfd

  7. Cool puzzle.

    Caveats: I don’t know Stan, I’m not a statistician, and I’d never done a Bayesian calculation before (and I probably still haven’t done one correctly!).

    I agree with many of the commenters above:

    1) I get that given a probability p of success in a single trial, the probability of observing a streak of length n, that is a sequence of m failures, followed by n successes, followed by a failure is (1-p)^m p^n (1-p) = (1-p)^(m+1) p^n

    2) summing over all possible values of the length initial string of failures (m = 0 to \infty) gives the probability of observing a streak of length n of p^(n-1) (1-p)

    3) updating a uniform prior after having observed a sequence of N streaks the sum of the length of which is S_N gives a posterior pdf for p of K(S_N, N) p^(S_N – N) (1-p)^N, where the normalization constant is K(S_N, N) = (S_N + 1)!/( (S_N – N)! N!). The mean of this distribution is (S_N – N + 1)/(S_N + 2).

    For the example sequence of observed streaks, the mean of this distribution is about 0.62, as commented previously.

    I haven’t seen anyone else address the extra credit: how does this compare to observing the full sequence of trials. To examine this, I updated a uniform prior on p by the likelihood of observing m successes in M trials and got that the posterior pdf is K(m, M) p^m (1-p)^(M-m), where the normalization constant is K(m, M) = (M + 1)!/(m! (M-m)!). I imagine this is probably a common textbook example? The mean of this distribution is (m + 1)/(M + 2).

    First, intuition says that you must lose something from observing only the streaks instead of the full sequence of trials – in looking at only the streaks you are throwing away information about the number of failures in between the streaks.

    Of the many ways to compare the results from observing the streaks to observing the full sequence, I chose to focus on using the means of the posterior pdfs as estimators for p:

    a) choose a value of p
    b) generate a realization of a sequence of M trials
    c) extract the streaks
    d) estimate p with via the means of the two posterior pdfs
    e) replicate this a large number of times (10000) and examine
    the means and 2.5% to 97.5% percentile ranges of the two estimates.
    f) repeat for a range of values of p

    Note there is at least one subtlety I’m just ignoring for now: what if the last trial in a sequence is a success? Then I haven’t really observed the end of the last streak, but I’m just using the streak formula as if I had. You would expect this to bias the streak estimate of p slightly downwards, especially for higher values of p when the sequence is more likely to end on a success.

    I made two plots, one for sequence lengths of M = 100 and one for sequence lengths of M = 1000. Not sure of the best way to make them available; I’ve put pngs of them on picasaweb at:

    https://goo.gl/SsVxGA
    https://goo.gl/Gp3MPq

    The dashed lines are the 2.5% and 97.5% percentiles.

    These plots show that you do indeed lose something by only observing the streaks. Your estimates of p are always less precise (i.e. estimates from repeated trials show a wider range of variation), getting worse at lower values of p. For very low values of p, the streak estimate is horribly biased high. For the shorter sequence lengths (M = 100), the streak estimates appear to be biased slightly low: perhaps an effect of ignoring when the sequence ends on a success that I mentioned above?

    -Keith

Leave a Reply to Daniel Lakeland Cancel reply

Your email address will not be published. Required fields are marked *