Skip to content

Solution to Stan Puzzle 1: Inferring Ability from Streaks

If you missed it the first time around, here’s a link to:

First, a hat-tip to Mike, who posted the correct answer as a comment. So as not to spoil the surprise for everyone else, Michael Betancourt (different Mike), emailed me the answer right away (as he always does for math problems—Michael’s literally amazing).

Although I formulated it to myself as “How do I code this in Stan?”, it turns out there’s an analytic solution. Here’s how I worked through it (after about as many false starts as the others who posted on the list).

Michael Betancourt also analyzed the process qua process; maybe he’ll elaborate by editing this post below or in comments.


If the observed data are streaks y = (y_1, \ldots, y_N), with streak lengths y_n \geq 1, the underlying sequence of successes and failures must match the following regular expression

z = 0^{*} \, 1^{y_1} \, 0^{+} \, 1^{y_2}  \cdots 0^{+} \, 1^{y_N} \, 0,

where a^n is n repetitions of a, a^* is zero or more repetitions of a, and a^+ is one or more repetitions of a. Because sequence concatenation is associative, 0^+ = 0 \, 0^*, and y_n \geq 1, the above can be regrouped as

z = (0^* \, 1) \, 1^{y_1 - 1} \, 0 \, (0^* \, 1) \, 1^{y_2 - 1} \cdots (0^* \, 1) \, 1^{y_N - 1} \, 0

Given the way the streak data is generated, the probability of generating the subsequent misses after the first and then the first made shot, namely (0^* \, 1), is equivalent to the the sum of the probabilities of generating 1 or generating 0\,1 or 0\,0\,1 or …, which conventiently reduces to 1, because it covers all the possibilities for observing zero or more failures, then a single success. Many people were getting at that intuition in the comments.

Thus the probability of z, marginalizing over the unobserved 0^* sequences, reduces to the probability of generating the 1^{y_n - 1} terms and the required inter-streak 0 terms. With a \theta chance of success, the likelihood reduces to

\displaystyle p(y \, | \, \theta) \propto \prod_{n=1}^N \left( \theta^{y_n - 1} \times (1 - \theta)\right)

\displaystyle \mbox{ } \ \ \ = \theta^{\sum_{n=1}^N y_n - 1} \times (1 - \theta)^N

\displaystyle \mbox{ } \ \ \ = \theta^{\mbox{\footnotesize sum}(y) - N} \times (1 - \theta)^N

\displaystyle \mbox{ } \ \ \ \propto \mbox{Binomial}(\mbox{sum}(y) - N \, | \, N, \theta)

With a uniform prior on \theta, which is equivalent to a \mbox{Beta}(\theta \, | \, 1,1) prior, the beta-binomial conjugacy provides the following analytic solution for the posterior.

\displaystyle p(\theta \, | \, y) = \mbox{Beta}(\theta \, | \, \mbox{sum}(y) - N + 1, N + 1)

Loss of Information

How much information is “leaking” when we reduce the underlying sequence z with streaks y? When formulated as streaks, the Beta posterior is based on a total of N “observations,” whereas the length of z is greater than N and would lead to a \mbox{Beta}(\alpha,\beta) posterior with \alpha = \mbox{sum}(y) + 1 and \beta \geq N + 1.

Stan Code

So as not to disappoint those who wanted to see a Stan solution, here’s the MCMC version which lays it out as a model with parameters to estimate.

data {
  int N;
  int y[N];
parameters {
  real theta;
model {
  sum(y) - N ~ binomial(N, theta);

This model can be integrated into larger models based on theta, but is not much use in and of itself.

In this case, the analytic solution lets you generate draws directly from the posterior in the generated quantities block using Monte Carlo (withouth the Markov chain bit), which is much more efficient than MCMC.

data {
  int N;
  int y[N];
generated quantities {
  real theta;
  theta <- beta_rng(sum(y) - N + 1, N + 1);

But there's unlikely to be a need to do even straight-up Monte Carlo when you have an analytic posterior.


    • I didn’t understand the reasoning about four players or how you got to the Poisson.

      • Anoneuoid says:

        Thanks for taking a look at this. Forget the four players, etc I think that is just confusing things. Also, I believe should have used binomial for the likelihood rather than Poisson, but that does not explain the difference (I get theta~.69 using binomial).

        1) A streak is a sequence of trials that ends when the first miss occurs.
        2) The probability of getting n shots in a row, each with constant probability p of success would be: p^n
        3) The probability at least one miss occurs during a set of n shots would then be: 1-p^n
        4) In the case of n-1 shots: 1-p^(n-1)
        5) The difference between these two values must be the probability the first miss occurs on trial n: (1-p^n) – [1-p^(n-1)]
        6) Rearranging: p^(n-1) – p^n
        7) The number of first misses we expect to observe on trial n depends on how many streaks we observed M, giving: M*[p^(n-1) – p^n]
        8) We observed M=4 streaks of length s=(3,1,2,5)
        9) We need to convert streak length s to trial of first miss n. The first miss was on trial s+1, so at first I thought one needed to be added. However, every streak must start with a success so the first miss should be counted from the second trial of the streak. Therefore we should subtract one and those two cancel: n=s+1-1=s
        10) So we expect to see M*[p^(n-1) – p^n] streaks of length n out of M streaks for probability of success p. We know M=4, and n=(3,1,2,5). We observed one streak of length 3, one of length 1, etc…
        11) Then for each streak length we can calculate the likelihood of observing it a given number of times from binomial(M, p^(n-1) – p^n), or get the approximate likelihood using Poisson(M*[p^(n-1) – p^n]).

        • Carlos Ungil says:

          I’m with you until (6), which can be rewritten as p^(n-1)*(1-p). As you say in (9), this is the probability of getting a streak of length n.

          At that point, you could multiply the M likelihoods because the streaks are independent. You take the route of counting how many streaks of length 1,2,3,.. are observed and I’m not sure I follow your reasoning from there.

          From the probability of getting a streak of length n you can calculate the probability of seeing a particular number of streaks of length n, but you cannot simple multiply those probabilities because they are not independent (the total count has to be M, the number of streaks observed ).

          You should consider a multinomial distribution with an infinite number of categories corresponding to streaks of length=1, 2, 3, …. [inf]. In the example, the outcome is (1, 1, 1, 0, 1, 0, 0, 0, ….).

          • Anoneuoid says:

            > “From the probability of getting a streak of length n you can calculate the probability of seeing a particular number of streaks of length n, but you cannot simple multiply those probabilities because they are not independent (the total count has to be M, the number of streaks observed ).”

            Thanks, I agree my problem must be here. It appears I am correctly calculating the expected distribution of streak lengths (as shown by the monte carlo results), but incorrectly calculating the expected variance.

  1. My approach just noted that the streaks are negative binomial processes once you remove the first success, which immediately gives p(y_n – 1) = NegBin(y_n – 1 | 1, theta). We have no information about about the gaps between streaks so marginalizing them out yields no additional information, and we’re left with p(y) = \prod_{n=1}^{N} p(y_n – 1) = \prod_{n=1}^{N} NegBin(y_n -1 | 1, theta) which reduces to Bob’s answer up to normalization.

    And I’m figuratively amazing at best. :-p

  2. Llew says:

    See Schilling’s paper for an explanation of this relating to longest runs of successes/heads.

Leave a Reply