Maybe one of the remaining ’72 Dolphins will read this one.

Under the subject line “Thought you might find this interesting. (And curious what your intuition is, if you are not too sophisticated to still have one),” Shane Frederick sent me this question:

Suppose a normal distribution has a height of "1" at its peak (the mean).

What is the height of the curve at two Standard Deviations from the mean?

I replied:

exp(-0.5*2^2) = exp(-2), or about 1/7. Sorry, the math was so available that my intuition didn’t come into play. So I was the wrong person to ask this one!

Shane responded:

The vast majority of (smart) people underestimate this, and, more interestingly, override their intuition by adjusting further in the wrong direction – e.g. by correcting 0.06 to 0.03, by remembering that “the tails are long.”

Dan Goldstein asked me this recently at a conference. My guess (0.14) was excellent, but was based purely a perceptual estimate from my memory of what the curve looks like; akin to asking me for an estimate of the diameter of the grapefruit I ate for breakfast yesterday. I have no idea how the math works. I think there is a formula or something, possibly involving “e.” I had no intuition per se, besides the photo in my “minds-eye.” IF I had invoked math, I could imagine my “S1” would yield 0.025. And that my “S2” might have adjusted that downward, since that is ALL the mass of 2 SDs and higher. Butthen my other S1 would have overruled that S2 correction, since, as I visualize it, the curve has not come close to asymptoting at “only” 2 SDs.

He followed up with this one:

What's the most common four game start to an NFL season?

W W W W
W W L L
L L W W
W L W L
L W L W
L L L L

I replied:

Logic and math suggest that it's either the first one or the last one. I think that extremely shitty teams are more prevalent than extremely good teams, so I'm guessing the last option.

Shane respopnded:

Correct and correct. I also find it obvious (that it is WWWW or LLLL), but I swear I asked 60 people (including some statisticians) and only 3 have responded correctly.

The popularity of options and their actual incidence is almost perfectly negatively correlated.

The point of these examples is not that they're super-challenging math or probability problems; rather, it can be interesting to see where people's intuitions can go wrong.

38 thoughts on “Maybe one of the remaining ’72 Dolphins will read this one.

  1. What’s the most common *number of wins* out of the first 4 games though?

    Although bad teams are more common than excellent ones, you can’t have most teams losing every game, because for every loser there’s a winner. I would guess that 2 wins is the most common it’s just that there are many ways to order that WLLW LWLW WWLL etc

    • I thought it might be a trick question, that the most common four-game start might be LLLW or something else not on the list.

      And yeah, obviously the combinatorics are a big part of the question. I’m sure 2W / 2L is much more common than 4L, but the former can occur in six ways and the latter in only one. I know you know this, we all know this.

    • I think that applying some scrutiny to these types of questions, just as you’ve done, is important. I’m not accusing anyone in this conversation of this, but a bugbear of mine is when people who like to think of themselves as mathematically literate get their jollies by setting up questions that purport to show the superiority of logic over human intuition.

      When that takes place, I think that often what is actually happening is that the logician is insisting on some ‘correct’ answer to their particular question, but the question itself is kind of dumb or unhelpful. Meanwhile, the lay audience is actually providing a useful answer to what would be the sensible question. To try and illustrate, I tend to use this classic stats class scene:

      Questioner – Which number is more likely to come up in a lottery,123456 or 169287?
      Respondent – 169287
      Questioner – Ha, it’s actually neither; your intuitions are bad!

      I think we can turn this on its head, and ask, which is the more useful question:

      Which of those two numbers is more likely to come up in a lottery?
      Or
      Are unordered or ordered number sequences more common?

      The answer, I think, is clearly the latter. It is the latter question that allows humans to detect communication or trends from among noise. So our respondent above can be thought of as giving the more useful answer that numbers along the lines of 123456 are uncommon, and warrant special attention when they occur.

      Why I care is because the supposed irrationality of humans is often one of the planks used to disparage the value of cultural knowledge, normative behaviours, and social identification with communities. It’s a whole thing in organisational psychology.

      I concider myself to be squarely in the Gerd Gigerenzer camp on this front. I’d be happy to hear if people think I’m off base.

    • Exactly. When I was first working this kind of stuff out, I wrote a case study that uses this as an example, Typical sets and the curse of dimensionality.

      Suppose you have i.i.d. draws y[n] ~ bernoulli(theta), and theta > 0.5 for n in {1, …, N}. Then the most probable outcome is a sequence of N 1s. This is easy to see because it’s the most probable outcome for each y[n]. If they are not i.i.d., and y[n] ~ bernoulli(theta[n]), then a sequence of 1s is the most probable outcome if each theta[n] > 0.5.

      The key thing to understand is that the outcome of all 1s is not “typical” in the sense of the expectation of the number of 1s is wrong. Let’s say that we have theta = 0.6 and N = 100. Then we expect 60 wins and 40 losses, but the most probable single sequence outcome is a sequence of 100 wins. To bring that back to random numbers, if we let z = sum(y), creating a transformed random number, then the most probable z is 60.

  2. NFL teams that start with four consecutive losses, for the season:

    2025: 4
    2024: 1
    2023: 2
    2022: 0
    2021: 2
    2020: 2

    An interesting question (interesting to me): 1) What’s the combinatorially most number of teams that can start with four losses? To simplify the answer, according to a Google search, the average number of wins per season, for an NFL team, over the past ten years is 8.5. There are 17 games in a season.

  3. Speaking of sports, I had been meaning to ask you about the sabermetrics post you made. Was that off the cuff, or had you had that prepared a-priori? I talked to Tim Johnson, and he said he inherited a bunch of RNG’s and re-cycles code from years ago. Now, we use Stan or PyMC, and then if we’re implementing a Gibbs sampler or whatever, you can use base R. But putting it out in a day was impressive. I’ve honestly done less base R classical linear modeling than Bayesian, so I have to use references. Stan, not really. Depends. There are RNG libraries in C++ you can import. But that kind of knowledge. Extremely impressive. Although I’ve only done a few formal classical linear regression classes, mostly computational statistics, Bayesian and then math, and the rest, textbooks. But is there a sabermetrics reference? I haven’t thought too much about sports betting, to be honest. Just poker, and I haven’t done simulations, just study strategy, books, online and the casino. Not claiming to be an expert or anything.

  4. I looked at the last 5 NFL seasons. There are 32 teams and so 160 4-game season-starting sequences. The frequency of each sequence is as follows:

    0000 9
    0001 11
    0010 11
    0011 10
    0100 7
    0101 9
    0110 8
    0111 13
    1000 12
    1001 10
    1010 9
    1011 7
    1100 9
    1101 10
    1110 13
    1111 8

    (Due to ties, the 4 missing sequences are 1 each of 010T, 110T, T010, and T000)

    Logic and math suggested to Andrew that 0000 would be the most common followed by 1111. Shane labeled these ‘correct’ and also obvious. But I’m not seeing it. They’re not the most common in the last 5 years, and I don’t think more years is going to change things too much.

    Thoughts?

    • Love that you brought the data in. It also wasn’t obvious to me that there would be any kind of stand out for LLLL. Apparently according to the data there isn’t.

    • Intuitive or counterintuitive results are generally preferred over evidence.

      Seriously, though, if NFL games were coin flips, each possible sequence would be equally likely.

      If we assume that some teams are better than others and that teams are matched up randomly, then the WWWW or LLLLL should be more probable than outcomes with 2W and 2L.

      However, the NFL intentionally devises schedules so that better teams are more frequently matched against each other. It seems to me that that should reduce the effect of team differences but not eliminate it.

      Maybe we need more complicated possibilities, such as: If a team does well consistently, it will become obvious what their “secret” is, and the next team will be more likely to neutralize it. Conversely, if a team does poorly consistently, it will be obvious they need to change something.

      Or, the evidence is not strong enough to distinguish among small effects.

      Presumably people who bet big money on games have thought about all of this and more and have modeled same.

    • Bjs:

      What you have is a small sample. If the expected number in each category is 10, then the standard deviation for the number in each category is sqrt(10) = 3. Your data are roughly consistent with the probabilities being equal in each category. Which they’re not; it’s just that 5 seasons isn’t enough to estimate this.

      You say, “I don’t think more years is going to change things too much.” If we could gather all the data from the history of the NFL, we’d have about 10 times as much data. That’s still not a huge sample size, but an extra 9x of data could change things a lot.

      • Andrew:

        Okay, fine. Mathematically, briefly explain how getting four consecutive losses would be greater than other combination. The binomial probability distribution for four consecutive losses: (.5)^4 = 6.25%. For two wins and two losses: 4*[(.5)^4] = 25% — and so on, where four consecutive losses (or wins) is the lowest likely result. Then, your point is that as the sample size increases and the standard deviation decreases, the distribution of the combination of wins and losses should converge to the expected results (given some strong assumptions that, say, a team can have a fair chance of winning of every game they play) of a binomial probability distribution. Right?

        • Sam:

          The probability is not 1/2. Some teams are better than others. So, no, it shouldn’t look like a binomial distribution; that’s the point.

        • The “extremely shitty teams are more prevalent than extremely good teams” reasoning can be illustrated with this extreme example of four teams, one (A) that loses every match and three (B,C,D) that are equally likely to win when facing each other. We can just look at the results for the parings A-B,C-D / A-C,B-D / A-D,B-C in the first three match days (permutations won’t change the end result). There is one single competitive match each time and the eight equiprobable three-games start outcomes for each team will be the following:

          A: LLL (8 times)
          B: WLL, WLW, WWL, WWW (2 times each)
          C: LWL, LWW, WWL, WWW (2 times each)
          D: LLW, LWW, WLW, WWW (2 times each)

          The count for each three-games start is then:

          LLL 8
          LLW 2
          LWL 2
          WLL 2
          LWW 4
          WLW 4
          WWL 4
          WWW 6

        • Andrew, if you write a sequence of wins and losses for each team for the season, and you append these sequences across all teams, exactly half of them will be wins. Not approximately half, exactly (i’m assuming we throw away ties if they occur). Because, for every win, some other team had a loss.

          I agree that for individual teams their probability of W or L is not 1/2 but the average over all teams is exactly 1/2

          This exact halfness for any length sequences induces a dependency between subsequences. If one set of teams has more WWWW then the remaining other teams must have extra sequences with 3 or 4 L. This dependency between the team results seems to sort of balance out the probability of different results. I haven’t run simulations but I guess either my intuition is less good than yours or I’m overthinking it, or maybe the kinds of results seen in the data are more like what you’d get even in large samples. Also the fact that in reality the games are not randomly selected, so that good teams play each other more often, and bad teams play each other more often … makes me think we should expect fairly uniform distribution over sequences.

        • I’m also not sure about the reasoning that LLLL should be more common than WWWW, although I accept that these two are likely the most common.

          Say that we just looked at the first game that teams play. Obviously (as also noted by Daniel), exactly half of them must correspond to an L and half to a W if ties are not possible. And the same must be true for the second game that each team plays, and so on for the third game and fourth game.

          If every team’s win was random, then each combination would be equally likely (so LLLL and WWW are equally common). However, say now that there is dependence (as there should be) so that teams that had an L in the first game are more likely to also have an L for the second game. However, that must also mean that teams that had a W in the first game must have an equally larger probability of having a W in the second game (because when L1 plays against L1, there is a 0.5 probability of W2 for an L1, as it is when W1 plays against W1, so the extra probability must come from when W1 plays against L1). Therefore, LL** and WW** are equally likely (as was L*** and W***)

          Consequently, it seems that we would eventually arrive at the conclusion that LLLL and WWWW would be equally likely?

          This is just rough reasoning though. We would also have to consider the probabilities for those with LL and WW to lose or win the third game, respectively, and so on, which also requires considering the other combinations of wins and losses. However, it at least seems reasonable that they would be the same due to this dependence.

        • Sam:

          It’s the maximum likelihood estimate if you consider the sequence. It’s not the maximum likelihood estimate of the number of wins considered as a binomial. But see Daniel’s point.

          Andrew:

          You mean just the balanced one, binomial(0.5, N)? Because binomial(0.6, N) is also a binomial.

          Daniel:

          You make a good point. You want to compute the expectation over the whole win/loss matrix simulating seasons, which is not the same as simulating each team marginally. I’ve only ever done this kind of finite-sample adjustment by simulation. Here, I think it’d be easiest to simulate with something like a Bradley-Terry model that gives win-loss probability in each matchup.

          I think Andrew’s point here is that you can have more under 50% teams than over. For example, consider teams that each play 10 games with records of 3-7, 4-6, 4-6, 9-1.

        • I should clarify that when I too quickly wrote “If every team’s win was random” I meant, “If the probability of a win was 0.5 for each team whatever the combination of teams in a match”.

        • Bob,

          I wrote, “Some teams are better than others. So, no, it shouldn’t look like a binomial distribution.” You only get a binomial model if the win probability is the same for all games. If all games had the favorite with 60% chance of winning and the underdog 40% chance of winning (e.g., in a hypothetical league with perfect balance between teams but a large home-field advantage), then, yes, you could see binomial(n, 0.6). But in a league in which teams vary a lot in ability, the win probability won’t be constant so it won’t be binomial. Also, every win is a loss for a different team so you can’t get binomial(n, 0.6) either, but that’s another story.

        • > it seems that we would eventually arrive at the conclusion that LLLL and WWWW would be equally likely?

          No, Lxxx and Wxxx are equally likely (and so are xLxx and xWxx, xxLx and xxWx or xxxL and xxxW) but that doesn’t necessarily make LLLL equally likely to WWWW. See the example I wrote above where four teams play against each other and LLL ends up being more likely than WWW even though the Lxx and Wxx are balanced, etc.

        • Carlos,

          Yeah, I was working out step 3 in my setup, and LLL is more probable whenever WWW if P(loss given 2 losses and you play against someone with 1 win) > P(win given 2 wins and you play against someone with 1 win). This is true in your example. I’m not sure whether this is a reasonable assumption (I’d think really bad teams would be downgraded), but it shows how it ultimately depends on the assumption Andrew brings up.

      • Could be. What model do you suppose the data should follow? (In the end, if we could get enough data)

        In adding more years, I got more rigorous with my data checking. The above table was close, but a little off (bye weeks as early as week 4?! You displease me, NFL). The following is updated for 10 seasons, hopefully all correct now (xx=00, 01, 10, 11):

        00xx: 23, 20, 20, 19
        01xx: 17, 17, 14, 23
        10xx: 19, 18, 19, 16
        11xx: 17, 21, 23, 22

        (now up to 10 ties of various sorts)

      • “whenever” should be “than”

        Also, note the equal probability of LL** and WW** in your example. This supports the correctness of this step, so it is when we go to the third step that differences can start to materialize. I suspected it could possibly be that way in my first post when saying that this was just “rough reasoning” so it was nice to work out how it can develop with more steps.

        • In the “two games” case there are not enough “degrees of freedom” to get an “interesting” result. For example with the data above if we put in an array the 1st game L/W and 2nd game L/W we get

          41 38

          38 41

          The sum of each row and column is fixed (79) and necessarily the elements in each diagonal are equal, so LL=WW (41 in this case).

          However, with an additional game we get a three dimensional array

          20 21

          17 21

          ———

          22 16

          20 21

          and the sum of the four values in the six sides is fixed (79) but the values in the diagonals of the cube don’t need to be equal (here LLL=20 and WWW=21).

        • Carlos,

          Yes, I am aware. What I am saying is that before working out step 3 it was not obvious to me that the dependence between one W and one L would lead to only a one “degree” restriction (compared to when each team’s W-L is independently binomially distributed, so each team could get a W, for example), and not cause further restrictions down the line. Compare to if the setup had been that teams that lost play against others that lost, and teams that won play against others that won (an equal number of times). Then the W-L restriction would cause further degrees of restriction of results, compared to the independent binomial case.

          In retrospect, it is clear from your example that there is only one degree of restriction. When I first read it quickly though, I got the impression that the setup was different so that’s why I didn’t consider it in my original post.

        • I was expanding on your observation that it gets interesting in the third step. I agree that it’s not immediately obvious but it ends up being quite simple. The frequency of n-dimensional sequences of loss/win results is a set of 2^n numbers between 0 and 1. They have to sum to 1 so there are only 2^n-1 independent numbers. The win/loss parity introduces an additional constraint for each dimension: the marginal probabilities conditional on a win (or a loss) at that position are 0.5. The number of constraints grows linearly with n but the number of degrees of freedom grows exponentially.

          For n=1 there is one degree of freedom and an additional constraint, the only solution is L: 0.5 W:0.5.

          For n=2 there are three degrees of freedom and two additional constraints. We can fix for example any value from 0 to 0.5 for LL but the value for WW will be the same.

          For n=3 there are eight degrees of freedom and three additional constraints. The largest difference possible between the values for LLL and WWW (or vice versa) is 0.25 (1/4). One could have 0.25 for LLL, LWW, WLW and WWL.

          For n=4 there are fifteen degrees of freedom and four additional constraints. The largest difference would be 1/3 (2/6), one could have 1/3 for LLLL and 1/6 for each of LWWW, WLWW, WWLW and WWWL.

          In general, the maximum difference possible between LL…LL and WW…WW is (n-2) / 2(n-1).

    • Thanks for collecting the data.

      One way to have more data without actually having more data is to change slightly the question and look at the similar one for the most common three games start, where one could in principle apply a similar reasoning. Then you can aggregate the xxx0 and xxx1 data into one (I also counted the fourth-game ties) getting the following table (second number is for the extended dataset you posted later, some fourth-game ties may be missing):

      000 20 (43)
      001 21 (39)
      010 17 (35)
      011 21 (37)
      100 22 (37)
      101 16 (35)
      110 20 (39)
      111 21 (45)

      Or for the most common two games start (the missing fourth-game ties in the second number may explain why the number of first game wins is not the same as the number and first game losses, same for second game wins and losses, an issue that wasn’t as obvious in the three-games count above):
      00 41 (82)
      01 38 (72)
      10 38 (72)
      11 41 (84)

  5. The point of these examples is not that they’re super-challenging math or probability problems; rather, it can be interesting to see where people’s intuitions can go wrong.

    IF I had invoked math, I could imagine my “S1” would yield 0.025. And that my “S2” might have adjusted that downward, since that is ALL the mass of 2 SDs and higher.

    My intuition says that what’s happening when people give wrong answers is that they’re confusing the height of the curve at some point with the area under the curve in some region related to that point.

  6. Correct and correct. I also find it obvious (that it is WWWW or LLLL), but I swear I asked 60 people (including some statisticians) and only 3 have responded correctly.

    The popularity of options and their actual incidence is almost perfectly negatively correlated.

    Seems suspect this didn’t include the actual numbers (and basic method like years analyzed, data source, etc). Maybe it was in the email but left out of your post. Is this based off a modelling assumption data?

  7. Let’s turn the crank and then look at the intuitions that Andrew’s developed to let him jump right away to the forma he wrote down. Before I started working on stats, these posts and calculations felt like magic tricks. Let’s turn on the slow-motion camera and follow the hands.

    First, you’ve got to understand how the word problem relates to something simpler. Formally, I take the word problem to reduce to

    [I] answer = normal(mu + 2 * sigma | mu, sigma) / normal(mu | mu, sigma), when

    [II] normal(mu | mu, sigma) = 1.

    We can do this the hard way, or we can do it the easy way.

    First the hard way. Go to Wikipedia and expand out the definition of normal(y | mu, sigma). Then solve for (mu, sigma) in equation [II]. There are infinitely many solutions, so we’re going to have to verify that the problem even makes sense—what if these solutions give different answers? Well, it turns out that’s the key to the easy way.

    Andrew once mentioned that he thought probability densities were taught poorly as a big formula dump because they’re easier to understand starting from the kernel. You can work out the integral for the normalizing constant as an exercise, but it turns out that in most practical calculations we do, the normalizing constants drop out (e.g., MCMC sampling or likelihood ratios and word problems sent in by blog correspondents).

    For the normal, the kernel is exp(-1/2 ((x – mu) / sigma)**2). Now let’s look at where the intuitive leaps come in:

    1. The problem is translation invariant. We’re going to get the same answer no matter what mu is. This needs a proof. But the intuition is simple—the location parameter mu just says where the exact same curve is located along the x axis. It’s a good exercise. So let’s just suppose mu is 0. Now our problem is just

    [I’] answer = normal(2 * sigma | 0, sigma) / normal(0 | 0, sigma).

    2. The problem is scale invariant. We’re going to get the same answer no matter what sigma is. This also needs a proof. But the intuition is also simple. The factor sigma just scales up or down the whole density, so the ratio’s not going to change. So let’s suppose mu is 1. Why, because our experience tells us it’s the easiest thing to multiply or divide by being the multiplicative unit. Now our problem is just

    [I”] answer = normal(2 | 0, 1) / normal(0, | 0, 1).

    3. We can drop the normalizing constants in the evaluation of the normals and just use the kernel. That’s easy to see once you know how to break densities down into kernels and normalizing constants. The parameters (0, 1) on which the normalizing constant is based do not change in the numerator and denominator. So we can do a quick substitution and get

    [I”’] answer = exp(-1/2 2**2) / exp(-1/2 0**2) = exp(-2) / exp(0) = exp(-2) =0.135.

    Et voila. Once you start scrutinizing the math in all of this, which for me was when I coded up a bunch of Stan’s log densities with Daniel Lee, each of these three steps is so intuitive as to not require much thought. And if you’re Andrew-level good at this, you just write down what he did above straight away because steps 1/2/3 arise again and again in exactly this form and if you do regression, the normal density kernel is precompiled as it were.

    • > look at the intuitions that Andrew’s developed to let him jump right away to the forma he wrote down

      He wrote that “the math was so available that my intuition didn’t come into play” so I assume that he just remembers the formula for the normal f(x)=C*exp(-1/2 x^2) where in the standard form C=1/sqrt(2 pi) is a normalization constant but here we’re told that it’s not normalized and f(0)=C=1. Then f(2)=exp(-2).

        • Andrew’s point was that you don’t need any intuition when you know the formula for the normal distribution which is, up to a normalization constant, the exponential of minus half x squared. The math can’t get any easier.

        • Thinking a bit more about it it seems that I find the application of the formula defining the normal function to a simple question regarding the normal function straightforward – and don’t really see how it wouldn’t be – just like you find that a great deal of intuition is required.

          An essentially similar problem is to consider a triangular function with peak value equal to one and find the value at the midway point between the peak and the point where the function becomes zero. I wouldn’t say that to reduce this problem to one where the math is easy one has to consider translation invariance or scale invariance, let alone kernels or whatever.

Leave a Reply

Your email address will not be published. Required fields are marked *