My answer: Write a little program to simulate it

Brendon Greeff writes:

I was searching for an online math blog and found your email address. I have a question relating to the draw for a sports tournament.

If there are 20 teams in a tournament divided into 4 groups, and those teams are selected based on four “bands” (Band: 1-5 ranked teams, 6-10, 11-15, 16-20), ie. the top 5 ranked teams are first drawn randomly into the 4 groups, then teams ranked 6-10 drawn are drawn randomly, and so on and so forth.

If all the teams are the same for 2 consecutive tournaments, what are the chances/ odds/ probability that 3 teams from Bands 1, 2 and 3 would end up in the same group for 2 consecutive tournaments?

My response is given in the title above. You can do the simulation in R, Python, whatever. The point is that, if you want to do probability and statistics, it’s as important to know some programming as it is to know some math. (And that’s relevant to our discussion from the other day.)

20 thoughts on “My answer: Write a little program to simulate it

  1. This would be pretty easy to calculate (rather than simulate) if there were four teams per band instead of 5. As it is, the problem isn’t even well defined: you presumably don’t really assign all 5 teams in each band to 4 groups at random, independent of the previous draws, because you could end up with different numbers of teams in each group. (The ‘extra’ team from Bands 1, 2, and 3 could all be assigned to the same group, for example). I’m guessing that if a group gets two teams from Band 1 they don’t get any from Band 2, but other rules are possible.

    I thought of taking a few minutes and doing the simulation, but decided it would take me more like half an hour than 5 minutes, and hey, I’m a busy man. If anyone has an elegant program I’d be interested in seeing it.

    I will predict that in the same sense that the chance of two people in a group of 30 having the same birthday is “surprisingly high”, the chance here is going to be surprisingly high too.

  2. Obviously? I was pointing to how far one might get with simulation here http://statmodeling.stat.columbia.edu/2014/06/14/hes-great-math-wants-statistics-machine-learning/#comment-172143 and admittedly somewhat tongue and check here http://statmodeling.stat.columbia.edu/2010/06/18/course_proposal/ in my course proposal for zombies.

    Robert: There are reasons why it not used more. Simulation itself can be seen by many as another black box they can’t or don’t need to understand and simulation won’t facilitate learning from most current statistical texts (without a lot of work) and it’s not quite acceptable as professional communication in the statistical profession. And a some point the simulation needs to go to MCMC to do fully realistic statistical anlyses. Hopefully we might learn more why’s here.

    But when you are stuck on the math, it seems a no brainer to do some simulation rather than remain stuck or give up!

    • One problem with using simulations is that they are not always easy to write correctly. e.g. whether the order in which you traverse your array matters. Or whether you calculate a function based on the current state or a frozen state of the previous iteration and then do all the swaps simultaneously.

      I’ve often wondered if using a normal imperative language (e.g. C, Python, Matlab etc. ) makes this harder? Maybe if there was a specific declarative or agent based language to do this sort of modelling? Is there? I’m not sure.

      • It’s also not always easy to do the analytical math correctly, either. The point is that you need to know both how to do the work and how to check it in both cases.

        Which language is easiest for a person usually depends on which they’re most familiar with. I think imperative languages like R or Python are nice for this sort of thing because you can just tell the whole generative story in simple code, assigning to each variable in turn. But then I don’t understand any of the R code posted later in the comments, because I’m not an R guru and they’re using lots of the “fancy” stuff. That can make it easier to understand and less error prone if you’re familiar with the higher-level idioms, or more inscrutable if you’re not.

    • Bill:

      I exchanged emails and talked briefly to Peter Bruce about this – he seems to disagree but I am not sure why.

      On the other hand, in Julian Simon’s writing there was an example of using simulation to _understand_ Bayesian statistics, from memory rudimentary rejection sampling or what is now called ABC. So I believe Julian would agree with you.

      Another clue that something more than simuation is required but it should not be overlooked when it can help.

  3. My favorite intro Stats textbook, A Modern Introduction to Probability and Statistics by Dekking et al, puts simulation and Monte Carlo techniques (like bootstrapping) right into the first semester of statistics. It’s a pretty powerful problem-solving tool to put in the hands of beginning students, especially if they don’t have the mathematical sophistication to deal with complicated integrals yet.

  4. Hi,
    I like R a lot, so I decided to give it a try. It took me 26 minutes indeed, Phil…
    Maybe I made a stupid mistake, but I get 9, 11, 13 or 15 teams from the first 15 teams (=first 3 bands) that are in the same group twice. So there are _always_ 3 teams in the same group twice.

    Here’s my R code, feel free to change as needed or make it more elegant (1000 runs take 2 seconds).

    makeTeams <- function(dummy)
    {
    # Drawing from bands 4 times
    Teams <- data.frame(A=sort(sample( 1: 5, 4)), # group A
    B=sort(sample( 6:10, 4)), # group B etc
    C=sort(sample(11:15, 4)),
    D=sort(sample(16:20, 4)) )
    # The last team:
    Teams$E <- (1:20)[ ! 1:20 %in% unlist(Teams) ]
    Teams
    } # End Function makeTeams

    makeTeams() # Sorting makes it easier to compare different rounds by humans

    simulation <- function(dummy)
    {
    Round1 <- makeTeams()
    Round2 <- makeTeams()
    # Number of Teams that are in the same group twice:
    nteams <- 0
    for(team in 1:15)
    {
    gr1 <- (which(Round1==team)+3) %/% 4 # column number = Group
    gr2 <- (which(Round2==team)+3) %/% 4 # column number = Group
    if(gr1==gr2) nteams <- nteams+1
    }
    nteams
    } # end function simulation

    hd <- hist(sapply(1:1e3, simulation), breaks=50, col=4)
    hd
    # nteams counts counts2ndtime
    # 9 512 505
    # 11 400 387
    # 13 85 98
    # 15 3 10

    • Berry: You’ve interpreted the problem differently than I did. Your function seems to be always assigning four teams from Band 1 to A, then four teams from Band 2 to B, etc. Since there are only 5 teams in each band, there will necessarily be at least 3 from each band that get assigned to the same group both times. As Phil points out above, it’s not clear from the problem statement how the drawing works, but I would assume that the point of the banding is to try to even out the groups, so assigning four of the five best teams to the same group seems strange.

      I also interpreted differently what it means to have three teams in the “same group” both times. I think your function is checking to see which teams are in Group A both times, or Group B both times. But I think Brendon is asking on the probability of three teams being in the same group *as each other* both times.

      Additionally, although it is present in the problem statement, I strongly suspect that the qualification that the teams in question be from Bands 1, 2, and 3 is not really part of what Brendon would really like to know. My guess is that what happened is something like “Hey! The three of us are in the same group again! What are the chances of that?” with the original banding of the teams not actually relevant to the coincidence being remarked upon.

      • My bad. Your interpretations of my code are correct.
        I missed that there are only 4 groups. But then there are 5 teams per group, right? (Thats impossible ;-). too much soccer, recently).
        Anyway, perfect example of spending too much time on coding vs. careful reading of the question.
        Below’s the code for the case without banding (probably can be rewritten to run faster), and the plot is here:
        https://dl.dropboxusercontent.com/u/4836866/Sonstiges/TeamsDistribution.png

        25% of simulations have a maximum of 3 or more teams twice within the same group. Remember there is no stratification here.
        Now I really need to continue my work…

        simulation <- function(dummy) # simulation without banding
        { # assign teams randomly to groups a,b,c,d
        GroupsRound1 <- sample( rep(as.factor(letters[1:4]), length=20) )
        GroupsRound2 <- sample( rep(as.factor(letters[1:4]), length=20) )
        # Number of Teams that are in the same group twice:
        reps_a <- sum(which(GroupsRound1=="a") %in% which(GroupsRound2=="a"))
        reps_b <- sum(which(GroupsRound1=="b") %in% which(GroupsRound2=="b"))
        reps_c <- sum(which(GroupsRound1=="c") %in% which(GroupsRound2=="c"))
        reps_d <- sum(which(GroupsRound1=="d") %in% which(GroupsRound2=="d"))
        max(reps_a, reps_b, reps_c, reps_d) # is max really the question?
        } # end function simulation

        system.time( hd <- sapply(1:5e4, simulation) ) # ca 27 seconds on my computer
        table(hd)
        # 0 1 2 3 4 5
        # 36 2025 5466 2289 182 2 # 1e4 runs
        # 196 9847 27632 11376 930 19 # 5e4 runs
        # 180 9907 27703 11263 935 12 # ditto

        plot(table(hd)/sum(table(hd)), las=1, lwd=5, lend=1, col="blue", ylab="proportion",
        main="Max number of teams twice in the same group.\nDistribution of 50'000 simulations")

  5. All right, I went ahead and gave it a try. I made the following assumptions:

    – Each group has at least one team from each band. This implies that exactly one group will have an extra Band 1 team, one will have an extra Band 2 team, etc.
    – The problem is asking, “What is the probability that at least three teams will be in the same group as each other both times?” Which group it is (A, B, C,or D) doesn’t matter. I also ignored the part about the teams being from bands 1, 2, and 3, because as I noted above I suspect this isn’t really part of the question.

    My code is not very fast, but I’m getting at least 3 in the same group both times ~77% of the time.

    makeGroups <- function() {
    draw <- c(sample(1:5),sample(6:10),sample(11:15),sample(16:20)) #randomly order within each band
    groups <- matrix(draw,nrow = 4,dimnames = list(c('A','B','C','D'))) #assign 1,5,9,13,17 to A. 1 and 5 are from the first band. Etc.
    groups <- as.data.frame(t(groups)) #makes comparing easier
    }

    same3 <- function(x,y) {
    xlong <- rep(x,each = 4) #duplicate so we can compare each team to each
    overlaps <- mapply(is.element,xlong,y) #check overlaps between each pair of teams. There's got to be a better way to do this.
    numOverlaps = 3) #were there at least 3?
    }

    simulation2 <- function(dummy) {
    Round1 <- makeGroups()
    Round2 <- makeGroups()
    return(same3(Round1,Round2))
    }

    sum(sapply(1:10000,simulation2))/10000

    • apologies for double post. A line from one of my functions got borked during the copy/paste:

      same3 <- function(x,y) {
      xlong <- rep(x,each = 4) #duplicate so we can compare each team to each
      overlaps <- mapply(is.element,xlong,y) #check overlaps between each pair of teams
      numOverlaps = 3) #were there at least 3?
      }

  6. The fundamental question, it seems to me, is whether you want the answer to specific questions (in which case simulation will always work) or whether you want either the answer to generic questions or a general understanding (in which case simulation won’t really work). This comes up in economics all the time, since generally applicable theorems require mathematical assumptions which are almost always unverifiable and implausible in the limit (upper hemi-continuity, for example.) The tendency is then just to argue that economics is a set of applied math problems to which simulation of answers to differential equations trumps assumptions which give those equations a closed-form solution. (Black-Scholes, anyone?)

    A proto-statistician needs to know how to simulate to get the answer to a question which either has no closed form solution or whose closed-form solution is too tedious to calculate. Whether that forms the basis for statistical understanding, however, is a difficult question… but it’s one to which statisticians who picked up simulation after having learned statistics without a grounding in simulation will universally answer in the negative… it’s just not clear whether they’re right or not.

    • > answer to generic questions or a general understanding

      Certainly agree with not getting an answer in general (e.g. it is always true regardless of granularity) but not sure what general understanding is/means.

      On the other hand, the need to remain finite in simulations does avoid a lot of confusions caused by the use of non-finite sets in statistical theory being miss-apprehended as implying anything in practice.

      > statisticians who picked up simulation after having learned statistics without a grounding in simulation
      Interesting point.

  7. I wonder if there’s a generic way to filter this sort of spam (see “running”‘s comment above, which is camouflage for the link in the e-mail address). Presumably people are doing this to enhance their google rank … ? Does wordpress have an easy way to set up captcha/confirmation filters for first-time commenters … ?

    • I don’t know about setting these things up on WordPress, but I thought I’d at least delete the comment’s link.

      The other day,while I was at work,my cousjn stole my iPad and tested too see if it can survive a 30 fokt drop, just so shee can be a youtube sensation. My iPad is now destroyed and she has 83 views. I knolw this iss totally off topic but I had to share it with someone!

Leave a Reply

Your email address will not be published. Required fields are marked *