This post is a rerun. I was listening to This American Life on my bike today and heard Ira say:
There’s this study done by the Pew Research Center and Smithsonian Magazine . . . they called up one thousand and one Americans. I do not understand why it is a thousand and one rather than just a thousand. Maybe a thousand and one just seemed sexier or something. . . .
And my first thought was, Hey, I know why they surveyed 1001 people and not exactly 1000! And my second thought was, Hey, I think this came up on the blog the first time that episode aired. And indeed, here it is:
The survey may well have aimed for 1000 people, but you can’t know ahead of time exactly how many people will respond. They call people, leave messages, call back, call back again, etc. The exact number of people who end up in the survey is a random variable.
If Ira can do repeats, so can we!
Maybe one of you works for Chicago Public Radio and can let them know why the survey didn’t have exactly 1000 respondent? Ira has given me so much information and entertainment over the years; it would be good to give back just a little.
You can contact them at
[email protected]
Now riddle me this; why do people take 999 bootstrap samples and not 1000 (or 1001 for that matter)? I have a half remembered, half made up memory of the answer that goes something like “…and that’s why you use 999 randomizations and not 1000”, but I can’t seem to remember the first half.
I found this dead end from Wilcox in a comment on stackexchange:
http://books.google.com/books?id=uUNGzhdxk0kC&pg=PA155&lpg=PA155&dq=%22599+is+recommended+for+general+use%22&source=bl&ots=ZSpVygnQkK&sig=Z47Do1Xk_2_FtnBxIYm2buSQnkw&hl=en&sa=X&ei=IZf5UsDWDsifqQG3hYCYBQ&ved=0CDcQ6AEwAg#v=onepage&q=%22599%20is%20recommended%20for%20general%20use%22&f=false
From what I can tell, it’s for when you estimate confidence intervals. If for a scalar parameter you order your bootstrap estimates from 1 to B, then conservative estimators for the 5th and 95th percentiles are given by the (B+1)*0.05th and the (B+1)*0.95th estimates. Since you want those to be integers, and you want them to err on the side of giving a wider CI, you should set e.g. B=999 instead of B=1000.
They discuss this a little in Davison & Hinkley, Bootstrap Methods and their Applications, chapter 5.
Maybe something to do with ties? Or some funkiness with even vs odd number of samples?
Gray’s explanation makes sense. It’s silly, though, because when you get right down to it there’s no real reason to want an exact 95% interval rather than 94% or 96%. I mean, sure, pick whatever N you want, but in this case I’d say they’re overthinking the matter, which I don’t recommend, as it can distract students from more important points.
Hearing this blog shouted-out on This American Life would be the most niche little joy I can imagine
There could be other explanations. Maybe it is like this story:
The height of Mount Everest was calculated to be exactly 29,000 ft (8,839.2 m) high, but was publicly declared to be 29,002 ft (8,839.8 m) in order to avoid the impression that an exact height of 29,000 feet (8,839.2 m) was nothing more than a rounded estimate.
https://en.wikipedia.org/wiki/Andrew_Scott_Waugh
Roger:
I didn’t know that Everest story. It’s great. For the surveys, though, I’m pretty sure my story is correct. Survey interviewing is going on in parallel, so they can’t really pick the exact sample size ahead of time.
“Ira has given me so much information and entertainment over the years; ”
I’ll never forget the episode on the Fraternal Order of Real Bearded Santas. Hilarious.