Frank Hansen writes:

I [Hansen] signed up for my first marathon race. Everyone asks me my predicted time. The predictors online seem geared to or are based off of elite runners. And anyway they seem a bit limited.

So I decided to do some analysis of my own.

I was going to put together a web page where people could get their race time predictions, maybe sell some ads for sports gps watches, but it might also be publishable.

I have 2 requests which obviously I don’t want you to spend more than a few seconds on.

1. I was wondering if you knew of any sports performance researchers working on performance of not just elite athletes, but the full range of runners.

2. Can you suggest a way to do multilevel modeling of this. There are several natural subsets for the data but it’s not obvious what makes sense. I describe the data below.

3. Phil (the runner/co-blogger who posted about weight loss) might be interested.

I collected race results for the Chicago marathon and 3 shorter races: Chicago Half Marathon, Soldier Field 10 Miler, Ravenswood 5k. I collected data from 2003 through 2009. Within each year I matched results for finishers between each shorter race and that year’s marathon based on full name and age. I used python to scrape web pages for the results.

Of course in a particular year a given marathoner may have run more than one of the shorter races. At this point I am ignoring that, treating them as independent records even though they have the same marathon finish data.

I would think that knowing several shorter races to predict a marathon time would help, but demanding several matches really cuts down the data.

I also collected weather data, so I know the temperature, humidity, wind speed near 8 am for each race (in Chicago).

I end up with around 13,000 records. A record contains a marathon time, a short race time, the type of short race, the temperature, humidity and wind speed difference between the short race and the marathon. I also know the age and sex of the marathon finisher.

Taking logs helps the R-squared, but this way it’s easier to interpret.

nt.form <- "mar.pace ~ short.pace + short.race.type + age + sex + temp.dif + humid.dif +wind.dif -1"
Call:
lm(formula = int.form, data = full.dat)
Residuals:
Min 1Q Median 3Q Max
-510.061 -36.867 -5.632 34.116 510.552
Coefficients:
Estimate Std. Error t value Pr(>|t|)

short.pace 0.999389 0.006703 149.087 < 2e-16 ***
short.race.typehalf 82.630974 4.242505 19.477 < 2e-16 ***
short.race.typerw 106.133301 4.347218 24.414 < 2e-16 ***
short.race.typesf10 89.458519 4.209498 21.252 < 2e-16 ***
age 0.321860 0.064960 4.955 7.33e-07 ***
sexM 8.444752 1.286381 6.565 5.41e-11 ***
temp.dif 1.516766 0.051981 29.179 < 2e-16 ***
humid.dif 0.128886 0.041519 3.104 0.00191 **
wind.dif -1.534700 0.150816 -10.176 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 " " 1
Residual standard error: 65.79 on 13004 degrees of freedom
Multiple R-squared: 0.9895, Adjusted R-squared: 0.9895
F-statistic: 1.368e+05 on 9 and 13004 DF, p-value: < 2.2e-16

In the regression results the marathon and short race “pace” variable is in seconds per mile, so the short.race.typehalf equal to 82 means roughly add 82 seconds to your half marathon mile pace to get the marathon mile pace, and so on for the inde[endent variables. Temperature is in Fahrenheit, Humidity in %, Wind Speed in mph.Marathon day for 2009 was really cold, predicting pace for 2009 based on a fit of the other years has larger errors than predicting 2008 using a fit for the non-2008 data.

My main piece of advice is to never ever ever ever ever use “summary” to display regression outputs in R. Only use “display” or “coefplot”. Unless, that is, you care that your standard error is “4.242505” or that your p-value is “4.242505” or that your F-statistic is “1.368e+05”. I don’t. But, then again, I’m a Bayesian.

I actually had this as a pet project for a long time but could never get started. I have collected nearly a year's complete training data and my second marathon the Marine Corps Marathon 2008 (2:58) from my Garmin 305. That Garmin is long gone but I should have the data hidden in one of the retired laptops. Although I have to point out that for competitive (not elite) athletes who have done their long runs without cheating track work is a good predictor. In any case long story short if you are really interested I can find you a way to get you my personal data without any obligations. I am sure there are many other places where higher quality data in higher magnitudes are present.

The challenge in predicting the marathon times for non-elite runners is that there are multiple types, in addition to Noakes description of Elite, Competitive, near Competitive and Citizen there is the whole approach to the marathon. The banker, the negative splitter, the giver upper etc… Genes have a lot to do with it as well. I broke 3hrs with perhaps 55 miles peak running where for the majority of people this would be rather tough to achieve on such a meager mileage. Let me know if you would like a co-author:)

Again no obligations on sharing my personal data.

Best,

I am weighing in in a fashion that is largely beside the point (I am a non-statistician but enthusiastic marathoner!)…

Yes, the marathon pace calculators (McMillan is the basis for much of this stuff out there – http://www.mcmillanrunning.com/mcmillanrunningcal… will almost certainly be far off for novices and non-elites – they're based on higher mileage than most of us will do, and they are increasingly off at longer distances. I would add at least 10 or 15 minutes to a prediction off even a half-marathon time, and obviously a half-marathon is likely to be a better predictor than a 5 or 10K. (I would think I am fairly representative in saying that my fastest half-marathon is 1:54 but my fastest marathon is 4:16 – it wasn't the same year, fitness may have been different, but McMillan would predict a four-hour marathon off that half-marathon time, whereas my personal observations of friends of similar running background is that the just-sub-four-hour marathoners are more like running a 1:47-48 half. YMMV, particularly if you have a very strong background in another sport of aerobic development (cycling, cross-country skiing, etc.).)

A couple other thoughts on data. You would get a better set of info off the NYRR database – results are separate for the marathon and other races, but there are a lot of years of data up there, and in particular many marathoners do the Grete's Great Gallop half-marathon at the beginning of October, about 6 weeks before the marathon, so while you can't be sure they're all 'racing' it, it would give you same-year data on some useful number of runners. The November NYC marathon is also less prone to weather conditions that produce wildly variable times.

(One other thing – heat and humidity are crucial, but I would be surprised if wind speed made much calculable difference to non-elite marathoners' times.)

NB I also note that a lot of marathoners I know will do a short race (NYRR requires nine races for a marathon slot for the next year, so there's an incentive to race a lot) at the end of a long run – i.e. on a day when they're due for 15 miles, run ten first easy and then 'race' the 5-miler in the park – the times you'd get on this would obviously be pretty different than what you'd get if you raced it fresh, as per McMillan.

My fellow runners who do run marathons usually add around nine minutes to 2xhalf-marathon time. E.g., my friend Olivier who ran the Paris marathon in 2:29:nn last Spring did the Berlin half-marathon in about 1:12, so this is a bit conservative. But another friend who runs my pace of 1:20 for the half-marathon does about 2:49 in the marathon. I thus presume there is a dependence on the pace of the runner and this 82 second per mile seems too static for my taste…

Might be interesting to include as a covariate the time (in days?) between the short race and the marathon.

I find it interesting that the median residual is so far from zero.

I'm also intrigued by the residual standard error of over a full minute (per mile!). Yikes!

Ray Fair at Yale has some stuff on predicting running times of non-elite runners.

Thoughts:

1) I would expect logarithms to work better; the amount that one slows down at longer distances should be proportional to one's pace. Alternatively, include interaction terms.

2) Ray Fair, the econometrician at Yale, is an aging marathoner and has played a little bit with how times change with age. I don't know whether he's played with distance.

3) I would expect some very fat tails. If a person runs 7 half marathons under similar conditions and with similar conditioning, I would expect five or six of them to be clustered near that person's ability and one or two to be much worse, when the runner went out too fast, got sick during the race, or something else. In this case, that can show up as either a much-worse-than-median marathon time or a much-worse-than-median shorter-race time. I'd be tempted to go through after the first pass and throw out points that are more than 2 standard errors or so from the regression line, probably doing this recursively (reincluding previous points that fall within the new parameters), and hoping for convergence, at least if I'm trying to use OLS routines.

One of the rules of thumb I've seen is that pace goes as distance to the .06 power, though based on much more slipshod data collection than your correspondent has done I've started to use .07 instead. Another suggestion I've seen is that, if you're in marathon shape and can run 10 half-mile intervals at a given time, then you can run a marathon in 60 times that; I take this one with more salt, but mention it because it was explicitly asserted that this has worked for a range of abilities (runners from 3 to 6 hours).

I was also working off the idea that 1 degree Fahrenheit slows me down by one minute for the marathon; your fit suggests 40 seconds. My "one minute" is based on even less careful study than the other numbers I've cited, but is based on my own guess about how I run, and I tend to think that heat affects me more than other people.

I'm running my second marathon in November, if all goes according to plan.

Jack Daniels (the coach) developed a pretty extensive methodology for doing these sorts of predictions that he published in his book Daniels' Running Formula. Greg McMillan has an on-line running calculator that does a related calculation.

Your data set looks really valuable because it could help check these approaches. It looks like one difference in your approach is that Daniel's fit a multiplicative model, so that the increase in pace in seconds depends upon the runner's speed. It might be interesting to fit a log-log version of your model and see how close your results are to Daniels' or McMillan's.

There is a multilevel modeling question in Daniel's approach. He allowed for different intercepts per individual but assumed a constant slope on a graph of running cost versus duration of race. It would be interesting to re-analyze his data to allow for separate slopes. Separate slopes sould imply that some runners are more suited to different race lengths.

Slightly off topic, perhaps, but this reminded me of a post I'd been meaning to publish plotting the running pace of the world record holders at various distances. It isn't "non-elite" runners, but it's interesting, and I wonder if similar patterns hold for non elite runners.

http://www.statisticalskier.com/2010/08/running-w…

Jme: Your graphs are cool but confusingly labeled. Speed is not minutes-per-mile, it's miles-per-minute (or, more conventionally, miles-per-hour).

Andrew: Picky, picky! ;) Lazy of me, I suppose, calling it speed, rather than pace. Runners will often mean "pace", i.e. minutes/mile when they talk about how "fast" they're going.

Andrew, most runners think instead of pace rather than speed, especially when comparing performance at different distances. In fact, I have a GPS watch/computer/heart rate gizmo that came with default displays for different sports; the default for cycling shows miles/hour or km/hour, but the default for running shows minutes/mile or minutes/km. So I like Jme's y-axes, and if he showed speed I'd probably suggest that he show pace instead. Of course one could use dual axes to show both on the same plot, and perhaps it would make sense to reverse the y-axis so slower pace is towards the bottom of the plot.

Frank, I am indeed very interested in this subject, but I don't have much to add. Many of the commenters obviously know a lot more about this than I do. I think your approach is very good, and I applaud you for having just gone ahead and tried it before asking for advice and having lots of people tell you why such-and-such an approach is unlikely to work.

Although I don't have

muchto add, I do have afewthings to add:1. It would be nice to see a plot of actual versus predicted pace (or time), perhaps for a random subset of the data if there are too many points to show on a single plot.

2. The way you include temperature and humidity makes sense to me, but other things would also make sense. What you are trying to capture is whether the conditions for each of the races were good for running, and that might be pretty nonlinear with humidity and temperature. You might not be able to get that with your approach. Maybe you could classify each race's conditions as poor-good-great (based on a combination of humidity and temperature that you make up from conventional wisdom, or based on average pace or winning pace that year), and include this in the model using indicator variables.

3. As another commenter pointed out, in spite of a high r-squared you have a dauntingly large residual of about a minute per mile, which is probably not very useful if you're trying to decide whether to aim for 3:30 versus 4:00 or whatever. It would be nice to know if there is even a hope of doing better than that with your available information (through a different functional form for the model, for instance) or whether there is just a lot of interpersonal variation in endurance that your explanatory variables can't capture. With 13,000 records you might have enough data to shed some light on that. For instance, you could look at the subset of people who competed in both the 10-mile and the marathon, in consecutive year, and ran roughly the same pace in the 10-mile in both years. Did they also run roughly the same pace in the marathon in both years?

4. It's conventional wisdom that the longer the race, the more experience helps. Like you, other first-timers don't know how to pace themselves, or how to adjust during the race based on how they're feeling. One might expect experienced runners to run closer to their optimal time than newbies, in which case an experienced runner will run the marathon faster than a newbie with the same 10-mile time. You don't have the data to let you perfectly separate newbies from experienced marathoners, but at least you can tell who has run the _Chicago_ marathon before. If someone has run it more than once, you know that for all but the first time they were definitely experienced…you don't know about the first time, one way or the other, though. At any rate, you might try running separate models for people who are and are not known to have run a marathon before. Of course, even if the results are different, it's hard to know what to do with them: should you pace yourself like the typical experienced marathoner who runs the same 10-mile time as you, on the theory that that should be close to the optimal pace? Or should you assume that those people probably run more total mileage than you and will thus have better endurance? I'm not sure…but at least I wouldn't recommend trying to run _faster_ than them, so you'd learn something, maybe.

I hope you'll keep us informed about both the success (or not) of your modeling, and how well you turn out to be able to predict your performance.

Pace is fine. I just found the graph confusing. It showed average speed increasing with distance, but then distance was on a log scale, the line was curving, so . . . it was just hard to follow. I'd suggest playing around with some transformations to try to get the curve closer to linearity.

Andrew: I understood your confusion as soon as you pointed it out. Pretty classic case of making a graph w/out thinking enough about how people who aren't around running all the time would understand it.

I'd be curious, though, what sort of transformations you think would straighten those curves. I played around with a few and the linear with a "hinge" was the most sensible result I saw.

My (limited) understanding of physiology suggests that the linear relationship with a "hinge" roughly fits what would be expected and I actually might consider simply modeling the two pieces separately, since I suspect the human physiology at play is just different in the two regions.

But this is getting a bit far afield from Frank's question…

I'll just reiterate that the time gap between the short race and the marathon might be relevant as a covariate. Also, we should remember that one assumption we're making when we try to model athletic performances like these is that people are always putting out a max effort every time they race. That may or may not be the case. As Phil pointed out, less experienced runners may not know how to pace themselves as well as they might. Also, sometimes people do shorter races leading up to a marathon as "training" and might not be taking them as seriously as they might.

One of the rare times I disagree with Andrew about graphical approaches. I'd much rather see a nonlinear curve on an untransformed scale or a log scale than a linear curve on a scale that has a more complicated transformation. I can get used to more complicated transformations if I see them enough — an x-axis of quantiles of the standard normal, for instance — but the idea of searching for an unfamiliar transformed axis that makes the curve linear, ugh, no way would I be able to interpret that without a huge amount of effort.

I do think the plots could be improved in some ways — almost any plot can — but I'm happy with the axis scales. It might also be worth looking at a linear x-axis scaling…all the times I've looked at log scales, I can still sometimes be surprised get the wrong idea about the relationship.

For some reason, the observation that the pace starts to slow dramatically for efforts over 100 minutes reminds me of an interview I saw several years ago with a former elite marathoner — I want to say Frank Shorter, but perhaps that's just because he's one of the few I can name. Anyway, the guy used to run world-class times, maybe even set U.S. or world records a time or two, and then retired and didn't run a marathon for a decade. But a friend decided to run one, and talked Frank (or whoever) into running it with him. So Frank did a modicum of training and ran the marathon in…maybe it was right around 3 hours, I don't remember exactly. Anyway, in the interview he said it was a lot harder to run a marathon in 3 hours than it had been to run 2:20, because now he had to suffer just as much per minute, but for an extra 40 minutes.

I am actually working on something similar for my Stats MBA class. I am have been running for awhile and also training for an upcoming marathon. I ran regressions of pace vs. several variables such as distance, temperature, humidity, heat index, training over time, and days of rest. I excluded runs where I ran purposely at a certain pace. I found that the only variable which had any significance was time. Even with training over the summer and running in warmer temp's, my pace still improved. In other words, what I found was that the more I ran the faster I got and the better I can predict and control pace and the less of an impact external factors (outside of injury) have.

Thanks everyone. Just got back on the grid after a week off, so will digest this in the next few days.

Just a quick comment —

@ Jenny: each record's long and short race were in the same year, and all the short races, Ravenswood 5k, Soldier Field 10, and the Chicago 1/2 all take place before the marathon. The other big Chicago race is the Shamrock Shuffle (8k) but it is really early and a quick look at it suggested there is a "not trained over the winter" effect in the Shamrock.

re time between races: should be pretty much covered by the race type variable since the various races are at about the same time each year.

Here is a plot of marathon pace vs. half marathon pace. Newbies are blue, experienced marathoners are in red. The line is the prediction based on McMillan.

Here is how McMillan describes his estimate

The plot does show less scatter for the faster runners say under 7 min/mile (420 sec/mile) pace. For the rest of us, attaining times per McMillan looks like it will be a challenge.

I think wind, even a light breeze, helps a lot to cool you off. So if heat and humidity are crucial, wind could be too.