Looking for data on speed and traffic accidents—and other examples of data that can be fit by nonlinear models

Posted on November 2, 2017 9:17 AM by Andrew

For the chapter in Regression and Other Stories that includes nonlinear regression, I’d like a couple homework problems where the kids have to construct and fit models to real data. So I need some examples. We already have the success of golf putts as a function of distance from the hole, and I’d like some others. One thing that came to mind today, because I happened to see a safety warning poster on the bus reminding people not to drive too fast, is data on speed and traffic accidents.

But I’m interested in other examples too. Just about anything interesting with data on x and y where there’s no simple linear relation, and where log and logit transformations don’t do the trick either. The outcome can be discrete or continuous, either way.

There’s gotta be lots and lots of good examples but for some reason I’m drawing a blank. So I could use your help. I need not just the examples but also the data.

Thanks! Anyone who comes up with a good suggestion will be rewarded with a free Stan sticker.

28 thoughts on “Looking for data on speed and traffic accidents—and other examples of data that can be fit by nonlinear models”

Dave C. on November 2, 2017 9:34 AM at 9:34 am said:

https://www.nhtsa.gov/research-data/fatality-analysis-reporting-system-fars

Some good statistics reported via NHTSA

Reply ↓
- Jonathan on November 2, 2017 10:47 AM at 10:47 am said:
  
  More detailed NHTSA accident data can also be found at https://www.nhtsa.gov/national-automotive-sampling-system-nass/crashworthiness-data-system.
  
  NASS data is also useful because some of the data is stratified and various accidents have different sampling weights. It thus provides good examples of how failing appreciate the importance of sample weights can have disastrous results, a lesson learned the hard way by a formerly “leading” paper in the child restraint research community that has now been retracted, https://injuryprevention.bmj.com/content/23/4/e1, as a result of an inability to replicate the results by Padmanaban et al.
  
  Reply ↓
- Carol F. on November 2, 2017 11:19 AM at 11:19 am said:
  
  FARS is a good dataset, but tricky to analyze because it is a case-only dataset. For traffic crashes, you might try NASS-CDS or NASS-GES, which have speed limit info (both) and speed-at-impact (CDS) and crash severity represented as speed (CDS).
  
  https://www.nhtsa.gov/research-data/national-automotive-sampling-system-nass
  
  Reply ↓
Shosh on November 2, 2017 9:44 AM at 9:44 am said:

A few things that might be fun:

The taxi/limousine commission publishes trip data that economics love analyzing in papers; now conveniently accessible in an R package: https://cran.rstudio.com/web/packages/nyctaxi/vignettes/nyc_taxi.html

Traffic stops (by police) data from the Stanford project on this: https://openpolicing.stanford.edu/data/

Data from toxicity studies by the National Oceanic & Atmospheric Administration: https://www.diver.orr.noaa.gov/deepwater-horizon-nrda-data#

Cardiovascular mortality rates by year/sex/county: https://ghdx.healthdata.org/record/united-states-cardiovascular-disease-mortality-rates-county-1980-2014

Instances of Dengue fever by time/location in Sri Lanka: https://cran.r-project.org/web/packages/mozzie/index.html

Reply ↓
Jonathan Sweeney on November 2, 2017 9:50 AM at 9:50 am said:

Jonathan Auerbach and Rob Trangucci contributed a very nice talk on speed limits and pedestrian safety to Stan Con 2017. Link with data, model, and Rmd below.

Long live reproducible research!

https://github.com/stan-dev/stancon_talks/tree/master/2017/Contributed-Talks/01_auerbach

Reply ↓
Erin Jonaitis on November 2, 2017 9:58 AM at 9:58 am said:

Maybe something involving seasonality, e.g. hours of daylight, or blood levels of Vitamin D?
(For a small amount of extra screwiness you could do sunrise/set times. You could be some junior programmer’s first exposure to the annoyance that is daylight saving time!)

Reply ↓
- Erin Jonaitis on November 2, 2017 10:03 AM at 10:03 am said:
  
  P.S. re “kids” — what is the target audience for this textbook? Sorry if you’ve already mentioned this elsewhere, but I couldn’t find it immediately.
  
  Reply ↓
  - Andrew on November 2, 2017 10:40 AM at 10:40 am said:
    
    Erin:
    
    The book is self-contained, but I expect it will mostly be used as a textbook for courses in regression (traditionally a second course in statistics for undergraduates or a first course for graduate students).
    
    Reply ↓
Jonathan (another one) on November 2, 2017 10:54 AM at 10:54 am said:

The teams, score and date of every NCAA Division 1 and 3 hockey game since 2002. Suitable for fitting Bradley-Terry models, Poisson regression models for goals scored and all manner of home/road, strength and other effects. https://github.com/octonion/hockey/tree/master/ncaa/csv

Reply ↓
T on November 2, 2017 10:54 AM at 10:54 am said:

Varian’s titanic survivor example here: https://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf
This example lends itself nicely to trees, but maybe you don’t want to go there. Could use it to demonstrate importance of interactions though.

Reply ↓
Hernan Bruno on November 2, 2017 11:00 AM at 11:00 am said:
During the Rio Olympics, I was watching the Air Rifle 10m. It is obvious that in this and other disciplines, the winner is declared by the highest score, but this is not statistically significant if we were to do a test. Check the wikipedia page

https://en.wikipedia.org/wiki/Shooting_at_the_2016_Summer_Olympics_%E2%80%93_Women%27s_10_metre_air_rifle#final

The csv file you can download here, courtesy of yours truly
https://www.dropbox.com/s/2cy484b0thx2xtu/airrifle2016.csv?dl=0

I decided to check it. But with repeated observations, I needed to run a stan model, as each shooter has her own mean and probably her own standard deviation. Notice that there are two stan codes below. One assumes that the score (0-11) is normally distributed. The second assumes that the distance from the center of the target is distributed Rayleigh, which is the correct distribution for a distance from a center if symmetry in all directions is assumed. Stan supports the Rayleigh distribution, so you can run it. I think it is an interesting example of the Rayleigh distribution. And by the way, these shooters shot 14-20 times (notice that they don’t all shoot the same number of times because of the elimination system). We would need them to shoot 2000 times to start seeing some non-overlapping intervals here.

What do you think? Don’t be harsh with the crappy coding, I was watching sports on my couch when I coded this. Also notice some old-fashioned Stan notation.
```
library(tidyr)
data_airrifle <- read.csv("", sep=";", header=TRUE)

# Data capture a bit strange, change name of winner.
colnames(data_airrifle)[1] <- "Thrasher"

# Make it "long" and delete the NA
data_airrifle <- gather(data_airrifle)
data_airrifle <- data_airrifle[!is.na(data_airrifle[, 2]), ]

#list of competitors
competitor_id = as.factor(data_airrifle$key)

levels(competitor_id) <- unique(data_airrifle$key)

# create id from lastname
id = as.numeric(competitor_id)
score = data_airrifle$value
score2 =11- data_airrifle$value
K= length(competitors)
N = length(id)

da.ar = list(score = score, score2 = score2, id=id, N=N, K=K)

stan.code = "
data {
	int N;
	int K;
	int id[N];
	real score[N];
	}
parameters {
	vector[K] gamma;
	vector[K] sigma;
	real mu_gamma;
  real sigma_gamma;
}
model{
	gamma ~ normal(mu_gamma, sigma_gamma);
	for(i in 1:N){
		score[i] ~ normal(gamma[id[i]], sigma[id[i]]);
	}
}
"
library(rstan)
out = stan(model_name = "Air Rifle Olympics", model_code = stan.code, data=da.ar , iter=1000, chains=2, verbose=FALSE)

# Now using the Rayleigh distribution

stan.code = "
data {
	int N;
	int K;
	int id[N];
	real score2[N];
	}
parameters {
	vector[K] gamma;
	real mu_gamma;
       real sigma_gamma;
}
model{
	gamma ~ normal(mu_gamma, sigma_gamma);
	for(i in 1:N){
		score2[i] ~ rayleigh(gamma[id[i]]);
	}
}
"
library(rstan)
out2 = stan(model_name = "Air Rifle Olympics", model_code = stan.code, data=da.ar , iter=1000, chains=2, verbose=FALSE)
print(out2)
plot(out, pars="gamma")
```
Reply ↓
- Hernan Bruno on November 2, 2017 11:03 AM at 11:03 am said:
  
  I just see this comment posted and I notice the indentation did not come out.
  
  Reply ↓
Terry on November 2, 2017 11:05 AM at 11:05 am said:

Income distribution is an inverted U shape with a mass point at zero. There is also a curious high-frequency oscillation suggesting that income is often rounded off to a multiple of $5,000 (bins are $2,500 wide).

This is a good illustration of the exhortation to actually look at your data rather than blindly fit a model.

Data are at: https://www2.census.gov/programs-surveys/cps/tables/finc-07/2017/finc07.xls. Data is for frequency of household income for incomes from zero to $100,000. Relevant data is in cells B11-B50. Be careful to use only cells B11-B50 because the bin size is different for cells B51-B54. (This change in bin size could be another example of something to watch out for.)

Website with information about data is at: https://www.census.gov/data/tables/time-series/demo/income-poverty/cps-finc/finc-07.html

Reply ↓
- Terry on November 2, 2017 11:15 AM at 11:15 am said:
  
  This data is probably not what you want, though, because it is basically a pdf of income rather than an x-y relationship amenable to regression techniques.
  
  Reply ↓
Julien on November 2, 2017 11:21 AM at 11:21 am said:

What about trying to measure the gravitational acceleration constant (9.81 m/s^2)? Data would consist of measures of the time for an object to drop from different heights, and the model would have to look like Newton’s law of gravitation. A fun thing would be to make the students build the dataset themselves with a clock and a tape!

Reply ↓
Ed Hagen on November 2, 2017 11:57 AM at 11:57 am said:

I had students collect data with their smartphones, which are packed full of sensors, for exactly this sort of problem. There are free apps for both iPhones and Androids that make it easy to collect the raw data, which student can then email to themselves in csv format. Specifically, I had them swing their phones on the charging cable like a pendulum, and then fit the acceleration data using a damped harmonic oscillator model. But there are many other sensors too that would easily generate data with non-linear relationships:

https://en.wikipedia.org/wiki/IPhone#Sensors

Reply ↓
awinter on November 2, 2017 12:18 PM at 12:18 pm said:

New Mexico pueblo populations!

For a small pystan project I have New Mexico pueblo populations from ~1790 – 1950. I added latitude, longitude, and elevation data. I have mean rainfall data which will be added later today.

The csv or ods sheet are here:
https://github.com/bioinfonm/pystan_musings/tree/master/beginners_excercise/new_mexico_pueblo_populations

Reply ↓
Rodney Sparapani on November 2, 2017 12:30 PM at 12:30 pm said:

How about the mcycle data set in the MASS R package?

Description:

A data frame giving a series of measurements of head acceleration
in a simulated motorcycle accident, used to test crash helmets.

Usage:

mcycle

Format:

‘times’ in milliseconds after impact.

‘accel’ in g.

Source:

Silverman, B. W. (1985) Some aspects of the spline smoothing
approach to non-parametric curve fitting. _Journal of the Royal
Statistical Society series B_ *47*, 1-52.

References:

Venables, W. N. and Ripley, B. D. (1999) _Modern Applied
Statistics with S-PLUS._ Third Edition. Springer.

Reply ↓
bjs12 on November 2, 2017 12:35 PM at 12:35 pm said:

For a single Y vs. a single X, how about child pedestrian fatalities vs. day of the year (reminiscent of births vs. day of year, as previously discussed on this blog)?

Data from FARS. Presumably additional X variables available there too.

Seasonality plus a major spike on Halloween :(

Reply ↓
Andrew [not Gelman] on November 2, 2017 2:58 PM at 2:58 pm said:

Temperature’s effect on electricity consumption is an easy non-linear example. It can be approximated with a quadratic, but more advanced methods are more appropriate.

https://www.researchgate.net/figure/226527334_fig2_Fig-3-Non-linear-relationship-between-the-filtered-electricity-load-of-a-distribution

Some data at the link below. There are interactions with day of week and hour, which makes it a bit more interesting.

https://www.iso-ne.com/isoexpress/web/reports/load-and-demand/-/tree/zone-info

Reply ↓
Greg Ashton on November 2, 2017 4:57 PM at 4:57 pm said:

Data on driving speeds in different countries obtained using Google-maps estimated journey time between random zip codes:

https://github.com/ga7g08/GoogleMapsAPI_experiment

This was a project I started as a grad student for fun and never got around to actually analysing – a summary of what I did manage to do can be found here

https://github.com/ga7g08/GoogleMapsAPI_experiment/blob/master/Results.md

Reply ↓
ojm on November 2, 2017 5:27 PM at 5:27 pm said:

Here’s code to generate some (synthetic) nonlinear time series data via Shalizi:

https://bactra.org/reviews/fan-yao/examples.R

A classic real-world data set for nonlinear time series is Nicholson’s ‘sheep blowfly’ data. Googling gave me this link:

https://www.stat.berkeley.edu/~brill/blowfly97I.html

Wood analysed these here: https://www.nature.com/nature/journal/v466/n7310/abs/nature09319.html

I’d be interested in seeing alternative analyses.

Reply ↓
- ojm on November 2, 2017 7:33 PM at 7:33 pm said:
  
  Something that is perhaps more along the lines you’re specifically looking for is:
  
  https://epubs.siam.org/doi/pdf/10.1137/130929230
  
  They drop an object, measure its trajectory and fit the solution of Newton’s equations of motion with constant gravity and an empirical drag model. This solution is nonlinear.
  
  They also use prior and posterior predictive checks etc. Good example of the potential issues of model misspecification too.
  
  Reply ↓
Brian Gawalt on November 4, 2017 10:57 PM at 10:57 pm said:

I’m late to the party but — traffic flows! They’re a great case of “tipping point” nonlinearities.

I found a good set linked to from an intro-to-R Berkeley stats assignment I googled up:

“Data was collected by loop detectors at one particular location of eastbound Interstate
80 in Sacramento, California. There are six columns and 1740 rows in the data set. The
rows correspond to successive five minute intervals from March 14 to 20, 2003, where the
data values in a row report the flow (number of cars) and occupancy (the proportion of
time there was a car over the loop) in each of three lanes on the freeway. Lane 1 is the
leftmost lane, lane 2 is in the center, and lane 3 is the rightmost. The data set can be
found at https://www.stat.berkeley.edu/classes/s133/data/flow-occ.csv The original data are
from the Freeway Performance Measurement System (PEMS) website (Freeway Performance
Measurement System (PEMS) website)”

That link is: https://www.stat.berkeley.edu/classes/s133/data/flow-occ.csv

Inspired by: https://jliszka.github.io/2013/10/01/how-traffic-actually-works.html

Reply ↓
- Dave C. on November 7, 2017 10:10 AM at 10:10 am said:
  
  This one just for fun:
  https://phys.org/news/2017-02-mathematical-line-traffic.html
  
  Reply ↓
Belay Birlie on November 6, 2017 11:51 AM at 11:51 am said:

O2K {nlstools}: Oxygen kinetics during 6-minute walk test data set is a nice free dataset

Reply ↓
ZC on November 6, 2017 6:21 PM at 6:21 pm said:

Health spending (using MEPS, or the RAND insurance experiment data, which I think is public)–people love to use log(y+1) as a kludge, but really should be fitting something that explicitly accounts for masses at zero…

Reply ↓
- Andrew on November 6, 2017 6:36 PM at 6:36 pm said:
  
  Zc:
  
  log(y+1) is horrible. What’s “1” supposed to mean? $1? $1000? etc.
  
  Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Looking for data on speed and traffic accidents—and other examples of data that can be fit by nonlinear models

28 thoughts on “Looking for data on speed and traffic accidents—and other examples of data that can be fit by nonlinear models”

Leave a Reply Cancel reply