For the chapter in Regression and Other Stories that includes nonlinear regression, I’d like a couple homework problems where the kids have to construct and fit models to real data. So I need some examples. We already have the success of golf putts as a function of distance from the hole, and I’d like some others. One thing that came to mind today, because I happened to see a safety warning poster on the bus reminding people not to drive too fast, is data on speed and traffic accidents.
But I’m interested in other examples too. Just about anything interesting with data on x and y where there’s no simple linear relation, and where log and logit transformations don’t do the trick either. The outcome can be discrete or continuous, either way.
There’s gotta be lots and lots of good examples but for some reason I’m drawing a blank. So I could use your help. I need not just the examples but also the data.
Thanks! Anyone who comes up with a good suggestion will be rewarded with a free Stan sticker.
https://www.nhtsa.gov/research-data/fatality-analysis-reporting-system-fars
Some good statistics reported via NHTSA
More detailed NHTSA accident data can also be found at https://www.nhtsa.gov/national-automotive-sampling-system-nass/crashworthiness-data-system.
NASS data is also useful because some of the data is stratified and various accidents have different sampling weights. It thus provides good examples of how failing appreciate the importance of sample weights can have disastrous results, a lesson learned the hard way by a formerly “leading” paper in the child restraint research community that has now been retracted, http://injuryprevention.bmj.com/content/23/4/e1, as a result of an inability to replicate the results by Padmanaban et al.
FARS is a good dataset, but tricky to analyze because it is a case-only dataset. For traffic crashes, you might try NASS-CDS or NASS-GES, which have speed limit info (both) and speed-at-impact (CDS) and crash severity represented as speed (CDS).
https://www.nhtsa.gov/research-data/national-automotive-sampling-system-nass
A few things that might be fun:
The taxi/limousine commission publishes trip data that economics love analyzing in papers; now conveniently accessible in an R package: https://cran.rstudio.com/web/packages/nyctaxi/vignettes/nyc_taxi.html
Traffic stops (by police) data from the Stanford project on this: https://openpolicing.stanford.edu/data/
Data from toxicity studies by the National Oceanic & Atmospheric Administration: https://www.diver.orr.noaa.gov/deepwater-horizon-nrda-data#
Cardiovascular mortality rates by year/sex/county: http://ghdx.healthdata.org/record/united-states-cardiovascular-disease-mortality-rates-county-1980-2014
Instances of Dengue fever by time/location in Sri Lanka: https://cran.r-project.org/web/packages/mozzie/index.html
Jonathan Auerbach and Rob Trangucci contributed a very nice talk on speed limits and pedestrian safety to Stan Con 2017. Link with data, model, and Rmd below.
Long live reproducible research!
https://github.com/stan-dev/stancon_talks/tree/master/2017/Contributed-Talks/01_auerbach
Maybe something involving seasonality, e.g. hours of daylight, or blood levels of Vitamin D?
(For a small amount of extra screwiness you could do sunrise/set times. You could be some junior programmer’s first exposure to the annoyance that is daylight saving time!)
P.S. re “kids” — what is the target audience for this textbook? Sorry if you’ve already mentioned this elsewhere, but I couldn’t find it immediately.
Erin:
The book is self-contained, but I expect it will mostly be used as a textbook for courses in regression (traditionally a second course in statistics for undergraduates or a first course for graduate students).
The teams, score and date of every NCAA Division 1 and 3 hockey game since 2002. Suitable for fitting Bradley-Terry models, Poisson regression models for goals scored and all manner of home/road, strength and other effects. https://github.com/octonion/hockey/tree/master/ncaa/csv
Varian’s titanic survivor example here: http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf
This example lends itself nicely to trees, but maybe you don’t want to go there. Could use it to demonstrate importance of interactions though.
During the Rio Olympics, I was watching the Air Rifle 10m. It is obvious that in this and other disciplines, the winner is declared by the highest score, but this is not statistically significant if we were to do a test. Check the wikipedia page
https://en.wikipedia.org/wiki/Shooting_at_the_2016_Summer_Olympics_%E2%80%93_Women%27s_10_metre_air_rifle#final
The csv file you can download here, courtesy of yours truly
https://www.dropbox.com/s/2cy484b0thx2xtu/airrifle2016.csv?dl=0
I decided to check it. But with repeated observations, I needed to run a stan model, as each shooter has her own mean and probably her own standard deviation. Notice that there are two stan codes below. One assumes that the score (0-11) is normally distributed. The second assumes that the distance from the center of the target is distributed Rayleigh, which is the correct distribution for a distance from a center if symmetry in all directions is assumed. Stan supports the Rayleigh distribution, so you can run it. I think it is an interesting example of the Rayleigh distribution. And by the way, these shooters shot 14-20 times (notice that they don’t all shoot the same number of times because of the elimination system). We would need them to shoot 2000 times to start seeing some non-overlapping intervals here.
What do you think? Don’t be harsh with the crappy coding, I was watching sports on my couch when I coded this. Also notice some old-fashioned Stan notation.
I just see this comment posted and I notice the indentation did not come out.
Income distribution is an inverted U shape with a mass point at zero. There is also a curious high-frequency oscillation suggesting that income is often rounded off to a multiple of $5,000 (bins are $2,500 wide).
This is a good illustration of the exhortation to actually look at your data rather than blindly fit a model.
Data are at: https://www2.census.gov/programs-surveys/cps/tables/finc-07/2017/finc07.xls. Data is for frequency of household income for incomes from zero to $100,000. Relevant data is in cells B11-B50. Be careful to use only cells B11-B50 because the bin size is different for cells B51-B54. (This change in bin size could be another example of something to watch out for.)
Website with information about data is at: https://www.census.gov/data/tables/time-series/demo/income-poverty/cps-finc/finc-07.html
This data is probably not what you want, though, because it is basically a pdf of income rather than an x-y relationship amenable to regression techniques.
What about trying to measure the gravitational acceleration constant (9.81 m/s^2)? Data would consist of measures of the time for an object to drop from different heights, and the model would have to look like Newton’s law of gravitation. A fun thing would be to make the students build the dataset themselves with a clock and a tape!
I had students collect data with their smartphones, which are packed full of sensors, for exactly this sort of problem. There are free apps for both iPhones and Androids that make it easy to collect the raw data, which student can then email to themselves in csv format. Specifically, I had them swing their phones on the charging cable like a pendulum, and then fit the acceleration data using a damped harmonic oscillator model. But there are many other sensors too that would easily generate data with non-linear relationships:
https://en.wikipedia.org/wiki/IPhone#Sensors
New Mexico pueblo populations!
For a small pystan project I have New Mexico pueblo populations from ~1790 – 1950. I added latitude, longitude, and elevation data. I have mean rainfall data which will be added later today.
The csv or ods sheet are here:
https://github.com/bioinfonm/pystan_musings/tree/master/beginners_excercise/new_mexico_pueblo_populations
How about the mcycle data set in the MASS R package?
Description:
A data frame giving a series of measurements of head acceleration
in a simulated motorcycle accident, used to test crash helmets.
Usage:
mcycle
Format:
‘times’ in milliseconds after impact.
‘accel’ in g.
Source:
Silverman, B. W. (1985) Some aspects of the spline smoothing
approach to non-parametric curve fitting. _Journal of the Royal
Statistical Society series B_ *47*, 1-52.
References:
Venables, W. N. and Ripley, B. D. (1999) _Modern Applied
Statistics with S-PLUS._ Third Edition. Springer.
For a single Y vs. a single X, how about child pedestrian fatalities vs. day of the year (reminiscent of births vs. day of year, as previously discussed on this blog)?
Data from FARS. Presumably additional X variables available there too.
Seasonality plus a major spike on Halloween :(
Temperature’s effect on electricity consumption is an easy non-linear example. It can be approximated with a quadratic, but more advanced methods are more appropriate.
https://www.researchgate.net/figure/226527334_fig2_Fig-3-Non-linear-relationship-between-the-filtered-electricity-load-of-a-distribution
Some data at the link below. There are interactions with day of week and hour, which makes it a bit more interesting.
https://www.iso-ne.com/isoexpress/web/reports/load-and-demand/-/tree/zone-info
Data on driving speeds in different countries obtained using Google-maps estimated journey time between random zip codes:
https://github.com/ga7g08/GoogleMapsAPI_experiment
This was a project I started as a grad student for fun and never got around to actually analysing – a summary of what I did manage to do can be found here
https://github.com/ga7g08/GoogleMapsAPI_experiment/blob/master/Results.md
Here’s code to generate some (synthetic) nonlinear time series data via Shalizi:
http://bactra.org/reviews/fan-yao/examples.R
A classic real-world data set for nonlinear time series is Nicholson’s ‘sheep blowfly’ data. Googling gave me this link:
https://www.stat.berkeley.edu/~brill/blowfly97I.html
Wood analysed these here: https://www.nature.com/nature/journal/v466/n7310/abs/nature09319.html
I’d be interested in seeing alternative analyses.
Something that is perhaps more along the lines you’re specifically looking for is:
http://epubs.siam.org/doi/pdf/10.1137/130929230
They drop an object, measure its trajectory and fit the solution of Newton’s equations of motion with constant gravity and an empirical drag model. This solution is nonlinear.
They also use prior and posterior predictive checks etc. Good example of the potential issues of model misspecification too.
I’m late to the party but — traffic flows! They’re a great case of “tipping point” nonlinearities.
I found a good set linked to from an intro-to-R Berkeley stats assignment I googled up:
“Data was collected by loop detectors at one particular location of eastbound Interstate
80 in Sacramento, California. There are six columns and 1740 rows in the data set. The
rows correspond to successive five minute intervals from March 14 to 20, 2003, where the
data values in a row report the flow (number of cars) and occupancy (the proportion of
time there was a car over the loop) in each of three lanes on the freeway. Lane 1 is the
leftmost lane, lane 2 is in the center, and lane 3 is the rightmost. The data set can be
found at http://www.stat.berkeley.edu/classes/s133/data/flow-occ.csv The original data are
from the Freeway Performance Measurement System (PEMS) website (Freeway Performance
Measurement System (PEMS) website)”
That link is: http://www.stat.berkeley.edu/classes/s133/data/flow-occ.csv
Inspired by: http://jliszka.github.io/2013/10/01/how-traffic-actually-works.html
This one just for fun:
https://phys.org/news/2017-02-mathematical-line-traffic.html
O2K {nlstools}: Oxygen kinetics during 6-minute walk test data set is a nice free dataset
Health spending (using MEPS, or the RAND insurance experiment data, which I think is public)–people love to use log(y+1) as a kludge, but really should be fitting something that explicitly accounts for masses at zero…
Zc:
log(y+1) is horrible. What’s “1” supposed to mean? $1? $1000? etc.