## Wanted: Statistics-related research projects for high school students

So. I sometimes get contacted by high school students who want to work on research projects involving statistics or social science. I’ve supervised several such students, and what works best is when they have their own idea, and I can read what they’ve written and give comments. I’m more of a sounding board than anything else.

But sometimes we do have good ideas, quantitative research projects that a high school student could do that would have some interesting statistical content or would shed light on some political or social issue.

If you have any good ideas—projects that would be fun for a high school student, or something quantitative a student could do that could make the world a better place—place them in the comments, and then maybe we could put together a list.

I’m not looking for classroom activities—Deb and I have a whole book about that—I’m looking for ideas for research projects that high school students could do on their own.

1. Greg Snow says:

This website: https://www-fars.nhtsa.dot.gov//QueryTool/QuerySection/SelectYear.aspx is the starting point for getting data on traffic fatalities in the United States (a bulk download of some of that data is available here: https://github.com/wgetsnaps/ftp.nhtsa.dot.gov–fars). Each year of data is in its own set of files and exactly what all is included changes between years, so some data wrangling is needed to combine data across years (a good experience for serious students in my opinion).

A couple of ideas for analysis projects:

If you look at the data by day of year for the last few years then August 2nd shows up as an unusually dangerous day, is there something really dangerous about that day? or is it a chance combination? Most likely it is a combination of being in the summer (summer is more dangerous than the other seasons), being near the beginning of the month (early in the month is more dangerous, possibly due to people who are paid on the last day of the month doing more drinking/driving/other soon after), weekends are more dangerous and over the last few years Aug. 2nd has been on a weekend more than surrounding dates, and probably a big part of random chance. A project could look at how dangerous a day like Aug. 2nd really is after adjusting for meaningful trends and look to see if it really is a dangerous day, or if chance can easily explain what is going on.

Choose a change in law and look at traffic deaths before and after to see how much and in what direction traffic deaths changed. This could focus on a single state and a change of law in that state, or focus on multiple states where some had similar law changes (and others who did not change that law can act as control).

2. Mark Samuel Tuttle says:

Local police statistics are increasingly available on the Web. So some simple crime prediction exercises could try to associate these with other observables – address, day of week, time of month, time of year, weather, etc.

There are now businesses that do this, and the Los Angeles Police Department does this, controversially …

Anyway, it could start out very simply, but it’s open-ended, obviously …

3. This is more about gathering and organizing data than statistics, but I’ve often thought that it would be great to have a “US Version” of David MacKay’s excellent Sustainable Energy – without the hot air. Its numbers and examples of things related to energy usage are UK-based, and especially when I point to this book when teaching, it would be good to have parallel data conveniently tabulated for the US. Bonus projects: do this for other countries, too!

4. gdanning says:

Having taught advanced high school students for many years, I would say that there is not a whole lot that high school students can’t do, at least in theory. In practice, the challenges as I see them are 1) they might not have access to published articles that are behind paywalls; and 2) they will not have time to do much in the way of data cleaning (what with their numerous AP classes, clubs, etc). So, I would suggest looking at studies that have publicly available replication data, and seeing what happens if the analysis is done in slightly different ways (adding control variables, logistic regression instead of OLS, etc, etc). The results would be valuable lessons for budding data analysts.

One slightly more specific suggestion might be to look at the predictive value of variables found to be statistically significant in whatever study interests students, as Ward, et al., did with studies of civil war onset here: https://journals.sagepub.com/doi/pdf/10.1177/0022343309356491

5. Nathan L. says:

Something I did in undergrad which I enjoyed and think can be tailored to high schoolers:

Download a dataset off Kaggle, and after learning what the data is about and what it’s measuring, come up with 2-3 questions that the data may provide an answer for, then investigate those questions. Data on Kaggle is relatively clean, especially the learning datasets and often contains interesting relationships beyond the scope of the competition. There are also the public data sets that people upload themselves.

A motivated high schooler will be able to find a dataset of interest and apply rudimentary statistics (linear regression, descriptive statistics) to discover some relationships in the data. If the students are properly instructed, it can also serve as a tool to teach them good habits about how to ask and explore scientific questions, how to explore data, ways to understand what is causal or spurious, etc.

To that second point, it might be fun to ask students to intentionally find a clearly spurious relationship between two datasets to emphasize how important the probability-part of statistics is.

The best thing about this, I think, is that it’s easy to find interesting stuff on Kaggle, and really that will be the hardest part of any statistics project: getting the data.

Also, perhaps designing and analyzing their own surveys – but honestly this may be too nuanced

6. Another one:

As I wrote about here, there’s a lot of great data out there on what students major in in college, and many interesting trends in recent years. (Note, however, that interesting doesn’t necessarily mean important.) I could imagine a lot of projects based on this, e.g.

– more robustly defining trends (e.g. fitting multi-year data, not just two timepoints).
– looking at what subjects cluster with other subjects — the results will likely be unsurprising, but who knows…
– looking at what schools cluster with other schools, based on trends in majors’ enrollment.

7. Tony Wuersch says:

In Providence RI, where I live, there are 179k people in 15 wards. A project to go to city planning or a registry of deeds, find out what data there is on all real estate in a ward, and summarize it, would be interesting. Also, to compare that with Zillow, to see how much data is from public sources.

In Switzerland, many towns have books describing history, geography, fauna, demography, politics and so on. The table of contents are illustrative. An example is here:
https://www.riehen.ch/gemeinde-riehen/portrait/geschichte/gemeindekunde-riehen-michael-raith-1988

The above is an insane book by US standards — a 250 page comprehensive book of a suburb just outside Basel, that at his recent population height topped 22k people.

8. Thomas Wittlinger says:

Just on a small note – in Germany you can get a lot of data straight from the “destatis”. Students are additional supported by “Statistik Campus”…;-)
You may have a look…
https://www.destatis.de/DE/PresseService/StatistikCampus/StatistikCampus.html;jsessionid=D885458B3AB1E975D8EF4EE0C50B554B.InternetLive2

9. Martin Modrák says:

I believe there is a vast amount of information on e-sports matches, e.g. DOTA 2: https://www.opendota.com/heroes Those lend to similar types of analyses that traditional sports do, but might be more attractive to students and also have some additional interesting features. In DOTA 2 for example, there is a complicated meta-game just around choosing heroes to play with. The rules are regularly updated resulting in a phase of exploration of new playstyles and then reaching a new equilibrium. Further the free datasets seem to have very fine level of detail (up to individual player movement and actions) which is AFAIK lacking from available sports datasets.