Benjamin Jarvis writes:

I’m (re-)developing a course about discrete choice analysis, and I would like to build on data examples you use in your book with Jennifer Hill. In particular, I would like to include lessons about the conditional logistic regression model, aka McFadden’s multinomial logit. I was hoping students could extend the Bangladesh well-switching example used in your logistic regression chapter to the conditional logit case.

Students would estimate models of well switching, where the outcome variable isn’t just whether a family switches wells, but what well, in particular, the family switches to. This would probably require the use of two linkable data sets: one about the wells, their arsenic levels, and their geographic positions, and one about family well use over time. I understand that these data are potentially sensitive because of the geographic information, but perhaps if the data are stripped of most other identifying information about families, and if the geographic information is coarsened, they could be safely distributed. Do such data exist and can they be (safely) distributed for teaching or research purposes? Any help you can provide is much appreciated.

My reply: Unfortuantely we do not have this follow-up data, so my suggestion for your class is that you have the students simulate some fake data and fit models to it. You could even have one group of students do the simulation and another group do the fitting and see what they discover.

This could be a fun project for others, so I’m blogging it.

My friend has been doing a series of these every month this year, all welcome

https://www.lesswrong.com/posts/kmZjtuo4Gzv5WhTTX/a-d-and-d-sci-may-2021-interdimensional-monster-carcass

I have indeed, though the one you linked isn’t the most new-player-friendly or indicative one. The best place to start is probably https://www.lesswrong.com/posts/Y9FcNzWqczbfqcPQ3/d-and-d-sci-ii-the-sorceror-s-personal-shopper

I’m planning on having my MS-level stats class do a p-hacking assignment. Everyone gets their own matrix of n simulated features from independent normal distributions with arbitrary labels assigned (“wearing_red”, “born_on_wednesday”, “likes_bean_burritos”). Then they have to write me a report with significant finding and why this is a major step forward for science.

You might like this website that makes it easy to do p-hacking.

I’ve used it for a couple of classes. Students find it interesting.

https://shinyapps.org/apps/p-hacker/

I have a similar idea for my “python programming for ecologists” exercise: write an automatic signal detector, and then feed it random noise so students will see with their own eyes how easy it is to trick automatic inference.

A colleague and I did this in a class on gene expression statistics, we simulated the data and then the students did the gene expression analysis as did my colleague.

Even though the knew the data was simulated, some of the students could not help trying to give biological reasons why certain locations were jointly under/over expressed.