Skip to content

A fun activity for your statistics class: One group of students comes up with a stochastic model for a decision process and simulates fake data from this model; another group of students takes this simulated dataset and tries to learn about the underlying process.

Benjamin Jarvis writes:

I’m (re-)developing a course about discrete choice analysis, and I would like to build on data examples you use in your book with Jennifer Hill. In particular, I would like to include lessons about the conditional logistic regression model, aka McFadden’s multinomial logit. I was hoping students could extend the Bangladesh well-switching example used in your logistic regression chapter to the conditional logit case.

Students would estimate models of well switching, where the outcome variable isn’t just whether a family switches wells, but what well, in particular, the family switches to. This would probably require the use of two linkable data sets: one about the wells, their arsenic levels, and their geographic positions, and one about family well use over time. I understand that these data are potentially sensitive because of the geographic information, but perhaps if the data are stripped of most other identifying information about families, and if the geographic information is coarsened, they could be safely distributed. Do such data exist and can they be (safely) distributed for teaching or research purposes? Any help you can provide is much appreciated.

My reply: Unfortuantely we do not have this follow-up data, so my suggestion for your class is that you have the students simulate some fake data and fit models to it. You could even have one group of students do the simulation and another group do the fitting and see what they discover.

This could be a fun project for others, so I’m blogging it.


  1. Elle says:

    I’m planning on having my MS-level stats class do a p-hacking assignment. Everyone gets their own matrix of n simulated features from independent normal distributions with arbitrary labels assigned (“wearing_red”, “born_on_wednesday”, “likes_bean_burritos”). Then they have to write me a report with significant finding and why this is a major step forward for science.

  2. A colleague and I did this in a class on gene expression statistics, we simulated the data and then the students did the gene expression analysis as did my colleague.

    Even though the knew the data was simulated, some of the students could not help trying to give biological reasons why certain locations were jointly under/over expressed.

Leave a Reply

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.