Lizzie writes:
Thinking on what to do for my now-online stats course next term, I stumbled on your re-post from 2016.
And wanted to ask if you or anyone did this (that you know of)? Or if any of your in-development books do lots and lots and lots of simulation? The more I teach fake data, the more critical it seems (including for our manuscript! But more on that tomorrow) so I’d love a book or course to follow focused a lot on this ….
My response: We’re putting more and more simulation in our books, but never enough, I fear. I still like the idea of a course based entirely on simulation, as suggested in the linked post.
Your definition of “computer simulation” seems broad; so, am going to suggest this paper:
https://arxiv.org/abs/1810.10525
It has a few simple algorithms and methodologies of how to take a data set, say, of a ball bouncing on a basketball court and derive the equations, using unsupervised learning, of gravity and projectile motion.
Why “entirely” based on simulation? I suppose there is room for anything, given enough students and objectives, but I would think a mixture of simulation and real data is a better idea. For example, the recent post about first games of baseball seasons in relation to total season wins: simulating data under varying conditions and exploring the regression model that results is very instructive. Then, looking at the actual data becomes even more instructive. I think, in general, the mixture of simulated data and real data is best. Totally using simulated data seems very much a niche approach.
Yes!
Agreed. I love simulation, but as a tool for helping us understand processes in the real world. So juxtaposing data produced from simulations with data produced from the real world is necessary to make that connection.
Might be wrong but I don’t think anyone is suggesting not having real dat but rather always/usual starting with simulation (fake data) and then moving on to real data.
However that simulation has to be fully understood as attempts to represent elements and relations among them in the represented world [the real world that produced the data] and map them onto elements and relations in the representing world [the probability world that creates the fake data]. This requires transporting what repeatedly happens given a model, to what reality happened to produce this time. So real data is always the real target of the
exercise.
The most concisely I can put that is below:
Science is thinking, observing and then making sense of thinking and observing for some purpose. Making sense is formalized in assessments of what would repeatedly happen in some reality or possible world. If it would repeatedly happen (a habit either of an organism, community or physical object/process), it’s real.
Statistics can be defined as formalizing ways to learn from observations using math (i.e. using models). Unfortunately what is taken as abstract statistical work is too often misunderstood mathematics where it’s representational role has been suppressed or overlooked. Models or assumptions are often just taken as the price one has to pay to obtain statistical outputs like p-values, confidence intervals and predictions. Unavoidable just like taxes. However, they are representations of some reality beyond our direct access. Even if only implicitly.
It is always the case that given we have no direct access to reality, reality must be represented abstractly in our heads. Given that we must think about reality using abstractions, we can only notice aspects of those abstractions. These need to not be confused with the reality they just attempt to represent, in some meaningful way for some purpose. The veracity of any statistical method depends on an implicit abstract fake world being not too wrong. That is too different from the “real” world for the given purpose in some meaningful way.
Mathematics is required to discern exactly what an abstract object or construction implies. Taken as being completely true, what follows? However, mathematics as it is usually taught and written about can be a formidable barrier for most. Even those with an undergraduate degree in mathematics may have only learned to “get used to it” rather than actually understand it. That is, they can do the calculations correctly but do not know what to fully make of the results. But mathematics has many mediums and one in particular can perhaps be grasped most widely by researchers. This is because, as CS Peirce pointed out, the object of mathematics is some or all hypotheses concerning the forms of relations in the abstract construction. All mathematical knowledge thus has a hypothetical structure: if such and such entities and structures are supposed to exist, then this and that follows. Fortunately there a many ways to discern what follows.
CS Peirce further defined mathematics as the manipulation of diagrams or symbols taken beyond doubt to be true – experiments performed on abstract objects rather than chemicals – diagrammatical reasoning. Here diagrams more than symbols have been argued to be more perspicuous (an account or representation more clearly expressed and easily understood or lucid). Diagrams are arguably the medium of mathematics most can grasp. An abstract diagram is made, manipulated and observed to understand the diagram much more thoroughly. Recently they have been accepted in the mathematical community as being rigorous – Visual Reasoning with Diagrams https://link.springer.com/book/10.1007/978-3-0348-0600-8
By far the main mathematical constructions in statistics are probability models. These can easily recast as diagrams which then can easily be automated and animated with computer simulation, given modern computation. Once probability models are recast as diagrams then then the diagrams themselves can be used to generate the pseudo-random variables needed for simulation. This generating can be inefficient but is valid and much easier to grasp than other ways. This transforms the understanding of probability into experiments performed on diagrams just using simulation all the way down. Those abstract mathematical probability models are best understood in terms of what would be repeatedly drawn from them and simulation does the repeated drawing. I believe that simulation provides a profitable mechanical way of noticing aspects of probability and statistics where the learning about a model is clearly and fully distinguished from what to make of observations in hand. And it involves very little mathematical skill but rather just the ability to think abstractly. But there is no free lunch, it needs to be worked with, experienced and reflected on.
Models take elements and relations among them in the represented world [that produced the data] and map them onto elements and relations in the representing world [probability world]. This requires transporting what repeatedly happens given a model, to what reality happened to produce this time. In the second we can know exactly what we are learning about (the probability model), in the first (the world) we can only guess or profitably bet about it. Those guesses are informed by what repeatedly happens in the probability world. However, it is really just replacing the medium or form of mathematics (a means to understand an abstract representation, the aim of which is to infer necessary conclusions from hypothetical objects) with something more concretely experimental and hence potentially self correcting with persistence. Thereby better facilitating self verification more widely with less mathematical skill.
Yes. I was not suggesting no or extremely low real data in the course, but I have found some students really struggle with simulating data from a model, so it needs to be repeated more I think (or, of course, I need to teach it better… that’s a constant battle). So I am looking for courses (or suggestions on how to run a course) where simulating data happens over and over again the course, and how to teach simulating data from a model also welcome!
Students in my course often have LOTS of experience with real data and I agree porting back to real data is critical but I would like to teach: model -> simulated data -> model on simulated data -> model on real data -> posterior predictive checks and repeat … adjust model … I am open to persuasive arguments for orders.
It’s an important clarification that you aren’t interested in “introducing” science so much as furthering it by adding a tool to budding scientists’ kit. I have no doubt you’re doing a great job of teaching the skills for that, so it must be frustrating that some students have disproportionate difficulty with simulation. I wonder if these students may have a mental barrier that makes them think of simulation as its own, isolated thing. If so, that might prevent them from perceiving lessons on simulation as just a transfer of the skills they already have as scientists/analysts to another context. For example, perhaps some students see simulation as an “extra” activity, supplemental to “real” science; or as a way of learning certain computer skills; or as a neat didactic illustration of substantive principles being studied.
If you think that might be the case, then you could start off by assigning readings of journal articles where simulation is used not to make a point about algorithms or to solve a distributional problem, but as the actual study. (Assuming you aren’t doing this already.) Surprisingly few scientists are aware this is a real option for doing, and publishing, science. You could also try more explicitly linking what we do in simulations to the scientific method more generally–you know, like, the scientific method is just an iterative process for developing and evaluating models, and data simulation is just inserting algorithms into one or more steps of the scientific method. (Again, if you aren’t already.)
I think there needs to be a lot of focus and practice on simulation. There are no free lunches.
Unfortunately students may see simulation as a cheap trick to get the answer when they want to focus learning “big boy pants” mathematical ways to do that.
One of the reasons why I use probability diagrams, digits of Pi and rejection sampling to get random variables from any distribution. No black boxes and then you tell them to use black box methods that are much more efficient.
A lot of people are using simulation and Aki has a recent webinar where does and talks a lot about simulation that I found quite good.
There is also this course at Berkely that uses simulation and they expect to increase that in the future Interleaving Computational and Inferential Thinking: Data Science for Undergraduates at Berkeley https://hdsr.mitpress.mit.edu/pub/e69066t4/release/2?readingCollection=3572f4eb
I think of simulation as computer-aided imagination. Other aids to imagination, at least for me, include movement/gestures, drawing, visualizing things in my head, and talking things out either with myself or with someone else.
Relative to other imagination aids, simulation makes my thought processes accessible to others, which is critical for scientific communication. Simulation also connects imagination with the formal techniques needed to derive testable predictions, again serving an important scientific function.
This is all to say that I think simulation is an extremely valuable scientific tool. I try to include it more and more in my teaching, since I agree that it is not included in most curricula these days but we have computational tools that make simulation far more accessible than it was in the past.
> Other aids to imagination
I would argue mathematics is the ideal form of imagination in statistics, but it has a very steep learning curve that few get beyond just being able to do the calculations correctly. For those with enough commitment/talent I would recommend the math be taken after some experience in simulation has provided some understanding.
Mathematics may be the ideal form but unfortunately there are problems in statistics not solved by math but by algorithms. Given that, simulation is actually the more generalizable method.
Good point.
I missed that earlier post, so I figure it’s worth replying here: Another big yes! I’m a huge fan of simulation: as a “multidisciplinary” engineer* I’ve been gradually learning more and more tools for it over time, trying to expand the boundaries of what I can design and test in the computer. It’s the future, period.
That said, I think grounding it in reality where possible/practical is helpful. One thing you might consider doing is using something like Tracker (a singularly ungoogleable tool, https://physlets.org/tracker/) with the students’ webcams to gather some data for more basic models. Some people are fine working with a pile of equations, but many people need some physical intuition to guide them through the equations. And in most systems, there is a meaningful physical intuition or meaning to things, and understanding it will let you put constraints on optimizers which would otherwise happily meander off to excellent fits in absolute nonsense corners of your parameter space. As the non-PhD on teams full of PhDs, building and applying that intuition was a fair portion of my mid-career.
In any case, yes! Do it! It’s brilliant, and it really is the future. Just see if there are places you can ground it in reality.
* A colleague recently referred to me as “our resident physicist” in the “he meddles in every subsystem using spherical cows, but is usually helpful” sense. Best underhanded compliment I’ve ever gotten. For context: I went into industry doing network programming right after HS, then after a few years did undergrad mathematics and linguistics degrees. I went back to industry, doing NLP and “data science” back before it was called that. Nowadays, I work on medical devices, with lots of firmware and circuit modeling feeding into physiological models.
CAST is a collection of interactive Statistics ebooks that makes extensive use of simulation and animated graphics.
Do you mean like reverse math where you try to generate the axiomatic assumptions? Only with simulation to data revealing those paths?
Also, I have to state I’m not a robot to post this. What if I am one?
Lie. (Can robots lie?)
Our Complex Systems program at UVM has a few simulation-only courses, for example “CS 302. Modeling Complex Systems” (http://catalogue.uvm.edu/search/?P=CS%20302)
Admittedly, the scope is broader than just “statistical models” but I’ve heard from students that all of the simulations really help thinking through the role of models in science in general.
I did a basic course that was nearly all simulation in statistics for psychology students. They had to do things like make simulations to see what happens when you vary size and probability in binomial distributions assessing variance and shape of the distributions. Or, report what happens to classical error rates in t-tests when you vary whether assumptions are met and N. Or, instead of calculating power have them look at “null” findings in papers and simulate what might happen if you ran that experiment again assuming the observed estimates are representative. There were about 8 assignments with simulation all together.
I have a strong belief that the best way to introduce science is in terms of models. Models are intuitive and ubiquitous–building blueprints, fashion models, model planes, best practices, DIY guides, mentors, practice problems, stereotypes, morays, scientific theories, mimes–are all models with varying levels of usefulness, rigor, and logic (math). Everyone creates, revises, interprets, and over-interprets models every day, whether you’re 2 years old or 92 years old. Science is just a process for feeding our models into a formal, iterative method (model!) of development and evaluation, and data simulation is just inserting algorithms into one or more steps of the scientific method.
I’d suggest that a core science curriculum include, in addition to (and probably ahead of) simulations, exploration of various explanations for the usefulness of formal models, and thus of science. There’s a mathematical answer (“the unreasonable effectiveness of mathematics in the natural sciences”), for instance, and an evolutionary biological answer about how we perceive and interpret the world. There’s also a less abstract historical explanation: there was this great big fight between Rationalists and Empiricists around the 17th/18th centuries, which ended up in an intimate marriage between the two. The result is that good science requires a great deal of careful thought, informed by a great deal of information, with a huge dollop of the humility necessary to throw all your hard work away in the face of valid criticism or failed predictions.
Throw in something about the superiority of actuarial over clinical approaches, and Bob’s your uncle.
Michael: That’s what I am trying to do and gave a brief overview in my comment above – recast simulation as a scientific process. Which I think requires focusing on the role of simulation to represent and learn about possible worlds to better makes sense of what happens in the actual world. Just the scientific process laid out in diagrammatic form and animated.
> evolutionary biological answer about how we perceive and interpret the world.
I use tics, only three senses of the world that allows them to survive (formally Biosemiotics).
https://swirlstats.com/ may be an interesting use case. Swirl itself is entirely computational and has had a reasonably successful history. “swirl teaches you R programming and data science interactively, at your own pace, and right in the R console”
Sorry, I’m a bit late to the discussion. And what I can offer is not much.
Here I’ve put together some R examples for power simulations that I use for teaching and consulting: https://arxiv.org/abs/2110.09836
They help me to illustrate the workflow: First simulate, then collect data.
Follow the students to get to the course
https://eng.auburn.edu/news/2021/09/industrial-and-systems-students-professor-receive-best-paper-award.html
I’ve taught system dynamics (SD) courses, and they were simulation courses as seem to be discussed here, including the tie to real data and to theories about the way the real world works. If all your models were largely nonlinear ODE models, you might argue you were doing SD, too. Often SDers focus more on the dynamics caused by the feedback in the ODE models than on the stochastic nature of the data, but both should be done.
One of the nice features of such a course is that it makes answers to student questions about “How do you do …?” or “But what would happen if ….?” pretty much the same: “I dunno; have you tried it?” or “I dunno; let’s try it.”
In teaching a one-quarter course, I tended to find that it took 2-4 weeks for most to figure out how to go about creating useful models. Once they had tried enough, they began to catch on, and modeling and simulation became the normal response to a question or a problem. Iteration was key–often not by refining a first model in ever-smaller steps towards the “perfect model” but the willingness to start small, build based on what the real data showed, and throw a model away and start over when the time was right.
Roger Myerson and me wrote a book to teach decision analysis to students with no coding background that is entirely based on simulations, from beginning to end. The simulation approach entirely lives to its promise. The students love to learn this way, and simulation based courses are a lot of fun for the professors to teach.
https://mitpress.mit.edu/books/probability-models-economic-decisions-second-edition
Eduardo:
How to you use simulation to evaluate the effective value of information in a multistage decision problem? I’d like to solve such problems using simulation but I don’t know how to do it, because the decision at each stage requires an expectation value or integral that depends on the values of data that won’t be observed until the next stage. Intuitively it seems to me that there should be some way to solve such problems by simulating the entire process, but I can’t figure out how to do it.
Perhaps there’s information you’d find useful on some of Diana Fisher’s site at https://ccmodelingsystems.com/. See especially the Student Projects, Research, and Subscriber Resources (I presume) sections.
If you have not looked into creating computer simulation models using system dynamics (SD) I highly recommend finding out more about this very versatile and approachable analytical method. Developed at MIT in the mid 1950’s by Jay Forrester, it has become an approach used by MIT researchers to study global climate change, the CDC to study bioterrorism, and the Mayo Clinic to study end stage kidney disease (to name only 3 applications). It is central to MIT’s Sloan School of Business Management.
That said, I have taught SD model-building to high school students for over 20 years. I had students regularly build SD models (about once a month) in second year algebra, pre-calculus, and calculus classes. I also taught a year-long SD model-building course for math and science students (for 2 decades) where students developed skill in building SD models using lessons I created (and that have been in print for over 20 years) culminating in a 10-week research project where students chose a topic of interest to them, researched it, built a working simulation SD model, wrote a technical paper, and presented their models to an audience. I also followed a similar pattern, but in a more compressed time frame (10 weeks) at an undergraduate university, teaching Environmental Math Modeling. SD modeling is based on the study of dynamic feedback processes in systems. It is visual and reasonably accessible to a broad audience. The student essentially designs the model as a 2-D blueprint type diagram showing stocks (integrals, represented as boxes), their rate of change (flow pipes into and out of the boxes – first derivatives), other variables and parameters (in circle compartments), and the connecting arrows that show the dependencies of one part of the model on other parts. This approach can be adapted to have students build specific models to highlight particular dynamics that are important to a core concept in science, then they can manipulate the simulation to answer “what if” questions. But it is in the BUILDING of the model that most of the learning happens.
There is a research and applications community centered around this SD model-building approach “The System Dynamics Society.” After having spent so many years teaching my students to build simulations to understand core concepts in mathematics and science I strongly believe we are doing our students a great disservice if we do not arm them with this powerful method of understanding how complex systems operate.