Model building is Lego, not Playmobil. (toward understanding statistical workflow)

John Seabrook writes:

Socrates . . . called writing “visible speech” . . . A more contemporary definition, developed by the linguist Linda Flower and the psychologist John Hayes, is “cognitive rhetoric”—thinking in words.

In 1981, Flower and Hayes devised a theoretical model for the brain as it is engaged in writing, which they called the cognitive-process theory. It has endured as the paradigm of literary composition for almost forty years. The previous, “stage model” theory had posited that there were three distinct stages involved in writing—planning, composing, and revising—and that a writer moved through each in order. To test that theory, the researchers asked people to speak aloud any stray thoughts that popped into their heads while they were in the composing phase, and recorded the hilariously chaotic results. They concluded that, far from being a stately progression through distinct stages, writing is a much messier situation, in which all three stages interact with one another simultaneously, loosely overseen by a mental entity that Flower and Hayes called “the monitor.” Insights derived from the work of composing continually undermine assumptions made in the planning part, requiring more research; the monitor is a kind of triage doctor in an emergency room.

This all makes sense to me. It reminds me of something I tell my students, which is that “writing is non-algorithmic,” which isn’t literally true—everything is algorithmic, if you define “algorithm” broadly enough—but which is intended to capture the idea that when writing, we go back and forth between structure and detail.

Writing is not simply three sequential steps of planning, composing, and revising, but I still think that it’s useful when writing to consider these steps, and to think of Planning/Composing/Revising as a template. You don’t have to literally start with a plan—your starting point could be composing (writing a few words, or a few sentences, or a few paragraphs) or revising (working off something written by someone else, or something written earlier by you)—but at some point near the beginning of the project, an outline can be helpful. Plan with composition in mind, and then, when it’s time to compose, compose being mindful of your plan and also of your future revision process. (To understand the past, we must first know the future.)

But what I really wanted to talk about today is statistical analysis, not writing. My colleagues and I have been thinking a lot about workflow. On the first page of BDA, we discuss these three steps:
1. Model building.
2. Model fitting.
3. Model checking.
And then you go back to step 1.

That’s all fine, it’s a starting point for workflow, but it’s not the whole story.

As we’ve discussed here and elsewhere, we don’t just fit a single model: workflow is about fitting multiple models. So there’s a lot more to workflow; it includes model building, model fitting, and model checking as dynamic processes where each model is aware of others.

Here are some ways this happens:

– We don’t just build one model, we build a sequence of models. This fits into the way that statistical modeling is a language with a generative grammar. To use toy terminology, model building is Lego, not Playmobil.

– When fitting a model, it can be helpful to use fits from other models as scaffolding. The simplest idea here is “warm start”: take the solution from a simple model as a starting point for new computation. More generally, we can use ideas such as importance sampling, probabilistic approximation, variational inference, expectation propagation, etc., to leverage solutions from simple models to help compute for more complicated models.

– Model checking is, again, relative to other models that interest us. Sometimes we talk about comparing model fit to raw data, but in many settings any “raw data” we see have already been mediated by some computation or model. So, more generally, we check models by comparing them to inferences from other, typically simpler, models.

Another key part of statistical workflow is model understanding, also called interpretable AI. Again, we can often best understand a fitted model by seeing its similarities and differences as compared to other models.

Putting this together, we can think of a sequence of models going from simple to complex—or maybe a network of models—and then the steps of model building, inference, and evaluation can be performed on this network.

This has come up before—here’s a post with some links, including one that goes back to 2011—so the challenge here is to actually do something already!

Our current plan is to work through workflow in some specific examples and some narrow classes of models and then use that as a springboard toward more general workflow ideas.

P.S. Thanks to Zad Chow for the adorable picture of workflow shown above.

12 thoughts on “Model building is Lego, not Playmobil. (toward understanding statistical workflow)

  1. Does anyone have any suggested reading on using simple models as priors or starting guess values for more complex models? I suggested this a while back to a collaborator but didn’t have concrete suggestions and dropped it.

  2. The Bayes workflow book draft already has examples based on course material from Jonah and Andrew. Also, Andrew’s putting example case study for Stan is an example of model expansion based on model evaluation.

    Andrew originally hired me and Matt Hoffman not to work on Stan, but to work on workflow for hierarchical regression-like models. We were busy trying to come up with a language that was more expressive than lme4’s (which can’t express many of the models in Gelman and Hill’s regression book) and would allow this kind of model exploration. After struggling a couple moneths, we gave up. The problem you immediately run into is that (a) regression notation is very limiting [hence elaborate extensions like SEM], and (b) the combinatorics of automatic model exploration are absurd. If you have 10 predictors, as many of Andrew’s poli sci models do, that’s a whole lot of possible interaction terms—45 binary, 120 ternary, 210 quaternary, and so on. But the real problem is that you’re selecting a subset. We can’t ever consider all 2^45 possible subsets of binary predictors. We also didn’t have any special computation in mind for hierarchical models, though Matt was very interested in variational approaches, especially stochastic ones (he also wrote the stochastic gradient variational inference paper while a postdoc with Andrew). So we leaned into the inventor’s paradox and decided to tackle harder problems—a more general language and more general sampler. We started with autodiff implementations of HMC in C++, then I quickly realized we could build a BUGS-like language on top of that and Matt realizedthe problem wasn’t that GIbbs implementations were too slow, but that we needed something like HMC if only we cold automatically tune it.

    I do believe the neural network models offer advantages in that they learn both non-linearities and interactions. It’s not as automatic as some of the more gee-whiz sales pitches, but it’s a lot more automatic than typing formulas into lme4.

    @James and Andrew: warm starting on the parameters doesn’t help much. Stan converges pretty quickly to the typical set (75 iterations is the default, but it often only takes 10 or 20 iterations). That’s counterintutive for those coming from BUGS, but HMC is very good at directed search when it starts in the tails. It’s the exploration in the typical set for estimating adaptation parameters in Stan that takes a long time during warmup. So any warm start needs to also do someting about adaptation. We were thinking about this at the point we were thinking about discrete parameters, because as discrete parameters change during updates (say with block Gibbs or Metropolis), then the adaptation for HMC also has to update. I was really hoping the PyMC folks could evaluate this since they have the toolkit.

    This post had a lot of triggers! My mom’s dissertation was based on Linda Flower’s work and extended it to heterogeneous populations with different approaches to writing (mainly around whether they worked with or without a global plan and outline). A very Andrew idea! Flower was also down the hall from me when I was a professor at Carnegie Mellon. As I recall, the English department was so split between people who read and wrote and those who did continental philosophy that they couldn’t even have a joint department meeting!

    The fact that model building is Lego and we all have the same components is why all of our models wind up looking the same. The first thing I did with Andrew (before moving to Columbia) was fit some noisy measurement models for crowdsourcing. Turns out Phil Dawid had already come up with that model in the 1970s because he was using the same Lego bricks.

    • In the problems I’ve had where Stan couldn’t really do a great job… it was often that I couldn’t even get past warmup… But that model probably had on the order of 10-20k parameters (a handful for each public use microdata area in the census, and there are several thousand of those, plus several hierarchical parameters that connected all the pumas together).

      That was ~ 2 years ago, perhaps Stan is much better today.

      One thing I think is a good thing is to use optimizing to find a starting point. Sure the optimal location isn’t in the typical set, but it *is* in the high probability density set, and moving away from the optimum moves you towards the typical set pretty much no matter what direction you go… A big problem with my complex models was that Stan would start in la-la-land and decide that in la-la-land it needed to take MICROSCOPIC timesteps, and it would never get anywhere.

      • Daniel,

        Bob has a really cool idea for warmup that combines optimization and Nuts to rapidly get into the typical set. I think he’s done some experiments on it but he (and others) keep getting distracted with other Stan things, so it’s still in the speculative stage. I have high hopes that it could help solve the problems you’re talking about here. More generally, we’ve been thinking about better integration of computation into workflow, for example the idea that if your model isn’t working, it’s better to learn this right away rather than running your algorithms for several hours.

      • Stan’s a bit better, but not a lot better at adapation than it was two years ago.

        That’s a good point—we can run into problems where the tails are stiff and the body of the distribution is relatively easy to sample. The only thing we’ve been able to do in those cases is (a) find better initializations, (b) reparameterize so all parameters are on unit scale, and (c) reparameterize so parameters are less correlated.

        In the specific kinds of models you’re talking about, where there are multilevel intercepts, the problem is usually only very weak identifiability. It’s what Andrew calls the Folk Theorem. The model just isn’t well determined by the data. In those cases, we’re often uncertain about hierarchical parameters and wind up with high funnel-like curvature when they’re low variance. This is another case where zero-avoid priors and things like that can help.

        I don’t know if that’s also a siilar problem of sampling near the mode. It’s in an energy well which can be pretty deep. I know PyMC does that. The real problem is that most multilevel models we use don’t have posterior modes.

        Often if conditioning is bad for sampling, it’ll be bad for optimization too. On the other hand, it’s scalable enough to use quasi-Newton methods in sampling because we don’t need to maintain detailed balance. So there can be a win here. The real win will be if we can figure out how to condition for local geometry without some horrendous cubic penalty in number of parameters.

        • In my problem, when I did optimization and/or ADVI to get initial points, the sampling usually worked well in about 3 out of 4 chains, but would get stuck in the 4th… I’m sure it was a tricky model to sample, but whichever chains did work… they tended to work mostly fine. The problem was consistently getting all chains to work fine, and convincing myself that it wasn’t just a fluke that the appeared to work fine…

          I think a bunch of techniques have been created to improve the diagnostics etc, like your rank plot ideas and soforth. Probably I should return to that model and see if I can make it work.

        • If one chain gets stuck, it may be beause it’s the only one that found the high curvature part of the posterior. In those cases, we need to increase target acceptance rate to lower step size or usually, reparameterize.

          Getting stuck can also be from adaptation going astray. We’re introducing a parallel adaptation mode that’ll combine information cross-chain to try to get around this kind of bad adaptation in one chain.

        • When one or two chains are stuck, we’ve also been talking about using stacking to average over chains. Lowering the step size etc. can work, but if you’ve already run Stan and you find that one chain has not mixed, you can use stacking to mix it in with the others and get an approximate inference.

          We’ve also been talking about being able to run Stan for an (approximately) fixed amount of time per chain rather than a fixed number of iterations. That would help with this sort of problem, but it requires some programming effort so it has to wait on line behind various other Stan improvements. Stacking you can do right away with existing technology.

  3. Interesting post. Three comments:
    1. The data analysis pipeline is wider in scope than what is implied here. A nice platform for documenting and sharing a data analysis pipeline is openml.org, see: Vanschoren J, van Rijn JN, Bischl B, and Torgo , OpenML: networked science in machine learning. SIGKDD Explorations 15(2), pp 49-60, 2013.
    2. The data analysis pipeline is also feeding the current replicability crisis. See:Botvinik-Nezer R, Holzmeister F et al, Variability in the analysis of a single neuroimaging dataset by many teams, bioRxiv, 2019, https://www.biorxiv.org/content/10.1101/843193v1
    3. Starting in 2010, I worked for 8 years on a “Note on the theory of Applied Statistics” (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2171179). The “note” reached version 18 with inputs from many statisticians. At some point I let it rest. I am glad to see that Andrew adopted a similar terminology in https://statmodeling.stat.columbia.edu/2017/05/26/theoretical-statistics-theory-applied-statistics-think/. It is indeed time for a strong theory of applied statistics. MRP is surely part of it. Yesterday’s and today’s conference triggered by a suggestion from Andrea was excellent, thank you Lauren and org team!.

  4. I forgot to mention that the cat picture’s the wrong way up. It needs to be rotated left (-pi/2 for the math geeks and a bit further for the photo geeks who want to see that platform parallel to the ground).

    Given it seems so obvious, I assume Andrew put it that way on purpose as a joke.

Leave a Reply to Bob Carpenter Cancel reply

Your email address will not be published. Required fields are marked *