Forking paths vs. six quick regression tips

Bill Harris writes:

I know you’re on a blog delay, but I’d like to vote to raise the odds that my question in a comment to http://statmodeling.stat.columbia.edu/2015/09/15/even-though-its-published-in-a-top-psychology-journal-she-still-doesnt-believe-it/gets discussed, in case it’s not in your queue.

It’s likely just my simple misunderstanding, but I’ve sensed two bits of contradictory advice in your writing: fit one complete model all at one, and fit models incrementally, starting with the overly small.

For those of us who are working in industry and trying to stay abreast of good, current practice and thinking, this is important.

I realize it may also not be a simple question.  Maybe both positions are correct, and we don’t yet have a unifying concept to bring them together.

I am open to a sound compromise.  For example, I could imagine the need to start with EDA and small models but hold out a test set for one comprehensive model.  I recall you once wrote to me that you don’t worry much about holding out data for testing, since your field produces new data with regularity.  Others of us aren’t quite so lucky, either because data is produced parsimoniously or the data we need to use is produced parsimoniously.

Still, building the one big model, even after the discussions on sparsity and on horseshoe priors, can sound a bit like http://statmodeling.stat.columbia.edu/2014/06/02/hate-stepwise-regression/http://statmodeling.stat.columbia.edu/2012/10/16/bayesian-analogue-to-stepwise-regression/, and
http://statmodeling.stat.columbia.edu/2013/02/11/toward-a-framework-for-automatic-model-building/, although I recognize that regularization can make a big difference.

Thoughts?

My reply:

I have so many things I really really must do, but am too lazy to do.  Things to figure out, data to study, books to write.  Every once in awhile I do some work and it feels soooo good.  Like programming the first version of the GMO algorithm, or doing that simulation the other day that made it clear how the simple Markov model massively underestimates the magnitude of the hot hand (sorry, GVT!), or even buckling down and preparing R and Stan code for my classes.  But most of the time I avoid working, and during those times, blogging keeps me sane.  It’s now May in blog time, and I’m 1/4 of the way toward being Jones.

So, sure, Bill, I’ll take next Monday’s scheduled post (“Happy talk, meet the Edlin factor”) and bump it to 11 May, to make space for this one.

And now, to get to the topic at hand:  Yes, it does seem that I give two sorts of advice but I hope they are complementary, not contradictory.

On one hand, let’s aim for hierarchical models where we study many patterns at once.  My model here is Aki’s birthday model (the one with graphs on cover of BDA3) where, instead of analyzing just Valentine’s Day and Halloween, we looked at all 366 days at once, also adjusting for day of week in a way that allows that adjustment to change over time.

On the other hand, we can never quite get to where we want to be, so let’s start simple and build our models up.  This happens both within a project—start simple, build up, keep going until you don’t see any benefit from complexifying your model further—and across projects, where we (statistical researchers and practitioners) gradually get comfortable with methods and can go further.

This is related to the general idea we discussed  a few years ago (wow—it was only a year ago, blogtime flies!), that statistical analysis recapitulates the development of statistical methods.

In the old days, many decades ago, one might start by computing correlation measures and then move to regression, adding predictors one at a time.  Now we might start with a (multiple) regression, then allow intercepts to vary, then move to varying slopes.  In a few years, we may internalize multilevel models (both in our understanding and in our computation) so that they can be our starting point, and once we’ve chunked that, we can walk in what briefly will feel like seven-league boots.

Does that help?

13 thoughts on “Forking paths vs. six quick regression tips

  1. As I understand it, the contradiction disappears once we understand that the Garden of Forking Paths criticism and Build Many Models advice are related to different inference procedures.

    GoFP is all about interpreting p-values and hypothesis testing inferences. P-values can be rendered uninterpretable if the analysis is contingent on the data at hand. The reason is the heart of the GoFP: even if we run only one analysis, if this analysis was done because the data has some characteristic that lead us to use this test and not another, the p-value is conditional on the obtained data. But inference from p-values only care with the data at hand as a (concrete) replication from many (hypothetical) experiments. Thus, if we change our analysis each time we replicate, the p-value we compute from the data has no relation to the distribution of the test statistic because it change with each iteration. How con we make binary decision (reject/~reject) based on a procedure that assumes replications of the same procedure when it might change at each iteration?

    BMM advice is about combining all information available to compute posterior distributions. We start simple, building prior distributions with information we have at hand, building a likelihood function that makes sense to the problem at hand and then inferring posterior distribution of parameters or other quantities of interest. This way, we build some understanding about the problem using the available data, and can add to it to take more subtle relations in account. As I understand it, BMM advice has its problems, too: if we don’t know the true model (but when do we really know it?), we should take in consideration our uncertainty in the model definition, which means our final model has narrower posterior distributions than it should. But we do not make binary decisions with them based on statistical significance – we describe the posterior distribution to give a sense of parameter values and how uncertain we are about them.

  2. Thanks, Andrew. In some ways, this at first felt underwhelming–instead of a big, grand theory, it was a simple “They really do work together.” After a bit of reflection, that seemed not only okay but quite good; both things you have been saying still seem true, they do hang together, and it simply takes a bit of thought to blend them.

    Erikson sees your advice as focusing on different inference procedures; I rather see it as looking at different purposes. The garden warning focuses on inference (design), while the build-many-models approach focuses on construction. That difference points, I think, to how you can blend them safely. With a really bad analogy, one wants to build complete airplanes–symmetric wing structure, fuselage, propulsion system, and the like, but one assembles airplanes in an order that makes manufacturing sense and that lets you verify that each step was completed successfully: answer “are the wings and the fuselage tightly fastened?” before “will this entire airplane fly?” If you don’t do any testing before trying to fly, you may not get many willing test pilots, and you may have a really tough job detecting whether things are tied together properly, because many of the early steps may have been hidden by later steps.

    But the garden lingers: how much of that airplane, that model, does one need to design, document, and review with others before starting to build? Does EDA now become finding (most) /all/ the variables that might contribute, sparsity be damned (at least at the design level; that will come out at the implementation level), instead of finding the key variables that provide insight? Is it perhaps that EDA could be seen as the divergent phase of analysis, where one is trying to look more broadly, while CDA is the convergent phase of analysis? I guess I’ve often looked at EDA as a convergent phase that takes messy, real-world data and begins to make sense by simplifying, which I think is also true.

    I think I hear you writing the old IBM one-word motto in more words.

    I’m waiting for Thursday’s contribution, too; it sounds related.

  3. It’s also worth mentioning as it is in BDA3: we don’t necessarily start with a simple model. We start with what we know, and we build up _or_ down following our assessments of its fit to data. Oftentimes, starting with an already “complete” hierarchical model can teach us to simplify, not complexify.

    • I think Dustin’s point is a really good one. Most researchers that I know start with the model that’s built from theory or the existing literature (this is Dustin’s “what we know” model), and then adjust from there depending on how well the model fits the data.

      • While this is certainly a valid approach, it is important to remember that it requires large amounts of data rather than the small samples typically seen in psychology as an example.

    • Andrew has said multiple times that p-values may be a useful summary statistic. He does not say “no p-values”. AFAICT, his current thought is that the dichotomous hypothesis and significance testing (or most common, the hybrid of the two) methods are a misuse of p-values. Perhaps he would also agree that the usefulness of p-values has been overstated to the point of mass insanity, but not “no p-values”.

      Also, I am curious as to why you think “The Gelman View” is only about “Social Science” or “Social Statistics”.

      • Well, I said it both ways (i.e., as quantitative social science and as social statistics) in the post. I guess my intention was to say, “what can social scientists who aren’t statistical experts but are still engaged in quantitative work learn from Andrew Gelman?”

        • But why do you think it is only applicable to *social* scientists?

          >”The purpose of social statistics is to describe and understand variation in the world. The world is a complicated place, and we shouldn’t expect things to be simple.”

          Do you think that non-social statistics have a different purpose? If so, what is it?

          >”Social scientists should look to build upon a broad shared body of knowledge, not to “own” a particular intervention, theoretic framework, or technique. Such ownership creates incentive problems when the intervention, framework, or technique fail and the scientist is left trying to support a flawed structure.”

          Do you think that non-social scientists should look to do something different? If so, what is it?

        • I think you have it backwards.

          The problem is that social scientists (and others) misunderstood the meaning of a p-value and statistical significance, so decided to try substituting statistical methods for the scientific method. To wit: careful classification of phenomena, development of experimental conditions required to get consistent observations, independent replication, quantitative theory development, and comparing precise predictions with observation. I’d recommend reading papers in your area of interest from before ~1940. You will probably be very surprised at how good they are, especially given the technological constraints researchers were under at the time.

          Medicine/biology are most definitely prone to the same problems as sociology/psych (although it is true that more often you see a precise prediction like “if theory T is true, we should get observation O of DNA sequence GTACAAA…CTCG”). Also, this claim is dubious: “we’ve made a lot of progress in genetics thanks to genetically-identical mice”. First, most genetics is done at the cell level or with flies. Second, inbred mice are not identical (although they may be more similar to one another than to other mice). Third, I doubt even the claim of “genetically identical” would stand up to scrutiny, eg: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC60876/

        • Yes, I made a similar point about medicine. It’s not just that people are all different, but any given disease may be more or less of a single thing that can be systematically studied and against which progress can be made. Yersinia pestis, maybe yes, cancer, maybe no.

          I’m not sure if I disagree with your other points, or if they are necessary in conflict with the lessons I said I took from Andrew’s blog.

        • >”One large and related advantage of natural science over social science in advancing our understanding of the world is not just the relative durability of natural science theories but the relative constancy of those theories’ parameters, the constancy of the constants. ”

          My main point is that the methods currently used by “social scientists” are different from those shown successful in the “natural sciences”, and that these alternative methods have no hope of ever figuring out any constants.

          Perhaps the results have been disappointing because researchers in that area have decided to deviate from the scientific method. It remains to be seen whether there is really something about the nature of the subject that requires a different approach, the approach with proven track record hasn’t been tried yet.

Leave a Reply

Your email address will not be published. Required fields are marked *