Skip to content

What can be our goals, and what is too much to hope for, regarding robust statistical procedures?

Gael Varoquaux writes:

Even for science and medical applications, I am becoming weary of fine statistical modeling efforts, and believe that we should standardize on a handful of powerful and robust methods.

First, analytic variability is a killer, e.g. in “standard” analysis for brain mapping, for machine learning in brain imaging, or more generally in “hypothesis driven” statistical testing.

We need weakly-parametric models that can fit data as raw as possible, without relying on non-testable assumptions.

Machine learning provides these, and tree-based models need little data transformations.

We need non-parametric model selection and testing, that do not break if the model is wrong.

Cross-validation and permutation importance provide these, once we have chosen input (endogenous) and output (exogenous) variables.

If there are less than a thousand data points, all but the simple statistical question can and will be gamed (sometimes unconsciously), partly for lack of model selection. Here’s an example in neuroimaging.

I [Varoquaux] no longer trust such endeavors, including mine.

For thousands of data points and moderate dimensionality (99% of cases), gradient-boosted trees provide the necessary regression model.

They are robust to data distribution and support missing values (even outside MAR settings).

For thousands of data points and large dimensionality, linear models (ridge) are needed.

But applying them without thousands of data points (as I tried for many years) is hazardous. Get more data, change the question (eg analyze across cohorts).

Most questions are not about “prediction”. But machine learning is about estimating functions that approximate conditional expectations / probability. We need to get better at integrating it in our scientific inference pipelines.

My reply:

There are problems where automatic methods will work well, and problems where they don’t work so well. For example, logistic regression is great, but you wouldn’t want to use logistic regression to model Pr(correct answer) given ability, for a multiple choice test question where you have a 1/4 chance of getting the correct answer just by guessing. Here it would make more sense to use a model such as Pr(y=1) = 0.25 + 0.75*invlogit(a + bx). Of course you could generalize and then say, perhaps correctly, that nobody should ever do logistic regression; we should always fit the model Pr(y=1) = delta_1 + (1 – delta_1 – delta_2)*invlogit(a + bx). The trouble is that we don’t usually fit such models!

So I guess the point is that we should keep pushing to make our models more general. What this often means in practice is that we should be regularizing our fits. One big reason we don’t always fit general models is that it’s hard to estimate a lot of parameters using least squares or maximum likelihood or whatever.

I agree with your statement that “we should standardize on a handful of powerful and robust methods.” Defaults are not only useful; they are also in practice necessary. This also suggests that we need default methods for assessing the performance of these methods (fit to existing data and predictive power on new data). If users are given only a handful of defaults, then these users—if they are serious about doing their science, engineering, policy analysis, etc.—will need to do lots of checking and evaluation.

I disagree with your statement that we can avoid “relying on non-testable assumptions.” It’s turtles all the way down, dude. Cross-validation is fine for what it is, but we’re almost always using models to extrapolate, not to just keep on replicating our corpus.

Finally, it’s great to have thousands, or millions, or zillions of data points. But in the meantime we need to learn and make decisions from what information that we have.


  1. Shane says:

    In practice, do people often fit simple logistic regression models to estimate theta, when guessing, item difficulty, and item discrimination are potential issues? I’m likely biased due to my field of research (Education), but most of the psychometric latent trait models I’ve seen that aim to estimate theta take an Item Response Theory approach which accounts for those factors. I’ve even seen it done using Stan!

    • Jk says:

      I do. My issue with the example of the mc question is that if people truly guess, your test is unsuitable for the population or the question is very badly written. People don’t typically guess when they know something about the subject.

    • klint says:

      Depending on the data in hand, even if a 3PL were to be the real DGP, estimating that many item parameters can be a total mess.

  2. Anonymous says:

    I’ve always thought this comes down to whether the analyst has good model priors and whether they are good at updating those priors through their workflow. If so, then the analyst can incorporate outside knowledge—like in Andrew’s example—and get a model closer to the true data generating process than an algorithm would find. Untestable model assumptions can be very helpful.

    Although, I bet an algorithm could reliably beat the typical analysis plagued with unrealistic effect sizes and forked paths.

  3. Matt Skaggs says:

    “we should standardize on a handful of powerful and robust methods”

    I have made the same point numerous times on this blog, with nothing but pushback from other commenters. It is not being argued that specialized approaches are not needed at all, but rather that most of the questions being posed can be answered by established methods, or should be shoehorned into a form that can be.

    Endless bickering over whether the statistical approach is appropriate is not good for science.

    • Zhou Fang says:

      It’s all well and good saying “we should standardize on a handful of powerful and robust methods”, but do we have agreement about which methods that would be? Varoquaux proposes tree based methods, but in my experience, those frequently perform terribly…

      • Matt Skaggs says:

        I am very curious as to how your tree-based methods blow up or why you feel they are inadequate. Care to share?

        • Zhou Fang says:

          They just typically seem really slow, and perform poorly in cross validation. Possibly I could work harder to tune the method, but the slowness discourages that…

          • Olav says:

            Try extreme gradient boosting. If you don’t want to do any tuning, just use method “xgbtree” in Caret with the the default tuning parameters. It performs really well in my experience. That said, there are definitely situations where regression works better than tree-based methods, no matter how much tuning you do.

    • Michael Nelson says:

      Counterargument: If using a standard set of powerful, robust methods for modeling were the best way to get answers from data, then the model for predicting returns would be identical at every investment firm, because otherwise they’d guarantee a net loss relative to those who did use that model. (Setting aside insider trading or very niche funds.) Since we know this isn’t the case, it would be easy for any right-minded statistician to make a killing on Wall Street!

      Limiting most analyses to a standard set does more to improve the efficiency of science than it does to maximize the probability of making the most accurate descriptions/predictions possible with a dataset. As scientists, we could form two groups: those who prioritize efficiency and those who do not. Of course, the former would still have to be able to explain the successes of the latter, forcing them to deviate from their standard set, so…

      You’re right that bickering endlessly is not good–we need to argue toward an end. As methodological engineers, our arguments have a clear end: to establish that one approach is (or is not) the best approach in a particular case, either mathematically or empirically. Once that is done, just stop bickering.

  4. Michael Nelson says:

    Your correspondent’s proposal suggests a paradox. Suppose your analytical software conducts the analysis that maximizes the principles of good science advocated by Gael. Not all of these priorities can be achieved equally without cost to others, but let’s suppose there’s a default for prioritizing the priorities, too. Here’s the paradox: the software does not put any weight on the priority of getting the most correct conclusion possible from our data (or “most useful conclusion,” or “conclusion that uses the most information from the data…”), either in the present case or in the long run. It would be nice to think that fulfilling these procedural requirements would naturally lead to the best answer, but it is provably not so. I, for one, would happily make a trade-off of a little robustness, or using less orthodox methods, in return for a much higher probability of approximating truth.

    We do not need to sacrifice flexibility to ensure quality, especially when doing so may put a ceiling on how much we can learn from our data. There is another way: open peer review, post-publication or otherwise. Informed readers can spot bad methods or fishy claims and then communicate those critiques in a way, and through a venue, that facilitates revisions, either by the authors for the study at hand or by others in the ordinary, iterative scientific method.

    • Andrew et al,

      Apologies for introducing a topic that may not be related to this specific topic. But what are your thoughts on the ‘universal testing’ of COVID-19? I ask because we are trying to ascertain the potential of asymptomatic & pre-symptomatic spread. I would be grateful if any of you weighed in.
      Thank you!

  5. david rothman says:

    lol. i forwarded around this post to some former wall street pals with the Subj: “us 20+ years ago” wrt to building cointegration based stat arb models with robust, well, everything. we did nothing standard because every time we tried, things broke down (or was contradictory) with the simplest changes in rolling window time choices, in parameter values, in VECM lag value choices (what a minefield that was). ultimately it came down to choices to satisfice the problem and a massive out of sample testing regime to hope to gain a significant tradeable was fun while it lasted. what has made me chuckle a lot over the years is the things we were knee deep in then (PCA, Nelder Mead, orthog LS, genetic algs, nearest neighbor etc etc etc) are veritable rock stars of the past 10 years to the ML guys (so to speak :-).

Leave a Reply