Interpreting apparently sparse fitted models

Hannes Margraf writes:

I would like your opinion on an emerging practice in machine learning for materials science.

The idea is to find empirical relationships between a complex material property (say the critical temperature of a superconductor) and simple descriptors of the constituent elements (say atomic radii and electronegativities). Typically, a longish list of these ‘primary’ descriptors is collected. Subsequently a much larger number of ‘derived’ features are generated, by combining the primary ones and some non-linear functions. So primary descriptors A,B can be combined to yield exp(A)/B and many other combinations. Finally, lasso or similar techniques are used to find a compact linear regression model for the target property (a*exp(A)/B+b*sin(C) or whatnot).

The main application of this approach is to quite small datasets (e.g. 10-50 datapoints). I’m kind of unsure what to think of this. I would personally just use some type of regularized non-linear regression with the primary features here (e.g. GPR). Supposedly, the lasso approach is more interpretable though because you can see what features get selected (i.e. how the non-linearity is introduced). But it also feels very garden-of-forking-paths-like to me.

I know that you’ve talked positively about lasso before, so I wonder what your take on this is.

It’s hard for me to answer this with any sophistication, as it’s been a long long time since I’ve worked in materials science—my most recent publication in that area appeared 34 years ago—so I’ll stick to generalities. First, lasso (or alternatives such as horseshoe) are fine, but I don’t think they really give you a more interpretable model. Or, I should say, yes, they give you an interpretable model, but the interpretability is kinda fake, because had you seen slightly different data, you’d get a different model. Interpretability is bought at the price of noise—not in the prediction, but in the chosen model. So I’d prefer to think of lasso, horseshoe, etc. as regularizers, not as devices for selecting or finding or discovering a sparse model. To put it another way, I don’t take the shrink-all-the-way-to-zero thing seriously. Rather, I interpret the fitted model as an approximation to a fit with more continuous partial pooling.

7 thoughts on “Interpreting apparently sparse fitted models

  1. The described machine learning approach does sound pretty wild to me.
    IMHO, regularizing priors should be used when you actually believe the predictors can have close to 0 effect on the outcome, not to induce some sparsity.
    But in the example above it is reasonable to assume that all the predictors, e.g. a, exp(a), sin(a) are associated with the outcome, there are just too many of them.

    A maybe useful alternative would be to use something like Projective Inference (https://arxiv.org/abs/1810.02406), which uses a reference model with all predictors to find a subset of predictors that work nearly as well as the large model.

    In this way, you avoid using priors that are in conflict with subject matter knowledge.

  2. > Interpretability is bought at the price of noise—not in the prediction, but in the chosen model.
    Unfortunately interpretable is interpreted to be many different things but don’t you mean coarsening as you are coarsening a continuous partial pooling model to get a fully no pooling subset model. That adds noise as with coarsening data.

    In this brief blog https://www.statcan.gc.ca/en/data-science/network/decision-making the degree of interpretability is taken simply as how easily the user can grasp the connection between input data and what the ML model would predict. Erasmus et al. (2020) provide a more general and philosophical view. Rudin et al. (2021) avoid trying to provide an exhaustive definition by instead providing general guiding principles to help readers avoid common, but problematic ways of thinking about interpretability.

  3. Why would anyone turn over a modest size dataset to an arbitrary black-box algorithm? Why not engage in some actual data analysis? Plot your data, examine it for outliers, clusters, nonlinearity, heteroskedasticity, etc. Fix the features that can affect your model: re-express, set aside or correct outliers. then fit a simpler model. Along the way you may find that a stray point or unexpected grouping is meaningful and thereby learn more than any blind fit could tell you.
    Downsides to this approach:
    1) You must THINK
    2) You must take responsibility for what you do–you can’t “blame the computer”
    3) You might conclude that there is no plausible model “good enough” to meet your needs
    4) You shouldn’t believe related hypothesis tests–especially if they are just barely significant– because you have been messing in the data. But you shouldn’t have believed them anyway since you couldn’t form a null hypothesis when you had no model to begin with–and data analysis is not (or should not be) about testing anyway.
    Upsides to this approach:
    1) You are much more likely to understand your data and any phenomena behind them
    2) You are much more likely to find a simple model (and Occam’s razor says that’s more likely to scientifically useful)
    3) You may discover the unexpected in your data (and, as Tukey says, you should expect that)

    • Sure, most papers in this direction absolutely do the type of analysis you describe. The idea of fitting the sparse models is that they may give you something on top, which you weren’t going to find in your manual analysis because they cover such a broad range of features.

      But at the end of the day, the true success stories in this respect are rather rare.

      • Coming from a computational chemistry background, i find that most of the work you describe had typically a large number of researcher degree of freedom.

        They tune the knobs, choose the basis sets or the intermediate functions appropriately to get good predictive value.

        But that gets a publication but typically poor out of sample prediction. Those things just don’t scale to the larger context.

        Of course, everyone is doing it and the publication machine chugs along merrily. :)

  4. As Keith already said, a key issue here is what is meant by interpretability. Any interpretation that implies that x1 with estimated nonzero coefficient “has an effect” (thought of a causal?) whereas x2 with estimated zero coefficient doesn’t is not granted (this is an issue with any regression though). And of course with 10-50 observations and all kinds of correlation between predictors that obviously comes in when deriving more predictors from those you already have, results will be hugely unstable.

Leave a Reply

Your email address will not be published. Required fields are marked *