“Do Simpler Machine Learning Models Exist and How Can We Find Them?”

The above is the title of a talk by computer scientist Cynthia Rudin. here’s the abstract:

While the trend in machine learning has tended towards building more complicated (black box) models, such models are not as useful for high stakes decisions – black box models have led to mistakes in bail and parole decisions in criminal justice, flawed models in healthcare, and inexplicable loan decisions in finance. Simpler, interpretable models would be better. Thus, we consider questions that diametrically oppose the trend in the field: for which types of datasets would we expect to get simpler models at the same level of accuracy as black box models? If such simpler-yet-accurate models exist, how can we use optimization to find these simpler models? In this talk, I present an easy calculation to check for the possibility of a simpler (yet accurate) model before computing one. This calculation indicates that simpler-but-accurate models do exist in practice more often than you might think. Also, some types of these simple models are (surprisingly) small enough that they can be memorized or printed on an index card.

This is joint work with many wonderful students including Lesia Semenova, Chudi Zhong, Zhi Chen, Rui Xin, Jiachang Liu, Hayden McTavish, Jay Wang, Reto Achermann, Ilias Karimalis, Jacques Chen as well as senior collaborators Margo Seltzer, Ron Parr, Brandon Westover, Aaron Struck, Berk Ustun, and Takuya Takagi.

This led to a question on my part. Rudin writes, “Simpler, interpretable models would be better.” I wonder whether it would makes sense to separate the concepts of “simpler” and “interpretable.” I say this because in applications such as social and environmental science, a more complicated model can be more interpretable because they are adjusting for more factors. I guess it depends on the application. If I’m doing adjustment for survey nonresponse or imbalance in an experiment, I find a more complicated model to be more interpretable and explainable: if I use a simpler model, it can be harder to explain. For example, in political polling it is more interpretable if we are adjusting for education of respondents than if we’re not. Also in those settings it would not really be an advantage to be able to write the model on an index card!

I sent this question to Rudin, who replied:

My definition of interpretability is totally domain dependent, so yes, of course I agree with you. Not every problem has the same type of model as the best solution, otherwise half of us would be out of a job and we’d all be working on that problem! I actually have a pretty broad scope for what I consider interpretable. For instance, I advocate for interpretable neural networks for computer vision. Our review paper goes through a pretty wide variety of problems that I think are interpretable, and we admitted in the survey that we didn’t even cover a small fraction of it. The talk I’m giving here just happens to be about sparse models (trees, additive models) because I find them useful for medical and criminal justice problems, as well as energy reliability.

23 thoughts on ““Do Simpler Machine Learning Models Exist and How Can We Find Them?”

  1. This is great to see. As an actuary, our models often need to be filed and approved by regulators, so interpretability is key (helps for management approval as well). Sometimes this can make us feel a bit inferior to our data science peers in other industries as we fiddle around with GLM/GAM style regression models while they continue to innovate with the latest flashy black-box. But interpretability really is a virtue, at least for certain domains.

  2. People regularly misinterpret the coefficients of simple models. Interpreting the coefficients is no different than doing so for the weights in a neural network. They do not represent “effects” unless the model is correctly specified.

    Otherwise they are arbitrary values conditional on which model specification is chosen out of hundreds of millions (or more) equally plausible ones. And those are constrained by the data available.

    So I would like to see how this “interpretability” works on a real life problem. Do they only mean understanding how the model works, rather than drawing conclusions about reality?

    • I would love some discussion of this point. I don’t have a good interpretation for “interpretable,” or at least I have trouble distinguishing it from “likely to be misinterpreted through a causal interpretation.”

    • Yes, interpretability in ML is usually synonomous with providing some understandable mechanism to explain a model’s predictions, which is often presumed to be an accurate reflection of the actual decision making process (but may not be). I think mismatch between what is modeled and reality comes up mostly around whether the explanation does in fact seem to reflect how the model is learning or making decisions. See, e.g., work criticizing saliency maps on the grounds that they can fail to reflect important model differences: https://proceedings.neurips.cc/paper/2018/file/294a8ed24b1ad22ec2e7efea049b8737-Paper.pdf https://arxiv.org/pdf/2206.13498.pdf.

      Questions along the lines of ‘does this coefficient actually capture the real world phenomena I think it does’ aren’t so common, because ML isn’t about explaining some data generating process. Though, addressing some of the “failures” of deep neural nets that make the news, related to distribution shift, shortcut learning, and other unexpected biases, can seem to require moving more in the direction of causal modeling (we argue something to this effect here https://arxiv.org/pdf/2203.06498.pdf, following others who have advocated for integrative modeling https://www.nature.com/articles/s41586-021-03659-0).

  3. In hindsight, the one time when I advocated for a simpler model – because it did indeed work better – turned out to not be a good career choice. Our rival organization, after spending a large amount of the sponsor’s money on developing the bigger model, held a meeting with those sponsors – I was not invited – and gave a long powerpoint presentation insinuating that I preferred the simpler one because I clearly did not understand the more “sophisticated” one. I could have rebutted by showing the superior accuracy of the simpler model, but alas, never got the chance.

    Now it is hard for me to imagine achieving success in arguing for the better fidelity of a simpler model. Seems to violate various innate aspects of human nature.

    • Maybe you need to go about it in a different way. Advocate for an even more complicated model that stacks the simple and complicated model. The averaging process would probably end up passing through the outputs of the simple model most of the time if it’s significantly better than the complicated one, but it’s technically even more complex and expensive to implement, so therefore superior

      • Or maybe just a cooler name for it. Random forest, neural nets, bagging, and boosting sounds so cool. You know, if your sponsors are staring across the horizon of big data and you offer something like random forest, they may just say, “yes…yes…that’s the thing.” Mr P is sorta cool, but it sounds kinda like a hip hop artist. But if you say, well we ran this Bayesian linear regression with a couple of variables, how are you gonna compete against the feedforward artificial neural network multilayer perceptron model?? It’s like boring old stats vs the Transformers :-P

    • Your client knows that more bling gets the ladies. Even if bling strikes out in the first round, you can build on it. Your Ford might look nice in the right light but in the end Uptown Girls prefer Lambos. That’s what Billy Joel found out anyway.

    • > Seems to violate various innate aspects of human nature.

      I’d say you are wrong about that, communication is mostly non-verbal, sounds like you just didn’t make the case. The most competent people I’ve worked with don’t shy away from difficult decisions and competence is power.

      Give it some relaxed thought and you’ll realize one of our greatest strengths as a species is in reducing complicated problem and data sets down to first principles, and then building from those for an intended effect.

      That being said, its my opinion data science will be a short lived career.

      The reason behind that is simply, the basis for most complex society is the division of labor and cost of energy, and if you can create models that make that labor effectively free, you create significant issues because you have made a fundamental bias of machines over people in terms of the business math. Economies only become strong when there are more productive efforts that are rewarded than unproductive, especially in the realm of production and truck/trade. As automation and models replace people everything will become slowly unmoored, squeezed, or concentrated until the motor seizes.

      When you are working in state of the art fields, it is often more important to ask yourself, should we do this, with a decent amount of reflection along a background in moral philosophy, over the naturally naive, and simpler minded question of ‘can’ we do this.

      You may find the Wealth of Nations (Adam Smith) and its offshoot/follow-up book W&PoN (Landis) an interesting but slow read but entirely worth the time spent as it describes well vetted facts from an age that valued facts far more than worthless opinions, and has largely stood the test of time.

  4. Keep in mind that a neural net is a statistical approximation function for complex data. One can remove nodes. Ea damage a network and still it can nearly perform the function. In some networks though this results in larger error risks than in others. One might try to combine q learning to create decision state trees / matrixes. But neural nets like people will always be a bit fuzzy and often not perfect, though damn good

  5. Can someone explain to me the preference for interpretable models? Perhaps I am not understanding properly was it meant by an interpretable model but it seems that often people consider a linear regression model to be interpretable but I don’t really get it. The coefficients in a linear regression don’t really tell you much imo. The textbook answer is that a one unit change in the covariate is associated with a [coefficient] change in the outcome, holding all other covariates constant. But what this association means is not clear or how to use it. The exception, of course, is when you’re doing a causal analysis then you can interpret the causal parameter as an effect but that doesn’t appear to be what ML people are talking about, at least explicitly.

    Trees are easier to understand but I still don’t see why there’s a preference for it, all else equal. Would practitioners not just follow the output of the tree the same way they would follow the output of a black box?

    To be honest, most of the times I see people asking for interpretability, they really just want causal inference. But this may just reflect my lack of understanding of ML and its goals.

    • To provide a practical example, let’s say eBay is trying to detect storefronts that may be a front for the sale of illicit goods. The concern is that people may have listings for products (say houseplants or baseball cards) when they are actually selling drugs. They have a bunch of data from manually investigating reports and from randomly sampling storefronts to investigate.

      They compute a bunch of reasonable metrics like price of goods relative to similar products, number of bids per sale, unique buyers per 100 sales, product type, etc. They then fit two models to the data, an interpretable decision tree and XGBoost and want to choose which model to use to identify storefronts for investigation.

      When evaluating the quality of the two models, they can compute all sorts of numeric criteria but with the interpretable decision tree they can see the logic of it and ask the investigators if the logic seems reasonable. This allows them to use investigators knowledge of what fraudulent storefront look like to confirm that the model is picking up on reasonable factors and using them in a reasonable way. This helps to identify overfitting, model biases, and data errors. Additionally, two years later a government agency audits eBay after a series of high-profile cases involving fronts using online retailers surface. Explaining and justifying a simple decision tree to regulators (most of which are not statisticians/data scientists) will be much easier than explaining an XGBoost model output. You can say, we investigate all business with a known address in Humbolt County, CA that sell houseplants with an average price above $95 dollars to identify protentional weed sellers. An XGBoost model that flag that kind of transaction will not be easily interpretable and you are left with say “so math says its similar to prior bad storefronts”. The first is more likely to be positively received by regulators as it clearly shows what you think is suspicious and why and can be evaluated internally and externally for accuracy for conceptual soundness.

      This example is fictional, but it is similar to how a lot of banking and financial models are evaluated in practice. As a regulator, we want to understand how the systems work and why you are making these decisions so that we can confirm you are following the laws and regulations in a reasonable way and not just saying “a computer told me”. As a business, you should want to know why a model makes a particular decision so that you can evaluate if it is sound conceptually and not just chasing noise/bias.

  6. This reminds me of two things. I don’t really have a conclusion to draw from these; they just seem relevant to what Rudin is saying.

    First: Andrew, I vaguely recall a conversation you and I had in about 1990, about ‘expert systems.’ https://en.wikipedia.org/wiki/Expert_system Here’s one thing that Wikipedia page says: “The goal of knowledge-based systems is to make the critical information required for the system to work explicit rather than implicit. In a traditional computer program the logic is embedded in code that can typically only be reviewed by an IT specialist. With an expert system the goal was to specify the rules in a format that was intuitive and easily understood, reviewed, and even edited by domain experts rather than IT experts. The benefits of this explicit knowledge representation were rapid development and ease of maintenance.”

    I think one of the things that prompted that conversation was an article about.. I want to say it was an expert system for making the decisions that a “master brewery” makes at a commercial brewery. One of the things that a lot of the people making expert systems discovered is that they could often replicate the decisions of the human with a surprisingly compact set of rules. If the hops are too bitter, do X. If the grain is too moist, do Y. Decisions that seemed complicated at first — “wow, you must have years of experience to know that you should do that! — could often be replicated with just a slightly more complicated decision tree or other set of rules: “If the alcohol content midway through is too high, and the bitterness is too low, then do X if the temperature is low or X’ if the temperature is high,” that sort of thing. I do think a lot of models can be written on a 3×5 card.

    Second: Last year I was working for a client who needed a model that performed as well as reasonably possible while also being understandable to people with only limited technical understanding. In some circumstances, maybe many, those might have been mutually exclusive but in fact I tried lots of models — logistic regression, random forests, bayesian hierarchical models, a few others — and they all performed roughly the same. So the client said “great, we’ll go with logistic regression because the audience already understands that, or at least they accept it.” I’m not sure what would have happened if a more complicated model had worked a lot better; how much would they have been willing to trade off performance vs explicability? But it didn’t come up.

    • Our software is used by clients running online experiments and contextual bandits. When we say that the value functions of the bandits are interpretable we mean it in the way Phil uses it- that the bandit is both machine and human parseable. One can not overstate the importance in many situations for certain types of clients to be able to mentally audit and process the model before being willing to implement its use. This is in addition to the bandit being routed as an arm in AB Test to evaluate the in-market efficacy of contextual/conditional selections vs a uniform random and/or pure selection policy (as one would in a simple AB Test).

      I am looking forward to reviewing the paper. Separately I have wondered if there is anything of use in looking at a relationship between the marginal accuracy vs the number of bits of the marginal Kolmogorov complexity between the ‘explainable’ model and the black box model to get some sort of shadow price of complexity wrt accuracy. How much extra bits do we need to spend to get the incremental accuracy. Or at least get some some measure of the number of bits that are in some sense not explainable/interpretable.

    • Warren:

      Machine learning is statistical data analysis. As Bill James said, the alternative to “good statistics” is not “no statistics,” it’s “bad statistics.” Criminal justice is important. People will study criminal justice problems using statistical analyses. It makes sense to understand these analyses and do as well as we can. I’ve published some papers myself on statistics in criminal justice. I don’t think it makes sense to not try to understand criminal justice using statistical principles, given that the alternative is that people will do crappy analyses and use that to jump to strong conclusions (for example check this out if you really want to be horrified).

      • Ah. An excuse to segue into today’s morning reading.

        The 16 December 2022 Science has a (ravingly positive) review of “Escape From Model Land” by Erica Thompson. Did I say positive? Yep. They liked it.

        I’ll not quote the review, or comment on it (since I’d mess it up), but it sounds seriously up your-all’s alley.

        Also, by the way, thank you for saying “Machine learning is statistical data analysis.” I’ve thought about screaming that in the context of my rants on current AI (here and elsewhere), so it’s helpful to have it in print, so to speak. I really ought to read this book, since it sounds like it deals with the limits of such statistical data analysis, and might make my ranting less off the wall. But there’s a stack of back issues of Science, and I didn’t suceed in my Japanese fiction reading goals this year (oops)…

        Whatever, thanks for this blog and putting up with me, and have a good 2023.

        • “Machine learning is statistical data analysis.”

          In a broad way, as is intelligence.

          Saying AI or machine learning is just X doesn’t disprove it is a form of intelligence unless you can prove intelligence doesn’t involve X.

          Have a good 2023.

  7. A good point I think. Makes me think of Paul Rosenbaum’s award lecture at JSM a few years back where he was reflecting on an empirical paper he was involved in. There were replies and all the worries were about unobserved confounders. He saw this as encouraging: everyone was satisfied by the quite transparent matching they had done on observables. (But of course the implied model for such matching has lots of parameters, so isn’t “simple”.)

  8. Years ago, a friend gave me a volume entitled simply “Simplicity”. Not sure where it is right now, but there are many concise ways to define what makes a model simpler.

    Simulated annealing gained some popularity within the AI community (surprisingly amongst KR types) some years ago as a way to quickly find acceptable (but not always optimal) solutions to problems for which exact solutions could only be found by searching an exponentially large space. It appears to be an area-unto-itself these days. (The same might be said for genetic algorithms, and I’ve noticed a bit of work on random-restart methods.)

    To teach it, I used the metaphor of used two dynamically weighted dice to guide the search for a solution. The first roll of the die told you where to make a change. The second roll told you whether or not to actually implement the change if it resulted in a poorer solution (which might take you out of a local maxima or minima). The second die becomes progressively less capricious as you approach a good quality answer.

    If solutions are easy to visualize, it’s a good way to find structure in your data, and from there you can craft domain-based models. So I see it as useful for the exploratory phase of finding faster methods.

    Consider matrix multiplication. The usual elementary method requires about n-cubed operations. In 1969, Volker Strassen devised an algorithm that did the job in n to-the-power 2.8074 operations. Matrix multiplication is not usually not considered a high stakes operation, but I’ve read that in the early days of computer graphics this made is possible for laser printers using PostScript to do certain graphics calculations just well enough to make better graphics commercially feasible.

    The current record is n to-the-power 2.37188 (as of October 2022), and, as far as I can see it, the results are explainable. At about the same time, New Scientist reported that DeepMind discovered a fast algorithm for 4×4 matrices (among others) (a big factor in computer graphics), but “is far from intuitive for humans”. Also reported in Nature. This may result in a 20% big speedup for computer games and other rendering problems.

    Haven’t done an appropriate survey, but I’d confidently claim the average person doesn’t care much how their matrices are multiplied, so long as it’s correct. (Ever seen the inverse-square-root video? It’s a masterpiece of early computing, likely obsoleted by current floating point hardware.)

    On the other hand, I think it would be questionable to impose self-driving cars on a population based on a black box method.

    If I had more time I could have written a shorter note.

  9. “I wonder whether it would makes sense to separate the concepts of ‘simpler’ and ‘interpretable’.” … “My definition of interpretability is totally domain dependent, so yes, of course I agree with you.”

    In the context of high-dimensional sequence data (e.g., language or amino acids), another related aspect we consider when winnowing the space of possible models is robustness of the model to distribution shifts. Other things being equal, we seek the model with which we can most effectively construct data partitions over which we can reliably calculate uncertainty.

    With large language models (i.e., deep parametric networks, such as Transformers), we gain such robustness by partitioning the data via the strong signals for reliability from dense representation matching (i.e., via matching labels and predictions into training/support and considering the distance to first match into training/support) and binning the output space. (The ‘dense representations’ are constructed from the hidden layers of the neural network.) We can then model (and analyze) the partitions separately to allow the proportions of the partitions to shift over new data.

    The point here is that, at least restricting ourselves to neural networks, we may observe similar effectiveness on our metrics (e.g., the losses both fall below a given threshold) between a small neural network and a much larger (i.e., more complicated, higher parameter) neural network on our in-distribution data. However, the larger neural network may be more robust to distribution shifts, and thus preferable, unless we have some other means of constraining the input.

    Robustness can (sometimes) become a competing factor against simplicity, at least with high-dimensional data.

Leave a Reply

Your email address will not be published. Required fields are marked *