Comparing bias and overfitting in learning from data across social psych and machine learning

This is Jessica. Not too long ago, I wrote about research on when claims in machine learning (ML)-oriented research are not reproducible, where I found it useful to try to contrast the nature of claims, and threats to their validity, in artificial intelligence and machine learning versus in a social science like psych where there’s been plenty of public discussion of ways authors can overclaim. 

Writing 20 years ago on the “two cultures” of statistical modeling—trying to interpret and draw generalizations from parameters of a fitted function like social psychologists do, versus maximizing predictive accuracy of some fitted model without requiring interpretability like ML research—Breiman wrote that “there are only rare published critiques of the uncritical use of data models.” But these days critical discussions of how learning from data can go wrong in both “cultures” seem commonplace, even in the media. Similar to social psychologists reckoning with the lack of replication, ML research communities also seem to be undergoing a period of rapid learning in which researchers are having to reflect on and acknowledge threats to the validity of the their models and claims based on them that hadn’t been widely acknowledged or discussed prior to the last 5 or so years. So it got me wondering, how distinct are the most visible concerns today about how claims derived from statistical models are subject to bias and overfitting at these two extremes of fitting functions to data, social psych versus ML?

There are a few major classes of differences to acknowledge first. In terms of classic bias-variance trade-off stuff, in ML we lose the assumption that we need to match the formal representation we use to learn from data to the nature of the true function. The new focus on prediction accuracy, over finding a single model that points to a small number of strong predictors, lets us tolerate more bias from bad estimators, bad assumptions, etc., so long as they help us learn efficiently from data. The added flexibility in model specification space makes overfitting to the training data a primary concern.

However, there are also massive differences between how ML and social psych interact with the world, perhaps the most obvious being that ML as an enterprise overall has demonstrated surprising levels of predictive accuracy put to work in millions of everyday applications that generate utility for the organizations that use them, fueling society’s obsession with big data and deep learning. On the other hand, conventional social psychology has produced, well, it’s hard to say how much value to society in an applied sense because that’s not really the point; it’s about knowledge accumulation for its own sake, which is why the questioning of how much knowledge has actually been produced aka the replication crisis is particularly brutal. 

Many of the more visible concerns about bias in ML (e.g. from lit on algorithmic bias) seem sharply distinguishable from concerns about bias in social psych studies, in part because they’re not about the claims that ML researchers are making for other researchers (e.g., my model/system achieves 94% accuracy at task X on such-and-such benchmark dataset”), but instead about implied claims from applying the predictive model in the world. For example, a model predicting recidivism might estimate a person has a 42% chance of committing another crime once released, from which statements attributing effects to certain predictors can be derived (“these people who differ only in their race are predicted to have a higher chance of committing another crime than these other people.”) Some people refer to algorithmic bias as being about “social bias” to emphasize that the concern is deployed models that lead to systematically worse outcomes from some groups of people as a result of some preexisting inequity, bias, or simply lack of representation in the training data. Whereas the kind of bias that figures most prominently in debates about how to reform social psych is the tendency for the empirical-based claims researchers make to represent false positives, because the researcher censors evidence (file drawer effect), has preferences about what results should look like, and exploits flexibility in the overall learning process leading to claims about human behavior that don’t hold.

Another way to put it is that concerns about bias in ML are largely questioning the value-free ideal of science, the idea that social, political, moral, personal values should play a restricted role in science and not influence how results are justified. The common argument is that as ML becomes more and more pervasive for decision making in the world, technologists need to drop their assumption that their creations are value-neutral, because despite how optimally they may learn from past data, they can worsen or uphold unfair patterns. To the contrary, the value-free ideal is the epistemological basis from which mainstream methodological critiques of social psych are leveled; we’re trying to get people to be less motivated by their personal goals and biases when they do empirical research.

This doesn’t mean ML is immune to analogous concerns to those in social psych, though. Some have discussed different types of leakage and researcher tweaking of parameters to get the best accuracy out of some model that more closely resemble how we talk about researcher degrees of freedom in fields like psych. Worse may be how researchers can delude themselves into thinking that they’ve made progress on some hard task, when in reality the performance can be achieved with dumber heuristics that don’t necessarily use what should, by the definition of the task, be critical information. E.g., a common visual question answering benchmark dataset enabled getting decent accuracy for many questions using priors, e.g., guessing based on the distribution of objects/concepts in the dataset rather than the image.

But there are a couple reasons that make me think the ML irreproducibilty movement will never really gain the kind of traction or visibility that replicability or reproducibility has in social psych. One is the baseline difference in how well ML research achieves its goals (accurate prediction) versus how well social psychology achieves its goals (producing reliable, valid causal descriptions of human behavior). At the end of the day, it’s hard to challenge evidence of ML’s impressive predictive performance, whether or not we fix our bad intuitions about how learning is happening. The other is the relative ease of judging whether or not some accuracy level has been achieved by a specific model and architecture compared to judging whether or not human behavior can actually be explained by simple psychological plausible models. 

Related to this (but maybe an overgeneralization?) it seems there’s a difference in how generic claims made by social psychologists versus ML researchers are implied to be. The heavy emphasis on predictive accuracy in ML implies to me a different type of reasoning in many cases, where the claims are more possibilistic than probabilistic (e.g., “we can achieve this level of accuracy given this training data/model/architecture/parameterization etc for this task/test set” instead of “taking up more personal space makes many people feel more powerful”). More often the types of severe overconfidence in ML model results seems to stem not from things like readers overlooking the dependence of stated model accuracy on human-mediated decisions (how hyperparameters were tuned, which run of the algorithm was reported on, etc.), but because of latent assumptions on the reader’s part, like that the learner has picked up on the structure of the feature space that a human would. E.g., around 2014 results came out showing that noising an image, imperceptibly to a human, could consistently trick a number of classifiers into applying the wrong label (and now there’s lots more examples in ML research), which is surprising given a latent assumption that achieving human-like accuracy meant seeing the problem in a human-like way, e.g., in which there’s a type of continuous perceptual distance underlying label assignment. Or like the chart question answering example above, that how we operationalize tests of how well algorithms can perform certain human-like tasks (e.g., reading comprehension) actually allows us to test what we think (what if it would be possible to do well at the task without looking at either the questions or the passages?) 

Is it meaningful that in one case we use metaphors like garden of forking paths, which seem to imply some decider/agent at the helm, and in the other terms like alchemy, which evoke blind experimentation, come up? There’s a continuum of complexity as you move from analyses via anova’s and t-tests at one end to overparameterized deep neural nets on the other, and I wonder if we’re more likely to let ML researchers off the hook for certain threats to reproducibility or replicability because of the field’s own challenges keeping up with sheer complexity of the learning process. The leakage and “design freedoms” can seem less human-mediated in light of the sheer complexity of the “design space” of something like a deep convolutional neural net. 

In judging the validity of social psych claims and the potential social harms of ML research at the time of publication, there’s epistemological uncertainty: we can’t know for sure whether or not the claims are good. In the case of an empirical social psych paper, lots of reform literature implies that replicability is the best standard for judging validity of an experiment’s results. In the case of ML research, the evidence that would reduce the uncertainty around bias seems further in the future still: it depends on how the contribution (fitted model, learning approach, etc.) makes it into real world deployment. E.g., some argue that finding no evidence that an ML is unfair in the technical sense does not mean we can conclude that the system really is unbiased, because that can only be defined with respect to how it’s deployed and used. In both cases, there’s a need within the field to build up knowledge about how you can predict, upon publication, how likely the findings are to be bad/biased. 

Here the methodological reformers in psych seem ahead of those in machine learning, in that there’s a significant amount of work making the case for one or another “features” of an experiment or analysis (low sample size, high measurement noise, lack of declaration of methods in advance, dichotomization, etc.) tending to correlate with a lack of replicability. I can’t tell, from the outside looking in, if the ML community yet has consensus on how to judge the potential for future harm or bias of a contributed algorithm or system. The epistemological uncertainty at the time of publication in ML seems larger and more contingent our ability to predict the future in a very literal way; e.g., who might come along and put the contribution to use in what real world situation. 

At the highest level, maybe a point of parallel besides just “people having revelations about ways learning functions from data can go wrong” is that in both cases, it’s been clear we need to update our beliefs about the bias-variance trade-off. Methodological reformers addressing social psych have tried to show seemingly harmless decisions about data inclusion, transformation, etc. combined with a coarse objective function (significance) and a researcher who is implicitly optimizing for it can change analysis results. Some have noted how the desire for unbiased procedures has ironically motivated psychologists to prefer between subjects experiments, and rely heavily on frequentist maximum likelihood, without recognizing that with small samples and noise from measurement error, sampling variability, etc.,these approaches do not achieve their large sample guarantees. (See e.g., Andrew’s many previous posts on bias/variance tradeoff). 

Similarly, there’s been theorizing and simulation in AI/ML to explain how a deep learning model can be so vulnerable to certain forms of adversarial manipulation, and how biased predictions can result from training data properties and model choices. Beliefs and intuitions have required rapid updating as new results come out; as one example, the tendency to think that the brittleness of deep nets is due to non-linearity, where some of the a-ha moments involve pointing out that linear models suffice to explain. Related to this, there have been some fascinating recent revelations around phenomena like double descent that demonstrate how overparameterized models without explicit regularization directly challenge our notions of how bias and variance trade-off. Intuitions have also been proven wrong as applies to some of the algorithmic bias work, e.g., when researchers pointed out, after the initial popularization of the idea of unfair algorithms (like through ProPublica’s famous article), that defining fairness is not as simple as some critiques implied

In the flurry of activity aimed at providing causal explanations for the various threats and seemingly anomalous results, there’s a risk of being misled by our intuitions about what the problem is (I think of Devezer et al.’s 2020 paper as a good statement of how this applies in the methods reform debate in psych, by pointing out lack of acknowledgement that replicability is an imperfect estimator for good science). Is it somehow easier to get back on track in the logic and proof based paradigm of ML than in a probabilistic realm like social science? 

Part of me thinks there’s some latent comparison here, between the enormous flexibility of reasoning and wide breadth but limited depth of knowledge of human intelligence, versus the narrow, high depth but low flexibility of machine intelligence. My perspective on all this is still forming though, so I’m curious to hear readers’ thoughts on any of this.

7 thoughts on “Comparing bias and overfitting in learning from data across social psych and machine learning

  1. I’m a philosopher of science who works in the area we call “science, values, and policy,” so it was great to see a reference to Douglas’ book here. I did want to address a common confusion about “non-epistemic values.” On the most common definition, epistemic values are any truth-promoting features of some body of scientific research. Political, moral, and personal values can be epistemic values, depending on the context. For example, if the existing research in an area has produced false claims due to the influence of sexist and racist assumptions, then bringing in feminist and antiracist values can be truth-promoting, i.e., these “non-epistemic values” would be epistemic values.

    • I see, thanks for the correction! Now that you bring that up, I recall from back when I was reading Douglas’s work that the boundary between what was epistemic and non-epistemic was nuanced. I edited so hopefully it is less wrong now. I found her work very eye-opening by the way, especially for thinking about algorithmic bias.

  2. This might be a bit on a tangent but it really irks me the way everyone acts like ML bias is ohh so dangerous and bad without bothering to compare it to the alternatives.

    I mean, is it a bad thing to use a face recognition system that has a higher rate of false positives for minorities than for whites? Or a recidivism predictor that tends to predict higher recidivism from blacks? Yah, that troubles me. But so does all the research showing that police trying to recognize people on CCTV are racially biased (or at least less good at cross-racial identification), that judges making these bail decisions are racially biased.

    Indeed, it seems to me that the real complaint here is that ML models are too *easy* to analyze and test. Unlike the judge making individual calls with too small a number of cases to really work out the effect of, say, race on their decision the ML model can be applied to thousands of otherwise similar cases and it’s behavior quantified.

    But that’s a *feature*. It means that, unlike when we have humans make the call, we can look at where our ML algorithms are biased and work on improving them. When companies like amazon refuse to provide facerecognition tech to law-enforcement we don’t get less bias. We get the existing human bias plus shitty facerecognition tech from companies who don’t care about reducing that bias and making it better.

    Of course, most people would much rather assure themselves they can’t be associated with the bad things than to act to make things slightly less bad. Not to mention people *really* don’t like having to face the theorems showing that *no* decision about groups with differing base rates can be fair in all the ways we want it to be.

    • Peter:

      I agree with your general point here. I just disagree with your claim that “everyone acts like ML bias is ohh so dangerous and bad without bothering to compare it to the alternatives.” When machine learning bias is discussed, I’ve often seen people make the point that one should also consider the biases in the current system or in other preferred alternatives.

      My impression is that when we see one-sided presentations (discussions of biases of a new system without discussions of bias in the existing system), it’s not that “people would much rather assure themselves they can’t be associated with the bad things than to act to make things slightly less bad,” but rather that people are reacting to what they see as a naivety or hard sell in which the new system has been sold to people as a panacea.

      In any case, I agree with your general point that, when it comes to policy, the benefits and flaws of any new system should be compared to the benefits and flaws of the existing system and other proposed alternatives.

      • Computer systems are often more entrenched than humans. Humans will change with society and hopefully become less biased, a computer system rarely does, and it’ll be expensive.
        Plus, people may trust it more.

    • I believe there are two issues:

      1) Who should be held responsible for cases where bias leads to harm? If it is a human or group of humans making the decision, we can hold them responsible. But if a decision is outsourced to a trained ML model and it makes a mistake, who should be blamed for the mistake? What recourse would someone harmed by such a system have?
      2) Ironically, I think the widespread awareness of human bias makes people more willing to believe that machines are less biased than they are. For many decision makers and members of the public, machines are thought of as perfect rational actors. And people trying to sell these systems often make that claim. We need voices speaking against those false beliefs to prevent naive acceptance of ML from becoming the norm. (Same reason we harp on naive statistical practice on this blog.)

      • Blame is definitely one of the harder questions. One challenge is that in many of the “high stakes” cases brought up in the algorithmic bias lit, it’s not easy to say whether there was a mistake or not. If your model is learning from tons of past data, and the task is to predict the probability of defaulting on a loan, it again comes back to predicting the future. Even if ground truth is possible, it might require interrogating the model in a way that isn’t accessible to those either using the model or getting labeled by it.

        Sort of related, I recently came across work on ‘actionable algorithmic recourse’, which implies that it is wrong to make predictions that affect people’s lives/futures in major ways without giving them some actionable advice on how to change their features to get a better prediction in the future. And that if your model is based entirely on hard to change traits (age, sex, socioeconomic status, etc.) then maybe you shouldn’t be using it. https://arxiv.org/pdf/1809.06514.pdf
        According to the first author, the idea has been controversial in some circles.

        On your second point … that is important to remember. My view is so entrenched in the spheres where this stuff has been discussed for a couple years so I forget that these things are not obvious in many circles / applications where model predictions are used.

Leave a Reply

Your email address will not be published. Required fields are marked *