The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning

This is Jessica. In a paper to appear at AIES 2022, Sayash Kapoor, Priyanka Nanayakkara, Arvind Narayanan, and Andrew and I write:

Recent arguments that machine learning (ML) is facing a reproducibility and replication crisis suggest that some published claims in ML research cannot be taken at face value. These concerns inspire analogies to the replication crisis affecting the social and medical sciences. They also inspire calls for greater integration of statistical approaches to causal inference and predictive modeling.

A deeper understanding of what reproducibility critiques in research in supervised ML have in common with the replication crisis in experimental science can put the new concerns in perspective, and help researchers avoid “the worst of both worlds,” where ML researchers begin borrowing methodologies from explanatory modeling without understanding their limitations and vice versa. We contribute a comparative analysis of concerns about inductive learning that arise in causal attribution as exemplified in psychology versus predictive modeling as exemplified in ML.

Our results highlight where problems discussed across the two domains stem from similar types of oversights, including overreliance on theory, underspecification of learning goals, non-credible beliefs about real-world data generating processes, overconfidence based in conventional faith in certain procedures (e.g., randomization, test-train splits), and tendencies to reason dichotomously about empirical results. In both fields, claims from learning are implied to generalize outside the specific environment studied (e.g., the input dataset or subject sample, modeling implementation, etc.) but are often difficult to refute due to underspecification of the learning pipeline. We note how many of the errors recently discussed in ML expose the cracks in long-held beliefs that optimizing predictive accuracy using huge datasets absolves one from having to consider a true data generating process or formally
represent uncertainty in performance claims. At the same time, the goals of ML are inherently oriented toward addressing learning failures, suggesting that lessons about irreproducibility could be resolved through further methodological innovation in a way that seems unlikely in social psychology. This assumes, however, that ML researchers take concerns seriously and avoid overconfidence in attempts to reform. We conclude by discussing risks that arise when
sources of errors are misdiagnosed and the need to acknowledge the role that human inductive biases play in learning and reform.

As someone who has followed the replication crisis in social science for years and now sits in a computer science department where it’s virtually impossible to avoid engaging with the huge crushing bulldozer that is modern ML, I often find myself trying to make sense of ML methods and their limitations by comparison to estimation and explanatory modeling. At some point I started trying to organize these thoughts, then enlisted Sayash and Arvind, who had done some work on ML reproducibility, Priyanka who follows work on ML ethics and related topics, and Andrew as authority on empirical research failures. It was a good coming together of perspectives, and an excuse to read a lot of interesting critiques and foundational stuff on inference and prediction (we cite over 200 papers!) As a ten page conference style paper this was obviously ambitious, but the hope is that it will be helpful to others who have found themselves trying to understand how, if at all, these two sets of critiques relate. On some level I wrote it with computer science grad students in mind–I teach a course to first year PhDs where I talk a little about reproducibility problems in CS research and what’s unique compared to reproducibility issues in other fields, and they seem to find it helpful.

The term learning in the title is overloaded. By “errors in learning” here we are talking about not just problems with whatever the fitted models have inferred–we mean the combination of the model implications and the human interpretation of what we can learn from it, i.e., the scientific claims being made by researchers. We break down the comparison based on whether the problems are framed as stemming from data problems, model representation bias, model inference and evaluation problems, or bad communication.

table comparing concerns in ml versus psych

The types of data issues that get discussed are pretty different – small samples with high measurement error versus datasets that are too big to understand or document. The underrepresentation of subsets of the population to which the results are meant to generalize comes up in both fields, but with a lot more emphasis on implications for fairness in decision pipelines in ML based on its applied status. ML critics also talk about unique data issues like “harms of representation,” where model predictions reinforce some historical bias, like when you train a model to make admissions decisions based on past decisions that were biased against some group. The idea that there is no value-neutral approach to creating technology so we need to consider normative ethical stances is much less prevalent in mainstream psych reform, where most of the problems imply ways that modeling diverges from its ideal value-neutral status. There are some clearer analogies though if you look at concerns about overlooking sampling error and power issues in assessing the performance of an ML model.

Choosing representations and doing inference are also obviously different on the surface in ML versus psych, but here the parallels in critiques that reformers are making are kind of interesting. In ML there’s colloquially no need to think about the psychological plausibility of the solutions that a learner might produce; it’s more about finding the representation where the inductive bias, i.e., properties of the solutions that it finds, is desirable for the learning conditions. But if you consider all the work in recent years aimed at improving the robustness of models to adversarial manipulations to input data, which basically grew out of acknowledgment that perturbations of input data can throw a classifier off completely, it’s often implicit that successful learning means the model learns a function that seems plausible to a human. E.g., some of the original results motivating the need for adversarial robustness were surprising because they show that manipulations that a human doesn’t perceive as important (like slight noising of images or masking of parts that don’t seem crucial) can cause prediction failures. Simplicity bias in stochastic gradient descent can be cast as a bad thing when it causes a model to overrely on a small set of features (in the worst case, features that correlate with the correct labels as a result of biases in the input distribution, like background color or camera angle being strongly correlated with what object is in the picture). Some recent work explicitly argues that this kind of “shortcut learning” is bad because it defies expectations of a human who is likely to consider multiple attributes to do the same task (e.g., the size, color, and shape of the object). Another recent explanation is underspecification, which is related but more about how you can have many functions that achieve roughly the same performance given a standard test-validate-train approach but where the accuracy degrades at very different rates when you probe them along some dimension that a human thinks is important, like fairness. So we can’t really escape caring about how features of the solutions that are learned by a model compare to what we as humans consider valid ways to learn how to do the task.

We also compare model-based inference and evaluation across social psych and ML. In both fields, implicit optimization–for statistical significance in psych and better than SOTA performance in ML–is suggested to a big issue. However in contrast to using analytical solutions like MLE in psych, optimization is typically non-convex in ML such that the hyperparameters and initial conditions and computational budget you use in training the model can matter a lot. One problem critics point to is that in reporting researchers don’t always recognize this. How you define the baselines you test against is another source of variance and potentially bias if chosen in a way that improves your chances of beating SOTA.

In terms of high-level takeaways, we point out ways that claims are irrefutable by convention across the two fields. In ML research one could say there’s confusion about what’s a scientific claim and what’s an engineering artifact. When a paper claims to have achieved X% accuracy on YZ benchmark with some particular learning pipeline, this might be useful for other researchers to know when attempting progress on the same problem, but the results are more possibilistic than probabilistic, especially when based on only one possible configuration of hyperparameters etc and with an implicit goal of showing one’s method worked. The problem is that the claims are often stated more broadly, suggesting that certain innovations (a new training trick, a model type) led to better performance on a loosely defined learning task like ‘reading comprehension,’ ‘object recognition’, etc. In a field like social psych on the other hand you have a sort of inversion of NHST as intended, where a significant p-value leads to acceptance of loosely defined alternative hypotheses and subject samples are often chosen by convenience and underdescribed but claims imply learning something about people in general.

There’s also some interesting stuff related to how the two fields fail in different ways based on unrealistic expectations about reality. Meehl’s crud factor implies that using noisy measurements, small samples and misspecified models to argue about classes of interventions that have large predictable effects on some well-studied class of outcomes (e.g., political behavior) is out of touch with common sense about how we would expect multiple large effects to interact. In ML, the idea that we can leverage many weak predictors to make good predictions is accepted, but assumptions that distributions are stationary and that good predictive accuracy can stand alone as a measure of successful learning imply a similarly naive view of the world.

So… what can ML learn from the replication crisis in psych about fixing its problems? This is where our paper (intentionally) disappoints! Some researchers are proposing solutions to ML problems, ranging from fairly obvious steps like releasing all code and data to things like templates for reporting on limitations of datasets and behavior of models to suggestions of registered reports or pre-registration. Especially in an engineering community there’s a strong desire to propose fixes when a problem becomes apparent, and we had several reviewers that seemed to think the work was only really valuable if we made specific recommendations about what psych reform methods can be ported to ML. But instead the lesson we point out from the replication crisis is that if we ignore the various sources of uncertainty we face about how to reform a field—in how we identify problematic claims, how we define the core reasons for the problems, and how we know that a particular reform will more successful than others—it’s questionable whether we’re making real progressin reform. Wrapping up a pretty nuanced comparison with a few broad suggestions based on our instincts just didn’t feel right.

Ultimately this is the kind of paper that I’ll never feel is done to satisfaction, since there’s always some new way to look at it, or type of problem we didn’t include. There are also various parts where I think a more technical treatment would have been nice to relate the differences. But as I think Andrew has said on the blog, sometimes you have to accept you’ve done as much as you’re going to and move on from a project.

25 thoughts on “The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning

  1. Fascinating review paper, with a thorough framing. Much appreciated and relevant for thinking about the theme of ‘uncertainty in medical predictions’. A relevant review for the medical field may be https://pubmed.ncbi.nlm.nih.gov/30763612/ .
    Some other comparative works, all supporting the idea that ML should not be overhyped: https://pubmed.ncbi.nlm.nih.gov/33848231/
    https://pubmed.ncbi.nlm.nih.gov/32201256/
    https://pubmed.ncbi.nlm.nih.gov/25532820/

    • Thanks, here is one from 1998 https://pubmed.ncbi.nlm.nih.gov/9819841/

      But the delight in complexity plays on and on.

      I think Jessica’s post could be somewhat summarized as a lack of understanding/adherence to good scientific process invigorated by careerist communities that rewards not thinking critically and deeply.

      Only remedy being wide documentation of repeated failures along with clarification of why and ways to mitigate (as in Jessica’s paper)? That takes many years.

    • Thanka. It expresses concerns, that I did have as well, based on clear scientific approach. We should care more when applying Machine Learning on sensitive aspects of human behaviours

  2. Thanks for all the references – I find these very interesting. Perhaps the idea that we should be able to declare a winner in the classical statistical methods vs machine learning methods is an idea that should be squelched. My own suspicious (thus far confirmed from my limited experience) are:
    The differences between methods are fairly small, for data sets with a relatively small number of potential factors.
    Whatever difference between methods exists is likely to be small compared with the value of getting better quality data (e.g., better measurements).
    The ML methods are likely to perform better relative to classical methods under two conditions: (1) when the number of potential factors becomes extremely large (examples would be genetic data, image and text data) and/or (2) when the event of interest becomes extremely rare (i.e. the needle in the haystack – examples such as fraud detection, rare disease prediction, etc.).

    Any references or thoughts about these speculations would be appreciated.

    • Dale, a few references-

      Andrew Ng is talking a lot about data-centric ML these days. As in, optimize data instead of model. https://www.youtube.com/watch?v=06-AZXmwHjo

      For rare event detection, I am actually very pessimistic about ML here. Many “rare” events are predicted on a stupefying number of factors. That might mean that each one is actually quite unique, and there isn’t a nearby “neighbor” in any dataset to meaningfully learn from. In the case of, say, suicide prediction- there are general predictors that predispose someone to be high*er* risk, but nothing that comes even close to a practically useful degree of predictability. And orders of magnitude more data might not more the needle very much (see the Fragile Families study below):

      Some references on this:
      – Precision medicine and the cursed dimensionality: https://www.nature.com/articles/s41746-019-0081-5
      – The Fragile Families data challenge and issues with social outcome predictability: https://www.pnas.org/doi/10.1073/pnas.1915006117

    • Hi Dale, agreed, it’s silly to try to declare a winner. I like some of the recent work proposing integrated modeling approaches that combine aspects on explanation and prediction … eg you can look at something like R^2 for your psych model, but then you could use the best achievable performance of an ML model to help contextualize how close your behavioral model comes to explaining all the possible-to-explain variance (eg https://www.journals.uchicago.edu/doi/10.1086/718371)

      Regarding 1, you might find some of the recent work analyzing overparameterized regression to try to understand phenomena like ‘double descent’ ‘benign overfitting’ etc in deep models interesting.

  3. Very excited to read this! I especially like this point –

    “The problem is that the claims are often stated more broadly, suggesting that certain innovations (a new training trick, a model type) led to better performance on a loosely defined learning task like ‘reading comprehension,’ ‘object recognition’, etc.”

    I wonder how much the overloading of terms from cognitive psychology makes fertile ground for unwarranted, speculative claims about the mechanisms by which inscrutably large models work. This definitely happens in public press about AI, but I’m curious how much leading ML researchers contribute too.

      • Some notable examples of interpretable models include sparse logical models (such as decision trees, decision lists, and decision sets) and scoring systems which are linear classification models that require users to add, subtract, and multiply only a few small numbers to make a prediction. These models can be much easier to understand than multiple regression and logistic regression, which can be difficult to interpret. Now, the intuitive simplification of these regression models, by restricting the number of predictors and rounding the coefficients, does not provide optimal accuracy. This is just a post hoc adjustment. It is better to build in interpretability from the very start.

        It may make it easier for the user to predict what the model will predict, but the coefficients don’t necessarily correspond to anything in reality. They correspond to a reality where the model is true.

        There is a huge industry that revolves around misinterpreting parameters of simple linear models. Pretty much every statistical “adjustment” falls into that category.

        • Anoneuoid:

          Thanks, we attempted to prevent the interpretation that “coefficients … correspond to anything in reality” with these two sentences.

          “Emphatically, it is the abstract model that is understood not necessarily the reality it attempts to represent.”

          “Again, it is the prediction model that is understandable, not necessarily the prediction task itself.”

          Suggestions for how else to avoid that risk of misinterpretation?

        • Perhaps an example of a seemingly innocuous change to the model specification (eg, adding a new term) that changes the sign of an original coefficient the user may be tempted to interpret.

  4. I tried to approach the paper with an open mind but it’s hard to seriously consider a comparison of an essentially stalled research program to one of the most productive fields of inquiry in existence today that just breezes past the detail that ML actually works.

    • We certainly aren’t trying imply that the fields are equally successful in terms of impact on real world problems. Though I could see how putting these two fields side by side could make someone think that. That ML / deep learning especially have gotten so much attention because they solve the problems they take on so well was sort of a premise in writing it, but maybe we can make that clearer.

      But—-ML methods working shockingly well in many ways (even when why they work isn’t fully understood) does not preclude large numbers of papers being published with overhyped, uncertainty suppressing claims. I don’t envy those who publish at the big conferences. From what I hear from colleagues it sounds like it’s hard to publish serious work and expect people to pay attention without playing hype games.

      We’re also not trying to knock ML as an entire field. In the discussion we talk about how despite the various ways critics are showing ML models can fail, many of the issues are solvable by incorporating them into the learning problem or training pipeline, kind of like adversarial training approaches. So the question is more, which of these critiques can’t be resolved by changing how you define the task? I suspect some of the communication problems and implicit optimization for good results won’t necessarily go away with better pipelines, but we aren’t asserting any final verdicts in this paper.

    • >> breezes past the detail that ML actually works.

      That is not the picture that I get of, for example, the attempt by businesses to apply ML to their own big data. Perhaps the fact isn’t as widely publicized as the very visible successes of ML in, e.g., image categorization, but the anecdotal evidence I see is that ML is proving difficult for businesses to apply in practice, and the advice they seem to fall back on is to use simpler predictive techniques whose “training” is better understood and less expensive computationally. Looking at the attempts of businesses to train ML models might be interesting to the authors, because in that setting, there is much less incentive to exaggerate the capability of the trained model. Because a failure in practice turns into something incorrect, embarrassing, unprofitable, wasteful etc. for the business itself.

      • I had the impression that ML was largely successful in many financial applications – fraud detection, for example. I had started writing a paper at one point – a takeoff on Robert Solow’s famous quip “You can see the computer age everywhere but in the productivity statistics.” My version was along the lines of “you can see big data everywhere but in the productivity statistics.” I had even put together some suggestive evidence that this was true – but I abandoned the attempt because there were many areas in which big data (and associated ML methods) were being used successfully (at least I thought). The evidence consisted of showing things like default rates not declining despite all the modeling effort. But there were too many explanations why default rates would not decline even if the ML models were largely successful.

        I’d be happy to go back to this idea and revive my writing if people really believe that business applications of ML have largely been failures. Sure they have been over-hyped, but I thought most financial institutions were heavily invested in using ML (I’m not thinking of investment return modeling here – I think that area remain elusive). If anyone wants to pursue this further, I’d be happy to share what I’ve already done.

        • ML has been successful at tasks that humans already know how to do relatively well. It can often do these faster and cheaper with a minor loss of accuracy.

          But not so much at improving the performance on tasks humans would struggle with.

      • I suspect that’s right, that a lot of the AI hype in industry is not really evidence of many of the ‘hottest’ methods in ML research being put to use, it’s probably more likely logistic regression and random forests. But the improvements of models in some research areas (eg of large language models like BERT in NLP with the advent of transformers, on CNNs doing way better on benchmarks in computer vision) are hard to deny.

        • Two major ML advances in biomolecular science were reported in the last year. The first is AlphaFold, a ML predictor of protein structure from sequence that is a clear improvement on previous protein structure prediction approaches and is touted as a game-changer. One can be negative about this for several reasons – it’s appeared just around the time that it’s possible anyway to make useful structures of most proteins of unknown structure by homology modelling since it seems that effectively every protein domain fold in nature is now represented in the data bank of known protein structures (in fact this seems to be why AlphaFold works so well).

          Additionally, there is the complaint that even if an ML approach can predict the structure we haven’t really learned anything; the “protein folding problem” can only be considered solved if we can predict the structure of a protein from first principles using physicochemical knowledge.

          On the other AlphaFold is certainly useful since it’s already facilitated the structure determination of a bunch of “difficult” proteins that crystallography/electron microscopy groups have struggled with in some cases for years. A recent paper describes some examples and it’s rare to see the following sort of exclamation in a scientific paper!:

          “We are shocked…stunned…by the quality of the model. You would not believe how much effort we have put into getting this structure. Years of work…Both cryo-EM and crystallography…I mean, this is really shocking.”

          https://doi.org/10.1002/prot.26223

          So it depends somewhat on your goals. If you’re trying to understand the precise physical and chemical principles underlying protein folding then you might be a little underwhelmed by AlphaFold. If it’s important that you have a good structural model of your protein as a basis for guiding further experiments it seems pretty good. Perhaps that applies generally to ML approaches to practical problems.

  5. Human intelligence is based on moral justice,metascience and finding econoic goals.Politics is oriented twards these facts with common welfare of society.Ecology and protection of environment are facinating tools.Keeping these facts in mind AI & ML should propagate.

  6. Always do the formal qualitative reasoning.
    Data structures are not always “well behaved”.
    The existence of Phase Transitions in ERGM’s and the No Free Lunch theorems are evidence enough to infer that.

  7. @Jessica and co-authors, a hearty thank you from a former Operations Researcher, now ML practitioner working on ML fairness and general ML hygiene at a Fortune 10 company. Corporate practices need to catch up with these guidelines.

Leave a Reply

Your email address will not be published. Required fields are marked *