What did ML researchers talk about in their broader impacts statements?

This is Jessica. A few months back I became fascinated with the NeurIPS broader impact statement “experiment” where NeurIPS organizers asked all authors to in some way address the broader societal implications of their work. It’s an interesting exercise in requiring researchers to make predictions under uncertainty about societal factors they might not be used to thinking about, and to do so in spite of their incentives as researchers to frame their work in a positive light. You can read some of them here if you’re curious.  

Recently I collaborated on analysis of a sample of these statements, with Priyanka Nanayakkara, who led the work and will present on it next week at a conference on AI ethics, and Nick Diakopoulos. The analysis reports on what themes and sources of variation between broader impacts statements were apparent in a random sample of them from 2020 NeurIPS papers. This was a qualitative analysis so there’s some room for interpretation, but the goal was to try to learn something about how ML researchers grappled with a vague prompt that they should address broader implications of their work, including what sort of concerns were given priority.  

Some observations I find interesting: 

  • There was a split in the statements between implications that were mostly technically oriented (discussing criteria computer scientists commonly optimize for, like robustness to perturbations in the input data or parameterization of a model) and those that tended to be more society-facing. Non-trivial proportions of the sampled statements mentioned common desiderata for learning algorithms like efficiency (~30%), generalizability (~15%), and robustness and reliability (~20%). Some of these mapped from general properties to impacts on society (e.g. efficiency leads to saved time and money, a more robust model might positively impact safety, etc.). It’s not really clear to me how much information is added to the paper in such cases, since presumably these properties become clear if you understand the contribution. 
  • Authors of theoretical work varied in whether they considered the exercise relevant to them. Roughly equal proportions of the statements we looked at implied that due to the theoretical nature of the work, there are no foreseeable consequences versus despite the theoretical nature of the work, there may be societal consequences. 
  • Some authors were up front about sources of uncertainty they perceived in trying to identify downstream consequences of their work, noting that any societal impacts depend on the specific application of the model or method, stressing that their results are contingent on assumptions they made, noting that human error or malicious intent could throw a wrench into outcomes, or directly stating that their statements about impacts are speculation. Others however spoke more confidently and didn’t hedge their speculation. 
  • Since the prompt was very open-ended, authors were left to choose what time frame the impacts they discussed were expected to occur in. Only about 10% of the sampled statements mentioned a time frame at all, and those were in broad terms (e.g., either “long term” or “immediate”). So it was sometimes ambiguous whether authors are thinking about impacts immediately after deployment, or impacts after, for instance, users of the predictions have had time to adjust their behavior. 
  • More than half of the statements described who would be impacted, though sometimes in very broad terms, like “the healthcare domain,” “creative industries” (all those generative models!) or other ML researchers. 
  • Recommendations were implicit in many of the statements, and more than half of the statements implied who was responsible for future actions. It was often implied to be other ML researchers doing follow-up work, but regulation and policy were sometimes mentioned. 
  • When it came to themes in the types of socially relevant impacts discussed, common concerns included impacts to bias/fairness (~24%), privacy (~20%) and the environment (~10%). In a lot of ways the statements echoed common points of discussion in conversations about possible ML harms 

Overall, my impression from reading some of the statements is that there was some “ethical deliberation” occurring, and that most authors took the exercise in good faith. I was pleasantly surprised in reading some of them. For example, some of the statements seemed quite useful from the standpoint of providing the reader of a very technical paper with some context on what types of problems might occur, and more generally, with more context on how the work relates to real world applications. I don’t think this latter part was the goal necessarily, but it makes me think high level statements are preferable to asking authors to be specific, since a high level statement can provide more of a “big picture” view than the author might get from the paper alone, where space is often precious, without the statements reading like conjecture presented as knowledge. In terms of making ML papers more accessible to non-ML researchers, the statements may have some value. 

I had found the instructions for the statements unsatisfying because it wasn’t clear how the organizers viewed the goal, so that evaluating the success of the exercise seems impossible. Though now that I’ve read some, if their goal was simply to encourage reflection on implications of tech, it seems to have been successful.  Similarly, if the organizers wanted to amplify the types of concerns that are being discussed in AI/ML these days, like bias/fairness, environmental impacts of training large models, etc. it also seems they were successful. 

If there were or are loftier goals going forward with exercises like this, like helping readers of the paper recognize what to watch out for if they implement a technique, or helping the researchers avoid putting a lot of effort into some “dangerous” technology, then more guidance for the authors on how to translate technical implications to societal ones (at least in non-obvious ways) is needed. Also more information on how detailed they are intended to be in expressing what they see as implications. For example, the statements tended to be fairly short, a few paragraphs at most, and so when there wasn’t a clear set of issues to talk about related to the domain, authors sometimes mentioned more general types of outcomes like bias without it being completely clear what they meant. 

And of course there’s the question of whether ML researchers are in the right position to offer up useful information about longer term societal implications of technology, both in terms of their training and incentives. Not surprisingly, some of the statements read a bit like the “Limitations” sections that sometimes appear in papers, where often for every limitation mentioned there’s some rationalization that it’s actually not that big a problem. I expect the organizers were aware of these doubts and questions, so it seems like the exercise was either meant to signal that they want the community to take ethical considerations more seriously, or it was a kind of open-ended experiment. Probably some of both.   

In retrospect, maybe a good way for the organizers to introduce this exercise would have been to write a broader impact statement when they provided instructions! Now that I’ve read some, here’s my attempt: 

This call asks authors to write a broader impacts statement about their ML research. The outcomes of asking ML researchers to write broader impacts statements are not well known, but we believe that this process is likely to prompt authors to think more deeply about societal consequences of their contributions than they would otherwise, and that this might lead them to wield deep learning more responsibly so we are all less likely to encounter reputation-comprising deep fakes of ourselves in the future. We also expect this exercise to further popularize popular societal concerns that arise in discussion of machine learning ethics. Perhaps writing broader impacts statement will also help novice readers understand how dense tedious technical papers about small improvements to transformer architectures relate to the real world, helping democratize machine learning research.

However, there are also risks to requesting such statements. For example, there is some risk that authors will rationalize any consequences that they perceive, making them feel like they have seriously considered ethical concerns and there are none worth worrying about, even when they had only spent a few minutes writing something right before the deadline when they were clearly mentally compromised by the preceding all-nighters. There is also a risk that researchers will list potential impacts on society that, if a careful analysis were done by ethicists or other social scientists, would be deemed not very likely, but readers will not be able to ascertain this, and will worry about and go on to write future broader impacts statements about impacts that are actually irrelevant. Finally, there is some risk that authors will feel like they are being evaluated based on an exercise for which there is no rubric and it is hard to imagine what one would look like, which as we know from our teaching experience makes many people uncomfortable and leads to bad reviews. Future organizers of NeurIPS should consider how to minimize these risks.

8 thoughts on “What did ML researchers talk about in their broader impacts statements?

  1. Jessica:

    Just don’t ask for a broader impact statement for each blog post. Or for each blog comment. It’s impact statements all the way down. . . .

    More seriously, your discussion reminds me of a related idea, which is how good it is to require a Limitations sections in any paper. I think this is a great idea for three reasons:

    1. I like to put limitations in a paper, but journal reviewers often react in a hostile way to any sign of weakness. If the limitations section is required, then I have to do it!

    2. It’s good for readers to see some limitations.

    3. Being forced to write about limitations forces authors to think about limitations, and that’s a good thing.

      • My impression is that a paragraph on limitations is standard for articles in epidemiological journals, at least when observational data is used. (But Epidemiology is a more modest and circumspect field than say, Economics or Computer Science).

        • anon –

          > My impression is that a paragraph on limitations is standard for articles in epidemiological journals, at least when observational data is used.

          Yes, I really like seeing cautions as to inferring cauality in epidemiological research based on observational or non-longitudinal data, and discussion of the limitations in the representativeness of the sampling.

          That should be more standard in other fields.

    • > 2. It’s good for readers to see some limitations

      +1 for limitations sections.

      Should be standard, IMO.

      I always go there to read, straight after I read the thesis and what the authors concluded about their thesis.

      I find it the most interesting part of many papers and as a non-scientist, a good way to cut through technical or arcane material to get a feel for the quality of research.

  2. Jessica:

    Did you mean for the link attached to the phrase “read some of them here” to be gated? I don’t mind logging into my gmail and requesting access but wanted to make sure that was the intention.

    When you originally posted about this I was skeptical because I was thinking about specialization and self versus external regulation. First, specialization generally makes things more efficient so my thought was that it would be better for ML researchers to just do their thing and build useful tools and then have ML/AI ethicists and social scientists do the criticism and connecting to society. The drawback of course is over-specialization and you get the Weber-ian iron cage thing where there’s abdication of responsibility and missing the forest for the trees.

    Second, I don’t have too much of a favorable prior on self-regulation. I just feel like the incentives are not adequately aligned, a point that you mention when you talk about limitations sections rationalizing away how the problems identified are not big deals. And I also think that sometimes these self-regulation things can get in the way of external regulation by being a distraction or excuse. People often respond to these types of views with “why not do both?” but I think there are opportunity costs and politics that prevents that from happening.

    None of these are strong beliefs (especially because I’m not an expert in this field and am still learning) so your descriptions of these sections do move my prior and make me a bit more optimistic, though I’m still skeptical. This is definitely an interesting issue so thanks for following up on that previous post! Your and Andrew’s connection to limitations sections also illustrates how this issue is also relevant beyond the specific area of ML research.

    • Oops meant the link to not require Google sign-in. Just updated.

      I was skeptical too, and I still am, though as I say in the post I was at least pleasantly surprised that some of them seemed helpful for adding context to papers. I agree with both of your reasons for concern – requiring authors to write them might be easiest way to do this at scale, at least as a starting point, but if there are research areas / methods related to topics like future studies and speculative design, which there are, it implies to me there’s some method/training that makes some speculation more reliable. Presumably very few ML researchers have been exposed to that stuff; my sense is that it tends to be much more humanistic/interpretativist, pretty far outside of mainstream CS. I haven’t seen any good arguments for why CS researchers are in a good position to do it. It seems like a goal rather than something organizers might claim is true now.

      And self-regulation – I’m skeptical. I like limitations sections in papers, but at least in my field it’s a section that reviewers are often halfway writing themselves by requiring authors to mention what they perceive as major limitations before they sign off on accepting. Some authors might do a good job being honest without that external pressure, but certainly not all. So, I still think broader impacts as it was carried out in 2020 is a flawed idea unless the goals were very simple, like expose researchers to thinking about ethics, whether they are well prepared to or not. But I’m open to the idea that the net effect might have been positive.

Leave a Reply

Your email address will not be published. Required fields are marked *