Skip to content
 

The NeurIPS 2020 broader impacts experiment

This year NeurIPS, a top machine learning conference, required a broader impacts statement from authors. From the call:

 In order to provide a balanced perspective, authors are required to include a statement of the potential broader impact of their work, including its ethical aspects and future societal consequences. Authors should take care to discuss both positive and negative outcomes

I heard that next year ICML, another top ML conference, will add the same requirement. 

Questions about how to make computer scientists more thoughtful of the potential societal implications of what they create have been around for awhile, with an increasing number of researchers looking at how to foster reflection or transparency through different design methods, fact sheets to go along with tech contributions, etc. But a requirement that all authors try to address broader societal implications in publications is a new thing. Actions like these are part of a reform movement aimed at shifting computer science values rather drastically away from the conventional view that algorithms and math are outside of any moral philosophy. 

Here, I’m not going to take on bigger questions of what this kind of action means or how useful it is, and instead reflect more on how it was done, and what questions I, as a curious outsider (I don’t publish at NeurIPS), have had in looking at the official messaging about the broader impacts statement. It’s felt a bit like doing a puzzle. 

While the call doesn’t go into too much detail about how the statements should be written or used, the FAQ for authors says:

Do I have to complete the Broader Impact section? Answer: Yes, please include the section in the submission for review. However, if your work is very theoretical or is general enough that there is no particular application foreseen, then you are free to write that a Broader Impact discussion is not applicable.

So, some acknowledgment of the requirement is required at best. How is it used in the reviewing process?

Can my submission be rejected solely on the basis of the Broader Impact Section? Answer: No. Reviewers will be asked to rate a submission based on the evaluation criteria. They will also be asked to check whether the Broader Impact is adequately addressed. In general, if a paper presents theoretical work without any foreseeable impact in the society, authors can simply state “This work does not present any foreseeable societal consequence”. If a paper presents a method or an application that might have reasonable chances to have some broader impact, authors can discuss along the following lines: “This work has the following potential positive impact in the society…. At the same time, this work may have some negative consequences because… Furthermore, we should be cautious of the result of failure of the system which could cause…” 

I checked out the evaluation criteria which are also part of the call, which include this sentence about broader impacts:

Regardless of scientific quality or contribution, a submission may be rejected for ethical considerations, including methods, applications, or data that create or reinforce unfair bias or that have a primary purpose of harm or injury. 

It’s a little ambiguous, but since they say above that a submission cannot be rejected solely on the basis of the Broader Impact section, I assume the reviewer would have to point to other parts of the paper (the work itself?) to argue that there’s a problem. Maybe we are ruling out rejecting cases where the science is sound and reasonably ethical, but the broader impacts statement is badly written?  

The FAQ also includes this:

How should I write the Broader Impact section? Answer: For additional motivation and general guidance, read Brent Hecht et al.’s white paper and blogpost, as well as this blogpost from the Centre for Governance of AI. For an example of such a discussion, see sec. 4 in this paper from Gillick et al. 

So I looked at some of the links, and the blogpost by Hecht gives some important info about how reviewers are supposed to read these things: 

As authors, you’re also likely reviewers (esp. this year!). NeurIPS leaders should probably address this directly, but our proposal’s view is that it’s not your job as a reviewer to judge submissions for their impacts. Rather, you should evaluate the *rigor with which they disclose their impacts*. Our proposal also recommends that reviewers adopt a “big tent” approach as “norms and standards develop”.

So reviewers should judge how rigorously authors reported impacts, but they can’t reject papers on the basis of them. The NeurIPS reviewer guidelines say a little more, basically echoing this point that the reviewer’s judgment is about whether the authors did an adequate job reflecting on positive and negative potential impacts. 

Right after this point, the reviewer guidelines mention the broader issue of ethical concerns as a possible reason for paper rejection:

Does the submission raise potential ethical concerns? This includes methods, applications, or data that create or reinforce unfair bias or that have a primary purpose of harm or injury. If so, please explain briefly.

Yes or No. Explain if the submission might raise any potential ethical concern. Note that your rating should be independent of this. If the AC also shares this concern, dedicated reviewers with expertise at the intersection of ethics and ML will further review the submission. Your duty here is to flag only papers that might need this additional revision step.

Something unsatisfying about this piece for me is that as a reviewer, I’m told to assess the broader impacts only for how well it reports on positive and negative potential, but then I can raise ethical concerns with the paper. In a case where I would not have noticed an ethical concern with the paper, but a convincing broader impacts brings one to my attention, I assume I can then flag the paper for an ethics concern. If the dedicated ethics reviewers described above then decided to reject the paper, I think it’s still fair to say that this would not be an example of the paper being rejected solely on the basis of the broader impacts. But the broader impacts statement could have effectively handed over the matches that started the fire.

This makes me wonder, if I were an author who feels they have some private information about possible ethical consequences that may not be available to their reviewers, based, e.g., on the amount of time I’ve spent thinking about their specific topic, would I be motivated to share it? It seems like it would be slightly safer to keep it to myself, since even if the reviewers are smarter than I think, it shouldn’t make the outcome of my paper any worse. 

NeurIPS is now over, so some authors may be looking for extra feedback about the broader impacts statements in the form of actual paper outcomes. NeurIPS program chairs published a Medium post with some stats. 

This year, we required authors to include a broader impact statement in their submission. We did not reject any papers on the grounds that they failed to meet this requirement. However, we will strictly require that this section be included in the camera-ready version of the accepted papers. As you can see from the histogram of the number of words in this section, about 9% of the submission did not have such a section, and most submissions had a section with about 100 words.

We appointed an ethics advisor and invited a pool of 22 ethics reviewers (listed here) with expertise in fields such as AI policy, fairness and transparency, and ethics and machine learning. Reviewers could flag papers for ethical concerns, such as submissions with undue risk of harm or methods that might increase unfair bias through improper use of data, etc. Papers that received strong technical reviews yet were flagged for ethical reasons were assessed by the pool of ethics reviewers.

Thirteen papers met these criteria and received ethics reviews. Only four papers were rejected because of ethical considerations, after a thorough assessment that included the original technical reviewers, the area chair, the senior area chair and also the program chairs. Seven papers flagged for ethical concerns were conditionally accepted, meaning that the final decision is pending the assessment of the area chair once the camera ready version is submitted. Some of these papers require a thorough revision of the broader impact section to include a clearer discussion of potential risks and mitigations, and others require changes to the submission such as the removal of problematic datasets. Overall, we believe that the ethics review was a successful and important addition to the review process. Though only a small fraction of papers received detailed ethical assessments, the issues they presented were important and complex and deserved the extended consideration. In addition, we were very happy with the high quality of the assessments offered by the ethics reviewers, and the area chairs and senior area chairs also appreciated the additional feedback.

This seems mostly consistent with what was said in the pre-conference descriptions, but doesn’t seem to rule out the concern that a good broader impacts statement could be the reason a reviewer flags a paper for ethics. This made me wonder more about how the advocates frame the value of this exercise for authors. 

So I looked for evidence of what intentions the NeurIPS leadership seemed to have in mind in introducing these changes. I even attended a few NeurIPS workshops that were related, in an effort to become a more informed computer scientist myself. Here’s what I understand the intentions to be:

  1. To give ML researchers practice reflecting on ethics implications of their work, so they hopefully make more socially responsible tech in the future.
  2. To encourage more “balanced” reporting on research, i.e., “remove the rose-colored glasses”, as described here.
  3. To help authors identify future research they could do to help address negative societal outcomes of their work. 
  4. To generate interesting data on how ML researchers react to an ethics prompt.

One thing I’ve heard relatively consistently from advocates is that this is a big experiment. This seems like an honest description, conveying that they recognize that requiring reflection on ethics is a big change to expectations placed on CS researchers and since they don’t know exactly how to best produce this reflection so they’re experimenting. 

But implying that it’s all a big open-ended experiment seems counterproductive to building trust among researchers in the organizers’ vision around ethics. Don’t get me wrong, I’m in favor of getting computer scientists to think more intentionally about possible downsides to what they build. That this needs to happen in some form seems inevitable given the visibility of algorithms in different decision-making applications. My observation here is just that there seem to be mixed messages about where the value in the experiment lies to someone who is honestly trying to figure it out. Is reflecting on possible ethical issues thought to be of intrinsic value itself? That seems like the primary intention from what organizers say. But the ambiguity in the review process, and other points I’ve seen argued in talks and panel discussions make me think that transparency around possible implications is thought to be valuable more as a means to an end, i.e., making it easier for others to weigh pros and cons so as to ultimately judge what tech should or shouldn’t be pursued. I think if I were a NeurIPS author asked to write one, the ambiguity about how the statements are meant to be used would make it hard for me to know what incentives I have to do a good job now, or what I should expect in the future as the exercise evolves. From an outside perspective looking in, I get the sense there isn’t yet a clear consensus about what this should be.

Maybe it’s naive and even cliche to be looking for a clear objective function. Or maybe it’s simply too early to expect that. As a computer scientist witnessing these events, though, it’s hard to accept that big changes are occurring but no one knows exactly where it’s all headed. Trusting in something that is portrayed as hard to formalize or intentionally unformalized doesn’t come easy. Although I can say that the lack of answers and confusing messaging is prompting me to read more of the ML ethics lit, so perhaps there’s a silver lining to making a bold move even if the details aren’t all worked out.

I also can’t help but think back to watching the popularization of the replication crisis back in the early 2010’s, from the initial controversy, to growing acceptance that there was a problem, to the sustained search for solutions that will actually help shift both training and incentives. I think we’re still very early in a similar process in computer science, but with many who are eager to put changes in motion. 

44 Comments

  1. Andrew says:

    Interesting story. I’m reminded of the general principle of setting up decision problems by starting by specifying goals. In this case, the goals are . . . I’m not sure, because I’m not close to the core of the Neurips community, but I guess one goal is to have practitioners think harder about ethics and general impacts of their work, and another goal might be to avoid the embarrassment and conflict that can arise when some controversial papers get published. Forcing an impacts discussion of each paper should make it harder for certain controversial topics to slip through unnoticed, which is not to say that controversial papers would or should disappear, but just that the society would be airing such work with its eyes open. In addition, there could be intermediate goals such as Neurips being a model for other societies to implement similar policies.

    This is all speculation, and I don’t know enough about any of this to have a sense of what aspects of this new policy are good or bad ideas; I’m just inclined to think that in any case it can be useful to talk about goals, in addition to considering specific questions of policies, incentives, and behavior.

    • Agreed. Here the stated goals are pretty tentative. But the larger goal I’m getting, from some of what the ML ethics work cites, i.e., moral philosophy and philosophy of science, seems very ambitious, to change the assumption that scientific methods can be judged in a mostly “value-free” way. I think that’s part of the distinction I’m trying to understand.

      • Anon For Obvious Reasons says:

        This policy seems terribly frightening. Science is judged in a “value-free way” as a goal, because we don’t want important scientific results to be suppressed because of the political preferences of editors or referees. Look at the impacts discussions – they are almost universally about differential effects based on group identity, largely from the context of current North American political discussions. There are surely interesting ethical issues in scientific publishing, but the request here is not about fleshing out some deep philosophy of science – it is basically, as in the Gebru-LeCun debate, a political battle about whether algorithms that reflect existing society are bad if there are margins of existing society the ethicist dislikes, with “AI ethics” being a field absolutely dominated by a politics that is very left-leaning politically. Am I missing something here? It seems deliberately obtuse to ignore this elephant in the room.

        • I’m not trying to be obtuse, just (for obvious reasons!) not commenting on that until I’ve read more. For now I am trying to channel my uncertainty about how to think about this into learning more about the underlying philosophies that are getting cited, so that I am more informed on where the arguments are coming from.

        • Jukka says:

          I am not convinced about the “value-free way”. You don’t even have to think about the historical horrors of unethical human experimentation in medicine and the like. I mean sure there was something going on when Einstein, Teller, Russell, Oppenheimer, Polanyi, and whoever wrote to the Bulletin of the Atomic Scientists? And I do wonder why CS research involving human subjects does not have to go through institutional ethics review? In other fields there are things like the Declaration of Helsinki, you know.

        • Robert says:

          Most science that gets judged in a value-free way is *actually* science; i.e. a passive description of how some part of nature functions. Most of what is being passed off as science here is actually engineering: the focus is on demonstrating how to make a large number of technical choices to design an apparatus that solves a problem in a particular way. Many of the “science” papers that are not directly about solving a problem are still a sort of engineering research, where the working of a specific apparatus is dissected. Usually the justification of this research within the paper is specifically to design a better apparatus later.

          I don’t expect this proposal to be a helpful solution, but the idea that NeurIPS as a whole gets some kind of science-pass rubs me the wrong way. A scientific framing makes the venue seem sexier to possible authors but plenty of the work published there is not distant enough from specific uses to claim value neutrality full-stop.

      • elin says:

        In some of the articles about the situation at Google Brain, the issue of energy consumption and contribution to global climate change seems to have been part of the discussion.

  2. Jessica—is there an example somewhere of what they wanted? Or what kind of ethics they were concerned about? Did they cite which papers were causes for concern or why? How could only a dozen or so papers cause ethical concerns when the whole field of AI/ML poses enough ethical dilemmas for an undergraduate course?

    Should all image recognition papers come with a disclaimer that the technology can be used for surveillance by [insert favorite villain here]? Should OCR papers come with disclaimers that they reduce the number of bank tellers we need? Should the alpha-Go paper have said that they’re going to send the current human Go champions the way of John Henry? Do self-driving car papers need to report they might run over a pedestrian or be repurposed into self-driving drones? Do machine translation papers have to remind us that mistranslations can be misleading or even costly? What kind of disclaimers do they want for the deep fakes papers?

    How about the ethics of how much power was spent fitting models? [OK, that one was mostly sarcastic.]

    Does every ML and AI paper have to add a disclaimer that they move us one step closer to the singularity? [That one, too, but I stress that it’s only “mostly”.]

    Or is the concern more about applications? Like the one Dan blogged about?

    • These are good questions, some of which I’ve been trying to answer. The NeurIPS site gives this as an example: https://arxiv.org/pdf/1912.06979.pdf
      They also point authors looking for more motivation and guidance on how to write them here: https://acm-fca.org/2018/03/29/negativeimpacts/
      Here: https://brenthecht.medium.com/suggestions-for-writing-neurips-2020-broader-impacts-statements-121da1b765bf
      And here: https://medium.com/@GovAI/a-guide-to-writing-the-neurips-impact-statement-4293b723f832

      My sense from looking at these is that the concern is mostly applications that could cause some harm to people once deployed. Like Andrew suggests above, part of the goal is probably to flag the papers that are going to embarrass the community when the media finds out about them, which I expect would be ones where the application is relatively clear. From the fact that they said authors could just say “This work does not present any foreseeable societal consequence” I don’t think blanket disclaimers are the goal, but where to draw the line isn’t clear to me. My impression is that asking questions in the limit won’t bring many answers here.

      • First, let me say that I’m all for ethical considerations in science. I’m just not sure how to do it and it seems to me like these weak measures may be worse than nothing at all in terms of giving people some false sense of security that they’re behaving ethically.

        Given the behavior of the big tech companies and big governments of the world, it’s pretty clear they’re happy to ignore ethical issues of the kind being discussed here. Governments argue we’re safer with more CCTV and surveillance and spying. DARPA argues we need more powerful “warfighters” to be safe. Financiers argue that Wal-Mart and Amazon are positive forces for bringing cheap products to people, despite crushing local businesses. Nobody in tech seems worried about wage inequality while our wages continue to go up; no, we just feel that we’re special and deserve the rewards. We don’t care about taxi drivers or secretaries or newsppaer writers being put out of work if we can automate their jobs. Facebook doesn’t even worry about the hate speech other than as a PR situation to handle; they literally generate their own hate speech when backed into a corner (see this NYT article, which I found pretty shocking). Big tech is so unethical they colluded on hiring fraud for years until they were slapped on the wrist for it. American research universities are in bed with DARPA, ONR, and the Army and Air Force research arms.

        Jessica says in the above comment:

        I don’t think blanket disclaimers are the goal, but where to draw the line isn’t clear to me.

        I agree with both clauses. But I’m guessing boilerplate will be the inevitable result. It’s less thinking to find one of these that works and boilerplate it. It’s like the data management plan or mentoring plan in an NSF proposal. Once you get one that sounds good, you use it in all of your grants.

        https://arxiv.org/pdf/1912.06979.pdf ? The ethical considerations seems to be that (a) it might learn song lyrics that are “biased or offensive” by training from examples, and (b) it might plagiarize. The suggested mitigation strategies are (a) a profanity filter and hate-speech detector, and (b) explicitly condition to reduce plagiarism.

        https://acm-fca.org/2018/03/29/negativeimpacts/ ? Driverless cars put drivers out of work. Crowdsourcing and the gig economy can lead to low wages. Mobile phones lead to traffic deaths. Companion robots alter the meaning of relationships. Social media platforms host conspiracy theories. And as they say, the list goes on and on. But these are very generic. Why didn’t that first article’s disclaimer cite lost wages by songwriters? Why didn’t they cite learning language from corpora that would spout conspiracy theories as well as hate speech? My point isn’t that those aren’t causes for concern, but that the causes for concern on any of these projects are nearly endless because the papers tend to be about very low level technology solutions.

        I have a hard time believing that “part of the goal is probably to flag the papers that are going to embarrass the community when the media finds out about them”. I don’t see any embarassment at all in the community about crowdsourcing, driverless cars, etc.

        https://brenthecht.medium.com/suggestions-for-writing-neurips-2020-broader-impacts-statements-121da1b765bf ? I don’t see how 100 word synopses of ethical issues is going to “introduce new and healthy accountability into your research lives” or how “your funding agencies, your company’s leaders, and the general public are about to get a better view into how your work published at NeurIPS might affect the world.”

        Maybe they should recommend inviting the IRB to review your grants rather than “hire a social scientist”.

  3. Radford Neal says:

    The presence of this requirement in the call for papers is the reason that I declined to review papers for NeurIPS this year.

    I should note that I did review papers the previous two years, and in fact was rated as a top reviewer both years, which meant that they kindly waived my conference registration fee.

    The best case result of this requirement is that it becomes just a virtue signalling formality, with everyone including meaningless expressions of concern about racial bias, world peace, or whatever other platitudes seem vaguely related to the paper. And maybe a few papers could actually get away with saying that there are just no ethical implications of their poof of a lower bound on MCMC convergence times. I’m reminded of some math papers from the USSR that I’ve seen, which started with a couple of ridiculous paragraphs about how the work contributes to the victory of the proletariate.

    This best case is still pretty bad, because it helps propagate the idea that EVERYTHING IS POLITICAL. This idea is the foundation of totalitarianism. In the last century, it was responsible for hundreds of millions of deaths.

    Worse would be if the organizers of NeurIPS, and other conferences, actually take this seriously. Then one can expect researchers with disfavoured political opinions to be shut out of conferences (not that this hasn’t already happened), and some research areas to be shunned because it’s too easy for people hoping to score activist points to attack the area. And of course the EVERYTHING IS POLITICAL message is even stronger.

    • somebody says:

      > This best case is still pretty bad, because it helps propagate the idea that EVERYTHING IS POLITICAL.

      To nitpick a little here, it’s indisputable that applied machine learning has already had political and ethical consequences. Reckless application of black box nonparametric mean estimators with no uncertainty quantification, feedback loops, causal interpretation of predictors, technological unemployment, diffusion of legal responsibility, high-energy cost HPCs, underrepresented subgroups in training data, none of those are hypothetical—all of these ethical problems have happened. I don’t think that the solution is lowercase-p politics, a bureaucratic hoop to make an editorial board feel better and placate a legal department, but the politics here are not constructed, they’re just there.

      • jim says:

        “underrepresented subgroups in training data”

        To nitpick a little here, not everyone views this as an ethical problem. If the goal is to determine if an individual is qualified for a loan, it’s still not clear to me why specific subgroups deserve special consideration. This is, in fact, a political view. It represents of the ethics of *some* people, but not *all* people, perhaps not *most* people, and it is by no means an ethical absolute truth.

        • Curious says:

          jim:

          The issue is that systematic bias produces results that do not accurately reflect the risk at an individual level. When this occurs for subgroups of people that have been historically enslaved, oppressed, and systematically excluded from economic success, there seems added importance that the process be fair and accurate rather than biased.

        • somebody says:

          I understand that this issue tickles some people’s sensibilities but I’m talking about an engineering problem here, not affirmative action.

          Say I’m a police force and I’m predicting people’s probability of crime from their digital footprint. I compute some propensity score that gets transformed by some softmax to a probability, and where probability > 0.6 I take some action. For obvious reasons, Amish people have a much smaller digital footprint, so my classifier is likely to be pretty inaccurate. The result is a broader propensity distribution on the Amish subpopulation, which leads to more false positives AND false negatives. This is a pretty weird example, but the point is that the probabilistic outputs of a classifier are not “true” probabilities in the sense of an RNG, but a result of the engineering decisions. If I train something on Facebook data, it’ll be less accurate on the types of people not on Facebook. Very often, in real life data science problems, people building important classifiers don’t go looking for or collecting data, they just use the data that’s lying around closest to them, but is that necessarily a fair process? At my company, we don’t even check model performance outside the US.

          To take a complete strawman example, let’s say the following is true for every person in my city:

          1. Red people from Town A always repay loans and red people from Town B never do
          2. Blue people with car X always repay loans and blue people with car Y never do

          Color, township, and car ownership are bernoulli(0.5) and uncorrelated.

          I’m the credit agency, and I have the money to survey township, or car ownership, but not both. If I pick the township, I can perfectly predict loan repayment for red people, but blue people are a toss up. If I pick car ownership, I can perfectly predict loan repayment for blue people, but red people are a toss up. What goes in the training set? It’s up to me!

          To speak to the real world, there are lots of people without good credit, not because of bad borrowing records but because they live in primarily cash economies. They call this “credit invisibility.” There are lots of people with excellent borrowing records who are diligent on their payment schedules who do not leave an electronic footprint. Their probability of loan repayment is still going to be closer to 0.5 and the models will be very uncertain about these people because there’s no data available to the system—it calls to attention that it doesn’t output their “true probability of repayment”, but rather “the probability of repayment to the best of our machine’s ability to tell.” The machine’s ability to tell reflects not just the uncertainty of the world. but the decisions and priorities of the people who made it. What data goes in, what features go in, what places get model verification and what places don’t.

          This compounds with the feedback loop effect—credit begets credit, and also the fact that these aren’t even correctly calibrated bayesian probabilities, just softmax transformations—there’s additional uncertainty due to uncertainty in the final model parameters. Sure, gradient descent converges to a global minimum, but which global minimum? The correct way to compute class probabilities in a neural network would be to make it bayesian and sample from adiabatic monte carlo or something, which would generally move the outputs closer to the vector(1/n) relative to the naive minimizer, but many people don’t even know that these “probabilities” from stochastic gradient descent aren’t calibrated!

          • Chebyshev says:

            An economist would argue that, in free markets, somebody will step in the loan to those missing set of people without competition from the algo driven credit firms. That somebody will employ either humans or a mix of humans, knowing what the run of the mill algos are doing.

            Nothing corrects things like economics.

    • Ben says:

      > This best case is still pretty bad, because it helps propagate the idea that EVERYTHING IS POLITICAL. This idea is the foundation of totalitarianism. In the last century, it was responsible for hundreds of millions of deaths.

      This is the kind of the bad hot take I’d expect at the end of one of these papers lolol.

      “We have shown X. Unfortunately we think this idea might be the foundation of totalitarianism, etc. etc.”

      The thing that annoys me about something like this is more like what Bob said in his first paragraph (https://statmodeling.stat.columbia.edu/2020/12/21/the-neurips-2020-broader-impacts-experiment/#comment-1619929) — that this would just be some weird, formalized excuse mechanism.

    • Robert says:

      Insisting that everyone doing public impacting work has a responsibility to think a couple steps ahead is wildly different (if not essentially opposed!) to insisting that the government should make people behave how it wants, and it seems totally absurd to put both of them in one tent of “claiming everything is political”.

    • Brian says:

      Far fetched hyperbolic statements on your part, Radford. Love how some concerns about ethics leads to totalitarianism that heads to mass deaths. Dude, get some self-awareness .

      • Chebyshev says:

        It is impossible to be far fetched or hyperbolic when dealing with this hideous beast. Neal is being more civilised in his response.

        It always starts with innocent, friendly language that plays on social desirability bias. This needs to be nipped in the bud.

        Brian, if you feel you are safe because you say the right words, you need to read up on Bolsheviks. Never again.

  4. Chebyshev says:

    Honestly, NIPS was so much better.

  5. I am reminded of a workshop on faculty mentoring of graduate students I attended, at which the worst mentors spoke the most wonderful words about mentoring techniques and philosophy. It is delusional to think that this exercise will be any different, and that the gold star from the reviewers will reward actual ethics rather than the ability to spout the right-sounding words. (And of course, it’s a penalty for those who can’t write eloquently enough.) Perhaps I am wrong, but how could this be shown? What’s the assessment of this “experiment?”

  6. A.G.McDowell says:

    There is an ACM code of ethics at https://ethics.acm.org/code-of-ethics/ If all computing professionals became ACM members and complied with this literally, whoever gets to make judgements about what is ethical by ACM rules and what is not would control all computing and be one of the most powerful people on the planet. I wonder if ethics could ever attract people who sought political power, instead of being entirely politically neutral and interested only in doing good for its own sake?

    • Martha (Smith) says:

      A.G. said,
      “I wonder if ethics could ever attract people who sought political power, instead of being entirely politically neutral and interested only in doing good for its own sake?”

      The irony is dripping all over my screen.

  7. Christian Hennig says:

    Thanks for this piece! It reflects the uncertainty that I have thinking about these things nicely.

  8. Anoneuoid says:

    Watch the FDA vaccine approval meetings. There is limited time to talk.

    If the discussion gets guided towards “ethics” of who gets the vaccine first and other relatively minor issues then there is no time to discuss the risks of ADE in nursing home patients, after waning, and exposure to new strains.

    Ie, you waste your time talking about how ethical something is (basically, bikeshedding) while ignoring the elephant in the room.

  9. From being in a different setting where a similar request was made, it followed from the sense that something was required but a lot of uncertainty about what and especially how to judge it.

    So to get others thinking about and provide some input – make it optional, open ended (vague) and assure people they won’t be judged on it.

    Here they seem to have gone ahead though and judged people (without due notice) at the publication step.

    But they got people thinking about the issue – right?

    • True, some ML researchers might feel more comfortable writing about broader impacts now since they got practice, and like I said, thinking about this has inspired me to learn more about this topic, so its probably partially effective. I think what’s hard for me to take here is that in trying to achieve that goal, the organizers seemed to overlook the need to build trust and knowledge so that something like this doesn’t become a joke or thing that people try to game or simply figure out the formula for. Surely they anticipated researchers would have a lot of questions about why and how to do this. The lack of info about how the statements are judged, how those judgments are validated, and what knowledge base one should look to to understand the different types of ethical problems seem like major oversights to the point that it almost seems intentional, like they were asking for a leap of faith.

      • Radford Neal says:

        “asking for a leap of faith”

        And why would they do that? Could it be that the whole point is to get people to comply with this requirement whether they have faith that it is a good idea or not? That it is fundamentally an exercise in asserting power?

        • That there’s an element of power struggle seems true. I don’t think the the whole point is to assert power though. I think there are some genuinely good intentions behind the requirement, just less productive attempts at cross-talk and training than would seem ideal if the goal is trust and buy-in from the community.

          • Andrew says:

            Jessica:

            Well put. The existence of power struggles should not be taken to imply that the main point is to assert power. Sometimes the main point is to assert power; other times there is an external goal, and people assert power directly (to get others to go along) or reactively (because others are asserting power already) or in anticipation of opposition, or because similar disputes have led to power struggles in the past.

  10. somebody says:

    My opinion is that well-intentioned bureaucrats, when correctly identifying a real problem, fallaciously conclude that additional bureaucracy can fix it. Sometimes, when you see that something’s wrong, the solution doesn’t involve you at all. The desire to feel like you’re helping is strong.

  11. Joseph Bartolomeo says:

    > ‘Regardless of scientific quality or contribution, a submission may be rejected for ethical considerations, including methods, applications, or data that create or reinforce unfair bias’

    In other words science has to pass trough political filter of the day. If reality is wrong, unfair or politically incorrect (in our subjective view), we’re going to reject reality.

    Prof. Pedro Domingos published a great piece on this https://spectator.us/militant-liberals-politicizing-artificial-intelligence/

    • To be fair, many fields have long had methods or even entire areas that were off limits. That part isn’t new. The vagueness about what constitutes unfair bias is unsettling though.

      Unfortunately I find it hard to take Domingos very seriously; he sometimes asks valid questions but my sense is that ultimately his agenda is less about trying to have a conversation with what he perceives as the other side and more about riling people up as much as he can so he can complain abut being canceled. He had a reputation for this at his former institution, where I started my faculty career. His Spectator piece exaggerates, for instance claiming every NeurIPS paper “describing how to speed up an algorithm needs to have a section on the social goods and evils of this obscure technical advance” when the organizers were clear authors could say the work has no foreseeable societal consequences.

      • Anon says:

        I wonder, do you also find it hard to take Timnit Gebru and Anima Anandkumar seriously? As much as Pedro, if not more so, they also seem intent on riling people up as much as possible so that they can complain about being victims.

        • I will answer that for the sake of being honest, but first it should be noted that my reasons for not taking Domingos very seriously re not just based on a single shallowly written article or Twitter exchange. They’re based on personal experiences from being at the same institution and witnessing either directly or indirectly his behavior in various situations.

          I know nothing of Anandkumar except her titles and what I heard (but didn’t witness) happened between her and Domingos on Twitter, which apparently culminated in her making a list of people who liked Domingos’ tweets and attempted to ridicule them. Assuming that’s true, then yes, on the basis of that single event, I might say I find it hard to take her seriously, since I find that kind of good/evil framing of the whole ML bias conversation extremely unhelpful.

          On the other hand I’ve never had the impression Timnit Gebru is vengeful or engaging in deliberately misleading rhetoric. I’ve read a few of her papers, seen her talk a few times, and seen what kinds of things she shares on Twitter. While her approach to trying to change things seems very different than mine, she has always struck me as honest and devoted to her cause. So yes, I take her seriously even if I don’t completely agree with everything she says.

    • Andrew says:

      Joseph, Jessica:

      I was curious so I followed the link and read Domingos’s article.

      It seems to me that Domingos got halfway to a key point—but only halfway.

      The article is subtitled, “‘Debiasing’ algorithms actually means adding bias,” and its first sentence presupposes that decisions are “being made by algorithms that are mathematically incapable of bias.”

      He writes of “complex mathematical formulas that know nothing about race, gender or socioeconomic status. They can’t be racist or sexist any more than the formula y = a x + b can,” but the key issue is not any claim that “a x + b” is biased. The point, I think, is that if “x” is bias, then “a x + b” will be biased too. In that sense, a formula such as “a x + b” can promulgate bias or launder bias or make certain biases more socially acceptable.

      If x is biased, so that “a x + b” is bias, then you can debias the algorithm by first subtracting the bias of x. In practice it is not so easy, because it’s typically in the nature of biases that we don’t know exactly how large they are—if we knew, we’d have already made the connection—but there’s no reason why a “debiasing” algorithm can’t at least reduce bias.

      OK, so here’s the half that I think Domingos gets right: “Bias” is relative to some model and some particular scenario. If “a x + b” is biased, arising from some biases in “a,” “x,” and “b,” we can expect these component biases to themselves vary over time and across contexts. A number, and thus the result of an algorithm, can be biased in favor of members of a particular group in some settings and biased against them in another. “Debiasing” is, by its nature, not automatic, and attempts to debias can make things worse, both directly and by adding another input into the algorithm and thus another potential for bias. One can draw an analogy to Keyensian interventions into the economy, which can make things better or can make things worse, but also complexify the system by adding another player into the system.

      And here’s the half that I think Domingos gets wrong: He’s too sanguine about existing algorithms being unbiased. I don’t know why he’s so confident that existing algorithms for credit-card scoring, parole consultation, shopping and media recommendations, etc., are unbiased and not capable of outside improvement. I respect his concern about political involvement in these processes—but the existing algorithms are human products and are already the result of political processes. Again, his concern that “progressives will blithely assign prejudices even to algorithms that transparently can’t have any,” is missing the point that the structure and inputs of these algorithms are the result of existing human choices.

      I don’t know what’s the best way forward here—as I said, I think Domingos is half right, but his discussion is ridiculously simplistic, unworthy I think of a professor of computer science. In his article, he writes that he “posted a few tweets raising questions about the latest changes — and the cancel mob descended on me. Insults, taunts, threats — you name it.” I hate twitter. I recommend he post his arguments not just on twitter (where discussion is little more than a series of yeas and nays) or on a magazine website (which will either have no discussion at all, or have discussions that degrade into insults, taunts, etc.) but as a blog, where there can be thoughtful exchanges and he can engage with these issues.

      • Thanks for your assessment Andrew, I agree with you on what he gets right and wrong. I had balked at the simplistic algorithms can’t be biased statement when I skimmed it, which seemed like a sensationalist attempt to dispute a more superficial argument than much of the work on algorithmic bias is trying to make. Also that debiasing makes algorithms worse at their intended function; it’s trivial to think of examples like predictive policing where the goal is to send police to places they’re most likely to be needed but the prior arrest data driving the predictions may instead be a better reflection of where police tend to congregate. To say “algorithms increasingly run our lives” but not be able to recognize how they might promulgate bias in the very settings he’s describing makes no sense to me.

        On the context part: I like your point about how the component biases can change over time and contexts, and trying to fix a bias can introduce other biases (reminds me of Corbett-Davies and Goel’s work on how it can be impossible to satisfy multiple common notions of fairness at once https://arxiv.org/abs/1808.00023). Related to contextual complexity I would add that in many domains people oversee the use of model predictions, and have some ability to step in, which means we should be studying not just how the algorithm does versus how the person would do without it, but also how they do together. I have tended to think about think about all this from the standpoint of trying to reduce the various forms of uncertainty that make broader impacts statements hard to evaluate… we can try to outline what may go wrong, but can’t easily know whether our model of how it will be used captures the real world parameters that matter (ontological uncertainty), and what the values of those parameters are (epistemic and sometimes aleatoric uncertainty). A student I work with wrote a short paper on the broader impacts exercise from the standpoint of uncertainty reduction https://arxiv.org/pdf/2011.13170.pdf. But so far characterizing it like this hasn’t really led us to see a much better way forward.

        +1 to blogs for discussion! I appreciate being able to write about this topic in particular here, since Twitter is kind of war zone whenever it comes up.

Leave a Reply