Artificial intelligence and aesthetic judgment

This is Jessica. In a new essay reflecting on how we get tempted to aestheticize generative AI, Ari Holtzman, Andrew, and I write: 

Generative AIs produce creative outputs in the style of human expression. We argue that encounters with the outputs of modern generative AI models are mediated by the same kinds of aesthetic judgments that organize our interactions with artwork. The interpretation procedure we use on art we find in museums is not an innate human faculty, but one developed over history by disciplines such as art history and art criticism to fulfill certain social functions. This gives us pause when considering our reactions to generative AI, how we should approach this new medium, and why generative AI seems to incite so much fear about the future. We naturally inherit a conundrum of causal inference from the history of art: a work can be read as a symptom of the cultural conditions that influenced its creation while simultaneously being framed as a timeless, seemingly acausal distillation of an eternal human condition. In this essay, we focus on an unresolved tension when we bring this dilemma to bear in the context of generative AI: are we looking for proof that generated media reflects something about the conditions that created it or some eternal human essence? Are current modes of interpretation sufficient for this task? Historically, new forms of art have changed how art is interpreted, with such influence used as evidence that a work of art has touched some essential human truth. As generative AI influences contemporary aesthetic judgment we outline some of the pitfalls and traps in attempting to scrutinize what AI generated media “means.”

I’ve worked on a lot of articles in the past year or so, but this one is probably the most out-of-character. We are not exactly humanities scholars. And yet, I think there is some truth to the analogies we are making. Everywhere we seem to be witnessing the same sort of beauty contest, where some interaction with ChatGPT or another generative model is held up for scrutiny, and the conclusion drawn that it lacks a certain emergent “je ne sais quoi” that human creative expressions  like great works of art achieve. We approach our interactions as though they have the same kind of heightened status as going to a museum, where it’s up to us to peer into the work to cultivate the right perspective on the significance of what we are seeing, and try to anticipate the future trajectory of the universal principle behind it.   

At the same time, we postulate all sorts of causal relationships where conditions under which the model is created are thought to leave traces in the outputs – from technical details about the training process to the values of the organizations that give us the latest models  – just like we analyze the hell out of what a work of art says about the culture that created it. And so we end up in a position where we can only recognize what we’re looking for when we see it, but what we are looking for can only be identified by what is lacking. Meanwhile, the artifacts that we judge can be read as a signal of anything and everything at once.

If this sounds counterproductive (because it is), it’s worth considering why these kinds of contradictory modes of reading objects have arisen in the past over the history of art: to keep fears at bay. By making our judgments as spectators seem essential to understanding the current moment, we gain a feeling of control.  

And so, despite these contradictions, we see our appraisals of model outputs in the the current moment as correct and arising from some innate ability we have to recognize human intelligence. But aesthetic judgments have never been fixed – they have always evolved along with innovations in our ability to represent the world, whether through painting or photography or contemporary art. And so we should expect that with judgments of generative AI as well. We conclude by considering how the idea of taste and aesthetic judgment might continue to shape our interactions with generative model outputs, from “wireheading” to generative AI as a kind of art historical tool we can turn toward taste itself.

In the real world people have goals and beliefs. In a controlled experiment, you have to endow them

This is Jessica. A couple weeks ago I posted on the lack of standardization in how people design experiments to study judgment and decision making, especially in applied areas of research like visualization, human-centered AI, privacy and security, NLP, etc. My recommendation was that researchers should be able to define the decision problems they are studying in terms of the uncertain state on which the decision or belief report in each trial is based, the action space defining the range of allowable responses, the scoring rule used to incentivize and/or evaluate the reports, and process that generates the signals (i.e., stimuli) that inform on the state. And that not being able to define these things points to limitations in our ability to interpret the results we get.

I am still thinking about this topic, and why I feel strongly that when the participant isn’t given a clear goal to aim for in responding, i.e., one that is aligned with the reward they get on the task, it is hard to interpret the results. 

It’s fair to say that when we interpret the results of experiments involving human behavior, we tend to be optimistic about how what we observe in the experiment relates to people’s behavior in the “real world.” The default assumption is that the experiment results can help us understand how people behave in some realistic setting that the experimental task is meant to proxy for. There sometimes seems to be a divide among researchers, between a) those who believe that judgment and decision tasks studied in controlled experiments can be loosely based on real world tasks without worrying about things being well-defined in the context of the experiment and b) those who think that the experiment should provide (and communicate to participants) some unambiguously defined way to distinguish “correct” or at least “better” responses, even if we can’t necessarily show that this understanding matches some standard we expect to operate the real-world. 

From what I see, there are more researchers running controlled studies in applied fields that are in the former camp, whereas the latter perspective is more standard in behavioral economics. Those in applied fields appear to think it’s ok to put people in a situation where they are presented with some choice or asked to report their beliefs about something but without spelling out to them exactly how what they report will be evaluated or how their payment for doing the experiment will be affected. And I will admit I too have run studies that use under-defined tasks in the past. 

Here are some reasons I’ve heard for using not using a well-defined task in a study:

People won’t behave differently if I do that. People will sometimes cite evidence that behavior in experiments doesn’t seem very responsive to incentive schemes, extrapolating from this that giving people clear instructions on how they should think about their goals in responding (i.e., what constitutes good versus bad judgments or decisions) will not make a difference. So it’s perceived as valid to just present some stuff (treatments) and pose some questions and compare how people respond.

The real world version of this task is not well-defined. Imagine studying how people use dashboards giving information about a public health crisis, or election forecasts. Someone might argue that there is no single common decision or outcome to be predicted in the real world when people use such information, and even if we choose some decision like ‘should I wear a mask’ there is no clear single utility function, so it’s ok not to tell participants how their responses will be evaluated in the experiment. 

Having to understand a scoring rule will confuse people. Relatedly, people worry that constructing a task where there is some best response will require explaining complicated incentives to study participants. They might get confused, which will interfere with their “natural” judgment processes in this kind of situation. 

I do not find these reasons very satisfying. The problem is how to interpret the elicited responses. Sure, it may be true that in some situations, participants in experiments will act more or less than the same when you put some display of information on X in front of them and say “make this decision based on what you know about X” and when you display the same information and ask the same thing but you also explain exactly how you will judge the quality of their decision. But – I don’t think it matters if they act the same. There is still a difference: in the latter case where you’ve defined what a good versus bad judgment or decision is, you know that the participants know (or at least that you’ve attempted to tell them) what their goal is when responding. And ideally you’ve given them a reason to try to achieve that goal (incentives). So you can interpret their responses as their attempt at fulfilling that goal given the information they had at hand. In terms of the loss you observe in responses relative to the best possible performance, you still can’t disambiguate the effect of their not understanding the instructions from their inability to perform well on the task despite understanding it. But you can safely consider the loss you observe as reflecting an inability to do that task (in the context of the experiment) properly. (Of course, if your scoring rule isn’t proper then you shouldn’t expect them to be truthful under perfect understanding of the task. But the point is that we can be fairly specific about the unknowns). 

When you ask for some judgment or decision but don’t say anything about how that’s evaluated, you are building variation in how the participants interpret the task directly into your experiment design. You can’t say what their responses mean in any sort of normative sense, because you don’t know what scoring rule they had in mind. You can’t evaluate anything. 

Again this seems rather obvious, if you’re used to formulating statistical decision problems. But I encounter examples all around me that appear at odds with this perspective. I get the impression that it’s seen as a “subjective” decision for the researcher to make in fields like visualization or human-centered AI. I’ve heard studies that define tasks in a decision theoretic sense accused of “overcomplicating things.” But then when it’s time to interpret the results, the distinction is not acknowledged, and so researchers will engage in quasi-normative interpretation of responses to tasks that were never well defined to begin with.

This problem seems to stem from a failure to acknowledge the differences between behavior in the experimental world versus in the real world: We do experiments (almost always) to learn about human behavior in settings that we think are somehow related to real world settings. And in the real world, people have goals and prior beliefs. We might not be able to perceive what utility function each individual person is using, but we can assume that behavior is goal-directed in some way or another. Savage’s axioms and the derivation of expected utility theory tell us that for behavior to be “rationalizable”, a person’s choices should be consistent with their beliefs about the state and the payoffs they expect under different outcomes.

When people are in an experiment, the analogous real world goals and beliefs for that kind of task will not generally apply. For example, people might take actions in the real world for intrinsic value – e.g., I vote because I feel like I’m not a good citizen if I don’t vote. I consult the public health stats because I want to be perceived by others as informed. But it’s hard to motivate people to take actions based on intrinsic value in an experiment, unless the experiment is designed specifically to look at social behaviors like development of norms or to study how intrinsically motivated people appear to be to engage with certain content. So your experiment needs to give them a clear goal. Otherwise, they will make up a goal, and different people may do this in different ways. And so you should expect the data you get back to be a hot mess of heterogeneity. 

To be fair, the data you collect may well be a hot mess of heterogeneity anyway, because it’s hard to get people to interpret your instructions correctly. We have to be cautious interpreting the results of human-subjects experiments because there will usually be ambiguity about the participants’ understanding of the task. But at least with a well-defined task, we can point to a single source of uncertainty about our results. We can narrow down reasons for bad performance to either real challenges people face in doing that task or lack of understanding the instructions. When the task is not well-defined, the space of possible explanations of the results is huge. 

Another way of saying this is that we can only really learn things about behavior in the artificial world of the experiment. As much as we might want to equate it with some real world setting, extrapolating from the world of the controlled experiment to the real world will always be a leap of faith. So we better understand our experimental world. 

A challenge when you operate under this understanding is how to explain to people who have a more relaxed attitude about experiments why you don’t think that their results will be informative. One possible strategy is to tell people to try to see the task in their experiment from the perspective of an agent who is purely transactional or “rational”:

Imagine your experiment through the eyes of a purely transactional agent, whose every action is motivated by what external reward they perceive to be in it for them. (There are many such people in the world actually!) When a transactional agent does an experiment, they approach each question they are asked with their own question: How do I maximize my reward in answering this? When the task is well-defined and explained, they have no trouble figuring out what to do, and proceed with doing the experiment. 

However, when the transactional human reaches a question that they can’t determine how to maximize their reward on, because they haven’t been given enough information, they shut down. This is because they are (quite reasonably) unwilling to take a guess at what they should do when it hasn’t been made clear to them. 

But imagine that our experiment requires them to keep answering questions. How should we think about the responses they provide? 

We can imagine many strategies they might use to make up a response. Maybe they try to guess what you, as the experimenter, think is the right answer. Maybe they attempt to randomize. Maybe they can’t be bothered to think at all and they call in the nearest cat or three year old to act on their behalf. 

We could probably make this exercise more precise, but the point is that if you would not be comfortable interpreting the data you get under the above conditions, then you shouldn’t be comfortable interpreting the data you get from an experiment that uses an under-defined task.

Deja vu on researching whether people combined with LLMs can do things people can do

This is Jessica. There has been a lot of attention lately on how we judge whether a generative model like LLM has achieved human-like intelligence, and what not to do when making claims about this. But I’ve also been watching the programs of some of the conferences I follow fill up with a slightly different rush to document LLMs: papers applying models like GPT-4 to tasks that we once expected humans to do to see how well they do. For example, can we use ChatGPT to generate user responses to interactive media? Can they simulate demographic backstories we might get if we queried real populations? Can they convince people to be more mindful? Can they generate examples of AI harms?  And so on. 

Most of this work is understandably very exploratory. And if LLMs are going to reshape how we program or get medical treatment or write papers, then of course there’s some pragmatic value to starting to map out where they excel versus fail on these tasks, and how far we can rely on them to go. 

But do we get anything beyond pragmatic details that apply to the current state of LLMs? In many cases, it seems doubtful.

One problem with papers that “take stock” of how well an LLM can do on some human task is that the technology keeps changing, and even between the big model releases (e.g., moving from GPT-3 to GPT-4) we can’t easily separate out which behaviors are more foundational, resulting from the pre-training, versus which are arising as a result of interactive fine-tuning as the models get used. This presents a challenge to researchers who want something about their results to be applicable for more than a year or two. There needs to be something we learn that is more general than this particular model version applied to this task. But in this kind of exploratory work, that’s hard to guarantee. 

To be fair, some of these papers can contribute intermediate level representations that help characterize a domain-specific problem or solution independent of the LLM. For instance, this paper developed a taxonomy of different types of cognitive reframing that work for negative thoughts in applying LLMs to the problem. But many don’t.

I’m reminded of the early 2010s when crowdsourcing was really starting to take off. It was going to magically speed up machine learning by enabling annotation at scale, and let behavioral researchers do high throughput experiments, transforming social science. And it did in many ways, and it was exciting to have a new tool. But if you looked at a lot of the specific research coming out to demonstrate the power of crowdsourcing, the high level research question could be summarized as “Can humans do this task that we know humans can do?” There was little emphasis on the more practical concerns about whether, in some particular workflow, it makes sense to invest effort in crowdsourcing, how much money or effort it took the researchers to get good results from crowds of humans, or what would happen if the primary platform at the time (Amazon Mechanical Turk) stopped being supported. 

And now here we are again. LLMs are not people, of course, so the research question is more like “By performing high dimensional curve fitting on massive amounts of human-generated content, can we generate human-like content?” Instead of being about performance on some benchmark, this more applied version becomes about whether the AI-generated content is passable in domain X. But since definitions of passable tend to be idiosyncratic and developed specific to each paper, it’s hard to imagine someone synthesizing all this in any kind of concrete way later. 

Part of my distaste for this type of research is that we still seem to lack an intermediate layer of understanding of what more abstract behaviors we can expect from different types of models and interactions with models. We understand the low-level stuff about how the models work, we can see how well they do on these tasks humans usually do, but we’re missing tools or theories that can relate the two. This is the message of a recent paper by Holtzman, West, and Zettlemoyer, which argues for that researchers invest more in developing a vocabulary of behaviors, or “meta-models” that predict aspects of an LLM’s output, to replace questions like What is the LLM doing? with Why is the LLM doing that? 

I guess one could argue that this kind of practical research is a more worthwhile use of federal funding than the run-of-the-mill behavioral study, which might set out to produce some broadly generalizable result but shoot itself in the foot by using small samples, noisy measurements, an underdefined population, etc. But at least in studies of human behavior there is usually an attempt at identifying some deeper characterization of what’s going on, so the research question might be interesting, even if the evidence doesn’t deliver. 

Defining decisions in studies of visualization and human centered AI

This is Jessica. A few years ago, Dimara and Stasko published a paper pointing to the lack of decision tasks in evaluation of visualization research, where it’s common to talk about decision-making, but then to ask simpler perceptual style questions in the study you run. A few years earlier I had pointed to the same irony when taking stock of empirical research on visualizing uncertainty, where despite frequent mention of “better decision-making” as the objective for visualizing uncertainty, few well-defined decision tasks are studied, and instead most studies evaluate how well people can read data from a chart and how confident they report feeling about their answers to the task. Colloquially, I’ve heard a decision task described as a choice between alternatives, or a choice where the stakes are high, but neither of these isolates a clear set of assumptions. 

Then there is all the research being produced on “AI-advised decisions” – how people make decisions with the aid of AI and ML models–which has become much more popular in the last five years. Some of this research is invested in isolating decision tasks to build general understanding about how people use model predictions. E.g., according to one recent survey of empirical human-subjects studies on using AI to augment human decisions, these studies are “necessary to evaluate the effectiveness of AI technologies in assisting decision making, but also to form a foundational understanding of how people interact with AI to make decisions.” The body of empirical work on AI-advised decisions as a whole is thought to allow us to “develop a rigorous science of humanAI decision-making”; or “to assess that trust exists between human-users and AI-embedded systems in decision making”; etc. Reading these kinds of statements, it would seem there must be some consensus on what a decision is and how decisions are a distinct form of human behavior compared to other tasks that are not considered decisions. 

But looking at the range of tasks that get filed under “studying decision making” in both of these areas, it’s not very clear what the common task structure is that makes these studies about decision-making. For some, the point seems to be to compare human decisions to a definition of rational behavior, but sometimes this is defined by the researchers but other times it’s not. Sometimes the point is to study “subjective decisions,” like helping a friend decide if the list price of a house matches its valuation where the participants are intended to use their own judgment in thinking about prioritizing them.  

If we are going to isolate decision-making as an important class of behavior when we study interfaces, I think we should be able to give a definition of what that means. I get that human decision-making might appear to require hard to formalize sometimes (because if it wasn’t, why haven’t the humans figured out how to automate it?) And decision theory isn’t necessarily familiar if you’re coming from a computer science background. But it seems hard to make progress or learn from a body of empirical work on decisions if we can’t say exactly what classifies a task as decision  The survey papers I’ve seen on decision making in visualization or in human-centered AI conclude that we need a more coherent definition of decision, but no one appears to be suggesting anything concrete. 

So here’s a proposal for one way to understand what constitutes a decision problem: a decision task involves the person choosing a response from some set of possible responses, where we know that the quality of the response depends on the realization of a state of the world which is uncertain at the time of the decision. Additionally, for the fields I’m talking about above, we generally want to assume that there is some information (or signal) available to the decision maker when they make their decision which is correlated with the uncertain state. 

We can summarize this by saying when we talk about people making a decision we should be able to point to: 

  1. An uncertain state of the world. E.g., if we are taking recidivism prediction as the decision task, the uncertain state is whether the person will commit another crime after being released, which can take the value 0 (no) or 1 (yes).  
  2. An action space from which the decision maker chooses a response. Above I called the action space a choice of response, because I’ve seen a colloquial understanding of decision that assumes that a task has to involve choosing between some actions that correspond to something that feels like a real-world choice between a small number of options. It doesn’t. The response to the decision problem it could be a probability forecast or some other numeric response. What matters is that we have a way to evaluate the quality of that response that accounts for the realization of the state.
  3. A (scoring) rule that assigns some quality score to each chosen action so we can evaluate the decision. I think sometimes people hear ‘scoring rule’ and they think it has to be a proper scoring rule, or at least a continuous function, or something like that. It can refer to any way in which we assign value to different responses to the task. However, we should acknowledge that we can use scoring rules for different purposes even in a single study, and think about why we do this. E.g., sometimes we might have one scoring rule that we use to incentivize participants in our study (which may just be a flat reward scheme regardless of the quality of your responses, which is used a lot at least in visualization research). Then we have some different scoring rule that we use to evaluate their responses, like evaluating the accuracy of the responses. If we want to conclude that we have learned about the quality of human decisions, it’s the latter we care about more, but in many situations we should also be thinking carefully about the former as well, so that participants understand the decision problem the same way that we do when we analyze it . So we should be able to identify both when running a decision study. 

 And, optionally for studies comparing different interfaces to aid a decision maker:

  1. Some signal (or set of signals) that inform about the uncertain state that the decision-maker has access to in making their decision. This could be a visualization of some relevant data, or the prediction made by a model on some instance.   

This may seem obvious to some, as I am essentially just describing components of statistical decision theory. But I think making these aspects of a decision problem explicit when talking about decisions in visualization and HCAI research would already be a step forward. It would at least give us a list of things to make sure we can identify if we are trying to understand a decision task. And it could help us realize the limitations on how much we can really say about decision-making when we can’t specify all these components. For example, if there’s no uncertain state on which the quality of someone’s response to the task depends, what’s the point of trying to evaluate different decision strategies? Or if we can’t describe how the signals our interface is providing compare in terms of conveying information about the uncertain state, then how can we evaluate different approaches to presenting them?

Something else I’ve noticed is papers that refer to a decision task but then bury the information about the nature of the scoring rule that is used, especially that used to incentivize the participants, as if it doesn’t matter for anything. But if we give no thought to what to tell study participants about how to make a decision well, then we should be careful about using the results of the study to talk about making better or worse decisions – they might have been trying different things, doing whatever seemed quickest, etc. 

Also, some studies resist defining decision quality even in assessing responses, as if this would take away from the realism of the task. I think there’s a temptation in studying these topics to assume there’s some inherent blackbox nature to how people make decisions or use visualizations that absolves us as researchers of having to try to formalize anything. It isn’t necessarily wrong to study tasks where we can’t say what exactly would constitute a better decision in some real world decision pipeline, but if we want to work toward a general understanding of human decision making with AIs or with visualizations through controlled empirical experiments, we should study scenarios that we can fully understand. Otherwise we can make observations perhaps, but not value judgments.  

Related to this, I think adhering to this definition of decision would make it easier for researchers to tell the difference between normative decision studies and descriptive or exploratory ones. If the goal is simply to understand how people approach some kind of decision task, or what they need, like this example of how child welfare workers screen cases, then it doesn’t necessarily matter if we can’t say what the scoring rule/rules is – maybe that’s part of what we’re trying to learn. But I would argue that whenever we want to conclude something about decision quality, we should be able to describe and motivate the scoring rule(s) we’re using. I’ve come across papers in both of the areas I mentioned that seem to confuse the two. I also have mixed feelings about labeling some decisions as “subjective,” though I understand the motivation behind trying to distinguish the more formally defined tasks from those that seem underspecified. There’s a risk of “subjective” making it sound like it’s possible to have a decision for which there really is no scoring rule, implicit or not, but I don’t think that makes sense.

Of course there is lots more that could be said about all this. For example, if you’re going to study some decision task with the goal of producing new knowledge about human decision making in general, I think you should be able to go further than just specifying your decision problem: you should also understand its properties and motivate why they are important. I find that often there is an iterative process of specifying the decision task for an experiment – you take a stab at specifying something you think might work, then you attempt to understand how well it “works” for the purposes of your experiment. This process can be very opaque if you don’t have a good sense of what properties matter for the kind of claim you hope to make from your results. I have some recent work where we lay out an approach to evaluating the decision experiments themselves, but will leave that for a follow-up post.

P.S. These thoughts are preliminary, and I welcome feedback from those studying decisions in the areas I’ve mentioned, or other domains.

Jurassic AI extinction

Coming to theaters near you, some summer, sometime, maybe soon

This is Jessica. I mostly try to ignore the more hype-y AI posts all over my social media. But when “a historic coalition of experts” comes together to back a statement implying the human race is doomed, it’s worth a comment. In case you missed it, a high profile group of signatories came together to back the statement: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”

Intellectuals sometimes like to make statements, and some even try to monetize them, but how did this group arrive at this particularly dramatic sentence? There’s been plenty of speculation, including among experts, about scenarios where AIs like LLMs do things like trick people, or perpetuate stereotypes, or violate data privacy, and how these things could get worse when they are deployed with more agency in the world. But how did we get from here to extinction all of a sudden? Why couldn’t we just emphasize safety or something? We already have plenty of centers and initiatives for AI safety. Are we now going to see rebranding around extinction? 

It makes one wonder if this coalition is privy to some sort of special knowledge about how we go from ChatGPT not apologizing for its errors to populations dying off. Because if we can’t foresee how that’s going to happen, why make extinction the headline? What about just making the world less objectionable/unfair/deceptive/anything else we actually have evidence AI can contribute to? What exactly does upping the ante contribute to the already overhyped AI landscape? 

Maybe we need to put ourselves in the shoes of the AI researcher who has spent their career being celebrated for their advancements to the technology, but who hasn’t really engaged in much criticism involving AI’s negative potential. It seems reasonable to imagine that they might feel some pressure to sound the alarm in some way, given how much louder the critics are now that the hype machine is finally recognizing that it’s worth hearing about the darker side. I think many of us sense that it’s gotten harder be a techno-optimist without acknowledging other people’s worries, even if you haven’t really paid attention until now.

So maybe extinction appears as an easy way to signal being concerned about AI. It puts you in the headlines for your altruistic urges, but under a concern that’s indefinite enough that an equally vague gesture toward solutions, like “we need to focus more on responsible AI or AI safety” feels satisfactory. At least for me, extinction is not very emotionally evocative. It’s a nice clean way to signal terror without having to get too specific. We might as well be making vague references to something like “closing the loop.” 

But its confusing … when did it become cool again to be the dramatic AI guy? I can remember a faculty candidate once unironically mentioning the singularity in their job talk when I was a grad student, and suddenly everyone sat up a bit, like Did he actually just say that? It was sort of an unwritten rule that you couldn’t get all sci-fi extremist and expect to be taken seriously. But now existential risks seem to be experiencing a resurgence. This would not necessarily be a problem if there was some logical argument or model to warrant the extreme projection. But where’s the evidence that it’s extinction that we need to guard against, rather than the more mundane (and much more probable) human bias gets amplified, self-driving car causes wreck, family man gets put in jail due to automated face recognition etc. kind of world? 

This seemingly rapid leap–from how language models that predict the next word can generate human-seeming text to experts warning about the extinction of the human race–makes me think of Plato expressing his fear of the poet in The Republic, who he thought was so dangerous that he should be banished from the city. My understanding is that Plato’s fear didn’t seem so extreme at the time, because there was no sharp distinction perceived between what we would now call works of art versus other types of objects that seemed to possess their own creative principle, like nature. So the idea that a poem that could be in the world like a natural object carried weight with people. But even Plato wasn’t talking about human extinction from the words of the poet. His concerns with art were more with its potential for moral corruption. At any rate, Plato’s fear starts to seem pretty down to earth relative to the words of these AI experts. 

I can understand how AI researchers who have been raising concerns about AI safety for years would find this slightly annoying (many of whom are women – Margaret Mitchell, Abeba Birhane, and many others – whose work tends to go unmentioned when a Hinton or LeCun speaks up). Someone responsible for contributing to the underlying technology becomes concerned after being pretty quiet for years and its a massively newsworthy event that paints that person the new spokesperson for safe AI. I’m glad there’s some recognition when the people responsible for some of the key technical innovations say they’re not convinced it’s all good – being up front about the limitations of the methods you develop is healthy.  But when a bunch of people jump to align themselves with an extreme sentence that seems to come out of nowhere, it must be frustrating to those who’ve spent years trying to get the community to the point of recognizing any consequences at all. 

New open access journal on visualization and interaction

This is Jessica. I am on the advisory board of an open access visualization research journal called the Journal of Visualization and Interaction (JoVI), recently launched by Lonni Besançon, Florian Echtler, Matt Kay, and Chat Wacharamanotham. From their website:

The Journal of Visualization and Interaction (JoVI) is a venue for publishing scholarly work related to the fields of visualization and human-computer interaction. Contributions to the journal include research in:

  • how people understand and interact with information and technology,
  • innovations in interaction techniques, interactive systems, or tools,
  • systematic literature reviews,
  • replication studies or reinterpretations of existing work,
  • and commentary on existing publications.

One component of their mission is to require materials to be open by default, including exposing all data and reasoning for scrutiny, and making all code reproducible “within a reasonable effort.” Other goals are to emphasize knowledge and discourage rejection based on novelty concerns (a topic that comes up often in computer science research, see e..g., my thoughts here). They welcome registered reports, and say they will not impose top down constraints on how many papers can be published that can lead to arbitrary-seeming decisions on papers that hinge on easily fixable mistakes. This last part makes me think they are trying to avoid the kind of constrained decision processes of conference proceeding publications, which are still the most common publication mode in computer science. There are existing journals like Transactions on Visualization and Computer Graphics that give authors more chances to go back and forth with reviewers, and my experience as associate editor there is that papers don’t really get rejected for easily fixable flaws. Part of JoVI’s mission seems to be about changing the kind of attitude that reviewers might bring, away from one of looking for reasons to reject and toward trying to work with the authors to make the paper as good as possible. If they can do this while also avoiding some of the other CS review system problems like lack of attention or sufficient background knowledge of reviewers, perhaps the papers will end up being better than what we currently see in visualization venues. 

This part of JoVI’s mission distinguishes it from other visualization journals:

Open review, comments, and continued conversation

All submitted work, reviews, and discussions will by default be publicly available for other researchers to use. To encourage accountability, editors’ names are listed on the articles they accept, and reviewers may choose to be named or anonymous . All submissions and their accompanying reviews and discussions remain accessible whether or not an article is accepted. To foster discussions that go beyond the initial reviewer/author exchanges, we welcome post-publication commentaries on articles.

Open review is so helpful for adding context to how papers were received at the time of submission, so I hope it catches on here. Plus I really dislike by the attitude that it is somehow unfair to bring up problems with published work, at least outside of the accepted max 5 minutes of public QA that happens after the work is presented at a conference. People talk amongst themselves about what they perceive the quality or significance of new contributions to be, but many of the criticisms remain in private circles. It will be interesting to see if JoVI gets some commentaries or discussion on published articles, and what they are like. 

This part is also interesting: “On an alternate, optional submission track, we will continually experiment with new article formats (including modern, interactive formats), new review processes, and articles as living documents. This experimentation will be motivated by re-conceptualizing peer review as a humane, constructive process aimed at improving work rather than gatekeeping.” 

distll.pub is no longer publishing new stuff but some of their interactive ML articles were very memorable and probably had more impact than more conventionally published papers on the topic. Even more so I like the idea of trying to support articles as living documents that can continue to be updated. The current publication practices in visualization seem a long way from encouraging a process where it’s normal to first release working papers. Instead, people spend six months building their interactive system or doing their small study to get a paper-size unit of work, and then they move on. I associate the areas where working papers seem to thrive (e.g., theoretical or behavioral econ) with theorizing or trying to conceptualize something fundamental to behavior, rather than just describing or implementing something. The idea that we should be trying to write visualization papers that really make us think hard over longer periods, and that may not come in easily bite-size chunks, seems kind of foreign to how the research is conceptualized. But any steps toward thinking about papers as incomplete or imperfect, and building more feedback and iteration into the process, are welcome.

Two talks about robust objectives for visualization design and evaluation

This is Jessica. I’ll be giving a talk twice this week, on the topic of how to make data visualizations more robust for inference and decision making under uncertainty. Today I’m speaking at the computer science seminar at University of Illinois Urbana Champaign, and  Wednesday I’ll be giving a distinguished data science lecture at Cornell. In the talk I consider what’s a good objective to use as a target in designing and evaluating visualization displays, one that is “robust” in the sense that it leads us to better designs even if people don’t use the visualizations as intended. My talk will walkthrough what I learned from using effect size judgments and decisions as a design target, and how aiming for visualizations that facilitate implicit model checks can be a better target for designing visual analysis tools. At the end I jump up a level to talk about what our objectives should be when we design empirical visualization experiments. I’ll talk about a framework we’re developing that uses the idea of a rational agent with full knowledge of a visualiation experiment design to create benchmarks that can be used to determine when an experiment design is good (by asking e.g., is the visualization important to do well on the decision problem under the scoring rule used?) and which can help us figure out what causes losses in observed performance by participants in our experiment.

Using predictions from arbitrary models to get tighter confidence intervals

This is Jessica. I previously blogged about conformal prediction, an approach to getting prediction sets that are guaranteed on average to achieve at least some user-defined coverage level (e.g., 95%). If it’s a classification problem, the prediction sets are comprised of a discrete set of labels, and if the outcome is continuous (regression) they are intervals. The basic idea can be described as using a labeled hold-out data set (the calibration set) to adjust the (often wrong) heuristic notion of uncertainty you get from a predictive model, like the softmax value, in order to get valid prediction sets. 

Lately I have been thinking a bit about how useful it is in practice, like when predictions are available to someone making a decision. E.g., if the decision maker is presented with a prediction set rather than just the single maximum likelihood label, in what ways might this change their decision process? It’s also interesting to think about how you get people to understand the differences between a model-agnostic versus a model-dependent prediction set or uncertainty interval, and how use of them should change.

But beyond the human facing aspect, there are some more direct applications of conformal prediction to improve inference tasks. One uses what is essentially conformal prediction to estimate the transfer performance of an ML model trained on one domain when you apply it to a new domain. It’s a useful idea if you’re ok with assuming that the domains have been drawn i.i.d. from some unknown meta-distribution, which seems hard in practice. 

Another recent idea coming from Angelopoulos, Bates, Fannjiang, Jordan, and Zrnic (the first two of whom have created a bunch of useful materials explaining conformal prediction) is in the same spirit as conformal, in that the goal is to use labeled data to “fix” predictions from a model in order to improve upon some classical estimate of uncertainty in an inference. 

What they call prediction-powered inference is a variation on semi-supervised learning that starts by assuming that you want to estimate some parameter value theta*, and you have some labeled data of size n, a much larger set of unlabeled data of size N >> n, and access to a predictive model that you can apply to the unlabeled data. The predictive model is arbitrary in that it might be fit to some other data than the labeled and unlabeled data you want to use to do inference. The idea is then to first construct an estimate of the error in the predictions of theta* from the model on the unlabeled data. This is called a rectifier since it rectifies the predicted parameter value you would get if we were to treat the model predictions on the unlabeled data as the true/gold standard values in order to recover theta*. Then, you use the labeled data to construct a confidence set estimating your uncertainty about the rectifier. Finally, you use that confidence set to create a provably valid confidence set for theta* which adjusts for the prediction error. 

You can compare this kind of approach to the case where you just construct your confidence set using only the labeled observations, resulting in a wide interval, or where you do inference on the combination of labeled and unlabeled data by assuming the model predicted labels for the unlabeled data are correct, which gets you tighter uncertainty intervals but which may not contain the true parameter value. To give intuition for how prediction powered inference differs, the authors start with an example of mean estimation, where your prediction powered estimate decomposes to your average prediction for the unabeled data, minus the average error in predictions on the labeled data. If the model is accurate, the second term is 0, so you end up with an estimate on the unlabeled data which has much lower variance than your classical estimate (since N >> n). Relative to existing work on estimation with a combination of labeled and unlabeled data, prediction-powered inference assumes that most of the data is unlabeled, and considers cases where the model is trained on separate data, which allows for generalizing the approach to any estimator which is minimizing some convex objective and avoids making assumptions about the model.

Here’s a figure illustrating this process (which is rather beautiful I think, at least by computer science standards):

diagram of process

They apply the approach to a number of examples to create confidence intervals for e.g., the proportion of people voting for each of two candidates in a San Francisco election (using a computer vision model trained on images of ballots), predicting intrinsically disordered regions of protein structures (using AlphaFold), estimating the effects of age and sex on income from census data, etc.

They also provide an extension to cases where there is distribution shift, in the form of the proportion of classes in the labeled being different from that in the unlabeled data. I appreciate this, as one of my pet peeves with much of the ML uncertainty estimation work happening these days is the how comfortably people seem to be using the term “distribution-free,” rather than something like non-parametric, even though the default assumption is that the (unknown) distribution doesn’t change. Of course the distribution matters, using labels that imply we don’t care at all about it feels kind of like implying that there is in fact the possibility of a free lunch.

Research is everywhere, even, on rare occasions, in boxes labeled research

This is Jessica. I remember once hearing one of my colleagues who is also a professor talking about the express train that runs through much of Chicago up to Northwestern campus. He said, “The purple line is fantastic. I get on in the morning, always get a seat and I can get research done. Then I get to campus, and all research ceases for 8 hours. But I get back on the train and I’m right back to doing research!”

It is no joke that the more senior you get in academia, the less time you get to do the things that made you choose that career in the first place. But the topic of this post is a different sort of irony. Right how it’s deadline time for my lab, when many of the PhD students are preparing papers for the big conference in our field. It’s a very “researchy” time. What is surprising is how easy it is to be surrounded by people doing research and not feel like there is much actual new knowledge or understanding happening.

There is a David Blackwell quote that I have come to really like:

 I’m not interested in doing research and I never have been, I’m interested in understanding, which is quite a different thing.

Andrew has previously commented on this quote, implying that this may have been true at Blackwell’s time, but things have since shifted and understanding is now recognized as a valuable part of research. But I tend to think that Blackwell’s sentiment is still very much relevant. 

For example, when I think about what most people would call “my research,” I think of papers I’ve published that propose or evaluate visualization techniques or other interactive tools we create. But I don’t necessarily associate most of this work with “understanding.” On some level we find things out, but its very easy to present some stuff you learned in a paper without it ever actually challenging anything we already know. It’s framed as brand new information but usually it’s actually 99% old information in the form of premises and assumptions with a tiny new bit of something. It might not actually answer any of the questions that get you out of bed in the morning. I think most researchers would relate to feeling like this at least sometimes.

Pursuing understanding is why I like my job. I think of it as tied to the questions that I am chewing on but I can’t yet fully answer, because the answer is going to be complicated, connecting to many other things I’ve thought about in the past but without the derivation chain being totally clear. Maybe it even contradicts things I’ve thought or said in the past. On some level I think of understanding as dynamic, about a shift in perspective. This all makes hard to circumscribe linguistic boundaries around. I find it’s more natural to express understanding in questions versus answers. 

The problem is that questions don’t make for a good paper though unless they can be answered with some satisfaction. As soon as you plan the thing that will fit nicely into the 10-15 page article, with a concise introduction, related work section, and a description of the methods and results, you probably have left behind the  understanding. You are instead in the realm of “Making Statements Whose Assumptions and Implications More or Less Follow from One Another and are the Right Scope for a Research Article.” Your task becomes connecting the dots, e.g., making clear there’s motivating logic running from the data collection to the definitions or estimators to the inferences you draw in the end. This is of course usually already established by the time you write the paper, but it can still takes a long time to write it all out, and hopefully you don’t discover an error in your logic, because then its even harder to make the pieces fit and you have to figure out how to talk about that. 

But it’s the understanding that is source of actual new information, in contrast to the veneer of new knowledge we usually get with a paper. I used to think that even though it was hard to really explore a problem in a single paper, the real learning or understanding would manifest through bodies of work. Like if you look at my papers over the last ten years, you can see what I’ve come to understand. But I don’t think that’s quite accurate. Certainly there is some knowledge accrual and some influence of what I’ve said in past papers on how I see the world now. But I would say the knowledge I’m most interested in, or most proud of having gained, is not well represented in the papers. It’s more about what intuitions I’ve developed over time, about things like what’s hard about studying behavior under uncertainty, what’s actually an important problem or an unanswered question when it comes to learning from data in different scenarios, what’s misleading or wrong in the way things get portrayed in the literature in my field, etc.

The conflict arises because understanding doesn’t care about connecting the dots. It happens in a realm where it’s well understand that the dots have only a tenuous relationship to the truth status of whatever claims you want to make. But it’s hard to write papers in that world. Strong assertions seem out of place. 

Maybe this is why Blackwell’s papers tended to be short. 

It’s worth asking whether one can reach understanding without going through the motions of doing the research. I’m not sure. I think there’s value in attempting to take things seriously and make moderately simple statements about them of the type that can be put in a research paper. But then again something like blogging can have the same effect. 

On the bright side, if you can find a way to write a paper that you really believe in, then once you put the paper out there, you might get some critical feedback. And maybe then understanding enters the equation, because the critique jars your thinking enough to help you see beyond your old premises. But at least for me this is not the norm. I like getting critical feedback, but even when the paper is about something I’m still in the midst of trying to understand, often by the time things have been published and presented at some conference and the right people see it and weigh in, I’ve already reached some conclusions about the limitations of those ideas and moved on. For this reason it has always driven me crazy when people associate my current interests with things I’ve published a couple years ago. 

In terms of shifting the balance toward more understanding, being intentional about publishing less papers and being pickier about what problems you take on should help. And other possibilities I’ve posted about in the past like trying to normalize scientists admitting what they don’t know or when they have doubts about their own work in talks and the papers themselves. More pointing out of assertions and claims to generalization that aren’t warranted, even if the work is already published and it makes the authors uncomfortable, because it enforces the idea that we are doing research because we actually care about getting the understanding right, not just because we like clever ideas.

P.S. Probably the title should have been, Understanding is everywhere, even, on rare occasions, in boxes labeled research. But I like the recursion!

Predicting LLM havoc

This is Jessica. Jacob Steinhardt recently posted an interesting blog post on predicting emergent behaviors in modern ML systems like large language models. The premise is that we can get qualitatively different behaviors form a deep learning model with enough scale–e.g., AlphaZero hitting a point in training where suddenly it has acquired a number of chess concepts. Broadly we can think of this happening as a result of how acquiring new capabilities can help a model lower its training loss and how as scale increases, you can get points where some (usually more complex) heuristic comes to overtake another (simpler) one. The potential for emergent behaviors might seem like a counterpoint to the argument that ML researchers should write broader impacts statements to prospectively name the potential harms their work poses to society… non-linear dynamics can result in surprises, right? But Steinhardt’s argument is that some types of emergent behavior are predictable.  

The whole post is worth reading so I won’t try to summarize it all. What most captured my attention though is his argument about predictable deception, where a model fools or manipulates the (human) supervisor rather than doing the desired tasks, because doing so gets it better or equal reward. Things like ChatGPT saying that “When I said that tequila has ‘relatively high sugar content,’ I was not suggesting that tequila contains sugar” or an LLM claiming there is “no single right answer to this question” when there is, sort of like a journalist insisting on writing a balanced article about some issue where one side is clearly ignoring evidence. 

The creepy part is that the post argues that there is reason to believe that certain factors we should expect to see in the future–like models being trained on more data, having longer dialogues with humans, and being more embedded in the world (with a potential to act)–are likely to increase deception. One reason is because models can use the extra info they are acquiring to build better theories-of-mind and use them to better convince their human judges of things. And when they can understand what humans respond to and act in the world they can influence human beliefs through generating observables. For example, we might get situations like the following: 

suppose that a model gets higher reward when it agrees with the annotator’s beliefs, and also when it provides evidence from an external source. If the annotator’s beliefs are wrong, the highest-reward action might be to e.g. create sockpuppet accounts to answer a question on a web forum or question-answering site, then link to that answer. A pure language model can’t do this, but a more general model could.

This reminds me of a similar example used by Gary Marcus of how we might start with some untrue proposition or fake news (e.g., Mayim Bialik is selling CBD gummies) and suddenly have a whole bunch of websites on this topic. Though he seemed to be talking about humans employing LLMs to generate bullshit web copy. Steinhardt also argues that we might expect deception to emerge very quickly (think phase transition), as suddenly a model achieves high enough performance by deceiving all the time that those heuristics dominate over the more truthful strategies. 

The second part of the post on emergent optimization argues that as systems increase in optimization power—i.e., as they consider a larger and more diverse space of possible policies to achieve some goal—they become more likely to hack their reward functions. E.g., a model might realize your long term goals are hard to achieve (say, lots of money and lots of contentness) but that’s hard. And so instead it resorts to trying to change how you appraise one of those things over time. The fact that planning capabilities can emerge in deep models even when they are given a short-term objective (like predicting the next token in some string of text) and that we should expect planning to drive down training loss (because humans do a lot of planning and human-like behavior is the goal) means we should be prepared for reward hacking to emerge. 

From a personal perspective, the more time I spend trying out these models, and the more I talk to people working on them, the more I think being in NLP right now is sort of a double-edged sword. The world is marveling at how much these models can do, and the momentum is incredible, but it also seems that on a nearly daily basis we have new non-human-like (or perhaps worse, human-like but non-desirable) behaviors getting classified and becoming targets for research. So you can jump into the big whack-a-mole game, and it will probably keep you busy for awhile, but if you have any reservations about the limitations of learning associatively on huge amounts of training data you sort of have to live with some uncertainty about how far we can go with these approaches. Though I guess anyone who is watching curiously what’s going on in NLP is in the same boat. It really is kind of uncomfortable.

This is not to say though that there aren’t plenty of NLP researchers thinking about LLMs with a relatively clear sense of direction and vision – there certainly are. But I’ve also met researchers who seem all in but without being able to talk very convincely about where they see it all going. Anyway, I’m not informed enough about LLMs to evaluate Steinhardt’s predictions but I like that some people are making thoughtful arguments about what we might expect to see.

 

P.S. I wrote “but if you have any reservations about the limitations of learning associatively on huge amounts of training data you sort of have to live with some uncertainty about how far we can go with these approaches” but it occurs to me now that it’s not really clear to me what I’m waiting for to determine “how far we can go.” Do deep models really need to perfectly emulate humans in every way we can conceive of for these approaches to be considered successful? It’s interesting to me that despite all the impressive things LLMs can do right now, there is this tendency (at least for me) to talk about them as if we need to withhold judgment for now. 

Multiverse R package

This is Jessica. Abhraneel Sarma, Alex Kale, Michael Moon, Nathan Taback, Fanny Chevalier, Matt Kay, and I write,

There are myriad ways to analyse a dataset. But which one to trust? In the face of such uncertainty, analysts may adopt multiverse analysis: running all reasonable analyses on the dataset. Yet this is cognitively and technically difficult with existing tools—how does one specify and execute all combinations of reasonable analyses of a dataset?—and often requires discarding existing workflows. We present multiverse, a tool for implementing multiverse analyses in R with expressive syntax supporting existing computational notebook workflows. multiverse supports building up a multiverse through local changes to a single analysis and optimises execution by pruning redundant computations. We evaluate how multiverse supports programming multiverse analyses using (a) principles of cognitive ergonomics to compare with two existing multiverse tools; and (b) case studies based on semi-structured interviews with researchers who have successfully implemented an end-to-end analysis using multiverse. We identify design tradeoffs (e.g. increased flexibility versus learnability), and suggest future directions for multiverse tool design.

Here it is on CRAN. And here’s the github repo.

A challenge in conducting multiverse analysis is that you have to write your code to branch over any decision points where there is uncertainty about the right choice. This means identifying and specifying dependencies between paths, such as cases where running one particular model specification requires one particular definition of a variable. Relying on standard imperative programming solutions like for loops leads to messy error-prone code which is hard to debug, run, and interpret later. Additionally, depending on how the code executed, an analyst might have to wait until the entire multiverse has executed before they can discover errors with some paths. This makes debugging slower. 

There are a few existing tools for specifying a multiverse, but this package lets the author build things up from a single analysis (which seemed more realistic to us than expecting them to start from the omniscient view of the entire multiverse), and it interfaces with the sort of iterative workflow one might expect in computational notebooks. Executing a multiverse is optimized over the case where you compute every single path separately by sharing information about results among related subpaths. Immediate feedback is provided on a default analysis which the author can control. 

As the abstract describes, the package has seen a little use so far (including for a virtual-reality related multiverse with millions of paths), so we think this kind of design pattern has some promise. 

PS: See my collaborators’ other work on interactive papers for communicating multiverse results and visualization approaches. And stay tuned for more work led by Abhraneel on interactive visualization to probe results of a multiverse analysis.

Software to sow doubts as you meta-analyze

This is Jessica. Alex Kale, Sarah Lee, TJ Goan, Beth Tipton, and I write,

Scientists often use meta-analysis to characterize the impact of an intervention on some outcome of interest across a body of literature. However, threats to the utility and validity of meta-analytic estimates arise when scientists average over potentially important variations in context like different research designs. Uncertainty about quality and commensurability of evidence casts doubt on results from meta-analysis, yet existing software tools for meta-analysis do not necessarily emphasize addressing these concerns in their workflows. We present MetaExplorer, a prototype system for meta-analysis that we developed using iterative design with meta-analysis experts to provide a guided process for eliciting assessments of uncertainty and reasoning about how to incorporate them during statistical inference. Our qualitative evaluation of MetaExplorer with experienced meta-analysts shows that imposing a structured workflow both elevates the perceived importance of epistemic concerns and presents opportunities for tools to engage users in dialogue around goals and standards for evidence aggregation.

One way to think about good interface design is that we want to reduce sources of the “friction” like the cognitive effort users have to exert when they go to do some task; in other words minimize the so-called gulf of execution. But then there are tasks like meta-analysis where being on auto-pilot can result in misleading results. We don’t necessarily want to create tools that encourage certain mindsets, like when users get overzealous about suppressing sources of heterogeneity across studies in order to get some average that they can interpret as the ‘true’ fixed effect. So what do you do instead? One option is to create a tool that undermines the analyst’s attempts to combine disparate sources of evidence every chance it gets. 

This is essentially the philosophy behind MetaExplorer. This project started when I was approached by an AI firm pursuing a contract with the Navy, where systematic review and meta-analysis are used to make recommendations to higher-ups about training protocols or other interventions that could be adopted. Five years later, a project that I had naively figured would take a year (this was my first time collaborating with a government agency) culminated in a tool that differs from other software out there primarily in its heavy emphasis on sources of heterogeneity and uncertainty. It guides the user through making their goals explicit, like what the target context they care about is; extracting effect estimates and supporting information from a set of studies; identifying characteristics of the studied populations and analysis approaches; and noting concerns about assymmetries, flaws in analysis, or mismatch between the studied and target context. These sources of epistemic uncertainty get propagated to a forest plot view where the analyst can see how an estimate varies as studies are regrouped or omitted. It’s limited to small meta-analyses of controlled experiments, and we have various ideas based on our interviews of meta-analysts that could improve its value for training and collaboration. But maybe some of the ideas will be useful either to those doing meta-analysis or building software. Codebase is here.

What is the obligation of the theorist to the “motivating application?”

This is Jessica. I’m back in Chicago now, but I spent the fall visiting an institute for theoretical computer science at UC Berkeley. I’ve been gravitating toward theory at the CS/stat/econ intersection in the interest of being better able to characterize or predict the value of a more informative data summary for a decision problem. The topic of the program I attended was sequential decision making, which covered more classic CS/operations topics like reinforcement learning, multi-arm bandits, competitive analysis of algorithms, etc. as well as some econ-focused approaches like information design. The goal in the CS subset is often optimal or approximately optimal algorithms given some objective, whereas in econ it tends to be simplified characterizations of a solution that nonetheless provide some new insight into what’s going on. 

What I want to write about is a certain tension that becomes palpable for me when I try to take theory more seriously, between the spirit of “doing theory” and the need to have some practical example to motivate the theory. 

When it comes to motivating applications, some theory papers get away with saying little to establish their relevance, because they start with some well established problem (the sunflower problem, the cake-cutting problem, etc.). Others point to some “killer application” where at least some subset of what the theorists produce is already being put to use directly in the world, like matching markets or auction design. But then there are the many theory papers that fall somewhere in the middle of the spectrum, where the theory is motivated by referring to some application or class of applications, without it being so clear that any such theory is already put to use for that application, and without the theorist attempting to produce results specific to that domain. From the outside looking in, it can be hard to judge how seriously readers are meant to take these connections, or how seriously the authors take them. 

For example, in learning theory, applications are often mentioned to motivate new characterizations of bounds or optimal algorithms or solution concepts, typically at the beginning of a paper or talk, where examples like optimal treatment policies for healthcare or ad auctions or adaptive experimentation might be mentioned before the formal problem definition is given. Then maybe again at the end we hear about how slightly different classes of application motivate changing up some assumptions in future work. So the applications are like bookends, but not necessarily engaged with in any detail. Or sometimes, the intro and even related work sections of a paper seems to promise application-specificity (“This paper considers mechanism design for healthcare”) but then the application appears to get dropped once you get to the theory, with no looking back.  

As a more applied person who can’t help but wonder about these loose ends, I’m often left feeling like I’m not fully appreciating the theory the way I’m supposed to. In watching theory talks, or reading theory papers (again usually on topics in the CS/econ/stats intersection), it’s not uncommon for me to reach a point about halfway through where I realize I no longer care much about the solution, because the pursuit of optimality or a complete characterization has taken over such that I can no longer relate the results to a real world problem. Or, maybe I still can with some effort, but then it seems impractical to think that anyone would want to try to apply the results in practice because using the theoretical framework adds so much complexity. 

Related to this blog, I’m reminded of how a paper presenting a data hold-out mechanism for adaptive data analysis, patterned after differential privacy, once came up, but apparently didn’t work out so well when implemented. One of the authors of that work then confirmed that the work was a proof of concept, not really meant for practical application at the time of publication. Was it naive of the non-theorists to assume that a theoretical contribution motivated by a real world problem should be applicable to real examples of that problem at the time of publication? I don’t think so. It also seems fine for theorists to present theoretical solutions for practice problems even if they aren’t easy to apply in practice, as long as they are up front about that. I would hope that the average theorist, despite working in ‘theoryland’, wants more applied folks to take their contributions seriously and provide feedback. But the style of theory talks and papers often leads one to wonder if there is someone somewhere doing the follow-up work to see how well the thing can be applied in practice, what’s not trivial about it, what considerations might have been missed, etc. 

In a panel on doing theory at the program I was at, someone mentioned how links between certain theory questions and real world applications can be taken for granted over time, even if the story is no longer very accurate. Would it be better not to mention the application at all if one isn’t sure of the applicability? Or is it unreasonable to expect authors working in a “practical” field like CS or econ to muster the level of confidence to stop mentioning the real world altogether? 

I was recently reading Philip Stark’s deeply skeptical take on modelling for policy decisions, which is on some level the same spirit of questioning one poses at the more applied end of the spectrum. We take for granted in theory and modeling that certain sacrifices must be made in trying to make things work out given the tools we have. But what real world constraints we sever connections to can’t be taken lightly. So we need a lot of interchange between the domain experts working on the applications and the theorists, or someone whose focus is going between the two. 

None of this is to say that theorists in the areas I mention aren’t aware or actively thinking about these questions. I heard several conversations at the Simon’s Institute that seemed to be about returning to or questioning the role of the motivating application to figure out how to proceed. My sense is that many theorists are quite aware when they are adding assumptions for tractability in finding a solution versus when assumptions or parts of a formulation are core to the problem itself and independent of the need to close the loop. My concern is more the ambiguity around how high priority the application is when looking in from the outside, e.g., reading theoretical papers that imply there is ultimately to be some bridging between theoryland and the real world but never get around to saying more. That seems like both the hardest and the most interesting part, but not necessarily well incentivized on either side.

Explanation and reproducibility in data-driven science (new course)

This is Jessica. Today I start teaching a new seminar course I created to CS grad students at Northwestern, called Explanation and Reproducibility in Data-Driven Science. 

Here’s the description:

In this seminar course, we will consider what it means to provide reproducible explanations in data-driven science. As the complexity and size of available data and state-of-the-art models increase, intuitive explanations of what has been learned from data are in high demand. However, events such as the so-called replication crisis in social science and medicine suggest that conventional approaches to modeling can be widely misapplied even at the highest levels of science. What does it mean for an explanation to be accurate and reproducible, and how do threats to validity of data-driven inferences differ depending on the goals of statistical modeling? The readings of the course will be drawn from recent and classic literature pertaining to reproducibility, replication, and explanation in data-driven inference published in computer science, psychology, statistics, and related fields. We will examine recent evidence of problems of reproducibility, replicability and robustness in data-driven science; theories and evidence related to causes of these problems; and solutions and open questions. Topics include: ML reproducibility, interpretability, the social science replication crisis, adaptive data analysis, causal inference, generalizability, and uncertainty communication.

The high level goal is to expose more CS PhD students to results and theories related to blind spots in conventional use of statistical inference in research. My hope is that reading a bunch of papers related to this (ambitious) topic but from different angles will naturally encourage thinking beyond the specific results to make observations about how overinterpreting results and overtrusting certain procedures (randomized experiments, test-train splits, etc) can become conventional in a field. 

Putting together the reading list (below) was fun, but I’m open to any suggestions of what I missed or better alternatives for some of the topics. The biggest challenge I suspect will be having these discussions without being able to assume a certain series of stats courses (prerequisites call for exposure to both explanatory and predictive modeling, but I left it kind of loose). I am doing a few lectures early on to review key assumptions and methods but there’s no way I can do it all justice.

In developing it, I consuilted syllabi from a few related courses: Duncan Watts’ Explaining Explanation course at Wharton (which my course overlaps with the most), Matt Salganik and Arvind Narayanan’s Limits of Prediction at Princeton, and Jake Hofman’s Modeling Social Data and Data Science Summer School courses.  

Schedule of readings

1. Course introduction

Optional:

2. Review: Statistical Modeling in Social Science

Note: These references are for your benefit, and can be consulted as needed to fill gaps in your prior exposure.

3. Review: Statistical Modeling in Machine Learning

Note: These references are for your benefit, and can be consulted as needed to fill gaps in your prior exposure.

PROBLEMS AND DEFINITIONS

4. What does it mean to explain?

5. What does it mean to reproduce? 

Optional:

6. Evidence of reproducibility in social science and ML

Optional:

PROPOSED CAUSES

7. Adaptive overfitting: social science

Optional: 

8. Adaptive overfitting: ML

Optional:

9. Generalizability

Optional:

10. Causal inference

Optional:

11. Misspecification & multiplicity

Optional:

12. Interpretability: ML

Optional:

13. Interpretability: Social Science

Optional:

 

SOLUTIONS AND OPEN QUESTIONS

14. Limiting degrees of freedom

Optional:

15. Integrative methods

Optional:

16. Better theory

Optional: 

17. Better communication of uncertainty 

Optional:

Show me the noisy numbers! (or not)

This is Jessica. I haven’t blogged about privacy preservation at the Census in a while, but my prior posts noted that one of the unsatisfying (at least to computer scientists) aspects of the bureau’s revision of the Disclosure Avoidance System for 2020 to adopt differential privacy was that the noisy counts file that gets generated was not released along with the post-processed Census 2020 estimates. This is the intermediate file that is produced when calibrated noise is added to the non-private estimates to achieve differential privacy guarantees, but before post-processing operations are done to massage the counts into realistic looking numbers (including preventing negative counts and ensuring proper summation of smaller geography populations to larger, e.g. state level). In this case the Census used zero-concentrated differential privacy as the definition and added calibrated Gaussian noise to all estimates except predetermined “invariants”: the total population for each state, the count of housing units in each block, and the group quarters’ counts and types in each block.   

Why is the non-release of the noisy measurements file problematic? Recall that privacy experts warn against approaches that require “security through obscurity,” i.e., where parameters of the approach used to noise data have to be kept secret in order to avoid leaking information. This applied to the kinds of techniques the bureau previously used to protect Census data, like swapping of households in blocks where they were too unique. Under differential privacy its fine to release the budget parameter epsilon, along with other parameters if using an alternative parameterization like the concentrated differential privacy definition used by the Census, which also involves a parameter rho to control the allocation of budget across queries and a parameter delta to capture how likely it is that actual privacy loss will exceed the bound set by epsilon. Anyway, the point is that using differential privacy as the definition renders security threats from parameters getting leaked obselete. Of more interest to data users, it also opens up the possibility that one can account for the added noise in doing inference with Census data. See the appendix of this recent PNAS paper by Hotz et al. for a discussion of conditions under which inference is possible on data to which noise has been added to achieve differential privacy versus where identification issues arise.

But these inference benefits are conditional on the bureau actually releasing that file. Cynthia Dwork, Gary King, and others sent a letter calling for the release of the noisy measurements file a while back. More recently, Ruth Greenwood of Harvard’s Election Law clinic and others filed a Freedom of Information Act (FOIA) requesting 1) the noisy measurements file for Census 2010 demonstration data (provided by the bureau to demonstrate what the new disclosure avoidance system under differential privacy produces, for comparison with published 2010 estimates that used swapping), and 2) the noisy measurements file for Census 2020. The reasoning is that users of Census data need this data, particularly for redistricting, in order to better assess the extent to which the new system adds bias through post-processing. Presumably once the file is released it could become the default for reapportionment to sidestep any identified biases.

The Census responded to the request for the noisy measurements file for the 2010 Demonstration data by saying that “After conducting a reasonable search, we have determined that we have no records responsive to item 1 of your request.” They refer to the storage overhead of roughly 700 950 gigabyte files as the reason for their deletion. 

Their response to the request for the 2020 noisy measurements file is essentially that releasing the file would compromise the privacy of individuals represented in the 2020 Census estimates. They say that “FOIA Exemption 3 exempts from disclosure records or portions of records that are made confidential by statute, and Title 13 strictly prohibits publication whereby the data furnished by any particular establishment or individual can be identified.” They refer to “Fair Lines American Foundation Inc. v. U.S. Department of Commerce and U.S. Census Bureau, Memorandum Opinion at No. 21-cv-1361 (D.D.C. August 02, 2022) (holding that 13 U.S.C. § 9(a)(2) permits some level of attenuation in the chain of causation, and thus supports the withholding of information that could plausibly allow data furnished by a particular establishment or individual to be more easily reconstructed).” They encourage the plaintiff to request approved access to the files for their specific research project, since this kind of authorized use is still possible. 

I find the claim that somehow releasing the 2020 noisy measurements file would compromise individual privacy interesting and unexpected. I don’t really have reason to believe that the Bureau would be lying when they claim that leakage would result from releasing the files, but how exactly is the noisy measurements file going to aid reconstruction attacks? My first thought was maybe post-processing steps were parameterized partially based on observing the realized error between the original estimates and noised estimates, but this would contradict the goals of post-processing as they’ve been described, which are removing artifacts that make the data seem fake (namely negative counts) and making things add up. A more skeptical view is that they just don’t want to have two contradicting files of 2020 estimates out there based on the confusion and complications it could cause legally, for instance, if redistricting cases that relied on the post-processed estimates are now challenged by the existence of more informative data. Aloni Cohen and Christian Cianfarini, who have followed the legal arguments being made in Alabama’s lawsuit against the Department of Commerce and Census over the switch to differential privacy, tell me that there is some historical precedent for redistricting maps being revisited after the discovery of data errors, including examples where rule has been in favor of and against the need to reconstruct the maps. 

If the reasoning is primarily to avoid contradictory numbers, then it’s yet another example of the same fears about losing the (false) air of precision in Census estimates that has been called “incredible certitude” and “the statistical imaginary” and goes hand in hand with bizarre (at least to me) restrictions on the use of statistical methods by Title 13, which prevents using any “statistical procedure … to add or subtract counts to or from the enumeration of the population as a result of statistical inference.” (This came up in the Alabama case but was dismissed because noise addition under differential privacy is not a method of inference). 

Finally, in other Census data privacy news, Priyanka Nanayakkara informs me that they recently announced that the ACS files will not be subject to a formal disclosure avoidance approach by 2025 as hoped. The reason being that “the science does not yet exist to comprehensively implement a formally private solution for the ACS.” It sounds like fully synthetic data is more likely than differential privacy, which could be good for inference (see for instance the same Hotz et al. article appendix above, which contrasts inference under synthetic data generation and differential privacy). We need more computer scientists doing research on it.

The easygoing relationship between computer scientists and null hypothesis significance testing

This is Jessica. As you might expect, as a professor in a computer science department I spend a lot of time around computer scientists. As someone who is probably more outward looking than the average faculty member, there are things I like about CS, like the emphasis on getting the abstractions right and on creating new things rather than just studying the old, but also some dimensions where I know the things I like to think about or find interesting are “out of distribution.” One thing that continues to surprise me is the unquestioning way in which many faculty and students assume significance testing is the standard for scientific data analysis. 

Some examples, by area: 

ML, systems: We rely too heavily on comparing point estimates to assess performance across different [models/methods/systems]. Let’s fix this with significance testing! 

Privacy: Let’s noise up this data, but better make sure they can still do t-tests!

Big data/databases: Let’s do zillions of t-tests simultaneously! 

Theory: Let’s design a mechanism to allow for optimal data-driven science, by which we mean NHST! 

Visualization: Let’s turn graphs into devices for NHST!

HCI: Let’s make a GUI so people can do NHST without any prior exposure to statistics! 

On some level, this is not that surprising. CS majors often take a probability class, but when it comes to stats for data analysis, many don’t go beyond a basic intro stats course. And early non-major stats courses often devote a lot of time to statistical testing. Estimation, exploratory analysis, and anything else that might precede NHST are treated as mostly instrumental. So classical stats becomes synomous with NHST for many. Of course in CS, prediction gets a lot of attention but it’s sort of its own beast, treated like an engineering tool that powers everything everywhere.  

I expect the average computer scientist sees little reason to care, for example, about what happened when a bunch of psychologists doing small N studies overrelied on NHST. There’s a fallback attitude that issues caused by humans will never be very relevant objects of study because the primary artifacts are code, and that that kind of squishy social science stuff doesn’t belong in CS (though as the joke beloved to people who do deal with the human-computer interface goes, “The three hardest problems in computer science are people and off by one errors.”)   

And so, I seem to find myself somewhat regularly in a position where I am more or less performing some angst over the problems with NHST in an effort to get students or faculty colleagues to reconsider their unquestioning assumption that significance testing is how scientists analyze data. I can think of many situations where I’ve tried to explain why NHST, as practiced and philosophically, is not so rational. I bring up Andrew-isms like it doesn’t make sense to treat effects as present or absent because there’s always some effect, the question is how big, or what to do about the fact that the difference between significant and not significant is not significant, etc. Sometimes I can tell I capture their attention for a moment, but rarely do I feel like I’ve really convinced someone there’s a problem that might affect their research. For instance, I get responses that start with phrases like ‘If this is true …’  and I’m pretty sure it isn’t just me getting blown off for being female, because I’ve seen similar reactions when like-minded colleagues point out the issues. 

Repeatedly encountering all this resistance can almost make one feel a little bit guilty, like here your colleagues are obviously having a fulfilling relationship with their chosen interpretation of statistics and yet you’re insisting for some reason on dredging up weird anomalies with seemingly weak links to what they do, like some sort of witch determined to sow doubts in the healthy partnership between computer science and stats. But of course I don’t actually feel guilty because I think they need to hear it, even if I derail a few conversations.  

I guess one question is how a computer scientist’s orientation to NHST is qualitatively different than that of someone in another field that uses stats. For example, how does a psychology researcher’s perspective on NHST differ from that of a computer science researcher? I would expect that computer scientists are probably worse at than psychologists is anticipating misuse, again because understanding human behavior has never been perceived as being critical to doing great CS research. I think there’s a genuine belief that NHST is the answer based on believing that if it’s used properly (which can’t be that hard right? just don’t fake the data and make sure there’s enough of it), it provides the most direct answer to the questions people care about: is this thing real. On the surface, it can seems like a concise solution to a large class of problems, which doesn’t deserve to be conflated with the flaws of some humans who used it for very different seeming purposes. 

I also think there’s a genuine confusion about what the alternative would be if one doesn’t use NHST. Sometimes researchers make it explicit that they can’t imagine alternatives (e.g., here), in which case at least the value that someone like me can provide is clearer (giving them examples of alternative ways of expressing findings from an analysis). But, for that to work, I first have to convince them there’s a problem. Maybe the resistance is also partly a function of discrete thinking being built into CS. Advocating against NHST to some computer scientists can certainly feel like trying to convince them that we should replace binary.

On a more positive note, when I realized that much of the stat/science reform discussion hasn’t reached many computer scientists I started including some background in a CS research in a class I teach to first year PhDs. I’ve taught it a few times and they seem interested when I present some of the core issues and draw connections to CS research (like we do here). I’m also teaching a graduate seminar courst next quarter on explanation and reproducibility in data-driven science where we’ll discuss papers from stats, social science, and ML related to what it means for an explanation of model behavior to be valid and reproducible. Maybe all this will help me figure out how to better target my anti-NHST spiel to CS assumptions.

The more I thought about them, the less they seemed to be negative things, but appeared in the scenes as something completely new and productive

This is Jessica. My sabbatical year, which most recently had me in Berkeley CA,  is coming to an end. For the second time since August I was passing through Iowa. Here it is on the way out to California from Chicago and on the way back.

A park in Iowa in AugustA part in Iowa in November

If you squint (like, really really squint), you can see a bald eagle overhead in the second picture.

One association that Iowa always brings to mind for me is that Arthur Russell, the musician, grew up there. I have been a fan of Russell’s music for years, but somehow had missed Iowa Dream, released in 2019 (Russell died of AIDS in 1992, and most of his music has been released posthumously). So I listened to it while we were driving last week. 

Much of Iowa Dream is Russell doing acoustic and lofi music, which can be surprising if you’ve only heard his more heavily produced disco or minimalist pop. One song, called Barefoot in New York, is sort of an oddball track even amidst the genre blending that is typical of Russell. It’s probably not for everyone, but as soon as I heard it I wanted to experience it again. 

NPR called it “newfound city chaos” because Russell wrote it shortly after moving to New York, but there’s also something about the rhythm and minutae of the lyrics that kind of reminds me of research. The lyrics are tedious, but things keep moving like you’re headed towards something. The speech impediment evokes getting stuck at times and having to explore one’s way around the obstruction. Sometimes things get clear and the speaker concludes something. Then back to the details that may or may not add up to something important. There’s an audience of backup voices who are taking the speaker seriously and repeating bits of it, regardless of how inconsequential. There’s a sense of bumbling yet at the same time iterating repeatedly on something that may have started rough but becomes more refined. 

Then there’s this part:

I really wanted to show somehow how things deteriorate

Or how one bad thing leads to another

At first, there were plenty of things to point to

Lots of people, places, things, ideas

Turning to shit everywhere

I could describe these instances

But the more I thought about them

The less they seemed to be negative things

But appeared in the scenes as something completely new and productive

And I couldn’t talk about them in the same way

But I knew it was true that there really are

Dangerous crises

Occurring in many different places

But I was blind to them then

Once it was easy to find something to deplore

But now it’s even worse than before

I really like these lyrics, in part because they make me uncomfortable. On the one hand, the idea of wanting to criticize something, but losing the momentum as things become harder to dismiss closer up, seems opposite of how many realizations happen in research, where a few people start to notice problems with some conventional approach and then it becomes hard to let them go. The replication crisis is an obvious example, but this sort of thing happens all the time. In my own research, I’ve been in a phase where I’m finding it hard to unsee certain aspects of how problems are underspecified in my field, so some part of me can’t relate to everything seeming new and productive. 

But at the same time the idea of being won over by what is truly novel feels familiar when I think about the role of novelty in defining good research. I imagine this is true in all fields to some extent, but especially in computer science, there’s a constant tension around how important novelty is in determining what is worthy of attention. 

Sometimes novelty coincides with fundamentally new capabilities in a way that’s hard to ignore. The reference to potentially “dangerous crises” brings to mind the current cultural moment we’re having with massive deep learning models for images and text. For anyone coming from a more classical stats background, it can seem easy to want to dismiss throwing huge amounts of unlabeled data at too-massive-and-ensembled-to-analyze models as a serious endeavor… how does one hand off a model for deployment if they can’t explain what it’s doing? How do we ensure it’s not learning spurious cues, or generating mostly racist or sexist garbage? But the performance improvements of deep neural nets on some tasks in the last 5 to 10 years is hard to ignore, and phenomena like how deep nets can perfectly interpolate the training data but still not overfit, or learn intermediate representations that align with ground truth even when fed bad labels, makes it hard to imagine dismissing them as a waste of our collective time. Other areas, like visualization, or databases, start to seem quaint and traditional. And then there’s quantum computing, where the consensus in CS departments seems to be that we’re going all in regardless of how many years it may still be until its broadly usable. Because who doesn’t like trying to get their head around entanglement? It’s all so exotic and different.

I think many people gravitate to computer science precisely because of the emphasis on newness and creating things, which can be refreshing compared to fields where the modal contribution it to analyze rather than invent. We aren’t chained to the past the way many other fields seem to be. It can also be easier to do research in such an environment, because there’s less worry about treading on ground that’s already been covered.

But there’s been pushback about requiring reviewers to explicitly factor novelty into their judgments about research importance or quality, like by including a seperate ranking for “originality” in a review form like we do in some visualization venues. It does seem obvious that including statements like “We are first to …” in the introduction of our papers as if this entitles us to publication doesn’t really make the work better. In fact, often the statements are wrong, at least in some areas of CS research where there’s a myopic tendency to forget about all but the classic papers and what you saw get presented in the last couple years. And I always cringe a little when I see simplistic motiations in research papers like, no one has ever has looked at this exact combination (of visualization, form of analysis etc) yet. As if we are absolved of having to consider the importance of a problem in the world when we decide what to work on.

The question would seem to be how being oriented toward appreciating certain kinds of novelty, like an ability to do something we couldn’t do before, affects the kinds of questions we ask, and how deep we go in any given direction over the longer term. Novelty can come from looking at old things in new ways, for example developing models or abstractions that relate previous approaches or results. But these examples don’t always evoke novelty in the same way that examples of striking out in brand new directions do, like asking about augmented reality, or multiple devices, or fairness, or accessibility, in an area where previously we didn’t think about those concerns much.

If a problem is suddenly realized to be important, and the general consensus is that ignoring it before was a major oversight, then its hard to argue we should not set out to study the new thing. But a challenge is that if we are always pursuing some new direction, we get islands of topics that are hard to relate to one another. It’s useful for building careers, I guess, to be able to relatively easily invent a new problem or topic and study it in a few papers then move on. And I think it’s easy to feel like progress is being made when you look around at all the new things being explored. There’s a temptation I think to assume that  it will all “work itself out” if we explore all the shiny new things that catch our eye, because those that are actually important will in the end get the most attention. 

But beyond not being to easily relate topics to one another, a problem with expanding, at all times, in all directions at once, would seem to be that no particular endeavor is likely to be robust, because there’s always an excitement about moving to the next new thing rather than refining the old one. Maybe all the trendy new things distract from foundational problems, like a lack of theory to motivate advances in many areas, or sloppy use of statistics. The perception of originality and creativity certainly seem better at inspiring people than obsessing over being correct.

Barefoot in NY ends with a line about how, after having asked whether it was in “our best interest” to present this particular type of music, the narrator went ahead and did it, “and now, it’s even worse than before.” It’s not clear what’s worse than before, but it captures the sort of committment to rapid exploration, even if we’re not yet sure how important the new things are, that causes this tension. 

Only positive reinforcement for researchers in some fields

This is Jessica. I was talking to another professor in my field recently about a talk one of us was preparing. At one point, the idea of mentioning, in a critical light, some well known recent work in the field came up, since this work had omitted to consider an important aspect of evaluation which would help make one of the points in the talk. I thought it seemed reasonable to make the comment, but my friend (who is more senior than me), ‘We can’t do that anymore. We used to be able to do that’. I immediately knew what they meant: that you can’t publicly express criticism of work done by other people these days, at least not in HCI or visualization.

What I really mean by “you can’t publicly express criticism” is not that you physically can’t or even that some people won’t appreciate it. Instead it’s more that if you do express criticism or skepticism about a published piece in a public forum outside of certain established channels, you will be subject to scrutiny and moral judgment, for being non-inclusive or “gate-keeping” or demonstrating “survivor bias.” The overall sentiment being that expressing skepticism of the quality of some piece of research out of the “proper” channels of reviewing and post-conference-presentation QA makes you somehow threatening to the field. It’s like people assume that critique cannot be helpful unless its somehow balanced with positives or provided in the context of some anonymous format or at a time when authors have prepared themselves to hear comments and will therefore not be surprised if someone says something critical. Andrew has of course commented numerous times on similar things in prior posts.

I write these views as someone who dislikes conflict and publicly bringing up issues in other people’s work. If I’m critiquing something, my style tends to involve going into detail to make it seem more nuanced and shading the critique with acknowledgement of good things to make it seem less harsh. Or if there are common issues I might write a critical paper pointing to the problems in the context of making a bigger argument so that it feels less directed at any particular authors. But I don’t think all this hedging should be so necessary. Criticism in science should be acceptable regardless of how it comes up, and you can’t imply it should go away without seeming to contradict the whole point of doing research. This has always seemed like a matter of principle to me, even back when I was getting critiqued myself as a PhD student and not liking it. So I still get surprised sometime when I realize that my attitude is unusual, at least in the areas I work in. 

One thing I really dislike is the idea that its not possible to be both an inclusive field and a field that embraces criticism. Like the only way to have the former is to suppress the latter. It’s unfortunate I guess that some fields that embrace criticism are not very diverse (say, finance or parts of econ), and that other fields that prioritize novelty and diversity in methods over critiquing what exists tend to be better on diversity, like HCI or visualization which do pretty well in terms of attracting women and other groups. 

In a different conversation with the same friend above, they mentioned how once in giving an invited seminar talk at another university, another professor we know at that university made some critical comments and my friend got into a back and forth with them about the research. My friend didn’t think much of it, but as their visit went on, got the impression that some of the PhD students and other junior scholars who had attended saw the critique and exchange between my friend and the other faculty member as embarrassing (to my friend) and inappropriate. This was surprising to my friend, who felt it was totally normal and fine to that the audience member had given blunt remarks after the talk. I had a similar experience during an online workshop a few months back, where a senior well known faculty member in the audience had multiple critical comments and questions for the keynote speaker, which I thought was a great discussion. But others seemed to view as an extreme event that bordered on inappropriate.   

Related to all this, I sometimes get the sense that many people see it as predetermined that open criticism will have more negative consequences than positive, because it will a) undermine the apparent success of the field and/or b) discourage junior scholars, especially those that bring diversity.  On the latter, I’m not how much evidence people opposed to criticism have in mind versus they can simply imagine a situation where some junior person gets discouraged. But a different way to think about it could be, It’s the responsibility of the broader field, not just the critic, if we have junior researchers fleeing in light of harsh critique. I.e., where are the support structures if all it takes is one scathing blog post? There’s sort of an “every man for himself” attitude that overlooks how much power mentors can have in supporting students who get critiqued. Similarly there’s a tendency to downplay how one person’s research getting critiqued is often less about that particular person being incompetent than it is about various ways in which methods get used or claims are made in a field that are conventional but flawed. If we viewed critique more from the standpoint of ‘we’re all in this together’ maybe it would be less threatening.

A few months ago I wrote a post on my other blog that tries to imagine what it would look like to be more open-minded about critique, e.g., by taking for granted that we are all capable of making mistakes and updating our beliefs. I would like to think it is possible to have healthy open critique. But sometimes when I sense how uncomfortable people are with even talking about critique, I wonder if I’m being naive. For all the progress I’ve seen in my field in some respects (including more diversity in background/demographics, and better application of statistical methods) I haven’t really seen attitudes on critique budge.

Concreteness vs faithfulness in visual summaries

This is Jessica. I recently had a discussion with collaborators that got me thinking about trade-offs we often encounter in summarizing data or predictions. Specifically, how do we weigh the value of deviating from a faithful or accurate representation of how some data was produced in order to it more interpretable to people? This often comes up as sort of an implicit concern in visualization, when we decide things like whether we should represent probability as frequency to make it more concrete or usable for some inference task. It comes up more explicitly in some other areas like AI/ML interpretability, where people debate the validity of using post-hoc interpretability methods. Thinking about it through a visualization example more has made me realize that at least in visualization research, we still don’t really have principled foundation for resolving these questions.

My collaborators and I were talking about designing a display for an analysis workflow involving model predictions. We needed to visualize some distributions, so I proposed using a discrete representation of distribution based on how they have been found to lead to more accurate probability judgments and decisions among non-experts in multiple experiments. By “discrete representations” here I mean things like discretizing a probability density function by taking some predetermined number of draws proportional to the inverse cumulative distribution function and showing it in a static plot (quantile dotplot), or animating draws from the distribution we want to show over time (hypothetical outcome plots), or possibly some hybrid of static and animated. However, one of my collaborators questioned whether it really makes sense to use, for example, a ball swarm style chart if you aren’t using a sampling based approach to quantify uncertainty. 

This made me realize how common it is in visualization research to try to separate the visual encoding aspect from the rest of the workflow. We tend to see the question of how to visualize distribution as mostly independent from how to generate the distribution. So even if we used some analytical method to infer a sampling distribution, the conclusions of visualization research as typically presented would suggest that we should still prefer to visualize it as a set of outcomes sampled from the distribution. We rarely discuss how much the effectiveness of some technique might vary when the underlying uncertainty quantification process is different. 

On some level this seems like an obvious blind spot, to separate the visual representation from the underlying process. But I can think of a few reasons why researchers might default to trying to separate encodings from generating processes and not necessarily question doing this. For one, having worked in visualization for years, at least in the case of uncertainty visualization I’ve seen various instances where users of charts seem to be more sensitive to changes to visual cues than they are to changes to descriptions of how some uncertainty quantification was arrived at. This implies that aiming for perfect faithfulness in our descriptions is not necessarily where we want to spend our effort. E.g, change an axis scaling and the effect size judgments you get in response will be different, but modifying the way you describe the uncertainty quantification process alone probably won’t result in much of a change to judgments without some addtional change in representation. So the focus naturally goes to trying to “hack” the visual side to get the more accurate or better calibrated responses.

I could also see this way of thinking becoming ingrained in part because people who care about interfaces have always had to convince others of the value of what they do through evidence that the reprersentation alone matters. Showing the dependence of good decisions on visualization alone is perceived as sort of a fundamental way to argue that visualization should be taken seriously as a distinct area.

At the same time though, disconnecting visual from process could be criticized for suggesting a certain sloppiness in how we view the function of visualization. Not minding the specific ways that we break the tie between the representation and the process might imply we don’t have a good understanding of the constraints on what we are trying to achieve. Treating the data generating process as a black box is certainly much easier than trying to align the representations to it, so it’s not necessarily surprising that the research community seems to have settled with the former.

Under this view, it becomes research-worthy to point out issues that only really arise because we default to thinking that representation and generation are separate. For example, there’s a well known psych study suggesting we don’t want to visualize continuous data with bar charts because people will think they are seeing discrete groups (and vice versa). It’s kind of weird that we can have these one-off results be taken very seriously, but then not worry so much about mismatch in other contexts, like acknowledging that making some assumptions to compute a confidence interval and then sampling some hypothetical outcomes from that is different from using sampling directly to infer a distribution. 

I suspect for this particular uncertainty visualization example, the consequences of the visual metaphor not faithfully capturing underlying distribution generation process seem minor relative to the potential benefits of getting people thinking more concretely about the implications of error in the estimate. There’s also a notion of frequency that’s also inherent in the conventional construction of confidence intervals which maybe makes a frequency representation seem less egregiously wrong. Still, there’s the potential for the discrete representation to be read as mechanistic, i.e., as signifying a bootstrap construction process even where it actually doesn’t that my collaborator seemed to be getting at.

But on the other hand, any data visualization is a concretization of something nebulous, i.e., an abstraction encoded in the visual-spatial realm used to represent our knowledge of some real world thing approximated by a measurement process. So one could also point out that it doesn’t really make sense to act as though there are going to be situations where we are free from representational “distortion.”

Anyway, I do think there’s a valid criticism to be made through this example of how research hasn’t really attempted to address these trade-offs directly. Despite all of the time we spend emphasizing the importance of the right representation in interactive visualization, I expect most of us would be hard pressed to explain the value of a more concrete representation over a more accurate one for a certain problem without falling back on intuitions. Should we be able to get precise about this, or even quantify it? I like the idea of trying, but in an applied field like infovis would expect the majority to judge it to be not worth the effort (if only because theory over intuition is a tough argument to make when funding exists without it).  

Like I said above, a similar trade-off seems to come up in areas like AI/ML interpretatibility and explainability, but I’m not sure if there are attempts yet to theorize it. It could maybe be described as the value of human model alignment, meaning the value of matching the representation of some information to metaphors or priors or levels of resolution that people find easier to mentally compute with, versus generating model alignment, where we constrain the representation to be mechanistically accurate. It would be cool to see examples attempting to quantify this trade-off or otherwise formalize it in a way that could provide design principles.

David Blackwell stories

This is Jessica. Recently I got asked, on a podcast, what famous scientific figure alive or dead, real or fictional I would have dinner with, and for the sake of choosing someone who was both brilliant and sounds like an inspiring person to be around, I said David Blackwell.

I didn’t really know Blackwell’s work until maybe a year or year and a half ago, when I was introduced to Blackwell Ordering by a colleague. It’s a way of quantifying the instrumental value of some information generating process (aka channel or experiment or information structure) before the state of the world realizes. Imagine we have two different forecasts for predicting who will win an election, based on some set of signals like poll results, and we can choose which one to consult to make some decision, like how much of our time or money we will allocate to help our preferred candidate prior to an election. How do we decide, given that the value we get out of the forecast (assigned by some real valued payoff function) will depend on the decision we make and the state that ultimately realizes? If we know the probability distribution over the possible signals for each possible state of the world, then Blackwell’s theorem tells us that under certain conditions we can choose which forecast to use even without knowing the specific utility function or the distribution over inputs that we are dealing with. The constraint that needs to be satisfied is that one information structure needs to be a “garbling” of the first (or in Blackwell’s terms, the first needs to be sufficient for the second), meaning that we can represent the second as what we get when we apply some post-processing operation to the first (where both are represented in matrix form).

Once we establish that one forecast is sufficient for the other, we know the second one can’t give us more information about the state: Blackwell showed that the expected utility of deciding using the optimal strategy for the original (non-garbled) forecast is always at least as big as using the optimal strategy with the garbled forecast. This is not just for one particular instantiation of a decision problem, but for any input distribution or utility function. So, the ordering gives a concise summary of the statistical conditions under which one experiment is more informative than another. I’m not sure the extent to which work like Blackwell’s, which he did in the early 1950’s, directly influenced later learning theories that are similarly agnostic to the input distribution, like PAC learning, but this way of thinking seems ahead of its time. When I first learned about it, I found Blackwell ordering exciting because I was frustrated with how dependent our knowledge of what makes a better visual representation of uncertainty tends to be on specific decision setups that researchers study. So the idea of theory for doing unconditional analysis of information structures reflecting uncertainty around some outcome was like a missing piece I was looking for. Since then I’ve been paying more attention to other work on the value of information, such as in economics. 

More recently I learned more about some of Blackwell’s other major contributions, like approachability. The original formulation asked under what conditions we can expect the row player in a minimax game with vector payoffs to be able to get his payoffs to approach some target set, making it a generalization of minimax for vector payoffs. But there are connections to online learning and forecasting; it turns out Blackwell’s result implies no-regret learning and calibration. There’s also Rao-Blackwell theorem, which says that (at least with enough data) it’s possible to make a better estimator of your parameter by taking the conditional expected value of your estimator given some sufficient statistic for that parameter.

This fall I attended a workshop on data-driven decision processes, where there was a day devoted to Blackwell. Some of these were by people who spent time around him, and shared anecdotes about what he was like to work with. He sounds like an inspiring person to be around, someone who undoubtedly experienced a lot more pushback along the way that most of his faculty peers but continued to make major contributions and positively affect the people around him in all sorts of ways. Someone mentioned for instance how he had high standards for communicating ideas in the most direct, comprehensible way, like when teaching, which I respect. There are talks listed under the last day here that cover his work and influence for those that are interested.

However, there is one anecdote that I keep coming across as I’ve been finding out more about Blackwell– not about Blackwell himself but about his experience as a Black statistician–that I have to complain about. It involves Blackwell’s first interactions with the UC Berkeley math department, which then contained statistics, where he later spent years as a professor and chair.  Apparently, he was considered as a hire in 1942 but was not extended an offer, because of racism. Many accounts describe how the offer was blocked by the spouse of a faculty member, e.g., “Blackwell would later in his career find out the then head of the math department’s wife protested Blackwell’s hiring. It was customary to host faculty members in their home and the wife objected to hosting a black man in her house.”, or “Blackwell was blocked by one of the ‘faculty wives.’” Or, put differently: “In 1954, after an initial attempt in 1942, which failed due to the racial prejudice of some faculty families, Blackwell was appointed Professor.”

To be clear, I do not doubt that someone’s wife (or more specifically, the chair’s wife as some records describe) protested to Blackwell getting hired. Racism is not hard to find even now, so pre-civil rights era it does not seem surprising that some professor’s wife would have made blatantly racist comments. And I’m glad that the bias that Blackwell had to deal with during his life is brought to light in recounting his career. 

But… isn’t it a little odd, so long after the fact, to be talking about someone’s wife as the reason Blackwell wasn’t offered a job in what was presumably at the time an all or mostly male department? I respect that speakers and authors who retell this want to acknowledge the racism Blackwell experienced in the mostly white academic institution, and I can understand why some of the original faculty involved might have been frustrated by the influence of someone’s wife. But eighty years later, it seems kind of weird to hear this retold, as if we’re going out of our way to put the blame on some woman who wasn’t even part of the department. The optics distract from the more interesting story of who Blackwell was.

I doubt the reasons for bringing up this particular anecdote are to intentionally redirect blame, so much as people have heard it and decide to echo it to tell a more entertaining story. From looking around online it seems like it originated from a biography on Neyman written by Constance Reid.