This is Jessica. A couple weeks ago I posted on the lack of standardization in how people design experiments to study judgment and decision making, especially in applied areas of research like visualization, human-centered AI, privacy and security, NLP, etc. My recommendation was that researchers should be able to define the decision problems they are studying in terms of the uncertain state on which the decision or belief report in each trial is based, the action space defining the range of allowable responses, the scoring rule used to incentivize and/or evaluate the reports, and process that generates the signals (i.e., stimuli) that inform on the state. And that not being able to define these things points to limitations in our ability to interpret the results we get.
I am still thinking about this topic, and why I feel strongly that when the participant isn’t given a clear goal to aim for in responding, i.e., one that is aligned with the reward they get on the task, it is hard to interpret the results.
It’s fair to say that when we interpret the results of experiments involving human behavior, we tend to be optimistic about how what we observe in the experiment relates to people’s behavior in the “real world.” The default assumption is that the experiment results can help us understand how people behave in some realistic setting that the experimental task is meant to proxy for. There sometimes seems to be a divide among researchers, between a) those who believe that judgment and decision tasks studied in controlled experiments can be loosely based on real world tasks without worrying about things being well-defined in the context of the experiment and b) those who think that the experiment should provide (and communicate to participants) some unambiguously defined way to distinguish “correct” or at least “better” responses, even if we can’t necessarily show that this understanding matches some standard we expect to operate the real-world.
From what I see, there are more researchers running controlled studies in applied fields that are in the former camp, whereas the latter perspective is more standard in behavioral economics. Those in applied fields appear to think it’s ok to put people in a situation where they are presented with some choice or asked to report their beliefs about something but without spelling out to them exactly how what they report will be evaluated or how their payment for doing the experiment will be affected. And I will admit I too have run studies that use under-defined tasks in the past.
Here are some reasons I’ve heard for using not using a well-defined task in a study:
—People won’t behave differently if I do that. People will sometimes cite evidence that behavior in experiments doesn’t seem very responsive to incentive schemes, extrapolating from this that giving people clear instructions on how they should think about their goals in responding (i.e., what constitutes good versus bad judgments or decisions) will not make a difference. So it’s perceived as valid to just present some stuff (treatments) and pose some questions and compare how people respond.
—The real world version of this task is not well-defined. Imagine studying how people use dashboards giving information about a public health crisis, or election forecasts. Someone might argue that there is no single common decision or outcome to be predicted in the real world when people use such information, and even if we choose some decision like ‘should I wear a mask’ there is no clear single utility function, so it’s ok not to tell participants how their responses will be evaluated in the experiment.
—Having to understand a scoring rule will confuse people. Relatedly, people worry that constructing a task where there is some best response will require explaining complicated incentives to study participants. They might get confused, which will interfere with their “natural” judgment processes in this kind of situation.
I do not find these reasons very satisfying. The problem is how to interpret the elicited responses. Sure, it may be true that in some situations, participants in experiments will act more or less than the same when you put some display of information on X in front of them and say “make this decision based on what you know about X” and when you display the same information and ask the same thing but you also explain exactly how you will judge the quality of their decision. But – I don’t think it matters if they act the same. There is still a difference: in the latter case where you’ve defined what a good versus bad judgment or decision is, you know that the participants know (or at least that you’ve attempted to tell them) what their goal is when responding. And ideally you’ve given them a reason to try to achieve that goal (incentives). So you can interpret their responses as their attempt at fulfilling that goal given the information they had at hand. In terms of the loss you observe in responses relative to the best possible performance, you still can’t disambiguate the effect of their not understanding the instructions from their inability to perform well on the task despite understanding it. But you can safely consider the loss you observe as reflecting an inability to do that task (in the context of the experiment) properly. (Of course, if your scoring rule isn’t proper then you shouldn’t expect them to be truthful under perfect understanding of the task. But the point is that we can be fairly specific about the unknowns).
When you ask for some judgment or decision but don’t say anything about how that’s evaluated, you are building variation in how the participants interpret the task directly into your experiment design. You can’t say what their responses mean in any sort of normative sense, because you don’t know what scoring rule they had in mind. You can’t evaluate anything.
Again this seems rather obvious, if you’re used to formulating statistical decision problems. But I encounter examples all around me that appear at odds with this perspective. I get the impression that it’s seen as a “subjective” decision for the researcher to make in fields like visualization or human-centered AI. I’ve heard studies that define tasks in a decision theoretic sense accused of “overcomplicating things.” But then when it’s time to interpret the results, the distinction is not acknowledged, and so researchers will engage in quasi-normative interpretation of responses to tasks that were never well defined to begin with.
This problem seems to stem from a failure to acknowledge the differences between behavior in the experimental world versus in the real world: We do experiments (almost always) to learn about human behavior in settings that we think are somehow related to real world settings. And in the real world, people have goals and prior beliefs. We might not be able to perceive what utility function each individual person is using, but we can assume that behavior is goal-directed in some way or another. Savage’s axioms and the derivation of expected utility theory tell us that for behavior to be “rationalizable”, a person’s choices should be consistent with their beliefs about the state and the payoffs they expect under different outcomes.
When people are in an experiment, the analogous real world goals and beliefs for that kind of task will not generally apply. For example, people might take actions in the real world for intrinsic value – e.g., I vote because I feel like I’m not a good citizen if I don’t vote. I consult the public health stats because I want to be perceived by others as informed. But it’s hard to motivate people to take actions based on intrinsic value in an experiment, unless the experiment is designed specifically to look at social behaviors like development of norms or to study how intrinsically motivated people appear to be to engage with certain content. So your experiment needs to give them a clear goal. Otherwise, they will make up a goal, and different people may do this in different ways. And so you should expect the data you get back to be a hot mess of heterogeneity.
To be fair, the data you collect may well be a hot mess of heterogeneity anyway, because it’s hard to get people to interpret your instructions correctly. We have to be cautious interpreting the results of human-subjects experiments because there will usually be ambiguity about the participants’ understanding of the task. But at least with a well-defined task, we can point to a single source of uncertainty about our results. We can narrow down reasons for bad performance to either real challenges people face in doing that task or lack of understanding the instructions. When the task is not well-defined, the space of possible explanations of the results is huge.
Another way of saying this is that we can only really learn things about behavior in the artificial world of the experiment. As much as we might want to equate it with some real world setting, extrapolating from the world of the controlled experiment to the real world will always be a leap of faith. So we better understand our experimental world.
A challenge when you operate under this understanding is how to explain to people who have a more relaxed attitude about experiments why you don’t think that their results will be informative. One possible strategy is to tell people to try to see the task in their experiment from the perspective of an agent who is purely transactional or “rational”:
Imagine your experiment through the eyes of a purely transactional agent, whose every action is motivated by what external reward they perceive to be in it for them. (There are many such people in the world actually!) When a transactional agent does an experiment, they approach each question they are asked with their own question: How do I maximize my reward in answering this? When the task is well-defined and explained, they have no trouble figuring out what to do, and proceed with doing the experiment.
However, when the transactional human reaches a question that they can’t determine how to maximize their reward on, because they haven’t been given enough information, they shut down. This is because they are (quite reasonably) unwilling to take a guess at what they should do when it hasn’t been made clear to them.
But imagine that our experiment requires them to keep answering questions. How should we think about the responses they provide?
We can imagine many strategies they might use to make up a response. Maybe they try to guess what you, as the experimenter, think is the right answer. Maybe they attempt to randomize. Maybe they can’t be bothered to think at all and they call in the nearest cat or three year old to act on their behalf.
We could probably make this exercise more precise, but the point is that if you would not be comfortable interpreting the data you get under the above conditions, then you shouldn’t be comfortable interpreting the data you get from an experiment that uses an under-defined task.