This is Jessica. I recently heard the term elicitation being used in a machine learning context, where the loss function was referred to as “eliciting” information from the algorithm during optimization. This seemed unusual, though not necessarily incorrect, given a dictionary definition like “the process of drawing out some response or information.” It got me thinking about how the meaning of “elicitation” is interpreted in different academic circles and how they vary in strictness of definition.
For example, if it’s used in fields like visualization and human computer interaction, elicitation tends to refer to asking people for some beliefs, whether or not the beliefs are about some state of the world for which ground truth can be obtained, and regardless of how they are rewarded for providing the beliefs or whether they are tested for coherency in any way. A paper might use the term elicitation to refer to something like asking users of some interface tool to rate their level of trust on a scale from 1 to 10.
As a research topic though, elicitation has been scoped more specifically to asking people for information about the uncertainty of some state or event. For example, elicitation in judgment and decision making often refers to prior elicitation from domain experts. The goal assumed by classic literature by Brier and others focuses on eliciting a probabilistic forecast in the form of a distribution. Closely related is using elicitaiton to refer to getting information about properties of random variables rather than entire distributions, like an expectation, median, or variance.
However, even when different camps agree that elicitation is getting information about state uncertainty, there are different views on what is required to “draw out” information from some agent or process (rather than just requesting it). Some see the incentive scheme as a necessary component, implying elicitation is about making someone want to give you certain information. This view is common in work on elicitation in theoretical CS and econ for instance, which assumes that proper scoring rules are available to evaluate the elicited information against ground truth: rules such that if the person is aware of how the rule works, then they will report honestly because it maximizes their payoff (and in the case of strictly proper scoring rules, the payoff maximizing report is unique). Consequently, we get definitions of elicitable properties as those that we can construct a strictly proper scoring rule for. In this view, if we aren’t using proper scoring rules, we might be better off using a term like “solicit” to convey we are attempting to obtain some information but not able to provide any sort of guarantees. Or we should use terms that have been developed to refer specifically to eliciting non-verifable information under different conditions (e.g., peer prediction, where scoring is based on comparing a response elicited from one agent to other agents’ responses, or mechanism design, which is about eliciting how much people value something, e.g., utility functions). Some of these approaches are mathematically very similar to what the theorists call elicitation, but still referred to under different terms.
Much of the work focused on prior elicitation from domain experts for statistical analysis however seems to have moved away from relying on proper scoring rules. Instead, the goal seems to be figuring out structured ways to ask for distributions or properties in ways that minimize the addition of noise to the modeling process. As one canonical survey claims “We see that elicitation as simply part of the process of statistical modeling. Indeed, in a hierarchical model at which point the likelihood ends and the prior begins is ambiguous.” Acknowledging the murkiness of using elicited priors makes a lot of sense to me, having spent a few years working in this space on graphical elicitation of prior distributions. It’s difficult to evaluate whether you got the “right” information from someone, i.e., beliefs they held before you asked, given that asking people for beliefs can trigger a constructive process. It’s difficult to evaluate elicitation methods against one another by comparing the distributions they imply, since if they ask for different information sets (e.g., different properties of a distribution), then you might ben using different methods with different assumptions to map the inputs to distributions. One unfortunate thing I also discovered along the way is that the prior elicitation literature is not well integrated across different fields, and so you have people working on similar problems in different areas but not really citing each other, which is frustrating.
Related to the differences in how much weight researchers put on incentive compatibility, there’s a debate that I’ve seen play out a few times at talks and in conversations, where, when faced with the results of some human user study or experiment that elicited beliefs from people, the theorist asks: How can you even interpret the results from this study if you didn’t incentivize their reports? And the counterargument from the behavioral researcher is often along the lines of, Well, we don’t tend to see any big effects on results when we change incentives, so I doubt it would make a difference. To which the economist or theorist might launch into a lecture of the importance of incentive compatibility and proper scoring rules and so on. And the conversation continues with neither side convinced that the other side knows what they are talking about.
One challenge with seeing elicitation as equivalent to use of proper scoring rules is that a proper scoring rule might not work as intended due to people not understanding it. In an ML context where you don’t have to care about interpretability, you just set up the loss function. But with people, the rule has to be sufficiently interpretable that they recognize that responding truthfully will maximize their payoffs. In game theory, this has led to the definition of “obviously-strategy proof mechanisms,” mechanisms that have an equilibrium not just in dominant strategies (those that are best regardless of what others do) but in obviously dominant strategies, defined as strategies where for any deviating strategy, at the earliest information set where the obviously dominant strategy and the alternative diverge, the best possible outcome you can get from the alternative is not better than the worst outcome you can get the obviously dominant strategy. It’s like designing for the worst case scenario where no one is capable of contingent reasoning. Maybe there are other variants of baking interpretability in elicitation; I would be interested in pointers.
It also seems plausible that there may be situations where through enough experience, people could internalize the scoring rule, even if they don’t understand descriptions of it. Maybe some people might need to see it in action, used to determine their payoffs, to realize that its in their best interest to respond truthfully, and that learning process may take time. This implies that to be sure that your proper scoring rule worked, you need to evaluate the responses you obtained. It becomes more of a process than a definition.
P.S. Dan Goldstein sends a link to this study on binarized scoring rule which finds that giving people transparent information on quantitiative incentives reduces truth-telling comparing to not telling participants anything about the rule. Doh!
P.P.S. See the recent survey by Aki and others for a deeper dive into prior elicitation challenges and practice.
This all sounds like a measurement error problem of sorts. Priors are data in the sense of more information. In a Stan model, just pieces of information added to the log-posterior. ‘Data’ are observations and can have measurement error. I guess prior elicitation is just a way of gathering a different kind of data, which is also subject to measurement error.