Not-so-obviously heuristic-proof reforms to statistical communication

This is Jessica. I’ve subscribed to aspects of the “estimation” movement–the move toward emphasizing magnitude and uncertainty of effects and testing multiple hypotheses rather than NHST–for awhile, having read this blog for years and switched over to using Bayesian stats when I first became faculty. I try to write results sections of papers that focus on the size of effects and their uncertainty over dichotomous statements (which by the way can be very hard to do when you’re working under strict page limits, as in many computer science venues, and even harder to train students to do). I would seem to make a natural proponent of estimation given that some of my research has been about more expressive visualizations of uncertainty, e.g., rather than using error bars or even static depictions of marginal distributions because they invite heuristics, we should find ways to present uncertainty that make it concrete and hard to ignore (sets of samples across time or space). 

But something that has irked me for a while now is what seems to be a pervasive assumption in arguments for emphasizing effect magnitude and uncertainty: that doing so will make the resulting expressions of results more robust to misinterpretation. I don’t think it’s that simple.

Why is it so easy to think it is? Maybe because shifting focus to magnitude and uncertainty of effects implies an ordering of results expressions in terms of how much information they provide about the underlying distributions of effects. NHST p-values are less expressive than point estimates of parameters with confidence or credible intervals. Along the same lines, giving someone information on the raw measurements (e.g., predictive intervals) along with point estimates plus confidence intervals should make them even better off, since you can’t uniquely identify a sample distribution from a 95% CI. If we are talking about describing and discussing many hypotheses, that too would seem more expressive of the data than discussing only comparisons to a null hypothesis of no effect. 

But is more information always better? In some of these cases (e.g., showing the raw data points plus the means and CIs) I would expect the more expressive representation to be better, since I’ve seen in experiments (e.g. here) that people tend to overestimate effect sizes when given information about standard error rather than standard deviation. But as behavioral agents, I think it’s possible that being served with some representation of effects higher on the information ladder will sometimes make us worse off. This is because people have cognitive processing limitations. Lots of research shows how when faced with distributional information, people often satisfice by applying heuristics, or shortcut decision strategies, that rely on some proxy of what they really should consider to make a judgment under uncertainty. 

I am still thinking through what the best examples of this are, but for now I’ll just give a few anecdotes that seem related to inappropriately assuming that more information should necessarily help. First, related to my own research, we once tested how well people could make effect size judgments like estimating the probability of superiority (i.e., the probability that a draw from a random variable B is greater than one from random variable A) from different representations of two normal distributions with homogenous variance, including density plots, quantile dotplots, intervals, and animated hypothetical outcome plots, which showed random draws from the joint distribution of A and B in each frame. Unless we expect people to be able to mentally calculate probability of superiority using their estimates of the properties of each pdf, then the animated plots should’ve offered the most useful information for the task, because all you really needed to do was estimate how frequently the draws from A and B changed order as they watched the animation. However, we didn’t see a performance advantage from using them – results were noisy and in fact people did a bit worse with them. It turns out only a minority (16%) reported using the frequency information they were given directly to estimate effect size, while the rest reported using some form of heuristic such as first watching the animation to estimate the mean of each distribution, then mapping that difference to probability. This was a kind of just-shoot-me moment for me as a researcher given that the whole point of the animated visualization was to prevent people from defaulting to judging the visual distance between means and mapping that to a probability scale more or less independently of the variance.   

Another example that comes to mind is a little more theoretical, but perhaps analogous to some of what happens under human heuristics. It’s based on a result related to how ordering of channels in an information theoretic sense can be counterintuitive. Imagine we have a decision problem for which we define a utility function, which takes in a measurement of the state of the world and an action that the decision maker selects and outputs a real-valued utility. For each possible state of the world there is a probability distribution over the set of values the measurement can take. The measurement process (or “noisy channel” or “experiment”) can be represented as a matrix of the probabilities of outputs the channel returns given inputs from some input distribution S. 

Now imagine we are comparing two different channels, k2 and k1, and we discover that k1 can be represented as the result of multiplying a matrix representing a post-processing operation with our matrix k2. We then call k1 a garbling of k2, capturing how if you take a measurement then do some potentially noisy post-processing, the result can’t give you more information about the original state. If we know that k1 is a garbling of k2, then according to Blackwell’s theorem, when an agent chooses k2 and uses the optimal decision rule for k2, her expected utility is always (i.e., for any input distribution or utility function) at least as big as that which she gets when she chooses k1 and uses the optimal decision rule for k1. This implies other forms of superiority as well, like that for any given input distribution S the mutual information between the channel output of k2 and S is higher than that of the channel output of k1 and S. All this seems to align with our intuitions that more information can’t make us worse off. 

But – when we consider pre-processing operations rather than post-processing (i.e., we are doing a transformation on the common input S then passing it through a channel), things get less predictable. For example, the result in the paper linked above shows that applying a deterministic function as a pre-processing step to an input distribution S can give us counterintuitive cases, like where the mutual information between the output of one channel and S is higher than the mutual information between the output of another channel and S for any given distribution, but the first channel is not Blackwell superior to the second. This implies that under pre-garbling a channel can lead to higher utility in a decision scenario without necessarily being more informative in the sense of representing some less noisy version of the other. I’m still thinking through how to best translate this to people applying heuristics to results expressions in papers, but one analogy might be that if you consider a heuristic to be a type of noisy channel, and a choice of how to represent effect distributions as a type of preprocessing, the implication is that it’s possible to have scenarios where people are better off in the sense of making decisions that are more aligned with the input distributions given a representation that isn’t strictly more informative to a rational agent. If we don’t consider the heuristics, the input distributions, and the utility functions along with the representations of effects, we might create results presentations that seem nice in theory but mislead readers. 

So instead of relying on our instincts about what we should express when presenting experiment results, my view is that we need to adopt more intentional approaches to “designing” statistical communication reforms. We should be seriously considering what types of heuristics people are likely to use, and using them to inform how we choose between ways of representing results. For example, do people become more sensitive to somewhat arbitrary characteristics of how the effects are presented when dichotomous statements are withheld, like where they judge how reliable they think they are by judging how big they look in the plots? Is it possible that with more information, some readers get less information because they don’t feel confident enough to trust that the estimated effect is important? On some level, the goal of emphasizing magnitude and variation would seem to be that we do expect these kinds of presentations to make people less confident in what they see in a results section, but we think in light of the tendency authors have to overestimate effects, diminishing confidence is a necessary thing. But if that’s the case we should be clear about that communication goal, rather than implying that expressing more detail about effect distributions, and suppressing more high level statements about what effects we see versus don’t see in results, must lead to less biased perceptions.  Another interesting example is to imagine that we’re comparing going from testing a single hypothesis, or presenting a single analysis path, to presenting a series of (non-null) hypotheses we tested, or presenting a multiverse made of plausible analysis paths we might have taken. These examples contribute more information about uncertainty in effects, but if people naturally apply heuristics like comparing positive versus negative results over the set of hypothesis tests or the set of analysis paths to help distill the abundance of information, we’ve missed the point. I’m not arguing against more expressive uncertainty communication, just pointing out that it’s not implausible that things might backfire in various ways.  

It also seems like we have to consider at some point how people interpret the authors’ text-based claims in a paper in tandem with any estimates/visualizations of the effects, since even with estimation-style reporting of effects through graphics or tables, authors still might include confident-sounding generalizations in the text. Do the text statements in the end override the visuals or tables of coefficients? If so, maybe we should be teaching people to write with more acknowledgment of uncertainty. 

At the end of the day though, I don’t think a purely empirical or user-centered approach is enough. One-off human subjects experiments of representations of uncertainty can be fraught when it comes to pointing out the most important limitations of some new approach – we often only learn what we are anticipating in advance. So when I say more intentional design, I’m thinking too about how we might formalize design problems so we can make inferences beyond what we learn from empirical experiments. Game theory might be useful here, but even more so information theory is an obvious tool for reasoning about the conditions (including assumptions of different heuristics which might be informed by behavioral research) under which we can and cannot expect superiority of certain representations. And computer scientists might be helpful too, since they are naturally thinking about the types of computation that different representations support and the complexity (and worst case properties) of different procedures. 

PS. I see Greenland and Rafi’s suggestions to re-express p-values as information theoretic suprisals, or S values, which behave better than p-values and can be understood via simple analogies like coin flips, as an exception to what I’m saying. Their work seems to take seriously the importance of understanding how people reason about semantics and their cognitive limits for finding better representations.

20 thoughts on “Not-so-obviously heuristic-proof reforms to statistical communication

  1. This whole line of thought bothers me. I know enough to ignore what advertisers show – they have a good understanding, like magicians, of how to use deception to convey a message they wish, even (especially) when it is not true. If research into visual representations of uncertainty results in some sort of “optimal” presentation, then I will have to start understanding how it is optimal and for whom. What audience was the information designed for, and what heuristics was it assumed they would use when exposed to that information? Then, we have a sort of game theory model where I may change the way I view the information in response to what I believe they are doing, and they modify their presentation, and…..

    Of course, this is unavoidable. Perceptions are not perfect and neither are intentions. And people are not homogeneous in how they respond to information presentations or the heuristics they use (the fact that a large number of people might use particular heuristics does not alleviate my concerns – it accentuates them, as I now need to understand whether my perception habits are like that large group or not).

    Presumably, (aside from marketers and magicians), we are interested in portraying the truth. The fact that recipients may have difficulty processing the truth is what you are suggesting we need to understand better. Sure – but to what end? Isn’t this part of the CDC issue that they feel the need to misrepresent facts because the public will not perceive the truth accurately? The result is that the CDC has a major credibility issue. So, I’m afraid that this line of investigation may serve to undermine the credibility of whatever research information is presented, distrusting whether it has been altered in anticipation of how it will be perceived.

    I guess this tirade is not really an argument that such research isn’t necessary or useful. But it makes me uncomfortable. About the only thing I can conclude is that it makes me more insistent than ever that all research that is published or used for any public purpose must release the data. That is the only protection I have against purposely distorted information – even if it is for benevolent purposes.

    • I had similar thoughts reading this… thinking at the same time that it is very interesting and relevant nonetheless.
      One thought I had is that I believe (maybe wrongly) that this kind of research on perception of information will not capture that some people may go through a process – they may misinterpret something at first, and later for whatever reason realise that they got it wrong. My intuition is that this is more likely when they get more informative presentations, even though at first impression research may imply they get more out of less information.

      Also the heterogeneity of people is very important here, I think. If, say, only a minority of people really get more information out of an “objectively” more informative presentation – may this not just be the minority I’m writing for? The most interested minority, who engage most with the material and are most likely to do something constructive with the results?
      (I don’t mean to imply any answer, just asking questions.)

      • I think the process people go through is very important – it makes sense to think that with no restrictions on the time and effort readers spend we would expect results closer to what rational agents do with the most complete presentations, but probably not all readers, because like you say, there will be different strategies. This is all part of why I like the idea of moving toward more formal modeling of communication reforms in addition to the empirical stuff – it makes it easier to test different assumptions about the extent of uncertainty suppressing heuristics in the population to see how better/worse off different groups that process things in different ways are when you start adding more information.

        To Dale’s comment – I’m definitely not arguing that there’s one optimal presentation, nor that authors shouldn’t be making all their data and analysis open. I’m just pointing out that an assumption that presenting more detailed information on effects observed in empirical studies will lead to the most accurate perceptions or beliefs is naive, and that a better way to think about the problem we need to solve in reforming statistical communication is of identifying the distillations that best preserve the information once passed through different heuristics for suppressing variation (which I’ve seen far too much of in my work to ignore anymore!)

    • It’s an interesting question, and it has a number of parts. (1) Creating knowledge is almost useless if you can’t communicate it. (2) Effective communication can convince people of things that aren’t true; (3) Ineffective communication can fail to convince people of things that are true; and (4) People’s incentives to communicate, in general, are not necessarily aligned towards truth.

    • “Presumably, (aside from marketers and magicians), we are interested in portraying the truth. ”

      I puzzled over this for a minute, but then realized that politicians are a subset of marketers.

    • I can see how this would feel akin to nudging, which I find distastefully manipulative. But, I feel there’s a fundamental difference between exploiting biases and working under their existence. Take the classic example of organ donation opt-in vs. opt-out vs. option. The uninformed designer chooses one at random, the nudgelord chooses to maximize his own utility, and an informed designer hopefully aims for having people choose yes/no, because he knows providing a default introduces a large bias.

      Many presentations are truthful, but knowing how people tend to interpret results can let you choose the truthful presentation that will be interpreted most accurately. I see designing around human biases as being like adding priors: they can be abused by bad actors, and they can be used to get more reliable results.

      Also, all articles and results are colored by us authors. We decide on emphasis and what details make the cut. Even the data is filtered by what we choose to measure. So sure, we’ll present so the ideas we feel are most important are transmitted with the least loss, but there’s always tradeoffs that have to be made.

      Bad scientists will produce bad science. The trick is to get good scientists to not produce bad science.

  2. As a fellow proponent of estimation I didn’t have any notions that misunderstandings would be avoided or even reduced. In fact, I thought they might increase.

    but…

    I did believe that the might change in kind. Where misunderstandings under NHST are often of the complete nonsense variety misunderstandings under estimation are of the magnitude variety. I’d be much more happy living with the latter.

    Having not read many of your studies Jessica, it seems you’re asking for information that is quantitative from your participants. Are you looking at magnitudes of errors and their distribution? (can’t imagine you’re not). It seems to me with estimation there might be a more graceful degradation.

  3. One needs to distinguish between idealized information processing and human information processing.

    In a Bayesian analysis the idealized information processing is solely through the likelihood _assuming_ the fixed data generating model (which defines the likelihood) and the data are true/accurately recorded (infinite number of zeros after any digits not indicated). Here any new data can only lead to more concentration (likelihoods can only be flat or curved downwards) and coarsening of the data can only lose information. On the other hand, changing the model such as to a multilevel model can widen credible intervals and coarsening the data may increase robustness to misspecification.

    Often people use other methods and there is a nice paper by XL Meng that after pointing out the unique property of likelihood above, shows how all bets are off – more data can be worse.

    Now there was empirical studies way back that showed human information process was very different. Not sure why that faded away? When Andrew and I were writing the Convincing Evidence paper I suggested a reference I had read as a grad student but Andrew found another reference that we used –

    Driver, M. J., and Streufert, S. (1969). Integrative complexity: an approach to individuals and
    groups as information-processing systems. Administrative Science Quarterly 14, 272–285

    I do think likelihood tends to be neglected in the Bayesian literature.

  4. > But is more information always better?

    I think the answer is self-evidently no. Otherwise papers would just be long pages of raw data tables. All good analyses seek to compress information to surface useful and reliable knowledge, and thus have to select what information to show. People, even very smart people, cannot digest an infinite amount of input simultaneously.

  5. > estimating the probability of superiority (i.e., the probability that a
    > draw from a random variable B is greater than one from random variable A)

    If the quantity of interest is this probability of superiority, why not just calculate that and give it to the reader? Why expect them to be able to mentally guess it from other information?

    > the animated plots should’ve offered the most useful information for the
    > task, because all you really needed to do was estimate how frequently the
    > draws from A and B changed order as they watched the animation.

    That doesn’t sound easy to me to do mentally. If I had to do that, I’d keep track by making marks on a piece of paper.

    • >If the quantity of interest is this probability of superiority, why not just calculate that and give it to the reader? Why expect them to be able to mentally guess it from other information?

      Probability of superiority (aka Common Language Effect Size) has been proposed to be a more intuitive way to present effect size than Cohen’s d, but very rarely do papers give it directly. So if authors don’t give significance information to readers, to imply some rundown of which effects are “real” vs “fake” (not a distinction I’m endorsing!) then how people judge effect size from the author’s description of the observed effects in text and from the typical graphical presentations (often bars with error bars) is worth knowing.

      It could be interesting though to consider whether giving probability of superiority and graphs leads to more accurate perceptions than one or the other. I suspect graphs and maybe what the author says directly will have more weight. But this is all speculation without some specific setting in mind. A lot of the challenge here would seem to be that we can suggest new ways for authors to convey results, but we have to expect heterogeneity across authors and across readers in terms of how well they know what to do with different expressions of uncertainty they get. I suspect there’s some positive relationship between the number of different ways you present your results in a paper and the heterogeneity you get in interpretations.

      • > Common Language Effect Size

        I find that a curious name for something more like a correlation. In a regression setting, for example, there would be two complementary concepts: size/beta and correlation/R-squared. I think that “strength” would have been more appropriate than “size”. I guess that the practice of identifying “effect size” with “correlation” may be common in some fields though.

      • > that doing so will make the resulting expressions of results more robust to misinterpretation

        I think this is a false assumption. The reason to do Bayesian statistics is that it accurately presents the results of the analysis (unlike P-values, NHST, etc.). That doesn’t imply that people won’t misinterpret the results, depending on what question they try to use the results to answer. The information is there. It is up to the reader to use it to answer the question that they are interested in. Of course, the author can anticipate some questions and make the answers to those questions clearer.

        > some specific setting in mind

        Yes, I think that is necessary.

        • “The reason to do Bayesian statistics is that it accurately presents the results of the analysis (unlike P-values, NHST, etc.).” That is a very strange sentence. I’m not sure in the first place whether “presentation” is an appropriate word for the job of Bayesian (or frequentist) statistics, but if so, surely Bayesian statistics presents the results of a Bayesian analysis just as accurately or inaccurately as the p-value presents the results of its calculation. For sure “accuracy” is not the correct word for this distinction.

        • You can do bayesian NHST with bayes factors.

          The problem with NHST is not fixed by swapping a p-value for something else.

          It is fixed by testing your hypothesis instead of a strawman null hypothesis.

  6. > It turns out only a minority (16%) reported using the frequency information they were given directly to estimate effect size, while the rest reported using some form of heuristic such as first watching the animation to estimate the mean of each distribution, then mapping that difference to probability. This was a kind of just-shoot-me moment

    That’s interesting. I’ve dealt with some graphs lately where they’re confusing and kinda hard to pick up all the details of what’s going on but the more I read them and the more I get used to the problems the easier they are to digest.

    There’s definitely way too much information in these plots (interactive Tableaus, where part of the info you get by scrolling over lines and reading values), but I like to think that I’m getting what I want from them.

    In this situation the graphs are things we look at regularly, so the target is different I think than what you’re discussing here. Here is sounds like the plots are targeted for sorta one-off consumption of new things. So it’s like, one-off visualizations into the intrinsically noisy process of research. I don’t really trust the complex graphs I’m talking about for one-thing either — too much risk for a coding problem, or me misinterpreting things.

    I guess I’m saying that producing multiple graphs over many days and having time to think about them is more valuable than one graph once, which seems like a really unfair advantage, but it might be practical to do the multiple days thing so then comparison is reasonable in the sense that it’s a choice. Maybe a bigger danger for automation is garbage in garbage out, and for the discussion here it seems like we’re assuming there is an underlying truth to the data being plotted (it was generated in so-and-so way, and whatnot).

  7. I love that you’re doing usability testing on your presentations.

    Communication is typically a dialogue, with the recipient an active participant. The recipient will put the message into their “cultural” context, which may mean in statistics e.g. that data is normally distributed. “Consider your audience” means in this context that the message should not expend a lot of effort on the distribution, but make it very clear when that expectation is being broken. It also means there’s no abstract “best way” you can teach that is independent of the intended audience.

    If you’re trying to present “what is special about my research” to yourself or your peers, that can bd taught: being clear & concise etc.
    But teaching “what is special about my research” to different audiences has to involve a dialogue of sorts, or it’s a shot in the dark.

Leave a Reply

Your email address will not be published. Required fields are marked *