Can Visualization Alleviate Dichotomous Thinking? Some experimental evidence.

Jouni Helske, Satu Helske, Matthew Cooper, Anders Ynnerman, and Lonni Besançon write:

Can Visualization Alleviate Dichotomous Thinking? Effects of Visual Representations on the Cliff Effect

Common reporting styles for statistical results in scientific articles, such as p-values and confidence intervals have been reported to be prone to dichotomous interpretations, especially with respect to the null hypothesis significance testing framework. . . . This type of reasoning has been shown to be potentially harmful to science. Techniques relying on the visual estimation of the strength of evidence have been recommended to reduce such dichotomous interpretations but their effectiveness has also been challenged. We ran two experiments on researchers with expertise in statistical analysis to compare several alternative representations of confidence intervals and used Bayesian multilevel models to estimate the effects of the representation styles on differences in researchers’ subjective confidence in the results. We also asked the respondents’ opinions and preferences in representation styles. Our results suggest that adding visual information to classic CI representation can decrease the tendency towards dichotomous interpretations—measured as the `cliff effect’: the sudden drop in confidence around p-value 0.05—compared with classic CI visualization and textual representation of the CI with p-values. All data and analyses are publicly available at https://github.com/helske/statvis.

This sounds cool. I’ll let co-blogger Jessica judge the relevance and quality of the research, as this is in her area of expertise.

23 thoughts on “Can Visualization Alleviate Dichotomous Thinking? Some experimental evidence.

  1. Hi Andrew and everyone and thanks for sharing our paper and findings.

    The paper idea stems from the many times I have seen claims that representing statistical results as plotted confidence intervals (CIs) would help avoid dichotomization. We also had some evidence of this in our field in a paper with Pierre Dragicevic available here: https://inria.hal.science/hal-01980268v3/document (Jessica provided a short review of the paper appended at the end of our paper), but many of the papers used to back up this claims were quite outdated and very difficult to reproduce/replicate. We therefore set out to verify the claim and test other visual representations of CIs as well. We made sure to test the visualization with researchers (and not students as often done in the literature) since we were interested in reducing dichotomous interpretations in published papers.

    The novel visual representations that we proposed seem to help reduce dichotomous interpretations (as measured, by proxy, through a reduced cliff effect — the drop of confidence around p=0.05). Surprisingly, we found that classical CI representations did not seem to reduce the cliff effect/dichotomous interpretations.

    We’re all more than happy to discuss this further with people here.

    – Our CRAN Package is available here: https://cran.r-project.org/web/packages/ggstudent/index.html
    – Matthew Kay’s CRAN package also provides support for these visualizations: https://cran.r-project.org/web/packages/ggdist/index.html

    • Just curious, how do you reconcile the conclusion at the end of the abstract of the dichotomization paper with the goal of trying to reduce dichotomous interpretations in published papers? “We wanted to see whether they have had any influence on CHI.
      Our analysis of CHI proceedings from the past nine years suggests that they have not.”

      • Not sure I got your question right Jessica, so feel free to correct me if I did not.

        The dichotomisation paper aimed at checking weather recommendations and guidelines fo a decade + the replication crisis had an impact on the dichotomisation of evidence in the HCI community. We found that it did not seem to have an impact. It seems that therefore recommendations are not enough and this is how the idea of a visual representation that would reduce the cliff effect and therefore reduce “dichotomous inferences”.

        • I’m just commenting on how concluding ‘it had no impact’ seems like dichotomous reasoning based on taking seriously the point null hypothesis of 0 effect. I think this gets at how hard it can be to avoid dichotomous reasoning, at least in the high level summaries of research. We have such an in-built tendency to talk about things as if black and white.

        • @Jessica for some reason I cannot answer your comment directly (the reply button does not show).

          I agree with your statement there, we have a tendency to be binary about almost everything. That being said, I think that we have tried to hedge our conclusion in this paper. We first say “Our analysis” which brings the attention back to the evidence we have obtained alone, and not a general conclusion IMO. and we then use the hedge “suggests” to tone down the statement.

          We have actually submitted twice the following paper to CHI to try and explain exactly this: how difficult it can be to reason about evidence. After two rejects from CHI I left it as a preprint (for now at least): https://hal.inria.fr/hal-03342756/file/Besancon__Definitely_Maybe__preprint%20%281%29.pdf

          From this preprint:
          “We see several potentialy (hedge) important (booster) outcomes and implications from this work. First, adding the terms ‘hedging’ and ‘boosting’ to HCI researchers’ lexicon is in itself a form of contribution. The Sapir-Whorf hypothesis, or Wharfianism [21], proposes that our language influences the ways in which we think about things, and certainly having succinct terms to capture these concepts improves our ability to communicate around them. If authors and reviewers are sensibly alerted to the use of hedges and boosters, results dissemination in our field could improve.”

          and


          While we abstain from giving firm guidelines, we relay previous recommendations from linguists and statisticians. First, statistical significance relies on a somewhat arbitrary and malleable α level, thus we recommend avoiding the use of “statistically significant’ when possible. Instead authors should simply report the associated data, including a measure of central tendency (e.g., mean), and confidence intervals or outcomes of statistical tests (e.g., T- or F- statistic, degrees of freedom, and p value) [5, 6, 29, 45, 46].
          Second, embrace uncertainty when communicating results [109]. Be wary of using statistics to convey certainty, and instead con- template and communicate the importance of factors influencing uncertainty, including sample size, noise in the data, effect size, or whether the sample is representative. Consequently, scientists should adapt their use of modal terms in order to reflect as closely as possi- ble and strength of evidence that their experimental setup and results provide, and the uncertainty that they should convey [62].
          Third, authors should also be careful to hedge and boost appro- priately when citing prior work in order to accurately represent the original work and avoid misrepresenting the (un)certainty around prior findings (see e.g., [65, 69]).”

          TL;DR:
          It’s really a complicated matter and I hope to be able to bring more conversations about the “language used” to the field. However, our community is very biased towards paper providing clear guidelines and our submission has been rejecting twice for this because we did not. I am in complete agreement with you that it is a complicated matter and I am definitely guilty too of dichotomising sometimes.

        • I guess I would have expected the concluding statement to be something like “We find little evidence of an influence on CHI proceedings’ to avoid the ‘effects are either present or absent’ mindset.

          This paper on hedging also sounds familiar! But I don’t think I ever reviewed it, probably just came across in my scholar feed. Its important work. I kind of think that if researchers in fields like HCI were to truly embrace the uncertainty in their work, it would mean no more controlled studies in HCI.

          I have grappled with the ‘give us recommendations’ mindset in several recent papers I’ve submitted – one to VIS, another to AIES last year, both pointing to different kinds of overclaiming in research areas. It’s very frustrating. I usually bring up Devezer et al. https://royalsocietypublishing.org/doi/10.1098/rsos.200805 but rarely does it help.

        • “I guess I would have expected the concluding statement to be something like “We find little evidence of an influence on CHI proceedings’ to avoid the ‘effects are either present or absent’ mindset.”

          This is quite likely a better phrasing and I love when reviewers point out to problematic binary thinking. Unfortunately, it rarely happens.

          Happy to read that I’m not the only one struggling although it is quite sad that these struggles are here.

          Thanks for the recommendation about Devezer et al., will definitely use this, although I suspect it is unlikely to help in all cases, as you yourself point out.

  2. During the experiment we displayed each trial to each participant (one at a time), and asked the following question: “A random sample of 200 adults from Sweden were prescribed a new medication for one week. Based on the information on the screen, how confident are you that the medication has a positive effect on body weight (increase in body weight)?”.

    Lets denote this “confidence” (different from CI “confidence”) with C. Then from fig 3, we see:

    p = 0.001 -> C ~ 0.90
    p = 0.04 -> C ~ 0.65
    p = 0.05 -> C ~ 0.55
    p = 0.06 -> C ~ 0.45
    p = 0.80 -> C ~ 0.05

    Interesting that people think p-values around 0.05 indicate slightly better than even odds there really is an effect in the observed direction.

    Since there is always some effect, it is 50/50 to be positive/negative if we know nothing at all.* So significance per se actually has little percieved informational content. Much higher or lower p-values are percieved to tell us something.

    That is collectively, itd be good to look at the individual curves since that is the level where cognition is actually occuring. Eg, based on the medical replication projects we know only ~20% of replications are significant in the same direction. And non-significance means data is too noisy to say much. So I’d put something like:

    p = 0.001 -> C ~ 0.7
    p = 0.04 -> C ~ 0.3
    p = 0.05 -> C ~ 0.3
    p = 0.06 -> C ~ 0.5
    p = 0.80 -> C ~ 0.5

    *Actually you can check that by eliciting a value before showing any results. Perhaps people think a study on 200 adults from sweden would only be funded if it was somewhat promising.

    • That p = 0.8 -> C ~ 0.05 result really makes no sense. A high p-value means you are uncertain about the direction.

      I can see people not accounting for p-hacking, publication bias, etc that happens just under p = 0.05, but don’t understand what thought process is going on for high p-values. Whatever it is, it is apparently a widespread misconception.

      • My guess is that the research subjects implicitly added the qualifier “meaningful” to the question of whether or not there was a positive change.

        • I considered that, but they were shown intervals that went up to 0.5 kg for p = 0.8 vs 1 kg for p= 0.001.

          So one kg is perceived as meaningful while half a kg is not? Really neither are meaningful amounts of weight gain.

      • “don’t understand what thought process is going on for high p-values. Whatever it is, it is apparently a widespread misconception.”

        I am afraid that we cannot provide more details here on what participants had in mind when inputing their values. This is, however, very interesting in itself I’d argue.

        Aren’t there *many* widespread misconceptions about statistics though? I’m sure I’ve fallen prey to some of them (e.g., over-simplifications)

        • Yes, I find that curve more interesting than the different visualizations. It gives me a headache trying to put myself in the shoes of someone answering the question that way, but it is very important we figure out what thought process is being used. It has lead to industrialized production of misinformation.

          Changing the details would slso be interesting. Eg, also ask them before showing any interval and sometimes it is China instead of Sweden. Instead of medication and weight do vaccine or vitamin vs infections. I bet those differences will be bigger than for various visualizations.

  3. This paper sounds familiar, I think I reviewed it in the past. Seems rigorously done, though I’m never been a fan of using subjective confidence as a measure, even if just for relative comparisons, because its not clear what it means.

    • I agree with you that subjective measures are not ideal (although when used in relative comparisons this is less of a problem), but we could not think of another proxy/measure than this one which is also a standard when it comes to the cliff effect.

      • The task is about drawing an inference from some sample dataset, so it would be pretty easy to cast it as a decision problem – specify a prior distribution on the treatment effect, show the participants the sample, ask them to make a decision (e.g., choose whether to take a gamble on the effect being positive). This way there is at least a well-defined standard for their judgment, even if people don’t achieve it. You can then also get further info from your experiment by calibrating the behavioral responses, described here: https://arxiv.org/pdf/2304.03432.pdf

        • I would argue that casting it as a decision problem would provide a perhaps better-defined proxy but also, in the case you present (whether they take a gamble) some form of dichotomisation which would not allow us to then see all the nuances and interesting results that one can find from our current results.

          Now I imagine that we could could design it so that participants can input some value they’d bet on the effect being positive. But, at least in my mind, this would still be subjective in the end wouldn’t it?

          Thanks for pointers to the paper. I’ll give it a thorough read. So obviously disclaimer I haven’t read it yet :).

        • No, that is not correct. Check out the link I sent – even if you are eliciting what a vis researcher would call a “judgment” it can still be a decision problem in the decision theoretic sense.

  4. you are almost there – the next jump is to communicate findings, verbally.

    This would involve alternative representations (verbal descriptions) with meaning equivalence and alternatives with surface similarity. These list are distinguished by a boundary of meaning (BOM). These lists can be tested with S-type errors. See https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070 and https://link.springer.com/article/10.1007/s11192-021-03914-1.

    See also https://arxiv.org/abs/2301.01653v1

    We now apply this with interesting preliminary results to zero shot transfer learning.

Leave a Reply

Your email address will not be published. Required fields are marked *