Andy Stein writes:

On one of my projects, I had a plot like the one above of drug concentration vs response, where we divided the patients into 4 groups. I look at the data below and think “wow, these are some wide confidence intervals and random looking data, let’s not spend too much time more with this” but many others do not agree and we spend lots of time on plots like this. Next time, I’ll do what I finally did today and simulate random noise and show how trends pop up. But I was just wondering if you had any advice on how to teach scientists about how easy it is to find trends in randomness? I’m thinking of writing a little shiny app that lets you tune some of the parameters in the R code below to create different versions of this plot. But I know I can’t be the first person to have this challenge and I thought maybe you or others have given some thought on how to teach this concept – that one can easily find trends in randomly generated data if the sample size is small enough and if you create enough plots.

This all makes sense to me. An ideal outcome would be for Stein and others to write an article for a journal such as Technometrics or American Statistician with a title such as “Using Simulations to Convince People of the Importance of Random Variation When Interpreting Statistics.”

Stein replied:

My ideal outcome would be for there to be an article with luminaries in the field in a journal like NEJM or Lancet where everyone will have to pay attention. Unfortunately, I don’t think my colleagues give enough weight to what’s in the statistics and pharmacometrics journals. If they did, then I wouldn’t be in this predicament! Having such a paper out in the world in a high-impact journal would be enough to make me really happy.

My latest idea is to make a little Shiny App where you upload your dataset and it creates 10 sets of plots – one of the real data and nine where the Y variable has been randomly permuted across all the groups. Then people get to see if they can pick out the real data from the mix. If I get a public version implemented in time, I’ll try and send that along, too.

Here’s Stein’s R code:

set.seed(1) n_subjects = 15 n_trials = 4 mean_response = -50 sd_response = 15 mean_conc = 100 sd_conc = 20 rand_values = data.frame(response = rnorm(n = n_subjects*n_trials, mean = mean_response, sd = sd_response), conc = rnorm(n = n_subjects*n_trials, mean = mean_conc, sd = sd_conc), id = rep(1:n_trials,times = n_subjects)) g = ggplot(data = rand_values, aes(x=conc,y=response)) g = g + geom_point() g = g + geom_smooth(method = "lm") g = g + facet_wrap(~id) print(g)

An article in a high-impact medical journal and the Shiny App are a great idea! Reference + demonstration tool.

I have to convince people of this frequently in our research group as well.

Also, I would agree with the author that putting the article in something like Lancet or NEJM would be ideal. I have used the “Scientists rise up against statistical significance” commentary in Nature by Amrhein, Greenland, and McShane as a reference when responding to peer review of manuscripts where I did not use statistical significance and the reviewers asked for it. I think I have done that twice now, and it fortunately worked both times. Although many people have said the same thing before in other journals, it feels like a bigger stick when it is in Nature, or Lancet, or NEJM, as silly as that may be.

I did a little simulation that is similar in spirit a little while ago. Situations with no effect can often “look” like there are effects if we aren’t careful. http://randomanalyses.blogspot.com/2011/10/error-bar-judgment.html

The idea of comparing a plot of the original data to similarly structured random sets is one of the ideas in this paper: Buja, A., Cook, D. Hofmann, H., Lawrence, M. Lee, E.-K., Swayne, D.F and Wickham, H. (2009) Statistical Inference for exploratory data analysis and model diagnostics Phil. Trans. R. Soc. A 2009 367, 4361-4383 doi: 10.1098/rsta.2009.0120

The `vis.test` function in the TeachingDemos package for R implements the idea of presenting multiple graphs and then having the user click on the one that they think is the real data. It does not use shiny and uses base graphics instead of ggplot2 graphics. But it might be a place to start if you want to just show the idea to colleagues or build it into a shiny app.

Dianne Cook presented this at RStudio::conf2018, from ~15 min here: https://resources.rstudio.com/rstudio-conf-2018/to-the-tidyverse-and-beyond-dianne-cook

The nullabor package provides functions for this.

Well, in effect this is a visual demonstration of what a p-value is doing, isn’t it?

I think the authors indeed make an explicit connection with p-values. Also check out this follow-up paper:

http://had.co.nz/stat645/graphical-inference.pdf

On the general idea of using simulations to convince people of the importance of random variation, I also find hypothetical outcome plots to be very educative:

https://medium.com/hci-design-at-uw/hypothetical-outcomes-plots-experiencing-the-uncertain-b9ea60d7c740

They can be generated from hypothetical models (e.g., dance of p-values https://www.youtube.com/watch?v=5OL1RqHrZQ8 and of other statistics https://www.youtube.com/watch?v=UKX9iN0p5_A) or from the data itself (e.g., using resampling https://explorablemultiverse.github.io/examples/dance/ or the posterior predictive distribution).

Sort of… One big problem is that the statistical software usually “decides for you” which p value is relevant. But there’s no real reason to think that its decision is necessarily the right one.

For example if you run a linear regression in R

lm(y ~ x, data=foo)

It will set up a model where y = a + b * x + normal_error(m,s)

and it will calculate the p value for say b under the hypothesis that the model is true and that we would resample normal_error for fixed x.

Suppose I ask you about your real scientific process, and you say “whoever comes into our clinic we take their height and weight and we think weight y is a function of height x”.

so notice that

1) if you run the experiment again, you will get different x values

2) The distribution of x values may well change dramatically (suppose for example suddenly a disease circulates in the infant population and all of a sudden for a week most of the patients are 2ft “high”)

3) The weight distribution is *not* normally distributed around any curve, much less a line. In fact, the quantity is constrained to be positive and more than about 4lbs, positive and more than about 60lbs for adults, and it has a longish tail to the right, probably more like a gamma distribution.

4) Sex is important and there are two classes, so the distribution is actually a mixture model of the two.

…etc

So whatever the p value is that plops out of your stats software, it probably has *next to nothing* to do with a decent description of an *actual data generating process*. So if you do this kind of simulation, even if you don’t include priors, you wind up with a much better idea of variability than if you just look at a p value from a default calculation.

Good stuff. Teaching statistical intuition is an important goal.

Andy might consider teaching a similar intuition, that when there are many observations, a regression with a p-value < .05 often explains almost nothing.

Generate a series of large datasets assuming no correlation between X and Y. Identify datasets with p < .05. Plot the data. Ask the user to "see" the "significant" relationship between X and Y. Often the relationship is invisible to the naked eye.

Then plot the regression line to show the user how tenuous the claimed relation is.

Many times I have looked at a dataset with p < .05 and been unable to see the relationship in the raw data. Often, the data is just a big ball of dots. Often, the "significant relation" looks like it depends on a few data points, usually outliers.

(This is kind of the opposite of what Andy is demonstrating, that with a small number of observations, regression coefficients are highly variable.)

Seems worthwhile to me, but OTOH, maybe this exercise just illustrates what a low R^2 looks like.

Yes, I think your exercise mostly just illustrates a low R^2. Notwithstanding the myriad issues associated with “significance” of a relationship, in a multivariate world, any large data set (other than a time series) is likely to show “just a big ball of dots.” I think seeing what the regression line looks like and its associated p-value is instructive somewhat. It can reveal a potential signal amidst the noise, and it also makes it very clear that there is more going on than just X and Y: this is the invitation to multivariate thinking. And, the worse the scatter, the more someone should be thinking multivariately. I realize that I am succumbing to looking at p values and significance (which I admit often leads to bad practice), but I do think the graphical exploration of bivariate plots and regression lines is a useful step towards developing statistical intuition – as long as it is a starting point and not the end point.

” it also makes it very clear that there is more going on than just X and Y: this is the invitation to multivariate thinking. “

Good point.

More specifically, this is an example of what I call “a teachable moment” — and one that can be planned by the teacher.

One key lesson is: always just plot the data first without a regression line. If there were no lines present in those graphs, most people would easily conclude there was not a strong or meaningful relationship.

Regarding plotting the data first without regression lines – I agree!

I think in this case that might be part of the lesson, though, for this tool. Sometimes people really want to see lines (even and especially in their mind’s eye). I have had to push back on putting lines on scatter plots for this very reason. Once the line is on there, that really is all people seem to look at. Generating a bunch of noise and putting lines on it shows that the lines they draw via lm or in their mind may be meaningless.

Hmmm – I recall doing this when I was a graduate student, something like using simulation to clarify issues of unconditional versus conditional inference (1985?). Andy has often used simulation in his blog posts and I have commented here probably over a 100 times about it being useful in make things clear. A 2010 abstract of mine was entitled Perceptive simulation (smart Monte Carlo) as the basis for understanding, implementing and evaluating Bayesian analysis in a XXXX setting. Blah, blah blah.

There are some challenges.

First to many, statistical simulation is just as much a black box as measure theory. Second, simulations don’t for most part gain the same authority as the Rhubarb Chair of Biostatistic’s recommendation/publication. Third, without some real grasp of what statistics can be and should do to enable better grasp of observations, how do you make sense of a reported simulation?

I am working on using digits of Pi to overcome the first, a reminder of the take no one’s word for it in science, arguing simulation can displace the need to take the Rhubarb Chair of Biostatistic’s word for it. For the third, trying to first share a sense of scientific inquiry and how that requires thinking in abstract representations (or models) and then a sense of the achievements in statistical inquiry in terms of repeated use of abstract representations (or possible fake worlds). I suspect it will fail for those who actually can’t or won’t do a number of simulations on their own.

But science require perseverance and happens faster as more join in – good luck.

Such a paper would certainly be useful. Here is a similar type of paper published in a high impact journal that could serve as an inspiration.

Menge, D.N.L., MacPherson, A.C., Bytnerowicz, T.A. et al. 2018. Logarithmic scales in ecological data presentation may cause misinterpretation. Nat Ecol Evol 2, 1393–1402

https://doi.org/10.1038/s41559-018-0610-7

The use of = rather than <- in that R code is triggering me, but I'd like to add some suggestions:

What if you simulated something known to have a lot of random variation (systolic blood pressure measurements, etc)?

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0005673

This paper by Morris, White, & Crowther on simulations may be useful, although it's on simulations for evaluating statistical methods, but there are still a lot of good tidbits in there

https://doi.org/10.1002/sim.8086

Data journalism beats them all:

https://twitter.com/studersc/status/1229961869898604544

https://www.economist.com/graphic-detail/2020/02/18/diseases-like-covid-19-are-deadlier-in-non-democracies

What is going on with all the paywalls on MSM sites recently?

Also, all those points should have uncertainty bars. Same with the ones in the OP chart.

I don’t think telling ‘scientists’ that they need to go back and retake stats 101 because they didn’t understand it the first time, will have any effect.

Nobody else cares, look at that Economist graph, it’s genius level galaxy brain stuff. Their ‘brilliant data team’ found a ‘clear trend’. It doesn’t matter how many people respond telling them how stupid this is, it’s ‘data’ and ‘research’ and will become received wisdom. You want to tell these people they have to work for a living?

This reminds me of a keynote presentation given by Dianne Cook at Rstudio::conf2018. In her talk, she presented methods for visual inference by producing “null-plots” and hide the real data in between such plots. The R-package nullabor provides the necessary functions.

Her talk (topic discussed from ~15 min): https://resources.rstudio.com/rstudio-conf-2018/to-the-tidyverse-and-beyond-dianne-cook

Nullabor package vignette: https://cran.r-project.org/web/packages/nullabor/vignettes/nullabor.html

Just my usual warning to deploy symmetry considerations when confronted by correct but directionally incomplete statements like “one can easily find trends in randomly generated data if the sample size is small enough and if you create enough plots”: It’s essential to add that one can easily miss important trends in randomly generated data if the sample size is small enough and if you create enough plots. That situation should be simulated too.

Failure to add the opposing caution is a common example of the nullistic bias that pervades use and defense of NHST and its analogs including Bayesian ones (e.g., see “The need for cognitive science in methodology,” Am J Epidemiol 186:639-645, open access at https://doi.org/10.1093/aje/kwx259). More generally, we risk creating and absorbing serious cognitive bias when making and reading true but unbalanced observations about the space of possibilities generating our observations.

Valentin Amrhein rightly asked “how can one “miss important trends in randomly generated data” if by definition any trend would be random?”

My bad, I should have said something like:

“one can easily miss important trends in data randomly generated from populations following such trends if the sample size is small enough and if you create enough plots.”

+1

Bland and Altman wrote their paper on measurement of agreement for Statistics in The Statistician (https://doi.org/10.2307/2987937), but realized medical doctors didn’t read it, so their write another one, for The Lancet (https://doi.org/10.1016/S0140-6736(86)90837-8).

I also drafted a small app that permutes the data and allows to simulate the plots https://permutationplotter.herokuapp.com/.

While I agree wholeheartedly about the importance of random variation, the data do not look all that random to me. Suppose we draw the Chebyshev lines that minimize the maximum error. As it turns out, in all four plots we draw a line through two points at the bottom of the data and a parallel line through one point at the top of the data such that all that data points lie between the two lines. The trend line is the line halfway between those two lines. The lines through two points at the top and one point at the bottom that bound the data are further apart, so that the maximum error is greater.

The slopes of the Chebyshev lines are not so different from those of the plotted lines. Furthermore, if we eliminate one of the data points used to draw the Chebyshev lines, and draw new lines, we get similar lines, except if we eliminate the point around a response of 80 in group 4. I am not ready to call the data random, at least for the first three groups.

Chebyshev lines are usually easy to eyeball before doing any calculations, and are superior as a rough and ready technique to drawing a line between the first and last data points. You may laugh, but I have seen that done in a newspaper article by a renowned economist. Also, I think that minimizing the maximum error may be of practical importance, for instance, in responses to drugs.