They did a graphical permutation test to see if students could reliably distinguish the observed data from permuted replications. Now the question is, how do we interpret the results?

1. Background: Comparing a graph of data to hypothetical replications under permutation

Last year, we had a post, I’m skeptical of that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”, discussing recently published “estimates of the causal impact of a poverty reduction intervention on brain activity in the first year of life.”

Here was the key figure in the published article:

As I wrote at the time, the preregistered plan was to look at both absolute and relative measures on alpha, gamma, and theta (beta was only included later; it was not in the preregistration). All the differences go in the right direction; on the other hand when you look at the six preregistered comparisons, the best p-value was 0.04 . . . after adjustment it becomes 0.12 . . . Anyway, my point here is not to say that there’s no finding just because there’s no statistical significance; there’s just a lot of uncertainty. The above image looks convincing but part of that is coming from the fact that the responses at neighboring frequencies are highly correlated.

To get a sense of uncertainty and variation, I re-did the above graph, randomly permuting the treatment assignments for the 435 babies in the study. Here are 9 random instances:

2. Planning an experiment

Greg Duncan, one of the authors of the article in question, followed up:

We almost asked students in our classes to guess which of ~15 EEG patterns best conformed to our general hypothesis of negative impacts for lower frequency bands and positive impacts for higher-frequency bands. One of the graphs would be the real one and the others would be generated randomly in the same manner as in your blog post about our article. I had suggested that we wait until we could generate age and baseline-covariate-adjusted versions of those graphs . . . I am still very interested in this novel way of “testing” data fit with hypotheses — even with the unadjusted data — so if you can send some version of the ~15 graphs then I will go ahead with trying it out on students here at UCI.

I sent Duncan some R code and some graphs, and he replied that he’d try it out. But first he wrote:

Suppose we generate 14 random + 1 actual graphs; recruit, say, 200 undergraduates and graduate students; describe the hypothesis (“less low-frequency power and more high-frequency power in the treatment group relative to the control group”); and ask them to identify their top and second choices for the graphs that appear to conform most closely with the hypothesis. I would also have them write a few sentences justifying their responses in order to coax them to take the exercise seriously.

The question: how would you judge whether the responses convincingly favored the actual data? More than x% first-place votes; more than y% first or second place votes? Most votes? It would be good to pre-specify some criteria like that.

I replied that I’m not sure if the results would be definitive but I guess it would be intereseting to see what happens.

Duncan responded:

I agree that the results are merely useful but not definitive.

Your blog post used these graphs to show that the data, if manipulated with randomly-generated treatment dummies, produced an uncomfortable number of false positives. This exercise would inform that intuition, even if we want to rely on formal statistics for the most systematic assessment of how confident we should be with the results.

I agree, and Drew Bailey, who was also involved in the discussion, added:

The earlier blog post used these graphs to show that the data, if manipulated with randomly-generated treatment dummies, produced an uncomfortable number of false positives. This new exercise would inform that intuition, even if we want to rely on formal statistics for the most systematic assessment of how confident we should be with the results.

3. Experimental conditions

Duncan was then ready to go. He wrote:

I am finally ready to test randomly generated graphs out on a large classroom of undergraduate students.

Paul Yoo used Stata to generate 15 random graphs plus the real one (see attached). The position (10th) in the 16 for the PNAS graph was determined from a random number draw. (We could randomize its position but that increases the scoring task considerably.) We put an edited version of the hypothesis that was preregistered/spelled out in our original NICHD R01 proposal below the graphs. My plan is to ask class members to select their first and second choices for the graph that conforms most closely to the hypothesis.

Bailey responded:

Yes, with the same caveat as before (namely, that the paths have already forked: we aren’t looking at a plot of frequency distributions for one of the many other preregistered outcomes in part because these impacts didn’t wind up on Andrew’s blog).

4. Results

Duncan reported:

97 students examined the 16 graphs shown in the 4th slide in the attached powerpoint file. The earlier slides set up the exercise and the hypothesis.

Almost 2/3rds chose the right figure (#10) on their first guess and 78% did so on their first or second guesses. Most of the other guesses are for figures that show more treatment-group power in the beta and gamma ranges but not alpha.

5. Discussion

I’m not quite sure what to make of this. It’s interesting and I think useful to run such experiments to help stimulate our thinking.

This is all related to the 2009 paper, Statistical inference for exploratory data analysis and model diagnostics, by Andreas Buja, Dianne Cook, Heike Hofmann, Michael Lawrence, Eun-Kyung Lee, Deborah Swayne, and Hadley Wickham.

As with hypothesis tests in general, I think the value of this sort of test is when it does not reject the null hypothesis, which represents a sort of negative signal that we don’t have enough data to learn more on the topic.

The thing is, I’m not clear what to make of the result that almost 2/3rds chose the right figure (#10) on their first guess and 78% did so on their first or second guesses. On one hand, this is a lot better than the 1/16 and 1/8 we would expect by pure chance. On the other hand, the fact that some of the alternatives were similar to the real data . . . this is all getting me confused! I wonder what Buja, Cook, etc., would say about this example.

6. Expert comments

Dianne Cook responded in detail in comments. All of this is directly related to our discussion so I’m copying her comment here:

The interpretation depends on the construction of the null sets. Here you have randomised the group. There is no control of the temporal dependence or any temporal trend, so where the lines cross or the volatility of lines is possibly distracting.

You have also asked a very specific one-sided question – it took me some time to digest what your question is asking. Effectively it is, in which plot is the solid line much higher than the dashed line only in three of the zones. When you are randomising groups, the group labels have no relevance, so it would be a good idea to set the higher-valued one to be the solid line in all null sets. Otherwise, some plots would be automatically irrelevant. People don’t need to know the context of a problem to be an observer for you, and it is almost always better if the context is removed. If you had asked a different question, eg in which plot are the lines getting further apart at higher Hz, or in which plot are the two lines the most different, would likely yield different responses. The question you ask matters. We typically try to keep it generic “which plot is different” or “which plot shows the most difference between groups”. Being too specific can create the same problem as creating the hypothesis post-hoc after you have seen the data, eg you spot clusters and then do a MANOVA test. You pre-registered your hypothesis so this shouldn’t be a problem. Thus your null hypothesis is “There is NO difference in the high-frequency power between the two groups.”

When you see as much variability in the null sets as you have here, it would be recommended to make more null sets. With more variability, you need more comparisons. Unlike a conventional test where we see the full curve of the sampling distribution and can check if the observed test statistic has a value in the tails, with randomisation tests we have a finite number of draws from the sampling distribution on which to make a comparison. Numerically we could generate tons of draws but for visual testing, it’s not feasible to look at too many. However, you still might need more than your current 15 nulls to be able to gauge the extent of the variability.

For your results, it looks like 64 of the 97 students picked plot 10, their first pick. Assuming that this was done independently and that they weren’t having side conversations in the room, then you could use nullabor to calculate the p-value:

> library(nullabor)
> pvisual(64, 97, 16)
x simulated binom
[1,] 64 0 0

which means that the probability that this many people would pick plot 10, if it really was truly a null sample, is 0. Thus we would reject the null hypothesis, and with strong evidence, conclude that there is more high frequency in the high-cash group. You can include the second votes by weighting the p-value calculation by two picks out of 16 instead of one, but here the p-value is still going to be 0.

To understand whether observers are choosing the data plot, for reasons related to the hypothesis you have to ask them why they made their choice. Again, this should be very specific here because you’ve asked a very specific question, things like “the lines are constantly further apart on the right side of the plot”. For people that chose null plots instead of 10, it would be interesting to know what they were looking at. In this set of nulls, there are so many other types of differences! Plot 3 has differences everywhere. We know there are no actual group differences, so this big of an observed difference is consistent with there being no true difference. It is ruled out as a contender only because the question asks in 3 of the 4 zones if is there a difference. We see crossings of lines in many plots, so this is something very likely to see assuming the null is true. The big scissor pattern in 8 is interesting, but we know this has arisen by chance.

Well, this has taken some time to write. Congratulations on an interesting experiment, and interesting post. Care needs to be taken in designing data plots, constructing the null-generating mechanisms and wording questions appropriately when you apply the lineup protocol in practice.

This particular work has been borne from curiosity about a published data plot. It reminds me of our work in Roy Chowdhury et al (2015) (https://link.springer.com/article/10.1007/s00180-014-0534-x). It was inspired by a plot in a published paper where the authors reported clustering. Our lineup study showed that this was an incorrect conclusion, and the clustering was due to the high-dimensionality. I think your conclusion now would be that the published plot does show the high-frequency difference reported.

She also lists a bunch of relevant references at the end of the linked comment.

21 thoughts on “They did a graphical permutation test to see if students could reliably distinguish the observed data from permuted replications. Now the question is, how do we interpret the results?

  1. I find this confusing. Why is it interesting to see if students can recognize the pattern most consistent with the hypothesis? That really tests their ability to reason why pattern would match a given hypothesis, but doesn’t tell us anything about their ability to distinguish between the real data and a random permutation of the real data. I would be more interested in their ability to distinguish between signal and noise, not their ability to match a pattern to a hypothesis (though the latter might have some value in terms of educational outcomes). Even the question of whether students can distinguish between signal and noise (which is more interesting to me) doesn’t seem to really get at the heart of the matter. I haven’t read the original publication or post but from your description it sounds like the original issue was how convincing the evidence was. It seems as though this experiment asks a different question. As I said, I am confused.

    • This is not about student ability, I think. Rather it is about the data, and whether they are significantly more in line with the research hypothesis than the randomly permuted data, meaning that they show a pattern that the students (on average) can tell apart from the randomly permuted data. (You may understand this anyway, but from reading your comment I’m not sure.)

      • That’s what I don’t understand. Seeing whether students can tell them apart is a different question than whether the evidence suggests a relationship between cash aid and brain activity. It sounds like the paper is focused on the latter question, but the experiment that was run involves the former. Isn’t it possible that the students could pick out a relationship that matches the hypothesis, but that the evidence for the hypothesis is not convincing? Or that there is convincing evidence but the students cannot pick it out? It seems to me that the question of whether students can recognize a pattern matching the hypothesis is a very different question (albeit of potential interest – along the lines of what types of information displays might be most effective, as Jessica suggests below) than the original discussion about the paper seemed to involve.

        • See Dianne Cook’s posting! The fact that the students can tell them apart runs counter to the permutation null hypothesis regardless of any special “ability” of the students. The only thing that can be learnt from this about the data is that data from the permutation null hypothesis look systematically different from the real data. Arguably one can add that the difference is in the direction of the research hypothesis, as the students were in advance told that this is what they should look for. This is however somewhat problematic, see Di Cook’s posting. You may argue that this kind of “rejection” of the permutation null hypothesis isn’t particularly strong, which of course leads into the standard discussion about how much it means to reject a point null in a statistical test, but as long as you buy into that logic, you do learn something potentially interesting about the data in this way, or more precisely, about the relation between the data and the permutation null hypothesis.

  2. I think these tests are interesting, but one problem with trying to interpret the proportion who picked the right graph on their first/second guess is that it will depend on how well designed the graph is. Even something as simple as adding color back to these plots might affect how easily people can do the task. I believe some of the original lineup work plays up the analogy to statistical tests without acknowledging this and other confounding factors, but later work by Hofmann et al uses models the proportion of people who can choose the real data from the lineup as a measure of graphical power.

  3. The group difference is about 1/10 the inter-individual variation shown in the prior post. And what if we measure in the morning vs night, or before after eating? I suspect intra-individual variation would be of the same order of magnitude. The numbers were technically higher in the one group, but I doubt that translates into anything meaningful.

    This can’t really be the best way to assess the benefits of giving low income mothers extra money. There has to be a more obvious outcome to look at. If we have to try this hard to see one, it seems like something is wrong.

    However, I also noticed that $2,500/yr is called a “small departure in baseline balance” while $3,756/yr is “high-cash” and “20% boost”. An extra $10/day isn’t a huge amount, except in key circumstances like needing a bit extra to afford the bus to a job interview.

  4. This is fun and perhaps sometimes useful idea, but I think the actual “fixed” experiment they ran — where it is always the same 15 graphs and the real on is always in the same position — is much worse in some ways than a version where you draw 15 graphs at random for each participant and the positions are randomized. With this “fixed” experiment, adding participants quickly provides very minimal improvements.

    For some perhaps relevant theory on the design of such experiments, check out Chierichetti, Kumar & Tompkins https://www.pnas.org/doi/10.1073/pnas.2202116119.

  5. I am not sure I am following the nuances of the authors’ work here, but I just wanted to make the observation that there are randomization tests available for this kind of analysis pipeline. The Matlab function ‘clusterrandanalysis’ in Fieldtrip was (or maybe still is) available to do permutation testing (over subjects). Not sure whether that is still available in EEGLAB (the toolbox that the authors used) — there used to be a lot of collaboration between Fieldtrip and EEGLAB developers.

    • I’d think what was done here is some kind of permutation testing as well. The difference is that standard testing works on a formally defined test statistic, whereas this kind of visual testing works on the visual appearance of the graph. The idea is that visual displays often give a more nuanced and “complete” view of what goes on, and sometimes they may show some surprising details. On the other hand, normally there is no way to check formally whether what we see is special (or as some would put it, “real”), or whether data from a model with “nothing going on” could show similar patterns. The visual testing is meant to find this out. I don’t know whether this is better than using a formal test statistic in this particular case, but I think that there are certainly situations where this is very worthwhile.

  6. I think in one aspect this is somewhat different from the spirit of the visual inference in the cited Buja et al. paper. I suspect that the research hypothesis of “negative impacts for lower frequency bands and positive impacts for higher-frequency bands” can be translated into a formal test statistic relatively convincingly, so that testing the significance of the visual display may not add much information to running a permutation test with that statistic (even though it says that the real data *are* more special in this respect than the permutation data).

    I’d think that the original spirit was that if we’re looking at a graph in exploratory data analysis and see something special, we didn’t know in advance what we were looking for, so we couldn’t have specified a test statistic that captures our research hypothesis. In such a case one can use graphical testing to see whether what we found in the real data was more special than *anything* that somebody could have found in “random” data. There is no way to test this with a standard test based on a statistic.

    Of course, as Bailey wrote, forking paths may play a role anyway – if we look at lots of graphs and we pick the one that looks most special out of these, this isn’t the same as if somebody else gets shown *only* the special graph we found among a bunch of graphs of the same kind showing random data.

    • Christian:

      Yes, good point. The motivation for the exercise with the students was to assess the claim made in my blog post, that the graph in the original paper wasn’t so convincing because you can easily get very similar graphs just by chance.

  7. The interpretation depends on the construction of the null sets. Here you have randomised the group. There is no control of the temporal dependence or any temporal trend, so where the lines cross or the volatility of lines is possibly distracting.

    You have also asked a very specific one-sided question – it took me some time to digest what your question is asking. Effectively it is, in which plot is the solid line much higher than the dashed line only in three of the zones. When you are randomising groups, the group labels have no relevance, so it would be a good idea to set the higher-valued one to be the solid line in all null sets. Otherwise, some plots would be automatically irrelevant. People don’t need to know the context of a problem to be an observer for you, and it is almost always better if the context is removed. If you had asked a different question, eg in which plot are the lines getting further apart at higher Hz, or in which plot are the two lines the most different, would likely yield different responses. The question you ask matters. We typically try to keep it generic “which plot is different” or “which plot shows the most difference between groups”. Being too specific can create the same problem as creating the hypothesis post-hoc after you have seen the data, eg you spot clusters and then do a MANOVA test. You pre-registered your hypothesis so this shouldn’t be a problem. Thus your null hypothesis is “There is NO difference in the high-frequency power between the two groups.”

    When you see as much variability in the null sets as you have here, it would be recommended to make more null sets. With more variability, you need more comparisons. Unlike a conventional test where we see the full curve of the sampling distribution and can check if the observed test statistic has a value in the tails, with randomisation tests we have a finite number of draws from the sampling distribution on which to make a comparison. Numerically we could generate tons of draws but for visual testing, it’s not feasible to look at too many. However, you still might need more than your current 15 nulls to be able to gauge the extent of the variability.

    For your results, it looks like 64 of the 97 students picked plot 10, their first pick. Assuming that this was done independently and that they weren’t having side conversations in the room, then you could use nullabor to calculate the p-value:

    > library(nullabor)
    > pvisual(64, 97, 16)
    x simulated binom
    [1,] 64 0 0

    which means that the probability that this many people would pick plot 10, if it really was truly a null sample, is 0. Thus we would reject the null hypothesis, and with strong evidence, conclude that there is more high frequency in the high-cash group. You can include the second votes by weighting the p-value calculation by two picks out of 16 instead of one, but here the p-value is still going to be 0.

    To understand whether observers are choosing the data plot, for reasons related to the hypothesis you have to ask them why they made their choice. Again, this should be very specific here because you’ve asked a very specific question, things like “the lines are constantly further apart on the right side of the plot”. For people that chose null plots instead of 10, it would be interesting to know what they were looking at. In this set of nulls, there are so many other types of differences! Plot 3 has differences everywhere. We know there are no actual group differences, so this big of an observed difference is consistent with there being no true difference. It is ruled out as a contender only because the question asks in 3 of the 4 zones if is there a difference. We see crossings of lines in many plots, so this is something very likely to see assuming the null is true. The big scissor pattern in 8 is interesting, but we know this has arisen by chance.

    Well, this has taken some time to write. Congratulations on an interesting experiment, and interesting post. Care needs to be taken in designing data plots, constructing the null-generating mechanisms and wording questions appropriately when you apply the lineup protocol in practice.

    This particular work has been borne from curiosity about a published data plot. It reminds me of our work in Roy Chowdhury et al (2015) (https://link.springer.com/article/10.1007/s00180-014-0534-x). It was inspired by a plot in a published paper where the authors reported clustering. Our lineup study showed that this was an incorrect conclusion, and the clustering was due to the high-dimensionality. I think your conclusion now would be that the published plot does show the high-frequency difference reported.

    If anyone is interested in more reading on this topic here’s a good list to work from:

    M. Majumder, H. Hofmann, and D. Cook. Validation of visual statistical inference, applied to linear models. Journal of American Statistical
    Association, 108(503):942–956, 2013.

    H. Wickham, D. Cook, H. Hofmann, and A. Buja. Graphical inference for infovis. IEEE Transactions on Visualization and Computer
    Graphics (Proc. InfoVis ’10), 16(6):973–979, Nov.-Dec. 2010. doi: 10.1109/TVCG.2010.161.

    H. Hofmann, L. Follett, M. Majumder, and D. Cook. Graphical tests for power comparison of competing designs. Visualization and Computer
    Graphics, IEEE Transactions on, 18(12):2441–2448, 2012.

    A. Loy, H. Hofmann, and D. Cook. Model choice and diagnostics for linear mixed-effects models using statistics on street corners. Journal of Computational and Graphical Statistics, 26(3):478–492, 2017. doi: 10.1080/10618600.2017.1330207.

    N. Roy Chowdhury, D. Cook, H. Hofmann, and M. Majumder. Measuring lineup difficulty by matching distance metrics with subject choices
    in crowd-sourced data. Journal of Computational and Graphical Statistics, 27(1):132–145, 2018.

    S. Vanderplas, C. Rottger, D. Cook, and H. Hofmann. Statistical significance calculations for scenarios in visual inference. Stat, page e337, 2020. doi: https://doi.org/10.1002/sta4.337

    S. Vanderplas and H. Hofmann. Clusters beat trend!? Testing feature hierarchy in statistical graphics. Journal of Computational and
    Graphical Statistics, 26(2):231–242, 2017. doi: 10.1080/10618600.2016.1209116.

    R. Beecham, J. Dykes, W. Meulemans, A. Slingsby, C. Turkay, and J. Wood. Map lineups: Effects of spatial structure on graphical inference. IEEE Transactions on Visualization and Computer Graphics, 23(1):391–400, 2017. doi: 10.1109/TVCG.2016.2598862.

    • Great posting! This is just to tell you that I was a fan of that work from the beginning (I saw this presented in 2008 if I’m not mistaken, and I was able to pick the real data in a live experiment when this was presented;-).

    • Thank you for the explanation – I must be particularly dense about this, but despite your clear explanation, I am missing something. I guess I don’t really understand the “permutation null hypothesis.” As I see it, the original paper had a hypothesis (concerning the impact of cash transfers on EEG activity in different frequency bands) for which the standard statistical tests were inconclusive. The experiment being run here asks whether students could recognize the actual data from a random permutation of the 2 groups’ data. Isn’t that a totally different question? Perhaps it is interesting in its own right (though I would think there are better experiments to run if you are interested in discovering how people perceive random vs. nonrandom data), but it seems pretty far removed from the original research question. Then, the final conclusion that there is too much noise to conclude anything confidently seems to mirror what the standard statistical results were to begin with – how does this follow-up experiment add anything to that original question?

      I’m still missing something here, feeling like I’m lost in the weeds. Looking at the titles of the references you cite, this area of work seems focused on issues of visual inference. That certainly interests me, but this example seems like a fairly unproductive way to investigate that. Instead of asking whether people can pick out patterns consistent with the research question from among the actual data and random permutations of it, wouldn’t it be more direct to ask people to rank various visual displays in terms of their consistency with the research question? In other words, I’m confused between whether the purpose is to say something about the observed results from the study or about the effectiveness and accuracy of various visual displays.

      • Hi Dale,

        I think the missing piece for you is to consider the plot to be a test statistic. When you apply the plot design to the data provided you get the observed test statistic. When you apply the same plot design to null data sets you get draws from the sampling distribution assuming the null hypothesis of there being no structure/relationship.

        It is actually testing the hypothesis in the original paper (ideally, if the nulls are generated properly).

        The benefit is that the plot is a more complex statistic than can be ever described numerically.

        The complication is that you need humans to determine how extreme is the data plot. You need to recruit enough human observers, and you need to provide enough comparison plots (nulls) to be able to say that the data plot is different (or not).

        I really like to explain the testing procedure from the plot at https://www.dicook.org/files/macquarie_2022/slides#/section-2

        Does this help you? If not, try again to explain where you are confused.

        • Thanks, that is very helpful. I think I see the difference (that slide really helps). It still seems like a complex way to proceed, as it mixes several things at once: the observed data’s randomness/signal and the human perception of graphical evidence. Both are of interest, but by mixing the two it isn’t clear to me what to conclude from the “permutation null hypothesis” test. Personally, it seems more straightforward to investigate the graphical perception issues separately from trying to say anything about this particular study.

  8. Hi Dale, In response to: “Thanks, that is very helpful. I think I see the difference (that slide really helps). It still seems like a complex way to proceed, as it mixes several things at once: the observed data’s randomness/signal and the human perception of graphical evidence.” It’s a valid test, as supported by the work of Majumder et al (2013). As you say, it can be complex. Generally, statistical hypothesis testing is a hard process to wrap your head around. Visual inference tests are also expensive given that you need to have human observers. You would primarily use it in a situation where there is no existing statistical test, like this scenario of the blog post. The published plot was being used to make a claim, and this claim was challenged. The lineup test (probably the only way to objectively test) strongly suggests that there is validity to the claim.

Leave a Reply

Your email address will not be published. Required fields are marked *