“Check out table 4.”

A colleague sent along this article and writes:

Check out table 4. this is ERC funded research (the very best of European science get this money).

OK, now I was curious, so I scrolled through to table 4. Here it is:

Yup, it’s horrible. I don’t know that I’d call it cargo cult science at its absolute worst, but it’s pretty bad.

(The full story, for those who are not familiar, is that p-values are super-noisy—see, for example, the first full paragraph on page 5 of this article, hence summarizing a study by statistically significant p-values is a criminally wasteful presentation of results.)

6 thoughts on ““Check out table 4.”

  1. I have produced (though not published) such graphs and find them useful for quick identification of patterns. Table 4 shows the results of 5*5*2 = 50 significance tests and it’s difficult to convey the pattern of these results in a manner that is both (a) quick and (b) open enough to allow the reader to form his/her own conclusions.
    Seeing that there is also a table 5 and a table 6 in similar manner, the authors apparently felt that this worked for them and was useful to readers. Could this be improved? Certainly! But, taking a glass half full perspective, is it an improvement over a wall of test with 50 p-values and and a revelation from on high what patterns there are and are not? Also certainly yes!

  2. I reread John Ioannidis’ 2005 paper recently and found it surprising how it was thoroughly steeped in the mindset of type 1/type 2 errors, talking endlessly about “real” relationships and “no” relationships, “true” findings and “false” ones. I guess it’s not that surprising given the paper’s title.

  3. Would it help at all if there were a global yes/no test? It seems there are probably three different criticisms implicit here.

    One is against yes/no questions in general, which you’ve articulated many times (and which I disagree with, although I agree that it makes a lot of sense in the context of the kind of data you work with).

    The second is that they have no precise hypothesis about which of the ten measures (five measures times two regions) they are looking for differences in, yet they do separate tests on each one, with no way of formally aggregating them (except maybe for some Bonferroni-type correction, but it’s an exaggeration to say that this turns a set of inference into one global “at-least-one-is-different” inference, if for no other reason than because no one ever comes out and explicitly states that as being the test they’re doing: they just plug and play the numbers).

    The third reason is that, I’m not sure what the rows are doing here, but if they’re trying to draw comparisons between conditions, comparing rows in this binary table is not the way to do it. Some actual test is needed.

    I agree with 2 and 3 but not 1, and I wonder (putting aside 3) if the heart of the matter isn’t really 2: selecting measures in high(er than one) dimensional time series data in cognitive science is still completely throw-everything-at-the-wall-and-see-what-sticks. As unfortunate as that is, there’s still a right way and a wrong way to ask binary model comparison questions if that’s what you’re doing. This is the wrong way, and if you did it the right way (fitting an actual time series model, fitting a second model that is a “nil model” that prohibits the types of interactions these stars are indicating, across the board, comparing these two models), then this wouldn’t be a table you would ever be tempted to make.

Leave a Reply to David Waldron Cancel reply

Your email address will not be published. Required fields are marked *