The backpack fallacy rears its ugly head once again

Shravan points to this that he saw in Footnote 11 in some paper:

“However, the fact that we get significant differences in spite of the relatively small samples provides further support for our results.”

My response: Oh yes, this sort of thing happens all the time. Just google “Despite limited statistical power”.

This is a big problem, a major fallacy that even leading researchers fall for. Which is why Eric Loken and I wrote this article a few years ago, “Measurement error and the replication crisis,” subtitled, “The assumption that measurement error always reduces effect sizes is false.”

Anyway, we’ll just keep saying this over and over again. Maybe new generations of researchers will get the point.

15 thoughts on “The backpack fallacy rears its ugly head once again

  1. If the same authors had tested their own hypothesis, then they would argue the small sample deviation was random, or a coincidence.

    At some point we need to face the fact there is no principled reasoning behind these logical fallacies. Most people simply have no idea why they believe some things but not others. Very similar to chatgpt.

    Eg, does anyone here believe ivermectin is effective for river blindness but not covid? Because there is far, far more evidence for the latter. The only “reasoning” going on is argument from authority heuristic/fallacy. Nothing wrong with that, as long as you recognize that is the case and adjust your confidence accordingly.

  2. the points regarding type-M/S errors in point estimation, significance filtering as noise amplification, regression to the null in replication of pre-filtered results, etc. etc. are clear, but is the corollary to this that when measurements are *not* noisy, significance at some alpha with low sample size *does* indicate the presence of a large, non-zero effect? if I’m trying to evaluate whether elephants or amoebas have greater mass, and know a priori that my measurement device is both precise and accurate, then the p < 1E-100 or whatever that I get at n=8 *does* reflect a large standardized effect (relative to the differences within groups, ie in terms of Hedges' g or Glass' delta etc.)?

    or to strain the analogy: since p-values reflect both signal and noise, I would accept my friend Clark has superpowers (very implausible, a priori!) when his mile time is a handful of microseconds and his backpack is filled with degenerate neutronium. If someone tells me that their n=8 t-test produced a p-value of 1E-100 and no other information, I'd be very inclined to trust the point estimate of the difference between groups on a scale that includes them both, and in that context would say that the n=8 does suggest a true, large effect where a p = 1E-100 under n=1E9 would suggest a small one

    • was curious how this would look, so here’s a quick experimental result (I’m sure it can all be solved mathematically but this was way easier): https://i.imgur.com/7sbYiSh.png

      procedure was to sample mu ~ cauchy(0,1) population means and then sample x ~ normal(mu, 1) observations at different sample sizes (2, 16, 128, 1024, 8192) and ran t-tests on pairs of equal-sized samples (just like IRL)

      (or, well, I generated sample means and variances directly from their sampling dists and used dt() in R instead of t.test(), but same thing)

      Then I compared pairs of tests with the “same” p-value (ie adjacent p-values in the sorted list) and checked whether the true effect size was larger or smaller in the smaller-n test. Finally, I dragged a sliding window across the log p-values to see how often the smaller sample test had a larger true effect (which I think is broadly capturing the intuition the OP opposes, that low p-vals at small samples suggest stronger effects than at large samples, independent of social phenomena like file drawers and forking paths)

      As such, I don’t think the intuition’s too terrible! At eg p = 0.01 (-log10(0.01) =2), for example, over 95% of n = 16 x 2 t-tests have larger *true* effect sizes than n = 1024 x 2 studies at the same p-value. Estimates of the former will be enriched for type-M errors, sure, but a low p-value at low n is still decent evidence for a large true effect!

        • oh, weird! can maybe try https://imgur.com/7sbYiSh, if not here’s a reup to some random image hosts: https://i.ibb.co/gMPYC87/image.png or https://i.postimg.cc/tpymDMg4/image.png :]

          was also thinking a bit further on this last night about similar analogies to the backpack-laden mile run. Maybe statistical tests can be likened to other types of tests, like school exams. In exams, the examiner collects observations (responses to test questions) and attempts to estimate some unobserved variable (the student’s aptitude), maybe assessing their performance as implausible under some null model (eg bling guessing, or if they curve by thresholding grades according to tail probabilities in the class-wide distribution of total scores… and also, all the other students are identical for this to hold, since the usual assumption is a point null on focal parameters)

          a student’s total score (the “test statistic”) might fall in the far right tail of the “blind guessing” null distribution, so they receive an A for their work… or if our decision threshold is binary, a Pass. When we hear that a student passed the test with an A+++, we expect them to have high underlying ability. If we hear that they received that score on a test that only had a small number of questions, they’d have had to have *really* dazzled the grader with their responses, since even blind guessing / word salad would occasionally produce decent answers

          hmm, or maybe an even better scenario would be the Voight-Kampff test — students are divided into two groups, Replicants and Humans, and we care to separate the latter from the former. If we learn that a Blade Runner needed a bajillion supplemental questions and biometric measurement devices to ID Human A, we’d be like, woah, this person is pretty robotic. If we learned that they needed only a single question, the desert tortoise one or whatever, before leaping from their chair and declaring the person DEFINITELY 100% GRADE-A HUMAN, we’d rightly assume their response was really full of human-y passion (or something, I’ve never actually seen Blade Runner)

          where the students peers are not identical, I think we’re closer to a multilevel / regularized context, or maybe in the ROPE world, or something 🤔 or, like, generating a null distribution of test statistics from a prior predictive at some n

  3. What if you had a small effect size that is statistically significant at, say, 0.01? I would never make an argument like the one in the quote, but I think I would be more confident in this result than in one consistenting in a large effect with a large CI—esp given a small sample

Leave a Reply

Your email address will not be published. Required fields are marked *