Looking at all possible comparisons at once: It’s not “overfitting” if you put it in a multilevel model

Rémi Gau writes:

The human brain mapping conference is on these days and heard via tweeter about this Overfitting toolbox for fMRI studies that helps explore the multiplicity of analytical pipelines in a more systematic fashion.

Reminded me a bit of your multiverse analysis: thought you might like the idea.

The link is to a conference poster by Joram Soch, Carsten Allefeld, and John-Dylan Haynes that begins:

A common problem in experimental science is if the analysis of a data set yields no significant result even though there is a strong prior belief that the effect exists. In this case, overfitting can help . . . We present The Overfitting Toolbox (TOT), a set of computational tools that allow to systematically exploit multiple model estimation, parallel statistical testing, varying statistical thresholds and other techniques that allow to increase the number of positive inferences.

I’m pretty sure it’s a parody: for one thing, the poster includes a skull-and-crossbones image; for another, it ends as follows:

Widespread use of The Overfitting Toolbox (TOT) will allow researchers to uncover literally unthinkable sorts of effects and lead to more spectacular findings and news coverage for the entire fMRI community.

That said, I actually think it can be a good idea to fit a model looking at all possible comparisons! My recommended next step, though, is not to look at p-values or multiple comparisons or false discovery rates or whatever, but rather to fit a hierarchical model for the distribution of all these possible effects. The point is that everything’s there, but most effects are small. In the areas where I’ve worked, this makes more sense than trying to pick out and focus on just a few comparisons.

26 thoughts on “Looking at all possible comparisons at once: It’s not “overfitting” if you put it in a multilevel model

  1. > the poster includes a skull-and-crossbones image

    Note as well that “tot” means “dead” in German and “muerta” (in the Pirates of the Caribbean quote) means “dead” in Spanish.

  2. It seems to be another way of saying there are thousands/millions/billions/infinity arbitrary models of “chance” you can choose to reject, so it is meaningless to “discover” one unless someone has previously claimed it is actually correct. Also, obviously only one of these myriad models can actually be true (more likely it is none)…

    Too bad the authors still don’t understand what they are seeing (emphasis mine):

    As a next step,it would be desirable to reanalyze the entire amount of previous fMRI studies to harvest the false-positive effects that might have been missed using conventional statistical techniques

    These models they are rejecting are false, they in no way describe the data generation and analysis process (and who claimed they did?). Rejecting such a false model is a “true positive” result.

    The solution is to stop testing these default null hypotheses and instead test the research hypothesis. This requires getting a quantitative prediction out of the research hypothesis, if that is too hard then you shouldn’t be testing anything at all. Instead you should be getting precise and accurate estimations of whatever parameters/variables are of interest (what Andrew said).

    • I think the authors precisely understand what they see. The Overfitting Toolbox (TOT) does not reject models as being false. It aims to reject null hypotheses regarding experimental effects, given each model. It accepts those models where this rejection is successful. This increases the number of positive inferences.

      • Sorry, I barely remember this and it is only procrastination that made me click your comment just now to see it was directed towards me. But my point is that all these models are false, so checking if they are false is pointless. Additional layers of accept/reject (“accepting successful rejections” or whatever) will not change this.

        So, there is no such thing as a “false positive” as mentioned in the paper, you can only correctly reject these models (the “experimental effect” is typically one parameter of the model that equals zero).* Either you are correctly rejecting the model or incorrectly accepting it.

        And indeed, this is precisely what the statistical machinery does for you provided you feed it enough sufficiently precise measurements. To be clear, I see two outcomes:

        1) Correct rejection of the statistical model, which we learn nothing from since it is always true
        2) Incorrect acceptance of the statistical model due to insufficient sample size

        *I think a lot of people forget they are checking whether the entire model is correct when comparing to the observations, the “effect size” is just one part of it.

  3. Beautiful prank! I hope the banner hits the muddleheaded researchers (how Paul Meehl called those psychologists that discovered interesting relationships everywhere) hard in the head. It also seems to denounce the poor state of theory and methodological procedure in behavioral brain research (and in Psychology, in general). On one hand, the banner shows we can obtain evidence for almost any effect if we tweak the statistical model enough; on the other, the sheer flexibility of modeling choices means the theory isn’t strong enough to constrain the model space, which implies an enourmous chasm between substantive theory and statistical models.

    • The general response in psychology to the ability to find anything by tweaking the model is “commit to your model and don’t treat exploration like it’s a test.” Which is why pre-registration is now pretty well-liked in the replication movement. However, the fact that many people aren’t aware that it exists seems problematic.

      Even so, the ~40% replication rate of RP:P (with another 30% showing results that were ambiguous), combined with the pathetic underpowered-ness of psychology studies, does seem to suggest that the majority of the results in the field weren’t false positives. This is encouraging, since there was no reason to expect that in the light of decades of researcher degrees of freedom.

        • Given 40% of positive results (p<0.05, I assume, I don´t understand what RP:P means), lower power does indicate larger true effect size, right?

        • If you mean you need a larger point estimate to reject the null given higher variance-to-sample size ratios, then yes. But since “power” is only defined relative to an assumed treatment effect size, then I’m not sure this logic works… it just seems like arguing that, holding constant the variance in the outcome, finding a non-p-hacked result that is statistically significant happens more often when the real effect in the world is big.

          But supposing that all studies that are published were p-hacked, and that also there is some non-zero-on-average effect of most treatments in papers published in the top journals in Psychology (of some direction, not necessarily the one estimated), I’d expect that “well-powered” studies would find an effect in the same direction about half the time. Right? I’m not saying that’s all that is shown in the paper, I’m just not sure I see the empirical content of the claim you’re making with regards to evidence of the quality of the literature.

          That all said, when I look at all the results in the Nozek et al replication project (I think that’s what Austin was referring to), I do agree that there is some evidence that not everything published in the very top Psych journals is total BS. But that is a pretty low bar, right?

        • I could definitely be figuring this wrong (say, by not having much of a prior distribution in mind), I but I think the tentative conclusion from the math in case would be that *most* findings top psychology journals aren’t nonsense, providing the sample of studies drawn here is representative. Also that a lot still probably are, leaving readers with the rather frustrating problem of finding out which is which. I do think “false positive rate under fifty percent” is not a particularly exciting endorsement of reliability, but it does mean “not over fifty percent.”

        • Or to elaborate a bit further on my position: I believe the results from this project support an estimation of >50% real effects, meaning if you see an effect in psychology, it means it’s incorrect to say it’s probably nonsense. But personally I entered into science wanting to have a high degree of confidence in the correctness of a finding, maybe in the 90% range or something. Also, it should be kept in mind that whether p < .05 supports a theory really depends on the study design, but I'm comfortable dealing with that. P-hacked results, I am considerably less comfortable with.

        • I’d agree, so long as we both mean “average effects of treatment” and not individaul-level treatment effects, and we agree that the top 3 journals in Psych are not representative of most Psych papers.

          Now, as to whether they are USEFUL estimates or just RIGHT ones, that’s a much different story… what because effect sizes matter, and if tiny, unnoticeable stimuli have large effects (as estimated in many papers), than none of it matters, because they all apparently cancel each other out in some (¿magical?) way.

        • I was thinking about the replication studies, assuming that the original studies are exploratory and not very informative (probably the Overfitting toolbox has been used to harvest the false-positive effects that might have been missed using conventional statistical techniques) but the replication studies are confirmatory and properly executed and reported so the p-values behave as expected.

          If the null is effect size = 0 and the power is calculated for some prespecified alternative effective size = 1, we can for example look at the two scenarios “high power = 80%” and “low power = 20%”.

          If power = 80% and 40% of the studies have p-value < 0.05, the average true effect size is very likely to be smaller than 1. Maybe in (slightly less than) half the studies the true effect is actually 1, and in the rest is zero. Or maybe the true effect is always of the same size, quite below 1. Etc.

          If power = 20% and 40% of the studies have p-value < 0.05, the average true effect size is very likely to be larger than 1. Maybe in (slightly less than) 40% of the studies the true effect is very large and in the rest is zero. Or maybe the true effect is always of the same size, quite above 1. Etc.

        • RP:P stands for the Reproducibility Project: Psychology. It was this big replication effort a couple years back, spanning 100 studies.

          In regards to the effect size thing, effect size is one of several factors that’s used to calculate power to begin with (along with alpha and sample size – I might be missing a variable or two). A larger effect size should give it more power. But with low power studies (I think for a t-test that would be anything powered under 50%), the in-sample effect size has to be larger than the population effect size to achieve statistical significance.

          Generally, at jrc alludes to, significant results from a study with a small sample size (or too much noise) are less likely to be real (that is, more likely to have an effect size of zero) than significant results from a high sample size (or not much noise). This is because fewer real results will achieve significance while the number of false positives remains unchanged (being preset at whatever alpha is), causing false positive to rise relative to true positives.

          Sorry if I’m telling you something you already know or anything :)

        • Ah, yes, I guess it does make sense to distinguish the two. In this case, I was thinking about the raw effect size and how it effect power in conjunction with variability in the study. But then the effect size statistics with which I’m familiar, e.g. Cohen’s d, would seem to be standardized. Thanks for the comment!

        • jrc:

          No, what Austin is saying is quite a bit different. He’s talking about replications, not initial results that may have come from p-hacking, or even just file-drawer bias. You wouldn’t expect a 40% replication rate if 90% of reported results were pure noise chasing. The “low-power” point is that this doesn’t mean that the 60% unreplicated results were likely to be due to only noise-chased results, but rather that at least a few of those results were likely due to under-powered replications. After all, even with moderate effects, the p-value is not the most stable statistic.

        • “You wouldn’t expect a 40% replication rate if 90% of reported results were pure noise chasing”

          Well… given very high-powered studies on noisy outcomes with no heterogenous and small (but real) treatment effects, then I think I WOULD expect 40% to replicate. And up to 40% to go the other direction. And about 20% to non-significant.

          But I think your broader point – that Nosek et al do find some evidence that Psych researchers do better than coin-flips – is probably right. Seems like a pretty low bar to call a Science “replicable” based on that…but that’s a different argument for a different thread.

        • Ah, I don’t think that’s quite it, I’m familiar with PPV. If I see a significant low power study and a significant high power study, I trust the significant high power study more. But we’re not looking at a sample of significant studies. We’re looking at 100 studies that are reported whether they attain significance or not. These studies are all pre-registered I’m pretty sure, so with no p-hacking going around alpha should truly be around .05. But beta is probably considerably higher than .05; I seem to recall seeing someone estimate the power of an average study about 33%, though RP:P was probably better powered than that. The point is that the number of studies we’d expect to fail to achieve significance by chance is larger than what we’d expect to achieve significance by chance, so what we’re seeing is probably a bit biased towards non-significance. But then again the influence of each error rate should depend on the percentage of hypotheses that were true or false to begin with… hmm. Aside from that these calculations could be problematic because the replicators used different metrics for successful replication, such as whether the confidence interval contained the effect size of the original, but I think that didn’t change things up too much. At any rate I don’t think I’m committing that fallacy, but perhaps I’m miscalculating somehow.

        • I see what you are saying… though I would alter your second sentence to read “If I see a significant low power study and [any estimate from] high power study, I trust the [inference from] high power study more.”

          I’d also point out, even given all the badmouthing Psych Sci gets here, that I’d hope that the best journals in the field publish research that tends towards the less-likely-to-be-pure-noise-chasing-BS than the less good journals in the field (not always true, but on average, in most fields). So there is a big problem with inference from sample to population-of-psych-papers here.

          And then you add in the weirdness that “power” is only ever understood relative to some “true” or “hypothesized” or “hoped for” effect size… well the whole exercise gets a little hard to reason through.

          But I do think you were probably not committing the exact “what does not kill my significance makes it stronger” fallacy… my bad on that.

        • Dear Austin,

          Yes, RPP was higher powered than 33%. I worked in one of the replication teams and I remember that we were told to aim for a power of 80-90%. I am really not sure anymore if it was 80,90 or even 95, but it is mentioned in the paper and/or the supplemental materials, if the exact number is very important to you. In any case, the goal was far more than 33%.
          But all power calculations were based on the reported, thus likely inflated, effect sizes. Thus, these ~30% results going in the same direction COULD be partially due to overestimated effect size (Type M-errors) in the original studies. Of course this means that the actual power for many studies was not 90% or 80% or probably not even 50%, so sure some will have had a power of 33%.
          Unfortunately we do not have the TRUE effect size (distribution) to calculate the true power for the studies…

        • I worked in one of the replication teams and I remember that we were told to aim for a power of 80-90%

          If replication = “statistical significance in same direction”, all the null models are false in the first place, and samples are sized to 80% power to detect this, what would we expect to see?

          Wouldn’t it be ~40% significance in the same direction (“replicated”), ~40% in the other direction (“non-replicated”), and ~20% non-significant (“ambiguous”)?

          I feel like this analysis must be missing something because people are treating those results as encouraging.

        • I think it is interesting if we use the type S and type M errors framework (simplifying all intricacies a lot for the sake of exposition)

          Based on the RP:P paper, the average original study effect size (standardized as a correlation r) is around r=0.4. To obtain 80% power to detect such effect, the required sample size is around 47, using Fisher transformation of r to makes things ‘normal’.

          But the replication effect size (which should be a much better estimate of the real effect size because it supposedly wasn’t subject to the garden of forking paths) is around r=0.2 (mean; the median is lower). It’s not an order of magnitude lower, but it’s well about half. Under this scenario, the power falls to 26%, which is not far from the observed replication rate using p-values!

          Under 26% power, the probability of a type S error is relatively small, around 0.18%, but the exaggeration ratio is about 1.9.

          So, returning to the point Austin made above that Psychology is not chasing pure noise: I do believe we should not focus so much in the type 1, type 2 errors framework to criticize research in psychology. Theorized effects will likely never be equal to zero, so, with enough power, we will detect something. The point is: are we able to estimate those effects reliably? Are they stable enough under different conditions to really contribute to a scientific theory?

          If the studies are not sufficiently powered to estimate the effects with precision and we continue to use statistical significance to decide about the existence of an effect, studies will consistently overestimate the true effect. In my opinion, the RP:P paper is not so much as the (cup half-empty) ‘more than half research results are false’ ou (cup half-full) ‘almost half research results are true’, but: psychologists are systematically overestimating the effect of interest by low power, deciding based on p-values, hacking those p-values in obvious or subtle ways…

      • Even so, the ~40% replication rate of RP:P (with another 30% showing results that were ambiguous), combined with the pathetic underpowered-ness of psychology studies, does seem to suggest that the majority of the results in the field weren’t false positives. This is encouraging, since there was no reason to expect that in the light of decades of researcher degrees of freedom.

        There was a very good reason to expect similar results to what you describe. If the null models were always false we would naively expect ~50% replication rate (naively: assuming the theory they attempted to test had little to do with the deviation from the model).

    • Very nice commentary Erikson.

      Whether theory can constrain model space is a question that has puzzled me. In asking folks why they hold to a particular proposition/hypothesis makes me lean to the view that most do not stray too far from what they learn in grad school. Of course, there will be subsets that do. But they have a hard row b/c they are cast as disrupters. I believe too that if individuals who have few if any conflicts of interest were evaluating research, we would have been further along than we are.

      Theories too are shaped by some who have conflicts of interests. They can yield quite substantive rationales. I see this in several fields.

Leave a Reply to a reader Cancel reply

Your email address will not be published. Required fields are marked *