The real problem of that nudge meta-analysis is not that it includes 12 papers by noted fraudsters; it’s the GIGO of it all

A few days ago we discussed a meta-analysis that was published on nudge interventions. The most obvious problem of that analysis was that included 11 papers by Brian Wansink and 1 paper by Dan Ariely, and for good reasons we don’t trust papers by these guys. This all got a lot of attention (for example here on Retraction Watch), but I’m concerned that the focus on the fraudulent papers will distract people from what I consider to be the real problem of that analysis.

The real problem

I’m concerned about selection bias within each of the other 200 or so papers cited in that meta-analysis. This is a literature with selection bias to publish “statistically significant” results, and it’s a literature full of noisy studies. If you have a big standard error and you’re publishing comparisons you find that are statistically significant, then by necessity you’ll be estimating large effects. This point is well known in the science reform literature (for example see the example on pages 17-18 here).

Do a meta-analysis of 200 studies, many of which are subject to this sort of selection bias, and you’ll end up with a wildly biased and overconfident effect size estimate. It’s just what happens! Garbage in, garbage out. Excluding papers by known fraudsters . . . well, yeah, you should do that, for sure, but it does not at all solve the problem of bias in the published literature. Indeed, one reason those fraudulent papers could live so comfortably in the literature was that they were surrounded by all these papers with ridiculous overestimates of effect sizes. Kinda like E.T. hiding among the stuffed animals in the closet.

Again, the first problem I noticed with that meta-analysis was an estimated average effect size of 0.45 standard deviations. That’s an absolutely huge effect, and, yes, there could be some nudges that have such a large effect, but there’s no way the average of hundreds would be that large. It’s easy, though, to get such a large estimate by just averaging hundreds of estimates that are subject to massive selection bias. So it’s no surprise that they got an estimate of 0.45, but we shouldn’t take this as an estimate of treatment effects.

The experts speak

The above issue—GIGO—is the main point, and I discussed it in detail in my earlier post, but, again, I wanted to emphasize it here because I’m afraid it got lost amid the Wansink/Ariely brouhaha. As I wrote the other day, I would not believe the results of this meta-analysis even if it did not include any of those 12 papers, as I don’t see any good reason to trust the individual studies that went into it. (No, the fact that those individual studies were published in reputable journals and had statistically significant p-values is not enough, for reasons discussed in the classic papers by Simmons et al., Francis, etc.)

But there are a few other things I’d like to share that were pointed out to me by some colleagues who are experts in the statistical analysis of interventions in psychology.

Beth Tipton:

In addition to your concerns in your post, I’d add that this speaks to my general concern with reporting of meta-analyses. The effect size that makes it into the abstract and conclusion is (1) unadjusted and (2) is the average (there may be a lot of variation).

In contrast, imagine an observational study that used advanced methods to adjust for confounding (e.g., iv, propensity scores, whatever) – what if they reported the *unadjusted* effect size in the abstract and conclusion? This would never fly there and shouldn’t fly in meta-analysis either.

See below for their own text on the effect of publication bias:
“Assuming a moderate one-tailed publication bias in the literature attenuated the overall effect size of choice architecture interventions by 26.79% from Cohen’s d = 0.42, 95% CI [0.37, 0.46], and τ2=0.20 (SE=0.02) to d=0.31 and τ2=0.23. Assuming a severe one-tailed publication bias attenuated the overall effect size even further to d=0.03 and τ2=0.34; however, this assumption was only partially supported by the funnel plot.”

I just want to be clear that my comment is not so much about them as about a larger problem with meta analysis reporting.

Dan Goldstein:

Those effect sizes seem implausibly large. I went into the appendix of the “Estimating the Reproducibility of Psychological Science by the Open Science Collaboration” paper and looking at the replications (not original studies) the median Cohen’s d was around .25. So if .41 is small to medium, the question is compared to what. They say in the same section .68 is “slightly larger” when it’s substantially larger.

That said, I would expect results of choice architecture studies to be bigger than typical psych science lab studies on unconscious influence or whatever. Some choice architecture effects are monstrously large (think about ranking alternatives …. on search engine results people rarely look past the first handful of results or ads. They very rarely go to page 2 or 3. That’s a big effect).

Default effects are some of the biggest you will find. But this paper is about much more than defaults, hence my skepticism.

David Yeager:

This is paper is an unfortunate example of the kind of over-claiming and heterogeneity-naive meta-analysis that Beth Tipton, Chris Bryan, and I wrote about in our paper. Really frustrating to see, but quite common in the literature. Hopefully it will spark some good dialogue about better methods.

I think the p-hacked studies are only one part of why the effect sizes are so inflated though.

Another main reason, I suspect, is that there are very different kinds of nudges that have effects that are orders of magnitude different. They’re doing this meta-analysis in a heterogeneity-naive way. Some nudges are one-time decisions, set it and forget it, and they get huge effects. “Save more tomorrow” is one. Defaults or framing effects are others. People are presented with a choice that makes it very hard to deviate from a default, and they tend to get effects of d = 1 or so. But in those studies they’re not trying to find out if you make other choices similarly weeks or months later, where the researchers aren’t doing anything to the choice architecture.

Other nudges, which are many of the major policy applications, are more subtle and happen over many choices over time. Examples include Opower’s use of norms on an energy bill, where you get a framing device monthly, but then you have to apply it every night when you turn out the lights or consider which refrigerator to buy. The most optimistic studies get d = .02, and a 2-3% difference in energy use. Another example is an implementation intentions writing exercise, which is supposed to influence whether you go to the gym or study for the SAT over several months. Those kinds of trials get d = .1, or .15, in multi-condition studies where you focus on only the most effective condition. The same is true for studies like FAFSA simplification. Further, many of the latter types of nudges have declining effect sizes over time as they are replicated in more heterogeneous samples, as Beth Tipton, Chris Bryan, and I point out. Crucially, these repeated-decision Nudge studies are from large samples and using rigorous methods–often including independent evaluations, so they are not likely to be false positives. None of them get an effect close to even a fourth of .45 SD—that’s absurdly massive and not at all what the literature would suggest.

The much smaller true effect of real-world nudges is not a problem with the field, in my opinion. It’s pretty well-known that the farther you get from the lab or lab-like study with a single choice, and the closer you get to looking at real-world, repeated-choice outcomes over time, then your effect sizes will be much smaller (see here). But if readers are anchored on the absurd result in this meta-analysis, which doesn’t take well-established heterogeneity seriously, then it will lead to unrealistic expectations from both scientists and policymakers–and possibly even underutilization of high-quality evidence when future, legitimate studies fail to live up to the unrealistic standard.

There’s a bit of irony in this result, because the recent mega-study from Katie Milkman, Angela Duckworth, and BCFG showed that nudgers routinely over-estimate the effects of nudges, by an order of magnitude. What this PNAS meta-analysis suggests is that maybe the experts get it so wrong because they’re looking at the literature like these meta-analysts did.

Last, this is especially tragic because, as you note, these are junior folks who are following what is unfortunately common practice in meta-analysis (e.g. emphasizing the average effect and not the heterogeneity in effects; not reporting the huge prediction interval in the abstract). It’s an interesting case of the collateral damage from senior people over-stating their results that then trickles down consequences for junior folks who make the mistake of taking established scholars’ work seriously. We talk a lot about how we waste junior scholars’ time trying to replicate a literature that is full of false positives; now it’s clear that we also waste their time when we have a misleading effect-size literature that ends up in a meta-analysis.

Not a moral thing

This is not a morality play. The authors of the meta-analysis worked hard and played by the rules, as the saying goes. But, as I so often say (but I hate to have to keep saying it), honesty and transparency are not enuf. If you average a bunch of biased estimates, you’ll get a biased estimate, and all the pure intentions in the world won’t solve this problem. I so so much would like the researchers who do this sort of thing to use their talents more constructively.

The people who I get mad at here are not the young authors of this paper, who are doing their best and have been admirably open with their data and methods. No, I get mad at the statistics establishment (including me!) for writing textbooks that focus on methods and say almost nothing about data quality, and I get mad at the science establishment—the National Academy of Sciences—for promoting this sort of thing (along with himmicanes, air rage, ages ending in 9, etc. etc.), not to mention the people in nudgeworld who are cool with people thinking that the average effect is so large. Don’t forget, the leaders of the field of Nudge are people who described Wansink’s papers as “masterpieces,” which makes me think they’re real suckers for people who tell them what they want to hear. It’s horrible that they’re sucking young researchers into this vortex. It’s Gigo and Gresham all the way down.

20 thoughts on “The real problem of that nudge meta-analysis is not that it includes 12 papers by noted fraudsters; it’s the GIGO of it all

  1. Related to the selection bias, I believe one of the reasons to be skeptical towards meta-analyses/systematic reviews of nudging studies is the definition of nudging. From Thaler and Sunstein’s Nudge book:

    “A nudge, as we will use the term, is any aspect of the choice architecture that alters people’s behavior in a predictable way without forbidding any options or significantly changing their economic incentives.”

    A nudge will, by definition, have a non-zero effect (i.e., alter people’s behavior). This is a problem if studies can (must?) decide whether an intervention is a nudge conditional upon the effect size. This will of course bias the field towards large effects (where d = .25 is a small effect), independent of how good the meta-analysis is.

    I made this point in a post last year in relation to another review of the literature on nudging, but I believe the issue is still relevant here when the meta-analysis samples studies explicitly about nudges (and not only about publication bias).

    • Erik:

      Good point. If you define a nudge as something that works, then, yeah, nudges work! In all seriousness, I guess this is a problem in their thinking, that they just assume that these things all work, so much so that it’s not clear what to call it when it doesn’t work. “Attempted nudge,” perhaps?

    • A related problem with the definition of “nudge” is including defaults. Defaults are a different animal than other types of nudges. As pointed out in the post above defaults have big effects, including them in the same category makes the whole theory of nudging more plausible. But, defaults work because we are not trying to change people’s behavior at all. If I become a part of a class action unless I fill out a form and send it to a judge opting out of a class, of course that default rule will affect the number of class members. But, the default rule didn’t change my behavior at all. If took advantage of my current behaviour or non-behaviour. Sustein has used this example as an example of a “nudge”. If defaults are nudges, then category of “nudge” is just too broad to be helpful. Defaults may be useful for public policy, while non-default “nudges” are hopeless.

  2. Helpful post! Meta-analysis seems like a type of analysis that can easily do more harm than good, as the motivation for the random effects approach assumes a very specific statistical problem (aggregating parameter estimates from multiple studies with independent random samples of the same population) but in reality we have GIGO, non-random samples, etc. I like what Tipton says as it seems to capture this question I’ve been having about what the average effect people are reporting is actually good for.

    A few years ago I got asked to do work on a Navy grant designing an interface to help their scientists do systematic review and meta-analysis, which Alex Kale has been leading and Tipton helping advise on some of the stats for. The design philosophy we settled on is very much ‘how do we get people doing meta-analysis to doubt what they are doing?’ Lots of attempts in the questions the tool asks about studies and interactive plots to prompt reflection on different sources of uncertainty.

    • Jessica:

      This is related to Xiao-Li’s point that we discussed last month, that large surveys or, more generally, “big data,” can mislead people by giving small standard errors resulting in overconfidence.

      There’s also a problem with incentives. Leaders in “nudge” and related subfields have an incentive to promote work that claims their methods have large effects. Consider this line of mine from five years ago:

      If you’d been deeply invested in the old system, it must be pretty upsetting to think about change. Fiske [the editor of this new PNAS paper] is in the position of someone who owns stock in a failing enterprise, so no wonder she wants to talk it up. The analogy’s not perfect, though, because there’s no one for her to sell her shares to. What Fiske should really do is cut her losses, admit that she and her colleagues were making a lot of mistakes, and move on. She’s got tenure and she’s got the keys to PPNAS, so she could do it. Short term, though, I guess it’s a lot more comfortable for her to rant about replication terrorists and all that.

      I guess it was a mistake for me when writing this to focus on that one person. Fiske has some authority and power, as she has the demonstrated ability to get bad papers published in a high-profile journal, but she’s just one person.

      To put it another way, when you ask, “How do we get people doing meta-analysis to doubt what they are doing?”, this is related to two society-of-science questions: (1) How could senior researchers move beyond their short-term reputational incentives to question their scientifically dead paradigms, (2) How can young researchers avoid getting trapped in the intellectual webs spun by some prominent people in their fields?

      • A number of recent posts have a fair amount of overlap on this topic.

        e.g. https://statmodeling.stat.columbia.edu/2022/01/07/institute-for-replication-and-the-usual-concerns/#comment-2042165

        So I think you are right in you your suggested (1) and (2) – somehow despite many people clarifying that the first and critical step in any meta-analysis is to assess replication, they can’t avoid combining and foolishly hoping they accomplished something. I think I wrote once no replication no combining just summarize why one cannot safely learn from these studies. A paraphrase Tukey a collect of studies and an aching desire for answer does not justify combining.

        For instance, in this wiki entry https://en.wikipedia.org/wiki/Meta-analysis I originally put this sentence first “contrast results from different studies and identify patterns among study results, sources of disagreement among those results, or other interesting relationships that may come to light with multiple studies”. Then mentioned possibility of combining.

        Then it was removed and then someone put it back in as the second sentence and prefaced with something that makes it a fallacy “Not only can meta-analyses provide an estimate of the unknown effect size” No not unless replication is first assessed as the unknown effect size is meaningless/harmful.

    • If the goal of a meta-analysis is to build a coherent synthesis of a field for the purpose of understanding a pattern of phenomena and making actionable predictions, then it seems to me that the ideal meta-analysis is a theory, not a meta-analysis. As Shravan said in the other thread, the best a meta-analysis can do is summarize the work that has been published in an area. But as Jessica says, the statistical summary typically assumes a model in which the measured effect size in each study is a (representative) sample from a (simple) population distribution of effect sizes. Such an model provides no basis for understanding the mechanisms by which the manipulations performed in each study manifest in the observations on which the “effect size” in computed. Without that, there is no foundation for attempting to generalize beyond the published findings—including generalization to applied domains.

      A model based on a theory of the underlying phenomena, on the other hand, would explicitly describe the mapping between (hypothesized) mechanisms and observables in a way that would provide a basis for meaningful generalization. It would also provide a basis for understanding residual uncertainty, because this would be propagated through the model. Of course, if the meta-analysis is based only on published results, the model might also need to include mechanisms describing the publishing process, not just the hypothesized underlying mechanisms.

      • Gec:

        I guess we could take the math literally and say that the implicit model of the meta-analysis is that there’s some class of interventions called “nudges” and that the effect of a “nudge” depends on (a) what exactly is being done, what’s being measured, etc.; (b) who’s being nudged; and (c) when and where is it happening. There’s some distribution of (a,b,c) in the wild, which is being approximated by the studies in those 200 published papers.

        I agree that it would be good to have a model of the underlying phenomenon, but I think that no general model can be formed, for the reason that the essence of nudges is that they work without trying to work. Nudges are like the sorts of images that you can never look at because, by construction, they only appear in your peripheral vision.

        In any case, I think the idea of this particular meta-analysis is that, theory aside, they’d like to learn something empirically about what works and by how much. I like the idea of a purely empirical study as a summary of what’s out there. I just don’t think it works here because of the huge selection biases within the component studies.

        • I strongly agree that a “theory of nudging” would not be possible. Or rather, I suspect to the extent it could be done, it would end up being a constellation of theories touching on various aspects of judgment and decision making and culture and social learning etc. etc. but nothing that would provide any insight about any kind of general phenomenon. It would be like having a theory of your cat, you could probably construct an explanation for what they were doing at the moment, but it wouldn’t help you predict what they would do in the future, not even necessarily in a nearly-identical scenario.

          Still, I think that if the GIGO meta-analyzers had at least tried to construct a theory of nudging, they would have quickly realized how hopeless the task is. As you say, the construct of “nudging” doesn’t really pick out a coherent set of phenomena that could be attributed to a common set of mechanisms.

          But I also agree that there is value in a purely empirical summary as well (with the appropriate caveats that the empirical studies are unlikely to be representative). More broadly, I think researchers would do well to understand (to themselves) and make clear (to others) when their object of study is an effect and not a cause.

  3. I like David Yeager’s response and also Jessica’s. Meta-analysis is clearly the wrong tool for characterizing an array of heterogeneous effects. This seems so obvious one has to wonder why it even needs to be said, yet everywhere (certainly in my field of economics) you see meta-analysis, either formally or implicitly, applied where it shouldn’t.

    Is it because researchers are beguiled by overly simple models in which an effect really does come down to a single parameter whose value is context-independent? Or is it the revenge of decades of lazy statistics where researchers looked for and “found” average effects that were “statistically significant”, and this framework guided theory, since a good theory, they say, is one that predicts empirical results?

    I’m asking these questions since I just finished three days of virtual attendance at the ASSA (economics) meetings, where paper after paper was structured around hyper-simple effect models supported by average effect estimates with little constellations of stars after them.

    • Peter:

      I don’t really know but my guess is that in this case the researchers thought something like this: “Nudge is a big deal, there’ve been lots of studies of different nudges, so let’s put them all together and see if we can learn where it works better and where it works worse.” They wouldn’t’ve thought of GIGO as being a problem because they don’t think the original studies are garbage (except possibly those 12 suspect papers, but even there they were hedging their bets by doing the analysis both ways).

      How to explain to these people that many or most of the original studies were garbage? One way to do it is to go through studies one at a time; the trouble is that this takes lots of work and then there are still 200+ studies remaining that didn’t happen to be written by Wansink or Ariely so haven’t been throughly looked at. Another way is to do what Ulrich Schimmack, Greg Francis, Uri Simonsohn and others have done and point out problems in the distributions of published p-values. I guess another approach would be to look at a random sample of the studies in that meta-analysis and see how many of the studies in the random sample are fatally flawed. The trouble there is that there’s usually a fuzzy zone where there are various possible analyses of the data, so you can rarely definitively show that a study’s published estimates are “garbage” (whatever that means). Yet another approach is to do the meta-analysis allowing for bias; the trouble is that then you can end up with an estimate of zero, and that’s not what people want to hear.

      • While I agree with you, my qualm is about the earlier step of even thinking a meta-analysis is appropriate for this set of studies. It suggests the authors had an ex ante belief that there was some sort of elemental nudginess out there, where it made sense to ask what its average effect was.

        Once you go the meta-analysis route, you commit (or are supposed to commit) to an algorithmic filter for what studies to include, and it’s simply binary. What algorithm would pick up the indicators of garbage-osity that you point out?

        Not a fan of meta-analysis in the social sciences.

        • Well weighting/modeling as opposed to filtering was worked through here based on Don Rubin’s response surface modeling proposal – On the bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions. S Greenland, K O’Rourke https://pubmed.ncbi.nlm.nih.gov/12933636/

          The challenge is to implement it credibly in for a given collection of studies.

          (Also not a fan of meta-analysis in the social sciences, my work was in clinical research.)

        • Include all studies that are relevant, estimate publication bias, then do a multiverse meta-analysis? One might have to work to interpret results, but it’d provide a ceiling to work from.

          Keith’s comment below is also relevant.

  4. I had been under the impression that the whole point of doing a meta-analysis was to avoid the problem of studies with small sample sizes and noisy estimates giving you an implausible effect. I first heard of the concept when people were using them to show the existence of a publication bias, with the larger studies consistently showing something to have a smaller/insignificant effect size.

Leave a Reply

Your email address will not be published. Required fields are marked *