A debate about effect-size variation in psychology: Simmons and Simonsohn; McShane, Böckenholt, and Hansen; Judd and Kenny; and Stanley and Doucouliagos

Posted on April 30, 2019 9:20 AM by Andrew

A couple weeks ago, Uri Simonsohn and Joe Simmons sent me and others a note that they were writing a blog post citing some of our work and asking for us to point out anything that we find “inaccurate, unfair, snarky, misleading, or in want of a change for any reason.”

I took a quick look and decided that my part in this was small enough that I didn’t really have anything to say. But some of my colleagues did have reactions which they shared with the blog authors. Unfortunately, Simonsohn and Simmons did not want to post these replies on their blog or to link to them, so my colleagues asked me to post something here. So that’s what I’m doing.

1. Post by Joe Simmons and Uri Simonsohn

This is the post that started it all, and it begins:

A number of authors have recently proposed that (i) psychological research is highly unpredictable, with identical studies obtaining surprisingly different results, (ii) the presence of heterogeneity decreases the replicability of psychological findings. In this post we provide evidence that contradicts both propositions.

Consider these quotes:

“heterogeneity persists, and to a reasonable degree, even in […] Many Labs projects […] where rigid, vetted protocols with identical study materials are followed […] heterogeneity […] cannot be avoided in psychological research—even if every effort is taken to eliminate it.”
McShane, Tackett, Böckenholt, and Gelman (American Statistician 2019 .pdf)

“Heterogeneity […] makes it unlikely that the typical psychological study can be closely replicated”
Stanley, Carter, and Doucouliagos (Psychological Bulletin 2018 .pdf)

“Repeated investigations of the same phenomenon [get] effect sizes that vary more than one would expect […] even in exact replication studies. […] In the presence of heterogeneity, […] even large N studies may find a result in the opposite direction from the original study. This makes us question the wisdom of placing a great deal of faith in a single replication study”
Judd and Kenny (Psychological Methods 2019 .pdf)

This post is not an evaluation of the totality of these three papers, but rather a specific evaluation of the claims in the quoted text. . .

2. Response by Blakeley McShane, Ulf Böckenholt, and Karsten Hansen

I wish Simmons and Simonsohn had just linked to this, but since they didn’t, here it is. And here’s the summary that McShane, Böckenholt, and Hansen wrote for me to post here:

We thank Joe and Uri for featuring our papers in their blogpost and Andrew for hosting a discussion of it. We keep our remarks brief here but note that (i) the longer comments that we sent Joe and Uri before their post went live are available here (they denied our request to link to this from their blogpost) and (ii) our “Large-Scale Replication” paper that discusses many of these issues in greater depth (especially on page 101) is available here.

A long tradition has argued that heterogeneity is unavoidable in psychological research. Joe and Uri seem to accept this reality when study stimuli are varied. However, they seem to categorically deny it when study stimuli are held constant but study contexts (e.g., labs in Many Labs, waves in their Maluma example) are varied. Their view seems both dogmatic and obviously false (e.g., should studies with stimuli featuring Michigan students yield the same results when conducted on Michigan versus Ohio State students? Should studies with English-language stimuli yield the same results when conducted on English speakers versus non-English speakers?). And, even in their own tightly-controlled Maluma example, the average difference across waves is ≈15% of the overall average effect size.

Further, the analyses Joe and Uri put forth in favor of their dogma are woefully unconvincing to all but true believers. Specifically, their analyses amount to (i) assuming or forcing homogeneity across contexts, (ii) employing techniques with weak ability to detect heterogeneity, and (iii) concluding in favor of homogeneity when the handicapped techniques fail to detect heterogeneity. This is not particularly persuasive, especially given that these detection issues are greatly exacerbated by the paucity of waves/labs in the Maluma, Many Labs, M-Turk, and RRR data and the sparsity in the Maluma data which result in low power to detect and imprecise estimates of heterogeneity across contexts.

Joe and Uri also seem to misattribute to us the view that psychological research is in general “highly unpredictable” and that this makes replication hopeless or unlikely. To be clear, we along with many others believe exact replication is not possible in psychological research and therefore (by definition) some degree of heterogeneity is inevitable. Yet, we are entirely open to the idea that certain paradigms may evince low heterogeneity across stimuli, contexts, or both—perhaps even so low that one may ignore it without introducing much error (at least for some purposes if not all). However, it seems clearly fanatical to impose the view that heterogeneity is zero or negligible a priori. It cannot be blithely assumed away, and thus we have argued it is one of the many things that must be accounted for in study design and statistical analysis whether for replication or more broadly.

But, we would go further: heterogeneity is not a nuisance but something to embrace! We can learn much more about the world by using methods that assess and allow/account for heterogeneity. And, heterogeneity provides an opportunity to enrich theory because it can suggest the existence of unknown or unaccounted-for moderators.

It is obvious and uncontroversial that heterogeneity impacts replicability. The question is not whether but to what degree, and this will depend on how heterogeneity is measured, its extent, and how replicability is operationalized in terms of study design, statistical findings, etc. A serious and scholarly attempt to investigate this is both welcome and necessary!

3. Response by Charles Judd and David Kenny

Judd and Kenny’s response goes as follows:

Joe Simmons and Uri Simonsohn attribute to us (Kenny & Judd, 2019) a version of effect size heterogeneity that we are not sure we recognize. This is largely because the empirical results that they show seem to us perfectly consistent with the model of heterogeneity that we thought we had proposed. In the following we try to clearly say what our heterogeneity model really is and how Joe and Uri’s data seem to us consistent with that model.

Our model posits that an effect size from any given study, 𝑑_i, estimates some true effect size, 𝛿_i, and that these true effect sizes have some variation, 𝜎_𝛿, around their mean, 𝜇_𝛿. What might be responsible for this variation (i.e., the heterogeneity of true effect sizes)? There are many potential factors, but certainly among such factors are procedural variations of the sort that Joe and Uri include in the studies they report.

In the series of studies Joe and Uri conducted, participants are shown two shapes, one more rounded and one more jagged. Participants are then given two names, one male and one female, and asked which name is more likely to go with which shape. Across studies, different pairs of male and female names are used, but always with the same two shapes.

What Joe and Uri report is that across all studies there is an average effect (with the female name of the pair being seen as more likely for the rounded shape), but that the effect sizes in the individual studies vary considerably depending on which name pair is used in any particular study. For instance, when the name pair consists of Sophia and Jack, the effect is substantially larger than when the name pair consists of Liz and Luca.

Joe and Uri then replicate these studies a second time and show that the variation in the effect sizes across the different name-pairs is quite replicable, yielding a very substantial correlation of the effect sizes between the two replications, computed across the different name-pairs.

We believe that our model of heterogeneity can fully account for these results. The individual name-pairs each have a true effect size associated with them, 𝛿_i, and these vary around their grand mean 𝜇_𝛿. Different name-pairs produce heterogeneity of effect sizes. Name-pairs constitute a random factor that moderates the effect sizes obtained. It most properly ought to be incorporated into a single analysis of all the obtained data, across all the studies they report, treating it and participants as factors that induce random variation in the effect of interest (Judd, Kenny, & Westfall, 2012; 2017). . . .

The point is that there are a potentially a very large number of random factors that may moderate effect sizes and that may vary from replication attempt to replication attempt. In Joe and Uri’s work, these other random factors didn’t vary, but that’s usually not the case when one decides to replicate someone else’s effect. Sample selection methods vary, stimuli vary in subtle ways, lighting varies, external conditions and participant motivation vary, experimenters vary, etc. The full list of potential moderators is long and perhaps ultimately unknowable. And heterogeneity is likely to ensue. . . .

4. Response by T. D. Stanley and Chris Doucouliagos

And here’s what Stanley and Doucouliagos write:

Last Fall, MAER-Net (Meta-Analysis of Economics Research-Network) had a productive discussion about the replication ‘crisis,’ and how it could be turned into a credibility revolution. We examined the high heterogeneity revealed by our survey of over 12,000 psychological studies and how it implies that close replication is unlikely (Stanley et al., 2018). Marcel van Assen pointed out that the then recently-released, large-scale, multi-lab replication project, Many Labs 2 (Klein et al., 2018), “hardly show heterogeneity,” and Marcel claimed “it is a myth (and mystery) why researchers believe heterogeneity is omnipresent in psychology.”

Supporting Marcel’s view is the recent post by Joe Simmons and Uri Simonsohn about a series of experiments that are directly replicated a second time using the same research protocols. They find high heterogeneity across versions of the experiment (I^2 = 79%), but little heterogeneity across replications of the exact same experiment.

We accept that carefully-conducted, exact replications of psychological experiments can produce reliable findings with little heterogeneity (MAER-Net). However, contrary to Joe and Uri’s blog, such modest heterogeneity from exactly replicated experiments is fully consistent with the high heterogeneity that our survey of 200 psychology meta-analyses finds and its implication that “it (remains) unlikely that the typical psychological study can be closely replicated” . . .

Because Joe and Uri’s blog was not pre-registered and concerns only one idiosyncratic experiment at one lab, we focus instead on ML2’s pre-registered, large-scale replication of 28 experiments across 125 sites, addressing the same issue and producing the same general result. . . . ML2 focuses on measuring the “variation in effect magnitudes across samples and settings” (Klein et al., 2018, p. 446). Each ML2 experiment is repeated at many labs using the same methods and protocols established in consultation with the original authors. After such careful and exact replication, ML2 finds only a small amount of heterogeneity remains across labs and settings. It seems that psychological phenomenon and the methods used to study them are sufficiently reliable to produce stable and reproducible findings. Great news for psychology! But this fact does not conflict with our survey of 200 meta-analyses nor its implications about replications (Stanley et al., 2018).

In fact, ML2’s findings collaborate both the high heterogeneity our survey finds and its implication that typical studies are unlikely to be closely replicated by others. Both high and little heterogeneity at the same time? What explains this heterogeneity in heterogeneity?

First, our survey finds that typical heterogeneity in an area of research is 3 times larger than sampling error (I^2 = 74%; std dev = .35 SMD). Stanley et al. (2018) shows that this high heterogeneity makes it unlikely that the typical study will be closely replicated (p. 1339), and ML2 confirms our prediction!

Yes, ML2 discovers little heterogeneity among different labs all running the exact same replication, but ML2 also finds huge differences between the original and replicated effect sizes . . . If we take the experiments that ML2 selected to replicate as ‘typical,’ then it is unlikely that this ‘typical’ experiment can be closely replicated. . . .

Heterogeneity may not be omnipresent, but it is frequently: seen among published research results, identified in meta-analyses, and confirmed by large-scale replications. As Blakeley, Ulf and Karsten reminds us, heterogeneity has important theoretical implications, and it can also be identified and explained by meta-regression analysis.

5. Making sense of it all

I’d like to thank all parties involved—Simmons and Simonsohn; McShane, Böckenholt, and Hansen; Judd and Kenny; and Stanley and Doucouliagos—for their contributions to the discussion.

On the substance of the matter, I agree with Judd and Kenny and Stanley and Doucouliagos that effects do vary—indeed, just about every psychology experiment you’ll ever see is a study of a two-way or three-way interaction, hence varying effects are baked into the discussion—and it would be a mistake to consider variation as zero, just because it’s hard to detect variation in a particular dateset (echoes of the hot-hand fallacy fallacy!).

I’ve sounded the horn earlier on the statistical difficulties of estimating treatment effect variation, so I also see where Simmons and Simonsohn are coming from, pointing out that apparent variation can be much larger than underlying variation. Indeed, this relates to the justly celebrated work of Simmons, Nelson, and Simonsohn on researcher degrees of freedom and “false positive psychology”: The various overestimated effects in the Psychological Science / PNAS canon can be viewed as a big social experiment in which noisy data and noisy statistics were, until recently, taken as evidence that we live in a capricious social world, and that we’re buffeted by all sorts of large effects.

Simmons and Simonsohn’s efforts to downplay overestimation of effect-size variability is, therefore, consistent with their earlier work on downplaying overestimates of effect sizes. Remember: just about every effect being studied in psychology is an interaction. So if an effect size (e.g., ovulation and voting) was advertised as 20% but is really, say, 0.2%, then that’s really an effect-size heterogeneity that’s being scaled down.

I also like McShane, Böckenholt, and Hansen’s remark that “heterogeneity is not a nuisance but something to embrace.”

6. Summary

On the technical matter, I agree with the discussants that it’s a mistake to think of effect-size variation as negligible. Indeed, effect-size variation—also called “interactions”—is central to psychology research. At the same time, I respect Simmons and Simonsohn’s position that effect-size variation is not as large as it’s been claimed to be—that’s related to the uncontroversial statement that Psychological Science and PNAS have published a lot of bad papers—and that overrating of the importance of effect-size variation has led to lots of problems.

It’s too bad that Uri and Joe’s blog doesn’t have a comment section; fortunately we can have the discussion here. All are welcome to participate.

P.S. Anton Olsson-Collentine, Jelte Wicherts, and Marcel van Assen send along this related paper, Heterogeneity in direct replications in psychology and its association with effect size, where they conclude:

Our findings thus show little evidence of widespread heterogeneity in direct replication studies in psychology, implying that citing heterogeneity as a reason for non-replication of an effect is unwarranted unless predicted a priori.

31 thoughts on “A debate about effect-size variation in psychology: Simmons and Simonsohn; McShane, Böckenholt, and Hansen; Judd and Kenny; and Stanley and Doucouliagos”

Anon on April 30, 2019 9:53 AM at 9:53 am said:

The fact that McShane and colleagues sent longer comments that Simmons and Simonsohn refused to link to in their blog post seems super sketchy. They should change their practices. Or else they should make it clear to readers that they only provide opportunities for sound-bite reactions from the people they blog about.

Reply ↓
- Andrew on April 30, 2019 9:56 AM at 9:56 am said:
  
  Anon:
  
  I wouldn’t call what Simmons and Simonsohn did “super sketchy.” If they want to run a blog without comments, that’s their call. I do think it’s too bad, though, as both they and their readers could learn a lot from comments. I know I have.
  
  Reply ↓
  - sentinel chicken on April 30, 2019 10:43 AM at 10:43 am said:
    
    I’ve always found it troubling that they don’t allow comments and have set up DataColada to blunt any public discussion about their work. I can appreciate that it would be time consuming and tedious to try to keep up with comments, deal with all the trolling and find time to occasionally reply, but you have made it clear that this is possible and can enhance the blog. Sending ‘pre-prints’ to relevant researchers and authors is a total CYA move. It leaves one with the impression that they can dish it out but they can’t take it.
    
    Reply ↓
    - Anon on April 30, 2019 2:28 PM at 2:28 pm said:
      
      +1
    - Zad Chow on May 1, 2019 12:20 AM at 12:20 am said:
      
      Yes, this is seriously concerning. I’m not sure what good reason they have for prohibiting comments but it doesn’t look very good.
    - Austin Fournier on May 1, 2019 2:45 PM at 2:45 pm said:
      
      The comments thing has annoyed me too from time to time, but it should be noted that they’ve linked to longer author responses in the past. Just not this time, I guess?
Anoneuoid on April 30, 2019 10:03 AM at 10:03 am said:

Many Labs projects […] where rigid, vetted protocols with identical study materials are followed […] heterogeneity […] cannot be avoided in psychological research—even if every effort is taken to eliminate it.”

I seem to remember the methods being “improved” (altered) for at least a couple of those “many labs project” replications.

Reply ↓
Ulrich Schimmack on April 30, 2019 11:02 AM at 11:02 am said:

Let’s understand the motivated reasoning why they care so much about low or no heterogeneity.

Their statistical tool, p-curve, assumes homogeneity and produces biased estimates when heterogenity is moderate or large. A better statistical tool, z-curve, does not make assumptions about homogeneity and can produce good estimates with large heterogeneity Brunner & Schimmack, 2019).

Too bad, when somebody needs to make obviously false dogmatic assumptions to justify a model that makes assumptions that are often violated.

https://replicationindex.com/2018/10/19/an-introduction-to-z-curve/

Reply ↓
- Anoneuoid on April 30, 2019 11:21 AM at 11:21 am said:
  
  if a set of studies with mean power of 60% were replicated exactly (including sample sizes), we would expect that 60% of the replication studies produce a significant result again.
  
  The replication study should have a much larger sample size, large enough to ensure none of these “low power” arguments apply. If that isn’t worth doing that then skip the replication because it will just be counterproductive.
  
  Reply ↓
  - Anonymous on April 30, 2019 11:40 AM at 11:40 am said:
    
    It is an unreasonable burden to demand much larger sample sizes from replication studies.
    
    We don’t know the real power. And if the original research had good power, a study with same sample size should be able to replicate result most of the time.
    
    Reply ↓
    - Anoneuoid on April 30, 2019 12:25 PM at 12:25 pm said:
      
      It really isn’t. You can see the results when you do not have overwhelming sample size: more bickering and excuses.
      
      If it is worth doing, it is worth doing right. That means every (non-pilot) study should be replicated, and those replications should have sample size large enough to avoid any complaints about imprecision of the result. If that means only funding 1/4 the current number of studies, great! It will cut down on the amount of BS being generated.
    - Anoneuoid on April 30, 2019 12:32 PM at 12:32 pm said:
      
      And if the original research had good power, a study with same sample size should be able to replicate result most of the time.
      
      If the original study turned out to have high enough power to avoid complaints/excuses about low power then fine.
    - Carlos Ungil on April 30, 2019 1:50 PM at 1:50 pm said:
      
      > We don’t know the real power.
      
      What is the real power?
      
      > And if the original research had good power, a study with same sample size should be able to replicate result most of the time.
      
      It depends on how likely the existence of a “true” effect was in the first place. If there is no effect to be found, the probability of “replicating” a statistically significant result is 5%.
      
      The statement in that blog (“For example, if a set of studies with mean power of 60% were replicated exactly (including sample sizes), we would expect that 60% of the replication studies produce a significant result again.”) requires some not-so-mild assumptions. Let’s assume that the true effect is zero or the one corresponding to a 60% power. If both are equally likely the probability of getting p<0.05 again is 56%. But if the distribution of true effect sizes is 90% zero / 10% non-zero then the probability of getting p<0.05 again is only 36%.
    - Anoneuoid on April 30, 2019 2:20 PM at 2:20 pm said:
      
      Carlos, I assume there will be a significant effect 100% of the time with sufficient power. So, in the worst case scenario 50% of studies will replicate in the same direction (assuming the original was always “significant”).
      
      It is pretty much either that or the replication “effect” will be constrained to be so small it will be considered negligible for the types of studies we are considering here.
    - Carlos Ungil on April 30, 2019 3:14 PM at 3:14 pm said:
      
      Note that I was replying to a comment about what to expect from replications with the same power / sample size.
      
      But I agree that if the sample size is infinite the power is 100% for any effect size epsilon > 0 and (if I remember correctly) we will almost surely get a statistically significant at some point even if the true effect size is zero. And anyways all the models are wrong, etc.
    - Anoneuoid on April 30, 2019 10:34 PM at 10:34 pm said:
      
      Doesn’t require “infinite power”, and in fact “power” is the wrong way to look at it altogether (not that this is being done). Start with looking at the width of whatever interval you want, then compare that to other possible intervals based on other explanations for the same observation.
- Thanatos Savehn on April 30, 2019 11:36 PM at 11:36 pm said:
  
  This. We all tend to fall into the holes we dig ourselves.
  
  Reply ↓
Gary McClelland on April 30, 2019 11:32 AM at 11:32 am said:

It is difficult for me to reconcile the most recent Data Colada on heterogeneity of effect size with Data Coladda #33 (http://datacolada.org/33) argues “‘The’ Effect Size Does not Exist.”

Reply ↓
- Andrew on April 30, 2019 11:52 AM at 11:52 am said:
  
  Gary:
  
  It’s not just them. I think there’s been a fundamental incoherence for a long time in how we think about effect size. The first incoherence is that there’s a huge focus on the “average causal effects,” but by even talking about averages we’re admitting the effects can vary, and then why is the average so important? The second incoherence is that we tend to speak of effects of treatments, and we expect these to replicate, but just about all the treatment effects we study are actually interactions, and once you accept that there can be large, important, and persistent two-way interactions, it’s not clear why you can’t also have large three-way interactions (i.e., varying effect sizes).
  
  Reply ↓
  - Martha (Smith) on May 1, 2019 1:53 AM at 1:53 am said:
    
    +1
    
    Reply ↓
- Austin Fournier on May 1, 2019 2:42 PM at 2:42 pm said:
  
  Wasn’t Data Colada #33 referring to different experiments testing the same theory, whereas #76 is talking about replications of the same experiment?
  
  Reply ↓
Corey on April 30, 2019 12:16 PM at 12:16 pm said:

This makes a pleasant change from the p-value discussion in that it’s about the way things actual are and not just about the proper use and interpretation of a particular tool. There is at least the potential for a resolution that appeals to facts about the world.

Reply ↓
Z on April 30, 2019 12:46 PM at 12:46 pm said:

Andrew, I don’t think your reduction of the discussion to the existence and size of interactions is totally appropriate. It’s obvious that there is tons of subject-level effect heterogeneity (how often do you see an RCT where everyone is cured?), and this subject-level heterogeneity is certainly explained by interactions. But even if interactions are large and prevalent, this would not imply that there is lots of *study-level* heterogeneity, which is what Simonsohn and Simmons are talking about. Study-level heterogeneity might be roughly broken up into two sources–heterogeneity of participant pools, and heterogeneity of implementation. Heterogeneity of participant pools occurs if underlying factors that contribute to the treatment effect (i.e. interaction variables) are distributed differently across participant pools in different replications of the same (vaguely specified) study. Heterogeneity of implementation occurs when version of treatment or setting differs across replications, and when there is an interaction between these variables that differ across implementations and the treatment effect. One can think that existence of large interaction variables is common, but they do not tend to differ across replications’ participant pools or include design features that differ across replications. This seems reasonable to me a priori, since: the variables that differ across participant pools and implementations are a small subset of all variables, I see no reason for interaction variables to be especially likely to vary across studies, and there shouldn’t be too many big interaction variables by the piranha argument. But it is still an empirical question how much study level heterogeneity there is, and of course the answer will depend on the study and the set of replications. I think it makes sense to look at lots of examples where there is no glaring source of heterogeneity across the replication set and see how often and by how much variation in results across studies exceeds what could be explained by sampling variability. This can generate a sort of prior distribution on the degree of study level heterogeneity. So far Simonsohn and Simmons have only looked at a few examples of replication sets I believe, but I like where they’re headed with this.

Reply ↓
- Andrew on April 30, 2019 1:31 PM at 1:31 pm said:
  
  Z:
  
  That’s all fine, but ultimately I’m interested in treatment effects. I see the analysis of study-level heterogeneity as a means to the end of understanding variation in treatment effects. For the reasons discussed by Judd and Kenny, it makes sense to me that studies that are more similar to each other will show less heterogeneity than studies that are more different from each other. To understand how effects vary by person and situation, I think it will make more sense to try to study that variation directly, rather than via meta-analysis of published results.
  
  Reply ↓
  - Z on April 30, 2019 3:58 PM at 3:58 pm said:
    
    “That’s all fine, but ultimately I’m interested in treatment effects. I see the analysis of study-level heterogeneity as a means to the end of understanding variation in treatment effects.”
    
    I’m also more interested in treatment effects and (subject-level) variation in treatment effects than study-level heterogeneity per se. But I think Simonsohn and Simmons are interested in study-level heterogeneity because of its implications for thinking about replications of studies targeting average effects, not as a way to learn about subject-level heterogeneity. Which doesn’t mean you shouldn’t be more interested in the implications of their work for subject-level heterogeneity, but for reasons laid out above I think it would be tricky to try to use study-level heterogeneity as a window into subject-level heterogeneity.
    
    Reply ↓
  - Tom M on April 30, 2019 6:38 PM at 6:38 pm said:
    
    Andrew,
    I agree that we should generally be interested in individual-level treatment heterogeneity, and that meta-analysis doesn’t address this. But Z is making an important distinction that seems to be muddled in some of the discussion. The importance of study heterogeneity depends a lot on the field. I work in mental health research at a medical center. With both behavioral and biological outcomes, between-study heterogeneity is a big problem. If an RCT yields very different results in different replications, that’s bad. Sure, we don’t expect the same effects from different studies – the populations are different, the environments are different, the clinicians administering the treatment are different (especially important if the treatment is behavioral rather than pharmacological), the labs doing biological assays are different, etc. But if the results vary more than you can reasonably expect from what you know about the different conditions, you’re stuck. You can’t recommend, or even really study, a treatment that is that sensitive to those conditions, and this is not unheard of in this field (in fact it’s more common when the outcomes are biological rather than behavioral). On the other hand, if the distributions of treatment effects are reasonably similar across studies, say 2-3 times what you would expect from random sampling variance alone, then you’re in good shape: you can now look for patient-level moderators of treatment effects, which is ultimately what you want. By the way, I know many psychologists and psychiatrists, and here’s the ironic thing: Those who are primarily clinicians typically not only embrace treatment heterogeneity, it’s practically what they live for — it’s what makes things interesting. But those who are primarily researchers often seem to lose sight of this, perhaps succumbing to “physical science envy”, but I think also because the research and statistical methods they know are focused on mean effects.
    
    Reply ↓
    - Martha (Smith) on May 1, 2019 1:57 AM at 1:57 am said:
      
      Interesting. Thanks.
    - Solomon Kurz on May 1, 2019 10:14 PM at 10:14 pm said:
      
      +1
Keith O'Rourke on April 30, 2019 2:48 PM at 2:48 pm said:

> I see the analysis of study-level heterogeneity as a means to the end of understanding variation in treatment effects.

My early wording in the wiki entry for meta-analysis tried to make that the primary focus of meta-analysis but it got more and more watered down to “In addition to providing an estimate of the unknown common truth, meta-analysis has the capacity to contrast results from different studies”

Perhaps all the variations in what drives apparent and _real_ heterogeneity as well as how it can be addressed makes communication very challenging (e.g. https://statmodeling.stat.columbia.edu/2017/11/01/missed-fixed-effects-plural/ ) along with the the avoidance of informative priors even when faced with terribly noisy sample based heterogeneity estimates.

Reply ↓
Thanatos Savehn on May 1, 2019 1:48 AM at 1:48 am said:

Now do the synthesis Greenland, Mayo, Rothman, Colquhoun, Althouse, Chow, Harrel, et. al. are trying, via argument, to create on Twitter.

Reply ↓
Anonymous on May 2, 2019 5:13 AM at 5:13 am said:

Quote from above: “It’s too bad that Uri and Joe’s blog doesn’t have a comment section; fortunately we can have the discussion here. All are welcome to participate.”

Yes! I have been annoyed with that for some time now, not only concerning the blog you mention but also concerning other blogs from other people or institutions, who are sometimes even supposedly all about being “open” ,and “inclusive”, and all for “discussions”, and all that good stuff. (And a comment section alone is not enough, there should also be no filter except for valid reasons. I have found too many times that my, often critical, comments on blogs are not published).

I have recently contacted one of the “Data Colada” people concerning another one of their blogs with, what i feel were, possibly useful comments and thoughts pertaining to their blog post, and possibly related paper. I received a nice reply, but i felt the reply could possibly not be correct/valid. I replied to make that clear, but that seemed to be the end of the exchange. I am not a “famous” professor (i am not even an academic), and i may have written stupid things, or have written them in a stupid manner, but i still think that it would be scienticially useful if the blog would have a comment section. I could have simply posted my replies there, and others could join in, possbily point out any mistakes in my reasoning, others could think about the comment, etc. Just like i am doing in this “Statmodeling” blog, which to me is shining example, and a light in the darkness, concerning all this stuff.

What i have noticed is that some of these blogs with no comment sections get talked about or re-tweeted or “liked” via twitter and facebook a lot, where i fear only a subset of people (get to) participate. I have likened this to the “old boys/girls club” and editorial power and unscientific processes that (sometimes?) go on at “top” journals: just like some “special” folks only get published, and get asked to review, in the “top” journals, so do some “special folks” mainly get all the attention concerning their blog posts.

And just like the “reply to X” exchanges that then happen at “top” journals, on twitter and facebook you often then get the “2nd wave” of “special folks” who feel the need to, and receive, all the attention when they reply to the blog post. And on twitter there also sometimes are the “3rd” wave of comments from (what i think might be) the “i-want-to-be-special-folks, but-i-am-not-part-of-that-group-yet” who (to me at least) always seem to over-enthusiastically confirm and/or priase the original blog post folks, and the “2nd wave” comments/folks. I feel it’s quite funny, and sad at the same time, to (at least think to) see this process happen.

From a slightly bigger picture perspective, it recently occured to me that (at least for me) a lot of recent (proposed) “improvements” might very well simply be some different forms/versions of the exact possibly problematic issues that may have contributed to the current possible mess in academia.s

I have thought about trying to see how many of these i could possibly find, and if i could come up with a substantial list i could write and post a short sarcastic pre-print paper pointing them all out. I could borrow, and slightly change, Bargh’s title of one of his (in-)famous blogposts “Priming effects replicate just fine, thanks” (also possibly see https://statmodeling.stat.columbia.edu/2016/02/12/priming-effects-replicate-just-fine-thanks/), and call the pre-print “Psychology replicates just fine, thanks”.

I could then list all the possibly problematic processes, entities, and issues that (i feel) might currently be being (at least “conceptually” :P) replicated via several (proposed) “improvements”, and current processes and entities. But i don’t want to (try and) publish anything anymore, spend too much time on all that stuff, and i wonder if it would have any use.

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

A debate about effect-size variation in psychology: Simmons and Simonsohn; McShane, Böckenholt, and Hansen; Judd and Kenny; and Stanley and Doucouliagos

31 thoughts on “A debate about effect-size variation in psychology: Simmons and Simonsohn; McShane, Böckenholt, and Hansen; Judd and Kenny; and Stanley and Doucouliagos”

Leave a Reply to Z Cancel reply