Exploring some questions about meta-analysis (using ivermectin as an example), with R and Stan code

We had a long discussion in comments the other day about some debates regarding studies of ivermectin for treating covid. Some of the discussion touched on general principles of meta-analysis, including challenges with Bayesian meta-analysis, so I thought I’d expand on it all here.

tl;dr. When doing meta-analysis, look at the estimated distribution of possible effects, not just at the estimate of the average across all studies.

The new study from Malaysia

It all started with our discussion of a recently published study from Malaysia that reported:

52 of 241 patients (21.6%) in the ivermectin group and 43 of 249 patients (17.3%) in the control group progressed to severe disease (relative risk [RR], 1.25; 95% CI, 0.87-1.80 . . . . In this randomized clinical trial of high-risk patients with mild to moderate COVID-19, ivermectin treatment during early illness did not prevent progression to severe disease. The study findings do not support the use of ivermectin for patients with COVID-19.

When posting on this study, I wrote that the conclusion, “The study findings do not support the use of ivermectin for patients with COVID-19,” is literally true but is misleading if “do not support the use” is taken to imply “provide strong evidence against the use.” To put it another way, it would be a mistake to see this study and conclude: Negative finding, published in JAMA, therefore ivermectin doesn’t work. Better to say this is one data point that is consistent with what is already believed, which is that this treatment is neither devastating nor devastatingly effective.

To put it another way, a 95% interval of 0.87 to 1.80 rules out huge effects but it doesn’t rule out a moderate effect such as a 10% benefit. Perhaps more to the point, I’d expect that, if the drug is effective, that the effectiveness would vary a lot by context. Indeed, some proponents of ivermectin for covid have argued that the conditions of this particular experiment were not set up to allow a large effect. Without taking a stance on that particular question, let me just say that there are lots of reasons we’d expect the effect of the ivermectin on covid to vary across studies: different outcome measurements, different treatments, different control comparisons, and different patients being treated at different stages of infection.

All of this is to say that it makes sense to think in a meta-analytic framework of a distribution of effects rather than “the effect.”

The effect size

As noted above, the study rules out huge effect sizes (for the conditions of the experiment) but is consistent with the treatment having small benefits or fairly large negative effects. If you think any effects are likely to be small (which is my usual prior), then the study doesn’t provide much information.

But then in comments Greg Kellogg wrote:

They did power analysis with an estimated 17.5% progression in control group, and powered the study to observe if ivermectin dropped it by half. That seems like a large effect size on the surface, but it’s exactly what ivermectin advocates have been saying Ivermectin would do. . . .

I see the point made by Andrew and others that the study itself is underpowered if they were looking for 20% reduction in progression to severe disease. But (and perhaps this is a little cynical) the only reason these studies keep being done on Ivermectin for Covid-19 is that Ivermectin advocates have been describing it as a miracle treatment. So is it so unreasonable to set a target at 50% reduction?

The meta-meta-analysis

In comments, Kejo pointed to a meta-meta-analysis of ivermectin on covid, performed by Alexandros Marinos. I call it a meta-meta-analysis because Marinos shows multiple meta-analyses. He starts with a meta-analysis based on 29 studies from a website called ivmmeta.com, and the result looks pretty impressive: the estimated average effect is a risk ratio of 0.33 with a narrow uncertainty interval. So, yeah, Greg was right that there are people claiming huge positive effects from this treatment.

There are problems with that meta-analysis, though. As psychiatrist blogger Scott Alexander discusses, a bunch of those studies are really bad. We’ve discussed the GIGO problems with meta-analysis before; in short, you just can’t do much when you include data points with huge unknown biases. The first step is to throw them out.

Marinos continues with some other meta-analyses, finally presenting this:

It’s a meta-analysis including a bunch of studies that Scott Alexander excluded in his writeup along with a couple others that some ivermectin advocates wanted to include. Again, the result looks pretty impressive: an estimated risk ratio of 0.38 (that’s a huge average benefit) with 95% interval (0.19, 0.97), which ranges from a really really huge benefit to no effect (but no harm). From this perspective, yeah, the evidence in favor of this treatment is strong!

I was kinda suspicious, though, first because it seemed doubtful that the medical establishment would be so suspicious of this treatment if it were really so effective as all that, and second because given the data summaries above, I was surprised that the meta-analysis inference was so strong (the estimate at the bottom of the above graph).

So I typed in the numbers and did my own meta-analysis, using the standard Bayesian template as in chapter 5 of our book Bayesian Data Analysis. Here goes:

The data, stored in the file meta_analysis_data.txt:

study       est  se
Ahmed       -1.897 1.505
Bukhari     -1.715 0.489
Buonfrate    1.955 1.505
Chaccour    -3.219 1.681
Krolewiecki  0.924 1.634
Lopez       -1.109 1.668
Mahmud      -1.966 1.551
Mohan       -0.968 0.795
Ravikirti   -2.207 1.523
Together    -0.198 0.321
Vallejos     0.285 0.760

The Stan program, stored in the file meta_analysis.stan:

data {
  int J;
  vector[J] est;
  vector[J] se;
parameters {
  real mu;
  real<lower=0> tau;
  vector<offset=mu, multiplier=tau>[J] theta;
model {
  est ~ normal(theta, se);
  theta ~ normal(mu, tau);
generated quantities {
  real theta_new = normal_rng(mu, tau);

The R code:

data <- read.table("meta_analysis_data.txt", header=TRUE)
stan_data <- list(est=data$est, se=data$se, J=nrow(data))
model <- cmdstan_model("meta_analysis.stan", pedantic=TRUE)
fit <- model$sample(data=stan_data, parallel_chains=4, refresh=0)

And the output:

  variable   mean median   sd  mad     q5   q95 rhat ess_bulk ess_tail
 lp__      -11.05 -10.77 3.34 3.29 -17.08 -6.00 1.00      959     1541
 mu         -0.81  -0.80 0.43 0.39  -1.53 -0.13 1.00     1490     1829
 tau         0.90   0.81 0.54 0.47   0.19  1.85 1.00     1122     1039
 theta[1]   -1.09  -1.00 0.89 0.77  -2.71  0.24 1.00     3504     2793
 theta[2]   -1.40  -1.39 0.49 0.50  -2.21 -0.61 1.00     2541     2621
 theta[3]   -0.07  -0.24 0.96 0.82  -1.34  1.72 1.00     2577     2346
 theta[4]   -1.33  -1.17 0.96 0.85  -3.09  0.02 1.00     3191     3143
 theta[5]   -0.41  -0.47 0.87 0.74  -1.69  1.15 1.00     3340     2922
 theta[6]   -0.88  -0.83 0.88 0.73  -2.41  0.51 1.00     4462     3107
 theta[7]   -1.08  -1.00 0.84 0.73  -2.56  0.15 1.00     4143     3332
 theta[8]   -0.87  -0.84 0.61 0.57  -1.93  0.11 1.00     5155     3330
 theta[9]   -1.14  -1.05 0.87 0.79  -2.72  0.10 1.00     3534     3384
 theta[10]  -0.31  -0.32 0.31 0.32  -0.82  0.21 1.00     3389     2908
 theta[11]  -0.23  -0.28 0.61 0.61  -1.13  0.85 1.00     3447     3284
 theta_new  -0.81  -0.79 1.14 0.84  -2.67  1.00 1.00     3015     3088

All this is assuming the numbers in I'm taking the numbers in that table are correct summaries of the studies. Everything's on the logit (log-odds) scale, so I can exponentiate to get risk ratios. The result:

- Inference for mu, the average effect in the hypothetical superpopulation of studies: the estimate is -0.81 with a 90% posterior interval of [-1.53 -0.13]. Exponentiating gives an estimate of 0.44 and a 90% interval of [0.22, 0.87]. That's all compared to 100%, so 0.44 corresponds to a multiplication of odds by 0.44, etc.

- Inference for theta_new, the effect in a new study sampled from the hypothetical superpopulation: the estimate is -0.81 with a 90% interval of [-2.67, 1.00]; exponentiating gives an estimate of 0.44 and a 90% interval of [0.07, 2.71], i.e. a new study could have a true effect on the odds of a bad outcome of somewhere between a factor of 0.07 and a factor of 2.7. Neither of these extremes sound even remotely plausible; the wide uncertainty arises from the flat prior on tau.

I'm not claiming that my simple Bayesian meta-analysis is definitive; I'm just using it to explore what these particular numbers can tell us, and where additional assumptions can come in.

Anyway, the point is that Marinos's strong conclusion is coming essentially from this one study looking at the PCR results of 86 patients after a 14-day trial:

What about that Malaysia study? It reports results on progression to severe disease on 490 patients. As I wrote above, I didn't find the result to be particularly newsworthy on its own, as I don't typically expect to see controversial treatments showing factor-of-2 effects in either direction---but this study seems stronger than the Bukhari study that is driving the meta-analysis.

A key question seems to be if ivermectin can have huge effects in some settings. I'm generally suspicious of claims of huge effects, but, sure, I get it that some treatments really can have large effects. That's why this discussion has convinced me that the Malaysia study is more valuable than I'd thought at first. In a scientific context, it's just one data point. But in the context of these debates, this one data point can make a real difference.

Another way of putting things is that, in the above meta-analysis, most of the information must be coming from the Bukhari paper, as the confidence interval from that study is so much narrower than all the others. Throwing this in with the new study from Malaysia, we can either conclude that one or both of these experiments is severely flawed, or we can conclude that the true effect of ivermectin varies a lot between the two studies.

You can look at the comment thread with Sander Greenland for more on this meta-meta-analysis. The Bukhari study and the Malaysia study are comparable it that they are both relatively large (compared to some of the studies under consideration)

Another meta-analysis

A commenter also pointed us to a post by Gideon Meyerowitz-Katz, "The Jury is Still Out on Ivermectin: Why the new ivermectin study doesn’t tell us much about whether the drug is effective for Covid-19," which sounds about right. Or, at least, it's pretty much what I said in my original post: this new study is one more data point. Katz writes: "taking this study in isolation would traditionally mean that we don’t really know if ivermectin works or not, and leave it at that. While it was well-conducted and reasonably large, it wasn’t huge nor definitive enough to make any recommendations from."

He also does a meta-analysis, but it's slightly different from the one performed by Marinos and shown above, because he's only looking at the survival outcome: all-cause mortality. The Bukhari study doesn't come in at all (as it seems that nobody died in either treatment or control group in that experiment), and the Malaysia study provides very little information (3 deaths out of 241 in the treatment group and 10 out of 249 in the control). Anyway, here it is:

As with Marinos's meta-analysis shown earlier, the conclusion looks suspiciously precise to me: You're putting in all this super-noisy data and you come out with this very narrow overall interval. Katz's take-home point is that the result is not statistically significantly different from zero, but the whole thing just bothers me.

So, again, I'll do my own meta-analysis.

The data, stored in the file meta_analysis_data_2.txt:

study       est   c2.5  c97.5
Lopez       0.24  0.01   5.87
Mahmud      0.14  0.01   2.70
Ravitkirti  0.12  0.01   2.09
Rezal       2.92  0.12  69.20
Fonseca     1.06  0.58   1.94
Gonzalez    0.86  0.29   2.56
Hashim      0.15  0.01   2.40
Vallejos    1.34  0.30   5.92
Together    0.82  0.44   1.51
I-tech      0.31  0.09   1.13

(I-tech is the Malaysia study.)

The same Stan program as before.

The R code:

data_2 <- read.table("meta_analysis_data_2.txt", header=TRUE)
est_log <- log(data_2$est)
se_log <- log(data_2$c97.5 / data_2$est) / 2
stan_data_2 <- list(est=est_log, se=se_log, J=nrow(data_2))
fit_2 <- model$sample(data=stan_data_2, parallel_chains=4, refresh=0)

Here I reverse-engineered the standard errors by assuming the distance between the estimate and the upper end of the 95% interval (on the log scale) for each study was two s.e.'s.

And then the results:

  variable   mean median   sd  mad     q5   q95 rhat ess_bulk ess_tail
 lp__      -10.55 -10.29 3.03 2.87 -15.91 -6.00 1.00      859     1585
 mu         -0.35  -0.31 0.34 0.26  -0.94  0.10 1.02      334      102
 tau         0.45   0.34 0.40 0.30   0.03  1.26 1.02      338      109
 theta[1]   -0.48  -0.34 0.71 0.40  -1.72  0.31 1.02      373      114
 theta[2]   -0.53  -0.40 0.66 0.44  -1.70  0.26 1.01      692      484
 theta[3]   -0.53  -0.41 0.62 0.43  -1.71  0.20 1.01      652      174
 theta[4]   -0.21  -0.24 0.56 0.39  -1.10  0.77 1.00     1192      618
 theta[5]   -0.12  -0.13 0.26 0.25  -0.52  0.31 1.00     2357     1120
 theta[6]   -0.25  -0.25 0.35 0.31  -0.83  0.34 1.00     3577     3009
 theta[7]   -0.51  -0.39 0.59 0.41  -1.69  0.23 1.00     1369      686
 theta[8]   -0.17  -0.21 0.44 0.34  -0.82  0.62 1.00     1543      706
 theta[9]   -0.25  -0.25 0.25 0.24  -0.66  0.16 1.00     2303      574
 theta[10]  -0.56  -0.46 0.47 0.40  -1.52  0.05 1.00      842      284
 theta_new  -0.37  -0.30 0.70 0.40  -1.54  0.50 1.01     1338      383

Again, mu is the average effect over the hypothetical superpopulation of trials, and theta_new is the effect in a hypothetical new trial sampled from this population distribution. We can exponentiate the numbers above to put them on the relative risk scale, so that mu gets an estimate of 0.70 with a 90% interval of [0.39, 1.11], which is pretty close to what Katz got---our posterior shows a bit more uncertainty (recall that we are giving 90% intervals whereas he was giving 95%), but that makes sense given that we're doing a full Bayesian analysis including uncertainty in tau, not just plugging in a point estimate or the equivalent. The point estimate for theta_new is about the same---it has to be, given the way the model was constructed---but the 90% interval is much wider: after exponentiation, it's [0.2, 1.6]. This makes sense; this wider interval accounts for variation between conditions (different versions of the treatment, different control conditions, different patients at different levels of disease progression).

A general point about these meta-analysis

The point is that the interval for mu . . . it's not the right thing to look at. Or, maybe it's the right thing to look at if there truly might be no effect, but it's not the right thing to look at if the treatment does have an effect. Because if it does have an effect, that effect will vary.


That said, I don't really believe the estimates from my meta-analyses. Well, yeah of course I don't believe them: a meta-analysis is only as good as the data that go into it, and I have no idea how much to trust these data.

But, no, it's not just that. Even if I thought these data were perfect, completely clean randomized experiments with no missing data and no selection issues, I'd still think my meta-analysis is wrong, because it has flat priors on the hyperparameters, mu and tau:
- The flat prior on mu does no partial pooling toward zero, which will cause us to take extreme estimates too seriously;
- The flat prior on tau leaves us with posterior probability attached to very high values of tau, which will cause us to be too open to the idea of huge positive or negative effects in individual studies.

So, what to do? I don't know, exactly! There's no Platonically correct prior for (mu, tau). All I can say for sure is that the prior we are currently using is too wide.

What, then, to do?

We can learn by experimentation.

Let's try independent priors: mu ~ normal(0, 1) and tau ~ half-normal(0, 1) constrained to be positive.

This can't be right either---it stands to reason that the larger mu is, in absolute value, the larger we would expect tau to be---but let's go with it for now. Also, normal(0, 1), that sounds so arbitrary! But recall that this is all on the logistic scale, which we can interpret, and I don't think it's soooo ridiculous to suppose a priori that the population average effect will be less than 1 on the log-odds scale, and that effects under different conditions shouldn't vary by much more than 1 on that scale. In any case, it's a choice, and, as with all models, we recognize that, whatever choice we use will in some way bound the effectiveness of our method.

To continue, here's the new Stan model:

data {
  int J;
  vector[J] est;
  vector[J] se;
parameters {
  real mu;
  real<lower=0> tau;
  vector<offset=mu, multiplier=tau>[J] theta;
model {
  est ~ normal(theta, se);
  theta ~ normal(mu, tau);
  mu ~ normal(0, 1);
  tau ~ normal(0, 1);
generated quantities {
  real theta_new = normal_rng(mu, tau);

And here's the result of fitting it to that Katz meta-analysis data:

  variable   mean median   sd  mad     q5   q95 rhat ess_bulk ess_tail
 lp__      -11.19 -10.95 2.92 2.89 -16.43 -6.84 1.01     1069     2048
 mu         -0.30  -0.29 0.26 0.23  -0.75  0.09 1.00     2013     1161
 tau         0.34   0.28 0.28 0.25   0.02  0.88 1.00     1153     1423
 theta[1]   -0.37  -0.31 0.49 0.34  -1.27  0.30 1.00     3092     2114
 theta[2]   -0.40  -0.32 0.49 0.36  -1.33  0.23 1.00     3318     2763
 theta[3]   -0.42  -0.34 0.50 0.36  -1.38  0.21 1.00     3157     2425
 theta[4]   -0.21  -0.23 0.43 0.32  -0.87  0.48 1.00     3938     3155
 theta[5]   -0.13  -0.13 0.25 0.24  -0.52  0.30 1.00     4403     3448
 theta[6]   -0.24  -0.24 0.33 0.29  -0.79  0.26 1.00     4900     3547
 theta[7]   -0.42  -0.33 0.50 0.34  -1.37  0.21 1.00     2675     1775
 theta[8]   -0.17  -0.20 0.38 0.32  -0.75  0.50 1.00     4080     2850
 theta[9]   -0.24  -0.24 0.24 0.23  -0.63  0.16 1.00     5644     3168
 theta[10]  -0.49  -0.40 0.42 0.36  -1.29  0.08 1.00     2198     2542
 theta_new  -0.30  -0.27 0.51 0.34  -1.15  0.44 1.00     3221     2823

The 90% interval for mu was [-0.94, 0.10] under the flat-prior model and is [-0.75, 0.09] under this new model. The 90% interval for theta_new was [-1.54, 0.50] under the flat-prior model and is [-1.15, 0.44]. So the effect of the prior is to rule out some of those extreme values, which makes sense to me.

A response to the (hypothetical) anti-Bayesian

At this point you might reply that you don't care about my priors etc.---but I'd just respond that to do a meta-analysis at all is to do modeling and to make assumptions, and there's nothing magic about the meta-analysis procedures used by Marinos and Katz. Ultimately, it's about using all the tools we have to estimate the distribution of effects, and there are no assumption-free answers here.

P.S. See here for more on priors for this model.

30 thoughts on “Exploring some questions about meta-analysis (using ivermectin as an example), with R and Stan code

  1. Thank you for your calm, temperate approach. There is so much maximalist yelling and screaming in our discourse today. I’m one of those who is turned away when a position is articulated by a red face with distended neck veins and spittle flecked lips. Do people think that anger is a argument winner? Of course, 35 years of clinical oncology has made skepticism about treatments my default.

  2. I was kinda suspicious, though, first because it seemed doubtful that the medical establishment would be so suspicious of this treatment if it were really so effective as all that

    You have this intuition backwards. Positive results despite a bias against the treatment should be more convincing.

    There is a very strong bias to get whatever results you want/expect and any number of sources of systematic error to get there.

    • Anon:

      I dunno about that. There are a lot of doctors out there with a lot of sick patients. I think that if a much-discussed miracle cure were out there, people would start using it, and the evidence for its amazing efficacy would be clear.

      In any case, that was just my reasoning and it’s not required in order to interpret the meta-analyses.

      • You underestimate the degree of mass confusion. Remember the WHO tweeting out there is “no evidence” that antibodies confer immunity? That is the thought process being used.

        There are some individual doctors claiming ivermectin works great. I have no idea either way and doubt there will ever be a clear answer.

    • > You have this intuition backwards. Positive results despite a bias against the treatment should be more convincing.

      There have been a number of studies conducted by people who were expecting modest results and who found nonw.

      It’s interesting how many people simply dismiss study findings because of their own assumptions about the bias of the researchers. We get a recursive dynamic of bias about bias. In the end, the point is that people have a tendency to filter whatever results on the basis of their “motivations” (or I guess one might say “priors.”)

      Ususally when someone conducts a study to find an effect, you’d think of they have a bias it would be to find that effect – yet as we see with the Malaysian study there’s a built-in assumption about many that actually they conducted the study to prove that IVM doesn’t work. From that basis, deconstruct the analysis to conclude that actually the study shows that IVM has a miraculous benefit.

      “Designed to fail” becomes an off the shelf theory that can be applied to any study that shows no benefit. Unfortunately, it’s hard to determine to what extent a critique has value – as in “Well, they didn’t follow the right protocol.” Or to what extent its just standard contrarianism and an endless loop of unfalsifiable claims.

      • It’s interesting how many people simply dismiss study findings because of their own assumptions about the bias of the researchers. We get a recursive dynamic of bias about bias. In the end, the point is that people have a tendency to filter whatever results on the basis of their “motivations” (or I guess one might say “priors.”)

        Yes, this wealth/connection-weighted collective prior is all that NHST measures. That is what I have been saying.

        To avoid this you need to do more than compare group A to group B. There is always a legitimate reason to throw out any given datapoint or study. You need to make an otherwise surprising theoretical prediction and observe that. Then it needs to be independently replicated.

  3. Nice discussion of meta analysis as routinely done. The strange aspect of meta analysis is the underlying assumption of concurrency. The reported studies are sequential in time and not analyzed as such. Did the sequential aspect affect the study design? Presumably researchers know about previous studies. They however do not see future studies. This is not accounted for in meta analysis. If initially the understanding is to see a 50% reduction, At the initial phase, d you design small studies because of this assumption?

    Another point worth considering: Why are these studies not evaluated for S-type error. Their low power seems to make this compulsory. How do you do that in meta analysis? Do you use the sign test??

    • “The reported studies are sequential in time and not analyzed as such”

      What time-dependent process would affect the results? Study designs could evolve I suppose, but at the end of the day all are designed to assess the efficacy of a treatment, which normally wouldn’t be time dependent from one study to the next.

      • Ordinary standard of care. Over time, we learn better treatments, so the effect of adding another effective treatment to the already-good treatments gets smaller and smaller.

        And variants.

  4. I think I recently read a suggestion that instead of looking at questionable quality studies individually, or looking at the conclusions of small or questionable studies in aggregate as is might be done in a meta-analysis, it would make more sense to pool the data from the individual studies and then perform an analysis.

    So looked agsin to see if I could reread that in more detail but I can’t find it again. Did I just dream that up?

    Anyways – is that just a hair-brained idea? Is the idea of pooling data collected under a wide network of contexts just nuts because there’d be no reasonable way to standardize or calibrate across the data?

    • Joshua:

      Pooling data collected under a wide network of contexts . . . that’s what meta-analysis is. That’s what the above meta-analyses are doing. Whether it’s nuts is another question. There are lots of reasons to think that, if the effect of ivermectin is nonzero, that the effect will vary a lot across conditions, which is why I think Marinos, Katz, and others are making a mistake by focusing on this average estimate.

      • Andrew –

        Yikes. So I guess I’m conflating systemic review and meta-analysis – t wrongly hinking that meta-analyses combine data post (individual study) analysis rather than pooling raw data and then conducting statistical analysis.

        • Joshua:

          There are different ways to do meta-analysis. I think it is better to do meta-analysis using the raw data from all the studies, but often the raw data aren’t accessible or easy to put together, hence we’ll see meta-analysis using published summaries.

  5. The issue of ivermectin as a treatment for SARS-CoV-2 / COVID-19 is a subset of questions about a large number of treatments that have been controversial (and controversially promoted)–zinc, vitamin D, vitamin A, hydroxychloroquine, probiotics…..etc. etc. etc.

    The website that was mentioned as the source of one meta-analysis of ivermectin done by “advocates.” The meta-analysis of ivermectin is just one “real-time” meta-analysis of 36 different treatments for SARS-CoV-2 / COVID-19. that is being done by whomever it is being done by. https://c19early.com/

    It was mentioned in the “chain of replies” in the earlier discussion of ivermectin that started with the JAMA paper. It is worth checking out.

    Of interest, this website cites 79 studies of ivermectin examining 8 different outcomes studied in at least one study (mortality, hospitalization, recovery, cases, viral clearance, ICU, ventilation, progression) Not all are RCTs. But the website, as pointed out in the earlier replies, seems not to “think” that being an RCT matters much.

    • Funny, I discovered this site in 2020 and I didn’t know what to think about it. It first looked like heavy evidence in favor of a number of these “alternative treatments”. I thought it would be hard to dismiss the results without carefully weighing each piece of evidence (each study). Then I realized this initiative of fast alternative meta-analyses is finally self-defeating: because I could accept that one or two of these drugs may be beneficial for Covid treatment, but not almost all of them. They should have stuck to a short list. But this is hard when you’re fighting the system.

  6. The statistical picture from these studies is complicated by the fact that each have multiple endpoints that are time sequenced and causally interrelated, so all these analyses are far removed from the kind of multivariate analysis that would better represent the available information. And then there is no way the biases in the study are randomly distributed. Thus the use of a simple random-effects model with only an overdispersion parameter (tau) is a pretty crude way of dealing with variation that is so heavily systematic in its dependence on measured variables. Not that a contextually more accurate analysis is likely to settle anything, and the universal reality stated in the closing sentence will still apply (there are no assumption-free answers).

    • Sander:

      The meta-analysis presented by Katz appears to be just based on studies of the mortality endpoint, but I’d still think that the treatments and patients and their conditions would differ enough that it does not make sense to summarize by the inference for the population average parameter (mu, in my notation).

      What’s bothering me is now I’m seeing all these meta-analyses and they all have this form with the individual confidence intervals on the top and the diamond showing inference for the average at the bottom. And those “weights”! This seems like a real problem, even more so as it seems to have become some sort of standard. This is an example of software making things worse, as there’s now an easy-to-use way to do something misleading. On the plus side, I kinda like those funnel plots. They can be overinterpreted, but at least they acknowledge variation.

      • > This is an example of software making things worse, as there’s now an easy-to-use way to do something misleading.
        That does describe the Cochrane Collaboration at least prior to say 2012. One editor of their handbook threatened to remove any section that a sample of clinicians found confusing. Not rewrite or expand but remove.

        I think they backed off on that but that was the prevailing attitude – has to be something clinicians can do without the assistance of a statistician.

      • Andrew: Yes I very much agree the standard “summarize with RE weights” is dubious for the reasons I mentioned – the variation is not random in any sense informed by reading the studies and seeing how they differ in ways that could and likely would affect what they report. Another problem is that the RE null test of “effect” isn’t a test for effect existence, but is usually misinterpreted that way; yet it’s only a test of whether the average effect across studies is null, and very inefficient for testing the existence of an effect. These problems have long been lamented, e.g., https://academic.oup.com/aje/article-abstract/140/3/290/99737

        Using study information in a meta-regression (as advised in the link and in Greenland and O’Rourke, Meta-Analysis, Ch. 33 of Modern Epidemiology 3rd ed. 2008) is an option to address them when there are enough studies or large enough studies to do that, and as explained in that chapter can be done with ordinary software.

  7. Hi Andrew, thank you for looking at my work — I’m quite shocked it got this far.

    A few clarifications:
    1. I started from the 29 early treatment studies from ivmmeta because that’s where Scott Alexander started from. My piece is a response to him, and especially the first part is an attempt to clean up and elucidate his argument. I didn’t choose the studies, the endpoints, etc.

    2. In my article I presented 5 meta-analyses. The first, with 29 studies, as you note, has many of poor quality in it, as it was the starting point. The last two (analyses 4 and 5 as numbered), as you also correctly note, lose much of their power if you remove Bukhari. These last two are the ones containing not just Alexander’s but also GidMK’s exclusions.

    However, analyses 2 and 3, the ones in between, are the ones that contain Alexander’s exclusions but not Gideon MK’s exclusions. Those ones also do not lose their oomph nearly as much when Bukhari is removed. I did the analysis of those without Bukhari and the results are very similar. (I am intentionally staying vague about the statistical language here because I am keenly aware of the vast gulf that separates us in understanding of the underlying tools I’m handling. Best to confirm my results yourself).

    This is, in fact, what I originally expected to see. Analyses 2 and 3 being “significant” while 4 and 5 slipping over the dreaded “1-line”.

    3. While I understand that these arbitrary thresholds are far less important than is commonly accepted, my intention in the piece was not to litigate the underlying statistics – that is an argument I am not qualified for, and one that would fly over the head of the readers even if I were. My intent was to use the most middle-of-the-road commonly-accepted tools (which is why I used Cochrane’s RevMan with default settings — nothing fancy) and language to firm up Scott’s argument, and demonstrate that how one perceives Scott’s article should hinge on the unstated assumption of how much one trusts that GidMK’s exclusions were done fairly and not with a conclusion in mind.

    Since neither GidMK’s nor Scott’s meta-analyses have been pre-declared (which is something I, at least, did do) we can’t really know if the conclusions themselves have affected the inputs. So it comes down to trust. That was the point the first part of my article was trying to make.

    4. Since then, some water has gone under the bridge. For one, I have published an article noting that Scott’s original meta-analysis did not even use commonly-accepted meta-analytic tools. He simply did a t-test. Say what you will about random effects, what Scott did falls under the “not even wrong” category.

    I’ve also been going over his literature review, and, sadly, he seems to have misunderstood/misclassified/misconstrued about a dozen of the studies that he comments on. This will eventually be cleaned up and go on my substack, but for now lives in a Twitter thread as a scratchpad:

    5. As you mention, many more studies have been published, and many more will. What happens if we get a doubling of the evidence base, with studies that say roughly the same things as this one did?

    Overall, I’ve been fascinated with this topic because it represents an incredible vantage point into the sensemaking crisis we’re facing on multiple topics right now, where different camps are unable to come to any consensus whatsoever. Sander Greenland seems to have understood what I am trying to do with an impressive degree of fidelity. My big difference from your state of mind is that I am not convinced by the reasoning you stated elsewhere in this thread:

    “There are a lot of doctors out there with a lot of sick patients. I think that if a much-discussed miracle cure were out there, people would start using it, and the evidence for its amazing efficacy would be clear.”

    I have seen way too much in terms of the ability of humanity in general, and medical doctors in particular, to deceive themselves for very long periods of time. One look into the story of Ignaz Semmelweis, or the hundreds of medical reversals, should give you a baseline about the amount of patients doctors will let die rather than give up on a cherished prior.

    Sadly, we can’t rely on the hive mind for this one. We really have to focus on the data in order to find the answer, and to the degree that the perceived consensus has been affecting the data we see we have to try and untangle it from the evidence, not ignore it. And if this sounds extreme or conspiratorial, you can look at, for instance, the behavior of certain journals that delay or refuse to publish studies based on the results, as seems to have happened many times with ivermectin studies – most recently with the Biber study taking about a year and a half to get published. Ivermectin seems to be the one topic where publication bias may actually go in the opposite direction.

    Thank you for reading this (if you do) and spending some time engaging with my article. I much appreciate it.

    • Alexandros:

      I think the problem with your analysis is fundamentally statistical in that it overrates the evidence in your meta-analysis. In your graph, it looks like you have a bunch of studies, but the ones with those really wide intervals such as (0.01. 5.87) are providing essentially zero information, and that’s even before considering issues of selection bias. As discussed in my post above, it all depends on one or two of the larger studies that are included, and it is an artifact of the approach that you and Meyerowitz-Katz use that they come up with overly-precise results.

      • I don’t mean to naysay, but out of curiosity, I took analysis 2 (all ivmmeta studies minus Scott Alexander’s exclusions) and further removed the following:

        1. Bukhari (which you suspected may be the one driving the results)
        2. Any study with CI upper bound higher than 5
        3. Any study with CI lower bound lower than 0.05

        (you’re going to have to trust me that I didn’t do multiple testing here — [0.05, 5] were the first numbers that came to mind, probably because of your example having a 5+ upper bound. Maybe these particular numbers are lopsided in some way? I have no idea)

        You can see the result here: https://imgur.com/a/g15wjlt
        The punchline: point estimate and interval don’t seem to have moved much, though my filter is strict enough to have only retained 5 studies of the original 29. Please correct me if I am misunderstanding the results.

        This seems to agree with what I wrote above: the fragility of the resulting analysis only appears **after** one excludes specifically the studies GidMK uniquely wanted to exclude. Scott’s exclusions, or the exclusions I reverse-engineered from your comment above don’t really move the needle much.

        On a separate note, and forgive me if you’ve seen this before, but there is a bayesian meta-analysis by Neil and Fenton which seems to be much closer to the style of statistics you (and I) prefer: https://www.researchgate.net/publication/353195913_Bayesian_Meta_Analysis_of_Ivermectin_Effectiveness_in_Treating_Covid-19_Disease

        No obligation to read or respond, but if you do read it and have thoughts on it, I’d sure love to hear them.

        • Alexandros:

          Yes, I think that, for reasons discussed the above post, this approach to meta-analysis does not make sense and leads to overconfidence. Your particular example is driven by that one study with a 95% interval of [0.22, 0.62] which implies a huge beneficial effect of the drug. If you were to really take that study seriously then you’d also have to accept that the effect varies by a huge amount across scenarios and populations and outcome measures, in which case I don’t think there’s any real interest in a purported average effect such as is estimated using the meta-analyses performed by you and Meyerowitz-Katz.

      • Andrew –

        So then what is the useful function of meta-analyses?

        What about if the meta-analyses are accompanied by sensitivity or cluster analyses to break down patterns in the studies in association with various parameters, such as sample size or other methodological considerations?

      • Hmmm. Previous attempt to post went into the ethernet. I’ll try again.

        Andrew –

        > The conclusion is that it’s a mistake when performing meta-analysis to focus on an estimate of average effect across studies.

        So then what is the useful function of meta-analyses?

        What about when a meta-analysis includes a sensitivity and/or cluster analysis to identify patterns in association with parameters like sample size or other methodological attributes?

        Is there any way to average across studies that you think has value? Perhaps not just a focus on effect on isolation from other factors, as if they can meaningfully be disaggregsted?

Leave a Reply

Your email address will not be published. Required fields are marked *