Skip to content

What am I missing and what will this paper likely lead researchers to think and do?

This post is by Keith O’Rourke and as with all posts and comments on this blog, is just a deliberation on dealing with uncertainties in scientific inquiry and should not to be attributed to any entity other than the author. As with any critically-thinking inquirer, the views behind these deliberations are always subject to rethinking and revision at any time.

In a previous post Ken Rice brought our attention to a recent paper he had published with Julian Higgins and  Thomas Lumley (RHL). After I obtained access and read the paper, I made some critical comments regarding RHL which ended with “Or maybe I missed something.”

This post will try to discern what I might have missed by my recasting some of the arguments I discerned as being given in the paper. I do still think,  “It is the avoidance of informative priors [for effect variation] that drives the desperate holy grail quest to make sense of varying effects as fixed”. However, given for argument’s sake that one must for some vague reason avoid informative priors for effect variation at all cost, I will try to discern if RHL’s paper outlined a scientifically profitable approach.

However, I should point out their implied priors seem to be a point prior of zero for there being any effect variation due to varying study quality and a point prior of one that the default fixed effect estimate can be reasonably generalized to a population of real scientific interest.  In addition to this, as I think the statistical discipline needs to take more responsibility for the habits of inference they instil in others I am very concerned what various research groups most likely will think and do given an accurate reading of RHL?

Succinctly (as its a long post) what I mostly don’t like about RHL’s paper is that they seem to suggest their specific weighted averaging to a population estimand – which annihilates the between study variation – will be of scientific relevance and from which one can sensibly generalize to a target population of interest. Furthermore it is suggested as being widely applicable and often only involves the use of default inverse variance weights. Appropriate situations will exist but I think they will be very rare. Perhaps most importantly, I believe RHL need to be set out how this will be credibly assessed to be the case in application. RHL does mention limitations, but I believe these are of a rather vague sort of don’t use these methods when they are not appropriate.

That is  seemingly little or no advice for when (or how to check) if one should use the publication interesting narrow intervals or the publication uninteresting wide intervals.

First, I will review work by Don Rubin that I came across when I was trying to figure out how to deal with the varying quality of RCTs that we were trying to meta-analyse in the 1980,s. It helps clarify what meta-analyses should ideally be aiming at. He conceptualized meta-analysis as building and extrapolating response surfaces in an attempt to estimate “true effects.” These true effects were defined as the effects that would be obtained in perfect hypothetical studies. I referred to this work in my very first talk on meta-analysis and RHL also referred to this paper on it – Meta-analysis: Literature synthesis or effect-size surface estimation? DB Rubin – Journal of Educational Statistics, 1992, though in a very limited way. I think I prefer Rubin’s earlier paper “A New Perspective” in this book. I will then apply this perspective of what meta-analyses should ideally be aiming at to critically assess where RHL’s proposed approach would be most promising.

Now, Rubin was building and extrapolating response surfaces out of a concern that “we really do not care scientifically about summarizing this finite population {of published studies we have become aware of) but rather “the underlying process that is generating these outcomes that we happen to see – that we, as fallible researchers, are trying to glimpse through the opaque window of imperfect empirical studies”.  He argues that to better do this we should model two kinds of factors – scientific factors (I often refer to this as biological variation) and scientifically uninteresting design factors (I often refer to this as study quality variation). Furthermore as we want to get at the underlying process, we need to extrapolate to the highest quality studies as these more directly truly reflect the underlying process.

Using the notation X for biological factors and Z for study quality factors he is “proposing, answers are conditional on those factors that describe the science [biology], X, and an ideal [quality] study Z = Z0. That is where we are making inferences.” If there is a lot of extrapolation uncertainty – then that’s the answer. Not much to learn from these studies, they are not the ones you are looking for, so move on.

Now his work has been primarily conceptual as far as I am aware. Sander Greenland and I tried to see how far we could take modelling and extrapolation  in a paper entitled “On the bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions.”  Unfortunately no one seemed willing to share a suitable data set for us to actually try it out. Ideally, such a data set would have a reasonable number a studies that have been adequately assessed for their quality (which we defined as whatever leads to more valid results which likely is of fairly high dimension with the quality dimensions being highly application-specific and hard to measure from published information). Given such requirements are quite demanding, I don’t think they will be met in the published clinical research with primarily only access to the publications. Perhaps though in a co-operating group of researchers studying a common disease and treatment prospectively conducting studies where they should have access to all the study protocols, revisions and raw data. Or perhaps a prospective inter-laboratory reliability study. In summary, there is always a need to  adequately model a X, Z surface and extrapolate to Z0, with this being ignorable only when all studies are close enough to Z0.

So we are moving on to just considering RHL’s method in consistently high quality studies – they are not the methods to use when there is varying quality. A rare situation but it can happen. I believe it would exclude most meta-analysis done within for instance the Cochrane Collaboration. Not sure if RHL would agree. Even within this restricted context of uniformly high quality studies, I think it will be helpful to give my sense of the science or reality of multiple high quality studies as even that can be tricky.

Each study likely will recruit an idiosyncratic sample of patients from the general population – it will be highly selective (non-random) and not that well characterized. This can be seen in the variation of patient characteristics and for instance the varying control group outcomes in the trials. There will be a restrictions on eligibility criteria. However, within that there there can still be much variation. So each study will have a idiosyncratic sample of patients recruited from the general population which provides a relative percentage in that study of the total studied population. RA Fisher used the term indefinite to refer to this type of situation where we have no means of reproducing such differing sub-populations at will. The same investigators at a later time or in a different city would unlikely be able to recruit the same mix of patients (i.e. the recruitment into clinical trails is pretty haphazard in my experience). Because of this I don’t how the sub-population could even be adequately described nor the relative percentage of this sub-populations in a population of interest ever be determined.

Now, the relative percentage of the total studied population of these study sub-populations will likely differ from the percentage of these same idiosyncratic sub-populations in the  general population or some targeted population of interest. Sample sizes in conducted trials is largely determined by funds available, recruitment abilities of the trial staff, other trials competing to enrol subsets of targeted patients at the time, etc. Because of this I don’t see why it would be expected that the relative percentages of the various study recruited patients (of the total studied population) would approximately equal  the relative percentages of the various patients in a general or target interest population.

Now in a very convenient case where the only patient characteristic that drives treatment effect variation is say gender – we will know the relative percentages in the study population and likely any targeted population (with negligible error) and post-stratifying (weighting to match to a target population gender proportions) will likely be straightforward. RHL provides an appendix which actually carries this out for such an example. But in realistic study settings, I have no idea how the needed weights could be obtained. That is, we will usually just have idiosyncratic samples of patients in studies without much if any knowledge of their make up or their relative percentages in targeted populations or what drives the biological variation (given there is variation).

Now it is clear that we do have the study population and so it should be fairly direct to assess the average treatment effect in the studied population. Here between study scientific variation (i.e. study population variation) can be taken as fixed and ignored. However, I would argue this question is not one of scientific relevance (that RHL is primarily interested in) but rather a practical economy of research question – is it worth continued study of this intervention to get a better sense of what that effect would be in a targeted population and what studies should we do to get a better sense. This perspective goes way back to Fisher with his  careful discernment of what variation should and should not be ignored for questions of scientific relevance.

As an aside, I have thought a fair amount about these issues as I presented a similar weighting argument to get an uninteresting average magnitude effect estimate and an interesting average sign effect estimate – the latter just being positive or negative. I presented the argument to a SAMSI Meta-analysis working group in 2008. I recall Ken and Julian being there but I am not really sure and they likely do not remember my presentation either. Jim Berger criticized the population that was defined by the inverse variance weighting as being non-existent  or imaginary. Now I had thought the assumption of the treatment effect being monotonic would make that criticism moot, but I was not sure at the time. Richard Peto often insisted he was justified in making such an assumption and I was taking that as a clue. If the effect is only positive or negative, then an average effect in any population real or counterfactual – no matter how uninteresting – would enable the sign to be pinned down for any other population. I later discussed this with David Cox via email and he argued that monotonicity was a very questionable assumption. Furthermore, if I actually wanted to make such an assumption, why not assume a treatment variation distribution on the positive line? So I abandoned the idea.

Now some specific excerpts from RHL that I may benefit from some clarification:

RHL > “we discuss in detail what the inverse-variance-weighted average represents, and how it should be interpreted under these different models” and “[the] summary that is produced should be interpretable, relevant to the scientific question at hand and statistically well calibrated … controversial issue that we aim to clarify is whether ˆβ in equation (1) [inverse-variance-weighted average] estimates a parameter of scientific relevance.”

I definitely like these promises but I don’t see them being explicitly or adequately met in the paper.

RHL > “we restrict ourselves to the situation of a collection of studies with similar aims and designs, free of important flaws in their implementation or analysis. (See Section 6 for further discussion.)”and then in Section 6 “when studies do not provide valid analyses, either because of limitations in the design and conduct of the study, or because, after data collection, post hoc changes are made to the analysis, but reported analyses do not take these steps into account… If in practice these procedures cannot be avoided, accounting or the biases that they induce is known to be difficult…”

The challenge here is that its not clear what is meant by important flaws and when they are present they seem to be suggesting not much can be done. For instance, what percentage of the meta-analyses in the Cochrane collection of meta-analyses would have such flaws – 10% or 90%? Would one or more entry in the risk of bias tool help sort this out?

RHL>”[inverse-variance-weighted average] estimates a population parameter, for the population formed by amalgamating the study populations at hand. … in the overall population that amalgamates all k individual study populations, define ηi as the proportion of the population that comes from study population i.”

I am not sure whether RHL mean to define ηi as the proportion of the total studied population that comes from study i or the proportion of a general or population of interest that matches study i’s sub-population. The following two claims suggest it is the first.

RHL>”we see that ni/Σni [the proportion of total study population in study i] is consistent for ηi” and “proportions ηi are known with negligible error”

The the population of scientific relevance is the second and so how to we get to that – just assume the proportion of the total studied population roughly equals a population of interest? That surely needs some justification?

RHL>”It remains to discuss the scientific relevance of β; the use of this specific weighted average is described … general results given in Section 3.3.”

I very much agree that it does remain and RHL claim it will be in Section 3.3.

But Section 3.3 is just an inverse-variance-weighted average view or recasting of general regression into sub-pieces – which I can not see as addressing scientific relevance. As if general regression was the definition of scientific relevance!?

Rather, I would (and have) argue(d) its just an analogue of various ways to factorize the joint likelihood of studies with various choices of (in RHL terms) identical versus independent parameter assumptions. For example in the simplest example of regression of a single x through the origin, specifying an identical slope parameter for all studies but independent within study variance parameters different for each study (that is each study gets its own variance parameter). The usual regression involving all the study’s raw data in one regression being expressed as Normal(beta,sd1, sd2,…,sdn,x,y) that is being rewritten as  Normal(beta,sd1,x1,y1) * Normal(beta,sd2,x2,y2) * …. * Normal(beta,sdn,xn,yn). As these individual study likelihoods are exactly quadratic (given RHL’s assumptions) they can be replaced with inverse variance weighted  individual study beta estimates. So what?

RHL>”the fixed effects meta-analysis estimates a well-defined population parameter, of general relevance in realistic settings. Consequently, assessing the appropriateness of fixed effects analyses by checking homogeneity is without foundation— … Both in theory and in practice, the argument is not tenable and should not be made.”

I think this a very strong claim given what is and isn’t in the paper. Furthermore – checking homogeneity is just checking some of the assumptions of the data generating model. If one is not making a common parameter assumption but rather a independent parameter assumption, does that make model checking impossible? That would be convenient. The minimalist of  data generating model assumptions for meta-analysis is there not being apples and oranges. Something is being taken as common in these multiple studies – is that in error? How would one ever know?

“For example, if the subjects in the studies contributing to the meta-analysis are representative of an overall population of interest, the fixed effects estimate is directly relevant to that overall population … If, however, the sample sizes across studies vary so greatly that the combined population is unrepresentative of any plausible overall population, then the fixed effects parameter will not be as useful.”

Maybe this explains what I am missing – RHL is only suggested the method be used when you know the mix of idiosyncratic study populations are actually is representative of an overall population of interest. That is if and only if the proportion of total study population in study i is consistent for ηi. How would one know that? How would one check that?

But RHL also claimed it was of general relevance in realistic settings so they are assuming in most realistic settings the proportion of total study population in study i is consistent for ηi? Or so it seems given this statement.

RHL>”Fixed effects meta-analysis can and often should be used in situations where effects differ between studies”

Now for somethings I do fully agree with.

RHL> “However, if the random-effects assumption is motivated through exchangeability alone”

Yup – it is surely one of those very untrue models (a really really wrong model as the Spice Girls would put it) but which is useful in bringing in some of the real uncertainty – though admittedly seldom the right amount. That is why I once was claimed it was not the least wrong model.

RHL> “Measures of heterogeneity should not be used to determine whether fixed effects analysis is appropriate, but users should instead make this decision by deciding whether fixed effects analysis—or some variant of it—answers a question that is relevant to the scientific situation at hand.”

I fully agree – but again its the banning of informative priors altogether – forcing there to be a discrete decision to either completely ignore or fully  incorporate very noisy between study variation. Nothing in between! And with seemingly little or no advice to be had for when (or how to check) if one should use the publication interesting narrow intervals or the publication uninteresting wide intervals. This is the real problem.




  1. Ken Rice says:

    Hi Keith – I had not seen your earlier “critical comments”, so I’m surprised to see this lengthy discussion of our paper here. I will try to find time to respond, with the other authors, but given the length this can’t be done immediately. Just looking at your first few paragraphs, some of your comments are also unclear to me; there is no “implied prior” in the paper – we actually spend a lot of time showing how certain estimates and intervals can be consistent with more than one set of assumptions, so I don’t see where that idea comes from.

    For everyone else, the paper and a supporting set of slides are available here – the slides are under “Supporting Information”. The main goal of the paper is to explain what fixed-effects (plural) meta-analysis does, and to help people think about whether this is useful. It was motivated our problems of repeatedly running into misconceptions of fixed-effects meta-analysis. We did not set out to tell people how they must do their meta-analyses, just to help them understand what different meta-analytic methods provide.

Leave a Reply