“Epidemiology and Biostatistics: competitive or complementary?”

Mohammad Mansournia writes:

I have a 20 minute lecture on “Epidemiology and Biostatistics: competitive or complementary?” at Tehran University of Medical Sciences in the next month. I should mention the difference between an epidemiologist and a biostatistician and their competitive or complementary roles in public health. I am wondering if you have any thoughts on this subject.

P.S.
“Knowledge of relevant aspects of statistics … is a prerequisite for proper study of epidemiological research …, just as, for example broad and deep command of mathematics is a prerequisite to a career in physics.”
– Olli Miettinen

“Epidemiologist: one who thinks that an odds ratio is an approximation to a relative risk as opposed to a statistician who knows the opposite.”
– Stephen Senn

“There are two types of statisticians: those who do causal inference and those who lie about it”
– Larry Wasserman

My reply: Perhaps the following two papers will be helpful:
Causality and statistical learning and
Experimental reasoning in social science.
They’re not specifically on epidemiology but they do address different perspectives in causal inference. By the way, I disagree with the above quote from Larry: as a statistician who works on surveys, I recognize that there is a long and important tradition in statistics of descriptive inference.

To elaborate on this, I do think that essentially all statistical problems are about comparisons. And forward causal inference is a form of comparison (what would happen under intervention 1, compared to what would happen under intervention 2). But I often find myself in the business of making comparisons that are not causal (for example, comparing changes in public opinion among two or more groups of people). Such comparisons can have causal implications and they can suggest reverse causal questions as discussed in this paper with Guido, but I wouldn’t quite call them “causal inference” in the usual sense of the term.

To return to epidemiology vs. biostatistics: it’s my impression that there’s a lot of forward causal inference and a lot of reverse causal inference in both fields. That is, researchers spend a lot of time trying to estimate particular causal effects (“forward causal inference”) and a lot of time trying to uncover the causes of phenomena (“reverse causal questioning”).

And, from my perspective (as elaborated in that paper with Guido), these two tasks are fundamentally different and are approached differently: forward causal inference is done via estimation within a model, whereas reverse causal questioning is an elaboration of model checking, exploring aspects of data that are not explained by existing theories.

31 thoughts on ““Epidemiology and Biostatistics: competitive or complementary?”

  1. In the graphical approach to Causal Inference, the difference between Epidemiology (understood as applied Causal Inference) and Biostatistics is crisp:

    There is a partly unknown data generating mechanism that creates a joint distribution of observed variables (step 1). Then you sample from that joint distribution to obtain data (step 2).

    Biostatistics is the theory of how you can use your sample to learn about the joint distribution, ie reverse step 2. The fundamental problem is that you only have an incomplete sample, so your main challenge will be sampling variability. To make progress on this problem, what you need is a theory of statistical inference, so that you can make better and more efficient estimators of the joint distribution.

    Epidemiology is the theory of how you can use the joint distribution to learn about the data generating mechanism, ie reverse step 1. The problem is that the relationship between possible data generating mechanisms and the joint distribution is many-to-one. Your main problems will be confounding and sampling variability. To make progress on this problem, what you want is a language for reasoning about what quantities in the joint distribution correspond to the research question. This is what you get from causal graphs.

    Biostatistics is a question about how you handle the fact that your dataset has too few rows. Epidemiology is a question about whether you have the right columns, and about how you put them together.

  2. Biostatistician here. We see a mix of randomized experiments (e.g., clinical trials), and observational studies (e.g., health policy analysis) in our group. When it comes to the former, we usually don’t need to think too hard about causal inference since we take for granted that randomization turns causal questions into associational questions. As for the latter, we usually don’t bother with anything too sophisticated since any causal inference relies on assumptions about which physicians tend to be pretty skeptical (rightly or wrongly). We sometimes do simple things like propensity score matching, etc., to give an analysis more of a causal feel, but I think my colleagues and I mostly view this as a different kind of data reduction. I feel like I’ve done my job if I’ve summarized the study and the data well enough that the reader knows what we did and what we found and can come to their own causal conclusions (maybe by drawing up their own DAG). I’d like to do better, but more often than not it seems like trying to work out the “correct” causal analysis of observational data doesn’t convince anyone who wasn’t convinced in the first place, or not anymore than the purely statistical analysis allowed.

  3. Anders, Ram:

    Thanks for your perspectives. One challenging thing about any sort of “What is done in field X?” question is that different people in field X work on different problems and do different things. So, depending on how you slice it, there can be many different ways of finding similarities and differences between epidemiology and biostatistics.

    • Right–I don’t want to give the impression that everyone in biostatistics handles causal inference the way I do.

      In general, though, if a conclusion hangs on a particular set of (controversial) assumptions, and the study itself does not have anything to say about those assumptions, is there value in making those assumptions when reporting on the study? Isn’t it sufficient to draw uncontroversial conclusions, letting the reader make further inferences based on their own assumptions? To be sure, the PI may be able to make an argument for their assumptions, but it seems possible in most cases to separate the uncontroversial inferences from the controversial ones, and to prioritize reporting of the former. The latter seems more like an afterthought, and may even be better placed in a separate paper interpreting the results of the study from particular perspectives.

    • I interpreted the question to ask “What is the conceptual distinction between Epidemiology and Biostatistics”. In my view, these fields of inquiry are complimentary: Epidemiology without Biostatistics is impossible, whereas Biostatistics without Epidemiology is pointless. The exception is the special case of randomized controlled trials, where Biostatistics alone is sufficient (because the Epidemiologic part of the analysis is trivial).

      If this was instead meant as a sociological question about “What is the distinction between the cluster of people who call themselves Epidemiologists and the cluster of people who call themselves Biostatisticians”, the answer is that in most cases Epidemiologists are just suboptimally trained Biostatisticians.

      • Anders_H:

        > case of randomized controlled trials, where Biostatistics alone is sufficient (because the Epidemiologic part of the analysis is trivial).

        I think that Larry’s “and those who lie about it” comment refers to folks pretending the above is the case when they are (explicitly or implicitly) aiming at casual inference of some sort when the studies aren’t randomised. The _lying_ may be implicit or maybe unwitting (e.g. some thinks p_values under the NULL will be uniformly distributed from non-randomised studies) but it very common (80% of the non-randomised studies I have evaluated).

        Here is a recent example http://statmodeling.stat.columbia.edu/2015/01/03/continue-teach-use-hypothesis-testing/#comment-206470

        > the answer is that in most cases Epidemiologists are just suboptimally trained Biostatisticians.
        I used to think most just lacked a good grasp of calculus and linear algebra (which would limit their statistical training) but after giving a recent webinar to an Epi society (where I avoided all calculus and linear algebra) I now think its a lack of ability or comfort to think abstractly – to take models as fallible representations of the world which help us get less wrong about it.

        On the other hand, many of the epidemiologists I have work with are very astute administratively and politically and make good research managers.

        • “I now think its a lack of ability or comfort to think abstractly – to take models as fallible representations of the world which help us get less wrong about it.”

          Well said!

        • I have a different interpretation of Larry’s comment:

          Any comparison of groups is only interesting insofar as it describes a causal relationship. Moreover, humans who read a study comparing groups will implicitly believe that a causal claim is being made. This is true even if the investigators avoid using causal language. Because really, what would be the point in doing the study if you weren’t looking for a causal relationship?

          You therefore have two types of epidemiologists: Those who handle the causality problem heads on, who are explicit about what they are trying to do and about what the world has to look like in order for their estimates to be meaningful. And those who claim they are not doing causal inference (even when that means that their studies are pointless)

          The above is an oversimplification, there are obviously exceptions. For example, prediction models for diagnostic purposes are models that explicitly try to achieve a non-causal objective, ie figure out the probability that a person has a disease. (However, “diagnosis” is only of interest because we believe we have causal information about the effect of treatment, and that this effect differs between those who have the disease and those who don’t).

          The quotation is a hyperbole, but it contains an important truth: When applied statisticians and epidemiologists claim they are not doing causal inference, they are usually wrong. Words such as “Confounding” and “Selection bias”can only be defined if you are trying to reach causal conclusion. Moreover, for most papers published in epidemiologic journals, there would be no point in doing the study unless you were looking for causal explanations. This is true even when authors disingenuously try to avoid talking about causality.

          Sometimes authors are so scared of making causal claims that they call their models “prediction models” even if they make no attempt to predict anything, which is just weird

        • Anders, I disagree with some of what you say. First, I agree that “confounding” and “selection bias” are defined in terms of causal effects, but I disagree that they’re only defined if you’re trying to reach causal conclusions. In fact, it’s the vast potential for confounding and selection bias that might prevent one from trying to make causal conclusions from observational epidemiologic studies in the first place. That is, I understand exactly what confounding and selection bias are in the causal context, which is why I’m extremely skeptical about ever being able to account for them using models.

          Second, in saying the above, I’m not discounting use of observational epidemiologic studies at all. I agree that *if* you’re going to make causal inferences from observational data, then by all means lay all of your assumptions out in the open using either Bayesian methods or causal diagrams. Readers are then free to buy your model results or remain skeptical. But, frankly, it’s ludicrous to imply that the only reason to do observational work is with a goal of drawing causal conclusions. Descriptive studies, with the goal of hypothesis generation, are far from pointless. Sometimes hard to publish, yes, but pointless, no. Personally, I would much rather see many more purely descriptive papers from observational epidemiology, with no p-values or confidence intervals; unfortunately, for whatever reason, such studies are not often published.

        • OK, I will accept that hypothesis generation is another legitimate non-causal research objective (in addition to prediction). Note though that the hypotheses you are trying to generate are usually causal in nature.

          I apologize for the hyperbole, but I still think the main point stands: Epidemiologists are usually being disingenuous when they try to avoid using causal language in their papers. Just look at any epidemiology journal and try to tell me that those guys are not trying to estimate causal effects.

        • Yes, I agree. In fact, on thinking about this more, I agree with the point that I think you were actually originally making regarding “confounding”. Confounding is a causal issue, and almost every epidemiology paper talks about controlling for confounding. Why are they doing this if they aren’t interested in making causal claims? It seems contradictory to say something like “of course correlation isn’t causation” in the Discussion, wile controlling for a mess of confounders in the Results.

          Further, most of these papers talk about controlling for confounding as if that’s necessarily a good thing, as if it necessarily leads to better estimates. But they completely miss the point that this is only necessarily the case if they happen to measure and control for all important confounders (and don’t control for things like colliders that would open up other backdoor paths). This is an assumption that would presumably be specified in a well thought out model.

        • “Confounding is a causal issue, and almost every epidemiology paper talks about controlling for confounding. Why are they doing this if they aren’t interested in making causal claims? It seems contradictory to say something like “of course correlation isn’t causation” in the Discussion, wile controlling for a mess of confounders in the Results.”

          Yes, this is exactly my point. I probably could have been clearer in the original comment. When you “control for confounding” and claim you are not estimating causal effects, that is a very strong indication that you don’t have a coherent understanding about what you are trying to achieve. I think a good description for this kind of research is “cargo cult science”.

          I think this is what Professor Wasserman was getting at in the quotation, but of course there is a risk that I’m conflating my views with his.

        • Just for clarification, the word “you” was directed at those epidemiologists who publish this kind of research. It was not directed at Mark. Like I said before in this thread, I wish I had the ability to edit. I will try to proofread my comments better in the future.

        • Anders_H: That is pretty much what I meant but I don’t see much difference between Biostatisticians versus Epidemiologists in this sort of being disingenuous. The other biostatistician at one of the research institutes I was at for a few years did this all the time and taught it to clinicians and epidemiologists in thier course (and I really suspect they did know better.)

          But also agreeing with Mark (and Andrew) that there are many non-causal research findings that should be published.

          A for instance would be A prospective study of peri-diagnostic and surgical wait times for patients http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2634695/ even though it was about the Canadian health care system we could not get it published in a Canadain journal because as one editor said – “there is no comparison group so it can’t be _research_”.

        • Anders:

          I disagree with the statement, “Any comparison of groups is only interesting insofar as it describes a causal relationship.” Or, maybe I should say, descriptive, non-causal comparisons might not be interesting to you, but they’re interesting to me and many others. Just take a look at most of my applied research papers: seats-votes curves, red-state-blue-state, radon, etc etc etc.

          It’s a big world out there. I respect that you are interested exclusively in causal inference, but there are lots of us out there who are interested in other things.

        • OK, I will have to backtrack from some of my hyperbole again:

          What I meant to say is that the kinds of research questions that are published in epidemiologic journals (eg “the relationship between nuts and longevity”, “the relationship between milk and cancer”, “the relationship between fats and cardiovascular disease”) are only meaningful if you are asking about causal effects. Associations can be important, but only to the extent that they give hints about causality. I have a preference for researchers who handle the causality question heads on, in fact, I think this is the only plausible use of epidemiology as a field. I have a low tolerance for people who are not clear about what question their paper is intended to answer.

          There are obviously interesting questions other than causality, and they are probably more common in statistical areas outside of epidemiology and medicine. In particular, prediction can be interesting.
          That said, if there are objectives other than prediction that people find interesting, I would be very interested in hearing a precise definition of what exactly the studies are trying to find out, and how the answer is going to be useful..

        • Anders, be very careful… Walter Willett might be looking over your shoulder! (sorry, noticed from your link above that you were at Harvard, so couldn’t resist) … he’s apparently the king of publishing clearly hyper-confounded results about nutrition and disease and making strong, unsupported claims (and making big headlines, to boot!).

        • Anders:

          If you include prediction, I think you’re safe, because as far as I can see, all statistical problems (including causal inference) can be formulated formally as prediction of unknown observables.

          In practical terms, though, sometimes—I’d say often, in fact—I’m not really interested in prediction of the future, I just want to know what happened. How did different groups of people vote, for example. Yes, such questions can have policy implications and they can be useful for addressing causal or predictive questions, but in the meantime I’m often looking for straight description. Which can be hard enough.

          Again, I refer you to most of my applied papers. There are often open and hotly debated questions that are descriptive, and it can be helpful to use statistical tools to get to the bottom of such debates. In social science jargon, we are establishing (or refuting) “stylized facts.”

        • Not sure if Andrew or Phil will comment, but I think their work (as you said) mostly took causality as a given and focused on predicting and measuring the amount present in various locations.

      • To apply your question to a different, but recurring theme, are data science and statistics competitive or complementary? Of course, they are complementary as pointed out in myriad ways by a large number of people in many different contexts. However, they may be competitive as a practical matter. Many applications of data analysis (under the names, “data mining,” “machine learning,” etc.) are being done by people with little or no statistical training. We can lament that fact and point out that this will lead to serious mistakes being made – but the reality is that the costs of doing it “right” may outweigh the benefits.

        I realize this is a diversion from the initial topic, but I have read too many of these justifications for why statistics is necessary for doing good analysis. While I would like to believe that, I do think statisticians are missing something – to the point of burying their heads in the sand. “Big data” is certainly over-hyped, but in many ways is being done by people with little statistical training – and being done with some success. I think those successes are being too quickly dismissed and that, in a very real sense, statistics and data science are increasingly competitive (when they should be complementary).

  4. Interesting discussion. I think that a lot of this discussion is coming from those who call themselves statisticians. In my world, the words epidemiologist and biostatistician are becoming blurry, unless the hard definition is simply based on which discipline the last degree was completed in. Many trainees in my generation are being trained in both disciplines, as well as other complimentary ones (e.g., computer science, informatics, applied math). There is a split between the epidemiologists with very little mathematical foundation or interest (usually clinicians) and the more technical variety that are more like biostatisticians than their clinically trained counterparts. Though the word statistician may not exactly apply to them, they certainly bring applied technical and research skills far beyond administrative/managerial ones, and the typical “sub-optimally trained biostatistician” title would sell them very short. Also I would like to echo Andrew’s comments that like poli sci, epi/population health/public health in general are not focused entirely on these “milk and cancer” type studies, though they get lots of press. Epi and its application to both clinical medicine and public health, like many other disciplines, is a broad one, with lots of causal inference and lots of other things too. And of course, there is crappy research done in every field.

  5. I have no idea what you guys mean when you use the word causation

    I. am an epidemiologist/ biostatistician for 35 years I have also work in preventive medicine all that time

    I have seen too many biostatisticians without any formal training inBiology or medicine

    in the past epidemiologist didn’t learn much statistics but today epidemiologists must approach the statistical knowledge of the biostatistician how do you design study if you do not know how to analyze it

    And in the medical field causation is the exception to what we fine

    We deal with the risk factors

    I am confused your larger perhaps social science is quite different

  6. Jay, the problem is that the concept “risk factor” is just too vague. Start by asking yourself why you are interested in learning about “risk factors”:

    Is it because you want to be able to avoid those risk factors, in the hope that this will reduce your risk? Then you are doing causal inference

    Is it because you want to understand disease etiology? Then you are also doing causal inference

    Alternatively, is it because you are interested in more accurate diagnosis? Ie you want to know about risk factors because knowing whether a patient has a risk factor will give you information about whether he has the disease? If so, you are doing a type of prediction modeling. This is fine, but people who do this should stop using causal language (like “confounding”) and start reporting their studies explicitly as prediction models. If this is what you are interested in, you should also think about whether a Bayesian approach would be better than the endless Cox proportional hazard models.

    Or are you interested in something else entirely?

  7. The main issue is not Epidemiology versus Biostatistics because they are clearly complementary disciplines and only SPHs make a difference between the technique and the science of PH. The issue is that not everything (and I personally believe little) can be explain via statistical models. Statics has its own limit and physics (with physical based models) should be embrued in the understanding of the underlying and coupled social, biological and environmental dynamics creating epidemiological dynamics.

  8. Epidemiologists have a very pragmatic apprach to research. they focus on what causes a disease , they know and have a deep understanding of research, and they have what is needed in terms of statistical knowledge to do good research. Most Statisticians think that good reeearch is the one in which we use the most advanced techniques for the data analysis.
    THAT IS NOT TRUE. Good research is the one in which we define a good research question (value for the community or advance knowledge) and use appropriately epidemiological and statistical techniques to answer a question for the benefit of communities. You can do groundbreaking research even using some of the very basic statistical techniques. God research is not defined by the level of sofistication in data analysis. John Snow did a groudbreaking research in the London Cholera epidemic, how sofisticated were the methods he uses; i recall that calculus and some advanced mathmatical techniques were available at that time.
    In conclusion, what matters is Good research, and every researcher should focus on doing research that is good and generates high value the most parsimonously possible. It doesnt matter if you call yourself Epidemiologist or Statistician

  9. Of course, there will always be something to learn; And it doesn’t mean that there aren’t people who have a level of research understanding that is deep enough to produce high quality research (otherwise we wouldn’t be enjoying those groundbreaking publications we read all the time). Putting my comment in the context of this discussion comparing Epidemiologists to Statisticians, the least we can say is that Epidemiologists understand research better that Statistician (who did not take a few Epidemiology courses) ; and Statisticians understand statistics better than Epidemiologists. That’s why it is not uncommon to see that in some universities, graduate Students who are serious about doing health-related research are required (by their supervisor -usually a Professor of Statistics) to take a few courses in the Epidemiology department.

Leave a Reply

Your email address will not be published. Required fields are marked *