Skip to content

You’ll have to figure this one out for yourselves.

So. The other day this following email comes in, subject line “Grabbing headlines using poor statistical methods,” from Clifford Anderson-Bergman:

Here’s another to file under “How to get mainstream publication by butchering your statistics”.

The paper: Comparison of Hospital Mortality and Readmission Rates for Medicare Patients Treated by Male vs Female Physicians

Journal: JAMA

Featured in: NPR, Fox News, Washington Post, Business Insider (I’m sure more, these are just the first few that show up in my Google News feed)

Estimated differences:
Adjusted mortality: 11.07% vs 11.49%; adjusted risk difference, –0.43%; 95% CI, –0.57% to –0.28%; P < .001; number needed to treat to prevent 1 death, 233 Adjusted readmissions, 15.02% vs 15.57%; adjusted risk difference, –0.55%; 95% CI, –0.71% to –0.39%; P < .001; number needed to treat to prevent 1 readmission, 182 Statistical Folly: "We used a multivariable linear probability model (ie, fitting ordinary least-squares to binary outcomes) as our primary model for computational efficiency and because there were problems with complete or quasi-complete separation in logistic regression models." Regarding the number of regression parameters: Not explicitly listed, but by the following paragraph, I would suspect there are at least hundreds of regression parameters (such as an indicator of medical of school attended). "We accounted for patient characteristics, physician characteristics, and hospital fixed effects. Patient characteristics included patient age in 5-year increments (the oldest group was categorized as ≥95 years), sex, race/ethnicity (non-Hispanic white, non-Hispanic black, Hispanic, and other), primary diagnosis (Medicare Severity Diagnosis Related Group), 27 coexisting conditions (determined using the Elixhauser comorbidity index28), median annual household income estimated from residential zip codes (in deciles), an indicator variable for Medicaid coverage, and indicator variables for year. Physician characteristics included physician age in 5-year increments (the oldest group was categorized as ≥70 years), indicator variables for the medical schools from which the physicians graduated, and type of medical training (ie, allopathic vs osteopathic29 training)." Also: setting aside the question of whether all of these effects have a strictly additive effect on probability (answer: no), it's not even clear that we want to condition on many of these unlisted physician characteristics, medical training, etc., if we want to talk about the causal difference of being treated by a male rather than female doctor. I don't have any idea about whether treatment from male or female doctors is better. But I do know that this paper gets us exactly 0 steps closer to answering that question.

And this from Scott Wong an hour later:

I’m an avid reader of your blog and it has changed the way I look at statistical analyses. I’ve started using Stan (python) multi-level models in my own work because of their ability to control/balance many factors at once.

A recent article caught my eye that uses null hypothesis significance testing to make drastic claims: “Evidence of the Superiority of Female Doctors: New research estimates that if all physicians were female, 32,000 fewer Americans would die every year.” The key result in the research study is that analysis of ~1.6MM hospitalizations revealed 30-day mortality of patients treated by female physicians was 11.07% vs 11.49% for male physicians. Similar patterns were found within smaller cohorts (reducing likelihood of Simpson’s paradox rearing its head).

The magnitude and confidence of their results is quite surprising. I didn’t identify any glaring errors in the research study, so I’m wondering if this is a garden of forking paths result or if the researchers are really on to something?

Other readers of your blog might be interested as well, so I was hoping you could discuss the paper there.

And then, 29 minutes later, this from Dean Eckles:

I thought you might find this example useful. It has some of your “favorite” things, and in an important setting.

This paper analyzes a lot of Medicare records with physician gender as the treatment, based on the idea from prior work that female physicians are “more likely to adhere to clinical guidelines and evidence-based practice”. The main analysis uses linear regression adjusting for a number of patient and physician characteristics and hospital fixed effects.

The cautious description of the result in the paper is “These findings suggest that the differences in practice patterns between male and female physicians, as suggested in previous studies, may have important clinical implications for patient outcomes.” The editorial comment in JAMA Internal Medicine titled “Equal Rights for Better Work?” more boldly says “These findings that female internists provide higher quality care for hospitalized patients yet are promoted, supported, and paid less than male peers in the academic setting should push us to create systems that promote equity in start-up packages, career advancement, and remuneration for all physicians.” And the press release from Harvard says:
“The difference in mortality rates surprised us,” said lead author Yusuke Tsugawa, research associate in the Department of Health Policy and Management. “The gender of the physician appears to be particularly significant for the sickest patients. These findings indicate that potential differences in practice patterns between male and female physicians may have important clinical implications.”

By the time we are with the press, we have:
– “Don’t want to die before your time? Get a female doctor” — USA TODAY
– “Evidence of the Superiority of Female Doctors: New research estimates that if all physicians were female, 32,000 fewer Americans would die every year” — The Atlantic
– “I’m assuming the difference is because of the way that women, in general, communicate. It’s about being better listeners, more nurturing and having emotional intelligence.” — NPR All Things Considered

As I commented on Twitter, this kind of analysis isn’t usually going to be very credible, but there are a few things that stick out. The authors only adjust for patient and physician age in 5-year buckets, so any within bucket selection of doctors gender remains a confounder. This is especially noteworthy because (a) being Medicare data, we are only talking about patients 65 and over, so each bucket is a pretty large fraction of the data and (b) there are substantial differences in the ages of male and female physicians. There is an alternative analysis in the supplement that uses adjustment for age as a continuous variable, though of course the way that is implemented here is that just replaces coarsening with a restriction to linearity.

Since this has:
– Observational causal inference
– Authors reporting being surprised by their results
– Adjustment for coarsened age, rather than non-parametric adjustment for age as observed
– Breathless coverage by the press
I thought you and your blog readers might find it interesting.

Eckles adds:

The authors do consider a subset of physicians who are hospitalists and call this a quasi-experiment, arguing that assignment to physician is then mainly determined by quasi-random shifts. But this also lacks some of the usual trappings of a credible analysis in that there isn’t much done to provide evidence for this assumption. And the effect size for this subpopulation also gets smaller with more controls.

And then the next day from Brandon Butcher:

I came across the following headline from Scientific American:

This seemed like something you might find interesting. Here’s the link to the original JAMA article:

A few quotes from the methods section I [Butcher] found concerning:

· Model 2 adjusted for all variables in model 1 plus hospital fixed effects (ie, hospital indicators), effectively comparing male and female physicians within the same hospital.
· We used a multivariable linear probability model30,31 (ie, fitting ordinary least-squares to binary outcomes) as our primary model for computational efficiency and because there were problems with complete or quasi-complete separation in logistic regression models.

Kinda funny that my first correspondent wrote about “butchering” the statistics, and my last correspondent’s name is Butcher. What’s that all about, huh?

In all seriousness, I don’t have to the time or energy to look at this one at all. But it does seem like an important topic so I thought I’d share it with you. You can read the paper yourself and make of it what you will.

P.S. A reader pointed me to a recent response to critics from Ashish Jha, one of the authors of the paper discussed above.

P.P.S. My correspondent writes:

Might the JAMA study fall into the category of The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time?

“Statistical significance is a lot less meaningful than traditionally assumed, for two reasons. First, abundant researcher degrees of freedom (Simmons, Nelson, and Simonsohn, 2011) and forking paths (Gelman and Loken, 2014) assure researchers a high probability of finding impressive p-values, even if all effects were zero and data were pure noise. Second, as discussed by Gelman and Carlin (2014), statistically significant comparisons systematically overestimate effect sizes (type M errors) and can have the wrong sign (type S errors).”

Ref: The statistical crisis in science: How is it relevant to clinical neuropsychology

My reply: Yes, this is possible. In some sense, perhaps “forking paths” should be our default assumption when judging such work. Rather than seeking evidence to support a claim of forking paths, perhaps we consider all p-values to be the result of forking paths (except in rare cases of preregistration) and we look for our inferences elsewhere. But I’m not quite ready to take that step in the context of a highly-publicized paper that I’ve been too busy to even read!

P.P.P.S. Also this highly critical assessment from William Briggs:

The NBC News story “Female Doctors Outperform Male Doctors, According to Study” makes these bold claims.

Patients treated by women are less likely to die of what ails them and less likely to have to come back to the hospital for more treatment, researchers reported Monday.

If all doctors performed as well as the female physicians in the study, it would save 32,000 lives every year, the team at the Harvard School of Public Health estimated.

Yet women doctors are paid less than men, on average, and less likely to be promoted.

“The data out there says that women physicians tend to be a little bit better at sticking to the evidence and doing the things that we know work better,” [Harvard’s Dr. Ashish Jha, who oversaw the study] told NBC News.

The ordinary reader would assume female doctors are always much better than male doctors, and the reason is (partly) because male doctors practice medicine regardless of what the evidence dictates. Worse, they receive greater rewards for their foolish and dangerous behavior.

The NBC story drew from paper “Comparison of Hospital Mortality and Readmission Rates for Medicare Patients Treated by Male vs Female Physicians” in the journal JAMA Internal Medicine by Tsugawa, Jena, and Figueroa. Its main claim is this:

Using a national sample of hospitalized Medicare beneficiaries, we found that patients who receive care from female general internists have lower 30-day mortality and readmission rates than do patients cared for by male internists. These findings suggest that the differences in practice patterns between male and female physicians, as suggested in previous studies, may have important clinical implications for patient outcomes.

Now those “suggests” in the second sentence should set alarm bells ringing. And, indeed, Tsugawa and his co-authors did not measure how doctors practiced, and so even if it were true that male and female physicians had different 30-day mortality and readmission rates, the researchers would have no way of knowing why the differences existed. And neither would NBC.

Let’s Examine the Numbers

What happened was this. The authors collected a sample of about a million-and-a-half “Medicare fee-for-service beneficiaries 65 years or older who were hospitalized in acute care hospitals.” Mean age of patients was about 80. The NBC summary misleads by saying just “patients,” which implies the research applies to everybody and not just elderly Medicare patients. . . .

Are there other possible explanations to account for the small differences noted by the models? Yes. Female docs were about 5 years younger on average, and female docs also treated many fewer patients on average than men. This implies women docs had more time per patient.

Even more intriguing, we also know “female physicians treated slightly higher proportions of female patients than male physicians did.” And since women live longer than men, particularly at those advanced ages, maybe — just maybe! — any slight change in mortality or readmission rates between male and female docs could be explained by women doctors treating more longer-lived patients.

That explanation is surely as or more plausible than results from an unnecessarily complicated statistical model. It also eliminates the unwarranted theorizing about how women physicians are “better at sticking to the evidence” and are thus “underpaid.”


  1. NatashaRostova says:

    The frustrating part is not just the bad stats, but the meta-scientific bias in favor of publishing “women are better than men,” which is in congruence with modern political movements focused on promoting women in the workplace etc.

    If you assume a uniform distribution of bad statisticians, but who focus far more on [fashionable topic], it creates a second layer of bias.

    • Thomas says:

      I like this notion of “meta-scientific bias in favor publishing”. The problem here is that a study has been done that can be summarized as Hamblin did in the Atlantic: “Female physicians actually tend to provide higher-quality medical care than males, according to research released today. If male physicians were as adept as females, some 32,000 fewer Americans would die every year—among Medicare patients alone.”

      Now, suppose the study had shown the opposite. Suppose Hamblin would have to report that firing all female physicians would save 32,000 annually “among Medicare patients alone”. I think it’s safe to assume that the reaction would be to call into the question the motivations behind even doing the study. Knowing the answer, people would say the mistake was even to ask the question.

      I lean towards thinking that that is actually a reason not to do these kinds of study, even with very good statistics. Even though there’s probably an objective truth about whether men or women are “generally” better doctors, it’s not a truth that does either doctors or patients any good to know.

      • John Goodwin says:


        A thesis is weakly falsifiable, in the sense of Popper, if it can possibly proven false.

        A thesis is *strongly* falsifiable, if, were it to turn out to be false, the results would still be publishable.

        • +1

          This is a sort rhetorical simulation of the philosophy of science. It’s a good idea to apply traditional criteria of “scientificity” (i.e., “demarcation criteria”) not just to claims and theories but to the conversations that are possible in practice. Quine once defined the “ideology” of a theory as that which determines what ideas are expressible within it, as distinct from its ontology, i.e., what sorts of things it talks about it. He was explicitly trying to extricate the term from its political baggage. But I think we’re seeing the relevance of ideology in the ordinary, political sense here. I like the idea of thinking about the publishability of a negative or null result as part of the criterion of falsifiability. Thanks, John.

  2. Louis says:

    I got this paper in my inbox as well. It seems an important issue and while I do no think that it is implausible that one gender does on average better than another in a particular job, I do have some concerns with the study -which does not mean that the conclusions are wrong…

    it struck me that this might be an excellent choice for another crowdstorming project…. I remember seeing an advertisement for such a project on your blog once. Maybe the organizers can try to tackle this topic??
    Though some of the issues I have are with sample selection and measurement and it might be hard to “fix” this before organizing the crowdstorm.

  3. Alex says:

    “Kinda funny that my first correspondent wrote about “butchering” the statistics, and my last correspondent’s name is Butcher. What’s that all about, huh?”

    Dentists named Dennis!

  4. Dale Lehman says:

    Can someone explain “We used a multivariable linear probability model (i.e. fitting ordinary least squares to binary outcomes) as our primary model for computational efficiency and because there were problems with complete or quasi-complete separation in logistic regression models?” I don’t understand the last part – what is meant by problems with separation or quasi-separation?

    • Jonathan (another one) says:

      If there’s separation then the logistic coefficient is unbounded on the separating variables. Quasi-separation is not quite separated, but so close that some variables have near-infinite coefficients. Stata (for example) will then drop the thus-separated observations (with an announcement that say, “three successes are perfectly determined” and run the regression on the smaller set of observations.

      • Dale Lehman says:

        It seems odd to me that separation should be an issue here. Does that mean that virtually all the patients died (or were readmitted) once some combination of the X variables is encountered? Given the large sample size and the nature of the X variables (age, illness severity, physician characteristics, etc.), this seems unlikely to me. I’ve run plenty of logistic regression models without encountering this problem and I don’t see why it would occur here.

        • Jonathan (another one) says:

          Take a hospital where no one died. There are hospital fixed effects, so all of those patients are eliminated in a logistic regression. Or a hospital where all the patients in some age bucket died. By running OLS on the probabilities instead, those patients can stay in the sample. (It’s not what I’d do.) I have encountered several well-trained economics PhDs who tell me they were told in graduate school to run OLS on binary data because heteroskedasticity doesn’t bias the coefficients and only messes with the standard errors. I’ve never found the logic of this very compelling, particularly when estimating the probability of something at -12.6%.

          • Andrew says:


            -12.6% . . . wasn’t that what Sam Wang said was the probability that Trump would win the election?

          • jrc says:

            I think OLS for binary variables makes sense when you have a) a large sample; b) randomization of treatment into groups. And I think it mostly makes sense just for interpretive ease: the coefficients are easy to interpret as the difference in probability between the two groups (when you regress on a constant and a treatment indicator variable). So in the sense of estimating “treatment effects” I’m often fine with OLS. This is less true for very rare (or very common) events, where the empirical proportion of people with condition Y is around 1/99%. In any case, I often want to see a Probit or Logit or Negative Binomial or Poisson or whatever to back it up, even if it can’t handle the fixed-effects or SE estimates as well… which brings me to point 2:

            When we use OLS then we can use all of our “arbitrary variance/cov matrix” (aka “cluster robust”) estimators for standard errors. People throw around “cluster robust” a lot, but often that means making substantive claims about the distributions of the error term(s), such as in various random effects models. The OLS cluster robust SE estimates don’t make such assumptions (within clusters…of course, they assume no correlation of error terms across clusters). I think this is a real and under-appreciated benefit of using least-squares models here. Computational time should become irrelevant fairly soon, even for 1M observations (100B observations, maybe not, but Facebook and Google can work on that). But the SE estimators we have for OLS are, for the moment I think, generally more conservative/flexible/robust than those we have for non-linear models.

            • Jonathan (another one) says:

              With a large enough sample, almost anything works. And of course checking the OLS model with a nonlinear model ought to be required anyway. But when you don’t have separation, or very weak separation, the data sets are now different. So what now? I don’t know the models you’re dealing with, but I’m not talking about 1%/99%… I can often get -12.6% probabilities out of groups that the logit forecast at around 5%! It’s the inflexibility of the leverage from the mean of the data that creates the problem. As to fancier standard errors, I’ll give you that one… OLS wins one comparison. But the nice thing about logistic errors is that really large standard errors don’t kill you when transformed into probability space, so i really don’t care very much (usually.)
              Finally, the advances in computing cut both ways. Logit probability calculations are simple enough now in complex models that the human translation from logit coefficients to marginal effects in probabilties over the whole data set are one line of code.

              • jrc says:

                Yeah – I agree with you on this.

                Regarding the changing sample due to dropping groups with no within-FE variation (or the problem when you don’t have “separation” of the kind I think you are discussing, like lack of variation in “treatment” within groups): I know that some people are doing some decomposition work on precisely this. The context is something like this – You have an OLS estimate without group dummies, and an FE estimate with them. Two things are changing your point estimates – a) you are using different identifying variation (within- as opposed to between-); b) you are implicitly “weighting” each observation differently once you add FE. The new techniques try to break down the difference in the two estimates to that due to the identifying variation itself (do I compare across- or within-) and that due to the changing “implicit weights” that you put on each observation after your within-group differencing (because “implicit weights” depend on within-group variance after de-meaning). I’ve seen one paper like this presented, but I can’t find it online, and I might be messing up the thinking a bit. But I think this is an interesting new research domain, and I’d like to see more work on it (and more work on comparing estimates from different estimators in general).

            • george says:

              Cluster robust std error estimates are available well beyond linear regression. For many common forms of regression GEE provides the natural robust version.

          • george says:


            “The authors only adjust for patient and physician age in 5-year buckets, so any within bucket selection of doctors gender remains a confounder.”

            See the online appendices with age included as a continuous covariate

          • Dale Lehman says:

            The study is based on an entire year of Medicare data for the population aged 65+. If there is a hospital where nobody died, please tell me where it is. That’s much more valuable information to me than the 0.42% lower mortality rate by having a female rather than a male physician.

  5. Wait a second here–am I misinterpreting something, or did the Atlantic article (by James Hamblin) totally mess up the percentages?

    According to the article, “People treated by a female had a 4 percent lower relative risk of dying and 5 percent lower relative risk of being admitted to the hospital again in the following month.”

    Isn’t the difference 0.43% and 0.55%, not 4% and 5%? In other words, didn’t Hamblin multiply the effect by 10?

    Here’s the Atlantic article:

  6. A few other things I noticed:

    There are far more male than female physicians in this study. The numbers are given on p. E4:

    F: 18,751
    M: 39,593.

    In addition, male physicians had far more annual hospitalizations (on average) than female physicians:

    F: 131.9
    M: 180.5

    Moreover, male physicians had far more patients in total (and on average) than female physicians:

    F: 415,559 (total)
    M: 1,200,296 (total)

    Given that the sample of male physicians is much larger than that of female physicians, and given that male physicians had, on average, more patients *and* more hospitalizations than female physicians, isn’t there a possibility that the male mortality rates are affected in part by (a) a larger patient load; (b) a larger proportion of patients in critical condition; and (c) a larger sample? In other words, given these disparities, are the mortality rates fully comparable?

    • Jonathan says:

      I would think it would be more that male doctors have more paperwork filed under their names. That is, I’ve never seen a patient treated by “a physician” in a hospital setting. Can’t imagine that happening. But I do see teams and there’s going to be some physician in charge and that is typically done by shift, with regulations about having some form of attending physician on duty, etc. I mean that glancing through the discussion of their data set, I didn’t see anything that would indicate actual female or male physicians were actually responsible for these but not those patients. All medical work is done by teams and the actual physician in charge typically plays a small role, maybe even a fleeting role. Actual care is handled mostly by nurses, as anyone who has ever spent time in a hospital can tell you, and most patient contact is with junior residents, interns and even medical students. I have no idea how that can be reduced to treatment by a female or male physician. As I said, I only skimmed the discussion but I didn’t see any mention of this, just that they looked at billing and paperwork names. If we assume that is somewhat accurate, how accurately would that reflect actual female or male physician treatment, whatever the heck that is? It can’t be anything close to 4 or 5% and must be higher. I’d guess more like over 20% inaccurate to be conservative. One simple reason is that billing is an incredibly bad way to track actual physician engagement in treatment and paperwork is collaborative.

      • I second the +1. Also, Jonathan, I thoroughly enjoy your comments. (I enjoy others’ comments too, of course! I come to this blog partly for the quality of the comment threads.) If you have a blog or book or something and don’t mind linking to it, I believe I would enjoy reading it.

      • shravan says:

        I’m actually sitting in hospital right now recovering from dialysis shunt surgery, in a completely normal German hospital here in Berlin. My ninth or tenth op. I have also experienced five years of so-called health care in Columbus OH for five years, and five years of Japanese hospitals (plus of course many unpleasant years of Indian hospitals). So I am something of an authority on getting medical care, or at least it sure feels like I am. The assertions you make about who sees the patients is true *only* in the US, where the primary goal is to control cost, patient treatment is only a side-effect of cost control. Countries in Europe, and Japan, have an actually functioning medical system. Right now it’s Christmas day and the chief surgeon of the shunt center is about to come and see me. I didn’t see a single med student except one who came to draw blood.

        It’s a good practice IMHO to always add the caveat *in the United States* lest people forget just how pathetic the US system is. It is the only reason I left the US the day after i got my degree.

    • Björn says:

      I’m sure one can come up with some conceivable theories on how the factors you mention might affect the results. E.g. what about this one:”That females are underrepresented among physicians compared to the overall population may reflect hurdles to entry into the profession. These hurdles result in a female person physician on averaging having had to be more qualified to enter the profession than a male one. As a result the average female phyiscian is slightly more qualified/suited to the job/better at the job than the average male one. If you had equal proportions any differences might disappear.” I guess that’s a conceivable theory, but I have no idea whether it’s actually true and whether there is data out there to verify or falsify it (perhaps if we had some kind of standardized test results on all physicians at the time they start university). I guess it is also hard to know whether the authors could fully account for differences in e.g. medical specialization, shifts worked or severity of cases assigned (your point b) etc., which you could speculate might differ between any two groups of phyiscians due to any number of reasons (ranging from personal preferences of physicians to those of more senior people that assign work and/or training opportunities). I could not really guess how that would even affect the results.

      The claimed effect size is not huge (in relative terms) so that I feel like it could be explained by some confounding factors (despite what the authors say). On the other hand the effect size is also of a plausible order of magnitude – i.e. substantially smaller than what you usually see with extremely effective medical interventions so that it could be what you get by applying them a bit better/more consistently.

  7. Rob says:

    Is it just me or is an absolute difference of 0.4% not that much? I get that it’s “statistically significant” given the large counts but (1) isn’t the overall model error likely bigger than that and (2) 0.4% just feels small (ie: not practically significant).

    • george says:

      Rob: 0.4% of the adult US population is a lot of people dying just because of the sex of their doctor – roughly 32,000 a year, by the authors’ estimate. If that were true (which I suspect it’s not) it would and should be a big story.

      A decent bias analysis could address this difficulty; if reporting biases are present by sex (see comment by Jonathan above) then the interpretation of the study likely goes from “small effect but one which matters” to “small effect if any and we have basically no idea about its direction”. Allowing for biases like this reminds me of the difference between 538’s recent predictions and pretty much everyone else’s.

      It’s disappointing that the authors included some woof about unmeasured confounding (unsubstantiated claim it would have to be “substantial magnitude to explain the differences we found”) but nothing about plausible biases in the data that was recorded.

  8. Eric Loken says:

    So, post hoc story telling here a couple of thoughts. In educational data we routinely see a small GPA advantage for female students (maybe a 1/6 to 1/4 point or so), which I generally attribute to some version of conscientiousness and reduced probability of being in the very low tail. Does that translate into the workforce? Just a bit more low end male variance could skew in the observed direction for health care. There’s probably only so much good the most skilled can do to preserve life, but there is plenty of damage the least skilled can do.

  9. Cliff AB says:

    Eric: If female students have a higher average GPA, I could definitely see that being linked to lower mortality among their patients.

    But that effect should have been washed out from the analysis! They adjusted for hundreds (maybe thousands? They did not specify how some of the categorical variables were coded) of variables, one them being med school attended by the physician.

    So if you believe all the results, this is the difference in mortality rates between male and female doctors who attended the same med school. The GPA effect would likely lumped into the med school attended.

    That being said, I don’t really don’t take too much stock in an extremely small estimated effect from an OLS model with 100’s-1,000’s covariates, definitively non linear effects and 0-1 responses. But I’ll refrain from cutting into that onion for the moment.

    The conclusion of the paper is certainly believable to me. But after reading that paper, my posterior is equal to my prior.

  10. Eric says:


    I hope you can do this one instead of the next silly one when it comes up. I think it’s important to see these things in meaningful papers rather than just in silly papers. I really appreciated when you did Chinese air pollution and when you did gender lifespan.


  11. Did anybody notice that in the tables confidence intervals overlapped with significant differences would nonetheless claimed? And that some confidence intervals that approached overlapping had sizable significance levels attached to them.

    • Corey says:

      That could be kosher. The confidence intervals are marginal (as in “marginal probability distribution”); you have to know how correlated the estimates are (i.e., what’s going on with the joint probability) to be able to say something about estimates of differences.

  12. Tom Passin says:

    When you “correct” or “adjust” for some predictor/variable/parameter (use whatever term you like best), there is always going to be some uncertainty in its value. So there will be some uncertainty in what is left behind – i.e., the adjusted values – in addition to the original variation in the data. I’ve noticed that you almost never get told the size of these uncertainties.

    If you “correct” for dozens or hundreds of variables, how often does it happen that what remains has any reasonably low uncertainty left? I don’t know, but it can’t be very often. Should we take seriously a paper that claims to have made many such adjustments but doesn’t say how the adjustments have affected the uncertainty in the adjusted values?

  13. Devin says:

    I don’t really buy their quasi-randomization assumption. Why they didn’t use an actual causal inference model instead of straight-up OLS regression? It would be a simple matter to throw some propensity score matching at their treatment assignment, right?

    Also, there’s no mention of whether they looked at any covariate interactions. For all we know, a covariate like (shift*physician gender) is highly significant and makes the very small marginal effects they noted disappear or even flip sign.

  14. Nerissa says:

    Male doctors in the study saw over a third more patients/doctor than female doctors. Of course female doctors got slightly better results. They had more time/patient.

  15. Jack says:

    With the exception of the lack of effort to better match the sample via propensity scoring or IV, this is a solid paper for examining health outcomes using observational data. There is a solid risk adjustment model for the patient risk of death using the patient level data in Medicare 5% sample. There is some effort to control for other aspects of practice, although the controls, as noted in prior comments may not be perfect and a propensity weighted or stratified model might be more informative.

    The authors did not go down a garden of forking paths. They didn’t keep subsetting the patients to find an effect. They tested for the robustness of their results in their sensitivity analysis, with multiple specifications to deal with the concerns about differences in the practice setting of female and male physicians. Thus, they try to examine the results involving only hospitalists, and anticipating the concern about team medicine also tested the effect of allocating patients to the primary physician by using a different allocation method.

    0.4% on a death rate of 11%, a 4-5% reduction in the death rate, is a big deal; this is typical of the effect size observed in many quality improvement projects in healthcare.

    The potential weaknesses in the comparison of male and female physicians – differences in the number of admissions per physician (not number of patients seen, since the female physicians may have a different mix of Medicare and non-Medicare patients and see the same number of patients/day), age – could have been more aggressively controlled for in the analysis, but this paper is not in the same class as the ESP or himmicaine papers.

    There is a significant econometrics literature that argues that linear probability models are unbiased and robust, even if the SEs are not completely accurate. The authors look like they were trained in programs that make linear probability models a good second choice.

    The lack of separation is interesting, and I suspect it has to do with a number of the smaller hospitals in the sample not having female doctors, and perhaps some not having male doctors.

    • E Thomas says:

      The results are also in line with broader research on gender differences in performance. Women (esp. those with experience) tend to outperform men when using objective, mechanical performance indicators: e.g., fewer ethical violations (doctors, dentists, lawyers, pharmacists), fatal accidents (pilots), medical mistakes (doctors, dentists, pharmacists).

    • Jose says:

      I agree with you. I think complaining about ages being in buckets is pretty pedantic/petty

    • Cliff AB says:


      One problem I had with this study is the number of covariates. They did not specify, but it could not have been less than several hundred, and could easily have been in the upwards of several thousands, depending on how the categorical variables were coded. The fact that they had millions of records and still had perfect separation seems to be good evidences that the number of covariates was likely greater than 1,000 (at least).

      Presumably, all these covariates are not perfectly balanced between the genders. If they are not balanced, then if the functional form (i.e. linear effects, multiplicative effects, etc.) is not correct, mis-modeling the effect of covariates can lead to biased estimation of a main effect of interest. Here’s an example in R:

      > n x1 x2 0 # variable that will independent of outcome CONDITIONAL on x1
      > y summary(lm(y ~ x1 + x2)) # Note the apparent bias in x2 conditional on x1!

      lm(formula = y ~ x1 + x2)

      Min 1Q Median 3Q Max
      -4.874 -1.134 -0.255 0.782 117.306

      Estimate Std. Error t value Pr(>|t|)
      (Intercept) 2.21957 0.01196 185.60 <2e-16 ***
      x1 2.13635 0.01022 209.05 <2e-16 ***
      x2TRUE -1.14221 0.02049 -55.75 <2e-16 ***

      Linear effects on probability generally speaking are wrong (unless you’re looking at strictly disjoint effects for some reason). With thousands covariates that are important, an effect of .005 on a probability should be taken with a little bit of salt.

      If gender were independent of all the other covariates, we wouldn’t have to worry about the biased estimates. But they never make that claim. In fact, I believe the opposite is implied by including them in the model.

      As stated earlier, their conclusion that female doctors have lower patient mortality rates is completely believable to me. But my posterior is still equal to my prior.

      • Cliff AB says:

        Arrg, I looked up how to put code into wordpress, and apparently was mislead! Here’s the code:

        n = 10^5 # sample size
        x1 = rnorm(n) # predictive covariate
        x2 = x1 > 0 # variable that will independent CONDITIONAL on x1
        y = rpois(n, lambda = exp(x1)) # log-link instead of identity link
        summary(lm(y ~ x1 + x2)) # Note the apparent bias in x2 conditional on x1!

    • Cliff AB says:


      I’m also not sure what you are referring to in regards to “lack of separation”. The problem they mention in the paper is that of perfect separation, also known as the “Hauck-Donner effect”.

      • Jack says:

        Cliff AB:
        Sloppy writing on my part. With fixed effects for the hospital, and relatively small samples in some hospitals, it is likely that some combination of the categorical variables at the hospital level would lead to perfect separation. I’ve also had the experience of failure to converge when a categorical variable almost but didn’t perfectly predict an outcome.

        With respect to the number of covariates in the model, without doing a direct count of all the covariates, it looks like:
        about 30 patient demographic variables
        27 Elixhauser comorbidities
        700-800 DRGs, depending on the grouper. Many won’t have any mortality and should be dropped as perfectly predictive. Final number not reported. I would guess based on past work, about 300 DRG categorical variables used in the analysis
        10-20 physician covariates.
        3000+ hospital fixed effects.

        I appreciate the concern about the potential bias due to unbalanced covariates, but it is not clear how many of these would be associated with the choice of physician and thus appropriate for a propensity adjustment. It is one of the challenges of working with observational data, which for many issues in health policy is all we will have. No one is randomizing to male or female doctors, and the quasi-randomization which hospitalists are assigned an ICU admission was an attempt to approximate this. The sensitivity analysis and the presentation of models with different mixes of included covariates was an effort to examine these effects and be transparent about the sensitivity of the results to the modeling strategy.

  16. Jb says:

    Another big issue in my mind that may be mediated by the lower number of patients and this more time per patient – all models were adjusted by Comirbidity scores which I assume would be correlated with how many problems the physician recorded in their documentation which , as a physician I would “know” to be the case . The busier we are the fewer problems documented in notes and thus the less sick patients look so busier physicians ( or maybe worse make note writers ) would have lower comorbidity scores and thus appear to perform worse. No analysis without adjusting for comorbidity score was done , maybe they could have found appropriate settings to assume no difference and checked that out ? I feel this could be an endogeneity issue with any physician workload study that adjusts for comorbidity scores .

    Always worry that the attention grabbing headlines will prevent people from critically assessing their analysis …

    • Martha (Smith) says:

      jb said,
      “Always worry that the attention grabbing headlines will prevent people from critically assessing their analysis …”

      Yes, definitely something to worry about.

Leave a Reply