Brad Greenwood, Seth Carnahan, and Laura Huang write:
A large body of medical research suggests that women are less likely than men to survive traumatic health episodes like acute myocardial infarctions. In this work, we posit that these difficulties may be partially explained, or exacerbated, by the gender match between the patient and the physician. Findings suggest that gender concordance increases a patient’s probability of survival and that the effect is driven by increased mortality when male physicians treat female patients. . . .
I heard about this study from David Kane, who also pointed to this news report and offered his own critical comments.
I replied that I didn’t think the paper was so bad but I agreed with Kane’s concerns about the data being observational.
Kane responded:
The problem is their claim that the assignment mechanism of patients to physicians is “quasirandom” when their own data demonstrates so clearly that it is not. More details:
https://www.davidkane.info/post/evidence-against-greenwood-et-al-s-claims-of-randomness/
I don’t have strong feelings on this one. I agree with Kane that the claims are speculative, and I agree with him that it would be better if the researchers would make their data public. It’s kind of frustrating when there’s a document with tons of supplementary analyses but no raw data. There’s a lot going on in this study—you should be able to learn a lot from N = 600,000 cases.
My summary
The big contributions of the researchers here are: (a) getting the dataset together, and (b) asking the question, comparing male and female doctors with male and female patients.
At this point, there are a lot of directions to go in the analysis, so I think the right thing to do is publish some summaries and preliminary estimates (which is what they did) and let the data be open to all. I don’t have any strong reason right now to disbelieve the claims in the published paper, but there’s really no need to stop here.
Thanks to Andrew for publishing this. I go into details in two posts: first and second. Comments:
1) Best line: As Andrew Gelman might say, “Forget it, David. It’s PNAS.”
2) Greenwood sent me the code but refused to allow me to post it or share it with others. I never trust an article which provides so little transparency.
3) The authors, at least Huang, have a clear ideological preference for the result which, in fact, they did find. Female physicians are better than male physicians and, if we do have to have any male physicians, they perform best (although still not as well as female physicians) in departments with more women.
4) Reader comments sought! I think that I demonstrate, easily, that the article is fundamentally flawed because their alleged “quasi-random” assignment is nowhere near random. Or am I the one spewing nonsense?
I don’t find your conjecture compelling, but neither do I find the paper satisfying. Fist, about your conjecture: you correctly point out that the assignment looks nonrandom – too many female patients are seen by female doctors for this to have been a quasi-random assignment. But, as far as I am aware, few patients in the ER get to choose their doctor. So, if there is a nonrandom selection, then it would be done by the ER nurse or whoever performs triage when someone comes into the ER. The most likely bias, if there is one, would be to assign patients based on the severity of their symptoms. I don’t know if this would send the sicker patients to the male or female physicians, nor do I know which sex patients are sicker when the present at the ER with heart attacks. But I don’t find your story about conscious/unconscious patients choosing the gender of their doctors very convincing.
Looking at the study further raises a number of other – more worrisome in my mind – issues. First, the data is for all Florida hospitals over a 20 year period. It does not appear that the analysis looks at the years separately. I have to believe (I haven’t checked for evidence yet) that there are many more female doctors in 2010 than in 1991. I also believe that treatment of heart attack patients has probably improved over that time period as well. Without seeing how the data breaks down by year, I’m not convinced that the selection is not random – it might be driven by time trends instead.
The other major issue that bothers me is that the two outcome measures the authors look at is death and length of stay. But I think the matching of patient and doctor gender appears to be the gender of the admitting doctor in the ER. Most of these patients will end up being treated by other physicians than the ER doctor. And, I think their mortality measure is for their stay in the hospital, not their initial ER screening. So, what exactly is the connection of the outcome measures with the gender of the ER physician? I find the linkages too fuzzy to make much sense.
Perhaps I am not understanding the study – I haven’t spent as much time with it as it requires (at least yet). But I am disturbed (as always) by lack of publicly available data and too many unanswered questions. I’m not sure that the alleged “quasi-random” assignment is high on my list of concerns, however.
One addition/correction: the diagnosis code is not assigned in the ER, so the attending physicians are the doctors who see these patients, not in the ER, but after they are admitted. I’m not sure that patients have much choice at that point, but it might be somewhat more discretion than who they see in the ER.
By the way, I see nothing in the article or supplementary appendix concerning patients who die in the ER – were they excluded or are they somehow included?
Some of the models included terms for hospital * quarter, so time is included, though the interaction does not appear to be. That could be due to there not being an appreciable interaction, which is plausible.
I am concerned about group differences in the quasi-random assignment; however, I do feel that D. Kane is being overly dismissive. It is called quasi-random precisely because the authors don’t have control of assignment and there could very well be biases compared to an actual randomized study. I agree with him that the methods used by the authors in the appendix to address this issue were not sufficient. Clearly D. Kane has shown that there is non-random assignment, what remains to be seen is whether this non-random assignment is already adjusted for by the model. What I’d like to see is a regression with doctor gender as the outcome and patient gender as a predictor, including all of the other covariate terms in the model. If there is an appreciable “patient” term, then I would be worried about these results being spurious.
A big issue here is that the effect is small, with mean differences on the order of 87% v.s. 88% survival. However, the sample is so large that the sample variation is going to be very small, and thus assumption violations are going to swamp the sampling variation if they exist and show “significant” differences.
As of now, I see at least two interpretations of the data:
(1) Women, by virtue of being better listeners or something or other, are better doctors for this, having lower mortality rates across the board. Further they have _even lower_ mortality rates compared to male doctors when the patient is female.
(2) Women have historically been viewed as less capable than their male counterparts. This is perhaps especially true in this dataset, which reaches back to 1991. Due to pervasive and sometimes unconscious sexism, intake nurses (or whoever is doing assignment) are more likely to choose to give difficult cases to a male doctor even when a female doctor of equal experience is available.
I don’t think that the authors have done enough to persuade the reader that scenario (1) is true and scenario (2) is unlikely.
Since D. Kane asked for feedback: It really rubs me the wrong way when when someone leans heavily on the “the researcher is biased by their ideology”. Almost all researchers have a model of the world and are trying to validate it through experimentation (see large hadron collider). When I see someone make this claim too early and without a huge mound of evidence fo bad faith, it makes me think that they themselves may have an ideological bend that may blind them in engaging the material in an honest manner. That is not what D. Kane did, but it is something that I’d recommend downplaying.
Perhaps you can explain what they did – because I can’t decipher it at all. I see a few models with Hospital-Qtr as a fixed effect though they don’t report anything about that variable’s coefficient in the model. From the discussion I can’t tell if it is a continuous measure of time or has something to do with the prior quarter. Interestingly, those are almost the only regressions with R-squared > 0.01 (with time series data I would expect to see larger values when time is included – although I’d also suspect autocorrelation). Also, the say that Hospital-Qtr is indicated as “Cluster” in all their regression models. What does that even mean? I am familiar with clustering, but the terminology here sees inaccessible to me.
I concur with your statement that D. Kane’s comment may reflect some ideological feelings, but I am appalled by the lack of clear description of what they actually did and how things were measured (yes, measurement is an issue again). While the authors did a lot of work – and some appears to be done somewhat carefully – how can anyone expect to figure out what exactly they did? And – as has been pointed out – the data is not publicly available so I guess we just have to trust that it was done right. Or throw it out and assume it was just a fishing expedition. Either way, I don’t see how this advances science.
I took it to mean that (for the fixed effect) there was one parameter for each hospital and each quarter within hospital from 1991 to present. Could be wrong though
One last thing. When the authors did investigate their quasi-randomization they found that female doctors had patients with _worse_ mortality than male doctors (see appendix). They then inferred that randomization bias would have hurt female doctors, not helped them.
The authors are unfortunately mistaken. Female patients have higher mortality, and are disproportionally given to female doctors (see D. Kane’s analysis). What famous statistical phenomena does this sound like? answer below.
Answer (backwards): xodarap snospmis
> I don’t find your conjecture compelling
Fair enough! My goal was just to provide an example of an assignment mechanism which might bias the results.
> It really rubs me the wrong way when when someone leans heavily on the “the researcher is biased by their ideology”.
Feedback is always welcome.
I wrestle with this issue. When explaining/understanding why work X is flawed, is it helpful to study/understand/speculate about the motivations of the researchers behind X? I understand the point of view that it is not — just show the flaws in the work! No need for ad hominem.
I take the other side. If you want to understand science as a human system, you need to understand the motivations of the people involved. This has come across most clearly (to me) in teaching. Many students don’t know how much of a difference a significant versus insignificant results makes to a junior researchers career. Once they do understand that, understanding p-hacking comes much more easily.
And, to be sure, the need for publications is a much more common causative agent than ideology. But, in this case, come on! Look at the Atlantic coverage!
Should I not even have mentioned my belief that ideology probably played a role here?
1) I am not accusing them of “bad faith.” I am accusing them of having a strong ideological prior — 50% of physicians should be female — which carries forward into other priors/beliefs, mainly that women doctors are better. They really believe these things! No bad faith needed
2) How much evidence of bad faith do you need? I am honestly looking for some guidelines. Personally, I start with zero faith — as I would hope that most readers of this blog would! Do you really believe that a randomly chosen PNAS article using observational data is likely to be good science? How about a random PNAS article which got a big write up in the Atlantic? I think zero faith is actually pretty generous. Then a refusal to allow the code to be distributed moves you firming into negative territory.
3) Isn’t a failure to notice the huge difference in the gender ratio — the central topic under study! — a sign of either incompetence or ideological bias? Honest question! It is not like I am complaining about some obscure assumption as to the distribution of the error term,
The ideology from which a study is approached is generally only relevant if it causes the researchers to engage in bad behavior in order to torture out the result that they want. Personally I’d follow Hanlon’s razor on this (Never attribute to malice that which is adequately explained by stupidity) and only look to author motivation if “stupidity” can be ruled out.
p-hacking is an interesting example because I think this is an area where it is very harmful to attribute it to malice. If you say that the main problem is people trying to fake their results in order to further their career, then that is not something that your students could see themselves doing and they may then find themselves unconsciously doing it. After all, they are not trying to fake results, they are trying to discover truth!
It can be very hard for even a sophisticated honest researcher to ensure that they are _not_ engaging in p-hacking. Every decision made with the data, every mean computed, every potential covariate adjustment is an opportunity for the garden of forking paths to bite you. How sure are you that with different data you would have ended up with the same analysis, or if the analysis is data dependent, that the analysis chosen is independent of the comparisons of interest?
Also, if you don’t like PNAS, you’ll probably hate JAMA (https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2593255).
Thank you for posting that link. While I still want the data, it is a much better written article. For example, they clearly explain how they assigned doctors to patients – they used the doctor that accounted for the largest component of the Medicare charges submitted. This is clear recognition that treatment is done by a team and it explains how they incorporated that fact into their analysis. The PNAS paper, in contrast, does not indicate it is an issue at all and does not explain how they linked patients with doctors. This is why I say it isn’t really that difficult to write more clearly (below).
This is what happens when papers are written badly and the data is not provided. Research parasites like myself start digging and can’t understand what they see. So:
From the paper, over a 20 year period, male doctors saw around 519,000 patients and female doctors saw approximately 62,000 patients. The patients we are referring to are presumably people seen in the ER, admitted for heart attacks, and not previously being cared for by these physicians.
One of the issues that is unclear to me is whether the physicians are the admitting doctor or the doctor that subsequently cared for the patient, or whatever? Since hospital care is a team effort, it seems to me that it should matter exactly what doctor they are analyzing when they look at the gender matches and mismatches. But I don’t have their data.
What I do have is the Medicare physician data for 2015. When I look at cardiologists seeing Medicare patients in hospitals, I find 1301 female cardiologists and 15,742 male cardiologists (the Medicare data does include a gender code).
If I instead look at emergency room visits (serious or life threatening) by Medicare patients, 1007 of the ER doctors were female and 2856 were male.
In either case, the M/F proportion of doctors is not even close to the M/F proportion of doctors reported in the paper. Of course, the data sources are quite different and these are apples to oranges comparisons. But my research parasitic mind keeps coming back to wondering what the authors mean by the “treating physician.” I don’t have a clue.
Dale:
You write, “This is what happens when papers are written badly and the data is not provided.” Just to emphasize: clear writing is difficult, researchers aren’t trained to write clearly or to share data, and the incentives don’t always favor clear writing or data sharing. After all, this paper got published in a top journal and received uncritical coverage in a respected magazine. I’m not being cynical here, just descriptive. I’d guess that the vast majority of researchers would like to write up their results clearly, they just don’t always know how to do it, and lack of clarity is not always pointed out in the editing and publicity stages of the work. It only comes up in post-publication review, as here, but then it can be too late, as the researchers can have already moved on to their next projects.
Andrew:
While I agree with you in general, you are letting these authors off the hook too easily. Clearer writing is not that difficult (clearer rather than clear). This particular paper leaves so many questions about exactly what was being measured that it should not have been difficult to have written it better. Really, how hard is it to just ask someone else to read the paper and provide feedback?
I would propose – as a minimal first step towards improving things – that anybody that works with data that is not publicly available should meet a higher standard of writing. If others cannot see the data, then the burden should be on authors to go to extra lengths to ensure that they describe carefully what data they used and how they used it. For example, they should make it clear whether they had data broken down by time, and if so, how they looked at the time dimension. That was not done in this case. It shouldn’t be so hard for editors to require that published articles adopt a standard that says that when data is not made available, extra care must be taken to describe the data.
That is a much lower bar than I would really like to see, but I don’t have great hope that we will even achieve that level of scrutiny.
Dale said,
“I would propose – as a minimal first step towards improving things – that anybody that works with data that is not publicly available should meet a higher standard of writing. If others cannot see the data, then the burden should be on authors to go to extra lengths to ensure that they describe carefully what data they used and how they used it. For example, they should make it clear whether they had data broken down by time, and if so, how they looked at the time dimension. … It shouldn’t be so hard for editors to require that published articles adopt a standard that says that when data is not made available, extra care must be taken to describe the data.”
Agreed. Now the question is, how to get that implemented? Do any journals now have that standard? If so, how to they promote and implement it? (e.g., Do they provide guidelines for editors? For peer reviewers? For authors?
Does anyone here know of journals that have such standards?
They also are not trained in mathematics, programming, or history/philosophy of science. What exactly are these researchers trained to do that qualifies them for this job?
My major issue with this study is the baseline assumption that the emergency medicine attending physician who bills the ICD code for the patient is the major determinant of AMI outcomes. AMIs are treated by interventional cardiologists and not by emergency room physicians. This is not to disparage emergency medicine physicians; definitively treating AMI is just not their job. Finally, there are a whole slew of factors–the triage nurse will determine how quickly a patient gets hooked up to an EKG machine. For ACS/AMI I imagine this would be far more important than the attending physician. I question whether the authors are familiar with how medicine works.
People shouldn’t be researching heart attack anymore. It’s a solved disease. See “Resolving the Coronary Artery Disease Epidemic Through Plant-Based Nutrition” https://www.ncbi.nlm.nih.gov/pubmed/11832674