Skip to content

Cancer patients be criming? Some discussion and meta-discussion of statistical modeling, causal inference, and social science:

1. Meta-story

Someone pointed me to a news report of a statistics-based research claim and asked me what I thought of it. I read through the press summary and the underlying research paper.

At this point, it’s natural to anticipate one of two endpoints: I like the paper, or I don’t. The results seem reasonable and well founded, or the data and analysis seem flawed.

One tricky point here is that there’s an asymmetry. A bad paper can be obviously bad in ways that a good paper can’t be obviously good.

Here are some quick reasons to distrust a paper (with examples):

– The published numbers just don’t add up (pizzagate)
– The claim would seem to violate some physical laws (ESP)
– The claimed effect is implausibly large (beauty and sex ratio, ovulation and voting)
– The results look too good compared to the noise level (various papers criticized by Gregory Francis)
– The paper makes claims that are not addressed by the data at hand (power pose)
– The fitted model makes no sense (air pollution in China)
– Garbled data (gremlins)
– Misleading citation of the literature (sleep dude)
And lots more.

Some of these problems are easier to notice than others, and multiple problems can go together. And sometimes the problems are obvious in retrospect but it takes awhile to find them. The point is, it can often be possible to see that a paper has fatal flaws. (And at this point let me add the standard evidence vs. truth clarification: just because a work of research has fatal flaws, it doesn’t mean that it’s underlying substantive claim is false; it just means the claim is not strongly supported by the data at hand.)

OK, so we can often be suspicious of a paper right away and then, soon after, be clear on its fatal flaws. Other time we can be given a perpetual-motion-machine-in-a-box: a paper whose claims are ridiculous, but we don’t feel like going through the trouble of unpacking it and finding the battery that’s driving it.

But here’s the asymmetry: we can read a paper and say that it’s reasonable, or even read it and say that the flaws we notice don’t seem fatal—but it’s harder to be sure. It’s a lot easier to get a clear sense that something’s broken than to get a clear sense that it works.

2. The story

Sanjeev Sripathi writes:

I’d like to understand your thoughts on this analysis that establishes a causative link between cancer and crime in Denmark: It was part of today’s WSJ Economics newsletter. Do you agree with the conclusions?

I followed the link to the research article, Breaking Bad: How Health Shocks Prompt Crime, by Steffen Andersen, Gianpaolo Parise, Kim Peijnenburg, which begins:

Exploiting variations in the timing of cancer diagnoses, we find that health shocks elicit an increase in the probability of committing crime by 13%. This response is economically significant at both the extensive (first-time criminals) and intensive margin (reoffenders). We uncover evidence for two channels explaining our findings. First, diagnosed individuals seek illegal revenues to compensate for the loss of earnings on the legal labor market. Second, cancer patients face lower expected cost of punishment through a lower survival probability. We do not find evidence that changes in preferences explain our findings. The documented pattern is stronger for individuals who lack insurance through preexisting wealth, home equity, or marriage. Welfare programs that alleviate the economic repercussions of health shocks are effective at mitigating the ensuing negative externality on society.

The health shocks they study are cancer diagnoses, and they fit a model to data from all adults aged between 18 and 62 in Denmark diagnosed with cancer at some time between 1980 and 2018. The main result, shown in the graph at the top of this post, comes from a linear regression predicting a binary outcome at the person-year level (was person i convicted of a crime committed in year t; something that happened in 0.7% of the person-years in their data) with indicators for people and years, predictors for number of years since cancer diagnosis, and some background variables including age and indicators for whether the person was in prison or in the hospital in that year.

3. Something I don’t understand

The only thing about the above graph that seems odd to me is how smooth the trend is of the point estimates. Given the sizes of the standard errors, I think you’d expect to see those estimates to jump around more. So I suspect there’s a mistake in their analysis somewhere. Of course I could be wrong on this, I’m just guessing.

4. Where are the data and code?

I don’t see a link to the data. From the article:

We combine data from several different administrative registers made available to us through Statistics Denmark. We obtain data on criminal offenses from the Danish Central Crime Registry maintained by the Danish National Police. . . . Health data are from the National Patient Registry and from the Cause of Death Registry. The National Patient Registry records every time a person interacts with the Danish hospital system . . .

Perhaps there are confidentiality constraints? It wasn’t clear what the restrictions were placed on this information, so maybe the data can’t be shared. But the authors of the paper should still share their code; that could help a lot.

It’s possible that the data or code are available; I just didn’t notice them at the site. The paper mentioned an online appendix which I found by googling; it’s here.

5. The model

I looked at their model, and there are some things I’d do differently. First off, it’s a binary outcome, so I’d use logistic regression rather than linear. I understand that you can fit a linear regression to binary data, but I don’t really see the point. Especially in a case like this, where the probabilities are so close to zero.

Second, I’d model men and women separately. Men commit most of the crimes, and it just seems like female crime and male crime are different enough stories that it would make sense to model them separately. Alternatively, they could be included in the model with an interaction.

This brings us to the third issue, that all sorts of things could be interacted with the treatment. At this point you might say that I’m trying to Christmas-tree the problem . . . so, sure, let’s not worry about interactions right now. Let’s go on.

My next concern has to do with the multiple measurements on each individual. I like the idea of using each person as his or her own control, but then what’s the comparison, exactly? Suppose I get diagnosed with cancer at the age of 50. Then you’re comparing my criming from ages 51-60 to my criming from ages 44-49 . . . but then again I’m older in my 50s so you’d expect me to be doing less criming then anyway . . . but then the model adjusts for age (I think they are using linear age, but I’m not entirely sure), but maybe the age adjustment overcorrects in some way . . . I’m not quite sure. It’s tricky.

And this brings me to my next concern or request, which is that I’d like to see some plots of raw data. I guess the starting point would be to plot crime rates over time (that is, the proportion of respondents who were convicted of a crime committed in year t, as a function of t), lining things up so that t=0 is the year of cancer diagnosis, with separate lines for people diagnosed with cancer ages 18-25, 26-30, 31-35, etc. And separate lines for men and women. OK, that might be too noisy, so maybe more pooling is necessary. And then there are cohort effects . . . ultimately I’m pretty sure this will end up being a logistic regression (or, ok, use a linear regression if you really want; whatever), and it might look a lot like what was eventually fit in the paper under discussion—but I don’t really think I’d understand it without building the model step by step from the data. I need to see the breadcrumbs.

That’s most of it. I had a few more issues with the model that I can’t remember now. Oh yeah, here’s one issue: who are these people committing crimes? How many of them are past criminals and how many of them are Walter Whites, venturing beyond the law (or, at least, getting caught) for the first time? Above I talked about doing 2 analyses, one for men and one for women, and that’s fine, but now I’m thinking we want to do separate analyses for people with past criminal records and people without. It seems to me there are two stories: one story is past criminals continuing to crime at rates higher than one might expect given their age profile; the other story is people newly criming (again, at a higher rate than the expected rate of new crimes for people who didn’t get cancer).

Oh, yeah, one more issue is selection bias because dead people commit no crimes.

By listing all of these, I’m not saying the published model was bad. There are always lots of ways of attacking any problem with data. My main concern is understanding the estimates and seeing the trail leading from raw data to final inferences. Also, if we carefully take things apart, we might understand why the above graph is so supiciously smooth. (When I say “suspiciously,” I’m not talking about foul play, just something I’m not understanding.)

6. Conclusion

I don’t have a conclusion. The results look possible. It’s not implausible that people commit 13% more crime than expected during the years after a cancer diagnosis. I mean, it seems kinda high to me and I also would’ve believed it if the effect went in the other direction, but, sure, I guess it’s possible? I’m actually surprised that the drop isn’t greater during the year of the diagnosis itself and the year after. This sort of thing is one reason I want to see more plots of the raw data. I notice lots of things in the analysis that are different than what I would’ve done, but it’s not obvious that any of these choices would cause huge biases. Best will be for the data and code to be out there so others can do their own analyses. Until then, this can serve as an example of some of the challenges of data analysis and interpretation.

P.S. It’s not so usual to see a research project inspired by a TV show, is it?

OK, let me check on google scholar:

Gilligan’s Island

Star Trek

And, of course, Jeopardy

Even Rocky and Bullwinkle.

OK, so that aspect of the paper is not such a big deal. Andersen et al. are hardly the first to write a TV-inspired quantitative research paper.

That said, the connection to the TV show is pretty much perfect here in how he story of the data lines up with the plot of Breaking Bad. The main difference is that in Breaking Bad he’s a killer, whereas Andersen et al. are mostly studying property crimes. Even there, though, it’s not such a bad match, given that Walter’s motivations are primarily economic.


  1. Josh says:

    “First off, it’s a binary outcome, so I’d use linear regression rather than logistic. I understand that you can fit a linear regression to binary data, but I don’t really see the point. Especially in a case like this, where the probabilities are so close to zero.” Is there a typo here?

  2. Alex says:

    “First off, it’s a binary outcome, so I’d use linear regression rather than logistic”

    This is backward, right?

  3. Jonathan (another one) says:

    Is the 13 percent increase absolute or, as it appears, relative? From the graph, the absolute percentage increases in crime are really, small: less than 0.1% after three years. If there were an underlying propensity to crime of 1% (how felonious are the cancer-age-adjusted Danes?) that would be a 10 percent increase, but I sure as heck wouldn’t trust it in a linear regression (I probably wouldn’t trust it in a logistic regression either.) Without looking at the paper, doesn’t it look like one of those studies in which the confidence intervals from a 4 million observation study are misleading? And if the crime rate really is that rare, having the logistic curve fit it would be just luck… probit? I’m calling OLS probability artifact until somebody sends me the paper….

    • Jonathan (another one) says:

      OK. I found the article at the link to Appendix 1 that Andrew gave. First, they do present separate analyses for men and women, and women are close to zero while men are somewhat higher with larger standard errors. (60 percent of the data points are women).

      Second, the underlying propensity for crime is 0.68%. So out of 357,043 people sampled, the underlying crime rate suggests that about 2,400 are criminals. A 13% boost in criminality contingent on cancer is about 315 crimes… out of 4.9 million people-years.

      Finally, the smoothness of the increase in the coefficients is determined by their functional form which they call a semi-dynamic estimation of average treatment effects, I think. But I need to look at it more.

  4. Joshua says:

    They should do a follow up to see if (1) people commit fewer crimes after they get a positive health checkup (or even find out a disease they had has been cured) and (2) they find that there’s a “dose” relationship meaning that, for example, there’s a positive association between the amount of increase in criminality and the severity of the negative health diagnosis. I’d say also they should look longitudinally to see if the criminality increases over time as the health condition worsens – but I could see where that one would get complicated as maybe someone would commit less crime as a result of being more disabled.

    • jim says:


      I’m curious if a thirty minute educational video on how positivity can cure cancer would lead to 33.37% lower crime rate over ten years among diagnosed individuals.

      • Joshua says:

        Jim –

        My point was that if the cause and effect is real you should be able to predict other related phenomena and see a “dose” effect.

        I’m not clear on what your point is.

        • jim says:

          Apologies, what I initially wrote was mockery playing off your comment. As is often the case, though, I changed it so really it doesn’t have anything to do with your comment.

          • Joshua says:

            jim –

            Yeah, I thought that’s what it was but I wasn’t quite sure if you were mocking me or the causality suggested by the study (which certainly seems dubious, in the least, to me).

  5. Thomas says:

    About the units.
    The figure caption says it’s percentage points. The ordinate is labeled .1, .2 etc. in %. So these are tenths of percents, right?

    Btw the lack of a leading 0, coupled with hardly noticeable decimal points, causes repeated drug dosage errors in hospitals (or it used to, before electronic prescibing).

    • Andrew says:


      Really the model should be on the log scale (or logit, which is essentially equivalent to log, considering how low the probabilities are). The graph on the hard-to-interpret raw scale is a product of a modeling choice that I would not recommend.

      • Jonathan (another one) says:

        Agreed. This was part of my confusion on first seeing the chart. It isn’t helped when the abstract calls the result a 13% increase, which leads you to interpret the .1 as 10% (despite the labeling of the y-axis) rather than .1%. To be fair, reading the article clears it up. But log or logit scales would be easily interpretable without having read the text.

        Also, it really doesn’t make any sense to me that a cancer diagnosis would have the same *absolute* percentage change across people with large relative differences in underlying crime propensity (particularly by age and gender.) That’s the obvious reason to go to a logit or probit specification, beyond the heteroskedacity of the linear probability model (and the problem of potential predictions below 0 in this case.)

    • Clyde Schechter says:

      Electronic prescribing has not solved this problem. Wrong dosage errors remain a frequent problem with electronic prescribing: even if you get the decimal points right, it is still easy to select the wrong unit (e.g. g when mg are called for, or vice versa).

  6. Rasmus says:

    “Perhaps there are confidentiality constraints? It wasn’t clear what the restrictions were placed on this information, so maybe the data can’t be shared”
    While i dont have huge experience in working with danish administrative data, it is generally not possible to share data from those sources(and European data protection law, GPDR, hasn’t made it easier). Usually, the best you can do is to encourage others to apply for the data via Statistics Denmark. They do allow sharing of code however.

  7. AllanC says:

    Re: probability of committing a crime

    The types of crime being committed ought to matter here. For example, if the reported crime has increased but it is largely of “kid crimes” such as TPing their bosses’ house, their posited explanations make no sense. I didn’t see it mentioned anywhere. Did I miss it?

    Re: age adjustment and other adjustments

    I don’t think anyone can reliably make adjustments for age or other covariates when the measurement of interest is within an individual but the reference group for adjustments is between individuals. I suspect this is why Andrew points out that the adjustment may over or under correct.
    Moreover, as some of the other comments point out, the absolute difference in total number of crimes committed is minuscule relative to total life years. So, any predicted effect size from the model is as likely to do with this over/under correction in these covariates as it is to do with a true difference in an underlying propensity to commit crimes. Or as Johnathan (another one) framed it: OLS probability artifact.

    • Jonathan (another one) says:

      Table G1 gives crime type. The largest category (9.5% of the total) is shoplifting. Holding drugs (shades of Walter White!) is second at 6.7%, followed by “simple violence” at 3.5%. Vandalism is towards the top at 2.4% There are 60 or so total types of crime.

  8. Erik Ruzek says:

    The thing that makes me suspicious is the hypothesis that I assume underlies this work. Specifically, people who get cancer are likely to increase their criminal activity. Who would make that prediction prior to seeing this paper? This feels like the kind of thing that emerges from looking at data and finding weird correlations. Perhaps these folks tried to explain it away with their covariates and couldn’t. But if so, let’s lead with that instead of some type of causal claim.

    • Andrew says:


      I can see the argument going either way. But in any case, yeah, I want to see the pattern in the raw data and then go from there.

    • Michael J says:

      > Who would make that prediction prior to seeing this paper?

      I like to think they were watching Breaking Bad one night and asked what if…

    • elin says:

      I could see rational choice people saying this. Basically if you think you are going to die soon anyway, what’s the deterrent effect of saying you will be punished in the future. This is not something I would agree with because I don’t do rational choice, but it seems like the kind of argument you see them making.

  9. mwiener says:

    Even if the effect turns out to be real, there could be other reasons for it. For example, what if no-one’s propensity to commit crime changes, but people with cancer are less healthy, therefore less likely to get away and more likely to be caught? Or anxiety about the cancer distracts them, and they leave more clues, and are more likely to be caught?

  10. DSJ says:

    “I like the idea of using each person as his or her own control, but then what’s the comparison, exactly? Suppose I get diagnosed with cancer at the age of 50. Then you’re comparing my criming from ages 51-60 to my criming from ages 44-49 . . . but then again I’m older in my 50s so you’d expect me to be doing less criming then anyway . . . but then the model adjusts for age (I think they are using linear age, but I’m not entirely sure), but maybe the age adjustment overcorrects in some way”

    Well, the adjustment for age does indeed “overcorrect” in some way: When you adjust for person and year, also adjusting for age results in perfect multicollinearity. Person fixed effects absorb all constant person-characteristics, including the birth year. Then with the year fixed effects, you can perfectly predict everyone’s age.

    Since the model cannot be estimated in the case of perfect multicollinearity, I guess Stata (or whatever software was used) silently removed one of the year dummies.

    Btw.: I like that they use the linear model for a binary outcome. Just my preference.

    • Andrew says:


      The biggest problem with the linear model is that it is additive. My impression is that it is a small subset of the people who commit most of the crimes. So a model that says that something causes criming to increase by X on the additive scale doesn’t really seem to make sense to me; I’d prefer a model that says that the effect would be proportional to the existing rate.

      Some of this problem would be resolved if the authors were to follow one of my other suggestions, which is to estimate the effect separately for people with and without past criminal convictions.

  11. JohnnyA says:

    I haven’t read the paper, so this may have been covered. Did the authors guard against p-hacking? That is always a risk, and especially so for a paper with a result that would be completely uninteresting if no relationship were found.

    Andrew wants to see the raw data but even this may not be enough if the authors have decided to measure criminality in particular ways or to exclude particular individuals because something looked awry.

Leave a Reply

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.