People used to send me ugly graphs, now I get these things

Antonio Rinaldi points me to this journal article which reports:

We found a sinusoidal pattern in CMM [cutaneous malignant melanoma] risk by season of birth (P = 0.006). . . . Adjusted odds ratios for CMM by season of birth were 1.21 [95% confidence interval (CI), 1.05–1.39; P = 0.008] for spring, 1.07 (95% CI, 0.92–1.24; P = 0.40) for summer and 1.12 (95% CI, 0.96–1.29; P = 0.14) for winter, relative to fall. . . . In this large cohort study, persons born in spring had increased risk of CMM in childhood through young adulthood, suggesting that the first few months of life may be a critical period of UVR susceptibility.

Rinaldi expresses concern about multiple comparisons, along with skepticism about the hypothesis that in Sweden 2-3 months old babies get some sunshine completely naked.

P.S. Some of the comments below are fascinating, far more so than the original paper! Maybe we should call this the “stone soup” or “Bem” phenomenon, when a work that is fairly empty of inherent interest (and likely does not represent any real, persistent pattern) gets a lot of people thinking furiously about a topic.

78 thoughts on “People used to send me ugly graphs, now I get these things

  1. What’s an “adjusted odds ratio”? Adjusted how?

    Also, if the 95% CI’s of each of the three classes are wide enough to include the point prediction of the odds ratios of other classes; isn’t that another warning flag that the effect might be an illusion? Or am I mistaken.

    • Quickly scanning the paper, they adjusted for birth year, gender, parental country of birth, paternal education, and family history of CMM.

      Agreed that what you’re observing is a warning flag; they set up these analyses as comparison to being born in the fall, but then they switch to talking about “spring as a risk factor” and don’t mention that fall is the reference point in the Discussion. I think they are doing this because of this sinusoidal logistic regression that they are saying identifies April 13 as the peak risk point and October 12 as the minimum risk point, but I find it all a little odd.

    • Might be, but not necessarily. For example you might have effect A,B,C and A overlaps B and B overlaps C but C doesn’t overlap A. Overlaps are not a great way to evaluate things.

      I don’t have access to the article, but do they put their data up? This is actually a topic I have a little interest in, and I’d like to see the data and/or re-analyze it.

      • Daniel,

        It looks like they had more “person-years of follow up per birth” for those born Jan-May. These are the same months they observed higher rates of CMM diagnosis. I’m not positive what that means. I suspect it is either that the subjects born those months were older during the follow up, had more recent follow-ups, or some combination of those. In that case it seems they have detected that CMM diagnoses increase with age at time of diagnosis. Hopefully I have misinterpreted something…

        Here is the data (dput format from R):
        structure(list(`Birth Month` = c(“January”, “February”, “March”,
        “April”, “May”, “June”, “July”, “August”, “September”, “October”,
        “November”, “December”), Births = c(297241L, 291749L, 334787L,
        331520L, 324466L, 305267L, 310043L, 300578L, 291196L, 278533L,
        252598L, 253596L), `Person-years (Millions)` = c(5.47, 5.36,
        6.2, 6.09, 5.88, 5.45, 5.44, 5.26, 5.13, 4.84, 4.41, 4.42), `CMM cases` = c(144L,
        139L, 162L, 170L, 166L, 127L, 131L, 126L, 116L, 115L, 99L, 100L
        ), `Cases per 100 000 person-years` = c(2.63, 2.59, 2.61, 2.79,
        2.82, 2.33, 2.41, 2.4, 2.26, 2.37, 2.25, 2.26)), .Names = c(“Birth Month”,
        “Births”, “Person-years (Millions)”, “CMM cases”, “Cases per 100 000 person-years”
        ), row.names = c(NA, 12L), class = “data.frame”)

        • RE: seasonal fertility: it would be common for a country to have regular, seasonal peaks/valleys in birth rates. I noticed it too, but I am not sure that it is a problem.

          I do worry about the fact that January people have had 12 months more “exposure” than December people, but January isn’t the peak month. There is a pretty big jump from December to January (and also a weird one from May to June). That might be me chasing noise, but I wonder if there isn’t some age problem that isn’t being taken care of with cohort controls. If you use year of birth in whole numbers as your covariate, you will not in any way control for the difference between being 22 years old and 1 month, and being 22 years old and 11 months, and that is the difference between being born in January or December of some year.

          This all assumes I understand what this data actually look like, because the “person-years” metric was weird to me too. I think this is just a cross-section with 3.5M or so people in it.

        • jrc,

          I believe what I called “person-years of follow up per birth” is actually the average age at follow up for each month. They say:
          “We identified 3 595 055 individuals in the Swedish Birth Registry who were born from 1973 through 2008…The study cohort was followed for CMM incidence from birth through 31 December 2009 (maximum attained age was 37 years).”

          First, 2009-1973=36 so the maximum age disagrees with the start/end years given. Anyway, say they performed the followups throughout 2009 and equal number of subjects born each year. Then the average age of subjects assessed *before* their birthday would be mean(0:35)=17.5, while *after* would be mean(1:36)=18.5. In that case we would expect the average age to decrease gradually by month from 18.5 to 17.5. This is precisely what we see (divide 1000000*column 3 by column 2 in the above data):
          http://s14.postimg.org/vhtf54y69/CMM1.png

          So one problem is that they adjusted for birth year when they were supposed to adjust for “age at time of follow up”. The subjects with birthdays earlier in the year were more likely to be followed up when they were up to a year older. At least part of what they have detected here is due to simply cancer incidence increasing with age.

        • So someone born on 1 January 1973 was followed until 31 December 2009. By your reckoning that person had (2009-1973 =) 36 years of exposure but by their reckoning they actually reached age 37 at midnight. I think their statement that they had people who reached age 37 is reasonable, and certainly not ridicluous.

      • Interesting – what do you see as the solution? It seems that the appropriate technique would involve unsupervised clustering or something like that (if you really are formulating your comparisons after the data have been observed).

      • If your hypothesis is that somehow early exposure (first few months) makes a big difference (and that seems more likely to be a result of fishing than something they would be likely to have pre-registered) then I’d tend to see a model in which you have two “sources of exposure”

        1) the first 3 months
        2) the rest of your life

        I’d want to approximate an integrated UV exposure for both aspects of exposure, and see how much (1) is predictive of melanoma risk. If it’s way more important than (2) than I’d expect it to have a coefficient much bigger than (1/4)/60 or so, as compared to (2) (which is likely to be say 60 years of exposure and therefore 4*60 times as much exposure as condition (1))

        “controlling for birth year” seems to be the wrong approach

        • Daniel,

          Maybe you could use within-season/across-cohort variation in sun exposure using weather data (was it cloudy in your birth state and birth month that year compared to other parts of the country that year?). Or for that matter, use a kind of “difference-in-difference” that uses variation across-space and across-time, all while controlling for the regular month of birth exposure itself (so dummy variables for birth year, birth month, and region of birth, with a RHS variable aggregated to the region-birth month level containing information on sunny-ness). That would take care of the problem of confounding season of birth and exposure to sunlight.

        • Something like that would be reasonable. The basic idea being to approximate integrate(UVR(t),t,3mos,melanoma onset time) as well as integrate(UVR(t),t,0,3mos) and to use those two predictors in some kind of reasonable causal model.

          The most sane causal model is that the 3mos exposure makes you “more susceptible” and therefore that its value somehow multiplies the later exposure level, possibly raised to some power.

          using lattitude of birth, and weather data would be a good idea, and I don’t think it’d even be that hard.

        • I’m pretty sure cloudy wouldn’t matter for this; it doesn’t impact UV, though it might impact behavior. If you’ve been in Sweden in the summer you know it’s really easy to get a sunburn because you lose track of how many hours you are outside because it’s so light. So if what I’ve heard before is true, that one bad sun burn during childhood can be a risk factor, it wouldn’t surprise me if there is a mechanism like time of year in first three months->risk of sunburn in first three months->long term risk. Anyway my pediatrician always told us it was important to keep infants from getting burned, presumably that’s a lot easier when you have only 4-5 hours of light (not to mention that it’s freezing out). Of course mds are notorious for not interpreting the data correctly. Of course it would be far better to actually track behavior.

      • One thing I can think of is hormones regulated by day length. One question (I don’t have access to article) are they looking at melanoma in young children (got to be very unusual I’d think) or are they looking at adults and then asking them what their birthday is, under some kind of assumption that the first few months of life are critical for the next 60 years ??

  2. I think Rahul is right. Based on CI’s, not much variation is explained between seasons, most variation within seasons. Couple of other weird things in the paper:

    They don’t graph the data (Which seems important when you’re arguing for a specific functional form!) but they do plot a sine curve, for some reason.

    Instead of using pi in their model, they use arccos(-1), just to be extra obscure.

    Also, modeling 1600 positive cases out of 64 million observations is going to make it hard for MLE methods. Surprised rare events considerations weren’t discussed.

  3. But it was published in an international epidemiological journal,
    impact factor nearly “10” in 2013.
    How can it happen that they accept such naive interpretations as
    written in the abstract of that article?

  4. 1) a year or so ago, didn’t you have a post, or series of posts, on a paper that contradicted ionniadis – the paper said that most p values in the medical literature were correct ?
    2) below is a quote from the abstract.
    I would submit that this sort of writing is hopeless; no one can parse this. And i’m not alone – I’m pretty sure that from tukey forward, people ahve been saying that writing things like “for spring, 1.07 (95% CI, 0.92–1.24; P = 0.40) ” is writing that is bad

    Results: There were 1595 CMM cases in 63.9 million person-years of follow-up. We found a sinusoidal pattern in CMM risk by season of birth (P = 0.006), with peak risk corresponding to birthdates in spring (March–May). Adjusted odds ratios for CMM by season of birth were 1.21 [95% confidence interval (CI), 1.05–1.39; P = 0.008] for spring, 1.07 (95% CI, 0.92–1.24; P = 0.40) for summer and 1.12 (95% CI, 0.96–1.29; P = 0.14) for winter, relative to fall. Spring birth was associated with superficial spreading subtype of CMM (P = 0.02), whereas there was no seasonal association with nodular subtype (P = 0.26). Other CMM risk factors included family history of CMM in a sibling (>6-fold) or parent (>3-fold), female gender, high fetal growth and high paternal education level.

  5. Andrew,

    You referred to this as “a work that… likely does not represent any real, persistent pattern”. I do not think that is the case. Plot their ” Millions of person years” vs “# cases” and we can see a very strong relationship:
    http://s30.postimg.org/wjzvjdvz5/CMM2.png

    The problem is (ok…appears to be, someone explain where I’ve gone wrong) that they are misinterpreting the data, not that this is a spurious relationship. This is the much more insidious failure mode of strawman NHST. I would predict people can replicate this methodology repeatedly and observe the “seasonal effect”. To be fair, in their discussion they write:

    “Other seasonally varying environmental exposures such as infectious agents, however, also cannot be excluded as possible contributors to the seasonal patterns we observed.”

    The “other seasonally varying exposures” appears to be the dates of birth and follow up doctors visit.

    • Was this data collected each March or something? I still don’t get the data set up enough to be able to figure out exactly what caused that age-birthdate relationship, but it looks pretty mechanical.

      Here are some potential culprits for mechanisms (all similar, but describe different empirical environments):

      Age-at-Onset: http://psycnet.apa.org/journals/szb/15/1/59/

      Age-at-Measure: http://economics.ucr.edu/repec/ucr/wpaper/201417.pdf

      Fixed-Cohort-Gestational-Age: http://www.biomedcentral.com/1471-2288/11/49

      … warning: engaging with this stuff often causes a severe and persistent headache. I’m not sure any of these options is exactly right, but I’d bet the thinking is pertinent.

      For the record, though – I’d call that a spurious relationship. It might be mechanical and determined by the combination of the data collection mechanism and underlying data generating process (ageing), but it is still a spurious relationship between the month of birth and the outcome of interest.

      • “Was this data collected each March or something? I still don’t get the data set up enough to be able to figure out exactly what caused that age-birthdate relationship, but it looks pretty mechanical. ”

        I agree that this data looks like it resulted from a procedural artifact. They do not describe their methods very well, but figuring this out is like a homework problem. Take a look at this plot I posted earlier:
        http://s14.postimg.org/vhtf54y69/CMM1.png

        What distribution of average ages (rounded down to the year) would we expect if the following are true:
        1) Birthdays are uniformly distributed throughout the year.
        2) Follow up dates were uniformly distributed throughout the year.
        3) Birthyears are uniformly distributed (or we can instead model this as every subject has the same birthyear).

        It would be a line with slope of (approx due to different month lengths) 1/12=.0833 and intercept would be the average age on jan 1st (~18.5 here). From their data we see the average age is roughly constant from Jan-April and July-Dec but is decreasing almost linearly from April-July with slope~ -0.27. This slope is a little over 3x the expected slope (.0833*3=0.25), and April to July account for 1/3 of the months. So most likely what they did is collected data mostly from April-July. Lets look at the average age per month (rounded to 30 days each) given
        1) Uniformly distributed birthdays from 1:360
        2) Uniformly distributed “follow up days” from 120:210.
        3) Initial average age of 18.4

        http://s29.postimg.org/nkq3oqicn/CMM3.png

        It is almost a perfect fit. It tails off a bit slower probably due to some percentage of the subjects rescheduling for the fall.

        • So, wait, you’re saying that they probably only looked at people whose melanoma was diagnosed between april and july, and then that mechanically induces a bias in the ages at onset due to them controlling for *year of birth* rather than *age*??

          I’m banging my forehead on the desk (metaphorically) right now.

        • This paper is a great demonstration of the failure, or even misleading nature, of strawman NHST. Compare to the merits of exploring the data, thinking about it, finding some simplifying assumptions, then deriving a rational (not just “curve fitting”) mathematical model.

          In their table 2 they report number of cases by age. If we use the middle bracket values we get the following upper chart for cases/yrs old. If we assume the same number of total subjects for each age bracket (3,595,055 people divided by 7 brackets gives 513579.3) we get the prevalence estimated in the lower chart:
          http://s4.postimg.org/ypou2507x/CMM4.png

          There were 181 cases in the 15-19 (mean=17) yrs old group and 395 in the 20-24 yrs old. So the rate of increase around this age is approximately(395-181)/5= 42.8 cases per year old, or 3.57 per month. So at 17.5 years old we would expect 181 + 6*3.57=202.4 cases per 513579.3 people, or 0.000394 cases per person. How many cases would we expect from their December data (with average age ~17.5)? They reported 253,596 births, multiply this by .000394 cases/person to get 99.914 cases. They reported 100 cases for this month.

          If we do the same for each month using the ages predicted by the model developed in my previous post, we get the the following prediction for cases/birth month. Blue line is model and black is the data:
          http://s28.postimg.org/buyynobml/CMM5.png

          To recap the assumptions:
          1) Uniformly distributed birthdays from 1:360 and Days per month rounded to 30 (Using the actual distribution they provide does not materially affect the estimate).
          2) Uniformly distributed “follow up days” from 120:210.
          3) Birth years are uniformly distributed.
          4) Equal number of people per age bracket.
          5) Linear rate of prevalence increase from 17-22 years old.

          The model does still underestimate the number of cases in April and May (by ~12-16, respectively), but the rest of the “seasonal effect” appears completely accounted for.

          Further, they write: “The distribution of season of birth varied comparing earlier with later birth cohorts, hence birth year was considered as a potential confounder.” So our assumption three is apparently wrong enough for them to have noticed.

          The season of birth vs birth year data is not provided, but they do report unadjusted odds ratios and after adjusting for birth year. After adjusting, the OR of summer (June-August) has increased at the expense of spring (March-May). Note that the above model overestimates June cases and underestimates April/May cases.

        • Let me see if I understand this. Your first graph shows prevalence by year from their data. You use that graph to get a slope in the vicinity of 17 years old (in cases per person per month). You then apply that linear predictor and their actual births data to extrapolate the variation by month, and you get the bottom graph, which is for all intents and purposes the entirety of their “seasonal effect”.

          So, if they’d just used people’s actual *age* (to the nearest month) instead of rounding down to the nearest integer year, they’d find that sure enough, risk (cases per person per month) is basically a constant (at least in the vicinity of 17-22 years).

          I think what we need here is a bounty. The way it will work is if you can find some simple explanation for an “effect” published in a peer reviewed journal, which shows that it is an artifact of doing a bad data analysis neglecting very basic issues in the data (there would have to be a judicial body I guess), the researchers would be required to refund their grant money to the person who found the problem ;-)

        • You are correct. Also this has already been cited and evidence for the favored theory exaggerated:

          “Furthermore, a new study published in 2014 of over three million people in Sweden showed that accumulation of UV
          damage begins as early as in the neonate, with melanoma incidence increased in those born in the
          spring and summer versus those born in the fall or winter [40].”
          http://www.ncbi.nlm.nih.gov/pubmed/24838074

          Really, at first glance there seems to be little room for many of these other risk factors to play an important role either. I guess other than is associated with age, but it seems like lifetime exposure to nearly everything would increase with age. This could be a field that has only been studying their own opinions for many years.

          Also, I’d take bitcoin donations if anyone here does that… I was tempted to after your comment but not going to bother posting an address unless there is interest though. I haven’t seen anyone mention it.

        • @question: “This could be a field that has only been studying their own opinions for many years.”

          I think this pretty much summarizes much of health research, unfortunately.

          Even researchers who use methods more sophisticated than Excel simply beat the data into confessing their preconcieved idea.

          Oscar Wilde said that a “a cynic is a person that knows the price of everything but the value of nothing” I’d say many researchers know a lot of statistics and regression but nothing about research practice. Yet the textbooks they read all implicitly assume flawless research practice.

        • question: you use an anonymous pseudonym so you may wish to retain that anonymity, however it seems worthwhile (in an abstract sense) to try to put your results out there and maybe even get a retraction from the journal. This kind of thing drives me up the wall. What is your opinion on other people presenting your findings as something like “communication from a party who wishes to remain anonymous”?

          I wonder if Andrew would be on board? As a tenured professor, and a person who has pointed out stats problems in journal articles in the past (ugly vs beautiful parents etc) maybe he’d be willing to sponsor such things.

        • Daniel,

          I have no desire to interact with the journal. Others should be free to do so, even presenting it as their own findings if that is the most convenient way.

          Also, like Fernando, I also suspect that this kind of thing is par for the course so it is not like there is any integrity of the literature to maintain.

        • I don’t think it is that they only collected data where the diagnosis was in April, I think it’s that they collected data in April for everyone in their cohorts, so that they basically have censored data for diagnosis that occurred after April. So everyone born January-April has exposure during the final calendar year while those born later in the year do not.

        • Births were apparently distributed as Births = c(297241L, 291749L, 334787L,
          331520L, 324466L, 305267L, 310043L, 300578L, 291196L, 278533L,
          252598L, 253596L), I don’t want to find a pattern by eyeballing but that seems to show that there are a lot more births March-September, so I wouldn’t really run too far with a uniform distribution of births over the year assumption.

        • Elin,

          Simply replace this line:
          bday<-sample(1:360, 100000, replace=T)
          with
          bday<-sample(1:360, 100000, replace=T, prob=rep(dat[,2]/sum(dat[,2]),each=30))

          It makes the estimate slightly better but has no real effect. I preferred to use the simplified model. Also some of my earlier posts are still waiting for moderation…

  6. Elin,

    On looking closer at their methods they write:
    “We identified 3 595 055 individuals in the Swedish Birth Registry who were born from 1973 through 2008. Month and day of birth information was complete for the entire cohort…The study cohort was followed for CMM incidence from birth through 31 December 2009 (maximum attained age was 37 years). All incident CMM cases were identified using the International Classification of Diseases, 7th revision, code 190, in the Swedish Cancer Registry. This registry includes all primary incident cancers in Sweden since 1958, with compulsory reporting nationwide.”

    So it appears there was pre-existing data on birth month distribution for normal and MMC-diagnosed people. They accessed these databases and then needed to compare the two. Now look again at this chart:
    http://s29.postimg.org/nkq3oqicn/CMM3.png

    If the roughly the same number of people were born each year from 1973-2008 what would be the average age Jan 1st 2009? mean(0:35)=17.5. This is close to the lower value on that chart. Likewise on Dec 31st the average would be 18.5. If they just collected data in april, then all people with birthdays before that would have age=18.5, and all after would be age=17.5. This is not what we see, instead we see the average age gradually decrease.

    Here is what they must have done. First, from the paper it does not look like anyone good with databases was involved in this project, it has what looks like an excel chart that plots a sine curve and they do not even plot their data. I was probably slow going and they were not going to go back to update so many records once they had recorded it. So they started going through the data in april (of probably 2009). They needed to go through 3 million records: formatting it, excluding people, comparing the two databases, etc.

    They went through the data in some order unrelated to birthday, probably alphabetical. Whenever they added a person to their excel spreadsheet it calculated their age from the current date to their birthdate. Since they started in april all the ages before this were recorded as ~18.5. They did this until the end of july after which all ages are recorded as about 17.5. From April-July it was essentially random whether or not the day they looked at that data was before or after the birthday, thus the gradual decrease looks just like what you get if you compare two samples from uniform distributions and see which is higher:

    Rcode:
    birthdayday<-sample(1:360, 100000, replace=T)
    followupday<-sample(90:210, 100000, replace=T)

    Due to this, people with later birthdays had less chance to get cancer before being added to their spreadsheet. Because the sample was so large this had a substantial effect on their results.

        • Question,

          Did they really need to enter all the data by hand over those few months, or could it just be that those are the months when people get checked out for skin cancer? You know, because summer is when you would notice a funny mole on your lower back.

          In that case, the pre-summer birth months would have an extra year of summers and check-ups to catch the melanoma compared to the post-summer birth months. Someone born in January 2000 would have had 10 summers before the end of data collection on Dec. 2009, with the last one coming at age 9 years and 6 months, and someone born in November 2000 would have only had 9, with the last one coming at age 8 years and 8 months. Repeat for each birth year cohort.

        • One thing that is not clear to me so far is whether they were really using monthly censoring/accounting for month of diagnosis or if they were using annual censoring.

        • Also, they say the censored on 31 December 2009 which should mean that any age calculation is based on that date. However my sense is that the whole issue of why they cared about birth year and not age was that they had a birth year* birth month table showing the incidence of diagnosis in each cell. The whole point of doing hazard models of various types is to adjust for the fact that people are dropping out due to either failure (in this case diagnosis) or mortality due to other causes (or migration but let’s set that aside) and also to adjust for hte fact that with right censoring on a fixed date and left start dates that are ragged you need to handle that. That’s why the whole thing that is puzzling me the most is why the model as people are describing it doesn’t seem to be the kind of model you would use for censored data.

        • Elin, my hope was that the raw microdata would be available but it isn’t. “question’s” analysis is based on the aggregated data table by year, which he is then spreading the yearly counts out evenly by birth month and assuming that the data is then observed at some time around April.

          So, let’s talk about what kind of model they SHOULD have used. I’ll start a new comment thread below to avoid the reply nesting problem…

        • jrc,

          That would not explain why there are more “person years of follow up”, ie avg years alive X number of people, for the controls (who may have never gone to the doctor at all). If they collected birth data up to Dec 2008 then “followed up” everyone until Dec 2009 the average age should not vary throughout the year.

        • “If they collected birth data up to Dec 2008 then “followed up” everyone until Dec 2009 the average age should not vary throughout the year.”

          I think it should. Consider people born in January. In December of 2009 they would be:

          11 months old
          1 year 11 months old
          2 years 11 months old….
          etc.

          Now Consider people born in June. In December of 2009 they would be:

          6 months old
          1 year 6 months old
          2 years 6 months old…
          etc.

          So then when you look at people on December 31, 2009, the average age of people born in December is 11 months less than the average age of people born in January.

          As for my hypothesis about visiting the doctor in the summer – that just says that the date when recording stopped (Dec. 31) is not necessarily the date you were “measured”. The “age at measurement” is the last time you went to a doctor. That could explain why you can mimic their result when you “sample” measurements over the summer (“followupday”). It isn’t because that is when they were entering data, its because that is the last time people were “measured”.

          In my mind I was just re-interpreting your line “followupday<-sample(90:210, 100000, replace=T)" not to mean date of data entry but to mean date of last doctor visit. I could be wrong – like I said, this stuff gives me a headache.

        • jrc,

          There was not necessarily any going to the doctor. They compared birth records with cancer records. I originally interpreted it the same way as you but it does not agree with what they describe in the methods. My interpretation of the “follow up day” is that it is the day they stopped checking the cancer records for that person. Also it seems unlikely they could organize 3 million people for actual follow ups. Another possibility was that everyone in Sweden does go to the doctor at least once a year so they had the date of those visits. In that case it would be strange for this to occur so much more often from April-June.

          It also seems that when calculating their “person-years” they must have used age in the colloquial sense (“How old are you?” “I am 18 years old”) not in days or years plus months alive. If they did it the way you describe the average age would decrease linearly, instead it pretty much only decreases from April-July.

        • question,

          I interpreted their data collection as: they downloaded two official Swedish governmental datasets and used all the people that were in them and of the right age, and used all information entered up until Dec. 31, 2009. The first dataset is the Swedish version of the Census/IRS combined, and the second one is a registry of all cancer diagnoses in Sweden (see below). Sweden is a very ogranized place data-wise, so they merged the two based datasets using something like SSN (ID#), and then they did this analysis on that merged data set. You can do that with Swedish data, if you can get access to it.

          Here is probably their cancer dataset: http://www.socialstyrelsen.se/register/halsodataregister/cancerregistret/inenglish

          So I don’t think there was any time at all for them to be entering/checking people on things that could be updated. I think it must have to do with the various survey timings and (like you said) how they calculate age. Maybe the admin or cancer data is only updated once a year, like in March/April?

          Your simulation with a moving survey (follow-up) window during the summer is compelling. I just wonder if there can’t be a reason there are more summer follow-ups that is not due to taking months to track through the data. I just don’t think they did that. On the other hand – it could be that the cancer data gets updated slowly over time or once a year or something – apparently there are 6 regional cancer registries that have to get combined…

          I think you have the basic mechanism down, I just don’t think that there was a grad student looking at newly updated data every day and plunking it into excel. I think this is either a behavioral effect about people going to the doctor at certain times in the year, and/or a timing/timing-of-reporting effect in the merged household and cancer dataset.

        • Question,

          I really think that is someone was alive for 365 days, that counts as a year of exposure or maybe 33 and 11/12s years. If they are calculating “person years” I assume that the point was to adjust for the differential number of months of exposure for those born in different months as well as differing cohort size by month. However it’s a bit ridiculous because to know we would need to know the month*year births not just the month births. Roughly I think what they probably did to get the person years is

          dat$extramonths <- seq(12,1)
          dat$personYearRecreated <- 17*dat$Births + dat$Births*dat$extramonths/12

          which give numbers that are a little low but not hugely low for each month's estimate of person months

          [1] 5350338 5227170 5970368 5884480 5732233 5367611 5425753 5235067 5047397
          [10] 4804694 4336266 4332265

          But they could have used actual births per year rather than assuming uniform births per year the way I did. If there are declining births the numbers would be higher. Or they could have done something fancier that the fraction I used.

        • Yes I did plot it earlier and no matter what the March numbers in particular are always out of whack. What I’ve plotted is more or less what you might expect if things were smooth if the followed a procedure like that (based on months not years). The thing that is particularly odd is that March is already the top month for births and even bring it down to 18.37 would need a lot more births in March and that just seems very very strange. though I suppose they could have transposed number there (337487 births rather than 334787).

      • So your theory is they did not actually try to match individual birth registry records with individual cancer records? How did they include the parent variables in that case? (Matching is not a big deal if the birth registry ID is there for the diagnosis — no need to go by hand).

        • Elin,

          I’m not sure what you mean. They did match the records. It is just that whatever procedure they used to do this must have been slow enough that they could not do it all in one go, it took them a few months to go through it.

        • I’m not understanding that, they tracked people through a fixed date, December 31 2009. They got the cancer registry data as of that date and marked them as diagnosed or not. By definition the people born earlier in the year had more months of exposure than the people born later in the year. I still don’t know how they are measuring this person years of exposure variable, meaning I don’t know if they have removed people with other forms of mortality and if they have somehow adjusted for the issue of month.

          For example say person A was born in June 1976 and was still alive and without a diagnosis on December 31 2009. How many person years of exposure does that person contributed to the June person years measurement? What if a person was diagnosed in June 2007, how many person years of exposure is that? It’s not how i would do it, but if you were going to adjust for the months of exposure caused by being born in different months, that person year statistic is where you would do it.

        • “if you were going to adjust for the months of exposure caused by being born in different months, that person year statistic is where you would do it.”

          I think you would just include a flexible (but not too flexible) specification for age*, where age is measured in months or days. The you let the seasonal pattern (month or quarter of birth dummies) compete with a smooth polynomial in age**. A real seasonal pattern would show up as a repeating, cyclical bend/wave pulling at the underlying, smooth age profile of cancer appearance***.

          *age, of course, is a measure of person-years (for a specific person)

          **and ditch the year of birth dummies they currently use

          ***unless skin cancers tend to appear during the summer due to immediate UV expsoure, at which point the question is whether or not 6 months or so of age is enough to influence whether or not someone’s cancer shows up that summer or the next summer. if a 25y8m old will get it this summer, and a 25y2m old will get it next summer, then there will appear to be a seasonal pattern associated with month of birth that is actually an age-profile/seasonal-sun effect not caused by birth cohort. The difference is that a real birth cohort effect wouldn’t change direction if you moved from Sweden to Antarctica.

        • I think there is a basic issue which is that some people are proposing that time specific risk is related to age and others are saying age is just another name for length of at risk period (or for cumulative UV exposure) since they started at birth. As soon as the effects start to be time varying you need to specify a more complex model and I wouldn’t want to naively guess about that, it’s a good example where we might actually have useful prior information (I don’t know, just saying it may be true) … I wish they would have used person-months or person-weeks because with person-years at this point I don’t know if they are rounding to integers or using decimals prior to calculating the person-years.

        • “For example say person A was born in June 1976 and was still alive and without a diagnosis on December 31 2009. How many person years of exposure does that person contributed to the June person years measurement?”

          Answer: 2009-1976=33

          “What if a person was diagnosed in June 2007, how many person years of exposure is that?”

          Answer: 2009-2007=2

          “What if a person was diagnosed in June 2007, and was still alive and without a diagnosis on May 31 2009 how many person years of exposure is that?”

          Answer: 2008-2007=1 (2008 was the year of the last birthday)

          It seems they must have rounded down to the nearest year (using age in the colloquial sense) in order to produce average ages like the red line in the below chart. If they used actual days alive or years plus months and followed up everyone until Dec 31st 2009 then the average age per month would decrease linearly by birth month like the blue line:
          http://s29.postimg.org/cxtjb169z/CMM6.png

          code: http://pastebin.com/qddkeM5i

        • I misread the above questions and made a typo.

          Born June 1976, Alive/undiagnosed Dec 31st 2009
          Answer: 2009-1976=33

          Born June 1976, Diagnosed June 2007
          Answer: It is not clear and also would depend on the exact birthday. Their behavior here would not have a substantial effect on the average age since diagnosis was so rare.
          A. If they continued counting person years after diagnosis: 2009-1976=33
          B. If they stopped counting person years after diagnosis:
          Bi. Diagnosis after birthday: 2007-1976=31
          Bii. Diagnosis before birthday: 2006-1976=30

          Born June 2007, Alive/undiagnosed May 31st 2009
          Answer: 2008-2007=1 (2008 was the year of the last birthday)

    • I don’t have access to the paper and if you say they didn’t have someone with access to the real databases I’ll take your word for it. But it seems to me that the whole thing should really just be a fairly straightforward life table analysis. Most importantly how were people who had mortality due to other causes handled?

  7. Pingback: International Journal of Epidemiology versus Hivemind and the Datagoround - Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Inference, and Social Science

  8. @question: “This could be a field that has only been studying their own opinions for many years.”

    This in my experience summarizes much of the research in health. Even researchers that use methods more sophisticated than Excel simply beat the data into confessing their preconceived hypothesis.

    It reminds me of Oscar Wilde’s quip that a cynic is a person that knows the price of everything but the value of nothing. Similarly I think many researchers know a lot of stats and econometrics but nothing about research practice. Yet most of the textbooks they read implicitly assume flawless research practice, which is needed to establish the validity of the methods they teach.

    • Agree, with the exception of some groups that are obsessed with checking and double checking details (even when the world is waiting to see the first analysis of clinical experiences treating SARs patients and the clinical research fellow has to do the double data entry themselves but told they have to before looking at the data).

      The _disasters_ usually could have been avoided by _common sense_ or an appointed devil’s advocate to try to find what might have gone wrong.

      Discovered errors will get _fixed_ in order of being able to 1. blame it on someone outside the group, 2. blame it on someone junior in the group 3. rework it as “yes, we made this mistake but we can now show you how you can avoid it” journal submission. (Third one is maybe not so bad.)

  9. So, per Elin’s questions and comments above, how SHOULD they have analyzed this data (from a Bayesian perspective).

    At the individual level, the thing to do would be to look at the registry of people with melanoma diagnosis and calculate (Diagnosis Date – Birth Date) to the nearest day (or if only month is available, to the nearest month). Express it in units of years maybe.

    For those who are not diagnosed, calculate an age at last visit/check for melanoma, or an age at end of study, if you assume (fairly accurately) that essentailly all of them don’t have melanoma at the end of study.

    A good first model to look at in my opinion would be to create a parameter which represented the actual time to onset (which would have to be somewhat smaller than the duration to diagnosis). Let’s assume a prior over the delay until diagnosis of say exponential(1/30) on the theory that it takes on the order of several months for most people to notice the melanoma from the time the first cell starts to grow.

    We then assume a constant rate of melanoma risk in time, call it r.

    Put a prior on r to make it on the order of 1/1000 people have melanoma by age 30 or so.

    r ~ exponential(1/(1/30000)); /* per year*/

    Then, for each diagnosis (in stan pseudocode)

    for(i in 1:Ndiagnosed){
    1 ~ poisson( (ageatdiagnosis[i] – delaytodiagnosis[i])*r);
    }

    for(i in 1:Nundiagnosed){
    0 ~ poisson(ageatstudytermination[i]*r);
    }

    —————————-

    Now, the above model is a baseline model saying essentially that risk is a constant in time. You run the model, and you get an estimate for r with uncertainty.

    Now, suppose you want to see if there’s an effect caused by birth season? The real hypothesis is that the rate of melanoma generation is higher in those exposed to sunlight in the first few months. So we break it down by season of birth, and look at different r values…

    /* set up 4 rates, one for each season*/
    r[1] ~ exponential(1.0/(1.0/30000));
    r[2] ~ exponential(1.0/(1.0/30000));
    ..etc etc.

    now you break it down by birth season, doing the above for each season

    for(i in 1:Ndiagnosed) {
    1 ~ poisson((ageatdiagnosis[i]-delaytodiagnosis[i])*r[season[i]]);
    }
    … and the same for the undiagnosed ones as above except using the different rates.

    now get the sample estimates for the rates for each season, and graph them to see if there are any differences.

    —-

    at least, that’s the first pass that I would come up with on a thursday afternoon in about a half hour of thinking about the problem of modeling this stuff.

    note, that the Bayesian model is very straightforward and makes good sense, but it doesn’t look like the kind of thing you’d see in a typical intro textbook to stats for epidemiology.

    • There are a few different ways I see to approach this.
      Well, this is time to failure with censored data. There are people would will get a diagnosis in the future if we followed them longer. There are also people who died before the cut off date from other causes who would or would not have gotten.
      What you need to do is to restructure the data to represent that. So you need to know for each person which ever is shortest: birth to diagnosis, birth to death, birth to Dec 31 2009 and an indicator for whether the censoring was due to diagnosis or not. I’d do that in days.
      Of course you can run a proportional hazard model or accelerated failure time model at that point, but I’d think you’d want to deal with some of the other issues first (like the question of whether there is differential delay until diagnosis based on season of birth.
      I actually would really want to construct a life table for each month or week of birth and plot the age specific risks … since from what people have posted they are making very specific claims about the date of peak and minimum risk, the more specific/narrow the time bin the better.

      • Fortunately, with a population of people under 37 years of age (and average age about 17 or whatever) you probably don’t have to deal with death as an important censoring factor here. The CDC US life tables say 97% of people in the US live to age 35 (i’d bet it’s the same or higher in sweden). Almost all of those deaths are going to be something like accidents, and so unrelated to melanoma risk. Not that you couldn’t follow up with checking on the effect of that, but i suspect it’s a second order thing.

    • Epidemiologists will have been shown Poisson regression and survival analysis techniques at some point in their education.
      (Most, I believe, are just not very good with quantitative reasoning, based on my biased personal experience.)

      My guess here is the greatest uncertainties lie in understanding how the cancer registry obtains and especially releases the data (e.g. is the most recent release understood to be of poor quality but made available for pragmatic reasons) and whether there are strong patterns of seasonally related selective self diagnosis of the need to seek medical attention.

      • Understanding how the cancer registry works: I agree this is an issue. This is one reason I put a delay to diagnosis parameter in my model pseudo-code above.

        the first thing you’d want to do is just histogram diagnosis dates. if they mostly fall within a month or two, you can probably assume that there’s some kind of annual data release or annual checkup or annual process in general (like JRC pointed out, maybe just that people tend to notice melanoma in the spring when they take off their heavy winter clothes). In that case, you could in my above model just put a longer prior on delay to diagnosis. Say exponential with 1/4 year or 1/2 year average.

        the greater the uncertainty in delay to diagnosis, the more the model will have leeway to let the rate be constant and the delay to diagnosis vary between patients. As a next step, you might want to model the delay to diagnosis in terms of factors that might be seasonal (like clothing choices). And as a next further step, you might want to model the risk as a time-varying thing, for example using the physics of UV radiation and the seasons to impose a periodically varying risk per unit time, and maybe multiply by exp(G(t)) where G is a gaussian process with mean 0 and which includes annual periodic component, and components with multi-year time scales to deal with variability in risk as a function of age.

        I am pretty sure you’d find that a constant risk in time would fit well enough given just the uncertainty in delay to diagnosis of several months, and the extra modeling would probably not be worth the effort unless you were trying to debunk this theory by showing how time varying risk especially with heavy risk in the early months of life is wrong.

        • Assuming a constant risk plus a periodically varying component (with just the first fourier component of the variation), you can build a simple closed form likelihood in maxima as follows. This calculate the density by setting up an ODE to describe the CDF of time to onset of the disease, and then differentiating it to get a density of time-to-onset.

          (%i9) rate:r0*(1+a*sin(2*%pi*b*t));

          (%o9) r0*(a*sin(2*%pi*b*t)+1)
          (%i10) odeprob:’diff(p,t) =rate*(1-p);

          (%o10) ‘diff(p,t,1) = (1-p)*r0*(a*sin(2*%pi*b*t)+1)
          (%i11) cdf:ode2(odeprob,p,t);

          (%o11) p = (%e^(r0*t-a*r0*cos(2*%pi*b*t)/(2*%pi*b))+%c)
          *%e^(a*r0*cos(2*%pi*b*t)/(2*%pi*b)-r0*t)
          (%i12) cdf:ic1(cdf,t=0,p=0);

          (%o12) p = -%e^(-r0*t-a*r0/(2*%pi*b))*(%e^(a*r0*cos(2*%pi*b*t)/(2*%pi*b))
          -%e^(r0*t+a*r0/(2*%pi*b)))
          (%i13) cdf:ratexpand(cdf);

          (%o13) p = 1-%e^(a*r0*cos(2*%pi*b*t)/(2*%pi*b)-r0*t-a*r0/(2*%pi*b))
          (%i14) density:factor(ratexpand(diff(rhs(cdf),t)));

          (%o14) r0*%e^(a*r0*cos(2*%pi*b*t)/(2*%pi*b)-r0*t-a*r0/(2*%pi*b))
          *(a*sin(2*%pi*b*t)+1)
          (%i15)

          for people who make it to the end of the monitoring period you can use the CDF to get the likelihood of going longer than that duration without melanoma.

  10. Tl;dr

    Having only briefly read through the comments this is my attempt at using a DAG to model the situation.

    My understanding is that “day of birth” (e.g. which of 365 days you are born in) determines exposure to sun in early childhood, which in turn can increase risk of melanoma independent of age. That is, a 1 year old born in June 1st has a higher risk than one born Sept. 1st, assuming summer is the season with the highest UV dosage. The hypothesis, as I understand it, is that it matters when, in the first few months of life, you are exposed to summer. That is, there is an interaction between day of birth and age (measured in days).

    I assume panel attrition (e.g. death by other causes, etc.) is just a function of age. I also assume censoring is on the basis of how long ago people were born relative to the end date of data collection (today) as measured by the date of birth. In both cases controlling for age d-separates them from the outcome, so they are uninformative for the outcome. Effectively this means that the distribution of outcomes for attrited units is the same as that of non-attrited units conditional on age. From here on it is easy to calculate the quantities of interest.

    If you don’t believe the assumptions upload a DAG with your assumptions and we can see what tests, if any, can distinguish between them.

    Where does Bayes come in? In estimating the interaction a hierarchical model will stabilize estimates, as many cells will be sparse. We can also coarsen by using weeks instead of days.

      • but that’s the point, the “theory” is that experiencing summer in the first few months of life when you have “sensitive skin” makes you more susceptible. The alternative theory espoused here is that they mechanically created some data artifact via the analysis procedure.

        • No, it’s not that you have sensitive skin, to some extent it is that you can’t sweat or discharge heat. The AAP recommendations talk about why they think infant exposure is a problem, and then in Sweden and other northern countries the winter born infants don’t get that exposure at under 6 months.

        • I put “sensitive skin” in quotes because I sort of mean “there’s something that makes these children more susceptible” I didn’t mean to specify a particular mechanism. I’m reading your AAP reference now though.

        • The portion where I saw they mentioned sweating and heat regulation is relevant to baby health in general, but I don’t think they’re implying an interaction with melanoma formation, more that if you get hot, can’t sweat, and can’t move, it’s not going to be generally good for you (heat stroke etc).

          For reference I’m looking at page 330 of the AAP recommendations you link above:

          “Infants <6 months should be kept out of direct
          sunlight. Because they are not mobile and cannot
          remove themselves from uncomfortable light and
          heat, they should be moved under a tree, umbrella,
          or the stroller canopy, although on reflective surfaces
          an umbrella or canopy may reduce UVR exposure by
          only 50%. Many infants have impaired functional
          sweating. Exposure to the heat of the sun may in-
          crease the risk of heatstroke. Sunburn may occur
          readily because an infant’s skin has less melanin than
          at any other time in life.31,32”

          I don’t take this to mean that sweating etc affects melanoma risk. In fact, the last sentence gives reason to believe that maybe sun exposure at this time might be less risky since low melanin = low melanocytes = low chance of mutation, though on the other hand, if they do mutate, they will be proliferating rapidly… so there are mixed models there.

    • @Fernando:

      Interesting approach. I love it as a way to see how DAGs can help.

      So, just like @question’s approach clearly makes a quantitative case for what possibly went wrong here can you do anything similar with your DAG-based-approach?

  11. My point is that there is a very simple way to communicate this theory graphically without having to jump immediately into a parametric model. And it reveals a simple estimation strategy.

    Had they used this language they might have avoided the problem.

Leave a Reply to question Cancel reply

Your email address will not be published. Required fields are marked *