Jon Minton writes:

You may be interested in a commentary piece I wrote early this year, which was published recently in the International Journal of Epidemiology, where I discuss your work on identifying an aggregation bias in one of the key figures in Case & Deaton’s (in)famous 2015 paper on rising morbidity and mortality in middle-aged White non-Hispanics in the US.

Colour versions of the figures are available in the ‘supplementary data’ link in the above. (The long delay between writing, submitting, and the publication of the piece in IJE in some ways supports the arguments I make in the commentary, that timeliness is key, and blogs – and arxiv – allow for a much faster pace of research and analysis.)

An example of the more general approach I try to promote to looking at outcomes which vary by age and year is provided below, where I used data from the Human Mortality Database to produce a 3D printed ‘data cube’ of log mortality by age and year, whose features I then discuss. [See here and here.]

Seeing the data arranged in this way also makes it possible to see when the data quality improves, for example, as you can see the texture of the surface change from smooth (imputed within 5/10 year intervals) to rough.

I agree with your willingness to explore data visually to establish ground truths which your statistical models then express and explore more formally. (For example, in your identification of cohort effects in US voting preferences.) To this end I continue to find heat maps and contour plots of outcomes arranged by year and age a simple but powerful approach to pattern-finding, which I am now using as a starting point for statistical model specification.

The arrangement of data by year and age conceptually involves thinking about a continuous ‘data surface’ much like a spatial surface.

Given this, what are your thoughts on using spatial models which account for spatial autocorrelation, such as in R’s CARBayes package, to model demographic data as well?

My reply:

I agree that visualization is important.

Regarding your question about a continuous surface: yes, this makes sense. But my instinct is that we’d want something tailored to the problem; I doubt that a CAR model makes sense in your example. Those models are rotationally symmetric, which doesn’t seem like a property you’d want here.

If you *do* want to fit Bayesian CAR models, I suggest you do it in Stan.

Minton responded:

I agree that additional structure and different assumptions to those made by CAR would be needed. I’m thinking more about the general principle of modeling continuous age-year-rate surfaces. In the case of fertility modeling, for example, I was able to follow enough of this paper (my background is as an engineer rather than statistician) to get a sense that it formalises the way I intuit the data.

In the case of fertility, I also agree with using cohort and age as the surface’s axes rather than year and age. I produced the figure in this poster, where I munged Human Fertility Database and (less quality assured but more comprehensive) Human Fertility Collection data together and re-arranged year-age fertility rates by cohort to produce slightly crude estimates of cumulative cohort fertility levels. The thick solid line shows at which age different cohort ‘achieve’ replacement fertility levels (2.05), which for most countries veers off into infinity if not achieved by around the age of 43. The USA is unusual in regaining replacement fertility levels after losing them, which I assume is a secondary effect of high migration, and migrant cohorts bringing with them a different fertility schedule with them than non-migrants. The tiles are arranged from most to least fertile in the last recorded year, but the trends show these ranks will change over time, and the USA may move to top place.

In addition to CAR models in Stan, you can also add an ICAR component – the equivalent of WinBUGS/GeoBUGS car.normal function. See our case study http://mc-stan.org/users/documentation/case-studies/icar_stan.html.

In case of interest, a paper I’ve written since, which touches on the issue of ‘thinking spatially’ about demographic data, without applying the necessary tools – like using models with neighbourhood matrices – is available as a pre-print here:

https://osf.io/ntz72/

The main point made by this paper is that – visually – the pattern for alcohol-related deaths looks like a ‘horizontal band then an ellipse’, and the pattern for drugs-related deaths looks like a ‘truncated triangle’. If modelling first, for example using Intrinsic Estimator models, the structure would be misspecified from the outset. After noting these patterns, simple binomial regressions are then fit to both sets of data, and as expected fit the data for which they’re designed better than the other data. It would be great to build and fit models along these lines using Stan, but that’s not a leap I’ve taken yet.

It sounds like you are saying “look at the shape of the data and then use a model that follows (or at least, allows) that shape.” This is appealing, but has the inherent problem that it can be unduly influenced by poor data.

I’m not saying that this approach is necessarily wrong, but that it always needs to be interpreted as “contingent on the quality/representativeness of the data”. For example, one always needs to ask if the shape that the data show might be influenced by the particular sample, or the sampling method, or inherent problems in obtaining data (e.g., the method of collection; or using sensitive questions that might lead to differential response or differential truthfulness in response)

I agree. And think the inference drawn about noting any particular ‘shape’ in any particular dataset has to also depend on whether associated ‘shapes’ can be identified in other datasets which are consistent with an underlying aetiological theory. For example, the cohort effect associated with being born around 1918 has been noted in the mortality data for many different countries, mainly in Western European populations. If it were only noted in one population, it would be easier to assume it’s a data artefact. There has to be a certain triangulation between datasets and other sources of information in order to draw meaningful inferences from any one pattern. Some of this process can likely be improved by setting up competing statistical model frameworks but it’s not the whole process.

A pre-print of a paper, based on the conference poster discussed above, is now available here: https://osf.io/fruhz