I’m remembering that our second daughter’s scheduled c-section was scheduled for June 30, which was the last day of the medical practice’s fiscal year. [they were reorganizing as of July 1, and would not have been covered by our insurance, requiring us to switch practices at the last minute or pay for it all ourselves.]. I’m not particularly superstitious, but I don’t think we’d have intentionally scheduled a c-section for Friday the 13th.

]]>https://github.com/fivethirtyeight/data/tree/master/births ]]>

If I crank the matrix size up to 10,000 x 10,000, it takes about 24.3 seconds to calculate the inverse (mean of 10 trials). And, if I go to 30,000 x 30,000, I seem to get memory thrashing and I have never waited for completion.

Interestingly, it takes Mathematica about 500 ms to invert a 300 x 300 matrix of integers—using integer arithmetic. In one case, it calculated that the (1,1) element of the inverse matrix is a fraction with a denominator having about 456 digits.

]]>Thanks

]]>Yes, and in general I think the individual-date effects will interact with weekend effects: When April Fool’s or Feb 29th is Sunday, for example, I’d expect the number of births to be slightly lower than for the same date as a weekday, but not as low as would be implied by the additive model of special-date and day-of-week effects.

]]>There are ways to bring on contractions – I suspect that women will use those ways on days after Christmas but not before. Before and on Christmas they’ll be too busy with Christmas preparations and celebrations to want to deal with a baby as well.

]]>http://statmodeling.stat.columbia.edu/2012/06/14/cool-ass-signal-processing-using-gaussian-processes/

The problem with this approach is that when the number of years is not divisible by 7, the weekday effect dominates and some special day effects are difficult to see. Weekdays, leap days and floating special days makes simple approaches more difficult. ]]>

Bob

]]>The data you’re looking at are the old 1968-1988 data, where the Christmas dip is 20%. The above graph show the 2000-2014 data, where the Christmas dip is 50%. The effect has increased over the years, as the proportion of scheduled births has increased. There’s no exaggeration here; it’s just a change over time.

]]>But the births.csv file reveals that the dip is more like 20% of the Dec. average.

Isn’t the adjustment exaggerating the effect?

]]>In many areas of research, you start with the cannon. Once the mouse is dead and you can look at it carefully from all angles, you can design an effective mousetrap. Red State Blue State went the same way: we found the big pattern only after fitting a multilevel model, but once we knew what we were looking for, it was possible to see it in the raw data.

]]>The raw data show the gross features of the data (Christmas, etc.) but it’s hard to pick out what’s going on with the smaller effects such as Valentine’s Day.

]]>http://research.cs.aalto.fi/pml/software/gpstuff/demo_births.shtml

they are:

http://chmullig.com/wp-content/uploads/2012/06/births.csv

and

http://www.mechanicalkern.com/static/birthdates-1968-1988.csv

The thing about doing sums over each day of the year is that you’re assuming EXACT periodicity of 1 year (which isn’t even a well defined unit of time thanks to leap years).

Births are basically a function of time in days, and f(t) doesn’t need to be exactly periodic with period 365 days, and all the spectral components integer multiples of this 365 day period. In fact, it isn’t according to the graphs above.

However, the gaussian process stuff *is* very computationally challenging due to the massive covariance matrix. So, it’s a good question as to whether you can get nearly-as-good inference by some other less computationally heavy method.

One issue is that we’re using a lot of knowledge in setting up the covariance matrix. There are NxN elements where N is the number of days of data we have. But there are only 5 or 10 hyperparameters that determine the full NxN matrix.

Can we specify some basis expansion using maybe a few more hyperparameters, say 10 or 20, which represents our knowledge sufficiently well to get similar high quality inference? For example, can we do a low order chebyshev polynomial to represent the long time trends, a radial basis function expansion that represents the “seasonal trends”, a periodic 7 day “weekday” basis function for the weekly trends, and some kind of discrete exactly year-periodic “special day” functions and wind up with say 20-50 parameters to estimate but the computation is much faster than inverting a million element matrix?

]]>Just to compare, I wonder how a raw, un-adjusted graph would look. i.e. All you do is for each day average over all the years in the dataset.

Aki / Andrew: Is that graph posted anywhere?

]]>The 538 graph estimates an average “the 13th” effect, averaging over 12 days of the year. This would be pretty noisy if you wanted to estimate each individual date. With enough data, any method would work, but given data limitations, there are benefits to fitting a model.

]]>One way to intuitively understand that is as a low-pass filter. If we fit the data restricting our function to have only slow variations on the scale of say 30 days, then we’ll get a “time, and seasonality” (here think of “time” as something like “timescale of > 200 days” and “seasonality” as “timescale between say 30 and 200 days”. Of course, day of week introduces an exactly periodic component with period 7 days.

Other than averages that we can compute strictly with these “slow moving” or “7 day periodic” functions, what else is going on? That’s what’s meant.

]]>1978: https://www.youtube.com/watch?v=fGWR3uI3Qa0

(But the word order is backwards, and ‘needles and pins’ doesn’t show the post 1980 rise that ‘pins and needles’ does.)

]]>Completely unrelated. What happened to last week’s scheduled items?

Thurs: FDA approval of generic drugs: The untold story

Fri: Acupuncture paradox update

Bob (who is sitting on pins and needles waiting for the Friday item)

PS Google Books Ngram Viewer shows use of the phrase “sitting on pins and needles” peaking just before 1900—but, it started coming back in about 1980 and in about 2003 usage surpassed that of any time in the 20th century. (smoothing of 3 in Ngram viewer)

]]>