On the cover of BDA3 is a Bayesian decomposition of the time series of birthdays in the U.S. over a 20-year period. We modeled the data as a sum of Gaussian processes and fit it using GPstuff.

Occasionally we fit this model to new data; see for example this discussion of Friday the 13th and this regarding April 15th.

We still can’t just pop these models into Stan—the matrix calculations are too slow—but we’re working on it, and I am confident that we’ll be able to fit the models in Stan some day. In the meantime I have some thoughts about how to improve the model and we plan to continue working on this.

Anyway, all the above analyses are Bayesian. Laurie Davies sent along this non-Bayesian analysis he did that uses residuals and hypothesis testing. Here’s his summary report, and here’s all his code.

I prefer our Bayesian analysis for various reasons, but Davies does demonstrate the point that hypothesis testing, if used carefully, can be used to attack this sort of estimation problem.

**The data**

The birthday data used in BDA3 come from National Vital Statistics System natality data, as provided by Google BigQuery and exported to csv by Robert Kern.

More recent data exported by Fivethirtyeight are available here:

The file US_births_1994-2003_CDC_NCHS.csv contains U.S. births data for the years 1994 to 2003, as provided by the Centers for Disease Control and Prevention’s National Center for Health Statistics.

US_births_2000-2014_SSA.csv contains U.S. births data for the years 2000 to 2014, as provided by the Social Security Administration.

NCHS and SSA data have some difference in numbers in the overlapping years, as we discussed here.

It seems worth pointing out that Laurie Davies doesn’t think of p-values in terms of hypothesis testing — he wants to avoid saying anything about the truth or falsity of any particular model. In his approach a statistical model is an approximation of the data; p-values are used to indicate whether a given model is an adequate approximation. I *think* the Davies interpretation of his p-value procedure in that he retains terms with small p-values because to do otherwise would be to use an inadequate model; conversely, removing terms with large p-values is okay because it simplifies the model without impacting the adequacy of the approximation it provides.

@Andrew: Didn’t Aki and Rob make some progress on this scaling? How bad is it? That is, if you just code the model up in Stan, how slow is it or does it just not move at all? Is that for MCMC or for variational inference or optimization? How did Aki fit the model with GPstuff (assuming that’s what he did) and was it full Bayes or a point approximation or something like variational inference? How long did that take in computational time? Just trying to get a sense of where Stan stands that’s a little more fine grained than “can’t fit the birthday problem”.

@Corey: “p-values are used to indicate whether a given model is an adequate approximation”. Sounds like hypothesis testing to me. And a traditional approach to regression. The one-at-a-time term addition can problematic when there’s correlation among predictors; it’s trying to tackle a joint combinatorial problem with a greedy componentwise heuristic.

Bob:

Rob etc. tell me they’re making progress. I haven’t been following the details but I look forward to the day that I can fit the birthday model (or some version of it) in Stan on my laptop.

It is operationally exactly like standard frequentist stuff; it’s just that Davies has idiosyncratic and strongly held philosophical views on the justification for these procedures.

(I think in this case the one-at-a-time approach doesn’t run into a problem with correlation among predictors because Davies is using Fourier basis functions with their nice orthonormality property.)

@Bob

A while back there was a poll aggregation model put on the blog, it used a prior that was more or less a gaussian random walk (day by day gaussian diffusion). I thought this was a little more “broad” than it should be, so I tried to do a similar thing with a stationary gaussian process prior and some kind of day by day dirichlet distribution for the polling likelihood. It was around 400 days of polling, so the gaussian process would have had a covariance matrix that was around 400×400 elements. I killed the Stan process after a half hour or so, before even getting the first warmup sample.

“The one-at-a-time term addition can problematic when there’s correlation among predictors; it’s trying to tackle a joint combinatorial problem with a greedy componentwise heuristic.”

I like this treatment of that problem: http://econ.tulane.edu/seminars/Gelbach_Covariates.pdf

Log-density and gradient evaluation in Stan with the built-in covariance function and the block Cholesky algorithm is as fast as in GPstuff. Stan is missing a built-in periodic covariance function and way to build efficient composite covariance functions. The block Cholesky algorithm is still in separate branch and is waiting additional test code. So it’s the usual resource problem.

In general I do regard p-values as a measure of approximation. In this particular case they are not. The p-value for a covariate is the probability that replacing it by standard Gaussian noise will lead to a smaller sum of squared residuals in the least squares case, or sum of the values of the rho-function in the robust case. There is no assumption of the usual linear model + error (see http://arxiv.org/abs/1610.05131 for the details).

The code provided is an approximation to calculating the robust version. Because of the large number of covariates 14611 and their size 7305 the correct calculation takes about 17 hours but gives better results based on fewer covariates, namely 137. The remaining 14474 are no better than Gaussian noise which is why they are not chosen. The covariates were trigonometric functions but only half were periodic over the interval. For example sin(pi(1:n)/n) is not periodic over the interval. They were chosen because the data exhibit strong periodicities but the data re not periodic. Orthogonal covariates help for least squares regression but not as far as I know for robust regression.

As it stands the procedure makes no assumptions about the data. It asks only whether the covariates are better than random noise. However if you postulate a standard linear regression model with error term then you can prove theorems on consistency (Theorem 2.1 of http://arxiv.org/abs/1610.05131). They do impose restrictions on the degree of correlation amongst the covariates but these are considerably weaker than orthogonality. See the examples in Section 2.7 where the correlation between any two covariates is 0.8. In another simulation the sample size is 250 and the number of covariates is 30000 with a common correlation of 0.85.

In the ordinary linear regression model with Fourier basis functions, strict periodicity over the interval only really matters when the number of data points is small (according page 6 of this selection of excerpts from Larry Bretthorst’s book on Bayesian frequency analysis).

Naive question:

In this sort of exercise do we have a specific goal? Is it to predict accurately the number of birthdays on any given day yet to come?

My goal was to demonstrate the capabilities of GPs. My model and many other models could be used to predict the number of future events. Such prediction models (not necessarily GPs) are commonly used to predict, for example, demand of consumer products and customer/patient flows. I guess some hospitals predict also the number of births to plan how many doctors and nurses need to be available, but then they can also partially control the process as some of the births are artificially started or delayed.

Thanks.

Can Laurie Davies’s model also do prediction? Or it it only testing hypotheses?

It might be interesting to run your model & produce a prediction of the number of births for each remaining day of 2017. And then a year later we can go and evaluate what sort of errors we get. Assuming the 2017 data will become available at some point.

There is no testing of hypotheses. Which hypotheses am I supposed to be testing? The analysis is essentially one of covariate choice where the number of covariates may be much large than the sample size. Typically this seems to be done by specifying a linear model y=xb+epsilon and then testing for b_j=0. Lasso is for example based on such a model. My paper is not based on this linear model and so there are no b_j s to be tested so to speak.

The data end on the 31st December 1988. I would not make any prediction for 2017 on the basis of such data. Suppose how ever we transport ourselves back to 1989 and require predictions for this year. I would look at the residuals Figure 6 and note that the number and size of the outliers is increasing with time. I would try to take this into account. I would take a specific day, say 1st April, and examine the 20 residuals for this day. Figure 8 shows a large drop in the mean. This may or may not be caused by large outliers. Whatever after examining these twenty values and depending on what they look like. If many days exhibited a common residual pattern I would borrow strength by combining them. Based on this I would make a prediction for the residual on each particular day. I then add this back to the regression function and give a prediction for the number of births. Days such as 8th August, 10th and 13th of October would have to be treated with care. They look different from the other days but with no known cause to explain this.

Is there a principled way of choosing the cut-off point?

@Laurie:

My bad. I should read your work. I went by what Andrew wrote above:

>>>Laurie Davies sent along this non-Bayesian analysis he did that uses residuals and hypothesis testing.<<<

Corey: No, this must be discussed with the owner of the data.

Consider data y with covariate x and minimize sum(y_i-a-bx_i)^2 over a and b. Now replace x by i.i.d. N(0,1) random variables and do this 10000 times. In 4832 of the cases the i.i.d. random covariates result in a smaller sum of squared residuals than does x. I would now claim that x is no better than random noise and report this to the owner of the data. If the random covariates were better in only 10 cases I would report this. Based on this information the owner can decide whether to include x or not. Note that there is no assumption that y=a+bx+epsilon. This is a big advantage as there is no need to model epsilon as i.i.d N(0,sigma^2) or anything else.

In the case of the births data the choice p=0.01 results in 219 covariates. The remaining 14392 are in the sense made precise no better than random noise. Taking p=0.001 as a cut-off point results in 193 covariates being included. The results in terms of the regression function and residuals are very close. I point out that no use is made of orthogonality. If you give me 14610 other covariates of the same size, n=7305, I can use the same code. That is, I would be able to if I had sufficient memory available, which I haven’t.

The theory is extremely simple and explicitly takes into account the number of covariates. The R code is short and simple. It requires no simulations or cross validation. In the simulations and real data sets I have considered it has out-performed lasso every time but one (where lasso was slightly better), and at times by a considerable margin.

Finally I now have a two-stage method which effectively gives the correct solution in 28.5 minutes rather than 17.5 hours.

Laurie:

“Now replace x by i.i.d. N(0,1) random variables and do this 10000 times. In 4832 of the cases the i.i.d. random covariates result in a smaller sum of squared residuals than does x.”

This sounds a LOT like the ABC method in Bayesian analysis. In this context, N(0,1) plays the role of a prior over x and the sum of squares plays the role of a gaussian likelihood.

Of course, you’re talking about replacing *measured* x (a covariate) with instead randomly generated x, instead of doing inference on purely unobserved parameters (a,b for example). But I imagine that there’s some interpretation in terms of measurement errors on the x which could leave you with mathematical equivalence between this method and a straightforward Bayesian computation for a specific kind of model.

For example, if you’re doing model comparison between models where x = 0 + measurement_error, vs x = x* + smaller_measurement_error with x* the “true” values for the x parameter is assumed to be near to the observed x.

ABC is “sample from the joint prior, discard samples for which the sampled data does not equal (is not close enough to) the actual observed data”; but the joint prior for the ABC sampling wouldn’t include the covariates because the errors are conditionally independent of them. (Well, you could include them, but it would convey no benefit and would drastically lower your sample-retention probability.)

Oh, that’s why you’re talking about errors-in-variables. Took me two reads to get it.

Yes, if you treat the covariate as “really just a vector of zeros with all noise” vs “really a vector of signal with a little noise” and you do an inference on these two possibilities via ABC you’d get something that seems similar… see below where I flesh that out.

I read the paper; I should have said so to save you the trouble of rehashing the method.

“In the case of the births data the choice p=0.01 results in 219 covariates. The remaining 14392 are in the sense made precise no better than random noise. Taking p=0.001 as a cut-off point results in 193 covariates being included.”

So there are 26 covariates for which the figure-of-merit for “optimizes objective better than white noise” is in this gray zone between 10^-3 and 10^-2. What if anything can we say about these covariates (beyond the aforementioned plain fact that they’re in the gray zone)? …I think one could use cross-validation (or even just plain old validation if the data set is large enough) to pick a cut-off that leads to the best empirical out-of-sample prediction error (or minimum out-of-sample objective function or whatever). It would be a pain in the hindquarters but at least it would automatically tune the cut-off in a principled way.

Daniel:

You are in a better position to show the equivalence than I am should they indeed be equivalent. For the births data I assume that this would require placing a Gaussian prior over 14602 covariates each of length 7305. My feeling is that this would cause trouble. Furthermore I assume you would require a prior over the error term. The method is a stepwise method and very easy to implement. The least squares version takes about ten minutes, the approximate robustified version 17 minutes. The robustification is important here as dates of interest are recognized as outliers.

One interpretation of the method is to replace sparsity “b_j=0” by “x_j is no better than Gaussian noise”. For what it is worth there is a theorem which says that if b_j=0 then under appropriate assumptions the corresponding x_j will be no better than random noise. Is this relevant to your second remark?

Laurie: consider the following textual description of a Bayesian model.

I have some data D and a general purpose covariate X measured with some noise (there is always SOME noise, occasionally it is small enough to ignore), and a general purpose model f parameterized by several parameters A, such that D = f(A,X) + Measurement_NoiseD under my model. This general purpose equation includes linear regression, spectral decomposition, whatever since A and X can be a vector and f can be a very complicated formula involving summation over indices and nonlinear transformations of X and all that.

My knowledge is that if A,X take on their correct values, then some statistic S(D,f(A,X)) takes on a value within some specific region of the space. For example if S is the sum of squares, then it should be not too small (If I feel that Measurement_NoiseD is nonzero) but also not too large (for example if Measurement_NoiseD has some credible range then the total noise shouldn’t be much bigger than what is credible for the sum of the measurement noise values). Using this I could assign numbers representing the credibility of various S values given A,X which I can call p(S(D,f(A,X))| A,X*)

I now can ask the question, suppose that I know that either X = 0 + noise1 or X = X* + noise2 so that my measured X are either “pure noise” or “a noisy measurement of some nonzero X*”, then which hypothesis has high posterior probability under a bayesian model, that x* = 0 or that x* is nonzero?

We can break down the prior on X* into a point mass at X*=0 and a density for nonzero X* called p(X*) and then write our model as a mixture of the two cases.

p(S(D,f(A,X(0))) | A,X*=0)p(A,X*=0) + p(S(D,f(A,X(X*)))| A,X*)p(A,X*)

Now, we can proceed to calculate a posterior sample from this model and ask what fraction of the sample has X* not equal to zero. We can calculate this sample using ABC methods in which we choose which term we’ll be sampling from, and *when* the X*=0 we generate X as whatever noise distribution you had mentioned, and then keep accept or reject the sample based on the p(S(D,f(A,X(0)))) using the randomly generated value of X(0), and also when the X* is nonzero we follow a similar procedure. We’ll assume infinite computational ability so we don’t have to deal with efficiency issues, it’s purely conceptual.

Now, supposing that the X* is “really” nonzero then we’ll wind up with a posterior sample having mostly nonzero X* values and we’ll reject the idea that we can treat the X data “as if” it were a purely random noise according to the Laurie Davies noise model whatever that was.

As far as I can see, instead of this, what you are doing is generating the random X(0) based on some assumed noise model and then *optimizing* the A vector, and treating this optimum as if it were a summary of the conditional posterior distribution.

So, due to the optimization step it seems as if my Bayesian posterior via ABC is a more general procedure, but it seems possible to consider your procedure as a kind of Maximum A Posteriori computational approximation of the more general Bayesian calculation.

When you say something like “14392 are in some sense no better than random noise” I’m interpreting something close to “the probability of each of these 14392 covariate X values having a truly nonzero value X* is small”

Without getting deeply into detailed correspondences, there is at least a strong flavor of a Bayesian model in which the likelihood is specified over a measure of “goodness of fit” p(S(D,f(A,X)) | A,X) and the prior is specified in terms of the plausibility that a measured covariate is either “signal with noise” or “pure noise” p(X | X*)p(X*)

This is very clever! Once again we find that if there’s something worth doing, Bayesians can steal it and claim it was Bayesian all along. ;-)

Pretty sure Cox’s theorem says “if it makes logical sense it’s Bayesian” when you boil it down ;-)

Better still, you can see whats not quite sensible in such a prior and data model specification that’s sets out the same statistical manipulations and likely get something that is less wrong.

Corey:

Not everybody seems to have read it, particularly not the arXiv paper, so regard it as general but maybe not pointless rehashing.

With p=0.001 the last covariates to be included with a p-value over just less than 0.001 are the pair sin(pi*3694/7305) and cos(pi*3694/7305) with coefficients 7.68 and -0.14 respectively. This corresponds to an oscillation with a period of four days and an amplitude of 7. I would classify this as being of statistical significance p<0.001 but of no practical significance. The penultimate covariates are the pair sin(pi*10/7305) and cos(pi*10/7305) with coefficients 41.6 and -55.8 respectively. This corresponds to an oscillation with a period of four years and an amplitude of 68. This looks to me to be weak but I would prefer to talk to a demographer before deciding about its practical significance. In other words I would prefer to talk to someone about whether to include covariates or not. The output does contain the p-values for all pairs so I guess it is of advantage to choose a higher p-value for the cut-off and then discuss.

Daniel: I am not sure that I understand you. My covariates are without error. They are trigonometric polynomials of the form sin(pi*j*(1:7305)/7305) and cos(pi*j*(1:7305)/7305) for j=1,…,7304.

“Daniel: I am not sure that I understand you. My covariates are without error. They are trigonometric polynomials of the form sin(pi*j*(1:7305)/7305) and cos(pi*j*(1:7305)/7305) for j=1,…,7304.”

There is some ambiguity because in the model we can only “identify” the product a_i * x_i. Since you identify the coefficient “a” with the value of a at the optimum, we’re left with the question of whether x is the unit trigonometric function, or is 0. So that’s a kind of discrete error distribution where the two possibilities are involved. In a more general scenario where we don’t identify the “a” uniquely via an optimization step, we could talk about the error in the vector a_j * x_j as arising because for example the a_j have a particular value which is unknown and might be exactly 0, and in that context we don’t need to have any distribution over the x values.

In the end, what matters is that there is uncertainty regarding how to include information about x_j and this uncertainty is represented the same way in Bayesian probability whether the x_j are measured with noise or are calculated exactly but there is uncertainty regarding their coefficient.

Another way to put this is that in Bayesian calculations *modeling error* and *measurement error* are quantified using the same algebra of probability, so whether something is “measurement error” or “modeling error” doesn’t alter the mathematics.

Daniel:

I still do not understand. You write

There is some ambiguity because in the model we can only “identify” the product a_i * x_i.

but I do not have a model. There is no mention of an error term and consequently no modelling of such. My covariates are non-zero and clearly without error as I calculated them. I can perform the procedure on deterministic data, no error anywhere to be seen, and still make the claim that this particular covariate is no better than random noise.

Laurie, I’m trying to draw a logical or mathematical relationship between your interpretation and a Bayesian interpretation, that is to show how a Bayesian who was building a certain model would arrive at the same or similar mathematical representation. So, the “model” will come in when I try to think like a Bayesian who is building a model and show that that Bayesian thinking leads to some kind of similar mathematics as your procedure.

Looking more carefully at your procedure, I see you consider functions x_i(t) which are sinusoidal functions vs sample paths where each x(t) comes from a normal distribution (white noise).

One interpretation then is that you provide a “prior” over functions that looks like:

p(x(t)) = p0 {if x(t) = sinusoid of a particular type} + p1 * product(normal(0,1),t,1,N)

That is, you put a “point mass” on the sinusoid and then demand that if it’s not a sinusoid it should be a white noise sample.

Suppose I do Bayesian inference on this model. Then if the sinusoidal function results in much better fit compared to any of the sample paths of the white noise in terms of much higher p(S(D,f)) where S is either sum of squares or some kind of “robust” sum of transformed errors or whatever, then the posterior probability that the function used in this part of the fit should be the sinusoid becomes very high, and the coefficient to use is chosen by your procedure to be the Maximum a Posteriori value for the coefficients.

If on the other hand it seems that the particular sinusoid of interest does not have very high posterior probability under this Bayesian interpretation of the calculation, then you decide to simply drop the term.

However, when you drop the term, you have a symmetry property of sorts that accumulates probability for the white noise. Imagine x1 … xN are functions which are either sinusoids of various frequencies etc or white noise samples, and a*WN(t) is a pure white noise sample of unknown magnitude.

The Bayesian considers the model

y = x1(t)+x2(t)+x3(t)+…xN(t) + a*WN(t)

which implies

y – x1(t)+x2(t)+x3(t)+…xN(t) = a*WN(t)

and induces a probability assignment (a sort of likelihood):

p((y-x1(t)+x2(t)+x3(t)+…xN(t))/a) = exp(-SSE(t))

The sum of N white noises is also white noise… so when you test to see if there is posterior probability that supports x_i being a particular sinusoid and then “drop” it if it doesn’t, you are shifting the weight to the WN(t) process, altering the probability of a particular value of a. In the same way as if you regress y = a*(1+f(x)) + b*(1+g(x))+c vs you regress y = a*f(x)+b*g(x) + d you will find that a+b+c = d the constant terms all accumulate together. So too do all the white noises accumulate together to produce some other white noise.

If you robustify it by transforming the sum of squared errors, it really doesn’t change the basic story.

I am late to this conversation, but I was trying to find the results of the fit of the kernel hyper parameters from the original birthdays demo here: http://research.cs.aalto.fi/pml/software/gpstuff/demo_births.shtml.

I could not find them posted anywhere and also in the BDA3 book they were not mentioned. Could somebody please point me to where I can find those results without running the GPstuff program?

If I were to run the GPstuff program, how long would I be expected to wait? It seems to runs for a long time as it is a quite big model and a big amount of data.

Thanks a lot!

Christian