Tony Hu writes:

Could you please help look at an example of my model fitting? I used a very flexible model—Bayesian multivariate adaptive regression spline. The result is as follows:

I fitted the corn yield data with multiple predictors for counties of the US (the figure shows results of Belmont County in Ohio). My advisor told me that there was something wrong with the model based on two problems exposed in the figure:

1. As marked by ellipses in the figure, the model behaves opposite to the data (e.g., the data points are going down while the model going up, or vice versa). Such behavior not only happens in this county but also in some other counties. I understand that a fitted model does not have to go through every data point, but should it follow the trend that data points show?

2. As marked by arrows, the model sometimes can capture the very low yield, but sometimes it does not. Is that normal?

I can’t tell what was wrong with the model. Do you have any thoughts?

My reply:

I don’t know what’s wrong with the model! I’ve never fit a multivariate adaptive regression spline, myself. But I thought this would be a great excuse to talk more about workflow.

The above graph does look problematic. Something’s definitely wrong. The ellipse on the left side of the plot doesn’t look so bad—it just looks like the fitted curve is doing some smoothing—but the ellipse on the right looks wrong. And the cases where the fitted curve is more extreme than the data—that makes no sense at all. Sometimes I’ve seen time series models overshoot the data; for example, random walk models “want” to be quadratic curves, and this can sometimes cause an estimated curve to have extreme peaks—but nothing like that seems to be going on here. One possibility is that the hierarchical model is estimating these peaks from other counties and partially pooling.

So, what do do? Take the model apart. Fit the model to subsets of the data. Fit the model to simulated data where you know the true parameter values. Fit the model in Stan so you can play around with it. If you’re concerned about oversmoothing across counties, increase the county-level variance parameter. Basically, you’ll have to build some scaffolding.

Maybe this is an ignorant guess, but is it possible that some episodic cause is simply missing from the model?

I’d be curious what hierarchy/geographic connection there is if the dataset is all counties in the US. Checking the wikipedia list of US droughts it looks like it’s catching the drops for some nation-wide droughts (dustbowl in the 30s, 1988-89 North American drought) but maybe missing more localized events (e.g. 1983 Midwest drought). So if you had a hierarchy like nation -> region -> state -> county I’d check the relative size of variance components there – is the model only really picking up national trends?

Relatedly: adding lines to a chart like this for the average at each level of the hierarchy (so national average, midwestern average, Ohio average) would help. E.g. in the 2nd ellipse if you see that the national and statewide averages do drop at that point then the county is probably just an outlier and the model is pooling toward the other data. But if you see that the national is low then but the state is also high then maybe your model is relying too much on national data and not enough on state (e.g. are priors on the state level too tight?)

I’d also check the predictor variable values for the dates in question and see if any of the values look funky. Could just be data entry error.

“I’d also check the predictor variable values for the dates in question and see if any of the values look funky.”

Well something in those values has to be funky, doesn’t it? In this case I think a data entry error is less likely than a funky correlation that blows up with an unusually high or low input value.

To my eyes, there seem to be several places that are some kind of numerical error, perhaps ill-conditioned data. They are those two bug dips. Not the area marked by the ellipses but nearby. Those big dips are often a strong clue that there are numerical errors in the underlying routine. I’ve seen that kind of thing elsewhere, and not just in regression splines.

Another observation is that a regression spline would normally fit the data with a much smoother curve. The fact that this one is so irregular, especially with all those smaller dips, suggests that something just isn’t right. If the routine has a smoothing or tension parameter, it’s probably set way off a good value.

If this is a home-grown spline smoother, the problem is almost certainly in the numerical techniques, such as an ill conditioned matrix, or insufficient precision. If the spline routine is from a well-known library, it’s probably a fitting parameter thing.

If this were my data, I would first fit it with a LOWESS procedure, just to make sure what the basic data looks like and that there are no surprises. Then if I wanted to use a spline fit, I’d try to adjust the parameters to get it to look much like the LOWESS fit.

After that, if I thought that the spline might be able to show something that the LOWESS couldn’t, I’d start changing the fit parameters to allow less smoothing and more local variation. But I would never accept the result we are shown without a lot of convincing that its features showed anything “real”.

I’m not sure what’s “Bayesian” about this? We see a single trace, a posterior distribution should show **many** trace samples to let us see what the range of variation is. Looking at a single trace is not very meaningful.

1) The droughts noted by @Dan seem relevant. Incorporating weather data from the station nearest to each county would accurately express these effects. Was this done or was a more global measure used? One source for historical zip code level weather data is here: https://www.climate.gov/maps-data/dataset/past-weather-zip-code-data-table

2) Currently, there are about 3,000 counties in the US, many of which will show spatial autocorrelation. Does the adaptive spline model incorporate these effects? If not, the resulting parameters will be biased. One approach to incorporating spatial proximity is summarized in “Interpretation: the final spatial frontier,” https://www.cambridge.org/core/journals/political-science-research-and-methods/article/interpretation-the-final-spatial-frontier/02ECB1938AC4BABF9A2FBF37B7E6EDA7

3) Anomalies for a single county should not be a surprise or even a problem. What was the overall fit like?

Overfitting. Train the model on (for examle) half the counties to choose the parameters/coefficients then see how it performs on the other half. When you instead optimize for out of sample performance the fitted curves will be much smoother.

In fact, your model only seems moderately flexible. Otherwise it would fit the data exactly.

Another thing is you should use the last 20% or whatever years as a holdout. Do not use data from the future to predict the past because you will overestimate the predictive skill of the model.

I’ll add a comment as a Bayesian MARS user/developer (e.g., R package BASS). I don’t read this as having anything explicitly hierarchical about it. There seems to be a set of predictors for each county, and a time series response for each county. What modeling choice you make next depends on what purpose you have for your model.

For instance, if you want to predict the complete time series response for a new county, you might treat this as a multivariate response problem (perhaps including time as another predictor variable, since BMARS is pretty flexible).

If your purpose is to forecast future yield for any of your training counties, you probably want to make your BMARS into a time series model, perhaps by including lags as predictors (for instance, https://doi.org/10.1080/01621459.1991.10475126).

In the multivariate response case, it would not be uncommon to see behavior like that shown in the plot—you might even say that that prediction is quite good, given the complexity of what you’re trying to predict. In the time series case, if the above plot were one step ahead predictions, I would say that this indicates that some of the temporal behavior is not easy to explain with these county-level predictors and lags, and is possibly just noise. A key step is to evaluate the uncertainty in your prediction, which should be easy to get for most Bayesian models. If the uncertainty around your ellipses is small, that would indicate something else is wrong.

So I agree, all of this is very much workflow, and workflow very much depends on the end goal.

This doesn’t strike me as a particularly bad model. Yes, there might be something wrong with it, but it might also be a perfectly good model.

Crop yields are subject to a lot of one-off, exogenous variables: insect infestations, badly timed frosts, or low rainfall at exactly the wrong time. These factors are pretty much i.i.d. from year to year. Almost all these factors are bad, so the fact that the outlier spikes are all downward seems right (harvests are rarely astonishingly good). Sometimes these catastrophes are correlated with other explanatory variables and sometimes they aren’t, so I’m not surprised that sometimes the model captures them and sometimes it doesn’t.

I think we are being tricked into thinking this is a bad model because it is presented as a time series with the points connected. But the catastrophes are heavily i.i.d. from year to year and so don’t fit the time-series paradigm well.

If you de-emphasized the time series aspect, you might get a better feeling about the model. De-trend the series then plot it as unconnected dots. Yes, there is a time-series trend as agricultural efficiency has improved over time, but one-off events are driving the spikes, so it may be better to think of the data as mostly pooled across all the years with perhaps some minor auto-correlation.

+1

This still doesn’t tell us what went “wrong” with the model in the place indicated by the ellipse on the plot, but I’m not sure there’s anything to explain there: either. Hu didn’t tell us much about the model, but isn’t it possible that sometimes there’s a good harvest or a bad harvest for reasons not included in the model? OK, sure, the model predicts that the harvest should have been good in 1936, instead it was bad. If that is indeed 1936 that is the bad year, one possibility is that the explanatory variables (whatever they are) weren’t sufficient to capture the effects of the 1936 North American heat wave.

Or perhaps there are no explanatory variables, and this is purely a curve-fitting exercise? If so, then for one thing I’m not sure what the point of it would be. If you’re just trying to smooth the data, you can do a rolling median or a rolling average or ‘supersmoother’ or many other options…or a spline is fine too, but for most purposes for which I would use a spline I would not want it to follow the data extremely closely. If I want something that follows the data extremely closely, I’ll just use the data! Nothing follows the data closer than that!

Good points

> Or perhaps there are no explanatory variables, and this is purely a curve-fitting exercise?

He says that he “fitted the corn yield data with multiple predictors for counties of the US” so it doesn’t seem to be just curve-fitting (and Andrew’s remark that “the cases where the fitted curve is more extreme than the data—that makes no sense at all” doesn’t seem applicable).

Thinking about this a little more, the i.i.d. nature of the natural catastrophes may be garfing up the spline, which suggests the model is a bad model.

A late frost in one year tells us little or nothing about crop yields in the years before or after, but the spline will try to spread the drop in yield due to the frost into adjacent years. This is bad. In that case, treating the data as a time series is the wrong model and can introduce errors.

The model needs to allow one-time catastrophes that have no correlation with yields in adjacent years.

There isn’t enough details about the model to figure out what, if anything is wrong. What exactly does “multiple predictors for counties” mean? Does it mean that you are fitting a spline in time with the county as basically a random effect? Or does it mean that you have a model based on covariates from the different counties? In the latter case I don’t think this is terribly problematic, there just going to be effects on the yield you model cannot capture from the specified covariates.