Oswaldo Melo writes:

I have learned many of curve fitting models in the past, including their technical and mathematical details. Now I have been working on real-world problems and I face a great shortcoming: which method to use.

As an example, I have to predict the demand of a product. I have a time series collected over the last 8 years. A simple set of (x,y) data about the relationship between the demand of a product on a certain week. I have this for 9 products. And to continue the study, I must predict the demand of each product for the next years.

Looks easy enough, right? Since I do not have the probability distribution of the data, just use a non-parametric curve fitting algorithm. But which one? Kernel smoothing? B-splines? Wavelets? Symbolic regression? What about Fourier analysis? Neural networks? Random forests?

There are dozens of methods that I could use. But which one has better performance remains a mystery. I tried to read many articles in which the authors make predictions based on a time- eries and in most, it

looks like the choice was completely arbitrarily. They would say: “now we will fit a curve to the data using multivariate adaptive regression splines.” But nowhere it’s explained why he used such a method instead of, let’s say, kernel regression or Fourier analysis or a neural network.I am aware of cross-validation. But am I supposed to try all the dozen methods out there, cross-validate all of them, and see which one performs better? Can cross-validation even be used for all methods – I am not sure. I have mostly seen cross-validation being used within a single method, never between a lot of methods.

I could not find anything on the literature that answers such a simple question. “Which curve fitting model should I use?”

These are good questions. Here are my responses, in no particular order:

1. What is most important about a statistical model is not what it does with the data but, rather, what data it uses. You want to use a model that can take advantage of all the data you have.

2. In your setting with structured time series data, I’d use a multilevel model with coefficients that vary by product and by time. You may well have other structure in your data that you haven’t even mentioned yet, for example demand as broken down by geography or demographic sectors of your consumers; also the time dimension has structure, with different things happening at different times of year. If you want a nonparametric curve fit, you could try a Gaussian process, which plays well with Bayesian multilevel models.

3. Cross-validation is fine but it’s just one more statistical method. To put it another way, if you estimate a parameter or pick a method using cross-validation, it’s still just an estimate. Just cos something performs well in cross-validation, it doesn’t mean it’s the right answer. It doesn’t even mean it will predict well for new data.

4. There are lots of ways to solve a problem. The choice of method to use will depend on what information you want to include in your model, and also what sorts of extrapolations you’ll want to use it for.

This sounds like a classic case of algorithm vs data (is it better to have more data or a better algorithm?). And, the answer is both. If you have weekly data for 8 years and for 9 products, I’d begin with a simply graphical exploration of what the data looks like. Do all 9 products have similar shapes – i.e., do all exhibit seasonality, are the patterns similar, etc.? Before worrying about building a model (and whether it is a multilevel model, whether to break things down geographically, what further detail is in your data, etc.), I think it makes sense to see what the data looks like.

There is plenty of theoretical literature that can be used rather than “curve fitting.” If these are comparable to many consumer durable products, Bass diffusion models are a good place to begin. If they are new products and are still on the rapid diffusion part of the cycle, then a good fitting curve will soon perform very badly, since the slope is likely to flatten pretty soon. On the other hand, if there appears to be a strong seasonal pattern with some stable trend, then more traditional short-term forecasting models would seem appropriate (decompose the series into seasonality and trend and fit some model to the remaining random fluctuations). That brings up another important question – what time frame do you want the forecast to cover?

Thinking about all this makes me believe the initial question was posed wrongly. There is no simple answer to the question “what method to use?” At some point the question will become technical, as in “what curve fits best and how do I measure that?” But long before you get to that point are a number of (in my mind) more important questions about the nature of what you are forecasting, how stable the underlying dynamics are (e.g., is a disruptive technology just over the horizon?), what the purpose of the forecast is, etc. etc. Without worrying about these important issues there can be no simple answer to the question of what curve to fit to the data.

“I am aware of cross-validation. But am I supposed to try all the dozen methods out there, cross-validate all of them, and see which one performs better? Can cross-validation even be used for all methods – I am not sure. I have mostly seen cross-validation being used within a single method, never between a lot of methods.”

Check out the SuperLearner literature. e.g. https://cran.r-project.org/web/packages/SuperLearner/index.html

(I’m not endorsing it–haven’t studied it enough to–but this is exactly what they are advocating)

It’s done all the time in machine learning. A real classic in natural language processing related to this is Banko and Brill’s Scaling to Very Very Large Corpora for Natural Language Disambiguation—they compare four techniques (not quite by cross validation, but by sets of test sets) with data size ranging between n=1e5 and n=1e9. What they found is that the best performer on their test wasn’t stable. What they didn’t do was report variation among the ten test sets. All that variation may very well be there just from sampling noise in the test sets.

Whatever method you use, extrapolation is always questionable unless you have some other reasons for thinking that some behavior will continue in the future. Take Fourier analysis. You can approximate any data set you want, as closely as you want, over whatever time interval your data covers. But beyond the end of your data interval, the waveform will repeat. This is not likely to actually happen with your sales data.

OTOH, there may be good reasons to think that seasonal variations will be similar next year to this year. So making seasonal corrections would likely be helpful.

Demand for a product can be affected by things out of your control, like fashion or weather. If you can find some good correlates like that, you can track them as sentinels to warn that your extrapolations may start going off course.

Tom beat me to it: I was going to give about the same advice.

When you say “demand for a pruduct,” you mean (1) people spend (2) money on this (3) stuff. That suggests you should be thinking about (1) demographics, (2) economics, and (3) whatever is special about your product(s). Any one of those has to be a better X than the number on the calendar.

For example, demographics: Do kids buy this? Do people with kids buy it? Single? Elderly? All of those are highly predictable, and you must know who is buying the stuff.

Economics: Do you buy more of these if you’re feeling optimistic about your future income, or fewer?

The product: If you have one, are you more or less likely to buy another one? Do you consume them, or do you keep them, and for how long? Does somebody advertise this stuff?

The moral here is the same as carpentry: Think first, pick up the tool afterwards.

I used to build similar models for my brother’s cafe (building forecasts of sales of various line items, to help reduce waste). Turned out that a simple random effects model as mentioned by Andrew outperformed my more complex efforts. These models can do a great job at capturing “curves” without being difficult to implement. Also, rstanarm does sensible inference about uncertainty, unlike some of the more machine learny methods you mention.

To benchmark your other attempts, perhaps start with this simple model using the rstanarm package and see how you go:

model_fit <- stan_lmer(sales ~ time_trend+ (1 + time_trend| product) + ((1 + time_trend| week/mon of year)), data = yourdata)

Where “time_trend” might be some simple transforms (say, a Box-Cox trend or a logistic curve–some visualisation of your data might help) of a linear trend, to capture basic non-linearities. As Kenneth mentioned, you might want to include any exogenous factors that you think affects sales. Or things like advertising spend.

The issue with this model is that your random effects mightn’t be exchangeable. Ask the question: would knowing the value of the time trend (or the exogenous factors, or ad spend) provide information about the week/month of the year? If so, you’ll need to make a correction a la Andrew’s paper with Bafumi.

Or if you want to get non-linear, something like

stan_gamm4(sales ~ s(time_trend, by = product), data = yourdata, random = ~(1 | product))

??!! Has this always been there?

NY resolution: read more documentation.

This is absolutely correct. I’ll just add that the model doesn’t just depend on the context of the data (as Kenneth says above) but also what you are trying to predict. Predicting next year’s sales will usually have a different model than predicting next month’s sales. Predicting sales by product will have a different model than predicting overall sales. And maybe the data in its current form doesn’t support the best model; you have to feature engineer what you need instead. Or perhaps there is a combination of models that will do better (e.g. rather than forecasting revenue, it might be better to have one model predict price and one model predict volume). There are so many possibilities for what is best, just based on the data. The desired end result matters in this decision just as much.

In addition to all of the great points above, I would add this: if multiple approaches appear to be able to solve the problem (which is often the case), go with the one that you understand the best and can best defend. This sounds obvious but often seems to get overshadowed by other concerns.

I’d start simple, and not get too far away from the data for a while.

1. Seasonality. Understand it. Get rid of it — in particular, it will make step 2 easier.

2. Trend: do you have any? Is it similar across products? Does it change a couple of times in the backdata?

2a. If there’s a trend, what seems to drive it? Is it population growth, number of diesel vehicles, or …? This isn’t so much a statistical search as a conceptual one, validated by statistics.

3. Start simple. See what you get with a Holts model, with cross-validation. See Hydman’s textbook for examples (and his R forecasting package)

Any way you slice it, you are assuming some model of the past can be extended into the future. That’s why skepticism about a forecast is always warranted.

Zbicyclist:

I like your suggestions but let me emphasize that, although you say “not get too far away from the data,” it would be more accurate to say “not get too far away from the data and your prior information.” The data are just a bunch of numbers. Issues such as seasonality, similarity across products, population growth, etc. . . . these all come from prior knowledge.

I’m not just being picky here; this is important because we spend lots of time trying to automate our procedures, either formally with algorithms or informally in textbooks. And when you look at formalizations of statistics, they tend to minimize prior information. Instead you’ll see models chosen entirely from features internal to the data such as sample size, data type (binary, counts, continuous), censoring, etc.

Yes, I think it’s always bad to throw away substantive data about your problem. I mean, think about it, some machine learning spline is going to do the same thing if you feed it some series of numbers whether that data is the temperature inside a gas turbine engine, the sales of manga in Tokyo, the concentration of antihistamine in the liver, the signal to noise ratio on a spaceship comm-link, or the consumption of calories by a professional athlete throughout the Tour-de-France.

If you can’t figure out some substantive information to use that would be different in those different circumstances, you should move on to another field other than mathematical modeling and statistics.

Yes,“not get too far away from the data and your prior information” is a more accurate statement of what I meant.

“I tried to read many articles in which the authors make predictions based on a time- eries and in most, it

looks like the choice was completely arbitrarily. They would say: “now we will fit a curve to the data using multivariate adaptive regression splines.” But nowhere it’s explained why he used such a method instead of, let’s say, kernel regression or Fourier analysis or a neural network.”

Yes — explanation of why the method was used is so often missing; in many cases, probably because the author didn’t have a very good reason for the choice. This is a sad situation — authors need to think about why they are using a particular method (and several comments above have given advice on what needs to go into the choice) AND they need to explain why they made that choice. Without the reasoning and without the explanation of the reasoning, the result is scientifically just a house of cards.

PS to Andrew: The French Curve looks like a duck. So maybe you are switching to duck pictures instead of cat pictures? Or just enlarging the menagerie? ;~)

The answer is parsimony and structure. You choose the method, which can accommodate whatever underlying structure is present (from prior knowledge) in the most parsimonious manner.

If you have no clue about the underlying dynamics, and you just run a huge cross validation race between parametizations, you are screwed. Unless, the data is very large (which here it isn’t), and you know for a fact (which you don’t) that the underlying DGP is stable and properly sampled by the available data.

Suppose f(x) is a function of x that is continuous on a finite interval [a,b]. Then there are a gazillion ways to approximate this function. It’s a theorem called the Weierstrass Approximation Theorem that polynomial series can approximate this function uniformly (that is, it’s possible to get better than epsilon error everywhere on [a,b] by making the polynomial series sufficiently big). But splines, and fourier series, and radial basis functions, and lots of other things can approximate this function just fine in practice.

Now, suppose you have a real-world problem where you’ve measured f with errors in a region between [a,b]. The data tell you approximately where your function should be within the interval (ie. some measure of the errors between f(x) and the data d(x) shouldn’t be too big).

However, what happens for x values outside [a,b] ??

The ONLY way to solve this problem is to have substantive knowledge about how f(x) should behave as it goes outside the interval [a,b]. So for example, if we’re talking about the concentration of a typical drug, f(x) should go to zero as x goes to infinity and your body excretes or metabolizes the drug. The range of rates are determined by substantive information about biochemistry and metabolism. If we’re talking about sales, we know at least they shouldn’t go to infinity in finite time, because there is no such thing as infinite sales. We may know things like historical rates of change, and be able to apply some information about how fast things could change… etc etc. It’s SUBSTANTIVE knowledge that tells you what the behavior of your function should do outside the measured range.

The truth is, in many cases, it just doesn’t matter what you use to interpolate through a data rich region, lots of things work well. But, it’s absolutely impossible for a pure function approximation technique to determine how best to predict outside the range of the data.

Let’s say the sample maximum is at b and the closest data point is at x’ = b – 2 * epsilon.

We’d like to say something about f(b-epsilon) and f(b+epsilon) but don’t have a data point there. Intuitively, it seems approximation error bounds for the interpolation should be better because we can approach b-epsilon from both b and x’, but b+epsilon only from b. If we assume nothing about f we can’t say anything, but any smoothness condition probably buys us lower approx error for b-epsilon. Not sure whether that can be made precise.

Still, it seems extrapolation shouldn’t be much more error prone than interpolation within a distance equal to the typical distance between points over which you interpolate.

Even in these cases, it’s substantive assumptions about the behavior of the real world process that gives us what we need. For example, suppose we know that the thing we’re measuring can’t change faster than some unknown bounds (that is, it’s Lipschitz but we don’t know the bound on the derivative). Well, we can learn something about the bound from the region where we do have data. So f(b+eps) is bounded by f(b) +- k*eps where k is unknown but its credible range can be learned from the data in [a,b]

Polynomials can approximate the function in [a,b] but we may know that the process itself has an upper and lower bound f(q) L, then polynomials are guaranteed to be an arbitrarily bad approximation outside [a,b] because they grow polynomially for large x.

Yes, given typical regularization assumptions, close to the interval [a,b] we can often do ok. And this applies for lots of cases (ie. estimate total sales next week from daily data for the last 5 years), but the regularity assumptions are themselves SUBSTANTIVE. There are lots of problem areas where very smooth assumptions don’t apply. For example, a feedback control system for a gas turbine engine. How rapidly can the temperature increase? Well, if your feedback control system opens the throttle to max maybe it can explode in 7 milliseconds? I don’t know. Can we tell the difference between a function that oscillates wildly up and down, and a function that is near constant and is measured with a lot of measurement error? Only via substantive assumptions about the underlying process we’re modeling.

ack, blog ate the greater and less than signs….

Polynomials can approximate the function in [a,b] but we may know that the process itself has an upper and lower bound f(q) < H and f(q) > L