“You should always (always) have a Naive model. It’s the simplest, cleanest, most intuitive way to explain whether your system is at least treading water. And if it is (that’s a big IF), how much better than Naive is it.”

Jonathan Falk points us to this bit from baseball analyst Tangotiger, who writes:

Back around 2002 or so, I [Tango] was getting really (really) tired with all of the baseball forecasting systems coming out of the woodwork, each one proclaiming it was better than the next.

I set out not to be the best, but to be the worst. I needed to create a Naive model, so simple, that we can measure all the forecasting systems against it. And so transparent that anybody could recreate it. . . .

The model was straightforward:

1. limit the data to the last three years, giving more weight to the more recent seasons

2. include an aging component

3. apply a regression amount

That’s it. I basically modeled it the way a baseball fan might look at the back of a baseball card (sorry, yet another dated reference), and come up with a reasonable forecast. Very intuitive. And never, ever, would you get some outlandish or out of character forecast. Remember, I wasn’t trying to be the best. I was just trying to create a system that seemed plausible enough to keep its head above water. The replacement level of forecasting systems.

I don’t get exactly what he’s doing here, but the general principle makes sense to me. It’s related to what we call workflow, or the trail of breadcrumbs: If you have a complicated method, you can understand it by tracing a path back to some easy-to-understand baseline model. The point is not to “reject” the baseline model but to use it as a starting point for improvements.

Tango continues with his evaluation of his simple baseline model:

Much to my [Tango’s] surprise, it was not the worst. Indeed, it was one of the best. In some years, it actually was the best.

This had the benefit of what I was after: knocking out all those so-called forecasting systems that were really below replacement level. They had no business calling themselves forecasting systems, and especially trying to sell their inferior product to unsuspecting, and hungry, baseball fans.

What was left were forecasting systems that actually were good.

He summarizes:

You should always (always) have a Naive model. It’s the simplest, cleanest, most intuitive way to explain whether your system is at least treading water. And if it is (that’s a big IF), how much better than Naive is it.

Well put. I’d prefer calling it a “baseline” rather than “naive” model, but I agree with his general point, and I also agree with his implicit point that we don’t make this general point often enough when explaining what we’re doing.

A couple of additional points

The only real place that I’d modify Tango’s advice is to say the following: In sports, as in other applications, there are many goals, and you have to be careful not to tie yourself to just one measure of success. For example, he seems to be talking about predicting performance of individual players or teams, but sometimes we have counterfactual questions, not just straight-up predictions. Also, in practice there can be a fuzzy line between a null/naive/baseline model and a fancy model. For example, Tango talks about using up to three years of data, but what if you have a player with just one year of data? Or a player who only had 50 plate appearances last year? What do you do with minor-league stats? Injuries? Etc. I’m not saying you can’t handle these things, just that decisions need to be made, and there’s no sharp distinction between data-processing decisions and what you might call modeling.

Again, this is not a disagreement with Tango’s point, just an exploration of how it can get complicated when real data and real decisions are involved.

19 thoughts on ““You should always (always) have a Naive model. It’s the simplest, cleanest, most intuitive way to explain whether your system is at least treading water. And if it is (that’s a big IF), how much better than Naive is it.”

  1. > I’d prefer calling it a “baseline” rather than “naive” model

    Yeah, my impression reading the baseline model was, dang, that’s a pretty complicated model! Even if you do a 50/30/20 weighted average kinda split, it means you need to collect and maintain data for the last three years which could end up being kinda constraining — especially in 2002!

    > tired with all of the baseball forecasting systems coming out of the woodwork

    I’m not sure about the us vs. them thinking tho. I think I’m my worst enemy with complexity a lot of the times, and then worrying what other people do is kinda a second order thing (but the details of the situation would matter).

  2. It sounds like his simple model wasn’t overfitting while the more complicated ones were.

    It is really sad to read the applied ML literature, it is almost a complete waste of time due to lack of proper hold-outs.

    Common approaches like leave-one-out cross validation aren’t even close to acceptable, because:

    1) The analyst ran many of them and tuned hyperparameters/etc based on the results, and

    2) Data collected later often contains info about the data collected earlier, so the holdout always needs to be the data generated after the training data.

    And no you can’t go back and process the training data in a new way once you see an outlier in the hold-out. At least not while expecting to get an accurate assessment of the predictive skill.

    • ^ All of this.

      Sounds like every other model is overfitting to noise. A common risk for mediocre analysts who fall in love with complex methods. Not a great look for this guy’s peers.

      • I think most really don’t understand the context of what they are doing. Eg, if the question is “what is 21 + 21?”, then the answer is 42. Done.

        But for most problems you are given the “answer” (observations), not the “question”. So:

        21 + 21 = ???

        Is a totally different category of question than:

        ??? = 42

        The former has one solution, while the latter has infinite. And that is assuming no uncertainty in the “answer” (observations), which always exists.

        There is an equation that could not be solved with the fastest and most efficient computing devices possible that would get you the answer 42 in 1000 trillion years. Actually there are infinitely many of those. And all are mathematically equivalent to the simple ones like 21 + 21 or 42 + 0, etc.

        That is why the goal is not to come up with an equation that fits the data you have. It is to fit the data you do not have.

  3. I sometimes trot out my pet motto, though usually in the context of programming –

    Start out as simple as possible because it’s only going to get more complicated as time goes on.

  4. Most data competitions (Kaggle, for example) contain a benchmark result, usually using a simple model (frequently it is a Naive Bayes model). Also, it is common practice (though not as common as it should be) to compare time series forecast models with naive forecasts (which are usually forecasting no change), and comparing the forecast model with the naive model over the holdout period. I am agreeing with Falk here, but noting that the practice is not totally uncommon.

  5. “Back around 2002 or so, …”

    2002! One would hope that over the past twenty years, people have become more aware of the dangers of overfitting and the virtues of simple models. (Maybe they haven’t, but I was struck by the date on this. I’d care more if it were a recent observation.)

  6. An online columnist covering (American) football used to have to make a generic “cut price” prediction of NFL games. Whichever team had the best season record coming into the week was predicted to be the winner. If they had the same record, the home team was predicted to be the winner.

    I think he’d typically pick more than 80% of the games correctly over the course of a season. Not many prognosticators did better than on average.

  7. “What do you do with minor-league stats? Injuries? Etc. I’m not saying you can’t handle these things…”

    Injuries are a huge problem in baseball forecasting. Bryce Harper was producing exactly as expected until he got hit in the thumb by a pitch, and now he will fall far short of his season projections. The importance of historical contingency makes predicting baseball like predicting the stock market.

    I don’t think tangotiger’s approach suggests overfitting, but rather that a lot of the data used in the fancy models to make the projections has no predictive power. If you add a bunch of parameters with no predictive power to your model, the “outcome space” (not really sure what the correct term is here, there must be one) grows proportionally at the margins, allowing bolder but no better projections. When tangotiger suggests that his model is better because it does not produce “outlandish” results I think this is what he means.

    I suspect that some baseball projection systems mitigate this effect with bodges. By that I mean that the system will not allow a final projection of 120 rbis for a rookie even if the model shows that, it will just output 90 rbis in a pure bodge because the model result is just too darn big.

  8. I have a great example from yesterday and today! I spent all day yesterday coding up a model in Stan that fits what I actually think is going on with the data. I have time series data on electricity prices at a bunch of different location, for the past six or seven years. There are seasonal patterns but they change from year to year, and there are trends that last a while and then go away. So I’ve got month-of-year effects but they can change, and I have trends but they can change, all tied together with appropriate distributions. It’s not super complicated but it took me several hours to code it and debug it. And this is just the first step in a larger model that will incorporate this one. For debugging I was using a reduced dataset with just a few locations; the full model is something like 6x bigger. So, I got it debugged and was getting results that seemed OK, so I expanded the model to use about half of my full dataset…and it was sampling so slowly it was going to take a whole day to finish! And I’m not even sure it would be fully converged.

    If I was sure it was what I wanted I guess it would be worth waiting a day for results, but as I mentioned this is only part of what I really need and it seems that the full model, once I spend more hours implementing and debugging, would take days to run. Criminy.

    And: I’ve got a meeting with the rest of my group at 1pm today and had really hoped to present some results. It was obvious late last night that I wasn’t going to be able to do that with the model I had. So I went to bed…and then this morning I got up and coded a much much simpler model: month-of-year effects that don’t vary as the years go past, and location effects that are also invariant. It took a few minutes to code and run and look at the results…and I think it’s fine for my purposes. I may not ever return to the Bayesian hierarchical model at all. Don’t get me wrong, the full model, properly implemented, would be better…but not very much better, I think. And the results from the naive model are good enough to tell me that the particular question I was trying to answer can’t be answered precisely with these data anyway. The full model that I want to run would give me better error bounds and slightly better central estimates of the parameters I’m interested in, but the uncertainties would still be very large. For my current purpose, a half-decent estimate of a highly uncertain parameter value is almost as good as a the very slightly better estimate I would get from the full model.

    I only wish I had done the simple model first, I would have saved myself a frustrating day of coding and debugging!

    • Lolol this reflects my recent experience as well. Recently I started with what I already thought was a simple model and then ended up with a wildly more simple model.

      I don’t think the time is wasted or whatever. It’s hard to make sense of the counterfactual “had I known the right answer all along” because of the way it bakes in the results.

      Maybe this is more like dealing with mathematical approximations. Is the O(dt) thing good enough for dt == 1e-6? I dunno, we don’t know the real answer (unless we’re in simulated data land), but we maybe we can see how different it is from dt == 1e-5!

  9. What he’s describing is essentially deviance testing in regression modeling… compare your model with the “naive” model (straight line, for example). The comparison with the “null” model, like ANOVA, tests whether there is any signal at all, not a very useful test! Well, his “naive” model is slightly more complex than “null”, but in spirit, that’s what he is doing. The fact that one can reject the null model doesn’t mean one has a good model!

    In my experience, the “naive” model can easily do more harm than good. A typical naive model is one that involves random actions, e.g. if the software sends digital ads to random people instead of targeted people, what are the results, which are compared to what happens when a targeting model is spun up to pick the ad recipients. Sure, there is some utility in this, but this is akin to comparing an MLB player with a “man on the street”. For most practical applications, the bar is simply too low. This then becomes harmful when the analyst’s only comparison is to this naive (random) model. The excuse for using such a low bar is usually that the situation is far too complicated to employ a realistic reference model.

    If there is a best-of-class model, that should always be the preferred comparison, rather than the naive model.

    If there isn’t a best of class, I much prefer to apply my model to “toy examples”. Create test cases that mimic the real world but simplified to the point that humans can specify the optimal solution (or at least a good enough solution). Run the model on those test cases and see if it gets the pre-meditated solution. In the case of ad targeting, I can create a population consisting of segments with specific behavior. By changing the mix of these segments, I have expectations of what a good model would do in terms of outputs. Does the model’s behavior conform to my expectations?

    I think testing on toy examples gives me more information than comparing my model against “random”. The bonus for doing this is that this process produces intermediate data that can be used to understand how the model failed to find the expected best solution.

    • > Create test cases that mimic the real world but simplified to the point that humans can specify the optimal solution

      Good point. In the situation described (showing that some other models are promising more than it can offer), simulated data seems like it could be a more straightforward way to go than making a baseline model.

      Not that making the baseline model was wrong or anything — sounds like it worked out, but simulating a model seems like it would be simpler than fitting one.

  10. This blog has often covered the benefits of fake-data simulation. My understanding is that that’s usually done with data generated by the model you’re trying to fit, or perhaps with some small tweaks. But do people also more systematically test how well a complex model does on fake-data generated by all of the intermediate models along the trail of breadcrumbs? (I think Section 4.3 of the Bayesian Workflow paper hints at something a little like that.)

    • Cross-fitting is sometimes done to assess the relative flexibility/generalizability of non-nested models (example: https://link.springer.com/article/10.3758/s13423-010-0022-4). The basic idea is to simulate data from one model (B) and then fit it with another (A). To the extent that model A can consistently fit data produced by model B and not vice versa, this indicates the extent to which model A is more flexible than model B.

      I’m not sure if this approach would add much for nested comparisons (except perhaps for very complex models that are otherwise hard to understand). When I first read your comment, I implicitly assumed that intermediate models would tend to be nested under a single encompassing model (e.g., the one including all possible predictors), but this isn’t necessarily true. Many times, one tries models of different types throughout the development process. So maybe we (I) need to be more systematic in tracking how model complexity evolves through development!

      • I heard about this one time when someone had some ODE model and was wanting to fit a really complicated (but physically motivated) extension to it.

        It woulda been a ton of work to actually build the complex thing. Andrew suggested just generate data with the complicated thing (not so hard to do in this case, I believe) and then fit the simple thing and see how the approximation worked out.

        In that case stuff was explicitly nested. I don’t know how it worked out but given the complexity of the extension I think this was a good idea. It’s kinda the other direction from fitting complex thing to simple data tho.

  11. I’m so glad to see you post about Tango; you can read a lot of his stuff here: http://www.tangotiger.com/index.php. He’s like a cross between Andrew and Bill James; he does a lot of Bayesian stuff but he presents it mostly as intuition. And yes his “MARCELS the monkey” projection method showed that so many others were not much more than well-intentioned snake oil.

Leave a Reply

Your email address will not be published. Required fields are marked *