Chang writes:

I am working on a fleet management system these days: basically, I am trying to predict the usage ‘y’ of our fleet in a zip code in the future. We have some factors ‘X’, such as number of active users, number of active merchants etc.

If I can fix the time horizon, the problem will become relatively easy:

y = beta * X

However, there is no ‘golden’ time horizon. At any moment, I need to make a call whether I need to send more (or less) cars to that zip code.Event worse, not all my factors are same. Some factors are strong in short term prediction, some factors are strong in long term prediction. If I force to combine them into one single regression model, I am afraid that I will be hurting the overall regression performance.

I was thinking about multivariate regression, but it does not really solve the problem. Multivariate regression might just give me an ‘average (not in a statistical sense)’ model that tries to predict multiple time-horizon, but due to each factor’s unique predicting power, it might not predict well in any time horizon.

I’d recommend fitting a multilevel model in Stan—hey, I’d even do it myself if you paid me enough! Maybe commenters have some more specific ideas.

I wouldn’t kill myself trying to get too detailed here. Unless you have some good reason to expect some particular relationship, you will only be doing an extrapolation beyond the range of the data. Extrapolations can be wildly off, no matter how good the fit within the data range.

This suggests that you should concentrate on understanding what the relationships actually are. If that can’t be done, then concentrate on detecting when your extrapolation is starting to deviate from what’s happening, and how you can adjust quickly.

This sounds a little under-specified, four things will help: the cost of moving cars at a given time horizon, the cost of being off by some number of cars (likely up and down separately), and how badly the prediction degrades as a function of horizon. I think it would be fine to have a set of models that differs by horizon and characterize which horizons they perform well at. We’ve been working on that approach as have some other groups in infectious disease prediction. It’s nice to be able to switch models since some of them are great short-term but completely wacky at longer horizons. Choosing which model to use and when really depends on the costs so I see a decision analysis in your future. Sounds fun!

This is an interesting problem. I’ve spent some time thinking about it in other settings (that admittedly are easier). My usual approach is to just include lags of the relevant variables and do an iterative forecast. So for instance, the usage in time t+1 might depend on the usage in time t. This implies autocorrelation.

In your case, there are probably time effects that you can break out like split the day up by hour or half hour (I think this is what Andrew means, but he may also mean doing it by zip code or something else). So if you’re forecasting from 5am to 6am, then you would account for the typical (low level) activity in that time of day, but a forecast of 5am to 12pm would take into account the ramp up in activity later in the day.

Have people started using multilevel models in prediction contests? Any idea? I can’t recall any Netflix / Kaggle like competition winners / high-on-leaderboard teams use multilevel models.

Just wondering the pros & cons against more traditional Machine Learning approaches.

Somehow, all the blog posts on here about multilevel models never seem to focus a lot on predictive accuracy which seems the big goal in this example.

A model would give interpretable estimates of uncertainty to go along with the predictions (which can be important for decision making), but would probably perform worse at most metrics evaluating accuracy of point predictions.

This is a subject that interests – and puzzles – me greatly. Have statisticians just given up on prediction, only interested in interpretation? I’d rather it not be an either/or choice. Aren’t both important? It appears to me that the machine learning approaches are generally more accurate predictors than modeling approaches (such as multi-level modeling). Their purposes are not the same – but they are not completely different either, are they? And, while many machine learning approaches do not yield as rich information regarding inferences, they do reveal a number of quantifiable measures of importance for different factors. Sometimes it seems like statisticians and “data scientists” inhabit completely parallel universes.

Dale:

Here are a few papers that my colleagues and I have written about prediction for Bayesian models:

http://www.stat.columbia.edu/~gelman/research/published/loo_stan.pdf

http://www.stat.columbia.edu/~gelman/research/published/waic_understand3.pdf

http://www.stat.columbia.edu/~gelman/research/published/final_sub.pdf

Z, Dale:

This whole discussion seems kinda weird to me. Statistical models are one approach to machine learning; conversely, machine learning methods are ways of computing statistical models. Statistics and machine learning are two words for the same thing. When you say, “machine learning approaches are generally more accurate predictors than modeling approaches (such as multi-level modeling),” I have no idea what are the “machine learning approaches” you’re talking about. They use multilevel modeling in machine learning all the time! It’s just given names such as “automatic relevance determination.” When we use Mister P on surveys, we make predictions. If we had a better predictive method, we’d use it.

Since Rahul is talking about Kaggle and whatnot he’s probably thinking of techniques like random forests and xgboost (both of which generate sets of decision trees). Curiously, the objective that xgboost optimizes is a regularized likelihood…

Indeed, I was thinking of Random Forests, Boosted Trees, Neural Networks, etc. They appear to outperform many other modeling approaches. On some theoretical level, I am willing to believe these are different words for the same thing – but I don’t think we can simply state that and move on. I am familiar with the Breiman paper (cited by Carlos below) and believe he was making a complementary point (but far more eloquently). I don’t pretend to have a handle on how to describe the differences but I do think there are differences.

@Corey

Exactly right. My question wasn’t an abstract philosophical one.

I was just making an anecdotal observation: In machine learning predictive contests which I follow (Kaggle, Netflix etc.) the top teams on the leader-board are very often talking about forests and boosting neural networks etc.

But I rarely hear talk of a team that won using a multilevel / hierarchical model. On the other hand in the non-competitive (academic) world I see the hierarchical models mentioned a lot.

And hence I was wondering whether there is a reason for this (assuming my anecdotal observation has some truth).

I like how naming things is one of the unsolved problems in computer science. Often when you pick up a machine learning textbook, there’s pages and pages of statistics using different terms like supervised learning. You’d think it’d be easy for you academics to just agree on the names to call everything, but it’s obviously not that simple.

John:

They speak different languages in France and Spain. You’d think it would be easy for those Europeans to just agree on the names to call everything, but it’s obviously not that simple.

+1

At an abstract scientific level you’re probably right, I mean I’ll take your word for it since you know way more than me.

At an applied level, there are a bunch of computer scientists turned ML guys that, while bright, really don’t think or know how to specify parameters, priors, or levels, like you do. Nor do they really think about modeling in that way, whatsoever. Instead they rely on ML libraries, like Tensor Flow, to create classes of models with massive parameters, and optimize strictly using out of sample prediction.

Compared to guys like you, or econometricians, who choose their parameters thoughtfully and carefully.

In my job we often have to choose between using these ‘machine learning’ style techniques, and providing more accurate forecasts, vs. a more well specified model.

Probably the right answer is to do something like the bsts (Steve Scott) R package, which seems to combine ML methods with time-series causal inference. That stuff is very complicated though, which for those of us in the private sector who aren’t cut from the cloth of stats professors, is something we need to consider :)

+1

This goes to the heart of what I was asking.

Is there a fundamental trade-off here. Between as you put it “providing more accurate forecasts, vs. a more well specified model”.

Ergo, if my goal is strictly predictive accuracy as opposed to some kind of “understanding” does the choice of method change?

And if so, is this why I don’t see “hierarchical modelling” mentioned much on the predictive-contest discussion-boards.

Or is it just a semantics issue that they indeed do a lot of hierarchical modelling but just under another name?

Being someone who’s been “raised” as a mathematician & statistician (through undergrad to PhD) as well as someone who works in the applied ML space, I’ve seen a lot go wrong when you don’t actually know what assumptions you’re making with a given algorithm. Kaggle competitions just aren’t realistic in comparison to the kind of problems I see in real life. Kaggle competitions, for example, have a specific goal to be optimized. Real life problems often don’t.

Ironically, the people who created tools TensorFlow have a very real understanding of what, e.g., Andrew is talking about. There is a lot of theory that goes into that kind of thing.

Probably you know Leo Breiman’s “two cultures” paper and saying that you “have no idea” is just a hyperbole: https://projecteuclid.org/euclid.ss/1009213726

There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.

What if our goal is to use data to learn true things about reality?

(I’m an engineer by training and I’m happy to use non-parametric ML to solve problems where that is indeed the goal.)

Carlos:

Those are stirring words from Breiman. Meanwhile, Bayesian methods solve lots of real applied problems for me, and have been doing so for close to 30 years.

You guys should convert this into a fun challenge. Put in an entry for a Kaggle contest or some other prediction contest and win it using a strictly Bayesian hierarchical / multi-level model.

It’d have great pedagogic value to see how the workflow looks like as well. Kinda different from the typical academic problem.

I provided the reference just for context, because maybe some people reading this blog really don’t know what the “statistical models” vs “machine learning” discussion is about. I know that’s a provocative and relatively ancient paper and I understand that there is no clear separation between both fields and both can be useful. But it doesn’t mean that there are no differences at all. The same argument could be made for the frequentist/Bayesian distinction: both can be useful and there may be no clear division. By the way, you appear in today’s entry at Mayo’s blog under the heading “A Bayesian wants everyone else to be non-Bayesian”.

As Andrew said, stirring words by Breiman.

So… do you think that the general field of statistics has stayed the same since those words were written?

No, it hasn’t. Most of us who are statistically trained are more than happy to “use what works” provided – and this is a rather large thing – we know it’s suited to the problem. And we often understand the statistical underpinnings of most ML algorithms we use (something that I sometimes don’t talk about, because certain people really don’t understand that being statistically motivated or understandable is not a bad thing in an algorithm). We certainly recognize the value of prediction.

But we also realize that there are times when inference is needed. For some statisticians, inference is more important – especially when we’re looking at, e.g., causality.

That’s a great paper, by a great statistician, but the information in there is also very old. Things change.

“If we had a better predictive method, we’d use it.”

Have you actually tried fitting neural nets, random forests, etc. along with a multilevel model and then combining them all in an ensemble with weights determined by cross validation? I bet that would give you better point predictions. But you’d lose interpretable measures of uncertainty surrounding those predictions (which is the reason I think it makes sense to use a multilevel model whenever doing so doesn’t sacrifice too much predictive accuracy.)

Z:

1. I don’t see how neural nets etc would get good estimates for public opinion in small states from national surveys, the way we can do this using MRP. The problem is that partial pooling is needed. But sure, maybe, I don’t know, sure, maybe if you average a bunch of prediction methods in some way, you’ll get reasonable predictions. But it’s really hard for me to imagine that these predictions would be

betterthan MRP.That’s fine—when we apply MRP, we’re “cheating” in that we’re using substantively meaningful predictors such as Republican vote share in the state in the previous election. That’s one of the reasons we use Bayesian models: they allow us to include prior information. I can see the virtue of an entirely off-the-shelf method that can perform pretty well without using this prior information. But, as I said, I think it’s a bit much to think its predictions would be

better.2. For the decisions we want to make, we often need to predict in new situations, so it’s not all about cross-validation. Consider this toxicology example. I’m not quite sure how you would use neural nets etc. to predict the data in this example but I suppose you could do it. But we want to make predictions under other exposure conditions. This comes up a lot in pharamcology too, when considering dosing.

In response to (1), I would say 2 things. First, MRP can be in the ensemble and could wind up with the most weight assigned to it, but usually adding other methods to an ensemble improves predictive performance at least a little. Second, machine learning methods don’t need to be entirely ‘off-the-shelf’ either. Thought can go into crafting input features. Input features to these methods can even be generated from the output of MRP.

I think your (2) is a great point. Rahul phrased his question as prediction vs. understanding, but prediction in even slightly altered circumstances can require understanding. Modeling predictions can in some sense then be more robust to shifts in context (or at least give you more of a chance to predict how much shifts in context might harm your predictions). So I now see at least 2 core advantages to ‘modeling’ over ‘machine learning’ for prediction: (A) estimates of uncertainty; (B) robustness/adaptability to changes in context.

Sure, that can help (https://arxiv.org/abs/1703.10936, although the multi-level models haven’t made their way in there yet). When using ML techniques in places where multi-level models are more appropriate you do need to do a round of regularization and its clunky but it can work.

The aspect of ML methods that should temper our enthusiasm about them is that if you give up understanding you can’t delineate when it’s going to fail. This ultimately was one of the things that contributed to the downfall of Google Flu Trends, for example. If you get the wrong answer and you can’t really figure out why, people aren’t going to trust the stuff. For Flu Trends there were some attempts to address this but they ended up being pretty opaque.

I was thinking along the same lines. I consider density forecasts more informative than point forecasts.

Kaggle-type contests are naturally prone to choosing over-complex models as the winners.

When you do model selection by sample-splitting (or cross-validation) on a large but finite dataset, this process has a known tendency to choose overfitting models. And the more models it has to choose from, the worse it gets, i.e. the more likely it is that an overfitting one will look better than the “best” model (i.e. the one that would actually predict best on an infinite test set from the same population as the training data).

So even if a carefully-designed, subject-matter-expert approved, “traditionally stats” multilevel model were actually “best,” it’d lose the Kaggle contest to someone who made the maximum number of daily submissions using slightly different tuning parameters of a “traditionally ML” model like a random forest.

But winning the contest does not guarantee that the random forest would *actually* do better long-term, even on data from the same population. And, as other commenters have pointed out, the black box may be more fragile to subtle changes in the real world population when you try to implement it.