He’s attempting to overhaul his company’s model of retail demand. Which makes me think of multilevel models . . .

Ryan Socha writes:

I do machine learning, but I try to do it with as statistically rigorous an approach as I can.

Me too!

He continues:

I’m a student during the school year, but am working in industry this summer. I am currently attempting to overhaul my company’s model of retail demand. We advise suppliers to national retailers, our customers are suppliers. Right now, for each of our customers, our demand model outputs a point estimate of how much of their product will be consumed at one of roughly a hundred locations. This allows our customers to decide how much to send to each location.

However, because we are issuing point estimates of mean demand, we are *not* modeling risk directly, and I want to change that, as understanding risk is critical to making good decisions about inventory management – the entire point of excess inventory is to provide a buffer against surprises.

Additionally, the model currently operates on a per-day basis, so that predictions for a month from now are obtained by chaining together thirty predictions about what day N+1 will look like. I want to change that too, because it seems to be causing a lot of problems with errors in the model propagating across time, to the point that predictions over even moderate time intervals are not reliable.

I already know how to do both of these in an abstract way.

I’m willing to bite the bullet of assuming that the underlying distribution of the PDF should be multivariate Gaussian. From there, arriving at the parameters of that PDF just requires max likelihood estimation. For the other change, without going into a lot of tedious detail, Neural ODE models are flexible with respect to time such that you can use the same model to predict the net demand accumulated over t=10 days as you would to predict the net demand accumulated over t=90 days, just by changing the time parameter that you query the model with.

The problem is, although I know how to build a model that will do this, I want the estimated variance for each customer’s product to be individualized. Yet frustratingly, in a one-shot scenario, the maximum likelihood estimator of variance is zero. The only datapoint I’ll have to use to train the model to estimate the mean aggregate demand for, say, cowboy hats in Seattle at time t=T (hereafter (c,S,T)) is the actual demand for that instance, so the difference between the mean outcome and the actual outcome will be zero.

It’s clear to me that if I want to arrive at a good target for variance or covariance in order to conduct risk assessment, I need to do some kind of aggregation over the outcomes, but most of the obvious options don’t seem appealing.

– If I obtain an estimate of variance by thinking about the difference between (c,S,T) and (c,Country,T), aggregating over space, I’m assuming that each location shares the same mean demand, which I know is false.

– If I obtain one by thinking about the difference between (c,S,T) and (c,S,tbar), aggregating over time, I am assuming there’s a stationary covariance matrix for how demand accumulates at that location over time, which I know is false. This will fail especially badly if issuing predictions across major seasonal events, such as holidays or large temperature changes.

– If I aggregate across customers by thinking about the difference between (c,S,T) and (cbar,S,T), I’ll be assuming that the demand for cowboy hats at S,T should obey similar patterns as the demand for other products, such as ice cream or underwear sales, which seems obviously false.

I have thought of an alternative to these, but I don’t know if it’s even remotely sensible, because I’ve never seen anything like it done before. I would love your thoughts and criticisms on the possible approach. Alternatively, if I need to bite the bullet and go with one of the above aggregation strategies instead, it would benefit me a lot to have someone authoritative tell me so, so that I stop messing around with bad ideas.

My thought was that instead of asking the model to use the single input vector associated with t=0 to predict a single output vector at t=T, I could instead ask the model to make one prediction per input vector for many different input vectors from the neighborhood of time around t=0 in order to predict outcomes at a neighborhood of time around t=T. For example, I’d want one prediction for t=-5 to t=T, another prediction for t=-3 to t=T+4, and so on.

I would then judge the “true” target variance for the model relative to the difference between (c,S,T)’s predicted demand and the average of the model’s predicted demands for those nearby time slices. The hope is that this would reasonably correspond to the risks that customers should consider when optimizing their inventory management, by describing the sensitivity of the model to small changes in the input features and target dates it’s queried on. The model’s estimate of its own uncertainty wouldn’t do a good job of representing out-of-model error, of course, but the hope is that it’d at least give customers *something*.

Does this make any sense at all as a possible approach, or am I fooling myself?

My reply:

When you write, “The only datapoint I’ll have to use to train the model to estimate the mean aggregate demand for, say, cowboy hats in Seattle at time t=T (hereafter (c,S,T)) is the actual demand for that instance,” it makes me think of additive models with interactions, for example theta_cst = alpha_c + beta_s + gamma_t + e_cst, or with some two-way interactions as well. That’s an additive model, and there are other ways to go (as discussed for example in Radford Neal’s Ph.D. thesis); the point is that there are natural comparison points beyond the simple y_cst.

Your three ideas above (aggregating over space, time and products) can be seen as special cases of this sort of additive model. So one way of handling this would be to fit a multilevel model, but I think you could also do it by retrofitting your machine learning approach, which I guess could be viewed a kind of hack or computational approximation to embedding the fitting procedure in a multilevel model. It’s an interesting general question, how to get the benefits of multilevel model by post-processing non-multilevel inferences, kind of related to our discussion of analysis of variance a few years ago.

P.S. No joke, I love that notation: c = cowboy hats, S = Seattle, T = time. Super-clear, and there’s something about the “cowboy hats” thing that adds just the right level of deadpan humor.

13 thoughts on “He’s attempting to overhaul his company’s model of retail demand. Which makes me think of multilevel models . . .

  1. Why not model the time-series explicitly? Total cowboy hat sales as a function of time at location S is then C(S,t) which is constrained to be a non-decreasing function. Now your goal is to estimate the difference C(S,T) – C(S,t) where T is the future timepoint and t is now. Maybe the function is nonlinear but you could approximate it has having at most a certain error from some piecewise linear function you specify as part of the prior for example. Then maybe nearby locations are also more similar, whereas farther away locations are less similar. So you could place “soft constraints” (priors) on differences between locations.

    • Or, pulling from the mention of ODE and from sources like Jay Forrester’s /Industrial Dynamics/, what about modeling the time series as the output of a set of ODEs? Both the demand and the changes in inventory (supply chain issues) can be incorporated into your model. You could do this in Stan, in MCSim, in Vensim, ….

      You can find more by searching on “system dinamics.”

    • Fundamentally, inventory management is about insuring against risk. To train a machine learning model to predict variance, you have to have an estimate of variance that you can use to tell the model whether or not it did a good job. Modeling the time-series explicitly was part of the proposed approach. However, modeling the time-series explicitly means that each tuple needs to have its own estimate of variance to target. The question is, how do we get a sensible target to aim the model’s estimates of variance at? It’s true that you can try to define it a priori, but that’s really hard.

      One of the goals of this approach was to let different products share information with each other. We’d like the model’s training on cowboy hats forecasts to be helpful when we’re predicting the behavior of baseball hats, or even the behavior of ice cream cones, because they probably are similarly impacted by seasonality, underlying changes in consumer demand, etc. The data for each product isn’t necessarily very reliable, but the data for all the different products together can do better. Throwing all available data into a blender like this is one of the big draws of machine learning for retail forecasting. Differences between products can be represented with a categorical variable corresponding to coarse membership categories for the categories.

      There probably are times when it’d be best to use completely separate models for different products, however, and I don’t have good intuitions on when those times would be. The question is whether and when the cost of throwing away data is worse than the cost of forcing your model to be expressive enough to represent all available data’s behavior.

      • Typo: Sentence should read “Differences between products can be represented with a categorical variable corresponding to coarse membership categories”.

        I accidentally said “coarse membership categories for the categories”, which sounds pretty useless!

        There are other small typos too, but that one’s legitimately confusing for readers.

    • Such approaches are essentially a form of maximum likelihood estimation, and, being point estimators, do not yield provable guarantees of calibration for uncertainty quantification. I personally wouldn’t use them in a risk management setting.

      Fundamentally, any point estimator has sampling variation–your variance has a variance. To see what I mean, consider an unregularized MLE linear model with number of covariates > number of data points. Is the variance there correct? There are some approaches for dealing with this. You can use bootstrap standard errors, though it’s computationally expensive, or a hold out set (https://en.wikipedia.org/wiki/Conformal_prediction) though that means you lose the held out data for training. Both of these approaches only hold for i.i.d. draws from the same distribution of covariates and neither can handle heteroskedasticity.

      For a rich, flexible, and rigorous uncertainty quantification, there’s nothing but approximate bayesian inference, but even then you have to make lots of decisions about how to model heteroskedasticity.

        • Somewhat off-topic, but it reminds me of the small sample methods guaranteed to work on an infinite-sized sample. Or consistent estimators in general… one of the worst justifications I know of.

        • Fair enough. Let me try again:

          The sampling distribution of the maximum likelihood estimator is undefined in small samples, whatever that means, or where the standard regularity conditions do not hold (bounded third derivatives of the log likelihood and its ilk). Where it is defined, it must be accounted for in uncertainty quantification, on top of fitted variance parameters.

    • The goal is to generate one Gaussian distribution per input tuple. There is only one actually realized outcome for each (product, location, time interval) sample in the dataset, so strictly speaking the MLE of the variance is zero. This is a little bit frequentist, philosophically. Whatever happens, happens. Some kind of reference class is needed in order to think about variance. To model uncertainty for each tuple despite this, fancifully, we are attempting to peer into alternate universes and see ways that our universe could have gone, but didn’t. (And then there’s the question of how to sample such universes… ideally, probably in some way associated with our uncertainty or the empirical behavior of the model’s errors.)

      • This only happens if the model is overfitting and the prediction for each data instance is exactly the realized observation. Otherwise, if predicted(y) =/= y, then

        lim sigma -> 0 log likelihood = -infinity

        Normal log loss example

        – (y – predicted(y))^2 / sigma

        up to numerical error of course.

        So to the question asker, if you’re getting zero variance MLEs, you’re probably overfitting. I don’t know what the regularization techniques are for neural ODEs though.

  2. So, it turns out that one of the best approaches to this is conducting Monte Carlo uncertainty quantification on the forward pass via Dropout.

    https://arxiv.org/abs/1506.02142

    I like that a lot better than these ideas, partly for simplicity and partly because these approaches make you a prisoner of your dataset in a way that Dropout doesn’t.

    My intuition is that using neighboring datapoints is risky because there can be instances when looking at neighboring time intervals or regions of space results in a much sharper than average change in how predictions should behave: for example, extending a short-term prediction to encompass an extra weekend or holiday might completely change the expected number of sales that the model should output, looking at a geographically close distribution center across state lines might completely change whether or not a product is outcompeted by a substitute good due to differences in tax policy, etc. Additionally, the prevalence of neighbors might be different, or mean different things, for different samples. Perhaps the average distance between locations is a lot greater in South Dakota than in California, say. This means that it’d be inappropriate to act like the neighbor of a prediction is just a different random draw from the same underlying location.

    MC Dropout lets the neural network address these questions in a way that’s a lot more flexible and bottom-up than other methods, and under the interpretation of neural networks as feature extractors, seems to correspond to counterfactual “what-ifs” for intermediate questions of interest in a way that’s appealing. “What if despite X and Y being true, Z fails because of some stochastic difficulty in this related supply chain?”, say. There’s some regularization lost due to that flexibility, but choosing a larger value for the Dropout frequency seems like a perfectly good way to compensate for it. The analogous complaint, I guess, would be that the optimal Dropout frequency should differ from parameter to parameter and sample to sample.

    I wonder if there’s some way to use information from learning rate optimizers to approach this in a very principled fashion. Also wondering if it’d be possible to incorporate prior knowledge of the covariance structure of predictions at different times or locations from the top-down while still using MC Dropout. Finally, wondering if information from learning rate optimizers could also be leveraged in order to be smarter about choosing good distributions to sample noise from for data augmentation methods.

    • Chaos,

      I haven’t worked with predicting variance, so feel free to take my advice with a grain of salt.

      When modeling, you shouldn’t be agonizing over picking one technique. Start with something simple, check how calibrated the confidence intervals are, see where it’s most wrong, and iterate. You’ll at least have benchmarks to compare against and can hand something in with known shortcomings should other methods fail. Putting your eggs in one basket is a mistake.

      The main thing you seem to be struggling with is how to measure variance. I’d personally recommend looking at the “M5 Uncertainty Competition” and try to dig up some of the submissions. Their problem seems identical to yours.

Leave a Reply

Your email address will not be published. Required fields are marked *