Skip to content

Multilevel modeling in Stan improves goodness of fit — literally.

John McDonnell sends along this post he wrote with Patrick Foley on how they used item-response models in Stan to get better clothing fit for their customers:

There’s so much about traditional retail that has been difficult to replicate online. In some senses, perfect fit may be the final frontier for eCommerce. Since at Stitch Fix we’ve burned our boats and committed to deliver 100% of our merchandise with recommendations, solving this problem isn’t optional for us. Fortunately, thanks to all the data our clients have shared with us, we’re able to stand on the shoulders of giants and apply a 50-year-old recommendation algorithm to the problem of clothes sizing. Modern implementations of random effects models enable us to scale our system to millions of clients. Ultimately our goal is to go beyond a uni-dimensional concept of size to a multi-dimensional understanding of fit, recognizing that each body is unique.

Cool! Their post even includes Stan code.


  1. Contraire says:

    Shilling a categorical model seems uncharacteristic. I thought you would recommend they model proportions directly…

    How mature is Stan anyway? Still not seeing the value relative to JAGS, other than being en vogue.

    • Andrew says:


      If you can already solve your problems using Jags, or Stata, or Excel, or whatever, go for it. The point of any tool is to be able to solve problems that you can’t otherwise solve, or to solve them more easily. If you’re interested in using Stan to solve problems, I recommend you take a look at our manual, our example models, and our case studies, all of which you can find here:

    • The ratings of an item by a person are discrete: too small, just right, too big.

      Gelman and Hill cover IRT/ideal-point models and ordinal logistic models in their regression book. There are very detailed case studies of IRT models on the Stan web site in both the regular case studies section and in the Asilomar StanCon (2018).

      Each person’s size and each item’s size is modeled continuously. That’s their variables alpha and beta. The cutpoints between the decision points are also continuous.

    • Stan is mature enough to be very usable, and it blows JAGS out of the water in terms of relatively complex models and speed (effective samples per second)

  2. John Hall says:

    They could also use this information to tailor marketing materials. Though I imagine they could do the same thing with just information on sizes purchased…

  3. I did this kind of stuff back in 2005 or so for a startup in Emeryville… I guess mathematical modeling wasn’t trendy enough back then, or I’d be a billionaire by now :-/

    • Keith O’Rourke says:


      All that really matters here is someone with funds thinking than can make money using the algorithm and finding out they are right. I don’t think that was happening in 2005.
      (By the way in 1982, I my MBA term project was to use Herbert Simon AI ideas to discern representations of fashion offerings that matched how women shopped at the time.)

      • I’m really down on the tech industry as a source of real innovation. The problem is these days is its very possible to make money without making VALUE for consumers. When the govt is printing like there’s no tomorrow, a lot of worthless fake “work” will happen… and it has been and still is.

  4. Erin Jonaitis says:

    Super cool stuff.

    It seems to me that there’s a wrinkle they haven’t addressed in this post – people’s sizes change over time. So it’s not just a problem about learning more over time about the “true” value – the true value is a moving target.

    I was also surprised by the apparent symmetry in the latent size vs purchase probability plot – I would have expected a steeper dropoff on the “too small” side, but I don’t see it. I wonder whether the shape of that plot varies by garment type (it might be easier to get away with an ill-fitting overcoat than to do the same with too-small pants).

    • Heidi Fox says:

      I often reject clothing for being simultaneously too small (in tbe hips) and too large (in the waist). I’m not sure which box I’d check in that case.

      • A major part of the stuff I was working on at that Emeryville startup was trying to help people like you ;-)

        The hardest part about all this wasn’t figuring out how to fit you, but figuring out how to communicate to the buyer which style to try… You wind up needing something like 8 different “styles” to fit a wide enough range of body types, and then you need to be able to “tell” the buyer which style they need… not an easy task.

  5. Mike Hunter says:

    This approach can’t be scalable beyond toy datasets.

    • Andrew says:


      We’ve fit item response models in Stan to datasets with many thousands of items. I haven’t tried out this particular model, but I have no reason to think that the approach described in the above post can’t be scaled to real problems.

      • Mike Hunter says:


        Ok, but how? RAM is the limiting factor here. Fitting a model with ‘thousands’ of items across gigs and gigs of data is no small feat on any machine using any software. It would help participants if you were to either elaborate just a tiny bit more on the method employed or, perhaps, point to literature where this is discussed.

        • Jake says:

          You can get terabytes of ram for not a lot of money these days, and one presumes they can go distributed when they get to a situation of needing to.

        • Andrew says:


          I don’t know why you put ‘thousands’ in quotes. When I said thousands, I meant thousands. Regarding details: I was thinking of a simulation that Bob Carpenter did a few years ago, which I can’t find anywhere now. You could take a look at this paper from a few years ago, but our Stan code is faster now, as we have hard-coded some models such as logistic regression.

          • Mike Hunter says:

            Putting thousands in quotes wasn’t a gee-whiz moment since thousands of features isn’t really that many, e.g., Google develops their predictive models using hundreds of millions of features.
            I think of both IRT models and models built using Stan as inferential even confirmatory approaches rooted in strong theory and based on a carefully specified model using a reduced set of predictors.
            What theory lends itself to inferential model building with thousands of features? What a priori hypotheses can be specified in that case?

            • How much better are the predictions from some kind of machine learning model involving say 1 Million predictors compared to a model using say a random sample of 1000 of those predictors? I’d be shocked if 1e6 predictors did a lot better than 1e3

              I think typically the reason people throw in 1M is because they have the data and they *don’t* have any kind of theory, not because 1M are needed to get good results.

              • A lot of the problems at which machine learning excels have ridiculously long tails. All the text in the world doesn’t build a very good simple language model to predict the next word given the previous half-dozen words. There are about a million forms of words in fairly common use not counting names. So a half dozen previous words makes for a lot of predictors. Of course, most of them won’t be observed in a finite text corpus, but if you have a a billion words (no longer even considered “big data”), you will likely have tens or hundreds of millions of predictors. Even pruning them heavily doesn’t give you something workable with only a thousand or ten thousand predictors.

                For speech recognition, the input is often broken down into fifty or so psychoacoustically filtered predictors every hundredth of a second. And those big language models over words are used to guide the speech recognizer in narrowing the search for the next word. Using many fewer predictors or much less training data starts dropping accuracy fast.

                Sure, they’re not very good models—basically just remembering what they saw, but they do let us build applications like Alexa or Siri.

              • It’s not clear to me how a half dozen words makes for more than a half dozen predictors.

                But clearly this is a terminology issue… The point about needing a lot of data in building the prediction model is very clear.

              • Ah I see what you’re saying it’s all possible phrases of 6 words that you mean… Combinatoric explosion.

            • Andrew says:


              An example of item response modeling would be a test with hundreds of questions taken by thousands of students. Or an ideal-point model fit to hundreds of questions asked of thousands of survey respondents. Or thousands of movies being rated by millions of viewers. There are lots and lots of applications of such models. Another example is the clothing-fitting problem in the above post.

  6. Ellen Terry says:

    This thread took an interesting direction, I come from a business perspective, where 5-20k transactions / sec is fairly common, so scale and dynamics are critical. We can deal with the dynamics by running algorithms in parallel and persisting parameters to be shared by the algorithms in production. But scale is a concern I have with Stan. I suspect that, since it’s accessing a compiled C++ object, we already have a bottleneck, and it’s goung to be difficult to scale / thread out in a cluster. Desktops are not an option. Thoughts?

    • Andrew says:


      Stan can be parallelized on clusters. We’re implementing and testing that right now.

      • Andrew:

        CmdStan 2.18 is out! It has MPI-based (multi-core or multi-machine) parallelism in evaluating the log densities.

        Ellen Terry:

        Stan’s not intended for high throughput applications. If you have a business mindset, it might be helpful to think of it as like a spreadsheet for statisticians. You can code things and do calculations, and it’s very flexible and powerful for what it does, but nobody would seriously suggest building a high throughput application around it. Scalability and parallelism probably wouldn’t be a problem—power consumtion and latency would be.

        We recommend Stan for when you need to perform full Bayesian inference to calculate posterior expectations accurately conditioned on static data. Andrew’s thinking about streaming data, but we’re nowhere near having a coherent solution yet, nor do I believe we ever will be able to do anything useful that’s not approximate or slow.

        I don’t understand why you think compiled C++ would be a bottleneck. It’s what companies like Google that really need to scale their front ends use. I think they’re a bit north of 20K queries/second, but they don’t have to be transactional the same way a bank transfer is.

Leave a Reply