In many problems, such as the extremely popular recommender-system problem, we are very concerned about the number of 0’s and 1’s we correctly label (assuming a binary recommendation). In such problems, we really don’t care if we estimate p = 0, yet observe y = 1, but rather just the correct number of labels. As with almost every machine learning problem, we always have the overfitting/underfitting issue. This really rears an ugly head with the recommender problem; it’s my experience you are either guaranteed to miss clear strong trends OR occasionally fit some p = 1 with y = 0 across a wide range of models.

If you use out-of-sample logistic loss, you’ve chosen to miss clear strong trends. I cannot emphasize enough how silly these results get; my experience is that you will find yourself choosing a model with 75% accuracy over a model with 99% accuracy just to avoid one or two p = 1 cases (of millions).

Of course, you can avoid the infinite out-of-sample loss issue by penalties/priors…but these penalized models typically end up giving the exact same suggestions as the unpenalized methods, and still don’t seem to favor the model that 99% accuracy model over the 75% accuracy model.

]]>See Schervish, M. (1989). A general method for comparing probability assessors. The Annals of Statistics 17, 1856–1879. (https://projecteuclid.org/euclid.aos/1176347398)

Ben Levinstein has a draft paper that aims to explain this result to philosophers, which may be more accessible: https://www.dropbox.com/s/544yr9374bsyvop/Schervish%20draft%201.0.pdf?dl=0

]]>Bob: are you also familiar with “target shuffling”? It’s a resampling technique where you take your data, shuffle the targets, then refit and score the model; do this many times and you get a distribution of scores and your score on the real (unshuffled-target) data had better be to the right, otherwise your model isn’t actually doing what you think it’s doing. It’s not for model comparison, so it’s tangential to this discussion, but it came to mind and I was wondering what you thought.

]]>Tom:

I think the key is when you write, “the objective is rarely 0/1 loss.” Or even log loss or squared loss. But it’s standard to use these for evaluation. Sometimes I think this works well, other times not so much. Research is needed to better understand this, but often people take very broad principles such as “Bayes” or “cross-validation” and assume that these solve all their problems, as if the Bayesian solution or the minimum cross-validation solution is necessarily best for their application.

]]>Yes, but that’s still predictive accuracy but we just argue about what constitutes the right measure of predictive accuracy. Right?

But Andrew talks about the *“difficulty of selecting among multilevel models using predictive accuracy”*

Now that seems different & bigger: It seems to circumvent / sidestep predictive accuracy itself. And then we are not just arguing just about the right metric.

]]>A closely related point is that in many engineering applications, the model is known to be very far from reality. It is often chosen for computational convenience. For example, Kalman filters and HMMs are applied everywhere with full knowledge that the underlying dynamics are not linear and not Markovian. In such cases, isn’t it meaningless to even talk about “getting the right answer”? The model is a “curve fitting sponge”, and we are adjusting parameters to get it to “behave well”. (Speaking of sponges, look at the success of the “deep neural networks” for signal interpretation tasks!)

The ML community knows very well that 0/1 loss is rarely the right loss. We have a whole subfield of “cost-sensitive learning” that looks at other families of losses (e.g., where a false negative is much more expensive than a false positive).

Returning to the world of probabilistic models, there has been interest in the ML community on studying the properties of proper scoring rules. As with log loss, these guarantee to converge to a properly calibrated distribution. I mention them only to remind folks that log loss is not the only sensible choice.

Does anyone know of work on Bayesian or robust approaches to decision problems when I have uncertainty about the true application loss function? For example, maybe I’m interested in precision@K, but I don’t know the value of K. That has been the (very shakey) justification for optimizing AUC in the machine learning community. For very big data problems, it might make sense to define a prior over the loss function, fit a complex model, and then optimize the decision rule against the uncertainty over the loss function.

Similarly, I wonder if it is the case that by using a carefully-constructed probabilistic model one could obtain additional guarantees of various kinds of robustness? For example, obviously a big advantage of a good Bayesian probabilistic model is that I can first fit (and validate) the model, and then I can pose a wide variety of queries (each with its own query-specific loss) against the model without having to re-fit the model. Perhaps I’m more likely to detect errors in the data and/or errors in my model family if I pursue the BDA methodology? Are there any formal results along these lines, or do people find that in practice they need to tweak the model for each query?

]]>You can follow essentially the same methodology with non-probabilistic methods. This is basically the only way to debug a machine learning algorithm. Generate data for which you know the right answer and compare the predictions of the fitted “model” to the right answer. One can look at a variety of side predictions as well, a bit like posterior predictive checks.

]]>Yes, you need a Bayesian model to generate from the prior. But that’s not the point I was trying to make. There are really two points floating around related to computation and model evaluation.

1. Assessing whether your inference algorithm is correct — that is, it samples from the posterior or recovers the correct MLE parameters. Using simulated data gives you a case where you know the right answer. It’s unit testing of a sort for an algorithm.

2. Evaluating a model’s inferences for data — that is, do they make sense and are they useful. This is something like a predictive evaluation and my point is just that 0/1 loss isn’t a very good one. Given that the goal is downstream decision making when we talk about predictions, you really want to concentrate on calibration and sharpness of the model’s predictions from data for quantities of interest. To the extent that one model’s better than another, it’s that it gives you more useful predictions or more useful insight into data that you already have. The latter is the point Wei and Andrew were trying to make — log loss is a blunt tool (response is relatively flat in most of the predictive range) — you need to look at the structure of the predictions.

]]>But this can only be done if one is using a Bayesian Model in the first place, right? Or do you mean one could use your approach to compare two generic models agnostic to the underlying modeling technique used?

Is the goal deciding whether one model is better than another or whether one Bayesian model is better than another Bayesian model?

Again, sorry if I’m asking stupid questions.

]]>Oh well, maybe you will clarify it sometime between now and May. Or maybe I will forget about it. This must be the only blog with such a long lag time.

]]>Roger:

It got bumped for some of that horrible power-pose stuff, is scheduled to appear in May. I hope it’ll be worth the wait!

]]>You can simulate parameters from the prior, then generate data from the sampling distribution and check that you recover the parameters from the data with whatever your fitting procedure is. That’s what we encourage everyone to do with Stan.

For many models, you can run something relatively inefficient like MCMC and assuming you did it right, get the correct answer to within some arithmetic precision. That’s what Andrew did in *BDA* to test variational inference, for example.

When things get too hairy, usually due to multimodality, you can’t really do either. For example, nothing that fits LDA is ever going to recover the parameters from a simulation. Same for deep belief networks. Everything’s just too underlyingly unidentified.

]]>I didn’t mean to imply that log loss is the ultimate solution — that’s why I cited the papers with more nuanced discussions of calibration, which is where the statistical (and I believe inferential) action is at. And sorry to overgeneralize. I know there’s a whole lot going on under the heading of “machine learning” and “statistics” and I’m in part classifying for my own convenience.

But let’s just compare and contrast for a second.

1. In Alp, Rajesh, Andrew and Dave’s arXiv paper (classify the authors as you like) on variational inference in Stan, the only evaluation is on predictive performance, i.e. log loss [section 3; empirical study].

2. In *BDA* (the first author of which is the third author of the above paper), the only evaluation for variational inference is in terms of the similarity of the marginal posteriors for the parameters in the true posterior (calculated via MCMC) and the mean-field variational approximation (by design, mean-field ignores the covariance structure, though there are steps that can be taken, as in Tamara Broderick’s papers).

The approach in Alp et al.’s paper is similar to what I see again and again in NIPS papers, is the official evaluation protocol for the DARPA PPAML project, and it’s how all the Kaggle competitions are run. The approach in Andrew et al.’s book is what I often see in stats journals. That’s all I’m trying to say — there’s a correlation in how things are evaluated among papers published in certain venues.

And I could’ve swore that you guys said the BUGS problems just weren’t big enough. Maybe I misheard or it was just a flippant remark, but I thought you guys really meant it. For example, I say that sort of things in comparisons of HMC to Gibbs or random-walk Metropolis on very simple conjugate models; we designed Stan to be relatively efficient on hard problems—the overhead on simple problems can tank its relative performance (more on this in future posts!).

]]>Naive question: Other than what you call an end-to-end measure what other ways do we have to verify that *“the model you wrote down is being estimated properly”?*

I can get the part about reasoning out which is the right end to end measure (e.g. log loss vs 0/1 vs AUC etc. ). But the part that confuses me is not having a measure.

]]>That’s what the Gneiting et al. paper’s getting at with its notion of calibration (N% of the actual values are expected to be in the N% posterior intervals), and sharpness (assuming calibration, narrower intervals are better).

The reason you can’t do this in a machine learning bakeoff is that not everyone’s going to give you a Bayesian posterior CDF. Some people still use SVMs, or heuristics, or completely uncalibrated approaches like naive Bayes.

]]>I did notice that January is the new October…

]]>This whole topic seems to be yet another instance where people first try some intuitive ad-hoc procedure, discover after long trial and error it fails often, only to discover after an even longer theoretical investigation that they should have just stuck with the simple Bayesian result which can be written down in minutes.

If you consider P(Model|d1,…,dn) it naturally factors so the model is judged as a product of out of sample predictions. But it does so in a way that clears up a host of problems. Consider one example inspired by Gelman’s article.

Suppose you’re trying to model the proportion of the vote total a candidate gets in an election. The true answer is .4. Model_1 estimates .42 while Model_2 estimates .44. The article claims the predictive log loss of these two is fairly similar even thought Model_1 is often practically significantly more ‘accurate’ than Model_2. The solution evidently is to use some other comparison which magnifies the difference and makes Model_1 look much better.

But is Model_1 really better than Model_2? The point estimate is closer to the true value, but what if we had:

Model_1: .42 +/- .000001

Model_2: .44 +/- .05

Now which model is better? Imagine for example, the other candidate gets .41 fraction of the vote. Model_1 would convince you the first candidate is the winner when actually they loose. Model_2 would correctly warn you there’s too much uncertainty involved to make a definite prediction. The more “accurate” model is the one which fools you into making a mistake!

What’s happening here is that the simple equations of probability theory naturally “calibrate” the models from their predictive performance, but they do so in very slick way. They don’t simply calibrate the point estimates or any variation of them the way Frequentists would want, they in essence *calibrate a combination of point estimates and the uncertainties*. The only question remaining is how long everyone plans on fiddle farting around with log loss or whatever until this sinks in they and just do it right.

Also to slightly play contrarian, I dislike AUCs as well, but log loss is not an end-all solution. There’s no single proposal that statisticians can hand out and which everyone can understand without delving much effort and research studying the topic. Assessing model fit to particular components of the data using PPCs is ideal, but does it not promise as much generality as a single comprehensive quantity (I personally would argue against that to begin with but that’s another story). If one really wanted to go that way, it still seems that we statisticians are stuck in debates on this manner: LOO-CV following, say, Aki, Andrew, and Jonah’s paper makes progress toward this direction but this is a 2015 paper, not some 1981 paper we can point to and which has been tested through time. Moreover, these principles significantly hinge on importance sampling, and it’s unknown how practical this is for many of the problems machine learners are concerned about.

Early stopping, ad hoc regularization rules, and so on is a different story. These are done because many do not mind coupling model+inference. The two are tied together in many people’s philosophy, so long as the fitted model at the end “performs well”. I think this can in principle be wrong too, because it’s difficult to know what made the fitted model succeed (the approximate inference or the posited model?), and thus it is difficult to iterate over the data analysis procedure. However, in practice, it can be too restrictive to assume this separation when the fitted model is all that matters at the end, and post hoc procedures are done to improve its success. I think this is what often drives much of the practical success of machine learning on “complex models” and “massive data”, to which many applied statistics papers do not concern themselves with.

> But Alp and Dustin (or their advisor, Dave Blei) didn’t seem surprised and weren’t even particularly concerned. They said (and I paraphrase), “That problem’s too easy.”

I can’t speak exactly for Alp or Dave, but I know they haven’t said that… The problem is very nuanced and I don’t know why you’re bringing it up here: if a linear model is initialized with parameters at one million and the true set of parameters is at 10, then there’s absolutely no way any stochastic optimization algorithm would practically converge to the true set of parameters. The numerical precision of the step-size decay would cause the algorithm to eventually “converge” to the middle. We’ve been repeating this. Of course there’s no way we could “tune” ADVI to get that to work, without additional innovations on convergence diagnostics (multiple chains of stochastic optimization requires something analogous but not the same as R-hat, and principled ways need to be done to do inference checking).

To summarize, I think ideologically we can come up with as many proposed (and clearly not end-all) solutions as we want; however, there really is a difference between what we can state in principle and what we can state and do in practice. Some things should certainly be done immediately, but the lack of knowledge about such practices prevents it; we should make concerted effort to changing these. Other things are more debatable, and we don’t have a clear idea how to separate these; telling everyone to clearly separate their models from inference just is not practical.

]]>Likewise. Although thinking in terms of generative probabilistic models seems completely natural (and correct) to those who do it, it’s very learned. And it’s especially not the sort of way you’d think about the world if you spend your time learning more classical machine learning (tree/nnet/SVM)-based techniques.

]]>**2 + 2 = 5?**

What surprised me most of all about my discussions at the PPAML PI meeting was the lack of concern for getting the right answer. By that I mean that if I write down a Bayesian model (joint density) and give you some data, I want the right inferences for the posterior. When you’re focused on an end-to-end measure, even something like root-mean-square-error or log loss, in some sense it doesn’t matter if the model you wrote down is being estimated properly.

You see this a lot in machine learning (or at least used to) with things like “early stopping” rules (there’s even a Wikipedia page for early stopping!). You’d build a model, write in the paper that you were doing an MLE, then run a hill climbing algorithm for only a few iterations. The result is a kind of ad-hoc regularization, which can help with inference. But rather than saying, “hey, the model’s wrong, I need a prior or regularizer”, the approach of just using cross-validation to pick a number of iterations to run an iterative method was often used.

We ran into the same issue with our own variational inference tool. It recovered means on fairly complex models with a lot of data (the ones used in the NIPS paper eval), but when Ben tried to run it on models from the Gelman and Hill book or the BUGS examples, it got the wrong answer. That’s why it didn’t show up in RStan right away and is still labeled **experimental**. Ben’s solved some of this problem by using a QR decomposition of predictors under the hood in RStanARM.

But Alp and Dustin (or their advisor, Dave Blei) didn’t seem surprised and weren’t even particularly concerned. They said (and I paraphrase), “That problem’s too easy.” Hence Andrew’s comment, “We put in 2+2 and it spit out 5.” (Though it was more like we put in 2+2 and it spit out 0.05 or 50; we were OK with putting in 2000 + 2000 and getting 4100 — we expected that order of bias.

**Log loss is really flat; decisions are where it’s at**

I’ve done more thinking on log loss and it’s very flat through most of the response, which is what Wei and Andrew were getting at in the linked paper above. It heavily penalizes tail mistakes (estimating 0.001 chance of something that happened, but the difference between a 0.1 and a 0.15 prediction is negligible). So indeed this all needs to be put into some kind of more realistic decision problem, as Alex D says above and as I was getting at in the applications.

**Sharpness and Calibration**

I’m writing up a tutorial for Stan on hierarchical modeling with a focus on predictive posterior inference and model checking. I’m using a binomial model for repeated binary trial data. Now the prediction is a number, and in that situation, it’s hard to imagine doing 0/1 loss. You could do something like root-mean-square-error, but it seems more natural to set it up probabilistically as an interval prediction problem that can be calibrated.

Michael Betancourt sent me references to the following when I asked what the concept of “sharpness” was called:

Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007) Probabilistic forecasts, calibration and sharpness. *Journal of the Royal Statistical Society: Series B* (Statistical Methodology), 69(2), 243–268.

You also can’t go wrong reading Michael’s own words on the subject,

Michael Betancourt (2015) A Unified Treatment of Predictive Model Comparison. *arXiv*:1506.02273.

There, I wound up writing a whole blog post as a comment.

]]>Bob is spot-on in talking about precision: I’ve done a lot of work with investigators (fraud, etc) and all of them ultimately care about precision.

]]>Can we get some context for this? You want your loss function to correspond to the evaluation metric so that is what your algorithm optimizes. Otherwise it will optimize something else you may or may not care about. I would think this is the first thing people figure out.

]]>As Bob said, model calibration (and log-loss as a measure for assessing this) is the “right” principle to follow if you have no knowledge of what downstream decisions are going to be made with the output. However, if you do have knowledge of the decision procedure, then calibrating the overall decision rule is more important than calibrating the model. In some of the industrial examples Bob gave where precision or recall had high priority, the end user would likely be completely comfortable with a model that misclassified certain rare cases with probability 100%, resulting in infinite log-loss, if this meant that misclassifications in the _other_ direction were minimized. If the model is really meant to make a single decision, then wasting degrees of freedom calibrating parts of the model that aren’t relevant to the decision-maker’s implied loss function is counterproductive. Of course, if the decision-maker wants to make many distinct decisions on the basis of the model, we approach the “generic decision-making” context that Bob mentioned, and log-loss becomes a more attractive option. And if we zoom all the way out to scientific users who want to add their model to the corpus of human knowledge, and thus think it should be used in _all_ downstream decisions going forward, then log-loss is clearly the right metric.

There’s a problem here that sort of mimics the structure of cross-validation. Ideally, we would think about methodological inquiries in academic settings as being scale-models of how the methods would be applied and evaluated in the real world, in the same way we’d like to have our held-out samples serve as proxies for truly out-of-sample data. But it seems that the evaluation mechanisms we’ve set up for Statistics and ML methods, where a few scalars in a table quantify “performance”, are in most cases fundamentally incongruent with real-world applications. It might be too much to ask for researchers to go out and implement a real-world case study where their new method helped somebody make better decisions (although this is absolutely the ideal), but a better incorporation of (perhaps hypothetical) use-cases in model evaluation might be nice.

]]>I come across this mindset numerous time. I think it stems from not thinking probabilistically (which is ironic), but rather in terms of yes/no classification (single machine either fails or doesn’t).

]]>I’d love to see people go beyond log loss and use machine learning algorithms and evaluation metrics that involve the correlation structure. If we think of log likelihood as log(P(observed outcoems | predicted probabilities, assuming independence)), then it’d be great to ask algorithms to include the correlation structure in their predictions, and remove the independence assumption from the evaluation metric. Anyone know if something like that is happening already?

]]>