This was something we used a few years ago in one of our research projects and in the paper, Difficulty of selecting among multilevel models using predictive accuracy, with Wei Wang, but didn’t follow up on. I think it’s such a great idea I want to share it with all of you.

We were applying a statistical method to survey data, and we had a survey to work with. So far, so usual: it’s a real-data application, but just one case. Our trick was that we evaluated our method separately on 71 different survey responses, taking each in turn as the outcome.

So now we have 71 cases, not just 1. But it takes very little extra work because it’s the same survey and the same poststratification variables each time.

In contrast, applying our method to 71 different surveys, that would be a lot of work, as it would require wrangling each dataset, dealing with different question wordings and codings, etc.

The corpus formed by these 71 questions is not *quite* the same as a corpus of 71 different surveys. For one thing, the respondents are the same, so if the particular sample happens to overrepresent Democrats, or Republicans, or whatever, then this will be the case for all 71 analyses. But this problem is somewhat mitigated if the 71 responses are on different topics, so that nonrepresentativeness in any particular dimension won’t be relevant for all the questions.

Just an aside on the interesting paper you linked to: it seems to me the primary problem lies in the use of the log-loss, not in evaluating models in terms of their predictive accuracy. Would you agree? The problem (which you illustrate in Section 1.2 and mention again in the Discussion) is that the log-loss doesn’t really care about small differences between moderate probabilities (e.g. .38 and .41) even though — as you point out — those differences can sometimes be practically significant. However, there are scoring rules that are sensitive to those differences, e.g. the Brier scoring rule. So comparing the predictive accuracy of models through the use of the Brier rule might be better in this case.

Here’s a plot of log loss and Brier loss (aka squared loss). Just to be clear,

$latex \mbox{logLoss}(y, \hat{y}) = -\mbox{bernoulli}(y \mid y_hat) = -\log \mbox{ifelse}(y, \hat{y}, 1 – \hat{y}).$

$latex \mbox{brierLoss}(y, \hat{y}) = (y – \hat{y})^2 \approx – \log \mbox{normal}(y \mid y_hat, 1).$

Plotting these out for the case where $latex y = 1$, we get

Squared error certainly has a lower dynamic range, but it still flattens small differences by squaring them.

Hey, who turned off the MathJax?

The scale of these metrics doesn’t matter, so it helps visualize which is more sensitive to small differences by scaling. I arbitrarily chose to give them the same loss for predicting 0.5 all the time. That then looks like this:

Scaling them this way, log loss penalizes near misses more and extreme misses more. But either way, near misses are dominated by larger misses when the total loss measure is just averaged over items as log loss and squared loss are. For example, with squared loss, an error of 0.25 is about 0.06 squared error, whereas an error of 0.5 is 0.25 squared error, almost five times as large.

Thank you for doing that, Bob. You are definitely right that the squared loss also flattens out small misses, but relatively speaking it’s less extreme than the log loss, which is evident by the overall shapes of the two curves. Indeed, the log loss in unbounded whereas the quadratic loss is bounded (between 0 and 1 let’s say), and often the range of the quadratic loss scale that’s realistically utilized will be even smaller. For example, in Andrew’s example in his paper (with true probabilities of 0.4 and 0.6 for two mutually exclusive events), the optimal prediction has a quadratic loss of 0.24 and the worst prediction out of the ones considered by Andrew (which assigns probabilities of 0.44 and 0.56) has a quadratic loss of 0.2416. 0.2416 maybe doesn’t sound very different from 0.24. However, a random guess (arguably the baseline worst prediction) has a quadratic loss of just 0.25, so given the range of the scale that’s actually realistically used, 0.246 is 16% better than 0.24. The log loss uses a larger part of its scale, however, so small misses can more easily be swamped when the losses are tallied up. Anyway, just some ideas. I don’t know what difference switching the loss function would make in the examples considered by Andrew. Maybe no difference.

Olav:

Interesting discussion. I’ll have to think about this.

The rescaled plot shows that absolute errors of 0.5 or less are a bit more flattened out by squared loss than log loss.

From 0.5 to 0.9 absolute error, there’s slightly more flattening from log loss.

Beyond 0.9 absolute error, the log loss grows sharply. So if there are lots of big misses (e.g. predictions of less than 10% for events that occur) then the log loss will be dominated by those.

So which does more flattening is going to boil down to what the absolute error distribution looks like.

Maximum likelihood estimates (peanlized) try to minimize log loss on the training set (plus negative log penlty). For Bayes, we sample from the posterior density, which doesn’t find an optimal set of parameters for log loss, but that’s still the objective of the density from which we’re sampling. So we can calculate expected loss (for any of the loss measures we might want to use).

I don’t quite agree with your analysis. I’d say that the distribution that’s more flattening is the one that assigns probabilities that are close to each other more similar scores. E.g., the difference in log scores assigned by the (rescaled) log score to 0.31 and 0.3, for example, is 0.0118. The difference in quadratic scores, on the other hand, is 0.0139. So the log score is more flattening in this case. More generally (for nicely behaved functions), function F will be more flattening than function F’ in cases where F has a smaller size derivative than F’. If we compare the absolute value of the derivative of the (rescaled) log loss to the absolute value of the derivative of the quadratic loss, we see that the log score is more flattening for probabilities greater than 0.2. That also makes sense given their graphs, I think.

In any case, the real difference between the quadratic and log rule is in how they treat low probability errors, as you point out. If there aren’t any of those, the (rescaled) log score and quadratic score will behave pretty similarly. The main reason the quadratic score is (in many cases) more sensitive to differences between moderate probabilities than the log score is really that it’s not hypersensitive to differences between low probabilities.

(There are two senses in which a loss function can be said to “flatten” errors, but it’s the relative kind that I mention in my last comment that I think is more important. Note that in my example — with probabilities of 0.31 and 0.3 — neither the log nor the quadratic loss actually flattens the difference in error between 0.31 and 0.3—they both exaggerate it. However, for probabilities larger than 0.5 both the quadratic and the log flatten differences in errors.)