This well-known paradox of R-squared is still buggin me. Can you help me out?

There’s this well-known—ok, maybe not well-enough known—example where you have a strong linear predictor but R-squared is only 1%.

The example goes like this. Consider two states of equal size, one of which is a “blue” state where the Democrats consistently win 55% of the two-party vote and the other which is a “red” state where Republicans win 55-45. The two states are much different in their politics! Now suppose you want to predict people’s votes based on what states they live in. Code the binary outcome as 1 for Republicans and 0 for the Democrats: This is a random variable with standard deviation 0.5. Given state, the predicted value is either 0.45 or 0.55, hence a random variable with standard deviation 0.05. The R-squared is, then, 0.05^2/0.5^2 = 0.01, or 1%.

There’s no trick here; the R-squared here really is 1%. We’ve brought up this example before, and commenters pointed to this article by Rosenthal and Rubin from 1979 giving a similar example and this article by Abelson from 1985 exploring the issue further.

I don’t have any great intuition for this one, except to say that usually we’re not trying to predict one particular voter; we’re interested in aggregates. So the denominator of R-squared, the total variance, which is 0.5^2 in this particular case, is not of much interest.

I’m not thrilled with that resolution, though, because suppose we compare two states, one in which the Democrats win 70-30 and one in which the Republicans win 70-30. The predicted probability is either 0.7 or 0.3, hence a standard deviation of 0.2, so the R-squared is 0.2^2/0.5^2 = 0.16. Still a very low-seeming value, even in this case you’re getting a pretty good individual prediction (the likelihood ratio is (0.7/0.3)/(0.3/0.7) = 0.49/0.09 = 5.4).

I guess the right way of thinking about this sort of example is consider some large number of individual predictions . . . I dunno. It’s still buggin me.

P.S. Mathias explains in a comment.

105 thoughts on “This well-known paradox of R-squared is still buggin me. Can you help me out?

  1. I think perhaps the discomfort it induces is the main value of this example. It reminds us that even if our model perfectly captures group-level trends, it may be very bad at predicting any individual outcome.

    Keeping this in mind might help us avoid the kind of errors that come up often on this blog, where a researcher convinces themselves that they have found a large consistent effect based on estimating a group average that is, in fact, swamped by individual variability.

    • Now suppose you want to predict people’s votes based on what states they live in… Given state, the predicted value is either 0.45 or 0.55

      Maybe look at it like using a test with false positive rate of 45%.

    • Guess I love this example to illustrate the non-intuitive nature of squares. In all these cases, r^2 is weird but the square root is deeply intuitive, r is the two party margin and that makes a great deal of sense to me.

      • I also find it interesting. Originally the Pythagorean theorem was literally summing the area of squares. Ie, the area of a square made from the longer side of a right triangle is the sum of the areas of squares made from the two shorter sides. Squares were considered *more* intuitive than lengths.

        If you look closely, r^2 is a dimensionless ratio of two areas. It isn’t intuitive at all why that should have anything to do with the “error” of a model fit though. Say you expect a 5 unit long piece of pipe, but measure 7 units. You wasted 2 extra units of pipe, but the squared error is then 4. Or try to make a 5 x 5 unit table but accidentally make it 5 x 7. Your table now has area of 35 units^2 vs the 25 expected. You erroneously used 24 sq units too much material, but the measurement error was 2 units.

        That got me looking for the history here. Apparently this is the original source of the r^2 concept:
        Wright, Sewall (January 1921). “Correlation and causation”. Journal of Agricultural Research. 20: 557–585.

        From searching it looks unavailable online for some reason, but I found this published a year later:

        The present paper is a critical discussion of the latest solution offered, the method of “path coefficients” as proposed by WRIGHT(1921 a).

        […]

        “Causation” has been popularly used to express the condition of association, when applied to natural phenomena. There is no philosophical basis for giving it a wider meaning than partial or absolute association. In no case has it been proved that there is an inherent necessity in the laws of nature. Causation is correlation.

        In his “Grammar of Science,” PEARSON (1900), who developed the product-sum correlation coefficient now used, says in regard to scientific law and causation:

        […]

        “No phenomenon or stage in a sequence has only one cause, allantecedent stages are successive causes, and, as science has no reason to infer a first cause, the succession of causes can be carried back to the limit of existing knowledge and beyond that ad infinitum in the field of conceivable knowledge. When we scientifically state causes we are really describing the successive stages of a routine of experience. ‘Causation’ says JOHN STUART MILL ‘is uniform antecedence’ and this definition is perfectly in accord with the scientific concept.”

        […]

        “The causes of any individual thing thus widen out into the unmanageable history of the universe. The causes of any finite portions of the universe lead us irresistibly to the history of the universe as a whole.”

        […]

        This method is claimed by WRIGHT(1921 a, b) to provide a measure of the influence of each cause upon the effect. Not only does it enable one to determine the effects of different systems of breeding, but provides a solution to the important problem of the relative influence of heredity and environment. To find flaws in a method that would be of such great value to science if only it were valid is certainly disappointing.

        The basic fallacy of the method appears to be the assumption that it is possible to
        set up a priori a comparatively simple graphic system which will truly represent the lines of action of several variables upon each other, and upon a common result.

        https://pubmed.ncbi.nlm.nih.gov/17245982/

  2. Hi,

    I have been wrestling with the properties of R2 some myself, which has resulted in an article now accepted for publication in Psychological Methods, where I argue that it is often more informative to look at relative standard deviations rather than variances. I also present what I think is a new measure for this purpose. The final version of the article is a bit focused on that measure, however, I think all three standard deviation measures in the article have their own benefits. I was actually in the process to send it to you to hear if you were willing to share any thoughts, so this felt like a good opportunity. I’d love to hear what people think.

    The accepted manuscript version is available here over ResearchGate: https://www.researchgate.net/publication/380668267_Coefficients_of_determination_measured_on_the_same_scale_as_the_outcome_Alternatives_to_R2_that_use_standard_deviations_instead_of_explained_variance
    and here over PsyArxiv:
    https://osf.io/preprints/psyarxiv/svuf8

  3. I genuinely don’t understand what the paradox is supposed to be? Individual level behavior can, and generally is, much noisier than aggregate behavior. In many cases it would be hard for this not to be true. In cognitive psychology, the Stroop interference effect is incredibly robust and if you average 20 trials from any subject in any typical paradigm you will reliably find it, but the R^2 at the level of a single trial will be pretty small.

    Confusion sometimes arises because researchers often lose track of what exactly they are explaining: is it the average response in an experimental condition, or is it a single response to an individual trial? For us, the answer is almost always the former, but people may talk (or think) like it is the latter, and thus vastly overestimate their predictive power.

    • Seth:

      At some level, yeah, sure, that’s gotta be it. But an R-squared of 1% . . . it seems so low!

      I guess the problem is that it’s hard to have good intuition about individuals when they are defined based on aggregate characteristics. Suppose, for example, I want to predict how someone is going to vote, knowing only that this person lives in New Jersey or Texas. This doesn’t tell us much, but when we’re only given that one piece of information, it’s natural for us to overweight it and not think about the variability.

      • Just like it may be “natural” to think that a librarian is a woman and a police officer is a man. But is it really surprising that many Texans vote Democrat and many police officers are women? It’s a stretch to call that low correlation a paradox – let alone a well-known one.

        • Carlos:

          The Democrats got 47% of the two-party vote in Texas in the past presidential election. This is no secret. The Democrats’ share of the two-party in Texas has got to be much closer than 50% than the proportion of police officers who are women or the proportion of librarians who are men.

          The R-squared thing is a paradox in the sense that we generally think of an R-squared of 1% as representing a very weak predictor. As various commenters have pointed out, the state you live in can be a strong predictor of political outcomes in the state while being a weak predictor of individual voting. As with just about all paradoxes, this makes sense when looked at from one direction but still can be difficult to internalize.

        • I got the librarian and policeperson examples looking for splits around 60/40 in a random table of occupations by sex (I noticed later it was for the UK).

          I didn’t think that a R² around 5% instead of 1% would make a lot of difference and both would be paradox-worthy. I imagined that you could have written a slightly different example with 60-40 and 40-60 states, an even “stronger” linear predictor with a R² of only 4%.

      • Hi Andrew,

        I think I did not manage to post my reply before (if its just awaiting approval or something, then feel free to remove this repetition).

        Does 10% still sound too little? If not, I do not think it is really about aggregation, but because R^2 squares the contributions from model an error before comparing them.

        Let’s take another example: Suppose you take repeated measures of an object and examine the error of the apparatus used for the measurement, and suppose that its error is independent between measurements, and equals the sum of two independent error-components, one with a standard normal distribution, and one with a distribution 9 times as large (i.e. a normal distribution with mean 0 and standard deviation 9). Perhaps they come from two different parts of the apparatus, for example, where one part is less robust.

        As R^2 squares these contributions, if we could model the first error component to predict the total error of the apparatus, the population R^2 of that model would equal 1^2 / (1^2 + 9^2) = 1 / 82, or about 1.2%. Similarly, for a model of the second error component, R^2 equals 81 / 82. The second component thus appears to have 81 times the determination of the first component, even though it only has 9 times the contribution, and even though the error that therefore remains after controlling for the other parts is 9 times as large, and so on. This does not appear to be due to aggregation: we are capturing one (continuous) part of the contributions, with one-ninth the contribution of what remains uncaptured. But because R^2 is about variances, it squares these contributions, the difference will seem larger, and so the smaller component will seem even smaller in comparison when interpreted through R^2.

        In the voting example, it is a little more complicated, as the distribution of the error will not just be a scaled version of the distribution of the model (which is why |R| might be a more direct comparison in this example, as others have noted, as it compares model to outcome, and as the outcome is a scaled version of the model here). However, if we are happy with summarizing the spread of the distribution of the error to that of the model through their standard deviations, we can see that the distribution of the model is about 0.05 / 0.497 that of the model, or just above one-tenth as large. If these values are not squared and compared by 0.05 / (0.05 + 0.497), you get a value of about 9.1% here, closer to the 10% of |R| rather than the 1% of R^2.

        If one thinks about “determination” as the contributions of these parts on their original (rather than squared) scale, then R^2 can be far off (which is why I present the different measures in my article linked above).

        If 10% still sounds too little, then perhaps it is something else, as noted by others. I am not a political expert, but I suppose what matters to how different the states are in their politics is more how many years one party is continuously in power than another. If it tends to win by 55-45 or 70-30 matters less than how long it can stay in power and enact its politics in that state.

  4. If you know the outcome is binary, why not use logistic regression? I don’t think R-squared works for logistic regression, so maybe this is just some artifact of using a continuous outcome model for binary outcome data?

    • I’m not sure who is teaching those things “R2 does not work for logistic regression”. R2, in the sense Andrew is using, is simply a measure of variance reduction. This is defined independently of any parametric model.

      • Sure, logistic regression has various measures of variance reduction (I’ve only seem them referred to as “psuedo-R2”), but the calculations that Andrew walked through are for the coefficient of determination used in OLS. The calculations for the psuedo-R2 measures used in logistic regression are different, and I don’t know if they will yield the same “paradox”. I guess someone could simulate it and find out…

        • That’s incorrect, Andrew computed the variance of a Bernoulli random variable, this has nothing to do with a linear model.

  5. Doesn’t R-hat just measure how much the covariates allow you to explain the outcomes with a linear model? I’d say the problem here is the pair of subjective labels, “strong linear predictor” and “the two states are much different”? Binary data is very weak and binary predictors are very weak. So it’s not surprising that a state indicator doesn’t reduce the prediction variance much on an individual-by-individual basis, which is what is being measured here.

    • pretty sure you meant R^2? R-hat being a statistic that measures convergence of MCMC (and a statistic you use regularly so makes sense you’d default to typing that)

      • Oops. I indeed meant R2, which I’ve never used, though I have seen decorating linear regression plots. (By the way, that’s me acting like a language model, because R-hat is much more prominent in my world than R2.) Before responding, I should have read the Wikipedia page on R2, which is titled “Coefficient of determination.” It very clearly states up front that the value is

        the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

        This definition is clearly general, but every single plot on the page is for a linear regression. Usually I find the Wikipedia too general and abstract and measure-theoretic when it comes to discussing stats, but here it’s just the opposite—it quickly dives into standard practice rather than continuing in full generality.

        The definition is for a statistic, which I’m pretty sure means it’s about the fit in a specific sample, which is how it’s used in the software R (no hat or square!). That is, it would’ve made more sense relating this to practice if Andrew had gathered a sample of data for New Mexico (+10.8% for Biden) and South Carolina (+11.7% for Trump), then collected a sample of 1000 votes from each state and then calculated the result.

        But the bottom line is still that there is a huge amount of residual variance even if you have the true probability estimates (.45 and .55) due to the nature of binary variables. Maybe that’s why all the advice online seems to be to avoid R2 in a logistic regression context.

        Andrew: This is the kind of post I found super confusing when I was starting to try to understand stats. You don’t explicitly write the model down, which I assume you take to be y[n] ~ bernoulli(p[state[n]]), where state[n] in { 1, 2 }, and p[k] in [0, 1]. You assume a perfect fit, whereas it looks from the definitions that it’s a sample statistic. Then you pull an approximate variance calculation out of thin air (I get it, but I found these shortcuts super confusing when first starting out when they disagreed with the exact answer). This kind of thing is also why I found BDA so puzzling and couldn’t understand it well enough with my weak background in math stats (e.g., I found the moment mathcing and implicit change of variables very confusing in the very first hierarchical regression discussion in chapter 5).

        Being me, I of course asked GPT to write code where I could simulate in R.

        BC: How can I use the R language to simulate data from a logistic regression where there is a single binary covariate x in { 0, 1 } indicating state of residence and p[Y = 1 | x = 0] = 0.45 and Pr[Y = 1 | x = 1] = 0.55 to indicate probability of a Republican vote (Y = 1) vs. a Democrat vote (Y = 0). I want to generate a sample of 2000 with 1000 having x = 0 and 1000 having x = 1 to represent a poll. Then I want to use R’s built-in logistic regression to fit the model. It would be nice if it reported R^2.

        Here’s the R code it produced:

        # 1. Data Simulation
        set.seed(123)  # for reproducibility
        n <- 1000      # number of samples per group
        
        # Generate binary covariate x
        x <- c(rep(0, n), rep(1, n))
        
        # Generate binary response y based on probabilities for x
        p <- ifelse(x == 0, 0.45, 0.55)  # probabilities depending on state
        y <- rbinom(2 * n, size = 1, prob = p)
        
        # Create data frame
        data <- data.frame(x, y)
        
        # 2. Model Fitting
        model <- glm(y ~ x, family = binomial(link = "logit"), data = data)
        
        # Summary of the model
        summary(model)
        
        # 3. Reporting R^2 - McFadden's pseudo-R^2
        null_deviance <- glm(y ~ 1, family = binomial(link = "logit"), data = data)$deviance
        residual_deviance <- model$deviance
        mcfadden_r2 <- 1 - (residual_deviance / null_deviance)
        mcfadden_r2
        

        It calls this pseudo-R^2, but it's just following the general definition you find on the above-linked Wikipedia page. Heres' the result:

        > # 1. Data Simulation
        > set.seed(123)  # for reproducibility
        > n <- 1000      # number of samples per group
        > 
        > # Generate binary covariate x
        > x <- c(rep(0, n), rep(1, n))
        > 
        > # Generate binary response y based on probabilities for x
        > p <- ifelse(x == 0, 0.45, 0.55)  # probabilities depending on state
        > y <- rbinom(2 * n, size = 1, prob = p)
        > 
        > # Create data frame
        > data <- data.frame(x, y)
        > 
        > # 2. Model Fitting
        > model <- glm(y ~ x, family = binomial(link = "logit"), data = data)
        > 
        > # Summary of the model
        > summary(model)
        
        Call:
        glm(formula = y ~ x, family = binomial(link = "logit"), data = data)
        
        Deviance Residuals: 
           Min      1Q  Median      3Q     Max  
        -1.264  -1.082  -1.082   1.093   1.276  
        
        Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
        (Intercept) -0.22900    0.06366  -3.597 0.000322 ***
        x            0.42967    0.08996   4.776 1.79e-06 ***
        ---
        Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
        
        (Dispersion parameter for binomial family taken to be 1)
        
            Null deviance: 2772.5  on 1999  degrees of freedom
        Residual deviance: 2749.5  on 1998  degrees of freedom
        AIC: 2753.5
        
        Number of Fisher Scoring iterations: 3
        
        > 
        > # 3. Reporting R^2 - McFadden's pseudo-R^2
        > null_deviance <- glm(y ~ 1, family = binomial(link = "logit"), data = data)$deviance
        > residual_deviance <- model$deviance
        > mcfadden_r2 <- 1 - (residual_deviance / null_deviance)
        > mcfadden_r2
        [1] 0.008275241
        

        Inspecting the code (always double-check GPT's work!), I thought it introduced a bug by writing y ~ x rather than y ~ x + 1, but I changed to the latter and got exactly the same result. From looking at StackOverflow (also something you need to double check!), I saw that R implicitly includes an intercept. So the code GPT wrote is correct.

        The result lines up with Andrew's back-of-the-envelope 0.01 estimate. This looks like the code matches the Wikipedia definition for the general case, using the terminology "deviance". Very cool how you can calculate that for the "null" using an intercept-only model.

        Whew. Back to prepping my ISBA poster!

        • ” Approximate variance calculation”

          I’m not sure what is approximate here. The variance Andrew computed is exact, it’s (1/2)(0.55 – 0.5)^2 + (1/2)(0.45 – 0.5)^2 = 0.05^2. This is the population variance of the predictor.

        • > The result lines up with Andrew’s back-of-the-envelope 0.01 estimate.

          The back-of-the-envelope estimate of what?

          “There’s no trick here; the R-squared here really is 1%.”

        • Andrew said, “Given state, the predicted value is either 0.45 or 0.55, hence a random variable with standard deviation 0.05.” Maybe I’m misunderstanding what he intended, but the standard deviation of a bernoulli(0.45) variable is the same as that of a bernoulli(0.55) variable, which is sqrt(0.45 * 0.55) = 0.4974937… So I’m not sure where the 0.05 comes from. I think he meant 0.5 and was rounding because 0.45 is close to 0.5 and back of the envelope, that’s a 0.5 standard deviation. I see him do this kind of thing in person all the time when I can follow all the details.

          He then later says, “So the denominator of R-squared, the total variance, which is 0.5^2”, but I’m not sure where that comes from. If the model is a 50-50 mixture of a bernoulli(0.45) and a bernoulli(0.55), then that works out to being equivalent to drawing from a bernoulli(0.5), which does have exactly 0.25 variance. If that’s what he intended, he didn’t show the work to get there, but I’m guessing he’s thinking about the kinds of polls where you sample equal numbers of people in each state. If the mixture isn’t exactly 0.5, then this is off. This is what you get from the deviance calculation above, derived from the y ~ x model logistic regression model, which has an intercept of logit(0.45) and a slope of logit(0.55) – logit(0.45).

          As I said, I could be messing up these calculations. I don’t do a lot of this kind of stats and I find all the definitions and assumptions being made here very hard to follow, as I said above.

        • > Code the binary outcome as 1 for Republicans and 0 for the Democrats: This is a random variable with standard deviation 0.5.

          There is an implicit assumption that there is a similar amount of people in both states and overall the outcome is just as likely to be 0 or 1.

          https://en.wikipedia.org/w/index.php?title=Standard_deviation#Discrete_random_variable

          Here there are N=2 possible outcomes x1=0 and x2=1 and the mean is μ=0.5. The standard deviation is the square root of 1/2 ( (0-0.5)^2 + (1-0.5)^2 ) which is equal to the square root of 0.5^2 which is equal to 0.5.

          > Given state, the predicted value is either 0.45 or 0.55, hence a random variable with standard deviation 0.05.

          Now the prediction – which depends on the state – is just as likely to be 0.45 or 0.55 and we can calculate the standard deviation of that random variable as before.

          There are again N=2 possible outcomes x1=0.45 and x2=0.55 and the mean is again μ=0.5. The standard deviation is the square root of 1/2 ( (0.45-0.5)^2 + (0.55-0.5)^2 ) which is equal to the square root of 0.05^2 which is equal to 0.05.

          > The R-squared is, then, 0.05^2/0.5^2 = 0.01, or 1%.

        • Hi, Carlos and Andrew:

          I would like to see (a) a definition of the data being used, (b) the exact model being proposed and its fit to the data, (c) a precise definition of R2, and (d) the calculation with all the intermediate steps to derive its value.

          I’m guessing that we’re talking about something like the following data situation:

          State 1: 450 Democrat votes, 550 Republican votes
          State 2: 550 Democrat votes, 450 Republican votes

          I think it would help dispel a lot of the confusion around this post, including my own. I do know basic probability theory, so I know how to calculate the variance of a random variable when someone defines one.

          P.S. Where I lost Carlos’s description is at “Now the prediction – which depends on the state – is just as likely to be 0.45 or 0.55 and we can calculate the standard deviation of that random variable as before.” I’m not sure what random variable we’re talking about. I also can’t line up why we’re subtracting the mean (which will be 0.5 over the sample in both states) from 0.45 and 0.55 (which I can see as the fits of a binomial model to the two states).

          P.P.S. My best guess isn’t lining up with what people are showing, which is this code:

          n = 1000
          x1 = rep(0, n)
          x2 = rep(1, n)
          y1 = c(rep(0, 550), rep(1, 450))
          y2 = c(rep(0, 450), rep(1, 550)) 
          x = c(x1, x2)
          y = c(y1, y2)
          data = data.frame(x, y)
          fit = glm(y ~ x + 1, family = binomial(link = "logit"), data = data)
          fit_null = glm(y ~ 1, family = binomial(link = "logit"), data = data)
          null_deviance = fit_null$deviance
          residual_deviance = fit$deviance
          R_squared = 1 - (residual_deviance / null_deviance)
          
          print("null deviance:")
          print(null_deviance)
          print("residual deviance:")
          print(residual_deviance)
          print("R squared:")
          print(R_squared)
          

          which outputs

          [1] "null deviance:"
          [1] 2772.589
          
          [1] "residual deviance:"
          [1] 2752.555
          
          [1] "R squared:"
          [1] 0.007225546
          
        • Hi Bob,

          I will make an attempt to explain how I see what Andrew and Carlos are saying (and as I am discussing similar things in my article). I am pretty sure Carlos and Andrew are considering the population R^2, defined through variances, of a linear model which predicts with the average of each state (so .55 for the state with 55% republicans, if they were coded 1, and .45 for the state with 45% republicans). As the discussion is around the population R^2, the data do not really enter into it. That population R^2 is further understood as 1 minus [the variance of the error (vote minus model prediction) divided by the variance of the “vote outcome” (1 if voted republican, 0 if voted democrat)]. This, as the common way to measure R^2 in a sample is to take 1 minus [the sum of squares of the errors, divided by the total sums of squares], which can be understood as an estimate of the former quantity, as it will converge to it as the sample size increases (provided the usual assumptions that estimates have finite variances, are random draws from the population, and so on hold).

          So:
          X = {.55 if state with more republicans, .45 if state with fewer republicans}
          Y = {1 if vote republican, 0 if vote democrat}
          And then X is used as the predicted value of Y. (Additionally, it is assumed that states have the same number of voters, and that we are considering all voters in both states as the population.)

        • Oh I forgot one specification: that convergence of the estimate of course further assumes that the expected value of the error is zero, although that seems to be fulfilled here.

        • > P.S. Where I lost Carlos’s description is at “Now the prediction – which depends on the state – is just as likely to be 0.45 or 0.55 and we can calculate the standard deviation of that random variable as before.” I’m not sure what random variable we’re talking about.

          Andrew wrote “Given state, the predicted value is either 0.45 or 0.55, hence a random variable […]”. That’s the random variable he was talking about. That “predicted value” which will be either 0.45 or 0.55 – with the implicit assumption that the probability is the same because the vote we want to predict is coming from either state with the same probability (I guess that’s what the “states of equal size” mention is for).

          If that helps you can think of the random variable “state” which is either A or B with equal probability and the random variable “predicted value” which is a (deterministic) function of the random variable “state”.

          > I also can’t line up why we’re subtracting the mean (which will be 0.5 over the sample in both states) from 0.45 and 0.55

          Because he wanted to calculate the standard deviation of the random variable “predicted value”.

          “[…] hence a random variable with standard deviation 0.05.”

        • Thanks for the effort, Mathias. That cleared things up for me, but wouldn’t have sufficed as a standalone definition as I don’t see you defining R2! Let me try to put everything together into an explanation I would’ve understood had someone written it down.

          First, let’s suppose the populations are:

          Region A:  55 votes of 1, 45 votes of 0
          Region B:  45 votes of 1, 55 votes of 0
          Population:  100 votes of 1, 100 votes of 0
          

          I suppose this matters because with whole populations we have the actual numbers, not an estimate, so we divide by N rather than (N – 1) in our variance calculations, which I’ll define using R syntax by

          mean(x) = sum(x) / length(x)
          
          pop_var(x) = sum((x - mean(x))^2) / length(x)
          

          Now I can unpack the variance of the outcomes across the entire population of Region A and Region B put together. Using Mathias’s terminology,

          variance-of-outcome
           = pop_var(c(rep(1, 100), rep(0, 100)))
           = (100 * (1 - .5)^2 + 100 * (0 - .5)^2) / 200
           = 0.25
          

          because mean(c(rep(1, 100), rep(0, 100)) = 0.5.

          Now let’s turn to the variance of the errors. I will assume we will use maximum likelihood to fit a logistic regression with a slope and an intercept so that our prediction for region A is 0.55 and for region B is 0.45. Technically, the intercept will be logit(0.45) and the slope with an indicator of 1 for Region A is logit(0.55) – logit(0.45). Then the prediction for state A will be

          inv_logit(logit(0.45) + 1 * (logit(0.55) - logit(0.45))
           = 0.55
          

          and the prediction for State B will be

          inv_logit(logit(0.45) + 0 * (logit(0.55) - logit(0.45))
          = 0.45
          

          This is enough to define the error vector for the data in each state, as well as the total error vector as

          err_state_a
           = c(rep(1, 55), rep(0, 45)) - rep(0.55, 100)
           = c(rep(.45, 55), rep(.55, 45))
          
          err_state_b
          = c(rep(1, 45), rep(0, 55)) - rep(0.45, 100))
          = c(rep(.55, 45), rep(.45, 55))
          
          err = c(err_state_a, err_state_b)
          = c(rep(.45, 55), rep(.55, 45), rep(.55, 45), rep(.45, 55))
          

          This will give us the variance of the errors through

          variance-of-errors
          = pop_var(err) 
          = (110 * .45^2 + 90 * .55^2) /200
          = 0.2475
          

          Now if we plug that into the formula for R2, we get:

          Rsquared
          = 1 - variance-of-errors / variance-of-population
          = 1 - (0.2475 / 0.25) 
          = 0.01
          

          Whew, are my fingers tired.

      • R^2 was invented essentially in concert with ANOVA and is all about the ratios of sums of squares (which is also the origin of most of the ANOVA concepts). Those ideas are most relevant when the logarithm of the error probability density is the square of a difference from a mean) (ie. when the errors are normally distributed)

        You can define sums of squares in other contexts, but their relevance is much more questionable. There’s a direct mapping between the likelihood of some set of observations (exp(-k * SSE)) and the sum of squared errors (log of that, so -k*SSE)

        There’s no such obvious reason to care about sum of squared errors when the data is say Gamma distributed or exponentially distributed or Binomially distributed.

        But there’s still a reason to care about how much having the model “helps you”. So then you can start talking about stuff like entropy of the errors or “typical sizes” of errors or etc.

        Short version: R^2 is a shortcut and doesn’t necessarily enhance understanding outside of normal distributions.

        • Also note, the entropy of a discretized normal distribution would be something like:

          k/N * sum(exp(-((x_i-m)/sigma)^2/2) * ((x-m)/sigma)^2/2)

          or the average of the squares (k is a constant).

          But you can calculate that entropy for non-normal distributions and it won’t be an average of squares, but it will still track the same underlying concept of uncertainty.

        • sorry, replace k/N with k*dx where dx is the size of the discretization grid. You can also assume truncation of the sum once the probability density drops low enough.

        • I am sorry but this is completely incorrect, the fact that the distribution is Gamma changes nothing. R2 is about relative decrease in mean squared error of a conditional expectation function. The distribution of Y has nothing to do with it.

        • You keep saying this but it doesn’t make it more true.

          Let’s put it in a more Bayesian way perhaps. Not all regression models are about conditional expectations. I can write a model in which I am interested in the conditional mode. Then I can write the error as a gamma with the mode at the regression formula, in which case a mean square error exists but is not minimized by the regression formula estimate. I can even write the model to predict the mode in a cauchy distribution, and there the mean squared error doesn’t even exist! But none of this is a problem for a Bayesian regression.

          However, when I use a Normal error, then in fact the regression estimate and the conditional expectations are the same (for sufficiently diffuse priors) and this is precisely because of the normal error and the fact that the log likelihood IS the sum of the squares.

          Even when the error is normal though a strong prior can pull the estimate well away from the values which would minimize the squared error. I may know that the particular apparatus has some bias in it or something.

          The relevance of R^2 in Bayesian models where the regression formula is not even trying to estimate conditional expectation is Highly questionable. And even if we are interested in conditional expectation but have strong priors and knowledge of some imperfection in the model, we can be well away from minimizing observed mean square.

          For positive outcomes with a right skew I have often used conditional gamma mode. I’ve also used conditional t distribution mode, and conditional skew normal errors in sports outcomes etc.

          R^2 is a concept whose underlying assumptions are prior free models with multivariate normal errors for the purpose of estimating means. This type of model is ubiquitous outside of Bayesian modeling but far from the only thing in Bayesian models.

        • “Then I can write the error as a gamma with the mode at the regression formula, in which case a mean square error exists but is not minimized by the regression formula estimate”

          This is a mathematical impossibility, the conditional expectation function minimizes the mean squared error, for all distributions (where the moments obviously exist).

        • The conditional expectation does but the regression in that case is not a conditional expectation. To understand this in the least confusing way consider the regression with cauchy errors. The expectation doesn’t exist, but the regression gives some kind of answer, so it must not be a conditional expectation.

  6. As with Seth, I don’t see the paradox either. Back in the day, I was taught the difference between SSPE and SSLoF as part of the limitations of R^2. “Pure Error”, due to within-unit/plot variability, is not addressed by simple between-unit/plot models.

    Draper expands on this in
    Draper, N. (1984) The Box-Wetz Criterion Versus R2. JRSS-A, 147,1, 100-103
    and in
    Draper, N. (1985) Corrections/ The Box-Wetz Criterion Versus R2 JRSS-A, 148, 4, 357 .

  7. I have never once looked at an R^2 statistic from a model I’ve fit.

    Like, I’ve fit a lot of models, but it’s never occurred to me to care about R^2.

    First off, variance is something that’s convenient mathematically but is almost always the wrong question in applied statistics. Start from a dimensional analysis perspective. Suppose you’re predicting say the length of a thing, then the standard deviation of this prediction error is a length, but the variance of this prediction error is a non-existent area. Similarly what are “dollars squared” or if we’re predicting lubricant oil properties why would we care about “viscosity squared?” etc etc. Usually you care about things in the dimensions they’re involved in your project in, not the square of that.

    Second of all, the ratio of residual variance to total variance is dimensionless, so that’s good, but it’s not necessarily measuring anything you care about. Suppose we have height of women and height of men, by splitting out by sex we can predict using the group averages and reduce the variance of the error. Why do we care about R^2 and not, maybe, the difference between the two groups as a fraction of the average of the within-group standard deviations?

    Yes, sometimes there’s some mathematical relationship between the two (I’d have to do the algebra I don’t know what it is off the top of my head) but it’s usually better to think directly in terms of the questions you have and to invent a measure of interest that addresses the question. I feel like R^2 is just a thing it’s easy to can into a regression program and pump out, and it’s convenient to the mathematics of multivariate normal distributions that underly the assumptions of 1950’s ANOVA calculations. It’s not anything fundamentally of interest in any given context. Particularly any context where errors aren’t normally distributed (like binary errors in vote share).

    • Hi Daniel,

      It sound like we have similar thoughts on R2. In my previous post, I link to an article of mine that examines some coefficients of the kind you discuss (although I agree that if one has another question, one should use whatever most closely corresponds to that questions). Namely:

      |R|: How many standard deviations does the outcome change on average as a result of one standard deviation’s change in the model?
      CFE: How much smaller (in terms of standard deviations) is the distribution of the errors after predicting with the model.
      R_pi: How much larger/smaller are the contributions of the model compared to those of the error (on average).

      • I think you both got to the heart of the “paradox”, and it can be explained by focusing on |R| (as Mathias suggested) instead of R-squared.

        In Andrew’s examples, |R| equals the magnitude of the correlation between the Republican-voting indicator and the state indicator. Both indicators have standard deviation 0.5, so |R| = 0.1 in the first example and |R| = 0.4 in the second. The squaring in R-squared makes these correlations “very low-seeming”.

    • One alternative measure I have found to be helpful is to discretize your measurement and then calculate the difference in entropy of prediction error. You can think of this as “information gain” directly and it isn’t “normal distribution” specific the way that sum of squared errors is.

  8. Isn’t this just a case of a very small effect being sufficient if aggregated over a large enough sample? Predicting the vote of an individual based on the state is clearly hard to do in your example — a fact shown by the low R^2. But predicting the sum of the votes is simple because individual variation is averaged out and only the mean of the residuals remains in the aggregate (the residuals are biased high for one state and low for the other).

  9. I have a similar feeling about the entropy of Bernoulli distributions. Looking at the plot of the entropy function (there is a plot [here][1]), one needs to get all the way to $p\approx0.9$ to see the uncertainty in the outcome of a Bernoulli($p$) drop by half (from 1 bit at $p=0.5$). But… don’t we feel more than halfway more certain of the outcome at $p=0.9$ compared to $p=0.5$?

    [1]: https://en.wikipedia.org/wiki/Binary_entropy_function

  10. I don’t have much to add to what others have said about your sense of paradox: individual outcomes are just much noisier than aggregate outcomes, so that a strong average effect might have a trivial reduction in individual variances.

    I will add that in expert witness testimony on, say, a model of individual stock prices with some very strong explanatory factor is not at all unusual to have R^2 on the order of 2 or 3 percent. We’re so used to opposing experts using this to cast shade at the model that we have written out canned responses to the objection. The basic point is that the major factors underlying the day-to-day changes in any stock price are largely things for which we have no data. But that certainly doesn’t mean that on a day the stock market is up 1 percent that that doesn’t translate directly into the stock price of every individual stock — it’s just that stock prices have much higher than one percent daily variance, so it’s something you can only see in aggregate.

  11. I agree with G. From a semantic view, a binary variable is a troublesome entity. Voting D or R may seem intuitive, but consider a binary variable ‘B(ird)’. What is the meaning of ‘Not-Bird’? And what do you include, and how, when counting the Bird population (e.g. nestlings, dead birds, toy birds, etc). Depends on your universe.

    A significant subset of the AI KR (knowledge representation) community wrestled with this for much of the 1980s, during the era when many was often said that most humans survived quite well with this sort of knowledge without any understanding of probability or stats and searched for solutions that used ‘pure logic’

    That era also found you can’t just add numbers. While ‘Bird
    Implies Fly’ entails ‘non-Bird implies not-Fly), prob(Fly|Bird) hardly constrains p(not-Bird|not-Fly).

    My guess (not being a trained statistician) is that the answer to the paradox lies in concisely understanding all that is implied by modelling an event as having a binary variable.

  12. I’m not sure I see a paradox either. But I suspect the discomfort with your intuition is driven mainly by R^2 being a variance function and not a standard deviation function.

    For example, in the 55/45 case, if someone told you the R^2 was 0.1 and not 0.01 (without you doing or seeing any computations) you’d probably believe it and be comfortable. It’s the squaring that makes them seem quantitatively off. It is for this reason I usually report the (signed) square root, as folks intuit the magnitude better.

    Other than in moment-of-inertia contexts, I personally have always found variances and things downstream from them very hard to intuit. YMMV

  13. “‘Bird Implies Fly’ entails ‘non-Bird implies not-Fly)…”

    No, that’s not right. ‘Bird implies Fly’ entails ‘not-Fly implies non-Bird’, the contrapositive. But it does not entail the inverse. Getting away from abstract logic, just think of bats and many species of insects, which are non-birds but do fly.

    • Oops, embarrassing. I’ve corrected that with a reply to my own post. New students of logic often don’t initially ‘get it’ that a contrapositive “not Y => not X” follows from “X =>Y”. But if one gives quasi-logical meaning to X=>Y, (that is, then believing “Typically (X=>Y), may mean believing (i.e., completely contains the truth value of) “typically (not Y => not X)”.

      Given the vast number of insects in the world, it might well be that living creatures typically fly.

      More generally, it is difficult to come up with crystal clear definitions for many propositions of interest once we get away from balls and urns.

  14. I essentially agree with Daniel and Bob. Binary data is weak and I’m confused as to why you’re considering the R^2 value at all, when it’s such a terrible measure (it can be high when the model is bad, low when the model is good, as you gave an example of, etc). I don’t think you use it in your research either, which makes me even more confused. I would just ignore the R^2 value and use the mean squared error, which tells you about the fit of your model in a much better way (that’s what you care about, right?).

    This is just another reason to use score voting–then political scientists would want to explain the score for each party and not just “Voter X voted for Party Y.”

    Shalizi’s notes are also pretty good in explaining why R^2 is useless and providing more examples:
    https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.pdf

      • That is not true if you use the standard definition of R^2. State the definition you are using or read the notes I linked (page 15). Neither of the variances in the numerator and denominator of the standard R^2 are the same as the mean squared error between the model and the data, and it is stretching the concept too far to call them mean squared errors. Try reading other people’s responses to you as well.

        • The are five equivalent definitions of R² in those notes.

          Take a look at the fifth one (9): R²=(s²-σ²)/s².

          σ² is the “sample variance of the residuals” which is also called the mean squared error.

          s² is the “sample variance of Y” which is equal to the mean squared error if the regression includes no covariates and the prediction is the sample mean – that’s not stretching the concept at all.

        • What is not true?

          The are five equivalent definitions of R² in those notes.

          Take a look at the fifth one (9): R²=(s²-σ²)/s²

          σ² is the “sample variance of the residuals” which is also called the mean squared error.

          s² is the “sample variance of Y” which is equal to the mean squared error if the regression includes no covariates and the prediction is the sample mean – that’s not stretching the concept at all.

        • Carlos:
          This is just terminology at this point. It is not true that the *mean squared error between the model and the data* is the same as any of the quantities comprising R^2, under any standard definition. You can call the sample variance the mean squared error between the data and the mean of the data if you want. It doesn’t change the fact that R^2 is useless.

        • Anon, page 16 of the document you sent actually shows the R2 is equal to the relative reduction of mean squared error…

        • My earlier comment is still in moderation for some reason. You’re calling “variance” “mean squared error”. I am not. I am saying the mean squared error between the model and the data is the right measure. This has nothing to do with either of the parts of R^2, which, if you like, are mean squared errors between the data points and the mean of the data points, suitably scaled.

        • > You’re calling “variance” “mean squared error”. I am not.

          Why not? The variance of the errors is one and the same with the mean squared error (when the mean of the errors is zero as here).

          The document you cited says: “The one situation where R2 can be compared is when different models are fit to the same data set with the same, untransformed response variable. Then increasing R2 is the same as decreasing in-sample MSE (by Eq. 9).”

          > I am saying the mean squared error between the model and the data is the right measure. This has nothing to do with either of the parts of R^2

          This is Eq. 9 again: R²=(s²-σ²)/s²

          σ² has everything to do with “the mean squared error between the model and the data” because that’s precisely what it is.

          s² is the mean squared error between the baseline model (without covariates – only the intercept) and the data.

          R² is therefore the relative reduction in mean squared error (compare to the baseline model).

        • This is just pedantry at this point. I will accept that in some tiny number of cases, it is the relative reduction in mean squared error. My point is that even if I accept this, this does not measure the same thing as the mean squared error. It is useless for that purpose, and you are better served by using the mean squared error itself.
          Page 18:
          “R^2 says nothing about prediction error. Go back to Eq. 13, the ideal case: even with σ^2 exactly the same, and no change in the coefficients, R^2 can be anywhere between 0 and 1 just by changing the range of X.
          Mean squared error is a much better measure of how good predictions are…”

        • My comment is held in moderation again. “Mean squared error” usually means “mean squared error between X and Y,” where X is usually a model and Y is usually data. Anything else, you have to specify it. This is why.
          Most of this conversation is just pedantry, but even if I accept that in some cases R^2 can be interpreted as a ratio of mean squared errors of different things, it doesn’t mean that it’s measuring the same thing as the mean squared error. It is terrible for prediction.
          Page 18:
          “R^2 says nothing about prediction error. Go back to Eq. 13, the ideal case: even with σ^2 exactly the same, and no change in the coefficients, R^2 can be anywhere between 0 and 1 just by changing the range of X.
          Mean squared error is a much better measure of how good predictions
          are…”

        • > I will accept that in some tiny number of cases, it is the relative reduction in mean squared error.

          R²=(s²-σ²)/s² is a general definition in the document you refer to – not something that holds only in some tiny number of cases.

          > My point is that even if I accept this, this does not measure the same thing as the mean squared error.

          I was trying just to find what “that” was when you wrote “That is not true”.

          The message you replied to had two sentences: “The mean squared error and the R2 measure the same thing. The R2 is simply quantifying the relative reduction in mean squared error.”

          The first one may not be true but the second is fine – even though you seem to be ready to accept it in some tiny number of cases only for some reason.

      • This is just pedantry about the words “mean squared error” at this point. Even if I accept that in some cases the R^2 does involve mean squared errors, it’s still a useless measure. It does not measure the same thing as mean squared error. This is the point I was trying to make.
        Page 18:
        “R^2 says nothing about prediction error. Go back to Eq. 13, the ideal case: even with σ^2 exactly the same, and no change in the coefficients, R^2 can be anywhere between 0 and 1 just by changing the range of X. Mean squared error is a much better measure of how good predictions are…”

  15. Aside from the problematical issue of using linear regression with a binary dependent variable (how would one characterize the error distribution?), have none of you contributing to this topic heard of the “standard error the the estimate”? Not one mention so far!

        • Yes I am certain. It seems you and others here have learned certain concepts such as R2 in the context of parametric models such as a,Gaussian likelihood with linear predictor or a Bernoulli likelihood with logit link. Thus no wonder you can’t understand its meaning or definition outside the context of a parametric model.

      • I don’t think anything you are saying has anything at all do do with statistics. You toss around terms like “error” indiscriminately without actually defining what you mean by error. In particular, if one has a binary dependent variable, say Y, how do you define what you mean by error.when a linear model for Y is employed? Y can take on only 2 values, so how do you characterize “e” in the linear regression estimation method applied to the model Y=bx+e. What possible distributions parametric or non-parametric do you think might charaterize your definition? To have any meaning at all for your approach, you still need a well-defined assumed model for estimating anything.

        • It’s not clear to me if you mean that what Andrew wrote doesn’t have anything to do with statistics.

          > Code the binary outcome as 1 for Republicans and 0 for the Democrats: This is a random variable with standard deviation 0.5. Given state, the predicted value is either 0.45 or 0.55, hence a random variable with standard deviation 0.05. The R-squared is, then, 0.05^2/0.5^2 = 0.01, or 1%.

          There are outcomes (0 and 1) and there are predictions (0.45 and 0.55, depending on the state). That’s all he needed to calculate the coefficient of determination in this case.

          There are also errors (-0.45 with probability 0.55 and 0.55 with probability 0.45 in one case, -0.55 with probability 0.45 and 0.45 with probability 0.55 in the other case) and he could have done the calculation as follows (less straightforward but equivalent):

          > Code the binary outcome as 1 for Republicans and 0 for the Democrats: This is a random variable with standard deviation 0.5. Given state, the predicted value is either 0.45 or 0.55 and the error (-0.45 with probability 0.55 and 0.55 with probability 0.45 in one case and -0.55 with probability 0.45 and 0.45 with probability 0.55 in the other) is hence a random variable with variance 0.2475. The R-squared is, then, (0.5^2 – 0.2475)/0.5^2 = 0.01, or 1%.

        • Hi Carlos,

          I agree with what you write, but I think it is worth noting that the equivalence of these two ways to calculate R^2 only works because (1) the expected value of the error is zero and (2) because the covariance between the model (i.e. the predictions) and the error is zero, both of which are fulfilled by this model.

        • Right. In other comments I may have done the impression that the equivalence is valid more in general than it is. Errors are often unbiased and uncorrelated anyway – and it’s the case in Andrew’s example.

          His calculation doesn’t involve errors directly but it’s also valid only in the specific case that the model makes least-squares predictions – something that may not be clear to all readers. “Given state, the predicted value is either 0.45 or 0.55” because that’s the conditional expected value. There are no assumption about the distribution of errors – something that may not be clear to all readers either.

          (If a “model” makes the “prediction” 0 or 1 with 50% probability for any individual blindly applying his formula would produce a meaningless “R-squared” of 1.)

        • Carlos, sorry if I was unclear — I was directing my post to Jon’s comments, not to what Andrew wrote.

  16. You write, “usually we’re not trying to predict one particular voter; we’re interested in aggregates”. This is in contrast to, say, clinical medicine, where we are trying to “predict” one particular patient. (The comment from Anoneuoid about false positive rate is relevant.)

    One’s intuition might be improved by plotting R-squared as a function of the probability P (not just choosing the value P=0.55). R-squared grows quadratically from 0 to 1 as P increases from 0.5 to 1. R-squared gets a “slow start”, if you will.

    Thinking about this in terms of the Brier score metric (https://en.wikipedia.org/wiki/Brier_score) for probabilistic prediction performance has been helpful in building my intuition.

  17. Like other commenters I don’t understand where’s the “paradox” – or what do you mean by “strong linear predictor”. I would expect “strong” to indicate better accuracy than “pretty good” as far as predictors go.

    The chart on the left doesn’t look like a strong linear predictor to me: https://imgur.com/a/70u0yRZ

    Whether a R-squared of 0.16 corresponds to a “pretty good” predictor is more a matter of opinion. It’s usually said that 0.8 is disappointing is some fields while in others 0.2 is fantastic.

    • Carlos:

      in some fields when you build an apparatus or deploy a new method you are expected to test it against a method that’s known to work and demonstrate that it actually produces reliable results. If the method is important, you would probaly write a paper outlining the multiple approaches you’ve used to test this new method / apparatus. And even if you do this very successfully, others will continue to test it in as many different applications as possible to determine the extent of its validity and to check your results. These are the “0.8 is disappointing” fields.

      In other fields, you can claim to have built a highly improbably apparatus – for example a soup bowl that is served to a customer at a restaurant and perpetually refills while they are eating from it – and no one even asks to see what the unlikely apparatus even looks like much less poses a question on HTF it actually works, and when you present your “data” from your “experiments”, they all agree you’ve made a major discovery and proceed to build on it with their own “research” using their own similarly reliable methods. These are the “0.2 is amazing and frequenly leads to major discoveries” fields.

      In some fields researchers seek reliable methods and pursue these methods over the long run to generate major discoveries.

      In other fields researchers make major discoveries and eschew reliable methods due to their propensity to undermine major discoveries.

      • BTW, the soup bowl supposedly refilled *without the customer knowing it was refilling*. A truly amazing piece of research hardware.

        You’d think some other researcher might wanted to try the same with a perpetual Twinkie – a Twinkie that no matter how many bites you take, it’s never gone: does the Twinkie eater just continue blissfully snacking away, not even noticing they have taken 57 bites from 3-inch long Twinkie?

        That’s how research would be tested in other disciplines: one perpetual soup bowl simply wouldn’t satisfy the entire discipline. They would want to see the same concept tested with perpetual Twinkies, a never-emptying plate of #5 spicy Tai food, or maybe even a perpetual bong hit! How long can someone toke on bong hit before they notice the bowl is never burned? I guess you’d also have to have a perpetual Bic lighter…

  18. R² doesn’t measure how good your classification is. The predicted values of your voters should be individually close to 1 or 0, where 55% are close to 0 (“Democrats”) and 45% close to 1. That’s why a predicted value for an individual voter (!) which is either 0.45 or 0.55 is a rather bad classifier. You should use another evaluation metric or scoring rule like the Brier Score or Log Loss (Cross-Entropy Loss).

    • > The predicted values of your voters should be individually close to 1 or 0, […] That’s why a predicted value for an individual voter (!) which is either 0.45 or 0.55 is a rather bad classifier.

      In this problem a classifier cannot do better than producing the probabilistic “prediction” p(1)=0.45 – or 0.55, depending on the covariate.

      The Brier score will be the mean squared error of that prediction, i.e. 0.2475.

      To measure how good your classification is you may want to compare it with the Brier score for the p(1)=0.5 prediction that ignores the covariate, i.e. 0.25. The relative improvement in the score happens to be 1% and it is not a coincidence.

      (If your predicted values are individually close to 1 or 0 your Brier score will be much worse than 0.25. If you really want to predict values with certainty the best that you can do is to predict always p(1)=0 or always p(1)=1 depending on the covariate – to get the Brier score down to 0.45.)

    • > The predicted values of your voters should be individually close to 1 or 0, where 55% are close to 0 (“Democrats”) and 45% close to 1. That’s why a predicted value for an individual voter (!) which is either 0.45 or 0.55 is a rather bad classifier.

      In this case a classifier predicting values that are individually close to 1 or 0 will be quite bad.

      The best than a classifier can do here is to produce a probabilistic prediction p(1)=0.45. The Brier score of such a prediction is 0.2475. The Brier score of a prediction p(1)=0.5 that doesn’t take into account which of the two states are being considered is slightly higher at 0.25. The relative improvement in the score (lower is better) is 1% – and it’s not a coincidence.

      If the predicted value for an individual voter has to be either 1 or 0 the best (lower) Brier score corresponds to predicting always 0 – and failing 45% of the time – for a score of 0.45. If you predict 0 with certainty 55% of the time and 1 with certainty 45% of the time the Brier score will be worse (higher) than that.

      • Carlos Ungil’s answer is the right explanation of why this feels paradoxical (perhaps only to some) and why it’s not a “good classifier”. The probabilistic predictions of .45 and .55 are the best you can do knowing only state. If we can only predict 0.45 or 0.55 probability for an individual, our classification per individual has a 45% chance of error! It doesn’t matter whether you measure with R2 or a proper scoring rule like predictive log probability (aka “log loss”) or squared error (aka “Brier score”).

        Even if we bring in sex, age, income, ethnicity, religion and/or religiosity along with state of residence, we still can’t build a good classifier on a voter-by-voter basis. I think Andrew once told me it’s about 70% accurate as a classfier with all that. Of course, some state are so skewed you can get to roughly 70% by guessing the majority party—I don’t know how much the other covariates would help in that situation.

        If we bring in party affiliation, that will do a lot better for affiliated voters. So even if you go full-Gelman and build a hierarchical model and then poststratify, don’t expect it to be very accurate at the state level. We’re not even polling real voters, just people who respond. Furthermore, all of the states are heavily correlated, so that looking at marginal errors is very misleading (Andrew’s posted about this latter point before).

        So it feels to me like election polling and forecasting are a mug’s game (to use a British term that Andrew threw into the blog recently).

  19. Lots of controversy in this post suggestive of the problems with R^2, something which it, itself, cannot resolve.

    Frank Harrell, for one, has taken the strong position that it’s one of the best measures of model fit.

    Imo, Shalizi’s conclusion is virtually correct. I add ‘virtually’ because R^2 is not alone in being flawed. In comparison with other metrics of fit used in approximating and iterative data simulation methods such as divide and conquer algorithms, its wider instability is distinctly sobering and revelatory.

    In the 80s, sociologist Harrison White made a useful point about expectations wrt R^2 as a function of the types of information: for survey data, R^2s of 10% to 20% are the norm. For business and financial data, R^2s of 40% to 50% are reasonable. R^2s much above that are prolly violating regression assumptions.

    About the same time Don Morrison (UCLA Marketing Science) was publishing papers on R^2 which made the point that in industries such as direct marketing, R^2s of 1% (and less) can be useful and even profitable. His example for this was predictive models for soliciting magazine subscriptions. The typical inputs to such models were generic and vague like Experian lifestyle factors or direct marketing lists and binary targets of known subscribers and nonsubscribers. Based on the resulting predictions, deciles could be formed and ranked from high to low. Mailing subscription solicitations to the top decile resulted in a sufficient number of acceptances that the model ‘paid for itself’, so to speak.

    Bottom line: there is no ‘one size fits all’ wrt R^2.

  20. There seems to be an error here. You want to know how much variability is explained in candidate selection and you’re using the variance of the votes. This is the wrong value. You need to estimate the variance of the candidate selection, red or blue. That’s what you’re trying to explain.

    For the 0.5 case that stays 0.5^2. However, for the 0.45 case, because of a large population consistently voting in one direction you’re going to get a very low variance in candidate outcome. The explained variance is going to be almost ~1 with a large population.

  21. Not in a thinking mode but, 10% is not a “huge” difference”? 90% of voters pair up with their opposites. Just 10% swing the other way. These are not radically different places. There was a bigger gap between Iowa and Illinois in the 2020 vote.

    I have no idea if it matters but, precision does matter. Always.

    • Meh:

      I’m not quite sure where you’re getting the claim of “huge” from. You have the word in quotes but it appears nowhere in my post. What I wrote is that “the two states are much different in their politics,” which seems right to me. You give the example of Iowa and Illinois. In the context of U.S. national elections, these two states are indeed much different in their politics.

      • Andrew: A lot of this feedback is because of the vagueness of your statement “the two states are much different in their politics”. The only way I can make sense of it is that you think one state is almost certain to go red and the other to go blue. But then you did an analysis of individual-level predictive power, and if we take your statement to mean I can tell which state you’e from by how you vote, or I can tell how you vote by which state you’re from, that’s clearly not true. So I’m assuming you meant the former, and then I’m confused why you’re surprised at the result of an individual-level analysis.

        Also, I really expected all that work you did with Yair on trying to understand why log loss is so close with all your careful modeling to just doing something simple. I thought your conclusion there was that log loss is way too flat to be useful.

        • I wish Andrew had buried the answer (1%) so I could think about the question before knowing what it is. It makes intuitive sense to me that R-squared is very low in this case, although I’m not sure that if you gave me only 5 seconds to think about it — which is not enough time for me to work it out — that I would have gone _that_ low with my guess. Pretty low, though.

          Bob, I agree, I wonder if the reason the answer seems paradoxical or at least odd to some people is that the outcome for the -states- has no uncertainty at all — if you take a sample that is anywhere near the size of a state’s voter turnout, one state is guaranteed to go D and the other is guaranteed to go R — but that this happens even though knowing the state gives you only a little bit of information if you’re trying to guess the party preference of an individual voter.

          I recently saw a clip of former tennis player Roger Federer giving a commencement address. Federer was a dominant player for a lot of years and was among the three best players in the world for even longer. He won about 80% of his matches — even more in his peak years — but, as he pointed out in his speech, he won only 56% of the points he played. And that includes his matches against crummy players! I don’t know what fraction of points he won against top-10 players…perhaps 53% or something like that? At his prime Roger Federer was much better than, say, Andy Murray, but select a random point from a Federer vs Murray match and…well, the smart bet is that Federer won the point, but, given no additional information (such as who was serving), if you want to offer odds on a randomly selected point between these two, you’d better offer something like 25:24.

          I hope the analogy is clear. Federer was “much better” than Murray, in the sense that Federer was very likely to win a match between the two, but any individual point was close to a toss-up. (Well, not really; any individual point was very likely to be won by whoever was serving. What I really mean is that Federer had only a very small edge in any individual point. But it takes a lot of points to win a match). The two states Andrew is talking about are “very different” in politics, but if you select a random voter from one of the states you can’t be at all sure which way they vote.

          This basic point has been made twenty times on this thread, I realize. I just wanted to bring in the Federer thing, which I find pretty striking.

  22. In a simple simulation, the outcome variable Y is generated by 10 independent variables, such that: Y = x1 + x2 + … + x10 + e

    Assuming all the x variables are independent and follow a standard normal distribution, this represents the true data-generating process (DGP) for Y.

    However, in the analysis, we only consider a model with a single predictor, x1: Y = b0 + b1*x1 + u

    Despite x1 being just one of the 10 variables that determine Y, we find that the coefficient b1 is statistically significant and approximates a value of 1. This indicates that manipulating x1 has a significant and important treatment effect on the outcome Y.

    Importantly, though the treatment effect of x1 is significant, this variable only accounts for 10% of the total variance in Y, as evidenced by the low R-squared value of 0.1.

    In this scenario, there is no paradox. The significant treatment effect on x1 and the low R-squared are not contradictory findings. The R-squared value reflects the overall model fit and the proportion of variance explained, while the coefficient b1 represents the specific effect of the manipulated variable x1.

    It is possible to have a significant and important effect of a variable, even if that variable is a minor factor in determining the overall outcome. The key is to recognize that the statistical significance of the coefficient and the R-squared value are capturing different aspects of the model’s performance.

  23. In a simple simulation, the outcome variable Y is generated by 10 independent variables, such that: Y = x1 + x2 + … + x10 + e

    Assuming all the x variables are independent and follow a standard normal distribution, this represents the true data-generating process (DGP) for Y.

    However, in the analysis, we only consider a model with a single predictor, x1: Y = b0 + b1*x1 + u

    Despite x1 being just one of the 10 variables that determine Y, we find that the coefficient b1 is statistically significant and approximates a value of 1. This indicates that manipulating x1 has a significant and important treatment effect on the outcome Y.

    Importantly, though the treatment effect of x1 is significant, this variable only accounts for 10% of the total variance in Y, as evidenced by the low R-squared value of 0.1.

    In this scenario, there is no paradox. The significant treatment effect on x1 and the low R-squared are not contradictory findings. The R-squared value reflects the overall model fit and the proportion of variance explained, while the coefficient b1 represents the specific effect of the manipulated variable x1.

    It is possible to have a significant and important effect of a variable, even if that variable is a minor factor in determining the overall outcome. The key is to recognize that the statistical significance of the coefficient and the R-squared value are capturing different aspects of the model’s performance.

  24. I generally find that R/sqrt(1-R^2) aka the signal-to-noise ratio scales pretty nicely. It doesn’t solve your original problem where you care about aggregates, but it solves your second problem in that it gives a much larger value of around 0.4.

    (Signal to noise ratio also satisfies other desirable properties, e.g. when combining independent pieces of evidence, it scales with sqrt(n) where n is the number of pieces of evidence.)

    I tend to think of R^2 as a measure of how many of the independent causes one has captured. Obviously it works even with non-causal prediction, but the value in this mindset is that then one can say what it’s not trying to do, namely in a sense it’s not trying to capture the predictive accuracy (because it’s got a quadratic component that makes it inaccurate for that). R and sqrt(1-R^2) more directly characterize the predictive accuracy from two different perspectives (R gives you the strength of a selection effect and sqrt(1-R^2) gives you a direct measure of the prediction error), and the signal-to-noise ratio interpolates nicely between them.

  25. [edit: escaped code]

    I initially agreed with Andrew that this is paradoxical, but after simulating some data, I’m not so sure.

    Here’s a simple latent variable model that maps onto Andrew’s toy example, where maybe we think of the latent `y` as representing someone’s ideology, and `y_discrete` is how they vote.

    library(tidyverse)

    # Simulate voters
    df %
    group_by(i) %>%
    summarize(
    y = sum(x),
    x = first(x)
    ) %>%
    mutate(
    y_discrete = y >= 0,
    x_discrete = x >= 0
    )

    When we regress `y ~ x`, `x` has an $latex R^2$ of 2%. Once we discretize and throw away information, that should go down. We can check it and see that it’s indeed 1%.

    # Calculate R^2
    lm(y_discrete ~ x_discrete, data = df) %>%
    summary()

    So, from this perspective, an $latex R^2$ of 1% seems about right.

    However, if we look at the voter breakdown in each state, we see that we end up with more or less exactly 45% / 55%, like in the post.

“`
# Look at vote totals in each “state”

    # TRUE: 55%
    # FALSE: 45%
    df %>%
    group_by(x_discrete) %>%
    summarize(avg = mean(y_discrete))

    

So, to be a little more precise, I actually still agree with Andrew that there’s a “paradox” here, in that it’s very surprising that such a small effect leads to such big differences in voting behavior. Obviously, this is not a paradox in the mathematical sense, and the resolution is that because so many people have “ideologies” that are very close to the threshold (i.e., 0), it doesn’t take much to switch their vote, even though their actual underlying political ideology hasn’t changed very much.

    Weirdly, the upshot of this is that if you think the latent variable model is reasonable, $latex R^2$ actually seems like it’s correctly detecting the fact that the latent variable only shifts your actual “ideology” by a little bit.

    (In fairness, I’m generally pretty suspicious of $latex R^2$, and so I’m curious what it takes to “break” this example so that $latex R^2$ is way off as a measure of the explanatory power of both the discretized and latent variables. I imagine it’s not too hard.)

    • Whoops—my attempts to format everything seem to have gone awry. Here’s the code, hopefully in a slightly easier to read format.

      library(tidyverse)

      # Create a data frame
      df %
      group_by(i) %>%
      summarize(
      y = sum(x),
      x = first(x)
      ) %>%
      mutate(
      y_discrete = y >= 0,
      x_discrete = x >= 0
      )

      # Calculate R^2
      lm(y_discrete ~ x_discrete, data = df) %>%
      summary()

      # Look at vote totals in each “state”
      # TRUE: 55%
      # FALSE: 45%
      df %>%
      group_by(x_discrete) %>%
      summarize(avg = mean(y_discrete))

    • I’m not sure why it’s so hard to keep the code from getting garbled… Feel free to delete these if the code still comes out unreadable.

      # Simulate voters
      df %
      group_by(i) %>%
      summarize(
      y = sum(x),
      x = first(x)
      ) %>%
      mutate(
      y_discrete = y >= 0,
      x_discrete = x >= 0
      )

      # Calculate R^2
      lm(y_discrete ~ x_discrete, data = df) %>%
      summary()

      # Look at vote totals in each “state”
      # TRUE: 55%
      # FALSE: 45%
      df %>%
      group_by(x_discrete) %>%
      summarize(avg = mean(y_discrete))

      • It’s not your fault—it’s because this blog’s really antagonistic toward both authors and respondents by (a) not supporting markdown, and (b) not allowing users to edit or even delete their posts.

        Nothing in the interface tells you that you have to use raw HTML or that even if you do that, it’s going to get ignored because WordPress runs some hideous pre-process on top of it.

        I went in and added the appropriate <pre> tags (which I had to write as “&lt;pre&gt;” here), and then WordPress somehow ignored them in the comments. LaTeX escapes used to work on the blog, then broke, and even when they were working, they didn’t work in comments. I can never figure out what WordPress is going to do.

  26. I think this is at least partially due to the fact that we like to work with variances because that has mathematically convenient properties.

    But for intuition standard deviation is more useful. Conveniently sqrt of R^2 is to standard deviations as R^2 is to variances.

    If you look at the square root of R-squared you get 0.4 in the second case, which at least to me feels intuitively right – my chance of making a prediction error gets reduced by 20% from 50% to 30%. 20%/50% is also 0.4.

  27. As far as I can tell this entire discussion hinges on the problem that, in the natural world, the state of residence has zero actual determinative or causal effect on how people vote. If an Idahoan steps into Washington, it doesn’t change their vote, period. So the variable “state of residence” does not control the outcome: vote democrat or vote republican and thus it has a very low correlation to the outcome.

    When Bob says “Even if we bring in sex, age, income, ethnicity, religion and/or religiosity along with state of residence, we still can’t build a good classifier on a voter-by-voter basis”, he’s just pointing out that these factors also do not “cause” people to vote one way or another, so holding them constant doesn’t help much, although it helps some because these variables do exert *some* direct control over how people vote, while “state of residence” exerts no control. E.g., if you add a variable that has some real causal effect to one that has no causal effect, the correlation has to improve! Religion, for example, reflects people’s beliefs and is thus more likely to determine how people vote than “state of residence” which has no fundamental relationship to people’s beliefs or how they vote.

    While in one sense this is not a paradox, in another sense it’s a major problem in statistics and social sciences. Andrew is using “state of residence” as a variable because it’s convenient and easy to measure, but it’s a variable that has no relationship to the problem at hand. It seems to me that it’s common to attempt to use “convenient” data like census data to determine cause and effect when in fact most census data contains mostly information that has at best indirect causal links (aside from it’s other problems) to the questions at hand. IOW, the data is a poor tool to understand the problem, and because of that there is large variation that obscures or even misrepresents the actual relationship.

    So I guess the upshot is that it’s not surprising that when you use a property or variable that is irrelevant to a particualr behavior to try to predict that behavior, the predictive power or correlation is very low. IOW, a bowling ball rolls faster than a whiffle ball because the bowling ball is heavy and the whiffle ball is light, not because the bowling ball is black and the whiffle ball is white; not because the bowling ball is has three holes and the whiffle ball has twenty six holes; not because the bowling ball says “Made in America” and the whiffle ball says “China”. It’s easy to see the color, the number of holes, and the maker’s mark but just because they are easy to see and measure doesn’t imply they determine how fast the ball rolls.

    • Reapplying the reasoning in Andrew’s post to your example:

      Code the binary outcome as 1 for heads and 0 for tails: This is a random variable with standard deviation 0.5. The predicted value is always 0.5, hence a random variable with standard deviation 0. The R-squared is, then, 0^2/0.5^2 = 0, or 0%.

  28. “… The two states are much different in their politics! …”

    They are much different because of the discontinuous way our system determines outcomes from votes. If one state was 55% Coke and 45% Pepsi and the other was 45% Coke and 55% Pepsi would you consider their taste in soft drinks to be much different?

Leave a Reply

Your email address will not be published. Required fields are marked *