## Linear or logistic regression with binary outcomes

Gio Circo writes:

There is a paper currently floating around which suggests that when estimating causal effects in OLS is better than any kind of generalized linear model (i.e. binomial). The author draws a sharp distinction between causal inference and prediction. Having gotten most of my statistical learning using Bayesian methods, I find this distinction difficult to understand. As part of my analysis I am always evaluating model fit, posterior predictive checks, etc. In what cases are we estimating a causal effect and not interested in what could happen later?

I am wondering if you have any insight into this. In fact, it seems like economists have a much different view of statistics than other fields I am more familiar with.

The above link is to a preprint, by Robin Gomila, “Logistic or linear? Estimating causal effects of treatments on binary outcomes using regression analysis,” which begins:

When the outcome is binary, psychologists often use nonlinear modeling strategies suchas logit or probit. These strategies are often neither optimal nor justified when the objective is to estimate causal effects of experimental treatments. . . . I [Gomila] draw on econometric theory and established statistical findings to demonstrate that linear regression is generally the best strategy to estimate causal effects of treatments on binary outcomes. . . . I recommend that psychologists use linear regression to estimate treatment effects on binary outcomes.

I don’t agree with this recommendation, but I can see where it’s coming from. So for researchers who are themselves uncomfortable with logistic regression, or who work with colleagues who get confused by the logistic transformation, I could modify the above advice, as follows:

1. Forget about the data being binary. Just run a linear regression and interpret the coefficients directly.

2. Also fit a logistic regression, if for no other reason than many reviewers will demand it!

3. From the logistic regression, compute average predictive comparisons. We discuss the full theory here, but there are also simpler versions available automatically in Stata and other regression packages.

4. Check that the estimates and standard errors from the linear regression in step 1 are similar to the average predictive comparisons and corresponding standard errors in step 3. If they differ appreciably, then take a look at your data more carefully—OK, you already should’ve taken a look at your data!—because your results might well be sensitive to various reasonable modeling choices.

Don’t get me wrong—when working with binary data, there are reasons for preferring logistic regression to linear. Logistic should give more accurate estimates and make better use of the data, especially when data are sparse. But in many cases, it won’t make much of a difference.

To put it another way: in my work, I’ll typically just do steps 3 and 4 above. But, arguably, if you’re only willing to do one step, then step 1 could be preferable to step 2, because the coefficients in step 1 are more directly interpretable.

Another advantage of linear regression, compared to logistic, is that linear regression doesn’t require binary data. Believe it or not, I’ve seen people discretize perfectly good data, throwing away tons of information, just because that’s what they needed to do to run a chi-squared test or logistic regression.

So, from that standpoint, the net effect of logistic regression on the world might well be negative, in that there’s a “moral hazard” by which the very existence of logistic regression encourages people to turn their outcomes into binary variables. I have the impression this happens all the time in biomedical research.

A few other things

I’ll use this opportunity to remind you of a few related things. My focus here is not on the particular paper linked above but rather on some of these general questions on regression modeling.

First, if the goal of regression is estimating an average treatment effect, and the data are well behaved, then linear regression might well behave just fine, if a bit inefficiently. The time when it’s important to get the distribution right is when you’re making individual predictions. Again, even if you only care about averages, I’d still generally recommend logistic rather than linear for binary data, but it might not be such a big deal.

Second, any of these methods can be a disaster if the model is far off. Both linear and logistic regression assume a monotonic relation between E(y) and x. If E(y) is a U-shaped function of x, then linear and logistic could both fail (unless you include x^2 as a predictor or something like that, and then this could introduce new problems at the extremes of the data). In addition, logistic assumes the probabilities are 0 and 1 at the extremes, and if the probabilities asymptote out at intermediate values, you’ll want to include that in your model too, which is no problem in Stan but can be more difficult with default procedures and canned routines.

Third, don’t forget the assumptions of linear regression, ranked in decreasing order of importance. The assumptions that first come to mind, are for many purposes the least important assumptions of the model. (See here for more on assumptions.)

Finally, the causal inference thing mentioned in the linked paper is a complete red herring. Regression models make predictions, regression coefficients correspond to average predictions over the data, and you can use poststratification or other tools to use regression models to make predictions for other populations. Causal inference using regression is a particular sort of prediction having to do with potential outcomes. There’s no reason that linear modeling is better or worse for causal inference than for other applications.

1. Bob Carpenter says:

Believe it or not, I’ve seen people discretize perfectly good data, throwing away tons of information, just because that’s what they needed to do to run a chi-squared test or logistic regression.

Speaking of discretizing data, it’s quite common to do that to convert to a classification problem to feed into a neural network or other machine learning package.

Like the comment on cut the other day, this is a practice we do all the way down. Even that golf putting data you’ve been presenting is discretized to the foot, not given in infinite precision. Hey, we should do a simulation in that case study to see how much the discretization hurts. That is, simulate continuous data, then discretize it to the nearest foot, then run both data sets.

Demographic features like income and education and geolocation get binned all the time. We might know someone’s from Michigan, or maybe even from the 3rd congressional district or from a given zip code. But you don’t usually get income in dollars or the latitude and longitude of their residence.

Learning stats, I found it very useful in calibrating my intuitions to calculate just how little information there is in binary data on its own. Even with 1000 observations, point estimates still have standard errors in the 1.5%+ range . In baseball terms that gives us a 95% interval of something like (0.360, 0.420) for on-base percentage after roughly two whole seasons of play. That’s why baseball analytics these days is more focused on continuous outcomes like exit velocity (speed and launch angle) than on discrete outcomes like base hits or home runs.

Linear in log odds is still relatively interpretable, though clearly not as easy as reasoning in pure probability.

• Andrew says:

Bob:

I took the golf putting data that were given to me. If the distances had been in inches and not feet, I would’ve used those numbers. But in this case it wouldn’t have made a difference: this could be confirmed with a simulation study, but in this case the answer is so clear that I don’t really need to do that study. If the data were rounded to the nearest 10 feet, though, that would make a difference. For the golf example, the real gain could come from additional information about the putts, for example whether they were too short, too long, to the left or right, the slope of the green, the weather conditions, etc. As you say, in real life we’re only using part of the available data.

That said, discretizing continuous data to binary is particularly wasteful of information. This can also be seen using analytic calculations or simulation studies. I wrote a paper about this with David Park a few years ago: it turns out it’s much better to divide the data into three categories and label them as -1, 0, 1 (or, in a simple comparison, compare high values to low values, throwing away the middle values). Trichotomizing still discards some information, but it can give the interpretability benefits of dichotomization while minimizing the information loss.

• Jordan Anaya says:

Do you have any thoughts on discretizing the outcome variable? In the biomedical world we are often interested in a cutoff value–above which the patient will receive potentially expensive/dangerous treatment.

My current strategy is to regress to a continuous value, and then see if the predicted value is above the established cutoff for treatment. But I have been wondering if I should just turn the problem into a binary classification and label the training data based on whether the outcome metric was above or below a cutoff, and then the model will just output its prediction about whether the patient should be given the treatment. We lose a lot of information this way, but I have been thinking about trying it, because one issue with mean squared error (for a deep learning loss function) is that larger values get weighted more (and yes I’ve been log transforming the data https://statmodeling.stat.columbia.edu/2019/08/21/you-should-usually-log-transform-your-positive-data/).

• Andrew says:

Jordan:

I think the best approach is to model the continuous outcome, get your posterior distribution including a joint predictive distribution for whatever you might care about, and then map back to any cutoff decision after that. Doing a reduced-form approach as you suggest can work, bu t can also throw away lots of information. See this old post, Don’t model the probability of win, model the expected score differential. That particular post was on sports, but the same point holds for data from politics, biology, whatever.

• Ben says:

> it turns out it’s much better to divide the data into three categories and label them as -1, 0, 1 (or, in a simple comparison, compare high values to low values, throwing away the middle values)

1. Should we survey directly with the -1, 0, 1 scale?

This seems like it’d make the regression easier cause then may as well talk about it on the linear scale (instead of coding it categorically). It’d also make visualization super easy.

2. Do 5 point scales not matter?

Also the “Don’t model the probability…” link you posted is broken.

• Andrew says:

Ben:

There’s still some information in the continuous response, and it does away throw some (10-15%) of the information to just use three categories, and also this works best if the three categories are roughly equally populated. So just asking a question with three responses might not work. For example, the General Social Survey asks about happiness on a 0/1/2 scale, but the vast majority of responses are 1’s and 2’s. I think a scale with more points (5 or 6 or 7 or 10, whatever) would work better, and indeed that’s what other surveys do for that question.

Also, I fixed the link; thanks.

2. Terry says:

OK, the obvious question: what do you do when the linear regression model predicts probabilities > 1 or 1 or < 0 territory?

I've run into situations where this is a real problem. Probability of snowfall as a function of temperature drops to zero around 40 degrees or so and stays zero up to 120 degrees. A simple linear regression would predict very negative values around 120 degrees. (Ignore the U-shaped problem where probability of snowfall decreases below 20 degrees or so.)

• pophealth says:

That would be bad. I believe the author guards themselves from this criticism by saying that is a prediction problem and that more often or not you just want to estimate the ATE which is unlikely to get into that territory. If mostly what you do is prediction than this is likely a weird paper (imo).

• Jonathan (another one) says:

+1

I had many colleagues who were taught to use OLS to estimate binary outcomes who tied themselves in knots with predictions over 1 and under 0. To me, the obvious appeal of logistic regression is the behavior when the independent variables point strongly in one way or another. In general, it ought to be a lot harder to change some probability from 95 percent to 100 percent than it is to change it from 55 to 60. If not in your case, sure — use OLS. But in estimating people’s choice behavior, for example, nonlinearity of the sort the logit or probit induces seems much more sensible in most cases I’ve seen.

• Anoneuoid says:

OK, the obvious question: what do you do when the linear regression model predicts probabilities > 1 or 1 or < 0 territory?

Negative Probabilities in Financial Modeling
https://onlinelibrary.wiley.com/doi/abs/10.1002/wilm.10093

Larger than One Probabilities in Mathematical and Practical Finance
https://econpapers.repec.org/article/bapjournl/120401.htm

• Daniel Lakeland says:

🙄

Great, just what we needed some mathematician trying to redefine words… Probability already has a perfectly good meaning, and it’s by definition a number between 0 and 1… if you want to work with some other numbers, call it something else.

• Anoneuoid says:

Can anyone make sense of this?

In contrast to the wide-spread opinion, inflated, or larger than 1, probability has a natural
interpretation. To show this, let us consider the following situations.

Example 1. In an experiment, three coins are tossed. The conventional question is: What is the
probability of getting, at least, one head in one toss? To calculate this probability, we assume that
all coins are without defects and all tosses are fair and independent. Thus, the probability of having
heads (h) for one toss of one coin is p1(h) = 0.5. The same is true for tails (t). Thus, the probability
of having no heads or what is the same, of having three tails, in this experiment is p(0h) = p(3t) =
0.5 x 0.5 x 0.5 = 0.125.

At the same time, we may ask the question: What is the probability of getting heads in one toss
with 3 coins? To answer this question, let us suppose that probability reflects not only the average
of getting of heads (which is 0.5) but also the average number of obtained heads in one toss (it may
be 1.5, for example). In this case, the probability of having heads in tossing three coins once is p3(h)
= 0.5 + 0.5 + 0.5 = 1.5.

The difference between these two situations is that in the first case, getting one head and
getting two heads are different events (outcomes of the experiment). In contrast to this, in the
second case, getting one head and getting two heads are different parts of the same event, or better
to say, the same multi-event (parts of the same outcome of the experiment). It means that an
outcome may have a weight, e.g., when two heads occur, the weight is 2, while when three heads
occur, the weight is 3.

I get:

>In an experiment, three coins are tossed. The conventional question is: What is the
probability of getting, at least, one head in one toss?

From the geometric distribution:
1 – (1- p_h)^3 = 1 – p_t^3 = 0.875

> At the same time, we may ask the question: What is the probability of getting heads in one toss
with 3 coins?

From the binomial distribution:
choose(n, k)*p_h^k*(1 – p_h)^(n – k)
= choose(3, 1)*0.5^1*(1 – 0.5)^(3 – 1) = 0.375

• Anoneuoid says:

> mean(replicate(1e6, sum(sample(0:1, 3, replace = T)) > 0))
> 0.87579

> mean(replicate(1e6, sum(sample(0:1, 3, replace = T)) == 1))
 0.37465

I assume they knew this, so what problem are they trying to solve?

• Daniel Lakeland says:

I think they are working with essentially unnormalized probabilities and finding that for some problems you don’t need to normalize to get the answer

• Anoneuoid says:

Well, you did say probabilities do not necessarily correspond to frequencies.

• Daniel Lakeland says:

sure the proof of Cox’s theorem that Kevin S VanHorn gives assumes only that there is a Low value and a High value… From that you can always map the range into 0,1 without loss of information. So these guys just work with something like the prescaled values. there is nothing to see here unless I’m misunderstanding

• Anoneuoid says:

How does that explain those:

suppose that probability reflects not only the average of getting of heads (which is 0.5) but also the average number of obtained heads in one toss (it may be 1.5, for example). In this case, the probability of having heads in tossing three coins once is p3(h) = 0.5 + 0.5 + 0.5 = 1.5.

So for them it looks like the “probability” corresponds to the expected value, instead of the frequency. I don’t see how to connect that to your interpretation of unnormalized probabilities.

Maybe these papers are worthy of their own blog discussion, because I can’t find discussion of them anywhere. @Andrew?

• Daniel Lakeland says:

Probability is an axiomatic system which is well understood… everywhere that they use the word probability just insert the word loofah… now it becomes obvious that they are not talking about probability… at which point it is up to them to explain what they are talking about and why anyone should care.

at some point they were talking about “multiplicities” which essentially became obvious that they needed to divide by something to get probability which is why I thought they were talking about unnormalized probability.

in any case I wouldn’t spend much time on it… it’s only the fact that they misuse the term probability which got them special anything. If I talk about how loofahs could be any real number you wouldn’t care about that because you’d say “why should I care about this loofah thing?”…

• Jonathan (another one) says:

My favorite quote (from the second article) is:

In a criminal investigation, the detective gets some evidence that strongly shows that the main
suspect is not guilty. To express his confidence, the detective says: “Oh, I’m 70% sure that this
woman is not guilty.” According to the subjective interpretation, 70% mean that the probability of
the woman‟s innocence is 0.7.
After some time, the detective gets even more persuading evidence and when asked by his
chief, the detective says: “Now I’m two times more confident that this woman is not guilty.”
However, two times 70% gives us 140% and according to the conventional probability theory, this
is impossible because 140% mean that the probability of the woman’s innocence is 1.4, while larger
than 1 probabilities are not permitted. However, if we allow larger than 1 probabilities, the
statement of the detective becomes clear.

If anyone understands the meaning of the last word in that quote, please let me know.

Like Daniel Lakeland, I presume the authors mean to do something like the renormalization of quantum theory in order to address pesky problems like how a lognormal process can generate a negative interest rate. The fact that I can’t figure out what they’re doing is my fault, I guess. At least I’m 118.3% sure it’s my problem.

3. pophealth says:

Very interesting article.

I read the average predictive comparisons paper but I am bit confused on how to implement. Does anybody have examples in Stan where they have done this?

• Daniel Lakeland says:

In stan, you can predict one value for each data point using the generated quantities… Then after sampling just average those. or if that’s too much data, do the averaging in stan and just generate the average prediction and sample that.

4. Rodney Sparapani says:

Hi Gang:

This kind of advice comes from the distribution-free asymptotic proofs.
However, these are asymptotic, so it is not clear how relevant this is for
small sample sizes. And, for what it’s worth, the causal inference
theoretical work has moved on and there are new ways to handle dichotomous
outcomes: see two stage predictor substitution and two stage residual inclusion.

Rodney

5. NYCer says:

I notice that the author cites Hellevik’s 2007 paper, where I first heard about these arguments.
See http://folk.uio.no/stvoh1/Q&Q%20Linear%20vs%20logisitic%20regression.pdf at p 73.

The idea gets picked up every now and then, tho I often wonder whether it isn’t more for convenience than anything else. See this https://www.nber.org/papers/w23830 at FN 7.

6. Dave says:

I seem to remember Mostly Harmless Econometrics advocating linear regression instead of logistic for this scenario too. I found that odd, but I guess it’s roughly the same if your context is dealing with small changes being fed to the logistic function, and the output isn’t too close to 0 or 1.

• Daniel Lakeland says:

See but in that case… they’re the same, and in the case where you are near the boundaries or have large changes… then logistic is right… so this is just wrong in every way.

• Ram says:

As long as each (X, y) pair is independently sampled from a process with unchanging covariances, OLS is a consistent estimator of the best linear approximation of E[y|X] (“best” in the MSE sense). This is true whether y is binary, ordinal, integral, real, … Intuitively, this means that there is a sample size beyond which this estimator will be close to this estimand with high probability, however you choose to define “close” and “high”. Thus, if what we’re after is this best linear approximation of E[y|X], and we have a reasonably large sample size, OLS will do just fine for our purposes.

As for inference, OLS is asymptotically normal about this best linear approximation for no other reason than the CLT, so provided our sample size is reasonably large, we just need consistent standard errors. These come in either the sandwich/Eicker-Huber-White variety, or the nonparametric bootstrap variety.

The question, then, is not whether a laundry list of model assumptions hold, but whether we have a sufficient sample size, and whether the object of interest is the best linear approximation of E[y|X]. It’s true that OLS has even nicer properties when some of these assumptions hold (linearity gives us unbiasedness, homoscedasticity gives us efficiency in the class of linear unbiased estimators, normality gives us asymptotic efficiency in the class of (linear and nonlinear) unbiased estimators). But we may still be interested in this object even if none of these assumptions hold. Linear approximations are easily interpreted and analytically/computationally convenient, and in general E[y|X] is going to be a messy object to try to work with directly.

• Daniel Lakeland says:

The fact that we can get arbitrarily close to the best linear estimator is small comfort when the thing we’re estimating is highly nonlinear.

library(ggplot2)

set.seed(10)

x = runif(30,-1,1)
y = x^2

ggplot(data.frame(x=x,y=y),aes(x,y))+geom_point()+geom_smooth(method=”lm”)

• Ram says:

Yes, linear approximation is just that. If we have reason to think the linear approximation is not going to be of scientific value, then we shouldn’t aspire to estimate it. But E[y|X] actually being linear in X is not a necessary condition for this approximation to have such value, nor is it being very nearly linear in X.

• Ram says:

Also note that linear regression can incorporate nonlinear transformations of the predictors (and interactions, and compositions of nonlinearity and interaction). The point is goodness of fit, and approximate truth of model assumptions, is not the only consideration in choosing a specification. (Sometimes it doesn’t matter at all.) Also important is what information is worthwhile to extract from E[y|X].

• Anonymous says:

I mean the only reason that logistic is “right” in any sense is because you’re rescaling your outcome. You could easily do this in your code and get an MSE of zero!

library(ggplot2)

set.seed(10)

x = runif(30,-1,1)
y = x^2

ggplot(data.frame(x=x,y=sqrt(y)*sign(x)),aes(x,y))+geom_point()+geom_smooth(method=”lm”)

There’s another issue here, which is that there is no reason to suspect that logit probabilities should be linear in x as a general matter either; in particular if logit(pi) is a function of x^2 then you’re right back in the same boat

• Daniel Lakeland says:

I’m not assuming logit linear… f(x) can be any nonlinear function of x, so

inverse_logit(f(x)) is a general purpose specification.

The point of the inverse_logit is entirely that it takes the real line and maps it to 0,1 so it guarantees you a result in the range [0,1]

people describe logistic regression as a linear regression on the log-odds… that’s wrong, I mean, it’s the wrong way to think about things… all logistic regression is is a convenient means to compress the real line into [0,1]

• Daniel Lakeland says:

we know ahead of time that we’re trying to estimate a function which is constrained to be between 0 and 1 no matter what the inputs are…

estimating the best q such that

y = inverse_logit(f(x,q))

is relatively easy, as is plugging in x values to the fitted function and getting predictions… and therefore it’s also easy to estimate the “causal effect” of changing x (under the assumptions that you have a causal model)

what exactly is the point of encouraging people to do something else?

• Ram says:

You’re right that logistic regression makes use of the fact that E[y|X] is necessarily between 0 and 1, whereas linear regression doesn’t. If we’re counting points, that’s a point in favor of logistic regression. That said, in general E[y|X] is neither linear in X nor logit-linear in X. A priori, there’s no reason to expect one to be a worse overall approximation of E[y|X] than the other in any particular application. The advantage of the linear approximation is that is directly interpretable, and analytically convenient. Logistic regression is only indirectly interpretable (no one really thinks about probability comparisons in terms of odds ratios, and average predictive comparisons require an extra set of computations), and is analytically inconvenient. If what I want is a generative model of the data, I get the case for logistic regression. But if what I want is the best linear approximation of E[y|X], why take the circuitous route you suggest?

• Daniel Lakeland says:

I really can’t imagine any situation where what we really want to know is the answer to the question “what is the best linear approximation of E[y|X]?”

What we want to know is “what can we actually expect for y if we know x” and when y is constrained to [0,1] and the best linear approximation can produce values outside that range… it isn’t going to be a very good approximation.

If you’re looking at something like x is always in some well defined range [x1,x2] over which range y varies from 43% to 47% or even say 30 to 50 %, then sure, it’s a potentially fine approximation to say y is a linear function of x. But such an approximation should be made intentionally, not just by default.

• Ram says:

Daniel,

You said “What we want to know is ‘what can we actually expect for y if we know x’”. Seems you’re saying what we want to know is E[y|X]. I agree! But in general, this is a pretty messy object to work with directly. If we just care about predictions, we can use nonparametric regression/supervised ML for this purpose. But if we care about the function itself, and extracting useful information from it, the best linear approximation (given a pertinent specification) is often a reasonable target for estimation and inference.

• Daniel Lakeland says:

> the best linear approximation (given a pertinent specification) is often a reasonable target for estimation and inference.

This seems to depend highly on context. perhaps the problems you deal with are of the sort “almost everyone has at least 30% success at thing Y and it’s really hard to get better than 45% success” so that whatever your problem is, it’s working in a nice range that doesn’t suffer from being near a saturation point or anything.

On the other hand, if you’re working with say children who get 30% of things right without any instruction and 97% right with a lot of high quality instruction, and in between there’s a big nonlinear range increasing up to about 75% and then a long tail to the right where learning becomes harder and harder when you want to get closer and closer to perfect, or anything even approximately like that… then no, linear models are going to suck. I think it’s very common to have problems like this, or problems with estimating percentage of failures after a certain time, or anything like that where the edges of the [0,1] y interval come into play.

• Ram says:

Daniel,

Yes, that’s what I meant by “often”. I would never advocate that people turn their brains off and use a canned procedure no matter the context. The point is just that often it will be reasonable to study the best first order approximation of the regression function, and when it is and when we have enough data, OLS plus robust inference will do just fine. When it’s not reasonable to do this, given background information and/or preliminary data, one shouldn’t do this. You’re right that different fields will differ in how frequently this is or isn’t reasonable, but different fields will also differ in how frequently the best logit-linear approximation is or isn’t reasonable.

• Daniel Lakeland says:

Yeah, I don’t advocate logit-linear either ;-)

E[y] = inv_logit(f(x;a)) is a general purpose specification, find parameters “a” such that this nonlinear function f applied to predictors x, when transformed through the inv_logit function produces a good approximation to E[y].

You never need probit, or any other model in the modern world, because the point isn’t “your function is close to a line when graphed on the log odds scale” or whatever people used to think back in 1959 when all calculations were done with a slide rule, a ruler and a pencil, it’s “your function f can be literally any function you like, and inv_logit just bounds it to be in the range [0,1]

• Ram says:

Daniel,

If I understand your proposal correctly, it is just nonparametric regression. By this I just mean that your model allows for E[y|X] to be anything at all, consistent with the [0, 1] range restriction. As I said that’s fine if your goal is prediction. If your goal is just to understand some specific aspect of E[y|X], one way to do this is to first build a completely specified representation of the DGP, and then extract this aspect from this representation. Another way is to estimate this aspect directly, and skip the intermediate representation step. When the aspect in question is the best linear approximation, OLS plus robust inference is an example of the latter approach. In general, if the goal is to get from A to B, why go from A to C to B when you can just go directly from A to B?

• Daniel Lakeland says:

I guess there are two levels here… One is mathematical… mathematically inv_logit(f(x)) is all you need, because it forms a complete basis for all functions from x to [0,1]

on the other hand, in applications, it’s much better if you restrict f(x) to some nonlinear family with a meaningful interpretation for your problem… and here I agree with you that restricting your problem to an interpretable subset is important in real world applications.

If at the end, the f(x) you need looks linear over the range of application, then you can replace f(x) with a+b*x (or the vector equivalent). If even after applying inv_logit(f(x)) it *still* looks linear over the range, and you will never be inputting anything that would cause a prediction to result in values outside the [0,1] range, then you can replace inv_logit(f(x)) with a+b*x…

why go through the intermediate representation? In part because then you know whether the approximation is good.

but the bigger reason is because by thinking about your problem in a framework in which you can solve any problem at all… you have enough rope to know that you need to be careful not to hang yourself… When you look at a problem with a very simple view that everything is a single shot pistol, then you are apt to shoot yourself in the foot… We see this time and again.. such as:

https://statmodeling.stat.columbia.edu/2020/01/09/no-i-dont-think-that-this-study-offers-good-evidence-that-installing-air-filters-in-classrooms-has-surprisingly-large-educational-benefits/

where very clearly someone applied “the formula” and got “the answer”

So I don’t disagree with you per se, I just think it’s important to think of your problem as simplifications from the more complex problem which you nevertheless know how to do and know that it’s not required in this case… rather than write papers saying essentially “all that’s ever really needed is to fit y = a + b*x in most causal inference problems”

• Ram says:

Daniel,

OK, I think we’re mostly in agreement. For the record, I was not defending the claim that one should always and everywhere use linear regression for causal inference problems. I haven’t read the paper Andrew cited, so I can’t comment on it. My goal was just to explain why one might think linear regression when you have a binary outcome isn’t uniformly unreasonable, and is sometimes quite reasonable.

As a side note, I do think *trying* to go from A to C to B can leave you in a worse spot than trying to go straight from A to B. For example, suppose we use a tree model as our representation. These models, and their variants (e.g., random forests), have been used to great effect for fitting binary response data, for making out-of-sample predictions, etc. And yet, these models are locally constant. This means that if you fit one, then try to compute an average derivative, you’re going to get nonsense, even though you’ve constructed a pretty good representation of the DGP. On the other hand, there are methods for estimating average derivatives that skip the intermediate representation step which would do just fine (under the right conditions, OLS will do this). So in this case you’d actually be doing worse by first fitting a good model of the data.

The obvious response is “why would I fit a locally constant model if my goal is to estimate a derivative”? And that’s a great question. But what it shows is that when constructing said representation, goodness-of-fit and assumption-satisfaction are not the only relevant considerations. What information we care about extracting is also important. And if someone proposes a method that gets me what I’m after without first going through involved representation exercises, they may be saving us from arriving at bad answers generated by good models that weren’t designed to answer the right questions.

• Daniel Lakeland says:

Those are all good points. I suggest you take a skim over this paper. I haven’t read it in depth either, but from skimming it, I think it’s wrong-headed.

For example in the summary at the end:

“In the presence of binary outcomes, the predominance of nonlinear modeling
analysis strategies such as logit and probit in the literature may have negative
implications for the field. First, researchers sometimes only report logit or probit
regression coefficients. These coefficients are not interpretable, which shifts the focus of
interest towards statistical significance, and away from actual effect sizes. Second,
researchers often interpret the results of logistic regression in terms of odds-ratios,
which are undeniably difficult to interpret. This, once again, focuses the attention onto
p-values, and denies the practical and theoretical importance of effect sizes. On the
contrary, linear regression yields results that are immediately interpretable in terms of
probability of change, which is the most desirable way to communicate effect sizes.”

This just seems like paternalistic “you don’t know what you’re doing so just stick to OLS”.

• Ram says:

Daniel,

I’ll try to read the paper when I get a chance. The snippet you quoted doesn’t seem unreasonable to me. I work in medical and health services research, and my physician collaborators do tend to focus exclusively on p-values in these settings in part because they have no idea how to think about odds ratios. I’ve spent a lot of time trying to get them to understand them, but it really is a strange way of thinking about probability comparisons. Often rather than using linear regression, I’ll suggest that we use a log link model of some sort, so that we can quote effects in terms of risk ratios or relative risks. Even if we do stick with logistic regression, I’ll often do average predictive comparisons just so they can wrap their head around the quantitative conclusions. But often I feel like a linear regression would be the simplest solution. And yet there is a misconception in my field that linear regression is never sensible with binary outcomes, so I usually avoid it in practice. Again I haven’t read the whole paper, but if the argument of the paper is summarized by this quote I could see it doing *some* practical good.

• Daniel Lakeland says:

> I’ll often do average predictive comparisons just so they can wrap their head around the quantitative conclusions.

We should always do average predictive comparisons, relying entirely on direct interpretability of coefficients is a mistake in my opinion. But this paper basically reads like “we can never expect anyone to understand what they’re doing, so we should dumb it down enough that when they do things wrong they’re not too far wrong” and I don’t think that’s the right attitude at all.

Much of the problems they seem to be concerned about are created almost entirely by point-estimation procedures which leave you with a lack of clarity on just how much uncertainty there is, particularly in the “interpreting coefficients of nonlinear models” case. Bayesian posteriors, average predictive comparisons, graphical model checks, these are what we need to teach as the norm.

7. Psyoskeptic says:

This advise falls apart if there are interactions and main effects. An interaction can be found when there is none, or even reversed when there is one, if you just use a linear model with binary data and there is a main effect and interaction.

• Andrew says:

Psyoskeptic:

8. Ron Kenett says:

Two remarks:
1. This is all an issue of generalisability. Discretisation has an obvious impact on it. Using linear regression or logistic regression too. Just to clarify, generalisation is what you can say, using your current data and analysis, in a wider context. This is one of the dimensions of our information quality framework. If you discretise you might either enhance or reduce information quality. The instructor’s site accompanying our book now has 48 presentations on information quality https://www.wiley.com//legacy/wileychi/kenett/presentation.html?type=SupplementaryMaterial
2. Regarding the above discussions, I suggest to look at the fine results presented in https://aisel.aisnet.org/jais/vol18/iss4/1/