Regression: What’s it all about? [Bayesian and otherwise]

Regression plays three different roles in applied statistics:

1. A specification of the conditional expectation of y given x;

2. A generative model of the world;

3. A method for adjusting data to generalize from sample to population, or to perform causal inferences.

We could also include prediction, but I prefer to see that as a statistical operation that is implied for all three of the goals above: conditional prediction as a generalization of conditional expectation, prediction as the application of a linear model to new cases, and prediction for unobserved cases in the population or for unobserved potential outcomes in a causal inference.

I was thinking about the different faces of regression modeling after being asked to review the new book, Bayesian and Frequentist Regression Methods, by Jon Wakefield, a statistician who is known for his work on Bayesian modeling in pharmacology, genetics, and public health. . . .

Here is Wakefield’s summary of Bayesian and frequentist regression:

For small samples, the Bayesian approach with thoughtfully well-specified priors is often the only way to go because of the difficulty in obtaining well-calibrated frequentist intervals. . . . For medium to large samples, unless there is strong prior information that one wishes to incorporate, a robust frequentist approach . . . is very appealing since consistency is guaranteed under relatively mild conditions. For highly complex models . . . a Bayesian approach is often the most convenient way to formulate the model . . .

All this is reasonable, and I appreciate Wakefield’s effort to delineate the scenarios where different approaches are particularly effective. Ultimately, I think that any statistical problem that can be solved Bayesianly can be solved using a frequentist approach as well (if nothing else, you can just take the Bayesian inference and from it construct an “estimator” whose properties can then be studied and perhaps improved) and, conversely, effective non-Bayesian approaches can be mimicked and sometimes improved by considering them as approximations to posterior inferences. More generally, I think the most important aspect of a statistical method is not what it does with the data but rather what data it uses. That all said, in practice different methods are easier to apply in different problems.

A virtue—and a drawback—of Bayesian inference is that it is all-encompassing. On the plus side, once you have model and data, you can turn the crank, as the saying goes, to get your inference; and, even more importantly, the Bayesian framework allows the inclusion of external information, the “meta-data,” as it were, that come with your official dataset. The difficulty, though, is the requirement of setting up this large model. In addition, along with concerns about model misspecification, I think a vital part of Bayesian data analysis is checking fit to data—a particular concern when setting up complex models—and having systematic ways of improving models to address problems that arise.

I would just like to clarify the first sentence of the quote above, which is expressed in such a dry fashion that I fear it will mislead casual or uninformed readers. When Wakefield speaks of “the difficulty in obtaining well-calibrated frequentist intervals,” this is not just some technical concern, that nominal 95% intervals will only contain the true value 85% of the time, or whatever. The worry is that, when data are weak and there is strong prior information that is not being used, classical methods can give answers that are not just wrong—that’s no dealbreaker, it’s accepted in statistics that any method will occasionally give wrong answers—but clearly wrong, obviously wrong. Wrong not just conditional on the unknown parameter, but conditional on the data. Scientifically inappropriate conclusions. That’s the meaning of “poor calibration.” Even this, in some sense, should not be a problem—after all, if a method gives you a conclusion that you know is wrong, you can just set it aside, right?—but, unfortunately, many users of statistics consider to take p < 0.05 or p < 0.01 comparisons as “statistically significant” and to use these as motivation to accept their favored alternative hypotheses. This has led to such farces as recent claims in leading psychology journals that various small experiments have demonstrated the existence of extra-sensory perception, or huge correlations between menstrual cycle and voting, and so on. In delivering this brief rant, I am not trying to say that classical statistical methods should be abandoned or that Bayesian approaches are always better; I’m just expanding on Wakefield’s statement to make it clear that this problem of “calibration” is not merely a technical issue; it’s a real-life concern about the widespread exaggeration of the strength of evidence from small noisy datasets supporting scientifically implausible claims based on statistical significance. Frequentist inference has the virtue and drawback of being multi-focal, of having no single overarching principle of inference. From the user’s point of view, having multiple principles (unbiasedness, asymptotic efficiency, coverage, etc.) gives more flexibility and, in some settings, more robustness, with the downside being that application of the frequentist approach requires the user to choose a method as well as a model. As with Bayesian methods, this flexibility puts some burden on the user to check model fit to decide where to go when building a regression. Regression is important enough that it deserves a side-by-side treatment of Bayesian and frequentist approaches. The next step to take the level of care and precision that is taken when considering inference and computation given the model, and apply this same degree of effort to the topics of building, checking, and understanding regressions. There are a number of books on applied regression, but connecting the applied principles to theory is a challenge. A related challenge in exposition is to unify the three goals noted at the beginning of this review. Wakefield’s book is an excellent start.

35 thoughts on “Regression: What’s it all about? [Bayesian and otherwise]”

1. I am not sure any of those three roles capture what I see as the purpose of regression in causal inference. Using a regression model as a generative model of the world seems particularly wrong: Any regression model is going to be consistent with multiple data generating mechanisms, which will each imply different causal effects.
For me, the primary purpose of a regression model is to allow the analysis to incorporate assumptions about the joint distribution; this is necessary in order to reduce the dimensionality of the joint strata of the covariates

• Anders:

For causal inference, you’ll want #3: “A method for adjusting data to generalize from sample to population, or to perform causal inferences.”

• When people use regression models for causal inference, control for confounding is due to conditioning on the confounding covariates. In an ideal world, this would be done with a non-parametric, saturated stratified analysis. In the real world, this is not possible due to continuous covariates and the curse of dimensionality. Therefore, you use regression models: What this does it to allow you to incorporate statistical (non-causal) assumptions about certain parameters being equal to zero

The point I am trying to make is that conceptually, what a regression model adds is not a method for controlling for confounding. You already get that from stratification. Rather, what a regression adds is a way to reduce the dimensionality once you have decided to condition a number of covariates.

In order to determine which covariates to control for, what you need is a model for the data generating mechanism such as a DAG

• Anders:

That may be true for you. But for many people, regression is used as a method for adjusting data to generalize from sample to population, or to perform causal inferences.

• It seems to me that the way regressions are used in the applied world is as a tool for the sort of story telling you talked about earlier. People take 2 and 3 and then tell stories about 1 following the narrative: “We know that given x, y = z. Therefore, every time x obtains, we know y will be x.” as opposed to “Our model shows a tendency of y towards z when parametrized by x. Therefore, when x, it’s worth looking at z as a possible value for y, but knowing that it’s almost never going to be exactly z and sometimes not anywhere close to z. And therefore we need to be mindful about the real meaning of y to us (our models) in contexts where much more than x is going on.”

My favourite example is research on learning styles which showed tendencies of y given x (y being learning outcomes under the x of experimental conditions given instructional method), but misinterpreted that to mean that y is all instruction in all possible contexts. And when that turned out to be (predictably) wrong, other people came in and started objecting to y = z, given x, as having any sort of reality at all. It seems like a completely straightforward case of overgeneralization but it was aided by regressions about which stories (and there are real stories involved) were being told.

• Rahul:

There’s a limit to what can be put in a book review. If you want real applications I recommend you look at my book with Jennifer Hill. It’s full of examples.

• I don’t have a copy of Gelman and Hill at hand currently and it’s been a few years since I read it, but if I recall correctly, I think the majority of the examples in there would fall under your category 3 (e.g., the radon example and the well-switching example come immediately to mind). Is that a fair assessment? Were there any examples of your category 1 and 2 in that book? If so, would you list one or two of each?

• Andrew:

IMHO, the distinction between 1, 2 & 3 will not be clear to a fair chunk of your book review’s readers. I really think including examples would enhance the clarity of your categories. My 2 cents.

• Rahul:

Yup, that’s why I write books and longer articles. The blog is helpful, in part from comments such as yours which give me a better sense of what needs further explaining in books and longer articles.

2. Nice review.

“clearly wrong, obviously wrong. Wrong not just conditional on the unknown parameter, but conditional on the data”

That’s not so bad. If something is clearly, obviously wrong possibly even the dimwitted will notice.

It’s the stuff that’s wrong, but harder to see, that’s most insidious. Suppose, for example, we have a common colinearity problem, and a predictor is the wrong sign. “That has the wrong sign,” we say, and that’s a trigger to action. But what if the coefficient is just as wrong, but in the other direction? Are we likely to notice it, or congratulate ourselves that we’ve found something?

• An edited summary of what you might have been thinking of from Mosteller and Tukey:

Meanings of regression:

1. Column (local) averages

2. Fitting a function

Purposes of regression:

1. To get a summary

2. To set aside the effect of a variable that might confuse the issue

3. Contributions to attempts at causal analysis are a popular use for regression

4. Sometimes, as a corollary to item 3, we want to measure the size of the effect through a regression coefficient, as we did in the age-at-contribution example […] this use is fraught with difficulties when there are multiple causes and when various noncausal variables are associated with other causal ones

5. An extreme instance of the causal approach occurs when we use it to try to discover a mathematical or empirical law

6. For prediction

• I have a lot of respect for Mosteller and Tukey individually, but I can’t stand their book. An example of what I hate about it is that phrase you have above, “To set aside the effect of a variable that might confuse the issue,” which is the kind of folksy language that makes me want to do some math.

Yes, I recognize that what they’re saying is similar to what I’m saying, so it’s not that their book is bad in some absolute sense. I just have found it more confusing than helpful, perhaps because the methods they use are so simple. At some point it helps to just start doing some more complicated things rather than going round in circles.

• “makes me want to do some math”–this after a review that has no math in it (well, some inequalities on p-values). As a general comment, you prefer to describe with words rather than with equations, and then refer to a paper/book which very few look up. IMHO, You would make your blog more useful if you provided simple mathematical example (like Wasserman in his blog did, RIP). Don’t get offended as I don’t want you to quit blogging in a huff with a comment about ungrateful readers (we are, but that’s life).

• Numeric:

Something like 50,000 people have bought my books so it’s not true that very few look things up there. Beyond that, all I can say is that my time and effort are limited. You might well say that I should not waste time and effort replying to blog comments, and that’s probably true, but as noted above I value the comments in that they give me a sense of areas of confusion to be addressed in future articles and books.

• “50,000 have bought my books so it’s not true that very few look things up there” is a non sequitur. I bought your book but I rarely look up something there from a reference in your blog unless I’m particularly interested (lazy or optimizing on a limited resource, my time. You decide). Am I unique? You’re the statistician, propose an estimand for the problem and propose an estimator, and test.

• For what it’s worth I’ve found the combination of reading books/papers & reading the blog more valuable than either alone.

Took me a while to bother to do both but the blog style probably helped convince me to put in that effort. There’s a lot of big picture conversational stuff here that is very valuable and would be hard to get otherwise unless you were Gelman’s drinking buddy or something. Informal conceptual conversations are one of the hardest things to replace if you don’t happen to work with/know the person. If you want to move from ‘informed [statisti-] citizen’ to using in practice then technical detail is usually much more available.

That’s one reason I also like books as well as papers – they often have a more conversational tone (even if still plenty of math) and you can get a better feel for the author’s personal vision. Interviews too.

PS I also liked Wasserman’s blog but the math-centric style has its +/-s.

• I suppose the difference with Tukey is supposed to be the extent to which there is underlying math written up somewhere to support the conceptual/folksy language? But I can see how one person’s Tukey could be another’s Gelman.

• Hjk:

Tukey, of course, was an excellent mathematician!

• Andrew: How can you count people here? I think I’ve 6 books with your name on the cover. I don’t think that makes me 6 people.

• How about a compromise for the lazy or time-optimizing statistician. Create an e-version of your book (people have to pay for access–I’m assuming BDA3), and then when you reference something in it you provide a link. Only those who have paid for access would be able to see it (or they’d get a display describing what section/page number the reference was on, so they could look it up with their copy or the library’s copy). That would increase the utility of the blog–I do tend to click on papers that have links mentioned in this blog and usually (not always) I can figure out what the author is trying to say from the abstract or one or two sections.

• I agree that Tukey’s language can be frustrating (not just in that book), but the book has some wonderful parts that are still worth reading today, e.g. In the chapter “Hunting out the real uncertainty”.

3. Perhaps the number of questionable scientific claims being made
could be reduced if instead of dwelling on the (arguably false) dichotomy
between Bayes and “frequentist” procedures, we focus on the real
dichotomy between Bayes and frequentist *properties* of procedures.

As Andrew touches upon in his post, every procedure, no matter
how derived, has pre-experimental operating characteristics
for each value of the parameter (“frequentist properties”).
Similarly, given a prior, every procedure (including a t-interval)
has Bayesian properties.

Instead of trying to get people to abandon the use of t-intervals for small noisy datasets,
perhaps we could encourage instead the reporting of the t-interval and
its Bayesian coverage under a prior where the effect is likely to be small
or zero. If your t-interval (which may have had 95% pre-experimental coverage
for each parameter value) has only 50% posterior coverage, you might think twice
before making a press release.

• +1

• Peter: I am not sure exactly what you mean here

> pre-experimental operating characteristics (frequentist and Bayesian)

So Bayesian here is interval coverage with respect to the prior – simulate parameters from prior.a, obtain interval from posterior using prior.b (more realistic than prior.a which Greenland termed omnipotent) check coverage for parameter drawn from prior.a and average?

> has only 50% posterior coverage
So the posterior probability content of the t-interval calculated from usual frequentist formulas?

Both would seem worthwhile to calculate but not sure everyone or most would agree.

4. I feel like numbers 1 and 2 are thought processes or mental models rather than part of the computation. You could overlay those with causal inference (which I, for one, do a lot) and prediction (which also turns up often in biostats). But the distinction between causality and prediction is really important; there are so many papers out there merrily adjusting for everything and the kitchen sink. They clearly have picked up a half-truth in Stat101, that adjusting is universally beneficial. I don’t know how to combat it other than plugging away, correcting doctors, and rejecting papers. After all, it’s not like there aren’t plenty of clear stats textbooks and courses out there. I fear that they don’t WANT to know why they are doing regression.

5. What does a frequentist do for prediction? That wasn’t meant to be sarcasm, the start of a joke, or a riddle—I really don’t know. If I have a (penalized) maximum likelihood estimate (or the result of some other estimation procedure) for regression coefficients in a simple linear regression, do I plug them in to compute conditional expectations of y given x? Then what about uncertainty? A Bayesian approach would take the uncertainty of the estimation of the coefficients into account in reasoning about the uncertainty of the prediction.

• wouldn’t a frequentist setup a calibrated bayesian model ala nate silver, interpreting priors as base rates?

Alternatively they might treat any machine learning algorithm (including any bayesian model) as a black box and select an algorithm based on the frequency properties of cross validation or some test error estimator.

LW’s claim was that you can be a frequentist and use bayes theorem.

Less-enlighted frequentists might avoid bayes theorem because of the name, but those are probably the same people doing silly stuff like screening predictors based on p-values who aren’t going to get this right anyway.

• You can base prediction intervals on the distribution of y_{n+1}-fitted value at x_{n+1}, taking into account the distribution of the estimators involved in computing the fitted values. This leads to standard frequentist prediction intervals that you should find in most standard books on regression.

• I should probably add that conceptually these intervals treat the already observed data as random, as do confidence intervals, and one could object by saying, but the observed data are actually fixed at the time of computing the predictions. The percentage levels of these intervals are to be interpreted over independent generation of new full datasets.

• Bob:

In frequentist statistics terminology, an “estimate” is something that will be evaluated conditional on the true value of the unknown quantity being estimated, whereas a “prediction” is something that will be evaluated unconditional on (that is, averaging over) the true value of the unknown quantity being predicted.

To put it another way, in classical statistics, “parameters” get estimated, whereas “missing data” or future values get predicted.

It’s confusing for people with Bayesian training because, in the Bayesian world, all unknowns are simply unknowns and there’s no logical distinction between parameters and missing or future data. But in the classical world, parameters and missing data are different.

We discuss this in a footnote in BDA which I think is also linked from the index (look under “prediction,” perhaps). The point is that unbiased prediction, in a classical sense, is not the same as unbiased estimation. In fact, classical unbiased prediction is the same as Bayesian calibration, except that it is conditional on point estimates (or assumed values) of the model parameters. Classical unbiased estimation is another story, with all its problems. That’s one reason I say that classical statistics is Bayesian statistics if you define all unknowns as prediction problems.

I don’t think this is well known but it is implicit in the classical literatures on forecasting and empirical Bayes.

• And for those with a curious delight in complexity

Models with unobserved random parameters (that need to be averaged over) require generalized likelihood.

A generalization by Bjornstad provided a likelihood for which Birnbaum’s theorem generalizes (but that theorem is now accepted to be wonky, true but based on unreasonable assumptions when fully clarified or just too vague to be of interest to anyone.)

Given the generalization by Bjornstad is still? considered the least wrong, its now all up in the error – blurby to say the least.

Bjornstad, J. F. On the generalization of the likelihood function and the likelihood principle.
Journal of the American Statistical Association 91 (1996)

p.s. this is one typo purposely left in this comment.