Bayesian methods and what they offer compared to classical econometrics

A well-known economist who wishes to remain anonymous writes:

Can you write about this agent? He’s getting exponentially big on Twitter.

The link is to an econometrician, Jeffrey Wooldridge, who writes:

Many useful procedures—shrinkage, for example—can be derived from a Bayesian perspective. But those estimators can be studied from a frequentist perspective, and no strong assumptions are needed.

My [Woolridge’s] hesitation with Bayesian methods—when they differ from classical ones—is that they are not “robust” in the econometrics sense.

Suppose I have a fractional response and I have panel data. I’m interested in effects on the mean. I want to allow y(i,t) to have any pattern of serial correlation and any distribution. I want to allow heterogeneity correlated with covariates.

I know how I would approach this: pooled quasi-MLE with a Chamberlain device and using cluster-robust inference.

How does a Bayesian solve this problem under the same assumptions plus a prior? I think it’s possible, but are such methods out there and in use?

My reaction to this will be milder than you might expect. Compared to the remarks of some other anti-Bayesians (see for example here and here), Wooldridge is pretty modest in his claims. He’s not saying that Bayesian methods are bad, just that they give him some hesitation.

Wooldridge’s main point seems to be that he and his colleagues have had success with non-Bayesian methods and, on the occasions that they’d looked around to see if Bayesian ideas could help, they haven’t been clear on where to start.

This suggests a need for a short paper taking some of his classical models and expressing them in Bayesian terms with Stan code. Wooldridge appears to be a Stata user, so it could also be useful to include some Stata code using StataStan to call the Stan program.

Somebody other than me will have to do that, as I don’t know what is meant by a fractional response, or pooled quasi-MLE, or a Chamberlain device. It won’t be possible to exactly duplicate these models Bayesianly—he wants it to work for “any pattern of serial correlation and any distribution,” and a Bayesian model would need some parametric form. But these parametric forms can be very flexible (splines, Gaussian processes, etc.), and I don’t actually think you really need the procedure to work for any pattern of serial correlation and any distribution. There are some patterns you’re never gonna see, and some distributions with such long tails that that no procedure would work to estimate effects on the mean. So I think his procedures must have some implicit constraints. In any case, I expect that it should be possible to set up a Bayesian model that pretty much does what Wooldridge wants, without taking too long to compute.

Regarding the claim that Bayesian methods are not robust in the econometrics sense . . . I dunno. I guess I’d have to see some simulation studies. I guess that in some sense his claim must be true, in the following sense: By construction, Bayesian inference maximizes statistical efficiency under the assumptions of the model Efficiency is only one of many goals of inference; thus, if you’re maximizing efficiency you must be losing somewhere else. We could just as well flip it around and say that I have hesitation with any statistical procedure X because it will be flawed when its assumptions fail.

As we’ve discussed in the past, one common failure mode of purportedly conservative or robust-but-inefficient methods is that users want results. They don’t want confidence intervals that are robust but are a mile wide. The way to get reasonable-sized confidence intervals with a statistically inefficient procedure is to throw in more data. For example, when fitting a time-series cross-section model, you might pool data from 40 years rather than just 10 years, so that you can estimate the average treatment effect with a desired level of precision. The trouble is, then you’re estimating some average over 40 years, and this might not be what you’re interested in. People will take this average treatment effect and act as if it applies in new cases, even though it’s not clear at all what to do with this average. Or, to put it another way, this parameter is answering the questions you want to answer—as long as you’re willing to make some strong assumptions about stability of the treatment effect.

So, ultimately, you’re trading off one set of assumptions for another. I’d typically rather make strong assumptions about something minor like the covariance structure of an error term and then be flexible about the things I really care about, like treatment interactions. But I guess the best choice will depend on the particular problems you work on, along with what can be done with the tools you’re familiar with.

I respect Wooldridge’s decision to stick with the methods he’s familiar with. I do that too! It makes sense. There’s a learning curve with any new approach, and I can well believe that Wooldridge using classical econometrics techniques will do better data analysis than Wooldridge using Bayesian methods, especially given that the tutorial I’ve outlined above does not yet exist.

Also, I agree with him that Bayesian methods can be studied from a frequentist perspective. That’s a point that Rubin often made. Rubin described Bayesian inference as a way of coming up with estimators and decision rules, and frequentist statistics as a framework them. And remember that Bayesians are frequentists.

I recommend that Wooldridge continue to use the methods he’s comfortable with. What would motivate him to try out Bayesian methods? If he’s working on a problem where strong prior information is available (as here) or where he has lots of data in scenario A and wants to generalize to similar-but-not-identical scenario B (as here) or where he wants to pipe his inferences into a decision analysis (as here) or where he’s interested in small-area estimation (as here) or various other settings. But until he ends up working on such problems, there’s no immediate need for him to switch away from what works for him. And we end up working on problems that our methods work on. Pluralism!

Being able to go into detail on this is a big reason I prefer blogs to twitter. I enjoy a good quip as much as the next person, but it’s also good to have space to explain myself and not just have to take a position.

32 thoughts on “Bayesian methods and what they offer compared to classical econometrics

    • Michael:

      There’s no free lunch.

      Any solution to a high-dimensional problem requires some assumptions, whether they be set up as a dimensionality restriction, a continuous prior distribution, a solution to an optimization problem (in which case the assumptions might be implicit), or some other way.

      In the econometrics setting, the goal typically is not to estimate a high-dimensional parameter but rather to estimate a low-dimensional parameter in the presence of a high-dimensional nuisance parameter. In that case, the Bayesian approach is to set up a joint probability model, while the classical econometrics approach is to come up with a procedure that has some desirable properties across the nuisance parameter space. As I discussed in the above post, such procedures can be constructed using some mix of assumptions or restrictions on that space, or a loss of statistical efficiency which in practice will result in pooling somewhere else in the analysis, thus trading off assumptions on the nuisance parameters with assumptions on the parameters of interest.

      • Andrew: I’m talking about estimating a single-dimensional parameter. My understanding is that in this category of COD problem there is a nice frequentist estimator with 1/n^1/2 convergence, but no such strict Bayesian estimator until the number of data points gets enormously large. Of course if your prior does enough smoothing to effectively remove COD, then this doesn’t apply.
        It’s not just a contrived pathological limit; a friend who was doing some ecological modeling actually ran directly into the problem described here.

        • Michael:

          Yes, that corresponds to estimation of a unidimensional parameter of interest in the presence of a high-dimensional nuisance parameter, or equivalently as estimation of a high-dimensional parameter in which interest lies in a univariate summary. In any case, some assumptions are required, and there is not always a direct mapping between assumptions used in different modes of inference.

        • One thing I’ve noticed about Wasserman’s examples of where Bayesian methods don’t work, is that they can work. But his argument devolves into, “a Bayesian wouldn’t do that,” which is ironic, because he’s implying that a frequentist procedure is allowed to use prior knowledge of the problem to construct its estimator, but the Bayesian procedure is not.

        • Dave- Yeah, that was my initial reaction too. But I think the point is that there’s no way to express the process as factorizing into a well-defined prior and a likelihood function.

    • This is an example of the type of problem he is talking about?

      Suppose a new HMO needs to estimate the fraction {\psi } of its patient population that will have a MI {(Y)} in the next year, so as to determine the number of cardiac unit beds needed. Each HMO member has had 300 potential risk factors {X=(X_{1},…,X_{300})} measured: age, weight height, blood pressure, multiple tests of liver, renal, pulmonary, and cardiac function, good and bad cholesterol, packs per day smoked, years smoked, etc. (We will get to 100,000 once routine genomic testing becomes feasible). A general epidemiologist had earlier studied risk factors for MI
      by following 5000 of the 50,000 HMO members for a year.

      You have a years worth of 5000 outcomes with 300 predictors each, and want to predict 50k outcomes using the same 300 predictors during the following year.

      Further, he wants to do this using no information other than those 5000 outcomes (and 300 associated predictors each). So the HMO basically considers the medical literature worthless?

      • A strong guess here: for MI in next year – heck in next several years – just knowing two things will vastly out-perform the “throw it all in the blender” approach: coronary calcium and VO2-max. Smoking status would help too I’m sure. The trouble is, collecting these first two isn’t standard. With some careful contextual knowledge and other metrics – I bet it could be improved somewhat, but only somewhat.

    • Paul:

      In that case it all makes sense, as I’ve never seen a Bayesian garage door opener. The closest I can think of are those Japanese rice cookers we heard about a few decades ago that used fuzzy logic.

      • I have one of those rice cookers – and it works better than any other AI device I have seen. Every time, perfect rice. I’ve not even found any bias against rice colors.

        • Dale:

          I’d respond by saying that I have a rice cooker that doesn’t use fuzzy logic, and it works perfect every time too . . . but, then again, maybe my rice cooker does use fuzzy logic and I just don’t know it!

        • Notice that he didn’t explicitly say that it works better than other rice cookers. Only that it works better than other AI devices. Probably there are a lot of AI ethical issues involved, though, and he’s just blinded to them by his Japanese-fuzzy-logic-rice-cooker-owner privilege.

  1. Brief note: the Chamberlain (or Mundlak-Chamberlain) device, per Mundlak (1978) and Chamberlain (1984), is just what economists call putting in the group means of the covariates as predictors in a panel/longitudinal/multilevel model to account for correlation. This has some nice properties and is something you have advocated in your paper with Bafumi.

  2. Perhaps important to note that Woodridge’s perspective is that of a microeconomist. At the risk if oversimplifying, the microeconomics domain tends to emphasize research design (quasi-experimental methods) over complex statistical modeling.

    Macroeconomics, on the other hand, is an area where Bayesian methods have become quite common. Woodridge isn’t talking about that at all, as far as I can tell.

    Also, if we take a cue from the economist’s “incentives matter” playbook, we should point out that Woodridge’s frequentist textbook is considered the standard among introductory texts. A shift to Bayesian methods would diminish Woodridge’s status in economics departments around the world.

  3. It’s not a full answer to “how does a Bayesian solve this”, but for linear regression with independent outcomes this 2010 paper shows one Bayesian way to get the robustness it seems Wooldridge is seeking. Much of the work involved is defining the parameter of interest in a way that holds up when few other assumptions are made; I expect this step would get a lot more complicated if expanding the same basic idea to cope with panel data.

  4. It would help to give examples of nice Bayesian applications. By comparison, Gauss applied OLS to find the lost dwarf planet Ceres. We can calculate heteroskedasticity-consistent standard errors for OLS, which is handy and robust. How does a Bayesian do robust inference? We need a neat, simple application of Bayesian methods where a classical approach would be difficult or awkward. Bob Litterman’s Bayesian Vector Autoregression revolutionized macroeconomics prediction, but I am not sure whether that qualifies as Bayesian, since it involves ad hoc shrinkage.

    • KL:

      There are zillions of successful examples of Bayesian methods. Since I’m writing this comment, I’ll start by pointing you to my articles and books, but there’s tons of stuff that’s not by me too. As I wrote in my above post, if you have methods that work for you for your problems, then it’s not necessarily worth it for you to learn something new. But, yeah, we develop new methods for problems where the old methods don’t work so well.

    • “We need a neat, simple application of Bayesian methods where a classical approach would be difficult or awkward”

      How about analysis of experimental data, where we may want to take frequent looks to stop early or make adaptations? Applying classical frequentist approaches requires “patching up” the p-values, etc using awkward adjustments, since the classical approaches are all based around the concept of a sampling distribution for a fixed sample size. The Bayesian approach doesn’t require a fixed n at all, so there are no awkward adjustments to the inferential machinery.

      • +1

        The funny thing about this is that there is a substantial literature on, for example, allocation of aggregate Type I error across dynamic data collection protocols that are so complicated almost no one ever uses them, including the metastep of then designing the size of individual stages of the protocols in advance and then modifying them on the fly. Instead, authors either (a) suppress the fact that they did a preliminary analysis which looked promising; or (b) tell everyone to ignore the man gesticulating behind the curtain that Toto has pulled back.
        And in Bayesian analysis, it all just falls away.

    • “By comparison, Gauss applied OLS to find the lost dwarf planet Ceres.”

      The folks at LIGO use(d) Bayesian inference to detect gravitational waves.

      “How does a Bayesian do robust inference?”

      In a Bayesian approach you can model heteroskedasticity directly. Or you can do you regression with Student-t errors. Or both. Or you can go for semi-/non-parametric approaches.

      “We need a neat, simple application of Bayesian methods where a classical approach would be difficult or awkward.”

      Anything with measurement error or partially available data. We all know that social science (economics) data is plagued with measurement error. Also A/B Testing in (online) marketing is really awkward in a classical framework (basically samuel’s point). These just come to mind, but there are a ton of other cases as Andrew pointed out. Also, you mentioned BVARs! Shrinkage is a form of prior: we a priori assume that coefficients are small, which incidentally (or not) helps with a lot of computational problems.

    • The Bayesian approach has clear advantages in state-space models (Kalman filters etc.) with probability distributions updated sequentially.

      • I used to work in Kalman filters. Most of the books derive the filter by minimizing the squared error. This makes it mysterious why you only need to know the previous mean and covariance to do each update. If you do the Bayesian derivation, then you see this is a general property of Bayesian estimators, i.e., you can update your prior all at once or one measurement at a time, and you get the same answer both ways.

        Bayesian approaches are better for sports (chess, table tennis, online games) ratings.

    • > We can calculate heteroskedasticity-consistent standard errors for OLS,

      Heteroskedasticity-consistent standard errors are an analytic formula for the asymptotic properties of a particular estimator, while analytic forms in Bayesian inference are intractable and replaced by general computational methods. Samples from the posterior distribution for estimated parameters paired with robust diagnostics serve entirely the same purpose. I’m not sure if there’s a fundamental reason why Bayesian methods should be computational while frequentist ones should be analytic, but that’s the way it is. I understand a lot of people are uncomfortable with using computational diagnostics and prefer to have a formula, but in my experience the assumptions required to have a formula are often difficult to investigate so I’d honestly say it’s a wash on robustness.

    • KL: you ask “How does a Bayesian do robust inference?”

      The paper linked in my other comment gives an answer, for linear regression. The short version is that one assumes a very flexible model, focuses inference on a parameter that defines a linear trend summarizing how mean outcome varies with covariates, and then follows the standard approach of summarizing the posterior with its mean and variance.

      The flexible model and the linear trend parameter can be used to justify heteroskedastic standard error estimates in a non-Bayesian way, so the result is perhaps not too surprising.

      • Ken:

        This reminds me of a point made by Rubin, that if your goal is to come up with an estimator with good frequentist statistical properties, one way to do this is to perform Bayesian inference under a reasonable model.

  5. I see a lot of Bayesian methods in current economics and econometrics research. I also see a lot of ipse dixit hypothesis testing and worryingly simple regression/time series models. I think there is some age/period/cohort effect here. Wooldridge is hardly likely to start using Bayes now after investing so much effort in variants on MLE, while if you want to be a hip young econometrician, you can’t stop after reading Wooldridge. I imagine I’ll still be squeezing everything into a latent variable model in Stan in 2041, and the cool kids will laugh at me.

    As for Stata, they added some Bayes capability to their out-of-the-package functionality in version 14*, which came out in 2015. Economics, finance, medical stats are their fields, and I they do a lot of market research before deciding to recruit, resource and commit to new features. Bayes (by simulation) is not something a software house can just bolt on to frequentist content or divert MLE programmers into.

    They continue to add to it, although they have an odd insistence in building everything from scratch themselves, and there’s still only RWMH and Gibbs. I am told (by StataCorp) that they find Bayesian capability is one of the most common reasons for new users investigating the software. So there must be something in it.

    And just while I’m here, economists and econometricians don’t have to use StataStan any more (from version 16) because they can flip between Stata (ado & mata language) and Python, and so use PyStan or CmdStanPy. In fact, that’s what I recommend anyone in Stata 16+ uses. StataStan (the stan command in Stata) passes between Stata and CmdStan via text files at start and end of execution, which slows things down a bit and sometimes users get that blocked because of sysadmin restrictions. Especially in Win***s.

    * – Of course, multiple imputation was in version 12 in 2011, and MI is [phenomenological] Bayes

  6. Now I think about it, I’m 50% confident someone could do this model in off-the-peg Bayes in Stata.

    I think fractional response means values between 0 and 1 and effects on the mean means an identity link

    It’s going to be something like

    xtset panelvariable timevariable
    bayes, prior(…): meglm depvar indepvars L.laggedindepvar || panelvariable: heterogeneitypredictor, family(binomial) link(identity)

    Not that I’ve actually tried this.

    and you can do it in Stata bayesmh, all you need is to write functions for your likelihood and prior. But then wouldn’t be nicer to write it in a real probabilistic programming language? My point is, people like this don’t avoid Bayes because it’s impossible, they avoid it because it was impossible when they stopped learning new stuff, and because now they are Important, nobody ever corrects them.

Leave a Reply to Joshua Cancel reply

Your email address will not be published. Required fields are marked *