“Economic predictions with big data” using partial pooling

Tom Daula points us to this post, “Economic Predictions with Big Data: The Illusion of Sparsity,” by Domenico Giannone, Michele Lenza, and Giorgio Primiceri, and writes:

The paper wants to distinguish between variable selection (sparse models) and shrinkage/regularization (dense models) for forecasting with Big Data. “We then conduct Bayesian inference on these two crucial parameters—model size and the degree of shrinkage.” This is similar to your recent posts on the two-way interaction of machine learning and Bayesian inference, as well as to your multiple comparisons paper. The conclusion is the data indicate variable selection is bad, a zero coefficient with zero variance is too strong. My intuition is that the results are not surprising since the data favoring exactly 0 is so unlikely, but I assume the paper fleshes out the nuance (or explains why that intuition is wrong).

Here is the abstract:

We compare sparse and dense representations of predictive models in macroeconomics, microeconomics, and finance. To deal with a large number of possible predictors, we specify a prior that allows for both variable selection and shrinkage. The posterior distribution does not typically concentrate on a single sparse or dense model, but on a wide set of models. A clearer pattern of sparsity can only emerge when models of very low dimension are strongly favored a priori.

I [Daula] haven’t read the paper yet, but noticed the priors while skimming.

(p.4) “The priors for the low dimensional parameters φ and σ^2 are rather standard, and designed to be uninformative.” The coefficient vector phi has a flat prior which you’ve shown is not uninformative and the probability of σ^2 is inversely proportional to σ^2 (no idea where that comes from, but nothing like what you recommend in the Stan documentation).

The overall setup seems reasonable, but I’m curious how you would set it up if you had your druthers.

My quick response is that I’m sympathetic to the argument of Giannone et al., as it’s similar to something I wrote a few years ago, Whither the “bet on sparsity principle” in a nonsparse world? Regarding more specific questions of modeling: Yes, I think they should be able to do better than uniform or other purportedly noninformative priors. When it comes to methods for variable selection and partial pooling, I guess I’d recommend the regularized horseshoe from Juho Piironen and Aki Vehtari.

1. If $latex p_X(x) \propto 1/x$, then if $latex Y = \log X$, then we apply a Jacobian correction to get

$latex p_Y(y) \propto p_X(\exp(y)) \left| \frac{d}{dy} \exp(y) \right| = \frac{1}{\exp(y)} \mid \exp(y) \mid = 1$.

Applying this here with $latex x = \sigma^2$, yields $latex p(\log \sigma^2) \propto 1$.

If a prior on a positively-constrained parameter is proportional to the inverse of the parameter, then the prior on the log of the parameter is uniform. Thus this yields a prior that is uniform on $latex \log \sigma^2$.

• Which is of course improper in both cases. you can’t normalize either 1/x on the positive reals or 1 on the reals.

• Let’s also think a little about what it could possibly mean to be uniform on log(x). This is saying that the order of magnitude of the parameter could be literally any number. In some sense, 10^100 is as likely as 10 as likely as 10^-100 as likely as 10^1000000000000000000000000

Which is an obviously stupid thing to say about a real world scientific problem.

• Thanks—I should’ve mentioned that! While it seems like that’d lead to preferences for small numbers, the truth is as described in Daniel Lakeland’s second comment.

I feel bad for working statisticians trying to do the right thing in the face of changing recommendations. The game used to be to be as uninformative as possible in the prior as long as the posterior was proper. Even now, Andrew and his fellow authors suggests similar uniform-on-the-log-scale priors in BDA3 (e.g., in Chapter 5 for beta count parameter in the rat clinical trial example). I don’t think those chapters got updated from the previous two versions and I don’t believe Andrew would recommend them now. But I feel bad for someone picking up BDA and trying to do the right thing. Our thinking about the importance of prior predictive distributions has changed over the last couple of years, with the more current thinking demonstrated in Gabry et al.‘s paper, “Visualization in Bayesian workflow.” (It’s hard to publish papers on priors, so they have to get snuck in through the back door of other papers.)

• Keith O’Rourke says:

> I feel bad for someone picking up BDA and trying to do the right thing
Hey, pleading hardship in science is not allowed ;-)

> hard to publish papers on priors

I think a large part of the problem is a dearth of published views on what Bayesian analyses in practice are attempting to do – their purpose – and how achieving that is assessed (calibrated?). It is not about getting formal posterior probabilities from joint probability models that are largely unconnected to the world were are trying to understand.

Some stuff is starting to appear – for instance Inferential statistics as descriptive statistics: there is no replication crisis if we don’t expect replication Amrhein, Tramimow and Greenland. https://peerj.com/preprints/26857/

For instance, “we can think of a posterior probability interval, or Bayesian “credible interval,” as a compatibility interval showing effect sizes most compatible with the data, under the model and prior distribution used to compute the interval”

• Chris Wilson says:

I agree Keith. If you look at the workflow and case studies on the Stan website, or Michael Betancourt’s website (both excellent resources IMO), there is a wide chasm between that and the few snippets of Bayes and/or GLMM(s) that an average reviewer will be aware of in most scientific journals. Sometimes (often) it feels like inhabiting a parallel world with increasingly few connections to what most scientists and researchers recognize as “statistics”, either in form or purpose.

• Keith O’Rourke says:

Very impressed with some of the material I have seen on Michael Betancourt’s website.

However, the first sentence here https://betanalpha.github.io/writing/ is still I believe too strong even if qualified by the second statement “Given the posterior distribution derived from a probabilistic model and a particular observation, Bayesian inference is straightforward to implement: inferences, and any decisions based upon them, follow immediately in the form of posterior expectations. Building such a probabilistic model that is satisfactory in a given application, however, is a far more open-ended challenge.”

The phrases “straightforward to implement” and “follow immediately” presume the probabilistic model that is satisfactory will be settled on. We almost never get that in an actual application.

• From my discussions with Michael, I’m pretty sure he agrees with Keith O’Rourke, as evidenced by his next sentence, “Building such a probabilistic model that is satisfactory in a given application, however, is a far more open-ended challenge.”

In general, we recommend posterior predictive checks to see if the model fits the data, and ideally, cross-validation and or genuine held out calibration.

• Keith excellent points. You are so right. Dearth of published views on what Bayesian analyses in practice are trying to do. Perhaps greater cross-disciplinary collaborations will yield some better measurement solutions.