Survey Statistics: individualism doesn’t work

This blog series launched with the proverb it is the people“. So let’s talk about work by some people I’ve loved talking to over the years: Swen Kuh, Lauren Kennedy, Qixuan Chen, Andrew Gelman, and Aki Vehtari. They’ve been thinking about how to evaluate models for multilevel regression and poststratification (MRP).

The typical machine learning loss looks one individual i at a time: Loss(y_i, yhat_i).

Our post on poststratification focused on estimating a population mean E[Y] with yhat, minimizing Loss(E[Y], y_hat). This differs from Loss(y_i, yhat_i). Indeed, Kuh et al. 2023 caution that these losses may order models differently. In other words, individual-level loss may not be great for choosing models for MRP. Kennedy et al. 2024 propose a way to get at the population-level loss Loss(E[Y], y_hat).

This is one of my less-stellar tent pitches. But I have a neighbor ! This is a post about population-level errors (it is the people !), and our tents average to a good pitch.

Kennedy et al. 2024 point out that the first challenge is that we don’t know E[Y] or E[Y|X]:

However, in practice the population truth (cellwise or at the aggregate level) is not available. One option in this case is to approximate the population truth with the sample observation, .

I assume this means what we wrote as Ehat[Y | X = j, sample] in our post on poststratification, also with clumsy notation. In words, the average Y for folks in the sample with the jth possible combination of covariates X. Aggregating these sample averages across X gives the classical poststratification estimate, which we called yhat_PS. In contrast, the MRP estimate yhat_MRP uses a multilevel model for E[Y|X].

So if I understand, Kennedy et al. 2024 propose to evaluate the MRP estimate by comparing it to the classical poststratification estimate ?

The second challenge is that the same data is used for both fitting the MRP model and assessing its error. As is typical in machine learning, they propose to address this with cross-validation:

K-fold cross validation partitions the sample into K non-overlapping folds. The kth fold is removed to fit the model and score calculated using this fold.

So in their equation (9) above, we would need to compute the average Y for folks in the kth fold with X = j. For this to be nonempty, we need K folks in each cell j. With enough covariates X, this is a tall order. In the paper they propose alternatives.

But I’m still stuck on the first challenge, am I understanding correctly their proposal to get at a population-level error ?

Reanalysis of that Nobel prizewinning study of patents and innovation (with R and Stan code)

A few days ago we discussed a paper from 2005, Competition and Innovation: An Inverted-U Relationship, two of whose authors recently won the Nobel prize in economics.

I had some concerns about the analysis, which I can express with reference to the above figure from that paper:

1. The paper’s all about an inverted U relationship, but this is driven by fitting a quadratic curve rather than, say, a curve with diminishing returns.

2. The line does not seem to go through the data points. In particular, the curve seems to be too low at the rightmost part of the graph–an artifact of fitting a quadratic, perhaps?

3. The y-axis is some weighted count of patents but it’s being used as a measure of the more abstract concept of “innovation.”

4. The x-axis is an average of profit margins but it’s being used as labeled a measure of the more abstract concept of “competition.”

5. The model predicts patents from profit margin in the same year, but to the extent the model is appropriate I think you’d expect a lag.

6. They use Poisson regression, but the data are not counts, also if you don’t somehow correct for overdispersion your standard errors will be too low.

The plan

Here are the ways I’m gonna adjust for the above issues in my reanalysis:

1. I’ll fit a quadratic curve to replicate what they did in the paper, and I’ll also fit another family of curves (the “hinge”) that allows for nonlinearity but without enforcing non-monotonicity.

2. One problem with the above graph is that it excludes 20% of the data (see the figure caption). I’ll plot the fitted curves showing all the data. Another possible reason for the problematic fit is that in the paper they say they adjust for industry and year effects, and so I’ll make a plot showing the fitted curve and the data broken down by industry. I’ll do the adjustments in two ways: “fixed effects” as in the published paper (using various R packages), and multilevel modeling (using Stan).

3, 4. I’ll relabel the axes of the graph to more accurately capture the authors’ measures of competitiveness and innovation, but I’ll swallow the concern that the analysis only uses a subset of the data that were available at the time (from the paper, “Our sample includes all firms with names beginning “A” to “L” plus all large R&D firms. After removing firms involved in large mergers or acquisitions and those with missing data . . .”).

5. This lag thing is an issue, but I have enough concerns with items 3 and 4 that it’s hard for me to take the paper’s theoretical and causal arguments seriously. But, sure, you could replicate my analyses at different lags.

6. I’ll compare four sorts of models: Poisson, quasipoisson, negative binomial, and normal regression on log(y+1). Quasipoisson and negative binomial are two different ways to correct Poisson regression for overdispersion. Modeling on log(y+1) is a completely different way to model data that are mostly positive but have some zeroes, and I think it actually makes the most sense for this example, as the data are not actually counts. When fitting Poisson and negative binomial regressions, I’ll round the data to the nearest integer, which turns out to essentially not affect the results.

The above is not a preregistration plan; I wrote it in the middle of the project after doing some of the analyses already.

The data

Bradford in comments links to a post from 2014 by Leif Nelson and Uri Simonsohn that includes a file linking to the website of Nicholas Bloom, one of the authors of the paper. Navigating the site takes me to this page with a link to a zip file with the data, which I then downloaded.

Good job by Bloom to post the data from a paper published in 2005. I’m not so good with data availability myself.

The data file has 354 records, including data from 1973-1994 and from 17 industries, and for each of these records it has the weighted patent count (“patcw”), the measure of profit margin (“Lc”), and a bunch of variables that I didn’t try to figure out, because these are all I need to replicate the main analysis. (The file does not seem to include a code book, but I’m not complaining, as they’re already way ahead of me by making the data available at all.)

Initial data analysis: quadratic regression using Poisson, quasipoisson, and negative binomial models

I get going by plotting the data and fitting some quadratic models:

Compared to the graph at the top of this post, the above plot shows more data (I’m not trimming the upper and lower 10% of observations), and the quadratic curves are much flatter than were shown in that paper:

(1) green curve: poisson fit to rounded data (using glm() in R)
(2) red curve: quasipoisson fit to rounded data (using glm())
(3) blue curve: quasipoisson fit to raw data (using glm())
(4) pink curve: negative binomial fit to rounded data (using glm.nb())
(5) orange curve: negative binomial fit to rounded data (using stan_glm())

The first four curves are so similar that the lines overlap and you can’t see the green and red curves: They’re there, just overwritten. For each model I display the point estimate of the curve (best fit for models 1-4, posterior median for model 5). In the models above, quasipoisson uses the Poisson fit and then adjusts standard errors to account for overdispersion; negative binomial is a probability distribution that accounts for overdispersion in a different way.

We went from 1 to 2 to see if switching to quasipoisson made a difference (it did; the standard error was much bigger after we allowed for overdispersion); we went from 2 to 3 to see if rounding made a difference (it did’t for these data); we went from 3 to 4 to see if switching to negative binomial made a difference (it didn’t do much, but the standard error increased slightly); and we went from 4 to 5 to see if switching to full Bayes made a difference (it looked a lot different, actually).

I’ll put the code at the end of this post so as not to distract from the story. Here are the relevant pieces of console output:

(1) poisson fit to rounded data
            coef.est coef.se
(Intercept) -74.82    25.85 
Lc          164.76    54.90 
I(Lc^2)     -88.39    29.14

(2) quasipoisson fit to rounded data
            coef.est coef.se
(Intercept) -74.82    84.64 
Lc          164.76   179.78 
I(Lc^2)     -88.39    95.44

(3) quasipoisson fit to raw data
            coef.est coef.se
(Intercept) -75.01    84.01 
Lc          165.13   178.43 
I(Lc^2)     -88.55    94.72 

(4) negative binomial fit to rounded data
            Estimate Std. Error
(Intercept)   -80.80      89.49
Lc            177.20     190.10
I(Lc^2)       -94.85     100.92

(5) negative binomial fit to rounded data (Bayesian)
            Median MAD_SD
(Intercept)  -7.1   33.2 
Lc           21.1   69.5 
I(Lc^2)     -12.2   37.3

In a paper I’d prefer to display these uncertainties graphically; here I’m giving the console output to give a sense of how things might go in our usual workflow of fitting models and looking at them. Here you can see that the quadratic terms all have large standard errors–except for the very first model, the Poisson, but its standard error is too low as it does not account for overdispersion.

The only thing that really puzzles me here is what’s going on with model 5. Could it be that the Bayesian model uses the posterior median rather than the optimum? I’ll check by re-fitting it, running stan_glm on “optimizing” setting:

(5_opt) negative binomial fit to rounded data (Bayesian posterior mode)
            Median MAD_SD
(Intercept)  -5.3   36.8 
Lc           17.1   76.7 
I(Lc^2)     -10.9   39.4 

That’s not much different from the posterior median. So I guess the difference between the negative binomial regressions fit by glm.nb() and stan_glm() arise from differences in the fitting algorithms, or maybe something I missed in the coding. If I wanted to proceed further down this track I would have to investigate this a bit further, reading up on what glm.nb is actually doing in R, and in Stan programming the negative binomial model myself from scratch and also comparing to brms.

For now I’ll just move on. My guess is that the difference in fits has something to do with how seriously the model takes the small number of influential data points near the top of the graph, but I’ll set that aside for now because we’ll be fitting some more models.

Quadratic regression adjusting for industry and year effects

The next step is to adjust for industry and year effects, which I’ll do in a few ways. I’ll aid in the interpretation of these by plotting the data and fitted curve separately for each industry. In the data the industry codes took on 17 different values ranging from 22 to 49, which according to the paper correspond to “two-digit SIC codes.” So I googled “two-digit SIC codes,” which are listed in various places online. I could not find an official link, but various unofficial sources seemed to agree; here’s one such list.

For each plot, black dots show the data for that industry, with a large dot showing the data from the first year of the data and thin black lines showing the time sequence. The colored curves are the fitted quadratics for each industry in an average year, as explained below.

It’s kind of weird that Furniture has so many patents, and that the number of patents for machinery and computers started out high and then dropped, and that electric and gas services had no patents for the first few years and then suddenly had a lot . . . this is just what’s in the data. I checked a few things to make sure I didn’t garble the categories but it’s possible that I’m missing something.

Here are the models I fit:

(3a) blue curve: quasipoisson fit to rounded data, including factors for industry and year (using glm())
(4a) pink curve: negative binomial fit to rounded data, including factors for industry and year (using glm.nb())
(5a) orange curve: negative binomial fit to rounded data, including factors for industry and year (using stan_glm())
(6a) purple curve: multilevel negative binomial fit to rounded data, with varying intercepts for industry and year (using stan_glmer())

Models (3a), (4a), and (5a) include unmodeled coefficients for industry and year (“fixed effects,” in economics jargon); model (6a) considers these coefficients as latent variables and estimates their distributions (this is what economists call “random effects”).

From the above graph you can see that, after adjusting for industry and year effects, the quadratic curves are much stronger. Most of this comes from the industry effects; there’s not much evidence for unexplained variation at the year level. Here’s the relevant console output from model (6a), which conveniently estimates the scale of each batch of varying intercepts:

stan_glmer
 family:       neg_binomial_2 [log]
 formula:      patcw_rounded ~ Lc + I(Lc^2) + (1 | sic2) + (1 | year)
 observations: 354
------
            Median MAD_SD
(Intercept) -68.3   27.1 
Lc          146.6   57.5 
I(Lc^2)     -77.9   30.9 

Auxiliary parameter(s):
                      Median MAD_SD
reciprocal_dispersion 5.6    1.0   

Error terms:
 Groups Name        Std.Dev.
 year   (Intercept) 0.11    
 sic2   (Intercept) 2.11    
Num. levels: year 22, sic2 17

In this fitted model (6a), and also in (4a), and (5a), not shown here, the estimated coefficient for the quadratic term is a bit more than 2 standard errors away from zero, that is, statistically significant at the conventional level. The estimated quadratic term in (3a), the quasipoisson regression, is a bit more than 4 standard errors from zero; I guess this difference is attributable to the different ways that the quasipoisson and negative binomial effectively weight the extreme values in the data.

Quadratic regressions on log(y+1)

When modeling count data we usually start with the negative binomial model with log link. And here I wanted to connect to whatever version of Poisson regression was used in that published paper.

But, as noted above, these data aren’t counts–they’re not even integers, and I think it makes sense to just directly model them on the log scale. Some of the observations have zero values, though, and so I’ll follow the standard practice of modeling nonnegative data y by fitting regressions to log(y+1). My general recommendation along these lines is to model log(y+A), where A is some constant that corresponds to a baseline level of the data. In this case, the 354 data points include 46 zeros, then another 60 values between 0 and 1, then various values (none of them are integers!) ranging as high as 44.7. For the purpose of studying patent counts as measures of innovation, it seems reasonable to add 1 to these data, which blurs the lowest values (there is no big distinction between 0 and the lowest nonzero observation, 0.037) while preserving the distinction between the higher levels. This seems about right: to the extent these data will be supplying a signal on innovation, we’ll want to mostly be learning from data on the high end.

Here’s what happens when we fit the quadratic regression, not adjusting for industry and year effects, to the log(y+1) transformed data. Now we can just use normal errors and we don’t have to worry about Poisson or negative binomials, so there’s just one curve:

You can see the zeros at the very bottom of the graph. The model of normally-distributed errors:

(7) green curve: normal regression fit to log(y+1) (using lm())

is not perfect, but I think it’s close enough, and we no longer have to worry about exactly how we’re modeling extremely high values. On the log scale we still see a fitted quadratic curve. The fit is noisy–the coefficient for the quadratic term has a standard error that is much higher than the estimate itself–but, again, let’s move on to the regressions that adjusts for industry and year effects.

As before, we show the data and fit (this time, with both on the log(y+1) scale) broken out by industry:

The curves come from two fitted models:

(8) orange curve: normal regression fit to log(y+1), including factors for industry and year (using stan_glm())
(9) purple curve: multilevel regression fit to log(y+1), with varying intercepts for industry and year (using stan_glmer())

They’re pretty similar; I assume that the small differences between the two fits arise from the fact that the least-squares model (8) does more adjustment for industries. The estimated coefficients for the quadratic terms are 3 or 4 standard errors from zero. Here’s the output from model (9):

 family:       gaussian [identity]
 formula:      log_patcw ~ Lc + I(Lc^2) + (1 | sic2) + (1 | year)
 observations: 354
------
            Median MAD_SD
(Intercept) -86.8   24.6 
Lc          186.6   52.5 
I(Lc^2)     -98.6   27.9 

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.5    0.0   

Error terms:
 Groups   Name        Std.Dev.
 year     (Intercept) 0.09    
 sic2     (Intercept) 1.04    
 Residual             0.47    
Num. levels: year 22, sic2 17

Hinge regression on log(y+1)

As we have discussed already, the big problem with quadratic regression is that it enforces non-monotonicity. I’d like to fit a model that allows nonlinearity but without a declining slope automatically turning into a negative slope.

My first thought was to fit a model with an asymptote, something like this: b_0 + b_1*(1 – exp(x/b_3)), as an alternative. But this curve has the opposite problem: it is restricted to be monotonic. What I really want is a model that can have diminishing returns or can asymptote or can have that inverted U shape, with the question answered by the data.

We have such a model–the hinge function:


This is a curve that smoothly connects two straight lines which can have arbitrary slopes. The parameters of the curve are:
x0: the value of x where the two lines would intersect
a: the value of y where the two lines would intersect
b0: the slope of the line on the left side of the hinge
b1: the slope of the line on the right side of the hinge
delta: the scale of the continuous curve connecting the two lines

For our purposes, we are most interested in b1: is there evidence that this slope is negative?

We program the hinge model in Stan–the code’s right there at the linked post–also including a multilevel model with varying intercepts for industry and year, as above. Here’s the whole program:

functions {
  vector logistic_hinge(vector x, real x0, real a, real b0, real b1, real delta) { 
    vector[size(x)] xdiff = x - x0;
    return a + b0 * xdiff + (b1 - b0) * delta * log1p_exp(xdiff / delta);
  }
}
data {
  real<lower=0> delta;
  int N;
  vector[N] x, y;
  int J1, J2;
  array[N] int<lower=1,upper=J1> group1;
  array[N] int<lower=1,upper=J2> group2;
}
parameters {
  real x0;
  real a, b0, b1;
  real<lower=0> sigma, sigma1, sigma2;
  vector<offset=0, multiplier=sigma1>[J1] a1;
  vector<offset=0, multiplier=sigma2>[J2] a2;
}
model {
  x0 ~ normal(1, 1);
  a ~ normal(0, 100);
  b0 ~ normal(0, 100);
  b1 ~ normal(0, 100);
  a1 ~ normal(0, sigma1);
  a2 ~ normal(0, sigma2);
  y ~ normal(logistic_hinge(x, x0, a, b0, b1, delta) + a1[group1] + a2[group2], sigma);
} 

When fitting the model, we specify the scale parameter delta, setting it to 0.05, which allows for a gentle curve within the range of the data (as you can see from the plots above, x ranges from about 0.85 to 1.0). There’s some arbitrariness here, but it’s just too hard to fit this parameter directly from the data. In effect this curvature is hard-coded into the quadratic regressions fit earlier.

We assign weak priors to the other parameters in the data, effectively excluding extreme values for the slopes of the curves. The only one of these priors that might be confusing is the prior for x0, the x-position of the hinge. We’re soft-bounding it at the high and low ends just so that the fit won’t get lost in extreme values: once x0 is far outside the range of the data, the curve effectively becomes a straight line and the location of the hinge becomes non-identified.

So, yeah, the hinge is a bit more work to fit compared to the quadratic, but that’s the price we have to pay to fit a more flexible model to this small dataset. And in this case I think the flexible model is absolutely necessary given the goal of seeing whether the data indicate that inverted U shape.

Here’s the result:

In each plot, the blue curve represents the posterior median of the parameters and the red curves correspond to 20 draws from the posterior distribution.

Here’s the summary of inferences:

 variable   mean median    sd   mad      q5    q95 rhat ess_bulk ess_tail
   lp__    71.69  71.85  6.65  6.87   60.81  82.19 1.00      838     1428
   x0       0.94   0.93  0.07  0.09    0.82   1.05 1.04      132     1204
   a        4.03   4.03  0.72  0.69    2.84   5.23 1.01     1178     1549
   b0      52.71  36.93 42.67 24.45   15.29 146.35 1.03      251     1288
   b1     -50.66 -31.22 46.50 22.85 -149.89 -11.44 1.03      146      883
   sigma    0.47   0.47  0.02  0.02    0.44   0.50 1.00     4279     2726
   sigma1   1.11   1.08  0.21  0.20    0.81   1.50 1.01      880     1217
   sigma2   0.08   0.08  0.04  0.04    0.02   0.16 1.00      999     1301
   a1[1]   -0.25  -0.25  0.30  0.30   -0.74   0.26 1.00      486      736
   a1[2]    0.23   0.22  0.30  0.30   -0.26   0.75 1.00      491     1036
...

The key parameter is b1, the slope on the right side of the curve. The posterior mean is -50.66 with standard error 46.50–so, apparently, not statistically significant–but if you look at the posterior quantiles, you’ll see that the distribution is very skewed. Indeed, it turns out that all 4000 of the posterior simulation draws of b1 are negative here.

So, after all this modeling, it seems that the data do clearly indicate an inverted U pattern!

That said, I haven’t explored the hinge model too carefully. For example, when I re-fit it with some other choices of the hinge scale parameter delta, it sometimes doesn’t mix well. I think the above graph with delta=0.5 fits these points about as well as is possible, so I’m ok with it as a data summary, but I’m not saying that my fitting procedure is fully computationally robust. If I were to be working more on the computation for this particular problem, I’d start by removing the year effects, as they can be adding stress to the computation without affecting the fit in any meaningful way.

Before getting to the interpretation of these results when it comes to competition and innovation, I want to tie up a couple of loose ends.

Quadratic regression on log(y+1), programmed in Stan

In fitting the above multilevel hinge regression, I moved from the pre-programmed Stan code of stan_glmer() to a custom Stan program. Just to check that nothing funny is going on here, I’ll go back and program the multilevel quadratic regression directly in Stan. It’s easy enough to write the program; I just replace the hinge by a quadratic function and get rid of the priors, and, indeed, we get essentially the same result as when fitting using stan_glmer(). I won’t bother showing the new graph here. No surprise, it’s just good to check.

The other thing we can do with the Bayesian fit is see whether the peak of the curve falls within the range of the data. The curve b0 + b1*x + b2*x^2 has its peak at x = -b1/(2*b2) (just take the derivative, set to 0, and solve for x), so we can compute the posterior distribution of this value from our simulations. It turns out that only 3 of the 4000 simulated curves has a peak outside the range of the data, so fair enough that, according to the fitted quadratic model (which I don’t like), the inverse U pattern is occurring within those bounds.

Adjusting for the group-level mean of the predictor

In general when fitting a multilevel model, it’s a good idea to adjust for the group-level mean of the predictor–in this case, the average value of x within each industry. Otherwise you have to worry about correlations between x and the varying intercepts. In this case, adding this group-level predictor doesn’t change much. I won’t share the result here just because I’ve already done a lot of work to write this up and there are enough other concerns with the analysis, but, yeah, it’s a good idea to include this predictor too.

Summary

OK, so what have we learned?

• The graph from the original paper (reproduced at the top of the above post) did not show a good fit because it didn’t display the adjustments for industry. We can see the pattern a lot clearer using separate plots for each industry.

• I think it makes more sense all around to model these data on the log(y+1) scale rather than using Poisson regression or any of its variants.

• The patterns within and between industries aren’t so clear to me. I don’t get why furniture has so many patents, why the number of patents for machinery and computers dropped, why the number of patents for electric and gas services shot up, and so forth. I’m not saying these numbers are wrong, and I might have made some mistake in coding; I’m just saying I’m baffled.

• I think the hinge model is much more appropriate than the quadratic for the propose of seeing the extent to which the data support an inverted U pattern. It wasn’t hard at all to program, debug, and fit the hinge model, and in this case it does support the inverted U. So in that sense the published paper was correct in its statistical conclusion, even if I think they got kinda lucky in finding the pattern with their quadratic curves.

• I still don’t buy the claim in the paper that they “find strong evidence of an inverted U relationship” between “product market competition and innovation.” Again, my main problem here is their measure of innovation using average weighted patent counts, an issue that further bothers me given the odd data patterns seen in the graphs for some of the industries, also with the selection in the data (“Our sample includes all firms with names beginning “A” to “L” plus all large R&D firms. After removing firms involved in large mergers or acquisitions and those with missing data . . .”).

• Also the model predicts y from x in the same year, and we’d expect a lag. I didn’t bother addressing this because of my other concerns about the data.

I guess that later work followed up with more comprehensive data sources, but I remain concern that any inverse U pattern, even if supported by this particular dataset, could depend strongly on various artifacts of the measures they are using. There aren’t really a lot of data points at high values of x here, so this downward slope is defined by how these patents are being counted in just a couple of industries.

All this might sound like picky criticism, but it’s the authors of the paper (and the Nobel prize committee), not me, who are making these broad claims about competition and innovation, and concerns about measurement seem really important here.

That said, it’s cool to see that a careful statistical analysis does find the inverted U. It does seem like a real pattern in the data, so the only question is the relevance of these data to the larger economics question.

tl;dr

In response to one of the comments below, I wrote that I don’t at all trust the idea of using time variation of patents within an industry to measure time variation of innovation. I get that the authors were doing their best using available data; still, I don’t think their data and analysis provide evidence for their substantive conclusions.

In the immortal words of John Tukey, “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

Looking at it that way, all my statistical analysis with the transformations and the quadratic curves and the hinge function was all a waste of time, as I’m just doing a fancy analysis of bad (or, at least, irrelevant) data. But . . . summarizing the statistical information still has value in itself. My analysis just took a few hours, a small fraction of the time the authors must have spent on preparing, analyzing, and interpreting their data. I think it’s important, when doing statistical analysis, to do the best we can do, in this case accounting for all these sources of variation and uncertainty.

Or, to put it another way, yes, it would have been fine to dismiss the published results entirely given the problems with the data. But some of the problems with the data became apparent only after I made those plots showing the time series for each industry. At that point, all the fitting may well have been a waste of time–but I only thought of making the plots because I was trying to understand the puzzling fit.

All the links connecting theory, measurement, data, analysis, and conclusions are important.

P.S. All these data and measurement problems just leap out at us, but there’s a whole world of people out there who just accept these conclusions–even some of the graphs from that 2005 paper–as truth. For example I came across this post by political journalist Matthew Yglesias that just straight up accepts the iffy empirical claims from that paper. It’s tricky–journalists are busy, and it’s natural to think that a much-cited paper written by two Nobel prize winners and published in a top journal has to be correct. I don’t really know what to say, except that this is one of the useful functions of social media, to allow us to push back against default narratives. Not that the default is necessarily wrong, just that it can be wrong, and it’s hard for journalists to escape the bubble and recognize this.

P.P.S. There was some concern about the arbitrariness of the log(y+1) transformation so I also fit the model to sqrt(y), which is a standard variance-stabilizing transformation for count data. The results looked essentially the same:

So the inverted U really seems to be there–but you have to assume the curve has the same shape for all industries. If you fit a separate curve for each industry, there’s no way you’d find the U, and you can see that in the scatterplots. The trouble is that the data are too sparse and variable to try to estimate a separate curve for each industry. Just look at electric and gas services, for example. Or passenger transit.

Data and code
Continue reading

“All Our Default Models Are Wrong: Causal inference for varying treatment effects”: my talk this Saturday morning in Ottawa

It’s at this colloquium on meta-analysis in economic research, Sat 18 Oct 2025, 9:30am:

All Our Default Models Are Wrong: Causal inference for varying treatment effects

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

Everybody knows that effects can vary, but the usual models we fit do not account for structure in the variation. This is relevant for generalization from sample to population, for anticipating changes over time, and when designing new studies and analyzing existing data. We discuss several directions for going beyond the usual additive model, along with the challenges of fitting such models and interpreting the results, which tend not to reach conventional “statistical significance.”

Perhaps with audience participation there will be a discussion of the particular relevance of this work to economics.

“On the poor statistical properties of the P-curve meta-analytic procedure”

Richard Morey writes:

My colleague Clint Davis-Stober and I have a new paper at JASA about Simonsohn et al’s “P curve” forensic meta-analytic tests, which are supposed to help identify “evidential value”, “lack of evidential value, and “left skew” in a set of test statistics. You’ve blogged about P-curve-related topics before so I thought you might be interested in our paper.

All the p curve tests are transformations of significant p values, summed to produce a known distribution if the p values are uniform. Their choice of transformations make for tests with hair-raising properties, including extreme sensitivity to values near the criterion, arbitrariness, inadmissibility, nonmonotonicity in the evidence, and general inconsistency of their “power” estimate.

Basically, they use transformations that approach infinity at the on the left at the significance criterion (like the probit) but on truncated one-tailed distributions (the chi-squared and F test statistics). This (and a few other choices) is disastrous.

We make the point that many of these techniques were never vetted by experts, and often are just “verified” by a few simulations. For tests, this is not good enough, but nevertheless these methods can get popular because (in my opinion) they tell people what they want to hear.

I asked Morey why their paper didn’t mention Hedges (1984), given that they were aware of that connection. Morey replied, “That was due to our focus away from the ES estimation techniques, and toward their tests, which are situated more in the nonparametric combination literature (or rather, should be). We figured the estimation angle had been explored by other authors, and since the tests hadn’t, we’d look into the tests (which are quite popular).”

Blake McShane summarizes the Morey and Davis-Stober article as follows:

Morey and Davis-Stober use fundamental mathematical statistics to show that the P-curve:

– Does not test what it claims to test (i.e., skewness or “evidential value,” which as they note “is not a well-defined statistical or scientific concept”)

– Has poor statistical properties including sensitivity, inadmissibility, and (horribile scriptu!) non-monotonicity.

In sum, this means that of the two different sets of methods (bewilderingly) both called the P-curve by the authors of the three papers listed at this website:
– The two sets of forensic meta-analytic tests introduced in JEP:G 2014 and JEP:G 2015 are poor tests as per Morey and Davis-Stober’s new paper.
– The meta-analytic effect size estimation method “introduced” in Perspectives 2014 is an alternative and inferior implementation of Larry Hedges famous 1984 paper/method as per prior work.

My take on this is slightly different. I agree with Morey, Davis-Stober, and McShane on statistical grounds, and I like the Hedges (1984) paper. In practice, though, I don’t see the different methods (again, I refer you to my post from 2018 for details) as being so different in practice, and my reason for saying this is that I see these methods more as rhetorical tools than anything else. I wouldn’t trust the estimates or hypothesis tests from any of these approaches!

To label a method as a “rhetorical tool” is not an insult. Rhetorical tools can be useful!

I offer a three well-known examples of statistical ideas arising in the field of science criticism, three methods whose main value is rhetorical:

1. The famous claim of John Ioannidis (2005) that “most published research findings are false.” As I wrote a few years ago , I don’t buy Ioannidis’s mathematical framing of the problem, in which published findings map to hypotheses that are “true” or “false,” but I’m still glad that Ioannidis wrote that paper, and I agree with his main point. I see his paper as rhetorical, as making the point in a simple way that most published research findings could be false, even though I would not want to use his method to actually estimate the rate of false findings, however that is defined. Again, this is not a slam on that influential 2005 paper, just my judgment on how it is useful.

2. The concepts of Type M and Type S errors, which I developed with Francis Tuerlinckx in 2000 and John Carlin in 2014. This has been an influential idea–ok, not as influential as Ioannidis’s paper!–and I like it a lot, but it doesn’t correspond to a method that I will typically use in practice. To me, the value of the concepts of Type M and Type S errors is they help us understand certain existing statistical procedures, such as selection on statistical significance, that have serious problems. There’s mathematical content here for sure, but I fundamentally think of these error calculations as having rhetorical value for the design of studies and interpretation of reported results.

3. Multiverse analysis (see this paper with Sara Steegen, Francis Tuerlinckx, and Wolf Vanpaemel) is another one of my popular ideas related to science reform. The method has caught on, and I’ve used it to help understand what can go wrong in science, but I don’t plan to use it in any of my own research! As with examples 1 and 2 above, I see multiverse analysis as more of a rhetorical tool, a way to see what can go wrong if you pick out just one analysis. It’s a kind of living version of researcher degrees of freedom and the garden of forking paths, just to mention two ideas which are more obviously rhetorical.

I don’t think all my methods related to science reform are primarily rhetorical; for example, I really do use multilevel modeling instead of multiple comparison corrections (here’s one of many examples), but there are times that a method can be helpful merely by serving a rhetorical purpose, as with the three items above.

And that’s how I see the p-curve method. No way I’d want to perform a formal hypothesis test or anything like that to detect the presence of questionable research practices or whatever. For reasons discussed by Morey and Davis-Stober, I think that would be a bad idea. But that’s never what I saw as the virtue of the p-curve approach. To me, the value of that method has always been in the concept that you could display an empirical distribution of p-values and compare to your general expectation of what would happen if everything were reported. Doing some math to make the formal comparison is fine in that it sharpens intuition, in the same way that the math in the Type M and S errors helps us understand the implications of these methods. But there are just too many assumptions floating around for me to want to do anything else here. For example, no way would I declare a problem just because a p-curve test yielded a p-value of less than 0.05, nor would I say everything’s ok just because such a test fails to reject.

I get that people do use the p-curve as a hypothesis test, so I respect the Morey and Davis-Stober paper: you have to meet people where they are.

Here’s what I wrote on the topic back in 2018:

I get concerned if people take these methods too literally. Consider the classic file-drawer-effect paper by Rosenthal, which I assume was written to demonstrate how serious this selection problem can be, but is sometimes twisted around to give the opposite meaning (by doing the calculation of how many papers would need to have been discarded to be consistent with a particular pattern of published results, and then claiming that since no such massive “file drawer” exists, the published claims should be accepted). I wouldn’t want researchers to take p-curve, or the Hedges approach, as evidence that a literature of uncontrolled p-values is approximately just fine.

As is often the case, I find myself more convinced by the demonstration of bias than by the attempted bias correction. In that sense, I see the Hedges procedure, or p-curve, or p-uniform, as being comparable to Type M and Type S errors (Gelman and Tuerlinckx, 2000) as a way of quantifying some effects of selection bias in statistical inference, but the desired solution is to go back to the original, unselected, data. All these methods can be useful in giving us a sense of the scale of bias arising in idealized situations.

I also recommend my 2013 paper with the late Keith O’Rourke, Difficulties in making inferences about scientific truth from distributions of published p-values.

Morey and Davis-Stober’s reactions

I sent the above post to Richard Morey and Clint Davis-Stober, who had the following comments:

Davis-Stober:

I’m not sure what it means for a person to “benefit from the method in practice” in this case. “Forensic” methods like p-curve are kind of strange conceptually. Usually, we’re building models to explain data–with the humility that our models are probably wrong but we’re trying to test out explanations. Analogous to a point made by Pek et al. (2025), forensic methods like p-curve don’t seem to work this way–the data themselves are suspect and it’s not clear what generating process created them. How do we know when p-curve works in practice? I think these methods are kind of a conceptual dead end.

Pek, J., Wu, H., Liu, Y., Dusenbery, K. J., & Wegener, D. T. (2025). What does a Z-curve analysis tell us? Manuscript submitted for review.

Morey:

I’d like to reiterate that these tests are really not well-connected to Hedges’ (1984) work–they’re nonparametric meta-analytic combinations of p values (e.g. Birnbaum 1954). This is relevant because there are really three things that are called the p curve, and this has caused a great deal of confusion. Obviously, the first “p curve” is just the distribution of p values under some hypothesis. The second is the p curve tests (tests based on sums of transformed p values, which we discuss) and the third is the meta-analytic methods for estimating effect sizes (it is these that are related to Hedges work). These three ideas are all only tenuously related. Some of the reaction to our work has been incredulous that we would not look at distributions of p values. Importantly, that’s not our argument. We don’t have a problem with looking at distributions of p values, but we would stress that one should be extremely humble in drawing concrete conclusions based on them.

I’m not against the rhetorical use of statistical ideas, and in fact I think all statistical methods are basically rhetorical: statistical methods are thought experiments, or kinds of games). If we accept that, then the question is what is the argument being made using the method? In the case of the p curve tests we discuss, the argument is clearly about specific bodies of work. The authors claimed to have “a key to the file drawer”, and these tests are supposed to be that key. The construction of any test can be regarded as the premises on which whatever rhetorical conclusion, and we argue that in this case the strength of any argument you’d construct on the basis of the tests specifically is weak, because they statistical properties are poor.

In the spirit of transparency and scholarship, I believe that the authors should have done a much more thorough job justifying their tests (that is, if you like, fleshing out precisely what the premises of the argument is that these methods allow you to make). If they had published in a statistical venue, they probably would have had to do much more work in fleshing out the properties of the test. As it is, they did not, and so we’ve had 10 years of people using this method without really understanding what it is. There have been 80 years of statistical literature on these types of methods (meta-analytic combinations of p values); I think they should have interfaced with that literature so that people (including them) could be properly informed. Instead, they cited Abelson’s introductory statistical pamphlet to justify their choice of transformation. I think this is poor scholarship.

Simonsohn’s reaction

I sent the above to Simonsohn, who replied:

Thanks Andrew for asking for my take on this.

I have written a Data Colada[129] blog post discussing the critique by Richard & Clintin. It’s titled “p-curve works well in practice, but, does it work well if you drop a piano on it?” I tried to make it short and fun, while engaging with the substance of the critique head on.

In a nutshell, none of the four criticisms raised are new. More importantly, they are unlikely to be consequential in real-life analyses (rather than in carefully chosen edge cases used to make rhetorical points).

Every statistical tool can perform poorly in edge cases. This is true even of the simple t-test (e.g., with outliers) and regression (under too many situations to list).

The question is not “is this tool perfect in all situations”. No tool is.

The question is “in realistic circumstances, are we better off with this tool than with the best alternative?”

The JASA paper doesn’t even try to answer that question.

Youall can read all the above and make your own decisions. I guess that much will depend on what sort of problems you’re working on and what you’re planning to do with these statistical summaries.

Condition numbers for HMC and the funnel

This post is by Bob.

Back to some technical statistical computing.

Condition numbers for random walks

The usual notion of condition number is the ratio of the largest to the smallest eigenvalue of the negative Hessian. Large eigenvalues correspond to high curvature and small eigenvalues to low curvature. Condition numbers matter because the step size needs to be small enough to deal with the regions of high curvature and thus will require many steps to traverse flatter regions of low curvature. Eigenvalues of the negative Hessian act like inverse variances (they are inverse variances in a multivariate normal with a diagonal covariance matrix), and are thus squared scales. If you set the step size to be consistent with the direction of highest curvature, you have to take a number of steps equal to the condition number to move in the direction of lowest curvature—this is the condition number. It bounds how many steps are going to be required to get roughly independent draws.

Neal’s funnel

Radford Neal introduced a funnel density in his slice sampling paper. I assume he was well aware of just how nasty this example is. The funnel is a centered parameterization of a hierarchical model with no data in N dimensions:

y ~ normal(0, 3)
x[1:N - 1] ~ normal(0, exp(y / 2))

Here’s a density plot of y versus x[1] from the Stan User’s Guide chapter on reparameterization.

As you move along the y axis between +6 and -6, the condition number goes from 1000 to roughly 1 at the origin back up to 1000. From conditioning, both the mouth and the neck of the funnel are tricky. And this is only +/- two standard deviations, which is only approximately 95% of the probability mass. One of the things that makes the funnel nasty is that during the move from -6 to 6, the eigenstructure changes with the principal eigenvector (the one with the largest eigenvalue), changes alignment from along the x axes to along the y axis.

It is very hard to estimate the uncertainty in the funnel using sampling, even independent sampling. The problem is that x[n]^2 has a mean of roughly 100, but x[n]^4 has a mean of 2 x 10^8 (!) and thus x[n]^2 itself has a standard deviation of 1.4 x 10^4 (I’m using the fact that var[X^2] = E[X^4] - E[X^2]^2). This has to be enormously skewed to the right because the values are bounded below by 0. Even with 10 billion independent draws from the funnel, the estimates of the expectation and variance of the x coordinates are all over the place.

Condition numbers for HMC

HMC is so effective precisely because it overcomes the random walk behavior of Metropolis. Where Metropolis takes O(N^2) amount of work to move a distance of N, HMC only requires O(N^5/4). But there’s still this nasty constant from conditioning lurking in that asymptotic complexity result.

I don’t know how I missed it before, but I only learned about this paper at the MCM conference in Chicago last month:

Langmore et al. introduce an appropriate notion of condition for HMC,

kappa = [ SUM_{n=1}^N (lambdaMax / lambda[n])^4 ]^(1/4)

where lambda[1:N] are the eigenvalues of the negative Hessian, and lambdaMax = max(lambda[1:N]). This tells us that it’s worse to have one big eigenvalue (one highly curved dimension) and many small eigenvalues (flat dimensions) than the other way around. Therefore, the funnel is actually more poorly conditioned for HMC in the mouth than in the neck. In the mouth, the largest eigenvalue corresponds to the relatively slow moving y axis and the x axes are all much lower curvature relatively speaking. The reason the neck is usually considered the source of the problem is that the leapfrog algorithm in HMC is only a first-order (i.e., gradient-based) approximation of the Hamiltonian trajectory, and it can diverge pretty quickly in regions of high curvature. It turns out that if you take HMC or NUTS and use a fixed step size, you cannot explore the tails of either the neck or the mouth of the funnel very well.

The Desperation of Causal Inference in Ecology

This post is by Lizzie. The Figure is take from Frank 2024.

I was in a meeting a little over a year ago in which I asked a student to define causal inference. The definition he gave me focused on complex approaches often used to try drag causality out of observational data. So I asked if causal inference involved experiments at all? “No,” came the reply. I double-checked. “No.” The student was certain. Someone else following up later did not change their mind.

Experiments cannot help with causal inference.

I knew we had a problem then, but how did it happen? I’ll tell you my version of what happened and some of what I can put together for how this happened, but I am open to other theories and ideas. And if perhaps the new ‘causal inference’ movement in ecology really has — finally — struck on a way for us to figure out ecology, then time will obviously prove me wrong, and you’re welcome to beat time to it in the comments section.

I could start back with Sewall Wright and the demes of cows I was once told he used to visit and Fisher and his fields of corn (or some agreeable consistent crop just waiting for its split plot design), but I will just start in the 1990s with path analysis in ecology. Path analysis (what I would call structural equation modeling with standardized coefficients) was hot in the 1990s in ecology. There was a chapter on it by Mitchell in the book ‘Design and Analysis of Ecological Experiments’ in 2001 (perhaps around its peak). It had this on the first page:

plant traits → visitation → pollination → reproduction

Isn’t that great? I could link plant traits to plant reproduction via those traits’ effects on (insect) visitation (to flowers) and how all that racey visiting led to pollination and then — reproduction (and then I might even make a run at … plant fitness!). I mean it is great. I like the idea. I liked the chapter. I did a path analysis. But I didn’t call it path analysis, I called it structural equation modeling because, by the time I was publishing, path analysis had hit some bumps.

Namely, everyone had done path analysis and many of those people had done it poorly in one way or another and suddenly all those little paths looked like a lot of made up stories with lots of little asterisks representing lots of significant p-values that didn’t really hold up to scrutiny. Shocker! (No, not shocker.) So, we all stopped doing path analysis and (within a few years it seems to me) we started doing structural equation modeling, sometimes with standardized coefficients. But we never called it path analysis again.

We couldn’t let go of path analysis because the dream was still alive. We wanted causality. We wanted to link things to explain how the world works. And manipulating plant traits is hard (have you ever tried to paint flowers different colors in a field? Or paste on tiny hairs (which we call trichomes)?), but measuring them is comparatively less hard. We wanted causality from observational data. That was the dream.

And, the dream is still alive. After all this time.

And the dreamers seem to have just discovered some of the basics of causal inference for observational data from the social sciences and econometrics literatures. With this, they have discovered that diversity (more species) in grasslands leads to lower productivity, not higher (Dee et al. 2023) and linked white nose syndrome in bats to increased infant mortality across the eastern US (Frank 2024). This latter paper is the one that rattled me because I attended a discussion group with colleagues and found out how many of my colleagues are excited by these ‘new techniques’ and how they have learned from them the amazing power of fixed effects for finding causality and the dangers of random effects to lead us astray.

Huh? Fixed effects to save the day and random effects of doom?

I tracked some of this down to me thinking of the common ecology definition of fixed versus random (I think some closer to definition #2, page 245 of Gelman and Hill: “2. Effects are fixed if they are interesting in themselves or random if there is interest in the underlying population. Searle, Casella, and McCulloch (1992, section 1.4) explore this distinction in depth.”) whereas the ‘new’ methods in ecology are using (I believe) definition # 5 (“5. Fixed effects are estimated using least squares (or, more generally, maximum likelihood) and random effects are estimated with shrinkage (“linear unbiased prediction” in the terminology of Robinson, 1991). This definition is standard in the multilevel modeling literature (see, for example, Snijders and Bosker, 1999, section 4.2) and in econometrics….”).

This explains some of the interesting lines I found in these papers, including:

Random effects account for clustering in data via the error structure of the model (Bolker et al. 2009; Gelman and Hill 2006), rather than estimating cluster means as part of the data generating process of a model (i.e., via fixed effect for each cluster’s mean, using the terminology of the mixed models literature). (Byrnes & Dee 2025)

The time-varying site attributes (μ_{st}) are also modeled in a fully flexible way that allows a year- specific effect for each site (in the estimation, an indicator for each year is interacted with an indicator for each site). (Dee et al. 2023)

I think the authors of this new Tower of Babel for ecology have also defined random and mixed effects to mean only ever linear models with lmer-style partial pooling on intercepts (never slopes I presume?) with fixed effects on slopes (back to definition #2). They even go so far as to refer to this as the “Common Design in Ecology” (they also capitalize Ecology and Ecologists in Byrnes & Dee 2025, which I find odd — is this high German? Personally, as an ecologist, I don’t think I need an capital letter) and explain:

Without more variable transformations, the multi-level modeling approach does not easily lend itself to controlling for as many unobservable sources of confounding as can be done in our linear, additive, fixed-effects panel data estimator. (Dee et al. 2023, in supp)

I thought about calling this post ‘The Tower of Babeling Causality’ or ‘The Problem with Statistical Terminology,’ but the real problem is not how lost in the weeds of words we get in with terminology. It’s partly how easily ecologists do want ‘new’ terms and approaches that will solve everything. The authors who have come armed with econometrics panel data approaches and instrument variable analysis (when most ecologists don’t know what is an instrument in their experiments) had the ground laid for them by all the ecologists who are enthralled by ‘random effects.’ I agree we have too many people trained to believe that chucking enough categorical covariates (site, plot, year …) on the intercept of a simple linear model will save the day. It’s a problem how much we sway from this being correct statistics to that being correct statistics. And how quickly think a new approach will change everything in ecology. We seem to quickly learn — and re-learn — that bad stats can easily lead you astray, but never take on that good statistics alone will not save you.

To be clear, I don’t have a giant problem with these methods. I have a problem with how they are presented as saviors (and somewhat how they are presented as new, but perhaps we need the ‘new’ and ‘savior’ angle to follow Grace) but I have a bigger problem in how rapidly they are being taken up. I fear the next 10 years I will live in a sea of piranhas where lots of ecological problems explain 5-10% of infant mortality and plant productivity.

And that’s the other problem — the bigger one: how much people want this causality. They want to believe that we have the data and methods to show that a disease that wipes out bats leads to an 8% increase in infant mortality. Of course we should want causality, we’re scientists, but the drive for causality seems to jettison a lot of the stuff we also need as scientists, especially estimates of uncertainty and the ability to leave room for uncertainty so that we search out better methods and better answers. I don’t know if bat decline has increased infant mortality 8% (though I highly doubt that number given the language of the author and how ‘outrageous’ he thinks it is that he is expected to share all his data for people to believe his claims). I just know we have managed to do science before and make progress and it wasn’t because we got better statistical methods or memorized a glossary of one particular set of people’s definitions of DAGs and fixed effects.

I am cited in one of these papers for old work I did where I compared shifts in the timing of flowering and leafout with warming over time (due to anthropogenic climate change and natural variation) and experimental warming (due to infrared heaters or teeny tiny plastic greenhouses — also, hello instruments in ecological experiments!). Estimates from experimental and observational data were different — the effect of warming in observational data was bigger. I did lots of different statistical analyses to figure this out, I even did something probably close to the ‘Common Design in Ecology’ (although the authors don’t seem to ding me for this) and the effect never went away. With Ailene Ettinger and other colleagues, I eventually got all new data and found the same thing using slightly different statistics. But that wasn’t why we got all the new data, we got it to test hypotheses about what drove the difference. And we found out that it appeared to be two things: warming experiments dry out soils which delays leafout and flowering and warming experiments over-report their warming (so their per degree estimates look smaller than they should).

I did all of this without ever invoking the term ‘causal inference.’ And that’s what really worries me for trainees today; that ‘causal inference’ will now mean a narrow branch of amazing ‘fully-flexible’ completely un-confounded statistics. We’re ecologists; we actually can manipulate some stuff. And somehow we’re going so gaga for econometrics statistics to give us causality through time-invariant fixed effects (or whatever) that we have students who don’t know how experiments could relate to causal inference.

What’s the solution? If you ask me, be less gaga over any statistical method (and I do love my own statistical methods so I could practice a little more of what I preach) and teach everyone basic mathematical notation and basic biological models. Teach them that generative modeling doesn’t belong to any one part of statistics or to only fixed or random effects. Teach them to be able to write out a simple biological model and simulate data from it and then fit their statistical model to it. ‘Only connect!’ Connect the models you learn for ecological theory with those you learn in stats. (And maybe teach them about the long debate in conservation biology about Cassandra’s curse, but that is a topic for another post.)

Weighting of evidence and conflict of interest at the FDA and elsewhere

JJ Levine writes:

I wanted to share a thought about the expected changes at the FDA and related agencies in the new administration. I am wondering if many of the controversies in health are really a problem of data aggregation.

Here is an article from the Washington Monthly that talks about the new proposed head of the FDA and some of his controversial opinions.

Aside from one erroneous line (“In statistics, ‘significance’ refers to the likelihood that the results are correct, not that there is a major effect.”), the article seems to be mainly about how to put together evidence from various different studies.

What I am wondering is what can be done to foster decision making at the highest levels that is not based on pull quotes from selected studies but is based on the weighing of evidence from different sources? How would a Bayesian framework deal with these issues?

The linked article, by Merrill Goozner, quotes Marty Makary, the potential incoming leader of the FDA, as writing in his book “Blind Spots” that “Home deliveries triple the risk of infant mortality.”

Goozner follows up on this triple-the-risk statement:

That shocking claim . . . sent me rushing for the footnotes. There were none. So, I searched the medical literature and found a 2010 meta-analysis (an analysis using pooled data from a group of studies on the same subject) in the American Journal of Obstetrics and Gynecology that made that claim. It reviewed data from over one-half million planned hospital or planned home deliveries. However, only 1 of the 12 studies used for the meta-analysis was a randomized controlled trial. The rest were observational studies, some with some without matched cohorts. The low quality of the evidence was blasted by letter writers to the Journal.

I [Goozner also found a subsequent review of that meta-analysis that warned, “the authors’ conclusions should be treated with some caution as they did not reflect all the evidence presented in the review.” . . . the meta-analysis showed no statistical difference in the two groups in perinatal infant mortality (deaths up to 7 days after childbirth).” The alleged “tripling” was only in neonatal infant mortality (deaths up to 28 days after childbirth), which logic suggests would include more deaths not associated with childbirth itself. The critique also pointed out that the data on neonatal infant deaths came from trials that included fewer than 50,000 births — less than a tenth of the meta-analysis’ total review population.

In a long book, anyone can make mistakes, and Goozner writes that many of Makary’s points are very reasonable. Goozner is concerned with Makary’s blind spots:

While Makary digs deep in the medical literature to find studies that justify his opinions, he avoids mention of the hundreds of medical journal studies published throughout his quarter-century career that document the pernicious effects of industry’s role in financing medical research, clinical practice guidelines, patient advocacy groups, continuing medical education, and physician marketing. He ignores how industry-funded research supported the extensive use of drugs and devices beyond the indications included on the FDA-approved label, often causing great harm in the process.

At this point, I should note my own conflicts of interest, that I’ve collaborated with colleagues in the pharmaceutical industry and I’ve received research support in those collaborations.

I agree with Levine that combining information is key here. I don’t see any easy answers. The conflicts of interest in health care are huge. That said, we don’t just have to talk about health care here. The conflicts of interest in military procurement, funding of police departments and prisons, etc., are even bigger, right? And, as Goozner notes, political pressures and conflicts of interest regarding the FDA are not new.

Hey! Here’s what to do when you have two or more surveys on the same population! (Combining survey data obtained using different modes of sampling)

This has come up before (in 2011 and in 2018) but someone just asked me again today so I thought I’d give this advice again, since I haven’t seen it written anywhere.

Here was the question:

I’m reaching out to seek your advice on how to integrate two probability samples for the new Poverty Tracker cohort.

In the Poverty Tracker 2024 cohort, we finally got the chance to include an Address Based Sampling (ABS) sample. Our sample design includes half of the sample coming from the Random Digit Dial frame and the other half recruited through the Address-Based Sampling. I’m trying to map out how we’d be able to integrate those samples from two frames but having a hard time to find resources.

And here’s my reply:

The right thing to do is to simply pool the data together from both samples into a single dataset. Also in the dataset include an indicator that says, for each data point, which sample it came from. Then we do all analysis (including construction of weights, if we want to do that) using the combined dataset. When fitting models, running regressions, we include the indicator for which sample, just in case it is predictive of anything.

You can do something similar if you want to combine more than two samples; just include an indicator for each sample. And the same idea applies when combining raw data from multiple surveys (although then you might need to do some work to line up relevant poststratification variables, for example if the two surveys use different categories or different question wordings when asking about education or ethnicity or party identification or whatever).

“Beyond Averages: Measuring Consistency and Volatility in NBA Player and Team Offense”

Meer Patel writes:

I’m a rising junior in high school and recently completed a study titled “Beyond Averages: Measuring Consistency and Volatility in NBA Player and Team Offense.” I used game-level data to build a new metric for offensive impact and analyzed how player performance fluctuates over time.

From the abstract of his paper:

While traditional player evaluation metrics focus almost exclusively on season-long averages, I propose a framework that incorporates both the magnitude and consistency of offensive impact normalized per minute of play using game-level data from the 2024-25 NBA season. . . . I introduce the Net Offensive Impact (NOI) statistic and use the coefficient of variation of NOI per minute over a fixed, randomly sampled set of games totaling approximately 400 minutes per player to quantify each player’s volatility using a standardized approach. . . . while the league’s top offensive performers tend to be both productive and stable, many players–especially role players and high-variance “wildcard” scorers–display far greater fluctuation. . . . Offensive consistency is closely linked to individual playing time and can also help predict team success to an extent, but falls short in the playoffs. . . . Offensive consistency is a valued but not singularly decisive attribute; both steady and volatile offensive contributors play important roles in shaping NBA outcomes depending on the situation.

I took a quick look at the first version of his paper and wrote that it’s hard for me to evaluate the basketball stuff because I don’t know so much about basketball. But I did have a statistical comment, which is that when your sample size is smaller you will see more variation in the average, and I think that’s what you’re seeing here. So I don’t know that you’re really finding evidence that some players are much more variable than others; you could just be noticing in a different way that some players play many more minutes than others. One way to look at this would be randomly subset your data so that you have the same number of data points for each player.

Patel responded with a revision, writing:

Now, I randomly select the same number of games (20) for each player when calculating volatility. I describe this approach in the updated Data and Methodology section. All results, tables, and figures are based on these random subsets. There were some changes, particularly in the top 5 most consistent and volatile player rankings, where the list changed drastically. That being said, I found that the main findings remained consistent when using equal-sized samples, which gives me even more confidence in the results.

I replied that I’m still concerned about sample size issues. Maybe you should sample equal number of minutes, rather than games, for the selected players? My intuition is still that when you measure volatility, you will find more volatility for players with smaller sample size per game. If you’re measuring game-to-game variability, then players who play fewer minutes will show more variability even with no differences between players, just because fewer minutes is a smaller sample size per game.

Patel replied:

After reading your last email, I revisited my methodology to make sure I’m not unintentionally overstating player-to-player differences in consistency, and updated my paper, which is attached to this email.

For each player, I randomly sample games (from their full season log) until I reach 400 total minutes, then compute NOI per minute for each sampled game, which normalizes all player contributions on the same standardized scale. All volatility statistics, such as the coefficient of variation (CV), are then calculated from this per-minute series, rather than from raw game-to-game NOI.

This adjustment removes the bias where players with fewer minutes per game could appear artificially more volatile, since all rate and variability statistics now reflect offensive production per minute played, regardless of a player’s role or average playing time.

In addition, to further check robustness, I repeated the random sampling with different seeds and confirmed that the volatility and consistency results remain stable across samples.

Enjoy. I have not read the paper in any sort of detail so have no comment on the findings there. (That’s not a negative statement on my part, it’s just the literal truth.) I’m posting here because the general idea sounds cool, issues of implementation aside. Studies of variation in sports are always challenging.

The last time I can recalling sharing someone’s NBA analysis was in this post from 2008. The guy who sent me that earlier item, Eli Witus, had written:

I don’t have any formal statistical training, so I am learning as I go. . . . I am very interested in multilevel modeling–I think it could be very useful in basketball since the game is much more interactive than baseball, and player statistics are heavily dependent on the context of the player’s teammates and coach. I think multilevel modeling could help answer questions about how a player’s statistics are likely to change if he changes teams.

I haven’t heard from Witus in a long time and his blog is no longer around, so I googled his name, and . . . it seems that he’s now the Executive Vice President of Basketball Operations & Assistant General Manager at the Houston Rockets. So all things are possible! In the meantime, if anyone near Edison, New Jersey, has interest in some basketball analytics, there’s nothing stopping you from contacting Meer Patel directly.

Measurement error in the strike zone

Under the subject line, “A brief note on one of your favorite topics: measurement,” Jonathan Falk points to this post from the evil sports gambling company Draft Kings:

Falk writes:

Best comment (among a few hundred idiotic ones):

Bad news. They later found that home plate, due to a manufacturing error, was 7 nanometers too wide, so the automated system, expecting the normal size for home plate, was actually incorrect.

(1) Fitting hierarchical models in genetics, (2) A Stan model that runs faster with 400,000 latent parameters, (3) Super-scalable penalized maximum likelihood inference for biome problems, (4) “In the end, I basically gave up working on biology because of the politics.”

Our recent post elicited this comment from Bob that was so detailed I thought it deserved its own post.

This one comment from Bob has enough material for 4 standalone posts:

1. Fitting hierarchical models in genetics: why the full Bayesian approach gives the right answer, whereas an existing shortcut method using approximations and likelihood ratio tests does not.

2. Stan’s HMC scales well in high dimension, but not so well in bad geometry, so for this problem it was better to use a Poisson/gamma model with 400,000 latent parameters than a much lower dimensional but computationally awkward negative binomial model. (Recall the dictum that mixture models all have computational problems and that all computational problems are essentially mixture models.)

3. “Super-scalable penalized maximum likelihood inference for biome problems with some collaborators in optimization. With 150K biomes averaging 3 megabases each and arbitrary amounts of data, it fits in about 20m on a couple decent servers (doesn’t need GPU).”

4. The struggles of interdisciplinary, or interlab, collaboration within certain subfields of biology.

Bob’s summary of that last point: “It’s trivial to rewrite a statistically sound, fully Bayesian version of DESeq2, but biologists are too conservative to use new tools, so I haven’t even tried to fix this. In the end, I basically gave up working on biology because of the politics.”

And here’s Bob’s comment in full:

There are already hierarchical models in wide use in genomics. For example, rMATS is a system for inferring splice variants. The statistical computation in that package is a mess, which led them to the ridiculous conclusion that fewer variants from a population with deeper sequencing is better for estimating the population values than more variants with less deep sequencing (basic stats will show you the latter is better if your goal is estimating the population). I rewrote it all as a Bayesian model. It gets the right answer statistically, namely that it’s better to have more samples at lower sequence depth, unlike rMATS. The problem with rMATS is their Laplace approximation for max marginal likelihood and then their likelihood ratio tests, which together give them the wrong results statistically, even for their model.

Negative binomials induce really bad geometry, as does just about anything that gives you multiple ways to explain the same data (i.e., either higher mean or higher dispersion can explain a large value). So I reparameterized in two ways. First, I coded the differential expression with a gamma-Poisson rather than negative binomial, which with 20 sequencing runs over 20K splice variants led to 400K latent parameters. Second, I use an overall mean plus difference parameterization on Gelman’s advice. Stan’s HMC scales well in high dimension, but not so well in bad geometry, so fitting is OK.

For political reasons, I haven’t been able to publish. As an example, my original collaborator’s Ph.D. supervisor literally won’t let her work on anything other than his project and the person whose lab she rotated through is only interested if it produces “breakthrough biology” we can publish in Nature, which is not something I can deliver. I haven’t been able to find anyone who can help me evaluate this and write it up in a way that the biologists would be OK with. I’d be happy to share the code. I even hired an intern specifically to work on this, but the intern showed up and refused to work on the problem!

More recently, I worked on super-scalable penalized maximum likelihood inference for biome problems with some collaborators in optimization. With 150K biomes averaging 3 megabases each and arbitrary amounts of data, it fits in about 20m on a couple decent servers (doesn’t need GPU). It’s like Rob Patro et al.’s salmon, but I figured out how to do all the bias adjustments for things like hexamer and positional bias (just push the uncertainty through Bayes style—not hard), which is why Rob told me that he abandoned the approach. I’ve had the same problem getting a biologist to help evaluate and write up. Even my collaborators on the autism project where I have a ton of gut biome data from Simons Foundation’s Autism Research Initiative (SFARI) and we did publish a paper, but they say they have zero time to do this. Everyone says they know someone, but when I contact them, they’re too busy. If I could come up with half a year funding for a postdoc, they’d loan me one (they really do assign their postdocs work—it’s not at all like CS and stats). It’s amazing to me how constrained biologists are in working with other people. The code we produced was really great, because Robert Gower (an optimization specialist) helped add a mirror descent optimization algorithm, which is about an order of magnitude faster than the L-BFGS default in Stan. And Robert Blackwell (it’s the 3 Roberts project), a high-performace compute specialist, added some high performance sparse matrix computation in C++ and figured out how to scale it out on a (CPU) cluster. What’s cool is that time doesn’t depend on the data size other than the really fast conversion to k-mer counts.

The autism gut biome paper for which I helped with the Stan modeling is Multi-level analysis of the gut–brain axis shows autism spectrum disorder-associated molecular and microbial profiles. This project was led by Jamie Morton, who was an amazing postdoc at Flatiron. Jamie wrote a really nice overview of how to think about compositional analyses like differential expression in Establishing microbial composition measurement standards with reference frames. But he has zero time to help build a state-of-the-art differential expression tool.

I would suggest looking at DESeq2 to see what a mess current differential expression inference is. It’s pretty easy to see why none of their p-values are well calibrated. It’s trivial to rewrite a statistically sound, fully Bayesian version of DESeq2, but biologists are too conservative to use new tools, so I haven’t even tried to fix this.

In the end, I basically gave up working on biology because of the politics.

Using hierarchical modeling to get more stable rankings of gene expression

Will Macnair writes:

I am a computational biologist (with a maths/stats background) working with genomic data. One task that comes up often is estimating changes in expression level between two conditions, for many thousands of genes, then ranking the genes in order of some measure of “confidence.”

My question is: In a Bayesian regression setting, what is a good way to rank the genes? If I calculate Bayes factors for “full model” vs “covariates-only model” and use these to rank the genes, is this ok? Are there better options?

To give a bit more detail, this task is typically referred to as “differential expression”, and we usually do this for all genes that pass some filter on very low / very boring expression. In the frequentist setting, the ranking is done with an adjusted p-value, and there are various popular methods for doing this (e.g. edgeR and DESeq2). For some recent analysis I have been using rstanarm, and I’ve been wondering what would be the best way to achieve a confidence-type ranking in this setting. The reason we do ranking is that we might do follow-up experiments for some genes, however the experiments are a lot of work. So we would like to prioritise genes where we have high confidence that something interesting is going on.

Some other thoughts:

– I searched previous blog entries and found this one. There is some nice discussion in the comments, in particular the thread started by Bob Carpenter where he talks about framing it as ranking. If I’ve missed something more recent, please point me to that!

– There are also some comments from Daniel Lakeland on how there has been a lot of work on the FDR approach, despite it being a bad framing of the problem (i.e. “effect yes/no” rather than “ranking under cost constraints”). The blog was posted in 2016, and the approach taken in the field has not really changed much since then, so your input would be valuable.

– In the blog post, and also in BDA3, you say that you don’t really like Bayes factors, as having a lump of probability at 0 doesn’t really make any sense. I agree with that, however in this case I don’t really want to make statements about the value of the parameter, just how confident we are that something is going on. Could Bayes factors here be ok?

– I wonder if an alternative approach would be to decide on a minimum interesting effect size, then ask which genes have the highest posterior probability of an effect size larger than this.

– I think elsewhere you have suggested addressing multiple testing-type problems by putting everything into one large hierarchical model. I have over 10k genes, and a few hundred samples, so this doesn’t seem practical (at least not for me!).

– In my analysis, I have calculated Bayes factors (for “full model” vs “only covariates”), but also tried ranking by mean(case_vs_control) / sd(case_vs_control). The Bayes factors seem to have some advantages, e.g. they are especially low for genes that are known to have strong sex effects, and sex is one of the covariates.

My reply: I know very little about genomics–ok, actually I know nothing at all about genomics, although I’ve had the occasional conversation of this sort where it seems that I’ve been able to offer helpful advice. So, rather than attempt any sort of solution, I will just share some scattered thoughts:

1. Why do you want to rank the genes? What are you going to do with the ranking?

My general thought is that, even if you’re doing ranking, there are lots of ways to do this, and it’s hard for me to think of a realistic example in which tail-area probabilities are the right way to do the ranking–see discussion here–except in some very simple symmetric problems where all analyses lead to the same result.

2. My problem with the Bayes factor is not so much with the lump of probability at zero, but rather with the dependence on the prior distribution for the unconstrained parameters in the model. Take K well-identified parameters on unit scale and change their weak prior from independent normal(0,100) to independent normal(0,1000), and you’ve decreased the model’s Bayes factor by a factor of 10^K without changing the inferences conditional on the model. We discuss this sort of thing further in Chapter 7 of BDA3 (chapter 6 of the earlier editions). When comparing models, you can establish conventions with weak priors so as to obtain stable Bayes factors, but that’s what these are–conventions–and there’s reason to expect the resulting Bayes factors to make much sense (that’s the point that I made in my 1995 article with Rubin on the topic), so you’ll just have to evaluate these as one more possible decision rule with no clearly-defined theoretical basis.

3. You’re right that I have suggested addressing multiple testing-type problems by putting everything into one large hierarchical model! Here’s my paper with Jennifer and Masanao explaining this, published in the Journal of Research on Educational Effectiveness in 2012.

4. You write that you over 10k genes, and a few hundred samples, so this doesn’t seem practical. But it is practical! We routinely fit multilevel regressions in Stan with hundreds of thousands of data points and thousands of predictors. Try it and see how it goes! If it’s slow, then my recommendation is to set the group-level variance parameters to reasonable values, then it’s just a simple regression model with a prior, and you can fit it fast using an optimizer, just a simple mode-finder or you can use ADVI or Pathfinder if you’d like.

5. I can see that ranking by mean(case_vs_control) / sd(case_vs_control) might not work so well because the sd can be noisy. You can probably do better by fitting a hierarchical model to the sd’s.

6. Again speaking generally, you should be able to use simulated-data experimentation to get a sense of how any of these methods is working. Simulated-data experimentation has two advantages here: first, by construction you know the true values of all the parameters so you can directly compare the performance of different methods; second, the very act of setting up the simulation forces you to figure out exactly what you are trying to learn, which brings us back to my item #1 above.

Survey Statistics: Longitudinal/panel data

Longitudinal” or “panel” data includes some repeated surveys of the same people over time. Sharon Lohr’s book describes examples such as the CPS:

Once a household is recruited to be in a panel survey such as the U.S. Current Population Survey, which measures characteristics related to employment, it stays in the survey for subsequent interviews.

The new-in-the-third-edition chapter on nonprobability samples also describes online web panels:

An online panel is a group of persons who have been recruited by a survey organization for the purpose of taking surveys. Then, when a topic comes up for which polling is desired, the organization asks a sample of persons from the panel who meet the survey criteria to take the poll.

Sampling Design and Analysis: Third Edition — Sharon Lohr

The panel data structure can be incorporated into models in various ways. For example, a hierarchical (i.e. multilevel) model can include an unobserved person-level effect. As Gelman et al. write in BDA p.21:

In general, we prefer to model complexity with a hierarchical structure using additional variables rather than with complicated marginal distributions, even when the additional variables are unobserved or even unobservable; this theme underlies mixture models…

But as Antonelli et al. (2016) and Fitzmaurice et al. p.39 write:

For linear mixed-effect models … deviations from the normality assumptions for the random effects have very little impact on the estimation of the fixed-effects parameters…For non-linear and generalized linear mixed models, misspecification of the random-effects distribution can lead to seriously biased estimates for the fixed-effects parameters.

Longitudinal Data Analysis : Fitzmaurice, Garrett, Davidian, Marie, Verbeke, Geert, Molenberghs, Geert: Amazon.nl: Boeken

In much of survey research, variables of interest are discrete, e.g. vote choice. So we would fit non-linear models, where the cautions above apply. How do most survey practitioners analyze longitudinal binary data ?

(I asked about this in stan discourse back in 2022 but could use more guidance.)

A bunch of readings and a new book on Bayesian meta-analysis

Robert Grant writes:

My Bayesian Meta-Analysis book is now available. There is a discount code “BMA25” for 25% off until 6 September.

The website also has code (of all software options), examples and discussion.

There’s a lot yet to be settled in this topic, which we only just touch on briefly in a closing chapter, for example with Bayesian MAs of Bayesian studies, or RCTs borrowing external control information, or Gaussian processes, etc etc. Even small-m and small-p tasks in this space can be highly correlated and benefit from Stan.

I haven’t seen this new book, but the topic is important!

Here are some things my collaborators and I have written on Bayesian meta-analysis:

– Section 5.6 of Bayesian Data Analysis

Meta-analysis with a single study

Water Treatment and Child Mortality: A Meta-analysis and Cost-effectiveness Analysis

Priors for hyperparameters in meta-analysis

Exploring some questions about meta-analysis (using ivermectin as an example), with R and Stan code

The real problem of that nudge meta-analysis is not that it includes 12 papers by noted fraudsters; it’s the GIGO of it all

No, I don’t like talk of false positive false negative etc but it can still be useful to warn people about systematic biases in meta-analysis

The p-curve, p-uniform, and Hedges (1984) methods for meta-analysis under selection bias: An exchange with Blake McShane, Uri Simonsohn, and Marcel van Assen

Constructing an informative prior using meta-analysis

Meta-analysis, game theory, and incentives to do replicable research

Using Bayesian meta-analysis to adjust for bias in experiments and observational studies

All these free readings, along with that 25%-off book, should give keep you busy for awhile!

The Dodgers are hiring

Richard Anderson writes:

I manage the data science team with the Los Angeles Dodgers and have a job post that may be of interest to students or readers of your blog. If you know anyone who may be interested, they are welcome to reach out to me directly.

From the job description:

We are especially interested in candidates with either (1) demonstrated strength in deep learning with applications to spatiotemporal data or (2) demonstrated strength in Bayesian hierarchical modeling & probabilistic forecasting.

And, in the list of areas of preferred expertise:

You have written probabilistic models in NumPyro, PyMC, or Stan with custom likelihoods and priors.

I can only assume that Leo Durocher would be spinning in his grave. “Quantitative Analysis,” indeed!

Survey Statistics: Sparsified MRP

This post has 3 questions for you below.

Multilevel Regression and Poststratification (MRP) aims to address nonresponse. Suppose we want to estimate E[Y], the population mean. But we only have Y for respondents. For example, suppose Y is voting Republican. And what if respondents are more or less Republican than the population ? If we have population data on X, e.g. a bunch of demographic variables, then we can estimate E[Y|X] and aggregate: E[Y] = E[E[Y|X]]. So if our sample has the wrong distribution of X, at least we fix that with some calibration.

With nonresponse worsening, we want to adjust for a lot of covariates X (including their interactions !). Estimates from such big models will be unstable without a lot of data and/or regularization.

As Andrew writes about MRP and RPP:

5. The multilevel part of MRP comes because you want to adjust for lots of cells j in your poststrat

But it’s not crucial that the theta_j’s be estimated using multilevel regression. More generally, we can use any regularized prediction method that gives reasonable and stable estimates while including a potentially large number of predictors.

Hence, regularized prediction and poststratification. RPP. It doesn’t sound quite as good as MRP but it’s the more general idea.

As Andrew writes about the lasso:

Lasso (“least absolute shrinkage and selection operator”) is a regularization procedure that shrinks regression coefficients toward zero…Somehow it took the Stanford school of open-minded non-Bayesians to regularize in the way that Bayesians always could—but didn’t…I was fitting uncontrolled regressions with lots of predictors and not knowing what to do…

“Over the past few months I [Tibshirani] have been consumed with a new piece of work [with Richard Lockhart, Jonathan Taylor, and Ryan Tibshirani] that’s finally done. Here’s the paper, the slides (easier to read), and the R package.”

QUESTION 1: These links are broken. [They’re no longer broken! I found the correct links and went in and updated them. — AG]

As Andrew writes about the “bet on sparsity principle“:

sparse models can be faster to compute, easier to understand, and yield more stable inferences.

Though this “easier to understand” may be kind of fake, as Andrew says:

lasso (or alternatives such as horseshoe) are fine, but I don’t think they really give you a more interpretable model. Or, I should say, yes, they give you an interpretable model, but the interpretability is kinda fake, because had you seen slightly different data, you’d get a different model. Interpretability is bought at the price of noise—not in the prediction, but in the chosen model.

And:

In most settings I actually find it difficult to directly interpret more than one coefficient in a regression model.

Same.

QUESTION 2: Ok, so forgetting about the interpretability stuff, let’s say I am doing a giant Regularized Prediction and Poststratification (RPP). Do you know of papers doing RPP with sparsifying priors ?

Implementation time.

Starting with the lasso: Tibshirani’s selectiveInference R package gives a logistic example here (our Y is binary too). I extended their example using brms and rstanarm with Laplace priors with scale 1/lambda, see ESL p.72:

mod_rstanarm <- stan_glm(
formula = formula_all,
data = data,
family = binomial(),
prior = laplace(0, 1.25)
)

Here are my results comparing the point estimates:

QUESTION 3: why does Bayesian inference with Laplace priors (and the same likelihood and data) give posterior means more similar to selectiveInference than lasso ? is the above expected behavior or do you suspect my code has a bug ?

(I asked about this here, in Andrew’s post about a post-selection inference question from Richard Artner.)

Moving away from lasso to other sparsifying priors, Piironen and Vehtari (2017) conclude:

the (regularized) horseshoe consistently outperforms Lasso in terms of predictive accuracy…A clear advantage for Lasso, on the other hand, is that it is hugely faster

brms even stopped supporting lasso in 2023.

Michael Betancourt showed struggles with the horseshoe, concluding:

In the end while these modifications of the horseshoe population model can help they typically don’t achieve any better performance in practice than a Cauchy population model or the binary mixture normal population model, both of which are much easier to configure and accurately fit out of the box. Consequently it’s hard to argue for the practical utility of the horseshoe model.

So back to QUESTION 2: which sparsifying priors should we use for RPP ?

Survey Statistics: Poststratification ?

Lots is written on this blog about “Poststratification”. Andrew addresses it formally with a “Mister“. But when I learned it from Alan Zaslavsky’s course it was casually just “Poststratification”. At the time it sounded to me like damage control after we forgot to stratify.

  • Stratification” so that your sample has the same distribution for variables X as in the population: divide the population into strata (i.e. groups) based on X. Then take a sample in each stratum proportional to its size.
  • Post“: divide the population into strata only after the sample is already selected.

(As Raphael Nishimura and Andrew Gelman pointed out in the comments below, stratification has other uses !)

Fancy graphics from a DOL video I worked on:

How can Poststratification help ?

Suppose we want to estimate E[Y], the population mean. But we only have Y in the survey sample. For example, suppose Y is voting Republican. We can use the sample mean, ybar = Ehat[Y | sample] (I don’t know how to LaTeX on this blog).

But our sample mean is conditional on being sampled. And what if survey-takers are more or less Republican than the population ? As Joe Blitzstein teaches us: “Conditioning is the soul of statistics.” Conditioning on being sampled might bias our estimate. But maybe more conditioning can also somehow help us ?! Joe taught me to try conditioning whenever I get stuck.

If we have population data on X, e.g. racial group, then we can estimate Republican vote share conditional on racial group E[Y|X] and aggregate according to the known distribution of racial groups, invoking the law of total expectation (Joe’s favorite): E[Y] = E[E[Y|X]]. So if our sample has the wrong distribution of racial groups, at least we fix that with some calibration. Replacing “E” with estimates “Ehat”, poststratification estimates E[Y] with E[Ehat[Y | X, sample]].

When our estimate of E[Y|X] is the sample mean of Y for folks with that X, the aggregate estimate is classical poststratification, yhat_PS. When our estimate of E[Y|X] is based on a model that regularizes across X, the aggregate estimate is Multilevel Regression (“Mister“) and Poststratification, yhat_MRP. Gelman 2007 shows how yhat_MRP is a shrinkage of yhat_PS towards ybar.

Which estimate is best for estimating E[Y] ? ybar, yhat_PS, or yhat_MRP ?

To answer this, I’d want to compare Loss(E[Y], yhat), Loss(E[Y], yhat_PS), and Loss(E[Y], yhat_MRP) for some population-level loss. This differs from the typical machine learning individual-level losses Loss(y_i, ybar), Loss(y_i, yhat_PS_i), and Loss(y_i, yhat_MRP_i).

As Kuh et al 2023 write:

it is not individual predictions that need to be good, but rather the aggregations of these individual estimates.

Gelman 2007 ends with “Where to go next”:

A parallel approach is through simulation studies—for greater realism, these can often be constructed using subsamples of actual surveys—as well as theoretical studies of the bias and variance of poststratified estimates with moderate sample sizes.

I found 3 papers that have gone there, but I’d like help finding more.
Holt & Smith 1979 compare population-level Loss(E[Y], ybar) to Loss(E[Y], yhat_PS) in a simulation study. They do not include MRP in the simulation. They find that neither is uniformly best, but poststratification is usually much better.
Wang & Gelman 2014 compare individual-level Loss(y_i, yhat_PS_i) to Loss(y_i, yhat_MRP_i) using cross-validation holding out y_i. They show that MRP does best, but is nearly indistinguishable from complete pooling of interactions (something closer to ybar, complete pooling of everything):
Kuh et al 2023 compare population-level loss to individual-level loss in a simulation study. They caution that these losses may order models differently ! Choose your diagnostics carefully. They only consider MRP.
So I have not yet found the comparison I want: population-level loss for unweighted ybar, classical poststratification, and MRP. I think adding MRP to the Holt & Smith 1979 simulation would be interesting ? Can someone do this (my birthday is in October) ?
Do any MRP papers discuss standard errors theoretically ? Gelman 2007 only discusses standard errors for models with noninformative priors (see below). I also think the formulas here have typos ?

Unfair to Galton

In my post the other day about Monsters, I wrote about “scientists who held political views that you might now call odious, such as Francis Galton’s racism (which, like Laura Ingalls Wilder’s views, were close to the core of his statistical work) or J. B. S. Haldane’s communism (which seems more peripheral to his contributions to biology, although I expect that Haldane himself saw some connections there).”

A colleague who knows more about Galton than I do argued that the core of Galton’s statistical work had nothing to do with eugenics, even by Galton’s definition. It had been my impression that even when Galton was writing about heights of siblings or whatever, that eugenics was not far from his concerns, but according to my colleague, Galton’s work on correlation was originally motivated by trying to understand the paradoxes of evolution (not “eugenics,” particularly) and “when he had his breakthrough he saw that it was not heredity at all – it was a general statistical phenomenon.” My colleague continued, “Galton’s 1st book can be called eugenic – it said talent runs in families. But it had no effect. Why? Because everybody ‘knew’ that. He got no response and he moved on – he didn’t change his views but he moved on to other questions with no eugenical side.”

From a modern perspective, I can’t see how you can avoid labeling Galton as racist. For example, he charmingly wrote:

Visitors to Ireland after the potato famine generally remarked that the Irish type of face seemed to have become more prognathous–that is more like the Negro in the protrusion of the lower jaw. The interpretation of that which was that the men who survived the Starvation and other deadly accidents of that horrible time were generally of low and coarse organization.

Talk about adding insult to injury! Starve a country and then say that the survivors have been selected to be “low and coarse.”

Also this:

Average negroes possess too little intellect, self-reliance, and self-control to make it possible for them to sustain the burden of any respectable form of civilization without a large measure of external guidance and support.

And:

The Hindoo cannot fulfil the required conditions nearly as well as the Chinaman, for he is inferior to him in strength, industry, aptitude for saving, business habits, and prolific power. The Arab is little more than an eater up of other men’s produce; he is a destroyer rather than a creator, and he is unprolific.

Lots of people back in the 1800s wrote like that, and I don’t know enough about that period of history to assess the ways in which Galton was more or less racist that other educated Englishmen of his time. Perhaps one thing that disturbs me about Galton is that this was not just casual racism, just some guy in the pub making Irish jokes, but rather his carefully-thought-out views. But this is different from my colleague’s point about Galton’s statistical work, which is that it moved away from concerns about evolution and heredity and toward more general mathematical understanding of regression and correlation. It should be possible for me to be bothered by Galton’s racial views and to be bothered by their political implications, and to be interested in the connection between racist attitudes and the history of statistics, while also recognizing the development of ideas of correlation and regression on their own terms.

Also, to get back to the racism, the notorious Galton quotations above are from a lot of writing that he produced during his life. These remarks might well represent something close to his max level of racism rather than the mean or median. And such beliefs did not necessarily get in the way of his scientific investigations. For example, my colleague informed me, “When Galton first looked at fingerprints he also looked for a possible difference between African and English prints – were the first group less evolved (simpler) than the second? He could see no such difference. He once wrote (in 1863), ‘Exercising the right of occasional suppression and slight modification, it is truly absurd to see how plastic a limited number of observations become, in the hands of men with preconceived ideas.’ Evidently Galton had no such preconceived idea.” In some sense his rejection of a racial idea in this case was even more impressive, if he indeed came into the analysis expecting to see such a pattern.

Again, no need to single out Galton; he gets more of our attention because of the importance of his contributions to statistics. Casual racial thinking comes up all the time in the history of quantitative social and biological sciences. For example, in his 1957 book, Probability, Statistics, and Truth, Richard Von Mises attempted to explain an underdispersion in the monthly rates of girl and boy births as being caused by different sex ratios among different racial or socio-economic groups; see pages 84-85 of this article. Mises does not present a rigorous argument, and if you try to look at it carefully, the math breaks down; my point is just that, when coming up with an apparently unexpected pattern in social data, he reached for a racial explanation. The question is not whether he was a “racist” in whatever terms might be used–according to wikipedia, he left Germany after the Nazis took power–just that racial thinking was in the air.

As we’ve discussed in other contexts, the point of this discussion is not to characterize Galton as “good” or “bad” but rather to better understand his statistical and his racial views in context. History is important.

Stanford Human Trafficking Data Lab Hiring for a Full-time Postdoctoral Scholar

Ben Seiler sends along this opportunity:

The Stanford Human Trafficking Data Lab is accepting applications for a postdoctoral fellowship position to join a project investigating trafficking risks in charcoal supply chains in Brazil. The position is open to recent graduates of PhD programs in statistics, economics, computer science, operations research, or related data science fields. The position provides opportunities to participate in rigorous, quantitative research on human trafficking, including supply chain network analysis and geospatial modeling. The successful candidate will have strong data science skills, including experience working with large, complex data from varied sources, and machine learning methodologies. The underlying data are complex and will require sophisticated data management and integration skills. A candidate should have proficiency with GIS software and Python, strong written and interpersonal communication skills, and a demonstrated interest in addressing social justice issues through data-driven research. The postdoc will work in partnership with PI Grant Miller (Stanford University School of Medicine) and other research team members, and will contribute to study design, participate in field research, conduct data analysis, and disseminate findings through academic publications and presentations. The postdoctoral fellow will be expected to focus mainly on this project, but may spend up to 20% of their time on independent research. For more on the Lab’s other ongoing projects see https://htdatalab.stanford.edu/projects/. The postdoc will be based at Stanford University in the Department of Health Policy, located near to the Departments of Economics and Political Science and the Graduate School of Business.

TO APPLY: Please email a single PDF document named
“Lastname_Firstname_HTDL_Postdoc_2025” containing the following materials to Lydia Aletrais at [email protected]:

1. A cover letter describing your interest in this position, your relevant training and experience, and our earliest and preferred start dates
2. A current CV
3. A transcript (unofficial is fine)
4. Names, e-mail addresses, and phone numbers of 2-3 references.

Applications will be considered on a rolling basis. Short-listed applicants will be asked to complete a technical exercise and may be called for an interview.

Given that they’re doing network analysis and geospatial modeling, I expect that knowledge of Stan would be useful too.

Don’t Hold Out On Me: Some thoughts on out-of-sample prediction

Ryan Socha writes:

Typically in statistical modeling, we validate or test models by checking how well they perform on a holdout dataset. However, conceptually, it seems possible that we could do essentially the same job by instead keeping the training dataset fixed and using one or more “holdout evaluation functions” that differ from the training objective.

Finding such functions is probably more difficult than finding extra data in most cases, but I am curious if there are any works that come to mind when you think of “holdout evaluation functions” as a means of assessing model performance in ways that can’t be Goodharted.

Beyond this, there’s a more general question – to what extent are holdout datasets fungible with holdout evaluation functions? Are there cases where having access to additional data is inherently superior to any alternate way of evaluating a model’s performance on the given data? Or, are there cases where no amount of additional holdout data can detect there’s some flaw in what the model’s doing?

I am particularly curious whether there might be a procedure that can convert from holdout datasets to holdout evaluation functions, or the other way around. Some caution is probably required to make sure that any alternate evaluation functions constructed to replace a holdout dataset do not “secretly contain” copies of the holdout data – but maybe that kind of smuggling is required for any such procedure to be possible in the first place?

A natural case where this kind of thing might be useful: suppose we want a model to generalize to a certain distribution that’s out of distribution from the training data, but no actual instances of that distribution yet exist. In cases like this, it seems like our only option for validating an approach is to make sure that the way we’re evaluating it is well-suited for the high-level properties we expect the new distribution to have. Although this seems to lose some of the flavor of the evaluation functions needing to be holdout functions, so perhaps it is not the best example after all and this is only of theoretical interest.

This is far beyond my math chops, all thoughts or comments appreciated. Feel free to put this and any response you might have on your blog if you think it would be of interest. As always, even just telling me you think this is a bad line of thought would be a welcome reply.

My reply: Over the years we’ve thought a lot about cross validation and external validation. Two key papers are:

[2014] Understanding predictive information criteria for Bayesian models. {\em Statistics and Computing} {\bf 24}, 997–1016. (Andrew Gelman, Jessica Hwang, and Aki Vehtari)

[2017] Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. {\em Statistics and Computing} {\bf 27}, 1413–1432. (Aki Vehtari, Andrew Gelman, and Jonah Gabry)

In writing these, we followed the general trend in machine learning, also expressed in the your question, moving away from looking at “information criteria” as ways to evaluate or compare models and toward a more direct interpretation of leave-one-out cross validation as an estimate of what would be obtained under external validation using new cases.

Some interesting issues arise:

1. As noted in your question, predictive performance on new data can depend on the scenario. The further the new data are in predictor space from the training data, the more uncertain and the less accurate you will expect the predictions to be. This suggests, first that any evaluation of a model’s predictive performance should require some specification of where the evaluation will be done, and second that cross validation can be studied by looking not just at average predictive performance but also on how the predictive accuracy depends on predictors in the model, as in this graph from our 2017 paper:

2. The choice of where to evaluate predictions, and how to average over these to get a summary measure of predictive accuracy, reminds me a lot of . . . poststratification! When doing this, it’s important to include predictors that capture the differences between the training and test data, and it can make sense to use multilevel modeling to facilitate predictions for new scenarios. Prediction goals can inform both design and analysis. For example, if you’re doing a study now with the goal of making inferences for future effects, then time could be a factor, and it would make sense to spread your study (your training data) across some period of time, which will then give you some leverage to estimate time trends when fitting your model. Realistically, though, inferences might still strongly depend on priors, for example if you have only one week to conduct your experiment but you want to estimate effects for a year going forward.

3. The other thing this reminds me of is the technique we’ve been using a lot lately, of poststratifying a survey on itself. That is, taking the data, fitting a model, and then creating an aggregate inference by averaging the model predictions over a hypothetical new population that’s exactly the same in its predictors as the data used to fit the model. This might seem to be a very circular thing to do, but it can be useful for two reasons:

The first reason that poststratifying a survey on itself is not an empty identity mapping is that the process of fitting the model can be thought of as a smoothing of the data, so MRP where poststrat is on the sample itself can be thought of as a three step procedure: (i) transform the data, (ii) apply a smoother on the transformed space (which is kinda what the multilevel model or Bayesian inferences is doing, where the transformed space is the space of parameters and thus step (i) of inference can be seen as an inversion of the assumed data-generating process), (iii) reverse-transform back to data space. Even if steps (i) and (iii) are inverses of each other, you get something from step (ii). Just like how you can flip two pairs of edges on a Rubik’s cube by first twisting to line up the edges in the right place, then applying the operator that does what you want, then reversing your original set of twists.

The second advantage of poststratifying a survey on itself is that you can use this as a diagnostic tool to understand what’s happening in an MRP situation. MRP does two things: it adjusts for different distributions of predictors in sample and population, and it does smoothing for small-area estimation. By poststratifying a survey on itself, you can isolate these two things and just see what the smoothing is doing, then you can poststratify on the population of interest and see the predictive effect of imbalance in predictor distributions between sample and population. We used this technique recently in an example of a survey with measurement error where we were getting results we didn’t understand. The mysterious patterns happened even when we were poststratifying the survey on itself, so we knew that it was not a problem with an unrepresentative sample; it was our Bayesian model that was to blame.

4. A few years ago we discussed the pervasive twoishness of statistics: the way that classical statistical inference, Bayesian statistics, and bootstrapping all have an incoherence in which two models coexist, with no requirement that the two models be part of a single consistent system–and how that’s actually a good thing. Cross validation, or external validation, has the same property: there’s the model used to fit the data, and the assumed population. Mathematically we can write these as p(y|theta,x) and p(x_new).