A different Bayesian World Cup model using Stan (opportunity for model checking and improvement)

Maurits Evers writes:

Inspired by your posts on using Stan for analysing football World Cup data here and here, as well as the follow-up here, I had some fun using your model in Stan to predict outcomes for this year’s football WC in Qatar. Here’s the summary on Netlify. Links to the code repo on Bitbucket are given on the website.

Your readers might be interested in comparing model/data/assumptions/results with those from Leonardo Egidi’s recent posts here and here.

Enjoy, soccerheads!

P.S. See comments below. Evers’s model makes some highly implausible predictions and on its face seems like it should not be taken seriously. From the statistical perspective, the challenge is to follow the trail of breadcrumbs and figure out where the problems in the model came from. Are they from bad data? A bug in the code? Or perhaps a flaw in the model so that the data were not used in the way that were intended? One of the great things about generative models is that they can be used to make lots and lots of predictions, and this can help us learn where we have gone wrong. I’ve added a parenthetical to the title of this post to emphasize this point. Also good to be reminded that just cos a method uses Bayesian inference, that doesn’t mean that its predictions make any sense! The output is only as good as its input and how that input is processed.

Update 2 – World Cup Qatar 2022 Predictions with footBayes/Stan

Time to update our World Cup 2022 model!

The DIBP (diagonal-inflated bivariate Poisson) model performed very well in the first match-day of the group stage in terms of predictive accuracy – consider that the ‘peudo R-squared’, namely the geometric mean of the probabilities assigned from the model to the ‘true’ final match results, is about 0.4, whereas, on average, the main bookmakers got 0.36.

It’s now time to re-fit the model after the first 16 group stage games with the footBayes R package and obtain the probabilistic predictions for the second match-day. Here there are the posterior predictive match probabilities for the held-out matches of the Qatar 2022 group stage played from November 25th to November 28th, along with some ppd ‘chessboard plots’ for the exact outcomes in gray-scale color – ‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results.

Plot/table updates: (see Andrew’ suggestions from the previous post, we’re still developing these plots to improve their appearance, see below some more notes). In the plots below, the first team listed in each sub-title is the ‘favorite’ (x-axis), whereas the second team is the ‘underdog’ (y-axis). The 2-way grid displays the 16 held-out matches in such a way that closer matches appear at the top-left of the grid, whereas more unbalanced matches (‘blowouts’) appear at the bottom-right.  The matches are then ordered from top-left to bottom-right in terms of increasing winning probability for the favorite teams. The table reports instead the matches according to a chronological order.

The most unbalanced game seems Brazil-Switzerland, where the Brazil is the favorite team with an associated winning probability about 71%. The closest game seems Iran-Wales – Iran just won with two goals of margin scored in the last ten minutes! – whereas France is given only 44% probability of winning against Denmark. Argentina seems to be ahead against Mexico, whereas Spain seems to have a non-negligible advantage in the match against Germany.

Another predictive note: Regarding ‘most-likely-outcomes’ (mlo here above), the model ‘guessed’ 4 ‘mlo’ out of 16 in the previous match-day.

You find the complete results, R code and analysis here.

Some more technical notes/suggestions about the table and the plots above:

  • We replaced ‘home’ and ‘away’ by ‘favorite’ and ‘underdog’.
  • I find difficult to handle ‘xlab’ and ‘ylab’ in faceted plots with ggplot2! (A better solution could be in fact to directly put the team names on each of the axes of the sub-plots).
  • The occurrence ‘4’ actually stands for ‘4+’, meaning that it captures the probability of scoring ‘4 or more goals’ (I did not like the thick ‘4+’ in the plot, for this reason we just set ‘4’, however we could improve this).
  • We could consider adding some global ‘x’ and ‘y’-axes with probability margins between underdog and  favorite. Thus, for Brazil-Switzerland, we should have a thick on the x-axis at approximately 62%, whereas for Iran-Wales at 5%.

For other technical notes and model limitations check the previous post.

Next steps: we are going to update the predictions for the third match-day and even compute some World Cup winning probabilities through a ahead-simulation of the whole tournament.

Stay tuned!

Football World Cup 2022 Predictions with footBayes/Stan

It’s time for football (aka soccer) World Cup Qatar 2022 and statistical predictions!

This year me and my collaborator Vasilis Palaskas implemented a diagonal-inflated bivariate Poisson model for the scores through our `footBayes` R CRAN package (depending on the `rstan` package), by considering as a training set more than 3000 international matches played during the years’ range 2018-2022. The model incorporates some dynamic-autoregressive team-parameters priors for attack and defense abilities and the Coca-Cola/FIFA rankings differences as the only predictor. The model, firstly proposed by Karlis & Ntzoufras in 2003, extends the usual bivariate Poisson model by allowing to inflate the number of draw occurrences. Weakly informative prior distributions for the remaining parameters are assumed, whereas sum-to-zero constraints for attack/defense abilities are considered to achieve model identifiability. Previous World Cup and Euro Cup models posted in this blog can be found here, here and here.

Here is the new model for the joint couple of scores (X,Y,) of a soccer match. In brief:

We fitted the model by using HMC sampling, with 4 Markov Chains, 2000 HMC iterations each, checking for their convergence and effective sample sizes. Here there are the posterior predictive matches probabilities for the held-out matches of the Qatar 2022 group stage, played from November 20th to November 24th, along with some ppd ‘chessboard plots’ for the exact outcomes in gray-scale color (‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results):

Better teams are acknowledged to have higher chances in these first group stage matches:

  • In Portugal-Ghana, Portugal has an estimated winning probability about 81%, whereas in Argentina-Saudi Arabia Argentina has an estimated winning probability about 72%. The match between England and Iran seems instead more balanced, and a similar trend is observed for Germany-Japan. USA is estimated to be ahead in the match against Wales, with a winning probability about 47%.

Some technical notes and model limitations:

  • Keep in mind that ‘home’ and ‘away’ do not mean anything in particular here – the only home team is Qatar! – but they just refer to the first and the second team of the single matches. ‘mlo’ denotes the most likely exact outcome.
  • The posterior predictive probabilities appear to be approximated at the third decimal digit, which could sound a bit ‘bogus’… However, we transparently reported the ppd probabilities as those returned from our package computations.
  • One could use these probabilities for betting purposes, for instance by betting on that particular result – among home win, draw, or away win – for which the model probability exceeds the bookmaker-induced probability. However, we are not responsible for your money loss!
  • Why a diagonal-inflated bivariate Poisson model, and not other models? We developed some sensitivity checks in terms of leave-one-out CV on the training set to choose the best model. Furthermore, we also checked our model in terms of calibration measures and posterior predictive checks.
  • The model incorporates the (rescaled) FIFA ranking as the only predictor. Thus, we do not have many relevant covariates here.
  • We did not distinguish between friendly matches, world cup qualifiers, euro cup qualifiers, etc. in the training data, rather we consider all the data as coming from the same ‘population’ of matches. This data assumption could be poor in terms of predictive performances.
  • We do not incorporate any individual players’-based information in the model, and this also could represent a major limitation.
  • We’ll compute some predictions’ scores – Brier score, pseudo R-squared – to check the predictive power of the model.
  • We’ll fit this model after each stage, by adding the previous matches in the training set and predicting the next matches.

This model is just an approximation for a very complex football tornament. Anyway, we strongly support scientific replication, and for such reason the reports, data, R and RMarkdown codes can be fully found here, in my personal web page. Feel free to play with the data and fit your own model!

And stay tuned for the next predictions in the blog. We’ll add some plots, tables and further considerations. Hopefully, we’ll improve predictive performance as the tournament proceeds.

Circling back to an old Bayesian “counterexample”

Hi everyone! It’s Dan again. It’s been a moment. I’ve been having a lovely six month long holiday as I transition from academia to industry (translation = I don’t have a job yet, but I’ve started to look). It’s been very peaceful. But sometimes I get bored and when I get bored and the weather is rubbish I write a  blog post. I’ve got my own blog now where it’s easier to type maths so most of the things I write about aren’t immediately appropriate for this place.

But this one might be.

It’s on an old example that long-time readers may have come across before. The setup is pretty simple:

We have a categorical covariate x with a large number of levels J. We draw a sample of N data points by first sampling a value of x from a discrete uniform distribution on [1,…,J]Once we have that, we draw a corresponding from a normal distribution with a mean that depends on which category of x we drew.

Because the number of categories is very large, for a reasonably sized sample of data we will still have a lot of categories where there are no observations. This makes it impossible to estimate the conditional means for each category. But we can still estimate the overall mean of y.

Robins and Ritov (and Wasserman) queer the pitch by adding to each sample a random coin flip with a known probability (that differs for each level of x) and only reporting the value of y if that coin shows a head. This is a type of randomization that is pretty familiar in survey sampling. And the standard solution is also pretty familiar–the Horvitz-Thompson estimator is an unbiased estimator of the population mean.

All well and good so far. The thing that Robins, Ritov and Wasserman point out is that the Bayesian estimator will, in finite samples, often be massively biased unless the sampling probabilities are used when setting the priors. Here is Wasserman talking about it. And here is Andrew saying some smart things in response (back in 2012!).

I read this whole discussion back in the day and it never felt very satisfying to me. I was. torn between my instinctive dislike of appeals to purity and my feeling that none of the Bayesian resolutions were very satisfying.

So ten years later I got bored (read: I had covid) and I decided to sketch out my solution using, essentially, MRP. And I think it came out a little bit interesting. Not in a this is surprising sense. Or even as a refutation of anything anyone else has written on this topic. But more it is an example that crystallizes the importance of taking the posterior seriously when you’re doing Bayesian modelling.

The resolution essentially finds the posterior for all of the mean parameters and then uses that as our new information about how the sample was generated. From this we can take our new joint distribution for the covariate, the data, and the ancillary coin and use it to estimate average of an infinite sample. And, shock and horror, when we do that we get something that looks an awful lot like a Horvitz-Thompson estimator. But really, it’s just MRP.

If you’re interested in the resolution, the full post isn’t too long and is here. (Warning: contains some fruity language). I hope you enjoy.

History, historians, and causality

Through an old-fashioned pattern of web surfing of blogrolls (from here to here to here), I came across this post by Bret Devereaux on non-historians’ perceptions of academic history. Devereaux is responding to some particular remarks from economics journalist Noah Smith, but he also points to some more general issues, so these points seem worth discussing.

Also, I’d not previously encountered Smith’s writing on the study of history, but he recently interviewed me on the subjects of statistics and social science and science reform and causal inference so that made me curious to see what was up.

Here’s how Devereaux puts it:

Rather than focusing on converting the historical research of another field into data, historians deal directly with primary sources . . . rather than engaging in very expansive (mile wide, inch deep) studies aimed at teasing out general laws of society, historians focus very narrowly in both chronological and topical scope. It is not rare to see entire careers dedicated to the study of a single social institution in a single country for a relatively short time because that is frequently the level of granularity demanded when you are working with the actual source evidence ‘in the raw.’

Nevertheless as a discipline historians have always11 held that understanding the past is useful for understanding the present. . . . The epistemic foundation of these kinds of arguments is actually fairly simple: it rests on the notion that because humans remain relatively constant situations in the past that are similar to situations today may thus produce similar outcomes. . . . At the same time it comes with a caveat: historians avoid claiming strict predictability because our small-scale, granular studies direct so much of our attention to how contingent historical events are. Humans remain constant, but conditions, technology, culture, and a thousand other things do not. . . .

He continues:

I think it would be fair to say that historians – and this is a serious contrast with many social scientists – generally consider strong predictions of that sort impossible when applied to human affairs. Which is why, to the frustration of some, we tend to refuse to engage counter-factuals or grand narrative predictions.

And he then quotes a journalist, Matthew Yglesias, who wrote, “it’s remarkable — and honestly confusing to visitors from other fields — the extent to which historians resist explicit reasoning about causation and counterfactual analysis even while constantly saying things that clearly implicate these ideas.” Devereaux responds:

We tend to refuse to engage in counterfactual analysis because we look at the evidence and conclude that it cannot support the level of confidence we’d need to have. . . . historians are taught when making present-tense arguments to adopt a very limited kind of argument: Phenomenon A1 occurred before and it resulted in Result B, therefore as Phenomenon A2 occurs now, result B may happen. . . . The result is not a prediction but rather an acknowledgement of possibility; the historian does not offer a precise estimate of probability (in the Bayesian way) because they don’t think accurately calculating even that is possible – the ‘unknown unknowns’ (that is to say, contingent factors) overwhelm any system of assessing probability statistically.

This all makes sense to me. I just want to do one thing, which is to separate two ideas that I think are being conflated here:

1. Statistical analysis: generalizing from observed data to a larger population, a step that can arise in various settings including sampling, causal inference, prediction, and modeling of measurements.

2. Causal inference: making counterfactual statements about what would have happened, or could have happened, had some past decision been made differently, or making predictions about potential outcomes under different choices in some future decision.

Statistical analysis and causal inference are related but are not the same thing.

For example, if historians gather data on public records from some earlier period and then make inference about the distributions of people working at that time in different professions, that’s a statistical analysis but that does not involve causal inference.

From the other direction, historians can think about causal inference and use causal reasoning without formal statistical analysis or probabilistic modeling of data. Back before he became a joke and a cautionary tale of the paradox of influence, historian Niall Ferguson edited a fascinating book, Virtual History: Alternatives and Counterfactuals, a book of essays by historians on possible alternative courses of history, about which I wrote:

There have been and continue to be other books of this sort . . . but what makes the Ferguson book different is that he (and most of the other authors in his book) are fairly rigorous in only considering possible actions that the relevant historical personalities were actually considering. In the words of Ferguson’s introduction: “We shall consider as plausible or probable only those alternatives which we can show on the basis of contemporary evidence that contemporaries actually considered.”

I like this idea because it is a potentially rigorous extension of the now-standard “Rubin model” of causal inference.

As Ferguson puts it,

Firstly, it is a logical necessity when asking questions about causality to pose ‘but for’ questions, and to try to imagine what would have happened if our supposed cause had been absent.

And the extension to historical reasoning is not trivial, because it requires examination of actual historical records in order to assess which alternatives are historically reasonable. . . . to the best of their abilities, Ferguson et al. are not just telling stories; they are going through the documents and considering the possible other courses of action that had been considered during the historical events being considered. In addition to being cool, this is a rediscovery and extension of statistical ideas of causal inference to a new field of inquiry.

See also here. The point is that it was possible for Ferguson et al. to do formal causal reasoning, or at least consider the possibility of doing it, without performing statistical analysis (thus avoiding the concern that Devereaux raises about weak evidence in comparative historical studies).

Now let’s get back to Devereaux, who writes:

This historian’s approach [to avoid probabilistic reasoning about causality] holds significant advantages. By treating individual examples in something closer to the full complexity (in as much as the format will allow) rather than flattening them into data, they can offer context both to the past event and the current one. What elements of the past event – including elements that are difficult or even impossible to quantify – are like the current one? Which are unlike? How did it make people then feel and so how might it make me feel now? These are valid and useful questions which the historian’s approach can speak to, if not answer, and serve as good examples of how the quantitative or ’empirical’ approaches that Smith insists on are not, in fact, the sum of knowledge or required to make a useful and intellectually rigorous contribution to public debate.

That’s a good point. I still think that statistical analysis can be valuable, even with very speculative sampling and data models, but I agree that purely qualitative analysis is also an important part of how we learn from data. Again, this is orthogonal to the question of when we choose to engage in causal reasoning. There’s no reason for bad data to stop us from thinking causally; rather, the limitations in our data merely restrict the strengths of any causal conclusions we might draw.

The small-N problem

One other thing. Devereaux refers to the challenges of statistical inference: “we look at the evidence and conclude that it cannot support the level of confidence we’d need to have. . . .” That’s not just a problem with the field of history! It also arises in political science and economics, where we don’t have a lot of national elections or civil wars or depressions, so generalizations necessarily rely on strong assumptions. Even if you can produce a large dataset with thousands of elections or hundreds of wars or dozens of business cycles, any modeling will implicitly rely on some assumption of stability of a process over time, and assumption that won’t necessarily make sense given changes in political and economic systems.

So it’s not really history versus social sciences. Rather, I think of history as one of the social sciences (as in my book with Jeronimo from a few years back), and they all have this problem.

The controversy

After writing all the above, I clicked through the link and read the post by Smith that Devereaux was arguing.

And here’s the funny thing. I found Devereaux’s post to be very reasonable. Then I read Smith’s post, and I found that to be very reasonable too.

The two guys are arguing against each other furiously, but I agree with both of them!

What gives?

As discussed above, I think Devereaux in his post provides an excellent discussion of the limits of historical inquiry. On the other side, I take the main message of Smith’s post to be that, to the extent that historians want to use their expertise to make claims about the possible effects of recent or new policies, they should think seriously about statistical inference issues. Smith doesn’t just criticizes historians here; he leads off by criticizing academic economists:

After having endured several years of education in that field, I [Smith] was exasperated with the way unrealistic theories became conventional wisdom and even won Nobel prizes while refusing to submit themselves to rigorous empirical testing. . . . Though I never studied history, when I saw the way that some professional historians applied their academic knowledge to public commentary, I started to recognize some of the same problems I had encountered in macroeconomics. . . . This is not a blanket criticism of the history profession . . . All I am saying is that we ought to think about historians’ theories with the same empirically grounded skepticism with which we ought to regard the mathematized models of macroeconomics.

By saying that I found both Devereaux and Smith to be reasonable, I’m not claiming they have no disagreements. I think their main differences come because they’re focusing on two different things. Smith’s post is ultimately about public communication and the things that academic say in the public discourse (things like newspaper op-eds and twitter posts) with relevance to current political disputes. And, for that, we need to consider the steps, implicit or explicit, that commentators take to go from their expertise to the policy claims they make. Devereaux is mostly writing about academic historians in their professional roles. With rare exceptions, academic history is about getting the details right, and even popular books of history typically focus on what happened, and our uncertainty about what happened, not on larger theories.

I guess I do disagree with this statement from Smith:

The theories [from academic history] are given even more credence than macroeconomics even though they’re even less empirically testable. I spent years getting mad at macroeconomics for spinning theories that were politically influential and basically un-testable, then I discovered that theories about history are even more politically influential and even less testable.

Regarding the “less testable” part, I guess it depends on the theories—but, sure, many theories about what have happened in the past can be essentially impossible to test, if conditions have changed enough. That’s unavoidable. As Devereaux replies, this is not a problem with the study of history; it’s just the way things are.

But I can’t see how Smith could claim with a straight face that theories from academic history are “given more credence” and are “more politically influential” than macroeconomics. The president has a council of economic advisers, there are economists at all levels of the government, or if you want to talk about the news media there are economists such as Krugman, Summers, Stiglitz, etc. . . . sure, they don’t always get what they want when it comes to policy, but they’re quoted endlessly and given lots of credence. This is also the case in narrower areas, for example James Heckman on education policy or Angus Deaton on deaths of despair: these economists get tons of credence in the news media. There are no academic historians with that sort of influence. This has come up before: I’d say that economics now is comparable to Freudian psychology in the 1950s in its influence on our culture:

My best analogy to economics exceptionalism is Freudianism in the 1950s: Back then, Freudian psychiatrists were on the top of the world. Not only were they well paid, well respected, and secure in their theoretical foundations, they were also at the center of many important conversations. Even those people who disagreed with them felt the need to explain why the Freudians were wrong. Freudian ideas were essential, leaders in that field were national authorities, and students of Freudian theory and methods could feel that they were initiates in a grand tradition, a priesthood if you will. Freudians felt that, unlike just about everybody else, they treated human beings scientifically and dispassionately. What’s more, Freudians prided themselves on their boldness, their willingness to go beyond taboos to get to the essential truths of human nature. Sound familiar?

When it comes to influence in policy or culture or media, academic history doesn’t even come close to Freudianism in the 1950s or economics in recent decades.

This is not to say we should let historians off the hook when they make causal claims or policy recommendations. We shouldn’t let anyone off the hook. In that spirit, I appreciate Smith’s reminder of the limits of historical theories, along with Devereaux’s clarification of what historians really do when they’re doing academic history (as opposed to when they’re slinging around on twitter).

Why write about this at all?

As a statistician and political scientist, I’m interested in issues of generalization from academic research to policy recommendations. Even in the absence of any connection with academic research, people will spin general theories—and one problem with academic research is that it can give researchers, journalists, and policymakers undue confidence in bad theories. Consider, for example, the examples of junk science promoted over the years by the Freakonomics franchise. So I think these sorts of discussions are important.

Some concerns about the recent Chetty et al. study on social networks and economic inequality, and what to do next?

I happened to receive two different emails regarding a recently published research paper.

Dale Lehman writes:

Chetty et al. (and it is a long et al. list) have several publications about social and economic capital (see here for one such paper, and here for the website from which the data can also be accessed). In the paper above, the data is described as:

We focus on Facebook users with the following attributes: aged between 25 and 44 years who reside in the United States; active on the Facebook platform at least once in the previous 30 days; have at least 100 US-based Facebook friends; and have a non-missing residential ZIP code. We focus on the 25–44-year age range because its Facebook usage rate is greater than 80% (ref. 37). On the basis of comparisons to nationally representative surveys and other supplementary analyses, our Facebook analysis sample is reasonably representative of the national population.

They proceed to measure social and economic connectedness across counties, zip codes, and for graduates of colleges and high schools. The data is massive as is the effort to make sense out of it. In many respects it is an ambitious undertaking and one worthy of many kudos.

But I [Lehman] do have a question. Given their inclusion criteria, I wonder about selection bias when comparing counties, zip codes, colleges, or high schools. I would expect that the fraction of Facebook users – even in the targeted age group – that are included will vary across these segments. For example, one college may have many more of its graduates who have that number of Facebook friends and have used Facebook in the prior 30 days compared with a second college. Suppose the economic connectedness from the first college is greater than from the second college. But since the first college has a larger proportion of relatively inactive Facebook users, is it fair to describe college 1 as having greater connectedness?

It seems to me that the selection criteria make the comparisons potentially misleading. It might be accurate to say that the regular users of Facebook from college 1 are more connected than those from college 2, but this may not mean that the graduates from college 1 are more connected than the graduates from college 2. I haven’t been able to find anything in their documentation to address the possible selection bias and I haven’t found anything that mentions how the proportion of Facebook accounts that meet their criteria varies across these segments. Shouldn’t that be addressed?

That’s an interesting point. Perhaps one way to address it would be to preprocess the data by estimating a propensity to use facebook and then using this propensity as a poststratification variable in the analysis. I’m not sure. Lehman makes a convincing case that this is a concern when comparing different groups; that said, it’s the kind of selection problem we have all the time, and typically ignore, with survey data.

Richard Alba writes in with a completely different concern:

You may be aware of the recent research, published in Nature by the economist Raj Chetty and colleagues, purporting to show that social capital in the form of early-life ties to high-status friends provides a powerful pathway to upward mobility for low-status individuals. It has received a lot of attention, from The New York Times, Brookings, and no doubt other places I am not aware of.

In my view, they failed to show anything new. We have known since the 1950s that social capital has a role in mobility, but the evidence they develop about its great power is not convincing, in part because they fail to take into account how their measure of social capital, the predictor, is contaminated by the correlates and consequences of mobility, the outcome.

This research has been greeted in some media as a recipe for the secret sauce of mobility, and one of their articles in Nature (there are two published simultaneously) is concerned with how to increase social capital. In other words, the research is likely to give rise to policy proposals. I think it is important then to inform Americans about its unacknowledged limitations.

I sent my critique to Nature, and it was rejected because, in their view, it did not sufficiently challenge the articles’ conclusions. I find that ridiculous.

I have no idea how Nature decides what critiques to publish, and I have not read the Chetty et al. articles so I can’t comment on theme either, but I can share Alba’s critique. Here it is:

While the pioneering big-data research of Raj Chetty and his colleagues is transforming the long-standing stream of research into social mobility, their findings should not be exempt from critique.

Consider in this light the recent pair of articles in Nature, in which they claim to have demonstrated a powerful causal connection between early-life social capital and upward income mobility for individuals growing up in low-income families. According to one paper’s abstract, “the share of high-SES friends among individuals with low-SES—which we term economic connectedness—is among the strongest predictors of upward income mobility identified to date.”

But there are good reasons to doubt that this causal connection is as powerful as the authors claim. At a minimum, the social capital-mobility statistical relationship is significantly overstated.

This is not to deny a role for social capital in determining adult socioeconomic position. That has been well established for decades. As early as the 1950s, the Wisconsin mobility studies focused in part on what the researchers called “interpersonal influence,” measured partly in terms of high-school friends, an operationalization close to the idea in the Chetty et al. article. More generally, social capital is indisputably connected to labor-market position for many individuals because of the role social networks play in disseminating job information.

But these insights are not the same as saying that economic connectedness, i.e., cross-class ties, is the secret sauce in lifting individuals out of low-income situations. To understand why the articles’ evidence fails to demonstrate this, it is important to pay close attention to how the data and analysis are constructed. Many casual readers, who glance at the statements like the one above or read the journalistic accounts of the research (such as the August 1 article in The New York Times), will take away the impression that the researchers have established an individual-level relationship—that they have proven that individuals from low-SES families who have early-life cross-class relationships are much more likely to experience upward mobility. But, in fact, they have not.

Because of limitations in their data, their analysis is based on the aggregated characteristics of areas—counties and zip codes in this case—not individuals. This is made necessary because they cannot directly link the individuals in their main two sources of data—contemporary Facebook friendships and previous estimates by the team of upward income mobility from census and income-tax data. Hence, the fundamental relationship they demonstrate is better stated as: the level of social mobility is much higher in places with many cross-class friendships. The correlation, the basis of their analysis, is quite strong, both at the county level (.65) and at the zip-code level (.69).

Inferring that this evidence demonstrates a powerful causal mechanism linking social capital to the upward mobility of individuals runs headlong into a major problem: the black box of causal mechanisms at the individual level that can lie behind such an ecological correlation, where moreover both variables are measured for roughly the same time point. The temptation may be to think that the correlation reflects mainly, or only, the individual-level relationship between social capital and mobility as stated above. However, the magnitude of an area-based correlation may be deceptive about the strength of the correlation at the individual level. Ever since a classic 1950 article by W. S. Robinson, it has been known that ecological correlations can exaggerate the strength of the individual-level relationship. Sometimes the difference between the two is very large, and in the case of the Chetty et al. analysis it appears impossible given the data they possess to estimate the bias involved with any precision, because Robinson’s mathematics indicates that the individual-level correlations within area units are necessary to the calculation. Chetty et al. cannot calculate them.

A second aspect of the inferential problem lies in the entanglement in the social-capital measure of variables that are consequences or correlates of social mobility itself, confounding cause and effect. This risk is heightened because the Facebook friendships are measured in the present, not prior to the mobility. Chetty et al. are aware of this as a potential issue. In considering threats to the validity of their conclusion, they refer to the possibility of “reverse causality.” What they have in mind derives from an important insight about mobility—mobile individuals are leaving one social context for another. Therefore, they are also leaving behind some individuals, such as some siblings, cousins, and childhood buddies. These less mobile peers, who remain in low-SES situations but have in their social networks others who are now in high-SES ones, become the basis for the paper’s Facebook estimate of economic connectedness (which is defined from the perspective of low-SES adults between the ages of 25 and 44). This sort of phenomenon will be frequent in high-mobility places, but it is a consequence of mobility, not a cause. Yet it almost certainly contributes to the key correlation—between economic connectedness and social mobility—in the way the paper measures it.

Chetty et al. try to answer this concern with correlations estimated from high-school friendships, arguing that the timing purges this measure of mobility’s impact on friendships. The Facebook-based version of this correlation is noticeably weaker than the correlations that the paper emphasizes. In any event, demonstrating a correlation between teen-age economic connectedness and high mobility does not remove the confounding influence of social mobility from the latter correlations, on which the paper’s argument depends. And in the case of high-school friendships, too, the black-box nature of the causality behind the correlation leaves open the possibility of mechanisms aside from social capital.

This can be seen if we consider the upward mobility of the children of immigrants, surely a prominent part today of the mobility picture in many high-mobility places. Recently, the economists Ran Abramitzky and Leah Boustan have reminded us in their book Streets of Gold that, today as in the past, the children of immigrants, the second generation, leap on average far above their parents in any income ranking. Many of these children are raised in ambitious families, where as Abramitzky and Boustan put it, immigrants typically are “under-placed” in income terms relative to their abilities. Many immigrant parents encourage their children to take advantage of opportunities for educational advancement, such as specialized high schools or advanced-placement high-school classes, likely to bring them into contact with peers from more advantaged families. This can create social capital that boosts the social mobility of the second generation, but a large part of any effect on mobility is surely attributable to family-instilled ambition and to educational attainment substantially higher than one would predict from parental status. The increased social capital is to a significant extent a correlate of on-going mobility.

In sum, there is without doubt a causal linkage between social capital and mobility. But the Chetty et al. analysis overstates its strength, possibly by a large margin. To twist the old saw about correlation and causation, correlation in this case isn’t only causation.

I [Alba] believe that a critique is especially important in this case because the findings in the Chetty et al. paper create an obvious temptation for the formulation of social policy. Indeed, in their second paper in Nature, the authors make suggestions in this direction. But before we commit ourselves to new anti-poverty policies based on these findings, we need a more certain gauge of the potential effectiveness of social capital than the current analysis can give us.

I get what Alba is saying about the critique not strongly challenging the article’s conclusions. He’s not saying that Chetty et al. are wrong; it’s more that he’s saying there are a lot of unanswered questions here—a position I’m sure Chetty et al. would themselves agree with!

A possible way forward?

To step back a moment—and recall that I have not tried to digest the Nature articles or the associated news coverage—I’d say that Alba is criticizing a common paradigm of social science research in which a big claim is made from a study and the study has some clear limitations, so the researchers attack the problem in some different ways in an attempt to triangulate toward a better understanding.

There are two immediate reactions I’d like to avoid. The first is to say that the data aren’t perfect, the study isn’t perfect, so we just have to give up and say we’ve learned nothing. On the other direction is the unpalatable response that all studies are flawed so we shouldn’t criticize this one in particular.

Fortunately, nobody is suggesting either of these reactions. From one direction, critics such as Lehman and Alba are pointing out concerns but they’re not saying the conclusions of the Chetty et al. study are all wrong of that the study is useless; from the other, news reports do present qualifiers and they’re not implying that these results are a sure thing.

What we’d like here is a middle way—not just a rhetorical middle way (“This research, like all social science, has weaknesses and threats to validity, hence the topic should continue to be studied by others”) but a procedural middle way, a way to address the concerns, in particular to get some estimates of the biases in the conclusions resulting from various problems with the data.

Our default response is to say the data should be analyzed better: do a propensity analysis to address Lehman’s concern about who’s on facebook, and do some sort of multilevel model integrating individual and zipcode-level data to address Alba’s concern about aggregation. And this would all be fine, but it takes a lot of work—and Chetty et al. already did a lot of work, triangulating toward their conclusion from different directions. There’s always more analysis that could be done.

Maybe the problem with the triangulation approach is not the triangulation itself but rather the way it can be set up with a central analysis making a conclusion, and then lots of little studies (“robustness checks,” etc.) designed to support the main conclusion. What if the other studies were set up to estimate biases, with the goal not of building confidence in the big number but rather of getting a better, more realistic, estimate.

With this in mind, I’m thinking that a logical next step would be to construct a simulation study to get a sense of the biases arising from the issues raised by Lehman and Alba. We can’t easily gather the data required to know what these biases are, but it does seem like it should be possible to simulate a world in which different sorts of people are more or less likely to be on facebook, and in which there are local patterns of connectedness that are not simply what you’d get by averaging within zipcodes.

I’m not saying this would be easy—the simulation would have to make all sorts of assumptions about how these factors vary, and the variation would need to depend on relevant socioeconomic variables—but right now it seems to me to be a natural next step in the research.

One more thing

Above I stressed the importance and challenge of finding a middle ground between (1) saying the study’s flaws make it completely useless and (2) saying the study represents standard practice so we should believe it.

Sometimes, though, response #1 is appropriate. For example, the study of beauty and sex ratio or the study of ovulation and voting or the study claiming that losing an election for governor lops 5 to 10 years off your life—I think those really are useless (except as cautionary tales, lessons of research practices to avoid). How can I say this? Because those studies are just soooo noisy compared to any realistic effect size. There’s just no there there. Researchers can fool themselves because the think that if they have hundreds or thousands of data points, that they’re cool, and that if they have statistical significance, they’ve discovered something. We’ve talked about this attitude before, and I’ll talk about again; I just wanted to emphasize here that it doesn’t always make sense to take the middle way. Or, to put it another way, sometimes the appropriate middle way is very close to one of the extreme positions.

Bayesian inference continues to completely solve the multiple comparisons problem

Erik van Zwet writes:

I saw you re-posted your Bayes-solves-multiple-testing demo. Thanks for linking to my paper in the PPS! I think it would help people’s understanding if you explicitly made the connection with your observation that Bayesians are frequentists:

What I mean is, the Bayesian prior distribution corresponds to the frequentist sample space: it’s the set of problems for which a particular statistical model or procedure will be applied.

Recently Yoav Benjamini criticized your post (the 2016 edition) in section 5.5 of his article/blog “Selective Inference: The Silent Killer of Replicability.”

Benjamini’s point is that your simulation results break down completely if the true prior is mixed ever so slightly with a much wider distribution. I think he has a valid point, but I also think it can be fixed. In my opinion, it’s really a matter of Bayesian robustness; the prior just needs a flatter tail. This is a much weaker requirement than needing to know the true prior. I’m attaching an example where I use the “wrong” tail but still get pretty good results.

In his document, Zwet writes:

This is a comment on an article by Yoav Benjamini entitled “Selective Inference: The Silent Killer of Replicability.”

I completely agree with the main point of the article that over-optimism due to selection (a.k.a. the winner’s curse) is a major problem. One important line of defense is to correct for multiple testing, and this is discussed in detail.

In my opinion, another important line of defense is shrinkage, and so I was surprised that the Bayesian approach is dimissed rather quickly. In particular, a blog post by Andrew Gelman is criticized. The post has the provocative title: “Bayesian inference completely solves the multiple comparisons problem.”

In his post, Gelman samples “effects” from the N(0,0.5) distribution and observes them with standard normal noise. He demonstrates that the posterior mean and 95% credible intervals continue to perform well under selection.

In section 5.5 of Benjamini’s paper the N(0,0.5) is slightly perturbed by mixing it with N(0,3) with probability 1/1000. As a result, the majority of the credibility intervals that do not cover zero come from the N(0,3) component. Under the N(0,0.5) prior, those intervals get shrunken so much that they miss the true parameter.

It should be noted, however, that those effects are so large that they are very unlikely under the N(0,0.5) prior. Such “data-prior conflict” can be resolved by having a prior with a flat tail. This is a matter of “Bayesian robustness” and goes back to a paper by Dawid which can be found here.

Importantly, this does not mean that we need to know the true prior. We can mix the N(0,0.5) with almost any wider normal distribution with almost any probability and then very large effects will hardly be shrunken. Here, I demonstrate this by usin the mixture 0.99*N(0,0.5)+0.01*N(0,6) as prior. This is quite far from the truth, but nevertheless, the posterior inference is quite acceptable. We find that among one million simulations, there are 741 credible intervals that do not cover zero. Among those, the proportion that do not cover the parameter is 0.07 (CI: 0.05 to 0.09).

The point is that the procedure merely needs to recognize that a particular observation is unlikely to come from N(0,0.5), and then apply very little shrinkage.

My own [Zwet’s] views on shrinkage in the context of the winner’s curse are here. In particular, a form of Bayesian robustness is discussed in section 3.4 of a preprint of myself and Gelman here. . . .

He continues with some simulations that you can do yourself in R.

The punch line is that, yes, the model makes a difference, and when you use the wrong model you’ll get the wrong answer (i.e., you’ll always get the wrong answer). This provides ample scope for research on robustness: how wrong are your answers, depending on how wrong is your model? This arises with all statistical inferences, and there’s no need in my opinion to invoke any new principles involving multiple comparisons. I continue to think that (a) Bayesian inference completely solves the multiple comparisons problem, and (b) all inferences, Bayesian included, are imperfect.

“Published estimates of group differences in multisensory integration are inflated”

Mike Beauchamp sends in the above picture of Buster (“so-named by my son because we adopted him as a stray kitten run over by a car and ‘all busted up'”) sends along this article (coauthored with John F. Magnotti) “examining how the usual suspects (small n, forking paths, etc.) had led our little sub-field of psychology/neuroscience, multisensory integration, astray.” The article begins:

A common measure of multisensory integration is the McGurk effect, an illusion in which incongruent auditory and visual speech are integrated to produce an entirely different percept. Published studies report that participants who differ in age, gender, culture, native language, or traits related to neurological or psychiatric disorders also differ in their susceptibility to the McGurk effect. These group-level differences are used as evidence for fundamental alterations in sensory processing between populations. Using empirical data and statistical simulations tested under a range of conditions, we show that published estimates of group differences in the McGurk effect are inflated when only statistically significant (p < 0.05) results are published [emphasis added]. With a sample size typical of published studies, a group difference of 10% would be reported as 31%. As a consequence of this inflation, follow-up studies often fail to replicate published reports of large between-group differences. Inaccurate estimates of effect sizes and replication failures are especially problematic in studies of clinical populations involving expensive and time-consuming interventions, such as training paradigms to improve sensory processing. Reducing effect size inflation and increasing replicability requires increasing the number of participants by an order of magnitude compared with current practice.

Type M error!

How much should we trust assessments in systematic reviews? Let’s look at variation among reviews.

Ozzy Tunalilar writes:

I increasingly notice these “risk of bias” assessment tools (e.g., Cochrane) popping up in “systematic reviews” and “meta-analysis” with the underlying promise that they will somehow guard against unwarranted conclusions depending on, perhaps, the degree of bias. However, I also noticed multiple published systematic reviews referencing, using, and evaluating the same paper (Robinson et al 2013; it could probably have been any other paper). Having noticed that, I compiled the risk of bias assessment by multiple papers on the same paper. My “results” are above – so much variation across studies that perhaps we need to model the assessment of risk of bias in review of systematic reviews. What do you think?

My reply: I don’t know! I guess some amount of variation is expected, but this reminds me of a general issue in meta-analysis that different studies will have different populations, different predictors, different measurement protocols, different outcomes, etc. This seems like even more of a problem, now that thoughtless meta-analysis has become such a commonly-used statistical tool, to the extent that there seem to be default settings and software that can even be used by both sides of a dispute.

Multilevel Regression and Poststratification Case Studies

Juan Lopez-Martin, Justin Phillips, and I write:

The following case studies intend to introduce users to Multilevel Modeling and Poststratification (MRP) and some of its extensions, providing reusable code and clear explanations. The first chapter presents MRP, a statistical technique that allows to estimate subnational estimates from national surveys while adjusting for nonrepresentativeness. The second chapter extends MRP to overcome the limitation of only using variables included in the census. The last chapter develops a new approach that combines MRP with an ideal point model, allowing to obtain subnational estimates of latent attitudes based on multiple survey questions and improving the subnational estimates for an individual survey item based on other related items.

These case studies do not display some non-essential code, such as the ones used to generate figures and tables. However, all the code and data is available on the corresponding GitHub repo.

The tutorials assume certain familiarity with R and Bayesian Statistics. A good reference to the required background is Gelman, Hill, and Vehtari (2020). Additionally, multilevel models are covered in Gelman and Hill (2006) (Part 2A) or McElreath (2020) (Chapters 12 and 13).

The case studies are still under development. Please send any feedback to [email protected]

This is the document I point people to when they ask how to do Mister P. Here are the sections:

Chapter 1: Introduction to Mister P

1.1 Data
1.2 First stage: Estimating the Individual-Response Model
1.3 Second Stage: Poststratification
1.4 Adjusting for Nonrepresentative Surveys
1.5 Practical Considerations
1.6 Appendix: Downloading and Processing Data

Chapter 2: MRP with Noncensus Variables
2.1 Model-based Extension of the Poststratification Table
2.2 Adjusting for Nonresponse Bias
2.3 Obtaining Estimates for Non-census Variable Subgroups

Chapter 3: Ideal Point MRP
3.1 Introduction and Literature
3.2 A Two-Parameter IRT Model with Latent Multilevel Regression
3.3 The Abortion Opposition Index for US States
3.4 Estimating Support for Individual Questions
3.5 Concluding Remarks
3.6 Appendix: Stan Code

This should be useful to a lot of people.

The Failure of Null Hypothesis Significance Testing When Studying Incremental Changes, and What to Do About It

Here it is:

A standard mode of inference in social and behavioral science is to establish stylized facts using statistical significance in quantitative studies. However, in a world in which measurements are noisy and effects are small, this will not work: selection on statistical significance leads to effect sizes which are overestimated and often in the wrong direction. After a brief discussion of two examples, one in economics and one in social psychology, we consider the procedural solution of open postpublication review, the design solution of devoting more effort to accurate measurements and within-person comparisons, and the statistical analysis solution of multilevel modeling and reporting all results rather than selection on significance. We argue that the current replication crisis in science arises in part from the ill effects of null hypothesis significance testing being used to study small effects with noisy data. In such settings, apparent success comes easy but truly replicable results require a more serious connection between theory, measurement, and data.

The article was published in 2018 but it remains relevant, all these many years later.

Doing Mister P with multiple outcomes

Someone sends in a question:

I’ve been delving into your papers and blog posts regarding MRP. The resources are really great – especially the fully worked example you did in collaboration with Juan Lopez-Martin and Justin Phillips.

The approach is really nice for when you want to estimate just a single parameter of interest from a survey, such as support for a policy. However, I’m wondering whether you’ve also had to deal with situations where you want to compare support for one policy vs. another, that were asked in the same survey of the same respondents? Some media reporting that compares levels of support for different things seems like they are often just looking at numbers from separate models run on each question, but the data might have been collected from the same people in a single survey. Having run some MRP, I can see that if you wanted to also add in repeated measures of respondents then the computational complexity can really balloon (in fact, I recently melted part of my computer trying to do this with an ordinal model!). But I also assume it is not totally valid to take data from the same respondents, run one MRP model on ‘Support for X’, and another model on ‘Support for Y’, or different framings of the same issue, and compare the posterior distributions of the level of support from these separate models – because that would be treating them as independent responses. Life would be much easier if this were an acceptable approach but I am not sure that it is!

Is my sense above incorrect, or should one instead incorporate the repeated measurement into the regression equation? I thought it might be possible to do so by changing the type of equation you have in your MRP primer (fit) to something like fit2: I add ‘question’ as an effect that, like the intercept, can vary across demographic subgroups, and I add respondentID as another way in which the data is nested. Is this anything like how you would deal with this?

fit <- stan_glmer(abortion ~ (1 | state) + (1 | eth) + (1 | educ) + male + (1 | male:eth) + (1 | educ:age) + (1 | educ:eth) + repvote + factor(region), ...) fit2 <- stan_glmer(response ~ question + (1 + question | state) + (1 + question | eth) + (1 + question | educ) + question:male + (1 + question | male:eth) + (1 + question | educ:age) + (1 + question | educ:eth) + question*repvote + question*factor(region) + (1 | respondentID), ...) This seems like something that ought be considered in opinion polling/survey research, and it is interesting to think about how best to address it. The main things I was wondering about were the levels at which different things should be nested, and at a practical level, if I generate expected responses with a function such as 'posterior_epred', can I just put new respondent IDs in the poststrat table and assume that randomly drawing different 'participants' across the different posterior draws will balance out the different possible respondent intercepts that might be drawn? As a note, I also see that sometimes one could instead do MRP on difference or change scores, but this is not really possible with certain response formats or with many different items from the same subject.

My reply:

If you have multiple, related questions on the same survey, then one approach is to fit an ideal-point model. I thought we had an ideal-point model in that case study, but now I don’t see it there—I wonder where it went? But the ideal-point model really only makes sense when the different questions are measuring the same sort of thing; if you’re interested in responses in two different issues, that’s another story. I agree that modeling the two responses completely separately is potentially wasteful of information.

Your proposed fit_2 model is similar to an ideal-point model, but it has the problem that the responses to the two questions are assumed to be independent. You could think of it like two separate models where there’s some pooling of coefficients between the two models.

What to actually do? I’m not sure. I think I’d start by just fitting two separate models, then you could look at the correlation of the residuals, and if there’s nothing there, maybe it’s ok to go with the separate models. If the residuals do show some correlation, then more needs to be done. Another approach is that if the two questions have a logical order, then you can first model y1 given x, and then model y2 given y1 and x.

P.S. I asked my correspondent if I could post the above question, and he replied, “Yes, but please don’t put any reference to my institution in there (not like it is secret, but just would have to clear it with people if it is included!).” So his identity will remain secret.

Probabilistic feature analysis of facial perception of emotions

With Michel Meulders, Paul De Boeck, and Iven Van Mechelen, from 2005 (but the research was done several years earlier):

According to the hypothesis of configural encoding, the spatial relationships between the parts of the face function as an additional source of information in the facial perception of emotions. The paper analyses experimental data on the perception of emotion to investigate whether there is evidence for configural encoding in the processing of facial expressions. It is argued that analysis with a probabilistic feature model has several advantages that are not implied by, for example, a generalized linear modelling approach. First, the probabilistic feature model allows us to extract empirically the facial features that are relevant in processing the face, rather than focusing on the features that were manipulated in the experiment. Second, the probabilistic feature model allows a direct test of the hypothesis of configural encoding as it explicitly formalizes a mechanism for the way in which information about separate facial features is combined in processing the face. Third, the model allows us to account for a complex data structure while still yielding parameters that have a straightforward interpretation.

We should not have listed the emotions in alphabetical order:

The accidental experiment that saved 700 lives

Paul Alper sends along this news article by Sarah Kliff, who writes:

Three years ago, 3.9 million Americans received a plain-looking envelope from the Internal Revenue Service. Inside was a letter stating that they had recently paid a fine for not carrying health insurance and suggesting possible ways to enroll in coverage. . . .

Three Treasury Department economists [Jacob Goldin, Ithai Lurie, and Janet McCubbin] have published a working paper finding that these notices increased health insurance sign-ups. Obtaining insurance, they say, reduced premature deaths by an amount that exceeded any of their expectations. Americans between 45 and 64 benefited the most: For every 1,648 who received a letter, one fewer death occurred than among those who hadn’t received a letter. . . .

The experiment, made possible by an accident of budgeting, is the first rigorous experiment to find that health coverage leads to fewer deaths, a claim that politicians and economists have fiercely debated in recent years as they assess the effects of the Affordable Care Act’s coverage expansion. The results also provide belated vindication for the much-despised individual mandate that was part of Obamacare until December 2017, when Congress did away with the fine for people who don’t carry health insurance.

“There has been a lot of skepticism, especially in economics, that health insurance has a mortality impact,” said Sarah Miller, an assistant professor at the University of Michigan who researches the topic and was not involved with the Treasury research. “It’s really important that this is a randomized controlled trial. It’s a really high standard of evidence that you can’t just dismiss.”

This graph shows how the treatment increased health care coverage during the months after it was applied:

And here’s the estimated effect on mortality:

They should really label the lines directly. Sometimes it seems that economists think that making a graph easier to read is a form of cheating!

I’d also like to see some multilevel modeling—as it is, they end up with lots of noisy estimates, lots of wide confidence intervals, and I think more could be done.

But that’s fine. It’s best that the authors did what they did, which was to present their results. Now that the data are out there, other researchers can go back in and do more sophisticated analysis. That’s how research should go. It would not make sense for such important results to be held under wraps, waiting for some ideal statistical analysis that might never happens.

Overall, this is an inspiring story of what can be learned from a natural experiment.

The news article also has this sad conclusion:

At the end of 2017, Congress passed legislation eliminating the health law’s fines for not carrying health insurance, a change that probably guarantees that the I.R.S. letters will remain a one-time experiment.

But now that they have evidence that the letters had a positive effect, maybe they’ll restart the program, no?

“Why the New Pollution Literature is Credible” . . . but I’m still guessing that the effects are being overestimated:

In a post entitled, “Why the New Pollution Literature is Credible,” Alex Tabarrok writes:

My recent post, Air Pollution Reduces Health and Wealth drew some pushback in the comments, some justified, some not, on whether the results of these studies are not subject to p-hacking, forking gardens and the replication crisis. Sure, of course, some of them are. . . . Nevertheless, I don’t think that skepticism about the general thrust of the results is justified. Why not?

First . . . my rule is trust literatures not papers and the new pollution literature is showing consistent and significant negative effects of pollution on health and wealth. . . . It’s not just that the literature is large, however, it’s that the literature is consistent in a way that many studies in say social psychology were not. In social psychology, for example, there were many tests of entirely different hypotheses—power posing, priming, stereotype threat—and most of these failed to replicate. But in the pollution literature we have many tests of the same hypotheses. We have, for example, studies showing that pollution reduces the quality of chess moves in high-stakes matches, that it reduces worker productivity in Chinese call-centers, and that it reduces test scores in American and in British schools. . . . from different researchers studying different times and places using different methods but they are all testing the same hypothesis, namely that pollution reduces cognitive ability. . . .

Another feature in favor of the air pollution literature is that the hypothesis that pollution can have negative effects on health and cognition wasn’t invented yesterday . . . The Romans, for example, noted the negative effect of air pollution on health. There’s a reason why people with lung disease move to the countryside and always have.

I also noted in Why Most Published Research Findings are False that multiple sources and types of evidence are desirable. The pollution literature satisfies this desideratum. Aside from multiple empirical studies, the pollution hypothesis is also consistent with plausible mechanisms . . .

Moreover, there is a clear dose-response effect–so much so that when it comes to “extreme” pollution few people doubt the hypothesis. Does anyone doubt, for example, that an infant born in Delhi, India–one of the most polluted cities in the world–is more likely to die young than if the same infant grew up (all else equal) in Wellington, New Zealand–one of the least polluted cities in the world? . . .

What is new about the new pollution literature is more credible methods and bigger data and what the literature shows is that the effects of pollution are larger than we thought at lower levels than we thought. But we should expect to find smaller effects with better methods and bigger data. . . . this isn’t guaranteed, there could be positive effects of pollution at lower levels, but it isn’t surprising that what we are seeing so far is negative effects at levels previously considered acceptable.

Thus, while I have no doubt that some of the papers in the new pollution literature are in error, I also think that the large number of high quality papers from different times and places which are broadly consistent with one another and also consistent with what we know about human physiology and particulate matter and also consistent with the literature on the effects of pollution on animals and plants and also consistent with a dose-response relationship suggest that we take this literature and its conclusion that air pollution has significant negative effects on health and wealth very seriously.

This all makes a lot of sense—enough so that I quoted large chunks of Tabarrok’s post.

Still, I think actual effects will be quite a bit lower than claimed in the literature. Yes, it’s appropriate to look at the literature, not just individual studies. But if each individual study is biased, that will bias the literature. You can think of Alex’s two posts on the effects of air pollution as a sort of informal meta-analysis, and it’s a meta-analysis that does not correct for selection bias within each published study. Again, his general points (both methodologically and with regard to air pollution in particular) make sense; I just think there’s a bias he’s not correcting for. When we talk about forking paths etc. it’s typically in settings where there’s essentially zero signal (more precisely, settings where any signal is overwhelmed by noise) and people are finding patterns out of nothing—but the same general errors can lead to real effects being greatly overestimated.

P.S. The comments to Tabarrok’s post are pretty wack. Some reasonable points but lots and lots of people just overwhelmed by political ideology.

International Workshop on Statistical Modelling – IWSM 2022 in Trieste (Italy)

I am glad to announce that the next International Workshop on Statistical Modelling (IWSM), the major activity of the Statistical Modelling Society, will take place in Trieste, Italy, between July 18 and July 22 2022, organized by University of Trieste.

The conference will be anticipated by the short course “Statistical Modelling of Football Data” by Ioannis Ntzoufras (AUEB) and Leonardo Egidi (Univ. of Trieste) on July 17th. The course is based on Stan and provided to people with a minimal statistical/mathematical background.

Interested participants may register choosing between some options:

  • whole conference
  • conference + short course
  • short course

Any information about registration and fees can be found here. The call for papers deadline for submitting a 4-pages abstract is April 4th (likely to be extended). For any information visit the IWSM 2022 website.

Stay tuned, and share this event with whoever may be interested in the conference.

Talks from our mini-conference, MRI Together: A global workshop on Open Science and Reproducible MR Research

The conference was called MRI Together: A global workshop on Open Science and Reproducible MR Research. Talks were by:

Megan Higgs

Geoff Cumming

Sabine Hoffmann

Valentin Amrhein


It was a privilege to be included with these thoughtful colleagues. Enjoy the videos.

An Easy Layup for Stan

This post is by Phil Price, not Andrew.

The tldr version of this is: I had a statistical problem that ended up calling for a Bayesian hierarchical model. I decided to implement it in Stan. Even though it’s a pretty simple model and I’ve done some Stan modeling before I thought it would take at least several hours for me to get a model I was happy with, but that wasn’t the case. Right tool for the job. Thanks Stan team!

Longer version follows.

I have a friend who routinely plays me for a chump. Fool me five times, shame on me. The guy is in finance, and every few years he calls me up and says “Phil, I have a problem, I need your help. It’s really easy” — and then he explains it so it really does seem easy — “but I need an answer in just a few days and I don’t want to get way down in the weeds, just something quick and dirty. Can you give me this estimate in…let’s say under five hours of work, by next Monday?” Five hours? I can hardly do anything in five hours. But still, it really does seem like an easy problem. I say OK and quote a slight premium over my regular consulting rate. And then…as always (always!) it ends up being more complicated than it seemed. That’s not a phenomenon that is unique to him: just about every project I’ve ever worked on turns out to be more complicated than it seems. The world is complicated! And people do the easy stuff themselves, so if someone comes to me it’s because it’s not trivial. But I never seem to learn.

Anyway, what my friend does is “valuation”: how much should someone be willing to pay for this thing? The ‘thing’ in this case is a program for improving the treatment of patients being treated for severe kidney disease. Patients do dialysis, they take medications, they’re on special diets, they have health monitoring to do, they have doctor appointments to attend, but many of them fail to do everything. That’s especially true as they get sicker: it gets harder for them to keep track of what they’re supposed to do, and physically and mentally harder to actually do the stuff.

For several years someone ran a trial program to see what happens if these people get a lot more help: what if there’s someone at the dialysis center whose job is to follow up with people and make sure they’re taking their meds, showing up to their appointments, getting their blood tests, and so on? One would hope that the main metrics of interest would involve patient health and wellbeing, and maybe that’s true for somebody, but for my friend (or rather his client) the question is: how much money, if any, does this program save? That is, what happens to the cost per patient per year if you have this program compared to doing stuff the way it has been done in the past?

As is usually the case, the data suck. What you would want is a random selection of pilot clinics where they tried the program, and the ones where they didn’t, and you’d want the cost data from the past ten years or something for every clinic; you could do some sort of difference-in-differences approach, maybe matching cases and controls by relevant parameters like region of the country and urban/rural and whatever else seems important. Unfortunately my friend had none of that. The clinics were semi-haphazardly selected by a few health care providers, probably slightly biased towards the ones where the administrators were most willing to give the program a try. The only clinic-specific data are from the first year of the program onward; other than that all we have is the nationwide average for similar clinics.

The fact that no before/after comparison is possible seemed like a dealbreaker to me, and I said so, but my friend said the experts think the effect of the program wouldn’t likely show up in the form of a step change from before to after, but rather in a lower rate of inflation, at least for the first several years. Relative to business as usual you expect to see a slight decline in cost every year for a while. I don’t understand why but OK, if that’s what people expect then maybe we can look for that: we expect to see costs at the participating clinics increase more slowly than the nationwide average. I told my friend that’s _all_ we can look for, given the data constraints, and he said fine. I gave him all of the other caveats too and he said that’s all fine as well. He needs some kind of estimate, and, well, you go to war with the data you have, not the data you want.

First thing I did is to divide out the nationwide inflation rate for similar clinics that lack the program, in order to standardize on current dollars. Then I fit a linear regression model to the whole dataset of (dollars per patient) as a function of year, giving each clinic its own intercept but giving them all a common slope. And sure enough, there’s a slight decline! The clinics with the program had a slightly lower rate of inflation than the other clinics, and it’s in line with what my friend said the experts consider a plausible rate. All those caveats I mentioned above still apply but so far things look OK.

If that was all my friend needed then hey, job done and it took a lot less than five hours. But no: my friend doesn’t just need to know the average rate of decrease, he needs to know the approximate statistical distribution across clinics. If the average is, say, a 1% decline per year relative to the benchmark, are some clinics at 2% per year? What about 3%? And maybe some don’t decline at all, or maybe the program makes them cost more money instead? (After all, you have to pay someone to help all those patients, so if the help isn’t very successful you are going to lose out). You’d like to just fit a different regression for each clinic and look at the statistical distribution of slopes, but that won’t work: there’s a lot of year-to-year ‘noise’ at any individual clinic. One reason is that you can get unlucky in suddenly having a disproportionate number of patients who need very expensive care, or lucky in not having that happen, but there are other reasons too. And you only have three or four years of data per clinic. Even if all of the clinics had programs of equal effectiveness, you’d get a wide variation in the observed slopes. It’s very much like the “eight schools problem”. It’s really tailor-made for a Bayesian hierarchical model. Observed costs are distributed around “true costs” with a standard error we can estimate; true inflation-adjusted cost at a clinic declines linearly; slopes are distributed around some mean slope, with some standard deviation we are trying to estimate. We even have useful prior estimates of what slope might be plausible. Even simple models usually take me a while to code and to check so I sort of dreaded going that route — it’s not something I would normally mind, but given the time constraints I thought it would be hard — but in fact it was super easy. I coded an initial version that was slightly simpler than I really wanted and it ran fine and generated reasonable parameter values. Then I modified it to turn it into the model I wanted and…well, it mostly worked. The results all looked good and some model-checking turned out fine, but I got an error that there were a lot of “divergent transitions.” I’ve run into that before and I knew the trick for eliminating them, which is described here. (It seems like that method should be described, or at least that page should be linked, in the “divergent transitions” section of the Stan Reference Manual but it isn’t. I suppose I ought to volunteer to help improve the documentation. Hey Stan team, if it’s OK to add that link, and if you give me edit privileges for the manual, I’ll add it.) I made the necessary modification to fix that problem and then everything was hunky dory.

From starting the model to finishing with it — by which I mean I had done some model checks and looked at the outputs and generated the estimates I needed — was only about two hours. I still didn’t quite get the entire project done in the allotted time, but I was very close. Oh, and in spite of the slight overrun I billed for all of my hours, so my chump factor didn’t end up being all that high.

Thank you Stan team!


God is present in the sweeping gestures, but the devil is in the details

Can’t remember anything at all — Nick Cave

Girls, it’s been a while. But it’s midnight and I am AMPED after a Springtime concert (when Lauren asked me what genre of music was, I said “loud”. They played in a very weird room that usually hosts orchestras and clearly had a rusted on group of silver-haired subscribers who didn’t want the intersection of swampy rock, free jazz, and noise. And I now need to see every single band possible in a room that’s half full of people who are SIMPLY NOT INTO IT. Glorious. It’s the vibe of the thing.)

Of course I’ve been gone, but fish gotta swim, birds gotta swim if sufficient external motivation is applied, and similarly I’ve just gotta blog. So I’ve been doing it in secret (not actually secret, I’ve been telling people). Not because I don’t love you but because sometimes a man needs to write eight thousand words on technical definitions of Gaussian processes, describe conjugate priors as “the crystal deodorant of Bayesian statistics“, and just generally make people say “look, I held on for as long as I could but that was too much“. I’ve also been re-posting a couple of my old posts from here that I like because the WordPress ate my equations (they’re all lightly edited because I’m anal, and this one probably got much better). Also, using distill to make blogs means I can go footnote mad and you know how much I love footnote.

Anyway. What am I gonna talk about? I feel there’s some pressure here because one of my favourite of my posts (and, also, one of my most “is he ok? no” posts) was written in a similar post-concert coming down state.

But I’m not gonna be that fancy or that long winded tonight.

Instead, I just want to share some stolen thoughts on a single phenomena the is interesting.

When you add a “random effect” to your regression-type model, the regression coefficients will change. Sometimes a lot.

Now I know that I risk the full wrath of the cancel culture (to use the youth’s term) by using the term random effect on this blog. Andrew has written at length about why he doesn’t like it (you can google). But it’s a useful umbrella term in this context: I mean iid effects like random slopes and intercepts, smoothing splines in (Bayesian) GAM models, Gaussian processes, and the whole undead army of other “extra randomness” bits that you can put in a regression equation. (The criticism that a term doesn’t mean anything specific or is ambiguous can be effectively blunted by using it to mean everything.)

As with many “change the cheerleader, change the world” statistical issues, this really always ends up being an issue of confounding. But oh what subtle confounding it can be! Jim Hodges has a legitimately wonderful book called Richly Parameterized Linear Models that I heartily recommend that goes into all of these things. Even just by looking at the sample chapters, you will find

  • models where the multilevel structure is confounded with a covariate
  • models where a more complex dependence structure (like an ICAR for spatial data or a spline for … non-spatial data) is confounded with a covariate.

(The book also covers some truly wild things that happen to the hyperparameters [aka the parameters that control the correlation structure] in these situations!)

The long and the short of it, though, is that the interpretation of regression coefficients will change when you add more complex things (like multilevel or spatial or temporal or non-linear) structure to your model.

As with all things, this becomes violently true when you start thinking of interpreting your regression coefficients causally.

It’s worth saying that this is potentially a feature and not a bug! Especially when you’ve added the extra structure to your regression model in order to try and separate structural effects (like repeated measurements in individuals or temporal or spatial autocorrelation) from the effect of your covariates!

So how do you deal with it? No earthly idea.

  • Hodges and Reich suggest (in some contexts) organizing your extra randomness so it’s orthogonal to your variables of interest (this is easy to do for Gaussians). This assumes that there are no unmeasured covariates moving in the same direction as your covariate of interest is and will keep the regression coefficient point estimates the same between the vanilla regression and regression with the extra stuff. (Hanks et al suggest this can lead to unreasonably narrow posterior uncertainty intervals)
  • Sigrunn, Janine, David, Håvard, and I think you should use priors to explicitly limit complexity of your extra structure, which puts bounds on how much of the change can be coming from unmeasured things correlated with your object of interest. (Jon Wakefield also does this when choosing priors for smoothing splines in his excellent book)
  • A whole cornucopia of literature that I don’t have the time or space to link to covers an infinity of aspects of formal causal identification and estimation in a whole variety of situations where we would add these type of structured uncertainties (longitudinal models, time series, spatial models, and variants thereof).

All of which is to say: don’t take your regression coefficients for granted when there are fancy things in your model. But also don’t avoid the fancy things because if you do that your regression coefficients won’t make any sense.

Did you think I’d leave you on a happy note?

Duality between multilevel models and multiple comparisons adjustments, and the relevance of this to some discussions of replication failures

Pascal Jordan writes:

I stumbled upon a paper which I think you might find interesting to read or discuss: “Ignored evident multiplicity harms replicability—adjusting for it offers a remedy” from Yoav Zeevi, Sofi Astashenko, Yoav Benjamini. It deals with the topic of the replication crisis. More specifically with a sort of reanalysis of the Reproducibility Project in Psychology.

There are parts which I [Jordan] think overlap with your position on forking paths: For example, the authors argue that there are numerous implicit comparisons in the original studies which are not accounted for in the reporting of the results of the studies. The paper also partly offers an explanation as to why social psychology is particularly troubled with low rates of replicability (according to the paper: the mean number of implicit comparisons is higher in social psychology than in cognitive psychology).

On the other hand, the authors adhere to the traditional hypothesis-testing paradigm (with corrections for implicit comparisons) which I know you are a critic of.

My reactions:

1. As we’ve discussed, there is a duality between multiple-comparisons corrections and hierarchical models: in both cases we are considering a distribution of possible comparisons. In practice that means we can get similar applied results using different philosophies, as long as we make use of relevant information. For example, when Zeevi et al. in the above-linked article write of “addressing multiplicity,” I would consider multilevel modeling (as here) one way of doing this.

2. I think a key cause of unreplicable results is not the number of comparisons (implicit or otherwise) but rather the size of underlying effects. Social and evolutionary psychology have been in this weird position where they design noisy studies to estimate underlying effects that are small (see here) and can be highly variable (the piranha problem). Put this together and you have a kangaroo situation. Multiple comparisons and forking paths make this all worse, as they give researchers who are clueless or unscrupulous (or both) a way of declaring statistical significance out of data that are essentially pure noise—but I think the underlying problem is that they’re using noisy experiments to study small and variable effects.

3. Another way to say it is that this is a problem of misplaced rigor. Social psychology research often uses randomized experiments and significance tests: two tools for rigor. But all the randomization in the world won’t get you external validity, and all the significance testing in the world won’t save you if the signal is low relative to the noise.

So, just to be clear, “adjusting for multiplicity” can be done using multilevel modeling without any reference to p-values; thus, the overall points of the above-linked paper are more general than any specific method they might propose (just as, conversely, the underlying ideas of any paper of mine on hierarchical modeling could be transmuted into a statement about multiple comparisons methods). And no method of analysis (whether it be p-values, hierarchical modeling, whatever) can get around the problem of studies that are too noisy for the effects they’re trying to estimate.