Jonah’s seminar tomorrow: “Bayesian Workflow and the Software That Shapes It”

Posted on May 18, 2026 12:30 PM by Leonardo Egidi

This is Leo. Jonah Gabry (Stan developer, Andrew’s collaborator, etc.) is spending the whole month of May as a visiting professor here with us at the University of Trieste in Italy. Tomorrow, May 19th, in the De Finetti room at the University of Trieste, at 9 am NYC time (GMT-4), Jonah will give the following talk:

“Bayesian Workflow and the Software That Shapes It”

based on the upcoming book: “Bayesian Workflow”.

For anyone local, you are welcome to come in person. Anyone else can join on Microsoft Teams (available here).

Predictive Modelling for Football Analytics is available!

Posted on November 4, 2025 4:00 PM by Leonardo Egidi

This post is by Leo.
After a long and exciting journey, the book I co-authored with Dimitris Karlis, and Ioannis Ntzoufras Predictive Modelling for Football Analytics edited by CRC Press is available!

The book discusses the most well-known classical and Bayesian models, along with the main computational tools used in the football analytics domain. It also introduces the footBayes R package (built on Stan and CmdStan), which accompanies the reader through all the examples proposed in the book. It aims to be both a practical guide and a theoretical foundation for students, data scientists, sports analysts, and football professionals who wish to understand and apply predictive modelling in a football context.

This text is primarily for senior undergraduates, graduate students, and academic researchers in mathematics, statistics, and computer science who are interested in learning about football analytics. For sure, we really enjoyed writing this book.

Here’s the table of contents:

Chapter 1 – A short introduction to football analytics

Chapter 2 – Methods, algorithms and computational tools

Chapter 3 – Tournament and game prediction via simulation

Chapter 4 – Implementation of basic models in R via footBayes

Chapter 5 – Additional statistical models for the scores

Chapter 6 – Modelling international matches: the Euro and World cup experience

Chapter 7 – Compare statistical models’ performance with the bookmakers

You can order the book here.

And you can download all the data and reproducible R code of the book here.

P.S. I still remember fitting my first Stan model on football data when I was a visiting scholar at Columbia, in the Department of Statistics, back in 2016. I was sitting in Andrew’s office, hoping for a decent fit, and discussing with Jonah how to improve it. Almost ten years later, we now have some proof that we can always improve our models;)

P.S.2 As the great coach José Mourinho once said, “He who knows only about football, knows nothing about football”. For me, football is simply a tool to make statistics more accessible and engaging especially for those outside the field.

Progress in 2023, Leo edition

Posted on January 23, 2024 6:00 PM by Leonardo Egidi

Following Andrew, Aki, Jessica, and Charles, and based on Andrew’s proposal, I list my research contributions for 2023.

Published:

Egidi, L. (2023). Seconder of the vote of thanks to Narayanan, Kosmidis, and Dellaportas and contribution to the Discussion of ‘Flexible marked spatio-temporal point processes with applications to event sequences from association football’. Journal of the Royal Statistical Society Series C: Applied Statistics, 72(5), 1129.
Marzi, G., Balzano, M., Egidi, L., & Magrini, A. (2023). CLC Estimator: a tool for latent construct estimation via congeneric approaches in survey research. Multivariate Behavioral Research, 58(6), 1160-1164.
Egidi, L., Pauli, F., Torelli, N., & Zaccarin, S. (2023). Clustering spatial networks through latent mixture models. Journal of the Royal Statistical Society Series A: Statistics in Society, 186(1), 137-156.
Egidi, L., & Ntzoufras, I. (2023). Predictive Bayes factors. In SEAS IN. Book of short papers 2023 (pp. 929-934). Pearson.
Macrì Demartino, R., Egidi, L., & Torelli, N. (2023). Power priors elicitation through Bayes factors. In SEAS IN. Book of short papers 2023 (pp. 923-928). Pearson.

Preprints:

Consonni, G., & Egidi, L. (2023). Assessing replication success via skeptical mixture priors. arXiv preprint arXiv:2401.00257. Submitted.

Softwares:

CLC estimator

free and open-source app to estimate latent unidimensional constructs via congeneric approaches in survey research (Marzi et al., 2023)

footBayes package (CRAN version 0.2.0)

diagonal inflated bivariate Poisson model (Karlis and Ntzoufras, 2003)
zero-inflated Skellam model (Karlis and Ntzoufras, 2009)

pivmet package (CRAN version 0.5.0)

sparse finite mixtures implementation (Fruhwirt-Schnatter and Malsiner-Walli, 2019)
Stan implementation

I hope and guess that the paper dealing with the replication crisis, “Assessing replication success via skeptical mixture priors” with Guido Consonni, could have good potential in the Bayesian assesment of replication success in social and hard sciences; this paper can be seen as an extension of the paper written by Leonhard Held and Samuel Pawel entitled “The Sceptical Bayes Factor for the Assessment of Replication Success“. Moreover, I am glad that the paper “Clustering spatial networks through latent mixture models“, focused on a model-based clustering approach defined in a hybrid latent space, has been finally published in JRSS A.

Regarding softwares, the footBayes package, a tool to fit the most well-known soccer (football) models through Stan and maximum likelihood methods, has been deeply developed and enriched with new functionalities (2024 objective: incorporate CmdStan with VI/Pathfinder algorithms and write a package’s paper in JSS/R Journal format).

Prediction isn’t everything, but everything is prediction

Image

This is Leo.

Explanation or explanatory modeling can be considered to be the use of statistical models for testing causal hypotheses or associations, e.g. between a set of covariates and a response variable. Prediction or predictive modeling, (supposedly) on the other hand, is the act of using a model—or device, algorithm—to produce values of new, existing, or future observations. A lot has been written about the similarities and differences between explanation and prediction, for example Breiman (2001), Shmueli (2010), Billheimer (2019), and many more.

These are often thought to be separate dimensions of statistics, but Jonah and I have been discussing for a long time that in some sense there may actually be no such thing as explanation without prediction. Basically, although prediction itself is not the only goal of inferential statistics, everything—every objective—in inferential statistics can be reframed through the lense of prediction.

Hypothesis testing, ability estimation, hierarchical modeling, treatment effect estimation, causal inference problems, etc., can all be described in our opinion from a (inferential) predictive perspective. So far we have not found an example for which there is no way to reframe it as prediction problem. So I ask you: is there any inferential statistical ambition that cannot be described in predictive terms?

P.S. Like Billheimer (2019) and others, we think that inferential statistics should be considered as inherently predictive and be focused primarily on probabilistic predictions of observable events and quantities, rather than focusing on statistical estimates of unobersvable parameters that do not exist outside of our highly contrived models. Similarly, we also feel that the goal of Bayesian modeling should not be taught to students as finding the posterior distribution of unobservables, but rather as finding the posterior predictive distribution of the observables (with finding the posterior as an intermediate step); even when we don’t only care about predictive accuracy and we still care about understanding how a model works (model checking, GoF measures), we think the predictive modeling interpretation is generally more intuitive and effective.

Update 4 – World Cup Qatar 2022 predictions (semifinals and winning probabilities)

Posted on December 13, 2022 12:45 PM by Leonardo Egidi

Time for our last update! Qatar 2022 World Cup is progressing fast, and only four teams – Argentina, France, Croatia and Morocco – are still in contention for the final victory. Who will be the winner on December 18th? Is our model better than the Paul the Octopus, an almost perfect oracle during World Cup 2010?

Semifinals predictions

We report in the table below the posterior predictive match probabilities from our DIBP model – get a look also here and here for other updates – for the two semifinals planned for Tuesday, December 13 and Wednesday, December 14, Argentina-Croatia and France-Morocco, respectively. We also report the usual ppd ‘chessboard plots’ for the exact outcomes in gray-scale color.

Notes: ‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results. The first team listed in each sub-title is the ‘favorite’ (x-axis), whereas the second team is the ‘underdog’ (y-axis). The 2-way grid displays the 2 held-out matches in such a way that the closest match appears in the left panel of the grid, whereas the most unbalanced match (‘blowout’) appears in the right panel.

France and Argentina seem clearly ahead against Croatia and Morocco, respectively. Anyway, underdogs such as Morocco have a non-negligible chance – approximately 35% – to beat France and advance to the final: consider that Morocco got two ‘clean-sheets‘ in the round of 16 and quarter of finals matches, against Spain and Portugal, respectively! Croatia already achieved the final four years ago, so maybe it should not be considered as a pure underdog…and Luka Modric, the Croatia’s captain, is still one of the best players in the world.

Note: keep in mind that the above predictions refer to the ‘regular’ times, not to the extra times! Anyway, to get an approximated probability to advance to the final, say for the favorite team, one could compute: favorite probability + 0.5*draw probability. The same could be done for the underdog team. In such a way, with no further assumptions we assume that the draw probability within the regular times is equally split between the two teams in the eventual extra-times.

World Cup winning probabilities

We also provide some World Cup winning probabilities for the four teams, based on some forward simulations of the tournament.

The results are somehow surprising! Unlike for what happens for the majority of the bookies, Argentina has the highest chances to win the World Cup. France comes at the second place, whereas Morocco is the underdog, with only the 8% probability to become the World Cup winner.

Full code and details

You find the complete results, R code and analysis here. Some preliminary notes and model limitations can be found here. And use the footBayes package!

Final considerations

We had a lot of fun with these World Cup predictions, we guess this has been a good and challenging statistical application. To summarize, the average of the correct probabilities, i.e. the average of the model probabilities for the actually observed outcomes, is 0.41, whereas the pseudo R-squared is 0.36 (up to the quarter of finals matches).

Update 3 – World Cup Qatar 2022 predictions (round of 16)

Posted on December 3, 2022 1:30 PM by Leonardo Egidi

World Cup 2022 is progressing, many good matches and much entertainment. Time then for World Cup 2022 predictions of the round of 16 matches from our DIBP model – here the previous update. In the group stage matches the average of the model probabilities for the actual final results was about 0.52.

Here there are the posterior predictive match probabilities for the held-out matches of the Qatar 2022 round of 16 to be played from December 3rd to December 6th, along with some ppd ‘chessboard plots’ for the exact outcomes in gray-scale color – ‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results. In the plots below, the first team listed in each sub-title is the ‘favorite’ (x-axis), whereas the second team is the ‘underdog’ (y-axis). The 2-way grid displays the 8 held-out matches in such a way that closer matches appear at the top-left of the grid, whereas more unbalanced matches (‘blowouts’) appear at the bottom-right. The matches are then ordered from top-left to bottom-right in terms of increasing winning probability for the favorite teams. The table reports instead the matches according to a chronological order.

Apparently, Brazil is highly favorite against South Korea, and Argentina seems much ahead against Australia, whereas much balance is predicted for Japan-Croatia, Netherlands-United States and Portugal-Switzerland. Note: take in consideration that these probabilities refer to the regular times, then within the 90 minutes. The model does not capture supplementary times probabilities.

You find the complete results, R code and analysis here. Some preliminary notes and model limitations can be found here.

Next steps: we’ll update the predictions for the quarter of finals. We are still discussing about the possibility to report some overall World Cup winning probabilities, even though I am personally not a huge fan of these ahead-predictions (even coding this scenario is not straightforward…!). However, we know those predictions could be really amusing for fans, so maybe we are going to report them after the round of 16. We also could post some pp checks for the model and more predictive performance measures.

Stay tuned!

Update 2 – World Cup Qatar 2022 Predictions with footBayes/Stan

Posted on November 25, 2022 12:00 PM by Leonardo Egidi

Time to update our World Cup 2022 model!

The DIBP (diagonal-inflated bivariate Poisson) model performed very well in the first match-day of the group stage in terms of predictive accuracy – consider that the ‘peudo R-squared’, namely the geometric mean of the probabilities assigned from the model to the ‘true’ final match results, is about 0.4, whereas, on average, the main bookmakers got 0.36.

It’s now time to re-fit the model after the first 16 group stage games with the footBayes R package and obtain the probabilistic predictions for the second match-day. Here there are the posterior predictive match probabilities for the held-out matches of the Qatar 2022 group stage played from November 25th to November 28th, along with some ppd ‘chessboard plots’ for the exact outcomes in gray-scale color – ‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results.

Plot/table updates: (see Andrew’ suggestions from the previous post, we’re still developing these plots to improve their appearance, see below some more notes). In the plots below, the first team listed in each sub-title is the ‘favorite’ (x-axis), whereas the second team is the ‘underdog’ (y-axis). The 2-way grid displays the 16 held-out matches in such a way that closer matches appear at the top-left of the grid, whereas more unbalanced matches (‘blowouts’) appear at the bottom-right. The matches are then ordered from top-left to bottom-right in terms of increasing winning probability for the favorite teams. The table reports instead the matches according to a chronological order.

The most unbalanced game seems Brazil-Switzerland, where the Brazil is the favorite team with an associated winning probability about 71%. The closest game seems Iran-Wales – Iran just won with two goals of margin scored in the last ten minutes! – whereas France is given only 44% probability of winning against Denmark. Argentina seems to be ahead against Mexico, whereas Spain seems to have a non-negligible advantage in the match against Germany.

Another predictive note: Regarding ‘most-likely-outcomes’ (mlo here above), the model ‘guessed’ 4 ‘mlo’ out of 16 in the previous match-day.

You find the complete results, R code and analysis here.

Some more technical notes/suggestions about the table and the plots above:

We replaced ‘home’ and ‘away’ by ‘favorite’ and ‘underdog’.
I find difficult to handle ‘xlab’ and ‘ylab’ in faceted plots with ggplot2! (A better solution could be in fact to directly put the team names on each of the axes of the sub-plots).
The occurrence ‘4’ actually stands for ‘4+’, meaning that it captures the probability of scoring ‘4 or more goals’ (I did not like the thick ‘4+’ in the plot, for this reason we just set ‘4’, however we could improve this).
We could consider adding some global ‘x’ and ‘y’-axes with probability margins between underdog and favorite. Thus, for Brazil-Switzerland, we should have a thick on the x-axis at approximately 62%, whereas for Iran-Wales at 5%.

For other technical notes and model limitations check the previous post.

Next steps: we are going to update the predictions for the third match-day and even compute some World Cup winning probabilities through a ahead-simulation of the whole tournament.

Stay tuned!

Football World Cup 2022 Predictions with footBayes/Stan

Posted on November 19, 2022 6:00 PM by Leonardo Egidi

It’s time for football (aka soccer) World Cup Qatar 2022 and statistical predictions!

This year me and my collaborator Vasilis Palaskas implemented a diagonal-inflated bivariate Poisson model for the scores through our `footBayes` R CRAN package (depending on the `rstan` package), by considering as a training set more than 3000 international matches played during the years’ range 2018-2022. The model incorporates some dynamic-autoregressive team-parameters priors for attack and defense abilities and the Coca-Cola/FIFA rankings differences as the only predictor. The model, firstly proposed by Karlis & Ntzoufras in 2003, extends the usual bivariate Poisson model by allowing to inflate the number of draw occurrences. Weakly informative prior distributions for the remaining parameters are assumed, whereas sum-to-zero constraints for attack/defense abilities are considered to achieve model identifiability. Previous World Cup and Euro Cup models posted in this blog can be found here, here and here.

Here is the new model for the joint couple of scores (X,Y,) of a soccer match. In brief:

We fitted the model by using HMC sampling, with 4 Markov Chains, 2000 HMC iterations each, checking for their convergence and effective sample sizes. Here there are the posterior predictive matches probabilities for the held-out matches of the Qatar 2022 group stage, played from November 20th to November 24th, along with some ppd ‘chessboard plots’ for the exact outcomes in gray-scale color (‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results):

Better teams are acknowledged to have higher chances in these first group stage matches:

In Portugal-Ghana, Portugal has an estimated winning probability about 81%, whereas in Argentina-Saudi Arabia Argentina has an estimated winning probability about 72%. The match between England and Iran seems instead more balanced, and a similar trend is observed for Germany-Japan. USA is estimated to be ahead in the match against Wales, with a winning probability about 47%.

Some technical notes and model limitations:

Keep in mind that ‘home’ and ‘away’ do not mean anything in particular here – the only home team is Qatar! – but they just refer to the first and the second team of the single matches. ‘mlo’ denotes the most likely exact outcome.
The posterior predictive probabilities appear to be approximated at the third decimal digit, which could sound a bit ‘bogus’… However, we transparently reported the ppd probabilities as those returned from our package computations.
One could use these probabilities for betting purposes, for instance by betting on that particular result – among home win, draw, or away win – for which the model probability exceeds the bookmaker-induced probability. However, we are not responsible for your money loss!
Why a diagonal-inflated bivariate Poisson model, and not other models? We developed some sensitivity checks in terms of leave-one-out CV on the training set to choose the best model. Furthermore, we also checked our model in terms of calibration measures and posterior predictive checks.
The model incorporates the (rescaled) FIFA ranking as the only predictor. Thus, we do not have many relevant covariates here.
We did not distinguish between friendly matches, world cup qualifiers, euro cup qualifiers, etc. in the training data, rather we consider all the data as coming from the same ‘population’ of matches. This data assumption could be poor in terms of predictive performances.
We do not incorporate any individual players’-based information in the model, and this also could represent a major limitation.
We’ll compute some predictions’ scores – Brier score, pseudo R-squared – to check the predictive power of the model.
We’ll fit this model after each stage, by adding the previous matches in the training set and predicting the next matches.

This model is just an approximation for a very complex football tornament. Anyway, we strongly support scientific replication, and for such reason the reports, data, R and RMarkdown codes can be fully found here, in my personal web page. Feel free to play with the data and fit your own model!

And stay tuned for the next predictions in the blog. We’ll add some plots, tables and further considerations. Hopefully, we’ll improve predictive performance as the tournament proceeds.

International Workshop on Statistical Modelling – IWSM 2022 in Trieste (Italy)

Posted on March 30, 2022 6:00 PM by Leonardo Egidi

I am glad to announce that the next International Workshop on Statistical Modelling (IWSM), the major activity of the Statistical Modelling Society, will take place in Trieste, Italy, between July 18 and July 22 2022, organized by University of Trieste.

The conference will be anticipated by the short course “Statistical Modelling of Football Data” by Ioannis Ntzoufras (AUEB) and Leonardo Egidi (Univ. of Trieste) on July 17th. The course is based on Stan and provided to people with a minimal statistical/mathematical background.

Interested participants may register choosing between some options:

whole conference
conference + short course
short course

Any information about registration and fees can be found here. The call for papers deadline for submitting a 4-pages abstract is April 4th (likely to be extended). For any information visit the IWSM 2022 website.

Stay tuned, and share this event with whoever may be interested in the conference.

footBayes: an R package for football (soccer) modeling using Stan

Posted on February 22, 2022 6:00 PM by Leonardo Egidi

footBayes 0.1.0 is on CRAN! The goal of the package is to propose a complete workflow to:

– fit the most well-known football (soccer) models: double Poisson, bivariate Poisson, Skellam, student t through the maximum likelihood approach and HMC Bayesian methods using Stan;

– visualize the teams’ abilities, the model pp checks, the rank-league reconstruction;

– predict out-of-sample matches via the pp distribution.

Here a super quick use of the package for the Italian Serie A. For any detail, check out the vignette and enjoy!

p.s. the vignette has been compiled without plot rendering to save time during the CRAN submission

library(footBayes)
require(engsoccerdata)
require(dplyr)

# dataset for Italian serie A

italy <- as_tibble(italy)
italy_2000_2002<- italy %>%
   dplyr::select(Season, home, visitor, hgoal, vgoal) %>%
   dplyr::filter(Season=="2000" | Season=="2001" | Season =="2002")

fit1 <- stan_foot(data = italy_2000_2002,
                  model="double_pois",
                  predict = 36) # double poisson fit (predict last 4 match-days)
foot_abilities(fit1, italy_2000_2002) # plot teams abilities
pp_foot(italy_2000_2002, fit1)   # pp checks
foot_rank(italy_2000_2002, fit1) # rank league reconstruction
foot_prob(fit1, italy_2000_2002) # out-of-sample posterior pred. probabilities

The fairy tale of the mysteries of mixtures

Aside

The post is by Leonardo Egidi.

This Bayesian fairy tale starts in July 2016 and will reveal some mysteries of the magical world of mixtures.

Opening: the airport idea

Once upon a time a young Italian statistician was dragging his big luggage in the JFK airport towards the security gates. He suddenly started thinking about how to elicit a prior distribution that is flexible but at the same time contains historical information about past similar events: thus, he wondered, “why not using a mixture of a noninformative and an informative prior in some applied regression problems?”

Magic: mixtures like three-headed Greek monsters

The guy fell in love with mixtures many years ago: the weights, the multimodality, the ‘multi-heads’ characteristic…like Fluffy, a gigantic, monstrous male three-headed dog who was once cared for by Rubeus Hagrid in the Harry Potter novel. Or Cerberus, a three-headed character from Greek mythology, one of the monsters guarding the entrance to the underworld over which the god Hades reigned. “Mixtures are so similar to Greek monsters and so full of poetic charm, aren’t they?!”

Of course his idea was not new at all: spike-and-slab priors are very popular in Bayesian variable selection and in clinical trials to avoid prior-data conflicts and get robust inferential conclusions.

He left his thought partially aside for some weeks, focusing on other statistical problems. However, some months later the American statistical wizard Andrew wrote an inspiring blog entry about prior choice recommendations:

What about the choice of prior distribution in a Bayesian model? The traditional approach leads to an awkward choice: either the fully informative prior (wildly unrealistic in most settings) or the noninformative prior which is supposed to give good answers for any possible parameter valuers (in general, feasible only in settings where data happen to be strongly informative about all parameters in your model).

We need something in between. In a world where Bayesian inference has become easier and easier for more and more complicated models (and where approximate Bayesian inference is useful in large and tangled models such as recently celebrated deep learning applications), we need prior distributions that can convey information, regularize, and suitably restrict parameter spaces (using soft rather than hard constraints, for both statistical and computational reasons).

This blog post gave him a lot of energy by reinforcing his old idea. So, he wondered, “what’s better than a mixture to represent a statistical compromise about a prior belief, combining a fully informative prior with a noninformative prior weighted somehow in an effective way?”. As the ancient Romans were used to say, in medio stat virtus. But he still needed to dig into the land of mixtures to discover some little treasures.

Obstacles and tasks: mixtures’ open issues

Despite their large use in theoretical and applied frameworks, as far as he knew from the current literature he realized that no statistician had explored the following issues about the mixture priors:

how to compute a measure of global informativity yielded by the mixture prior (such as a measure of effective sample size, according to this definition);
how to specify the mixture weights in a proper and automatic way (and not, say, only by fixing them upon historical experience, or by assigning them a vague hyperprior distribution) in some regression problems, such as clinical trials.

He struggled a lot with his mind during that cold winter. After some months he dug something out of the earth:

the effective sample size (ESS) provided by a mixture prior never exceeds the information of any individual mixture component density of the prior.
Theorem 1 here quantifies the role played by the mixture weights to reduce any prior-data conflict we can expect when using the mixture prior rather than the informative prior. “So, yes, mixture priors are more robust also from a theoretical point of view! Until now we only knew it heuristically”.

Happy ending: a practical regression case

Similarly to the bioassay experiment analyzed by Gelman et al. (2008), he considered a small-sample example to highlight the role played by different prior distributions, including a mixture prior, in terms of posterior analysis. Here there is the example.

Consider a dose-response model to assess immune-depressed patients’ survival according to an administered drug x, a factor with levels from 0 (placebo) to 4 (highest dose). The survival $y$ , registered one month after the drug is administered, is coded as 1 if the patient survives, 0 otherwise. The experiment is firstly performed to the sample of patients $y_{1}$ at time $t_{1}$ , where less than 50% of the patients survive, and then repeated to the sample $y_{2}$ at time $t_{1}$ , where all but one patients die, given that $y_{1}$ and $y_{2}$ are non-overlapping samples of patients.

The aim of the experimenter is to use the information from the first sample $y_{1}$ to obtain inferential conclusions about the second sample $y_{2}$ . Perhaps the two samples are quite different from each other in terms of survived people. From a clinical point of view we have two possible naive interpretations for the second sample:

the drug is not effective, even if there was a positive effect for $y_{1}$ ;
regardless of the drug, the first group of patients had a much better health condition than the second one.

Both of them appear quite extreme clinical conclusions: moreover, our information status is scarce, since we do not have any other influential clinical covariate, such as sex, age, presence of comorbidities, etc.

Consider the following data where the sample size for the two experiments is $N = 15$ :

library(rstanarm)
library(rstan)
library(bayesplot)

n <- 15
y_1 <- c(0,0,0,0,0,0,0,0,1,1,0,1,1,1,1) # first sample
y_2 <- c(1, rep(0, n-1))                # second sample

Given $p_{i} \equiv Pr (y_{i} = 1)$ , we fit a logistic regression $logit (p_{i}) = α + β x_{i}$ to the first sample, where the parameter $β$ is associated with the administered dose of the drug, $x$ . The five levels of the drug are randomly assigned to groups of three people each.

# dose of drug
x <- c(rep(0,3), rep(1,3), rep(2,3), rep(3,3), rep(4,3))
  
# first fit
fit <- stan_glm(y_1 ~ x, family = binomial)

print(fit)

## stan_glm
##  family:       binomial [logit]
##  formula:      y_1 ~ x
##  observations: 15
##  predictors:   2
## ------
##             Median MAD_SD
## (Intercept) -4.8    2.0  
## x            2.0    0.8  
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

Using weakly informative priors the drug is effective at $t_{1}$ , being the parameter $β$ positive and equal to 2.0 (with a posterior sd of 0.8), meaning that there is a positive effect of 2.0 on the log-odds of the survival for each additional amount of the dose.

Now we fit the same model to the second sample, according to three different priors for $β$ , reflecting three different ways to incorporate/use the historical information about $y_{1}$ :

weakly informative prior $β \sim N (0, 2.5) \Rightarrow$ scarce historical information about $y_{1}$ ;
informative prior $β \sim N (2, 0.8) \Rightarrow$ relevant historical information about $y_{1}$ ;
mixture prior $β \sim 0.8 \times N (0, 2.5) + 0.2 \times N (2, 0.8) \Rightarrow$ weighted historical information (0.2).

(We skip here the details about the choice of the mixture weights in 3., see here for further details).

# second fit

## weakly informative
fit2weakly <- stan_glm(y_2 ~ x, family = binomial)

## informative
fit2inf <- stan_glm(y_2 ~ x, family = binomial,
                    prior = normal(fit$coefficients[2], 
                                   fit$stan_summary[2,3]))

## mixture
x_stand <- (x -mean(x))/5*sd(x)  # standardized drug 
p1 <- prior_summary(fit)
p2 <- prior_summary(fit2weakly)
stan_data <- list(N = n, y = y_2, x = x_stand, 
                  mean_noninf = as.double(p2$prior$location),
                  sd_noninf = as.double(p2$prior$adjusted_scale),
                  mean_noninf_int = as.double(p2$prior_intercept$location),
                  sd_noninf_int = as.double(p2$prior_intercept$scale),
                  mean_inf = as.double(fit$coefficients[2]),
                  sd_inf =  as.double(fit$stan_summary[2,3]))
fit2mix <- stan('mixture_model.stan', data = stan_data)

Let’s figure out now what the three posterior distributions suggest.

Remember that in the second sample all but one patients die, but we actually do not know why this happened: the informative and the weakly informative analysis suggest almost opposite conclusions about the drug efficacy, both of them quite unrealistic:

the ‘informative posterior’ suggests a non-negligible positive effect of the drug $\Rightarrow$ possible overestimation;
the ‘weakly informative posterior’ suggests a strong negative effect $\Rightarrow$ possible underestimation;
the ‘mixture posterior’, that captures the prior-data conflict existing between the prior on $β$ suggested by $y_{1}$ and the sample $y_{2}$ and lies in the middle, is more conservative and likely to be more reliable for the second sample in terms of clinical justifications.

In this application a mixture prior combining the two extremes—the wildly informative prior and the weakly informative prior—can realistically average over them and represent a sound compromise (similar examples are illustrated here) to get robust inferences.

The moral lesson

The fairy tale is over. The statistician is now convinced: after digging in the mixture priors’ land he found another relevant case of prior regularization.

And the moral lesson is that we should stay in between—a paraphrase for the ancient in medio stat virtus—when we have small sample sizes and eventual conflicts between past historical information and current data.

Statistical Modeling, Causal Inference, and Social Science

Author Archives: Leonardo Egidi