Webinar: Design of Statistical Modeling Software

This post is by Eric.

On Wednesday, Juho Timonen from Aalto University is stopping by to tell us about his work. You can register here.


Juho will present what he thinks is an ideal modular design for statistical modeling software, based on his experiences as a user, developer, and researcher. He will look at the structure of existing software, focusing on high-level user interfaces that use Stan as their inference engine. Juho will use the longitudinal Gaussian process modeling package lgpr and its future improvements to demonstrate different concepts and challenges related to software design and development.

About the speaker

Juho Timonen completed his Bachelor’s degree majoring in Mathematics and Systems Analysis at Aalto University, Finland. His first research project was in game theory, and in his Bachelor’s thesis, Juho studied mixed-strategy equilibria in repeated games. He went on to do his Master’s degree at Aalto, majoring in Applied Mathematics. Juho joined the Computational Systems Biology group in the CS department at Aalto, where he got involved in probabilistic modeling and did his Master’s thesis on ODE modeling of gene regulatory networks. Juho continued as a Doctoral student in the same group, where his research is focused on probabilistic modeling.

My (remote) talks at UNC biostat, 12-13 May

I was invited to give three talks at the biostatistics department of University of North Carolina. I wasn’t sure what to talk about, so I gave them five options and asked them to choose three.

I’ll share the options with you, then you can guess which three they chose.

Here were the options:

1. All the ways that Bayes can go wrong

Probability theory is false. Weak priors give strong and implausible posteriors. If you could give me your subjective prior I wouldn’t need Bayesian inference. The best predictive model averaging is non-Bayesian. There will always be a need to improve our models. Nonetheless, we still find Bayesian inference to be useful. How can we make the best use of Bayesian methods in light of all their flaws?

2. Piranhas, kangaroos, and the failure of apparent open-mindedness: The connection between unreplicable research and the push-a-button, take-a-pill model of science

There is a replication crisis in much of science, and the resulting discussion has focused on issues of procedure (preregistration, publication incentives, and so forth) and statistical concepts such as p-values and statistical significance. But what about the scientific theories that were propped up by these unreplicable findings–what can we say about them? Many of these theories correspond to a simplistic view of the world which we argue is internally inconsistent (the piranha problem) involving quantities that cannot be accurately learned from data (the kangaroo problem). We discuss connections between these theoretical and statistical issues and argue that it has been a mistake to consider each of these studies and literatures on their own.

3. From sampling and causal inference to policy analysis: Interactions and the challenges of generalization

The three central challenges of statistics are generalizing from sample to population, generalizing from control to treated group, and generalizing from observed data to underlying constructs of interest. These are associated with separate problems of sampling, causal inference, and measurement, but in real decision problems all three issues arise. We discuss the way in which varying treatment effects (interactions) bring sampling concerns into causal inference, along with the real challenges of applying this insight into real problems. We consider applications in medical studies, A/B testing, social science research, and policy analysis.

4. Statistical workflow

Statistical modeling has three steps: model building, inference, and model checking, followed by possible improvements to the model and new data that allow the cycle to continue. But we have recently become aware of many other steps of statistical workflow, including simulated-data experimentation, model exploration and understanding, and visualizing models in relation to each other. Tools such as data graphics, sensitivity analysis, and predictive model evaluation can be used within the context of a topology of models, so that data analysis is a process akin to scientific exploration. We discuss these ideas of dynamic workflow along with the seemingly opposed idea that statistics is the science of defaults. We need to expand our idea of what data analysis is, in order to make the best use of all the new techniques being developed in statistical modeling and computation.

5. Putting it all together: Creating a statistics course combining modern topics with active student engagement

We envision creating a new introductory statistics, combining several innovations: (a) a new textbook focusing on modeling, visualization, and computing, rather than estimation, testing, and mathematics; (b) exams and homework exercises following this perspective; (c) drills for in-class practice and learning; and (d) class-participation activities and discussion problems. We will discuss what’s been getting in the way of this happening already, along with our progress creating a collection of stories, class-participation activities, and computer demonstrations for a two-semester course on applied regression and causal inference.

OK, time to guess which three talks they picked. . . .
Continue reading

Probabilistic feature analysis of facial perception of emotions

With Michel Meulders, Paul De Boeck, and Iven Van Mechelen, from 2005 (but the research was done several years earlier):

According to the hypothesis of configural encoding, the spatial relationships between the parts of the face function as an additional source of information in the facial perception of emotions. The paper analyses experimental data on the perception of emotion to investigate whether there is evidence for configural encoding in the processing of facial expressions. It is argued that analysis with a probabilistic feature model has several advantages that are not implied by, for example, a generalized linear modelling approach. First, the probabilistic feature model allows us to extract empirically the facial features that are relevant in processing the face, rather than focusing on the features that were manipulated in the experiment. Second, the probabilistic feature model allows a direct test of the hypothesis of configural encoding as it explicitly formalizes a mechanism for the way in which information about separate facial features is combined in processing the face. Third, the model allows us to account for a complex data structure while still yielding parameters that have a straightforward interpretation.

We should not have listed the emotions in alphabetical order:

Hey, check this out, it’s really cool: A Bayesian framework for interpreting findings from impact evaluations.

Chris Mead points us to a new document by John Deke, Mariel Finucane, and Daniel Thal, prepared for the Department of Education’s Institute of Education Sciences. It’s caled The BASIE (BAyeSian Interpretation of Estimates) Framework for Interpreting Findings from Impact Evaluations: A Practical Guide for Education Researchers, and here’s the summary:

BASIE is a framework for interpreting impact estimates from evaluations. It is an alternative to null hypothesis significance testing. This guide walks researchers through the key steps of applying BASIE, including selecting prior evidence, reporting impact estimates, interpreting impact estimates, and conducting sensitivity analyses. The guide also provides conceptual and technical details for evaluation methodologists.

I looove this, not just all the Bayesian stuff but also the respect it shows for the traditional goals of null hypothesis significance testing. They’re offering a replacement, not just an alternative.

Also they do good with the details. For example:

Probability is the key tool we need to assess uncertainty. By looking across multiple events, we can calculate what fraction of events had different types of outcomes and use that information to make better decisions. This fraction is an estimate of probability called a relative frequency. . . .

The prior distribution. In general, the prior distribution represents all previously available information regarding a parameter of interest. . . .

I really like that they express this in terms of “evidence” and “information” rather than “belief.”

They also discuss graphical displays and communication that is both clear and accurate; for example recommending summaries such as, “We estimate a 75 percent probability that the intervention increased reading test scores by at least 0.15 standard deviations, given our estimates and prior evidence on the impacts of reading programs for elementary school students.”

And it all runs in Stan, which is great partly because Stan is transparent and open-source and has good convergence diagnostics and a big user base and is all-around reliable for Bayesian inference, and also because Stan models are extendable: you can start with a simple hierarchical regression and then add measurement error, mixture components, and whatever else you want.

And this:

Local Stop: Why we do not recommend the flat prior

A prior that used to be very popular in Bayesian analysis is called the flat prior (also known as the improper uniform distribution). The flat prior has infinite variance (instead of a bell curve, a flat line). It was seen as objective because it assigns equal prior probability to all possible values of the impact; for example, impacts on test scores of 0, 0.1, 1, 10, and 100 percentile points are all treated as equally plausible.

When probability is defined in terms of belief rather than evidence, the flat prior might seem reasonable—one might imagine that the flat prior reflects the most impartial belief possible (Gelman et al., 2013, Section 2.8). As such, this prior was de rigueur for decades.

But when probability is based on evidence, the implausibility of the flat prior becomes apparent. For example, what evidence exists to support the notion that impacts on test scores of 0, 0.1, 1, 10, and 100 percentile points are all equally probable? No such evidence exists; in fact, quite a bit of evidence is completely inconsistent with this prior (for example, the distribution of impact estimates in the WWC [What Works Clearinghouse]). The practical implication is that the flat prior overestimates the probability of large effects. Following Gelman and Weakliem (2009), we reject the flat prior because it has no basis in evidence.

The implausibility of the flat prior also has an interesting connection to the misinterpretation of p-values. It turns out that the Bayesian posterior probability derived under a flat prior is identical (for simple models, at least) to a one-sided p-value. Therefore, if researchers switch to Bayesian methods but use a flat prior, they will likely continue to exaggerate the probability of large program effects (which is a common result when misinterpreting p-values). . . .

Yes, I’m happy they cite me, but the real point here is that they’re thinking in terms of modeling and evidence, also that they’re connecting to important principles in non-Bayesian inference. As the saying goes, there’s nothing so practical as a good theory.

What makes me particularly happy is the way in which Stan is enabling applied modeling.

This is not to say that all our problems are solved. Once we do cleaner inference, we realized the limitations of experimental data: with between-person studies, sample sizes are never large enough to get stable estimates of interactions of interest (recall 16), which implies the need for . . . more modeling, as well as open recognition of uncertainty in decision making. So lots more to think about going forward.

Full disclosure: My research is funded by the Institute of Education Sciences, and I know the authors of the above-linked report.

Warner Music Group wants to hire Bayesians!

For reals!

They’re hiring an Operations Research Optimization Scientist and a Vice President Data Science, who will:

• Instantiate & Productionize a suite of enterprise level models and applications that service the global WMG business. In particular: data pipelines, production-ready models for forecasting, API interfaces for UI/UX
• Lead expertise in productionization of a Bayesian predicting suite and in Bayesian model workflow.
• Hands-on assist with business facing imperatives: ad hoc research requests, statistical models and knowledge

See here for further background. I can’t confirm that if you take this job you’ll get to meet the members of Nickelback, but I can’t confirm that this won’t happen either.

Jamaican me crazy: the return of the overestimated effect of early childhood intervention

A colleague sent me an email with the above title and the following content:

We were talking about Jamaica childhood intervention study. The Science paper on returns to the intervention 20 years later found a 25% increase but an earlier draft had reported a 42% increase. See here.

Well, it turns out the same authors are back in the stratosphere! In a Sept 2021 preprint, they report a 43% increase, but now 30 rather than 20 years after the intervention (see abstract). It seems to be the same dataset and they again appear to have a p-value right around the threshold (I think this is the 0.04 in the first row of Table 1 but I did not check super carefully).

Of course, no mention I could find of selection effects, the statistical significance filter, Type M errors, the winner’s curse or whatever term you want to use for it…

From the abstract of the new article:

We find large and statistically significant effects on income and schooling; the treatment group had 43% higher hourly wages and 37% higher earnings than the control group.

The usual focus of economists is earnings, not wages, so let’s go with that 37% number. It’s there in the second row of Table 1 of the new paper: the estimate is 0.37 with a standard error of . . . ummmm, it doesn’t give a standard error but it gives a t statistic of 1.68—


B-b-b-but in the abstract it says this difference is “statistically significant”! I always thought that to be statistically significant the estimate had to be at least 2 standard errors from zero . . .

They have some discussion of some complicated nonparametric tests that they do, but if your headline number is only 1.68 standard errors away from zero, asymptotic theory is the least of your problems, buddy. Going through page 9 of the paper, it’s kind of amazing how much high-tech statistics and econometrics they’re throwing at this simple comparison.

Anyway, their estimate is 0.37 with standard error 0.37/1.68 = 0.22, so the 95% confidence interval is [0.37 +/- 2*0.22] = [-0.07, 0.81]. But it’s “statistically significant” cos the 1-sided p-value is 0.05. Whatever. I don’t really care about statistical significance anyway. It’s just kinda funny that, after all that effort, they had to punt on the p-value like that.

Going back to the 2014 paper, I came across this bit:

I guess that any p-value less than 0.10 is statistically significant. That’s fine; they should just get on the horn with the ORBITA stents people, because their study, when analyzed appropriately, ended up with p = 0.09, and that wasn’t considered statistically significant at all; it was considered evidence of no effect.

I guess the rule is that, if you’re lucky enough to get a result between 0.05 and 0.10, you get to pick the conclusion based on what you want to say: if you want to emphasize it, call it statistically significant; if not, call it non-significant. Or you can always fudge it by using a term like “suggestive.” In the above snippet they said the treatment “may have” improved skills and that treatment “is associated with” migration. I wonder if that phrasing was a concession to the fat p-value of 0.09. If the statistic had been at more conventionally attractive size 0.05 or below, maybe they would’ve felt free to break out the causal language.

But . . . it’s kind of funny for me to be riffing on p-values and statistical significance, given that I don’t even like p-values and statistical significance. I’m on record as saying that everything should be published and there should be no significance threshold. And I would not want to “threshold” any of this work either. Publish it all!

There are two places where I would diverge from these authors. The first is in their air of certainty. Rather than saying a “large and statistically significant effect” of 37%, I’d say an estimate of 37% with a standard error of 22%, or just give the 95% interval like they do in public health studies. JAMA would never let you get away with just giving the point estimate like that! Seeing this uncertainty tells you a few things: (a) the data are compatible (to use Sander Greenland’s term) with a null effect, (b) if the effect is positive, it could be all over the place, so it’s misleading as fiddlesticks to call it “a substantial increase” over a previous estimate of 25%, and (c) it’s empty to call this a “large” effect: with this big of a standard error, it would have to be “large” or it would not be “statistically significant.” To put it another way, instead of the impressive-sounding adjective “large” (which is clearly not necessarily the case, given that the confidence interval includes zero), it would be more accurate to use the less-impressive-sounding adjective “noisy.” Similarly, their statement, “Our results confirm large economic returns . . .”, seems a bit irresponsible given that their data are consistent with small or zero economic returns.

The second place I’d diverge from the authors is in the point estimate. They use a data summary of 37%. This is fine as a descriptive data summary, but if we’re talking policy, I’d like some estimate of treatment effect, which means I’d like to do some partial pooling with respect to some prior, and just about any reasonable prior will partially pool this estimate toward 0.

Ok, look. Lots of people don’t like Bayesian inference, and if you don’t want to use a prior, I can’t make you do it. But then you have to recognize that reporting the simple comparison, conditional on statistical significance (however you define it) will give you a biased estimate, as discussed on pages 17-18 of this article. Unfortunately, that article appeared in a psychology journal so you can’t expect a bunch of economists to have heard about it, but, hey, I’ve been blogging about this for years, nearly a decade, actually (see more here). Other people have written about this winner’s curse thing too. And I’ve sent a couple emails to the first author of the paper pointing out this bias issue. Anyway, my preference would be to give a Bayesian or regularized treatment effect estimator, but if you don’t want to do that, then at least report some estimate of the bias of the estimator that you are using. The good news is, the looser your significance threshold, the lower your bias!

But . . . it’s early childhood intervention! Don’t you care about the children???? you may ask. My response: I do care about the children, and early childhood intervention could be a great idea. It could be great even if it doesn’t raise adult earnings at all, or if it raises adult earnings by an amount that’s undetectable by this noisy study.

Think about it this way. Suppose the intervention has a true effect that it raises earnings by an average of 10%. That’s a big deal, maybe not so much for an individual, but an average effect of 10% is a lot. Consider that some people won’t be helped at all—that’s just how things go—so an average of 10% implies that some people would be helped a whole lot. Anyway, this is a study where the standard deviation of the estimated effect is 0.22, that is, 22%. If the average effect is 10% and the standard error is 22%, then the study has very low power, and it’s unlikely that a preregistered analysis would result in statistical significance, even at the 0.1 or 0.2 level or whatever it is that these folks are using. But, in this hypothetical world, the treatment would be awesome.

My point is, there’s no shame in admitting uncertainty! The point estimate is positive; that’s great There’s a lot of uncertainty, and the data are consistent with a small, tiny, zero, or even negative effect. That’s just the way things go when you have noisy data. As quantitative social scientists, we can (a) care about the kids, (b) recognize that this evaluation leaves us with lots of uncertainty, and (c) give this information to policymakers and let them take it from there. I feel no moral obligation to overstate the evidence, overestimate the effect size, and understate my uncertainty.

It’s so frustrating, how many prominent academics just can’t handle criticism. I guess they feel that they’re in the right and that all this stat stuff is just a bunch of paperwork. And in this case they’re doing the Lord’s work, saving the children, so anything goes. It’s the Armstrong principle over and over again.

And in this particular case, as my colleague points out, it is not just that they are not acknowledging or dealing with criticism in the prior paper, but here they are actively repeating the very same error with the very same study/data when they have been made aware of it on more than one occasion in this new paper and not acknowledging the issue at all. Makes me want to scream.

P.S. When asked whether I could share my colleague’s name, my colleague replied:

Regarding quoting me, do recall that I live in the midwest and have to walk across parking lots from time to time. So please do so anonymously.

Fair enough. I don’t want to get anybody hurt.

What should he read to pivot into research with a Bayesian focus?

Someone who wishes to remain anonymous writes:

I am currently finishing my PhD in statistics/biostatistics. I have recently accepted a postdoc to work on Bayesian clinical trials.

I have done all of my dissertation work in frequentist statistics. I learned about the basics of Bayesian methods in a number of my courses, but haven’t done formal research using many of these methods. I was wondering if you had any advice/resources for someone with a strong statistics background, but wants to prepare to pivot into research with a Bayesian focus. Particularly a good resource for somebody wishing to learn how to code their own samplers etc.

My reply:

I recommend our Bayesian Data Analysis book and also McElreath’s book Statistical Rethinking. You should not need to be coding your own samplers. You can do that in Stan. You can take a look at the Stan user’s guide and the Stan case studies, also our Bayesian workflow paper.

Confidence intervals, compatability intervals, uncertainty intervals

“Communicating uncertainty is not just about recognizing its existence; it is also about placing that uncertainty within a larger web of conditional probability statements. . . . No model can include all such factors, thus all forecasts are conditional.”us (2020).

A couple years ago Sander Greenland and I published a discussion about renaming confidence intervals.

Confidence intervals

Neither of us likes the classical term, “confidence intervals,” for two reasons. First, the classical definition (a procedure that produces an interval which, under the stated assumptions, includes the true value at least 95% of the time in the long run) is not typically what is of interest when performing statistical inference. Second, the assumptions are wrong: as I put it, “Confidence intervals excluding the true value can result from failures in model assumptions (as we’ve found when assessing U.S. election polls) or from analysts seeking out statistically significant comparisons to report, thus inducing selection bias.”

Uncertainty intervals

I recommended the term “uncertainty intervals,” on the grounds that the way confidence intervals are used in practice is to express uncertainty about an inference. The wider the interval, the more uncertainty.

But Sander doesn’t like the label “uncertainty interval”; as he puts it, “the word ‘uncertainty’ gives the illusion that the interval properly accounts for all important uncertainties . . . misrepresenting uncertainty as if it were a known quantity.”

Compatability intervals

Sander instead recommends the term “compatibility interval,” following the reasoning that the points outside the interval are outside because they are incompatible with the data and model (in a stochastic sense) and the points inside are compatible our data and assumptions. What Sander says makes sense.

The missing point in both my article and Sander’s is how the different concepts fit together. As with many areas in mathematics, I think what’s going on is that a single object serves multiple functions, and it can be helpful to disentangle these different roles. Regarding interval estimation, this is something that I’ve been mulling over for many years, but it did not become clear to me until I started thinking hard about my discussion with Sander.

Purposes of interval estimation

Here’s the key point. Statistical intervals (whether they be confidence intervals or posterior intervals or bootstrap intervals or whatever) serve multiple purposes. One purpose they serve is to express uncertainty in a point estimate; another purpose they serve is to (probabilistically) rule out values outside the interval; yet another purpose is to tell us that values inside the interval are compatible with the data. These first of these goals corresponds to the uncertainty interval; the second and third correspond to the compatibility interval.

In a simple case such as linear regression or a well-behaved asymptotic estimate, all three goals are served by the same interval. In more complicated cases, no interval will serve all these purposes.

I’ll illustrate with a scenario that arose in a problem I worked on a bit over 30 years ago, and discussed here:

Sometimes you can get a reasonable confidence interval by inverting a hypothesis test. For example, the z or t test or, more generally, inference for a location parameter. But if your hypothesis test can ever reject the model entirely, then you’re in the situation shown above. Once you hit rejection, you suddenly go from a very tiny precise confidence interval to no interval at all. To put it another way, as your fit gets gradually worse, the inference from your confidence interval becomes more and more precise and then suddenly, discontinuously has no precision at all. (With an empty interval, you’d say that the model rejects and thus you can say nothing based on the model. You wouldn’t just say your interval is, say, [3.184, 3.184] so that your parameter is known exactly.)

For our discussion here, the relevant point is that, if you believe your error model, this is a fine procedure for creating a compatability interval—as your data becomes harder and harder to explain from the model, the compatibility interval becomes smaller and smaller, until it eventually becomes empty. That’s just fine; it makes sense; it’s how compatability intervals should be.

But as an uncertainty interval, it’s terrible. Your model fits worse and worse, your uncertainty gets smaller and smaller, and then suddenly the interval becomes empty and you have no uncertainty statement at all—you just reject the model.

At this point Sander might stand up and say, Hey! That’s the point! You can’t get an uncertainty interval here so you should just be happy with the compatibility interval. To which I’d reply: Sure, but often the uncertainty interval isn’t what people want. To which Sander might reply: Yeah, but as statisticians we shouldn’t be giving people what we want, we should be giving people what we can legitimately give them. To which I’d reply: in decision problems, I want uncertainty. I know my uncertainty statements aren’t perfect, I know they’re based on assumptions, but that just pushes me to check my assumptions, etc. Ok, this argument could go on forever, so let me just return to my point that uncertainty and compatibility are two different (although connected) issues.

All intervals are conditional on assumptions

There’s one thing I disagree with in Sander’s article, though, and that’s his statement that “compatibility” is a more modest term than “confidence” or “uncertainty.” My take on this is that all these terms are mathematically valid within their assumptions, and none are in general valid when the assumptions are false. When the assumptions of model and sampling and reporting are false, there’s no reason to expect 95% intervals to contain the true value 95% of the time (hence, no confidence property), there’s no reason to think they will fully capture our uncertainty (hence, “uncertainty interval” is not correct), and no reason to think that the points inside the interval are compatible with the data and that the points outside are not compatible (hence, “compatibility interval” is also wrong).

All of these intervals represent mathematical statements and are conditional on assumptions, no matter how you translate them into words.

And that brings us to the quote from Jessica, Chris, Elliott, and me at the top of this post, from a paper on information, incentives, and goals in election forecasts, an example in which the most important uncertainties arise from nonsampling error.

All intervals are conditional on assumptions (which are sometimes called “guarantees“). Calling your interval an uncertainty interval or a compatability interval doesn’t make that go away, any more than calling your probabilities “subjective” or “objective” absolves you from concerns about calibration.

“Why the New Pollution Literature is Credible” . . . but I’m still guessing that the effects are being overestimated:

In a post entitled, “Why the New Pollution Literature is Credible,” Alex Tabarrok writes:

My recent post, Air Pollution Reduces Health and Wealth drew some pushback in the comments, some justified, some not, on whether the results of these studies are not subject to p-hacking, forking gardens and the replication crisis. Sure, of course, some of them are. . . . Nevertheless, I don’t think that skepticism about the general thrust of the results is justified. Why not?

First . . . my rule is trust literatures not papers and the new pollution literature is showing consistent and significant negative effects of pollution on health and wealth. . . . It’s not just that the literature is large, however, it’s that the literature is consistent in a way that many studies in say social psychology were not. In social psychology, for example, there were many tests of entirely different hypotheses—power posing, priming, stereotype threat—and most of these failed to replicate. But in the pollution literature we have many tests of the same hypotheses. We have, for example, studies showing that pollution reduces the quality of chess moves in high-stakes matches, that it reduces worker productivity in Chinese call-centers, and that it reduces test scores in American and in British schools. . . . from different researchers studying different times and places using different methods but they are all testing the same hypothesis, namely that pollution reduces cognitive ability. . . .

Another feature in favor of the air pollution literature is that the hypothesis that pollution can have negative effects on health and cognition wasn’t invented yesterday . . . The Romans, for example, noted the negative effect of air pollution on health. There’s a reason why people with lung disease move to the countryside and always have.

I also noted in Why Most Published Research Findings are False that multiple sources and types of evidence are desirable. The pollution literature satisfies this desideratum. Aside from multiple empirical studies, the pollution hypothesis is also consistent with plausible mechanisms . . .

Moreover, there is a clear dose-response effect–so much so that when it comes to “extreme” pollution few people doubt the hypothesis. Does anyone doubt, for example, that an infant born in Delhi, India–one of the most polluted cities in the world–is more likely to die young than if the same infant grew up (all else equal) in Wellington, New Zealand–one of the least polluted cities in the world? . . .

What is new about the new pollution literature is more credible methods and bigger data and what the literature shows is that the effects of pollution are larger than we thought at lower levels than we thought. But we should expect to find smaller effects with better methods and bigger data. . . . this isn’t guaranteed, there could be positive effects of pollution at lower levels, but it isn’t surprising that what we are seeing so far is negative effects at levels previously considered acceptable.

Thus, while I have no doubt that some of the papers in the new pollution literature are in error, I also think that the large number of high quality papers from different times and places which are broadly consistent with one another and also consistent with what we know about human physiology and particulate matter and also consistent with the literature on the effects of pollution on animals and plants and also consistent with a dose-response relationship suggest that we take this literature and its conclusion that air pollution has significant negative effects on health and wealth very seriously.

This all makes a lot of sense—enough so that I quoted large chunks of Tabarrok’s post.

Still, I think actual effects will be quite a bit lower than claimed in the literature. Yes, it’s appropriate to look at the literature, not just individual studies. But if each individual study is biased, that will bias the literature. You can think of Alex’s two posts on the effects of air pollution as a sort of informal meta-analysis, and it’s a meta-analysis that does not correct for selection bias within each published study. Again, his general points (both methodologically and with regard to air pollution in particular) make sense; I just think there’s a bias he’s not correcting for. When we talk about forking paths etc. it’s typically in settings where there’s essentially zero signal (more precisely, settings where any signal is overwhelmed by noise) and people are finding patterns out of nothing—but the same general errors can lead to real effects being greatly overestimated.

P.S. The comments to Tabarrok’s post are pretty wack. Some reasonable points but lots and lots of people just overwhelmed by political ideology.

International Workshop on Statistical Modelling – IWSM 2022 in Trieste (Italy)

I am glad to announce that the next International Workshop on Statistical Modelling (IWSM), the major activity of the Statistical Modelling Society, will take place in Trieste, Italy, between July 18 and July 22 2022, organized by University of Trieste.

The conference will be anticipated by the short course “Statistical Modelling of Football Data” by Ioannis Ntzoufras (AUEB) and Leonardo Egidi (Univ. of Trieste) on July 17th. The course is based on Stan and provided to people with a minimal statistical/mathematical background.

Interested participants may register choosing between some options:

  • whole conference
  • conference + short course
  • short course

Any information about registration and fees can be found here. The call for papers deadline for submitting a 4-pages abstract is April 4th (likely to be extended). For any information visit the IWSM 2022 website.

Stay tuned, and share this event with whoever may be interested in the conference.

Postdoc and research software engineer positions at Aalto University, Finland

This job ad is by Aki

I (Aki) have 1-2 postdoc positions open in my group for developing Bayesian inference methods, diagnostics, computation, workflow, probabilistic programming tools, teaching Bayesian data analysis, etc. If you’ve been following this blog, you probably know what kind of things I and Andrew work on. You can also check some of my talks and my recent publications. The specific tasks will be agreed based on the background, interests and future goals. Aalto University and nearby University of Helsinki has a big and active Bayesian and machine learning community, Finland has been ranked several times as the happiest country in the world, and Helsinki is among the most liveable cities. We collaborate actively with Stan, ArviZ and PyMC developers so the methods developed will have wide impact. There is no deadline for application nor official call page (we have those also twice per year) for these positions. You can find my contact information from my web page. Knowledge of Bayesian methods is a must, and experience in free software is a big plus. The length of the contract can be 1-2 years with option for extension.

Aalto University has also open a permanent position for a research software engineer. Aalto Scientific Computing is an elite “special forces” unit of Research IT, providing high-performance computing hardware, management, research support, teaching, and training. The team consists of a core of PhD staff working with top researchers throughout the university. All the work is open-source by default, and they take an active part in worldwide projects. They are looking for both people 1) with PhD degree with research experience in some computational field, and 2) software developer or computational scientist with a strong software/open source/Linux background, scientific computing experience, and some experience in research, with Masters degree or similar experience. This particular call emphasizes the ability to work in machine learning and AI environments. The ideal candidate will be working closely with machine learning researchers, and thus a background in machine learning is highly desirable. They are counting Bayesian probabilistic programming also as part of machine learning, so you could end up helping also my group. See more information and how to apply.

PhD student position in Stuttgart . . . to work on Stan! Latent variable models in psychology and the social sciences. With Paul “brms” Buerkner!

Paul Buerkner writes:

A lot of concepts in psychology and the social sciences can be formulated in terms of latent variables measured indirectly by observable data. However, existing statistical approaches remain limited in how they can express latent variables and relate them to each other. The goal of this project is to build advanced statistical models that better respect the probabilistic structures of latent variables and thus allow to obtain improved insights and predictions based on such variables. The primary goal of the proposed research is to develop a framework for Bayesian distributional latent variable models (BD-LVMs) that combines the principles of IRT and SEM with the flexibility of distributional regression powered by modern Bayesian estimation methods. Throughout the project, we will make extensive use of Stan and will later on integrate the developed methods in brms as well.

For more details about the position, please see https://www.stellenwerk.de/stuttgart/jobboerse/phd-student-position-mfd-part-time-75prozent-payment-according-to-e13-tv-l-temporary-for-the-duration-of-3-years-220304-82653/

Looks like fun!

Yuling’s talk, “From mixture model to continuous model extension to operator-based probabilistic inference”

Unfortunately Yuling‘s talk already happened but the abstract is so intriguing I had to share it with you:

The standard paradigm of Bayesian statistics starts by “Let there be a joint distribution”. But was there a joint distribution? The premise on the joint distribution over data and parameters is at best artificial: Latent variables don’t always exist; The model is almost always wrong hence the assumption of a true fixed parameter is not meaningful; Leaving a particular observation unmeasured is not always the same as treating it as uncertain in a joint distribution. A safer view is to only recognize the likelihood as a sequence of explanations of data. To this end, I propose a general probabilistic inference paradigm formulated from a larger encompassing model that includes individual sampling model as special cases, where the familiar Bayes is recovered when such operator is chosen to be a mixture. Mixtures play a key role in Bayesian inference: The posterior predictive distribution is always a mixture of sampling distribution. I will discuss other alternative operators, such as convolution, geometric bridge, and quantum superposition—To better understand the additive model, we shall first understand what the addition operator is.

“Should a dog’s sniff be enough to convict a person of murder?”

Alon Honig points to this news article by Peter Smith and writes:

Beyond the fact that this seems crazy (why would anyone convict based off the actions of a dog) I am wondering how you would think about this from a bayesian perspective. P (murderer|dog visited) = P (dog visited|murderer) * P(murderer) / P(dog visited)

What parameters do these jurors have that would make sense?

I have no idea! But I’m posting this, first because we do so many cat pictures here that it’s only fair for a dog to be the star of the story for once, and second because it’s refreshing to read an article mentioning the University of California law school that doesn’t bring up that notorious torture apologist and political hack. It’s good to know they have other faculty there.

An Easy Layup for Stan

This post is by Phil Price, not Andrew.

The tldr version of this is: I had a statistical problem that ended up calling for a Bayesian hierarchical model. I decided to implement it in Stan. Even though it’s a pretty simple model and I’ve done some Stan modeling before I thought it would take at least several hours for me to get a model I was happy with, but that wasn’t the case. Right tool for the job. Thanks Stan team!

Longer version follows.

I have a friend who routinely plays me for a chump. Fool me five times, shame on me. The guy is in finance, and every few years he calls me up and says “Phil, I have a problem, I need your help. It’s really easy” — and then he explains it so it really does seem easy — “but I need an answer in just a few days and I don’t want to get way down in the weeds, just something quick and dirty. Can you give me this estimate in…let’s say under five hours of work, by next Monday?” Five hours? I can hardly do anything in five hours. But still, it really does seem like an easy problem. I say OK and quote a slight premium over my regular consulting rate. And then…as always (always!) it ends up being more complicated than it seemed. That’s not a phenomenon that is unique to him: just about every project I’ve ever worked on turns out to be more complicated than it seems. The world is complicated! And people do the easy stuff themselves, so if someone comes to me it’s because it’s not trivial. But I never seem to learn.

Anyway, what my friend does is “valuation”: how much should someone be willing to pay for this thing? The ‘thing’ in this case is a program for improving the treatment of patients being treated for severe kidney disease. Patients do dialysis, they take medications, they’re on special diets, they have health monitoring to do, they have doctor appointments to attend, but many of them fail to do everything. That’s especially true as they get sicker: it gets harder for them to keep track of what they’re supposed to do, and physically and mentally harder to actually do the stuff.

For several years someone ran a trial program to see what happens if these people get a lot more help: what if there’s someone at the dialysis center whose job is to follow up with people and make sure they’re taking their meds, showing up to their appointments, getting their blood tests, and so on? One would hope that the main metrics of interest would involve patient health and wellbeing, and maybe that’s true for somebody, but for my friend (or rather his client) the question is: how much money, if any, does this program save? That is, what happens to the cost per patient per year if you have this program compared to doing stuff the way it has been done in the past?

As is usually the case, the data suck. What you would want is a random selection of pilot clinics where they tried the program, and the ones where they didn’t, and you’d want the cost data from the past ten years or something for every clinic; you could do some sort of difference-in-differences approach, maybe matching cases and controls by relevant parameters like region of the country and urban/rural and whatever else seems important. Unfortunately my friend had none of that. The clinics were semi-haphazardly selected by a few health care providers, probably slightly biased towards the ones where the administrators were most willing to give the program a try. The only clinic-specific data are from the first year of the program onward; other than that all we have is the nationwide average for similar clinics.

The fact that no before/after comparison is possible seemed like a dealbreaker to me, and I said so, but my friend said the experts think the effect of the program wouldn’t likely show up in the form of a step change from before to after, but rather in a lower rate of inflation, at least for the first several years. Relative to business as usual you expect to see a slight decline in cost every year for a while. I don’t understand why but OK, if that’s what people expect then maybe we can look for that: we expect to see costs at the participating clinics increase more slowly than the nationwide average. I told my friend that’s _all_ we can look for, given the data constraints, and he said fine. I gave him all of the other caveats too and he said that’s all fine as well. He needs some kind of estimate, and, well, you go to war with the data you have, not the data you want.

First thing I did is to divide out the nationwide inflation rate for similar clinics that lack the program, in order to standardize on current dollars. Then I fit a linear regression model to the whole dataset of (dollars per patient) as a function of year, giving each clinic its own intercept but giving them all a common slope. And sure enough, there’s a slight decline! The clinics with the program had a slightly lower rate of inflation than the other clinics, and it’s in line with what my friend said the experts consider a plausible rate. All those caveats I mentioned above still apply but so far things look OK.

If that was all my friend needed then hey, job done and it took a lot less than five hours. But no: my friend doesn’t just need to know the average rate of decrease, he needs to know the approximate statistical distribution across clinics. If the average is, say, a 1% decline per year relative to the benchmark, are some clinics at 2% per year? What about 3%? And maybe some don’t decline at all, or maybe the program makes them cost more money instead? (After all, you have to pay someone to help all those patients, so if the help isn’t very successful you are going to lose out). You’d like to just fit a different regression for each clinic and look at the statistical distribution of slopes, but that won’t work: there’s a lot of year-to-year ‘noise’ at any individual clinic. One reason is that you can get unlucky in suddenly having a disproportionate number of patients who need very expensive care, or lucky in not having that happen, but there are other reasons too. And you only have three or four years of data per clinic. Even if all of the clinics had programs of equal effectiveness, you’d get a wide variation in the observed slopes. It’s very much like the “eight schools problem”. It’s really tailor-made for a Bayesian hierarchical model. Observed costs are distributed around “true costs” with a standard error we can estimate; true inflation-adjusted cost at a clinic declines linearly; slopes are distributed around some mean slope, with some standard deviation we are trying to estimate. We even have useful prior estimates of what slope might be plausible. Even simple models usually take me a while to code and to check so I sort of dreaded going that route — it’s not something I would normally mind, but given the time constraints I thought it would be hard — but in fact it was super easy. I coded an initial version that was slightly simpler than I really wanted and it ran fine and generated reasonable parameter values. Then I modified it to turn it into the model I wanted and…well, it mostly worked. The results all looked good and some model-checking turned out fine, but I got an error that there were a lot of “divergent transitions.” I’ve run into that before and I knew the trick for eliminating them, which is described here. (It seems like that method should be described, or at least that page should be linked, in the “divergent transitions” section of the Stan Reference Manual but it isn’t. I suppose I ought to volunteer to help improve the documentation. Hey Stan team, if it’s OK to add that link, and if you give me edit privileges for the manual, I’ll add it.) I made the necessary modification to fix that problem and then everything was hunky dory.

From starting the model to finishing with it — by which I mean I had done some model checks and looked at the outputs and generated the estimates I needed — was only about two hours. I still didn’t quite get the entire project done in the allotted time, but I was very close. Oh, and in spite of the slight overrun I billed for all of my hours, so my chump factor didn’t end up being all that high.

Thank you Stan team!


Duality between multilevel models and multiple comparisons adjustments, and the relevance of this to some discussions of replication failures

Pascal Jordan writes:

I stumbled upon a paper which I think you might find interesting to read or discuss: “Ignored evident multiplicity harms replicability—adjusting for it offers a remedy” from Yoav Zeevi, Sofi Astashenko, Yoav Benjamini. It deals with the topic of the replication crisis. More specifically with a sort of reanalysis of the Reproducibility Project in Psychology.

There are parts which I [Jordan] think overlap with your position on forking paths: For example, the authors argue that there are numerous implicit comparisons in the original studies which are not accounted for in the reporting of the results of the studies. The paper also partly offers an explanation as to why social psychology is particularly troubled with low rates of replicability (according to the paper: the mean number of implicit comparisons is higher in social psychology than in cognitive psychology).

On the other hand, the authors adhere to the traditional hypothesis-testing paradigm (with corrections for implicit comparisons) which I know you are a critic of.

My reactions:

1. As we’ve discussed, there is a duality between multiple-comparisons corrections and hierarchical models: in both cases we are considering a distribution of possible comparisons. In practice that means we can get similar applied results using different philosophies, as long as we make use of relevant information. For example, when Zeevi et al. in the above-linked article write of “addressing multiplicity,” I would consider multilevel modeling (as here) one way of doing this.

2. I think a key cause of unreplicable results is not the number of comparisons (implicit or otherwise) but rather the size of underlying effects. Social and evolutionary psychology have been in this weird position where they design noisy studies to estimate underlying effects that are small (see here) and can be highly variable (the piranha problem). Put this together and you have a kangaroo situation. Multiple comparisons and forking paths make this all worse, as they give researchers who are clueless or unscrupulous (or both) a way of declaring statistical significance out of data that are essentially pure noise—but I think the underlying problem is that they’re using noisy experiments to study small and variable effects.

3. Another way to say it is that this is a problem of misplaced rigor. Social psychology research often uses randomized experiments and significance tests: two tools for rigor. But all the randomization in the world won’t get you external validity, and all the significance testing in the world won’t save you if the signal is low relative to the noise.

So, just to be clear, “adjusting for multiplicity” can be done using multilevel modeling without any reference to p-values; thus, the overall points of the above-linked paper are more general than any specific method they might propose (just as, conversely, the underlying ideas of any paper of mine on hierarchical modeling could be transmuted into a statement about multiple comparisons methods). And no method of analysis (whether it be p-values, hierarchical modeling, whatever) can get around the problem of studies that are too noisy for the effects they’re trying to estimate.

Priors for hyperparameters in meta-analysis

In response to my remark in this post, “So, what to do? I don’t know, exactly! There’s no Platonically correct prior for (mu, tau). All I can say for sure is that the prior we are currently using is too wide,” Erik van Zwet writes:

The distribution of tau across all the meta-analyses in Cochrane with a binary outcome has been estimated by Turner et al..

They estimated the distribution of log(tau^2) as normal with mean -2.56 and standard deviation 1.74.

I’ve estimated the distribution of mu across Cochrane as a generalized t-distribution with mean=0, scale=0.52 and 3.4 degrees of freedom.

These estimated priors usually don’t make a very big difference compared to flat priors. That’s just because the signal-to-noise ratio of most meta-analyses is reasonably good. For most meta-analyses, finding an honest set of reliable studies seems to be a much bigger problem than sampling error.

Exploring some questions about meta-analysis (using ivermectin as an example), with R and Stan code

We had a long discussion in comments the other day about some debates regarding studies of ivermectin for treating covid. Some of the discussion touched on general principles of meta-analysis, including challenges with Bayesian meta-analysis, so I thought I’d expand on it all here.

tl;dr. When doing meta-analysis, look at the estimated distribution of possible effects, not just at the estimate of the average across all studies.

Continue reading

footBayes: an R package for football (soccer) modeling using Stan

footBayes 0.1.0 is on CRAN! The goal of the package is to propose a complete workflow to:

– fit the most well-known football (soccer) models: double Poisson, bivariate Poisson, Skellam, student t through the maximum likelihood approach and HMC Bayesian methods using Stan;

– visualize the teams’ abilities, the model pp checks, the rank-league reconstruction;

– predict out-of-sample matches via the pp distribution.

Here a super quick use of the package for the Italian Serie A. For any detail, check out the vignette and enjoy!

p.s. the vignette has been compiled without plot rendering to save time during the CRAN submission


# dataset for Italian serie A

italy <- as_tibble(italy)
italy_2000_2002<- italy %>%
   dplyr::select(Season, home, visitor, hgoal, vgoal) %>%
   dplyr::filter(Season=="2000" | Season=="2001" | Season =="2002")

fit1 <- stan_foot(data = italy_2000_2002,
                  predict = 36) # double poisson fit (predict last 4 match-days)
foot_abilities(fit1, italy_2000_2002) # plot teams abilities
pp_foot(italy_2000_2002, fit1)   # pp checks
foot_rank(italy_2000_2002, fit1) # rank league reconstruction
foot_prob(fit1, italy_2000_2002) # out-of-sample posterior pred. probabilities



Fake Data Pretend

I sit at my screen and wage war on my model
It seems like it’s all, it’s all for nothing
I know the barricades
And I know the mortar in the wall breaks
I recognize the weapons, I used them well.

This is my mistake
Let me make it good
I raised the model and I will be the one to knock it down.

I’ve a rich understanding of my finest defenses
I proclaim that claims are left unstated
I demand a rematch
I decree a stalemate
I divine my deeper motives
I recognize the diagnostics
I’ve practiced them well, I fitted them myself.

It’s amazing what devices you can sympathize (empathize)
This is my mistake
Let me make it good
I raised the model and I will be the one to knock it down.

Reach out for me and hold me tight
Hold that memory
Let my machine talk to me, let my machine talk to me.

This is my world and I am fake data pretend
This is my simulation
And this is my time
I have been given the freedom
To do as I see fit
It’s high time I’ve razed the models that I’ve constructed.

It’s amazing what devices you can sympathize (empathize)
This is my mistake
Let me make it good
I raised the model and I will be the one to knock it down.

You fill in the data
You fill in the theory
You fill in the transformation
I raised the model
And I’m the only one
I will be the one to knock it down.

Apologies to you know who. It’s amazing how few words needed to be changed.