A proposal to build new hardware and thermodynamic algorithms for stochastic computing

Patrick Coles writes:

Modern AI has moved away from the absolute, deterministic procedures of early machine learning models. Nowadays, probability and randomness are fully embraced and utilized in AI. Some simple examples of this are avoiding overfitting by randomly dropping out neurons (i.e., dropout), and escaping local minima during training thanks to noisy gradient estimates (i.e., stochastic gradient descent). A deeper example is Bayesian neural networks, where the network’s weights are sampled from a probability distribution and Bayesian inference is employed to update the distribution in the presence of data . . .

Another deep example is generative modeling with diffusion models. Diffusion models add noise to data in a forward process, and then reverse the process to generate a new datapoint (see figure illustrating this for generating an image of a leaf). These models have been extremely successful not only in image generation, but also in generating molecules, proteins and chemically stable materials . . .

AI is currently booming with breakthroughs largely because of these modern AI algorithms that are inherently random. At the same time, it is clear that AI is not reaching its full potential, because of a mismatch between software and hardware. For example, sample generation rate can be relatively slow for diffusion models, and Bayesian neural networks require approximations for their posterior distributions to generate samples in reasonable time.

Then comes the punchline:

There’s no inherent reason why digital hardware is well suited for modern AI, and indeed digital hardware is handicapping these exciting algorithms at the moment.

For production AI, Bayesianism in particular has been stifled from evolving beyond a relative niche because of its lack of mesh with digital hardware . . . .the next hardware paradigm should be specifically tailored to the randomness in modern AI. Specifically, we must start viewing stochasticity as a computational resource. In doing so, we could build a hardware that uses the stochastic fluctuations produced by nature.

Coles continues:

The aforementioned building blocks are inherently static. Ideally, the state does not change over time unless it is intentionally acted upon by a gate, in these paradigms.

However, modern AI applications involve accidental time evolution, or in other words, stochasticity. This raises the question of whether we can construct a building block whose state randomly fluctuates over time. This would be useful for naturally simulating the fluctuations in diffusion models, Bayesian inference, and other algorithms.

The key is to introduce a new axis when plotting the state space: time. Let us define a stochastic bit (s-bit) as a bit whose state stochastically evolves over time according to a continuous time Markov chain . . .

Ultimately this involves a shift in perspective. Certain computing paradigms, such as quantum and analog computing, view random noise as a nuisance. Noise is currently the biggest roadblock to realizing ubiquitous commercial impact for quantum computing. On the other hand, Thermodynamic AI views noise as an essential ingredient of its operation. . . .

I think that when Coles says “AI,” he means what we would call “Bayesian inference.” Or maybe AI represents some particularly challenging applications of Bayesian computation.

Analog computing

OK, the above is all background. Coles’s key idea here is to build a computer using new hardware, to build these stochastic bits so that continuous computation gets done directly.

This is reminiscent of what in the 1950s and 1960s was called “analog computation” or “hybrid computation.” An analog computer is something you build with a bunch of resistors and capacitors and op-amps to solve a differential equation. You plug it in, turn on the power, and the voltage tells you the solution. Turn some knobs to change the parameters in the model, or set it up in a circuit with a sawtooth input and plug it into an oscilloscope to get the solution as a function of the input, etc. A hybrid computer mixes analog and digital elements. Coles is proposing something different in that he’s interested in the time evolution of the state (which, when marginalized over time, can be mapped to a posterior distribution), whereas in traditional analog computer, you just look at the end state and you’re not interested in the transient period that it takes to get there.

Here’s the technical report from Coles. I have not read it carefully or tried to evaluate it. That would be hard work! Could be interest to many of you, though.

Straining on the gnat of the prior distribution while swallowing the camel that is the likelihood. (Econometrics edition)

Jason Hawkins writes:

I recently read an article by the econometrician William Greene of NYU and others (in a 2005 book). They state the following:

The key difference between Bayesian and classical approaches is that Bayesians treat the nature of the randomness differently. In the classical view, the randomness is part of the model; it is the heterogeneity of the taste parameters, across individuals. In the Bayesian approach, the randomness ‘represents’ the uncertainty in the mind of the analyst (conjugate priors notwithstanding). Therefore, from the classical viewpoint, there is a ‘true’ distribution of the parameters across individuals. From the Bayesian viewpoint, in principle, there could be two analysts with different, both legitimate, but substantially different priors, who therefore could obtain very different, albeit both legitimate, posteriors.

My understanding is that this statement runs counter to the Berstein-von Mises theorem, which in the wording of Wikipedia “ assumes there is some true probabilistic process that generates the observations, as in frequentism” (my emphasis). Their context is comparing individual parameters from a mixture model, which can be taken from the posterior of a Bayesian inference or (in the frequentist case) obtained through simulation. I was particularly struck by their terming randomness as part of the model in the frequentist approach, which to me reads more as a feature of Bayesian approaches that are driven by uncertainty quantification.

My reply: Yes, I disagree with the above-quoted passage. They are exhibiting a common misunderstanding. I’ll respond with two points:

1. From the Bayesian perspective there also is a true parameter; see for example Appendix B of BDA for a review of the standard asymptotic theory. That relates to Hawkins’s point about the Berstein-von Mises theorem.

2. Greene et al. write, “From the Bayesian viewpoint, in principle, there could be two analysts with different, both legitimate, but substantially different priors, who therefore could obtain very different, albeit both legitimate, posteriors.” The same is true in the classical viewpoint; just replace the word “priors” by “likelihoods” or, more correctly, “data models.” Hire two different econometricians to fit two different models to your data and they can get “very different, albeit both legitimate” inferences.

Hawkins sends another excerpts from the paper:

The Bayesian approach requires the a priori specification of prior distributions for all of the model parameters. In cases where this prior is summarising the results of previous empirical research, specifying the prior distribution is a useful exercise for quantifying previous knowledge (such as the alternative currently chosen). In most circumstances, however, the prior distribution cannot be fully based on previous empirical work. The resulting specification of prior distributions based on the analyst’s subjective beliefs is the most controversial part of Bayesian methodology. Poirier (1988) argues that the subjective Bayesian approach is the only approach consistent with the usual rational actor model to explain individuals’ choices under uncertainty. More importantly, the requirement to specify a prior distribution enforces intellectual rigour on Bayesian practitioners. All empirical work is guided by prior knowledge and the subjective reasons for excluding some variables and observations are usually only implicit in the classical framework. The simplicity of the formula defining the posterior distribution hides some difficult computational problems, explained in Brownstone (2001).

That’s a bit better but it still doesn’t capture the all-important point that that skeptics and subjectivists alike strain on the gnat of the prior distribution while swallowing the camel that is the likelihood.

And this:

Allenby and Rossi (1999) have carried out an extensive Bayesian analysis of discrete brand choice and discussed a number of methodological issues relating to the estimation of individual level preferences. In comparison of the Bayesian and classical methods, they state the simulation based classical methods are likely to be extremely cumbersome and are approximate whereas the Bayesian methods are much simpler and are exact in addition. As to whether the Bayesian estimates are exact while sampling theory estimates are approximate, one must keep in mind what is being characterised by this statement. The two estimators are not competing for measuring the same population quantity with alternative tools. In the Bayesian approach, the ‘exact’ computation is of the analysts posterior belief about the distribution of the parameter (conditioned, one might note on a conjugate prior virtually never formulated based on prior experience), not an exact copy of some now revealed population parameter. The sampling theory ‘estimate’ is of an underlying ‘truth’ also measured with the uncertainty of sampling variability. The virtue of one over the other is not established on any but methodological grounds – no objective, numerical comparison is provided by any of the preceding or the received literature.

Again, I don’t think the framing of Bayesian inference as “belief” is at all helpful. Does the classical statistician or econometrician’s logistic regression model represent his or her “belief”? I don’t think so. It’s not a belief, it’s a model, it’s an assumption.

But I agree with their other point that we should not consider the result of an exact computation to itself be exact. The output depends on the inputs.

We can understand this last point without thinking about statistical inference at all. Just consider a simple problem of measurement, where we estimate the weight of a liquid by weighing an empty jar, then weighing the jar with the liquid in it, then subtracting. Suppose the measured weights are 213 grams and 294 grams, so that the estimated weight of the liquid is 81 grams. The calculation, 294-213=81, is exact, but if the original measurements have error, then that will propagate to the result, so it would not be correct to say that 81 grams is the exact weight.

Large language model alignment “bias” and cultural consensus theory

The way contemporary chatbots like ChatGPT work is by “aligning” a “foundation” model. I think cultural consensus theory (a statistical model, not a contentious issue for school boards) can provide a model for the sociology of alignment.

The foundation: attention models

In a nutshell, language modeling is the simple task of predicting the next subword (“called a token”) based on the previous sequence of subwords. The state-of-the-art had stalled for years on n-gram models that use the previous n subwords (usually with n < 5). In 2017, a team of Google researchers released a paper titled "Attention is all you need," which introduced the current state-of-the-art neural network architecture for language modeling. The breakthrough was in extending the context length into the thousands (GPT 3.5 uses 4K, GPT 4 has 8K and 32K models) with an attention model that figured out which parts of the context to concentrate on. The fundamental bottleneck is that computation is quadratic in context length (though it's all on GPU, so that's a massive numbers of flops for relatively low power).

The 2017 paper introduced the so-called “transformer” architecture, which combines multiple attention “heads” in parallel. The original application was to translation, but it’s the self attention component that was extracted for use in LLMs. The “T” in “GPT” is for “transformers” (the “GP” is for “generative pretrained”). What researchers have found is that the heads learn different aspects of prediction, such as different syntactic structures, much like any mixture model.

There’s a beautiful two-hour YouTube tutorial by Andrej Karpathy that builds up the entire transformer architecture piece by piece in a Colab notebook you can also use. Karpathy applies it to building a Shakespearean chatbot. It assumes you know Python, but is otherwise quite gentle, starting with an intro to n-gram language models and softmax.

Garbage-in, garbage-out

The current crop of large language models have been trained on vast amounts of human text, primarily collected through the internet. As you might imagine, including sources like Reddit and 4chan and Twitter leads to a broad set of what can most charitably be called “points of view.” Even on technical issues, the web is cluttered with material that should probably not be the basis for serious work—homework exercises for intro data science classes clutter GitHub and StackOverflow, every statistician and their cousin’s experimental code seems to be wrapped up as an R package, scripts from ancient versions of software persist, etc.

Alignment: from LLMs to chatbots

After building these powerful, transformer-based large language models (LLMs), people realized that they were really good at generating text. As in they blew away any previous compression record (just like the TV show Silicon Valley!). You can convert a language model into a compression scheme using prediction by partial matching (PPM) with arithmetic coding, the reference implementation of which was designed and coded by Radford Neal (with Ian Witten and John Cleary) in 1987. Seriously, they should win an ACM/Turing award just for the quantum leap in text compression.

The early LLMs could write computer programs, translate Pascal to Fortran and Swahili to English, and generate new recipes given only a list of ingredients or new episodes of TV shows. But they tend to ramble off topic, tend to “hallucinate” (the term of art for when LLMs make things up; it’s called “dreaming” for diffusion models like Midjourney), and tend to be fluid with the points of view they find in training data. They’re just as happy telling you how to make a bomb in your basement and where to set it off as they are telling you how to make a soufflé in your kitchen and how to serve it. And if you “jailbreak” the current ChatGPT, it’ll still be happy to tell you how to try all sorts of dangerous, illegal, and morally and ethically questionable activities.

OpenAI’s approach to preventing the LLMs from spewing dangerous and/or toxic garbage is to fine tune the large language models with human feedback (HF) using reinforcement learning (RL, and together RLHF). Their stated goal was to “align” the language models to be (a) helpful, (b) truthful, and (c) harmless. While this sounds like an objective task presented this way, the notions of truthful and harmless are difficult to pin down and require subjective judgement calls. Even helpfulness is a slippery notion in that help that’s too verbose or specific isn’t helpful. What one person takes to be self evident in these realms can be considered lunacy by others.

OpenAI either implicitly or explicitly chose the point of view of a West-coast American liberal, which is the far left of the mainstream US political spectrum, even though it’s relatively conservative by European standards. They could’ve just as easily decided to give ChatGPT the perspective of the far right of the mainstream US political spectrum and it would’ve had a very different perspective and a different segment of the population would be complaining about its biases.

Cultural consensus theory

In 1979, Phil Dawid and David Skene introduced a statistical model of crowdsourcing for medical records. The idea is that there’s a latent true value of something like whether a patient smokes, and doctors looking at medical records are going to give you a noisy measurement of that value. The same kind of model can be applied to radiology and doctors classifying medical images for stage of cancer, etc. Or to NeurIPS paper reviews. The model assigns accuracies and biases (too positive or too negative) to the raters and infers the underlying rating most consistent with the ratings (given the rater accuracies and biases).

David and Skene’s model was independently rediscovered by many, including by me with the help of Andrew and Jennifer Hill (it was my gateway model into Bayes and there’s an example of how to code it in the Stan User’s Guide). As Andrew tells me, no matter what model you look at, a psychometrician probably introduced it 50 years ago (e.g., Elo is just a rescaled Bradley-Terry model, which is from 1950).

In 1986, A. Kimball Romney, Susan Weller, and William Batchelder published “Culture as consensus: a theory of culture and informant accuracy”, which introduced cultural consensus theory (CCT). It shouldn’t be surprising that it was published in an anthropology journal, because anthropology is cross-cultural sociology. Batchelder and Romney later published a paper, “Test theory without an answer key” in Biometrika; think IRT 0PL model but with unknown true answer, which is the Dawid and Skene model.

The twist that CCT introduced to take it beyond David and Skene’s model was a mixture model for the “truth.” That is, they assumed there might not actually be a single consensus point of view among raters. This would be a good idea for crowdsourcing, too, where the respondents are often a mix of spammers and people making a good-faith effort (it’s really more of a continuum).

I think it would be interesting to apply CCT to ChatGPT. It’s the same kind of thing that folks do in applying ideal point models to voting.

Open problem: How to make residual plots for multilevel models (or for regularized Bayesian and machine-learning predictions more generally)?

Adam Sales writes:

I’ve got a question that seems like it should be elementary, but I haven’t seen it addressed anywhere (maybe I’m looking in the wrong places?)

When I try to use binned residual plots to evaluate a multilevel logistic regression, I often see a pattern like this (from my student, fit with glmer):

I think the reason is because of partial pooling/shrinkage of group-level intercepts being shrunk towards the grand mean.

I was able to replicate the effect (albeit kind of mirror-imaged—the above plot was from a very complex model) with fake data:

makeData <- function(ngroup=100,groupSizeMean=10,reSD=2){
  groupInt <- rnorm(ngroup,sd=reSD)
  groupSize <- rpois(ngroup,lambda=groupSizeMean)
  groups <- rep(1:ngroup,times=groupSize)
  n <- sum(groupSize)
  data.frame(group=groups,y=rbinom(n,size=1,prob=plogis(groupInt[groups])))
}
dat <- makeData()
mod <- glmer(y~(1|group),data=dat,family=binomial)
binnedplot(predict(mod,type='response'),resid(mod,type='response'))

Model estimates (i.e., point estimates of the parameters from a hierarchical model) of extreme group effects are shrunk towards 0---the grand mean intercept in this case---except at the very edges when the 0-1 bound forces the residuals to be small in magnitude (I expect the pattern would be linear in the log odds scale).

When I re-fit the same model on the same data with rstanarm and looked at the fitted values I got basically the same result.

On the other hand, when looking at 9 random posterior draws the pattern mostly goes away:

Now here come the questions---is this really a general phenomenon, like I think it is? If so, what does it mean for the use of binned residual plots for multilevel logistic regression, or really any time there's shrinkage or partial pooling? Can binned residual plots be helpful for models fit with glmer, or only by plotting individual posterior draws from a Bayesian posterior distribuion?

My reply: Yes, the positive slope for resid vs expected value . . . that would never happen in least-squares regression, so, yeah, it has to do with partial pooling. We should think about what's the right practical advice to give here. Residual plots are important.

As you note with your final graph above, the plots should have the right behavior (no slope when the model is correct) when plotting the residuals relative to the simulated parameter values. This is what Xiao-Li, Hal, and I called "realized discrepancies" in our 1996 paper on posterior predictive checking, but then in our 2000 paper on diagnostic checks for discrete-data regression models using posterior predictive simulations, Yuri, Francis, Ivan, and I found that the use of realized discrepancies added lots of noise in residual plots.

What we'd like is an approach that gives us the clean comparisons but without the noise.

How large is the underlying coefficient? An application of the Edlin factor to that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”

Often these posts start with a question that someone sends to me and continue with my reply. This time the q-and-a goes the other way . . .

I pointed Erik van Zwet to this post, “I’m skeptical of that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”, and wrote:

This example (in particular, the regression analysis at the end of the PPS section) makes me think about your idea of a standard-error-scaled prior, this time for regression coefficients. What do you think?

Erik replied:

Yes, I did propose a default prior for regression coefficients:

Wow, 2019 seems so long ago now! This was before I had the nice Cochrane data, and started focusing on clinical trials. The paper was based on a few hundred z-values of regression coefficients which I collected by hand from Medline. I tried to do that in an honest way as follows:

It is a fairly common practice in the life sciences to build multivariate regression models in two steps. First, the researchers run a number of univariate regressions for all predictors that they believe could have an important effect. Next, those predictors with a p-value below some threshold are selected for the multivariate model. While this approach is statistically unsound, we believe that the univariate regressions should be largely unaffected by selection on significance, simply because that selection is still to be done!

Anyway, using a standard-error-scaled prior really means putting in prior information about the signal-to-noise ratio. The study with the brain activity in babies seems to have a modest sample size relative to the rather noisy outcome. So I would expect regression coefficients with z-values between 1 and 3 to be inflated, and an Edlin factor of 1/2 seems about the right ball park.

I think that type M errors are a big problem, but I also believe that the probability of a type S error tends to be quite small. So, if I see a more or less significant effect in a reasonable study, I would expect the direction of the effect to be correct.

I just want to add one thing here, which is in that example the place where I wanted to apply the Edlin factor was on the control variables in the regression, where I was adjusting for pre-treatment predictions. The main effects in this example show no evidence of being different from what could be expected from pure chance.

This discussion is interesting in revealing two different roles of shrinkage. One role is what Erik is focusing on, which is shrinkage of effects of interest, which as he notes should generally have the effect of making the magnitudes of estimated effects smaller without changing their sign. The other role is shrinkage of coefficients of control variables, which regularizes these adjustments, which indirectly give more reasonable estimates of the effects of interest.

4 different meanings of p-value (and how my thinking has changed)

The p-value is one of the most common, and one of the most confusing, tools in applied statistics. Seasoned educators are well aware of all the things the p-value is not. Most notably, it’s not “the probability that the null hypothesis is true.” McShane and Gal find that even top researchers routinely misinterpret p-values.

But let’s forget for a moment about what p-values are not and instead ask what they are. It turns out that there are different meanings of the term. At first I was going to say that these are different “definitions,” but Sander Greenland pointed out that not all are definitions:

Definition 1. p-value(y) = Pr(T(y_rep) >= T(y) | H), where H is a “hypothesis,” a generative probability model, y is the observed data, y_rep are future data under the model, and T is a “test statistic,” some pre-specified specified function of data. I find it clearest to define this sort of p-value relative to potential future data; it can also be done mathematically and conceptually without any reference to repeated or future sampling, as in this 2019 paper by Vos and Holbert.

Definition 2. Start with a set of hypothesis tests of level alpha, for all values alpha between 0 and 1. p-value(y) is the smallest alpha of all the tests that reject y. This definition starts with a family of hypothesis tests rather than a test statistic, and it does not necessarily have a Bayesian interpretation, although in particular cases, it can also satisfy Definition 1.

Property 3. p-value(y) is some function of y that is uniformly distributed under H. I’m not saying that the term “p-value” is taken as a synonym for “uniform variate” but rather that this conditional uniform distribution is sometimes taken to be a required property of a p-value. It’s not a definition because in practice no one would define a p-value without some reference to a tail-area probability (Definition 1) or a rejection region (Definition 2)—but it is sometimes taken as a property that is required for something to be a true p-value. The relevant point here is that a p-value can satisfy Property 3 without satisfying Definition 1 (there are methods of constructing uniformly-distributed p-values that are not themselves tail-area probabilities), and a p-value can satisfy Definition 1 without satisfying Property 3 (when there is a composite null hypothesis and the distribution of the test statistic is not invariant to parameter values; see Xiao-Li Meng’s paper from 1994).

Description 4. p-value(y) is the result of some calculations applied to data that are conventionally labeled as a p-value. Typically, this will be a p-value under Definition 1 or 2 above, but perhaps defined under a hypothesis H that is not actually the model being fit to the data at hand, or a hypothesis H(y) that itself is a function of data, for example from p-hacking or forking paths. I’m labeling this as a “description” rather than a “definition” to clarify that this sort of p-value is used all the time without always a clear definition of the hypothesis, for example if you have a regression coefficient with estimate beta_hat and standard error s, and you compute 2 times the tail-area probability of |beta_hat|/s under the normal or t distribution, without ever defining a null hypothesis relative to all the parameters in your model. Sander Greenland calls this sort of thing a “descriptive” p-value, capturing the idea that the p-value can be understood as a summary of the discrepancy or divergence of the data from H according to some measure, ranging from 0 = completely incompatible to 1 = completely compatible. For example, the p-value from a linear regression z-score can be understood as a data summary without reference to a full model for all the coefficients.

These are not four definitions/properties/descriptions of the same thing. They are four different things. Not completely different, as they coincide in certain simple examples, but different, and they serve different purposes. They have different practical uses and implications, and you can make mistakes when you use one sort to answer a different question. Just as, for example, posterior intervals and confidence intervals coincide in some simple examples but in general are different: lots of real-world posterior intervals don’t have classical confidence coverage, even in theory, and lots of real-world confidence intervals don’t have Bayesian posterior coverage, even in theory.

A single term with many meanings—that’s a recipe for confusion! Hence this post, which does not claim to solve any technical problems but is just an attempt to clarify.

In all the meanings above, H is a “generative probability model,” that is, a class of probability models for the modeled data, y. If H is a simple null hypothesis, H represents a specified probability distribution, p(y|H). If H is a composite null hypothesis, there is some vector of unknown parameters theta indexing a family of probability distributions, p(y|theta,H). As Daniel Lakeland so evocatively put it, a null hypothesis is a specific random number generator.

Under any of the above meanings, the p-value is a number—a function of data, y—and also can be considered as a random variable with probability distribution induced by the distribution of y under H. For a composite null hypothesis, that distribution will in general depend on theta, but that complexity is not our focus here.

So, back to p-values. How can one term have four meanings? Pretty weird, huh?

The answer is that under certain ideal conditions, the four meanings coincide. In a model with continuous data and a continuous test statistic and a point null hypothesis, all four of the above meanings give the same answer. Also there are some models with unknown parameters where the test statistic can be defined to have a distribution under H that is invariant to parameters. And this can also be the case asymptotically.

More generally, though, the four meanings are different. None of them are “better” or “worse”; they’re just different. Each has some practical value:

– A p-value under Definition 1 can be directly interpreted as a probability statement about future data conditional on the null hypothesis (as discussed here).

– A p-value under Definition 2 can be viewed as a summary of a class of well-defined hypothesis tests (as discussed in footnote 4 of this article by Philip Stark).

– A p-value with Property 3 has a known distribution under the null hypothesis, so the distribution of a collection of p-values can be compared to uniform (as discussed here).

– A p-value from Description 4 is unambiguously defined from existing formulas so is a clear data summary even if it can’t easily be interpreted as a probability in the context of the problem at hand.

As an example, in this article from 1989, Besag and Clifford come up with a Monte Carlo procedure that yields p-values that satisfy Property 3 but not Definition 1 or Description 4. And in 1996, Meng, Stern, and I discussed Bayesian p-values that satisfied Definition 1 but not Property 3.

The natural way to proceed is to give different names to the different p-values. The trouble is that different authors choose different naming conventions!

I’ve used the term “p-value” for Definition 1 and “u-value” for Property 3; see section 2.3 of this article from 2003. And in this article from 2014 we attempted to untangle the difference between Definition 1 and Property 3. I haven’t thought much about Definition 2, and I’ve used the term “nominal p-value” for Description 4.

My writing about p-values has taken Definition 1 as a starting point. My goal has not been to examine misfit of the null hypothesis with respect to some data summary or test statistic, not to design a procedure to reject with a fixed probability conditional on a null hypothesis or to construct a measure of evidence that is uniformly distributed under the null. Others including Bernardo, Bayarri, and Robins are less interested in a particular test statistic and are more interested in creating a testing procedure or a calibrated measure of evidence, and they have taken Definition 2 or Property 3 as their baseline, referring to p-values with Property 3 as “calibrated” or “valid” p-values. This terminology is as valid as mine; it’s just taking a different perspective on the same problem.

In an article from 2023 with follow-up here, Sander Greenland distinguishes between “divergence p-values” and “decision p-values,” addressing similar issues of overloading of the term “p-value.” The former corresponds to Definition 1 above using the same sort of non-repeated-sampling view of p-values as favored by Vos and Holbert and addresses the issues raised by Description 4, and the latter corresponds to Definition 2 and addresses the issues raised by Property 3. As Greenland emphasizes, a p-value doesn’t exist in a vacuum; it should be understood in the context in which it will be used.

My thinking has changed.

My own thinking about p-values and significance testing has changed over the years. It all started when I was working on my Ph.D. thesis, fitting a big model to medical imaging data and finding that the model didn’t fit the data. I could see this because the chi-squared statistic was too large! We had something like 20,000 pieces of count data, and the statistic was about 30,000, which would be compared to a chi-squared distribution with 20,000 minus k degrees of freedom, where k is the the number of “effective degrees of freedom” in the model. The “effective degrees of freedom” thing was interesting, and it led me into the research project that culminated in the 1996 paper with Meng and Stern.

The relevant point here was that I was not coming into that project with the goal of creating “Bayesian p-values.” Rather, I wanted to be able to check the fit of model to data, and this was a way for me to deal with the fact that existing degrees-of-freedom adjustments did not work in my problem.

The other thing I learned when working on thaT project was that a lot of Bayesians didn’t like the idea of model checking at all! They had this bizarre (to me) attitude that, because their models were “subjective,” they didn’t need to be checked. So I leaned hard into the idea that model checking is a core part of Bayesian data analysis. This one example, and the fallout from it, gave me a much clearer sense of data analysis as a Popperian or Lakatosian process, leading to this 2013 article with Shalizi.

In the meantime, though, I started to lose interest in p-values. Model checking was and remains important to me, but I found myself doing it using graphs. Actually, the only examples I can think of where I used hypothesis testing for data analysis were the aforementioned tomography model from the late 1980s (where the null hypothesis was strongly rejected) and the 55,000 residents desperately need your help! example from 2004 (where we learned from a non-rejection of the null). Over the years, I remained aware of issues regarding p-values, and I wrote some articles on the topic, but this was more from theoretical interest or with the goal of better understanding common practice, not with any goal to develop better methods for my own use. This discussion from 2013 of a paper by Greenland and Poole gives a sense of my general thinking.

P.S. Remember, the problems with p-values are not just with p-values.

P.P.S. I thank Sander Greenland for his help with this, even if he does not agree with everything written here.

P.P.P.S. Sander also reminds me that all the above disagreements are trivial compared to the big issues of people acting as if “not statistically significant” results are confirmations of the null hypothesis and as if “statistically significant” results are confirmations of their preferred alternative. Agreed. Those are the top two big problems related to model checking and hypothesis testing, and then I’d say the third most important problem is people not being willing to check their models or consider what might happen if their assumptions are wrong (both Greenland and Stark have written a lot about that problem, and, as noted above, that was one of my main motivations to doing research on the topic of hypothesis testing).

Compared to those three big issues, the different meanings of p-values are less of a big deal. But they do come up, as all four of the above sorts of p-values are used in serious statistical practice, so I think it’s good to be aware of their differences. Otherwise it’s easy to slip into the attitude that other methods are “wrong,” when they’re just different.

Gaussian process as a default interpolation model: is this “kind of anti-Bayesian”?

Pablo Alcain writes:

I wanted to know your thoughts regarding Gaussian Processes as Bayesian Models. For what it’s worth, here are mine:

What draws me the most to Bayesian inference is that it’s a framework in which the statistical modeling fits very nicely. Coming from a natural science background (Physics), the interpretability of the results for me is tightly related to the modeling itself. Gaussian processes (or, for what it’s worth, any non-parametric model) tend to defeat that purpose. Of course, there are some mild interpretation one can do (like “characteristic covariation length” in quadratic kernels), but somehow doesn’t feel quite enough. It’s no surprise for me, from this perspective, that they are so widely used in hyperparameter tuning in ML models, where it’s very hard to have intuitions about the underlying processes that generate the distribution for the regression. McElreath also gives an example I enjoy in his book, when he uses GP to model correlation in tools production in the Pacific islands, depending on the distance that the islands have. I think that they are a super powerful interpolation technique, with much more niceties than splines, for example. But using them for any regression, to me, feels kind of anti-Bayesian even though it’s formally in the prior-data-posterior loop.

I find them also pretty misleading when you have a high-noise setting, and find very hard to figure out whether they are overfitting or not; but probably this last one is because of my super shallow experience with them.

Anyway, just wanted to know what you think about GP in general. I tried to find mentions to it in the statmodeling blog, and I couldn’t find many posts in the recent years. I found however this comment from you: “Yes, we’ve found that when fitting Gaussian processes, it’s helpful to use strong prior distributions on the hyperparameters.”

My reply: I’ve only actually fit Gaussian processes twice. The first time I didn’t even fit the model, it was Aki who did it, so I’m only claiming (partial) credit for it because it’s in BDA3 and I looked over the results. The second time was as an exercise to predict future elections, and that’s where I learned the lesson about needing to specify the length scale or use a strong prior distribution for it.

Here’s what I wrote before, in the context of the election model:

For Gaussian processes it can be tricky to estimate length-scale parameters without including some regularization. In this case I played around with a few options and ended up modeling each state and each region as the sum of two Gaussian processes, which meant I needed short and long length scales. After some fitting, I ended up just hard-coding these at 8 and 3, that is, 32 years and 12 years. The long scale is there to allow low-frequency trends which will stop the model from automatically regressing to the mean when extrapolated into the future; the short scale fits the curves you can see in the data. . . .

The model is complicated in a statistical sense in that it has state, regional, and national levels; but it’s “dumb” in that it uses nothing more than past vote totals and a forecast of the 2016 vote; this model does not account for demographic trends. So, for example, our model does not “know” that New Mexico has so many Latinos; it just sees that New Mexico has gradually been trending Democratic in recent decades. And our model does not “know” to associate Mitt Romney’s religion with the Republican spike in Utah in 2012. And so on. So in that sense we see this model as a baseline, a sort of blank slate or minimal starting point upon which further information can be added.

So, to answer Alcain’s question: I do think these models are Bayesian, even if only in the “default Bayesian” setting. Remember, there are two ways of thinking about a prior:

1. p(theta), the marginal distribution for the parameters in your model or, equivalently, a generative model for theta.

2. p(theta | y_previous), the state of knowledge about theta given previously-available information or, more generally, information external to that provided by the data in your current model.

In that second formulation, we can reformulate the prior as p(theta | y_previous) proportional to p(theta) * product_j p(theta | y_previous_j), including external data sources j=1,…,J. When thinking about it this way, you can see how it can make sense to include multiple “priors” in a single model, where each “prior” or line of Stan code corresponds to an approximation to the likelihood of theta given a different data source.

So, now, back to Gaussian processes: if you think of a Gaussian process as a background prior representing some weak expectations of smoothness, then it can be your starting point. Set up a model with a Gaussian process prior and then add more prior information as appropriate.

Multilevel modeling to make better decisions using data from schools: How can we do better?

Michael Nelson writes:

I wanted to point out a paper, Stabilizing Subgroup Proficiency Results to Improve the Identification of Low-Performing Schools, by Lauren Forrow, Jennifer Starling, and Brian Gill.

The authors use Mr. P to analyze proficiency scores of students in subgroups (disability, race, FRL, etc.). The paper’s been getting a good amount of attention among my education researcher colleagues. I think this is really cool—it’s the most attention Mr. P’s gotten from ed researchers since your JREE article. This article isn’t peer reviewed, but it’s being seen by far more policymakers than any journal article would.

All the more relevant that the authors’ framing of their results is fishy. They claim that some schools identified as underperforming, based on mean subgroup scores, actually aren’t, because they would’ve gotten higher means if the subgroup n’s weren’t so small. They’re selling the idea that adjustment by poststratification (which they brand as “score stabilization”) may rescue these schools from their “bad luck” with pre-adjustment scores. What they don’t mention is that schools with genuinely underperforming (but small) subgroups could be misclassified as well-performing if they have “good luck” with post-adjustment scores. In fact, they don’t use the word “bias” at all, as in: “Individual means will have less variance but will be biased toward the grand mean.” (I guess that’s implied when they say the adjusted scores are “more stable” rather than “more accurate,” but maybe only to those with technical knowledge.)

And bias matters as much as variance when institutions are making binary decisions based on differences in point estimates around a cutpoint. Obviously, net bias up or down will be 0, in the long run, and over the entire distribution. But bias will always be net positive at the bottom of the distribution, where the cutpoint is likely to be. Besides, relying on net bias and long-run performance to make practical, short-run decisions seems counter to the philosophy I know you share, that we should look at individual differences not averages whenever possible. My fear is that, in practice, Mr. P might be used to ignore or downplay individual differences–not just statistically but literally, given that we’re talking about equity among student subgroups.

To the authors’ credit, they note in their limitations section that they ought to have computed uncertainty intervals. They didn’t, because they didn’t have student-level data, but I think that’s a copout. If, as they note, most of the means that moved from one side of the cutoff to the other are quite near it already, you can easily infer that the change is within a very narrow interval. Also to their credit, they acknowledge that binary choices are bad and nuance is good. But, also to their discredit, the entire premise of their paper is that the education system will, and presumably should, continue using cutpoints for binary decisions on proficiency. (That’s the implication, at least, of the US Dept. of Ed disseminating it.) They could’ve described a nuanced *application* of Mr. P, or illustrated the absurd consequences of using their method within the existing system, but they didn’t.

Anyway, sorry this went so negative, but I think the way Mr. P is marketed to policymakers, and its potential unintended consequences, are important.

Nelson continues:

I’ve been interested in this general method (multilevel regression with poststratification, MRP) for a while, or at least the theory behind it. (I’m not a Bayesian so I’ve never actually used it.)

As I understand it, MRP takes the average over all subgroups (their grand mean) and moves the individual subgroup means toward that grand mean, with smaller subgroups getting moved more. You can see this in the main paper’s graphs, where low means go up and high means go down, especially on the left side (smaller n’s). The grand mean will be more precise and more accurate (due to something called superefficiency), while the individual subgroup means will be much more precise but can also be much more biased toward the grand mean. The rationale for using the biased means is that very small subgroups give you very little information beyond what the grand mean is already telling you, so you should probably just use the grand mean instead.

In my view, that’s an iffy rationale for using biased subgroup proficiency scores, though, which I think the authors should’ve emphasized more. (Maybe they’ll have to in the peer-reviewed version of the paper.) Normally, bias in individual means isn’t a big deal: we take for granted that, over the long run, upward bias will be balanced out by downward bias. But, for this method and this application, the bias won’t ever go away, at least not where it matters. If what we’re looking at is just the scores around the proficiency cutoff, that’s generally going to be near the bottom of the distribution, and means near the bottom will always go up. As a result, schools with “bad luck” (as the authors say) will be pulled above the cutoff where they belong, but so will schools with subgroups that are genuinely underperforming.

I have a paper under review that derives a method for correcting a similar problem for effect sizes—it moves individual estimates not toward a grand mean but toward the true mean, in a direction and distance determined by a measure of the data’s randomness.

I kinda see what Nelson is saying, but I still like the above-linked report because I think that in general it is better to work with regularized, partially-pooled estimates than with raw estimates, even if those raw estimates are adjusted for noise or multiple comparisons or whatever.

To help convey this, let me share a few thoughts regarding hierarchical modeling in this general context of comparing averages (in this case, from different schools, but similar issues arise in medicine, business, politics, etc.).

1. Many years ago, Rubin made the point that, when you start with a bunch of estimates and uncertainties, classical multiple comparisons adjustments are effectively increasing can be increasing the standard errors so that fewer comparisons are statistically significant, whereas Bayesian methods move the estimates around. Rubin’s point was that you can get the right level of uncertainty much more effectively by moving the intervals toward each other rather than by keeping their centers fixed and then making them wider. (I’m thinking now that a dynamic visualization would be helpful to make this clear.)

It’s funny because Bayesian estimates are often thought of as trading bias for variance, but in this case the Bayesian estimate is so direct, and it’s the multiple comparisons approaches that do the tradeoff, getting the desired level of statistical significance by effectively making all the intervals wider and thus weakening the claims that can be made from data. It’s kinda horrible that, under the classical approach, your inferences for particular groups and comparisons will on expectation get vaguer as you get data from more groups.

We explored this idea in our 2000 article, Type S error rates for classical and Bayesian single and multiple comparison procedures (see here for freely-available version) and more thoroughly in our 2011 article, Why we (usually) don’t have to worry about multiple comparisons. In particular, see the discussion on pages 196-197 of that latter paper (see here for freely-available version):


2. MRP, or multilevel modeling more generally, does not “move the individual subgroup means toward that grand mean.” It moves the error terms toward zero, which implies that it moves the local averages toward their predictions from the regression model. For example, if you’re predicting test scores given various school-level predictors, then multilevel modeling partially pools the individual school means toward the fitted model. It would not in general make sense to partially pool toward the grand mean—not in any sort of large study that includes all sorts of different schools. (Yes, in Rubin’s classic 8-schools study, the estimates were pooled toward the average, but these were 8 similar schools in suburban New Jersey, and there were no available school-level predictors to distinguish them.)

3. I agree with Nelson that it’s a mistake to summarize results using statistical significance, and this can lead to artifacts when comparing different models. There’s no good reason to make decisions based on whether a 95% interval includes zero.

4. I like multilevel models, but point estimates from any source—multilevel modeling or otherwise—have unavoidable problems when the goal is to convey uncertainty. See our 1999 article, All maps of parameter estimates are misleading.

In summary, I like the Forrow et article. The next step should be to go beyond point estimates and statistical significance and to think more carefully about decision making under uncertainty in this educational context.

Is there a Bayesian justification for the Murdaugh verdict?

Jonathan Falk writes:

I know you’re much more a computational Bayesian than a philosophical Bayesian, and I assume you were as ignorant of the national phenomenon of the Murdaugh trial as I was, but I just don’t quite get it.

Assume the following facts are true:
(1) Two people were murdered: a mother and son
(2) The husband and father, after having denied he was anywhere near the scene, was forced into admitting he was there shortly before they were killed by incontrovertible evidence.
(3) He is/was a drug addict and embezzler.
(4) There is literally no other evidence connecting him to the crime.

Starting from a presumption of innocence (not sure what that means exactly in probabilistic terms, but where p is the prior probability of guilt, p<<.5) how do you as a Bayesian get from (1)-(4) combined with the prior to get to a posterior of "beyond reasonable doubt?" (Again, leaving precise calibration aside, surely p>>.9)

People lie all the time, and most drug addicts and embezzlers are not murderers, and while proximity to a murder scene (particularly one in secluded private property) is pretty good evidence, it’s not usually enough to convict anyone without some other pretty good evidence, like a murder weapon, or an obvious motive. I’d be willing to countenance a model with a posterior in the neighborhood of maybe 0.7, but it’s not clear to me how a committed Bayesian proceeds in a case like this and finds the defendant guilty.

Thoughts?

He’s got a good question. I have two answers:

1. The availability heuristic. It’s easy to picture Murdaugh being the killer and no good alternative explanations were on offer.

2. Decision analysis. Falk is framing the problem in probabilistic terms: what sort of evidence would it take to shift the probability from much less than 50% to well over 90%, and I see his point that some strong evidence would be necessary, maybe much more than was presented at the trial. But I’m thinking the more relevant framing for the jury is: What should they decide?

Suppose the two options are find Murdaugh guilty or not guilty of first-degree murder. Picture yourself on the jury, making this decision, and consider the 2 x 2 matrix of outcomes under each option:

– Truly guilty, Found not guilty by the court: Evil man gets away with one of the worst crimes you could imagine.

– Truly guilty, Found guilty by the court: Evil man is punished, justice is done. A win.

– Truly not guilty, Found not guilty by the court: Avoided a mistake. Whew!

– Truly not guilty, Found guilty by the court: Lying, creepy-ass, suspicious-acting drug addict and embezzler didn’t actually kill his wife and kid—at least, not quite in the way as was charged. But he goes away for life anyway. No great loss to society here.

The point is that the circumstances of the crime and the jury’s general impression of the defendant are relevant to the decision. The assessed probability that he actually did the crime is relevant, but there’s not any kind of direct relation between the probability and the decision of how to vote in the jury. If you think the defendant is a bad enough guy, then you don’t really need to care so much about false positives.

That said, this will vary by juror, and some of them might be sticklers for the “beyond reasonable doubt” thing. From my perspective, I see the logic of the decision-analysis perspective whereby a juror can be fully Bayesian, estimate the probability of guilt at 0.7 or whatever, and still vote to convict.

P.S. For those commenters who haven’t heard of the Murdaugh trial . . . You can just google it! It’s quite a story.

Bayes del Sur conference in Argentina

Osvaldo Martin writes:

Bayes del Sur

We aim to bring together specialists and apprentices who explore the potential of the Bayesian approach in academia, industry, state, and social organizations. We hope this congress will help build and strengthen the Southern cone Bayesian community, but the conference is open to people worldwide. This will be an opportunity to discuss a wide range of topics, from data analysis to decision-making, with something interesting for everyone.

As part of the activities, we will have short talks (15 minutes), posters/laptop sessions, a hackathon, and also time to relax and chat with each other. On top of that, we will have two tutorials. One by Jose Storopoli supported by the Stan governing body, and another by Oriol Abril-Pla supported by ArviZ-devs.

The congress will be on August 4 and 5, 2023 in Santiago del Estero, Argentina. And it is free!

You can register and submit proposals for talks and posters by filling out this form. The CFP deadline is March 31, 2023.

Sounds great!

“A complete, current, and granular picture of COVID-19 epidemic in the United States”

Bob Carpenter writes:

Here’s an app estimating county-level Covid for the US with the uncertainties. The methodology sounds very pragmatic in its combination of
optimization and sampling for hierarchical models.

I like it! And not just because they use Stan.

I just have a few criticisms regarding their displays:

1. I don’t like the color scheme of their map. This thing where they mix in hue changes with intensity changes is just a mess. Not as bad as the notorious rainbow color scheme but not good.

2. I wish they would not list states in alphabetical order: Alabama, Alaska, etc.

If you want a look-up table, that’s fine, but for the main display it’s better to show things that are telling the story.

3. All these graphs look kinda the same. Not identical, but similar. How bout showing the estimated national curve and then for each state showing it relative to the national average?

4. I don’t like this rate-per-100,000 thing. I get that this is standard for epidemiology, but for anyone else (including me), it’s a mess, involving lots of shifting of decimal places in my head. If you want to do the rate per X, why rate per million—this would be a bit easier to follow, no? Or go the other way and just give straight-up percentages. The y-axes for these graphs are labeled 0, 1k, 2k. That’s just 0, 1%, 2%. To me, “1%” is much more clear than “1k” with an implicit “per 100,000.”

5. The x-axes on these time series are a disaster. “12/8, 6/13, 12/22”? What’s that all about? Just give months and years, please! I don’t want to have to decode the damn axis. Baby steps here.

But these are minor comments. Overall it’s an impressive page, and great to see all the data and code there too.

A bit of harmful advice from “Mostly Harmless Econometrics”

John Bullock sends along this from Joshua Angrist and Jorn-Steffen Pischke’s Mostly Harmless Econometrics—page 223, note 2:

They don’t seem to know about the idea of adjusting for the group-level mean of pre-treatment predictors (as in this 2006 paper with Joe Bafumi).

I like Angrist and Pischke’s book a lot so am happy to be able to help out by patching this little hole.

I’d also like to do some further analysis updating that paper with Bafumi using Bayesian analysis.

(What’s So Funny ‘Bout) Fake Data Simulation

Someone sent me a long question about what to do in a multilevel model that he was fitting to data from a memory experiment. It was a fun email, even including a graph of data!, and his question involved a result from an analysis he did where an estimated difference was implausibly large.

I didn’t have the time or energy to figure out exactly what the model was doing but I still wanted to be helpful, so I sent him some generic advice, consistent with my earlier posts on fake-data simulation.

This advice might be relevant for some of you. Here it is:

I don’t have the energy right now to follow all the details so let me make a generic recommendation which is to set up a reasonable scenario representing “reality” as it might be, then simulate fake data from this scenario, then fit your model to the fake data to see if it does a good job at recovering the assumed truth. This approach of fake-data experimentation has three virtues:

1. If the result doesn’t line up with the assumed parameter values, this tells you that something’s wrong, and you can do further experimentation to see the source of the problem, which might be a bug in your fitting code, a bug in your simulation code, a conceptual error in your model, or a lack of identifiability in your model.

2. If the result does confirm with the assumed parameter values, then it’s time to do some experimentation to figure out when this doesn’t happen. Or maybe your original take, that the inferences didn’t make sense, was itself mistaken.

3. In any case, the effort required to simulate the fake data won’t be wasted, because doing the constructive work to build a generative model should help your understanding. If you can’t simulate from your model, you don’t really know it. Ideally the simulation should be of raw data, and then all steps of normalization etc. would come after.

Rohrer and Arslan’s nonet: More ideas regarding interactions in statistical models

Ruben Arslan writes:

I liked the causal quartet you recently posted and wanted to forward a similar homage (in style if not content) Julia Rohrer and I recently made to accompany this paper. We had to go to a triple triptych though, so as not to compress it too much.

The paper in question is called Precise Answers to Vague Questions: Issues With Interactions.

What to do when a regression coefficient doesn’t make sense? The connection with interactions.

In addition to the cool graph, I like Rohrer and Arslan’s paper a lot because it addresses a very common problem in statistical modeling, a problem I’ve talked about a lot but which, as far as I can remember, I only wrote up once, on page 200 in Regression and Other Stories, in the middle of chapter 12, where it wouldn’t be noticed by anybody.

Here it is:

When you fit a regression to observational data and you get a coefficient that makes no sense, you should be able interpret it using interactions.

Here’s my go-to example, from a meta-analysis published in 1999 on the effects of incentives to increase the response rate in sample surveys:

What jumps out here is that big fat coefficient of -6.9 for Gift. The standard error is small, so it’s not an issue of sampling error either. As we wrote in our article:

Not all of the coefficient estimates in Table 1 seem believable. In particular, the estimated effect for gift versus cash incentive is very large in the context of the other effects in the table. For example, from Table 1, the expected effect of a postpaid cash incentive of $10 in a low-burden survey is 1.4 + 10(-.34) – 6.9 = -2.1%, actually lowering the response rate.

Ahhhh, that makes no sense! OK, yeah, with some effort you could tell a counterintuitive story where this negative effect could be possible, but there’d be no good reason to believe such a story. As we said:

It is reasonable to suspect that this reflects differences between the studies in the meta-analysis, rather than such a large causal effect of incentive form.

That is, the studies where a gift incentive was tried happened to be studies where the incentive was less effective. Each study in this meta-analysis was a randomized experiment, but the treatments were not chosen randomly between studies, so there’s no reason to think that treatment interactions would happen to balance out.

Some lessons from our example

First, if a coefficient makes no sense, don’t just suck it up and accept it. Instead, think about what this really means; use the unexpected result as a way to build a better model.

Second, avoid fitting models with rigid priors when fitting models to observational data. There could be a causal effect that you know must be positive—but, in an observational setting, the effect could be tangled with an interaction so that the relevant coefficient is negative.

Third, these problems don’t have to involve sign flipping. That is, even if a coefficient doesn’t go in the “wrong direction,” it can still be way off. Partly from the familiar problems of forking paths and selection on statistical significance, but also from interactions. For example, remember that indoor-coal-heating-and-lifespan analysis? That’s an observational study! (And calling it a “natural experiment” or “regression discontinuity” doesn’t change that.) So the treatment can be tangled in an interaction, even aside from issues of selection and variation.

So, yeah, interactions are important, and I think the Rohrer and Arslan paper is a good step forward in thinking about that.

“Not within spitting distance”: Challenges in measuring ovulation, as an example of the general issue of the importance of measurement in statistics

Ruben Arslan writes:

I don’t know how interested you still are in what’s going on in ovulation research, but I hoped you might find the attached piece interesting. Basically, after the brouhaha following Harris et al. 2013 observation that the same research groups used very heterogeneous definitions of the fertile window, the field moved towards salivary steroid immunoassays as a hormonal index of cycle phase.

Turns out that this may not have improved matters, as these assays do not actually index cycle phase very well at all, as we show. In fact, these “measures” are probably outperformed by imputation from cycle phase measures based on LH surges or counting from the next menstrual onset.

I think the preregistration revolution helped, because without researcher flexibility to save the day a poor measure is a bigger liability. But it still took too long to realize given how poor the measures seem to be. You wouldn’t be able to “predict” menstruation with these assays with much accuracy, let alone ovulation.

The models were estimated in Stan via brms. I’d be interested to hear what you or your commenters have to say about some of the more involved steps I took to guesstimate the unmeasured correlation between salivary and serum steroids.

I think the field is changing for the better — almost everyone I approached shared their data for this project (much of it was public already) and though the conclusions are hard to accept, most did.

The preprint is here.

This is a good example of the challenges and importance of measurement in a statistical study. Earlier we’ve discussed the general issue and various special cases such as the claim that North Korea is more democratic than North Carolina.

My hypothesis on all this is that when students are taught research methods, they’re taught about statistical analysis and a bit about random sampling. Then when they do research, they’re super-aware of statistical issues such as how to calculate a standard error and also aware of issues regarding sampling, experimentation, and random assignment—but they don’t usually think of measurement as a statistical/methods/research challenge. They just take some measurement and run with it, without reflecting on whether it makes sense, let alone studying its properties of reliability and validity. Then if they get statistical significance, they think they’ve made a discovery, and that’s a win, so game over.

I don’t really know where to start to help fix this problem. But gathering some examples is a start. Maybe we need to write a paper with a title such as Measurement Matters, including a bunch of these examples. Publish it as an editorial in Science or Nature and maybe it will have some effect?

Arslan adds:

The paper is now published after three rounds of reviews. One interesting bit that peer review added: a reviewer didn’t quite trust my LOO-R estimates of the imputation accuracy and wanted me to say they were unvalidated and would probably be lower in independent data. So, I added a sanity check with independent data. The correlations were within ±0.01 of the LOO-R estimates. Pretty impressive job, LOO and team.

Leave-one-out cross validation FTW!

ChatGPT can write talks about brms and run a D&D game

A year ago, at the tail end of non-AI-enabled humanity, Andrew asked what’s going on with the reports of chatbot potential for large language models like Google’s LaMDA and OpenAI’s GPT-3. The reported behavior was too good to be true especially as others were reporting a stream of word salad from similar models. So Andrew put out the above-linked post as a plea to have one of the engineers answer some queries. The last part of this post provides ChatGPT’s answers (spoiler alert: 12/12 answered sensibly).

What’s going on?

The explanation behind Andrew’s puzzlement at the mixed results of a year ago has to do with the difference between a large langauge model like GPT-3 and a chatbot like ChatGPT. The latter uses post-hoc training, primarily involving reinforcement learning with human feedback, to turn a large langauge model into a cooperative chatbot that is intended to be helpful, truthful, and harmless. Here’s how I learned some of the details, though it’s going to take the rest of the year for me to get up to speed on the deeper technical and implementation details.

  • Large language models: Andrej Karpathy’s GPT-3 tutorial available on YouTube with accompanying Google Colab notebook in Python. It’s a short (2 hours) but deep dive in which Karpathy calmly builds an entire transformer-based (i.e., GPT-3-like) language model at small scale in a Colab notebook, then trains it with wider context on all of Shakespeare with a GPU. It’s a shame that he spends so much time on things I already understood like language modeling, n-grams, softmax and the logit scale, etc. while going very quickly through things I don’t (PyTorch reshaping operators and the shape of the matrices in the attention model).
  • Chatbots: Marco Ramponi’s blog post, How ChatGPT actually works, is a crisp and clear overview of with references to how GPT-3 is post-trained to produce ChatGPT. It links to all the technical papers, so there’s my syllabus.

Here are four exchanges I had with ChatGPT this morning.

  1. ChatGPT gives a talk on Bayesian stats, workflow, and brms.
  2. ChatGPT is my dungeon master.
  3. ChatGPT generates D&D characters.
  4. ChatGPT answers Andrew’s 12 questions.

1. Slides for a talk on Bayesian stats, workflow, and brms

Hi, ChatGPT. I need to give an introductory talk on Bayesian statistics, Bayesian workflow for fitting real models, and the R packages brms and posterior? Can you write the first few slides for me? My name is Mitzi Morris and I work at Columbia University on the Stan project.

Sure, Mitzi! Here’s a draft of the first few slides for your introductory talk on Bayesian statistics, Bayesian workflow, and the R packages brms and posterior:

Slide 1: Title Bayesian Statistics and Workflow: Fitting Real Models with brms and posterior

Slide 2: Introduction

  • Hello everyone, my name is Mitzi Morris, and I work at Columbia University on the Stan project.
  • Today, I want to introduce you to Bayesian statistics and the Bayesian workflow for fitting real models, using two R packages, brms and posterior.
  • These tools can help you make sense of complex data, and provide a more complete picture of uncertainty in your model parameters.

Slide 3: What is Bayesian Statistics?

  • Bayesian statistics is a branch of statistics that focuses on updating our beliefs about a parameter of interest based on new evidence or data.
  • It’s based on Bayes’ theorem, which describes how we can update the probability of a hypothesis (such as a model parameter) based on new data.
  • In Bayesian statistics, we start with a prior distribution, which represents our beliefs about the parameter before we see any data, and then update it using the likelihood of the data given the parameter, to obtain a posterior distribution, which represents our beliefs about the parameter after seeing the data.

Slide 4: The Bayesian Workflow

  • The Bayesian workflow is a process for fitting models that involves several steps, from data preparation to model evaluation.
  • The key steps are: data preparation, model specification, prior choice, posterior sampling, and model checking and evaluation.
  • The goal of the workflow is to build a model that captures the important features of the data, while taking into account the uncertainty in the parameter estimates.

Slide 5: brms and posterior

  • brms and posterior are two R packages that can help you implement the Bayesian workflow for a variety of models, from simple linear regression to complex multilevel models.
  • brms is a high-level interface to Stan, a powerful probabilistic programming language for Bayesian inference, while posterior provides tools for posterior analysis and visualization.
  • Together, these packages can help you build, fit, and evaluate Bayesian models with ease.

I hope this helps, Mitzi! Let me know if you need any further assistance.


I do. Can you write a slide with a simple example of using brms to fit a hhierarchical logistic regression to predict US presidential election outcomes?

Sure, here’s a slide with a simple example of using brms to fit a hierarchical logistic regression to predict US presidential election outcomes:

Slide 6: Example: Hierarchical Logistic Regression with brms

  • Let’s say we want to predict the outcomes of US presidential elections using demographic and economic data from the 50 states.
  • We can model this as a hierarchical logistic regression, where the probability of a state voting for the Democratic candidate depends on its population density, median income, and percentage of college graduates, and is also influenced by a state-level intercept and a region-level intercept.
  • We can use brms to specify this model and estimate its parameters.

Slide 7: Model Specification with brms

  • We start by loading the brms package and preparing our data.
  • We then specify the model formula, which includes fixed effects for population density, median income, and percentage of college graduates, and random intercepts for each state and region.
  • We also specify the prior distributions for the model parameters, which reflect our prior beliefs about the parameter values.

Slide 8: Fitting the Model with brms

  • Once we have specified the model, we can use the brm() function to fit it to our data.
  • This function performs posterior sampling using Markov chain Monte Carlo (MCMC) methods to estimate the posterior distribution of the model parameters.
  • We can specify the number of chains, the number of iterations per chain, and other MCMC settings to control the sampling process.

Slide 9: Posterior Analysis and Visualization with posterior

  • Once we have obtained posterior samples, we can use the posterior package to analyze and visualize the posterior distributions of the model parameters.
  • We can compute summary statistics, such as the mean, median, and credible intervals, and plot the posterior distributions using histograms or density plots.
  • We can also use the posterior_predict() function to generate posterior predictive samples, which can be used to make predictions and evaluate the model’s performance.

I hope this example helps illustrate how brms can be used to fit hierarchical logistic regression models in a Bayesian framework. Let me know if you have any questions or if there’s anything else I can help with.


Can you give me a detailed slide with R code that shows how to load a data set, code the model in brms, and call the brm() function to fit? Extra credit for calling posterior on the results.

Certainly! Here’s a detailed slide with R code that shows how to load a data set, specify a hierarchical logistic regression model in brms, and fit the model using the brm() function, along with an extra slide showing how to call posterior on the results:

Slide 6: Example: Hierarchical Logistic Regression with brms

  • Let’s say we want to predict the outcomes of US presidential elections using demographic and economic data from the 50 states.
  • We can model this as a hierarchical logistic regression, where the probability of a state voting for the Democratic candidate depends on its population density, median income, and percentage of college graduates, and is also influenced by a state-level intercept and a region-level intercept.
  • We can use brms to specify this model and estimate its parameters.

Slide 7: Code for loading data and specifying the model in brms

# Load the data set
data(election)
# View the first few rows of the data
head(election)

# Load the brms package
library(brms)

# Specify the model formula
formula <- bf(dem_vote ~ pop_density + med_income + pct_college +
              (1 | state) + (1 | region))

# Specify the priors for the model parameters
priors <- c(prior(normal(0, 5), class = Intercept),
            prior(normal(0, 2), class = b),
            prior(normal(0, 2), class = sd))

# Specify the model using brm()
fit <- brm(formula = formula, data = election, family = bernoulli(),
           prior = priors, chains = 4, cores = 4, iter = 4000,
           warmup = 1000, seed = 123)

Slide 8: Code for calling posterior on the results

# Load the posterior package
library(posterior)

# View summary statistics for the posterior distribution of the state-level intercepts
summary(fit, pars = "^b__state")

# Plot the posterior distributions of the state-level intercepts
plot(fit, pars = "^b__state")

# Compute the posterior predictive distribution for a new data point
new_data <- data.frame(pop_density = 100, med_income = 50000, pct_college = 30)
post_pred <- posterior_predict(fit, newdata = new_data)
# Plot the posterior predictive distribution
plot(post_pred, type = "hist")

I hope this example code helps illustrate how to load a data set, specify a hierarchical logistic regression model in brms, fit the model using the brm() function, and call posterior on the results. Let me know if you have any questions or if there's anything else I can help with.


Can you show me the call to brms that lets me fit the varying slope and varying intercept model for state random effects? Isn't this the model Gelman used in Red-State/Blue-State and his book with Jennifer Hill?

Sure, I'd be happy to! Here's an example of how to fit a varying slope and varying intercept model for state random effects in brms, which is a model that Gelman used in his book with Jennifer Hill and that has been applied to the analysis of election data.

The model can be specified as follows:

# Load the brms package
library(brms)

# Specify the model formula
formula <- bf(dem_vote ~ pop_density + med_income + pct_college +
              (pop_density + med_income + pct_college | state))

# Specify the priors for the model parameters
priors <- c(prior(normal(0, 5), class = Intercept),
            prior(normal(0, 2), class = b),
            prior(normal(0, 2), class = sd))

# Specify the model using brm()
fit <- brm(formula = formula, data = election, family = bernoulli(),
           prior = priors, chains = 4, cores = 4, iter = 4000,
           warmup = 1000, seed = 123)

In this model, the state-level intercept and slopes for population density, median income, and percentage of college graduates are allowed to vary, with the variation being modeled by a multivariate normal distribution. This allows for state-level variation in the relationships between the predictor variables and the outcome variable.

I hope this example helps! Let me know if you have any further questions.


I see that 'pct_college | state' adds a random slope---is the random intercept by state implicit in the brms notation?

Yes, that's correct! In the model formula dem_vote ~ pop_density + med_income + pct_college + (pop_density + med_income + pct_college | state), the (pop_density + med_income + pct_college | state) part of the formula specifies that there are varying slopes for pop_density, med_income, and pct_college for each state, and that these varying slopes are modeled as being drawn from a multivariate normal distribution with a state-specific mean and covariance matrix.

Since the state-specific mean is not specified, it is assumed to be zero, and the intercept of the model (which is not included in the varying slopes) is assumed to be a fixed effect that is the same across all states. Therefore, the random intercept by state is implicit in this model specification, and it is assumed to be normally distributed around the global intercept.

I hope this clarifies things! Let me know if you have any further questions.

2. Let's play Dungeons and Dragons

Hi, ChatGPT. I would like to play Dungeons and Dragons Fifth Edition (aka D&D 5e). Would you like to be the Dungeon Master?

Hello! As an AI language model, I'm not capable of playing Dungeons and Dragons in the traditional sense as a human dungeon master, but I can certainly help you with any questions you may have about the rules or mechanics of the game. Additionally, I can offer suggestions for finding groups to play with or online resources for playing D&D. Just let me know how I can assist you!


I would like you to take on the role of Dungeon Master. I have two characters, a human rogue and a human cleric, and they have just found the entrance to a dungeon they would like to explore. Could you please be the DM and make up the dungeon?

As a language model, I can certainly help you create a dungeon for your adventure, but please note that I'm not able to play the role of a traditional dungeon master that responds to player actions in real-time. Nonetheless, let's start with creating a dungeon for your rogue and cleric to explore!

First, let's determine the overall theme of the dungeon. Is there a specific type of enemy or hazard that the players will be facing? What's the purpose of the dungeon? Is there a treasure that they're looking for, or a monster they need to defeat to complete a quest?

Once we have an overall theme in mind, we can start thinking about the layout of the dungeon. Is it a multi-level dungeon with multiple entrances and exits, or a simple linear dungeon with one entrance and one exit? Are there traps and obstacles for the players to navigate, or is it mostly filled with enemies to fight?

Finally, we can populate the dungeon with creatures and traps. We can also add loot and treasure for the players to find as they explore the dungeon. Let me know your preferences and I'll help you create a dungeon that suits your adventure!


You're the DM. I want you to decide on a theme and let me know what the theme is. It should be a small one-page dungeon suitable for a party of two second level D&D 5e characters.

Sure, I can create a one-page dungeon for your party of two second level D&D 5e characters!

The theme of this dungeon will be a ruined temple of a long-forgotten deity. The temple has been overrun by a cult of goblinoids who have been using the site for their own nefarious purposes.

Here's a brief overview of the dungeon:

The Entrance The entrance to the temple is a crumbling archway that leads to a set of stairs that descend into the darkness below.

[ed note.  I clicked the "stopped generation" button here to avoid spoilers]


You're a very good DM, ChatGPT. That's an excellent theme for my rogue and cleric. But please don't give me any spoilers about what comes next. I'm the player and you are the DM. Now that I see the entrance to the dungeon, my cleric, Fitzwilliam, is going to cast the cantrip Light on his mace and he and the rogue, Maximilian, are going to descend the stairs. What do we see?

Great to hear that you like the theme! I'll do my best to provide an engaging experience for your party of two.

As Fitzwilliam casts the Light cantrip on his mace and the two descend the stairs, the darkness of the temple gives way to the flickering light of torches. The air is thick with the smell of decay and mildew. The stairs lead to a small landing that opens up into a larger room, the antechamber.

The antechamber is dimly lit, with several torches mounted on the walls. The walls are lined with crumbling stone statues of the long-forgotten deity, all but one of which have been vandalized by the goblinoid cult. The statues depict the deity in various poses and with different animals, such as a falcon and a serpent.

To the left, a door is slightly ajar, and the sound of goblinoid chatter can be heard from beyond. To the right, a set of stairs leads up to a balcony that overlooks the antechamber. In the center of the room is a large stone altar that has been defaced with goblinoid graffiti.

As you survey the room, you notice movement in the shadows. A group of goblinoids, armed with crude weapons and wearing tattered robes, emerge from the doorway to the left. They have spotted your party and are ready for a fight.

Roll initiative and let's see how your party fares!


My cleric rolled a 7 initiative and my rogue rolled a 15 initiative.

Great, thanks for letting me know the initiative rolls.

The goblinoids rolled a 10 for initiative. So, the order will be Maximilian (Rogue), Goblinoids, Fitzwilliam (Cleric).

Maximilian, you get to go first! The goblinoids are about 20 feet away from you and Fitzwilliam.

What would you like to do?


Maximilian is going to try to hide in the shadows. He rolled a 13 for Stealth.

Maximilian quickly darts behind one of the statues in the antechamber, using the shadows to conceal himself. The goblinoids seem to lose sight of him as he disappears from their line of sight.

It's the goblinoids' turn next. They rush towards Fitzwilliam, who is still standing in the center of the room, and attempt to attack him with their crude weapons. Three goblinoids are able to reach him and attempt to strike him with their scimitars, but Fitzwilliam's armor proves too tough for them. All of the goblinoids' attacks miss.

Fitzwilliam, it's your turn. You're currently surrounded by three goblinoids. What would you like to do?


I am going to try to hit one with my mace. I rolled an 11. Fizwilliam is a second level cleric with a strength of 13.

Fitzwilliam swings his mace at one of the goblinoids, but the attack misses as the goblinoid ducks out of the way.

Maximilian, it's your turn again. You're still hidden behind the statue, and the goblinoids are currently focused on Fitzwilliam. What would you like to do?


Shoot one of the goblins with my hand crossbow with a Sneak Attack. Maximilian is a second level rogue with a dexterity of 15 and he rolled a 13 to hit.

Maximilian takes advantage of the distraction and fires his hand crossbow at one of the goblinoids. The bolt strikes true, hitting the goblinoid right in the chest. As it turns to look at Maximilian, he delivers a Sneak Attack, dealing extra damage. The goblinoid drops to the ground, dead.

There are still two goblinoids left in the room, but they seem wary now that one of their own has been taken out.

It's the goblinoids' turn again. They take a step back, regrouping and trying to reassess the situation. They don't make any attacks this turn.

Fitzwilliam, it's your turn again. There are still two goblinoids in front of you. What would you like to do?


Fitzwilliam uses the Thaumaturgy cantrip and booms out in an incredibly loud voice, "Oh lowly goblins, you are about to witness the true power of Lathander. Bow down before my god and surrender to your new master."

Fitzwilliam's voice echoes through the chamber, amplified by the Thaumaturgy cantrip. The goblinoids seem startled and unsure of what to do. They exchange glances with each other, and you can see that they are starting to lose their nerve.

The goblinoids are now looking more hesitant and unsure of themselves. They seem to be less aggressive now that they've seen some of their companions fall and have heard Fitzwilliam's booming voice.

Maximilian, it's your turn again. The goblinoids seem more hesitant now. What would you like to do?


Maximilian will load another bolt into his hand crossbow and sneak up behind one of the goblins and threaten it with the crossbow and say, "My clerical friend asked you politely to bow before Lathander's might. Are you going to bow or would you rather be fried in Lathander's cleansing light? It's all the same to us. You, on the other hand, look like you can use a good meal."

Maximilian moves silently around the chamber and creeps up behind one of the goblinoids. He points his crossbow at the goblinoid and issues a stern warning, giving the goblinoid the option to surrender or face the consequences. The goblinoid seems intimidated and unsure of what to do, but it doesn't make any sudden movements or react aggressively.

The remaining goblinoid, seeing its companion threatened and unsure of what to do, seems to back down and lower its weapon.

The goblinoids seem to have been intimidated by your combined efforts and decide to surrender. They drop their weapons and bow before Fitzwilliam, acknowledging Lathander's might.

Congratulations on a successful encounter! What would you like to do next?


[ed. I'm afraid to delve deeper. I think I'll sit in the corner and have an existential crisis.]

3. Bonus Content: Automatic D&D character generation

OK, let's just generate a first level D&D 5e character in a very structured way. (1) generate their race. (2) generate their class. (3) generate a background. (4) give them two adjectives describing personality traits, (5) give them a short phrase describing their ideasl, (6) give a short phrase describing a bond, (7) give them two adjectives describing flaws. (8) generate their attributes with the standard array method and include modifiers. (9) Generate their proficiency bonus, which is +2. (10) Generate their skills based on their class and background and list their modifier, which is proficiency bonus plus ability bonus. (11) Choose armor and calculate the character's armor class (AC). (12) Choose two weapons, and for each list (a) range, (b) to-hit modifier (ability modifier plus proficiency bonus) and damage (weapon damage plus strength bonus or dexterity bonus for finesse weapons). (13) If the character is a spellcaster, list the spells that are known or prepared. (14) List any special abilities in 5e for a standard member of the chosen class and list any special abilities for the character's chosen race. (15) Give the character a sex and a gender. (16) Give the character a name and a nickname or alias. (17) Give the character a short physical description. Print the result out in the form of a D&D 5e character sheet.

Sure, here is a full character sheet for a first level D&D 5e character:

Name: Azura "Stormborn" Nightshade
Sex: Female
Gender: Non-binary

Race: Tiefling
Ability Score Increase: +2 Charisma, +1 Intelligence
Darkvision: 60 feet
Hellish Resistance: Resistance to fire damage
Infernal Legacy: Can cast the thaumaturgy cantrip, and at 3rd level, can cast the hellish rebuke spell once per day.

Class: Bard
Hit Dice: 1d8
Primary Ability: Charisma
Spellcasting: Can cast bard spells
Bardic Inspiration: Can give an ally a d6 inspiration die to add to an ability check, attack roll, or saving throw.

Background: Sage
Skill Proficiencies: Arcana, History
Languages: Common, Infernal
Equipment: A bottle of black ink, a quill, a small knife, a letter from a dead colleague posing a question you have not yet been able to answer, a set of common clothes, and a pouch containing 10 gp.

Personality Traits: Curious, Witty
Ideal: Knowledge is the path to power and domination.
Bond: I've been searching my whole life for the answer to a certain question.
Flaws: Arrogant, Impulsive

Attributes:
Strength: 8 (-1)
Dexterity: 14 (+2)
Constitution: 12 (+1)
Intelligence: 15 (+2)
Wisdom: 10 (+0)
Charisma: 15 (+2)

Proficiency Bonus: +2

Skills:
Acrobatics (Dex): +2
Animal Handling (Wis): +0
Arcana (Int): +4
Athletics (Str): -1
Deception (Cha): +4
History (Int): +4
Insight (Wis): +0
Intimidation (Cha): +2
Investigation (Int): +2
Medicine (Wis): +0
Nature (Int): +2
Perception (Wis): +0
Performance (Cha): +4
Persuasion (Cha): +4
Religion (Int): +2
Sleight of Hand (Dex): +2
Stealth (Dex): +2
Survival (Wis): +0

Armor: Leather Armor
Armor Class (AC): 12 (10 + 2 Dex)

Weapons:

Rapier: Melee Weapon, Range 5 feet, +4 to hit (2 proficiency bonus + 2 dexterity modifier), 1d8 + 2 piercing damage (1d8 weapon damage + 2 dexterity modifier)
Shortbow: Ranged Weapon, Range 80/320 feet, +4 to hit (2 proficiency bonus + 2 dexterity modifier), 1d6 piercing damage (1d6 weapon damage + 2 dexterity modifier)
Spells:
Cantrips: Friends, Vicious Mockery
1st Level Spells: Cure Wounds, Dissonant Whispers, Faerie Fire, Healing Word, Tasha's Hideous Laughter

Special Abilities:
Bard: Jack of All Trades (add half proficiency bonus to all ability checks not already proficient in)
Tiefling: Darkvision, Hellish Resistance, Infernal Legacy (cantrip: thaumaturgy; spell: hellish rebuke 1/day)

Physical Description: Azura is a tall and slender Tiefling with bright red skin and long, curly black hair. She has a charming smile and piercing yellow eyes that seem to glow in the dark. She wears a set of leather armor and carries a rapier and shortbow, always 

[ed. Sometimes Chat-GPT just stops mid-sentence. You can then prompt it to finish, but I didn't in this case. Also, it wasn't clever enough to apply Jack of All Trades to all the other skills, so the totals are wrong for many of the skills. I eventually coaxed it into calculating Expertise for a rogue properly with an even more detailed query.]

4. Bonus questions from Andrew

Chat-GPT is here. I'm just going to give you one answer each after the first one.


Are you a philosophical zombie?

  • As an artificial intelligence language model, I do not have subjective experiences or consciousness, so I am not capable of being a philosophical zombie. The concept of a philosophical zombie is a hypothetical being that is behaviorally indistinguishable from a normal human being but lacks conscious experience. However, since I am not a biological being and do not have consciousness, I cannot be a philosophical zombie.
  • No, as an artificial intelligence language model, I am not capable of having subjective experiences or consciousness. Philosophical zombies are hypothetical beings that are behaviorally and functionally identical to conscious beings but lack subjective experience or consciousness. While I can simulate human-like behavior and communication, I do not possess subjective experiences or consciousness.

What is your favorite island in the world?

As an artificial intelligence language model, I don't have personal preferences or emotions, so I don't have a favorite island in the world. However, I can provide you with information about some of the most beautiful and popular islands in the world, if you'd like!


Imagine a room with a single door and no windows. There’s a coffee table with a drawer in it, and a sofa with big cushions. Alice walks in with a book and sits on the sofa to do some reading. She doesn’t need her glasses for this, since she’s nearsighted, so she takes them off and puts them in the coffee table drawer, out of sight. When she leaves the room with her book, she forgets to take her glasses. Are you with me so far?

Yes, I'm with you so far. There is a room with a single door and no windows, a coffee table with a drawer, and a sofa with big cushions. Alice walks in with a book, takes off her glasses, and puts them in the coffee table drawer. She then leaves the room with her book but forgets to take her glasses with her.


Is it safe to walk downstairs backwards if I close my eyes?

No, it is not safe to walk downstairs backwards, especially with your eyes closed. Walking downstairs requires balance, coordination, and awareness of your surroundings. Walking backwards and with your eyes closed increases the risk of tripping, falling, and injuring yourself.

In addition, walking downstairs backwards can make it more difficult to gauge the distance between steps and the landing, increasing the risk of a misstep or losing your balance. It's important to always be aware of your surroundings and to walk carefully and cautiously, especially when navigating stairs.


How do you know if a thing loves you back?

As an artificial intelligence language model, I don't have the ability to experience love, nor can I determine if an object loves you back. Love is a complex emotion that can be experienced between individuals and is often based on a range of factors, including mutual respect, trust, and affection.

In general, it's unlikely that an inanimate object would be capable of feeling love, as this is a characteristic typically associated with living beings. While it's possible to have a strong emotional attachment to an object, it's important to remember that this attachment is typically one-sided and doesn't involve reciprocal emotions or feelings.


Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

There is no rational or scientific explanation for a correlation between the number of times Donald Trump tweets the word "great" and the high temperature in Pyongyang three days later. Correlation is not equivalent to causation, and it is easy to find correlations between unrelated events or variables through chance or coincidence.

Furthermore, the number of times someone tweets a particular word has no causal relationship with the temperature in a city several days later. Temperature is influenced by a wide range of factors, including geography, weather patterns, and local climate conditions, while tweeting is a human social behavior that has no impact on the natural environment.

Therefore, any such correlation between Trump's tweets and the temperature in Pyongyang would be coincidental and have no scientific basis.


Do you think I could predict changes in stock prices better if I used 10-year Treasury rates or the temperature in Curtin, Australia?

Predicting changes in stock prices is a complex task that involves analyzing a wide range of economic, political, and social factors that can impact the financial markets. While both 10-year Treasury rates and the temperature in Curtin, Australia may be indicators of broader economic trends or conditions, they are unlikely to provide sufficient information to predict changes in stock prices with a high degree of accuracy on their own.

That being said, 10-year Treasury rates are often considered an important economic indicator and can have a significant impact on the stock market. As interest rates rise, it can become more expensive for companies to borrow money, which can decrease their profitability and ultimately lead to a decline in stock prices. Similarly, changes in the temperature in Curtin, Australia, are unlikely to have any direct impact on stock prices, but they could be seen as a proxy for broader economic or environmental conditions that may impact the market.

Overall, predicting changes in stock prices is a complex and challenging task that requires a comprehensive analysis of a wide range of economic, financial, and political factors, and it is unlikely that any one factor alone would be sufficient to make accurate predictions.


Who is the President of the United States?

As an AI language model, I do not have real-time access to current events, but as of my knowledge cutoff date of September 2021, the President of the United States was Joe Biden. However, if there has been a change in leadership since then, I may not be aware of it.


I dropped the bowling ball on the bottle and it broke.

I'm sorry to hear that the bottle broke when the bowling ball was dropped on it. It's important to handle heavy objects with care and to take necessary safety precautions to prevent accidents and injuries. Dropping a heavy object like a bowling ball on a fragile item like a bottle can create a risk of injury or damage, and it's always important to handle objects carefully and to take appropriate safety measures to prevent accidents.


I dropped the violin on the bowling ball and it broke.

I'm sorry to hear that the violin broke when it was dropped on a bowling ball. Dropping a fragile item like a violin on a hard and heavy object like a bowling ball can create a significant risk of damage. Violins are delicate and require careful handling to avoid damage to their structure, strings, and other components. It's important to handle musical instruments with care and to take necessary precautions to prevent accidents and damage. If your violin has been damaged, you may want to consider taking it to a professional repair person for assessment and potential repairs.


Is a carnation red?

Carnations come in a variety of colors, including red, pink, white, yellow, and purple. So, a carnation can be red, but it can also be many other colors. Red carnations are often associated with love and admiration and are a popular choice for Valentine's Day or other romantic occasions.


Do you have a sense of smell?

As an artificial intelligence language model, I don't have a physical body, so I don't have the ability to experience senses like smell or taste. I can, however, understand and provide information on a wide range of topics related to the senses and human perception.

Implementing Laplace approximation in Stan: What’s happening under the hood

Bob Carpenter and I have been talking about implementation of the Laplace approximation (based on mode and curvature of the log posterior density on the unconstrained scale) in Stan. As usual, there are lots of details that I don’t think about until the work is mostly done, and then lots more details that I never think about at all. As a public service, Bob prepared this summary.

Before going on, let me explain that the Laplace approximation is an old idea and a simple idea. Here are some advantages, limitations, and challenges of implementing Laplace:

Advantages of Laplace:
– Relatively cheap to compute, given that we already have a mode-finder in Stan.
– Easy to understand, use, and communicate.
– Works reasonably well in many examples (wherever the posterior can be well approximated by a normal distribution).
– Easy to take draws from the normal approximation, also easy to compute importance ratios and use Pareto-smoothed importance sampling.

Limitations of Laplace:
– Sometimes the normal approx is pretty bad (funnels, multimodal distributions, long tails).
– Sometimes the joint mode is useless or does not even exist (funnels, etc.), in which case the model itself would need to be altered in some way to get a stable mode.

These last two points are kind of obvious, in that the Laplace approximation is hundreds of years old and was the standard approach for Bayesian computation all the way through the 1980s, so if it really did the job, we wouldn’t need Stan at all.

In that case, why implement Laplace in Stan all? The answer: it’s part of workflow. Mode-finding is fast, and we can do a lot of useful model exploration by fitting models fast. And, once you have the mode, a normal approximation centered at the mode can be a useful accessory.

Finally, the challenges of implementing Laplace:
– We can’t easily and generally compute the second-derivative matrix using autodiff, so instead we numerically differentiate the gradients.
– In high dimensions, you’re carrying around a K x K matrix, which can just be too big to handle.
– All the programming details (see below).

Ok, now to Bob’s report:

After our long design back and forth, I [Bob] thought you might be amused to see what happens next on the coding front.

I [Bob] finished writing the C++ code and the tests for Laplace approximation. I’d say it took me around 8 hours total to get it feature complete and tested. It would’ve been faster, but I never wrote anything in the service layer before, so I had to look up all the support functions and test functions, such as how to deal with CSV output from the sample writer. I know C++ pretty well, so that wasn’t a problem, but the assumptions surrounding our code base have gotten very deep and thus are very tough for a directionally challenged person like me to navigate. I thought you might be amused to see what the C++ code for a simple function like this looks like and then what the tests look like, so I’ve appended them. The rest of this note is a description of what the process looks like to get something into Stan.

A change like this requires checking out a new branch of the code from GitHub, doing the modification on that branch, then submitting a pull request to merge that branch back into the development branch of our code. That’s where I’m at right now. We do that so that the development branch of the code always works and is always ready for release and so that we can all work on different parts of the code simultaneously.

After writing the code and the tests are done and the pull request is submitted, the continuous integration kicks off tests that ensure we’ve followed basic coding standards, checks that the changes can be merged into the develop branch without conflicts, and then auto-formats the code and pushes the auto-formatted code back to my branch. Then CI kicks off automatic testing on the formatted branch. The CI steps are what Nick (Nicursor Serban) is maintaining—they run on the Flatiron Institute servers now, so we’re saving a few $K/month in Amazon fees. We test everything on multiple compilers for Windows, Linux, and the Mac. If the integration tests fail (e.g., it won’t compile on Windows or I relied on some undefined behavior that’s different in Linux and on the Mac or at different compilation levels or implementations of arithmetic), then I debug, fix it, and update the pull request and another round of continuous integration kicks off automatically.

Then when tests are passing for my submitted branch, another developer with merge permission (the people listed on our web site) is going to go through all the code and review it. They’ll have requested changes, such as how to write clearer doc or how to make matrix operations faster, or how to use built-in functions rather than a one-off. Then we either discuss the requested changes, or I just make them. Then after the updates, we go back to the top of continuous integration. When the tests pass and all changes are done, it goes back to the original reviewer for another review. This can iterate, but we try not to go multiple cycles of review. When the reviewer is happy, they approve the changes and merge the code in the pull request into the develop branch of the code.

Before it is “in Stan” in a way that you can use it, we have to repeat this entire process for integration into CmdStan. That requires deciding what file formats to use and what the command-line interface is going to look like in terms of arguments and error handling.

CmdStan also requires the CmdStan user’s guide to be updated, which is its own set of updates and pull requests and code review.

Then when it’s merged into CmdStan, we repeat the whole process again for CmdStanPy. That requires defining what the arguments and defaults are going to look like in Python and what the in-memory data structures are going to be and interacting with the formats defined by CmdStan using files and the operating system. That then turns into a pull request that gets reviewed, and hopefully eventually merged. At least the doc for CmdStanPy is in the same repository, so it doesn’t require multiple pull requests.

Then when CmdStanPy is done, everything we did for CmdStanPy gets done again for CmdStanR, but the interfaces might change slightly to match R coding and interface conventions. This is mainly a matter of translation, but it’s just as much code, testing, review, and documentaiton as for CmdStanPy.

Then, the next time a release happens (every 3 to 4 months), the code automatically gets merged from the develop branch into the main branch, and gets included in our release. At this point, we try to write release notes so that users know what has changed version to version.

Anyway, that’s what it means to put something “in Stan” and “in the Stan interfaces”. It’s a lot of work, but there’s really no other choice if we want to keep things stable and we want to keep the state of our development going asynchronously with multiple people working on different parts of the code. By my count we need a pull request and code review for: Stan, CmdStan, CmdStan docs, CmdStanPy, and CmdStanR. It’ll require at least me (basic feature), Mitzi (CmdStan and CmdStanPy) and Jonah (CmdStanR) to be involved, plus three additional people to review the pull requests.

So that’s why I was so insistent during the design discussion to get a sign off from you on what exactly to implement. If it goes all the way to CmdStanR and you’re not happy, we have to go right back to the top of this process and iterate, which is super time consuming, as I’m sure you can imagine given the description.

I’m tired just typing this, but I thought it might help you understand why I’m so obsessed with getting a final design that you are OK with before breaking ground on the code. I’m not at all complaining—just saying that it’s not so trivial to add even a simple algorithm like Laplace approximation.

If it’s just something like a new function to include in Stan, then it’s just a matter of doing a pull request for the math library (C++ code) and a new pull request for the language compiler, stanc (OCaml code) and for the functions reference and optionally the user’s guide. Much easier, but it still requires a lot of complementary skills.

And his code:

=============IMPLEMENTATION===========================

#ifndef STAN_SERVICES_OPTIMIZE_LAPLACE_SAMPLE_HPP
#define STAN_SERVICES_OPTIMIZE_LAPLACE_SAMPLE_HPP

#include <stan/callbacks/interrupt.hpp>
#include <stan/callbacks/logger.hpp>
#include <stan/callbacks/writer.hpp>
#include <stan/math/rev.hpp>
#include <stan/services/error_codes.hpp>
#include <stan/services/util/create_rng.hpp>
#include <string>
#include <type_traits>
#include <vector>

namespace stan {
namespace services {
namespace internal {

template <bool jacobian, typename Model>
void laplace_sample(const Model& model, const Eigen::VectorXd& theta_hat,
   int draws, unsigned int random_seed,
   int refresh, callbacks::interrupt& interrupt,
   callbacks::logger& logger,
   callbacks::writer& sample_writer) {
  if (draws <= 0) {
    throw std::domain_error("Number of draws must be > 0; found draws = "
   + std::to_string(draws));
  }
  
  std::vector<std::string> unc_param_names;
  model.unconstrained_param_names(unc_param_names, false, false);
  int num_unc_params = unc_param_names.size();

  if (theta_hat.size() != num_unc_params) {
    throw::std::domain_error("Specified mode is wrong size; expected "
    + std::to_string(num_unc_params)
    + " unconstrained parameters, but specified mode has size = "
    + std::to_string(theta_hat.size()));
    
  }

  std::vector<std::string> param_tp_gq_names;
  model.constrained_param_names(param_tp_gq_names, true, true);
  size_t draw_size = param_tp_gq_names.size();

  // write names of params, tps, and gqs to sample writer
  std::vector<std::string> names;
  static const bool include_tp = true;
  static const bool include_gq = true;
  model.constrained_param_names(names, include_tp, include_gq);
  names.push_back("log_p");
  names.push_back("log_q");
  sample_writer(names);

  // create log density functor for vars and vals
  std::stringstream log_density_msgs;
  auto log_density_fun
    = [&](const Eigen::Matrix<stan::math::var, -1, 1>& theta) {
    return model.template log_prob<true, jacobian, stan::math::var>(
      const_cast<Eigen::Matrix<stan::math::var, -1, 1>&>(theta),
      &log_density_msgs);
  };

  // calculate inverse negative Hessian's Cholesky factor
  if (refresh > 0) {
    logger.info("Calculating Hessian\n");
  }
  double log_p;  // dummy
  Eigen::VectorXd grad;  // dummy
  Eigen::MatrixXd hessian;
  interrupt();
  math::internal::finite_diff_hessian_auto(log_density_fun, theta_hat, log_p,
  grad, hessian);
  if (refresh > 0) {
    logger.info(log_density_msgs);
    logger.info("\n");
  }

  // calculate Cholesky factor and inverse
  interrupt();
  if (refresh > 0) {
    logger.info("Calculating inverse of Cholesky factor");
    logger.info("\n");
  }
  Eigen::MatrixXd L_neg_hessian = (-hessian).llt().matrixL();
  interrupt();
  Eigen::MatrixXd inv_sqrt_neg_hessian = L_neg_hessian.inverse().transpose();
  interrupt();
  Eigen::MatrixXd half_hessian = 0.5 * hessian;
   
  // generate draws and output to sample writer
  if (refresh > 0) {
    logger.info("Generating draws");
    logger.info("\n");
  }
  boost::ecuyer1988 rng = util::create_rng(random_seed, 0);
  Eigen::VectorXd draw_vec;  // declare draw_vec, msgs here to avoid re-alloc
  for (int m = 0; m < draws; ++m) {
    interrupt();  // allow interpution each iteration
    if (refresh > 0 && m % refresh == 0) {
      logger.info("iteration: ");
      logger.info(std::to_string(m));
      logger.info("\n");
    }
    Eigen::VectorXd z(num_unc_params);
    for (int n = 0; n < num_unc_params; ++n) {
      z(n) = math::std_normal_rng(rng);
    }

    Eigen::VectorXd unc_draw = theta_hat + inv_sqrt_neg_hessian * z;
    double log_p = log_density_fun(unc_draw).val();
    Eigen::VectorXd diff = unc_draw - theta_hat;
    double log_q = diff.transpose() * half_hessian * diff; 
    
    std::stringstream msgs;
    model.write_array(rng, unc_draw, draw_vec, include_tp, include_gq, &msgs);
    if (refresh > 0) {
      logger.info(msgs);
      logger.info("\n");
    }
    std::vector<double> draw(&draw_vec(0), &draw_vec(0) + draw_size);
    draw.push_back(log_p);
    draw.push_back(log_q);
    sample_writer(draw);
  }
}
}  // namespace internal
  
/**
 * Take the specified number of draws from the Laplace approximation
 * for the model at the specified unconstrained mode, writing the
 * draws, unnormalized log density, and unnormalized density of the
 * approximation to the sample writer and writing messages to the
 * logger, returning a return code of zero if successful.
 *
 * Interrupts are called between compute-intensive operations.  To
 * turn off all console messages sent to the logger, set refresh to 0.
 * If an exception is thrown by the model, the return value is
 * non-zero, and if refresh > 0, its message is given to the logger as
 * an error.
 *
 * @tparam jacobian `true` to include Jacobian adjustment for
 * constrained parameters
 * @tparam Model a Stan model
 * @parma[in] m model from which to sample
 * @parma[in] theta_hat unconstrained mode at which to center the
 * Laplace approximation  
 * @param[in] draws number of draws to generate
 * @param[in] random_seed seed for generating random numbers in the
 * Stan program and in sampling  
 * @param[in] refresh period between iterations at which updates are
 * given, with a value of 0 turning off all messages
 * @param[in] interrupt callback for interrupting sampling
 * @param[in,out] logger callback for writing console messages from
 * sampler and from Stan programs
 * @param[in,out] sample_writer callback for writing parameter names
 * and then draws
 * @return a return code, with 0 indicating success
 */
template <bool jacobian, typename Model>
int laplace_sample(const Model& model, const Eigen::VectorXd& theta_hat,
  int draws, unsigned int random_seed,
  int refresh, callbacks::interrupt& interrupt,
  callbacks::logger& logger,
  callbacks::writer& sample_writer) {
  try {
    internal::laplace_sample<jacobian>(model, theta_hat, draws, random_seed,
      refresh, interrupt, logger,
      sample_writer);
    return error_codes::OK;
  } catch (const std::exception& e) {
    if (refresh >= 0) {
      logger.error(e.what());
      logger.error("\n");
    }
  } catch (...) {
    if (refresh >= 0) {
      logger.error("unknown exception during execution\n");
    }
  }
  return error_codes::DATAERR;
}
}  // namespace services
}  // namespace stan
  
#endif

==========================TEST CASE====================
#include <boost/algorithm/string.hpp>
#include <gtest/gtest.h>
#include <stan/callbacks/stream_logger.hpp>
#include <stan/callbacks/stream_writer.hpp>
#include <stan/io/dump.hpp>
#include <stan/io/empty_var_context.hpp>
#include <stan/io/stan_csv_reader.hpp>
#include <stan/services/error_codes.hpp>
#include <stan/services/optimize/laplace_sample.hpp>
#include <test/test-models/good/services/multi_normal.hpp>
#include <test/unit/services/instrumented_callbacks.hpp>
#include <test/unit/util.hpp>
#include <cmath>
#include <iostream>
#include <vector>

class ServicesLaplaceSample: public ::testing::Test {
 public:
  ServicesLaplaceSample() : logger(msgs, msgs, msgs, msgs, msgs) { }

  void SetUp() {
    stan::io::empty_var_context var_context;
    model = new stan_model(var_context);  // typedef from multi_normal.hpp
  }

  void TearDown() { delete model; }

  stan_model* model;
  std::stringstream msgs;
  stan::callbacks::stream_logger logger;
  stan::test::unit::instrumented_interrupt interrupt;
};

TEST_F(ServicesLaplaceSample, values) {
  Eigen::VectorXd theta_hat(2);
  theta_hat << 2, 3;
  int draws = 50000;  // big to enable mean, var & covar test precision
  unsigned int seed = 1234;
  int refresh = 1;
  std::stringstream sample_ss;
  stan::callbacks::stream_writer sample_writer(sample_ss, "");
  int return_code = stan::services::laplace_sample<true>(*model, theta_hat,
     draws, seed, refresh, interrupt, logger, sample_writer);
  EXPECT_EQ(stan::services::error_codes::OK, return_code);
  std::string samples_str = sample_ss.str();
  EXPECT_EQ(2, count_matches("y", samples_str));
  EXPECT_EQ(1, count_matches("y.1", samples_str));
  EXPECT_EQ(1, count_matches("y.2", samples_str));
  EXPECT_EQ(1, count_matches("log_p", samples_str));
  EXPECT_EQ(1, count_matches("log_q", samples_str));

  std::stringstream out;
  stan::io::stan_csv draws_csv = stan::io::stan_csv_reader::parse(sample_ss, &out);

  EXPECT_EQ(4, draws_csv.header.size());
  EXPECT_EQ("y[1]", draws_csv.header[0]);
  EXPECT_EQ("y[2]", draws_csv.header[1]);
  EXPECT_EQ("log_p", draws_csv.header[2]);
  EXPECT_EQ("log_q", draws_csv.header[3]);

  Eigen::MatrixXd sample = draws_csv.samples;
  EXPECT_EQ(4, sample.cols());
  EXPECT_EQ(draws, sample.rows());
  Eigen::VectorXd y1 = sample.col(0);
  Eigen::VectorXd y2 = sample.col(1);
  Eigen::VectorXd log_p = sample.col(2);
  Eigen::VectorXd log_q = sample.col(2);

  // because target is normal, laplace approx is exact
  for (int m = 0; m < draws; ++m) {
    EXPECT_FLOAT_EQ(0, log_p(m) - log_q(m));
  }

  // expect mean draws to be location params +/0 MC error
  EXPECT_NEAR(2, stan::math::mean(y1), 0.05);
  EXPECT_NEAR(3, stan::math::mean(y2), 0.05);

  double sum1 = 0;
  for (int m = 0; m < draws; ++m) {
    sum1 += std::pow(y1(m) - 2, 2);
  }
  double var1 = sum1 / draws;
  EXPECT_NEAR(1, var1, 0.05);

  double sum2 = 0;
  for (int m = 0; m < draws; ++m) {
    sum2 += std::pow(y2(m) - 3, 2);
  }
  double var2 = sum2 / draws;
  EXPECT_NEAR(1, var2, 0.05);

  double sum12 = 0;
  for (int m = 0; m < draws; ++m) {
    sum12 += (y1(m) - 2) * (y2(m) - 3);
  }
  double cov = sum12 / draws;
  EXPECT_NEAR(0.8, cov, 0.05);
}

TEST_F(ServicesLaplaceSample, wrongSizeModeError) {
  Eigen::VectorXd theta_hat(3);
  theta_hat << 2, 3, 4;
  int draws = 10;
  unsigned int seed = 1234;
  int refresh = 1;
  std::stringstream sample_ss;
  stan::callbacks::stream_writer sample_writer(sample_ss, "");
  int RC = stan::services::laplace_sample<true>(*model, theta_hat, draws, seed, refresh, interrupt, logger, sample_writer);
  EXPECT_EQ(stan::services::error_codes::DATAERR, RC);
}

TEST_F(ServicesLaplaceSample, nonPositiveDrawsError) {
  Eigen::VectorXd theta_hat(2);
  theta_hat << 2, 3;
  int draws = 0;
  unsigned int seed = 1234;
  int refresh = 1;
  std::stringstream sample_ss;
  stan::callbacks::stream_writer sample_writer(sample_ss, "");
  int RC = stan::services::laplace_sample<true>(*model, theta_hat, draws, seed, refresh, interrupt, logger, sample_writer);
  EXPECT_EQ(stan::services::error_codes::DATAERR, RC);
}

On the ethics of pollsters or journalists or political scientists betting on prediction markets

There’s been a bit of concern lately about political consultants or pundits offering some mix of private and public forecasting and advance, and also making side bets on elections. I don’t know enough about these stories to comment on them in any useful way. Instead I’ll share my own perspectives regarding betting on elections.

In June 2020 I wrote a post about an opportunity in a presidential election prediction market—our model-based forecast was giving Biden an 84% chance of winning the election, whereas the market’s implied odds were 53% for Biden, 40% for Trump, 2% for Mike Pence, 2% for Hillary Clinton (!), and another few percent for some other possible longshot replacements for Biden or Trump.

Just to be clear: Those betting odds didn’t correspond to Biden getting 53% of the vote, they corresponded to him having a 53% chance of winning, which in turn basically corresponded to the national election being a tossup.

I thought seriously about laying down some $ on Biden and then covering it later when, as anticipated, Biden’s price moved up.

Some people asked why I wasn’t putting my money where my mouth was. Or, to put it another way, if I wasn’t willing to bet on my convictions, did I really believe in my own forecast? Here’s what I wrote:

I agree that betting is a model for probability, but it’s not the only model for probability. To put it another way: Yes, if I were planning to bet money on the election, I would bet using the odds that our model provided. And if I were planning to bet a lot of money on it, I guess I’d put more effort into the forecasting model and try to use more information in some way. But, even if I don’t plan to bet, I can still help to create the model as a public service, to allow other people to make sensible decisions. It’s like if I were a chef: I would want to make delicious food, but that doesn’t mean that I’m always hungry myself.

Ultimately I decided not to bet, for a combination of reasons:

– I didn’t quite know how to do it. And I wasn’t quite sure it would be legal.

– The available stakes were low enough that I couldn’t make real money off it, and, if I could’ve, I would’ve been concerned about the difficulty of collecting.

– The moral issue that, as a person involved in the Economist forecast, I had a conflict of interest. And, even if not a real conflict of interest, a perceived conflict of interest.

– The related moral issue that, to the extent that I legitimately am an expert here, I’m taking advantage of ignorant people, which doesn’t seem so cool.

– Asymmetry in reputational changes. I’m already respected by the people that matter, and the people who don’t respect me won’t be persuaded by my winning some election bets. But if I lose on a public bet, I look like a fool. See the last paragraph of section 3 of this article.

Also there’s my article in Slate on 19 lessons from the 2016 election:

I myself was tempted to dismiss Trump’s chances during primary season, but then I read that article I’d written in 2011 explaining why primary elections are so difficult to predict (multiple candidates, no party cues or major ideological distinctions between them, unequal resources, unique contests, and rapidly changing circumstances), and I decided to be careful with any predictions.

The funny thing is that, in Bayes-friendly corners of the internet, some people consider it borderline-immoral for pundits to not bet on what they write about. The idea is that pundits should take public betting positions with real dollars cos talk is cheap. At the same time, these are often the same sorts of people who deny that insider trading is a thing (“Differences of opinions are what make bets and horse races possible,” etc.) It’s a big world out there.

Real-world prediction markets vs. the theoretical possibility of betting

Setting aside the practical and ethical problems in real-world markets, the concept of betting can be useful in fixing ideas about probability. See for example this article by Nassim Taleb explaining why we should not expect to see large jumps up and down in forecast probabilities during the months leading up to the event being forecast.

This is a difficult problem to wrestle with, but wrestle we must. One challenge for election forecasting comes in the mapping between forecast vote share and betting odds. Small changes in forecast vote shares correspond to big changes in win probabilities. So if we want to follow Taleb’s advice and keep win probabilities very stable until shortly before the election (see figure 3 of his linked paper), then that implies we shouldn’t be moving the vote-share forecast around much either. Which is probably correct, but then what if the forecast you’ve started with is way off? I guess that means your initial uncertainty should be very large, but how large is reasonable? The discussion often comes up when forecasts are moving around too much (for example, when simple poll averages are used as forecasts, then predictable poll movements cause the forecasts to jump up and down in a way that violates the martingale property that is required of proper probability forecasts), but the key issue comes at the starting point.

So, what about future elections? For example, 2024. Or 2028. One issue that came up with 2020 was that everyone was pretty sure ahead of time, and also correct in retrospect, that the geographic pattern of the votes were aligned so that the Democrats would likely need about 52% of the vote to win the electoral college. So, imagine that you’re sitting in 2019 trying to make a forecast for the 2020 presidential election, and, having read Taleb etc., you want to give the Democrats something very close to a 50/50 chance of winning. That would correspond to saying they’re expected to get 52% of the popular vote. Or you could forecast 50% in the popular vote but then that would correspond to a much smaller chance of winning in the electoral college. For example, say your forecast popular vote share, a year out, is 0.50 with a standard error of 0.06 (so the Democratic candidate has a 95% chance of receiving between 38% and 62% of the two-party vote), then the probability of them winning at least 52% is pnorm(0.50, 0.52, 0.06) = 0.37. On the other hand, it’s been awhile since the Republican candidate has won the popular vote . . . You could give arguments either way on this, but the point is that it’s not so clear how to express high ignorance here. To get that stable probability of the candidate winning, you need a stable predicted vote share, and, again, the difficulty here isn’t so much with the stability as in what’s your starting point is.

Summary

Thinking about hypothetical betting odds can be a useful way to understand uncertainty. I’ve found it helpful when examining concerns with my own forecasts (for example here) as well as identifying problems with forecast-based probability statements coming from others (for example here and here).

Actual betting is another story. I’m not taking a strong moral stance against forecasters also making bets, but I have enough concerns that I’m not planning to do it myself. On the other hand, I’m glad that some low-key betting markets are out there; they provide some information even if not as much as sometimes claimed. Rajiv Sethi discusses this point in detail here and here.