The connection between junk science and sloppy data handling: Why do they go together?

Nick Brown pointed me to a new paper, “The Impact of Incidental Environmental Factors on Vote Choice: Wind Speed is Related to More Prevention-Focused Voting,” to which his reaction was, “It makes himmicanes look plausible.” Indeed, one of the authors of this article had come up earlier on this blog as a coauthor of paper with fatally-flawed statistical analysis. So, between the general theme of this new article (“How might irrelevant events infiltrate voting decisions?”), the specific claim that wind speed has large effects, and the track record of one of the authors, I came into this in a skeptical frame of mind.

That’s fine. Scientific papers are for everyone, not just the true believers. Skeptics are part of the audience too.

Anyway, I took a look at the article and replied to Nick:

The paper is a good “exercise for the reader” sort of thing to find how they managed to get all those pleasantly low p-values. It’s not as blatantly obvious as, say, the work of Daryl Bem. The funny thing is, back in 2011, lots of people thought Bem’s statistical analysis was state-of-the-art. It’s only in retrospect that his p-hacking looks about as crude as the fake photographs that fooled Arthur Conan Doyle. Figure 2 of this new paper looks so impressive! I don’t really feel like putting in the effort to figuring out exactly how the trick was done in this case . . . Do you have any ideas?

Nick responded:

There are some hilarious errors in the paper. For example:
– On p. 7 of the PDF, they claim that “For Brexit, the “No” option advanced by the Stronger In campaign was seen as clearly prevention-oriented (Mean (M) = 4.5, Standard Error (SE) = 0.17, t(101) = 6.05, p < 0.001) whereas the “Yes” option put forward by the Vote Leave campaign was viewed as promotion-focused (M = 3.05, SE = 0.16, t(101) = 2.87, p = 0.003).": But the question was not "Do you want Brexit, Yes/No". It was "Should the UK Remain in the EU or Leave the EU". Hence why the pro-Brexit campaign was called "Vote Leave", geddit? Both sides agreed on before the referendum that this was fairer and clearer than Yes/No. Is "Remain" more prevention-focused than "Leave"? - On p. 12 of the PDF, they say "In the case of the Brexit vote, the Conservative Party advanced the campaign for the UK to leave the EU." This is again completely false. The Conservative government, including Prime Minister David Cameron, backed Remain. It's true that a number of Conservative politicians backed Leave, and after the referendum lots of Conservatives who had backed Remain pretended that they either really meant Leave or were now fine with it, but if you put that statement, "In the case of the Brexit vote, the Conservative Party advanced the campaign for the UK to leave the EU" in front of 100 UK political scientists, not one will agree with it. If the authors are able to get this sort of thing wrong then I certainly don't think any of their other analyses can be relied upon without extensive external verification. If you run the attached code on the data (mutatis mutandis for the directories in which the files live) you will get Figure 2 of the Mo et al. paper. Have a look at the data (the CSV file is an export of the DTA file, if you don't use Stata) and you will see that they collected a ton of other variables. To be fair they mention these in the paper ('Additionally, we collected data on other Election Day weather indicators (i.e., cloud cover, dew point, precipitation, pressure, and temperature), as well as historical wind speeds per council area.5 The inclusion of other Election Day weather indicators increases our confidence that we are detecting an association between wind speed and election outcomes, and not the effect of other weather indicators that may be correlated with wind speed.") My guess is that they went fishing and found that wind speed, as opposed to the other weather indicators that they mentioned, gave them a good story. Looking only at the Swiss data, I note that they also collected "Income", "Unemployment", "Age", "Race" (actually the percentage of foreign-born people; I doubt if Switzerland collects "Race" data; Supplement, Table S3, page 42), "Education", and "Rural", and threw those into their model as well. They also collected latitude and longitude (of the centroid?) for each canton, although those didn't make it into the analyses. Also they include "Turnout", but for any given Swiss referendum it seems that they only had the national turnout because this number is always the same for every "State" (canton) for any given "Election" (referendum). And the income data looks sketchy (people in Schwyz canton do not make 2.5 times what people in Zürich canton do). I think this whole process shows a degree of naivety about what "kitchen-sink" regression analyses (and more sophisticated versions thereof) can and can't do, especially with noisy measures (such as "Precipitation" coded as 0/1). Voter turnout is positively correlated with precipitation but negatively with cloud cover, whatever that means. Another glaring omission is any sort of weighting by population. The most populous canton in Switzerland has a population almost 100 times the least populous, yet every canton counts equally. There is no "population" variable in the dataset, although this would have been very easy to obtain. I guess this means they avoid the ecological fallacy, up to the point where they talk about individual voting behaviour (i.e., pretty much everywhere in the article).

Nick then came back with more:

I found another problem, and it’s huge:

For “Election 50”, the Humidity and Dew Point data are completely borked (“relative humidity” values around 1000 instead of 0.6 etc; dew point 0.4–0.6 instead of a Fahrenheit temperature slightly below the measured temperature in the 50–60 range). When I remove that referendum from the results, I get the attached version of Figure 2. I can’t run their Stata models, but by my interpretation of the model coefficients from the R model that went into making Figure 2, the value for the windspeed * condition interaction goes from 0.545 (SE=0.120, p=0.000006) to 0.266 (SE=0.114, p=0.02).

So it seems to me that a very big part of the effect, for the Swiss results anyway, is being driven by this data error in the covariates.

And then he posted a blog with further details, along with a link to some other criticisms from Erik Gahner Larsen.

The big question

Why do junk science and sloppy data handling so often seem together? We’ve seen this a lot, for example the ovulation-and-voting and ovulation-and-clothing papers that used the wrong dates for peak fertility, the Excel error paper in economics, the gremlins paper in environmental economics, the analysis of air pollution in China, the collected work of Brian Wansink, . . . .

What’s going on? My hypothesis is as follows. There are lots of dead ends in science, including some bad ideas and some good ideas that just don’t work out. What makes something junk science is not just that it’s studying an effect that’s too small to be detected with noisy data; it’s that the studies appear to succeed. It’s the misleading apparent success that’s turns a scientific dead end into junk science.

As we’ve been aware since the classic Simmons et al. paper from 2011, researchers can and do use researcher degrees of freedom to obtain apparent strong effects from data that could well be pure noise. This effort can be done on purpose (“p-hacking”) or without the researchers realizing it (“forking paths”), or through some mixture of the two.

The point is that, in this sort of junk science, it’s possible to get very impressive-looking results (such as Figure 2 in the above-linked article) from just about any data at all! What that means is that data quality doesn’t really matter.

If you’re studying a real effect, then you want to be really careful with your data: any noise you introduce, whether in measurement or through coding error, can be expected to attenuate your effect, making it harder to discover. When you’re doing real science you have a strong motivation to take accurate measurements and keep your data clean. Errors can still creep in, sometimes destroying a study, so I’m not saying it can’t happen. I’m just saying that the motivation is to get your data right.

In contrast, if you’re doing junk science, the data are not so relevant. You’ll get strong results one way or another. Indeed, there’s an advantage to not looking too closely at your data at first; that way if you don’t find the result you want, you can go through and clean things up until you reach success. I’m not saying the authors of the above-linked paper did any of that sort of thing on purpose; rather, what I’m saying is that they have no particular incentive to check their data, so from that standpoint maybe we shouldn’t be so surprised to see gross errors.

Fully Bayesian computing: Don’t collapse the wavefunction until it’s absolutely necessary.

Kevin Gray writes:

In marketing research, it’s common practice to use averages of MCMC draws in Bayesian hierarchical models as estimates of individual consumer preferences.

For example, we might conduct choice modeling among 1,500 consumers and analyze the data with an HB multinomial logit model. The means or medians of the (say) 15,000 draws for each respondent are then used as parameter estimates for each respondent. In other words, by averaging the draws for each respondent we obtain an individual-level equation for each respondent and individual-level utilities.

Recently, there has been criticism of this practice by some marketing science people. For example, we can compare predictions of individuals or groups of individuals (e.g., men versus women), but not the parameters of these individuals or groups to identify differences in their preferences.

This is highly relevant because since the late 90s it has been common practice in marketing research to use these individual-level “utilities” to compare preferences (i.e., relative importance of attributes) of pre-defined groups or to cluster on the utilities with K-means (for example).

I’m not an authority on Bayes of course, but have not heard of this practice outside of marketing research, and have long been concerned. Marketing research is not terribly rigorous…

This all seems very standard to me and is implied by basic simulation summaries, as described for example in chapter 1 of Bayesian Data Analysis. Regarding people’s concerns: yeah, you shouldn’t first summarize simulations over people and then compare people. What you should do is compute any quantity of interest—for example, a comparison of groups of people—separately for each simulation draw, and then only at the end should you average over the simulations.

Sometimes we say: Don’t prematurely collapse the wave function.

This is also related to the idea of probabilistic programming or, as Jouni and I called it, fully Bayesian computing. Here’s our article from 2004.

Slides on large language models for statisticians

I was invited by David Banks to give an introductory talk on large language models to the regional American Statistical Association meeting on large language models. Here are the slides:

Most usefully, it has complete pseudocode up to but not including multi-head attention. It also has an annotated bibliography of the main papers if you want to catch up. After the talk, I added a couple slides on scaling laws and an annotated bibliography, which I didn’t have time to get to before the talk. I also added a slide describing multi-head attention, but without pseudocode.

P.S. The meeting was yesterday at Columbia and I hadn’t been to the stats department since the pandemic started, so it felt very strange.

P.P.S. GPT-4 helped me generate the LaTeX Tikz code to the point where I did zero searching through doc or the web. It also generates all of my pandas and plotnine code (Python clones of R’s data frames and ggplot2) and a ton of my NumPy, SciPy, and general Python code. It can explain the techniques it uses, so I’m learning a lot, too. I almost never use StackOverflow any more!

Workflow for robust and efficient projection predictive inference

Yann McLatchie, Sölvi Rögnvaldsson, Frank Weber, and I (Aki) write in a new preprint “Robust and efficient projection predictive inference

The concepts of Bayesian prediction, model comparison, and model selection have developed significantly over the last decade. As a result, the Bayesian community has witnessed a rapid growth in theoretical and applied contributions to building and selecting predictive models. Projection predictive inference in particular has shown promise to this end, finding application across a broad range of fields. It is less prone to over-fitting than naïve selection based purely on cross-validation or information criteria performance metrics, and has been known to out-perform other methods in terms of predictive performance. We survey the core concept and contemporary contributions to projection predictive inference, and present a safe, efficient, and modular workflow for prediction-oriented model selection therein. We also provide an interpretation of the projected posteriors achieved by projection predictive inference in terms of their limitations in causal settings.

The main purpose of the is to present a workflow for projection predictive variable selection so that users may obtain reliable results in the least time-consuming way (sometimes there are safe shortcuts that can save enormous amount of wall clock and computing time). But it also discusses the use of the projected posterior in causal settings and gives some more background in general. All these have been implemented in the projpred R package (the most recent workflow supporting features added by Frank who has been doing awesome job in recent years improving projpred). While writing the introduction to the paper, we were happy to notice that projpred is currently the most downloaded R package for Bayesian variable selection!

Summer School on Advanced Bayesian Methods in Belgium

(this post is by Charles)

This September, the Interuniversity Institute for Biostatistics and statistical Bioinformatics is holding its 5th Summer School on Advanced Bayesian Methods. The event is set to take place in Leuven, Belgium. From their webpage:

As before, the focus is on novel Bayesian methods relevant to the applied statistician. In the fifth edition of the summer school, the following two courses will be organized in Leuven from 11 to 15 September 2023:

The target audience of the summer school are statisticians and/or epidemiologists with a sound background in statistics, but also with a background in Bayesian methodology. In both courses, practical sessions are organized, so participants are asked to bring along their laptop with the appropriate software (to be announced) pre-installed.

I’m happy to do a three-day workshop on Stan: we’ll have ample time to dig into a lot of interesting topics and students will have a chance to do plenty of coding.

I’m also looking forward to the course on spatial modeling. I’ve worked quite a bit on the integrated Laplace approximation (notably its implementation in autodiff systems such as Stan), but I’ve never used the INLA package itself (or one of its wrappers), nor am I very familiar with applications in ecology. I expect this will be a very enriching experience.

The registration deadline is July 31st.

HIIT Research Fellow positions in Finland (up to 5 year contracts)

This job post is by Aki

The Helsinki Institute for Information Technology has some funding for Research Fellows and the research topics can include Bayes, probabilistic programming, ML, AI, etc

HIIT Research Fellow positions support the career development of excellent advanced researchers who already have some postdoctoral research experience. While HIIT Research Fellows have a designated supervisor at University of Helsinki or Aalto, they are expected to develop their own research agenda and to gain the skills necessary to lead their own research group in the future. HIIT Research Fellows should strengthen Helsinki’s ICT research community either through collaboration or by linking ICT research with another scientific discipline. In either case, excellence and potential for impact are the primary criteria for HIIT Research Fellow funding.

The contract period is up to five years in length.

I (Aki) am one of the potential supervisors, so you could benefit from my help (other professor are great, too), but as the text says you would be an independent researcher. This is an awesome opportunity to advance your career in a lovely and lively environment between Aalto University and University of Helsinki. I can provide further information about the research environment and working in Finland.

The deadline is August 13th 2023

See more at HIIT webpage

Scientific software research faculty award

Simons Foundation (the parent institution of Flatiron Institute, where I work) has just announced grants to support professors working on scientific software, with a plan to support 6 new fellows per year. From the call for proposals:

A Scientific Software Fellowship provides five years of 50 percent salary support of the awardee’s academic-year salary and fringe benefits, whether normally paid over 9 or 12 months, along with a yearly $50,000 research allowance for the awardee…

Letters of intent are due 8 December 2023.

It’s clear these are not career transition awards and will instead go to people already involved in scientific software. While this call is aimed at physics, astrophysics, and mathematics professors, stats might qualify (you should check if you’re interested). I’m not involved in reviewing the grants—that’s the folks across the street.

R and OOP anti-patterns

Thomas Lumley just dropped a blog post, Blank cheque inheritance and statistical objects, which begins as follows.

One of the problems with object-oriented programming for statistical methods is that inheritance is backwards. Everything is fine for data structures, and Bioconductor has many examples of broad (often abstract) base classes for biological information or annotation that are specialised by inheritance to more detailed and useful classes. Statistical methods go the other way.

In base R, glm for generalised linear models is a generalisation of lm for linear models, but the glm class inherits from the lm class, …

This isn’t a problem with inheritance, it’s a problem with how R uses it.

Fundamentals of OOP and inheritance

The fundamental rule for inheritance in object oriented programming (OOP) is that a class X should inherit from class Y only if every X is a Y. That means you should be able to use an X wherever the code calls for an Y. For instance, a Poodle class might inherit from the Dog class, because every poodle is a dog. Any function you can apply to dogs, you can apply to poodles. Every function that is defined to return a poodle will also return a dog. Behavior as arguments and returns is governed by the concepts of covariance and contravariance in programming language theory. Inheritance must respect these relations for a coherent OOP design.

A classic blunder is to define a class for real numbers and one for complex numbers and have the complex numbers inherit from the real numbers. Every real number is a complex number, but not every complex number is a real number. So doing this will break standard OOP implementations. The reason beginners in OOP make this mistake is that it’s natural to think of the implementation of a complex number as taking a real number and adding an imaginary component. If you want to start with real numbers, a better way to define a complex number would be using composition to include two real components. That is, it contains two real numbers rather than inheriting from real numbers to get its real component. This is exactly how std::complex is defined in C++, with a constructor that takes the real and complex components as two objects of type T, where T might be double for double-precision complex numbers or it might be an autodiff type like stan::math::var.

The God object anti-pattern

I’m also not fond of how lm returns a god object. God objects are a widely cited anti-pattern in OOP, largely because they’re so frequently seen in the wild. The inclination that leads to this is to have something like a “blackboard” into which all kinds of state can be written so the user can get it all in one place. A common example is including all the input in a function’s output. No need to do that because the user already had all of that information because they couldn’t have called the function otherwise. God objects’ are usually a terrible idea as it’s nearly impossible to ensure consistency of such an object without defeating its purpose, because its purpose is to behave like a global variable repository. R doesn’t even try—you can take an lm fit object and change various aspects of it and leave it in an inconsistent state without warning, e.g.,

> fit <- lm(dist ~ speed, data = cars)

> fit$coefficients
(Intercept)       speed 
 -17.579095    3.932409 

> fit$coefficients = c(1, 2, 3, 4)

> fit$coefficients
[1] 1 2 3 4

The Stan interfaces in Python and R also return god objects. I lost the design argument to the other developers, who argued, “That’s the way it’s done in R and Python.”

R’s argument chaining vs. OOP method chaining

Speaking of OOP, chaining with pipes in R follows the object-oriented pattern of method chaining, but instead of using object returns that are the class defining the next method in the chain, it just passes along the return to use as the first argument in the next chained function. It’s no longer object oriented. This doesn’t break any OO patterns, per se. It might be awkward if you need to pack enough into a return to go onto the next function. In OOP, developers often break long method chains into groups of coherent calls with named returns when the returns are not all instances of the same class. The reason to break up long chains is, ironically given how they’re motivated in R, to help with readability and self-documentation. Code readability is the single best thing you can do to make code maintainable, because code will be read much more often than it gets written. You can bridge the gap between what R does with chaining and the standard way to do method chaining in OOP by looking at how Python classes are defined with an explicit self argument (like the this pointer to the class instance in C++, but C++ doesn’t require it as an explicit argument on methods).

P.S. I tried commenting on Lumley’s blog but was defeated by Disqus. I thought it might be of general interest, so am including it here.

blme: Bayesian Linear Mixed-Effects Models

The problem:

When fitting multilevel models, the group-level variance parameters can be difficult to estimate. Posterior distributions are wide, and point estimates are noisy. The maximum marginal likelihood estimate of the variance parameter can often be zero, which is a problem for computational algorithms such as lme4 which are based on this marginal mode. For models with multiple varying coefficients (varying-intercept, varying-slope models), the bigger the group-level covariance matrix, the more likely it is that its max marginal likelihood estimate will be degenerate. This leads to computational problems as well as problems with the estimated coefficients, as they get not just partially pooled but completely pooled toward the fitted model.

The solution:

Priors. Zero-avoiding or boundary-avoiding priors to avoid zero or degenerate estimates of group-level variances, also informative priors to get more reasonable estimates when the number of groups is small.

The research papers:

[2013] A nondegenerate estimator for hierarchical variance parameters via penalized likelihood estimation. {\em Psychometrika} {\bf 78}, 685–709. (Yeojin Chung, Sophia Rabe-Hesketh, Andrew Gelman, Jingchen Liu, and Vincent Dorie)

[2014] Weakly informative prior for point estimation of covariance matrices in hierarchical models. {\em Journal of Educational and Behavioral Statistics} {\bf 40}, 136–157. (Yeojin Chung, Andrew Gelman, Sophia Rabe-Hesketh, Jingchen Liu, and Vincent Dorie)

The R package:

blme: Bayesian Linear Mixed-Effects Models, by Vince Dorie, Doug Bates, Martin Maechler, Ben Bolker, and Steven Walker

Going forward:

blme is great but we’d also like to have full Bayes. Stan does full Bayes but can be slow if you have a lot of data and a lot of groups. Just for example, suppose you have longitudinal data with 5 observations on each of 100,000 people. Then a hierarchical model will have hundreds of thousands of parameters—that’s a lot! On the other hand, the Bayesian central limit theorem should be working in your favor (see appendix B of BDA, for example). So some combination of approximate and full Bayesian inference should work.

Also, lme4, and even blme, can have trouble when you have lots of variance parameters running around, and lme4 has its own issues, which unfortunately blme inherits, regarding computation with empty groups and various issues like that which should not really be a problem with Bayesian inference with informative priors.

Right now, though, we don’t have this best-of-both-worlds Bayesian solution that does full Bayes when computationally feasible and uses appropriate approximations otherwise. So blme is part of our toolbox. Thanks to Vince!

Computational linguist Bob Carpenter says LLMs are intelligent. Here’s why:

More specifically, he says:

ChatGPT can do deep, involved reasoning. It has the context capacity to do that.

I [Bob] think that human language is what is known as “AI complete”. To be good at language, you have to be intelligent, because language is about the world and context. You can’t do what ChatGPT does ignorant of the world or be unable to plan. . . .

Humans also generally produce output one word at a time in spoken language. In writing we can plan and go back and revise. We can do a little planning on the fly, but not nearly as much. To me, this was the biggest open problem in computational linguistics—it’s what my job talk was about in 1989 and now it’s basically a solved problem from the engineering if not scientific perspective.

I [Bob] am not saying there’s no limitations to using the LLM architecture—it doesn’t have any long- or really medium-term memory. I’m just saying it can’t do what it does now without some kind of “intelligence”. If you try to define intelligence more tightly, you either rule out humans or you somehow say that only human meat can be intelligent.

I told Bob that take on this might be controversial, even among computer scientists, and he replied:

Of course. Everything’s controversial among academics . . .

My [Bob’s] position is hardly novel. It’s the take of everyone I know who understands the tech (of course, that’s a no-true-Scotsman argument), including this paper from Microsoft Research. I do think if you have studied cognitive science, philosophy of language, and philosophy of mind, studied language modeling, studied psycholinguistics, have some inkling of natural language compositional semantics and lexical semantics, and you understand crowdsourcing with human feedback, then you’re much more likely to come to the same conclusion as me. If you’re just shooting from the hip without having thought deeply about meaning and how to frame it or how humans process language a subword component at a time, then of course the behavior seems “impossible”. Everyone seems to have confused it with cutting-and-pasting search results, which is not at all what it’s doing.

I’m not saying it’s equivalent to a human, just that whatever it’s doing is a form of general intelligence. What it’s truly lacking is longer term memory. That means there are things humans can do that it really is incapable of doing in its present form. But that’s not because it’s a “dumb machine”. We’re just “dumb meat” viewed from that perspective (unless you want to get all spiritual and say we have a soul of some kind that matters).

Bob also recommends this paper from Google and this one from OpenAI, and he continues:

There’s a ton of work on scaling laws now and what people are seeing is emergent behavior at certain model sizes. As in like 1% performance for 3B parameters, then 95% performance for 6B parameters kind of thing. But nobody knows why this is happening or where.

The capacity of these models is quite high, including the representation of words, representation of positions, etc. It’s generting one word at a time, but the structrure is an incredibly rich time series with literally billions of parameters.

The background here is that I’ve been reading what Thomas Basbøll has been writing on chatbots and the teaching of writing (a topic of interest to me, because I teach writing as part of my Communicating Data and Statistics course), and he recommended a long article by Giulio Alessandrini, Brad Klee, and Stephen Wolfram entitled “What Is ChatGPT Doing . . . and Why Does It Work?”

I really liked Alessandrini et al.’s article. It was at the right level for me, stepping through the following topics:

It’s Just Adding One Word at a Time
Where Do the Probabilities Come From?
What Is a Model?
Models for Human-Like Tasks
Neural Nets
Machine Learning, and the Training of Neural Nets
The Practice and Lore of Neural Net Training
“Surely a Network That’s Big Enough Can Do Anything!”
The Concept of Embeddings
Inside ChatGPT
The Training of ChatGPT
Beyond Basic Training
What Really Lets ChatGPT Work?
Semantic Grammar and the Power of Computational Language
So . . . What Is ChatGPT Doing, and Why Does It Work?

Alessandrini et al.’s article has lots of examples, graphs, and code, and I get the impression that they’re actively trying to figure out what’s going on. They get into some interesting general issues; for example,

One might have thought that for every particular kind of task one would need a different architecture of neural net. But what’s been found is that the same architecture often seems to work even for apparently quite different tasks. At some level this reminds one of the idea of universal computation . . . but I think it’s more a reflection of the fact that the tasks we’re typically trying to get neural nets to do are “human-like” ones—and neural nets can capture quite general “human-like processes”.

In earlier days of neural nets, there tended to be the idea that one should “make the neural net do as little as possible”. For example, in converting speech to text it was thought that one should first analyze the audio of the speech, break it into phonemes, etc. But what was found is that—at least for “human-like tasks”—it’s usually better just to try to train the neural net on the “end-to-end problem”, letting it “discover” the necessary intermediate features, encodings, etc. for itself. . . .

That’s not to say that there are no “structuring ideas” that are relevant for neural nets. Thus, for example, having 2D arrays of neurons with local connections seems at least very useful in the early stages of processing images. And having patterns of connectivity that concentrate on “looking back in sequences” seems useful . . . in dealing with things like human language, for example in ChatGPT.

They also talk about the choices involved in tuning the algorithms—always an important topic in statistics and machine learning—so, all in all, I think a good starting point before getting into the technical articles that Bob pointed us to above. I pointed Bob to the Alessandrini et al. tutorial and his reaction was that it “seriously under-emphasizes the attention model in the transform and the alignment post-training. It’s the latter that took GPT-3 to ChatGPT, and it’s a huge difference.”

That’s the problem with sending a pop science article to an expert: the expert will latch on to some imperfection. The same thing happens to me when people send me popular articles on Bayesian statistics or American politics or whatever: I can’t help focusing on the flaws. Anyway, I still like the Alessandrini et al. article, I guess more so when supplemented with Bob’s comments.

P.S. Also I told Bob I still don’t get how generating one word at a time can tell the program to create a sonnet in the style of whoever. I just don’t get how “Please give me a sonnet . . .” will lead to a completion that has sonnet form. Bob replied:

Turn this around. Do you know how you write sentences without planning them all out word by word ahead of time? Language is hard, but we do all kinds of planning in this same way. Think about how you navigate from home to work. You don’t plan out a route step by step then execute it, you make a very general plan (‘ride my bke’ or ‘take the subway’), take a step toward that goal, then repeat until you get to work. As you get to each part of the task (unlock the bike, carry the bike outside, put the kickstand up, get on the bike, etc.) are all easily cued by what you did last, so it barely requires any thought at all. ChatGPT does the same thing with language. ChatGPT does a ton of computation on your query before starting to generate answers. It absolutely does a kind of “planning” in advance and as the MS paper shows, you can coach it to do better planning by asking it to share its plans. It does this all with its attention model. And it maintains several rich, parallel representations of how language gets generated.

Do you know how you understand language one subword component at a time? Human brains have *very slow* clock cycles, but very *high bandwidth* associative reasoning. We are very good at guessing what’s going to come next (though not nearly as good as GPT—it’s ability at this task is far beyond human ability) and very good at piecing together meaning from hints (too good in many ways as we jump to a lot of false associations and bad conclusions). We are terrible at logic and planning compared to “that looks similar to something I’ve seen before”.

I think everyone who’s thought deeply about language realizes it has evolved to make these tasks tractable. People can rap and write epic poems on the fly because there’s a form that we can follow and one step follows the next when you have a simple bigger goal. So the people who know the underlying architectures, but say “oh language is easy, I’m not impressed by ChatGPT” are focusing on this aspect of language. Where ChatGPT falls down is with long chains of logical reasoning. You have to coax it to do that by telling it to. Then it can do it in a limited way with guidance, but it’s basic architecture doesn’t support good long-term planning for language. If you want GPT to write a book, you can’t prompt it with “write a book”. Instead, you can say “please outline a book for me”, then you can go over the outline and have it instantiate as you go. At least that’s how people are currently using GPT to generate novels.

I asked Aki Vehtari about this, and Aki pointed out that there are a few zillion sonnets on the internet already.

Regarding the general question, “How does the chatbot do X?”, where X is anything other than “put together a long string of words that looks like something that could’ve been written by a human” (so, the question could be, “How does the chatbot write a sonnet” or “How does ChatGPT go from ‘just guessing next word’ to solving computational problems, like calculating weekly menu constrained by number of calories?”), Bob replied:

This is unknown. We’ve basically created human-level or better language ability (though not human or better ability to connect language to the world) and we know the entire architecture down to the bit level and still don’t know exactly why it works. My take and the take of many others is that it has a huge capacity in its representation of words and its representation of context and the behavior is emergent from that. It’s learned to model the world and how it works because it needs that information to be as good as it is at language.

Technically, it’s a huge mixture model of 16 different “attention heads”, each of which is itself a huge neural network and each of which pay attention to a different form of being coherent. Each of these is a contextual model with access to the previous 5K or so words (8K subword tokens).

Part of the story is that the relevant information is in the training set (lots of sonnets, lots of diet plans, etc.); the mysterious other part is how it knows from your query what piece of relevant information to use. I still don’t understand how it can know what to do here, but I guess that for now I’ll just have to accept that the program works but I don’t understand how. Millions of people drive cars without understanding at any level how cars work, right? I basically understand how cars work, but there’d be no way I could build one from scratch.

PhD or PostDoc position on simulation-based inference with Paul “brms” Bürkner

Hi all, this is Paul. Andrew was so kind to allow me to post a job ad here on his blog.

At the Technical University of Dortmund, Germany, I am currently looking for a PhD Student or PostDoc to work with me on simulation-based Bayesian inference research in the context of our BayesFlow framework.

BayesFlow is a Python library for efficient simulation-based Bayesian Inference. It enables users to create specialized neural networks for amortized Bayesian inference, which repays users with rapid statistical inference after a potentially longer simulation-based training phase. A cornerstone idea of amortized Bayesian inference is to employ generative neural networks for parameter estimation, model comparison, and model validation when working with intractable simulators whose behavior as a whole is too complex to be described analytically.

Both the BayesFlow library itself and its community are quickly growing. Our goal is to make it the gold-standard simulation-based inference library within the next couple of years.

For more details about the position, please see Paul Bürkner – Open Positions

I am looking forward to your applications!

Paul

Analog computing and hybrid computing: The view from 1962.

From the proceedings of the December 4-6, 1962, fall joint computer conference, two researchers from General Electric Company’s Missile and Space Division write:

In general, there are two distinct modes of simulation; mathematical and physical. Mathematical simulation utilizes a mathematical model of the physical system under study. . . .

Physical simulation requires the excitation of the system under conditions which are representative of those encountered in actual system operation. This testing can involve anything from an inclined plane to large multi-million dollar ventures like the Space Environmental Simulator located at General Electric’s Valley Forge, Penna., Space Technology Center. These two types of simulation can be combined by mating physical hardware with a mathematical model. The general purpose computers available today are primarily designed for mathematical simulation. . . .

An electronic analog computer is an array of computational building blocks, or modules, each being able to perform a particular mathematical operation on an input voltage signal and provide a specific output response. These building blocks normally provide the functions of summation, integration with respect to time, multiplication by a constant, multiplication and division of variables, function generation, generation of trigonometric functions, and representation of system discontinuities. All quantities are represented on the analog by continuously varying voltages, restricted on almost all analog computers to the range between -100 and +100 volts. . . .

Data are fed into the analog computer in the form of parameter settings, which are usually associated with the coefficients that exist in the mathematical equations. Data are extracted from the computer in the form of voltages, either as steady-state values which can be read out on a voltmeter, or as varying values which can be recorded on a strip chart recorder or a plotting table. Some of the analog characteristics pertinent to our discussion are:

1. The analog is a parallel machine. All the variables are computed simultaneously and continuously. Thus, the speed with which the calculations are made is completely independent of the size or complexity of the problem.

2. The bigger a problem is, the more equipment is needed, as each piece of equipment works on one part of the problem.

3. Numbers on the analog are fixed point. Every variable must be scaled. The scaling will greatly affect the accuracy of the results.

4. The analog is best suited for solving systems of ordinary linear differential equations, although it can handle many other types of problem in a very satisfactory way.

5. There is no such thing as a computational cycle with the analog, because of characteristic No. 1. The analog can be set to calculate at any rate desired, but in practice there is an optimum time base associated with any particular problem, and attempts to run the problem much faster or slower will severely degrade the accuracy. The analog, generally speaking, is faster than the digital.

6. Analog outputs are almost always accurate to within 1%, but seldom better than 0.1%.

7. It is very easy, with most problems, to introduce extensive changes in the simulation in a matter of minutes.

Although the analog computer was designed primarily for the solution of problems in the aircraft field, its area of application has broadened considerably over the years. . . .

Many of these concerns still arise today, albeit in different form: scalability of computation (items 1 and 2), scalability of workflow (item 7), putting parameters on a natural scale (item 3), precision (item 6), and the idea that the method runs at some natural speed (item 5), which comes up with HMC and, before that, efficient Metropolis jumping rules.

They then move on to a discussion of digital computing:

The digital computer works by a counting technique and obeys logic rules exactly. The solutions are at discrete points dependent on the size of the time increment used. The smaller the mesh size, the more we approach the continuous solution. In contrast to the analog computer, which uses continuous variables in the form of voltages, the digital computer uses discrete variables, and operates with numbers as opposed to voltages. The digital computer is essentially a very fast calculating machine. . . .

There are a number of digital computer characteristics that are of particular interest in connection with hybrid simulation. These are:

1. It will deal only with numbers. Any problem must be reduced to a series of numerical operations before it can be handled by the computer. This is not to say that every step must actually be written each time. All sorts of aids to compiling programs are available. A program is nothing more than the entire sequence of instructions given to the computer to solve a problem. In actual practice, the machine itself will write most of its own instructions.

2. It will do exactly what it is told. All changes involve writing new instructions. The easier it is to make a change, the more complicated the original instructions have to be to include the option.

3. The results are exactly repeatable, but their accuracy is dependent on the numerical methods used to solve the problem.

4. The computer will perform only one operation at a time. That is, if the instruction reads, “Move number N from location A to location B,” the machine will, for a given period of time, be doing nothing but that.

5. The computer works with increments. None of the variables are calculated continuously. Generally speaking, the larger the calculation increment of the digital computer, the faster and the less accurate is the computation. There is absolutely no drift with a digital computer.

6. Compared with an analog, the digital is very much better equipped to make decisions. These can be made on the basis of comparison, time, reaching a point in the program, or almost any other criterion chosen by the programmer.

7. The digital can store very much more information than the analog. It can store tables, functions of several variables, whole programs, and many other things.

It is almost impossible to list the areas of application of the computer because of the diversity involved. We can say, however, that the digital computer lays sole claim to those problems which store a lot of information, use much logic, or require extreme accuracy. It will calculate trajectories, solve problems in astronomy, simulate mental processes such as learning and memory, analyze games, do translations, help design new computers, and do untold numbers of other tasks. The major effort to discover new computer applications is devoted to the digital area, with the analog a poor second, and the hybrid far behind.

They were right about that! Digital computers really did take over. Again, I find it interesting how much of the discussion turns on workflow, which we can roughly define as a process of exploration requiring science-like exploration by fitting multiple models.

They continue with some thoughts on the precision of computation which remain relevant over sixty years later:

The subject of accuracy is so complicated, and dependent on so many factors, that it just didn’t seem possible to summarize it by a mark in a box. While this is to some extent true of all the other characteristics listed, we believe considerations of accuracy fall into a special case.

On an analog computer, the result is usually within 0.1% and 1% of the value inherent in the equations. Whether this is excellent or poor depends on the nature of the problem. In many engineering investigations, this is much more precise than the data upon which the problem is based. The use to which the answer will be put also affects the accuracy required. Determination of the region of stability of a control system to within a millionth of the control range would be valueless, as the nature of the input could affect it much more than that. On a digital computer, the ultimate limit of accuracy is the number of bits in a word. This accuracy is seldom attained by the output variables of a problem, due to the approximations involved in almost any mathematical model, the idiosyncrasies of programming, and the practical necessity of taking reasonably large computing steps. The question concerning accuracy is more often, “How much cost and effort is needed to obtain the required accuracy?”, than “What accuracy is obtainable?” The answer has to be determined separately for each individual problem.

Next they move on to “hybrid” setups that combine analog and digital computing, sharing their own experiences:

The advantages of a hybrid that we felt to be of most value to the work of the department were in the area of increasing the size and variety of the problems we could solve. The things a hybrid can do to help in that endeavor are:

1. Assign different sections of a problem to each computer. For instance, in simulating a missile, the trajectory calculations can be assigned to the digital, because of the available precision, and the control simulation put on the analog because of its flexibility.

2. Assign different functions to each computer. For instance, all integrations might be assigned to the analog computer, in order to save time and get a continuous output. Or, all function generation might be assigned to the digital computer (where it is known as table look-up).

3. Provide analog plots of digital variables. This is particularly useful in observing the behavior of selected variables while the simulation is in progress. In one case, a stop was put on a 7090 after the first 15 seconds of what would otherwise have been a 10 minute run because it was easy to tell from the behavior of a continuous analog output that a key variable was not behaving quite as desired.

4. Let the digital provide logic for the analog. Things such as switching, scale changing, ending the program, choosing tables to examine, can be readily programmed into the digital and can greatly simplify and possibly even speed up an analog simulation.

5. Allow real hardware to be part of a simulation. Most hardware can readily be connected into the analog, and hybrid operation would allow it to connect to the digital just as easily. Similarly, digital devices can be included in analog operation the same way. Real hardware could also be considered to include people, as part of a control loop.

6. Provide accurate digital printouts of analog variables. Normally, the accuracy with which the analog variables are plotted is less than the accuracy that actually exists in the equipment. Hybrid operation enables selected variables to be converted to digital form and printed out from a digital tape.

The details of this sort hybrid computing don’t really matter anymore, but the general idea of looking at leaks in the modeling pipeline, that still is important.

I was also struck by the larger framework of simulation. Of course this makes sense: a missile test is expensive so you want to understand as much as you can using simulation before going out and launching something. In addition to being cost- and time-effective, simulation also makes the live test more effective. The real-world launch gives real-world data which you can compare to your expectations. The better your simulations, the better will be your expectations, and the more you will learn from discrepancies in the live data.

I’ve thought about these issues for awhile in the context of model checking and exploratory data analysis (see BDA starting from the first edition in 1995, and my 2003 article, A Bayesian formulation of exploratory data analysis and goodness-of-fit testing, but it was only just now that I realized the connection to workflow and simulated-data experimentation.

If only someone had given me this article to read 40 years ago, back when I was first doing simulations of physical systems. I blame the author of that 1962 article, who easily could have shared it with me at the time. The trouble was that he was too self-effacing.

P.S. The diagram at the top of this post comes from this 1963 article, “Corrected inputs: A method for improved hybrid simulation,” which begins:

Makes sense to me, to use some feedback to reduce transmission errors.

They were doing cool stuff back then, 60 years ago. Just regular guys, no Ph.D. or anything. Kinda like Steven Spielberg’s dad. Maybe that’s one reason I liked that movie so much.

A proposal to build new hardware and thermodynamic algorithms for stochastic computing

Patrick Coles writes:

Modern AI has moved away from the absolute, deterministic procedures of early machine learning models. Nowadays, probability and randomness are fully embraced and utilized in AI. Some simple examples of this are avoiding overfitting by randomly dropping out neurons (i.e., dropout), and escaping local minima during training thanks to noisy gradient estimates (i.e., stochastic gradient descent). A deeper example is Bayesian neural networks, where the network’s weights are sampled from a probability distribution and Bayesian inference is employed to update the distribution in the presence of data . . .

Another deep example is generative modeling with diffusion models. Diffusion models add noise to data in a forward process, and then reverse the process to generate a new datapoint (see figure illustrating this for generating an image of a leaf). These models have been extremely successful not only in image generation, but also in generating molecules, proteins and chemically stable materials . . .

AI is currently booming with breakthroughs largely because of these modern AI algorithms that are inherently random. At the same time, it is clear that AI is not reaching its full potential, because of a mismatch between software and hardware. For example, sample generation rate can be relatively slow for diffusion models, and Bayesian neural networks require approximations for their posterior distributions to generate samples in reasonable time.

Then comes the punchline:

There’s no inherent reason why digital hardware is well suited for modern AI, and indeed digital hardware is handicapping these exciting algorithms at the moment.

For production AI, Bayesianism in particular has been stifled from evolving beyond a relative niche because of its lack of mesh with digital hardware . . . .the next hardware paradigm should be specifically tailored to the randomness in modern AI. Specifically, we must start viewing stochasticity as a computational resource. In doing so, we could build a hardware that uses the stochastic fluctuations produced by nature.

Coles continues:

The aforementioned building blocks are inherently static. Ideally, the state does not change over time unless it is intentionally acted upon by a gate, in these paradigms.

However, modern AI applications involve accidental time evolution, or in other words, stochasticity. This raises the question of whether we can construct a building block whose state randomly fluctuates over time. This would be useful for naturally simulating the fluctuations in diffusion models, Bayesian inference, and other algorithms.

The key is to introduce a new axis when plotting the state space: time. Let us define a stochastic bit (s-bit) as a bit whose state stochastically evolves over time according to a continuous time Markov chain . . .

Ultimately this involves a shift in perspective. Certain computing paradigms, such as quantum and analog computing, view random noise as a nuisance. Noise is currently the biggest roadblock to realizing ubiquitous commercial impact for quantum computing. On the other hand, Thermodynamic AI views noise as an essential ingredient of its operation. . . .

I think that when Coles says “AI,” he means what we would call “Bayesian inference.” Or maybe AI represents some particularly challenging applications of Bayesian computation.

Analog computing

OK, the above is all background. Coles’s key idea here is to build a computer using new hardware, to build these stochastic bits so that continuous computation gets done directly.

This is reminiscent of what in the 1950s and 1960s was called “analog computation” or “hybrid computation.” An analog computer is something you build with a bunch of resistors and capacitors and op-amps to solve a differential equation. You plug it in, turn on the power, and the voltage tells you the solution. Turn some knobs to change the parameters in the model, or set it up in a circuit with a sawtooth input and plug it into an oscilloscope to get the solution as a function of the input, etc. A hybrid computer mixes analog and digital elements. Coles is proposing something different in that he’s interested in the time evolution of the state (which, when marginalized over time, can be mapped to a posterior distribution), whereas in traditional analog computer, you just look at the end state and you’re not interested in the transient period that it takes to get there.

Here’s the technical report from Coles. I have not read it carefully or tried to evaluate it. That would be hard work! Could be interest to many of you, though.

Large language model alignment “bias” and cultural consensus theory

The way contemporary chatbots like ChatGPT work is by “aligning” a “foundation” model. I think cultural consensus theory (a statistical model, not a contentious issue for school boards) can provide a model for the sociology of alignment.

The foundation: attention models

In a nutshell, language modeling is the simple task of predicting the next subword (“called a token”) based on the previous sequence of subwords. The state-of-the-art had stalled for years on n-gram models that use the previous n subwords (usually with n < 5). In 2017, a team of Google researchers released a paper titled "Attention is all you need," which introduced the current state-of-the-art neural network architecture for language modeling. The breakthrough was in extending the context length into the thousands (GPT 3.5 uses 4K, GPT 4 has 8K and 32K models) with an attention model that figured out which parts of the context to concentrate on. The fundamental bottleneck is that computation is quadratic in context length (though it's all on GPU, so that's a massive numbers of flops for relatively low power).

The 2017 paper introduced the so-called “transformer” architecture, which combines multiple attention “heads” in parallel. The original application was to translation, but it’s the self attention component that was extracted for use in LLMs. The “T” in “GPT” is for “transformers” (the “GP” is for “generative pretrained”). What researchers have found is that the heads learn different aspects of prediction, such as different syntactic structures, much like any mixture model.

There’s a beautiful two-hour YouTube tutorial by Andrej Karpathy that builds up the entire transformer architecture piece by piece in a Colab notebook you can also use. Karpathy applies it to building a Shakespearean chatbot. It assumes you know Python, but is otherwise quite gentle, starting with an intro to n-gram language models and softmax.

Garbage-in, garbage-out

The current crop of large language models have been trained on vast amounts of human text, primarily collected through the internet. As you might imagine, including sources like Reddit and 4chan and Twitter leads to a broad set of what can most charitably be called “points of view.” Even on technical issues, the web is cluttered with material that should probably not be the basis for serious work—homework exercises for intro data science classes clutter GitHub and StackOverflow, every statistician and their cousin’s experimental code seems to be wrapped up as an R package, scripts from ancient versions of software persist, etc.

Alignment: from LLMs to chatbots

After building these powerful, transformer-based large language models (LLMs), people realized that they were really good at generating text. As in they blew away any previous compression record (just like the TV show Silicon Valley!). You can convert a language model into a compression scheme using prediction by partial matching (PPM) with arithmetic coding, the reference implementation of which was designed and coded by Radford Neal (with Ian Witten and John Cleary) in 1987. Seriously, they should win an ACM/Turing award just for the quantum leap in text compression.

The early LLMs could write computer programs, translate Pascal to Fortran and Swahili to English, and generate new recipes given only a list of ingredients or new episodes of TV shows. But they tend to ramble off topic, tend to “hallucinate” (the term of art for when LLMs make things up; it’s called “dreaming” for diffusion models like Midjourney), and tend to be fluid with the points of view they find in training data. They’re just as happy telling you how to make a bomb in your basement and where to set it off as they are telling you how to make a soufflé in your kitchen and how to serve it. And if you “jailbreak” the current ChatGPT, it’ll still be happy to tell you how to try all sorts of dangerous, illegal, and morally and ethically questionable activities.

OpenAI’s approach to preventing the LLMs from spewing dangerous and/or toxic garbage is to fine tune the large language models with human feedback (HF) using reinforcement learning (RL, and together RLHF). Their stated goal was to “align” the language models to be (a) helpful, (b) truthful, and (c) harmless. While this sounds like an objective task presented this way, the notions of truthful and harmless are difficult to pin down and require subjective judgement calls. Even helpfulness is a slippery notion in that help that’s too verbose or specific isn’t helpful. What one person takes to be self evident in these realms can be considered lunacy by others.

OpenAI either implicitly or explicitly chose the point of view of a West-coast American liberal, which is the far left of the mainstream US political spectrum, even though it’s relatively conservative by European standards. They could’ve just as easily decided to give ChatGPT the perspective of the far right of the mainstream US political spectrum and it would’ve had a very different perspective and a different segment of the population would be complaining about its biases.

Cultural consensus theory

In 1979, Phil Dawid and David Skene introduced a statistical model of crowdsourcing for medical records. The idea is that there’s a latent true value of something like whether a patient smokes, and doctors looking at medical records are going to give you a noisy measurement of that value. The same kind of model can be applied to radiology and doctors classifying medical images for stage of cancer, etc. Or to NeurIPS paper reviews. The model assigns accuracies and biases (too positive or too negative) to the raters and infers the underlying rating most consistent with the ratings (given the rater accuracies and biases).

David and Skene’s model was independently rediscovered by many, including by me with the help of Andrew and Jennifer Hill (it was my gateway model into Bayes and there’s an example of how to code it in the Stan User’s Guide). As Andrew tells me, no matter what model you look at, a psychometrician probably introduced it 50 years ago (e.g., Elo is just a rescaled Bradley-Terry model, which is from 1950).

In 1986, A. Kimball Romney, Susan Weller, and William Batchelder published “Culture as consensus: a theory of culture and informant accuracy”, which introduced cultural consensus theory (CCT). It shouldn’t be surprising that it was published in an anthropology journal, because anthropology is cross-cultural sociology. Batchelder and Romney later published a paper, “Test theory without an answer key” in Biometrika; think IRT 0PL model but with unknown true answer, which is the Dawid and Skene model.

The twist that CCT introduced to take it beyond David and Skene’s model was a mixture model for the “truth.” That is, they assumed there might not actually be a single consensus point of view among raters. This would be a good idea for crowdsourcing, too, where the respondents are often a mix of spammers and people making a good-faith effort (it’s really more of a continuum).

I think it would be interesting to apply CCT to ChatGPT. It’s the same kind of thing that folks do in applying ideal point models to voting.

Postdoctoral position at MIT: privacy, synthetic data, fairness & causal inference

I have appreciated Jessica’s recent coverage of differential privacy and related topics on this blog — especially as I’ve also started working in this general area.

So I thought I’d share this new postdoc position that Manish Raghavan and I have here at MIT where it is an important focus. Here’s some of the description of the broad project area, which this researcher would help shape:

This research program is working to understand and advance techniques for sharing and using data while limiting what is revealed about any individual or organization. We are particularly interested in how privacy-preserving technologies interface with recent developments in high-dimensional statistical machine learning (including foundation models), questions about fairness of downstream decisions, and with causal inference. Applications include some in government and public policy (e.g., related to US Census Bureau data products) and increasing use in multiple industries (e.g., tech companies, finance).

While many people with relevant expertise might be coming from CS, we’re also very happy to get interest from statisticians — who have a lot to add here!

This post is by Dean Eckles.

ChatGPT4 writes Stan code so I don’t have to.

Several months ago I (Phil Price) wrote a Stan model to do some time series forecasting. It took me almost a full day to get it running and debugged. Today I decided to test ChatGPT4, by seeing if it could write a Stan model to give me the same functionality. I have been using ChatGPT to help me with coding, but in the past I’ve used it for little stuff, such as “I have a data frame with column names that all start with mm, and some of the names have ( and ) in them; write a function to remove all of that”. I have just started using it to write entire programs for me, and it looks like it is going to be enormously useful. Below, I present a verbatim transcript of the start of a ChatGPT4 session. At the end I got tired of cutting and pasting — and you would have gotten tired of reading it — but I tell you what happened. And at the very end, I’ll summarize my thoughts on the ChatGPT4 programming experience thus far.


Phil: I have monthly time series data. I want to fit a model to the data (typically from the past five to eight years) and use the model to forecast future values. Let’s call the data y[t]. Here are some things I want to include:
1. Seasonality. If the value in March is high in one year it is probably also high in the following year, and so on for other months. So I want ‘month effects’.
2. Autocorrelation in month effects. The month effect, for a given month of the year, is not iid random. If the March effect is high for several years and then drops a lot, it is more likely to stay low than to jump back up.
3. Local trend. If the value has increased for the past several months it will probably keep increasing, although it might not.
4. Soft constraints. The value of y[t] is very unlikely to fall below a value y_low, or to go above y_high. I don’t want to impose the constraints after the fact. Instead, I want to repel the forecasts from these limits. I want a fairly short-range repulsion, so the forecasts aren’t affected unless they get close to the limits.

Please write a Stan program to do this, and include comments that show where each of those four requirements is implemented.

ChatGPT4: Here’s a Stan program that incorporates seasonality, autocorrelation in month effects, local trend, and soft constraints:

Continue reading

Parallelization for Markov chain Monte Carlo with heterogeneous runtimes: a case-study on ODE-based models

(this post is by Charles)

Last week, BayesComp 2023 took place in Levi, Finland. The conference covered a broad range of topics in Bayesian computation, with many high quality sessions, talks, and posters. Here’s a link to the talk abstracts. I presented two posters at the event. The first poster was on assessing the convergence of MCMC in the many-short-chains regime. I already blogged about this research (link): here’s the poster and the corresponding preprint.

The second poster was also on the topic of running many chains in parallel but in the context of models based on ordinary differential equations (ODEs). This was the outcome of a project led by Stanislas du Ché, during his summer internship at Columbia University. We examined several pharmacometrics models, with likelihoods parameterized by the solution to an ODE. Having to solve an ODE inside a Bayesian model is challenging because the behavior of the ODE can change as the Markov chains journey across the parameter space. An ODE which is easy-to-solve at some point can be incredibly difficult somewhere else. In the past, we analyzed this issue in the illustrative planetary motion example (Gelman et al (2020), Section 11). This is the type of problem where we need to be careful about how we initialize our Markov chains and we should not rely on Stan’s defaults. Indeed, these defaults can start you in regions where your ODE is nearly impossible to solve and completely kill your computation! A popular heuristic is to draw the initial point from the prior distribution. On a related note, we need to construct priors carefully to exclude patently absurd parameter values and (hopefully) parameter values prone to frustrate our ODE solvers.

Even then—and especially if our priors are weakly informative—our Markov chains will likely journey through challenging regions. A common manifestation of this problem is that some chains lag behind because their random trajectories take them through areas that frustrate the ODE solver. Stanislas observed that this problem becomes more acute when we run many chains. Indeed, as we increase the number of chains, the probability that at least some of the chains get “stuck” increases. Then, even when running chains in parallel, the efficiency of MCMC as measured by effective sample size per second (ESS/s) eventually goes down as we add more chains because we are waiting for the slowest chain to finish!

Ok. Well, we don’t want to be punished for throwing more computation at our problem. What if we instead waited for the fastest chains to finish? This is what Stanislas studied by proposing a strategy where we stop the analysis after a certain ESS is achieved, even if some chains are still warming up. An important question is what bias does dropping chains introduce? One concern is that the fastest chains are biased because they fail to explore a region of the parameter space which contains a non-negligible amount of probability mass and where the ODE happens to be more difficult to solve. Stanislas tried to address this problem using stacking (Yao et al 2018), a strategy designed to correct for biased Markov chains. But stacking still assumes all the chains somehow “cover” the region where the probability mass concentrates and, when properly weighted, produce unbiased Monte Carlo estimators.

We may also wonder about the behavior of the slow chains. If the slow chains are close to stationarity, then by excluding them we are throwing away samples which would reduce the variance of our Monte Carlo estimators, however, it’s not worth waiting for these chains to finish if we’ve already achieved the wanted precision. What is more, as Andrew Gelman pointed out to me, slow chains can often be biased, for example if they get stuck in a pathological region during the warmup and never escape this region—as was the case in the planetary motion example. But we can’t expect this to always be the case.

In summary, I like the idea of waiting only for the fastest chains and I think understanding how to do this in a robust manner remains an open question. This work posed the problem and took steps in the right direction. There was a lot of traffic at the poster and I was pleased to see many people at the conference working on ODE-based models.

“A complete, current, and granular picture of COVID-19 epidemic in the United States”

Bob Carpenter writes:

Here’s an app estimating county-level Covid for the US with the uncertainties. The methodology sounds very pragmatic in its combination of
optimization and sampling for hierarchical models.

I like it! And not just because they use Stan.

I just have a few criticisms regarding their displays:

1. I don’t like the color scheme of their map. This thing where they mix in hue changes with intensity changes is just a mess. Not as bad as the notorious rainbow color scheme but not good.

2. I wish they would not list states in alphabetical order: Alabama, Alaska, etc.

If you want a look-up table, that’s fine, but for the main display it’s better to show things that are telling the story.

3. All these graphs look kinda the same. Not identical, but similar. How bout showing the estimated national curve and then for each state showing it relative to the national average?

4. I don’t like this rate-per-100,000 thing. I get that this is standard for epidemiology, but for anyone else (including me), it’s a mess, involving lots of shifting of decimal places in my head. If you want to do the rate per X, why rate per million—this would be a bit easier to follow, no? Or go the other way and just give straight-up percentages. The y-axes for these graphs are labeled 0, 1k, 2k. That’s just 0, 1%, 2%. To me, “1%” is much more clear than “1k” with an implicit “per 100,000.”

5. The x-axes on these time series are a disaster. “12/8, 6/13, 12/22”? What’s that all about? Just give months and years, please! I don’t want to have to decode the damn axis. Baby steps here.

But these are minor comments. Overall it’s an impressive page, and great to see all the data and code there too.

How to think about proofs of correctness of computer programs?

Mark Tuttle points to this post by Bill Gasarch about a panel discussion on the following topic:

Have the points made in Social Processes and Proofs of Theorems and Programs by DeMillo, Lipton, Perlis, survived the test of time? (Spoiler Alert: Yes.) . . .

1) The main topic was the 1979 DeMillo-Lipton-Perlis paper (see here) that gave arguments why Proofs of Program correctness could not work.

An all-to-brief summary of the DLP paper: Some researchers are trying to set up frameworks for doing proofs that programs are correct, analogous to the certainty we get with a proof of a Theorem in Mathematics. But proofs in Mathematics are, in reality, NOT that rigorous. Often details are left out or left to the reader. This is fine for mathematics (more on that later) but unacceptable for programs which need rather precise and rigorous proofs.

How do theorems in mathematics really get verified? By having enough people look at them and make sure they match intuitions, what DLP call A Social Process. (NOTE FROM BILL: Papers that are not important do not get looked at so there may well be errors.)

2) The notion of proving-programs-correct was very seductive; however, the people who were trying to do this had a blind spot about how the analogy of proving-programs-correct and proving-theorem-correct differ. In particular, a program is rather complicated and even stating carefully what you want to prove is difficult. By contrast, for most math statements, what you want to prove is clear. Note also that a program has lots of code (far more now than when DLP was written) and so much can happen that you cannot account for. . . .

6) So how can we be more certain that programs are correct?

a) Testing.
b) Modularize and test. Fix errors. Modularize and test. Fix errors…
c) Try to isolate side effects.
d) More testing.

Some point to Model Checking, which could be considered very sophisticated testing, but that’s used to verify circuits and perhaps low-level code, not programs. . . .

7) Can computers themselves help with proofs of correctness? That is the only hope; however, there are scaling problems.

8) When DLP was written a program with 100,000 lines of code was considered large. Now we have programs with millions of lines of code. And now we have more concurrency. So the lessons of the DLP paper are probably more relevant now then they were then.

9) Since Program Verification does not seem to be used, how come we don’t have a Software crisis?

a) We do! The Q+A mechanism at the meeting was terrible.
b) We do! FILL IN YOUR OWN FAVORITE STORY OF BAD SOFTWARE.
c) See the answer to question 6. . . .

The comments to that post are interesting too.

Tuttle then writes to me:

The point, from your several relevant blog discussions, is the criticality of “social process” – in this case social process in programming and software engineering, and, yes, even in mathematics. I’m not sure why there seem to be so many failures of social process in the kind of things you review – results that should be retracted, but, aren’t, for instance. Mathematicians seem to be able to fix things, given enough time, and, of course, the market fixes software engineering shortfalls. So, per your observations, where is the market for the kind of problematic conclusions you discuss?

Has there not been enough time? Or, is it that the results are not important enough for the market to care?

Disclosure: I’ve never proved a program correct. However, once I studied program verification it affected how I thought about programming and documentation. For example, if I’m implementing a function that is really important (typically short, but central), I may add a comment at the beginning describing a “pre-condition” – what is true about the inputs. Then, I try to name the function in a way that represents the post-condition – what is true about the output. The point, of course, is that the innards of the function represent the transformation of the inputs into the outputs, without the mathematics required for real program verification. In this way, the learned papers about program verification did effect how I did, and do, software engineering.

I’m reminded of Bob Carpenter’s remark that machine learning research has huge replicability problems, first because methods are tuned to the particular examples in the published papers, so advertised methods will typically work much better in those examples than in the wild, and second because machine learning methods will depend on lots of tuning parameters and settings that are not specified anywhere in the papers that describe them. The field of statistics is even worse, in that we (statisticians) often don’t supply any code at all. Talk about non-replicable!

And all of this is getting worse with the proliferation of short-form articles such as 6-page PNAS articles and 8-page conference papers where the focus is on importance of the method, not on correctness, and where there’s just no space to describe exactly what you’ve done.

Meanwhile the theoretical papers are all yammering on about “guarantees,” which to me is just a fancy way of saying “assumptions.”