Skip to content

The EpiBayes research group at the University of Michigan has a postdoc opening!

Jon Zelner writes:

The EpiBayes research group, led by Dr. Jon Zelner in the Dept. of Epidemiology and Center for Social Epidemiology and Population Health (CSEPH) at the University of Michigan School of Public Health seeks a postdoctoral fellow to work with us on a variety of projects relating to the transmission of SARS-CoV-2 and influenza in Michigan and at the national level.

We are a multidisciplinary, highly-collaborative group, with close connections to other research groups performing primary data collection at the University of Michigan and other institutions. As a group, we are broadly interested in understanding and addressing the key sources of spatial and sociodemographic variation in infectious disease infection and transmission risks. Our work covers a wide array of pathogens including respiratory viruses (COVID-19, influenza, RSV), bacterial infections (tuberculosis, Meningitis B, MRSA) and pathogens causing diarrheal disease (noroviruses, E. Coli, all-cause diarrhea). In this work, we integrate epidemiological data with environmental, spatial, and molecular data to reconstruct patterns of infection risk. The overarching theme uniting these projects is a focus on the use of cutting-edge statistical and simulation methods to understand and positively impact socioeconomic disparities in global and domestic infection risk.

The ideal candidate will have strong quantitative and computational skills and familiarity with some combination of the following: spatial analysis, Bayesian statistics, and transmission modeling, as well as a background in infectious disease epidemiology. Candidates must hold a PhD, but field of study is flexible (e.g. epidemiology, statistics, ecology, social science) with suitability assessed primarily as a function of skills and experience. We have access to a wide array of detailed observational and prospective cohort datasets, and the ideal candidate will be comfortable developing an independent research agenda that builds on these sources of data and complements the research/policy priorities identified by Dr. Zelner and other members of the group.
This position can be filled immediately, but the start date is flexible. The initial term of this position will be for two years, with the possibility of extension. This position will be all-remote until Spring 2021, with location thereafter to be determined in collaboration with the candidate and as a function of public health conditions. Interested candidates should send a brief inquiry and CV to Jon Zelner at

Jon Zelner is great (and one of my collaborators on this article). You should apply for this job.

Their covid site is

More on absolute error vs. relative error in Monte Carlo methods

This came up again in a discussion from someone asking if we can use Stan to evaluate arbitrary integrals. The integral I was given was the following:

\displaystyle \alpha = \int_{y \in R\textrm{-ball}} \textrm{multinormal}(y \mid 0, \sigma^2 \textrm{I}) \, \textrm{d}y

where the R-ball is assumed to be in D dimensions so that y \in \mathbb{R}^D.

(MC)MC approach

The textbook Monte Carlo approach (Markov chain or plain old) to evaluating such an integral is to evaluate

\displaystyle \frac{1}{M} \sum_{m=1}^M \textrm{normal}(y^{(m)} \mid 0, \sigma),

where the marginal distribution of each y^{(m)} is

\displaystyle y^{(m)} \sim \textrm{uniform}(R\textrm{-ball}).

I provide a Stan program for doing this below.

If you want to do this without Stan, you can rejection sample points in the ball very easily in low dimensions. In higher dimensions, you can generate a random angle and radius (the random radius needs to be drawn non-uniformly in dimensions higher than 1 to account for increasing area/volume as the radius increases). Stan could’ve also used an angle/radius parameterization, but I’m too lazy to work out the trig.

Most draws can contribute zero

If the volume of the R-ball is large compared to the volume containing the draws you get from \textrm{normal}(0, \sigma), then most of the draws from the R-ball will have roughly zero density in \textrm{normal}(0, \sigma). The effect of dimensionality is exponential on the difference in volume.

Absolute error

Even the volume of the R-ball is large compared to the volume containing the draws you get from \textrm{normal}(0, \sigma), the (MCMC) central limit theorem still controls error. It’s just that the error it controls is squared error. Thus if we measure absolute error, we’re fine. We’ll get an estimate near zero which is right to several decimal places.

Relative error

If the value of the integral is 0.01 or something like that, then to get 10% relative error, we need our estimate to fall within 0.001 of the true answer. 0 doesn’t do that.

What we have to do with a naive application of Monte Carlo methods is make a whole lot of draws to get an estimate between 0.009 and 0.011.

Relation to discrete sampling

This is related to the discrete sampling problem we ran into when trying to estimate the probability of winning the lottery by buying tickets and computing a Monte Carlo estimate. See my previous post, Monte Carlo and winning the lottery. The setup was very hard for readers to swallow when I first posted about it (my experience is that statisticians don’t like thought experiments or simplified hello world examples).

Stan program

The complexity comes in from the constraining trasnform to the R-ball and corresponding Jacobian adjustment to make the distribution in the ball uniform.

For the transform, I used a stick-breaking procedure very similar to how Stan transforms simplex variables and the rows of correlation matrix Cholesky factor. The transform requires a Jacobian, which is calculated on the fly. An approach that could be done with uniform draws purely through a transform would be to draw an angle (uniformly) and radius (non-uniformly based on distance from the origin); this would probably be more efficient.

 * Sampling for this program computes
 *   INTEGRAL_{theta in R ball in N dimensions} normal(theta | 0, sigma) d.theta
data {
  int N;
  real R;
  real sigma;
parameters {
  vector[N] alpha;
transformed parameters {
  // transform alpha to the unit ball
  vector[N] theta;

  // log_abs_det_J accumulates the change-of-variables correction
  real log_abs_det_J = 0;
    real sum_sq = 0;
    for (n in 1:N) {
      real mult = sqrt(1 - sum_sq);
      theta[n] = alpha[n] * mult;
      sum_sq += theta[n]^2;
      log_abs_det_J += log(mult);
  // scale from unit ball to R-ball;  no Jacobian because R is const.
  theta *= R;
model {
  // the Jacobian's triangular, so the log determinant is the sum of
  // the log of the absolute derivatives of the diagonal of J, i.e.,
  // sum(log(d.theta[n] / d.alpha[n])), as computed above

  target += log_abs_det_J;

  // theta is now uniform on R-ball
generated quantities {
  // posterior mean of E will be:
  //    INTEGRAL_{y in R-ball} normal(y | 0,  sigma) d.y
  //    =approx= 1/M SUM_{m in 1:M}  normal(theta(m) | 0, sigma)
  //             where theta(m) drawn uniformly from R-ball
  real E = exp(normal_lpdf(theta | 0, sigma));

As noted in the program inline documentation, the posterior mean of E is the result of the integral of interest.

Heckman Curve Update Update

tl;dr: “The policy conclusion we draw from our analysis is that age is not a short cut for identifying where governments should, or should not, invest. There are many well‐studied interventions for children that are worthy candidates for public funding based on efficiency considerations. However, the same is also true of many interventions targeting youth and older people. . . .”

Here’s the background. David Rea writes:

Just letting you know what eventually happened with our exchange with James Heckman over our ‘Heckman Curve’ paper in the Journal of Economic Surveys.

When we last emailed, James Heckman had written a response to our paper, and this had been accepted for publication in the JES.

The journal then offered us the opportunity to write a reply, which we did, and this was published as an early view.

After this James Heckman decided to withdraw his paper. We don’t know why he changed his mind about publication, but it did seem a little unusual.

If you are interested, here is a bit more background on the debate that didn’t eventuate. This is the Heckman Curve as published in Science in 2006:

Our paper attempted to see if more recent data showed a similar pattern. We used a dataset of benefit cost ratio estimates for 314 well-studied interventions from the Washington State Institute for Public Policy database.

We couldn’t find any evidence of a Heckman Curve in the data.

To be clear, the dataset has many early childhood interventions that generate large estimated benefits relative to the program cost. There are many early intervention programs for children that are worthwhile investments from an efficiency point of view. However, the key point is that there are many well-studied youth and adult interventions that are also very cost effective.

Heckman’s unpublished response to our paper contained several criticisms. The central argument was that we were misinterpreting his work. He stated that the Heckman Curve does not describe how the return on investment of human capital interventions differs by age. It is instead a theoretical proposition or thought experiment about an optimal portfolio of best-practice investments in human capital.

We actually thought that this was a very useful clarification. Many people, particularly in the policy community, understand the Heckman Curve as describing how average rates of return of human capital interventions differ by age. Clarifying this point has important real-world implications for social policy. It’s a funny thing to say given that he is critical of our work, but we think it would be useful for James Heckman to publish his paper.

We discussed David Rea and Tony Burton’s paper, “New evidence on the Heckman Curve,” last year.

Apparently the response by Heckman will never be published.

But, in any case, here’s the reply by Rea and Burton to the response that Heckman withdrew:

Clarifying the Nature of the Heckman Curve

David Rea and Tony Burton

Abstract: In response to our paper, James Heckman states that the Heckman Curve does not describe how the average return on investment of programs differs by the age of recipients. This clarification is useful as many people in the policy community have understood the Heckman Curve in this manner.

In our paper, we interpreted the Heckman Curve as a proposition that (a) social policy interventions targeted at early childhood would generate benefit cost ratios that were higher than other age groups, and (b) interventions targeted at older age groups would often have benefits smaller than their costs (Rea and Burton, 2020).

Our characterisation of the Heckman Curve was based on the paper by James Heckman entitled ‘Skill formation and the economics of investing in disadvantaged children’. This was published in Science in 2006 and stated that ‘early interventions targeted toward disadvantaged children have much higher returns than later interventions such as reduced pupil teacher ratios, public job training, convict rehabilitation programs, tuition subsidies, or expenditure on police. At current levels of resources, society overinvests in remedial skill investments at later ages and underinvests in the early years’ (Heckman, 2006, p.1902).

The natural interpretation of the quoted statement is that Professor Heckman was discussing interventions, investments in human capital and was advising decision‐makers inside and outside government to target investments in a particular way.

The empirical component of our paper used a dataset of benefit cost ratios estimated by the Washington State Institute for Public Policy. Analysis of the dataset showed no readily apparent relationship between the estimated benefit cost ratios of interventions and the age of recipients.

The policy conclusion we draw from our analysis is that age is not a short cut for identifying where governments should, or should not, invest. There are many well‐studied interventions for children that are worthy candidates for public funding based on efficiency considerations. However, the same is also true of many interventions targeting youth and older people. A number of early intervention programs have been shown to be cost effective, as have a range of ‘remedial’ or ‘second chance’ programs targeting older individuals. Good public policy requires a case‐by‐case assessment of the evidence and benefit cost analysis for each intervention being considered.

In his comment on our paper, Heckman states that we have misinterpreted the Heckman Curve (Heckman, 2020). He states that it is not a proposition about how the average return on investment of programs differs by the age of recipients. We welcome this clarification and believe it is a useful outcome of the exchange between us. Many people including ourselves have previously understood the Heckman Curve to be advice on public investment in human capital programmes. It is helpful to understand that his work in this area is not meant to be relevant to public investment decisions in the manner we describe.


Heckman, J.J. (2006) Skill formation and the economics of investing in disadvantaged children. Science 312 (5782): 1900–1902.

Heckman, J.J. (2020) Comment on “new evidence on the Heckman Curve” by David Rea and Tony Burton. Journal of Economic Surveys.

Rea, D. and Burton, T. (2020) New evidence on the Heckman Curve. Journal of Economic Surveys 31 (2).

So, it all seems clear. Heckman published a pretty graph in 2006 that was misinterpreted by many people as implying that “early interventions targeted toward disadvantaged children have much higher returns than later interventions such as reduced pupil teacher ratios, public job training, convict rehabilitation programs, tuition subsidies, or expenditure on police. At current levels of resources, society overinvests in remedial skill investments at later ages and underinvests in the early years.” But that was a mistake. In fact, the data don’t seem to support the claim that human capital investments are most effective when targeted at younger ages, and Heckman appears to agree with this, to the extent of wanting to emphasize that his curve is not a proposition about how the average return on investment of programs differs by the age of recipients.

On the other hand, Heckman still has this up at his webpage:

The Heckman Curve

This graphic shows that the highest rate of economic returns comes from the earliest investments in children, providing an eye-opening understanding that society invests too much money on later development when it is often too late to provide great value. It shows the economic benefits of investing early and building skill upon skill to provide greater success to more children and greater productivity and reduce social spending for society.

This graphic is formatted for use on social media and insertion into presentations, handouts and press releases.


I’m not sure what to make of Heckman’s advice. On one hand, on his website he is clear that his eponymous curve shows that “The earlier the investment, the higher the return.”

But in his (now unpublished) response to Rea and Burton, he wrote:

None of this says that the benefit-cost ratios or internal rates of return are necessarily higher for all younger-age interventions. . . . The Curve is a technological frontier across programs (best practice) and not an average across all programs, however poorly executed . . . Policy makers need advice on best practice, not on average practice.

So we should modify the advice on his website to “In best practice, the earlier the investment, the higher the return.”

I’m concerned, though, because I think that the statistical methods used by Heckman and his collaborators cause them to overestimate effect sizes. Early-childhood intervention supposedly increasing adult earnings by 42%. Noisy studies with huge standard errors, thus the statistical significance filter yields high effect sizes. The upshot is that I have no idea what are the effects of these supposed best-practice interventions—and I think Heckman has no idea either. The main difference between Heckman and me here is that he’s expressing a lot more confidence than I am in those noisy estimates.

I have two problems with the statement, “Policy makers need advice on best practice, not on average practice.” First, we really don’t know what best practice is. These interventions that Heckman is so sure are best practice, maybe aren’t. Second, even if these interventions were best practice, to be scaled they’d have to be applied in the real world, i.e., average practice. It would seem naive to think that whatever particular interventions had been tried in some experiment many years ago could just be applied directly in new settings.

In any case, I agree with Rea that these are all good things to be discussing in the open. He and Heckman and Charles Murray and Aaron Edlin and Anna Dreber and Rachael Meager and anyone else are welcome to reply in comments.

This is your chance to comment on the U.S. government’s review of evidence on the effectiveness of home visiting. Comments are due by 1 Sept.

Emily Sama-Miller writes:

The federally sponsored Home Visiting Evidence of Effectiveness (HomVEE) systematic evidence review is seeking public comment on proposed updates to its standards and procedures. HomVEE reviews research literature on home visiting for families with pregnant women and children from birth to kindergarten entry, and its results are used to inform federal funding decisions. HomVEE is sponsored by the Office of Planning, Research, and Evaluation for the Administration for Children and Families (ACF) in the U.S. Department of Health and Human Services. HomVEE has released a draft Version 2 Handbook that describes updated procedures and standards for the systematic review. ACF is seeking public comment in response to two Federal Register notices that summarize key proposed updates, one about clarifying and updating procedures standards for rating research quality and one about defining and reviewing different versions of home visiting models. The full draft handbook has been published here. Comments are due by September 1, 2020, and may be submitted via email to

Good to see the government actively seeking public comment on this.

Himmicanes again

Gary Smith gives a clear non-technical explanation of why not to take that himmicanes study seriously. Further background here.

That “not a real doctor” thing . . . It’s kind of silly for people to think that going to medical school for a few years will give you the skills necessary to be able to evaluate research claims in medicine or anything else.

Paul Alper points us to this news article by Abby Phillip, “How a fake doctor made millions from ‘the Dr. Oz Effect’ and a bogus weight-loss supplement,” which begins:

When Lindsey Duncan appeared on “The Dr. Oz Show” in 2012, he was introduced as a “naturopathic doctor” and a certified nutritionist. . . . But Duncan wouldn’t necessarily know anything about the chatter in the medical community, because he is no doctor. In announcing a $9 million settlement with Duncan this week, the Federal Trade Commission more accurately labeled him a “marketer” . . .

The entire story is fascinating and horrifying (especially in light of recent news regarding who’s in charge of our public health system), but what I want to focus on is a particular aspect of this story, which is the idea that this guy is not a real doctor.

What this guy was doing on TV was promoting a bogus health supplement, “green coffee bean extract.” What he was nominally doing was providing health advice, sharing information about the effectiveness of this supplement for weight loss.

Here’s my point. Set aside all the fraud for a moment and suppose this person was presenting legitimate medical research. In what way would it be relevant that he was a doctor, a physician, a legitimate M.D., etc.? Doctors are trained to treat patients. They’re not trained to evaluate research claims. Sure, in medical school I guess they get some lectures on statistics or whatever, but an M.D. or similar degree is pretty much irrelevant to evaluation of claims of effectiveness of a health supplement. Really, a marketer would be just as good.

So, yeah, I’m glad that this guy has to cease and desist, but I feel like the whole not-a-real-doctor thing is entirely irrelevant. It’s relevant that he said he was a doctor, because that was a lie about his professional credentials, which is relevant to evaluating any other claims he might make in this capacity. But, to me, the bigger problems are that: (a) it seems that you need to be labeled as a doctor to have credibility when making medical claims, and (b) it seems that, once someone’s labeled as a doctor, his claims are taken seriously by default. (This is similar to the problem of the credulity rut that we discussed recently, with economists and journalists taking crappy regression discontinuity studies seriously by default.)

Dr. Oz

Abby Phillips’s article continues by slamming Dr. Oz, the surgeon and TV host:

Their moneymaking scheme, however, was only possible with the help of Mehmet Oz’s increasingly maligned self-help show. . . . During the taping, Dr. Oz nodded along with Duncan’s pseudoscience gibberish, according to a transcript that was included in court documents. . . . There was then, and there is now, no scientific evidence that green coffee bean extract promotes weight loss. . . . Between the taping and the time the show aired, Oz’s producers e-mailed Duncan to ask whether he had a preferred green coffee extract supplier. . . . Meanwhile, “The Dr. Oz Show” is still on the air, despite his association with numerous debunked weight-loss products. A recent study found that half of the medical advice he dispenses is baseless or wrong. . . .

Dr. Oz is, of course, a real doctor. But so what?

It’s kind of silly for people to think that going to medical school for a few years will give you the skills necessary to be able to evaluate research claims in medicine or anything else.

Lots of doctors can do this sort of thing—lots of them are indeed excellent at evaluating research claims—but if so, I don’t think it’s the medical training that’s doing it. The medical training, and their practice of medicine, gives them lots of relevant subject-matter knowledge—I’m not saying this is irrelevant—but subject-matter knowledge isn’t enough, and I think it’s a big mistake when media organizations act as if an M.D. is a necessary or a sufficient condition for evaluating research claims.

P.S. Dr. Oz works at Columbia University, just like me! And Columbia’s proud of “the straight-talking guy in the blue scrubs.” As the alumni magazine put it in their celebratory article, “That’s Healthfotainment!”

P.P.S. Just to clarify: I’m not saying that statisticians are better than doctors. Above I wrote that I don’t like the “not a real doctor” thing. I also wouldn’t like a “not a real statistician” thing. When evaluating medical research, I’d definitely like people with medical training to be involved. Here’s an example: I would never have been able to have done this project on my own.

So please don’t take the above post as anti-doctor. Not at all. An M.D. is neither a necessary or a sufficient condition for evaluating research claims. The same goes with a Ph.D. in statistics or anything else.

Know your data, recode missing data codes

We had a class assignment where students had to graph some data of interest. A pair of students made the above graph, as a reminder that some data cleaning is often necessary. The students came up with the excellent title as well!

The U.S. high school math olympiad champions of the 1970s and 1980s: Where were they then?

George Berzsenyi writes:

Here is the last issue of the USAMO [math olympiad] Newsletter that was edited by Nura Turner and Tsz-Mei Ko along with a couple of additional summaries about the IMO participants. Concerning the Newsletter, I [Berzsenyi] just learned from Tsz-Mei Ko that it was the last issue.

At the time, I really wanted to do well in these competitions. In retrospect, I think it worked out best that I did ok but not great.

Probabilistic forecasts cause general misunderstanding. What to do about this?

The above image, taken from a site at the University of Virginia, illustrates a problem with political punditry: There’s a demand for predictions, and there’s no shortage of outlets promising a “crystal ball” or some other sort of certainty.

Along these lines, Elliott Morris points us to this very reasonable post, “Poll-Based Election Forecasts Will Always Struggle With Uncertainty,” by Natalie Jackson, who writes:

Humans generally do not like uncertainty. We like to think we can predict the future. That is why it is tempting to boil elections down to a simple set of numbers: The probability that Donald Trump or Joe Biden will win the election. Polls are a readily available, plentiful data source, and because we know that poll numbers correlate strongly with election outcomes as the election nears, it is enticing to use polls to create a model that estimates those probabilities.

Jackson concludes that “marketing probabilistic poll-based forecasts to the general public is at best a disservice to the audience, and at worst could impact voter turnout and outcomes.”

This is a concern to Elliott, Merlin, and me, given that we have a probabilistic poll-based forecast of the election! We’ve been concerned about election forecast uncertainty, but that hasn’t led us to take our forecast down.

Jackson continues:

[W]e do not really know how to measure all the different sources of uncertainty in any given poll. That’s particularly true of election polls that are trying to survey a population — the voters in a future election — that does not yet exist. Moreover, the sources of uncertainty shift with changes in polling methods. . . . In short, polling error is generally larger than the reported margin of error. . . . Perhaps the biggest source of unmeasurable error in election polls is identifying “likely voters,” the process by which pollsters try to figure out who will vote. The population of voters in the future election simply does not yet exist to be sampled, which means any approximation will come with unknown, unmeasurable (until after the election) errors.

We do account for nonsampling error in our model, so I’m not so worried about us understating polling uncertainty in general terms. But I do agree that ultimately we’re relying on pollster’s decisions.

Jackson also discusses one of my favorite topics, the challenge of communicating uncertainty:

Most people don’t have a solid understanding of how probability works, and the models are thoroughly inaccessible for those not trained in statistics, no matter how hard writers try to explain it. . . . It is little wonder that research shows people are more likely to overestimate the certainty of an election outcome when given a probability than when shown poll results.

That last link is to a paper by Westwood, Messing, and Lelkes that got some pushback a few months ago when it appeared on the internet, with one well-known pundit saying (mistakenly, in my opinion) “none of the evidence in the paper supports their claims. It shouldn’t have been published.”

I looked into all this in March and wrote a long post on the paper and the criticisms of it. Westwood et al. made two claims:

1. If you give people a probabilistic forecast of the election, they will, on average, forecast a vote margin that is much more extreme than is reasonable.

2. Reporting probabilistic forecasts can depress voter turnout.

The evidence for point 1 seemed very strong. The evidence for point 2 was not so clear. But point 1 is important enough on its own.

Here’s what I wrote in March:

Consider a hypothetical forecast of 52% +/- 2%, which is the way they were reporting the polls back when I was young. This would’ve been reported as 52% with a margin of error of 4 percentage points (the margin of error is 2 standard errors), thus a “statistical dead heat” or something like that. But convert this to a normal distribution, you’ll get an 84% probability of a (popular vote) win.

You see the issue? It’s simple mathematics. A forecast that’s 1 standard error away from a tie, thus not “statistically distinguishable” under usual rules, corresponds to a very high 84% probability. I think the problem is not merely one of perception; it’s more fundamental than that. Even someone with a perfect understanding of probability has to wrestle with this uncertainty.

As is often the case, communication problems are real problems; they’re not just cosmetic.

Even given all this, Elliott, Merlin, and I are keeping our forecast up. Why? Simplest answer is that news orgs are going to be making probabilistic forecasts anyway, so we want to do a good job by accounting for all those sources of polling error that Jackson discusses.

Just one thing

Jackson’s article is on a site with the url, “”. The site is called “Sabato’s Crystal Ball.”

Don’t get me wrong: I think Jackson’s article is excellent. We publish our pieces where we can, and to publish an article does not imply an endorsement of the outlet where it appears. I just think it’s funny that a site called “Crystal Ball” decided to publish an article all about the problems with pundits overstating their certainty. In all seriousness, I suggest they take Jackson’s points to heart and rename their site.

Regression and Other Stories translated into Python!

Ravin Kumar writes in with some great news:

As readers of this blog likely know Andrew Gelman, Jennifer Hill, and Aki Vehtari have recently published a new book, Regression and Other Stories. What readers likely don’t know is that there is an active effort to translate the code examples written in R and the rstanarm library to Python and the bambi library. The core ideas that are presented in the book transcend any specific programming language, and more so insightful ideas in how to think about statistical problems, frame them in a structured way, and use mathematical concepts to solve them. Regardless of R or Python, the fundamentals of Bayesian modeling are the same, and if you’re fluent in Python that shouldn’t be a barrier to extensive knowledge that is present in this book.

Reading this book you’ll also learn that the rstanarm library is quite good! It’s so good actually that there are many features we’d like to port to bambi. So if you’re fluent in Python we invite you to first read the book, and then come help out, both with translating the book and with adding features to bambi! Contributions are very welcome.

The repository for the library and Python port are in the repositories below:

Coding and drawing

Some people like coding and they like drawing too. What do they have in common?

I like to code—I don’t looove it, but I like it ok and I do it a lot—but I find drawing to be very difficult. I can keep tinkering with my code to get it to look like whatever I want, but I feel like with drawing I have very little control. I have an idea in my mind of what I want the drawing to look like, but my pencil does not follow my orders.

Coding is digital and drawing is analog. Is it just that? I don’t think so. I think that if I were doing digital drawing, pixel by pixel or whatever, I’d still struggle.

On the other hand, I like drawing a lot and find it super useful if I’m drawing graphs of data or math patterns.

Here’s what I wrote a few years ago:

I was trying to draw Bert and Ernie the other day, and it was really difficult. I had pictures of them right next to me, but my drawings were just incredibly crude, more “linguistic” than “visual” in the sense that I was portraying key aspect of Bert and Ernie but in pictures that didn’t look anything like them. I knew that drawing was difficult—every once in awhile, I sit for an hour to draw a scene, and it’s always a lot of work to get it to look anything like what I’m seeing—but I didn’t realize it would be so hard to draw cartoon characters!

This got me to thinking about the students in my statistics classes. When I ask them to sketch a scatterplot of data, or to plot some function, they can never draw a realistic-looking picture. Their density functions don’t go to zero in the tails, the scatter in their scatterplots does not match their standard deviations, E(y|x) does not equal their regression line, and so forth. For example, when asked to draw a potential scatterplot of earnings vs. income, they have difficulty with the x-axis (most people are between 60 and 75 inches in height) and having the data consistent with the regression line, while having all earnings be nonnegative. (Yes, it’s better to model on the log scale or whatever, but that’s not the point of this this exercise.)

Anyway, the students just can’t make these graphs look right, which has always frustrated me. But my Bert and Ernie experience suggests that I’m thinking of it the wrong way. Maybe they need lots and lots of practice before they can draw realistic functions and scatterplots.

When I do statistics-style drawings, I can make things look right, and accuracy really matters to me. But ask me to draw a cat or a dog or a house? Forget it. What’s going on here? More introspection and experimentation is needed. The point is, coding kind of is like drawing. there’s something going on here.

Getting all negative about so-called average power

Blake McShane writes:

The idea of retrospectively estimating the average power of a set of studies via meta-analysis has recently been gaining a ton of traction in psychology and medicine. This seems really bad for two reasons:

1. Proponents claim average power is a “replicability estimate” and that it estimates the rate of replicability “if the same studies were run again”. Estimation issues aside, average power obviously says nothing about replicability in any real sense that is meaningful for actual prospective replication studies. It perhaps only says something about replicability if we were able to replicate in the purely hypothetical repeated sampling sense and if we defined success in terms of statistical significance.

2. For the reason you point out in your Bababekov et al. Annals of Surgery letter, the power of a single study is not estimated well:
taking a noisy estimate and plugging it into a formula does not give us “the power”; it gives us a very noisy estimate of the power
Having more than one study in the average power case helps, but not much. For example, in the ideal case of k studies all with the same power, ~ N(z.true, 1/k) and mapping this estimate to power results in a very noisy distribution except for k large (roughly 60 in this ideal case). If you also try to adjust for publication bias as average power proponents do, the resulting distribution is much noisier and requires hundreds of studies for a precise estimate.

In sum, people are left with a noisy estimate that doesn’t mean what they think it means and that they do not realize is noisy!

With all this talk of negativity and bearing in mind the bullshit asymmetry principle, I wonder whether you would consider posting something on this or having a blog discussion or guest post or something along those lines. As Sander and Zad have discussed, it would be good to stop this one in its tracks fairly early on before it becomes more entrenched so as to avoid the dreaded bullshit asymmetry.

He also links to an article, “Average Power: A Cautionary Note,” with Ulf Böckenholt and Karsten Hansen, where they find that “point estimates of average power are too variable and inaccurate for use in application” and that “the width of interval estimates of average power depends on the corresponding point estimates; consequently, the width of an interval estimate of average power cannot serve as an independent measure of the precision of the point estimate.”

StanCon 2020 is on Thursday!

For all that registered for the conference, THANK YOU! We, the organizers, are truly moved by how global and inclusive the community has become.

We are currently at 230 registrants from 33 countries. And 25 scholarships were provided to people in 12 countries.

Please join us. Registration is $50. We have scholarships still available (more info on the registration page).


  • Videos for contributed talks and developer talks are online! Register now and you’ll be sent a password.
  • Our plenary speakers have all been confirmed (these will happen live at StanCon):
    • Seth Flaxman; Imperial College, London; “Hierarchical Models for Covid – identifying effects of lockdown and an R package”
    • Moriba Jah; Oden Institute for Computational Engineering & Sciences; “Multi-Source Information Modeling, Curation, and Fusion Enabling Transdisciplinary Decision-Making: A Case for Space!”
    • David Shor; “STAN and US Politics”
  • Thank you to our sponsors, Metrum Research Group and Jumping Rivers!
  • We’ve increased the number of scholarships and they are still available!

If you’re on the fence about whether to attend, we’ve done our best to bring out what makes StanCon special: the Stan Community. For a few hours, you get to spend time with Stan users and developers from around the world, sitting at tables discussing things that you’d discuss at StanCon.

Continue reading ‘StanCon 2020 is on Thursday!’ »

Don’t say your data “reveal quantum nature of human judgments.” Be precise and say your data are “consistent with a quantum-inspired model of survey responses.” Yes, then your paper might not appear in PNAS, but you’ll feel better about yourself in the morning.

This one came up in a blog comment by Carlos; it’s an article from PNAS (yeah, I know) called “Context effects produced by question orders reveal quantum nature of human judgments.” From the abstract:

In recent years, quantum probability theory has been used to explain a range of seemingly irrational human decision-making behaviors. The quantum models generally outperform traditional models in fitting human data, but both modeling approaches require optimizing parameter values. However, quantum theory makes a universal, nonparametric prediction for differing outcomes when two successive questions (e.g., attitude judgments) are asked in different orders. Quite remarkably, this prediction was strongly upheld in 70 national surveys carried out over the last decade (and in two laboratory experiments) and is not one derivable by any known cognitive constraints.

This set off a bunch of alarm bells:

1. “Universal, nonparametric prediction”: I’m always suspicious of claims of universality in psychology.

2. “Quite remarkably, this prediction was strongly upheld in 70 national surveys”: Quite remarkably, indeed. This just seems a bit too good to be true.

3. And the big thing . . . how can quantum theory make a prediction about survey responses? Quantum theory is about little particles and, indirectly, about big things made from little particles. For example, quantum theory explains, in some sense, the existence of rigid bodies such as tables, chairs, and billiard balls.

From reading the paper, it’s my impression that they’re not talking about quantum theory, as it’s usually understood in physics, at all. Rather, they’re talking about a statistical model for survey responses, a model which is inspired by analogy to certain rules of quantum mechanics. That’s fine—I’m on record as offering tentative support to this general line of research—I just want to be clear on what we’re talking about. I think it might be clearer to call these “quantum-inspired statistical models” rather than “quantum probability theory.”

As for the model itself: I took a quick look and it seems like it could make sense. It’s a latent-variable multidimensional model of attitudes, with the twist that whatever question was asked before could affect the salience of the different dimensions. The model makes a particular prediction which they call the QQ equality and which they claim is supported in their 70 surveys. I did not look at that evidential claim in detail. One thing that confuses me is why they are treating this QQ equality as evidence for their particular quantum-inspired model. Wouldn’t it be evidence for any model, quantum-inspired or otherwise, that makes this particular prediction?

It’s not clear to me that the quantum-inspired nature of the model is what is relevant here, so I think the title of the paper is misleading.

Here it is again:

Context effects produced by question orders reveal quantum nature of human judgments

I think a more accurate title would be:

Context effects produced by question orders are consistent with a quantum-inspired model of survey responses

Here are the explanations for my corrections:

1. Changed “reveal” to “are consistent with” because the data are, at best, consistent with a particular model. This is not the same as revealing some aspect of nature.

2. Changed “quantum nature” to “quantum-inspired model” because, as discussed above, it’s not a quantum model, it’s only quantum-inspired; also, it’s just a particular model, it’s not a property of nature. If I were to fit a logistic regression to some test questions—that’s standard practice in psychometrics, it’s called the Rasch model—and the model were to fit the data well, it would not be correct for me to say that I’ve revealed the logistic nature of test taking.

3. Changed “human judgments” to “survey responses” because there’s nothing in the data about judgments; it’s all survey responses. It would be ok with me if they wanted to say “attitudes” instead. But “judgments” doesn’t seem quite right.

Anyway, there might be something there. Too bad about all the hype. I guess the justification for the hype is that, without the hype, the paper probably wouldn’t’ve been published in a tabloid; and without the tabloid credentials, maybe our blog readers would never have head about this work, and then we wouldn’t’ve heard about it either.

Varimax: Sure, it’s always worked but now there’s maths!

Some day you will be lovedDeath Cab for Cutie

Here is a paper that I just read that I really liked: Karl Rohe and Muzhe Zeng’s Vintage Factor Analysis with Varimax Performs Statistical Inference.

(Does anyone else get a Simon Smith and His Amazing Dancing Bear vibe off that title? No? Just me? Fine.)

This paper is in the fine tradition of papers that examine a method that’s been around for ages and everyone (maybe in a particular discipline) knows to work well, but no one has ever cracked why. Almost universally, it turns out that these methods do actually work1. But it usually takes someone to hold them up to the light in just the right way to see why.

And that’s what this paper does.

The basic idea is that you can often rotate principal components in order to make the loading vectors sparse. This typically leads to more interpretable principal components and, therefore, better exploratory data analysis.

The black fly in this particular glass of Chardonay2 is that in the standard working model for PCA (namely that that everything is independent and Gaussian) we end up with an awkward rotational invariance that doesn’t exactly scream “Hey! Rotate me! I’ll get better! I promise!”.

But nonetheless the Varimax rotations do actually seem to help most of the time and so they’ve been comfortably used for the last 50 or so years. It’s even in the base stats package in R!

So what’s the deal, daddy-o? It turns out that the whole method works as long as the entries in the principal component matrix are sufficiently non-Gaussian3. Figure 1 from their paper (shown below) shows this clearly. It turns out that when the principal components have heavy tails you see “radial streaks” and then there is a good rotation—the one that lines the axes up with these streaks!

Checking for these streaks was part of the workflow for using variamax rotations proposed by Thurstone back in 1947 and this paper does a good job of using it as a jumping off point for building a theory of why these rotations work. (Side note: I really love how rooted in history this paper is while being extremely modern. It’s always nice to see a paper big-up the contributions of people who came before them while also doing amazing work.)

Anyway. You should all read the paper. It’s super sweet. Basically it shows that if your PCA is done on an appropriately centred and scaled matrix, then under some assumptions PCA + Varimax (which they call Vintage Factor Analysis) fits the model

\mathbb{E}(X\mid Y,Z)=ZBY^T,

where X is the n\times p matrix of covariates and the PCA+Varimax returns the matrix Z. (The B and Y are just in there for fun. Don’t stress about them. I promise the Z is the one we want.).

This basically means that while PCA will give you some n\times k matrix U such that X^TX=U^TU, Vintage Factor Analysis can actually recover (under conditions!) the original matrices Z and Y used to generate the data.

The key condition is that the entries in the true Z come from a heavy-tailed distribution (each column is iid from the same distribution, different columns can come from different distributions as long as they’re independent). This causes the radial streaks seen in Figure 1 and gives a preferred orientation that the Varimax rotation can find.

Now just because this method corresponds to a model doesn’t mean we believe that model. Actually we probably don’t. Unlike some other great homes of working models (like linear regression), this type of factor analysis is typically exploratory. So we don’t actually need to believe the assumptions for it to be useful.

This paper justifies the diagnostic checks of Thurstone (1947) and gives a plausible reason why this PCA+Varimax method can actually produce good, interpretable results.

On a personal note, my very first stats paper involved fitting a Bayesian factor model (Chris did all of the hard stats stuff because I was terrified and knew nothing. I provided an algorithm.). And I’ve got to say I’d much rather just do a PCA.

1 Yes. I know. Just because a method is persistent doesn’t make it valid. This blog is full of examples. But this is a slightly different thing. This is not a method that is mathematically well understood (like using p-values to do inference) where either the mathematical framework doesn’t match with reality (like using p-values to do inference) or they aren’t being used correctly (like using p-values to do inference). This is instead a method that works but does not have a well-developed mathematical theory. Those types of things typically do end up working provably, even though it might take a while for theory to catch up to the method. Fun fact: one of my very early papers did this for a much more obscure algorithm.


3 This is also the thinking behind Independent Component Analysis. The paper discusses the links.

Updates of bad forecasts: Let’s follow them up and see what happened!

People make bad forecasts, then they move on. Do the forecasts ever get fixed? Do experts learn from their mistakes?

Let’s look at three examples.

1. The economist who kept thinking that the Soviets were catching up

Paul Samuelson:


Yes, the above graph was from 1961, but “in subsequent editions Samuelson presented the same analysis again and again except the overtaking time was always pushed further into the future so by 1980 the dates were 2002 to 2012. In subsequent editions, Samuelson provided no acknowledgment of his past failure to predict and little commentary beyond remarks about ‘bad weather’ in the Soviet Union.” As late as 1989, the celebrated economist wrote, “The Soviet economy is proof that, contrary to what many skeptics had earlier believed, a socialist command economy can function and even thrive.”

This is not a left/right thing. Those of you who followed politics in the 1970s and the 1980s know that the supposed economic might of the Soviet Union was pushed by those on the left (who wanted us to imitate that system) and by those on the right (who feared we’d be overcome by it).

Anyway, for the purpose of today’s discussion, the point is that for twenty-eight years Samuelson neither corrected his error nor learned from it.

2. The contrarians who like to talk about global cooling

Steven Levitt and Stephen Dubner:

A Headline That Will Make Global-Warming Activists Apoplectic . . . It is curious that the global-warming arena is so rife with shrillness and ridicule. Where does this shrillness come from?

Yes, it was back in 2009 that the celebrated freakonomists promoted a claim that “The PDO cool mode has replaced the warm mode in the Pacific Ocean, virtually assuring us of about 30 years of global cooling”—but I don’t know that they’ve ever said a Whoops on that one, despite repeated news items such as the graph shown above.

Did Levitt and Dubner learn from this not to reflexively trust contrarian takes on important scientific issues? I don’t know. I did some searching and found this interview from 2015 where Levitt said,

I tell you what we were guilty of . . . We made fun of the environmentalists for getting upset about some other problem that turned out not to be true. But we didn’t do it with enough reverence, or enough shame and guilt. And I think we pointed out that it’s completely totally and actually much more religion than science. I mean what are you going to do about that? I think that’s just a fact.

So, as of 2015, they still seemed to have missed the point that you can learn from your mistakes.

3. Repeatedly biased forecasts from the U.S. Department of Transportation


The above graph was made in 2014.

So here’s my question: Has the Department of Transportation cleaned up their act? How are those projections going? I think the key problem is not the bad forecast, it’s the continuing to make the bad forecast even after it’s been destroyed by data.

We all make mistakes. The only way to avoid making mistakes is to do nothing. But we have the responsibility to learn from our mistakes. Science may be self-correcting, but it doesn’t self-correct on its own.

Somethings do not seem to spread easily – the role of simulation in statistical practice and perhaps theory.

Unlike Covid19, somethings don’t seem to spread easily and the role of simulation in statistical practice (and perhaps theory) may well be one of those.

In a recent comment, Andrew provided a link to an interview about the new book Regression and Other Stories by Aki Vehtari, Andrew Gelman, and Jennifer Hill. An interview that covered many of the aspects of the book, but the comments on the role of fake data simulation caught my interest the most. 

In fact, I was surprised by the comments in that the recommended role of simulation seemed much more substantial than I would have expected from participating on this blog. For at least the last 10 years I have been promoting the use of simulation in teaching and statistical practice with seemingly little uptake from other statisticians. For instance my intro to Bayes seminar and some recent material here (downloadable HTML from google drive).

My sense was that those who eat, drink and dream in mathematics [edit] see simulation as awkward and tedious. But maybe that’s just me, but statisticians have published comments very similar to this [edit].  But Aki, Andrew and Jennifer seem to increasingly disagree.

For instance, at 29:30 in the interview there is about 3 minutes from Andrew that all of statistical theory is a kind of shortcut to fake data simulation and you don’t need to know any statistical theory as long as you are willing to do fake data simulation on everything. However, it is hard work to do fake data simulation well [building a credible fake world and how that is sampled from]. Soon after, Aki commented that it is only with fake data simulation that you have access to the truth in addition to data estimates. That to me is the most important aspect – you know the truth.
Also at 49:25 Jennifer disclosed that she changed her teaching recently to be based largely on fake data simulation and is finding that the students having to construct the fake world and understand how the analysis works there provides a better educational experience.
Now in a short email exchange Andrew did let me know that the role of simulation increased as they worked on the book and Jennifer let me know that there is simulation exercises in the causal inference topics.
I think the vocabulary they and others have developed (fake data, fake world, Bayesian reference set generated by sampling from the prior, etc.) will help more see why statistical theory is a kind of shortcut  to simulation. I especially like this vocabulary and recently switched from fake universe to fake world in my own work.
However, when I initially tried using simulation in webinars and seminars, many did not seem to get it at all.
p.s. When I did this post I wanted to keep it short and mainly call attention to Aki, Andrew and Jennifer’s (to me increasing important) views on simulation. The topic is complicated, more so than I believe most people appreciate and I wanted to avoid a long complicated post. I anticipated doing many posts over the coming months and comments seem to support that.
However, Phil points out that I did not define what I meant by “fake data simulation” and I admittedly had assumed readers would be familiar with what Aki, Andrew and Jennifer meant by it (as well as myself). To _me_ it is simply drawing pseudo-random numbers from a probability model. The “fake data” label emphasizes its an abstraction that used abstractly to represent haphazardly varying observations and unknowns.  This does not exclude any Monte-Carlo simulation but just emphasizes one way it could be profitably used.
For instance, in the simple bootstrap, real data is used but the re-sampling draws are fake (abstract) and are being used to represent possible future samples. So here the probability model is discrete with support only on the observations in hand with probabilities implicitly defined by the re-sampling rules. So there is a probability model and simulation is being done. (However I would call it a degenerate probability model given loss of flexibility in choices).

Continue reading ‘Somethings do not seem to spread easily – the role of simulation in statistical practice and perhaps theory.’ »

Kafka comes to the visa office

Paul Alper points us to this news article by Catherine Rampell about “a Kafkaesque new processing policy for select categories of visas”:

If any fields on a form are left blank, it will automatically be rejected. Even if it makes no sense for the applicant to fill out that field. For example, if “Apt. Number” is left blank because the immigrant lives in a house: rejected. Or if the field for a middle name is left blank because no middle name exists: rejected, too. . . .

It’s hard not to see this as a preposterous new layer of red tape designed to deny visas to legally eligible applicants . . .

The policy change, at first affecting just asylum applicants, was announced without fanfare on the USCIS website sometime in the fall. “We will not accept your [application] if you leave any fields blank,” reads a note you wouldn’t know existed unless someone told you where to find it. “You must provide a response to all questions on the form, even if the response is ‘none,’ ‘unknown’ or ‘n/a.’ ”

Then, days before the New Year, USCIS added a similar notice for U-visa applications. In both cases the processing changes were effective immediately — even if documents had been mailed in before the policy was announced.

That’s the truly Kafkaesque touch.

Rampell continues with the story of the rejected visa applicant:

To be clear, the absence of a son’s middle name wasn’t the only blank on her application. As many attorneys told me has always been common practice, she also left other fields unfilled if they didn’t apply.

For example, she checked the boxes saying each of her sons is “single.” A subsequent section says: “If your family member was previously married, list the names of your family member’s prior spouses and the dates his or her marriages were terminated.” Because no “prior spouses” exist, she didn’t enter anything; USCIS cited this, too, among the reasons for rejection. . . .

The American Immigration Lawyers Association has collected 140 other examples of allegedly “incomplete” forms: an 8-year-old child who listed “none” for employment history but left the dates of employment field blank. An applicant who entered names of three siblings, but the form has spaces for four. . . .

My dad had no middle initial. He said that in the army everyone had to have a middle initial, so he was Robert N.M.I. Gelman on all the official forms. They didn’t deport him, though; fortunately his parents came to the United States many years before the restrictive immigration law.


It’s a busy day for Bayesians.

John Haman writes:

The Institute for Defense Analyses – Operational Evaluation Division (OED) is looking for a Bayesian statistician to join its Test Science team.  Test Science is a group of statisticians, data scientists, and psychologists that provides expertise on experimentation to the DoD.

In particular, we are looking for a Bayesian statistician to help our naval warfare group use the results from past test events to inform the design and analysis of future test events. Candidates will also need to have a background in experimental design, strong public speaking and writing skills, and be able to work well on group projects.

US citizenship required.

The job ad is here:

More info about IDA and Test Science:

Contact Heather Wojton ( with any questions.

And Macartan Humphreys writes:

Possibly of interest for statisticians / social scientists looking for a post doc we have an opportunity now at WZB Berlin that should be interesting. We are working with 7 teams implementing network scale up, RDS and other methods side by side in six countries to assess relative performance for measuring the prevalence of hidden populations (human trafficking).

We are focused on the meta-analytic part of this.

If you know good people it’s a good place to be and a really interesting topic.

“Statistical Models of Election Outcomes”: My talk this evening at the University of Michigan

At the Inter-university Consortium for Political and Social Research this evening:

Statistical Models of Election Outcomes

We will discuss various political and statistical aspects of election forecasts:
– How accurately can elections be forecast?
– What information is useful in forecasting elections?
– What sorts of elections are less predictable?
– To the extent that elections are predictable, what does this tell us about the effectiveness of campaigning?
– What are the effects of X on election, where X is ballot-order effects, Fox News, home-state advantage, or other things?
– How do we account for systematic errors in polling?
– What happened in 2016?
– How does our forecasting model for The Economist work?