Skip to content

“The Multiverse of Methods: Extending the Multiverse Analysis to Address Data-Collection Decisions”

Jenna Harder writes:

When analyzing data, researchers may have multiple reasonable options for the many decisions they must make about the data—for example, how to code a variable or which participants to exclude. Therefore, there exists a multiverse of possible data sets. A classic multiverse analysis involves performing a given analysis on every potential data set in this multiverse to examine how each data decision affects the results. However, a limitation of the multiverse analysis is that it addresses only data cleaning and analytic decisions, yet researcher decisions that affect results also happen at the data-collection stage. I propose an adaptation of the multiverse method in which the multiverse of data sets is composed of real data sets from studies varying in data-collection methods of interest. I walk through an example analysis applying the approach to 19 studies on shooting decisions to demonstrate the usefulness of this approach and conclude with a further discussion of the limitations and applications of this method.

I like this because of the term “classic multiverse analysis.” It’s fun to be a classic!

In all seriousness, I like the multiverse idea, and ideally it should be thought of as a step toward a multilevel model, in the same way that the secret weapon is both a visualization tool and an implicit multilevel model.

I’m also glad that Perspectives on Psychological Science has decided to publish research again, and not just publish lies about people. That’s cool. I guess an apology will never be coming, but at least they’ve moved on.

Is it really true that “the U.S. death rate in 2020 was the highest above normal since the early 1900s—even surpassing the calamity of the 1918 flu pandemic”?

tl;dr. No, it’s not true. The death rate increased by 15% from 2019 to 2020, but it jumped by 40% from 1917 to 1918.

But, if so, why would anyone claim differently? Therein lies a tale.

A commenter pointed to a news article with the above graphs and the following claim:

The U.S. death rate in 2020 was the highest above normal since the early 1900s — even surpassing the calamity of the 1918 flu pandemic. . . .

In the first half of the 20th century, deaths were mainly dominated by infectious diseases. As medical advancements increased life expectancy, death rates also started to smooth out in the 1950s, and the mortality rate in recent decades — driven largely by chronic diseases — had continued to decline.

In 2020, however, the United States saw the largest single-year surge in the death rate since federal statistics became available. The rate increased 16 percent from 2019, even more than the 12 percent jump during the 1918 flu pandemic.

Our commenter wrote:

If one takes the “normal” death rate to be that of the year prior to a pandemic and one assumes that the total population doesn’t change all that much from one year to the next, then this sub-headline seems to be seriously incorrect. If one eyeballs the “Total deaths in the U.S. over time” chart in the article and then compares the jumps due to the 1918 pandemic and the 2020 pandemic, it seems pretty clear that the percentage increase in the number of deaths (and thus the death rate, assuming a roughly constant population) from 1917 to 1918 is much greater than the percentage increase from 2019 to 2020. The jump from 1917 to 1918 looks to be around 40% while the jump from 2019 to 2020 looks to be around 15% (based on measurements of a screenshot of the graph using Photoshop’s ruler tool).

I was curious so I took a look. The above graphs include a time series of deaths per 100,000 and a time series of total deaths. (As an aside, I don’t know why they give deaths per 100,000, which is a scale that we have little intuition on. It seems to me that a death rate of 2.6% is more interpretable than a death rate of 2600 per 100,000.) Here’s what they have for 1917 and 1918 (I’m reading roughly off the graphs here):

1917: 2300 deaths per 100,000 and a total of 1 million deaths
1918: 2600 deaths per 100,000 and a total of 1.4 million deaths.

This is an increase of 13% in the rate but an increase of 40% in the total. But I looked up U.S. population and it seems to have been roughly constant between 1917 and 1918, so these above numbers can’t all be correct!

According to wikipedia, the U.S. population was 103 million in 1917 and 1918. 1 million deaths divided by 103 million people is 1%, not 2.3%. So I’m not quite sure what is meant by “death rate” in that article.

The problem also arises in other years. For example, the article says that 3.4 million Americans died in 2020. Our population is 330 million, so, again, that’s a death rate of about 1%. But the 2020 death rate in their “Death rate in the U.S. over time” chart is less than 1%.

I’m guessing that their death rate graph is some sort of age-adjusted death rate . . . ummmm, yeah, ok, I see it at the bottom of the page:

Death rates are age-adjusted by the C.D.C. using the 2000 standard population.

Compared to 1918, the 2000 population has a lot of old people. So the age-adjusted death rate overweights the olds (compared to 1918) and slightly underweights the olds (compared to 2020). The big picture here is that it makes 1918 look not so bad because the 1918 flu was killing lots of young people.

Also, one other thing. The note at the bottom of the article says, “Expected rates for each year are calculated using a simple linear regression based on rates from the previous five years.” One reason why 1918 is not more “above normal” than it is, is that there happens to be an existing upward trend during the five years preceding 1918, so the implicit model would predict a further increase even in the absence of the flu. I’m not quite sure how to think about that.

Anyway, the answer to the question in the title of this post is No.

Age adjustment can be tricky!

P.S. This is just a statistical mistake, but I wonder if there’s a political component too. There seems to be a debate about whether coronavirus is a big deal or not, epidemiologically speaking. I think coronavirus is a big deal: an increase of 15% in the death rate is a lot! But for some people, that’s not enough; it has to be the biggest deal of all time, or at least bigger than the 1918 flu. Hence you get this sort of headline. I have no reason to think this is deliberate political manipulation; rather, it’s just that when people make a mistake that yields a result that aligns with their preconceptions, they don’t always notice.

Or maybe I just made a mistake and I’m misunderstanding everything here. Could be; it’s happened to me before.

Open data and quality: two orthogonal factors of a study

It’s good for a study to have open data, and it’s good for the study to be high quality.

If for simplicity we dichotomize these variables, we can find lots of examples in all four quadrants:

– Unavailable data, low quality: The notorious ESP paper from 2011 and tons of papers published during that era in Psychological Science.

– Open data, low quality: Junk science based on public data, for example the beauty-and-sex-ratio paper that used data from the Adolescent Health survey.

– Unavailable data, high quality: It happens. For reasons of confidentiality or trade secrets, raw data can’t be shared. An example from our own work was our study of the NYPD’s stop and frisk policy.

– Open data, high quality: We see this sometimes! Ideally, open data provide an incentive for a study to be higher quality and also can enable high-quality analysis by outsiders.

I was thinking about this after reading this blog comment:

There is plenty to criticize about that study, but at least they put their analytic results in a table to make it easy on the reader.

Open data and better communication are a good thing. Also, honesty and transparency are not enough. Now I’m thinking the best way to conceptualize this is to consider openness and the quality of a study as two orthogonal factors.

This implies two things:

1. Just because a study is open, it doesn’t mean that it’s of high quality. A study can be open and still be crap.

2. Open data and good communication are a plus, no matter what. Open data make a good study better, and open data make a bad study potentially salvageable, or at least can make it more clear that the bad study is bad. And that’s good.

“Analysis challenges slew of studies claiming ocean acidification alters fish behavior”

Lizzie Wolkovich writes:

Here’s an interesting new paper in climate change ecology that states, “Using data simulations, we additionally show that the large effect sizes and small within-group variances that have been reported in several previous studies are highly improbable.”

I [Lizzie] wish I were more surprised, but mostly I was impressed they did the expensive work to try to replicate these other studies and somehow managed to get it published, given that “[t]he research community in the field of ocean acidification and coral reef fish behaviour has remained small.”

And here’s a news report with more background.

I’ve heard from some very important Harvard professors that the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%. I guess it’s a bit lower in other fields. We should all aspire to the greatness that is Ivy League psychology.

“Prediction Markets in a Polarized Society”

Rajiv Sethi writes about some weird things in election prediction markets, such as Donald Trump being given a one-in-eight chance of being the election winner . . . weeks after he’d lost the election.

Sethi writes:

There’s a position size limit of $850 dollars per contract in this market, which also happens to have hit a limit on the total number of participants. But dozens of other contracts are available that reference essentially the same outcome, and offer about the same prices. . . .

What prediction markets are revealing to us today is the extent of the chasm in beliefs held by Americans. . . . People actively seek information that largely confirms their existing beliefs, and social media platforms accommodate and intensify this demand. Prediction markets play an interesting and usual role in this environment. They encourage people with opposing worldviews to interact with each other in anonymous, credible, and non-violent ways. In a sense, they are the opposite of echo chambers. A market with homogeneous beliefs would have no trading volume, or would attract those with different opinions who are drawn by what they perceive to be mispriced contracts.

While most online platforms facilitate and deepen ideological segregation, prediction markets do exactly the opposite. They provide monetary reinforcement to those who get it right, and force others to question their assumptions and predispositions. While best known as mechanisms for generating forecasts through the wisdom of crowds, they also bring opposing worldviews into direct and consequential contact with each other. This is a useful function in an increasingly segregated digital ecosystem.

I dunno, seems kind of optimistic to me! We discuss prediction markets in section 2.6 of this article but come to no firm conclusions.

EU proposing to regulate the use of Bayesian estimation

The European Commission just released their Proposal for a Regulation on a European approach for Artificial Intelligence. They finally get around to a definition of “AI” on page 60 of the report (link above):

‘artificial intelligence system’ (AI system) means software that is developed with one or more of the techniques and approaches listed in Annex I and can, for a given set of human-defined objectives, generate outputs such as content, predictions, recommendations, or decisions influencing the environments they interact with

We don’t even have to wonder if they mean us. They do. Here’s the full text of Annex 1 from page 1 of “Laying down harmonized rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts” (same link, different doc).

referred to in Article 3, point 1

(a) Machine learning approaches, including supervised, unsupervised and reinforcement learning, using a wide variety of methods including deep learning;

(b) Logic- and knowledge-based approaches, including knowledge representation, inductive (logic) programming, knowledge bases, inference and deductive engines, (symbolic) reasoning and expert systems;

(c) Statistical approaches, Bayesian estimation, search and optimization methods.

This feels hopelessly vague with phrasing like “wide variety of methods” and the inclusion of “statistical approaches”. Also, I don’t see what’s added by “Bayesian estimation” given that it’s an instance of a statistical approach. At least it’s nice to be noticed.

Annex I looks like my CV rearranged. In the ’80s and ’90s, I worked on logic programming, linguistics, and knowledge representation (b). I’m surprised that’s still a going concern. I spent the ’00s working on ML-based natural language processing, speech recognition, and search (a). And ever since the ’10s, I’ve been working on Bayesian stats (c). cheater-detection bot pisses someone off

Justin Horton writes:

Of course are a private company. They have the right, within the law, to have who they want on their site and to ban who they want from their site.

What they don’t have the right to do is to call somebody a cheat without backing it up.

But that is what they have done.

That would annoy me too.

Cancer patients be criming? Some discussion and meta-discussion of statistical modeling, causal inference, and social science:

1. Meta-story

Someone pointed me to a news report of a statistics-based research claim and asked me what I thought of it. I read through the press summary and the underlying research paper.

At this point, it’s natural to anticipate one of two endpoints: I like the paper, or I don’t. The results seem reasonable and well founded, or the data and analysis seem flawed.

One tricky point here is that there’s an asymmetry. A bad paper can be obviously bad in ways that a good paper can’t be obviously good.

Here are some quick reasons to distrust a paper (with examples):

– The published numbers just don’t add up (pizzagate)
– The claim would seem to violate some physical laws (ESP)
– The claimed effect is implausibly large (beauty and sex ratio, ovulation and voting)
– The results look too good compared to the noise level (various papers criticized by Gregory Francis)
– The paper makes claims that are not addressed by the data at hand (power pose)
– The fitted model makes no sense (air pollution in China)
– Garbled data (gremlins)
– Misleading citation of the literature (sleep dude)
And lots more.

Some of these problems are easier to notice than others, and multiple problems can go together. And sometimes the problems are obvious in retrospect but it takes awhile to find them. The point is, it can often be possible to see that a paper has fatal flaws. (And at this point let me add the standard evidence vs. truth clarification: just because a work of research has fatal flaws, it doesn’t mean that it’s underlying substantive claim is false; it just means the claim is not strongly supported by the data at hand.)

OK, so we can often be suspicious of a paper right away and then, soon after, be clear on its fatal flaws. Other time we can be given a perpetual-motion-machine-in-a-box: a paper whose claims are ridiculous, but we don’t feel like going through the trouble of unpacking it and finding the battery that’s driving it.

But here’s the asymmetry: we can read a paper and say that it’s reasonable, or even read it and say that the flaws we notice don’t seem fatal—but it’s harder to be sure. It’s a lot easier to get a clear sense that something’s broken than to get a clear sense that it works.

2. The story

Sanjeev Sripathi writes:

I’d like to understand your thoughts on this analysis that establishes a causative link between cancer and crime in Denmark: It was part of today’s WSJ Economics newsletter. Do you agree with the conclusions?

I followed the link to the research article, Breaking Bad: How Health Shocks Prompt Crime, by Steffen Andersen, Gianpaolo Parise, Kim Peijnenburg, which begins:

Exploiting variations in the timing of cancer diagnoses, we find that health shocks elicit an increase in the probability of committing crime by 13%. This response is economically significant at both the extensive (first-time criminals) and intensive margin (reoffenders). We uncover evidence for two channels explaining our findings. First, diagnosed individuals seek illegal revenues to compensate for the loss of earnings on the legal labor market. Second, cancer patients face lower expected cost of punishment through a lower survival probability. We do not find evidence that changes in preferences explain our findings. The documented pattern is stronger for individuals who lack insurance through preexisting wealth, home equity, or marriage. Welfare programs that alleviate the economic repercussions of health shocks are effective at mitigating the ensuing negative externality on society.

The health shocks they study are cancer diagnoses, and they fit a model to data from all adults aged between 18 and 62 in Denmark diagnosed with cancer at some time between 1980 and 2018. The main result, shown in the graph at the top of this post, comes from a linear regression predicting a binary outcome at the person-year level (was person i convicted of a crime committed in year t; something that happened in 0.7% of the person-years in their data) with indicators for people and years, predictors for number of years since cancer diagnosis, and some background variables including age and indicators for whether the person was in prison or in the hospital in that year.

3. Something I don’t understand

The only thing about the above graph that seems odd to me is how smooth the trend is of the point estimates. Given the sizes of the standard errors, I think you’d expect to see those estimates to jump around more. So I suspect there’s a mistake in their analysis somewhere. Of course I could be wrong on this, I’m just guessing.

4. Where are the data and code?

I don’t see a link to the data. From the article:

We combine data from several different administrative registers made available to us through Statistics Denmark. We obtain data on criminal offenses from the Danish Central Crime Registry maintained by the Danish National Police. . . . Health data are from the National Patient Registry and from the Cause of Death Registry. The National Patient Registry records every time a person interacts with the Danish hospital system . . .

Perhaps there are confidentiality constraints? It wasn’t clear what the restrictions were placed on this information, so maybe the data can’t be shared. But the authors of the paper should still share their code; that could help a lot.

It’s possible that the data or code are available; I just didn’t notice them at the site. The paper mentioned an online appendix which I found by googling; it’s here.

5. The model

I looked at their model, and there are some things I’d do differently. First off, it’s a binary outcome, so I’d use logistic regression rather than linear. I understand that you can fit a linear regression to binary data, but I don’t really see the point. Especially in a case like this, where the probabilities are so close to zero.

Second, I’d model men and women separately. Men commit most of the crimes, and it just seems like female crime and male crime are different enough stories that it would make sense to model them separately. Alternatively, they could be included in the model with an interaction.

This brings us to the third issue, that all sorts of things could be interacted with the treatment. At this point you might say that I’m trying to Christmas-tree the problem . . . so, sure, let’s not worry about interactions right now. Let’s go on.

My next concern has to do with the multiple measurements on each individual. I like the idea of using each person as his or her own control, but then what’s the comparison, exactly? Suppose I get diagnosed with cancer at the age of 50. Then you’re comparing my criming from ages 51-60 to my criming from ages 44-49 . . . but then again I’m older in my 50s so you’d expect me to be doing less criming then anyway . . . but then the model adjusts for age (I think they are using linear age, but I’m not entirely sure), but maybe the age adjustment overcorrects in some way . . . I’m not quite sure. It’s tricky.

And this brings me to my next concern or request, which is that I’d like to see some plots of raw data. I guess the starting point would be to plot crime rates over time (that is, the proportion of respondents who were convicted of a crime committed in year t, as a function of t), lining things up so that t=0 is the year of cancer diagnosis, with separate lines for people diagnosed with cancer ages 18-25, 26-30, 31-35, etc. And separate lines for men and women. OK, that might be too noisy, so maybe more pooling is necessary. And then there are cohort effects . . . ultimately I’m pretty sure this will end up being a logistic regression (or, ok, use a linear regression if you really want; whatever), and it might look a lot like what was eventually fit in the paper under discussion—but I don’t really think I’d understand it without building the model step by step from the data. I need to see the breadcrumbs.

That’s most of it. I had a few more issues with the model that I can’t remember now. Oh yeah, here’s one issue: who are these people committing crimes? How many of them are past criminals and how many of them are Walter Whites, venturing beyond the law (or, at least, getting caught) for the first time? Above I talked about doing 2 analyses, one for men and one for women, and that’s fine, but now I’m thinking we want to do separate analyses for people with past criminal records and people without. It seems to me there are two stories: one story is past criminals continuing to crime at rates higher than one might expect given their age profile; the other story is people newly criming (again, at a higher rate than the expected rate of new crimes for people who didn’t get cancer).

Oh, yeah, one more issue is selection bias because dead people commit no crimes.

By listing all of these, I’m not saying the published model was bad. There are always lots of ways of attacking any problem with data. My main concern is understanding the estimates and seeing the trail leading from raw data to final inferences. Also, if we carefully take things apart, we might understand why the above graph is so supiciously smooth. (When I say “suspiciously,” I’m not talking about foul play, just something I’m not understanding.)

6. Conclusion

I don’t have a conclusion. The results look possible. It’s not implausible that people commit 13% more crime than expected during the years after a cancer diagnosis. I mean, it seems kinda high to me and I also would’ve believed it if the effect went in the other direction, but, sure, I guess it’s possible? I’m actually surprised that the drop isn’t greater during the year of the diagnosis itself and the year after. This sort of thing is one reason I want to see more plots of the raw data. I notice lots of things in the analysis that are different than what I would’ve done, but it’s not obvious that any of these choices would cause huge biases. Best will be for the data and code to be out there so others can do their own analyses. Until then, this can serve as an example of some of the challenges of data analysis and interpretation.

P.S. It’s not so usual to see a research project inspired by a TV show, is it?

OK, let me check on google scholar:

Gilligan’s Island

Star Trek

And, of course, Jeopardy

Even Rocky and Bullwinkle.

OK, so that aspect of the paper is not such a big deal. Andersen et al. are hardly the first to write a TV-inspired quantitative research paper.

That said, the connection to the TV show is pretty much perfect here in how he story of the data lines up with the plot of Breaking Bad. The main difference is that in Breaking Bad he’s a killer, whereas Andersen et al. are mostly studying property crimes. Even there, though, it’s not such a bad match, given that Walter’s motivations are primarily economic.

Can you trust international surveys? A follow-up:

Michael Robbins writes:

A few years ago you covered a significant controversy in the survey methods literature about data fabrication in international survey research. Noble Kuriakose and I put out a proposed test for data quality.

At the time there were many questions raised about the validity of this test. As such, I thought you might find a new article [How to Get Better Survey Data More Efficiently, by Mollie Cohen and Zach Warner] in Political Analysis of significant interest. It provides pretty strong external validation of our proposal but also provides a very helpful guide of the effectiveness of different strategies for detecting fabrication and low quality data in international survey research.

I’ve not read this new paper—it’s hard to find the time, what with writing 400 blog posts a year etc!—but I like the general idea of developing statistical methods to check data quality. Data collection and measurement are not covered enough in our textbooks or our research articles—think about all those papers that never tell you the wording of their survey questions! And remember that notorious Lancet Iraq study, or that North-Korea-is-a-democracy study.

We’re hiring (in Melbourne)

Andrew, Qixuan and I (Lauren) are hiring a postdoctoral research fellow to explore research topics around the use on multi-level regression and poststratification with non-probability surveys. This work is funded by the National Institutes of Health, and is collaborative work with Prof Andrew Gelman (Statistics and Political Science, Columbia University) and Assoc/Prof Qixuan Chen (Biostatistics, Columbia University).

My hope with this work is that we will develop functional validation techniques when using MRP with non-probability (or convenience) samples. Work will focus on considering the theoretical framing of MRP, as well as various validation tools like cross-validation with non-representative samples. Interested applicants need not have a background in MRP, but experience with multilevel models and/or survey weighting methods would be desirable. The team works primarily within an R language framework.
Interested applicants can apply at or contact the Monash chief investigator Dr Lauren Kennedy ( for further information. I should add that the successful applicant must have relevant work rights in Australia (that one’s from HR, and to do with the whole covid/travelling situation).

Hierarchical modeling of excess mortality time series

Elliott writes:

My boss asks me:

For our model to predict excess mortality around the world, we want to calculate a confidence interval around our mean estimate for total global excess deaths. We have real excess deaths for like 60 countries, and are predicting on another 130 or so. we can easily calculate intervals for any particular country. However, if we simulate each country independently, then the confidence interval surrounding the global total will be tiny, and incorrect, because you’ll never get simulations where like 160 countries are all off in the same direction. We need some way to estimate how errors are likely to be correlated between countries. What would Andrew Gelman recommend?

I [Elliott] shot back:

I’ll ask. He is going to recommend a hierarchical model, where you model excess deaths as a function of a global time trend, country-level intercept, country-level time trend and country-level covariates. Something like:

deaths ~ day + day^2 + (1 + day + day^2 | country) + observed_cases + observed_deaths + (observed_deaths | county)

Oh, residual excess deaths are definitely are not a quadratic function, now that I think about it. Probably cubic or ^4. But you get the idea.

In brms, you could also do splines:

deaths ~ s(day) + s(day, by="country") + (1 | country) + ....

Then, you take the 95% CI [actually, posterior interval, not confidence interval — AG] from the posterior draws.

Otherwise, you can derive the countey-by-country covariance matrix of day-level predicted excess deaths and simulate from a multivariate normal distribution (are excess deaths normally distributed? Maybe lognormal), then grab the CI off of that.

Yes, this all sounds vaguely reasonable. But definitely do the spline, not the polynomial. You should pretty much never be doing the polynomial.

I’d also recommend taking a look at the work of Leontine Alkema on Bayesian modeling of vital statistics time series.

The Tall T*

Saw The Tall T on video last night. It was ok, but when it comes to movies whose titles begin with that particular character string, I think The Tall Target was much better. I’d say The Tall Target is the best train movie ever, but maybe that title should go to Intolerable Cruelty, which was much better than Out of Sight, another movie featuring George Clooney playing the George Clooney character. As far as Elmore Leonard adaptations go, I think The Tall T was much better.

One reason why that estimated effect of Fox News could’ve been so implausibly high.

Ethan Kaplan writes:

I just happened upon a post of yours on the potential impact of Fox News on the 2016 election [“No, I don’t buy that claim that Fox news is shifting the vote by 6 percentage points“]. I am one of the authors of the first Fox News study from 2007 (DellaVigna and Kaplan). My guess is that the reason the Martin and Yurukoglu estimates are so large is that their instrument is identified off a group of people who watch Fox because it is higher in the channel order. My guess is that such potential voters are decently more likely than the average voter to be influenced by cable news. Moreover, my guess is that there are not a huge number of such people who actually do spend tons of time watching Fox News. Moreover, there may be fewer of such people than in the year 2000 when U.S. politics was less polarized than it is today and when Fox News did not yet have as much of a well-known reputation for being a conservative news outlet.

Interesting example of an interaction; also interesting example of a bias that can arise from a natural experiment.

Questions about our old analysis of police stops

I received this anonymous email:

I read your seminal work on racial bias in stops with Professors Fagan and Kiss and just had a few questions.

1. Your paper analyzed stops at the precinct level. A critique I have heard regarding aggregating data at that level is that: “To say that the threshold test can feasibly discern whether racial bias is present in a given aggregate dataset would be to ignore its concerning limitations which make it unusable in its ability to perform this task. The Simpson’s paradox is a phenomenon in probability and statistics which refers to a pattern exhibited by aggregated data either disappearing or reversing once the data is disaggregated. When the police behave differently across the strata of some variable, but a researcher’s analysis uses data that ignores and aggregates across this distribution, Simpson’s paradox threatens to give statistics that are inconsistent with reality. The variable being place, police are treating different strata of place differently, and the races are distributed unequally across strata. The researchers who designed the threshold test do not properly control for place, as modeling for something as large as a county or precinct (which is what they do) does not properly account for place if the police structure their behavior along the lines of smaller hot spots.”

2. How would your paper account for changes in police deployment patterns?

3. What are your thoughts on this article? It addresses a paper from one of your colleagues but if the critiques were valid would they also applied to your paper in 2007?

My reply:

1. I’m not sure about most of this because it seems to be referring to some other work, not ours. It refers to a threshold test, which is not what we’re doing. As to the question of why we used precincts: this was to address concern that the citywide patterns could be explained by differences between neighborhoods; we discuss this at the beginning of section 4 of our paper. Ultimately, though, the data are what they are, and we’re not making any claims beyond what we found in the data.

2. The data show that the police stopped more blacks and Hispanics than whites in comparable neighborhoods, in comparison to their rate of arrests in the previous year. All these stops could be legitimate police decisions based on local information. We really can’t say; all we can do is give these aggregates.

3. I read the linked article, and it seems that a key point there is that most of the stops in question are legal, and that “Those lawful stops should have been excluded from his regression analysis, since they cannot form the basis for concluding that the officers making the stop substituted race for reasonable suspicion.” I don’t agree with this criticism. The point of our analysis is to show statistical patterns in total stops. Legality of the individual stops is a separate question. Another comment made in the linked article is that the analysis was “evaluating whether the police were making stops based on skin color rather than behavior.” This is not an issue with our analysis because we were not trying to make any such evaluation; we were just showing statistical patterns. There was also a criticism regarding the use of data from one month to predict the next month. I can’t say for sure but I don’t think that shifting things by a month would chance our analysis. Another criticism had a prediction that a census tract would experience 120 stops in a month but it only had an average of 19. I don’t know the details here, but all regression models have errors. It all depends on what predictors are in the model. Finally, there is a statement that “It is a lot more comfortable to talk about the allegedly racist police than about black-on-black crime.” I don’t think it has to be one or the other. Our paper was about patterns in police stops; there’s other research on patterns of crime.

A recommender system for scientific papers

Jeff Leek writes:

We created a web app that lets people very quickly sort papers on two axes: how interesting it is and how plausible they think it is. We started with covid papers but have plans to expand it to other fields as well.

Seems like an interesting idea, a yelp-style recommender system but with two dimensions.

A fill-in-the-blanks contest: Attributing the persistence of the $7.25 minimum wage to “the median voter theorem” is as silly as _______________________

My best shots are “attributing Napoleon’s loss at Waterloo to the second law of thermodynamics” or “attributing Michael Jordan’s 6 rings to the infield fly rule.” But these aren’t right at all. I know youall can do better.

Background here.

For some relevant data, see here, here, here, and here.

P.S. I get it that Cowen doesn’t like the minimum wage. I have no informed opinion on the topic myself. But public opinion on the topic is clear enough. Also, I understand that I might be falling for a parody. Poe’s law and all that.

P.P.S. Yes, this is all silly stuff. The serious part is that Cowen and his correspondent are basically saying (or joking) that what happens, should happen. I agree that there’s a lot of rationality in politics, but you have to watch out for circular reasoning.

In making minimal corrections and not acknowledging that he made these errors, Rajan is dealing with the symptoms but not the underlying problem, which is that he’s processing recent history via conventional wisdom.

Raghuram Rajan is an academic and policy star, University of Chicago professor, former chief economist for the International Monetary Fund, and former chief economic advisor to the government of India, and featured many times in NPR and other prestige media.

He also appears to be in the habit of telling purportedly data-backed stories that aren’t really backed by the data.

Story #1: The trend that wasn’t

Guarav Sood writes:

In late 2019 . . . while discussing the trends in growth in the Indian economy . . . Mr. Rajan notes:

We were growing really fast before the great recession, and then 2009 was a year of very poor growth. We started climbing a little bit after it, but since then, since about 2012, we have had a steady upward movement in growth going back to the pre-2000, pre-financial crisis growth rates. And then since about mid-2016 (GS: a couple of years after Mr. Modi became the PM), we have seen a steady deceleration.

The statement is supported by the red lines that connect the deepest valleys with the highest peak, eagerly eliding over the enormous variation in between (see below).

Not to be left behind, Mr. Rajan’s interlocutor Mr. Subramanian shares the following slide about investment collapse. Note the title of the slide and then look at the actual slide. The title says that the investment (tallied by the black line) collapses in 2010 (before Mr. Modi became PM).

Story #2: Following conventional wisdom

Before Gaurav pointed me to his post, the only other time I’d heard of Rajan was when I’d received his book to review a couple years ago, at which time I sent the following note to the publisher:

I took a look at Rajan’s book, “The Third Pillar: How Markets and the State Leave the Community Behind,” and found what seems to be a mistake right on the first page. Maybe you can forward this to him and there will be a chance for him to correct it before the book comes out.

On the first page of the book, Rajan writes: “Half a million more middle-aged non-Hispanic white American males died between 1999 and 2013 than if their death rates had followed the trend of other ethnic groups.” There are some mistakes here. First, the calculation is wrong because it does not account for changes in the age distribution of this group. Second, it was actually women, not men, whose death rates increased. See here for more on both points.

There is a larger problem here is that there is received wisdom that white men are having problems, hence people attribute a general trend to men, even though in this case the trend is actually much stronger for women.

I noticed another error. On page 216, Rajan writes, “In the United States, the Affordable Care Act, or Obamacare, was the spark that led to the organizing of the Tea Party movement…” This is incorrect. The Tea Party movement started with a speech on TV in February, 2009, in opposition to Obama’s mortgage relief plan. From Wikipedia: “The movement began following Barack Obama’s first presidential inauguration (in January 2009) when his administration announced plans to give financial aid to bankrupt homeowners. A major force behind it was Americans for Prosperity (AFP), a conservative political advocacy group founded by businessmen and political activist David H. Koch.” The Affordable Care Act came later, with discussion in Congress later in 2009 and the bill passing in 2010. The Tea Party opposed the Affordable Care Act, but the Affordable Care Act was not the spark that led to the organizing of the Tea Party movement. This is relevant to Rajan’s book because it calls into question his arguments about populism.

The person to whom I sent this email said she notified the author so I was hoping he fixed these small factual problems and also that he correspondingly adjusted his arguments about populism. Arguments are ultimately based on facts; shift the facts and the arguments should change to some extent.

In the meantime, Rajan came out with a second edition of his book, and so I was able to check on Amazon to see if he had fixed the errors.

The result was disappointing. It seems that he corrected both errors but in a minimal way: changing “American males” to “Americans” and changing “the spark that led to the organizing of the Tea Party movement” to “an important catalyst in the organizing of the Tea Party Movement.” That’s good that they made the changes (though not so cool that they didn’t cite me) but I’m bothered by the way the changes were so minimal. These were not typos; they reflected real misunderstanding, and it’s best to wrestle with one’s misunderstanding rather than just making superficial corrections.

At this point you might say I’m being picky. The fixed the error; isn’t that enough? But, no, I don’t think that’s enough. As I wrote two years ago, arguments are ultimately based on facts; shift the facts and the arguments should change to some extent. If the facts change but the argument stays the same, that represents a problem.

In making minimal corrections and not acknowledging that he made these errors, Rajan is dealing with the symptoms but not the underlying problem, which is that he’s processing recent history via conventional wisdom.

This should not be taken as some sort of blanket condemnation of Rajan, who might be an excellent banker and college professor. Lots of successful people operate using conventional wisdom. We just have to interpret his book not as an economic analysis or a synthesis of the literature but as an expression of conventional wisdom by a person with many interesting life experiences.

The trouble is, if all you’re doing is processing conventional wisdom, you’re not adding anything to the discourse.

“Do you come from Liverpool?”

Paul Alper writes:

Because I used to live in Trondheim, I have a special interest in this NYT article about exercise results in Trondheim, Norway.

Obviously, even without reading the article in any detail, the headline claim that

The Secret to Longevity? 4-Minute Bursts of Intense Exercise May Help

can be misleading and is subject to many caveats.

The essential claims:

Such studies [of exercise and mortality], however, are dauntingly complicated and expensive, one reason they are rarely done. They may also be limited, since over the course of a typical experiment [of short duration], few adults may die. This is providential for those who enroll in the study but problematic for the scientists hoping to study mortality; with scant deaths, they cannot tell if exercise is having a meaningful impact on life spans.

However, exercise scientists at the Norwegian University of Science and Technology in Trondheim, Norway, almost 10 years ago, began planning the study that would be published in October in The BMJ.

More than 1,500 of the Norwegian men and women accepted. These volunteers were, in general, healthier than most 70-year-olds. Some had heart disease, cancer or other conditions, but most regularly walked or otherwise remained active. Few were obese. All agreed to start and continue to exercise more regularly during the upcoming five years.

Via random assignment, they were put into three groups: the control group which “agreed to follow standard activity guidelines and walk or otherwise remain in motion for half an hour most days,” the moderate group which exercises “moderately for longer sessions of 50 minutes twice a weekend” and the third group “which started a program of twice-weekly high-intensity interval training, or H.I.I.T., during which they cycled or jogged at a strenuous pace for four minutes, followed by four minutes of rest, with that sequence repeated four times.”
Note that those in the control group were allowed to indulge in interval training if they felt like it.

Almost everyone kept up their assigned exercise routines for five years [!!], an eternity in science, returning periodically to the lab for check-ins, tests and supervised group workouts.

The results:

The men and women in the high-intensity-intervals group were about 2 percent less likely to have died than those in the control group, and 3 percent less likely to die than anyone in the longer, moderate-exercise group. People in the moderate group were, in fact, more likely to have passed away than people in the control group [!!].

In essence, says Dorthe Stensvold, a researcher at the Norwegian University of Science and Technology who led the new study, intense training — which was part of the routines of both the interval and control groups — provided slightly better protection against premature death than moderate workouts alone.

Here can be found the BMJ article itself. A closer look at the BMJ article is puzzling because of the term non-significant which appears in the BMJ article itself and not in the NYT.


This study suggests that combined MICT and HIIT has no effect on all cause mortality compared with recommended physical activity levels. However, we observed a lower all cause mortality trend after HIIT compared with controls and MICT.


The Generation 100 study is a long and large randomised controlled trial of exercise in a general population of older adults (70-77 years). This study found no differences in all cause mortality between a combined exercise group (MICT and HIIT) and a group that followed Norwegian guidelines for physical activity (control group). We observed a non-significant 1.7% absolute risk reduction in all cause mortality in the HIIT group compared with control group, and a non-significant 2.9% absolute risk reduction in all cause mortality in the HIIT group compared with MICT group. Furthermore, physical activity levels in the control group were stable throughout the study, with control participants performing more activities as HIIT compared with MICT participants, suggesting a physical activity level in control participants between that of MICT and HIIT.

As it happens, I [Alper] lived in Trondheim back before North Sea oil transformed the country. The Norwegian University of Science and Technology in Trondheim, Norway did not exist but was called the NTH, incorrectly translated as the Norwegian Technical High School. Back then and as it is today, exercise was the nation’s religion and the motto of the country was

It doesn’t matter whether you win or lose. The important thing is to beat Sweden.

To give you a taste of what the country was like in the 1960s, while I was on a walk, a little kid stopped me and said, “Do you come from Liverpool?”

Dude should have his own blog.

Conference on digital twins

Ron Kenett writes:

This conference and the special issue that follows might be of interest to (some) of your blog readers.

Here’s what it says there:

The concept of digital twins is based on a combination of physical models that describe the machine’s behavior and its deterioration processes over time with analytics capabilities that enable lessons to be learned, decision making and model improvement. The physical models can include the control model, the load model, the erosion model, the crack development model and more while the analytics model which is based on experimental data and operational data from the field.

I don’t fully follow this, but it sounds related to all the engineering workflow things that we like to talk about.

Which sorts of posts get more blog comments?

Paul Alper writes:

Some of your blog postings elicit many responses and some, rather few. Have you ever thought of displaying some sort of statistical graph illustrating the years of data? For example, sports vs. politics, or responses for one year vs. another (time series), winter vs. summer, highly technical vs. breezy.

I’ve not done any graph or statistical analysis. Informally I’ve noticed a gradual increase in the rate of comments. It’s not always clear which posts will get lots of comments and which will get few, except that more technical material typically gets less reaction. Not because people don’t care, I think, but because it’s harder to say much in response to a technical post. I think we also get fewer comments for posts on offbeat topics such as art and literature. And of course we get fewer comments on posts that are simply announcing job opportunities, future talks, and future posts. And all posts get zillions of spam comments, but I’m not counting them.

As of this writing, we have published 10,157 posts and 146,661 comments during the 16 years since the birth of this blog. The rate of comments has definitely been increasing, as I remember not so long ago that the ratio was 10-to-1. Unfortunately, there aren’t so many blogs anymore, so I’m pretty sure that the total rate of blog commenting has been in steady decline for years.

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.