Skip to content

Career advice for a future statistician

Gary Ruiz writes:

I am a first-year math major at the Los Angeles City College in California, and my long-term educational plans involve acquiring at least one graduate degree in applied math or statistics.

I’m writing to ask whether you would offer any career advice to someone interested in future professional work in statistics.

I would mainly like to know:

– What sort of skills does this subject demand and reward, more specifically than the requisite/general mathematical abilities?

– What are some challenges someone is likely to face that are unique to studying statistics? Any quirks to the profession at a higher (mainly at the research) level?

– How does statistics contrast with related majors like Applied Mathematics in terms of the requisite training or later subjects of study?

– Are there any big (or at least common) misconceptions regarding what statistical research work involves?

– What are some of the other non-academic considerations I might want to keep in mind? For example, what are other statisticians usually like (if there’s a “general type”)? How does being a statistician affect your day-to-day life (in terms of the time investment, etc.), if at all?

– If you could give your younger self any career-related advice, what would it be? (I hope this question isn’t too cliche, but I figured it was worth asking).

– Finally, what are the most important factors that any potential statistician should consider before committing to the field?

My replies:

– Programming is as important as math. Beyond that, you could get a sense of what skills could be useful by looking at our forthcoming book, Regression and Other Stories, or by working through the Stan case studies.

– I don’t know that there are any challenges that are unique to studying statistics. Compared to other academic professions, I think statistics is less competitive, maybe because there are so many alternatives to academia involving work in government and industry.

– I don’t know enough about undergraduate programs to compare statistics to applied math. My general impression is that the two fields are similar.

– I don’t know of any major misconceptions regarding statistical research work. The only thing I can think of offhand is that in our PhD students we sometimes get pure math students who want to go into finance, I think in part because they think this will be a way for them to keep doing math. But then when they get jobs in finance, they find themselves running logistic regressions all day. So it might’ve been more useful for them to have studied applied statistics rather than learning proofs of the Strong Law of Large Numbers. But this won’t arise at the undergraduate level. I’m pretty sure that any math you learn as an undergrad will come in handy later.

– Regarding non-academic considerations: how your day-to-day life goes depends on the job. I’ve found lawyers and journalists to be on irregular schedules: either they’re in an immense hurry and are bugging me at all hours, or they’re on another assignment and they don’t bother responding to inquiries. Statistics is a form of engineering, and I think the job is more time-averaged. Even when there’s urgency (for example, when responding to a lawyer or journalist), everything takes a few hours. It’s typically impossible to do a rush job—and, even if you could, you’re better off checking your answer a few times to make sure you know what you’re doing. You’ll be making lots of mistakes in your career anyway, so it’s best to avoid putting yourself in a situation where you’re almost sure to mess up.

– Career advice to my younger self? I don’t know that this is so relevant, given how times have changed so much in the past 40 years. My advice is when choosing what to do, look at older people who are similar to you in some way and have made different choices. One reason I decided to go into research, many years ago, was that the older people I observed who were doing research seemed happy in their jobs—even the ones who were doing boring research seemed to like it—while the ones doing other sorts of jobs, even those that might sound fun or glamorous, seemed more likely to have burned out. Looking back over the years, I’ve had some pretty good ideas that might’ve made me a ton of money, but I’ve been fortunate enough to be paid enough to have no qualms about giving these ideas away for free.

– What factors should be considered by a potential statistician? I dunno, maybe think hard about what applications you’d like to work on. Typically you’ll have one or maybe two applications you’re an expert on. So choose something that seems interesting or important to you.

Interesting y-axis

Merlin sent along this one:

P.S. To be fair, when it comes to innumeracy, whoever designed the above graph has nothing on these people.

As Clarissa Jan-Lim put it:

Math is hard and everyone needs to relax! (Also, Mr. Bloomberg, sir, I think we will all still take $1.53 if you’re offering).

Model building is Lego, not Playmobil. (toward understanding statistical workflow)

John Seabrook writes:

Socrates . . . called writing “visible speech” . . . A more contemporary definition, developed by the linguist Linda Flower and the psychologist John Hayes, is “cognitive rhetoric”—thinking in words.

In 1981, Flower and Hayes devised a theoretical model for the brain as it is engaged in writing, which they called the cognitive-process theory. It has endured as the paradigm of literary composition for almost forty years. The previous, “stage model” theory had posited that there were three distinct stages involved in writing—planning, composing, and revising—and that a writer moved through each in order. To test that theory, the researchers asked people to speak aloud any stray thoughts that popped into their heads while they were in the composing phase, and recorded the hilariously chaotic results. They concluded that, far from being a stately progression through distinct stages, writing is a much messier situation, in which all three stages interact with one another simultaneously, loosely overseen by a mental entity that Flower and Hayes called “the monitor.” Insights derived from the work of composing continually undermine assumptions made in the planning part, requiring more research; the monitor is a kind of triage doctor in an emergency room.

This all makes sense to me. It reminds me of something I tell my students, which is that “writing is non-algorithmic,” which isn’t literally true—everything is algorithmic, if you define “algorithm” broadly enough—but which is intended to capture the idea that when writing, we go back and forth between structure and detail.

Writing is not simply three sequential steps of planning, composing, and revising, but I still think that it’s useful when writing to consider these steps, and to think of Planning/Composing/Revising as a template. You don’t have to literally start with a plan—your starting point could be composing (writing a few words, or a few sentences, or a few paragraphs) or revising (working off something written by someone else, or something written earlier by you)—but at some point near the beginning of the project, an outline can be helpful. Plan with composition in mind, and then, when it’s time to compose, compose being mindful of your plan and also of your future revision process. (To understand the past, we must first know the future.)

But what I really wanted to talk about today is statistical analysis, not writing. My colleagues and I have been thinking a lot about workflow. On the first page of BDA, we discuss these three steps:
1. Model building.
2. Model fitting.
3. Model checking.
And then you go back to step 1.

That’s all fine, it’s a starting point for workflow, but it’s not the whole story.

As we’ve discussed here and elsewhere, we don’t just fit a single model: workflow is about fitting multiple models. So there’s a lot more to workflow; it includes model building, model fitting, and model checking as dynamic processes where each model is aware of others.

Here are some ways this happens:

– We don’t just build one model, we build a sequence of models. This fits into the way that statistical modeling is a language with a generative grammar. To use toy terminology, model building is Lego, not Playmobil.

– When fitting a model, it can be helpful to use fits from other models as scaffolding. The simplest idea here is “warm start”: take the solution from a simple model as a starting point for new computation. More generally, we can use ideas such as importance sampling, probabilistic approximation, variational inference, expectation propagation, etc., to leverage solutions from simple models to help compute for more complicated models.

– Model checking is, again, relative to other models that interest us. Sometimes we talk about comparing model fit to raw data, but in many settings any “raw data” we see have already been mediated by some computation or model. So, more generally, we check models by comparing them to inferences from other, typically simpler, models.

Another key part of statistical workflow is model understanding, also called interpretable AI. Again, we can often best understand a fitted model by seeing its similarities and differences as compared to other models.

Putting this together, we can think of a sequence of models going from simple to complex—or maybe a network of models—and then the steps of model building, inference, and evaluation can be performed on this network.

This has come up before—here’s a post with some links, including one that goes back to 2011—so the challenge here is to actually do something already!

Our current plan is to work through workflow in some specific examples and some narrow classes of models and then use that as a springboard toward more general workflow ideas.

P.S. Thanks to Zad Chow for the adorable picture of workflow shown above.

Update: OHDSI COVID-19 study-a-thon.

Thought a summary in the read below section might be helpful as the main page might be a lot to digest.

The OHDSI Covid 19 group re-convenes at 6:00 (EST I think) Monday for updates.

For those who want to do modelling, you cannot get the data but must write analysis scripts that data holders will run on their computers and return results. My guess is that might be most doable through here where custom R scripts can be implemented that data holders might be able to run. Maybe some RStan experts can try to work this through.

Continue reading ‘Update: OHDSI COVID-19 study-a-thon.’ »

Noise-mining as standard practice in social science

The following example is interesting, not because it is particularly noteworthy but rather because it represents business as usual in much of social science: researchers trying their best, but hopelessly foiled by their use of crude psychological theories and cruder statistics, along with patterns of publication and publicity that motivate the selection and interpretation of patterns in noise.

Elio Campitelli writes:

The silliest study this week?

I realise that it’s a hard competition, but this has to be the silliest study I’ve read this week. Each group of participants read the same exact text with only one word changed and the researchers are “startled” to see that such a minuscule change did not alter the readers’ understanding of the story. From the Guardian article (the paper is yet to be published as I’m sending you this email):

Two years ago, Washington and Lee University professors Chris Gavaler and Dan Johnson published a paper in which they revealed that when readers were given a sci-fi story peopled by aliens and androids and set on a space ship, as opposed to a similar one set in reality, “the science fiction setting triggered poorer overall reading” and appeared to “predispose readers to a less effortful and comprehending mode of reading – or what we might term non-literary reading”.

But after critics suggested that merely changing elements of a mainstream story into sci-fi tropes did not make for a quality story, Gavaler and Johnson decided to revisit the research. This time, 204 participants were given one of two stories to read: both were called “Ada” and were identical apart from one word, to provide the strictest possible control. The “literary” version begins: “My daughter is standing behind the bar, polishing a wine glass against a white cloth.” The science-fiction variant begins: “My robot is standing behind the bar, polishing a wine glass against a white cloth.”

In what Gavaler and Johnson call “a significant departure” from their previous study, readers of both texts scored the same in comprehension, “both accumulatively and when divided into the comprehension subcategories of mind, world, and plot”.

The presence of the word “robot” did not reduce merit evaluation, effort reporting, or objective comprehension scores, they write; in their previous study, these had been reduced by the sci-fi setting. “This difference between studies is presumably a result of differences between our two science-fiction texts,” they say.

Gavaler said he was “pretty startled” by the result.

I mean, I wouldn’t dismiss out of hand the possibility of a one-word change having dramatic consequences (change “republican” to “democrat” in a paragraph describing a proposed policy, for example). But in this case it seems to me that the authors surfed the noise generated by the previous study into expecting a big change by just changing “sister” to “robot” and nothing else.

I agree. Two things seem to be going on:

1. The researchers seem to have completely internalized the biases arising from the statistical significance filter that lead to estimates being too high (as discussed in section 2.1 of this article), thus they came into this new experiment expecting to see a huge and statistically significant effect (recall the 80% power lie).

2. Then they do the experiment and are gobsmacked to find nothing (like the 50 shades of gray story, but without the self-awareness).

The funny thing is that items 1 and 2 kinda cancel, and the researchers still end up with positive press!

P.S. I looked up Chris Gavalar and he has a lot of interesting thoughts. Check out his blog! I feel bad that he got trapped in the vortex of bad statistics, and I don’t want this discussion of statistical fallacies to reflect negatively on his qualitative work.

Conference on Mister P online tomorrow and Saturday, 3-4 Apr 2020

We have a conference on multilevel regression and poststratification (MRP) this Friday and Saturday, organized by Lauren Kennedy, Yajuan Si, and me. The conference was originally scheduled to be at Columbia but now it is online. Here is the information.

If you want to join the conference, you must register for it ahead of time; just click on the link.

Here are the scheduled talks for tomorrow (Fri):

Elizabeth Tipton RCT Designs for Causal Generalization

Benjamin Skinner Why did you go? Using multilevel regression with poststratification to understand why community colleges students exit early

Jon Zelner From person-to-person transmission events to population-level risks: MRP as a tool for maximizing the public health benefit of infectious disease data

Katherine Li Multilevel Regression and Poststratification with Unknown Population Distributions of Poststratifiers

Qixuan Chen Use of administrative records to improve survey inference: a response propensity prediction approach

Lauren Kennedy and Andrew Gelman 10 things to love and hate about MRP

And here’s the schedule for Saturday:

Shiro Kuriwaki and Soichiro Yamauchi

Roberto Cerina Election projections using available data, machine learning, and poststratification

Douglas Rivers Modeling elections with multiple candidates

Yajuan Si Statistical Data Integration and Inference with Multilevel Regression and Poststratification

Yutao Liu Model-based prediction using auxiliary information

Samantha Sekar

Chris Hanretty Hierarchical related regression for individual and aggregate electoral data

Lucas Leemann Improved Multilevel Regression with Post-Stratification Through Machine Learning (autoMrP)

Leontine Alkema Got data? Quantifying the contribution of population-period-specific information to model-based estimates in demography and global health

Jonathan Gellar Are SMS (text message) surveys a viable form of data collection in Africa and Asia?

Charles Margossian Laplace approximation for speeding computation of multilevel models

More coronavirus research: Using Stan to fit differential equation models in epidemiology

Seth Flaxman and others at Imperial College London are using Stan to model coronavirus progression; see here (and I’ve heard they plan to fix the horrible graphs!) and this Github page.

They also pointed us to this article from December 2019, Contemporary statistical inference for infectious disease models using Stan, by Anastasia Chatzilena et al. I guess this particular paper will be useful for people getting started in this area, or for epidemiologists who’ve been hearing about Stan and would like to know how to use it for differential equation models in epidemiology. I have not read the article in detail.

We’re also doing some research on how to do inference for differential equations more efficiently in Stan. Nothing ready to report here, but new things will come soon, I hope. One idea is to run the differential equation solver on a coarser time scale in the NUTS updating and, use importance sampling to correct the errors, and then run the solver on the finer time scale in the generated quantities block.

What can we learn from super-wide uncertainty intervals?

This question comes up a lot, in one form or another. Here’s a topical version, from Luigi Leone:

I am writing after three weeks of lockdown.

I would like to put to your attention this Imperial College report (issued on monday, I believe).

The report estimates 9.8% of the Italian population (thus, 6 mil) and 15% of the Spanish population (thus about 7 mil people) as already infected. Their estimation is based on Bayesian models of which I do not know a thing, while you know a lot. Hence, I cannot judge. But on a practical note, I was impressed by the credibility intervals: for Italy between 1.9 mil and 15.2 mil, and for Spain between 1.7 mil and 19 mil! What could a normal person do of these estimates that imply opposite conclusions (for instance for the mortality rate, which could oscillate between the Spanish flu at one end and the regular flu at the other end of the interval)? It is also strange for me, that the wider credibility intervals are found for the countries with more data (tests, positives, deaths), not for those with less data.

My reply: When you get this sort of wide interval, the appropriate response is to call for more data. The wide intervals are helpful in telling you that more information will be needed if you want to make an informed decision.

As noted above, this comes up all the time. When we say to accept uncertainty and embrace variation, the point is not that uncertainty (or certainty) is a good in itself but rather guide our actions. Certainty, or the approximation of certainty, can help in our understanding. Uncertainty can inform our decision making.

“Partially Identified Stan Model of COVID-19 Spread”

Robert Kubinec writes:

I am working with a team collecting government responses to the coronavirus epidemic. As part of that, I’ve designed a Stan time-varying latent variable model of COVID-19 spread that only uses observed tests and cases. I show while it is impossible to know the true number of infected cases, we can rank/sign identify the effects of government policies on the virus spread. I do some preliminary analysis with the dates of emergency declarations of US states to show that states which declared earlier seem to have lower total infection rates (though they have not yet flattened the infection curve).

Furthermore, by incorporating informative priors from SEIR/SIR models, it is possible to identify the scale of the latent variable and provide more informative estimates of total infected. These estimates (conditional on a lower bound based on SIR/SEIR models) report that approximately 700,000 Americans have been infected as of yesterday, or roughly 6-7 times the observed case count, as many SEIR/SIR models have predicted.

I’m emailing you as I would love feedback on the model as well as to share it with others who may be engaged in similar modeling tasks.

Paper link

Github with Data & Stan code

Moving blog to twitter

My co-bloggers and I have decided that the best discussions are on twitter so we’re shutting down this blog, as of today. Old posts will remain, and you can continue to comment, but we won’t be adding any new material.

We’re doing this for two reasons:

1. Our various attempts to raise funds by advertising on the blog or by running sponsored posts have not been effective. (Did you know that approximately one in ten posts on this blog have been sponsored? Probably not, as we’ve been pretty careful to keep a consistent “house style” in our writing. Now you can go back and try to figure out which posts were which.) Not enough of you have been clicking the links, so all this advertising and sponsoring has barely made enough money to pay the web hosting fees.

2. The blog is too damn wordy. We recognize that just about nobody ever reads to the end of these posts (even this one). Remember what Robert Frost said about playing tennis without a net? Twitter has that 140-character limit, which will keep us focused. And on the rare occasions when we have more to say than can be fit in 140 characters, we’ll just post a series of tweets. That should be easy enough to read—and the broken-into-140-character-bits will be a great way to instill a readable structure.

Every once in a while we’ll have more to say than can be conveniently expressed in a tweet, or series of tweets. In these cases we’ll just publish our opinion pieces in Perspectives on Psychological Science or PNAS. I don’t know if you’ve heard, but we’ve got great connections at those places! I have a friend who’s a psychology professor at Princeton who will publish anything I send in.

And if we have any ideas that are too conceptually advanced to fit on twitter or in a PNAS paper, we’ll deliver them as Ted talks. We have some great Ted talk ideas but we’ll need some help with the stunts and the special effects.

This blog has been going for over 15 years. We’ve had a good run, and thanks for reading and commenting. Over and out.

Stasi’s back in town. (My last post on Cass Sunstein and Richard Epstein.)

OK, I promise, this will be the last Stasi post ever.

tl;dr: This post is too long. Don’t read it.
Continue reading ‘Stasi’s back in town. (My last post on Cass Sunstein and Richard Epstein.)’ »

And the band played on: Low quality studies being published on Covid19 prediction.

According to Laure Wynants et al Systematic review and critical appraisal of prediction models for diagnosis and prognosis of COVID-19 infection  most of the recent published studies on prediction of Covid19 are of rather low quality.

Information is desperately needed but not misleading information :-(

Conclusion: COVID-19 related prediction models for diagnosis and prognosis are quickly
entering the academic literature through publications and preprint reports, aiming to support
medical decision making in a time where this is needed urgently. Many models were poorly
reported and all appraised as high risk of bias. We call for immediate sharing of the individual
participant data from COVID-19 studies worldwide to support collaborative efforts in
building more rigorously developed and validated COVID-19 related prediction models. The
predictors identified in current studies should be considered for potential inclusion in new
models. We also stress the need to adhere to methodological standards when developing and
evaluating COVID-19 related predictions models, as unreliable predictions may cause more
harm than benefit when used to guide clinical decisions about COVID-19 in the current

“How to be Curious Instead of Contrarian About COVID-19: Eight Data Science Lessons From Coronavirus Perspective”

Rex Douglass writes:

I direct the Machine Learning for Social Science Lab at the Center for Peace and Security Studies, UCSD. I’ve been struggling with how non-epidemiologists should contribute to COVID-19 questions right now, and I wrote a short piece that summarizes my thoughts.

8 data science suggestions

For people who want to use theories or models to make inferences or predictions in social science, Douglass offers the following eight suggestions:

1: Actually Care About the Answer to a Question

2: Pose a Question and Propose a Research Design that Can Answer It

3: Use Failures of Your Predictions to Revise your Model

4: Form Meaningful Prior Beliefs with a Thorough Literature Review

5: Don’t Form Strong Prior Beliefs Based on Cherry Picked Data

6: Be Specific and Concrete About Your Theory

7: Choose Enough Cases to Actually Test Your Theory

8: Convey Uncertainty with Specificity not Doublespeak

2 more suggestions from me

I’d like to augment Douglass’s list with two more items:

9: Recognize that social science models depend on context. Be clear on the assumptions of your models, and consider where and when they will fail.

10: Acknowledge internal anomalies (aspects of your theories that are internally incoherent) and external anomalies (examples when your data makes incorrect real-world predictions).

Both these new points are about recognizing and working with the limitations of your model. Some of this is captured in Douglass’s point 3 above (“Use Failures of Your Predictions to Revise your Model”). I’m going further, in point 9 urging people to consider the limitations of their models right away, without waiting for the failures; and in point 10 urging people to publicly report problems when they are found. Don’t just revise your model; also explore publicly what went wrong.


Douglass frames his general advice as a series of critiques of a couple of op-eds by a loud and ignorant contrarian, a law professor named Richard Epstein.

Law professors get lots of attention in this country, which I attribute to some combination of their good media connections, their ability to write clearly and persuasively and on deadline, and their habit and training of advocacy, of presenting one side of a case very strongly and with minimal qualifications.

Epstein’s op-eds are pretty silly and they hardly seem worth taking seriously, except as indicating flaws in our elite discourse. He publishes at the Hoover Institution, and I’m guessing the people in charge of the Hoover Institution feel that enough crappy left-wing stuff is being published by the news media every day, that they can’t see much harm in countering that with crappy right-wing stuff of their own. Or maybe it’s just no big deal. Stanford University publishing a poorly-sourced opinion piece is, from a scholarly perspective, a much more mild offense than what their Berkeley neighbor is doing with a professor who engages in omitting data or results such that the research is not accurately represented in the research record. If you’re well connected, elite institutions will let you get away with a lot.

When responding to criticism, Epstein seems like a more rude version of the cargo-cult scientists who we deal with all the time on this blog, people who lash out at you when you point out their mistakes. In this case, Epstein’s venue is not email oor twitter or even Perspectives on Psychological Science; it’s an interview in the New Yorker, where he issues the immortal words:

But, you want to come at me hard, I am going to come back harder at you. And then if I can’t jam my fingers down your throat, then I am not worth it. . . . But a little bit of respect.

Dude’s a street fighter. Those profs and journalists who prattle on about methodological terrorists, second-string replication police, Stasi, Carmelo, etc., they got nothing on this Richard Epstein guy.

In this case, though, we can thank Epstein for motivating Douglass’s thoughtful article.

P.S. I’d been saving the above image for the next time I wrote about Cass “Stasi” Sunstein. But a friend told me that people take umbrage at “sustained, constant criticism,” so maybe best not to post more about Sunstein for awhile. My friend was telling me to stop posting about Nate Silver, actually. It’s ok, there are 8 billion other people we can write about for awhile.

Fit nonlinear regressions in R using stan_nlmer

This comment from Ben reminded me that lots of people are running nonlinear regressions using least squares and other unstable methods of point estimation.

You can do better, people!

Try stan_nlmer, which fits nonlinear models and also allows parameters to vary by groups.

I think people have the sense that maximum likelihood or least squares is this rigorous, well-defined thing, and that Bayesian inference is flaky. The (mistaken) idea is that when using Bayesian inference you’re making extra assumptions and you’re trading robustness for efficiency.

Actually, though, Bayesian inference can be more robust than classical point estimation. They call it “regularization” for a reason!

P.S. I’m not kidding that this can make a difference. Consider this bit from an article cited in the above-linked post:

The point here is not that there’s anything wrong with the above steps, just that they represent a lot of effort to get something that’s kinda clunky and unstable. Just a lack of awareness of existing software.

P.S. Earlier post title was wrong: it’s stan_nlmer you want for this purpose, not stan_lmer. Damn confusing function names!

Structural equation modeling and Stan

Eric Brown asks:

How does Stan and its Bayesian modeling relate to structural equation modeling? Do you know of a resource that attempts to explain the concepts behind SEM in terms of Stan nomenclature and concepts?

Some research that I’ve looked into uses SEM to evaluate latent factors underlying multiple measurements with associated errors; or use SEM to relate different measurements of the same physical property. I have a hard time wrapping my head around that analysis and would prefer to use what I know (Stan) and investigate the same issues.

Any suggestions?

My reply:

There are two aspects to a structural equation model: the statistical model and the causal interpretation.

The statistical model is a big multivariate distribution, and there should be no problem fitting it in Stan. I haven’t personally fit such models myself, but my guess is that if you put a query on the Stan Discourse list, asking if anyone’s fit a structural equation model in Stan, that you’ll get some responses.

The causal interpretation is just a separate issue from the fitted model. I think the usual causal interpretations of structural equation models are typically over-ambitious: without making lots of assumptions, there’s a limit to how much causal knowledge you can get from observational data, and traditional structural equation modeling does not make a lot of formal assumptions. Fitting a structural equation model in Stan won’t solve this problem, because even if you put strong priors on the parameters in the model, this doesn’t give you priors on the causal inferences. From a statistical perspective, causal inference corresponds to predictions about potential outcomes, and structural equation models, as traditionally written, just model the data, they don’t model potential outcomes. Some of these concerns are discussed in the causal inference chapters of my book with Jennifer Hill. We don’t talk about structural equation models, but our general discussions of causal inference should be relevant to understanding these issues.

tl;dr: I think Stan’s an excellent way to fit a structural equation model, considering it as a probability model, a math problem to fit a model to data. To go causal (which is the usual purpose of structural equation modeling), you might not want to fit a structural equation model at all!

There’s one other thing, which is that these models can be so big, that often people try to simplify the model or estimate some underlying structure by using rules such as statistical significance or Bayes factors to remove links from the model. I generally don’t like this, the practice of trying to estimate causal structure based on data. I discuss this a bit in my 2011 paper, Causality and Statistical Learning:

OHDSI COVID-19 study-a-thon.

The OHDSI COVID-19 study-a-thon started early on Thursday morning – 3 am for me.

The wrap up session – of the START of the Odyssey that needs to continue – will be available at 7 pm eastern time / EDT.

This will give anyone who might be able to contribute  to a world wide collaboration to enable better decision making and research on Covid19 a sense about what has happened so far.

I’ll add the link when I get it or anyone commenting that gets please share it here.

Sorry for delay – this is the link

Slides are now available . Other groups will be re-running the analyses on other data providers data and pointing out what they learned.


Introduction – Daniel Prieto-Alhambra and Patrick Ryan (Slides)
Literature Review – Jennifer Lane (22:00 • Slides)
Data Network In Action – Kristin Kostka (26:10• Slides)
Phenotype Development – Anna Ostropolets (31:38• Slides)
Clinical Characterization of COVID-19 – Ed Burn (42:10 • Slides)
The Journey Through Patient-Level Prediction – Peter Rijnbeek (50:12 • Slides)
Prediction #1: Amongst Patients Presenting with COVID-19, Influenza, or Associated Symptoms, Who Are Most Likely to be Admitted to the Hospital in the Next 30 Days? – Jenna Reps (56:55 • Slides)
Prediction #2: Amongst Patients at GP Presenting with Virus or Associated Symptoms with/without Pneumonia Who Are Sent Home, Who Are Most Likely to Require Hospitalization in the Next 30 Days? – Ross Williams (1:08:42 • Slides)
Prediction #3: Amongst Patients Hospitalized with Pneumonia, Who Are Most Likely To Require Intensive Services or Die? – Aniek Markus (1:15:25 • Slides)
Estimation #1: Hydroxychloroquine – Daniel Prieto-Alhambra (1:23:32 • Slides)
Estimation #2: Safety of HIV/HepC Protease Inhibitors – Albert Prats (1:31:24 • Slides)
Estimation #3: Association of Angiotensin Converting Enzyme (ACE) Inhibitors and Angiotensin II Receptor Blockers (ARB) on COVID Incidence and Complications – Daniel Morales (1:36:58 • Slides)
#OpenData4COVID19 – Seng Chan You (1:45:32 • Slides)
The Journey Ahead – Patrick Ryan (1:50:28 • Slides)
Questions & Answers – Daniel Prieto-Alhambra, Peter Rijnbeek and Patrick Ryan (2:08:15)

The second derivative of the time trend on the log scale (also see P.S.)

Peter Dorman writes:

Have you seen this set of projections? It appears to have gotten around a bit, with citations to match, and IHME Director Christopher Murray is a superstar. (WHO Global Burden of Disease) Anyway, I live in Oregon, and when you compare our forecast to New York State it gets weird: a resource use peak of April 24 for us and already April 8 for NY. This makes zero sense, IMO.

I looked briefly at the methodological appendix. This is a top-down, curve-fitting exercise, not a bottom-up epi model. They fit three parameters on a sigmoid curve, with the apparent result that NY, with its explosion of cases, simply appears to be further up its curve. Or, which amounts to the same thing, the estimate for the asymptotic limit is waaaay underinformed. These aren’t the sort of models I have worked with in the past, so I’m interested in how experienced hands would view it.

I have a few thoughts on this model. First, yeah, it’s curve-fitting, no more and no less. Second, if they’re gonna fit a model like this, I’d recommend they just fit it in Stan: the methodological appendix has all sorts of fragile nonlinear-least-squares stuff that we don’t really need any more. Third, I guess there’s nothing wrong with doing this sort of analysis, as long as it’s clear what the assumptions are. What the method is really doing is using the second derivative of the time trend on the log scale to estimate where we are on the curve. Once that second derivative goes negative, so the exponential growth is slowing, the model takes this as evidence that the rate of growth on the log scale will rapidly continue to go toward zero and then go negative. Fourth, yeah, what Dorman says: you can’t take the model for the asymptotic limit seriously. For example, in that methodological appendix, they say that they use the probit (“ERF”) rather than the logit curve because the probit fits the data better. That’s fine, but there’s no reason to think that the functional form at the beginning of the spread of a disease will match the functional form (or, for that matter, the parameters of the curve) at later stages. It really is the tail wagging the dog.

In summary: what’s relevant here is not the curve-fitting model but rather the data that show a negative second derivative on the log scale—that is, a decreasing rate of increase of deaths. That’s the graph that you want to focus on.

Relatedly, Mark Tuttle points to this news article by Joe Mozingo that reports:

Michael Levitt, a Nobel laureate and Stanford biophysicist, began analyzing the number of COVID-19 cases worldwide in January and correctly calculated that China would get through the worst of its coronavirus outbreak long before many health experts had predicted. Now he foresees a similar outcome in the United States and the rest of the world. While many epidemiologists are warning of months, or even years, of massive social disruption and millions of deaths, Levitt says the data simply don’t support such a dire scenario — especially in areas where reasonable social distancing measures are in place. . . .

Here’s what Levitt noticed in China: On Jan. 31, the country had 46 new deaths due to the novel coronavirus, compared with 42 new deaths the day before. Although the number of daily deaths had increased, the rate of that increase had begun to ease off. In his view, the fact that new cases were being identified at a slower rate was more telling than the number of new cases itself. It was an early sign that the trajectory of the outbreak had shifted. . . .

Three weeks later, Levitt told the China Daily News that the virus’ rate of growth had peaked. He predicted that the total number of confirmed COVID-19 cases in China would end up around 80,000, with about 3,250 deaths. This forecast turned out to be remarkably accurate: As of March 16, China had counted a total of 80,298 cases and 3,245 deaths . . .

[Not really; see P.S. below. — AG]

Now Levitt, who received the 2013 Nobel Prize in chemistry for developing complex models of chemical systems, is seeing similar turning points in other nations, even those that did not instill the draconian isolation measures that China did.

He analyzed data from 78 countries that reported more than 50 newcases of COVID-19 every day and sees “signs of recovery” in many of them. He’s not focusing on the total number ofcases in a country, but on the number of new cases identified every day — and, especially, on the change in that number from one day to the next. . . .

The news article emphasizes that trends depend on behavior, so they’re not suggesting that people stop with the preventive measures; rather, the argument is that if we continue on the current path, we’ll be ok.

Tuttle writes:

An important but subtle claim here is that the noise in the different sources of data cancels out. To be exact, here’s the relevant paragraph from the article:

Levitt acknowledges that his figures are messy and that the official case counts in many areas are too low because testing is spotty. But even with incomplete data, “a consistent decline means there’s some factor at work that is not just noise in the numbers,” he said. In other words, as long as the reasons for the inaccurate case counts remain the same, it’s still useful to compare them from one day to the next.

OK, a few thoughts from me now:

1. I think Mozingo’s news article and Levitt’s analysis are much more clear than that official-looking report with the fancy trend curves. [Not really; see P.S. below. — AG] That said, sometimes official-looking reports and made-up curves get the attention, so I guess we need both approaches.

2. The news article overstates the success of Levitt’s method. It says that Levitt predicted 80,000 cases and 3,250 deaths, and what actually happened was 80,298 cases and 3,245 deaths. That’s too close. What I’m saying is, even if Levitt’s model is wonderful [Not really; see P.S. below. — AG], he got lucky. Sports Illustrated predicted the Astros would go 93-69 this year. Forgetting questions about the shortened season etc., if the Astros actually went 97-65 or 89-73, we’d say that the SI prediction was pretty much on the mark. If the Astros actually went 93-69, we wouldn’t say that the SI team had some amazing model; we’d say they had a good model and they also got a bit lucky.

3. What to do next? More measurement, at the very least, and also organization for what’s coming next.

P.S. Commenter Zhou Fang points us to this document which appears to collect Michael Levitt’s reports from 2 Feb through 2 Mar. Levitt’s forecasts change over time:

2 Feb [305 deaths reported so far]: “This suggests by linear extrapolation that the number of new deaths will decrease very rapidly over the next week.”

5 Feb [492 deaths reported so far]: “Linear extrapolation, which is not necessarily applicable, suggests the number of new deaths will stop growing and start to decrease over the next week.”

7 Feb [634 deaths reported so far]: “This suggests that the rate of increase in the number of deaths will continue to slow down over the next week. An extrapolation based on the sigmoid function . . . suggests that the number of deaths will not exceed 1000 and that it will exceed 95% of this limiting value on 14-Feb-2020.”

9 Feb [813 deaths reported so far]: “An extrapolation based on the sigmoid function . . . suggests that the number of deaths may not exceed 2000 . . .”

12 Feb [1111 deaths reported so far]: “An extrapolation based on the sigmoid function . . . suggests that the number of deaths should not exceed 2000 . . .”

13 Feb [1368 deaths reported so far]: “This together with the data on Number of Cases in (D) suggests that the rate of increase in the number of deaths and cases will continue to slow down over the next week . . .”

17 Feb [1666 deaths reported so far]: “This suggests that the Total Number of Hubei Deaths could reach 3,300 . . . Note that this analysis is based only on Laboratory Confirmed Cases and does not include the 17,000 Clinically Diagnosed Cases.”

21 Feb [2129 deaths reported so far]: “Better analysis in Fig. 4 gives asymptotic values of 64,000 and 3,000 for Number of Cases and Deaths, respectively.”

23 Feb [2359 deaths reported so far]: “final estimate” of 3,030 deaths

2 Mar [2977 deaths reported so far]: “asymptotic values” of 3,150 deaths in Hubei and 190 for the rest of the country.

Some of the above numbers are Hubei, others are for all of China, but in any case it seems that Zhou Fang is right that the above-linked news article is misleading, and Levitt’s predictions were nothing special, just “overfitting and publication bias.”

I’m still struggling to understand hypothesis testing . . . leading to a more general discussion of the role of assumptions in statistics

I’m sitting at this talk where Thomas Richardson is talking about testing the hypothesis regarding a joint distribution of three variables, X1, X2, X3. The hypothesis being tested is that X1 and X2 are conditionally independent given X3. I don’t have a copy of Richardson’s slides, but here’s a paper that I think it related, just to give you a general sense of his theoretical framework.

The thing that’s bugging me is that I can’t see why anyone would want to do this, test the hypothesis that X1 and X2 are conditionally independent given X3. My problem is that in any situation where these two variables could be conditionally dependent, I think they will be conditionally dependent. It’s the no-true-zeroes thing; see the discussion starting on page 960 here. I’m not really interested in testing a hypothesis that I know is false, that I know would be rejected if I could just gather enough data.

That said, Thomas Richardson is a very reasonable person, so even though his talk is full of things that I think make no sense—he even brought up type 1 and type 2 errors!—I expect there’s something reasonable in all this research, I just have to figure out what.

I can think of a couple of possibilities.
Continue reading ‘I’m still struggling to understand hypothesis testing . . . leading to a more general discussion of the role of assumptions in statistics’ »

“For the cost of running 96 wells you can test 960 people and accurate assess the prevalence in the population to within about 1%. Do this at 100 locations around the country and you’d have a spatial map of the extent of this epidemic today. . . and have this data by Monday.”

Daniel Lakeland writes:

COVID-19 is tested for using real-time reverse-transcriptase PCR (rt-rt-PCR). This is basically just a fancy way of saying they are detecting the presence of the RNA by converting it to DNA and amplifying it. It has already been shown by people in Israel that you can combine material from at least 64 swabs and still reliably detect the presence of the RNA.

No one has the slightest clue how widespread SARS-Cov-19 infections really are in the population, we’re wasting all our tests testing sick people where the bayesian prior is basically that they have it already, and the outcome of the test mostly doesn’t change the treatment anyway. It’s stupid.

To make decisions about how much physical isolation and shutdown and things we need, we NEED real-time monitoring of the prevalence in the population.

Here’s my proposal:

Mobilize military medical personnel around the country to 100 locations chosen randomly proportional to the population. (the military is getting salaries already, marginal cost is basically zero).

In each location set up outside a grocery store.

Swab 960 people as they enter the grocery store. Sort the swab vials in random order.

From each vial, extract RNA into a tube, and combine the first 10 tubes into well 1, second 10 tubes into well 2 etc… for a 96 well PCR plate (this is a standard sized PCR tray used in every bio lab in the country).

Run the machines and get back a count of positive wells for each tray…

Use a beta(2,95) prior for the frequency of SARS-Cov-19 infection, this has high probability density region extending from 0 to about 10% prevalence, with the highest density region between around 0.5 and 5%, an appropriate prior for this application.

let f be the frequency in the population, then let ff = 1-dbinom(0,10,f), then ff is the frequency with which a randomly selected well with 10 samples will have *one or more* swab positive. The likelihood for N wells to come positive is then dbinom(N,96,ff)

Doing a couple lines of simulation, for the cost of running 96 wells you can test 960 people and accurate assess the prevalence in the population to within about 1%. Do this at 100 locations around the country and you’d have a spatial map of the extent of this epidemic today.

There is NO reason you couldn’t mobilize military resources later today to do this swabbing, and have this data by Monday.

This kind of pooled sampling is a well known design, so I assume the planners at the CDC have already thought of this. On the other hand, if they were really on top of things, they’d have had a testing plan back in January, so really I have no idea.

The innovation of Lakeland’s plan is that you can use a statistical model to estimate prevalence from this pooled data. When I’ve the pooled-testing design in textbooks, it’s been framed as a problem of identifying the people who have the disease, not for estimating prevalance rates.

Let’s do preregistered replication studies of the cognitive effects of air pollution—not because we think existing studies are bad, but because we think the topic is important and we want to understand it better.

In the replication crisis in science, replications have often been performed of controversial studies on silly topics such as embodied cognition, extra-sensory perception, and power pose.

We’ve been talking recently about replication being something we do for high-quality studies on important topics. That is, the point of replication is not the hopeless endeavor of convincing ESP scholars etc. that they’re barking up the wrong tree, but rather to learn about something we really care about.

With that in mind, I suggest that researchers perform some careful preregistered replications of some studies of air pollution and cognitive function.

I thought about this after receiving this in email from Russ Roberts:

You may have seen this. It is getting a huge amount of play with people amazed and scared by how big the effects are on cognition.

THIRTY subjects. Most results not significant.

If you feel inspired to write on it, please do…

What I found surprising was how many smart people I know have been taken by how large the effects are. Mostly economists. I have now become sensitized to be highly skeptical of these kinds of findings. (Perhaps too skeptical, but put that to the side…)

Patrick Collison, not an economist, but CEO of Stripe and a very smart person, posted this, which has been picked up and spread by The Browser which has a wide and thoughtful audience and by Collison on twitter. Collison’s piece is a brief list of other studies that “confirm” the cognitive losses due to air pollution.

My general reaction (using one of the studies) is that if umpires do a dramatically worse job on high pollution days because their brains are muddled by pollution, there must have been a massive (and noticeable) improvement in accuracy over the last 40 years as particulate matter has fallen in the US. Same with chess players—another study—there should be many more grandmasters and the quality of chess play overall in the US should be dramatically improved.

The big picture

There are some places where I agree with Roberts and some places where I disagree. I’ll go through all this in a moment, but first I want to set out the larger challenges that we face in this sort of problem.

I agree on the general point that we should be skeptical of large claims. A difficulty here is that the claims come in so fast: There’s a large industry of academic research producing millions of scientific papers a year, and on the other side there are about 5 of us who might occasionally look at a paper critically. A complicating factor here is that some of these papers are good, some of the bad papers can have useful data, and even the really bad papers have some evidence pointing in the direction of their hypotheses. So the practice of reading the cited papers is just not scalable.

Even in the above little example, Collinson links to 9 articles, and it’s not like I have time to read all of them. I skimmed through the first one (The Impact of Indoor Climate on Human Cognition: Evidence from Chess Tournaments, by Steffen Kunn, Juan Palacios, and Nico Pestel) and it seemed reasonable to me.

Speaking generally, another challenge is that if we see serious problems with a paper (as with the first article discussed above in this post), we can set it aside. The underlying effect might be real, but that particular study provides no evidence. But when a paper seems reasonable (as with the article on chess performance), it could just be that we haven’t noticed the problems yet. Recall that the editors of JPSP didn’t see the huge (in retrospect) problems with Bem’s ESP study, and recall that Arthur Conan Doyle didn’t realize that these photos were faked.

To get back to Roberts’s concerns: I have no idea what are the effects of air pollution on cognitive function. I really just don’t know what to think. I guess the way that researchers are moving forward on this is to look at various intermediate outcomes such as blood flow to the brain.

To step back: on one hand, the theory here seems plausible; on the other hand, I know about all the social and statistical reasons why we should expect effect size estimates to be biased upward. There’s a naive view that associates large type S and type M errors with crappy science of the Wansink variety, but even carefully reviewed studies published in top journals by respected researchers have these problems.

Preregistered replication to the rescue

So we’re at an impasse. Plausible theories, some solid research articles with clear conclusions, but this is all happening in a system with systematic biases.

This is where careful preregistered replication studies can come in. The point of such studies is not to say that the originally published findings “replicated” or “didn’t replicate,” but rather to provide new estimates that we can use, following the time-reversal heuristic.

Again, the choice to perform the replication should be considered as a sign of respect for the original studies: that they are high enough quality, and on an important enough topic, to motivate the cost and effort of a replication.

Getting into the details

1. I agree with Roberts that the first study he links to has serious problems. I’ll discuss these below the fold, but the short story is that I see no reason to believe any of it. I mean, sure, the substantive claims might be true, but if the estimates in the article are correct, it’s really just by accident. I can’t see the empirical analysis adding anything to our understanding. It’s not as bad as that beauty-and-sex-ratio study which, for reasons of statistical power, was doomed from the start—but given what’s reported in the published paper, the data are too noisy to be useful.

2. As noted above, I looked quickly at the first paper on Collinson’s list and I saw no obvious problems. Sure, the evidence is only statistical—but we sometimes can learn from statistical evidence. For reasons of scalability (see above discussion), I did not read the other articles on the list.

3. I’d like to push against a couple of Roberts’s arguments. Roberts writes:

If umpires do a dramatically worse job on high pollution days because their brains are muddled by pollution, there must have been a massive (and noticeable) improvement in accuracy over the last 40 years as particulate matter has fallen in the US.

Actually, I expect that baseball umpires have been getting much more accurate over the past 40 years, indeed over the past century. In this case, though, I’d think that economics (baseball decisions are worth more money), sociology (the increasing professionalization of all aspects of sports), and technology (umpires’ mistakes are clear on TV) would all push in that direction. I’d guess that air pollution is minor compared to these large social effects. In addition, the findings of these studies are relative, comparing people on days with more or less pollution. A rise or decline in the overall level of pollution, that’s different: it’s perfectly plausible that umps do worse on polluted days than on clear days because their bodies are reacting to an unexpected level of strain, and the same effect would not arise from higher pollution levels every day.

Roberts continues:

Same with chess players . . . there should be many more grandmasters and the quality of chess play overall in the US should be dramatically improved.

Again, I think it’s pretty clear that the quality of chess play overall has improved, at least at the top level. But, again, any effects of pollution would seem to be minor compared to social and technological changes.

So I feel that Roberts is throwing around a bit too much free-floating skepticism.
Continue reading ‘Let’s do preregistered replication studies of the cognitive effects of air pollution—not because we think existing studies are bad, but because we think the topic is important and we want to understand it better.’ »