Skip to content

How to embrace variation and accept uncertainty in linguistic and psycholinguistic data analysis

I don’t have much to say about this one, as Shravan wrote pretty much all of it. It’s a study of how to apply our general advice to “accept uncertainty,” in a specific area of research in linguistics:

Deep learning workflow

Ido Rosen points us to this interesting and detailed post by Andrej Karpathy, “A Recipe for Training Neural Networks.” It reminds me a lot of various things that Bob Carpenter has said regarding the way that some fitting algorithms are often oversold because the presenters don’t explain the tuning that was required to get good answers. Also I like how Karpathy presents things; it reminds me of Bayesian workflow.

The only thing I’d add is fake-data simulation.

I’m also interested in the ways that deep-learning workflow differs, or should differ, from our Bayesian workflow when fitting traditional models. I don’t know enough about deep learning to know what to say about this, but maybe some of you have some ideas.

Rank-normalization, folding, and localization: An improved R-hat for assessing convergence of MCMC

With Aki, Dan, Bob, and Paul:

Markov chain Monte Carlo is a key computational tool in Bayesian statistics, but it can be challenging to monitor the convergence of an iterative stochastic algorithm. In this paper we show that the convergence diagnostic R-hat of Gelman and Rubin (1992) has serious flaws. R-hat will fail to correctly diagnose convergence failures when the chain has a heavy tail or when the variance varies across the chains. In this paper we propose an alternative rank-based diagnostic that fixes these problems. We also introduce a collection of quantile-based local efficiency measures, along with a practical approach for computing Monte Carlo error estimates for quantiles. We suggest that common trace plots should be replaced with rank plots from multiple chains. Finally, we give recommendations for how these methods should be used in practice.

This article is the culmination of several years of discussion and examples, starting in 2014 or so when Kenny Shirley came across an example where multiple chains had countervailing trends, which motivated the development of split R-hat. It’s fun to be able to shoot down and then improve my own method!

I expect that we and others will do more work on this and related areas as we continue improve aspects of Stan (and statistical computing more generally) involving adaptation and convergence. For a start, see this vignette by Jonah Gabry and Martin Modrak, “Visual MCMC diagnostics using the bayesplot package.”

Vaping statistics controversy update: A retraction and some dispute

A few months ago we reported on two controversies regarding articles in the medical literature on the risks of e-cigarettes (vaping).

One of the controversial papers was “Electronic cigarette use and myocardial infarction among adults in the US Population Assessment of Tobacco and Health [PATH],” by Dharma N. Bhatta and Stanton A. Glantz, published in the Journal of the American Heart Association. That paper had a data issue: in one of their analysis, they were assessing the effects of e-cigarette use on heart attacks, but, as pointed out by critic Brad Rodu, “11 of the 38 current e-cigarette users in their study had a heart attack years before they first started using e-cigarettes.” At the time, I suggested that the authors of the article perform a corrected analysis removing the data from the people who had heart problems before they started vaping.

Since our post, some things have happened.

First, a letter dated 20 Jan 2020 was sent by 16 public health researchers to the editors of the Journal of the American Heart Association, citing Rodu’s comments and my blog post and urging the journal to do something to resolve the issue.

Second, on 23 Jan 2020, the journal responded with a letter stating, “The AHA continues to follow the guidelines as outlined by the International Committee of Medical Journal Editors (ICJME) and the Committee on Publication Ethics (COPE), which include protecting the integrity of the scientific publishing process. No additional information is available at this time.”

Third, on 18 Feb 2020, the journal issued a retraction! Here it is:

Retraction to: Electronic Cigarette Use and Myocardial Infarction Among Adults in the US Population Assessment of Tobacco and Health

After becoming aware that the study in the above‐referenced article did not fully account for certain information in the Population Assessment of Tobacco and Health [PATH] Wave 1 survey, the editors of Journal of the American Heart Association reviewed the peer review process.

During peer review, the reviewers identified the important question of whether the myocardial infarctions occurred before or after the respondents initiated e‐cigarette use, and requested that the authors use additional data in the PATH codebook (age of first MI and age of first e‐cigarettes use) to address this concern. While the authors did provide some additional analysis, the reviewers and editors did not confirm that the authors had both understood and complied with the request prior to acceptance of the article for publication.

Post publication, the editors requested Dr. Bhatta et al conduct the analysis based on when specific respondents started using e‐cigarettes, which required ongoing access to the restricted use dataset from the PATH Wave 1 survey.1 The authors agreed to comply with the editors’ request. The deadline set by the editors for completion of the revised analysis was not met because the authors are currently unable to access the PATH database. Given these issues, the editors are concerned that the study conclusion is unreliable.

The editors hereby retract the article from publication in Journal of the American Heart Association. [original article URL:]

Fourth, also on 18 Feb, Stanton Glantz, the second author of the now-retracted article, posted a response with the combative title, “Journal of American Heart Association caves to pressure from e-cig interests.” In his post, Glantz first reviews the dispute and then continues:

Rodu and a colleague argued that the analysis described above was inadequate because the PATH restricted use dataset had the date of first heart attack and date at which people started using e-cigarettes and that we should have used these two dates to exclude cases rather than the approach we took.

Indeed, one of the peer reviewers had suggested the same analysis. As I [Glantz] detailed in a letter to JAHA, while there was some misunderstanding of the specific supplemental analysis requested by the reviewer, the analysis that we presented during the peer review process substantially addressed the question raised by the reviewer. As I wrote the editor, Dr. London:

In any event, it is important to keep in mind that this discussion is about a supplementary analysis, not the main analysis in the paper. As the paper states, restricting the data as we did substantially dropped the number of MIs and the supplemental analysis was not statistically significant. Reviewer 2 understood and accepted our supplementary analysis and, after we responded to the original comment, recommended publishing the paper as it is with primary analysis (which is based on the whole dataset) despite the issues discussed in this letter.

In addition, doing the additional alternative analysis will not change the main analysis in the paper, which the reviewers and editors accepted.

The normal protocol for raising a technical criticism of a paper would be to write a letter to the journal criticizing the paper. If the editors find the criticism worth airing, they would invite the authors (in this case, Dr. Bhatta and me) to respond, then publish both letters and allow the scientific community to consider the issue.

Indeed, Rodu has published several letters and other publications criticizing our work, most recently about a paper I and other colleagues published in Pediatrics about the gateway effect of e-cigarette use on subsequent youth smoking. . . .

Rather than following this protocol, I first learned of Rodu’s criticism when USA Today called me for a response to his criticism. I was subsequently contacted by the Journal of the American Heart Association regarding Rodu’s criticism. I responded by suggesting the editors invite Rodu to publish his criticism in enough detail for Dr. Bhatta and I to respond, as well as accurately disclose his links to the tobacco industry.

Instead, the editors of the Journal of the American Heart Association demanded that Dr. Bhatta do additional analysis that deleted heart attacks before people may have used e-cigarettes as Rodu wanted rather than as how we did in the subsidiary analysis in the paper.

Dr. Bhatta and I have no issue with doing such additional analysis. Indeed, we prepared the statistical code to do so last November. (I doubt that the results will be materially different from what is in the paper, but one cannot be sure until actually running the analysis.)

The problem is that, during the process of revising the paper in response to the reviewers, we reported some sample size numbers without securing advance approval from the University of Michigan, who curates the PATH restricted use dataset. This was a blunder on our part. As a result, the University of Michigan has terminated access to the PATH restricted use dataset, not only for Dr. Bhatta and me, but for everyone at UCSF.

As part of our effort to remedy our mistake, we have published a revised version of the table in question (Table S6 in the paper) deleting the sample size numbers that had not been properly cleared with the University of Michigan. (Doing so did not materially change the paper.) . . .

Now, under continuing pressure from e-cigarette advocates (link 1, link 2), the editors of the Journal of the American Heart Association have retracted the paper because, without access to the PATH restricted use dataset, we have not been able to do the additional analysis.

The editors also gave Dr. Bhatta and me the option of retracting the paper ourselves. We have not retracted the paper because, despite the fact that we have not been able to do the additional analysis Rodu is demanding, we still stand behind the paper. . . .

I read this response with an open mind. Glantz makes three points:

1. Procedurally, he’s not happy that Rodu contacted the news media with his criticism and that a group of researchers contacted the journal. He would’ve been happier had Rodu submitted a letter to the journal, and then the journal could publish Rodu’s letter and a response by Bhatta and Glantz. I disagree with Glantz on this one. He and Bhatta made an error in their paper! When you make an error, you should fix it. To have a sequence of letters back and forth, that just muddies the waters. Especially considering that the previous time that Rodu found an error in a paper by Glantz and coauthors (see “Episode 1” here), they just brushed the criticism aside.

If someone points out an error in your work, you should correct the error and thank the person. Not attack and try to salvage your position with procedural arguments.

2. Glantz says that they wanted to do the recommended analysis, but now they can’t, because they don’t have access to the data anymore. That’s too bad, but then you gotta retract the article. If they ever get their hands on the data again, then they can redo the analysis. But, until then, no. You don’t get credit for an analysis you never did. They had access to the data, they messed up. Maybe next time they should be more careful. But the Journal of the American Heart Association should not keep a wrong analysis in print, just because the people who did that wrong analysis are no longer situated to do it right.

In his blog, Glantz wrote, “I doubt that the results will be materially different from what is in the paper, but one cannot be sure until actually running the analysis.” Exactly! One cannot be sure. A scientific journal can’t just go around publishing claims because the authors doubt that the correct analysis will be materially different. That kind of thing can be published in the Journal of Stanton Glantz’s Beliefs, or on his blog—hey, I understand, I have a blog too, and it’s a great place to publish speculations—but not in the Journal of the American Heart Association.

3. Finally, Glantz argues that none of this really matters because “this discussion is about a supplementary analysis, not the main analysis in the paper.” The argument is that main analysis is the cross-sectional correlational analysis; this supplementary analysis, which addresses causation, doesn’t really matter. This may be true—maybe the cross-sectional analysis was enough, on its own, to be published. But the paper as published did not just include the cross-sectional analysis; indeed, it seems that this analysis over time was required by the journal reviewers.

So, to recap:
– On the original submission, the reviewers said the cross-sectional analysis was not enough. They “requested that the authors use additional data in the PATH codebook (age of first MI and age of first e‐cigarettes use) to address this concern.”
– Bhatta and Glantz did the requested analysis. But they did it wrong.
– When they were asked to do the analysis correctly, Bhatta and Glantz were not able to do so because they did not have access to the data.
The sequence seems clear. Whether or not Glantz now thinks this “supplementary analysis” was key to the paper, the journal reviewers demanded it, and at the time of submission, Bhatta and Glantz did not argue that this supplementary analysis was unnecessary or irrelevant; rather, they did it. Or, to be more precise, they appeared to do it, but they screwed up. So I don’t find Glantz’s third point convincing either.

Glantz says that he and Bhatta “still stand behind the paper.” That’s fine. They’re allowed to stand by it. But I’m glad that the Journal of the American Heart Association retracted it. I guess one option could be that the authors could resubmit the paper to a new journal, labeling it more clearly as an opinion piece. That should make everyone happy, right?

Glantz concludes his post with this statement:

The results in the paper are accurately analyzed and reported. That is why we refused to retract the paper.

As I said earlier, we are still hoping to regain access to PATH so that we can do the additional analysis and put this issue behind us.

This to me seems like a horribly unscientific, indeed anti-scientific attitude, for two reasons. First, the results in that paper were not “accurately analyzed and reported.” They screwed up! They’re not being asked to do an “additional analysis”; they’re being asked to do a correct analysis. Second, the goal is science is not to “put this issue behind us.” The goal is to learn about reality.

It might be that all the substantive claims in that retracted paper are correct. That’s fine. If so, maybe someone can do the research to make the case. My problem is not with the authors of that paper believing their claims are true; my problem is with them misrepresenting the evidence, and I have a problem with that, even if the misrepresentation of evidence was purely an accident or honest mistake.

Conflicts of interest

There’s some further background.

I looked up David Abrams, the first author of the letter sent to the journal. He’s a professor of public health at New York University, his academic training is in clinical psychology, and he’s an expert on cigarette use. For example, one of his recent papers is, “How do we determine the impact of e-cigarettes on cigarette smoking cessation or reduction? Review and recommendations for answering the research question with scientific rigor,” and another is “Managing nicotine without smoke to save lives now: Evidence for harm minimization.” A web search brought me to this article, “Don’t Block Smokers From Becoming Smoke-Free by Banning Flavored Vapes,” on a website called Filter, which states, “Our mission is to advocate through journalism for rational and compassionate approaches to drug use, drug policy and human rights.” Filter is owned and operated by The Influence Foundation, which has received support from several organizations, including Juul Labs, Philip Morris International, and Reynolds American, Inc.

Brad Rodu, my original correspondent on this matter, is the first holder of the Endowed Chair in Tobacco Harm Reduction Research at the University of Louisville’s James Graham Brown Cancer Center. He is trained as a dentist and is also a senior fellow of The Heartland Institute, which does not make public the names of its donors but which has been funded by Altria, owner of Philip Morris USA. Rodu also has been directly funded by the tobacco industry.

On the other side, Stanton Glantz is a professor of tobacco control at the University of California, and his academic training is in engineering. He’s been an anti-smoking activist for many years and has made controversial claims about the risks of secondhand smoke. He’s been funded by the U.S. Food and Drug Administration and the National Institutes of Health.

So, several of the people involved in this controversy have conflicts. In their letter to the journal, Abrams et al. write, “The signatories write in a personal capacity and declare no competing interests with respect to tobacco or e-cigarette industries.” I assume this implies that Abrams is not directly funded by Juul Labs, Philip Morris International, etc.; he just writes for an organization that has this funding, so it’s not a direct competing interest. But in any case these researchers all have strong pre-existing pro- or anti-vaping commitments.

That’s ok. As I’ve written in other contexts, I’m not at all opposed to ideologically committed research. I personally have no strong take on vapes, but it makes sense to me that the people who study the topic most intently have strong views on the topic and have accepted money from interested parties. To point out the above funding sources and cigarette company links is not to dismiss the work being done in that area.

Also, just to speak more generally, I’ve taken $ from lots of places. Go to this list and you might well find an organization that you find noxious. So I’m not going around slamming anyone for working for cigarette companies or anyone else.

I will say one thing, though. On the webpage of the Heartland Institute is the following statement: “Heartland’s long-standing position on tobacco is that smoking is a risk factor for many diseases; we have never denied that smoking kills.” Now consider the following quotations from leaders of the cigarette industry:

Philip Morris Vice President George Weissman in March 1954 announced that his company would “stop business tomorrow” if “we had any thought or knowledge that in any way we were selling a product harmful to consumers.”

James C. Bowling, the public relations guru and Philip Morris VP, in a 1972 interview asserted, “If our product is harmful . . . we’ll stop making it.”

Then again in 1997 the same company’s CEO and chairman, Geoffrey Bible, was asked (under oath) what he would do with his company if cigarettes were every established as a cause of cancer. Bible gave this answer: “I’d probably . . . shut it down instantly to get a better hold on things.”

Lorillard’s president, Curtis Judge, is quoted in company documents: “if it were proven that cigarette smoking causes cancer, cigarettes should not be marketed.”

R. J. Reynolds president, Gerald H. Long, in a 1986 interview asserted that if he ever “saw or thought there were any evidence whatsoever that conclusively proved that, in some way, tobacco was harmful to people, and I believed it in my heart and my soul, then I would get out of the business.”

Given that the Heartland Institute takes the position that smoking kills, and given that they’re in contact with Philip Morris, RJR, etc., maybe they could remind the executives of these companies of the position that their predecessors took—“If our product is harmful . . . we’ll stop making it”—and ask what they should be doing next.

Holes in Bayesian Statistics

With Yuling:

Every philosophy has holes, and it is the responsibility of proponents of a philosophy to point out these problems. Here are a few holes in Bayesian data analysis: (1) the usual rules of conditional probability fail in the quantum realm, (2) flat or weak priors lead to terrible inferences about things we care about, (3) subjective priors are incoherent, (4) Bayes factors fail in the presence of flat or weak priors, (5) for Cantorian reasons we need to check our models, but this destroys the coherence of Bayesian inference.

Some of the problems of Bayesian statistics arise from people trying to do things they shouldn’t be trying to do, but other holes are not so easily patched. In particular, it may be a good idea to avoid flat, weak, or conventional priors, but such advice, if followed, would go against the vast majority of Bayesian practice and requires us to confront the fundamental incoherence of Bayesian inference.

This does not mean that we think Bayesian inference is a bad idea, but it does mean that there is a tension between Bayesian logic and Bayesian workflow which we believe can only be resolved by considering Bayesian logic as a tool, a way of revealing inevitable misfits and incoherences in our model assumptions, rather than as an end in itself.

This paper came from a talk I gave a few months ago at a physics conference. For more on Bayesian inference and the two-slit experiment, see this post by Yuling and this blog discussion from several years ago. But quantum probability is just a small part of this paper. Our main concern is to wrestle with the larger issues of incoherence in Bayesian data analysis. I think there’s more to be said on the topic, but it was helpful to write down what we could now. Also I want to make clear that these are real holes. This is different from my article, “Objections to Bayesian statistics,” which discusses some issues that non-Bayesians or anti-Bayesians have had, which I do not think are serious problems with Bayesian inference. In contrast, the “holes” discussed in this new article are real concerns to me.

Recent unpublished papers

You perhaps notice our published papers when they appear in journals:

But we also have lots of unpublished papers. I’ll blog them one at a time so that you can see each one and have a chance to make comments before publication.

What up with red state blue state?

Jordan Ellenberg writes:

I learned from your book that Democrats doing better in richer counties and Republicans doing better in poorer counties did not imply that richer people were more likely to vote for Democrats and that in fact, the opposite is true. I do wonder, though, to what extent that’s changing with the current realignments, for example see here, which shows that the “rich districts vote Democratic” effect has certainly gotten stronger since your book came out. I wonder if, as Dems consolidate strength in both cities and suburbs while the GOP asserts dominance over rurall America, we will actuallly start to see the income/voting relationship switch signs at the individual level? Or are we still just seeing the familar scenario of high distric-level incomes driven by inequality while Dem voting patterns are driven by poorer residents? Do Democratic voters now tend to be rich, or do they just tend to live near the rich…?

My reply: I’m not sure, but see this article about red state blue state in 2012, and section 16 of this article about 2016. Also relevant for considering the long view is this article about the twentieth-century reversal.

MRP Conference registration now open!

Registration for our MRP mini conference/meeting is now open. Please go to the conference website to register.  Places are limited so make sure you register so you don’t miss out!

Abstract submissions will be open until the end of this month.

Other than the great talks that we already have submitted, I’m super excited because this conference inspired us to make a hex MRP sticker! Created by the wonderful Mitzi Morris, this sticker will be available at the conference.

This conference wouldn’t be possible without the proud support of the Departments of Statistics and Political Science and Institute for Social and Economic Research and Policy at Columbia University.

What’s the American Statistical Association gonna say in their Task Force on Statistical Significance and Replicability?

Blake McShane and Valentin Amrhein point us to an announcement (see page 7 of this newsletter) from Karen Kafadar, president of the American Statistical Association, which states:

Task Force on Statistical Significance and Replicability Created

At the November 2019 ASA Board meeting, members of the board approved the following motion:

An ASA Task Force on Statistical Significance and Reproducibility will be created, with a charge to develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors. The task force will be appointed by the ASA President with advice and participation from the ASA BOD. The task force will report to the ASA BOD by November 2020. . . .

Based on the initial meeting, these members decided “replicability” was more in line with the critical issues than “reproducibility” (cf. National Academy of Sciences report,, hence the title of the task force is ASA Task Force on Statistical Significance and Replicability. . . .

Blake and Valentin and I are a little bit concerned that (a) this might become an official “ASA statement on Statistical Significance and Replicability” and could thus have an enormous influence, and (b) the listed committee seems like a bunch of reasonable people, no bomb-throwers like us or Nicole Lazar or John Carlin or Sander Greenland or various others to represent the voice of radical reform. We’re all reasonable people too, but we’re reasonable people who start from the perspective that, whatever its successes in engineering and industrial applications, null hypothesis significance testing has been a disaster in areas like social, psychological, environmental, and medical research—not the perspective that it’s basically a good idea that just needs a little bit of tinkering to apply to such inexact sciences.

I respect the perspectives of the status-quo people, the “centrists,” as it were—they represent a large group of the statistics community and should be part of any position taken by the American Statistical Association—but I think our perspective is important too.

I also don’t think that concerns about null hypothesis significance testing should be placed into a Bayesian/frequentist debate, with a framing that the Bayesians are foolish idealists and the frequentists are the practical people . . . that might have been the case 50 years ago, but it’s not the case now. As we have repeatedly written, the problem with thresholds is that they are used to finesse real uncertainty, and that’s an issue whether the threshold is based on p-values or posterior probabilities or Bayes factors or whatever. Again, we recognize and respect opposing views on this; our concern here is that the ASA discussion represents our perspective too, a perspective we believe is well supported on theoretical grounds and is also highly relevant to the recent replication crises in many areas of science.

This post is to stimulate some publicly visible discussion before the task force reports to the ASA board and in particular before the ASA board comes to a decision. The above-linked statement informs us that the leaders of this effort welcome input and are working on a mechanism for receiving comments from the community.

So go for it! As usual, feel free in the comments to disagree with me.

Calling all cats

Those of you familiar with this blog will have noticed that it regularly features cats. For example the majestic cat featured last week, this lover of Bayesian data analysis here and even my own cat, Jazz is featured here.

Sometimes there’s not quite the right cat picture out there – Andrew has even resorted to requesting cat pictures to bring a little sass to the statistical discussions. A few months ago I got the idea of creating a repo of cats who like statistics. Or cats whose owners like statistics. If you know such a cat and you want them to be statistics-blog famous, then submit their photo through this Google form so we can feature them!

An article in a statistics or medical journal, “Using Simulations to Convince People of the Importance of Random Variation When Interpreting Statistics.”

Andy Stein writes:

On one of my projects, I had a plot like the one above of drug concentration vs response, where we divided the patients into 4 groups. I look at the data below and think “wow, these are some wide confidence intervals and random looking data, let’s not spend too much time more with this” but many others do not agree and we spend lots of time on plots like this. Next time, I’ll do what I finally did today and simulate random noise and show how trends pop up. But I was just wondering if you had any advice on how to teach scientists about how easy it is to find trends in randomness? I’m thinking of writing a little shiny app that lets you tune some of the parameters in the R code below to create different versions of this plot. But I know I can’t be the first person to have this challenge and I thought maybe you or others have given some thought on how to teach this concept – that one can easily find trends in randomly generated data if the sample size is small enough and if you create enough plots.

This all makes sense to me. An ideal outcome would be for Stein and others to write an article for a journal such as Technometrics or American Statistician with a title such as “Using Simulations to Convince People of the Importance of Random Variation When Interpreting Statistics.”

Stein replied:

My ideal outcome would be for there to be an article with luminaries in the field in a journal like NEJM or Lancet where everyone will have to pay attention. Unfortunately, I don’t think my colleagues give enough weight to what’s in the statistics and pharmacometrics journals. If they did, then I wouldn’t be in this predicament! Having such a paper out in the world in a high-impact journal would be enough to make me really happy.

My latest idea is to make a little Shiny App where you upload your dataset and it creates 10 sets of plots – one of the real data and nine where the Y variable has been randomly permuted across all the groups. Then people get to see if they can pick out the real data from the mix. If I get a public version implemented in time, I’ll try and send that along, too.

Continue reading ‘An article in a statistics or medical journal, “Using Simulations to Convince People of the Importance of Random Variation When Interpreting Statistics.”’ »

A normalizing flow by any other name

Another week, another nice survey paper from Google. This time:

What’s a normalizing flow?

A normalizing flow is a change of variables. Just like you learned way back in calculus and linear algebra.

Normalizing flows for sampling

Suppose you have a random variable \Theta with a gnarly posterior density p_{\Theta}(\theta) that makes it challenging to sample. It can sometimes be easier to sample a simpler variable \Phi and come up with a smooth function f such that \Theta = f(\Phi). The implied distribution on \Theta can be derived from the density of \Phi and the appropriate Jacobian adjustment for change in volume,

\displaystyle p_{\Theta}(\theta) = p_{\Phi}(f^{-1}(\theta)) \cdot \left|\, \textrm{det} \textrm{J}_{f^{-1}}(\theta) \,\right|,

where \textrm{J}_{f^{-1}}(\theta) is the Jacobian of the inverse transform evaluated at the parameter value. This is always possible in theory—the unit hypercube with a uniform distribution is a sufficient basis for any multivariate function with the function being the inverse cumulative distribution function.

Of course, we don’t know the inverse CDFs for our posteriors or we wouldn’t need to do sampling in the first place. The hope is that we can estimate an approximate but tractable normalizing flow, which when combined with a standard Metropolis accept/reject step will be better than working in the original geometry.

Normalizing flows in Stan

Stan uses changes of variables, aka normalizing flows, in many ways.

First, Stan’s Hamiltonian Monte Carlo algorithm learns (aka estimates) a metric during warmup that is used to provide an affine transform, either just to scale (mean field metric, aka diagonal) or to scale and rotate (dense metric). If Ben Bales’s work pans out, we’ll also have low rank metric estimation soon.

Second, Stan’s constrained variables are implemented via changes of variables with efficient, differentiable Jacobians. Thank Ben Goodrich for all the hard ones: covariance matrices, correlation matrices, Cholesky factors of the these, and unit vectors. TensorFlow Probability calls these transforms “bijectors.” These constrained-variable transforms allow Stan’s algorithms to work on unconstrained spaces. In the case of variational inference, Stan fits a multivariate normal approximation to the posterior, then samples from the multivariate normal and transforms the draws back to the constrained space to get an approximate sample from the model.

Third, we widely recommend reparameterizations, such as the non-centered parameterization of hierarchical models. We used to call that specific transform the “Matt trick” until we realized it already had a name. The point of a reparameterization is to apply the appropriate normalizing flow to make the posterior closer to isotropic Gaussian. Then there’s a deterministic transform back to the variables we really care about.

What’s next?

The real trick is automating the construction of these transforms. People hold out a lot of hope for neural networks or other non-parametric function fitters there. It remains to be seen whether anything practical will come out of this that we can use for Stan. I talked to Matt and crew at Google about their work on normalizing flows for HMC (which they call “neural transport”), but they told me it was too finicky to work as a black box in Stan.

Another related idea is Riemannian Hamiltonian Monte Carlo (RHMC), which uses a second-order Hessian-based approximation to normalize the posterior geometry. It’s just very expensive on a per-iteration basis because it requires a differentiable positive-definite conditioning phase involving an eigendecomposition.

This study could be just fine, or not. Maybe I’ll believe it if there’s an independent preregistered replication.

David Allison sent along this article, Sexually arousing ads induce sex-specific financial decisions in hungry individuals, by Tobias Otterbringa and Yael Sela, and asked whether I buy it.

I replied that maybe I’ll believe it if there’s an independent preregistered replication. I’ve just seen too many of these sort of things to ever believe them at first sight.

Allison responded:

My intuition agrees with yours. The thing that initially caught my eye was a study with high human interest appeal of an evolutionary psychology finding of the type that some have described as “just so stories.” I too have published such ‘just so stories’ and many people (including me) are drawn to them, but lately questions about their robustness and replicability have been raised. For a recent example, see here.

The second thing that caught my eye was that the key findings involve a high-order interaction. Higher order interactions of course can be real, and even prespecified, but when a finding is, by definition, so “conditionally dependent” it raises the question of whether this is perhaps a chance finding due in part, to the often inherent multiple testing involved in subgrouping analyses and higher order interactions. In addition, interaction tests often have low power which tends to increase the false positive rate under reasonable assumptions.

In a further look at the paper, I see no statement that the study was pre-registered. I also note that there seem to be post hoc data analytic decisions made which, as you have described in your paper on “a garden of forking paths” may also lead to non-replicable findings. There is no statement that the assignment of subjects to conditions was random. An ad hoc measure of hunger was used when there are standard pre-existing measures available (e.g., < href="">here) and better still, it would have been relatively easy to randomize subjects to simply skip breakfast and lunch for a day or not and one would have been a more valid approach for assessing the causal effects of hunger.

When all of these factors are put together, it raises skepticism. None of these factors mean that the finding is wrong or that the study is not cool, interesting, and well executed and honestly reported, but it does support the intuition that the result has a low subjective probability of replication.

Is it really true that candidates who are perceived as ideologically extreme do even worse if “they actually pose as more radical than they really are”?

Most of Kruggy’s column today is about macroeconomics, a topic I’m pretty much ignorant of.

But I noticed one political science claim:

It’s easy to make the political case that Democrats should nominate a centrist, rather than someone from the party’s left wing. Candidates who are perceived as ideologically extreme usually pay an electoral penalty; this is especially true if, like Bernie Sanders, they actually pose as more radical than they really are.

The research I’ve seen shows a small average electoral benefit for moderation (see here, for example), so I’m with the Krugmeister on that one. But where did he come up with the claim that extremists pay more of an electoral penalty if they actually pose as more radical than they really are? It’s hard for me to imagine there’s enough data to estimate that; also I’m not quite clear what the claim is, also I’m not sure how you’d go back and measure candidates’ posed and real ideologies. So all that makes me skeptical.

As always, I’d be happy to be corrected if there’s something I’m missing here.

How many patients do doctors kill by accident?

Paul Kedrosky writes:

There is a longstanding debate in the medical community about how many patients they kill by accident. There are many estimates, all fairly harrowing, but little overall agreement. It’s coming to a boil again, and I’m wondering if you’ve ever looked at the underlying claims and statistical data here.

The most recent paper, Strengthening the Medical Error “Meme Pool”, by Benjamin Mazer and Chadi Nabhan, seems to have a somewhat bizarre argument, that absence of evidence should be evidence of absence, and that “extrapolating” from small samples shouldn’t be allowed given how bad doctors are at determining the actual cause of death.

My reply:

I’m not sure what to think. I’m somewhat sympathetic to the argument presented in that article, although I think the whole “meme” thing adds zero to the value of their discussion. I think the real issue here is that we’ll need some clear definition of “preventable medical error” before talking about their rates.

I do remember a few years ago writing about a ridiculous claim that was made by a data scientist on a similar topic. The data scientist claimed that approximately 75 people a year were dying “in a certain smallish town because of the lack of information flow between the hospital’s emergency room and the nearby mental health clinic.” I don’t know if they guy was making this up, or what, but the “certain smallish town” thing really irritated me because it implied some specific knowledge, but the numbers didn’t make sense. The “smallish town” played the same role in this fake-statistics story as the “friend of a friend” plays in traditional urban folklore.

There are certain things you can say that are automatic crowd-pleasers. One such thing is anything against the U.S. health care system. There’s a lot of things to hate about the U.S. health care system but that doesn’t mean we should believe numbers that people just make up.

Researcher offers ridiculous reasons for refusing to reassess work in light of serious criticism

Jordan Anaya writes:

This response from Barbara Fredrickson to one of Nick’s articles got posted the other day.

Alex Holcombe has a screenshot of the article on Twitter.

The issue that I have with the response is that she says she stands by the peer review process that led to her article getting published. But Nick’s critique also underwent peer review! (I assume letters to the journal undergo peer review but see Nick’s tweet here.)

So what’s different between the peer review process that makes her article infallible and Nick’s peer review process that makes his article so bad it’s not worth responding to?

I dunno, but Fredrickson’s reasons for not responding to the criticism (which, by the way, is full of specifics) are absolutely bizarre. She writes:

Readers should be made aware that the current criticisms continue a long line of misleading commentaries and reanalyses by this set of authors that (a) repeatedly target me and my collaborators, (b) dates back to 2013, and (c) spans multiple topic areas. I [Fredrickson] take this history to undermine the professional credibility of these authors’ opinions and approaches.

OK, let’s go through this carefully:

– “a long line …”: These authors have found many problems in the published work by Fredrickson and collaborators.
– “misleading”: That’s cheap talk given that Fredrickson provides zero examples of anything misleading that was written.
– “repeatedly”: Again, the authors found many problems, also the articles in question are still in the published record. Given that this work continues to be cited, if it has errors it should continue to be criticized.
– “target”: Is it “targeting” to point out errors in published papers? If so, why is this a bad thing.
– “dates back to 2013”: OK, so some errors by Fredrickson et al. have been pointed out for several years, so they’re old news. But the articles in question are still in the published record. Given that this work continues to be cited, if it has errors it should continue to be criticized.
– “spans multiple topic areas”: Huh? Is there a rule that people are only supposed to write about one topic area?

This is weird, weird stuff, and as always it makes me sad, more than anything else. Fredrickson is a prominent figure in her field and has a secure job. She could admit her errors and move on. But no, it’s never back down, never admit error. It’s so so sad, to think that someone is in a position to learn from error but refuses to do so. I’ve seen this enough that it hardly surprises me. But I still find it upsetting.

Again, even setting all substance aside, it’s bizarre that Fredrickson proposes that the critics be discounted because they been making these criticisms for several years. The published work is part of a stream of work that is many years old, hence it’s no surprise that the criticism is many years old too.

P.S. Above I wrote that Fredrickson didn’t “respond to” the criticism. Let me clarify that in this case I think a reasonable response would be for her to:

1. Retract the published claims, and
2. Thank the critics for tracking down the problems in those published papers.

That would put her in position to do:

3. Correct the supplementary record (not just in journal articles but also in books, lectures, etc., explain what went wrong so that people don’t mistakenly believe those erroneous published claims), and
4. Do some new research trying to figure out what’s really going on, without getting fooled by statistical noise or fake math.

Step 4 is what it’s all about. The ultimate goal is to help people, right?

Making differential equation models in Stan more computationally efficient via some analytic integration

We were having a conversation about differential equation models in pharmacometrics, in particular how to do efficient computation when fitting models for dosing, and Sebastian Weber pointed to this Stancon presentation that included a single-dose model.

Sebastian wrote:

Multiple doses lead to a quick explosion of the Stan codes – so things get a bit involved from the coding side.

– In practical applications I [Sebastian] would by now try to integrate over the dosing events. This allows you to avoid making the initials being vars such that you can a lot of speed. The trick is to make the ODE integrate “see” the dosing events and prevent the integrator to step over it. The cheap hack to do that is to insert a few observation points around the dosing event. This will make the integrator stop at the dosing time and make it realize that there are sudden changes in one of the compartments. This is not really clean, but CVODES handles it well and sundials even documents this technique as one way to handle it.

– I think we should one more time try the adjoint stuff. As we are about to get closures in Stan it should be straightforward to give to the integrator in Stan a log-prob function along with the ODE stuff. Once we have that, then we should be able to create something which is a lot faster than what we have now.

– Last time you tried the adjoint stuff your profiling showed that std::vector did stand out. This is because our AD system is darn slow to get the Jacobian of the ODE RHS. We should instead use an analytic derivative here – not sure how to automatically generate it ATM, but you get 2-3x speedups on the existing ODE system when you do that (and I have code around which does the automatic generation of the symbolic Jacobian…but this code is a well working prototype, not more)… hopefully the OCAML parser can churn these algebraic derivatives out.

– The other thing to explore is sparsity in the ODE RHS. Many ODE systems have a banded structure. One way to get this structural information is by specifying a chemical reaction network. Have a look at the dMod package on CRAN. The cool thing about knowing the sparistiy is that we don’t waste so much on derivative anyway not needed as they are constant and the ODE integrator can work more efficiently during the Newton iterations.

He continued:

Whenever we have long observation times of patients we usually follow a very regular administration pattern. As the dosing compartement often follows a first-order eliminiation scheme you can drastically simplify things. These first order elimination compartements have analytic solutions and for regular dosing patterns you can use the geometric series to sum up an arbitrary number of dosings very quickly. As a result you would just use the solution of the dosing compartment as a forcing function to the rest of the ODE system. This way you integrate out one state and you speedup a lot all the calculations.

Moreover, in many applications where we have these long observation times, we actually don’t care so much about getting everything right. We often only have data on steady-state and as such you can simplify drastically the models used wrt to the absorbtion process.

What I am saying is that with the use of our brains we can drastically make our life easier here… now, many people still just dump huge dosing histories into these programs and then complain about long running times…

As I have been burned by performance from this I would say that I found lots of good ways of how to avoid the performance killer imposed by ODE stuff + dosing.

This reminds me of something we did many years ago, at a much simpler level, explaining how analytical and simulation methods could be combined to get inferences for small probabilities.

“Sometimes research just has to start somewhere, and subject itself to criticism and potential improvement.”

Pointing to this careful news article by Monica Beyer, “Controversial study links pollution with bipolar, depression,” Mark Tuttle writes:

Sometimes potentially important things are hard, or even very hard.

Sometimes research just has to start somewhere, and subject itself to criticism and potential improvement.

I think this kind of thing supports our desire for high levels of transparency, difficult though that it may be in these circumstances.

I agree.

Coronavirus “hits all the hot buttons” for promoting the scientist-as-hero narrative (cognitive psychology edition)

The New York Times continues to push the cognitive-illusion angle on coronavirus fear. Earlier this week we discussed an op-ed by social psychologist David DeSteno; today there’s a news article by that dude from Rushmore:

There remains deep uncertainty about the new coronavirus’ mortality rate, with the high-end estimate that it is up to 20 times that of the flu, but some estimates go as low as 0.16 percent for those affected outside of China’s overwhelmed Hubei province. About on par with the flu.

Wasn’t there something strange . . . about the extreme disparity in public reactions?

While the metrics of public health might put the flu alongside or even ahead of the new coronavirus for sheer deadliness . . . And the new coronavirus disease, named COVID-19 hits nearly every cognitive trigger we have.

That explains the global wave of anxiety.

Wait a second! The article just said that the high-end estimate is that coronavirus could have a mortality rate 20 times that of the flu, and a low-end estimate that is about on par with the flu. Is it really “so strange” to have a wave of anxiety given this level of uncertainty?

Don’t get me wrong. I’m not saying that people are rational uncertainty-calculators. In particular, maybe the real lesson here is not that people shouldn’t be scared about coronavirus but that they should be more scared of the flu. But, as in our discussion the other day, I’m concerned about experts who seem so eager to leap up and call people irrational, when it seems to me that it can be quite rational to react strongly to an unknown risk. Even if, in retrospect, coronavirus doesn’t end up being as bad as some of the worst-case scenarios, that doesn’t mean it is a bad idea to be prepared. We don’t want to be picking pennies in front of the proverbial steamroller.

The Times article continues:

But there is a lesson, psychologists and public health experts say, in the near-terror that the virus induces, even as serious threats like the flu receive little more than a shrug. It illustrates the unconscious biases in how human beings think about risk, as well as the impulses that often guide our responses — sometimes with serious consequences.

Experts used to believe that people gauged risk like actuaries, parsing out cost-benefit analyses every time a merging car came too close or local crime rates spiked. But a wave of psychological experiments in the 1980s upended this thinking.




I am so damn sick of the scientist-as-hero narrative. It’s not enough to say that psychologists have learned a lot in the past 50 years about how we think about and make decisions under uncertainty. No, you also have to say that, before then, we were in the dark ages.

Is it really true that “Experts used to believe that people gauged risk like actuaries, parsing out cost-benefit analyses every time a merging car came too close or local crime rates spiked”? Maybe. I guess I’d like to see some quotes before I believe it. My impression is that experts used to believe, and in many cases still do, that parsing out cost-benefit analyses is a decision-making ideal that can be used as a comparison to better understand real decision processes.

But it’s obvious that people don’t “gauge risk like actuaries.” After all, if people really gauged risk like actuaries, we wouldn’t need actuaries! And, last I heard, they get paid a lot.

As with our discussion of that op-ed the other day, I have no problem with this news article regarding the public health details. Indeed, I don’t know anything about coronavirus, and it’s from articles like this that I get my news. The author writes, “Of course, it is far from irrational to feel some fear about the coronavirus outbreak tearing through China and beyond. . . . Assessing the danger posed by the coronavirus is extraordinarily difficult; even scientists are unsure. . . .”, so it’s not like he’s telling us not to worry. And I agree with the message that people should take their damn flu shots. I just don’t like how this interesting, important, and newsworthy story about uncertainty is being used as an excuse for an oversimplified model of decision science. As we’ve discussed earlier, it can be rational to react strongly to an uncertain threat. That scaredy-cat in the above image might be behaving in a smart way.

“Repeating the experiment” as general advice on data collection

Izzy Kates points to the above excerpt from Introductory Statistics, by Neil Weiss, 9th edition, and points out:

Nowhere is repeating the experiment mentioned. This isn’t the only time this mistake is made.

Good point! We don’t mention replication as a statistical method in our books either! Even when we talk about the replication crisis, and the concern that certain inferences won’t replicate on new data, we don’t really present replication as a data-collection strategy. Part of this is that in social sciences such as economics and political science, it’s rarely possible to do a direct replication—the closest example would be when we have a time series of polls, but in that case we’re typically interested in changes over time, so these polls are replications of methods but with possible changes in the underlying thing being measured.

I agree with Kates that if you’re going to give advice in a statistics book about data collection, random sampling, random assignment of treatments, etc., you should also talk about repeating the entire experiment. The problem is nothing special to Weiss’s book. I don’t know that I’ve ever seen a statistics textbook recommend repeating the experiment as a general method, in the same way they recommend random sampling, random assignment, etc.

Remember the 50 Shades of Gray story, in which a team of researchers had a seemingly strong experimental finding but then decided to perform a replication, which gave a null result, making them realize how much they were able to fake themselves out with forking paths.

Or, for a cautionary tale, the 64 Shades of Gray story, in which a different research team didn’t check their experimental work with a replication, thus resulting in the publication of some pretty ridiculous claims.

So, my advice to researchers is: If you can replicate your study, do so. Better to find the mistakes yourself than to waste everybody else’s time.

P.S. We’re still making final edits on Regression and Other Stories. So I guess I should add something there.