Transformative treatments

Kieran Healy and Laurie Paul wrote a new article, “Transformative Treatments,” (see also here) which reminds me a bit of my article with Guido, “Why ask why? Forward causal inference and reverse causal questions.” Healy and Paul’s article begins:

Contemporary social-scientific research seeks to identify specific causal mechanisms for outcomes of theoretical interest. Experiments that randomize populations to treatment and control conditions are the “gold standard” for causal inference. We identify, describe, and analyze the problem posed by transformative treatments. Such treatments radically change treated individuals in a way that creates a mismatch in populations, but this mismatch is not empirically detectable at the level of counterfactual dependence. In such cases, the identification of causal pathways is underdetermined in a previously unrecognized way. Moreover, if the treatment is indeed transformative it breaks the inferential structure of the experimental design. . . .

I’m not sure exactly where my paper with Guido fits in here, except that the idea of the “treatment” is so central to much of causal inference, that sometimes researchers seem to act as if randomization (or, more generally, “identification”) automatically gives validity to a study, as if randomization plus statistical significance equals scientific discovery. The notion of a transformative treatment is interesting because it points to a fundamental contradiction in how we typically think of causality, in that on one hand “the treatment” is supposed to be transformative and have some clearly-defined “effect,” while on the other hand the “treatment” and “control” are typically considered symmetrically in statistical models. I pick at this a bit in this 2004 article on general models for varying treatment effects.

P.S. Hey, I just remembered—I discussed this a couple of other times on this blog:

– 2013: Yes, the decision to try (or not) to have a child can be made rationally

– 2015: Transformative experiences: a discussion with L. A. Paul and Paul Bloom

“Kevin Lewis and Paul Alper send me so much material, I think they need their own blogs.”

In my previous post, I wrote:

Kevin Lewis and Paul Alper send me so much material, I think they need their own blogs.

It turns out that Lewis does have his own blog. His latest entry contains a bunch of links, starting with this one:

Populism and the Return of the “Paranoid Style”: Some Evidence and a Simple Model of Demand for Incompetence as Insurance against Elite Betrayal

Rafael Di Tella & Julio Rotemberg

NBER Working Paper, December 2016

Abstract:
We present a simple model of populism as the rejection of “disloyal” leaders. We show that adding the assumption that people are worse off when they experience low income as a result of leader betrayal (than when it is the result of bad luck) to a simple voter choice model yields a preference for incompetent leaders. These deliver worse material outcomes in general, but they reduce the feelings of betrayal during bad times. We find some evidence consistent with our model in a survey carried out on the eve of the recent U.S. presidential election. Priming survey participants with questions about the importance of competence in policymaking usually reduced their support for the candidate who was perceived as less competent; this effect was reversed for rural, and less educated white, survey participants.

I clicked through, and, ugh! What a forking-paths disaster! It already looks iffy from the abstract, but when you get into the details . . . ummm, let’s just say that these guys could teach Daryl Bem a thing or two.

Not Kevin Lewis’s fault; he’s just linking . . .

On the plus side, he also links to this:

Turnout and weather disruptions: Survey evidence from the 2012 presidential elections in the aftermath of Hurricane Sandy

Narayani Lasala-Blanco, Robert Shapiro & Viviana Rivera-Burgos

Electoral Studies, forthcoming

Abstract:
This paper examines the rational choice reasoning that is used to explain the correlation between low voter turnout and the disruptions caused by weather related phenomena in the United States. Using in-person as well as phone survey data collected in New York City where the damage and disruption caused by Hurricane Sandy varied by district and even by city blocks, we explore, more directly than one can with aggregate data, whether individuals who were more affected by the disruptions caused by Hurricane Sandy were more or less likely to vote in the 2012 Presidential Election that took place while voters still struggled with the devastation of the hurricane and unusually low temperatures. Contrary to the findings of other scholars who use aggregate data to examine similar questions, we find that there is no difference in the likelihood to vote between citizens who experienced greater discomfort and those who experienced no discomfort even in non-competitive districts. We theorize that this is in part due to the resilience to costs and higher levels of political engagement that vulnerable groups develop under certain institutional conditions.

I like this paper, but then again I know Narayani and Bob personally, so you can make of this what you will.

P.S. Although I think the “Populism and the Return of the Paranoid Style” paper is really bad, I recognize the importance of the topic, and I assume the researchers on this project were doing their best. It is worth another post or article explaining how better to address such questions and analyze this sort of data. My quick suggestion is that each causal question deserves its own study, and I don’t think it’s going to work so well to sift through a pile of data pulling out statistically significant comparisons, dismissing results that don’t fit your story, and labeling results that you like as “significant at the 7% level.” It’s not that there’s anything magic about a 5% significance level, it’s that you want to look at all of your comparisons, and you’re asking for trouble if you keep coming up with reasons to count or discard patterns.
Continue reading

Two unrelated topics in one post: (1) Teaching useful algebra classes, and (2) doing more careful psychological measurements

Kevin Lewis and Paul Alper send me so much material, I think they need their own blogs. In the meantime, I keep posting the stuff they send me, as part of my desperate effort to empty my inbox.

1. From Lewis:

“Should Students Assessed as Needing Remedial Mathematics Take College-Level Quantitative Courses Instead? A Randomized Controlled Trial,” by A. W. Logue, Mari Watanabe-Rose, and Daniel Douglas, which begins:

Many college students never take, or do not pass, required remedial mathematics courses theorized to increase college-level performance. Some colleges and states are therefore instituting policies allowing students to take college-level courses without first taking remedial courses. However, no experiments have compared the effectiveness of these approaches, and other data are mixed. We randomly assigned 907 students to (a) remedial elementary algebra, (b) that course with workshops, or (c) college-level statistics with workshops (corequisite remediation). Students assigned to statistics passed at a rate 16 percentage points higher than those assigned to algebra (p < .001), and subsequently accumulated more credits. A majority of enrolled statistics students passed. Policies allowing students to take college-level instead of remedial quantitative courses can increase student success.

I like the idea of teaching statistics instead of boring algebra. That said, I think if algebra were taught well, it would be as useful as statistics. I think the most important parts of statistics are not the probabilistic parts so much as the quantitative reasoning. You can use algebra to solve lots of problems. For example, this age adjustment story is just a bunch of algebra. Algebra + data. But there’s no reason algebra has to be data-free, right?

Meanwhile, intro stat can be all about p-values, and then I hate it.

So what I’d really like to see is good intro quantitative classes. Call it algebra or call it real-world math or call it statistics or call it data science, I don’t really care.

2. Also from Lewis:

“Less Is More: Psychologists Can Learn More by Studying Fewer People,” by Matthew Normand, who writes:

Psychology has been embroiled in a professional crisis as of late. . . . one problem has received little or no attention: the reliance on between-subjects research designs. The reliance on group comparisons is arguably the most fundamental problem at hand . . .

But there is an alternative. Single-case designs involve the intensive study of individual subjects using repeated measures of performance, with each subject exposed to the independent variable(s) and each subject serving as their own control. . . .

Normand talks about “single-case designs,” which we also call “within-subject designs.” (Here we’re using experimental jargon in which the people participating in a study are called “subjects.”) Whatever terminology is being used, I agree with Normand. This is something Eric Loken and I have talked about a lot, that many of the horrible Psychological Science-style papers we’ve discussed use between-subject designs to study within-subject phenomena.

A notorious example was that study of ovulation and clothing, which posited hormonally-correlated sartorial changes within each woman during the month, but estimated this using a purely between-person design, with only a single observation for each woman in their survey.

Why use between-subject designs for studying within-subject phenomena? I see a bunch of reasons. In no particular order:

1. The between-subject design is easier, both for the experimenter and for any participant in the study. You just perform one measurement per person. No need to ask people a question twice, or follow them up, or ask them to keep a diary.

2. Analysis is simpler for the between-subject design. No need to worry about longitudinal data analysis or within-subject correlation or anything like that.

3. Concerns about poisoning the well. Ask the same question twice and you might be concerned that people are remembering their earlier responses. This can be an issue, and it’s worth testing for such possibilities and doing your measurements in a way to limit these concerns. But it should not be the deciding factor. Better a within-subject study with some measurement issues than a between-subject study that’s basically pure noise.

4. The confirmation fallacy. Lots of researchers think that if they’ve rejected a null hypothesis at a 5% level with some data, that they’ve proved the truth of their preferred alternative hypothesis. Statistically significant, so case closed, is the thinking. Then all concerns about measurements get swept aside: After all, who cares if the measurements are noisy, if you got significance? Such reasoning is wrong wrong wrong but lots of people don’t understand.

Also relevant to this reduce-N-and-instead-learn-more-from-each-individual-person’s-trajectory perspective is this conversation I had with Seth about ten years ago.

“The Pitfall of Experimenting on the Web: How Unattended Selective Attrition Leads to Surprising (Yet False) Research Conclusions”

Kevin Lewis points us to this paper by Haotian Zhou and Ayelet Fishbach, which begins:

The authors find that experimental studies using online samples (e.g., MTurk) often violate the assumption of random assignment, because participant attrition—quitting a study before completing it and getting paid—is not only prevalent, but also varies systemically across experimental conditions. Using standard social psychology paradigms (e.g., ego-depletion, construal level), they observed attrition rates ranging from 30% to 50% (Study 1). The authors show that failing to attend to attrition rates in online panels has grave consequences. By introducing experimental confounds, unattended attrition misled them to draw mind-boggling yet false conclusions: that recalling a few happy events is considerably more effortful than recalling many happy events, and that imagining applying eyeliner leads to weight loss (Study 2). In addition, attrition rate misled them to draw a logical yet false conclusion: that explaining one’s view on gun rights decreases progun sentiment (Study 3). The authors offer a partial remedy (Study 4) and call for minimizing and reporting experimental attrition in studies conducted on the Web.

I started to read this but my attention wandered before I got to the end; I was on the internet at the time and got distracted by a bunch of cat pictures, lol.

“I thought it would be most unfortunate if a lab . . . wasted time and effort trying to replicate our results.”

[cat picture]

Mark Palko points us to this news article by George Dvorsky:

A Harvard research team led by biologist Douglas Melton has retracted a promising research paper following multiple failed attempts to reproduce the original findings. . . .

In June 2016, the authors published an article in the open access journal PLOS One stating that the original study had deficiencies. Yet this peer-reviewed admission was not accompanied by a retraction. Until now.

Melton told Retraction Watch that he finally decided to issue the retraction to ensure zero confusion about the status of the paper, saying, “I thought it would be most unfortunate if a lab missed the PLOS ONE paper, then wasted time and effort trying to replicate our results.”

He said the experience was a valuable one, telling Retraction Watch, “It’s an example of how scientists can work together when they disagree, and come together to move the field forward . . . The history of science shows it is not a linear path.”

True enough. Each experiment, successful or not, takes us a step closer to an actual cure.

Are you listening, John Bargh? Roy Baumeister?? Andy Yap??? Editors of the Lancet???? Ted talk people????? NPR??????

I guess the above could never happen in a field like psychology, where the experts assure us that the replication rate is “statistically indistinguishable from 100%.”

In all seriousness, I’m glad that Melton and their colleagues recognize that there’s a cost to presenting shaky work as solid and thus sending other research teams down blind alleys for years or even decades. I don’t recall any apologies on those grounds ever coming from the usual never-admit-error crowd.

Sorry, but no, you can’t learn causality by looking at the third moment of regression residuals

Under the subject line “Legit?”, Kevin Lewis pointed me to this press release, “New statistical approach will help researchers better determine cause-effect.” I responded, “No link to any of the research papers, so cannot evaluate.”

In writing this post I thought I’d go further. The press release mentions 6 published articles so I googled the first one, from the British Journal of Mathematical and Statistical Psychology (hey, I’ve published there!) and found this paper, “Significance tests to determine the direction of effects in linear regression models.”

Uh oh, significance tests. It’s almost like they’re trying to piss me off!

I’m traveling so I can’t get access to the full article. From the abstract:

Previous studies have discussed asymmetric interpretations of the Pearson correlation coefficient and have shown that higher moments can be used to decide on the direction of dependence in the bivariate linear regression setting. The current study extends this approach by illustrating that the third moment of regression residuals may also be used to derive conclusions concerning the direction of effects. Assuming non-normally distributed variables, it is shown that the distribution of residuals of the correctly specified regression model (e.g., Y is regressed on X) is more symmetric than the distribution of residuals of the competing model (i.e., X is regressed on Y). Based on this result, 4 one-sample tests are discussed which can be used to decide which variable is more likely to be the response and which one is more likely to be the explanatory variable. A fifth significance test is proposed based on the differences of skewness estimates, which leads to a more direct test of a hypothesis that is compatible with direction of dependence. . . .

The third moment of regression residuals??? This is nuts!

OK, I can see the basic idea. You have a model in which x causes y; the model looks like y = x + error. The central limit theorem tells you, roughly, that y should be more normal-looking than x, hence all those statistical tests.

Really, though, this is going to depend so much on how things are measured. I can’t imagine it will be much help in understanding causation. Actually, I think it will hurt in that if anyone takes it seriously, they’ll just muddy the waters with various poorly-supported claims. Nothing wrong with doing some research in this area, but all that hype . . . jeez!

Ethics and statistics

For a few years now, I’ve been writing a column in Chance. Below are the articles so far. This is by no means an exhaustive list of my writings on ethics and statistics but at least I thought it could help to collect these columns in one place.

Ethics and statistics: Open data and open methods

Statisticians: When we teach, we don’t practice what we preach (with Eric Loken)

Ethics in medical trials: Where does statistics fit in?

Statistics for sellers of cigarettes

Ethics and the statistical use of prior information

The war on data (with Mark Palko)

They’d rather be rigorous than right

It’s too hard to publish criticisms and obtain data for replication

Is it possible to be an ethicist without being mean to people?

The AAA tranche of subprime science (with Eric Loken)

The Commissar for Traffic presents the latest Five-Year Plan (with Phil Price)

Disagreements about the strength of evidence

How is ethics like logistic regression? Ethics decisions, like statistical inferences, are informative only if they’re not too easy or too hard (with David Madigan)

Objects of the class “George Orwell”

image

George Orwell is an exemplar in so many ways: a famed truth-teller who made things up, a left-winger who mocked left-wingers, an author of a much-misunderstood novel (see “Objects of the class ‘Sherlock Holmes,’”) probably a few dozen more.

But here I’m talking about Orwell’s name being used as an adjective. More specifically, “Orwellian” being used to refer specifically to the sort of doublespeak that Orwell deplored. When someone says something is Orwellian, they mean it’s something that Orwell would’ve hated.

Another example: Kafkaesque. A Kafkaesque world is not something Kafka would’ve wanted.

Just to be clear: I’m not saying there’s anything wrong with referring to doublespeak as Orwellian—the man did write a lot about it! It’s just interesting to think of things named after people who hated them.

Emails I never bothered to answer

So, this came in the email one day:

Dear Professor Gelman,

I would like to shortly introduce myself: I am editor in the ** Department at the publishing house ** (based in ** and **).

As you may know, ** has taken over all journals of ** Press. We are currently restructuring some of the journals and are therefore looking for new editors for the journal **.

You have published in the journal, you work in the field . . . your name was recommended by Prof. ** as a potential editor for the journal. . . . We think you would be an excellent choice and I would like to ask you kindly whether you are interested to become an editor of the journal. In case you are interested (and even if you are not), we would be glad if you could maybe recommend us some additional potential candidates who could be interested to get involved with **. We are looking for a several editors who will cover the different areas of the field.

If you have any questions, I will gladly provide you with more information.

I look forward to hearing from you,

with best regards

**

Ummm, don’t take this the wrong way, but . . . why is it exactly that you think I would want to work for free on a project, just to make money for you?

Christmas special: Survey research, network sampling, and Charles Dickens’ coincidences

image

It’s Christmas so what better time to write about Charles Dickens . . .

Here’s the story:

In traditional survey research we have been spoiled. If you work with atomistic data structures, a small sample looks like a little bit of the population. But a small sample of a network doesn’t look like the whole. For example, if you take a network and randomly sample some nodes, and then look at the network of all the edges connecting these nodes, you’ll get something much more sparse than the original. For example, suppose Alice knows Bob who knows Cassie who knows Damien, but Alice does not happen to know Damien directly. If only Alice and Damien are selected, they will appear to be disconnected because the missing links are not in the sample.

This brings us to a paradox of literature. Charles Dickens, like Tom Wolfe more recently, was celebrated for his novels that reconstructed an entire society, from high to low, in miniature. But Dickens is also notorious for his coincidences: his characters all seem very real but they’re always running into each other on the street (as illustrated in the map above, which comes from David Perdue) or interacting with each other in strange ways, or it turns out that somebody is somebody else’s uncle. How could this be, that Dickens’s world was so lifelike in some ways but filled with these unnatural coincidences?

My contention is that Dickens was coming up with his best solution to an unsolvable problems, which is to reproduce a network given a small sample. What is a representative sample of a network? If London has a million people and I take a sample of 100, what will their network look like? It will look diffuse and atomized because of all those missing connections. The network of this sample of 100 doesn’t look anything like the larger network of Londoners, any more than a disconnected set of human cells would look like a little person.

So to construct something with realistic network properties, Dickens had to artificially fill in the network, to create the structure that would represent the interactions in society. You can’t make a flat map of the world that captures the shape of a globe; any projection makes compromises. Similarly you can’t take a sample of people and capture all its network properties, even in expectation: if we want the network density to be correct, we need to add in links, “coincidences” as it were. The problem is, we’re not used to thinking this way because with atomized analysis, we really can create samples that are basically representative of the population. With networks you can’t.

This may be the first, and last, bit of literary criticism to appear in the Journal of Survey Statistics and Methodology.

How to include formulas (LaTeX) and code blocks in WordPress posts and replies

It’s possible to include LaTeX formulas like $latex \int e^x \, \mathrm{d}x$. I entered it as $latex \int e^x \, \mathrm{d}x$.

You can also generate code blocks like this

for (n in 1:N) 
  y[n] ~ normal(0, 1);

The way to format them is to use <pre> to open the code block and </pre> to close it.

You can create links using the anchor (a) tag.

You can also quote someone else, like our friend lorem ipsum,

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

You open with <blockquote> and close with </blockquote>.

You can add bold (tag inside angle brackets is b), italics (tag is i) and typewriter text (tag is tt), but our idiotic style makes typewriter text smaller, so you need to wrap it in a big for it to render the same size as surrounding text.

The full set of tags allowed is:

address, a, abbr, acronym, area, article, aside, b, big,
blockquote, br, caption, cite, class, code, col, del,
details, dd, div, dl, dt, em, figure, figcaption, footer,
font, h1, h2, h3, h4, h5, h6, header, hgroup, hr, i,
img, ins, kbd, li, map, ol, p, pre, q, s, section, small,
span, strike, strong, sub, summary, sup, table, tbody,
td, tfoot, th, thead, tr, tt, u, ul, var

For more details, see: https://en.support.wordpress.com/code/

Too bad there’s no way for users without admin privileges to edit their work. It’s fiddly getting LaTeX or HTML right on the first try.

After some heavy escaping, you deserve some comic relief; it’ll give you some hint at what I had to do to show what I entered to you without it rendering.

p=.03, it’s gotta be true!

Howie Lempel writes:

Showing a white person a photo of Obama w/ artificially dark skin instead of artificially lightened skin before asking whether they support the Tea Party raises their probability of saying “yes” from 12% to 22%. 255 person Amazon Turk and Craigs List sample, p=.03.

Nothing too unusual about this one. But it’s particularly grating when hyper educated liberal elites use shoddy research to decide that their political opponents only disagree with them because they’re racist.

https://www.washingtonpost.com/news/wonk/wp/2016/05/13/how-psychologists-used-these-doctored-obama-photos-to-get-white-people-to-support-conservative-politics/

https://news.stanford.edu/2016/05/09/perceived-threats-racial-status-drive-white-americans-support-tea-party-stanford-scholar-says/

Hey, they could have a whole series of this sort of experiment:

– Altering the orange hue of Donald Trump’s skin and seeing if it affects how much people trust the guy . . .

– Making Hillary Clinton fatter and seeing if that somehow makes her more likable . . .

– Putting glasses on Rick Perry to see if that affects perceptions of his intelligence . . .

– Altering the shape of Elizabeth Warren’s face to make her look even more like a Native American . . .

The possibilities are endless. And, given the low low cost of Mechanical Turk and Craig’s List, surprisingly affordable. The pages of Psychological Science PPNAS Frontiers in Psychology are wide open to you. As the man says, Never say no!

P.S. Just to be clear: I’m not saying that the above-linked conclusions are wrong or that such studies are inherently ridiculous. I just think you have to be careful about how seriously you take claims from reported p-values.

“Dirty Money: The Role of Moral History in Economic Judgments”

[cat picture]

Recently in the sister blog . . . Arber Tasimi and his coauthor write:

Although traditional economic models posit that money is fungible, psychological research abounds with examples that deviate from this assumption. Across eight experiments, we provide evidence that people construe physical currency as carrying traces of its moral history. In Experiments 1 and 2, people report being less likely to want money with negative moral history (i.e., stolen money). Experiments 3–5 provide evidence against an alternative account that people’s judgments merely reflect beliefs about the consequences of accepting stolen money rather than moral sensitivity. Experiment 6 examines whether an aversion to stolen money may reflect con- tamination concerns, and Experiment 7 indicates that people report they would donate stolen money, thereby counteracting its negative history with a positive act. Finally, Experiment 8 demonstrates that, even in their recall of actual events, people report a reduced tendency to accept tainted money. Altogether, these findings suggest a robust tendency to evaluate money based on its moral history, even though it is designed to participate in exchanges that effectively erase its origins.

I’m not a big fan of the graphs in this paper (and don’t get me started on the tables!), but the experiments are great. I love this stuff.

You Won’t BELIEVE How Trump Broke Up This Celebrity Couple!

[cat picture]

A few months ago I asked if it was splitsville for tech zillionaire Peter Thiel and chess champion Garry Kasparov, after seeing this quote from Kasparov in April:

Trump sells the myth of American success instead of the real thing. . . . It’s tempting to rally behind him-but we should resist. Because the New York values Trump represents are the very worst kind. . . . He may have business experience, but unless the United States plans on going bankrupt, it’s experience we don’t need.

and this news item from May:

Thiel, co-founder of PayPal and Palantir and a director at Facebook, is now a Trump delegate in San Francisco, according to a Monday filing.

Based on this recent interview, I suspect the bromance is fully over.

I guess we can forget about Kasparov and Thiel ever finishing that book.

P.S. Commenter Ajg at above-linked post gets credit for the title of this post.

This is not news.

image

Anne Pier Salverda writes:

I’m not sure if you’re keeping track of published failures to replicate the power posing effect, but this article came out earlier this month: “Embodied power, testosterone, and overconfidence as a causal pathway to risk-taking.”

From the abstract:

We were unable to replicate the findings of the original study and subsequently found no evidence for our extended hypotheses.

Gotta love that last sentence of the abstract:

As our replication attempt was conducted in the Netherlands, we discuss the possibility that cultural differences may play a moderating role in determining the physiological and psychological effects of power posing.

Let’s just hope that was a joke. Jokes are ok in academic papers, right?

Michael found the bug in Stan’s new sampler

Gotcha!

Michael found the bug!

That was a lot of effort, during which time he produced ten pages of dense LaTeX to help Daniel and me understand the algorithm enough to help debug (we’re trying to write a bunch of these algorithmic details up for a more general audience, so stay tuned).

So what was the issue?

In Michael’s own words:

There were actually two bugs. The first is that the right subtree needs it’s own rho in order to compute the correct termination criterion. The second is that in order to compute the termination criterion you need the points on the left and right of each subtree (the orientation of left and right relative to forwards and backwards depends on in which direction you’re trying to extend the trajectory). That means you have to do one leapfrog step and take that  point as left, then do the rest of the leapfrog steps and take the final point as right. But right now I’m taking the initial point as left, which is one off. A small difference (especially as the step size is decreased!) but enough to bias the samples.

I redacted the saltier language (sorry if that destroyed the flavor of the message, Michael [pun intended; this whole bug hunt has left me a bit punchy]).

I responded:

That is a small difference—amazing it has that much effect on sampling. These things are obviously balanced on a knife edge.

Michael then replied:

Well the effect is pretty small and is significant only when you need extreme precision, so it’s not entirely surprising [that our tests didn’t catch it] in hindsight. The source of the problem also explains why the bias went down as the step size was decreased. It also gives a lot of confidence in the general validity of previous results.

I’m just glad all that math was correct!

Whew. Me, too. Especially since the new approch seems both more efficient and more robust.

What do you mean by “new approach”?

Michael replaced the original NUTS algorithm’s slice sampler with a discrete sampler, which trickles through a bunch of the algorithmic steps, such as whether to jump to the latest subtree being built. We’ve (by which I mean Michael) have also been making incremental changes to the adaptation. These started early on when we broke adaptation down into a step size and a regularized mass matrix estimate and then allowed dense mass matrices.

When will Stan be fixed?

It’ll take a few days for us to organize the new code and then a few more days to push it through the interfaces. Definitely in time for StanCon (100+ registrants and counting, with plenty of submitted case studies).

Stan 2.10 through Stan 2.13 produce biased samples

[Update: bug found! See the follow-up post, Michael found the bug in Stan’s new sampler]

[Update: rolled in info from comments.]

After all of our nagging of people to use samplers that produce unbiased samples, we are mortified to have to announce that Stan versions 2.10 through 2.13 produce biased samples.

The issue

Thanks to Matthew R. Becker for noticing this with a simple bivariate example and for filing the issue with a reproducible example:

The change to Stan

Stan 2.10 changed the NUTS algorithm from using slice sampling along a Hamiltonian trajectory to a new algorithm that uses categorical sampling of points along the trajectory proportional to the density (plus biases to the second half of the chain, which is a subtle aspect of the original NUTS algorithm). The new approach is described here:

From Michael Betancourt on Stan’s users group:

Let me temper the panic by saying that the bias is relatively small and affects only variances but not means, which is why it snuck through all our testing and application analyses. Ultimately posterior intervals are smaller than they should be, but not so much that the inferences are misleading and the shrinkage will be noticeable only if you have more than thousands of effective samples, which is much more that we typically recommend.

What we’re doing to fix it

Michael and I are poring over the proofs and the code, but it’s unfortunate timing with the holidays here as everyone’s traveling. We’ll announce a fix and make a new release as soon
as we can. Let’s just say this is our only priority at the moment.

If all else fails, we’ll roll back the sampler to the 2.09 version in a couple days and do a new release with all the other language updates and bug fixes since then.

What you can do until then

While some people seem to think the error is of small enough magnitude not to be worrisome (see comments), we’d rather see you all getting the right answers. Until we get this fixed, the only thing I can recommend is using straight up static HMC (which is not broken in the Stan releases) or rolling back to Stan 2.09 (easy with CmdStan, not sure how to do that with other interfaces).

Even diagnosing problems like these is hard

Matthew Becker, the original poster, diagnosed the problem with fake data simulations, but it required a lot of effort.

The bug Matthew Becker reported was for this model:

parameters {
  vector[2] z;
}

model {
  matrix[2,2] sigma;
  vector[2] mu;

  mu[1] <- 0.0;
  mu[2] <- 3.0;
  sigma[1][1] <- 1.0 * 1.0;
  sigma[1][2] <- 0.5 * 1.0 * 2.0;
  sigma[2][1] <- 0.5 * 1.0 * 2.0;
  sigma[2][2] <- 2.0 * 2.0;

  z ~ multi_normal(mu, sigma);
}

So it's just a simple multivariate normal with 50% correlation and reasonably small locations and scales. It led to this result in Stan 2.13. It used four chains of 1M iterations each:

Inference for Stan model: TWOD_Gaussian_c1141a5e1a103986068b426ecd9ef5d2.
4 chains, each with iter=1000000; warmup=100000; thin=1;
post-warmup draws per chain=900000, total post-warmup draws=3600000.

and led to this posterior summary:

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
z[0]-5.1e-4  1.9e-3   0.95  -1.89  -0.61 7.7e-4   0.61   1.88 235552    1.0
z[1]    3.0  3.9e-3    1.9  -0.77   1.77    3.0   4.23   6.77 234959    1.0
lp__  -1.48  2.3e-3   0.96  -4.09  -1.84  -1.18   -0.8  -0.57 179274    1.0

rather than the correct (as known analytically and verified by our static HMC implementation):

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
z[0] 6.7e-5  5.2e-4   0.99  -1.95  -0.67-1.8e-4   0.67   1.95  3.6e6    1.0
z[1]    3.0  1.1e-3   1.99  -0.92   1.66    3.0   4.34   6.91  3.6e6    1.0
lp__  -1.54  6.8e-3    1.0  -4.25  -1.92  -1.23  -0.83  -0.57  21903    1.0

In particular, you can see that the posterior sd is too low for NUTS (not by much 0.95 vs. 1.0), and the posterior 90% intervals are (-1.89, 1.88) rather than (-1.95, 1.95) for z[1] (here, for some reason listed as "z[0]").

We're really sorry about this

Again, our sincere apologies here for messing up so badly. I hope everyone can forgive us. It is going to cause us to focus considerable energy on functional tests that'll diagnose these issues---it's a challening problem to balance sensitivity and specificity of such tests.

Steve Fienberg

I did not know Steve Fienberg well, but I met him several times and encountered his work on various occasions, which makes sense considering his research area was statistical modeling as applied to social science.

Fienberg’s most influential work must have been his books on the analysis of categorical data, work that was ahead of its time in being focused on the connection between models rather than hypothesis tests. He also wrote, with William Mason, the definitive paper on identification in age-period-cohort models, and he worked on lots of applied problems including census adjustment, disclosure limitation, and statistics in legal settings. The common theme in all this work is the combination of information from multiple sources, and the challenges involved in taking statistical inferences using these to make decisions in new settings. These ideas of integration and partial pooling are central to Bayesian data analysis, and so it makes sense that Fienberg made use of Bayesian methods throughout his career, and that he was a strong presence in the Carnegie Mellon statistics department, which has been one of the important foci of Bayesian research and education during the past few decades.

Fienberg’s CMU obituary quotes statistician and former Census Bureau director Bob Groves as saying,

Steve Fienberg’s career has no analogue in my [Groves’s] lifetime. . . . He contributed to advancements in theoretical statistics while at the same time nurturing the application of statistics in fields as diverse as forensic science, cognitive psychology, and the law. He was uniquely effective in his career because he reached out to others, respected them for their expertise, and perceptively saw connections among knowledge domains when others couldn’t see them. He thus contributed both to the field of statistics and to the broader human understanding of the world.

I’d say it slightly differently. I disagree that Fienberg’s career is unique in the way that Groves states. Others of Fienberg’s generation such as Don Rubin and Nan Laird have similarly made important theoretical or methodological contributions while also actively working on a broad variety of live applications. One can also point to researchers such as James Heckman and Lewis Sheiner who have come from outside to make important contributions to statistics while also doing important work in their own fields. And, to go to the next generation, I can for example point to my collaborators John Carlin and David Dunson, both of whom have had deep statistical insights while also contributing to the reform and development of their fields of application.

But please don’t take my qualification of Groves’s statement to be a criticism of Fienberg. Rather consider it as a plus. Fienberg is a model of an important way to be a statistician: to be someone deeply engaged with a variety of applied projects while at the same time making fundamental contributions to the core of statistics. Or, to put it another way, to work on statistical theory and methodology in the context of a deep engagement with a wide range of applications.

Lionel Trilling famously wrote this about George Orwell:

Orwell, by reason of the quality that permits us to say of him that he was a virtuous man, is a figure in our lives. He was not a genius, and this is one of the remarkable things about him. His not being a genius is an element of the quality that makes him what I am calling a figure. . . . if we ask what it is he stands for, what he is the figure of, the answer is: the virtue of not being a genius, of fronting the world with nothing more than one’s simple, direct, undeceived intelligence, and a respect for the powers one does have, and the work one undertakes to do. . . . what a relief! What an encouragement. For he communicates to us the sense that what he has done any one of us could do.

Or could do if we but made up our mind to do it, if we but surrendered a little of the cant that comforts us, if for a few weeks we paid no attention to the little group with which we habitually exchange opinions, if we took our chance of being wrong or inadequate, if we looked at things simply and directly, having only in mind our intention of finding out what they really are . . . He tells us that we can understand our political and social life merely by looking around us, he frees us from the need for the inside dope.

George Orwell is one of my heroes. I am not saying that Steve Fienberg is the George Orwell of statistics, whatever that would mean. What I do think is that the above traits identified by Trilling are related to what I admire most about Fienberg, and this is why I think it’s a fine accomplishment indeed for Fienberg to have not been a unique example of a statistician contributing both to theory and applications but an exemplar of this type. Laplace, Galton, and Fisher also fall in this category but none of us today can hope to match the scale of their contributions. Fienberg through his efforts changed the world in some small bit, as we all should hope to do.