Skip to content

“I Can’t Believe It’s Not Better”

Check out this session Saturday at Neurips. It’s a great idea, to ask people to speak on methods that didn’t work. I have a lot of experience with that!

Here are the talks:

Max Welling: The LIAR (Learning with Interval Arithmetic Regularization) is Dead

Danielle Belgrave: Machine Learning for Personalised Healthcare: Why is it not better?

Michael C. Hughes: The Case for Prediction Constrained Training

Andrew Gelman: It Doesn’t Work, But The Alternative Is Even Worse: Living With Approximate Computation

Roger Grosse: Why Isn’t Everyone Using Second-Order Optimization?

Weiwei Pan: What are Useful Uncertainties for Deep Learning and How Do We Get Them?

Charline Le Lan, Laurent Dinh: Perfect density models cannot guarantee anomaly detection

Fan Bao, Kun Xu, Chongxuan Li, Lanqing Hong, Jun Zhu, Bo Zhang. Variational (Gradient) Estimate of the Score Function in Energy-based Latent Variable Models

Emilio Jorge, Hannes Eriksson, Christos Dimitrakakis, Debabrota Basu, Divya Grover. Inferential Induction: A Novel Framework for Bayesian Reinforcement Learning

Tin D. Nguyen, Jonathan H. Huggins, Lorenzo Masoero, Lester Mackey, Tamara Broderick. Independent versus truncated finite approximations for Bayesian nonparametric inference

Ricky T. Q. Chen, Dami Choi, Lukas Balles, David Duvenaud, Philipp Hennig. Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering

Elliott Gordon-Rodriguez, Gabriel Loaiza-Ganem, Geoff Pleiss, John Patrick Cunningham. Uses and Abuses of the Cross-Entropy Loss: Case Studies in Modern Deep Learning

P.S. The name of the session is a parody of a slogan from a TV commercial from my childhood. When I was asked to speak in this workshop, I was surprised that they would use such an old-fashioned reference. Most Neurips participants are much younger than me, right? I asked around and was told that the slogan has been revived recently in social media.

IEEE’s Refusal to Issue Corrections

This is Jessica. The following was written by a colleague Steve Haroz on his attempt to make corrections to a paper he wrote published by IEEE (which, according to Wikipedia, publishes “over 30% of the world’s literature in the electrical and electronics engineering and computer science fields.”)

One of the basic Mertonion norms of science is that it is self-correcting. And one of the basic norms of being an adult is acknowledging when you make a mistake. As an author, I would like to abide by those norms. Sadly, IEEE conference proceedings do not abide by the standards of science… or of being an adult.

Two years ago Robert Kosara and I published a position paper titled, “Skipping the Replication Crisis in Visualization: Threats to Study Validity and How to Address Them”, in the proceedings of “Evaluation and Beyond – Methodological Approaches for Visualization”, which goes by “BELIV”. It describes a collection of problems with studies, how they may arise, and measures to mitigate them. It broke down threats to validity from data collection, analysis mistakes, poorly formed research questions, and a lack of replication publication opportunities. There was another validity threat that we clearly missed… a publisher that doesn’t make corrections.

Requesting to fix a mistake

A few months after the paper was published, a colleague, Pierre Dragicevic, noticed a couple problems. We immediately corrected and annotated them on the OSF postprint, added an acknowledgment to Pierre, and then sent an email to the paper chairs summarizing the issues and asking for a correction to be issued.

Dear organizers of Evaluation and Beyond – Methodological Approaches for Visualization (BELIV),

This past year, we published a paper titled “Skipping the Replication Crisis in Visualization: Threats to Study Validity and How to Address Them”. Since then, we have been made aware of two mistakes in paper:

  1. The implications of a false positive rate

In section 3.1, we wrote:

…a 5% false positive rate means that one out of every 20 studies in visualization (potentially several each year!) reports on an effect that does not exist.

But a more accurate statement would be:

…a 5% false positive rate means that one out of every 20 non-existent effects studied in visualization (potentially several each year!) is incorrectly reported as being a likely effect.

  1. The magnitude of p-values

In section 3.2, we wrote:

…p-values between 0.1 and 0.5 are actually much less likely than ones below 0.1 when the effect is in fact present…

But the intended statement was:

…p-values between 0.01 and 0.05 are actually much less likely than ones below 0.01 when the effect is in fact present…

As the main topic of the paper is the validity of research publications, we feel that it is important to correct these mistakes, even if seemingly minor. We have uploaded a new version to OSF with interactive comments highlighting the original errors (https://osf.io/f8qey/). We would also like to update the IEEE DL with the version attached. Please let us know how we can help accomplish that.

Thank you,

Steve Haroz and Robert Kosara

Summary of what we wanted to fix

  1. We should have noted that the false positive rate applies to non-existent effects. (A sloppy intro-to-stats level mistake.)
  2.  We put some decimals in the wrong place. (It probably happened when hurriedly moving from a Google doc to latex right before the deadline.)

We knew better than this, but we made a couple mistakes. They’re minor mistakes that don’t impact conclusions, but mistakes nonetheless. Especially in a paper that is about the validity of scientific publications, we should correct them. And for a scientific publication, the process for making corrections should be in place.

Redirected to IEEE

The paper chairs acknowledged receiving the email but took some time to get back to us. Besides arriving during everyone’s summer vacation, there was apparently no precedence for requesting a corrigendum (corrections for mistakes made by the authors) at this publication venue, so they needed a couple months to figure out how to go about it. Here was what IEEE eventually told them:

Generally updates to the final PDF files are not allowed once they are posted in Xplore. However, the author may be able to add an addendum to address the issue. They should contact ieee-mce@ieee.org to make the request. 

So we contacted that email address and after a month and a half got the following reply:

We have received your request to correct an error in your work published in the IEEE Xplore digital library. IEEE does not allow for corrections within the full-text publication document (e.g., PDF) within IEEE Xplore, and the IEEE Xplore metadata must match the PDF exactly.  Unfortunately, we are unable to change the information on your paper at this time.  We do apologize for any inconveniences this may cause.

This response is absurd. For any publisher of scientific research, there is always some mechanism for corrigenda. But IEEE has a policy against it.

Trying a different approach

I emailed IEEE again asking how this complies with the IEEE code of ethics:

I am surprised by this response, as it does not appear consistent with the IEEE code of ethics (https://www.ieee.org/about/corporate/governance/p7-8.html), which states that IEEE members agree:

“7 … to acknowledge and correct errors…”

I would appreciate advice on how we can comply with an ethical code that requires correcting errors when IEEE does not allow for it. 

And one of the BELIV organizers, to their credit, backed us up by replying as well:

As the organizer of the scientific event for which the error is meant to be reported, […] I am concerned about the IEEE support response that there are NO mechanisms in place to correct errors in published articles. I have put the IEEE ethics board in the cc to this response and hope for an answer on how to acknowledge and correct errors as an author of an IEEE published paper.

The IEEE ethics board was CCed, but we never heard from them. However, we did hear from someone involved in “Board Governance & Intellectual Property Operations”:

IEEE conference papers are published as received. The papers are submitted by the conference organizers after the event has been held, and are not edited by IEEE. Each author assumes complete responsibility for the accuracy of the paper at the time of publication. Each conference is considered a stand-alone publication and thus there is no mechanism for publishing corrections (e.g., in a later issue of a journal). The conference proceedings serves as a ‘snapshot’ of what was distributed at the conference at the time of presentation and must remain as is. IEEE will make metadata corrections (misspelled author name, affiliation, etc) in our database, but per IEEE Publications policy, we do not edit a published PDF unless the PDF is unreadable. 

That said, any conference author who identifies an error in their work is free to build upon and correct a previously published work by submitting to a subsequent conference or journal. We apologize for any inconvenience this may cause.

The problem with IEEE’s suggestion

Rather than follow the norm of scientific publishing and even its own ethics policies, IEEE suggests that we submit an updated version of the paper to another conference or journal. This approach is unworkable for multiple reasons:

1) It doesn’t solve the problem that the incorrect statements are available and citable.

Keeping the paper available potentially spreads misinformation. In our paper, these issues are minor and can be checked via other sources. But what if they substantially impacted the conclusions? This year, IEEE published a number of papers about COVID-19 and pandemics. Are they saying that one of these papers should not be corrected even if the authors and paper chairs acknowledge they include a mistake? 

2) A new version would be rejected for being too similar to the old version.

According to IEEE’s policies, if you update a paper and submit a new version, it must include “substantial additional technical material with respect to the … articles of which they represent an evolution” (see IEEE PSPB 8.1.7 F(2)). Informally, this policy is often described as meaning that papers need 30% new content to be publishable. But some authors have added entire additional experiments to their papers and gotten negative reviews about the lack of major improvements over previous publications. In other words, minor updates would get rejected. And I don’t see any need to artificially inflate the paper with 30% more content just for the heck of it.

It could even be rejected for self-plagiarism unless we specifically cite the original paper somehow. What a great way to bump up your h-index! “And in conclusion, as we already said in last year’s paper…”

3) An obnoxious amount of work for everyone involved.

The new version would need to be handled by a paper chair (conference) or editor (journal), assigned to a program committee member (conference) or action editor (journal), have reviewers recruited, be reviewed, have a meta-review compiled, and be discussed by the paper chairs or editors. What a blatant disregard for other people’s time.

The sledgehammer option

I keep cringing every time I get a Google Scholar alert for the paper. That’s not a good place to be. I looked into options for retracting it, but IEEE doesn’t seem very interested in retracting papers that make demonstrably incorrect statements or that incorrectly convey the authors’ intent:

Under an extraordinary situation, it may be desirable to remove access to the content in IEEE Xplore for a specific article, standard, or press book. Removal of access shall only be considered in rare instances, and examples include, but are not limited to, a fraudulent article, a duplicate copy of the same article, a draft version conference article, a direct threat of legal action, and an article published without copyright transfers. Requests for removal may be submitted to the Director, IEEE Publications. Such requests shall identify the publication and provide a detailed justification for removing access.  -IEEE PSPB 8.1.11-A

So attempting to retract is unlikely to succeed. Also, there’s no guarantee that we would not get accused of self-plagiarism if we retracted it and then submitted the updated version. And really, it’d be such a stupid way to fix a minor problem. I don’t have a better word to describe this situation. Just stupid.

Next steps

  1. Robert and I ask any authors who would cite our paper to cite the updated OSF version. Please do not cite the IEEE version. You can find multiple reference formats on the bottom right of the OSF page.
  2. This policy degrades the trustworthiness and citability of papers in IEEE conference proceedings. And any authors who have published with IEEE would be understandably disturbed by IEEE denigrating the reliability of their work. What if a paper contained substantial errors? And what if it misinformed and endangered the public? It is difficult to see these proceedings as any more trustworthy than a preprint. At least preprints have a chance of authors updating them. So use caution when reading or citing IEEE conference proceedings, as the authors may be aware of errors but unable to correct them.
  3. IEEE needs to make up its mind. It could decide to label conference proceedings as in-progress work and allow them to be republished elsewhere. However, if updated versions of conference papers cannot be resubmitted due to lack of novelty or “self-plagiarism”, IEEE needs to treat these conference papers the way that scientific journals treat their articles. In other words, if IEEE is to be a credible publisher of scientific content, it needs to abide by the basic Mertonian norm of enabling correction and the basic adult norm of acknowledging and correcting mistakes.

What about this idea of rapid antigen testing?

So, there’s this idea going around that seems to make sense, but then again if it makes so much sense I wonder why they’re not doing it already.

Here’s the background. A blog commenter pointed me to this op-ed from mid-November by Michael Mina, an epidemiologist and immunologist who wrote:

Widespread and frequent rapid antigen testing (public health screening to suppress outbreaks) is the best possible tool we have at our disposal today—and we are not using it.

It would significantly reduce the spread of the virus without having to shut down the country again—and if we act today, could allow us to see our loved ones, go back to school and work, and travel—all before Christmas.

Antigen tests are “contagiousness” tests. They are extremely effective (>98% sensitive compared to the typically used PCR test) in detecting COVID-19 when individuals are most contagious. Paper-strip antigen tests are inexpensive, simple to manufacture, give results within minutes, and can be used within the privacy of our own home . . .

If only 50% of the population tested themselves in this way every 4 days, we can achieve vaccine-like “herd effects” . . . Unlike vaccines, which stop onward transmission through immunity, testing can do this by giving people the tools to know, in real-time, that they are contagious and thus stop themselves from unknowingly spreading to others.

Mina continues:

The U.S. government can produce and pay for a full nation-wide rapid antigen testing program at a minute fraction (0.05% – 0.2%) of the cost that this virus is wreaking on our economy.

The return on investment would be massive, in lives saved, health preserved, and of course, in dollars. The cost is so low ($5 billion) that not trying should not even be an option for a program that could turn the tables on the virus in weeks, as we are now seeing in Slovakia—where massive screening has, in two weeks, completely turned the epidemic around.

The government would ship the tests to participating households and make them available in schools or workplaces. . . . Even if half of the community disregards their results or chooses to not participate altogether, outbreaks would still be turned around in weeks. . . .

The sensitivity and specificity of these tests has been a central debate – but that debate is settled. . . . These tests are incredibly sensitive in catching nearly all who are currently transmitting virus. . . .

But wait—if this is such a great idea, why isn’t it already happening here? Mina writes:

The antigen test technology exists and some companies overseas have already produced exactly what would work for this program. However, in the U.S., the FDA hasn’t figured out a way to authorize the at-home rapid antigen tests . . . We need to create a new authorization pathway within the FDA (or the CDC) that can review and approve the use of at-home antigen testing . . . Unlike vaccines, these tests exist today—the U.S. government simply needs to allocate the funding and manufacture them. We need an upfront investment of $5 billion to build the manufacturing capacity and an additional $10 billion to achieve production of 10-20 million tests per day for a full year. This is a drop in the bucket compared to the money spent already and lives lost due to COVID-19. . . .

I read all this and wasn’t sure what to think. On one hand, it sounds so persuasive. On the other hand, lots of tests are being done around here and I haven’t heard of these rapid paper tests. Mina talks about at-home use, but I haven’t heard about these tests being given at schools either. Also, Mina talks about the low false-positive rate of these tests, but I’d think the big concern would be false negatives. Also, it’s hard to believe that there’s this great solution and it’s only being done by two countries in the world (Britain and Slovakia). You can’t blame the FDA bureaucracy for things not happening in other countries, right?

Anyway, I wasn’t sure what to think so I contacted my epidemiologist colleague Julien Riou, who wrote:

I think the idea does make sense from a purely epi side, even though the author appears extremely confident in something that has basically never been done (but that maybe what you need to do to be published in Time magazine). In principle, rapid antigen testing every 4 days (followed by isolation of all positive cases) would probably reduce transmissibility enough if people are relatively compliant and if the sensitivity is high. The author is quick to dismiss the issue of sensitivity, saying:

People have said these tests aren’t sensitive enough compared to PCR. This simply is not true. It is a misunderstanding. These tests are incredibly sensitive in catching nearly all who are currently transmitting virus. People have said these tests aren’t specific enough and there will be too many false positives. However, in most recent Abbott BinaxNOW rapid test studies, the false positive rate has been ~1/200.

Looking at the paper the author himself links (link), the sensitivity of the Abbott BinaxNOW is “93.3% (14/15), 95% CI: 68.1-99.8%”. I find it a bit dishonest not to present the actual number (he even writes “>98%” somewhere else, without a source so I couldn’t check) and deflect on specificity which is not the issue here (especially if there is a confirmation with RT-PCR). The authors of the linked paper even conclude that “this inherent lower sensitivity may be offset by faster turn-around, the ability to test more frequently, and overall lower cost, relative to traditional RT-PCR methods”. Fair enough, but far from “these tests are incredibly sensitive” in the Time piece.

Two more points on the sensitivity of rapid antigen tests. First, it is measured with the RT-PCR as the reference, and we know that the sensitivity of RT-PCR itself is not excellent. There are a lot of papers on that, I randomly picked this one where the sensitivity is measured at 82.2% (95%CI 79.0-85.1%) for RT-PCR in hospitalised people. This should to be combined with that of rapid antigen testing if you assume both tests are independent. Of course there is a lot more to say about this, sensitivity probably depends on who is tested, when, whether there are symptoms, and both tests are probably not independent. Still I think it’s worth mentionning, and again far from “these tests are incredibly sensitive”. Second, the sensitivity is measured in lab conditions, and while I don’t have a lot of experience with this I doubt that you can expect everyone to use the test perfectly. And on top of that, people might not comply to isolation (especially if they have to work) and logistics problems are likely to occur.

Even with all these caveats, I think that this mass testing strategy might be sufficient to curb down cases if we can pull it off. Combined with contact tracing, social distancing, masks and all the other control measures in place in most of the world, being able to identify and isolate even a small proportion of infectious cases that you wouldn’t see otherwise can be very helpful. We’ll soon be able to observe the impact empirically in Slovakia and Liverpool.

So, again, I’m not sure what to think. I’d think that even a crappy test if applied widely enough would be better than the current setting in which people use more accurate tests but then have to wait many days for the results. Especially if the alternative is some mix of lots of people not going to work and to school and other people, who do have to go to work, being at risk. On the other hand, some of the specifics in that above-linked article seem fishy. But maybe Riou is right that this is just how things go in the mass media.

“It’s turtles for quite a way down, but at some point it’s solid bedrock.”

Just once, I’d like to hear the above expression..

It can’t always be turtles all the way down, right? Cos if it was, we wouldn’t need the expression. Kind of like if everything was red, we wouldn’t need any words for colors.

What are the most important statistical ideas of the past 50 years?

Aki and I wrote this article, doing our best to present a broad perspective.

We argue that the most important statistical ideas of the past half century are: counterfactual causal inference, bootstrapping and simulation-based inference, overparameterized models and regularization, multilevel models, generic computation algorithms, adaptive decision analysis, robust inference, and exploratory data analysis. These eight ideas represent a categorization based on our experiences and reading of the literature and are not listed in chronological order or in order of importance. They are separate concepts capturing different useful and general developments in statistics. We discuss common features of these ideas, how they relate to modern computing and big data, and how they might be developed and extended in future decades.

An earlier version of this paper appeared on Arxiv but then we and others noticed some places to fix it, so we updated it.

Here are the sections of the paper:

1. The most important statistical ideas of the past 50 years

1.1. Counterfactual causal inference
1.2. Bootstrapping and simulation-based inference
1.3. Overparameterized models and regularization
1.4. Multilevel models
1.5. Generic computation algorithms
1.6. Adaptive decision analys
1.7. Robust inference
1.8. Exploratory data analysis

2. What these ideas have in common and how they differ

2.1. Ideas lead to methods and workflows
2.2. Advances in computing
2.3. Big data
2.4. Connections and interactions among these ideas
2.5. Theory motivating application and vice versa
2.6. Links to other new and useful developments in statistics

3. What will be the important statistical ideas of the next few decades?

3.1. Looking backward
3.2. Looking forward

The article was fun to write and to revise, and I hope it will motivate others to share their views.

The p-value is 4.76×10^−264 1 in a quadrillion

Ethan Steinberg writes:

It might be useful for you to cover the hilariously bad use of statistics used in the latest Texas election lawsuit.

Here is the raw source, with the statistics starting on page 22 under the heading “Z-Scores For Georgia”. . . .

The main thing about this analysis that’s so funny is that the question itself is so pointless. Of course Hillary’s vote count is different from Joe’s vote count! They were different candidates! Testing the null hypothesis is really pointless and it’s expected that you would get such extreme z-scores. I think this provides a good example of how statistics can be misused and it’s funny to see this level of bad analysis in a high level legal filing.

Here’s the key bit:

There are a few delightful—by which I mean, horrible—items here:

First off, did you notice how he says “In 2016, Trump won Georgia” . . . but he can’t bring himself to say that Biden won in 2020? Instead, he refers to “The Biden and Trump percentages of the tabulations.” So tacky. Tacky tacky tacky. If you want to maintain uncertainty, fine, but then refer to “the Clinton and Trump percentages of the tabulations” in 2016.

Second, the binomial distribution makes no sense here. This corresponds to a model in which voters are independently flipping coins (approximately; not quite coin flips because the probability isn’t quite 50%) to decide how to vote. That’s not how voting works. Actually, most voters know well ahead of time who they will be voting for. So even if you wanted to test the null hypothesis of no change (which, as my correspondent noted above, you don’t), this would be the wrong model to use.

Third . . . don’t you love that footnote 3? Good to be educating the court on the names of big powers of ten. Next step, the killion, which, as every mathematician knows, is a number so big it can kill you.

Footnote 3 is just adorable.

What next, a p-value of 4.76×10^−264?

The author of the expert report is a Charles J. Cicchetti, a Ph.D. economist who has had many positions during his long career, including “Deputy Directory of the Energy and Environment Policy Center at the John F. Kennedy School of Government at Harvard University.”

The moral of the story is: just because someone was the director of a program at Harvard University, or a professor of mathematics at Williams College, don’t assume they know anything at all about statistics.

The lawsuit was filed by the State of Texas. That’s right, Texas tax dollars were spent hiring this guy. Or maybe he was working for free. If his consulting fee was $0/hour, that would still be too high.

Given that the purpose of this lawsuit is to subvert the express will of the voters, I’m glad they hired such an incompetent consultant, but I feel bad for the residents of Texas that they had to pay for it. But, jeez, this is really sad. Even sadder is that these sorts of statistical tests continue to be performed, 55 years after this guy graduated from college.

P.S. The lawsuit has now supported by 17 other states. There’s no way they can believe these claims. This is serious Dreyfus-level action. And I’m not talking about Amity Beach.

Postdoc at the Polarization and Social Change Lab

Robb Willer informs us that the Polarization and Social Change Lab has an open postdoctoral position:

The Postdoctoral Associate will be responsible for co-designing and leading research projects in one or more of the following areas: political polarization; framing, messaging, and persuasion; political dimensions of inequality; social movement mobilization; and online political behavior.

This looks super-interesting!

Also, the lab is at Stanford so maybe they could do some local anthropology and study’s going on with the Hoover Institution.

“A better way to roll out Covid-19 vaccines: Vaccinate everyone in several hot zones”?

Peter Dorman writes:

This [by Daniel Teres and Martin Strossberg] is an interesting proposal, no? Since vaccines are being rushed out the door with limited testing, there’s a stronger than usual case for adaptive management: implementing in a way that maximizes learning. I [Dorman] suspect there would also be large economies in distribution if localities were the units of sequencing rather than individuals. It would be useful to hear from your readers what they think a good distribution-cum-research-design plan would look like.

In the article, Teres and Strossberg write:

Vaccines are on the brink of crossing the finish line of approval, but the confusion surrounding the presidential transition has brought great uncertainty to the distribution plan.

The National Academies of Sciences, Engineering, and Medicine developed an ethical framework for equitable distribution of Covid-19 vaccines, as have others. But national plans based on these frameworks are problematic. They recommend giving the vaccine first to Phase 1a front line high-risk health workers and first responders. That stretches the supply chain to include workers in every hospital, nursing home, long-term care facility, as well as all ambulance, fire rescue, and police first responders. . . .

We propose a different approach: target several hot zones with high numbers of Covid-19 cases, especially those zones with rising Covid-19 hospitalization rates. . . . Vaccination within each hot zone would begin with Phase 1a individuals and then move on quickly through Phases 1b, 2, 3, and 4. . . .

We believe that our approach offers the best way to break the chain of transmission. The first 60 days will be key in showing the results. Our plan has many advantages over the “phase” plan proposed by the National Academies.

This seems reasonable to me. On the other hand, I don’t know anything about this, and I’m easily persuaded. What do youall think?

Covid crowdsourcing

Macartan Humphries writes:

We put together a platform that lets researchers contribute predictive models of cross national (and within country) Covid mortality, focusing on political and social accounts.

The plan then is to aggregate using a stacking approach.

Go take a look.

Unlike MIT, Scientific American does the right thing and flags an inaccurate and irresponsible article that they mistakenly published. Here’s the story:

Scene 1

A few months ago I wrote about a really bad article that appeared in Undark, MIT’s science magazine. The article was so bad it lowered my opinion of MIT, my alma mater, in that it showed such poor judgment by the administration to sponsor this kind of irresponsible anti-scientific crap. MIT’s done worse things, of course, but I knew enough about this case to be particularly disappointed in the institution. (I’m not naming the authors of the article here, not because it’s any kind of secret—just follow the links—but because my problem is not with the people, it’s with the institutions. Bad actors have power because they’re empowered by others, and well-meaning people can become bad actors when their mistakes are not corrected.)

The problem in this story was not so much the errors in the published article—nobody’s perfect, even the best writers make mistakes, and as a journalist it’s hard not to be swayed by your primary contacts. No, the big, big problem was that the magazine refused, absolutely refused, to correct their article even after the errors were pointed out to them. I’ve come to expect this sort of unscholarly behavior from the Association for Psychological Science—after all, they have their bigshots to protect. But I was disturbed to see this coming from MIT.

As I wrote at the time, beyond its misrepresentations, that Undark article was horrible because it offered a science-empty take on science. They took a science dispute and tried to turn it into a political dispute. Part of the job for Politico magazine, maybe; not part of the job for MIT. The authors tried their best to disparage the work of real science reporter Stephanie Lee, but all they did is make themselves look bad—and muddy the waters for casual readers who didn’t know the whole story.

Scene 2

That was too bad. I sent some emails to MIT people who didn’t respond, and I was preparing yet another but didn’t bother to send it. Life is too short.

Then a couple weeks ago the authors of the bad article rehashed their story in Scientific American magazine.

It was the same crap as before: taking a scientific dispute and making it personal (referring to a paper where a particular scientist was 16th author as “his science”), attempting to discredit Lee’s reporting by labeling Buzzfeed as “left-leaning” (I’m guessing that Buzzfeed is actually about as left-leaning as Scientific American!), using loaded language (critics are “accusing,” “public shaming,” “complaining,” “below-the-belt,” “attacking”), dismissing statisticians’ analyses on the ground of “absurdity,” and saying Lee’s “charges” were “wrong on all accounts” without refuting anything that she said. The bit where they guy donated $5000 to support the study but that doesn’t count because the 16th author said he received “zero dollars” . . . huh? I guess they call it “below-the-belt” because it’s true.

I wasn’t sure what could be done about this. There’s little to be gained by getting into a fight with journalists who refuse to act in good faith and who only declare their blatant conflicts of interest when forced to do so. It’s frustrating, though: they demonstrate a willingness to mislead, and that lets them win. If they were reasonable people, we could publicly disagree with them. But since they don’t care about getting things right, they get to set the agenda.

Really sad to see venerable institutions such as MIT and then Scientific American getting conned by these people. Martin Gardner would be spinning in his grave.

Scene 3

Stephanie Lee responded to the bad article on twitter! I hate twitter—it just seems like a terrible format for telling stories, laying out an argument, or having discussion—but, given the constraints of that form, Lee makes her points well. I was glad to see that she went to the trouble of doing this.

Scene 4

Scientific American flagged the article! Here’s what the editors added:

Editor’s note: This article was originally published on November 30, 2020 with a number of errors and misleading claims. First, it should have been labeled “Opinion,” but was not. Second, the authors’ bylines were omitted. Third, the authors failed to note that they have collaborated in the past with both John Ioannidis and Vinay Prasad, who are discussed in this essay, and also in this accompanying story. This, we now understand, was also the case with a similar opinion piece by the same authors in Undark magazine in June. Fourth, the authors did not disclose that there were other problematic issues raised about the design of a study co-authored by John Ioannidis, most notably how the study authors recruited study participants and how independent faculty at Stanford said that they were unable to verify the accuracy of their test.
Other specific errors or omissions are noted with asterisks in the text below. Scientific American sincerely regrets all of these errors.

Wow. Good job, Scientific American. It’s such standard practice for institutions to circle the wagons and defend wrongdoers, often using bureaucratic language (see this recent example), and it was such a breath of fresh air to see Scientific American do the right thing. The authors weren’t willing to correct their misrepresentations, so the magazine did it for them.

P.S. One annoying thing about this whole episode is that the authors of this article aren’t even serving their own cause by misrepresenting the science and ethics issues. It would be so easy for them to have made their case directly, something like this:
Continue reading ‘Unlike MIT, Scientific American does the right thing and flags an inaccurate and irresponsible article that they mistakenly published. Here’s the story:’ »

From the Archives of Psychological Science

Jay Livingston pointed me to PostSecret, which I’d never heard of before, and the above image, which apparently first appeared in 2011.

P.S. The image and the title of this post do not quite align. My concern with the journal Psychological Science is about incompetent work rather than made-up data.

Are we constantly chasing after these population-level effects of these non-pharmaceutical interventions that are hard to isolate when there are many good reasons to believe in their efficacy in the first instance?

A couple days ago we discussed issues of communicating uncertainty in a coronavirus mask experiment. That study itself is not so important, but I remain interested in the larger issues of inference and communication.

I sent the discussion to epidemiologist Jon Zelner, who wrote:

The struggle is real! I think this is a nice example of a time where some proof-of-principle lab studies can be more informative than population-based ones, like this one from Nancy Leung et al. I worry that we’re constantly chasing after these population-level effects of these non-pharmaceutical interventions that are hard to isolate when there are many good reasons to believe in their efficacy in the first instance.

Shameless plug on this end; we put up this preprint recently using a simple model that shows how easy it can be to get type M errors in intervention studies when you preferentially use data from large outbreaks.

Not quite the same topic, but I think it highlights the challenges of inferring these effects, and to me it sort of opens up the question of (a) whether there’s a world in which it is feasible to do so and (b) whether we should be putting our efforts in a different direction anyway.

I’m not sure what to think? I see Jon’s point; on the other hand, we’re also interested in population-level effects, so it makes sense to try to estimate them too, as long as we can be open about our uncertainties.

P.S. Here’s the title and abstract of the second paper that Zelner links to:

Preferential observation of large infectious disease outbreaks leads to consistent overestimation of intervention efficacy

Data from infectious disease outbreaks in congregate settings are often used to elicit clues about which types of interventions may be useful in other facilities. This is commonly done using before-and-after comparisons in which the infectiousness of pre-intervention cases is compared to that of post-intervention cases and the difference is attributed to intervention impact. In this manuscript, we show how a tendency to preferentially observe large outbreaks can lead to consistent overconfidence in how effective these interventions actually are. We show, in particular, that these inferences are highly susceptible to bias when the pathogen under consideration exhibits moderate-to-high amounts of heterogeneity in infectiousness. This includes important pathogens such as SARS-CoV-2, influenza, Noroviruses, HIV, Tuberculosis, and many others.

Seems like an important point.

Quine’s be Quining

Ron Bloom sends along the above and writes, “The rest of the article is just as crackling as is this paragraph.”

OK, so I went and read the article (Two Dogmas of Empiricism, from 1951), and I don’t really get it. I like the above-quoted paragraph but I couldn’t get much out of the rest of it. Maybe these ideas have just been absorbed in our thinking so they don’t seem special any more?

P.S. After I wrote this post but before it appeared, Bob’s been recommending Quine pretty strongly. Bob claims that Quine made Lakatos obsolete. I still don’t get it—but, hey, there’s a lot of things I don’t get that are still important!

Discussion of uncertainties in the coronavirus mask study leads us to think about some issues . . .

1. Communicating of uncertainty

A member of the C19 Discussion List, which is a group of frontline doctors fighting Covid-19, asked me what I thought of this opinion article, “Covid-19: controversial trial may actually show that masks protect the wearer,” published last month by James Brophy in the British Medical Journal.

Brophy writes:

Paradoxically, the publication last week of the first randomized trial evaluating masks during the current covid-19 pandemic and a meta-analysis of older trials seems to have heightened rather than reduced the uncertainty regarding their effectiveness. . . .

The DANMASK-19 trial was performed in Denmark between April and May 2020, a period when public health measures were in effect, but community mask wearing was uncommon and not officially recommended. All participants were encouraged to follow social distancing measures. Those in the intervention arm were additionally encouraged to wear a mask when in public and were provided with a supply of 50 surgical masks and instructions for proper use. Crucially, the outcome measure was rates of infection among those encouraged to wear masks and not in the community as a whole, so the study could not evaluate the most likely benefit of masks, that of preventing spread to other people. The study was designed to find a 50% reduction in infection rates among mask wearers.

Here’s what happened in the study:

Among the 4862 participants who completed the trial, infection with SARS-CoV-2 occurred in 42 of 2392 (1.8%) in the intervention arm and 53 of 2470 (2.1%) in the control group. The between-group difference was −0.3% point (95% CI, −1.2 to 0.4%; P = 0.38) (odds ratio, 0.82 [CI, 0.54 to 1.23]; P = 0.33).

And here’s how it got summarized:

This led to the published conclusion: “The recommendation to wear surgical masks to supplement other public health measures did not reduce the SARS-CoV-2 infection rate among wearers by more than 50% in a community with modest infection rates, some degree of social distancing, and uncommon general mask use. The data were compatible with lesser degrees of self-protection.”

As Brophy writes, this is an unusual way to summarize such a study that had a non-statistically significant result (in this case, an estimated reduction of 20% in infection rates with a standard error of 20%). Usually such a result would be summarized in a sloppy way as reflecting “no effect,” or summarized more carefully as being “consistent with no effect.” But it is a mistake to report a non-statistically-significant result as representing a “dubious treatment” or no effect. And that’s what Brophy is struggling with.

Brophy explains:

Incorrect interpretation of “negative” trials abounds, even among thoughtful academics, regardless of their personal beliefs about the studied intervention. . . . while the editorial accompanying the trial concludes that masks may work, it does so while also implying that the trial itself was negative, stating “. . . despite the reported results of this study, (masks) probably protect the wearer.”

I agree with Brophy’s criticism here. There should be no “despite” in that sentence, as the results of that study do not at all contradict the hypothesis that masks protect the wearer.

Brophy continues:

The results of DANMASK-19 do not argue against the benefit of masks to those wearing them but actually support their protective effect.

I guess I agree here too, but if you’re gonna say this, I think you should emphasize that the data are also consistent with no effect. Otherwise you can be misleading people in the other direction.

But ultimately there are no easy answers here. It’s similar to the struggles we have had when communicating probabilistic election forecasts. There’a also some interesting discussion in the comments section of Brophy’s article.

2. Experimental design and experimental reality

But there’s something else about this experiment that I hadn’t noticed at first which I think is also relevant to our discussion.

Recall the data summary:

Among the 4862 participants who completed the trial, infection with SARS-CoV-2 occurred in 42 of 2392 (1.8%) in the intervention arm and 53 of 2470 (2.1%) in the control group. The between-group difference was −0.3% point (95% CI, −1.2 to 0.4%; P = 0.38) (odds ratio, 0.82 [CI, 0.54 to 1.23]; P = 0.33).

A rate of 2% . . . that’s pretty low. The study began in April, 2020, a time when there was a lot of well-justified panic about coronavirus. Before all the lockdowns and social distancing, we were concerned about rapid spread of the disease.

I have two points here.

First, studies like this are usually designed to be large enough to have statistically significant results. I don’t know the details of the design of this study, but if they were anticipating, say, a 10% rate of infection rather than 2%, then this would correspond to a much larger number of cases and much more precision about the relative effect size.

Second, as Brophy notes, the main motivation for general mask use is not to protect the mask-wearer from infection but rather to protect others from being infected by the mask-wearer. In either case, you’d expect the overall effectiveness of masks to be higher in settings where there is more infection. Just as with “R0” and the “infection fatality rate” and other sorts of numbers that we’re hearing about, “the effectiveness of masks” is not a constant—it’s not a thing in itself—it depends on contexts. You’d expect masks to be much more effective if you’re in a busy city with lots of social interactions and regularly encountering infected people than if you’re a ghost, living in a ghost town.

These two points get lost in the usual way that this sort of study gets reported. The first point is missed because there is an unfortunate tendency not to think about the design once the data have been collected. The second point is missed because we’re trained to think about treatment effects and not their variation.

What’s Google’s definition of retractable?

Timnit Gebru, a computer scientist known best for her work on ethics and algorithmic bias in AI/ML applications like face recognition, was fired yesterday from co-leading Google’s Ethical Artificial Intelligence Team. Apparently this was triggered by an email she sent to members of her team. Social media is exploding over this, and I don’t have all the information so I won’t speculate on exactly what happened. However, one aspect of the media storm caught my attention, since it relates to the question of what makes for a good reason to retract research. 

From publication of emails that Gebru purportedly sent her co-workers, and that Jeff Dean, director of Google AI, purportedly sent to Google AI employees to explain what happened, we get some info about Google’s internal process for granting its employees permission to submit papers for external publication. From Dean’s email, it seems all research papers co-authored with a Google employee and intended for external publication require such review. Dean’s email describes the outcome of review of a paper Gebru was an author on, which Google requested she withdraw from submission:  

A cross functional team then reviewed the paper as part of our regular process and the authors were informed that it didn’t meet our bar for publication and were given feedback about why. It ignored too much relevant research — for example, it talked about the environmental impact of large models, but disregarded subsequent research showing much greater efficiencies.  Similarly, it raised concerns about bias in language models, but didn’t take into account recent research to mitigate these issues. 

Wait a second, these issues remind me of the paper on gender and mentoring that we discussed a couple weeks ago. That paper incited responses from many because it jumped from identifying associations between citations of researchers later in their career and the gender of senior researchers they published with earlier without acknowledging prior work that might have helped explain things. There were no major statistical errors made in the paper though (actually there may be some errors, but it not clear that they threatened any of the main results). The backlash was mostly a function of how the interpretations seemed to accept that female mentorship led to lower quality research, and the paper’s failure to acknowledge a body of work on gender and citations that could have provided alternative explanations for the results that might not have required quite so many inferential leaps.

That paper wasn’t retractably bad, at least not in my opinion. One problem with arguing to censor research for potentially harmful speculation or missed literature is that what catches someone’s eye as problematic omission or speculation is going to be highly subject to the reader’s values. To argue that such issues are fatal flaws due to possible harm to beliefs or future behavior requires a lot of speculation and an assumption that readers can’t recognize for themselves what is more or less plausible, a problematic insinuation that leads to a conclusion that all research must be perfectly accurate. But who gets to define that? (Not to mention that If we were to retract all papers that did these things, we might not have a lot of research left!)

So if Dean’s email really is describing the major problems with the paper Google asked Gebru to retract, it would suggest their internal review process allows for these judgment calls, albeit perhaps for different reasons. I expect corporations care a lot about the reputation of their brand, so I wouldn’t be surprised that their process allows for calls like this under the guise of protecting business interests. But it’s a definition of censorship-worthy that’s leaving a lot of room open for bias to creep in.  It makes me wonder how often these types of issues are used to censor papers by employees, and how Google researchers view the intended role of the review – to enforce a shared definition of quality, like a professor might be seen by PhD advisees in their lab? One would think the Google researchers doing the research would be the experts in the company on it. Or is the review supposed to be mostly about preventing leakage of sensitive information to protect privacy or IP? I guess we’ll have to wait for an answer, since I don’t expect Google to release the history of their internal evaluations anytime soon. 

Update (Dec 4 2020): Jeff Dean posted a file with his email and more information on Google’s review process.

MIT Technology Review reports on what the paper was about.

How to think about correlation? It’s the slope of the regression when x and y have been standardized.

Dave Balan writes:

I am an economist at the Federal Trade Commission with a very basic statistics question, one that I have put to several fairly high-powered econometricians, and to which no one has had a satisfying answer.

The question is this. Why are correlations meaningful? We know that they are ubiquitous, they get reported all the time in work across many disciplines. But for the life of me I cannot understand what the question is to which a correlation is the answer. I get that it’s sometimes useful to know whether or not the correlation is close to 0; if it is close to 0 then you know that it’s not too far from the truth to say that no (linear) relationship exists, and that might be all you need to know. By the same token, a correlation of, say, 0.9 tells you that it’s nowhere close to being true that no linear relationship exists, so you need to go further and investigate what that relationship is. What I can’t understand is why people interpret that 0.9 as a meaningful standalone number in its own right. A correlation of 0.9 means that the data lines up pretty nicely along some line with a positive slope, but that slope can be anywhere from just above 0 to just below infinity. What good does it do to know that a strong linear relationship exists when you have no idea what that relationship is?

To take the example of your recent (very interesting) election work, a finding that the correlation in the polling errors between State A and State B is 0 would clearly be important and relevant. And so a finding that the correlation is far from 0 is clearly important insofar as it tells you that it’s definitely not OK to assume that it’s zero. But what is its importance beyond that? What good does it do to know that the polling errors between State A and State B are highly correlated if you don’t know whether a 1 percentage point error in state A is associated with an error of 1 percentage point, or 0.1 points, or 2 points in State B?

I know that correlations have the advantage of being unit-free. And that’s nice, but it doesn’t seem to solve the problem.

Am I missing something fundamental here? If so, I hope you will share what it is. If not, is it a serious problem? Is there some other unit-free number that could be used instead? Maybe something like the elasticities that economists use?

I replied that the way I think about the correlation is that it’s the slope of the regression of y on x if the two variables have been standardized to have the same sd. And I pointed him to section 12.3 of Regression and Other Stories, which discusses this point.

Balan followed up:

Below is my [Balan’s] attempt at some intuition:

A. Since the correlation is the common slope of the y-on-x regression line and the x-on-y regression line, the dots must be configured in such a way that they look pretty much the same if you flip the axes.

B. The only way that that can be true is if the dots lie around some line with a slope of 1.

C. Note that this does NOT mean that the regression line through those dots is 1, rather it has to be <= 1 (per your book). D. Since the dots line up along a line with a slope of 1, they will still line up along a line with a slope of 1 when you flip the axes. The intercept might change, but the slope won’t. E. And since the orientation of the dots does not change much (and in the limit doesn’t change at all), the regression line through them does not change either. The part that I had a hard time understanding was why it is impossible for the dots to line up perfectly along a line with a slope other than 1, or to line up imperfectly along a line with a slope equal to 1. I think this is where the assumption of equal sd matters. If two variables have the same sd, then having a correlation of 1 means that they are basically the same variable (possibly shifted due to different means), which means that the only line that they can line up perfectly along has a slope of 1. Similarly, if they do not have a correlation of 1, then the regression to the mean described in your book kicks in so that the regression line must be less than 1 and the randomness means that the dots will not line up perfectly along that line.

To which I responded: Yes, corr is like a rescaled regression coefficient. Sometimes this makes sense, other times it does not. For example if you are computing elasticity, which is roughly speaking the regression of log(output) on log(input), then standardization would make no sense at all. But if x and y are two different standardized tests, it could make sense to renorm each to have mean 0 and sd 1.

The Shrinkage Trilogy: How to be Bayesian when analyzing simple experiments

There are lots of examples of Bayesian inference for hierarchical models or in other complicated situations with lots of parameters or with clear prior information.

But what about the very common situation of simple experiments, where you have an estimate and standard error but no clear prior distribution? That comes up a lot! In such settings, we usually just go with a non-Bayesian approach, or we might assign priors to varying coefficients or latent parameters but not to the parameters of primary interest. But that’s not right: in many of these problems, uncertainties are large, and prior information make a difference.

With that in mind, Erik van Zwet has done some research. He writes:

Our paper is now on arXiv where it forms a “shrinkage trilogy” with two other preprints. It would be really wonderful if you would advertise them on your blog – preferably without the 6 months delay! The three papers are:

1. The Significance Filter, the Winner’s Curse and the Need to Shrink at http://arxiv.org/abs/2009.09440 (Erik van Zwet and Eric Cator)

2. A Proposal for Informative Default Priors Scaled by the Standard Error of Estimates at http://arxiv.org/abs/2011.15037 (Erik van Zwet and Andrew Gelman)

3. The Statistical Properties of RCTs and a Proposal for Shrinkage at http://arxiv.org/abs/2011.15004 (Erik van Zwet, Simon Schwab and Stephen Senn)

He summarizes:

Shrinkage is often viewed as a way to reduce the variance by increasing the bias. In the first paper, Eric Cator and I argue that shrinkage is important to reduce bias. We show that noisy estimates tend to be too large, and therefore they must be shrunk. The question remains: how much?

From a Bayesian perspective, the amount of shrinkage is determined by the prior. In the second paper, you and I propose a method to construct a default prior from a large collection of studies that are similar to the study of interest.

In the third paper, Simon Schwab, Stephen Senn and I apply these ideas on a large scale. We use the results of more than 20,000 RCTs from the Cochrane database to quantify the bias in the magnitude of effect estimates, and construct a shrinkage estimator to correct it.

It’s all about the Edlin factor.

Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond

Charles Margossian, Aki Vehtari, Daniel Simpson, Raj Agrawal write:

Gaussian latent variable models are a key class of Bayesian hierarchical models with applications in many fields. Performing Bayesian inference on such models can be challenging as Markov chain Monte Carlo algorithms struggle with the geometry of the resulting posterior distribution and can be prohibitively slow. An alternative is to use a Laplace approximation to marginalize out the latent Gaussian variables and then integrate out the remaining hyperparameters using dynamic Hamiltonian Monte Carlo, a gradient-based Markov chain Monte Carlo sampler. To implement this scheme efficiently, we derive a novel adjoint method that propagates the minimal information needed to construct the gradient of the approximate marginal likelihood. This strategy yields a scalable differentiation method that is orders of magnitude faster than state of the art differentiation techniques when the hyperparameters are high dimensional. We prototype the method in the probabilistic programming framework Stan and test the utility of the embedded Laplace approximation on several models, including one where the dimension of the hyperparameter is ∼6,000. Depending on the cases, the benefits can include an alleviation of the geometric pathologies that frustrate Hamiltonian Monte Carlo and a dramatic speed-up.

“Orders of magnitude faster” . . . That’s pretty good!

Understanding Janet Yellen

I don’t know anything about Janet Yellen, the likely nominee for Secretary of the Treasury. For the purpose of this post, my ignorance is OK, even desirable, in that my goal is to try to understand mixed messages that I’m receiving.

Two constrasting views on the prospective Treasury Secretary

First, here’s Joseph Delaney:

So, I [Delaney] know that inflation is a potential menace and ignoring debt has gotten many an advanced nation into trouble. These are all reasonable things to be concerned about. But, via Yasha Levine, I want to bring your attention to the views of the frontrunner for incoming treasury secretary:

In a 2018 interview at the Charles Schwab Impact conference in Washington, Ms. Yellen said the United States’ debt path was “unsustainable” and offered a remedy: “If I had a magic wand, I would raise taxes and cut retirement spending.”

Last year, Ms. Yellen touched on the third rail of Democratic politics when she suggested more directly that cuts to Medicare, Medicaid and Social Security could be in order.

“I think it will not be solved without some additional revenues on the table, but I also find it hard to believe that it won’t be solved without some changes to those programs,” Ms. Yellen said at the National Investment Center for Seniors Housing & Care Fall Conference.

So, there are several issues all bundled together here. First, can we stop putting Medicare into the same bucket as the (less generous) Medicaid and the (quite sustainable) Social Security. The problem with Medicare, insofar as there is one, is an issue of medical cost inflation and that’s an independent policy problem that has little to do with the budget (except as a motivation to solve it). . . .

I am not saying these programs should never be considered for cuts, but that we should be very careful about not framing this as a choice to have lower revenues which require cuts. . . .

This reminded me that I’d noticed a Paul Krugman column on Yellen . . . ok, here it is:

In Praise of Janet Yellen the Economist

She never forgot that economics is about people.

It’s hard to overstate the enthusiasm among economists over Joe Biden’s selection of Janet Yellen as the next secretary of the Treasury. . . . But the good news about Yellen goes beyond her ridiculously distinguished career in public service. Before she held office, she was a serious researcher. And she was, in particular, one of the leading figures in an intellectual movement that helped save macroeconomics as a useful discipline when that usefulness was under both external and internal assault. . . .

Krugman also argues that Yellen “got it right” in 2009 by fighting against the “inflation hawks” to expand the economy.

What’s the conflict?

It seems to me that, despite both coming from the left, or the center-left, Delaney and Krumgan are painting much different pictures of Yellen. I say this because Delaney’s point—that it’s a mistake to use artificial budgetary constraints as a rationale for cutting benefits to the poor and middle class—is the kind of argument that I associate with Krugman, at least in his post-2000 incarnation. Krugman’s always saying we can afford Social Security, and he’s been pretty consistently criticizing those political figures who want to cut or otherwise restrict this retirement program. For example:

Social Security does not face a financial crisis; its long-term funding shortfall could easily be closed with modest increases in revenue.

Krugman goes on to offer a reason that some Republican politicians favor cutting Social security: “it’s all about the big money.”

So here’s the conflict.

A. Yellen wants to cut Social Security (as Delaney notes, she puts it in the “Medicare, Medicaid and Social Security” category, but that’s a separate issue we won’t get into here). Her rationale is that the debt is unsustainable and we can’t raise taxes.

B. Krugman hates, absolutely hates, people who want to cut Social Security, and he’s dismissive of the argument that the retirement program is unaffordable.

C. Krugman looooves Yellen, both as an academic economist and a policy figure.

I’m finding it difficult to hold A, B, and C in my head at the same time. Lewis Carroll might resolve the problem by just adding a fourth statement:

D. A and B are consistent with C.

Of course then I’d wonder why I should believe D, but then Carroll could posit:

E. D is true.

I guess this would pretty much cover it!

Possible explanations

OK, here are some possible resolutions to the above puzzle:

1. Maybe Yellen was misquoted and she doesn’t really want to cut Social Security?

2. Maybe Krugman wasn’t aware of Yellen’s stance on Social Security when he wrote his column the other day?

3. Maybe Krugman knows about Yellen’s stance on Social Security and doesn’t like it, but in his column he was evaluating all her positions on the economy: perhaps she agreed with him on 9 out of 10 issues and so in his column he’s focusing on the places where they agree?

4. Maybe Krugman has changed his views and now he thinks Social Security really is unsustainable? Maybe Social Security was sustainable in 2015 but not in 2020?

I’m guessing it’s #3. But I’m still baffled by how Krugman is so enthusiastic for Yellen given that they seem to disagree on such a core political and economic issue.

Summary

I’m not sure what to think here. My point is not to drag Krugman for being inconsistent. Rather, my point is how difficult it is for an outsider to evaluate policy positions.

There are lots of examples where a policymaker is on the left, and he or she is criticized from the right, or vice versa. And examples where a centrist is criticized from both sides, for example by supporting enough environmental regulations to annoy the right, but not enough to satisfy the left. I’m not saying that the centrist position is correct here; I’m just saying I understand the debate, or at least I think I do.

There are also examples of controversy arising from multidimensionality in policy positions. For example, you might agree with a policymaker’s position on China but disagree with their stance on India. Or you could agree on gun rights but not on abortion rights. I get that.

The Yellen example is interesting to me because it’s not either of the above things. Delaney and Krugman have different tones (the polite statistician and the aggressive economist) but I think their political positions are pretty similar. Delaney’s on the outside and has some distrust of Ivy League economics professors, and Krugman’s an insider, so that somewhat explains their different views about a credentialed academic economist—but I don’t see that Delaney and Krugman are disagreeing on the relevant policy question.

And that brings us to the other point, which is that this does not seem to be a multidimensional issue. Delaney is suspicious of Yellen regarding Social Security—but Krugman cares about social security too!

When two people with the same views on the same issue have opposite takes on a policymaker, I’m not sure what to think. Which is why I’m saying that, for the purpose of this discussion, it’s good that I came into this knowing nothing about Yellen. I don’t really have any strong views about Social Security either, but that’s another story.

P.S. In comments, Jim offers another possibility:

Yellen’s previous comments on social security, medicare and medicaid were just thinking out loud and don’t reflect a policy position. Krugman and Yellen have probably had many conversations, so Krugman knows much more about her thinking than a few simple quotes can reflect.

That makes sense. If Krugman thinks that Yellen’s previous statements on social security or entitlement reform don’t reflect her current positions, then it would make sense for him not to get into those issues in his column.

Basbøll’s Audenesque paragraph on science writing, followed by a resurrection of a 10-year-old debate on Gladwell

I pointed Thomas Basbøll to my recent post, “Science is science writing; science writing is science,” and he in turn pointed me to his post from a few years ago, “Scientific Writing and ‘Science Writing,'” which stirringly begins:

For me, 2015 will be the year that I [Basbøll] finally lost all respect for “science writing”.

He continues: “especially since the invention of the TED talk (a “dark art”), it gave me the feeling of knowing without actually providing me with knowledge. Popular presentations of science tell us stories about what is known without giving us the critical foundations we need to engage with it, i.e., to question those stories.”

And leads to this stunning conclusion:

Knowledge was once something you acquired through years of study, guided by books, but framed by a classroom (other people), an observatory (other vistas), a laboratory (other experiences), a library (other books). If you did not have access to these “academic” conditions you did not presume to understand the topic. Scientists wrote about their discoveries for people who had the knowledge, intelligence, time and apparatus to test them. These days, “science” is becoming something that is produced in a lab and consumed in a book you buy at the airport.

I’m a sucker for nostalgia. But I still can’t bring myself to take the position that the old days were better—after all, the vast majority of people didn’t, and don’t, have the opportunity for these years of study—or, even if they did, it would only be in one narrow field—so I still like the idea of science writing, if we can get beyond the obsolete “science as hero” framework.

One think I like about the above-quoted paragraph is its Audenesque rhythm. (“Yesterday all the past…”). Then again, Orwell roasted Auden for that particular poem, and years later Auden renounced it. Something can sound good and even make a certain kind of logical sense but still be factually or morally wrong. Orwell knew this all along, it took Auden awhile to realize it, and there are lots of people who still don’t get the point.

Speaking of Malcolm Gladwell . . . In his 2015 post, Basbøll links to this blog discussion from 2010 which is kind of amazing in that Gladwell responds to Basbøll in the comments. And it wasn’t even Basbøll’s post! Blogs really used to matter, enough so that a big name Malcolm Gladwell would engage with critic A in the comments section of a post by blogger B. And they went back and forth!

I’m not the world’s biggest Gladwell fan, but I admire that he engaged seriously with criticism in that way. Here’s an example, from late in the thread:

What strikes me [Gladwell] most—reading all the comments—is how unwilling many of the commenters (most of whom, I’m guessing, are academics) are to deal with the trade-off presented in the original post. Academics have the luxury, appropriately, of dealing with ideas and arguments and social science in its full complexity. Those of us who have chosen to swim in the lay pool do not. We have to make compromises. My book Blink, for example, was a compromise: an attempt to nudge people away from the reflexive position that intuition and instinct are invariably reliable or useful. A complete summary of the academic understanding of those questions would have been read by a fraction of the audience. Figuring out where to draw that line is difficult, and I don’t pretend that I always do it properly. But I do think that the effort to expose as wide an audience as impossible to the wonders and mysteries of social science ought to be met with more than condescension—especially from a group of people who teach for a living.

I don’t think this response from Gladwell is perfect—for example, he does not address that in his books, he wasn’t just doing compromises and trade-offs; he was also actively promoting junk science such as John Gottman’s divorce predictions (see here—wow, that was from back in 2010 also! Such a long time has gone by.) and I don’t know that he (Gladwell) has ever retracted his endorsement of Gottman’s claims.

So, yeah, I think Gladwell misses the point in his replies, in that his paragraph sounds reasonable in isolation but it doesn’t address his devastating combination of credulity and unwillingness to admit specific errors. But I still very much appreciate that he at least made the effort: he showed the critics some respect, which is more than you can say of David Brooks, Susan Fiske, Cass Sunstein, etc.

The other stunning thing in that thread from 2010 is when Brayden King, who wrote the blog post that started it all, added this in comments:

Lots of completely legitimate academic articles are liberally sprinkled with “premature conclusions or misleading anecdotes.” I don’t see them as harmful as you do in either case. The point of much empirical work is to push theoretical boundaries and to get people to think. Gladwell is doing the same thing, the main difference being the intended audience.

Dayum.

I mean, yeah, sure, lots of academics make mistakes and don’t ever issue corrections. ESP, ages ending in 9, pizzagate, the disgraced primatologist, that dude from Ohio State with the voodoo dolls, air rage, himmicanes, beauty and sex ratios, that sleep researcher, etc etc. But that’s a bad thing, right?? No matter what the intended audience.

I do think there are some solid defenses of Gladwell. One possible defense is that the man has a workflow, and if he were to fact-check his writing too carefully, it would destroy the spontaneity that makes it all hang together. The second possible defense is that to correct the errors would destroy the willing suspension of disbelief that makes traditional science writing so effective.

In either case, the argument is: (a) the pluses of Gladwell’s writing (the sharing of true facts, the reporting and publicizing of good research, the engagement of the reader in the process of social science) outweigh the minuses (the sharing of false claims, the reporting and publicizing of bad research, the misrepresentation of social science), and (b) that removal or correction of the errors would be impossible as it would in some way destroy the ability of Gladwell to produce this work.

I think this argument is plausible. But, to make it work, you need both (a) and (b). Either alone is not enough.

P.S. Thanks to Zad Chow for the above picture of Polynomial Cats. Happy new year, Zad!