“Replicability & Generalisability”: Applying a discount factor to cost-effectiveness estimates.

This one’s important.

Matt Lerner points us to this report by Rosie Bettle, Replicability & Generalisability: A Guide to CEA discounts.

“CEA” is cost-effectiveness analysis, and by “discounts” they mean what we’ve called the Edlin factor—“discount” is a better name than factor, because it’s a number that should be between 0 and 1, it’s what you should multiply a point estimate by to adjust for inevitable upward biases in reported effect-size estimates, issues discussed here and here, for example.

It’s pleasant to see some of my ideas being used for a practical purpose. I would just add that type M and type S errors should be lower for Bayesian inferences than for raw inferences that have not been partially pooled toward a reasonable prior model.

Also, regarding empirical estimation of adjustment factors, I recommend looking at the work of Erik van Zwet et al; here are some links:
What’s a good default prior for regression coefficients? A default Edlin factor of 1/2?
How large is the underlying coefficient? An application of the Edlin factor to that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”
The Shrinkage Trilogy: How to be Bayesian when analyzing simple experiments
Erik van Zwet explains the Shrinkage Trilogy
The significance filter, the winner’s curse and the need to shrink
Bayesians moving from defense to offense: “I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?”
Explaining that line, “Bayesians moving from defense to offense”

I’m excited about the application of these ideas to policy analysis.

It’s bezzle time: The Dean of Engineering at the University of Nevada gets paid $372,127 a year and wrote a paper that’s so bad, you can’t believe it.

“As we look to sleep and neuroscience for answers we can study flies specifically the Drosophila melanogaster we highlight in our research.”

1. The story

Someone writes:

I recently read a paper of yours in the Chronicle about how academic fraudsters get away with it. I came across a strange case that I thought you would at least have some interest in when a faculty members owns an open access journal that costs to publish and then publishes a large number of papers in the journal.  The most recent issue is all from the same authors (family affair).

It is from an administrator at University of Nevada Reno.  This concern is related to publications within a journal that may not be reputable.   The Dean of Engineering has a number of publications in the International Supply Chain Technology Journal that are in question Google Scholar.  Normally, I would contact the editor, or publisher, but in this case, there are complexities.

This may not  be an issue but many of the articles are short, being 1 or 2 pages. In addition, some have a peer review process of 3 days or less. Another concern is that many of the papers do not even discuss what is in the title.  Take the following paper: It presents nothing about the title. Many of the papers read as if AI was used.

While the quality of these papers may not be of concern, the representation of these as publications could be. The person publishing them should have ethical standards that exceed those that are under his leadership. He is also the highest ranking official of the college of engineering and is expected to lead by example and be a good model to those under him.

If that is not enough, looking into the journal in more detail alludes to more ethical questions. The journal is published by PWD Group out of Texas. Lookup of PWD Group out of Texas yields that Erick Jones is the Director and President.  Erick Jones was also the Editor of the journal.  In addition to the journal articles, even books authored by Erick Jones are published by PWD.

Further looking into the journal publications you will see that there are a large number with Erick Jones Sr. and Erick Jones Jr.  There are also a large number with Felicia Jefferson.  Felicia is also a faculty member at UNR and the spouse of Dean Jones.  A few of the papers raise concerns related to deer supply chains. The following has a very fast peer review process of a few days and the caption of a white tailed deer is a reindeer. Another paper is even shorter, with a very fast peer review, and captions yet a different deer which is still not a white tail. It is unlikely these papers went through a robust peer review.

While these papers affiliation are prior to coming to UNR, the incoherence, conflict of interest, and incorrect data do lot look good for UNR and they were published either when Dr. Jefferson was applying to UNR or early upon her arrival. Similar issues with the timing of this article. Also, in the print version of the journal, Dr. Jefferson handles submissions (pp3).

Maybe this information is nothing to be concerned about.  At the very least, it sheds a poor light on the scientific process, especially when a Dean is the potential abuser.  It is not clear how he can encourage high quality manuscripts from other faculty when he has been able to climb the ladder using his own publishing house. I’ll leave you with a paper with a relevant title on minimizing train accidents through minimizing sleep deprivation. It seems like a really important study.  The short read should convince you otherwise and make you question the understanding of the scientific process by these authors.

Of specific concern is whether these publications led to he, or his spouse, being hired at UNR.  If these are considered legitimate papers, the entire hiring and tenure process at UNR is compromised.  Similar arguments exist if these papers are used in the annual evaluation process. It also raises a conflict of interest if he pays to publish and then receives proceeds on the back end.

I have no comment on the hiring, tenure, and evaluation process at UNR, or on any conflicts of interest. I know nothing about what is going on at UNR. It’s a horrifying story, though.

2. The published paper

OK, here it is, in its entirety (except for references). You absolutely have to see it to believe it:

Compared to this, the Why We Sleep guy is a goddamn titan of science.

3. The Dean of Engineering

From the webpage of the Dean of Engineering at the University of Reno:

Dr. Erick C. Jones is a former senior science advisor in the Office of the Chief Economist at the U.S. State Department. He is a former professor and Associate Dean for Graduate Studies at the College of Engineering at The University of Texas at Arlington.

From the press release announcing his appointment, dated July 01, 2022:

Jones is an internationally recognized researcher in industrial manufacturing and systems engineering. . . . “In Erick Jones, our University has a dynamic leader who understands how to seize moments of opportunity in order to further an agenda of excellence,” University President Brian Sandoval said. . . . Jones was on a three-year rotating detail at National Science Foundation where he was a Program Director in the Engineering Directorate for Engineering Research Centers Program. . . .

Jones is internationally recognized for his pioneering work with Radio Frequency Identification (RFID) technologies, Lean Six Sigma Quality Management (the understanding of whether a process is well controlled), and autonomous inventory control. He has published more than 243 manuscripts . . .

According to this source, his salary in 2022 was $372,127.

According to wikipedia, UNR is the state’s flagship public university.

I was curious to see what else Jones had published so I searched him on Google scholar and took a look at his three most-cited publications. The second of these appeared to be a textbook, and the third was basically 8 straight pages of empty jargon—ironic that a journal called Total Quality Management would publish something that has no positive qualities! The most-cited paper on the list was pretty bad too, an empty bit of make-work, the scientific equivalent of the reports that white-collar workers need to fill out and give to their bosses who can then pass these along to their bosses to demonstrate how productive they are. In short, this guy seems to be a well-connected time server in the Ed Wegman mode, minus the plagiarism.

He was a Program Director at the National Science Foundation! Your tax dollars at work.

Can you imagine what it would feel like to be a student in the engineering school at the flagship university of the state of Nevada, and it turns out the school is being run by the author of this:

Our recent study has the premise that both humans and flies sleep during the night and are awake during the day, and both species require a significant amount of sleep each day when their neural systems are developing in specific activities. This trait is shared by both species. An investigation was segmented into three subfields, which were titled “Life span,” “Time-to-death,” and “Chronological age.” In D. melanogaster, there was a positive correlation between life span, the intensity of young male medflies, and the persistence of movement. Time-to-death analysis revealed that the male flies passed away two weeks after exhibiting the supine behavior. Chronological age, activity in D. melanogaster was adversely correlated with age; however, there was no correlation between chronological age and time-to-death. It is probable that the incorporation the findings of age-related health factors and increased sleep may lead toless train accidents. of these age factors when considering these options supply chain procedure for maintaining will be beneficial.

I can’t even.

P.S. The thing I still can’t figure out is, why did Jones publish this paper at all? He’d already landed the juicy Dean of Engineering job, months before submitting it to his own journal. To then put his name on something so ludicrously bad . . . it can’t help his career at all, could only hurt. And obviously it’s not going to do anything to reduce train accidents. What was he possibly thinking?

P.P.S. I guess this happens all the time; it’s what Galbraith called the “bezzle.” We’re just more likely to hear about when it happens at some big-name place like Stanford, Harvard, Ohio State, or Cornell. It still makes me mad, though. I’m sure there are lots of engineers who are doing good work and could be wonderful teachers, and instead UNR spends $372,127 on this guy.

I’ll leave the last word to another UNR employee, from the above-linked press release:

“What is exciting about having Jones as our new dean for the College of Engineering is how he clearly understands the current landscape for what it means to be a Carnegie R1 ‘Very High Research’ institution,” Provost Jeff Thompson said. “He very clearly understands how we can amplify every aspect of our College of Engineering, so that we can continue to build transcendent programs for engineering education and research.”

They’re transcending something, that’s for sure.

My challenge for Jeff Thompson: Show up at an engineering class at your institution, read aloud the entire contents (i.e., the two paragraphs) of “Using Science to Minimize Sleep Deprivation that may reduce Train Accidents,” then engage the students in a discussion of what this says about “the current landscape for what it means to be a Carnegie R1 ‘Very High Research’ institution.”

Should be fun, no? Just remember, the best way to keep the students’ attention is to remind them that, yes, this will be covered on the final exam.

P.P.P.S. More here from Retraction Watch.

P.P.P.P.S. Still more here.

P.P.P.P.P.S. Retraction Watch found more plagiarism, this time on a report for the National Science Foundation.

Using the term “visualization” for non-visual representation of data

The other day we linked to a study whose purpose was to “investigate challenges faced by curators of data visualizations for blind and low-vision individuals.”

JooYoung Seo, the organizer of that project, provides further background:

With the exception of a few IRB-related constraints, below is my brief response to your feedback.

1. Unclear terminology. You recommended that we use the verb “create” instead of “curate” in our survey, and we completely agree with your confusion about this terminology. We chose to use the verb curate because our research was funded by the Institute of Museum and Library Services (IMLS) to develop a tool that would make it easier for data curators to create accessible visualizations. We also began our research proposal as a community partnership with the Data Curation Network (DCN), so we were using terminology that was tailored to a specific professional group. In an effort to strike a better balance between the confusion of the terminology and the directionality of our research goals, we will add some explanations to make it easier to understand.

2. The inappropriateness of the term “visualization”. You raised the issue of the inappropriateness of using the term “visualization” in our survey to refer to accessible data “sensification”. This is very insightful.

I can assure you that our team is in no way trying to subscribe to or perpetuate the term visualization. As the PI, I am a lifelong blind person and my student who is co-leading this research is also a lifelong low-vision person, so we have given a lot of thought to the term “visualize”.

In our research, visualization is one way of encoding/decoding data representation. We believe that accessible data representation can only be achieved through multimodal data representation (sonification, tactile representation, text description, AI conversation, etc.) that comprehensively conveys various modalities along with visualization. Our initial research, Multimodal Access and Interactive Data Representation (MAIDR: https://github.com/uiuc-ischool-accessible-computing-lab/maidr), which we will present in May at the CHI2024 conference, reflects this belief, and this survey is an extension of our MAIDR research.

Despite the bias that the term visualization can introduce, we chose to use it in this survey for two reasons: first, we wanted to follow the convention of using terminology that is more easily understood by survey participants, assuming that they are data curators with varying levels of experience with accessibility, ranging from no experience to expert level experience. We could of course use the terms “data sensification” or “data representation” for further explanation, but since this initial study is focused on observing and understanding the status quo rather than “education,” we wanted to reduce potentially confusing new concepts as much as possible.

In parallel to our survey of data curators, we are also conducting a separate survey with blind people asking them about the accessibility issues with data visualization that they encounter in their daily lives. In that survey, we want to understand how blind people are approaching visualization.

Second, the reason we use the term visualization in our research involving blind and low-vision people is to challenge the misconception that being visually impaired excludes people from visualization altogether. For example, there are many blind and low-vision people who use their residual vision to approach visualization. Depending on when you were blind, there are some people who use the visualizations that remain in their brains as they learn. As someone who became blind as a teenager, I still use visual cues like color and brightness to help me learn and retain information.

If you cannot use the term visualize just because you can’t see, “See you tomorrow,” “Let’s see,” and “let me take a look” would also be unusable for blind people. Blind people are just as capable of using visual encoding/decoding.

I get his point on the term “visualization.” Indeed, I can visualize scenes with my eyes closed. In our paper, we used “sensification” in part to emphasize that we are interested in engaging other senses than vision, especially hearing and the muscular resistance sense.

A new argument for estimating the probability that your vote will be decisive

Toby Ord writes:

I think you will like this short proof that puts a lower bound on the probability that one’s vote is decisive.

It requires just one assumption (that the probability distribution over vote share is unimodal) and takes two inputs (the number of voters & the probability the underdog wins). It shows that in (single level) elections that aren’t forgone conclusions, the chance your vote is decisive can’t be much lower than 1 in the number of voters (and I show where some models that say otherwise go wrong).

Among other things, this makes it quite plausible that the moral value of voting is positive in expectation, since the aggregate value scales with n, while the probability scales with 1/n. Voting would produce net-value roughly when the value of your preferred candidate to the average citizen exceeds the cost to you of voting.

This relates to my paper with Edlin and Kaplan, “Voting as a rational choice: why and how people vote to improve the well-being of others.”

I was happy to see that Ord’s article mentioned the point made in the appendix of our 2004 paper, as it addresses a question that often arises, which is whether a vote can never be decisive because when an election is close there can be a recount.

Also, some background: it’s my impression that the p = 10^-90 crowd (that is, the people who assign ridiculously small probabilities of a single vote being decisive) are typically not big fans of the idea of democracy, so it is convenient for them to suppose that voting doesn’t matter.

I’m not saying the p = 10^-90 people are cynical, as they may sincerely believe that democracy is overrated, and then this is compounded by innumeracy. Probability is difficult!

And then there’s just the general issue that people seem to have the expectation that, when there’s any sort of debate, that all the arguments must necessarily go in their favor, so they’ll twist and turn a million ways to avoid grappling with contrary arguments; see for example this discussion thread where I tried to clarify a point but it didn’t work.

Regarding this last point, Ord writes:

I hope it is useful to have a simple formula for a completely safe lower bound for the chance a vote is decisive. Not the same as your empirically grounded versions, but nice to show people who don’t trust the data or the more complex statistical analysis.

“Don’t feed the trolls” and the troll semi-bluff

I was dealing with some trolling in blog comments awhile ago and someone sent me an email of support, appreciating my patience in my responses to the troll. The standard advice is “Don’t feed the trolls,” but usually here it seems to have worked well to give polite but firm and focused responses. One reason for this is that this is a blog, not twitter, so a troll can’t get hundreds or thousands here of “likes” for just expressing an opinion; instead, there’s room in comments for all sides to make their arguments. So often we get the best possible outcome: the would-be troll gets to make his point at whatever length he wants, others can respond, and the discussion is out there. Polite but firm responses take some of the thrills out of trolling; the people who really want to up the emotional level can go to twitter or 4chan.

Occasionally, though, a troll keeps coming back and making the same point over and over again, with a combination of rudeness, repetition, and lack of content that degrades the discussion, and I have to ask the troll to stop, or, with very rare necessity, to block him from commenting.

The semi-bluff

In poker, a “semi-bluff” is when you bluff, but your hand has some potential for improvement, so if you get called on it, you still have a possible out. I get the impression that the occasional trolling commenters here are engaged in a sort of semi-trolling. On one hand, they want to provoke strong reactions, and I get the impression they see themselves as charming, if not completely house-trained, gadflies or imps. So they can keep looping around the same refuted points over and over again on the ha-ha-troll theory that the rest of us have no sense of perspective. At the same time, they seem to sincerely believe in the deep truth of whatever he happens to be saying. (I’m guessing that even flat-out data fabricators, liars, and hacks believe that, in some deep sense, they’re serving a good cause.) So they’re kinda trolling but also are being sincere at some level.

The disturbing thing is that, frequent blog targets such as plagiarists, nudgelords, pizzagate proponents, sloppy pundits, confused popularizers, etc., probably see me as this kind of troll! I keep banging on and on and never give up. That horse thing. That Javert thing.

As is often the case, I’m stuck here in the pluralist’s dilemma. There’s no logical way to criticize trolls in general without entering the vortex and calling into question my own work that is getting trolled. Cantor or Russell would understand.

A question about Lindley’s supra Bayesian method for expert probability assessment

Andy Solow writes:

I wonder if you can help me with a question that has been bugging me for a while? I have been thinking about Lindley’s supra Bayesian method for expert probability assessment. Briefly, the model is that, conditional on the event of interest A, the log odds ratio for an expert is normal with mean mu > 0 and variance v and, symmetrically, conditional on not-A, it’s normal with mean -mu and the same variance. When the prior probability of A is 1/2, the posterior log odds ratio given the expert’s log odds ratio q is:

(2*mu/v) * q

Lindley took v = 2*mu so that the expert’s log odds ratio is simply adopted. Now, I would have thought that mu can be viewed as a measure of expertise: the more expert the expert, the greater mu. If that’s the case, then I also would have thought that the distribution of a more-expert expert should stochastically dominate that of a less-expert expert. But this is not true under Lindley’s assumption. Stochastic dominance requires that v is a non-increasing function of mu – the simplest case being a constant v. But for mu > v/2 the posterior log odds ratio is greater than q, which doesn’t seem right either. I wonder if I am thinking about this incorrectly? Any thoughts you might have would be greatly appreciated.

My reply: I don’t know, I’ve never thought about this one! The whole thing looks kind of arbitrary to me, and I’ve never been a fan of models of “expert opinion” that don’t connect to the data used by the expert or to whatever the expert is actually predicting. But Lindley was a smart guy, so I’m guessing that the idea is more general than looks to me at first.

Can someone explain in comments?

“Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001]."

Jonathan Falk came across this article and writes:

Is there any possible weaker conclusion than “providing caloric information may help some adults with food decisions”?

Is there any possible dataset which would contradict that conclusion?

On one hand, gotta give the authors credit for not hyping or overclaiming. On the other hand, yeah, the statement, “providing caloric information may help some adults with food decisions,” is so weak as to be essentially empty. I wonder whether part of the problem here is the convention that the abstract is supposed to conclude with some general statement, something more than just, “That’s what we found in our data.”

Still and all, this doesn’t reach the level of the classic “Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001]."

Click here to help this researcher gather different takes on making data visualizations for blind people

Here’s the survey, and here’s what it says:

The purpose of this study is to investigate challenges faced by curators of data visualizations for blind and low-vision individuals. This includes, but is not limited to, graphs, charts, plots, diagrams, and data tables.

We invite individuals who meet the following criteria to participate in our survey:
– Aged 18 or older
– Data curators or professionals who have experience creating a visualization to depict data and are interested in the accessibility of data visualizations
– Data professionals with or without experience in assisting BLV individuals with data visualizations
– Currently located in the United States

The survey uses the term “curating,” which doesn’t seem quite right to me. I create lots of visualizations; I don’t spend much time “curating” them. So that’s confusing. I think they should replace “curate” by “create,” or maybe “create or use,” throughout.

Also, they keep saying “visualizations,” which doesn’t sound quite right given that vision will not be involved. I’d prefer a term such as “sensification” or “vivification,” as we discuss in our article on the topic.

Also it’s funny how the survey starts with screenfuls of paperwork. The whole IRB thing really is out of control. It’s a mindless bureaucracy. I don’t blame the researcher on this study—it’s not his fault, he’s at a university and he has to play by the rules. You just end up with ridiculous things like this:

When they write the story of the decline and fall of western civilization, they’ll have to devote a chapter to Institutional Review Boards. Academic bigshots lie, cheat, and steal, and in the meantime people who do innocuous surveys have to put in these stupid warnings. Again, not the fault of the researcher! It’s the system.

Anyway, sensification for blind and low-vision people is a topic worth studying, and if you fill out the survey maybe it will be of help, at least a starting point for further exploration of this issue.

P.S. More here.

Lancet-bashing!

Retraction Watch points to this fun article by Ashley Rindsberg, “The Lancet was made for political activism,” subtitled, For 200 years, it has thrived on melodrama and scandal.

And they didn’t even mention Surgisphere (for more detail, see here) or this story (the PACE study) or this one about gun control.

All journals publish bad papers; we notice Lancet’s more because they get more publicity.

Opposition

Following the recommendation of Elin in comments, I checked out the podcast, If Books Could Kill. It seemed like the kinda thing I might like: 2 guys going back and forth taking apart Gladwell, Freakonomics, David Brooks, Nudge, and other subjects of this blog. It would be really hilarious if they were to take on the copy-paste jobs of chess sages Ray Keene and Chrissy Hesse, but I guess that would be too niche for them . . .

I have mixed feelings about If Books Could Kill. My take is mostly positive—I listen to podcasts while biking so what I’m looking for is a kind of background music, a continuing flow of interesting things that flow smoothly. I’m impressed that they can talk so engagingly, unrehearsed (I assume) for an hour straight. No ums or uhs, they stay on topic . . . they’re pros. Not every podcast I’ve listened to goes so well. Sometimes they go too slow and I get bored.

There’s one thing that they’re missing, though, and that’s opposition. The most recent episode I’ve been listening to provides a good example. They discussed Men are from Mars, Women are from Venus, a book from a few years back that of course I’d heard of, but I’d never read, actually never even opened. I’m not saying that out of pride, it’s just not something that ever crossed my desk.

The podcast went as usual: they lead off with a summary of the book’s theme, then talk about how the book has some reasonable ideas and they want to give it a fair shake, then they get into the problematic passages and get into some questionable aspects of the author’s career.

This all works, but then at some point I kind of resist. It’s not that I think they’re being unfair to the book, exactly. It’s more like . . . they need some opposition. It goes like this. the book says X, they point out problems with X and get to mocking, which is fine—but then I want to push back. I don’t buy everything they’re saying. I think the podcast would be better if they could add a third person, someone who’s also good at the podcast thing and who generally agrees with them, but can push back against their stronger statements. I’m not asking for a debate every week, just someone who can, every once in awhile, say, “Whoa, you’re going too far this time. Yes, you have good points, but the book you’re discussing is not so horrible as you say, at least not right here.”

And yes, the hosts of the podcast—Michael Hobbes and Peter Shamshiri—do point out positive features of the books they’re criticizing; they really do try to do their best to give the authors a fair shake. It’s just that in every episode they’ll get into this rhythm of reinforcing each other to the extent that they’ll miss the point. They need someone to keep them honest, keep them closer to their best selves.

The comment thread does some of that job on this blog.

And, as we’ve discussed over the years, I do find lots of positive things in Gladwell, Freakonomics, David Brooks, Nudge, etc. I think I’d appreciate Men are from Mars etc., but I guess at this point I won’t ever get around to reading it. As to the podcast: I think I’ll continue to be listening to it, because it’s entertaining. But it could be better.

P.S. More here on the value of opposition.

P.P.S. It appears that Hobbes has a track record of making hasty judgments and not considering alternative perspectives. That’s too bad cos otherwise I really enjoy this podcast.

Hey, I got tagged by RetractoBot!

A message came in my inbox from “The RetractoBot Team, University of Oxford,” with subject line, “RetractoBot: You cited a retracted paper”:

That’s funny! When we cited that paper by Lacour and Green, we already knew it was no good. Indeed, that’s why we cited it. Here’s the relevant paragraph from our article:

In political science, the term “replication” has traditionally been applied to the simple act of reproducing a published result using the identical data and code as used in the original analysis. Anyone who works with real data will realize that this exercise is valuable and can catch problems with sloppy data analysis (e.g., the Excel error of Reinhart and Rogoff 2010, or the “gremlins” article of Tol 2009, which required nearly as many corrections as the number of points in its dataset; see Gelman 2014). Reexamination of raw data can also expose mistakes, such as the survey data of LaCour and Green (2014); see Gelman (2015).

We also cited two other notorious papers, Reinhart and Rogoff (2010) and Tol (2009), both of which should have been retracted but are still out there in the literature. According to Google scholar, Reinhard and Rogoff (2010) has been cited more than 5000 times! I guess that many of these citations are from articles such as mine, using it as an example of poor workflow, but still. Meanwhile, Tol (2009) has been cited over 1500 times. It does have a “correction and update” from 2014, but that hardly covers its many errors and inconsistencies.

Anyway, I can’t blame RetractoBot for not noticing the sense of my citation; it’s just funny how they sent that message.

“When will AI be able to do scientific research both cheaper and better than us, thus effectively obsoleting humans?”

Alexey Guzey asks:

How much have you thought about AI and when will AI be able to do scientific research both cheaper and better than us, thus effectively obsoleting humans?

My first reply: I guess that AI can already do better science than Matthew “Sleeplord” Walker, Brian “Pizzagate” Wansink, Marc “Schoolmarm” Hauser, or Satoshi “Freakonomics” Kanazawa. So some humans are already obsolete, when it comes to producing science.

OK, let me think a bit more. I guess it depends on what kind of scientific research we’re talking about. Lots of research can be automated, and I could easily imagine an AI that can do routine analysis of A/B tests better than a human could. Indeed, thinking of how the AI could do this is a good way to improve how humans currently do things.

For bigger-picture research, I don’t see AI doing much. But a big problem now with human research is that human researchers want to take routine research and promote it as big-picture (see Walker, Wansink, Kanazawa, etc.). I guess that an AI could be programmed to do hype and create Ted talk scripts.

Guzey’s response:

What’s “routine research”? Would someone without a college degree be able to do it? Is routine research simply defined as such that can be done by a computer now?

My reply: I guess the computer couldn’t really do the research, as it that would require filling test tubes or whatever. I’m thinking that the computer could set up the parameters of an experiment, evaluate measurements, choose sample size, write up the analysis, etc. It would have to be some computer program that someone writes. If you just fed the scientific literature into a chatbot, I guess you’d just get millions more crap papers, basically reproducing much of what is bad about the literature now, which is the creation of articles that give the appearance of originality and relevance while actually being empty in content.

But, now that I’m writing this, I think Guzey is asking something slightly different: he wants to know when a general purpose “scientist” computer could be written, kind of like a Roomba or a self-driving car, but instead of driving around, it would read the literature, perform some sort of sophisticated meta-analyses, and come up with research ideas, like “Run an experiment on 500 people testing manipulations A and B, measure pre-treatment variables U and V, and look at outcomes X and Y.” I guess the first step would be to try to build such a system in a narrow environment such as testing certain compounds that are intended to kill bacteria or whatever.

I don’t know. On one hand, even the narrow version of this problem sounds really hard; on the other hand, our standards for publishable research are so low that it doesn’t seem like it would be so difficult to write a computer program that can fake it.

Maybe the most promising area of computer-designed research would be in designing new algorithms, because there the computer could actually perform the experiment; no laboratory or test tubes required, so the experiments can be run automatically and the computer could try millions of different things.

Sympathy for the Nudgelords: Vermeule endorsing stupid and dangerous election-fraud claims and Levitt promoting climate change denial are like cool dudes in the 60s wearing Che T-shirts and thinking Chairman Mao was cool—we think they’re playing with fire, they think they’re cute contrarians pointing out contradictions in the system. For a certain kind of person, it’s fun to be a rogue.

A few months ago I wrote about some disturbing stuff I’d been hearing about from Harvard Law School professors Cass Sunstein and Adrian Vermeuele. The two of them wrote an article back in 2005 writing, “a refusal to impose [the death] penalty condemns numerous innocent people to death. . . . a serious commitment to the sanctity of human life may well compel, rather than forbid that form of punishment. . . .”

My own view is that the death penalty makes sense in some settings and not others. To say that “a serious commitment to the sanctity of human life may well compel” the death penalty . . . jeez, I dunno, that’s some real Inquisition-level thinking going on. Not just supporting capital punishment, they’re compelling it. That’s a real edgelord attitude, kinda like the thought-provoking professor in your freshman ethics class who argues that companies have not just the right but the moral responsibility to pollute the maximum amount possible under the law because otherwise they’re ducking their fiduciary responsibility to the shareholders. Indeed, it’s arguably immoral to not pollute beyond the limits of the law if the expected gain from polluting is lower than the expected loss from getting caught and fined.

Sunstein and Vermeule also recommended that the government should fight conspiracy theories by engaging in “cognitive infiltration of extremist groups,” which seemed pretty rich, considering that Vermeule spent his online leisure hours after the 2020 election promoting election conspiracy theories. Talk about the fox guarding the henhouse. This is one guy I would not trust to be in charge of government efforts to cognitively infiltrate extremist groups!

Meanwhile, these guys go on NPR, they’ve held appointive positions with the U.S. government, they’re buddies with elite legal academics . . . it bothers me! I’m not saying their free speech should be suppressed—we got some Marxists running around in this country too—I just don’t want them anywhere near the levers of power.

Anyway, I heard by email from someone who knows Sunstein and Vermeuele. It seems that both of them are nice guys, and when they stick to legal work and stay away from social science or politics they’re excellent scholars. My correspondent also wrote:

And on that 2019 Stasi tweet. Yes, it was totally out of line. You and others were right to denounce it. But I think it’s worth pointing out that he deleted the tweet the very same day (less than nine hours later), apologized for it as an ill-conceived attempt at humor, and noted with regret that the tweet came across as “unkind and harsh to good people doing good and important work.” I might gently and respectfully suggest that continuing to bring up this tweet four years later, after such a prompt retraction—which was coupled with an acknowledgement of the value of the work that you and others are doing in focusing on the need for scrutiny and replication of eye-catching findings—might be perceived as just a tad ungracious, even by those who believe that Cass was entirely in the wrong and you were entirely in the right as regards the original tweet. To paraphrase one of the great capital defense lawyers (who obviously said this in a much more serious context), all of us are better than our worst moment.

I replied:

– Regarding the disjunction between Vermeule’s scholarly competence and nice-guyness, on one hand, and his extreme political views: I can offer a statistical or population perspective. Think of a Venn diagram where the two circles are “reasonable person” and “extreme political views and actions.” (I’m adding “actions” here to recognize that the issue is not just that Vermeule thinks that a fascist takeover would be cool, but that he’s willing to sell out his intellectual integrity for it, in the sense of endorsing ridiculous claims.)

From an ethical point of view, there’s an argument in favor of selling out one’s intellectual integrity for political goals. One can make this argument for Vermeule or also for, say, Ted Cruz. The argument is that the larger goal (a fascist government in the U.S., or more power for Ted Cruz) is important enough that it’s worth making such a sacrifice. Or, to take slightly lesser examples, the argument would be that when Hillary Clinton lied about her plane being shot at, or when Donald Trump lied about . . . ok, just about everything, that they were thinking about larger goals. Indeed, one could argue that for Cruz and the other politicians, it’s not such a big deal—nobody expects politicians to believe half of what they’re saying anyway—but for Vermeule to trash his reputation in this way, that shows real commitment!

Actually, I’m guessing that Vermeule was just spending too much time online in a political bubble, and he didn’t really think that endorsing these stupid voter-fraud claims meant anything. To put it another way, you and I think that endorsing unsubstantiated claims of voting fraud is bad for two reasons: (1) intellectually it’s dishonest to claim evidence for X when you have no evidence for X, (2) this sort of thing is dangerous in the short term by supplying support to traitors, and (3) it’s dangerous in the long term by degrading the democratic process. But, for Vermeule, #2 and #3 might well be a plus not a minus, and, as for #1, I think it’s not uncommon for people to make a division between their professional and non-professional statements, and to have a higher standard for the former than the latter. Vermeule might well think, “Hey, that’s just twitter, it’s not real.” Similarly, the economist Steven Levitt and his colleagues wrote all sorts of stupid things (along with many smart things) under the Freakonomics banner, thinks which I guess (or, should I say, hope) he’d never have done in his capacity as an academic. Just to be clear, I’m not saying that everyone does this, indeed I don’t think I do it—I stand by what I blog, just as I stand by my articles and books—but I don’t think everyone does. Another example that’s kinda famous is biologists who don’t believe in evolution. They can just separate the different parts of their belief systems.

Anyway, back to the Venn diagram. The point is that something like 30% of Americans believe this election fraud crap. 30% of Americans won’t translate into 30% of competent and nice-guy law professors, but it won’t be zero, either. Even if it’s only 10% or less in that Venn overlap, it won’t be zero. And the people inside that overlap will get attention. And some of them like the attention! So at that point you can get people going further and further off the deep end.

If it would help, you could think of this as a 2-dimensional scatterplot rather than a Venn diagram, and in this case you can picture the points drifting off to the extreme over time.

To look at this another way, consider various well-respected people in U.S. and Britain who were communists in the 1930s through 1950s. Some of these people were scientists! And they said lots of stupid things. From a political perspective, that’s all understandable: even if they didn’t personally want to tear up families, murder political opponents, start wars, etc., they could make the case that Stalin’s USSR was a counterweight to fascism elsewhere. But from an intellectual perspective, they wouldn’t always make that sort of minimalist case. Some of them were real Soviet cheerleaders. Again, who knows what moral calculations they were making in their heads.

I’m not gonna go all Sunstein-level contrarian and argue that selling out one’s intellectual integrity is the ultimate moral sacrifice—I’m picturing a cartoon where Vermeule is Abraham, his reputation is Isaac, and the Lord is thundering above, booming down at him to just do it already—but I guess the case could be made, indeed maybe will be the subject of one of the 8 books that Sunstein comes out with next year and is respectfully reviewed on NPR etc.

– Regarding the capital punishment article: I have three problems here. The first is their uncritical acceptance of a pretty dramatic claim. In Sunstein and Vermeule’s defense, though, back in 2005 it was standard in social science for people to think that statistical significance + identification strategy + SSRN or NBER = discovery. Indeed, I’d guess that most academic economists still think that way! So to chide them on their innumeracy here would be a bit . . . anachronistic, I guess. The second problem is that I’m guessing they were so eager to accept this finding is that it allowed them to make this cool point that they wanted to make. If they’d said, “Here’s a claim, maybe it’s iffy but if it’s true, it has some interesting ethical implications…”, that would be one thing. But that’s not what I read their paper as saying. By saying “Recent evidence suggests that capital punishment may have a significant deterrent effect” and not considering the opposite, they’re making the fallacy of the one-way bet. My third problem is that I think their argument is crap, even setting aside the statistical study. I discussed this a bit in my post. There are two big issues they’re ignoring. The first is that if each execution saves 18 lives, then maybe we should start executing innocent people! Or, hey, we can find some guilty people to execute, maybe some second-degree murderers, armed robbers, arsonists, tax evaders, speeders, jaywalkers, . . . . shouldn’t be too hard to find some more targets–after all, they used to have the death penalty for forgery. Just execute a few hundred of them and consider how many lives will be saved. That may sound silly to you, but it’s Sunstein and Vermeule, not me, who wrote that bit about “a serious commitment to the sanctity of human life.” I discussed the challenges here in more detail in a 2006 post; see the section, “The death penalty as a decision-analysis problem?” My point is not that they have to agree with me, just that it’s not a good sign that their long-ass law article with its thundering about “the sanctity of human life” is more shallow than two paragraphs of a blog post.

In summary regarding the death-penalty article, I’m not slamming them for falling for crappy research (that’s what social scientists and journalists did back in 2005, and lots of them still to do this day) and I’m not slamming them for supporting death penalty (I’ve supported it too, at various times in my life; more generally I think it depends on the situation and that the death penalty can be a good idea in some circumstances, even if the current version in the U.S. doesn’t work so well). I’m slamming them for taking half-assed reasoning and presenting it as sophisticated. I’d say they don’t know better, they’re just kinda dumb—but you assure me that Vermeule is actually smart. So my take on it is that they’re really good at playing the academic game. For me to criticize their too-clever-by-half “law and economics” article as not being well thought through, that would be like criticizing LeBron James for not being a golf champion. They do what’s demanded of them in their job.

– Regarding Sunstein’s ability to learn from error: Yes, I mention in my post that Sunstein was persuaded by the article by Wolfers and Donohue. I do think it was good that Sunstein retracted his earlier stance. That’s one reason I was particularly disappointed by what he and his collaborator did in the second edition of Nudge, which was to memory-hole the Wansink episode. It was such a great opportunity in the revision, for them to have said that the nudge idea is so compelling that they (and many others) were fooled, and to consider the implications: in a world where people are rewarded for discovering apparently successful nudges, the Wansinks of the world will prosper, at least in the short term. Indeed, Sunstein and Thaler could’ve even put a positive spin on it by talking about the self-correcting nature of science, sunlight is the best disinfectant, etc. But, no, instead they remove it entirely, and then Sunstein returns to his previous credulous self by posting something on what he called the “coolest behavioral finding of 2019.” Earlier they’d referred to Wansink as having had multiple masterpieces. Kind of makes you question their judgment, no? My take on this is . . . for them, everyone’s a friend, so why rock the boat? As I wrote, it looks to me like an alliance of celebrities. I’m guessing that they are genuinely baffled by people like Uri Simonsohn or me who criticize this stuff: Don’t we have anything better to do? It’s natural to think of behavior of Simonsohn, me, and other “data thugs” as being kinda pathological: we are jealous, or haters, or glory-seekers, or we just have some compulsion to be mean (the kind of people who, in another life, would be Stasi).

– Regarding the Stasi quote: Yes, I agree it’s a good thing Sunstein retracted it. I was not thrilled that in the retraction he said he’d thought it had “a grain of truth,” but, yeah, as retractions go, it was much better than average! Much better than the person who called people “terrorists,” never retracted or apologized, then later published an article lying about a couple of us (a very annoying episode to me, which I have to kind of keep quiet about cos nobody likes a complainer, but grrrr it burns me up, that people can just lie in public like that and get away with it). So, yes, for sure, next time I write about this I will emphasize that he retracted the Stasi line.

– Libertarian paternalism: There’s too much on this for one email, but for my basic take, see this post, in particular the section “Several problems with science reporting, all in one place.” This captures it: Sunstein is all too willing to think that ordinary people are wrong, while trusting the testimony of Wansink, who appears to have been a serial fabricator. It’s part of a world in which normies are stupidly going about their lives doing stupid things, and thank goodness (or, maybe I should say in deference to Vermeule, thank God) there are leaders like Sunstein, Vermeule, and Wansink around to save us from ourselves, and also in the meantime go on NPR, pat each other on the back on Twitter, and enlist the U.S. government in their worthy schemes.

– People are complicated: Vermeule and Sunstein are not “good guys” or “bad guys”; they’re just people. People are complicated. What makes me sad about Sunstein is that, as you said, he does care about evidence, he can learn from error. But then he chooses not to. He chooses to stay in his celebrity comfort zone, making stupid arguments evaluating the president’s job performance based on the stock market, cheerleading biased studies about nudges as if they represent reality. See the last three paragraphs here. Another bad thing Sunstein did recently was to coauthor that Noise book. Another alliance of celebrities! (As a side note, I’m sad to see the collection of academic all-star endorsements that this book received.) Regarding Sunstein himself, see the section “A new continent?” of that post. As I wrote at the time, if you’re going to explore a new continent, it can help to have a local guide who can show you the territory.

Vermeule I know less about; my take is that he’s playing the politics game. He thinks that on balance the Republicans are better than the Democrats, and I’m guessing that when he promotes election fraud misinformation, that he just thinks he’s being mischievous and cute. After all, the Democrats promoted misinformation about police shootings or whatever, so why can’t he have his fun? And, in any case, election security is important, right? Etc etc etc. Anyone with a bit of debate-team experience can justify lots worse than Vermeule’s post-election tweets. I guess they’re not extreme enough for Sunstein to want to stop working with him.

– Other work by Vermeule and Sunstein: They’re well-respected academics, also you and others say how smart they are, so I can well believe they’ve also done high-quality work. It might be that their success in some subfields led them into a false belief that they know what they’re doing in other areas (such as psychology research, statistics, and election administration) where they have no expertise. As the saying goes, sometimes it’s important to know what you don’t know.

My larger concern, perhaps, is that these people get such deference in academia and the news media, that they start to believe their own hype and they think they’re experts in everything.

– Conspiracy theories: Sunstein and Vermeule wrote, “Many millions of people hold conspiracy theories; they believe that powerful people have worked together in order to withhold the truth about some important practice or some terrible event. A recent example is the belief, widespread in some parts of the world, that the attacks of 9/11 were carried out not by Al Qaeda, but by Israel or the United States.” My point here is that there are two conspiracy theories here: a false conspiracy theory that the attacks were carried out by Israel or the United States, and a true conspiracy theory that the attacks were carried out by Al Qaeda. In the meantime, Vermeule has lent his support to unsupported conspiracy theories regarding the 2020 election. So Vermeule is incoherent. On one hand, he’s saying that conspiracy theories are a bad thing. On the other hand, in one place he’s not recognizing the existence of true conspiracies; in another place he’s supporting ridiculous and dangerous conspiracy theories, I assume on the basis that they are in support of his political allies. I don’t think it’s a cheap shot to point out this incoherence.

And what the does it mean that Sunstein thinks that “Because those who hold conspiracy theories typically suffer from a ‘crippled epistemology,’ in accordance with which it is rational to hold such theories, the best response consists in cognitive infiltration of extremist groups.”—but he continues to work with Vermeule? Who would want to collaborate with someone who suffers from a crippled epistemology (whatever that means)? The whole thing is hard for me to interpret except as an elitist position where some people such as Sunstein and Vermeule are allowed to believe whatever they want, and hold government positions, while other people get “cognitively infiltrated.”

– The proposed government program: I see your point that when the government is infiltrating dangerous extremist groups, it could make sense for them to try to talk some of these people out of their extremism. After all, for reasons of public safety the FBI and local police are already doing lots of infiltration anyway—they hardly needed Sunstein and Vermeule’s encouragement. Overall I suspect it’s a good thing that the cops are gathering intelligence this way rather than just letting these groups make plans in secret, set off bombs, etc., and once the agents are on the inside, I’d rather have them counsel moderation than do that entrapment thing where they try to talk people into planning crimes so as to be able to get more arrests.

I think what bothers me about the Sunstein and Vermeule article—beyond that they’re worried about conspiracy theories while themselves promoting various con artists and manipulators—is in their assumption that the government is on the side of the good. Perhaps this is related to Sunstein being pals with Kissinger. I labeled Sunstein and Vermeuele as libertarian paternalists, but maybe Vermuele is better described as an authoritarian; in any case they seem to have the presumption that the government is on their side, whether it’s for nudging people to do good things (not to do bad things) or for defusing conspiracy theories (not to support conspiracy theories).

But governments can’t always be trusted. When I wrote, “They don’t even seem to consider a third option, which is the government actively promoting conspiracy theories,” it’s not that I was saying that this third option was a good thing! Rather, I was saying that the third option is something that’s actually done, and I gave examples of the U.S. executive branch and much of Congress in the period Nov 2020 – Jan 2021, and the Russian government in their invasion of Ukraine.” And it seems that Vermeule may well be cool with both these things! So my reaction to Vermeule saying the government should be engaging in information warfare is similar to my reaction when the government proposed to start a terrorism-futures program and have it be run by an actual terrorist: it might be a good idea in theory and even in practice, but (a) these are not the guys I would want in charge of such a program, and (b) their enthusiasm for it makes me suspicious.

– Unrelated to all the above: You say of Vermeule, “after his conversion to Catholicism, he adopted the Church’s line on moral opposition to capital punishment.” That’s funny because I thought the Catholic church was cool with the death penalty—they did the inquisition, right?? Don’t tell me they’ve flip-flopped! Once they start giving into the liberals on the death-penalty issue, all hell will break loose.

OK, why did I write all that?

1. The mix of social science, statistical evidence, and politics is interesting and important.

2. As an academic, I’m always interested in academics behaving badly, especially when it involves statistics or social science in some way. In particular, the idea that these guys are supposed to be so smart and so nice in regular life, and then they go with these not-so-smart, not-so-nice theories, that’s interesting. When mean, dumb people promote mean, dumb ideas, that’s not so interesting. But when nice, smart people do it . . .

3. It’s been unfair to Sunstein for me to keep bringing up that Stasi thing.

Regarding item 2, one analogy I can see with Vermeule endorsing stupid and dangerous election-fraud claims is dudes in the 60s wearing Che T-shirts and thinking Chairman Mao was cool. From one perspective, Che was one screwed-up dude and Mao was one of history’s greatest monsters . . . but both of them were bad-ass dudes and it was cool to give the finger to the Man. Similarly, Vermeule could well think of Trump as badass, and he probably thinks its hilarious to endorse B.S. claims that support his politics. Kinda like how Steven Levitt probably thinks he’s a charming mischievous imp by supporting climate denialists. Levitt would not personally want his (hypothetical) beach house on Fiji to be flooded, but, for a certain kind of person, it’s fun to be a rogue.

Here’s what I wrote when the topic came up before:

There’s no evidence that Vermeule was trying to overthrow the election. He was merely supportive of these efforts, not doing it himself, in the same way that an academic Marxist might root for the general strike and the soviet takeover of government but not be doing anything active on the revolution’s behalf.

The paradox of replication studies: A good analyst has special data analysis and interpretation skills. But it’s considered a bad or surprising thing that if you give the same data to different analysts, they come to different conclusions.

Benjamin Kircup writes:

I think you will be very interested to see this preprint that is making the rounds: Same data, different analysts: variation in effect sizes due to analytical decisions in ecology and evolutionary biology (ecoevorxiv.org)

I see several ties to social science, including the study of how data interpretation varies across scientists studying complex systems; but also the sociology of science. This is a pretty deep introspection for a field; and possibly damning. The garden of forking paths is wide. They cite you first, which is perhaps a good sign.

Ecologists frequently pride themselves on data analysis and interpretation skills. If there weren’t any variability, what skill would there be? It would all be mechanistic, rote, unimaginative, uninteresting. In general, actually, that’s the perception many have of typical biostatistics. It leaves insights on the table by being terribly rote and using the most conservative kinds of analytic tools (yet another t-test, etc). The price of this is that different people will reach different conclusions with the same data – and that’s not typically discussed, but raises questions about the literature as a whole.

One point: apparently the peer reviews didn’t systematically reward finding large effect sizes. That’s perhaps counterintuitive and suggests that the community isn’t rewarding bias, at least in that dimension. It would be interesting to see what you would do with the data.

The first thing I noticed is that the paper has about a thousand authors! This sort of collaborative paper kind of breaks the whole scientific-authorship system.

I have two more serious thoughts:

1. Kircup makes a really interesting point, that analysts “pride themselves on data analysis and interpretation skills. If there weren’t any variability, what skill would there be?”, but then it’s considered a bad or surprising thing that if you give the same data to different analysts, they come to different conclusions. There really does seem to be a fundamental paradox here. On one hand, different analysts do different things—Pete Palmer and Bill James have different styles, and you wouldn’t expect them to come to the same conclusions—; on the other hand, we expect strong results to appear no matter who is analyzing the data.

A partial resolution to this paradox is that much of the skill of data analysis and interpretation comes in what questions to ask. In these replication projects (I think Bob Carpenter calls them “bake-offs”), several different teams are given the same question and the same data and then each do their separate analysis. David Rothschild and I did one of these; it was called We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results, and we were the only analysts of that Florida poll from 2016 that estimated Trump to be in the lead. Usually, though, data and questions are not fixed, despite what it might look like when you read the published paper. Still, there’s something intriguing about what we might call the Analyst’s Paradox.

2. Regarding his final bit (“apparently the peer reviews didn’t systematically reward finding large effect sizes”), I think Kircup is missing the point. Peer reviews don’t systematically reward finding large effect sizes. What they systematically reward is finding “statistically significant” effects, i.e. those that are at least two standard errors from zero. But by restricting yourself to those, you automatically overestimate effect sizes, as I discussed to interminable length in papers such as Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors and The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. So they are rewarding bias, just indirectly.

The importance of measurement, and how you can draw ridiculous conclusions from your statistical analyses if you don’t think carefully about measurement . . . Leamer (1983) got it.

Screen Shot 2013-08-03 at 4.23.29 PM

Jacob Klerman writes:

I have noted your recent emphasis on the importance of measurement (e.g., “Here are some ways to make your study replicable…”). For reasons not relevant here, I was rereading Leamer (1983), Let’s Take the Con Out of Econometrics—now 40 years old. It’s a fun, if slightly dated, paper that you seem to be aware of.

Leamer also makes the measurement point (emphasis added):

When the sampling uncertainty S gets small compared to the misspecification uncertainty M ,it is time to look for other forms of evidence, experiments or nonexperiments. Suppose I am interested in measuring the width of a coin. and I provide rulers to a room of volunteers. After each volunteer has reported a measurement, I compute the mean and standard deviation, and I conclude that the coin has width 1.325 millimeters with a standard error of .013. Since this amount of uncertainty is not to my liking, I propose to find three other rooms full of volunteers, thereby multiplying the sample size by four, and dividing the standard error in half. That is a silly way to get a more accurate measurement, because I have already reached the point where the sampling uncertainty S is very small compared with the misspecification uncertainty M. If I want to increase the true accuracy of my estimate, it is time for me to consider using a micrometer. So to in the case of diet and heart disease. Medical researchers had more or less exhausted the vein of nonexperimental evidence, and it became time to switch to the more expensive but richer vein of experimental evidence.

Interesting. Good to see examples where ideas we talk about today were already discussed in the classic literature. I indeed thing measurement is important and is under-discussed in statistics. Economists are very familiar with the importance of measurement, both in theory (textbooks routinely discuss the big challenges in defining, let alone measuring, key microeconomic quantities such as “the money supply”) and in practice (data gathering can often be a big deal, involving archival research, data quality checking, etc., even if unfortunately this is not always done), but then once the data are in, data quality and issues of bias and variance of measurement often seem to be forgotten. Consider, for example, this notorious paper where nobody at any stage in the research, writing, reviewing, revising, or editing process seemed to be concerned about that region with a purported life expectancy of 91 (see the above graph)—and that doesn’t even get into the bizarre fitted regression curve. But, hey, p less than 0.05. Publishing and promoting such a result based on the p-value represents some sort of apogee of trusting implausible theory over realistic measurement.

Also, if you want a good story about why it’s a mistake to think that your uncertainty should just go like 1/sqrt(n), check out this story which is also included in our forthcoming book, Active Statistics.

Mister P and Stan go to Bangladesh . . .

Prabhat Barnwal, Yuling Yao, Yiqian Wang, Nishat Akter Juy, Shabib Raihan, Mohammad Ashraful Haque, and Alexander van Geen ask,

Is the low COVID-19–related mortality reported in Bangladesh for 2020 associated with massive undercounting?

Here’s what they did:

This repeated survey study is based on an in-person census followed by 2 rounds of telephone calls. Data were collected from a sample of 135 villages within a densely populated 350-km2 rural area of Bangladesh. Household data were obtained first in person and subsequently over the telephone. For the analysis, mortality data were stratified by month, age, sex, and household education. Mortality rates were modeled by bayesian multilevel regression, and the strata were aggregated to the population by poststratification. Data analysis was performed from February to April 2021. . . .

Mortality rates were compared for 2019 and 2020, both without adjustment and after adjustment for nonresponse and differences in demographic variables between surveys. Income and food availability reported for January, May, and November 2020 were also compared.

And here’s what they found:

All-cause mortality in the surveyed are was lower in 2020 compared with 2019, but measures to control the COVID-19 pandemic were associated with a reduction in rural income and food availability. These findings suggest that government restrictions designed to curb the spread of COVID-19 may have been effective in 2020 but needed to be accompanied by expanded welfare support.

More specifically:

Enumerators collected data from an initial 16 054 households in January 2020 . . . for a total of 58 806 individuals . . . A total of 276 deaths were reported between February and the end of October 2020 for the subset of the population that could be contacted twice over the telephone, slightly below the 289 deaths reported for the same population over the same period in 2019. After adjustment for survey nonresponse and poststratification, 2020 mortality changed by −8% (95% CI, −21% to 7%) compared with an annualized mortality of 6.1 deaths per 1000 individuals in 2019. However, in May 2020, salaried primary income earners reported a 40% decrease in monthly income (from 17 485 to 10 835 Bangladeshi Taka), and self-employed earners reported a 60% decrease in monthly income (23 083 to 8521 Bangladeshi Taka), with only a small recovery observed by November 2020.

I’ve worked with Lex and Yuling for a long time, and they both know what they’re doing.

Beyond the direct relevance of this work, the above-linked article is a great example of applied statistical analysis with multilevel regression and poststratification using Stan.

Resources for teaching and learning survey sampling, from Scott Keeter at Pew Research

Art Owen informed me that he’ll be teaching sampling again at Stanford, and he was wondering about ideas for students gathering their own data.

I replied that I like the idea of sampling from databases, biological sampling, etc. You can point out to students that a “blood sample” is indeed a sample!

Art replied:

Your blood example reminds me that there is a whole field (now very old) on bulk sampling. People sample from production runs, from cotton samples, from coal samples and so on. Widgets might get sampled from the beginning, middle and end of the run. David Cox wrote some papers on sampling to find the quality of cotton as measured by fiber length. The process is to draw a blue line across the sample and see the length of fibers that intersect the line. This gives you a length-biased sample that you can nicely de-bias. There’s also an interesting out there about tree sampling, literally on a tree, where branches get sampled at random and fruit is counted. I’m not sure if it’s practical.

Last time I found an interesting example where people would sample ocean tracts to see if there was a whale. If they saw one, they would then sample more intensely in the neighboring tracts. Then the trick was to correct for the bias that brings. It’s in the Sampling book by S. K. Thompson. There are also good mark-recapture examples for wildlife.

I hesitate to put a lot of regression in a sampling class; It is all too easy for every class to start looking like a regression/prediction/machine learning class. We need room for the ideas about where and how data arises and it’s too easy to crowd those out by dwelling on the modeling ideas.

I’ll probably toss in some space-filling sampling plans and other ways to down size data sets as well.

The old Cochran style was: get an estimator, show it is unbiased, find an expression for its variance, find an estimate of that variance, show this estimate is unbiased and maybe even find and compare variances of several competing variance estimates. I get why he did it but it can get dry. I include some of that but I don’t let it dominate the course. Choices you can make and their costs are more interesting.

I connected Art to Scott Keeter at Pew Research, who wrote:

Fortunately, we are pretty diligent about keeping track of what we do and writing it up. The examples below have lengthy methodology sections and often there is companion material (such as blog posts or videos) about the methodological issues.

We do not have a single overview methodological piece about this kind of work but the next best thing is a great lecture that Courtney Kennedy gave at the University of Michigan last year, walking through several of our studies and the considerations that went into each one:

Here are some links to good examples, with links to the methods sections or extra features:

Our recent study of Jewish Americans, the second one we’ve done. We switched modes for this study (thus different sampling strategy), and the report materials include an analysis of mode differences https://www.pewresearch.org/religion/2021/05/11/jewish-americans-in-2020/

Appendix A: Survey methodology

Jewish Americans in 2020: Answers to frequently asked questions

Our most recent survey of the US Muslim population:

U.S. Muslims Concerned About Their Place in Society, but Continue to Believe in the American Dream


A video on the methods:
https://www.pewresearch.org/fact-tank/2017/08/16/muslim-americans-methods/

This is one of the most ambitious international studies we’ve done:

Religion in India: Tolerance and Segregation


Here’s a short video on the sampling and methodology:
https://www.youtube.com/watch?v=wz_RJXA7RZM

We then had a quick email exchange:

Me: Thanks. Post should appear in Aug.

Scott: Thanks. We’ll probably be using sampling by spaceship and data collection with telepathy by then.

Me: And I’ll be charging the expenses to my NFT.

In a more serious vein, Art looked into Scott’s suggestions and followed up:

I [Art] looked at a few things at the Pew web-site. The quality of presentation is amazingly good. I like the discussions of how you identify who to reach out to. Also the discussion of how to pose the gender identity question is something that I think would interest students. I saw some of the forms and some of the data on response rates. I also found Courtney Kennedy’s video on non-probability polls. I might avoid religious questions for in-depth followup in class. Or at least, I would have to be careful in doing it, so nobody feels singled out.

Where could I find some technical documents about the American Trends Panel? I would be interested to teach about sample reweighting, e.g., raking and related methods, as it is done for real.

I’m wondering about getting survey data for a class. I might not be able to require them to get a Pew account and then agree to terms and conditions. Would it be reasonable to share a downsampled version of a Pew data set with a class? Something about attitudes to science would be interesting for students.

To which Scott replied:

Here is an overview I wrote about how the American Trends Panel operates and how it has changed over time in response to various challenges:

Growing and Improving Pew Research Center’s American Trends Panel

This relatively short piece provides some good detail about how the panel works:
https://www.pewresearch.org/fact-tank/2021/09/07/how-do-people-in-the-u-s-take-pew-research-center-surveys-anyway/

We use the panel to conduct lots of surveys, but most of them are one-off efforts. We do make an effort to track trends over time, but that’s usually the way we used to do it when we conducted independent sample phone surveys. However, we sometimes use the panel as a panel – tracking individual-level change over time. This piece explains one application of that approach:
https://www.pewresearch.org/fact-tank/2021/01/20/how-we-know-the-drop-in-trumps-approval-rating-in-january-reflected-a-real-shift-in-public-opinion/

When we moved from mostly phone surveys to mostly online surveys, we wanted to assess the impact of the change in mode of interview on many of our standard public opinion measures. This study was a randomized controlled experiment to try to isolate the impact of mode of interview:

From Telephone to the Web: The Challenge of Mode of Interview Effects in Public Opinion Polls

Survey panels have some real benefits but they come with a risk – that panelists change as a result of their participation in the panel and no longer fully resemble the naïve population. We tried to assess whether that is happening to our panelists:

Measuring the Risks of Panel Conditioning in Survey Research

We know that all survey samples have biases, so we weight to try to correct those biases. This particularly methodology statement is more detailed than is typical and gives you some extra insight into how our weighting operates. Unfortunately, we do not have a public document that breaks down every step in the weighting process:

Methodology

Most of our weighting parameters come from U.S. government surveys such as the American Community Survey and the Current Population Survey. But some parameters are not available on government surveys (e.g., religious affiliation) so we created our own higher quality survey to collect some of these for weighting:

How Pew Research Center Uses Its National Public Opinion Reference Survey (NPORS)

This one is not easy to find on our website but it’s a good place to find wonky methodological content, not just about surveys but about our big data projects as well:

Home


We used to publish these through Medium but decided to move them in-house.

By the way, my colleagues in the survey methods group have developed an R package for the weighting and analysis of survey data. This link is to the explainer for weighting data but that piece includes links to explainers about the basic analysis package:
https://www.pewresearch.org/decoded/2020/03/26/weighting-survey-data-with-the-pewmethods-r-package/

Lots here to look at!

It’s been awhile since I’ve taught a course on survey sampling. I used to teach such a course—it was called Design and Analysis of Sample Surveys—and I enjoyed it. But . . . in the class I’d always have to spend some time discussing basic statistics and regression modeling, and this always was the part of the class that students found the most interesting! So I eventually just started teaching statistics and regression modeling, which led to my Regression and Other Stories book. The course I’m now teaching out of that book is called Applied Regression and Causal Inference. I still think survey sampling is important; it was just hard to find an audience for the course.

Here’s how to subscribe to our new weekly newsletter:

Just a reminder: we have a new weekly newsletter. We posted on it a couple weeks ago; I’m just giving a reminder here because the goal of the newsletter is to reach people who wouldn’t otherwise go online to read the blog.

Subscribing is free, and then in your inbox each Monday morning you’ll get a list of our scheduled posts for the forthcoming week, along with links to the past week’s posts. Enjoy.

P.S. To subscribe, click on the link and follow the instructions from there.

Learning from mistakes (my online talk for the American Statistical Association, 2:30pm Tues 30 Jan 2024)

Here’s the link:

Learning from mistakes

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

We learn so much from mistakes! How can we structure our workflow so that we can learn from mistakes more effectively? I will discuss a bunch of examples where I have learned from mistakes, including data problems, coding mishaps, errors in mathematics, and conceptual errors in theory and applications. I will also discuss situations where researchers have avoided good learning opportunities. We can then try to use all these cases to develop some general understanding of how and when we learn from errors in the context of the fractal nature of scientific revolutions.

The video is here.

It’s sooooo frustrating when people get things wrong, the mistake is explained to them, and they still don’t make the correction or take the opportunity to learn from their mistakes.

To put it another way . . . when you find out you made a mistake, you learn three things:

1. Now: Your original statement was wrong.

2. Implications for the future: Beliefs and actions that flow from that original statement may be wrong. You should investigate your reasoning going forward and adjust to account for your error.

3. Implications for the past: Something in your existing workflow led to your error. You should trace your workflow, see how that happened, and alter your workflow accordingly.

In poker, they say to evaluate the strategy, not the play. In quality control, they say to evaluate the process, not the individual outcome. Similarly with workflow.

As we’ve discussed many many times in this space (for example, here), it makes me want to screeeeeeeeeeam when people forego this opportunity to learn. Why do people, sometimes very accomplished people, give up this opportunity? I’m speaking here of people who are trying their best, not hacks and self-promoters.

The simple answer for why even honest people will avoid admitting clear mistakes is that it’s embarrassing for them to admit error, they don’t want to lose face.

The longer answer, I’m afraid, is that at some level they recognize issues 1, 2, and 3 above, and they go to some effort to avoid confronting item 1 because they really really don’t want to face item 2 (their beliefs and actions might be affected, and they don’t want to hear that!) and item 3 (they might be going about everything all wrong, and they don’t want to hear that either!).

So, paradoxically, the very benefits of learning from error are scary enough to some people that they’ll deny or bury their own mistakes. Again, I’m speaking here of otherwise-sincere people, not of people who are willing to lie to protect their investment or make some political point or whatever.

In my talk, I’ll focus on my own mistakes, not those of others. My goal is for you in the audience to learn how to improve your own workflow so you can catch errors faster and learn more from them, in all three senses listed above.

P.S. Planning a talk can be good for my research workflow. I’ll get invited to speak somewhere, then I’ll write a title and abstract that seems like it should work for that audience, then the existence of this structure gives me a chance to think about what to say. For example, I’d never quite thought of the three ways of learning from error until writing this post, which in turn was motivated by the talk coming up. I like this framework. I’m not claiming it’s new—I guess it’s in Pólya somewhere—, just that it will help my workflow. Here’s another recent example of how the act of preparing an abstract helped me think about a topic of continuing interest to me.

“My view is that if I can show that a result was cooked and that doing it correctly does not yield the answer the authors claimed, then the result is discredited. . . . What I hear, instead, is the following . . .”

Economic historian Tim Guinnane writes:

I have a general question that I have not seen addressed on your blog. Often this question turns into a narrow question about retracting papers, but I think that short-circuits an important discussion.

Like many in economic history, I am increasingly worried that much research in recent years reflects p-hacking, misrepresentation of the history, useless data, and other issues. I realize that the technical/statistical issues differ from paper to paper.

What I see is something like the following. You can use this paper as a concrete example, but the problems are much more widespread. We document a series of bad research practices. The authors played games with controls to get the “right” answer for the variable of interest. (See Table 1 of the paper). In the text they misrepresent the definitions of variables used in regressions; we show that if you use the stated definition, their results disappear. They use the wrong degrees of freedom to compute error bounds (in this case, they had to program the bounds by hand, since stata automatically uses the right df). There are other and to our minds more serious problems involved in selectively dropping data, claiming sources do not exist, etc.

Step back from any particular problem. How should the profession think about claims such as ours? My view is that if I can show that a result was cooked and that doing it correctly does not yield the answer the authors claimed, then the result is discredited. The journals may not want to retract such work, but there should be support for publishing articles that point out such problems.

What I hear, instead, is the following. A paper estimates beta as .05 with a given SE. Even if we show that this is cooked—that is, that beta is a lot smaller or the SE a lot larger if you do not throw in extraneous regressors, or play games with variable definitions—then ours is not really a result. It is instead, I am told, incumbent on the critic to start with beta=.05 as the null, and show that doing things correctly rejects that null in favor of something less than .05 (it is characteristic of most of this work that there really is no economic theory, so the null is always “X does not matter” which boils down to “this beta is zero.” And very few even tell us whether the correct test is one- or two-sided).

This pushback strikes me as weaponizing the idea of frequentist hypothesis testing. To my mind, if I can show that beta=.05 comes from a cooked regression, then we need to start over. That estimate can be ignored; it is just one of many incorrect estimates one can generated by doing things inappropriately. It actually gives the unscrupulous an incentive to concoct more outlandish betas which are then harder to reject. More generally, it puts a strange burden of proof on critics. I have discussed this issue with some folks in natural sciences who find the pushback extremely difficult to understand. They note what I think is the truth: it encourages bad research behavior by suppressing papers that demonstrate that bad behavior.

It might be opportune to have a general discussion of these sorts of issues on your website. The Gino case raises something much simpler, I think. I fear that it will in some ways lower the bar: so long as someone is not actively making up their data (which I realize has not been proven, in case this email gets subpoenaed!) then we do not need to worry about cooking results.

My reply: You raise several issues that we’e discussed on occasion (for some links, see here):

1. The “Research Incumbency Rule”: Once an article is published in some approved venue, it is taken as truth. Criticisms which would absolutely derail a submission in pre-publication review can be brushed aside if they are presented after publication. This is what you call “the burden of proof on critics.”

2. Garden of forking paths.

3. Honesty and transparency are not enough. Work can be non-fraudulent but still be crap.

4. “Passive corruption” when people know there’s bad work but they don’t do anything about it.

5. A disturbingly casual attitude toward measurement; see here for an example: https://statmodeling.stat.columbia.edu/2023/10/05/no-this-paper-on-strip-clubs-and-sex-crimes-was-never-gonna-get-retracted-also-a-reminder-of-the-importance-of-data-quality-and-a-reflection-on-why-researchers-often-think-its-just-fine-to-publ/ Many economists and others seem to have been brainwashed into thinking that it’s ok to have bad measurement because attenuation bla bla . . . They’re wrong.

He responded: If you want an example of economists using stunningly bad data and making noises about attenuation, see here.

The paper in question has the straightforward title, “We Do Not Know the Population of Every Country in the World for the Past Two Thousand Years.”