Stupid legal arguments: a moral hazard?

I’ve published two theorems; one was true and one turned out to be false. We would say the total number of theorems I’ve proved is 1, not 2 or 0. The false theorem doesn’t count as a theorem, not does it knock out the true theorem.

This also seems to be the way that aggregation works in legal reasoning: if a lawyer gives 10 arguments and 9 are wrong, that’s ok; only the valid argument counts.

I was thinking about this after seeing two recent examples:

1. Law professor Larry Lessig released a series of extreme arguments defending three discredited celebrities: Supreme Court judge Clarence Thomas, financier Jeffrey Epstein, and economist Francesca Gino. When looking into this, I was kinda surprised that Lessig, who is such a prominent law professor, was offering such weak arguments—but maybe I wasn’t accounting for the asymmetrical way in which legal arguments are received: you spray out lots of arguments, and the misses don’t count; all that matters is how many times you get a hit.

I remember this from being a volunteer judge at middle-school debate: you get a point for any argument you land that the opposing side doesn’t bother to refute. This creates an incentive to emit a flow of arguments, as memorably dramatized by Ben Lerner in one of his books. Anyway, the point is that from Lessig’s perspective, maybe it’s ok that he spewed out some weak arguments; that’s just the rules of the game.

2. A group suing the U.S. Military Academy to abandon affirmative action claimed in its suit that “For most of its history, West Point has evaluated cadets based on merit and achievement,” a ludicrous claim, considering that the military academy graduated only three African-American cadets during its first 133 years.

If I were the judge, I’d be inclined to toss out the entire lawsuit based on this one statement, as it indicates a fatal lack of seriousness on the part of the plaintiffs.

On the other hand, I get it: all that matters is that the suit has at least one valid argument. The invalid arguments shouldn’t matter. This reasoning can be seen more clearly, perhaps, if we consider a person unjustly sentenced to prison for a crime he didn’t commit. If, in his defense, he offers ten arguments, of which nine are false, but the tenth unambiguously exonerates him, then he should get off. The fact that he, in his desperation, offered some specious arguments does not make him guilty of the crime.

The thing that bugs me about this West Point lawsuit and, to a lesser extent, Lessig’s posts, is that this freedom to make bad arguments without consequences creates what economists call a “moral hazard,” by which there’s an incentive to spew out low-quality arguments as a way to “flood the zone” and overwhelm the system.

I was talking with a friend about this and he said that the incentives here are not so simple, as people pay a reputational cost when they promote bad arguments. It’s true that whatever respect I had for Lessig or the affirmative-action-lawsuit people has diminished, in the same way that Slate magazine has lost some of its hard-earned reputation for skepticism after running a credulous piece on UFOs. But . . . Lessig and the affirmative-action crew don’t care about what people like me think about them, right? They’re playing the legal game. I’m not sure what, if anything, should be done about all this; it just bothers me that there seem to be such strong incentives for lawyers (and others) to present bad arguments.

I’m sure that legal scholars have written a lot about this one, so I’m not claiming any originality here.

P.S. However these sorts of lawsuits are treated in the legal system, I think that it would be appropriate for their stupidity to be pointed out when they get media coverage. Instead, there seems to be a tendency to take ridiculous claims at face value, as long as they are mentioned in a lawsuit. For example, here’s NPR on the West Point lawsuit: “In its lawsuit filed Tuesday, it asserts that in recent decades West Point has abandoned its tradition of merit-based admissions”—with no mention of how completely stupid it is to claim that they had a “tradition of merit-based admissions” in their 133 years with only 4 black graduates. Or the New York Times, which again quotes the stupid claim without pointing out it’s earth-is-flat nature. AP and Reuters did a little better in that they didn’t quote the ridiculous claim; on the other hand, that serves to make the lawsuit seem more reasonable than it is.

(again) why I don’t like to talk so much about “p-hacking.” But sometimes the term is appropriate!

Part 1

Jonathan Falk points us to this parody article that has suggestions on how to p-hack.

I replied that I continue to be bothered by the term “p-hacking.” Sometimes it applies very clearly (as in the work of Brian Wansink, although it’s a mystery why he felt the need to p-hack given that it seems that his data could never have existed as reported), but other times there is no “hacking” going on. So I prefer the term forking paths.

Two things going on here:

1. Saying “p-hacking” when it’s forking paths is uncharitable, as it implies active “hacking” when it can well be that researchers are just following the data in what seems like a reasonable way.

2. Bad researchers looove to conflate the professional and the personal. Say they’re p-hacking and they’ll get in a huff: “Who are you to accuse me of misconduct??”, etc. Say they have forking paths and you remove, or at least, reduce, that argument. OK, in real life, yeah, people will say, “Who are you to accuse me of forking paths?”, but forking paths is just a thing that happens, an inevitable result of data processing and analysis plans that were not decided ahead of time.

So, yeah, humor aside, I don’t like the p-hacking talk, for similar reasons to my not liking the “file drawer” thing: in both cases, the focus on a specific mechanism can serve to minimize the real problem, to conflate scientific mistakes with intentional misconduct, and to provide an easy out for many practitioners of bad science who don’t seem to realize that honesty and transparency are not enuf.

Falk responds:

I agree completely with that.

But honestly, I feel like both the garden of forking paths and p-hacking are just versions of Bitcoin’s Proof of Work method. You get rewards for showing how much effort you had to go to get SIGNIFICANCE. If you have a study with a p-value on your first try of 1e-8, people will say “But that result was obvious! Why do they even bother with a test?” If you garden-of-forking-paths or p-hack your way to 0.047, you will be credited for your perspicacity.

Part 2

Ethan Steinberg writes:

I just came across an article that will probably be interesting to you and your readers. Back in 2022, the Florida Surgeon General released a report that the COVID vaccine appeared to be statistically significantly correlated with cardiac arrest “In the 28 days following vaccination, a statistically significant increase in cardiac-related deaths was detected for the entire study population (RI = 1.07, 95% CI = 1.03 – 1.12).” is the full report.

This was then used to recommend against COVID vaccines for young men in particular.

A local Florida paper just obtained and released the original versions of the reports:

Here are the drafts, from first to last.

The TLDR is that the original analysis did not find significant increases in cardiac related deaths. They had to go through a lot of analysis variants / drafts to get the result they were looking for.

I guess the real question here is how this could be avoided in the future. Maybe we should expect public health officials to register their analysis in advance?

I don’t think we should ask public health officials to register their analysis in advance, as that just seems like more of a mess. But in any case the above seems like an example where there really was p-hacking.

P.S. Just to clarify: As always, the problem is not with the “hacking”—looking at data in many different ways—but rather in only reporting some small subset of the analyses. It’s fine to go through a lot of analyses of the data; then, you should publish all of it, or publish a single analysis that incorporates all of what you’ve done using multilevel modeling.

West Point, like major league baseball, was purely based on merit and achievement before Jackie Robinson and that Tuskegee Airman guy came along and messed everything up.

This news article, “Anti-Affirmative Action Group Sues West Point Over Admissions Policy,” contained this amazing quote:

“For most of its history, West Point has evaluated cadets based on merit and achievement,” the group said in its complaint, filed on Tuesday in the Southern District of New York. But that changed, the group argued, over the last few decades.

Whaaa . . .?

How many black cadets did they have in the 1840s, anyway? Guess we gotta check the internet . . . googling *black cadets at west point* points us to this article from the National Museum of African American History and Culture:

In its first 133 years of existence (1802–1935), over 10,000 white cadets graduated from the United States Military Academy at West Point. In stark contrast, only three African American cadets could claim this achievement . . . Benjamin O. Davis Jr. became the fourth African American cadet to graduate in 1936. Perhaps best known as commander of the famous Tuskegee Airmen in World War II, Davis had a long and distinguished career in the Air Force before retiring in 1970 at the rank of Lieutenant General. . . .

They’ve got this juicy quote from Major General John D. Schofield, Superintendent of West Point, in 1880:

“To send to West Point for four years competition a young man who was born in slavery is to assume that half a generation is sufficient to raise a colored man to the social, moral, and intellectual level which the average white man has reached in several hundred years. As well might the common farm horse be entered in a four-mile race against the best blood inherited from a long line of English racers.”

The article continues:

Between 1870 and 1899, only 12 African American cadets were admitted to West Point. Each endured physical and emotional abuse and racist treatment from their white peers and professors throughout their time at the Academy. They were ostracized, barred from social activities with other cadets, and spoken to only when officially necessary, a practice known as silencing. While white cadets were hazed by their fellow cadets as punishment for serious misconduct, Black cadets were hazed for being Black and for being at West Point.

OK, so here’s the score:

Years # black cadets # black graduates
1802-1869 0 0
1870-1935 12 3

Given the data, it’s absolutely ridiculous of them to say, “For most of its history, West Point has evaluated cadets based on merit and achievement.”

Is that just how lawyers write things in official complaints? Is the idea to make some ludicrous claims just to distract the other side? I don’t get it. If I were a judge, that sort of thing would just annoy me. Then again, I’m not a judge.

It’s just like major league baseball, which was purely based on merit and achievement until those pesky affirmative action bureaucrats came along in 1948 to mess everything up.

In all seriousness, it seems in retrospect to have been a terrible decision to restrict the military academies to whites for the first 100+ years. Imagine if Robert Lee and Stonewall Jackson had had black classmates at West Point. Maybe then they wouldn’t have been so gung-ho to lead troops in defense of slavery. The Dred Scott decision would have a different meaning if it was their friends who were at risk of being kidnapped and enslaved. It seems fair enough to draw a direct line from an all-white West Point to the tragedy of the Civil War. And then for some twit in 2023 to say, “For most of its history, West Point has evaluated cadets based on merit and achievement” . . . !

P.S. Annoyingly, the news article does not link to the actual complaint. But after some googling, I found it here. The whole thing is kinda nuts. In the same paragraph where they make the obviously false claim, “For most of its history, West Point has evaluated cadets based on merit and achievement,” they also point out that the U.S. military wasn’t desegregated until 1948!

Later on, they refer to “the brief period of racial unrest [from 1969 to 1972] that West Point retells over and over.” I guess they’re cool about the first 150 years or so when blacks were entirely or nearly-entirely excluded and evaluation was based on “based on merit and achievement”: keep the place essentially all-white and intimidate the few black cadets who are there, and you have no racial unrest, huh?

Affirmative action is a complicated issue. I don’t think this particular group is helping anyone by trying to push a distorted version of history.

P.P.S. Relevant context from Dred Scott v. Sandford:

The question is simply this: Can a negro whose ancestors were imported into this country, and sold as slaves, become a member of the political community formed and brought into existence by the Constitution of the United States, and as such become entitled to all the rights and privileges and immunities guaranteed to the citizen? . . .

It will be observed, that the plea applies to that class of persons only whose ancestors were negroes of the African race, and imported into this country, and sold and held as slaves. The only matter in issue before the court, therefore, is, whether the descendants of such slaves, when they shall be emancipated, or who are born of parents who had become free before their birth, are citizens of a State, in the sense in which the word citizen is used in the Constitution of the United States. . . .

The situation of this population was altogether unlike that of the Indian race. The latter, it is true, formed no part of the colonial communities, and never amalgamated with them in social connections or in government. But although they were uncivilized, they were yet a free and independent people, associated together in nations or tribes, and governed by their own laws. . . .

The words “people of the United States” and “citizens” are synonymous terms, and mean the same thing. They both describe the political body who, according to our republican institutions, form the sovereignty, and who hold the power and conduct the Government through their representatives. . . . The question before us is, whether the class of persons described in the plea in abatement compose a portion of this people, and are constituent members of this sovereignty? We think they are not, and that they are not included, and were not intended to be included, under the word “citizens” in the Constitution, and can therefore claim none of the rights and privileges which that instrument provides for and secures to citizens of the United States. On the contrary, they were at that time considered as a subordinate and inferior class of beings . . .

They had for more than a century before been regarded as beings of an inferior order, and altogether unfit to associate with the white race, either in social or political relations; and so far inferior, that they had no rights which the white man was bound to respect; and that the negro might justly and lawfully be reduced to slavery for his benefit. He was bought and sold, and treated as an ordinary article of merchandise and traffic, whenever a profit could be made by it. This opinion was at that time fixed and universal in the civilized portion of the white race. It was regarded as an axiom in morals as well as in politics, which no one thought of disputing . . .

That was in the good old days, back when West Point has evaluated cadets based on merit and achievement and there was no racial unrest. Somewhere between 1857 and today, something seems to have gone terribly wrong, according to this new lawsuit. Too bad for them that Roger Taney is no longer on the court.

My SciML Webinar next week (28 Sep): Multiscale generalized Hamiltonian Monte Carlo with delayed rejection

I’m on the hook to do a SciML webinar next week:

These are organized by Keith Phuthi (who is at CMU) through University of Michigan’s Institute for Computational Discovery and Engineering.

Sam Livingstone is moderating.This is presenting joint work with Alex Barnett, Chirag Modi, Edward Roualdes, and Gilad Turok.

I’m very excited about this project as it combines a number of threads I’ve been working on with collaborators. When I did my job talk here, Leslie Greengard, our center director, asked me why we didn’t use variable stepwise integrators when doing Hamiltonian Monte Carlo. I told him we’d love to do it, but didn’t know how to do it in such a way as to preserve the stationary target distribution.

Delayed rejection HMC

Then we found Antonietta Mira’s work on delayed rejection. It lets you retry a second Metropolis proposal if the first one is rejected. The key here is that we can use a smaller step size for the second proposal, thus recovering from proposals that are rejected because the Hamiltonian diverged (i.e., the first-order gradient based algorithm can’t handle regions of high curvature in the target density). There’s a bit of bookkeeping (which is frustratingly hard to write down) for the Hastings condition to ensure detailed balance. Chirag Modi, Alex Barnett and I worked out the details, and Chirag figured out a novel twist on delayed rejection that only retries if the original acceptance probability was low. You can read about it in our paper:

This works really well and is enough that we can get proper draws from Neal’s funnel (vanilla HMC fails on this example in either the tails in either the mouth or neck of the funnel, depending on the step size). But it’s inefficient in that it retries an entire Hamiltonian trajectory. Which means if we cut the step size in half, we double the number of steps to keep the integration time constant.

Radford Neal to the rescue

As we were doing this, the irrepressible Radford Neal published a breakthrough algorithm:

What he managed to do was use generalized Hamiltonian Monte Carlo (G-HMC) to build an algorithm that takes one step of HMC (like Metropolis-adjusted Langevin, but over the coupled position/momentum variables) and manages to maintain directed progress. Instead of fully resampling momentum each iteration, G-HMC resamples a new momentum value then performs a weighted average with the existing momentum with most of the weight on the existing momentum. Neal shows that with a series of accepted one-step HMC iterations, we can make directed progress just like HMC with longer trajectories. The trick is getting sequences of acceptances together. Usually this doesn’t work because we have to flip momentum each iteration. We can re-flip it when regenerating, to keep going in the same direction on acceptances, but with rejections we reverse momentum (this isn’t an issue with HMC because it fully regenerates each time). So to get directed movement, we need steps that are too small. What Radford figured out is that we can solve this problem by replacing the way we generate uniform(0, 1)-distributed probabilities for the Metropolis accept/reject step (we compare the variate generated to the ratio of the density at the proposal to the density at the previous point and accept if it’s lower). Radford realized that if we instead generate them in a sawtooth pattern (with micro-jitter for ergodicity), then when we’re at the bottom of the sawtooth generating a sequence of values near zero, the acceptances will cluster together.

Replacing Neal’s trick with delayed rejection

Enter Chirag’s and my intern, Gilad Turok (who came to us as an undergrad in applied math at Columbia). Over the summer, working with me and Chirag and Edward Roualdes (who was here as a visitor), he built and evaluated a system that replaces Neal’s trick (sawtooth pattern of acceptance probability) with the Mira’s trick (delayed rejection). It indeed solves the multi scale problem. It exceeded our expectations in terms of efficiency—it’s about twice as fast as our delayed rejection HMC. Going one HMC step at a time, it is able to adjust its stepsize within what would be a single Hamiltonian trajectory. That is, we finally have something that works roughly like a typical ODE integrator in applied math.

Matt Hoffman to the rescue

But wait, that’s not all. There’s room for another one of the great MCMC researchers to weigh in. Matt Hoffman, along with Pavel Sountsov, figured out how to take Radford’s algorithm and provide automatic adaptation for it.

What Hoffman and Sountsov manage to do is run a whole lot of parallel chains, then use information in the other chains to set tuning parameters for a given chain. In that way it’s like the Goodman and Weare affine-invariant sampler that’s used in the Python package emcee. This involves estimating the metric (posterior covariance or just variance in the diagonal case) and also estimating steps size, which they do through a heuristic largest-eigenvalue estimate. Among the pleasant properties of their approach is that the entire setup produces a Markov chain from the very first iteration. That means we only have to do what people call “burn in” (sorry Andrew, but notice how I say other people call it that, not that they should), not set aside some number of iterations for adaptation.

Edward Roualdes has coded up Hoffman and Sountsov’s adaptation and it appears to work with delayed rejection replacing Neal’s trick.

Next for Stan?

I’m pretty optimistic that this will wind up being more efficient than NUTS and also make things like parallel adaptation and automatic stopping a whole lot easier. It should be more efficient because it doesn’t waste work—NUTS goes forward and backward in time and then subsamples along the final doubling (usually—it’s stochastic with a strong bias toward doing that). This means we “waste” the work going the wrong way in time and beyond where we finally sample. But we still have a lot of eval to do before we can replace Stan’s longstanding sampler or even provide an alternative.

My talk

The plan’s basically to expand this blog post with details and show you some results. Hope to see you there!

In which we answer some questions about regression discontinuity designs

A researcher who wishes to remain anonymous writes:

I am writing with a question about your article with Imbens, Why High-Order Polynomials Should Not Be Used in Regression Discontinuity Designs. In it, you discourage the use of high-order polynomials of the forcing variable when fitting models. I have a few questions about this:

(1) What are your thoughts about the use of restricted cubic splines (RCS) that are linear in both tails?

(2) What are your thoughts on the use of a generalized additive model with local regression (rather than with splines)?

(3) What are your thoughts on the use of loess to fit the regression models?

I wonder if the use of restricted cubic splines would be less susceptible to the difficulties that you describe given that it is linear in the tails.

My quick reply is that I wouldn’t really trust any estimate that jumps around a lot. I’ve seen too many regression discontinuity analyses that give implausible answers because the jump at the discontinuity cancels a sharp jump in the other direction in the fitted curve. When you look at the regression discontinuity analyses that work (in the sense of giving answers that make sense), the fitted curve is smooth.

The first question above is addressing the tail-wagging-the-dog issue, and that’s a concern as well. I guess I’d like to see models where the underlying curve is smooth, and if that doesn’t fit the data, then I think the solution is to restrict the range of the data where the model is fit, not to try to solve the problem by fitting a curve that gets all jiggy.

My other general advice, really more important than what I just wrote above, is to think of regression discontinuity as a special case of an observational study. You have a treatment or exposure z, an outcome y, and pre-treatment variables x. In a discontinuity design, one of the x’s is a “forcing variable,” for which z_i = 1 for cases where x_i exceeds some threshold, and z_i = 0 for cases where x_i is lower than the threshold. This is a design with known treatment assignment and zero overlap, and, yeah, you’ll definitely want to adjust for imbalance in that x-variable. My inclination would be to fit a linear model for this adjustment, but sometimes a nonlinear model will make sense, as long as you keep it smooth.

But . . . the forcing variable is, in general, just one of your pre-treatment variables. What you have is an observational study! And you can have imbalance on other pre-treatment variable also. So my main recommendation is to adjust for other important pre-treatment variables as well.

For an example, see here, where I discuss a regression discontinuity analysis where the outcome variable was length of life remaining, and the published analysis did not include age as a predictor. You gotta adjust for age! The message is: a discontinuity analysis is an observational study. The forcing variable is important, but it’s not the only thing in town. The big mistakes seem to come from: (a) unregularized regression on the forcing variable which randomly give you wild jumpy curves that pollute the estimate of the discontinuity, (b) not adjusting for other important pre-treatment predictors, and (c) taking statistically significant estimates and treating them as meaningful, without looking at the model that’s been fit.

We discuss some of this in Section 21.3 of Regression and Other Stories.

A message to Parkinson’s Disease researchers: Design a study to distinguish between these two competing explanations of the fact that the incidence of Parkinson’s is lower among smokers

After reading our recent post, “How to quit smoking, and a challenge to currently-standard individualistic theories in social science,” Gur Huberman writes:

You may be aware that the incidence of Parkinson (PD) is lower in the smoking population than in the general population, and that negative relation is stronger for the heavier & longer duration smokers.

The reason for that is unknown. Some neurologists conjecture that there’s something in smoked tobacco which causes some immunity from PD. Other conjecture that whatever causes PD also helps people quit or avoid smoking. For instance, a neurologist told me that Dopamine (the material whose deficit causes PD) is associated with addiction not only to smoking but also to coffee drinking.

Your blog post made me think of a study that will try to distinguish between the two explanations for the negative relation between smoking and PD. Such a study will exploit variations (e.g., in geography & time) between the incidence of smoking and that of PD.

It will take a good deal of leg work to get the relevant data, and a good deal of brain work to set up a convincing statistical design. It will also be very satisfying to see convincing results one way or the other. More than satisfying, such a study could help develop medications to treat or prevent PD.

If this project makes sense perhaps you can bring it to the attention of relevant scholars.

OK, here it is. We’ll see if anyone wants to pick this one up.

I have some skepticism about Gur’s second hypothesis, that “whatever causes PD also helps people quit or avoid smoking.” I say this only because, from my perspective, and as discussed in the above-linked post, the decision to smoke seems like much more of a social attribute than an individual decision. But, sure, I could see how there could be correlations.

In any case, it’s an interesting statistical question as well as an important issue in medicine and public health, so worth thinking about.

How to quit smoking, and a challenge to currently-standard individualistic theories in social science

Paul Campos writes:

Probably the biggest public health success in America over the past half century has been the remarkably effective long-term campaign to reduce cigarette smoking. The percentage of adults who smoke tobacco has declined from 42% in 1965 (the first year the CDC measured this), to 12.5% in 2020.

It’s difficult to disentangle the effect of various factors that have led to this stunning decline of what was once a ubiquitous habit — note that if we exclude people who report having no more than one or two drinks per year, the current percentage of alcohol drinkers in the USA is about the same as the percentage of smokers 60 years ago — but the most commonly cited include:

Anti-smoking educational campaigns

Making it difficult to smoke in public and many private spaces

Increasing prices

Improved smoking cessation treatments, and laws requiring the cost of these to be covered by medical insurance

I would add another factor, which is more broadly cultural than narrowly legal or economic: smoking has become declasse.

This is evident if you look at the relationship between smoking rates and education and income: While 32% of people with a GED smoke, the percentages for holders of four year college degrees and graduate degrees are 5.6% and 3.5% respectively. And while 20.2% of people with household incomes under the $35,000 smoke, 6.2% of people with household incomes over $100,000 do.

All worth noting. Anti-smoking efforts are a big success story, almost such a bit story that it’s easy to forget.

The sharp decline in smoking is a big “stylized fact,” as we say in social science, comparable to other biggies such as the change in acceptance of gay people in the past few decades, and the also-surprising lack of change in attitudes toward abortion.

When we have a big stylized fact like this, we should milk it for as much understanding as we can.

With that in mind, I have a few things to add on the topic:

1. Speaking of stunning, check out these Gallup poll results on rates of drinking alcohol:

At least in the U.S., rich people are much more likely than poor people to drink. That’s the opposite of the pattern with smoking.

2. Speaking of “at least in the U.S.”, it’s my impression that smoking rates have rapidly declined in many other countries too, so in that sense it’s more of a global public health success.

3. Back to the point that we should recognize how stunning this all is: 20 years ago, they banned smoking in bars and restaurants in New York. All at once, everything changed, and you could go to a club and not come home with your clothes smelling like smoke, pregnant women could go places without worrying about breathing it all in, etc. When this policy was proposed and then when it was clear it was really gonna happen, lots of lobbyists and professional contrarians and Debby Downers and free-market fanatics popped up and shouted that the smoking ban would never work, it would be an economic disaster, the worst of the nanny state, bla bla bla. Actually it worked just fine.

4. It’s said that quitting smoking is really hard. Smoking-cessation programs have notoriously low success rates. But some of that is selection bias, no? Some people can quit smoking without much problem, and those people don’t need to try smoking-cessation programs. So the people who do try those programs are a subset that overrepresents people who can’t so easily break the habit.

5. We’re used to hearing the argument that, yeah, everybody knows cigarette smoking causes cancer, but people might want to do it anyway. There’s gotta be some truth to that: smoking relaxes people, or something like that. But also recall what the cigarette executives said, as recounted by historian Robert Proctor:

Philip Morris Vice President George Weissman in March 1954 announced that his company would “stop business tomorrow” if “we had any thought or knowledge that in any way we were selling a product harmful to consumers.” James C. Bowling . . . . Philip Morris VP, in a 1972 interview asserted, “If our product is harmful . . . we’ll stop making it.” Then again in 1997 the same company’s CEO and chairman, Geoffrey Bible, was asked (under oath) what he would do with his company if cigarettes were ever established as a cause of cancer. Bible gave this answer: “I’d probably . . . shut it down instantly to get a better hold on things.” . . . Lorillard’s president, Curtis Judge, is quoted in company documents: “if it were proven that cigarette smoking caused cancer, cigarettes shoudl not be marketed” . . . R. J. Reynolds president, Gerald H. Long, in a 1986 interview asserted that if he ever “saw or thought there were any evidence whatsoever that conclusively proved that, in some way, tobacco was harmful to people, and I believed it in my heart and my soul, then I would get out of the business.”

6. A few years ago we discussed a study of the effects of smoking bans. My thought at the time was: Yes, at the individual level it’s hard to quit smoking, which might give skepticism about the effects of measures designed to reduce smoking—but, at the same time, smoking rates vary a lot by country and by state, This was similar to our argument about the hot hand: given that basketball shooting success rates vary a lot over time and across game conditions, it should not be surprising that previous shots might have an effect. As I wrote awhile ago, “if ‘p’ varies among players, and ‘p’ varies over the time scale of years or months for individual players, why shouldn’t ‘p’ vary over shorter time scales too? In what sense is “constant probability” a sensible null model at all?” Similarly, given how much smoking rates vary, maybe we shouldn’t be surprised that something could be done about it.

7. To me, though, the most interesting thing about the stylized facts on smoking is how there is this behavior that is so hard to change at the individual level but can be changed so much at the national level. This runs counter to currently-standard individualistic theories in social science in which everything is about isolated decisions. It’s more of a synthesis: change came from policy and from culture (whatever that means), but this still had to work its way though individual decisions. This idea of behavior being changed by policy almost sounds like “embodied cognition” or “nudge,” but it feels different to me in being more brute force. Embodied cognition is things like giving people subliminal signals; nudge is things like subtly changing the framing of a message. Here we’re talking about direct education, taxes, bans, big fat warning labels: nothing subtle or clever that the nudgelords would refer to as a “masterpiece.”

Anyway, this idea of changes that can happen more easily at the group or population level than at the individual level, that’s interesting to me. I guess things like this happen all over—“social trends”—and I don’t feel our usual social-science models handle them well. I don’t mean that no models work here, and I’m sure that lots of social scientists done serious work in this area; it just doesn’t seem to quite line up with the usual way we talk about decision making.

P.S. Separate from all the above, I just wanted to remind you that there’s lots of really bad work on smoking and its effects; see here, for example. I’m not saying that all the work is bad, just that I’ve seen some really bad stuff, maybe no surprise what with all the shills on one side and all the activists on the other.

The Freaky Friday that never happened

Speaking of teaching . . . I wanted to share this story of something that happened today.

I was all fired up with energy, having just taught my Communicating Data and Statistics class, taking notes off the blackboard to remember what we’d been talking about so I could write about it later, and students were walking in for the next class. I asked them what it was, and they said Shakespeare. How wonderful to take a class on Shakespeare at Columbia University, I said. The students agreed. They love their teacher—he’s great.

This gave me an idea . . . maybe this instructor and I could switch classes some day, a sort of academic Freaky Friday. He could show up at 8:30 and teach my statistics students about Shakespeare’s modes of communication (with his contemporaries and with later generations including us, and also how Shakespeare made use of earlier materials), and I could come at 10am to teach his students how we communicate using numbers and graphs. Lots of fun all around, no? I’d love to hear the Shakespeare dude talk to a new audience, and I think my interactions with his group would be interesting too.

I waited in the classroom for awhile so I could ask the instructor when he came into the room, during the shuffling-around period before class officially starts at 10:10. Then 10:10 came and I stood outside to wait as the students continued to trickle in. A couple minutes later I saw a guy approaching, about my age, I ask if he teaches the Shakespeare class. Yes, he is. I introduce myself: I teach the class right before, on communicating data and statistics, maybe we could do a switch one day, could be fun? He says no, I don’t think so, and goes into the classroom.

That’s all fine, he has no obligation to do such a thing, also I came at him unexpectedly at a time when he was already in a hurry, coming to class late (I came to class late this morning too. Mondays!). His No response was completely reasonable.

Still . . . it was a lost opportunity! I’ll have to brainstorm with people about other ways to get this sort of interdisciplinary opportunity on campus. We could just have an interdiscplinary lecture series (Communication of Shakespeare, Communication in Statistics, Communication in Computer Science, Communication in Medicine, Communication in Visual Art, etc.), but it would be a bit of work to set up such a thing, also I’m guessing it wouldn’t reach so many people. I like the idea of doing it using existing classes, because (a) then the audience is already there, and (b) it would take close to zero additional effort: you’re teaching your class somewhere else, but then someone else is teaching your class so you get a break that day. And all the students are exposed to something new. Win-win.

The closest thing I can think of here is an interdisciplinary course I organized many years ago on quantitative social science, for our QMSS graduate program. The course had 3 weeks each of history, political science, economics, sociology, and psychology. It was not a statistics course or a methods course; rather, each segment discussed some set of quantitative ideas in the field. The course was wonderful, and Jeronimo Cortina and I turned it into a book, A Quantitative Tour of the Social Sciences, which I really like. I think the course went well, but I don’t think QMSS offers it anymore; I’m guessing it was just too difficult to organize a course with instructors from five different departments.

P.S. I read Freaky Friday and its sequel, A Billion for Boris, when I was a kid. Just noticed them on the library shelves. The library wasn’t so big; I must have read half the books in the children’s section at the time. Lots of fond memories.

“Creating Community in a Data Science Classroom”

David Kane sends along this article he just wrote with the above title and the following abstract:

A community is a collection of people who know and care about each other. The vast majority of college courses are not communities. This is especially true of statistics and data science courses, both because our classes are larger and because we are more likely to lecture. However, it is possible to create a community in your classroom. This article offers an idiosyncratic set of practices for creating community. I have used these techniques successfully in first and second semester statistics courses with enrollments ranging from 40 to 120. The key steps are knowing names, cold calling, classroom seating, a shallow learning curve, Study Halls, Recitations and rotating-one-on-one final project presentations.

There’s some overlap with the ideas in chapters 1 and 2 of my forthcoming book with Aki, “Active Statistics: Stories, Games, Problems, and Hands-on Demonstrations for Applied Regression and Causal Inference,” but Kane has a bunch of ideas that are different from (but complementary) to ours. I recommend our book (along with Rohan Alexander’s Telling Stories with Data, Elena Llaudet and Kosuke Imai’s Data Analysis for Social Science, our own, Regression and Other Stories); I also recommend Kane’s article. Kane, like Aki and me in our new book, focus on ways to get students actively involved in class. I hadn’t thought about this in terms of “community,” but that seems like a good way to frame it.

I’m sure there’s lots more out there; the above list of resource is focused on modern introductions to applied statistics and active ways to learn and teach this material.

“Are there clear examples of the opposite idea, where four visually similar visualizations can have vastly different numerical stats?”

Geoffrey Yip writes:

You’ve written before on how numerical stats can mislead people. There are great visuals for this idea through Causal Quartets or Ascombe’s Quartets. Are there clear examples of the opposite idea, where four visually similar visualizations can have vastly different numerical stats?

My reply: Sure, graphs can be horrible and convey no information or even actively mislead. Tables can be hard to read, but I guess that, without actually faking the numbers, it would be hard to make a table that’s as misleading as some very bad graphs. But I think that a good graph should always be better than the corresponding table; see this article from 2002 for exploration of this point.

Harvard law prof sez: “I believe that if [universities] are going to accept blood money . . . the should only ever accept that money anonymously.”

1. The action

Recently there has been some controversy about Supreme Court judge Clarence Thomas, who appears to have broken the law by not reporting a series of gifts given to him and his family by a rich donor. A judge breaking the law—that’s not cool!

2. The defense

Oddly enough—or, at least oddly at first glance—one of Thomas’s defenders in this affair is law professor and free-software campaigner Larry Lessig.

Lessig’s involvement seems odd at first because in 2011 he published a book entitled, “Republic, Lost: How Money Corrupts Congress — and a Plan to Stop It.” You’d think that if Lessig believes that money corrupts Congress, he’d think that it would corrupt the court system.

Actually, though, no! Check out this recent post [updated link here] from Lessig. On one hand, I think it’s kinda cool that a bigshot like Lessig is spending his time blogging—he’s just like me! No regular NYT or Fox News gig or whatever so he puts his thoughts out there in the ether, for all to read!—; on the other hand, I’m not so happy with what Lessig has to say:

Yet most of the attacks on Justice Thomas go beyond the failure to report some of his gifts. Most are attacking (1) that he took these or any gifts at all, or (2) that he took them from someone on the Right, or (3) that he took them from someone who is wealthy.

Point (3) seems odd . . . Point (2) is odd as well . . . Which leaves point (1), the point about this reporting that is most odd to me: One might well believe that Justices should not take gifts like this at all — that they should never visit others, or stay with others, that their vacations should be on their own dime. . . .

Whaaaa? So the two alternatives for a federal judge are: (a) take zillions of dollars in undeclared gifts from a political donor, or (b) “never visit others, or stay with others”? That’s just weird. I don’t think anything in the law prohibits judges from visiting people!

I guess that’s something they teach in law school, to introduce ridiculous slippery-slope arguments?

Lessig continues:

But if that’s your view, then fair reporting would ask whether Thomas’ behavior is unique or exclusive to him.

This is absolutely nuts! First off, when he says, “if that’s your view,” “that” refers to the position that Supreme Court judges “should never visit others, or stay with others, that their vacations should be on their own dime. . . .” There’s nobody who has the view that judges should never visit or stay with others, so he’s making a conditional argument (if A then B) for which A is the empty set.

The second weird part of his argument is, if you break the law, it’s not much of a defense to say that you’re not the only person to have broken it. Typically they write laws for offenses that happen more than once!

The third weird part is that he speculates that other judges have done the same thing, but offers no evidence of that happening.

Lessig continues:

Washington is filled with techniques for the modestly paid to live life as if they were rich. . . . Justices on the Supreme Court are paid at the level of mid-tier lawyers at big New York law firms. And when you’re paid less than you think you’re worth, you find ways to justify what most would think corrupt.

I don’t really get why a judge on the Supreme Court should think he’s worth more, financially, than a mid-tier lawyer at a big New York law firms. Mid-tier lawyers have a lot of responsibility, no? They can make or lose a lot of money for their firms, and they get paid a lot. If Thomas or any another federal judge thought he could get more money in this way, he’s free to quit his job and join a law firm, no?

He concludes with a call for change:

If we’re to address these corruptions for real . . . we need to stop pretending that the problem is individual. It is not. It is institutional. The corruptions that are destroying our government are woven into the systems of our government. It is these systems that must change if we’re to have institutions we can trust.

I kinda see where he’s coming from, but the first step toward reform would be to enforce existing laws, no? If your goal is reform, it seems like a step backward to minimize existing violations.

The other odd thing here is that Lessig leads his post with, “Justice Thomas’s interpretation of the reporting requirements applicable to him is wrong. The reading offered by Dahlia Lithwick and Mark Joseph Stern is correct.” The title of Lithwick and Stern’s article is, “Clarence Thomas Broke the Law and It Isn’t Even Close.” But Lessig didn’t say that “Thomas broke the law,” he just said “Thomas’s interpretation of the reporting requirements applicable to him is wrong.” By framing it this way, Lessig took a clear statement of lawbreaking and turned it into a fuzzy-sounding legalism. I guess that’s what bigtime law professors do.

3. The reaction

Paul Campos, a law professor at the University of Colorado, was not happy with Lessig’s post. Campos shared some statistics:

Clarence Thomas has a salary of $285K per year, which puts him in the 98th percentile of individual income in the USA. This seriously understates his true economic position, however, because

(a) He’s guaranteed this salary, COLA adjusted, for life, even if he were to quit tomorrow; and

(b) He. can make as much money as he wants publishing books — or “books” — via whatever wingnut welfare publishing outfit wants to help a brother out; and

(c) His wife gets oodles of money from those same sources for her work, or “work; and

(d) Larry Lessig makes about $600K per year as as senior HLS professor for teaching two classes a year.

I kinda wonder how Campos knows Lessig’s salary and teaching load. I guess the world of law school professors is small, and everyone knows these things, more or less?

Anyway, googling *lawrence lesssig salary* led me to this 2018 post, “Poker the Bear: The Sad Unraveling of Lawrence Lessig.” I have no idea whassup with all that but who knows.

Also this interview with Lessig on academic corruption:

Lessig: They receive money from interested parties to participate in the policymaking process. . . . in fields like economics and law, basically the soft sciences, the temptation to bend or to shade is always there. And if the return from bending or shaving is high, then obviously we have to worry that there’s that kind of a distortion going on.

Interviewer: Is there evidence that a payment’s changed the testimony or research focus or of people who are being paid?

Lessig: Well we know first that it changes the perception of the integrity . . . the perception alone is enough. . . . there’s this incentive to focus on the question in a way that keeps the supplier of the data happy, which again is the kind of dynamic that we would worry about if we’re worried about the actual research being bent in a way that’s not reproductive of the truth. . . .

Lessig seems so passionate about fighting corruption. It seems almost unbelievable that he’d go out and defend a federal judge who’s taken all these freebies.

4. The resolution of the paradox

Again, the paradox is that (a) Lessig hates corruption, he even hates the perception of corruption, especially in “fields like economics and law,” but (b) when a federal judge has been taking tons of money and free trips from a rich political donor, Lessig doesn’t think it’s such a big deal.

I was legitimately puzzled here. In his above-linked post, Campos suggests that Lessig is motivated by some sort of generalized honor-among-plutocrats principle, and that could be: Lessig lives in a rarefied world in which being “paid at the level of mid-tier lawyers at big New York law firms” is considered to be a mark of poverty rather than of comfortable wealth. But still.

Then I came across this news article from 2020, “Lawrence Lessig sues New York Times over MIT and Jeffrey Epstein interview.”

This story is relevant because Lessig was arguing that if you’re going to take money from crooks, you should do it in secret. Here are two quotes from Lessig, reported in that article:

Joi Ito was the academic entrepreneur and friend of Lessig who’s famous as the link between Jeffrey Epstein and MIT. (Incidentally, as an MIT grad, let me just say that the Institute has gone downmarket when it comes to scandals. Back when I was a student, the big debate was whether to accept military funding: at the time, we were all upset about Central America and southern Africa, but in retrospect maybe we should’ve been more bothered by what our military was doing in Afghanistan and Iraq. Now the controversy is about professors hanging out on a private island with sleazy millionaires.)

Anyway, Lessig’s statements leapt out at me: he doesn’t seem to have much of a problem with corruption at all! His problem is with open corruption. Just take your “blood money” (his term) but don’t let anyone know about it. He literally said, “Were I king, I would ban non-anonymous gifts of type 3 [“money from people convicted of a crime”] or type 4 [“blood money”].”

What the hell? If he were king, he presumably would have the ability to ban all blood money, right? But, no, he wouldn’t go that far! He’d only ban “non-anonymous” blood money. All right, then.

I don’t really understand Lessig’s position, but I guess he’s consistent. Payoffs are ok but only if they’re secret. This also is consistent with him being so bothered by perceptions. If the payoff is secret, there’s no perception. No perception, no problem.

The other thing is that I noticed a certain paradox-loving style of writing, the sort of thing that your English teacher in high school or political science teacher in college will really appreciate. I’m thinking of the bit where he says:

Everyone seems to treat it . . . I see it as exactly the opposite . . . rather than repeating unreflective paeans . . .

Geddit? He’s a subtle thinker, not like everybody else.

And now I see how he got this reputation for brilliance. “Paeans”! I don’t even now how to pronounce that word. Dude must have got a really high SAT score, also really good grades in law school. One of my pet peeves of this sort of ethical analysis is people making these clever good-is-bad-and-bad-is-good arguments. Any schlub can say that it’s unethical for a federal judge to take big gifts from political donors, or that a university should avoid taking money from criminals. It takes a very special legal thinker to say that the real problem here is that the donations are not going on in secret.

In all seriousness, sometimes it seems that the main qualification for becoming an elite law professor is to be the kind of person who could get perfect SAT scores and write snappy high school essays and then just stay in that position for the rest of your life. I can only assume that those “mid-tier lawyers at big New York law firms” have more on the ball than this.

5. Why does this have to do with statistical modeling, causal inference, and social science?

The connection is political science, power and influence, and the role of institutions in society in supporting or opposing various forms of political influence. Lessig’s just one guy and this post does not represent any sort of attempt at a systematic study. Indeed, it’s safe to say that Lessig is not representative of left-wing law professors at elite universities, most of whom I expect would not defend a federal judge for accepting large undeclared donations.

In our paper on stories as evidence, Basbøll and I talked about the way that good stories are anomalous and immutable, and the anomalousness would seem to contradict usual statistical principles. In short: a good story is “man bites dog,” but good statistical evidence comes from representative data of the “dog bites man” variety.

We concluded that stories are useful for learning because they represent model checks: a good story is something that really happened and whose facts can be checked (“immutable”) and that is surprising (“anomalous”); that is, it represents a stubborn fact that does not align with our usual view of the world.

The Lessig story is immutable (lots of evidence that he is much more bothered by political influence being public than by the donations themselves, to the extent that he defends the judge in part, it seems, because Thomas did not disclose all that he received from that donor) and it’s anomalous (a political liberal who is opposed to money corrupting politics, but then is not so upset about a prominent instance of money that was potentially corrupting political decisions).

Following our principles, the anomaly is interesting: the surprise is relative to our implicit model that a anti-corruption campaigner would be disturbed by a public case of undisclosed political donations. Indeed, my expectation, without having any clear background on Lessig, would’ve been that he’d jump all over Thomas and use this as another example of why money in politics is a problem. But Lessig didn’t do that; hence the puzzle, which makes it a good story and also makes us realize that more must be going on.

The question is, why did Lessig take this paradoxical position of defending the gift-taking politician. As Imbens and I explained, we can’t in general answer Why questions, but they can motivate further study.

Regarding Lessig, we have a few theories, which I don’t see as mutually exclusive:

1. Celebrity stick together. Perhaps Lessig has sympathy for Thomas because they’re both wealthy celebrity lawyers. That perhaps is more important than their political differences.

2. Law professors have a loyalty to the system. There’s this whole Supreme Court mystique, starting with the idea that the judges on that particular court get to be called “justices” (or even “Justices”). Lessig was once a clerk on that court. The institution just means too much to him.

3. General rich-guy cluelessness combined with Harvard elitism. Perhaps Lessig genuinely thinks of $285,000/year, guaranteed COLA-adjusted for life, as a low salary. And perhaps he thinks it’s beneath someone’s dignity to be paid as much as “mid-tier lawyers at big New York law firms.”

4. Fundraising. Lessig ran for president and he worked with some political foundations. It seems he got some big donations from some rich people on these projects. He was also friends with Joi Ito who got money from at least one famous rich person and I guess others too. The point is that, from Lessig’s perspective, money from rich people is really important, but he realizes that it bothers lots of people, which is why he’d prefer the money to anonymous. It would hard for Lessig to take a hard stance against taking money from rich people, given that he’s done so much of this himself. This leads him into tricky spots in his role as a campaigner against money in politics, and it doesn’t help that he can come up with various ridiculous too-clever-by-half ethical arguments.

Of all these reasons, the bit about the fundraising is the most interesting. I say this because I get a good salary (not in the Lessig range, apparently, but enough to live comfortably and focus my time on research, teaching, and service), and the sources are indirect. You could say that Columbia launders its money for me, in the sense that I don’t really have to worry about where the $ originally come from. Similarly, when I get federal grants and corporate support, I don’t think too hard about whatever bad things the government and the companies might be doing. So, yeah, it’s easy enough for me to get indignant about unreported political donations—I’m not raising money from individuals.

To put it another way: I’ve eaten a lot of meat in my life but I’ve never been to the abattoir. So who am I to criticize Lessig for being cool with anonymously-donated “blood money”? I don’t look at the ultimate sources of my own funds.

I can’t say I agree with Lessig—I still hold the commonsensical (to me) view that transparency is better, that Thomas should’ve followed the law and disclosed the donations, that secrecy is not a “saving grace” to taking money from evil people, etc.—but I can kind of almost see where he’s coming from.

Again, the social-science interest here is partly the direct importance of the influence of money in politics and partly understanding how an anti-corruption campaigner can, paradoxically, come out against transparency and in favor of secret donations.

P.S. One of the benefits of blogging is you get comments! Lessig’s post has 9 comments, almost all of which strongly disagree with him. Maybe this will motivate him to reconsider his views.

P.P.S. In the between when I wrote this post and when it is appearing online, I came across this post from Lessig defending embattled dishonesty researcher Francesca Gino.

Jeffrey Epstein, Clarence Thomas, Francesca Gino . . . quite the collection he’s got there! Paradox-boy is outta control.

In all seriousness, I wonder if one of the problems here is that, in elite law academia, you win by coming up with creative arguments that no one has come up with before. The bad news is that most of the reasonable arguments have already been taken, so to be truly original you need to color outside the lines. Any idiot of a journalist can make the obvious arguments that Clarence Thomas broke the law and that Harvard and MIT shouldn’t be taking anonymous money from the likes of Jeffrey Epstein. It takes a bigshot Harvard Law professor to take the opposite side of these cases. I fear that the system incentivizes the Lessigs of the world to make ridiculous claims. Of course, lots of Harvard law professors aren’t going around defending the recipients of “blood money”—but those professors aren’t getting publicity either. I’m not writing about them, right? If you want the fame along with that $600K salary, it helps to be ridiculous.

“Evidence-based medicine”: does it lead to people turning off their brains?

Joshua Brooks points us to this post by David Gorski, “The Cochrane mask fiasco: Does EBM predispose to COVID contrarianism?” EBM stands for “evidence-based medicine,” and here’s what Gorski writes:

A week and a half ago, the New York Times published on Opinion piece by Zeynep Tufekci entitled Here’s Why the Science Is Clear That Masks Work. Written in response to a recent Cochrane review, Physical interventions to interrupt or reduce the spread of respiratory viruses, that had over the last month been widely promoted by antimask and antivaccine sources, the article discusses the problems with the review and its lead author Tom Jefferson, as well as why it is not nearly as straightforward as one might assume to measure mask efficacy in the middle of a pandemic due to a novel respiratory virus. Over the month since the review’s publication, its many problems and deficiencies (as well as how it has been unrelentingly misinterpreted) have been discussed widely by a number of writers, academics, and bloggers . . .

My [Gorski’s] purpose in writing about this kerfuffle is not to rehash (much) why the Cochrane review was so problematic. Rather, it’s more to look at what this whole kerfuffle tells us about the Cochrane Collaborative and the evidence-based medicine (EBM) paradigm it champions. . . . I want to ask: What is it about Cochrane and EBM fundamentalists who promote the EBM paradigm as the be-all and end-all of medical evidence, even for questions for which it is ill-suited, that can produce misleading results? . . .

Back in the day, we used to call EBM’s failure to consider the low to nonexistent prior probability as assessed by basic science that magic like homeopathy could work its “blind spot.” Jefferson’s review, coupled with the behavior of EBM gurus like John Ioannidis during the pandemic, made me wonder if there’s another blind spot of EBM that we at SBM have neglected, one that leads to Cochrane reviews like Jefferson’s and leads EBM gurus like Ioannidis to make their heel turns so soon after the pandemic hit . . .

[Regarding the mask report,] perusing the triumphant gloating on social media from ideological sources opposed to COVID-19 interventions, including masks and vaccines, I was struck by how often they used the exact phrase “gold standard” to portray Cochrane as an indisputable source, all to bolster their misrepresentation. . . .

Gorski continues:

I’ve noticed over the last three years a tendency for scientists who were known primarily before the pandemic as strong advocates of evidence-based medicine (EBM), devolving into promoters of COVID-19 denial, antimask, anti-public health, and even antivaccine pseudoscience. Think Dr. John Ioannidis, whom I used to lionize before 2020. Think Dr. Vinay Prasad, of whose work on medical reversals and calls for more rigorous randomized clinical trials of chemotherapy and targeted therapy agents before FDA approval we generally wrote approvingly.

Basically, what Jefferson exhibited in his almost off-the-cuff claim that massive RCTs of masks should have been done while a deadly respiratory virus was flooding UK hospitals was something we like to call “methodolatry,” or the obscene worship of the RCT as the only method of clinical investigation. . . .

But it’s not so simple:

Human trials are messy. It is impossible to make them rigorous in ways that are comparable to laboratory experiments. Compared to laboratory investigations, clinical trials are necessarily less powered and more prone to numerous other sources of error: biases, whether conscious or not, causing or resulting from non-comparable experimental and control groups, cuing of subjects, post-hoc analyses, multiple testing artifacts, unrecognized confounding of data due to subjects’ own motivations, non-publication of results, inappropriate statistical analyses, conclusions that don’t follow from the data, inappropriate pooling of non-significant data from several, small studies to produce an aggregate that appears statistically significant, fraud, and more.

Evidence-based medicine eats itself

For some background on the controversies surrounding “evidence-based medicine,” see this news article from Aaron Carroll from 2017.

Here’s how I summarized things back in 2020, my post entitled “Evidence-based medicine eats itself”:

There are three commonly stated principles of evidence-based research:

1. Reliance when possible on statistically significant results from randomized trials;

2. Balancing of costs, benefits, and uncertainties in decision making;

3. Treatments targeted to individuals or subsets of the population.

Unfortunately and paradoxically, the use of statistics for hypothesis testing can get in the way of the movement toward an evidence-based framework for policy analysis. This claim may come as a surprise, given that one of the meanings of evidence-based analysis is hypothesis testing based on randomized trials. The problem is that principle (1) above is in some conflict with principles (2) and (3).

The conflict with (2) is that statistical significance or non-significance is typically used at all levels to replace uncertainty with certainty—indeed, researchers are encouraged to do this and it is standard practice.

The conflict with (3) is that estimating effects for individuals or population subsets is difficult. A quick calculation finds that it takes 16 times the sample size to estimate an interaction as a main effect, and given that we are lucky if our studies are powered well enough to estimate main effects of interest, it will typically be hopeless to try to obtain the near-certainty regarding interactions. That is fine if we remember principle (2), but not so fine if our experiences with classical statistics have trained us to demand statistical significance as a prerequisite for publication and decision making.

Bridges needed

The above-linked Gorski post was interesting to me because it presents a completely different criticism of the evidence-based-medicine paradigm.

It’s not that controlled trials are bad; rather, the deeper problems seem to be: (a) inferential summaries and decision strategies that don’t respect uncertainty (that was my concern) and (b) research agendas that don’t engage with scientific understanding (that was Gorski’s concern).

Regarding that latter point: a problem with standard “evidence-based medicine” or what I’ve called the “take a pill, push a button model of science” is not that it ignores scientific theories, but rather that it features a gap between theory and evidence. On one side there are theory-stories of varying levels of plausibility; on the other side there are statistical summaries from (necessarily) imperfect study.

What we need are bridges between theory and evidence. This includes sharper theories that make quantitative predictions that can be experimentally studied, and empirical studies measuring intermediate outcomes, and lab experiments to go along with the field studies.

Omid Malekan on why crypto is not a scam

Omid Malekan, who teaches crypto finance at Columbia business school, writes:

I came across your blog post and obviously disagree, I work professionally in crypto, have been following it for close to a decade, written a few books, advise various Columbia student groups, etc.

Instead of trying to refute your claims about crypto being a scam, I thought it would be more fun to demonstrate its value. So I took the text of your blog post and embedded it inside the Bitcoin blockchain. You can see it here. I included a hash of the text as well for easier verification.

What does this mean?

It means this text now sits inside an append-only ledger, copies of which are kept by thousands of distributed computers all over the world. Unlike your original blog post, this data cannot be changed or deleted, because unlike the server that hosts your blog, it now lives inside one of the most secure networks on earth. It would cost someone tens of millions of dollars to even try to change it, and they’d likely fail. This ability to secure and preserve the integrity of information–be it blog posts or transaction data in a payment system–is a unique value prop of Bitcoin. No other digital service provides anything close to it.

But wait, that’s not a financial application!

True, but this “proof of existence” feature would not be possible without Bitcoin also having a native currency. Before the publication of the original Bitcoin paper in 2008, it was believed that securing an open and permissionless distributed system (where participants are free to come and go) was impossible, despite decades of work in the sophisticated field of distributed systems. Nakamoto Consensus was a novel solution that introduced financial incentives to help secure such a system, incentives that need to be paid in the platforms own currency. I had to pay the miner that processed this inscription around $8 for them to do it.

The ability for a distributed group of participants all over the world to reach consensus on anything (like the existence of your blog post) in an open, transparent and permissionless way is a pretty big deal, one that has major implications for the academy. I happen to run my own Bitcoin node (using about $300 worth of hardware). That means I have my own copy of all the data inside the ledger and get the latest updates as they happen. We are in the process of setting up Bitcoin and Ethereum nodes up at the B school to aid professors and students interested in research. Unlike traditional finance, where all the data is hoarded by private entities that charge an arm and a leg for access (at best) or just hide the important details (at worst) in crypto everyone gets to see everything.

And of course as you know all too well, payment systems are nothing more than than a series of records (of debits and credits). And while people paying each other in various cryptocurrencies remains the primary use, the open, decentralized and censorship-resistant architecture has since been expanded to countless other activities, from the movement of dollars (recently embraced by PayPal) to the issuance of consumer rewards (recently embraced by Lufthansa) to digital art (recently embraced by both Christie’s and Sotheby’s).

I’m skeptical about the digital art thing; then again, I’m skeptical about bogus social science on beauty and sex ratios, ovulation and voting, himmicanes, ages ending in 9, etc.—and lots of people who do these things have stable jobs and I guess will continue doing so. It’s a big world out there.

I don’t have any other comments on the specifics of Malekan’s note; I just wanted to share it with you, as he’s offering a different perspective than mine. Paying $8 to save one of my blog posts doesn’t sound quite scalable to me, given how much is out there on the internet; on the other hand, the foolish claim has been made by a celebrity scientist that each scientific citation is worth $100,000. I don’t know what all this means, except that valuations of intangibles can be controversial. In any case, I appreciate Malekan giving us the above background.

Regarding the “scam” thing, I received the following email from Gur Huberman, also at Columbia business school. Gur asked:

1. Do you need a scammer to initiate or support a scam? Would you call Satoshi a scammer? Ditto for Vitalic Butterin (Ethereum)?
2. Is gold a scam?
3. What exactly assigns value to money as we know it?

I replied that a scam builds off some existing institution. Ponzi’s scam was based on certain postage products. The stamps were real. Theranos was based on a real institution of blood testing. Etc. “Crypto” can be a scam even while based on real computer programs. For that matter, gold is not a scam—I have some in my teeth!—but the people who try to sell you gold on TV are scammers. In my post on Dan Davies’s book, I characterized frauds as being linear or exponential.

The point of my earlier post

My post that started this particular discussion was called “Crypto scam social science thoughts: The role of the elite news media and academia.” What was and is most interesting to me, and the point of the post, was the role of media elites in keeping bubbles afloat. So if you disagree with me regarding some benefits of crypto, you can just think about the media fluffing given to FTX etc. in recent years.

How big problem it is that cross-validation is biased?

Some weeks ago, I posted in Mastodon (you can follow me there) a thread about “How big problem it is that cross-validation is biased?”. I have also added that text to CV-FAQ. Today I extended that thread as we have a new paper out on estimating and correcting selection induced bias in cross-validation model selection.

I’m posting here the whole thread for the convenience of those who are not (yet?) following me in Mastodon:

Unbiasedness has a special role in statistics, and too often there are dichotomous comments that something is not valid or is inferior because it’s not unbiased. However, often the non-zero bias is negligible, and often by modifying the estimator we may even increase bias but reduce the variance a lot, providing an overall improved performance.

In CV the goal is to estimate the predictive performance for unobserved data given the observed data of size n. CV has pessimistic bias due to using less than n observation to fit the models. In case of LOO-CV this bias is usually small and negligible. In case of K -fold-CV with a small K, the bias can be non-negligible, but if the effective number of parameters of the model is much less than n, then with K>10 the bias is also usually negligible compared to the variance.

There is a bias correction approach by Burman (1989) (see also Fushiki (2011)) that reduces CV bias, but even in the cases with non-negligible bias reduction, the variance tends to increase so much that there is no real benefit (see, e.g. Vehtari and Lampinen (2002)).

For time series when the task is to predict future (there are other possibilities like missing data imputation) there are specific CV methods such as leave-future-out (LFO) that have lower bias than LOO-CV or K -fold-CV (Bürkner, Gabry and Vehtari, 2020). There are sometimes comments that LOO-CV and K -fold-CV would be invalid for time series. Although they tend to have a bigger bias than LFO, they are still valid and can be useful, especially in model comparison where bias can cancel out.

Cooper et al. (2023) demonstrate how in time series model comparison variance is likely to dominate, it is more important to reduce the variance than bias, and leave-few-observations and use of joint log score is better than use of LFO. The problem with LFO is that the data sets used for fitting models are smaller, increasing the variance.

Bengio and Grandvalet (2004) proved that there is no unbiased estimate for the variance of CV in general, which has been later used as an argument that there is no hope. Instead of dichotomizing to unbiased or biased, Sivula, Magnusson and Vehtari (2020) consider whether the variance estimates are useful and how to diagnose when the bias is likely to not be negligible (Sivula, Magnusson and Vehtari (2023) prove also a special case where there actually exists unbiased variance estimate).

CV tends to have high variance, as the sample reuse is not making any modeling assumptions (this holds also for information criteria such as WAIC). Not making modeling assumptions is good when we don’t trust our models, but if we trust we can get reduced variance in model comparison, for example, examining directly the posterior or using reference models to filter out noise in the data (see, e.g., Piironen, Paasiniemi and Vehtari (2018) and Pavone et al. (2020)).

When using CV (or information criteria such as WAIC) for model selection, the performance estimate for the selected model has additional selection induced bias. In case of small number of models this bias is usually negligible, that is, smaller than the standard deviation of the estimate or smaller than what is practically relevant. In case of negligible bias, we may choose suboptimal model, but the difference to the performance of oracle model is small.

In case of a large number of models the selection induced bias can be non-negligible, but this bias can be estimated using, for example, nested-CV or bootstrap. The concept of the selection induced bias and related potentially harmful overfitting are not new concepts, but there hasn’t been enough discussion when they are negligible or non-negligible.

In our new paper with Yann McLatchie Efficient estimation and correction of selection-induced bias with order statistics we review the concepts of selection-induced bias and overfitting, propose a fast to compute estimate for the bias, and demonstrate how this can be used to avoid selection induced overfitting even when selecting among 10^30 models.

The figure here shows simulation results with p=100 covariates, with different data sizes n, and varying block correlation among the covariates. The red lines show the LOO-CV estimate for the best model chosen so far in forward-search. The grey lines show the independent, much bigger test data performance, which usually don’t have available. The black line shows our corrected estimate taking into account the selection induced bias. Stopping the searches at the peak of black curves avoids overfitting.
The figure here shows simulation results with p=100 covariates, with different data sizes n, and varying block correlation among the covariates. The red lines show the LOO-CV estimate for the best model chosen so far in forward-search. The grey lines show the independent much bigger test data performance, which usually don't have available. Black line shows our corrected estimate taking into account the selection induced bias. Stopping the searches at the peak of black curves avoids overfitting.

Although we can estimate and correct the selection induced bias, we primarily recommend to use more sensible priors and not to do model selection. See more in Efficient estimation and correction of selection-induced bias with order statistics and Bayesian Workflow.

Evaluating Visualizations for Inference and Decision-Making (Jessica Hullman’s talk in the Columbia statistics seminar next Monday)

Social Work Bldg room 903, at 4pm on Mon 18 Sep 2023:

Evaluating Visualizations for Inference and Decision-Making

Research and development in computer science and statistics have produced increasingly sophisticated software interfaces for interactive visual data analysis. Data visualizations have also become ubiquitous for communication in the news and scientific publishing. Despite these successes, our understanding of how to design effective visualizations for data-driven decision-making remains limited. Design philosophies that emphasize data exploration and hypothesis generation can encourage pattern-finding at the expense of quantifying uncertainty. Designing visualizations to maximize perceptual accuracy and self-reported satisfaction can lead people to adopt visualizations that promote overconfident interpretations. I will motivate a few alternative objectives for measuring the effectiveness of visualization, and show how a rational agent framework based in statistical decision theory can help us understand the value of a visualization in the abstract and in light of empirical study results.

This is a super-important topic, also interesting because in many cases people think evaluation is a big deal without thinking too hard about what are the goals of the graph. It’s hard to design a good evaluation without having some goals in mind. For example, see this discussion from a few years ago of a study that was described as finding that “chartjunk is more useful than plain graphs” and this paper with Antony Unwin on different goals of infoviz and statistical and graphics.

Another thing I like about the above abstract from Jessica is how she’s talking about two different goals of statistical graphics: (1) clarifying a point that you want to convey, and (2) providing opportunity for discovery. Both are important!

Better Than Difference in Differences (my talk for the Online Causal Inference Seminar Tues 19 Sept)

Tues 19 Sep 2023, 8:30am Pacific time:

Better Than Difference in Differences

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

It is not always clear how to adjust for control data in causal inference, balancing the goals of reducing bias and variance. We show how, in a setting with repeated experiments, Bayesian hierarchical modeling yields an adaptive procedure that uses the data to determine how much adjustment to perform. The result is a novel analysis with increased statistical efficiency compared with the default analysis based on difference estimates. The increased efficiency can have real-world consequences in terms of the conclusions that can be drawn from the experiments. An open question is how to apply these ideas in the context of a single experiment or observational study, in which case the optimal adjustment cannot be estimated from the data; still, the principle holds that difference-in-differences can be extremely wasteful of data.

The talk follows up on Andrew Gelman and Matthijs Vákár (2021), Slamming the sham: A Bayesian model for adaptive adjustment with noisy control data, Statistics in Medicine 40, 3403-3424,

Here’s the talk I gave in this seminar a few years ago:

100 Stories of Causal Inference

In social science we learn from stories. The best stories are anomalous and immutable (see We shall briefly discuss the theory of stories, the paradoxical nature of how we learn from them, and how this relates to forward and reverse causal inference. Then we will go through some stories of applied causal inference and see what lessons we can draw from them. We hope this talk will be useful as a model for how you can better learn from own experiences as participants and consumers of causal inference.

No overlap, I think.

Crypto scam social science thoughts: The role of the elite news media and academia

Campos quotes from one of the many stories floating around regarding ridiculous of cryptocurrency scams.

I’m not saying it should’ve been obvious in retrospect that crypto was a scam, just that (a) it always seemed that it could be a scam, and (b) for awhile there have been many prominent people saying it was a scam. Again, prominent people can be in error; what I’m getting at is that the potential scamminess was out there.

The usual way we think about scams is in terms of the scammers and the suckers, and also about the regulatory framework that lets people get away with it.

Here, though, I want to talk about something different, which is the role of outsiders in the information flow. For crypto, we’re talking about trusted journalistic intermediaries such as Michael Lewis or Tyler Cowen who were promoting or covering for crypto.

There were lots of reasons for respected journalists or financial figures to promote crypto, including political ideology, historical analogies financial interest, FOMO, bandwagon-following, contrarianism, and plain old differences of opinion . . . pretty much the same set of reasons for respected journalists or financial figures to have been crypto-skeptical!

My point here is not that I knew better than the crypto promoters—yes, I was crypto-skeptical but not out of any special knowledge—; rather, it’s that the infrastructure of elite journalism was, I think, crucial to keeping the bubble afloat. Sure, crypto had lots of potential just from rich guys selling to each other and throwing venture capital at it, and suckers watching Alex Jones or whatever investing their life savings, but elite media promotion took it to the next level.

It’s not like I have any answers to this one. There were skeptical media all along, and I can’t really fault the media for spotting a trend that was popular among richies and covering it.

I’m just interested in these sorts of conceptual bubbles, whether they be financial scams or bad science (ovulation and voting, beauty and sex ratio, ESP, himmicanes, nudges, UFOs, etc etc etc), and how they can stay afloat in Wiley E. Coyote fashion long after they’ve been exposed.

Crypto is different from Theranos or embodied cognition, I guess, in that it has no inherent value and thus can retain value purely as part of a Keynesian beauty context, whereas frauds or errors that make actual scientific or technological claims can ultimately be refuted. Paradoxically, crypto’s lack of value—actually, its negative value, given its high energy costs—can make it a more plausible investment than businesses or ideas that could potentially do something useful if their claims were in fact true.

P.S. More here from David Morris on the role of the elite news media in this story.

Analyst positions available at the Consumer Financial Protection Bureau!

Jennifer Zhang, who took my applied statistics class a few years ago, writes:

I am now working at the Consumer Financial Protection Bureau in DC. I’m writing to share an exciting job opportunity that I hope some of your students would be interested in.

The Consumer Financial Protection Bureau (CFPB), a 21st century government agency that implements and enforces Federal consumer financial law and ensures that markets for consumer financial products are fair, transparent, and competitive, is recruiting this fall for the Director’s Financial Analyst (DFA) position to start in June 2024, and we want to encourage graduating seniors/recent graduates to apply.

Continue reading

Using forecasts to estimate individual variances

Someone who would like to remain anonymous writes:

I’m a student during the school year, but am working in industry this summer. I am currently attempting to overhaul my company’s model of retail demand. We advise suppliers to national retailers, our customers are suppliers. Right now, for each of our customers, our demand model outputs a point estimate of how much of their product will be consumed at one of roughly a hundred locations. This allows our customers to decide how much to send to each location.

However, because we are issuing point estimates of mean demand, we are *not* modeling risk directly, and I want to change that, as understanding risk is critical to making good decisions about inventory management – the entire point of excess inventory is to provide a buffer against surprises.

Additionally, the model currently operates on a per-day basis, so that predictions for a month from now are obtained by chaining together thirty predictions about what day N+1 will look like. I want to change that too, because it seems to be causing a lot of problems with errors in the model propagating across time, to the point that predictions over even moderate time intervals are not reliable.

I already know how to do both of these in an abstract way.

I’m willing to bite the bullet of assuming that the underlying distribution of the PDF should be multivariate Gaussian. From there, arriving at the parameters of that PDF just requires max likelihood estimation. For the other change, without going into a lot of tedious detail, Neural ODE models are flexible with respect to time such that you can use the same model to predict the net demand accumulated over t=10 days as you would to predict the net demand accumulated over t=90 days, just by changing the time parameter that you query the model with.

The problem is, although I know how to build a model that will do this, I want the estimated variance for each customer’s product to be individualized. Yet frustratingly, in a one-shot scenario, the maximum likelihood estimator of variance is zero. The only datapoint I’ll have to use to train the model to estimate the mean aggregate demand for, say, cowboy hats in Seattle at time t=T (hereafter (c,S,T)) is the actual demand for that instance, so the difference between the mean outcome and the actual outcome will be zero.

It’s clear to me that if I want to arrive at a good target for variance or covariance in order to conduct risk assessment, I need to do some kind of aggregation over the outcomes, but most of the obvious options don’t seem appealing.

– If I obtain an estimate of variance by thinking about the difference between (c,S,T) and (c,Country,T), aggregating over space, I’m assuming that each location shares the same mean demand, which I know is false.

– If I obtain one by thinking about the difference between (c,S,T) and (c,S,tbar), aggregating over time, I am assuming there’s a stationary covariance matrix for how demand accumulates at that location over time, which I know is false. This will fail especially badly if issuing predictions across major seasonal events, such as holidays or large temperature changes.

– If I aggregate across customers by thinking about the difference between (c,S,T) and (cbar,S,T), I’ll be assuming that the demand for cowboy hats at S,T should obey similar patterns as the demand for other products, such as ice cream or underwear sales, which seems obviously false.

I have thought of an alternative to these, but I don’t know if it’s even remotely sensible, because I’ve never seen anything like it done before. I would love your thoughts and criticisms on the possible approach. Alternatively, if I need to bite the bullet and go with one of the above aggregation strategies instead, it would benefit me a lot to have someone authoritative tell me so, so that I stop messing around with bad ideas.

My thought was that instead of asking the model to use the single input vector associated with t=0 to predict a single output vector at t=T, I could instead ask the model to make one prediction per input vector for many different input vectors from the neighborhood of time around t=0 in order to predict outcomes at a neighborhood of time around t=T. For example, I’d want one prediction for t=-5 to t=T, another prediction for t=-3 to t=T+4, and so on.

I would then judge the “true” target variance for the model relative to the difference between (c,S,T)’s predicted demand and the average of the model’s predicted demands for those nearby time slices. The hope is that this would reasonably correspond to the risks that customers should consider when optimizing their inventory management, by describing the sensitivity of the model to small changes in the input features and target dates it’s queried on. The model’s estimate of its own uncertainty wouldn’t do a good job of representing out-of-model error, of course, but the hope is that it’d at least give customers *something*.

Does this make any sense at all as a possible approach, or am I fooling myself?

My reply: I haven’t followed all the details, but my guess is that your general approach is sound. It should be possible to just fit a big Bayesian model in Stan, but maybe that would be too slow, I don’t really know how big the problem is. The sort of approach described above, where different models are fit and compared, can be thought of as a kind of computational approximation to a more structured hierarchical model, in the same way that cross-validation can be thought of as an approximation to an error model, or smoothing can be thought of as an approximation to a time-series model.