A time series so great, they plotted it twice. (And here’s a better way to do it:)

Someone who I don’t know writes:

If you decide to share this publicly, say in your blog, let me stay anonymous.

It’s funny how people want anonymity on these things!

Anyway, my correspondent continues:

I came across this 2016 PNAS article, “Seasonality in human cognitive brain responses.”

It has this interesting figure:

The same data are plotted twice, once in the left half of the figure, and again in the right half. The horizontal axis is repeated, so we are not looking at data fabrication. In the caption, the authors say “n=28”. (Two pairs of dots overlap, so you see only 26 dots in each half). They also describe this figure as a “double plot”. I did an internet search for “double plot” and, so far as I can tell, there is no such thing. The closest thing was a dual-axis plot, which is not what the authors have here. They’ve used “double plots” in other figures in the paper too.

Going by how the authors drew the x-axis and their disclosure that “n=28”, I assume that the authors did not mean to deceive the readers. But I still find it deceptive. I can hardly think of a situation where repeating a plot is a good idea. But if an author must do it, they should probably not just call it a “double plot” and leave it at that. They should describe what it is they have done and why.

Yeah, this is wack! The natural thing would be to just show one year and not duplicate any data—I guess then there’s a concern that you wouldn’t see the continuity between December and January. But, yeah, repeating the entire thing seems like a bit much.

Here’s what I’d recommend: Display one year, Winter/Spring/Summer/Fall, then append Fall on the left and Winter on the right (so now you’re displaying 18 months) but gray out the duplicate months, so then it’s clear that they’re not additional data, they’re just showing the continuity of the pattern.

Best of both worlds!

P.S. The duplicate graph reminds me of a hilarious lampshade I saw once that looked like a map of the world, but it actually was two maps: that is, it went around the world twice, so that from any horizontal angle you could see all 360 degrees. I tried to find an image online but no amount of googling took me to it.

Problem with the University of Wisconsin’s Area Deprivation Index. And, no, face validity is not “the weakest of all possible arguments.”

A correspondent writes:

I thought you might care to comment on a rebuttal in today’s HealthAffairs. I find it a poor non-defense that relies on “1000s of studies used our measure and found it valid”, as well as attacks on the critics of their work.

The issue began when the Center of Medicare & Medicaid Services (CMS) decided to explore a health equity payment model called ACO-REACH. CMS chose a revenue neutral scheme to remove some dollars from payments to providers serving the most-advantaged people and re-allocate those dollars to the most disadvantaged. Of course, CMS needs to choose a measure of poverty that is 100% available and easy to compute. These requirements limit the measure to a poverty index available from Census data.

CMS chose to use a common poverty index, University of Wisconsin’s Area Deprivation Index (ADI). Things got spicy earlier this year when some other researchers noticed that no areas in the Bronx or south-eastern DC are in the lowest deciles of the ADI measure. After digging into the ADI methods a bit deeper, it seems the issue is that the ADI does not scale the housing dollars appropriately before using that component in a principal components analysis to create the poverty index.

One thing I find perplexing about the rebuttal from UWisc is that it completely ignores the existence of every other validated poverty measure, and specifically the CDC’s Social Vulnerability Index. Their rebuttal pretends that there is no alternative solution available, and therefore the ADI measure must be used as is. Lastly, while ADI is publicly available, it is available under a non-commercial license so it’s a bit misleading for the authors to not disclose that they too have a financial interest in pushing the ADI measure while accusing their critics of financial incentives for their criticism.

The opinions expressed here are my own and do not reflect those of my employer or anyone else. I would prefer to remain anonymous if you decide to report this to your blog, as I wish to not tie these personal views to my employer.

Interesting. I’d never heard of any of this.

Here’s the background:

Living in a disadvantaged neighborhood has been linked to a number of healthcare outcomes, including higher rates of diabetes and cardiovascular disease, increased utilization of health services, and earlier death1-5. Health interventions and policies that don’t account for neighborhood disadvantage may be ineffective. . . .

The Area Deprivation Index (ADI) . . . allows for rankings of neighborhoods by socioeconomic disadvantage in a region of interest (e.g., at the state or national level). It includes factors for the theoretical domains of income, education, employment, and housing quality. It can be used to inform health delivery and policy, especially for the most disadvantaged neighborhood groups. “Neighborhood” is defined as a Census block group. . . .

The rebuttal

Clicking on the above links, I agree with my correspondent that there’s something weird about the rebuttal article, starting with its title, “The Area Deprivation Index Is The Most Scientifically Validated Social Exposome Tool Available For Policies Advancing Health Equity,” which elicits memories of Cold-War-era Pravda, or perhaps an Onion article parodying the idea of someone protesting too much.

The article continues with some fun buzzwords:

This year, the Center for Medicare and Medicaid Innovation (CMMI) took a ground-breaking step, creating policy aligning with multi-level equity science and targeting resources based on both individual-level and exposome (neighborhood-level) disadvantage in a cost-neutral way.

This sort of bureaucratic language should not in itself be taken to imply that there’s anything wrong with the Area Deprivation Index. A successful tool in this space will get used by all sorts of agencies, and bureaucracy will unavoidably spring up around it.

Let’s read further and see how they respond to the criticism. Here they go:

Hospitals located in high ADI neighborhoods tend to be hit hardest financially, suggesting health equity aligned policies may offer them a lifeline. Yet recently, CMS has been criticized for selecting ADI for use in its HEBA. According to behavioral economics theory, potential losers will always fight harder than potential winners, and in a budget-neutral innovation like ACO REACH there are some of both.

I’m not sure the behavioral economics framing makes sense here. Different measures of deprivation will correspond to different hospitals getting extra funds, so in that sense both sides in the debate represent potential winners and losers from different policies.

They continue:

CMS must be allowed time to evaluate the program to determine what refinements to its methodology, if any, are needed. CMS has signaled openness to fine-tune the HEBA if needed in the future. Ultimately, CMS is correct to act now with the tools of today to advance health equity.

Sure, but then you could use one of the other available indexes, such as the Social Deprivation Index or the Social Vulnerability Index, right? It seems there are two questions here: first, whether to institute this new policy to “incentivize medical groups to work with low-income populations”; second, whether there are any available measures of deprivation that make sense for this purpose; third, if more than one measure is available, which one to use.

So now on to their defense of the Area Deprivation Index:

The NIH-funded, publicly availably ADI is an extensively validated neighborhood-level (exposome) measure that is tightly linked to health outcomes in nearly 1000 peer-reviewed, independent scientific publications; is the most commonly used social exposome measure within NIH-funded research today; and undergoes a rigorous, multidisciplinary evaluation process each year prior to its annual update release. Residing in high ADI neighborhoods is tied to biological processes such as accelerated epigenetic aging, increased disease prevalence and increased mortality, poor healthcare quality and outcomes, and many other health factors in research studies that span the full US.

OK, so ADI is nationally correlated with various bad outcomes. This doesn’t yet address the concern of the measure having problems locally.

But they do get into the details:

A recent peer-reviewed article argued that the monetary values in the ADI should be re-weighted and an accompanying editorial noted that, because these were “variables that were measured in dollars,” they made portions of New York State appear less disadvantaged than the authors argued they should be. Yet New York State in general is a very well-resourced state with one of the ten highest per capita incomes in the country, reflected in their Medicaid Federal Medical Assistance Percentage (FMAP). . . .

Some critics relying on face validity claim the ADI does not perform “well” in cities with high housing costs like New York, and also California and Washington, DC, and suggest that a re-weighted new version be created, again ignoring evidence demonstrating the strong link between the ADI and health in all kinds of cities including New York (also here), San Francisco, Houston, San Antonio, Chicago, Detroit, Atlanta, and many others. . . .

That first paragraph doesn’t really address the question, as the concerns about the South Bronx not having a high deprivation index are about one part of New York, not “New York State in general.” But the rebuttal article does offer two links about New York specifically, so let me take a look:

Associations between Amygdala-Prefrontal Functional Connectivity and Age Depend on Neighborhood Socioeconomic Status:

Given the bimodal distribution of ADI percentiles in the current sample, the variable was analyzed in three groups: low (90–100), middle (11–89), and high neighborhood SES.

To get a sense of things, I went to the online Neighborhood Atlas and grabbed the map of national percentiles for New York State:

So what they’re doing is comparing some rich areas of NYC and its suburbs; to some low- and middle-income parts of the city, suburbs, and upstate; to some low-income rural and inner-city areas upstate.

Association Between Residential Neighborhood Social Conditions and Health Care Utilization and Costs:

Retrospective cohort study. Medicare claims data from 2013 to 2014 linked with neighborhood social conditions at the US census block group level of 2013 for 93,429 Medicare fee-for-service and dually eligible patients. . . . Disadvantaged neighborhood conditions are associated with lower total annual Medicare costs but higher potentially preventable costs after controlling for demographic, medical, and other patient characteristics. . . . We restricted our sample to patients with 9-digit residential zip codes available in New York or New Jersey . . .

I don’t see the relevance of these correlations to the criticisms of the ADI.

To return to our main thread, the rebuttal summarizes:

The ADI is currently the most validated scientific tool for US neighborhood level disadvantage. This does not mean that other measures may not eventually also meet this high bar.

My problem here is with the term “most validated.” I’m not sure how to take this, given that all this validation didn’t seem to have shown that problem with the South Bronx etc. But, sure, I get their general point: When doing research, better to go with the devil you know, etc.

The rebuttal authors add:

CMS should continue to investigate all options, beware of conflicts of interest, and maintain the practice of vetting scientific validated, evidence-based criteria when selecting a tool to be used in a federal program.

I think we can all agree on that.

Beyond general defenses of the ADI on the grounds that many people use it, the rebuttal authors make an interesting point about the use of neighborhood-level measures more generally:

Neighborhood-level socioeconomic disadvantage is just as (and is sometimes more) important than individual SES. . . . These factors do not always overlap, one may be high, the other low or vice versa. Both are critically important in equity-focused intervention and policy design. In their HEBA, as aligned with scientific practice, CMS has included one of each—the ADI captures neighborhood-level factors, and dual Medicare and Medicaid eligibility represents an individual-level factor. Yet groups have mistakenly conflated individual-level and neighborhood-level factors, wrongly suggesting that neighborhood-level factors are only used because additional individual factors are not readily available.

They link to a review article. I didn’t see the reference there to groups claiming that neighborhood-level factors are only used because additional individual factors are not readily available, but I only looked at that linked article quickly so I probably missed the relevant citation.

The above are all general points about the importance of using some neighborhood-level measure of disadvantage.

But what about the specific concerns raised with the ADI, such as the labeling most of the South Bronx as being low disadvantage (in the 10th to 30th percentile nationally)? Here’s what I could find in the rebuttal:

These assertions rely on what’s been described as “the weakest of all possible arguments”: face validity—defined as the appearance of whether or not something is a correct measurement. This is in contrast to empirically-driven tests for construct validity. Validation experts universally discredit face validity arguments, classifying them as not legitimate, and more aligned with “marketing to a constituency or the politics of assessment than with rigorous scientific validity evidence.” Face validity arguments on their own are simply not sufficient in any rigorous scientific argument and are fraught with potential for bias and conflict of interest. . . .

Re-weighting recommendations run the risk of undermining the strength and scientific rigor of the ADI, as any altered ADI version no longer aligns with the highly-validated original Neighborhood Atlas ADI methodology . . .

Some have suggested that neighborhood-level disadvantage metrics be adjusted to specific needs and areas. We consider this type of change—re-ranking ADI into smaller, custom geographies or adding local adjustments to the ADI itself—to be a type of gerrymandering. . . . A decision to customize the HEBA formula in certain geographies or parts of certain types of locations will benefit some areas and disservice others . . .

I disagree with the claim that face validity is “the weakest of all possible arguments.” For example, saying that a method is good because it’s been cited thousands of times, or saying that local estimates are fine because the national or state-level correlations look right, those are weaker arguments! And if validation experts universally discredit face validity arguments . . . ummmm, I’m not sure who are the validation experts out there, and in any case I’d like to see the evidence of this purportedly universal view. Do validation experts universally think that North Korea has moderate electoral integrity?

The criticism

Here’s what the critical article lists as limitations of the ADI:

Using national ADI benchmarks may mask disparities and may not effectively capture the need that exists in some of the higher cost-of-living geographic areas across the country. The ADI is a relative measure for which included variables are: median family income; percent below the federal poverty level (not adjusted geographically); median home value; median gross rent; and median monthly mortgage. In some geographies, the ADI serves as a reasonable proxy for identifying communities with poorer health outcomes. For example, many rural communities and lower-cost urban areas with low life expectancy are also identified as disadvantaged on the national ADI scale. However, for parts of the country that have high property values and high cost of living, using national ADI benchmarks may mask the inequities and poor health outcomes that exist in these communities. . . .

They recommend “adjusting the ADI for variations in cost of living,” “recalibrating the ADI to a more local level,” or “making use of an absolute measure such as life expectancy rather than a relative measure such as the ADI.”

There seem to be two different things going on here. The first is that ADI is a socioeconomic measure, and it could also make sense to include a measure of health outcomes. The second is that, as a socioeconomic measure, ADI seems to have difficulty in areas that are low income but with high housing costs.

My summary

1. I agree with my correspondent’s email that led off this post. The criticisms of the ADI seem legit—indeed, they remind me a bit of the Human Development Index, which a similar problem of giving unreasonable summaries that can be attributed to someone constructing a reasonable-seeming index and then not looking into the details; see here for more. There was also the horrible, horrible Electoral Integrity Index, which had similar issues of face validity that could be traced back to fundamental issues of measurements.

2. I also agree with my correspondent that the rebuttal article is bad for several reasons. The rebuttal:
– does not ever address the substantive objections;
– doesn’t seem to recognize that, just because a measure gives reasonable national correlations, that doesn’t mean that it can’t have serious local problems;
– leans on an argument-from-the-literature that I don’t buy, in part out of general distrust of the literature and in part because none of the cited literature appears to address the concerns on the table;
– presents a ridiculous argument against the concept of face validity.

Face validity—what does that mean?

Let me elaborate upon that last point. When a method produces a result that seems “on its face” to be wrong, that does not necessarily tell us that the method is flawed. If something contradicts face validity, that tells us that it contradicts our expectations. It’s a surprise. One possibility is that our expectations were wrong! Another possibility is that there is a problem with the measure, in which case the contradiction with our expectations can help us understand what went wrong. That’s how things went with the political science survey that claimed that North Korea was a moderately democratic country, and that’s how things seem to be going with the Area Deprivation Index. Even if it has thousands of citations, it can still have flaws. And in this case, the critics seem to have gone in and found where some of the flaws are.

In this particular example, the authors of the rebuttal have a few options.

They could accept the criticisms of their method and try to do better.

Or they could make the affirmative case that all these parts of the South Bronx, southeast D.C., etc., are not actually socioeconomically deprived. Instead they kind of question that these areas are deprived (“New York State in general is a very well-resourced state”) but without quite making that claim. I think one reason they’re stuck in the middle is politics. Public health is in general coming from the left side of the political spectrum and, from the left, if an area is poor and has low life expectancy, you’d call it deprived. From the right, you could argue that these poor areas already get tons of government support and that all this welfare dependence just compounds the problem. From a conservative perspective, you might argue that these sorts of poor neighborhoods are not “deprived” but rather are already oversaturated with government support. But I don’t think we’d be seeing much of that argument in the health-disparities space.

Or they could make a content-low response without addressing the problem. Unfortunately, that’s the option they chose.

I have no reason to think they’ve chosen to respond poorly here. My guess is that they’re soooo comfortable with their measure, soooooo sure it’s right, that they just dismissed the criticism without ever thinking about it. Which is too bad. But now they have this post! Not too late for them to do better. Tomorrow’s another day, hey!

P.S. My correspondent adds:

The original article criticizing the ADI measure has some map graphic sins that any editor should have removed before publication. Here are some cleaner comparisons of the city data. The SDI measure in those plots is the Social Deprivation Index from Robert Graham Center.

Washington, D.C.:

New York City:

Boston:

San Francisco area:

The Ten Craziest Facts You Should Know About A Giraffe:

Palko points us to this story:

USC oncologist David Agus’ new book is rife with plagiarism

The publication of a new book by Dr. David Agus, the media-friendly USC oncologist who leads the Lawrence J. Ellison Institute for Transformative Medicine, was shaping up to be a high-profile event.

Agus promoted “The Book of Animal Secrets: Nature’s Lessons for a Long and Happy Life” with appearances on CBS News, where he serves as a medical contributor, and “The Howard Stern Show,” where he is a frequent guest. Entrepreneur Arianna Huffington hosted a dinner party at her home in his honor. The title hit No. 1 on Amazon’s list of top-selling books about animals a week before its March 7 publication.

However, a [Los Angeles] Times investigation found at least 95 separate passages in the book that resemble — sometimes word for word — text that originally appeared in other published sources available on the internet. The passages are not credited or acknowledged in the book or its endnotes. . . .

The passages in question range in length from a sentence or two to several continuous paragraphs. The sources borrowed from without attribution include publications such as the New York Times and National Geographic, scientific journals, Wikipedia and the websites of academic institutions.

The book also leans heavily on uncredited material from smaller and lesser-known outlets. A section in the book on queen ants appears to use several sentences from an Indiana newspaper column by a retired medical writer. Long sections of a chapter on the cardiac health of giraffes appear to have been lifted from a 2016 blog post on the website of a South African safari company titled, “The Ten Craziest Facts You Should Know About A Giraffe.”

Never trust a guy who wears a button down shirt and sweater and no tie.

The author had something to say:

“I was recently made aware that in writing The Book of Animal Secrets we relied upon passages from various sources without attribution, and that we used other authors’ words. I want to sincerely apologize to the scientists and writers whose work or words were used or not fully attributed,” Agus said in a statement. “I take any claims of plagiarism seriously.”

From the book:

“I’m not pitching a tent to watch chimpanzees in Tanzania or digging through ant colonies to find the long-lived queen, for example,” he writes. “I went out and spoke to the amazing scientists around the world who do these kinds of experiments, and what I uncovered was astonishing.”

All good, except that when he said, “I went out and spoke to the amazing scientists around the world,” he meant to say, “I went on Google and looked up websites of every South African safari company I could find.”

“The Ten Craziest Facts You Should Know About A Giraffe,” indeed.

And here are a few relevant screenshots:

I have no idea what that light bulb thingie is doing in that last image, but here’s some elaboration:

“Research misconduct,” huh? I guess if USC ever gives Dr. Agus a hard time about that, he could just move a few hundred miles to the north, where they don’t care so much about that sort of thing.

Deja vu on researching whether people combined with LLMs can do things people can do

This is Jessica. There has been a lot of attention lately on how we judge whether a generative model like LLM has achieved human-like intelligence, and what not to do when making claims about this. But I’ve also been watching the programs of some of the conferences I follow fill up with a slightly different rush to document LLMs: papers applying models like GPT-4 to tasks that we once expected humans to do to see how well they do. For example, can we use ChatGPT to generate user responses to interactive media? Can they simulate demographic backstories we might get if we queried real populations? Can they convince people to be more mindful? Can they generate examples of AI harms?  And so on. 

Most of this work is understandably very exploratory. And if LLMs are going to reshape how we program or get medical treatment or write papers, then of course there’s some pragmatic value to starting to map out where they excel versus fail on these tasks, and how far we can rely on them to go. 

But do we get anything beyond pragmatic details that apply to the current state of LLMs? In many cases, it seems doubtful.

One problem with papers that “take stock” of how well an LLM can do on some human task is that the technology keeps changing, and even between the big model releases (e.g., moving from GPT-3 to GPT-4) we can’t easily separate out which behaviors are more foundational, resulting from the pre-training, versus which are arising as a result of interactive fine-tuning as the models get used. This presents a challenge to researchers who want something about their results to be applicable for more than a year or two. There needs to be something we learn that is more general than this particular model version applied to this task. But in this kind of exploratory work, that’s hard to guarantee. 

To be fair, some of these papers can contribute intermediate level representations that help characterize a domain-specific problem or solution independent of the LLM. For instance, this paper developed a taxonomy of different types of cognitive reframing that work for negative thoughts in applying LLMs to the problem. But many don’t.

I’m reminded of the early 2010s when crowdsourcing was really starting to take off. It was going to magically speed up machine learning by enabling annotation at scale, and let behavioral researchers do high throughput experiments, transforming social science. And it did in many ways, and it was exciting to have a new tool. But if you looked at a lot of the specific research coming out to demonstrate the power of crowdsourcing, the high level research question could be summarized as “Can humans do this task that we know humans can do?” There was little emphasis on the more practical concerns about whether, in some particular workflow, it makes sense to invest effort in crowdsourcing, how much money or effort it took the researchers to get good results from crowds of humans, or what would happen if the primary platform at the time (Amazon Mechanical Turk) stopped being supported. 

And now here we are again. LLMs are not people, of course, so the research question is more like “By performing high dimensional curve fitting on massive amounts of human-generated content, can we generate human-like content?” Instead of being about performance on some benchmark, this more applied version becomes about whether the AI-generated content is passable in domain X. But since definitions of passable tend to be idiosyncratic and developed specific to each paper, it’s hard to imagine someone synthesizing all this in any kind of concrete way later. 

Part of my distaste for this type of research is that we still seem to lack an intermediate layer of understanding of what more abstract behaviors we can expect from different types of models and interactions with models. We understand the low-level stuff about how the models work, we can see how well they do on these tasks humans usually do, but we’re missing tools or theories that can relate the two. This is the message of a recent paper by Holtzman, West, and Zettlemoyer, which argues for that researchers invest more in developing a vocabulary of behaviors, or “meta-models” that predict aspects of an LLM’s output, to replace questions like What is the LLM doing? with Why is the LLM doing that? 

I guess one could argue that this kind of practical research is a more worthwhile use of federal funding than the run-of-the-mill behavioral study, which might set out to produce some broadly generalizable result but shoot itself in the foot by using small samples, noisy measurements, an underdefined population, etc. But at least in studies of human behavior there is usually an attempt at identifying some deeper characterization of what’s going on, so the research question might be interesting, even if the evidence doesn’t deliver. 

Devereaux on ChatGPT in the classroom

Palko points to this post by historian Bret Devereaux:

Generally when people want an essay, they don’t actually want the essay; the essay they are reading is instead a container for what they actually want which is the analysis and evidence. An essay in this sense is a word-box that we put thoughts in so that we can give those thoughts to someone else. . . .

In a very real sense then, ChatGPT cannot write an essay. It can imitate an essay, but because it is incapable of the tasks which give an essay its actual use value (original thought and analysis), it can only produce inferior copies of other writing. . . .

That leaves the role of ChatGPT in the classroom. And here some of the previous objections do indeed break down. A classroom essay, after all, isn’t meant to be original; the instructor is often assigning an entire class to write essays on the same topic, producing a kaleidoscope of quite similar essays using similar sources. Moreover classroom essays are far more likely to be about the kind of ‘Wikipedia-famous’ people and works which have enough of a presence in ChatGPT’s training materials for the program to be able to cobble together a workable response (by quietly taking a bunch of other such essays, putting them into the blender and handing out the result, a process which in the absence of citation we probably ought to understand as plagiarism). In short, many students are often asked to write an essay that many hundreds of students have already written before them.

And so there were quite a few pronouncements that ChatGPT had ‘killed’ the college essay. . . . This both misunderstands what the college essay is for as well as the role of disruption in the classroom. . . .

In practice there are three things that I am aiming for an essay assignment to accomplish in a classroom. The first and probably least important is to get students to think about a specific historical topic or idea, since they (in theory) must do this in order to write about it. . . . The second goal and middle in importance is training the student in how to write essays. . . . Thus the last and most important thing I am trying to train is not the form of the essay nor its content, but the basic skills of having a thought and putting it in a box that we outlined earlier. Even if your job or hobbies do not involve formal writing, chances are (especially if your job requires a college degree) you are still expected to observe something real, make conclusions about it and then present those conclusions to someone else (boss, subordinates, co-workers, customers, etc.) in a clear way, supported by convincing evidence if challenged. What we are practicing then is how to have good thoughts, put them in good boxes and then effectively hand that box to someone else. . . .

Crucially – and somehow this point seems to be missed by many of ChatGPT’s boosters I encountered on social media – at no point in this process do I actually want the essays. Yes, they have to be turned in to me and graded and commented because that feedback in turn is meant to both motivate students to improve but also to signal where they need to improve. But I did not assign the project because I wanted the essays. To indulge in an analogy, I am not asking my students to forge some nails because I want a whole bunch of nails – the nails they forge on early attempts will be quite bad anyway. I am asking them to forge nails so that they learn how to forge nails (which is why I inspect the nails and explain their defects each time) and by extension also learn how to forge other things that are akin to nails. . . .

What one can immediately see is that a student who simply uses ChatGPT to write their essay for them has simply cheated themselves out of the opportunity to learn (and also wasted my time in providing comments and grades).

It’s as if you’re coaching kids on the football team and you want them to build up their strength by lifting weights. It wouldn’t help the students to do it using a forklift.

Regarding chatbots in the classroom, I see a few issues:

1. It makes cheating a lot easier. Already you can google something like *high school essay on seven years war* and get lots of examples, but the most accessible ones you have to pay for, and you just get the essays one at a time, you can’t easily modify them. I’ve never actually used the GPT chatbot but I’m guessing there you can just type it in and get the essay right away.

Even for students who don’t want to cheat, I could see them typing in the essay topic just to get started, and then taking the chatbot output as a starting point . . . and that’s cheating, or it can be. More to the point, it can be destructive of the learning process relative to the ultimate goal: As Devereaux discusses, this sort of chatbot result won’t be at all useful for future writing that’s actually intended to convey information.

2. It will change the sorts of assignments that teachers give to students. Instead of the take-home essay assignment, the in-class assignment.

3. Even for the in-class assignment, the chatbot can be useful for preparing. Here I can see pluses and minuses. The plus is that it can give the student a lot of practice. The minus is that it’s practicing an empty form of writing (that horrible “five-paragraph essay” thing, ending, “In conclusion . . .”) and also it can be used to cheat: a student has a sense of what the question will be can prepare by having the chatbot write some plausible answers.

Again, the problem with cheating is that it’s a replacement for learning the relevant skills, in the same way that lifting weights with the forklift is a replacement for actually building your muscle strength.

4. On the plus side, it does seem that if used carefully the chatbot can create useful practice problems for studying. Whether the topic is writing anything else, it can be helpful to have lots of practice.

Some questions on regression

Brett Cooper writes:

I read your book Regression and Other Stories. As a beginner and community college student, I wonder if I may be able to ask you a couple of simple and clarifying questions.

1. Multiple Regression and Collinearity

I created 2 linear regression models using variables from a simple dataset. The first model is a simple linear regression model with a regressor X1 and a specific response variable Y. The other model is a multiple linear regression model which includes X1, X2, X3 and Y. The statistical summary for the simple linear regression model assigns a positive and statistically significant coefficient beta1 to X1 indicating a positive linear association between X1 and Y. However, the statistical summary for the multiple regression model shows that the coefficient beta1 for X1 has changed sign, changed magnitude and is not statistically significant anymore.
Did that happen because of the undesirable but often unavoidable effect of collinearity between X1 and one or both of the other two predictors X2 and X3?

Based on my understanding, it is to be expected that, even with zero multicollinearity, the regression coefficient associated with certain predictor X change, in magnitude or even sign, when the predictors X are considered together in a multiple linear regression model instead of separately in relation to the response variable Y. Is that correct?
However, is it “normal” for the coefficients’ sign and for the p-value for the same predictor X to change when switching to a different model? A change of coefficient sign for X would indicate an opposite behavior between X and Y in going from a simple to a multiple regression model…

— As a rule of thumb, before creating a multiple regression model involving Y,X1,X2,X3, would it recommended and useful to first create the simple regression models, i.e. Y =beta1*X1+beta0, Y =beta2*X2+beta0, Y=beta3*X3+beta0, and compute their regression coefficients? Or should we jump straight to the multivariate model Y= beta1*X1+beta2*X2+beta3*X3 and evaluate the regression coefficients at that point?

2. Multicollinearity

— Multicollinearity affects the interpretability (impacts the accuracy of the regression coefficients) of our model but not its predictive power. Multicollinearity can have different sources: it can originate from the data itself but also from the structure of the model. For example, model Y = beta1*X1 + beta2* X1^2 + beta3 * (X1*X2) has interdependent terms. Surprisingly, multicollinearity would NOT be present between X1 and the term X1^2 even if they are quadratically dependent…Is that correct? What about the interaction term (X1*X3) and the term X1? Or would multicollinearity be present and only be reduced if we mean center the variables X1, X2, X3?

— Correlation means linear dependence. Is multicollinearity only caused by the presence of linear dependence or do other types of dependence (curvilinear, etc.) between predictors also cause collinearity in the model?

3. Variable Transformation (feature scaling)

–Certain statistical models require their input predictors to be scaled before they can be used to build the model itself so the variables can all be on equal footing.
Some models, like decision trees, don’t require scaling at all. Scaling variables (linear or nonlinear scaling) is generally useful when the involved variables, the Xs and the Y, have very different ranges. My understanding is that predictor variables with large ranges would automatically receive large regression coefficients even if their relative importance is lower in comparison to other predictors. Is that correct and true for most models?

— In some cases, scaling seems optional and only improves the interpretability of the association between Y and the Xs (computed correlation may be tiny only and we can increase it by scaling the variables). That said, is the scaling of the predictors X and/or response variable Y necessary and critical for the creation of an accurate and correct simple or multiple linear regression model? Or does scaling only help with the interpretability of the regression coefficients?

My reply:

1a. When you add predictors to a model, you can expect the coefficients of the original predictors to change. Once they can change, yes, they can change sign: there’s nothing special about zero. As for statistical significance and p-values: sure, they can change too. One way to see this is to imagine N = 1 million: then even small changes in the coefficients will correspond to huge changes in p-values.

1b. Yes, when you fit a big model, I recommend fitting a series of little models to build up to it, and you can look at how the predictions of interest change as the model builds up.

2. I’m not completely sure about your questions on multicollinearity. To put it another way: the answers to these questions are not obvious to me, and I recommend figuring these out by just simulating them in R.

3. I think that scaling is important for the interpretation of parameters (see here) and if you’re going to use Bayesian priors (see here). There’s also nonlinear scaling (logs, etc.) or combining predictors, which will change your model entirely.

When your regression model has interactions, do you need to include all the corresponding main effects?

Jeff Gill writes:

For some reason the misinterpretations about interactions in regression models just won’t go away. I teach the point that mathematically and statistically one doesn’t have to include the main effects along with the multiplicative component, but if you leave them out it should be because you have a strong theory supporting this decision (i.e. GDP = Price * Quantity, in rough terms). Yet I got this email from a grad student yesterday:

As I was reading the book, “Introduction to Statistical Learning,” I came across the following passage. This book is used in some of our machine learning courses, so perhaps this is where the idea of leaving the main effects in the model originates. Maybe you can send these academics a heartfelt note of disagreement.

“The hierarchical principle states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant. In other words, if the interaction between X1 and X2 seems important, then we should include both X1 and X2 in the model even if their coefficient estimates have large p-values. The rationale for this principle is that if X1 × X2 is related to the response, then whether or not the coefficients of X1 or X2 are exactly zero is of little interest. Also X1 × X2 is typically correlated with X1 and X2, and so leaving them out tends to alter the meaning of the interaction.”

(Bousquet, O., Boucheron, S. and Lugosi, G., 2004. Introduction to statistical learning theory. Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tübingen, Germany, August 4-16, 2003, Revised Lectures, pp.169-207.)

There are actually two errors here. It turns out that the most cited article in the history of the journal Political Analysis was about interpreting interactions in regression models, and there are seemingly many other articles across various disciplines. I still routinely hear the “rule of thumb” in the quote above.

To put it another way, suppose you start with the model with all the main effects and interactions, and then you consider the model including the interactions but excluding one or more main effects. You can think of this smaller model in two ways:

1. You could consider it as the full model with certain coefficients set to zero, which in a Bayesian sense could be considered as very strong priors on these main effects, or in a frequentist sense could be considered as a way to lower variance and get more stable inferences by not trying to estimate certain parameters.

2. You could consider it as a different model of the world. This relates to Jeff’s reference to having a strong theory. A familiar example is a model of the form, y = a + b*t + error, with a randomly assigned treatment z that occurs right after time 0. A natural model is then, y = a + b*t + c*z*t + error. You’d not want to fit the model, y = a + b*t + c*z * d*z*t + error—except maybe as some sort of diagnostic test—because, by design, the treatment cannot effect y at time 0.

I have three problems with the above-quoted passage. The first is the “even if the p-values” bit. There’s no good reason, theoretically or practically, that p-values should determine what is in your model. So it seems weird to refer to them in this context. My second problem is where they say, “whether or not the coefficients of X1 or X2 are exactly zero is of little interest.” In all my decades of experience, whether or not certain coefficients are exactly zero is never of interest! I think the problem here is that they’re trying to turn an estimation problem (fitting a model with interactions) into a hypothesis testing problem, and I think this happened because they’re working within an old-fashioned-but-still-dominant framework in theoretical statistics in which null hypothesis significance testing is fundamental. Finally, calling it a “hierarchical principle” seems to be going too far. “Hierarchical heuristic,” perhaps?

That all said, usually I agree with the advice that, if you include an interaction in your model, you should include the corresponding main effects too. Hmmm . . . let’s see what we say in Regression and Other Stories . . . section 10.3 is called Interactions, and here’s what we’ve got . . .

We introduce the concept of interactions in the context of a linear model with a continuous predictor and a subgroup indicator:

Figure 10.3 suggests that the slopes differ substantially. A remedy for this is to include an interaction . . . that is, a new predictor defined as the product of these two variables. . . . Care must be taken in interpreting the coefficients in this model. We derive meaning from the fitted model by examining average or predicted test scores within and across specific subgroups. Some coefficients are interpretable only for certain subgroups. . . .

An equivalent way to understand the model is to look at the separate regression lines for [the two subgroups] . . .

Interactions can be important, and the first place we typically look for them is with predictors that have large coefficients when not interacted. For a familiar example, smoking is strongly associated with cancer. In epidemiological studies of other carcinogens, it is crucial to adjust for smoking both as an uninteracted predictor and as an interaction, because the strength of association between other risk factors and cancer can depend on whether the individual is a smoker. . . . Including interactions is a way to allow a model to be fit differently to different subsets of data. . . . Models with interactions can often be more easily interpreted if we preprocess the data by centering each input variable about its mean or some other convenient reference point.

We never actually get around to giving the advice that, if you include the interaction, you should usually be including the main effects, unless you have a good theoretical reason not to. I guess we don’t say that because we present interactions as flowing from the main effects, so it’s kind of implied that the main effects are already there. And we don’t have much in Regression and Other Stories about theoretically-motivated models. I guess that’s a weakness of our book!

What data to include in an analysis? Not always such an easy question. (Elliott Morris / Nate Silver / Rasmussen polls edition)

Someone pointed me to a recent post by Nate Silver, “Polling averages shouldn’t be political litmus tests, and they need consistent standards, not make-it-up-as-you-go,” where Nate wrote:

The new Editorial Director of Data Analytics at ABC News, G. Elliott Morris, who was brought in to work with the remaining FiveThirtyEight team, sent a letter to the polling firm Rasmussen Reports demanding that they answer a series of questions about their political views and polling methodology or be banned from FiveThirtyEight’s polling averages, election forecasts and news coverage. I found several things about the letter to be misguided. . . .

First, I strongly oppose subjecting pollsters to an ideological or political litmus test. . . . Why, unless you’re a dyed-in-the-wool left-leaning partisan, would having a “relationship with several right-leaning blogs and online media outlets” lead one to “doubt the ethical operation of the polling firm”? . . .

Rasmussen has indeed had strongly Republican-leaning results relative to the consensus for many years. Despite that strong Republican house effect, however, they’ve had roughly average accuracy overall because polls have considerably understated Republican performance in several recent elections (2014, 2016, 2020). . . . Is that a case of two wrongs making a right — Rasmussen has had a Republican bias, but other polls have had a Democratic bias, so they come out of the wash looking OK? Yeah, probably. Still, there are ways to adjust for that — statistical ways like a house effects adjustment . . .

Second, even if you’re going to remove Rasmussen from the averages going forward, it’s inappropriate to write them out of the past . . . It’s bad practice to revise data that’s already been published, based on decisions you made long after that data was published. For one thing, it makes your numbers less reliable as a historical record. For another, it can lead to overconfidence when using that data to train or build models. . . .

Third, I think it’s clear that the letter is an ad hoc exercise to exclude Rasmussen, not an effort to develop a consistent set of standards. . . . The thing about running a polling average is that you need a consistent and legible set of rules that be applied to hundreds of pollsters you’ll encounter over the course of an election campaign. Going on a case-by-case basis is a) extremely time-consuming . . . and b) highly likely to result in introducing your own biases . . . Perhaps Morris’s questions were getting at some larger theme or more acute problem. But if so, he have should stated it more explicitly in his letter. . . .

Nate raises several interesting questions here:

1. Is there any good reason for a relationship with “right-leaning” outlets such as Fox News and Steve Bannon to cause one to “doubt the ethical operation of the polling firm”?

2. Does it ever make sense to remove a biased poll, rather than including in your analysis with a statistical correction?

3. If you are changing your procedure going forward, is it a mistake to make those changes retroactively on past work?

4. Is it appropriate to send a letter to one polling organization without going through the equivalent process with all the other pollsters whose data you’re using?

Any followups?

I’ll go through the above questions one at a time, but first I was curious if Nate or Elliott had said anything more on the topic.

I found these two items on twitter:

This from Elliott: “asking pollsters detailed methodological questions is not (or shouldn’t be!) controversial. it’s standard practice in most media organizations, and aggregators should probably even be publishing responses for the public and using them as a way to gauge potential measurement error,” linking to a list of questions that CNN asks of all pollsters.

This from Nate, referring to Elliott’s letter to Rasmussen as a “Spanish Inquisition” and linking to this article from the Washington Examiner which, among other things, reported this from a Rasmussen poll:

Whaaaaa? As a check, I googled *abortion roe wade polling* and found some recent items:

Gallup: “As you may know, the Supreme Court overturned its 1973 Roe versus Wade decision concerning abortion, meaning there is no Constitutional protection for abortion rights and each state could set its own laws to allow, restrict or ban abortions. Do you think overturning Roe versus Wade was a good thing or a bad thing?”: 38% “good thing,” 61% “bad thing,” 1% no opinion.

CBS/YouGov: “Last year, the U.S. Supreme Court ended the constitutional right to abortion by overturning Roe v. Wade. Do you approve or disapprove of the Court overturning Roe v. Wade?”: 44% “approve,” 56% “disapprove.”

USA Today (details here): “It’s been a year since the Supreme Court overturned the Roe v. Wade decision, eliminating a
constitutional right to an abortion at some stages of pregnancy. Do you support or oppose the court decision to overturn Roe v. Wade?”: 30% “support,” 58% “oppose,” 12% undecided.

There’s other polling out there, all pretty much consistent with the above. An then there’s Rasmussen, which stands out. Would I want to include Rasmussen’s “Majority Now Approve SCOTUS Abortion Ruling” in a polling average? I’m not sure.

Some of it must could be their question wording: “Last year, the Supreme Court overturned the 1973 Roe v. Wade decision, so that each state can now determine its own laws regarding abortion. Do you approve or disapprove of the court overturning Roe v. Wade?” This isn’t far from the Gallup question, but they does remove the “Constitutional protection” phrase, and I guess that could make a difference. Also, they’re just counting “likely voters,” and much could depend on where those respondents come from.

Whether or not it makes sense to take the Rasmussen organization seriously (I remain concerned about their numbers that added up to 108%), I think it’s kinda journalistic malpractice for the Washington Examiner to report their claim of “Support for overturning Roe v. Wade is up since last year. 52% to 44%, US likely voters approve,” without even noting how much that disagrees with all other polling out there. My first thought was that, yeah, the Washington Examiner is a partisan outlet, but even partisans benefit from accurate news, right? I guess the point is that the role of an operation such as the Washington Examiner is not so much to inform readers as to circulate talking points and get them out into the general discussion—indeed, thanks to Nate and then me, it happened here!

1. Is there any good reason for a relationship with “right-leaning” outlets such as Fox News and Steve Bannon to cause one to “doubt the ethical operation of the polling firm”?

OK, now on to Nate’s questions. First, should we doubt the ethics of a pollster who hangs out with Fox News and Steve Bannon? My answer here is . . . it depends!

On one hand, . . . Should we discredit my statistical work because I teach at Columbia University, an institution whose most famous professor was Dr. Oz and which notoriously promulgated false statistics for its college rankings? Lots of people teach at Columbia, similarly lots of people go on Fox News: there’s an appeal to reaching an audience of millions. Going on Fox might be a bad idea, but does it cast doubt on a pollster’s ethics?

As I said, it depends. If a pollster or quantitative social scientist is consistently using crap statistics to push election denial, then, yes, I do doubt their ethics. The relevant point here is not that Fox and Bannon are “right-leaning” but rather that they’ve been fueling election denial misinformation, and distorted election statistics are part of the process.

So, yeah, I agree with Nate that Elliott’s phrase, “several right-leaning blogs and online media outlets,” doesn’t tell the whole story—as Nate put it, “Perhaps Morris’s questions were getting at some larger theme or more acute problem.” There is a larger theme and more acute problem, and that’s refuted claims about the election that have been endorsed by major political and media figures. Given what Rasmussen’s been doing in this area, I think Nate’s been a bit too quick to take their side of the story on this, to refer to Elliott’s inquiries as an “inquisition,” etc. You don’t have to be a “dyed-in-the-wool left-leaning partisan” to doubt the ethical operation of a polling firm that is promoting lies about the election.

How close does a pollster need to be to election deniers so that I don’t trust it at all? I don’t know. I guess it depends on context, which is a good reason for Elliott to ask specific questions to Rasmussen about their polling methodology. If they’re open about what they’re doing, that’s a good sign; if they give no details, that’s gonna make it harder to trust them. Rasmussen has no duty to respond to those questions, Fivethirtyeight has no duty to include its polls in their analyses, etc etc all down the line.

2. Does it ever make sense to remove a biased poll, rather than including in your analysis with a statistical correction?

Discarding a data point is equivalent to including it but giving it a weight of zero or, from a Bayesian point of view, allowing it to be biased with an infinite-variance prior on the bias. So we can transform Nate’s very reasonable implied question (why discard Rasmussen polls? Why not just include your skepticism in your model?) as the question: Why not just give the Rasmussen polls a very small weight or, from a Bayesian point of view, allow them to have a bias that has a very large uncertainty?

There are two answers here. The first is that if the weight is very small or the bias has a huge uncertainty, then it’s pretty much equivalent to not including the survey at all. Remember 13. The second answer is that if these surveys are really being manipulated, then there’s no reason to think the bias is consistent. To put it another way: if you don’t think the Rasmussen polls are providing useful information, then you might not want to include them for the same reason that you wouldn’t include a rotten onion in your stew. Sure, one bad onion won’t destroy the taste—it’ll be diluted amid all the other flavors (including those of all the non-rotten onions you’ve thrown in)—but what’s the point?

This second answer is as much procedural as substantive: by excluding a pollster entirely, Fivethirtyeight is saying they don’t want to be using numbers that they can’t, on some level, trust. They’re making the procedural point that they have some rules for the polls they include, some red lines that cannot be crossed.

From the other direction, Nate’s plea for Fivethirtyeight to continue including Rasmussen’s polls in its analyses is also a procedural and perception-based argument: he’s making the procedural point that “you need a consistent and legible set of rules” and can’t be making case-by-case decisions.

The funny thing is . . . Nate and Elliott are kind of saying the same thing! Elliott’s saying they’ll be removing Rasmussen unless they follow the rules and Nate’s saying that too. I looked up Fivethirtyeight’s rules for pollsters from when Nate was running the organization and it says “Pollsters must also be able to answer basic questions about their methodology, including but not limited to the polling medium used (e.g., landline calls, text, etc.), the source of their voter files, their weighting criteria, and the source of the poll’s funding.” And they don’t include “‘Nonscientific’ polls that don’t attempt to survey a representative sample of the population or electorate.” So I guess a lot depends on the details; see item 4 below.

3. If you are changing your procedure going forward, is it a mistake to make those changes retroactively on past work?

I have a lot of sympathy for Nate’s argument here. He created the Fivethirtyeight polling averages, then combined this with his interest in sports analytics, worked his butt off for over a decade . . . and now the new team is talking about changing things. It would be kind of like if CRC Press hired someone to create a fourth edition of Bayesian Data Analysis, and the new author decided to remove chapter 6 because it didn’t match his philosophy. I’d be furious! OK, that’s not a perfect analogy because my coauthors and I have copyright on BDA, but the point is that Nate was Fivethirtyeight for awhile, so it’s frustrating to think of the historical record being changed.

That said, it’s not clear to me that Elliott is planning to change the historical record. From his quoted letter: “If banned, Rasmussen Reports would also be removed from our historical averages of polls and from our pollster ratings. Your surveys would no longer appear in reporting and we would write an article explaining our reasons for the ban.” It could be that the polls would still be in the database, just flagged and not included in the averages. I think that would be OK.

To put it another way, I think it’s ok to go back and clean up old data, as long as you’re transparent about it.

From a slightly different angle, Nate writes, “There’s also an implicit conflict here about the degree to which journalists should gatekeep or shield the public from potential sources of ‘misinformation.'” I’m not exactly sure of Elliott’s motivations here, but my guess is that his goal is not so much to “shield the public” but rather to come up with more accurate forecasts. Nate argues that including a Republican-biased poll should lead to more accurate forecasts by balancing other polls with systematic polling errors favoring the Democrats. I guess that if Fivethirtyeight going forward is not going to include Rasmussen polls, they’ll have to adjust for possible systematic errors in some other way. That would make sense to me, actually. If you do want to adjust for the possibility of errors on the scale of 2016 or 2020 (polls that showed the Democrats getting approximately 2.5 percentage points more support than they actually received in the vote), then it would make sense to make that adjustment straight up, without relying on Rasmussen to do it for you.

4. Is it appropriate to send a letter to one polling organization without going through the equivalent process with all the other pollsters whose data you’re using?

I have no idea what’s been going on between Fivethirtyeight and Rasmussen and between Fivethirtyeight and other polling organizations. The quoted letter from Elliott to Rasmussen begins, “I am emailing you to send a final notice . . .”, so it seems safe to assume this is just one in a series of communications, and we haven’t seen the others that came before.

Nate writes, “I think it’s clear that the letter is an ad hoc exercise to exclude Rasmussen, not an effort to develop a consistent set of standards.” My guess is that it’s neither an ad hoc exercise to exclude Rasmussen, nor an effort to develop a consistent set of standards, but rather that it’s an effort to apply an imperfect set of standards. Rules such as “Pollsters must also be able to answer basic questions about their methodology, including but not limited to . . .” and “‘Nonscientific’ polls that don’t attempt to survey a representative sample” are imperfect—but that’s the nature of rules.

I guess what I’m saying is that it’s hard to compare Fivethirtyeight’s interactions with Rasmussen with their interactions with other pollsters, given that (a) we don’t know what their interactions with Rasmussen are, and (b) we don’t what their interactions with other pollsters are.

Let me just say that this sort of thing is always challenging, as there’s no way to have completely consistent rules. For example, we have good reasons to be suspicious that Brian Wansink ever used his famous bottomless soup bowl in any actual experiment. Do we apply this level of scrutiny to the apparatus described in every peer-reviewed research article? No, first because this would require an immense amount of effort, and second because “this level of scrutiny” is not even defined. It’s judgment calls all the way down. Fivethirtyeight has a necessarily ambiguous policy on what polls they will include in their analyses—there’s no way for such a policy to not have some ambiguity—and Nate and Elliott are making different judgment calls on whether Rasmussen violates the policy.

Having this discussion

Unfortunately there hasn’t been much of a conversation on this poll-inclusion issue, which I guess is no surprise given that Nate (indirectly) called Elliott a bullshitter and explicitly writes, “I don’t intend this a back-and-forth.” Which is too bad, given that we’ve had good conversations on forecasting before.

It’s easier for me to have this discussion because I know both Nate and Elliott. I don’t know either of them well on a personal level, but I’ve collaborated with both of them (for example, here and here) and I think they both do great work. I’ve criticized Nate’s forecasting procedure; then again, I’ve also criticized Elliott’s, even though (or especially because) it was done in collaboration with me.

To say I like both of them is not an attempt to put myself above the fray or to characterize their disagreements as minor. People often get themselves into positions where they are legitimately angry at each other—it’s happened to me plenty of times! The main point of the present post is that the decisions Elliott is making regarding which polls to include in his analysis, and the questions that Nate is asking, are challenging, with no easy answers.

P.S. Here’s a brief summary of statistical concerns with the 2020 presidential election forecasts from Economist and Fivethirtyeight forecasts. tl;dr: both had problems, in different ways.

A quote on data transparency—from 1662!

Michael Nelson writes:

Recently, I came across a quote in Irwin (1935), taken from 19th century sources on Graunt (1662). Todhunter said of Graunt, who apparently was the first (English?) person to get the idea of using data gathered on the plague to compute and publish life tables, that:

Graunt was careful to publish with his deductions the actual returns from which they were obtained, comparing himself, when so doing, to “a silly schoolboy coming to say his lesson to the world (that peevish and tetchie master) who brings a bundle of rods, wherewith to be whipped for every mistake he has committed.” Many subsequent writers have betrayed more fear of the punishment they might be liable to on making similar disclosures, and have kept entirely out of sight the sources of their conclusions. The immunity they have thus purchased from contradiction could not be obtained but at the expense of confidence in their results.

I have a new hero.

Those 1662 dudes, they knew what they were talking about.

Do Ultra-Processed Data Cause Excess Publication and Publicity Gain?

Ethan Ludwin-Peery writes:

I was reading this paper today, Ultra-Processed Diets Cause Excess Calorie Intake and Weight Gain (here, PDF attached), and the numbers they reported immediately struck me as very suspicious.

I went over it with a collaborator, and we noticed a number of things that we found concerning. In the weight gain group, people gained 0.9 ± 0.3 kg (p = 0.009), and in the weight loss group, people lost 0.9 ± 0.3 kg (p = 0.007). These numbers are identical, which is especially suspicious since the sample size is only 20, which is small enough that we should really expect more noise. What are the chances that there would be identical average weight loss in the two conditions and identical variance? We also think that 0.3 kg is a suspiciously low standard error for weight fluctuation.

They also report that weight changes were highly correlated with energy intake (r = 0.8, p < 0.0001). This correlation coefficient seems suspiciously high to us. For comparison, the BMI of identical twins is correlated at about r = 0.8, and about r = 0.9 for height. Their data is publicly available here, so we took a look and found more to be concerned about. They report participant weight to two decimal places in kilograms for every participant on every day. Kilograms to two decimal places should be pretty sensitive (an ounce of water is about 0.02 kg), but we noticed that there were many cases where the exact same weight appeared for a participant two or even three times in a row. For example participant 21 was listed as having a weight of exactly 59.32 kg on days 12, 13, and 14, participant 13 was listed as having a weight of exactly 96.43 kg on days 10, 11, and 12, and participant 6 was listed as having a weight of exactly 49.54 kg on days 23, 24, and 25.

In fact this last case is particularly egregious, as 49.54 kg is exactly one kilogram less, to two decimal places, than the baseline for this participant’s weight when they started, 50.54 kg. Participant 6 only ever seems to lose or gain weight in increments of 0.10 kilograms. Similar patterns can also be seen in the data of other participants.

We haven’t looked any deeper yet because we think this is already cause for serious concern. It looks a lot like heavily altered or even fabricated data, and we suspect that as we look closer, we will find more red flags. Normally we wouldn’t bother but given that this is from the NIH, it seemed like it was worth looking into.

What do you think? Does this look equally suspicious to you?

He and his sister Sarah followed up with a post, also there are posts by Nick Brown (“Some apparent problems in a high-profile study of ultra-processed vs unprocessed diets”) and Ivan Oransky (“NIH researcher responds as sleuths scrutinize high-profile study of ultra-processed foods and weight gain”).

I don’t really have anything to add on this one. Statistics is hard, data analysis is hard, and when research is done on an important topic, it’s good to have outsiders look at it carefully. So good all around, whatever happens with this particular story.

“Nobody’s Fool,” by Daniel Simons and Christopher Chabris

This new book, written by two psychology researchers, is an excellent counterpart to Lying for Money by economist Dan Davies, a book that came out a few years ago but which we happened to have discussed recently here. Both books are about fraud.

Davies gives an economics perspective, asking what are the conditions under which large frauds will succeed, and he focuses on the motivations of the fraudsters: often they can’t get off the fraud treadmill once they’re on it. In contrast, Simons and Chabris focus on the people who get fooled by frauds; the authors explain how it is that otherwise sensible people can fall for pitches that are, in retrospect, ridiculous. The two books are complementary, one focusing on supply and one on demand.

My earlier post was titled “Cheating in science, sports, journalism, business, and art: How do they differ?” Nobody’s Fool had examples from all those fields, and when they told stories that I’d heard before, their telling was clear and reasonable. When a book touches on topics where the reader is an expert, it’s a good thing when it gets it right. I only wish that Simons and Chabris had spent some discussing the similarities and differences of cheating in these different areas. As it is, they mix in stories from different domains, which makes sense from a psychology perspective of the mark (if you’re fooled, you’re fooled) but gives less of a sense of how the different frauds work.

For the rest of this review I’ll get into some different interesting issues that arose in the book.

Predictability. On p.48, Simons and Chabris write, “we need to ask ourselves a somewhat paradoxical question: ‘Did I predict this?’ If the answer is ‘Yes, this is exactly what I expected,’ that’s a good sign that you need to check more, not less.” I see what they’re saying here: if a claim is too good to be true, maybe it’s literally too good to be true.

On the other hand, think of all the junk science that sells itself on how paradoxical it is. There’s the whole Freakonomics contrarianism thing. The whole point of contrarianism is that you’re selling people on things that were not expected. If a claim is incredible, maybe it’s literally incredible. Unicorns are beautiful, but unicorns don’t exist.

Fixed mindsets. From p.61 and p.88, “editors and reviewers often treat the first published study on a topic as ‘correct’ and ascribe weaker or contradictory results in later studies to methodological flaws or incompetence. . . . Whether an article has been peer-reviewed is often treated as a bright line that divides the preliminary and dubious from the reliable and true.” Yup.

There’s also something else, which the authors bring up up in the book: challenging an existing belief can be costly. It creates motivations for people to attack you directly; also it seems to me that the standards for criticism of published papers are often much higher than for getting the original work accepted for published in the first place. Remember what happened to the people who squealed on Lance Armstrong? He attacked them. Or that Holocaust denier who sued his critic? The kind of person who is unethical enough to cheat could also be unethical enough to abuse the legal system.

This is a big deal. Yes, it’s easy to get fooled. And it’s even easier to get fooled when there are social and legal structures that can make it difficult for frauds to publicly be revealed.

Ask more questions. This is a good piece of advice, a really important point that I’d never thought about until reading this book. Here it is: “When something seems improbable, that should prompt you to investigate by asking more questions [emphasis in the original]. These can be literal questions . . . or they can be asked implicitly.”

Such a good point. Like so many statisticians, I obsess on the data in front of me and don’t spend enough time thinking about gathering new data. Even something as a simulation experiment is new data.

Unfortunately, when it comes to potential scientific misconduct, I don’t usually like asking people direct questions—the interaction is just too socially awkward for me. I will ask open questions, or observe behavior, but that’s not quite the same thing. And asking direct questions would be even more difficult in a setting where I thought that actual fraud was involved. I’m just more comfortable on the outside, working with public information. This is not to disagree with the authors’ advice to ask questions, just a note that doing so can be difficult.

The fine print. On p.120, they write, “Complacent investors sometimes fail to check whether the fine print in an offering matches the much shorter executive summary.” This happens in science too! Remember the supposedly “long-term” study that actually lasted only three days? Or the paper whose abstract concluded, “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications,” even though the study itself had no data whatsoever on people “becoming more powerful”? Often the title has things that aren’t in the abstract, and the abstract has things that aren’t in the paper. That’s a big deal considering: (a) presumably many many more people read the title than the abstract, and many many more people read the abstract than the paper, (b) often the paper is paywalled so that all you can easily access are the title and abstract.

The dog ate my data. From p.123: “Many of the frauds that we have studied involved a mysterious, untimely, or convenient disappearance of evidence.” Mary Rosh! I’m also reminded of Dan Davies’s famous quote, “Good ideas do not need lots of lies told about them in order to gain acceptance.”

The butterfly effect. I agree with Simons and Chabris to be wary of so-called butterfly effects: “According to the popular science cliché, a butterfly flapping its wings in Brazil can cause a tornado in Texas.” I just want to clarify one thing which we discuss further in our paper on the piranha problem. As John Cook wrote in 2018:

The butterfly effect is the semi-serious claim that a butterfly flapping its wings can cause a tornado half way around the world. It’s a poetic way of saying that some systems show sensitive dependence on initial conditions, that the slightest change now can make an enormous difference later. . . . The lesson that many people draw from their first exposure to complex systems is that there are high leverage points, if only you can find them and manipulate them. They want to insert a butterfly to at just the right time and place to bring about a desired outcome.

But, Cook explains, that idea is wrong. Actually:

Instead, we should humbly evaluate to what extent it is possible to steer complex systems at all. . . . The most effective intervention may not come from tweaking the inputs but from changing the structure of the system.

The point is that, to the extent the butterfly effect is a real thing, the point is that small interventions can very occasionally have large and unpredictable results. This is pretty much the opposite of junk social science of the “priming” or “nudge” variety—for example, the claim that flashing a subliminal smiley face on a computer screen will induce large changes in attitudes toward immigration—which posit reliable and consistent effects from such treatments. That is: if you really take the butterfly idea seriously, you should disbelieve studies that purport to demonstrate those sorts of bank-shot claims about the world.

Clarke’s Law

One more thing.

In his book, Davies talks about fraud in business. There’s not a completely sharp line dividing fraud from generally acceptable sharp business practices; still, business cheating seems like a clear enough topic that it can make sense to write a book about “Lying for Money,” as Davies puts it.

As discussed above, Simons and Chabris talk about people being fooled by fraud in business but also in science, art, and other domains. In science in particular, it seems to me that being fooled by fraud is a minor issue compared to the much larger problem of people being fooled by bad science. Recall Clarke’s law: Any sufficiently crappy research is indistinguishable from fraud.

Here’s the point: Simons and Chabris focus on the people being fooled rather than the people running the con. That’s good. It’s my general impression that conmen are kind of boring as people. Their distinguishing feature is a lack of scruple. Kind of like when we talk about findings that are big if true. And once you’re focusing on people being fooled, there’s no reason to restrict yourself to fraud. You can be just as well fooled by research that is not fraudulent, just incompetent. Indeed, it can be easier to be fooled by junk science that isn’t fraudulent, because various checks for fraud won’t find the problem. That’s why I wrote that the real problem of that nudge meta-analysis is not that it includes 12 papers by noted fraudsters; it’s the GIGO of it all. You know that saying, The easiest person to fool is yourself?

In summary, “How do we get fooled and how can we avoid getting fooled in the future?”, is a worthy topic for a book, and Simons and Chabris did an excellent job. The next step is to recognize that “getting fooled” does not require a conman on the other side. To put it another way, not every mark corresponds to a con. In science, we should be worried about being fooled by honest but bad work, as well as looking out for envelope pushers, shady operators, and out-and-out cheats.

How does Bayesian inference work when estimating noisy interactions?

Alicia Arneson writes:

I am a PhD student at Virginia Tech studying quantitative ecology. This semester, I am taking Deborah Mayo’s Philosophy of Statistics course, so I can’t help but to think more critically about statistical methods in some of the papers I read. To admit my current statistical bias – I do work in a lab that is primarily Bayesian (though this is my first year so I am still somewhat new to it), but Dr. Mayo does have me questioning some aspects of Bayesian practice. One of those questions is the topic of this letter!

Recently, I read a paper that aimed to determine the effect of increased foraging costs on passerine immune function. The experiment seemed really well designed, but I was somewhat frustrated when I got to the statistical analysis section. The authors used Bayesian univariate response models that fit each immune outcome to upwards of 26 parameters that included up to four-way interactions. My initial feeling was that there is no good way to (a) interpret these or (b) to feel at all confident about the results.

In investigating those thoughts, I came across your blog post entitled “You need 16 times the sample size to estimate an interaction than to estimate a main effect.” I thought this was a very interesting read and, while it applies more to frequentist frameworks, I noticed in the comments that you suggested not that we shouldn’t try to estimate interactions, but rather that it would be better to estimate them using a Bayesian approach. I can somewhat understand this suggestion given the examples you used to demonstrate how standard errors can change so much, but what is less clear to me is how Bayes provides a better (or at least more clear) approach when estimating interaction effects.

Therein lies my questions. If you have some time, I am curious to know what you think about:

(a) how a Bayesian approach for estimating interactions is better than doing so under a frequentist methodology, and

(b) can researchers use Bayesian methods to “go too far,” so to speak, when trying to estimate interaction effects that their design would not have captured well (thinking along the lines of classical experimental design and higher order effects being masked when sample sizes are too small), i.e. should a relatively small experiment ever attempt to quantify complex interactions (like a 4-way interaction), regardless of the framework?

Lots to chew on! Here are my responses:

1. As discussed, estimates of interactions tend to be noisy. But interactions are important! Setting them to zero is not always a good solution. The Bayesian approach with zero-centered priors partially pools the interactions toward zero, which can make more sense.

2. We need to be more willing to live with uncertainty. Partial pooling toward zero reduces the rate of “statistical significance”—estimates that are more than two posterior standard deviations from zero—as Francis Tuerlinckx and I discussed in our article from 2000 on Type M and Type S errors. The point is, if you do a Bayesian (or non-Bayesian) estimate, we don’t recommend acting as if non-statistically-significant parameters are zero.

3. I think the Bayesian method will “go too far,” in the sense of apparently finding big things that aren’t really there, if it uses weak priors. With strong priors, everything gets pulled toward zero, and the only things that remain far from zero are those where there is strong evidence.

4. Bayesian or otherwise, design matters! If you’re interested in certain interactions, design your study accordingly, with careful measurement and within-person (or, in your case, within-animal) measurements; see discussion here. There are problems with design and data collection that analysis can’t rescue.

5. To look at it another way, here’s an article from 2000 where we used frequentist analysis of a Bayesian procedure to recommend a less ambitious design, on the grounds that inferences from the more ambitious design would be too noisy to be useful.

Some challenges with existing election forecasting methods

With the presidential election season coming up (not that it’s ever ended), here’s a quick summary of the problems/challenges with two poll-based forecasting methods from 2020.

How this post came about: I have a post scheduled about a dispute between election forecasters Elliott Morris and Nate Silver about whether the site Fivethirtyeight.com should be including polls from the Rasmussen organization in their analyses.

At the end of the post I had a statistical discussion about the weaknesses of existing election forecasting methods . . . and then I realized that this little appendix was the most interesting thing in my post!

Whether Fivethirtyeight includes Rasmussen polls is a very minor issue, first because Rasmussen is only one pollster and second because if you do include their polls, any reasonable approach would be to give them a very low weight or a very large adjustment for bias. So in practice for the forecast it doesn’t matter so much if you include those polls, although I can see that from a procedural standpoint it can be challenging to come up with a rule to include or exclude them.

Now for the more important and statistically interesting stuff.

Key issues with the Fivethirtyeight forecast from 2020

They start with a polling average and then add weights and adjustments; see here for some description. I think the big challenge here is that the approach of adding fudge factors makes it difficult to add uncertainty without creating weird artifacts in the joint distribution, as discussed here and here. Relatedly, they don’t have a good way to integrate information from state and national polls. The issue here is not that they made a particular technical error; rather, they’re using a method that starts in a simple and interpretable way but then just gets harder and harder to hold together.

Key issues with the Economist forecast from 2020

From the other direction, the weakness of the Economist forecast (which I was involved in) was a lack of robustness to modeling and conceptual errors. Consider that we had to overhaul our forecast during the campaign. Also our forecasts had some problems with uncertainties, weird things relating to some choices in how we modeled between-state correlation of polling errors and time trends. I don’t think there’s any reason that a Bayesian forecast should necessarily be overconfident and non-robustness to conceptual errors in the model, but that’s what seemed to have happened with us. In contrast, the Fivethirtyeight approach was more directly empirical, which as noted above had its own problems but didn’t have a bias toward overconfidence.

Key issues with both forecasts

Both of the 2020 presidential election forecasts had difficulty handling data other than horse-race polls. The challenging information included economic and political “fundamentals,” which were included in the forecasts but with some awkwardness, in part arising from the fact that these variables themselves change over time during the campaign, known polling biases such as differential nonresponse, knowledge of systematic polling errors in previous elections, issues specific to the election at hand (street protests, covid, Clinton’s email server, Trump’s sexual assaults, etc.), issue attitudes in general to the extent they were not absorbed into horse-race polling, estimates of turnout, vote suppression, and all sorts of other data sources such as new-voter registration numbers. All these came up as possible concerns with forecasts, and it’s not so easy to include them in a forecast. No easy answers here—at some level we just need to be transparent and people can take our forecasts as data summaries—but these concerns arise in every election.

Why is every action hero named Jack, John, James, or, occasionally, Jason, but never Bill, Bob, or David?

Demetria Glace writes:

I wasn’t the first to make the connection, but once I noticed it, it was everywhere. You walk past a poster for a new movie and think, Why is every action hero named Jack, John, James, or, occasionally, Jason?

I turned to my friends and colleagues, asking desperately if they had also noticed this trend, as I made my case by listing off well-known characters: John Wick, Jason Bourne, Jack Reacher, John McClane, James Bond, Jack Bauer, and double hitter John James Rambo. . . .

As a data researcher, I [Glace] had to get to the bottom of it. What followed was months of categorizing hundreds of action movies, consulting experts in the field of name studies, reviewing academic papers and name databases, and seeking interviews with authors and screenwriters as to the rationale behind their naming decisions. . . .

Good stuff. It’s fun to see a magazine article with the content of a solid blog post.

Don’t get me wrong, I enjoy reading magazines. But magazine articles, even good magazine articles, follow a formula: they start off with a character and maybe an anecdote, then they ease into the main topic, they follow through with a consistent story, ending it all with a pat summary. By contrast, a blog post can start anywhere, go wherever it wants, and, most importantly, does not need to come to a coherent conclusion. The above-linked article on hero names was like that, and I was happy to see it running in Slate.

The vicious circle of corroboration or pseudo-confirmation in science and engineering

Allan Cousins writes:

I have recently been thinking about the way in which professionals come to accumulate “knowledge” over their careers and how that process utilizes (read: abuses) the notion of corroboration. I believe this might be of interest to both of you and so I wanted to see if either of you might have any insights or comments.

In particular, I have been thinking about professional endeavours that have dichotomous outcomes where the range of possibilities is restricted to (or perhaps more accurately, viewed as) it either worked or it did not work. For the purposes of this discussion I will look at structural engineering but I believe the phenomenon I am about to describe is just as applicable to other similarly characterized disciplines. In structural engineering: the structure either stood up or it collapsed, the beam either carried the load or it did not, etc. In my experience there are nearly as many theories of how structures work as there are structural engineers. But this wide range of opinions among structural engineers is certainly not because the underlying concepts are not well understood. That may have been true in 1850 but not today. In fact, structural engineering is quite mature as a field and there are very few concepts (except at the edges of the field) where such a diverse range of thought could be justified.

This begs the question of how could this unsatisfactory state of affairs have come to pass? I have often pondered this but only recently have come to what I think to be a reasonable explanation. First, let us rule out the idea that structural engineering professionals are of below average intelligence (or rather below some required intelligence threshold for such endeavors only known to Omniscient Jones). Under such an assumption I believe that the likely answer to our question comes down to an interplay between industry dynamics, an abuse of the concept of corroboration, and the nature of the outcomes inherent to the field.

Even if engineers have never heard of the concept of Philosophy of Science (and most have not) they are apt to act in ways akin to the typical scientist. That is, they go about their enterprise (designing structures) by continuously evaluating their understanding of the underlying structural mechanics by looking at and seeking out corroborating evidence. However, unlike scientists structural engineers don’t usually have the ability to conduct risky tests (in the popperian sense) in their day to day designs. By definition the predicted outcome of a risky test is likely to be wrong in absence of the posited theory and if structural engineers were routinely conducting such field tests newspaper headlines would be replete with structural engineering failures. But today structural engineering failures are quite rare and when they happen they are usually small in magnitude (one of the greatest structural engineering failures in US history was the Hyatt Regency Walkway collapse and it only caused 114 deaths. For comparison that is about the same number of deaths caused by road accidents in a single DAY in the US). Indeed, building codes and governing standards are codified in such a way that the probability of failure of any given element in a system is quite a rare event (global failure even rarer still). What that means is that even if what a structural engineer believes to be true about the structural systems that they design actually has very little verisimilitude (read: is mostly wrong and to a severe degree) their designs will not fail in practice as long as they follow codified guidelines. It is only when structural engineers move away from the typical (where standard details are the norm and codes contain prescribed modes of analysis / design) where gaps in their understanding become apparent due to observed failures. What this means then is that while the successful outcome of each “test” (each new structural design) is likely to be taken by the designer as corroborating their understanding (in the same sense that it does for the scientist), it does not necessarily even provide the most meager of evidence that the designer has a good grasp of their discipline. In fact, it is possible (though admittedly not overly likely) that a designer has everything backwards and yet their designs don’t fail because of the prescribed nature of governing codes.

The above leaves us with an interesting predicament. It seems clear that structural engineers or others in similarly situated disciplines cannot rely on outcomes to substantiate their understanding. Though in practice that is what they largely do; they are human after all.

This lack of ability to conduct risky tests interplays with industry dynamics and in not a particularly promising way. Those who commission structural designs are unlikely to care about the design itself (except to the extent that it doesn’t fail and doesn’t mess with the intended aesthetic), and as a result, structural engineering tends to be treated like a commodity product where the governing force is price. What that means is that there is an overwhelming pressure to get designs out the door as quickly as possible lest a structural engineering firm lose money on its bid. This pressure all but guarantees that even if senior structural engineers have a good understanding of structural principles the demands for their time leave few hours in the day to be spent on mentorship and review of young engineers’ work product. As a result, young engineers are unlikely to be able to rely on senior engineers to correct their misunderstanding of structural principles. That pretty much leaves only one other avenue for the young engineer to gain true understanding and that is via self-teaching of the literature and the like. However, given the lack of ability to construct risky tests (see above) the self-learning route is apt to lead young structural engineers to think that they have a good understanding of certain concepts (because they see corroborating evidence in their “successful” designs) where that is not the case. Though to be fair to my brethren I am assuming that the average young engineer does not have the ability to discern true engineering principles from the literature on their own without aid. However, I believe this assumption to hold, on average.

This leads to a cycle where young engineers – who have a less than perfect understanding of structural systems that goes unchecked – become senior engineers who in turn are looked up to by a new crop of young engineers. The now senior engineers mentor the young engineers, to the extent time demands allow, and distill their misknowledge to them. Those young engineers eventually become senior. And in the extreme, the cycle repeats progressively until “knowledge” at the most senior levels of the field is almost devoid of any verisimilitude at all. Naturally there will be counterbalancing forces where some verisimilitude is maintained but I do think the cycle, as I have described it, is at least a decent characterture of how things unfold in practice. It’s worth remarking that many on the outside will never see this invisible cycle because it is shielded from them by the fact that structures tend to stand up!

It seems to me that this unfortunate dynamic is likely to play out in any discipline where outcomes are dichotomous in nature and where the unwanted outcome (such as structural failure) is a low probability event by construction (and is unconnected to true understanding of the underlying concepts). It is certainly interesting to think about, and when the above phenomenon is coupled with human tendency to ascribe good outcomes to skill, and poor outcomes to bad luck, the result in terms of knowledge accumulation / dissemination may be quite unsatisfactory.

I think what I have just argued is that professional activities that become commoditized are likely to be degenerative over time. This would certainly accord with my experience in structural engineering and other fields where I have some substantive knowledge. And I wanted to see if you would agree or not. Do you have any stark counter examples from your professional life that you can recall? Do you think I am being unduly pessimistic?

There are two things going on here:

1. Corroboration, and the expectation of corroboration, as a problem. This relates to what I’ve called the confirmationist paradigm of science, where the point of experimentation is to confirm theories. The motivations are then all in the wrong places, just in general. Quantitative analysis under uncertainty (i.e., statistics) adds another twist to the vicious cycle of confirmation, with the statistical significance filter and the 80% power lie, by which effects get overestimated, motivating future studies that overestimate effect sizes, etc., until entire subfields get infested with wild and unrealistic overestimates.

2. The sociological angle, with students following their advisors, advisors promoting former students, etc. I don’t have so much to say about this one, but I guess that it’s part of the story too.

Also relevant to this discussion is the recent book, False Feedback in Economics: The Case for Replication, by Andrin Spescha.

Here are the data from that cold showers study. So you haters can now do your own analyses!

The other day we discussed some hype around the article, “Impact of cold exposure on life satisfaction and physical composition of soldiers,” published in the journal BMJ Military Health. According to a Stanford professor, podcast participant, and supplement salesman, “deliberate cold exposure is great training for the mind.”

But some have expressed skepticism regarding that study, with, as our correspondent Matt Bogard put it, “n = 49 split into treatment and control groups for these outcomes (also making gender subgroup comparisons).” There are times when n=49, or even n=1, can be enough, but not when estimating the effects of subtle treatments on highly variable outcomes.

In comments, Shravan Vasishth points out that the data should be available from the journal website. And, indeed, here’s the link, https://militaryhealth.bmj.com/content/early/2023/01/03/military-2022-002237.long:

Scroll down and you’ll see this:

Take that, you haters! You can click on the link, aaaaand:

OK . . . so let’s check the Internet Archive. The page is https://web.archive.org/web/20230000000000*/https://www.vyzkumodolnosti.cz/en/datasets, and here’s what we see:

So, 7, 12, and 14 Mar 2023. Clicking on any of these yields the following:

The first four links work, giving spreadsheets that appear to be raw data! The last two links give nothing; they just point back to this page.

For reasons discussed in my earlier post, I don’t have much interest in these data myself, but, for anyone who’s interested, just follow those links at the Internet Archive.

Would you allow a gun to be fired at your head contingent on a mere 16 consecutive misfires, whatever the other inconclusive evidence?

Jonathan Falk writes:

I just watched the 1947 movie Boomerang!, an early directorial effort by Elia Kazan. It tells the (apparently true) story of Homer Cummings, a DA who took it upon himself to argue for the nonprosecution of a guy who everyone thought was guilty. In the big court scene at the end, he goes through a lot of circumstantial evidence of innocence, but readily admits that none of this evidence is dispositive. He then gets to the gun found on the would-be defendant. He asks the judge to load the gun with six bullets and then announces to the court: “From the coroner’s report, we know that when the gun was fired it was angled downward from a distance of six inches behind the victim’s head.” He then has his assistant hold the gun angled down in this fashion behind him and tells him to pull the trigger. The gun clicks but does not fire. He then says: “There is a flaw in the firing pin, and when held down at an angle like this it does not fire. We experimented with this 16 times before today.” He then exhales slightly and says: “Today was the 17th. I apologize for the cheap theatrics” and the court observers break into applause. (No, Alec Baldwin wasn’t born when the movie was made.)

Now a good Bayesian, of course, would combine all the circumstantial evidence with the firing pin evidence to get a posterior distribution on guilt. But a frequentist? Would you allow a gun to be fired at your head contingent on a mere 16 consecutive misfires, whatever the other inconclusive evidence? Let p be the probability of misfire given that the gun was the murder weapon. Given 16 consecutive misfires, we can, with 95% probability, bound p between 1 and 0.9968. And the marginal information of the 17th misfire is really, really small…

I guess the role of the cheap theatrics is not to provide more information but rather to convince the jury. I’ve heard that humans are not really Bayesian.

As for the 16 consecutive previous misfires:

1. I don’t see any reason to think the outcomes would be statistically independent. Maybe they all misfired for some other reason.

2. Also, no reason to trust him when he says they experimented 16 times before. People exaggerate their evidence all the time.

Before reading this post, take a cold shower: A Stanford professor says it’s “great training for the mind”!

Matt Bogard writes:

I don’t have full access to this article to know the full details and can’t seem to access the data link but with n = 49 split into treatment and control groups for these outcomes (also making gender subgroup comparisons) this seems to scream, That which does not kill my statistical significance only makes it stronger.

From the abstract:

Results: Theoretical and practical training in cold immersion in the winter did not induce anxiety. Regular cold exposure led to a significant (p=0.045) increase of 6.2% in self-perceived sexual satisfaction compared with the pre-exposure measurements. Furthermore, considerable increase (6.3% compared with the pre-exposure period) was observed in self-perceived health satisfaction; the change was borderline significant (p=0.052). In men, there was a reduction in waist circumference (1.3%, p=0.029) and abdominal fat (5.5%, p=0.042). Systematic exposure to cold significantly lowered perceived anxiety in the entire test group (p=0.032).

Conclusions: Cold water exposure can be recommended as an addition to routine military training regimens. Regular exposure positively impacts mental status and physical composition, which may contribute to the higher psychological resilience. Additionally, cold exposure as a part of military training is most likely to reduce anxiety among soldiers.

I’m not planning to pay 42 euros to read the whole article (see image above), but, yeah, based on the abstract it looks like any effects here are too variable to be discovered in this way. This one hits a few of our themes:

1. Lots of p-values around 0.05. Greg Francis has written about this.

2. Forking paths: lots and lots of different ways of slicing the data.

3. Small sample size. N = 49 isn’t a lot even before getting into the subgroups and interactions.

4. Implausibly large effect-size estimates. An average reduction of 5.5% of abdominal fat, that sounds like a lot, no? This problem comes for free when variability is high.

5. Noisy measurements that don’t quite align with questions of interest. I can’t be sure about this one, but I’m not quite sure that the life satisfaction and sexual satisfaction surveys are really measuring what’s important here.

6. Story time. Even setting aside the statistical problems, do you notice how they move from “sexual satisfaction,” “health satisfaction,” “waist circumference,” and “abdominal fat” in the Results, to “mental status and physical composition” in the conclusion? I guess “getting skinny and having good sex” wouldn’t sound so good.

7. Between-person comparisons. There’s no need for this study to be done in this way—some people get treatment, some get control. It should be easy enough to do both treatments on each person, but it seems that they didn’t do so. Why? I guess because between-person comparisons are standard practice. They’re easier to analyze and at first glance look cleaner than within-person comparisons. But that apparent cleanliness is an illusion.

8. Coherence with folk theories. Cold showers! Sounds paleo, huh? I’m not saying that cold showers can’t have benefits, just that this is a noisy study with the sort of conclusion that a lot of people will be happy to hear.

What’s going on?

I don’t know what’s going on. Here’s my guess: These researchers took some measurements that vary a lot from person to person, maybe some of these measurements vary a bit within person too. They applied a treatment which will have variable effects: maybe very close to zero in some cases, positive for some people, negative for others. Given this mix, we can expect the average effects to be small. Small average effects, indirect measurements, high variation . . . it’ll be hard to find any signal amid all this noise. Then this gets piped through forking paths and the statistical-significance filter and, boom!, results come out, ready to be published and publicized. I’m not saying the authors of the paper did anything dishonest, but that doesn’t stop them from pulling comparisons out of noise.

It’s the usual story of junk science, the push of thousands of journals seeking publications and millions of people doing research, combined with the pull of “the aching desire for an answer” (as Tukey put it) to unlimited numbers of research questions, mixed in with the horrible ability of statistical methods to convince people there’s strong evidence even when it isn’t there.

The article in question was from an obscure journal and I figured I’d never hear about it again.

Part 2

But then I checked my email, and two days earlier I’d received this message from Scott McCain:

Some friends and family are into the idea of cold showers. I’ve seen some work on it before. Recently, Stanford professor Andrew Huberman has covered this study supporting that cold exposure can have a whole host of benefits (including self-perceived sexual satisfaction and reduced waist circumference). I felt that this was interesting but a bit surprising.

I don’t have access to this study, so I downloaded the raw data—which is great that they published it! It seems like they’ve measured a whole bunch of things. I’ve tried communicating to friends and family that this study (however I haven’t analyzed their data myself, besides a cursory look) seems likely underpowered and maybe has been at risk of a garden of forking paths.

Indeed. Again, I’m not saying that cold or hot showering has no effect; I just don’t think this sort of push-button model of scientific inquiry will be useful in figuring it out. But, just to be clear, I’m not trying to talk your friends and family out of taking cold showers. They should go for it, why not?

Part 3

And then I received another email, this one from Joshua Brooks, pointing to a series of twitter post from Gideon Meyerowitz-Katz slamming the above-discussed study. It seems that the cold-shower paper became widely discussed on the internet after it was promoted by Andrew Huberman, a neurobiology professor at Stanford who has a podcast and a “once-a-month newsletter with science and science-based tools for everyday life.”

We’ll get back to Huberman in a moment, but first let me discuss the posts by Meyerowitz-Katz, who writes that the paper in question “shows precisely the opposite” of what it claims. I wouldn’t put it that way; rather I’d just say the paper provides no strong evidence of anything. It’s a noisy study. Noisy and not statistically significantly different from zero is not the same thing as saying that the effect is not there or even that the effect is not important; it’s just that the study is too weak to find anything useful. Also, Meyerowitz-Katz is annoyed that the paper focuses on before-after comparisons. But before-after comparisons can be fine! You learn a lot by comparing to “before” data. And in any case you can compare the before-after differences in the treatment and control groups. On the other hand, Meyerowitz-Katz comes to the same conclusion that I do, which is that the study appears to be consistent with null effects so its conclusions should not be taken seriously.

OK, one more thing. Meyerowitz-Katz writes:

To sum up – this is a completely worthless study that has no value whatsoever scientifically. It is quite surprising that it got published in its current form, and even more surprising that anyone would try to use it as evidence.

I wouldn’t quite put it that way. First, who’s to say it’s “completely worthless”? It has some measurements and maybe they’ll be useful to someone. They posted their raw data! Second, I’m fine with saying that it’s too bad that the paper got published or that anyone would try to use it as evidence. But to call this quite surprising?? Bad or pointless research papers get published all the time, in all sorts of journals, and then they get taken as evidence by all sorts of people, renowned professors and otherwise. So I’m surprised Meyerowitz-Katz is surprised. His surprisal suggests to me that he puts too much faith in journal articles!

Anyway, after looking this all over, I responded to Brooks:

I guess “BMJ Military Health” is a pretty obscure journal . . . but, sure, lots of bad stuff gets published! I’ve never heard of this Huberman guy. I guess if this thing of hyping crap science works for Gladwell, NPR, and Ted, it makes sense that people with less elevated perches in the media will try it too.

I’m not trying to be cynical here—I don’t think that hyping crap science is a good thing—I’m just trying to be realistic. There are lots of journals out there, and if you fish around through enough of them, you can find superficially-plausible articles that will support just about any position.

Brooks followed up with further background on Huberman:

On the cold showers, he points to a meta-analysis.

I don’t know, actually, that he bases his view to any significant extent on that one paper.

The cold immersion claim is just one of the rather remarkable claims he makes on a whole range of effects…

They mostly come off as credible at first glance to me. For all, he claims an “evidence base” in the literature. Personally, I don’t necessarily dismiss everything he says per se, but when i string together the sheer number of absolutely certain claims he makes about such large effects, I have to conclude there’s a fundamental flaw.

Also, apparently he hawks supplements, plus, I’ve also hear him hawking such things as mattresses customized to fit individual consumers by responses to an online questionnaire, that result in improved sleep.

I’m currently listening to a podcast where he’s talking to a scientist about the genetics of “inherited experience.” Right now they’re describing experiments showning a differential effect to worms who are fed other worms who were exposed to an experimental condition (electric shock) then put into a blender.

It’s actually pretty interesting – and some of the research they’re talking about supposedly has been replicated.

But it all feels kinda like the ESP research.

It’s hard to think about these things because there could be real effects! As discussed above, to the extent that cold showers have meaningful effects on people, we should expect these effects to vary a lot from person to person.

I went to Huberman’s webpage on the cold showers to see the meta-analysis that Brooks mentions, but the only meta-analysis I found there was “Impact of Cold-Water Immersion Compared with Passive Recovery Following a Single Bout of Strenuous Exercise on Athletic Performance in Physically Active Participants: A Systematic Review with Meta-analysis and Meta-regression.” Cold-water immersion for athletic performance seems to have zero overlap with cold showers for mood and general health. Nothing wrong with talking about this study but it doesn’t really seem relevant for the discussion of cold showers.

Also Huberman has this:

Building Resilience & Grit

By forcing yourself to embrace the stress of cold exposure as a meaningful self-directed challenge (i.e., stressor), you exert what is called ‘top-down control’ over deeper brain centers that regulate reflexive states. This top-down control process involves your prefrontal cortex – an area of your brain involved in planning and suppressing impulsivity. That ‘top-down’ control is the basis of what people refer to when they talk about “resilience and grit.” Importantly, it is a skill that carries over to situations outside of the deliberate cold environment, allowing you to cope better and maintain a calm, clear mind when confronted with real-world stressors. In other words, deliberate cold exposure is great training for the mind. [Boldface in the original.]

“Grit,” huh? C’mon dude, get real.

P.S. Here’s the supplement he’s advertising:

Looks a little bit iffy, but, hey, what do I know? I’m never studied human performance. I kinda wonder if Huberman takes these himself. I could imagine a few options:

1. Of course he takes them; he’s a true believer.

2. Of course he doesn’t take them; the sponsorship thing is all about the money.

3. He believes they work, but he doesn’t think he personally needs them, so he doesn’t take them.

4. He doubts they do anything, but he figures they won’t hurt, so why not, and he takes them.

Maybe there’s some other option I haven’t thought of.

Cheating in science, sports, journalism, business, and art: How do they differ?

I just read “Lying for Money: How Legendary Frauds Reveal the Workings of Our World,” by Dan Davies.

I think the author is the same Dan Davies who came up with the saying, “Good ideas do not need lots of lies told about them in order to gain public acceptance,” and also the “dsquared” who has occasionally commented on this blog, so it is appropriate that I heard about his book in a blog comment from historian Sean Manning.

As the title of this post indicates, I’m mostly going to be talking here about the differences between frauds in three notoriously fraud-infested but very different fields of human endeavor: science, sports, and business.

But first I wanted to say that this book by Davies is one of the best things about economics I’ve ever read. I was trying to think what made it work so well, and I realized that the problem with most books about economics is that they’re advertising the concept of economics, or they’re fighting against dominant economics paradigms . . . One way or another, those books are about economics. Davies’s book is different in that he’s not saying that economics is great, he’s not defensive about economics, and he’s not attacking it either. His book is not about about economics; it’s about fraud, and he’s using economics as one of many tools to help understand fraud. And then when he gets to Chaper 7 (“The Economics of Fraud”), he’s well situated to give the cleanest description I’ve ever seen of economics, integrating micro to macro in just a few pages. I guess a lot of readers and reviewers will have missed that bit because it’s not as lively as the stories at the front of the book, also, who ever gets to Chapter 7, right?, and that’s kinda too bad. Maybe Davies could follow up with a short book, “Economics, what’s it all about?” Probably not, though, as there are already a zillion other books of this sort, and there’s only one “Lying for Money.” I’m sure there are lots of academic economists and economics journalists who understand the subject as well or better than Davies; he just has a uniquely (as far as I’ve seen) clear perspective, neither defensive nor oppositional but focused on what’s happening in the world rather than on academic or political battles for the soul of the field. (See here and here for further discussion of this point.)

Cheating in business

Cheating in business is what “Lying for Money” is all about. Davies mixes stories of colorful fraudsters with careful explanations of how the frauds actually worked, along with some light systematizing of different categories of financial crime.

In his book, Davies does a good job of not blaming the victims. He does not push the simplistic line that “you can’t cheat an honest man.” As he points out, fraud is easier to commit in an environment of widespread trust, and trust is in general a good thing in life, both because it is more pleasant to think well of others and also because it reduces transaction costs of all sorts.

Linear frauds and exponential frauds

Beyond this, one of the key points of the book is that there are two sorts of frauds, which I will call linear and exponential.

In a linear fraud, the fraudster draws money out of the common reservoir at a roughly constant rate. Examples of linear frauds include overbilling of all sorts (medical fees, overtime payments, ghost jobs, double charging, etc.), along with the flip side of this, which is not paying for things (tax dodging, toxic waste dumping, etc.). A linear fraud can go on indefinitely, until you get caught.

In an exponential fraud, the fraudster needs to keep stealing more and more to stay solvent. Examples of exponential frauds include pyramid schemes (of course), mining fraud, stock market manipulations, and investment scams of all sorts. A familiar example is Bernie Madoff, who raised zillions from people by promising them unrealistic returns on their money, but as a result incurred many more zillions of financial obligations. The scam was inherently unsustainable. Similarly with Theranos: the more money they raised from their investors, the more trouble they were in, given that they didn’t actually ever have a product. With an exponential fraud you need to continue expanding your circle of suckers—once that stops, you’re done.

A linear fraud is more sustainable—I guess the most extreme example might be Mister 880, the counterfeiter of one-dollar bills who was featured in a New Yorker article many years ago—but exponential frauds can grow your money faster. Embezzling can go either way: in theory you can sustainably siphon off a little bit every month without creating noticeable problems, but in practice embezzlers often seem to take more money than is actually there, giving them unending future obligations to replace the missing funds.

With any exponential fraud, the challenge is to come up with an exit strategy. Back in the day, you could start a pyramid scheme or other such fraud, wait until a point where the scam had gone long enough that you had a good profit but before you reach the sucker event horizon, and then skip town. The only trick is to remember to jump off the horse before it collapses. For business frauds, though, there’s a paper trail, so it’s harder to leave without getting caught. The way Davies puts it is that in your life you have one chance to burn your reputation in this way.

Another way for a fraudster to escape, financially speaking, is to go legit. If you’re a crooked investor, you can take your paper fortune to the racetrack or the stock market and make some risky bets: if you win big, you can pay off your funders and retire. Unfortunately, if you win big, and you’re already the kind of person to conduct an exponential fraud in the first place, it seems likely you’ll just take this as a sign that you should push further. Sometimes, though, you can keep things going indefinitely by converting an exponential into a linear scheme, as seems to have happened with some multilevel modeling operations. As Davies says, if you can get onto a stable financial footing, you have something that could be argued was never a fraud at all, just a successful business that makes its money by convincing people to pay more for your product than it’s worth.

The final exit strategy is recidivism, or perhaps rehabilitation. Davies shares many stories of fraudsters who got caught, went to prison, then popped out and committed similar crimes again. They kept doing what they were
good at! Every once in awhile you see a fraudster who managed to grease enough palms that after getting caught he could return to life as a rich person, for example Michael Milken.

One other thing. Yes, exponential frauds are especially unsustainable, but linear frauds can be tricky to maintain too. Even if you’re cheating people at a steady, constant rate, so you have no pressing need to raise funds to cover your past losses, you’re still leaving a trail of victims behind, and any one of them can decide to be the one to put in the effort to stop you. More victims = greater odds of being tracked down. There’s all sorts of mystique about “cooling off the mark,” but my impression that the main way that scammers get away with their frauds is by maintaining some physical distance from the people they’ve scammed, and by taking advantage of the legal system to make life difficult for any whistleblowers or victims who come after them. Again, see Theranos.

Cheating in science

Science fraud is a mix of linear and exponential. The linear nature of the fraud is that it’s typically a little bit in paper after paper, grant proposal after grant proposal, Ted talk after Ted talk, a lie here, an exaggeration there, some data manipulation, some p-hacking, at each time doing whatever it takes to get the job done. The fraud is linear in that there’s no compounding; it’s not like each new research project requires an ever-larger supply of fake data to make up for what was taken last time.

On the other hand, there’s a potentially exponential problem that, if you use fraud to produce an important “discovery,” others will want to replicate it for themselves, and when those replications fail, you’ll need to put in even more effort to prop up your original claims. In business, this propping-up can take different forms (new supplies of funds, public relations, threats, delays, etc.), and similarly there are different ways in science to prop up fake claims: you can ignore the failed replications and hope for the best, you can attack the replicators, you can use connections in the news media to promote your view and use connections in academia to publish purported replications of your own, you can jump sideways into a new line of research and cheat to produce success there . . . lots of options. The point is, fake scientific success is hydra-headed as it will spawn continuing waves of replication challenges. As with financial fraud, the challenge, after manufacturing a scientific success, is to draw a line under it, to get it accepted as canon, something they can never take away from you.

Cheating in sports

Lance Armstrong is an example of an exponential fraud. He doped to win bike races—apparently everybody was doping at the time. But Lance was really really good at doping. People started to talk, and then Lance had to do more and more to cover it up. He engaged in massive public relations, he threatened people, he tried to wait it out . . . nothing worked. Dude is permanently disgraced. It seems that he’s still rich, though: according to wikipedia, “Armstrong owns homes in Austin, Texas, and Aspen, Colorado, as well as a ranch in the Texas Hill Country.”

Other cases of sports cheating have more of a linear nature. Maradona didn’t have to keep punching balls into the net; once was enough, and he still got to keep his World Cup victory. If Brady Anderson doped, he just did it and that was that; no escalating behavior was necessary.

Cheating in journalism

Journalists cheat by making things up in the fashion of Mike Barnicle or Jonah Lehrer, or by reporting stories that originally appeared elsewhere without crediting the original source, which I’ve been told is standard practice at the New York Times and other media outlets. Reporting an already-told story without linking to the source is considered uncool in the blogging world but is so common in regular journalism that it’s not even considered cheating! Fabrication, though, remains a bridge too far.

Overall I’d say that cheating in journalism is like cheating in science and sports in largely being linear. Every instance of cheating leaves a hostage to fortune, so as you continue to cheat in your career, it seems likely you’ll eventually get found out for something or another, but there’s no need for an exponential increase in the amount of cheating in the way that business cheaters need to recoup larger and larger losses.

The other similarity of cheating in journalism to cheating in other fields is the continuing need for an exit strategy, with the general idea being to build up reputational credit during the fraud phase that you can then cash in during the discovery phase. That is, once enough people twig to your fraud, you are already considered too respectable or valuable enough to dispose of. Mike Barnicle is still on TV! Malcolm Gladwell is still in the New Yorker! (OK, Gladwell isn’t doing fraud, exactly: rather than knowingly publishing lies, he’s conveniently putting himself in the position where he can publish untrue and misleading statements while setting himself in some sort of veil of ignorance where he can’t be held personally to blame for these statements. He’s playing the role of a public relations officer who knows better than to check the veracity of the material he’s being asked to promote.)

Art fraud

I don’t have anything really to say about cheating in art, except that it’s a fascinating topic and much has been written about it. Art forgery involves some amusing theoretical questions, such as: if someone copies a painting or a style of a no-longer-living artist so effectively that nobody can tell the difference, is anyone harmed, other than the owners of existing work whose value is now diluted? From a business standpoint, though, art forgery seems similar to other forgery in being an essentially linear fraud, again leading to a linearly increasing set of potentially incriminating clues.

Closely related to art fraud is document fraud, for example the hilarious and horrifying (but more hilarious than horrifying) gospel of Jesus’s wife fraud, and this blurs into business fraud (the documents are being sold) and science fraud (in this case, bogus claims about history).

Similarities between cheating in business, science, sports, and journalism

Competition is a motivation for cheating. It’s hard to compete in business, science, sports, and journalism. Lots of people want to be successes and there aren’t enough slots for everyone. So if you don’t have the resources or talent or luck to succeed legitimately, cheating is an alternative path. Or if you are well situated for legitimate success, cheating can take you to the next level (I’m looking at you, Barry Bonds).

Cheating as a shortcut to success, that’s one common thread in all these fields of endeavor. There’s also cheating in politics, which I’m interested in as a political scientist, but right now I’m kinda sick of thinking about lying cheating political figures—this includes elected officials but also activists and funders (i.e., the bribers as well as the bribed)—so I won’t consider them here.

Another common thread is that you’re not supposed to cheat, so the cheater has to keep it hidden, and sometimes the coverup is, as they say, worse than the crime.

A final common thread is that business, science, sports, journalism, and art are . . . not cartels, necessary, but somewhat cooperative enterprises whose participants have a stake in the clean reputation of the entire enterprise. This motivates them to look away when they see cheating. It’s unpleasant, and it’s bad all around for the news to spread, as this could lead to increased distrust of the entire enterprise. Better to stick to positivity.

Differences

The key difference I see between these different areas is that in business it’s kinda hard to cheat by accident. In science we have Clarke’s Law: Any sufficiently crappy research is indistinguishable from fraud. In business or sports we wouldn’t say that. OK, there might be some special cases, for example someone sells tons of acres of Florida swampland and is successful because he (the salesman) sincerely thinks it’s legitimate property, but in general I think of business frauds as requiring something special, some mix of inspiration, effort, and lack of scruple that most of us can’t easily assemble. A useful idiot might well be useful as part of a business fraud, but I wouldn’t think that ignorance would be a positive benefit.

In contrast, in research, a misunderstanding of scientific method can really help you out, if your goal is to produce publishable, Gladwell-able, Freakonomics-able, NPR-able, Ted-able work. The less you know and the less you think, the further you can go. Indeed, if you approach complete ignorance of a topic, you can declare that you’ve discovered an entire new continent, and a pliable news media will go with you on that. And if you’re clueless enough, it’s not cheating, it’s just ignorance!

In this dimension, sports and art seem more like business, and journalism seems more like science. Yes, you can cheat in sports without realizing it, but knowing more should allow you to be more effective at it. I can’t think of a sporting equivalent to those many scientists who produce successful lines of research by wandering down forking paths, declaring statistical significance, and not realizing what they’ve been doing.

With journalism, though, there’s a strong career path of interviewing powerful people and believing everything they say, never confronting them. To put it another way, there’s only one Isaac Chotiner, but there are lots and lots of journalists who deal in access, and I imagine that many of them are sincere, i.e. they’re misleading their readers by accident, not on purpose.

Other thoughts inspired by the book Lying for Money

I took notes while reading Davies’s book. Page references are to the Profile Books paperback edition.

p.14, “All this was known at the time.” This comes up again on p.71: “At this point, the story should have been close to its conclusion. Indeed, the main question people asked in 1982, when OPM finally gave up and went bankrupt, is why didn’t it happen three years earlier? Like a Looney Toons character, nothing seemed to stop it. New investors were brought in as the old ones gave up in disgust.” This happens all the time; indeed it was one of the things that struck me about the Theranos story was how the company thrived for nearly a decade, after various people in the company realized the emptiness of the company’s efforts.

A fraud doesn’t stay afloat all by itself; it takes a lot of effort to keep it going. This effort can include further lies, the judicious application of money, and, as with Theranos, threats and retaliation. It’s a full-time job! Really there’s no time to make up the losses or get the fictional product to work, given all the energy being spent to keep the enterprise alive for years after the fact of the fraud is out in the open.

p.17, “Fraudsters don’t play on moral weaknesses, greed or fear; they play on weaknesses in the system of checks and balances.” I guess it’s a bit of both, no? One thing I do appreciate, though, is the effort Davies puts in to not present these people as charming rogues.

I want to again point to a key difference between fraud in business and fraud in science. Business fraud requires some actual talent, or at least an unusual lack of scruple or willingness to take risks, characteristics that set fraudsters apart from the herd. In contrast, scientific misconduct often just seems to require some level of stupidity, enough so that you can push buttons, get statistical results, and draw ridiculous conclusions without looking back. Sure, ambition and unscrupulousness can help, but in most cases just being stupid seems like enough, and also is helpful in the next stage of the process when it’s time to non-respond to criticism.

p.18, “Another thing which will come up again and again is that it is really quite rare to find a major commercial fraud which was the fraudster’s first attempt. An astonishingly high proportion of the villains of this book have been found out and even served prison time, then been places in positions of trust once again.” I’m reminded of John Gribbin and John Poindexter.

Closer to home, there was this amazing—by which I mean amazingly horrible—story of a public school that was run jointly by the New York City Department of Education and Columbia University Teachers College. The principal of this school had some issues. From the news report:

In 2009 and 2010, while Ms. Worrell-Breeden was at P.S. 18, she was the subject of two investigations by the special commissioner of investigation. The first found that she had participated in exercise classes while she was collecting what is known as “per session” pay, or overtime, to supervise an after-school program. The inquiry also found that she had failed to offer the overtime opportunity to others in the school, as required, before claiming it for herself.

The second investigation found that she had inappropriately requested and obtained notarized statements from two employees at the school in which she asked them to lie and say that she had offered them the overtime opportunity.

After those findings, we learn, “She moved to P.S. 30, another school in the Bronx, where she was principal briefly before being chosen by Teachers College to run its new school.”

So, let’s get this straight: She was found to be a liar, a cheat, and a thief, and then, with that all known, she was hired to two jobs as school principal?? An associate vice president of Teachers College said, “We felt that on balance, her recommendations were so glowing from everyone we talked to in the D.O.E. that it was something that we just were able to live with.” In short: once you’re plugged in, you stay plugged in.

p.47: Davies talks about how online drug dealers eventually want to leave the stressful business of drug dealing, and at this point they can cash in their reputation by taking a lot of orders and then disappearing with customers’ money. An end-of-career academic researcher can do something similar if they want, using an existing reputation to promote bad ideas. Usually though you wouldn’t want to do that, as there’s no anonymity so the negative outcome can reflect badly on everything that came before. The only example I can think of offhand is the Cornell psychology researcher Daryl Bem, who is now indelibly associated with some very bad papers he wrote on extra-sensory perception. I was also gonna include Orson Welles here, as back in the 1970s he did his very best to cash in his reputation on embarrassing TV ads. But, decades later, the ads are just am amusing curiosity and Orson’s classic movies are still around: his reputation survived just fine.

p.50: “When the same features of a system keep appearing without anyone designing them, you can usually be pretty sure that the cause is economic.” Well put!

p.57: Regarding Davies’s general point about fraud preying upon a general environment of trust, I want to say something about the weaponization of trust. An example is when a researcher is criticized for making scientific errors and then turns around, in a huff, and indignantly says he’s being accused of fraud. The gambit is to move the discussion from the technical to the personal, to move from the question of whether there really is salad oil on those tanks to the question of whether the salad oil businessman can be trusted.

p.62: Davies writes, “fraud is an unusual condition; it’s a ‘tail risk.'” All I can say is, fraud might be an unusual “tail risk” in business, but in science it’s usual. It happens all the time. Just in my own career, I had a colleague who plagiarized; another one who published a report deliberately leaving out data that contradicted the story he wanted to tell; another who lied, cheated, and stole (I can’t be sure about that one as I didn’t see it personally; the story was told to me by someone who I trust); another who smugly tried to break an agreement; and another who was conned by a coauthor who made up data. That’s a lot! It’s two cases that directly affected me and three that involved people I knew personally. There was also Columbia faking its U.S. News ranking data; I don’t know any of the people involved but, as a Columbia employee, I guess that I indirectly benefited from the fraud while it was happening.

I’d guess that dishonesty is widespread in business as well. So I think that when Davies wrote “fraud is an unusual condition,” he really meant that “large-scale fraud is an unusual condition”; indeed, that would fit the rest of his discussion on p.62, where he talks about “big systematic fraud” and “catastrophic fraud loss.”

This also reminds me of the problems with popular internet heuristics such as “Hanlon’s razor,” “steelmanning,” and “Godwin’s law,” all of which kind of fall apart in the presence of actual malice, actual bad ideas, and actual Nazis. The challenge is to hold the following two ideas in your head at once:

1. In science, bad work does not require cheating; in science, honesty and transparency are not enough; just cos I say you did bad work it doesn’t mean I’m accusing you of fraud; just cos you followed the rules as you were taught and didn’t cheat it doesn’t mean you made the discovery you thought you did.

2. There are a lot of bad guys and cheaters out there. It’s typically a bad idea to assume that someone is cheating, but it’s also often a mistake to assume that they’re not.

p.65: Davies refers to a “black hole of information.” I like that metaphor! It’s another way of saying “information laundering”: the information goes into the black hole, and when it comes out its source has been erased. Traditionally, scientific journals have functioned as such a black hole, although nowadays we are more aware that, even if a claim has been officially “published,” it should still be possible to understand it in the context of the data and reasoning that have been used to justify it.

As Davies puts it on p.71, “People don’t check up on things which they believe to have been ‘signed off.’ The threat is inside the perimeter.” I’ve used that analogy too! From 2016: “the current system of science publication and publicity is like someone who has a high fence around his property but then keeps the doors of his house unlocked. Any burglar who manages to get inside the estate then has free run of the house.”

p.76: “The government . . . has some unusual characteristics as a victim (it is large, and has problems turning customers away).” This reminds me of scientific frauds, where the scientific community (and, to the extent that the junk science has potential real-world impact, the public at large) is the victim. Scientific journals have the norm of taking every submission seriously; also, a paper that is rejected from one journal can be submitted elsewhere.

p.77: “If there is enough confusion around, simply denying everything and throwing counter-accusations at your creditors can be a surprisingly effective tactic.” This reminds me of the ladder of responses to criticism.

p.78: Davies describes the expression “cool out the mark” as having been “brought to prominence by Erving Goffman.” That’s not right! Cooling out the mark was already discussed in great detail in linguist David Maurer’s classic book from 1940, The Big Con. More generally, I find Goffman irritating for reasons discussed here, so I really don’t like to see him credited for something that Maurer already wrote about.

p.114: “Certain kinds of documents are only valid with an accountant’s seal of approval, and once they have gained this seal of validity, they are taken as ‘audited accounts’ which are much less likely to be subjected to additional verification or checking.” Davies continues: “these professions are considered to be circles of trust. The idea is partly that the long training and apprenticeship processes of the profession ought to develop values of trust and honesty, and weed out candidates who do not possess them. And it is partly that professional status is a valuable asset for the person who possesses it.”

This reminds me of . . . academic communities. Not all, but much of the time. This perspective helps answer a question that’s bugged me for awhile: When researchers do bad work, why do others in their profession defend them? Just to step away from our usual subjects of economics and psychology for a moment, why were the American Statistical Association and the American Political Science Association not bothered by having giving major awards to plagiarists (see here and here)? You’d think they’d be angry about getting rooked, or at least concerned that their associations are associated with frauds. But noooo, the powers that be in these organizations don’t give a damn. The Tour de France removed Lance Armstrong’s awards, but ASA and APSA can’t be bothered. Why? One answer is that they—we!—benefit from the respect given to people in our profession. To retract awards is to admit that this respect is not always earned. Better to just let everyone quietly go about their business.

On p.124, Davies shares an amusing story of the unraveling of a scam involving counterfeit Portuguese banknotes: “While confirming them to be genuine, the inspector happened to find two notes with the same serial numbers—a genuine one had been stacked next to its twin. Once he knew what to look for, it was not too difficult to find more pairs. . . .” The birthday problem in the wild!

p.126: “mining is a sector of the economy in which standards of honesty are variable but requirements for capital are large, and you can keep raising money for a long time before you have to show results.” Kind of like some academic research and tech industries! Just give us a few more zillion dollars and eventually we’ll turn a profit . . .

p.130: “The key to any certification fraud is to exploit the weakest link in the chain.” Good point!

p.131: “It’s often a very good idea to make sure that one is absolutely clear about what a certification process is actually capable of certifying . . . Gaps like this—between the facts that a certification authority can actually make sure of, and those which it is generally assumed it can—are the making of counterfeit fraud.”

This reminds me of scientific error—not usually fraud, I think, but rather the run-of-the-mill sorts of mistakes that researchers, journals, and publicists make every day because they don’t think about the gap between what has been measured and what is being claimed. Two particularly ridiculous examples from psychology are the 3-day study that was called “long term” and the paper whose abstract concluded, “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications,” even though the reported studies had no measures whatsoever of anyone “becoming more powerful,” let alone any actionable implications of such an unmeasured quantity. Again, I see no reason to think these researchers were cheating; they were just following standard practice of making strong claims that sound good but were not addressed by their data. Given that experimental scientists—people whose job is to connect measurement to a larger reality!—regularly make this sort of mistake, I guess it’s not a surprise that the same problem arises in business.

p.134: Davies writes that medical professionals “have a long training program, a strong ethical code and a lot to lose if caught in a dishonest act.” But . . . Surgisphere! Dr. Anil Potti! OK, there are bad apples in every barrel. Also, I’m sure there’s some way these dudes rationalize their deeds. Ultimately, they’re just trying to help patients, right? They’re just being slowed down by all those pesky regulations.

p.136: Davies writes, “The thing is, the certification system for pharmaceuticals is also a safety system.” I love that “The thing is.” It signals to me that Davies didn’t knock himself out writing this book. He wrote the book, it was good, he was done, it got published. When I write an article or book, I get obsessive on the details. Not that I don’t make typos, solecisms, etc., but I’m pretty careful to keep things trim. Overall I think this works, it makes my writing easier to read, but I do think Davies’s book benefits from this relaxed style, not overly worked over. No big deal, just something I noticed in different places in the book.

p.137: “Ranbaxy Laboratories . . . pleaded guilty in 2013 to seven criminal charges relating to the generic drugs it manufactured . . . it was in the habit of using substandard ingredients and manufacturing processes, and then faking test results by buying boxes of its competitors’ branded product to submit to the lab. Ranbaxy’s frauds were an extreme case (although apparently not so extreme as to throw it of the circle of trust entirely; under new management it still exists and produces drugs today).” Whaaa???

p.145: Davies refers to “the vital element of time” in perpetuating a fraud. A key point here is that uncovering the fraud is never as high a priority to outsiders as perpetuating the fraud is for the fraudsters. Even when money is at stake, the amount of money lost by each individual investor will be less than what is at stake for the perpetuator of the fraud. What this means is that sometimes the fraudster can stay alive by just dragging things out until the people on the other side get tired. That’s a standard strategy of insurance companies, right? To delay, delay, delay until the policyholder just gives up, making the rational calculation that it’s better to just cut your losses.

I’ve seen this sort of thing before, that cheaters take advantage of other people’s rationality. They play a game of chicken, acting a bit (or a lot) crazier than anyone else. It’s the madman theory of diplomacy. We’ve seen some examples recently of researchers who’ve had to deal with the aftermath of cheating collaborators, and it can be tough! When you realize a collaborator is a cheater, you’re dancing with a tiger. Someone who’s willing to lie and cheat and make up data could be willing to do all sorts of things, for example they could be willing to lie about your collaboration. So all of a sudden you have to be very careful.

p.157: “In order to find a really bad guy at a Big Four accountancy firm, you have to be quite unlucky (or quite lucky if that’s what you were looking for). But as a crooked manager of a company, churning around your auditors until you find a bad ‘un is exactly what you do, and when you do find one, you hang on to them. This means that the bad auditors are gravitationally drawn into auditing the bad companies, while the majority of the profession has an unrepresentative view of how likely that could be.”

It’s like p-hacking! Again, a key difference is that you can do bad science on purpose, you can do bad science by accident, and there are a lot of steps in between. What does it mean if you use a bad statistical method, people keep pointing out the problem, and you keep doing it? At some point you’re sliding down the Clarke’s Law slope from incompetence to fraud. In any case, my point is that bad statistical methods and bad science go together. Sloppy regression discontinuity analysis doesn’t have to be a signal that the underlying study is misconceived, but it often is, in part because (a) regression discontinuity is a way to get statistical significance and apparent causal identification out of nothing, and (b) if you are doing a careful, well-formulated study, you might well be able to model your process more likely. Theory-free methods and theory-free science often go together, and not in a good way.

p.161: “The problem is that spotting frauds is difficult, and for the majority of investors not worth spending the effort on.” Spotting frauds is a hobby, not a career or even a job. And that’s not even getting into the Javert paradox.

p.173: “The key psychological element is the inability to accept that one has made a mistake.” We’ve seen that before!

p.200: “The easier something is to manage—the more possible it is for someone to take a comprehensive view of all that’s going on, and to check every transaction individually—the more difficult it is to defraud.” This reminds me of preregistration in science. It’s harder to cheat in an environment where you’re expected to lay out all the steps of your procedure. Cheating in that context is not impossible, but it’s harder.

p.204: Davies discusses “the circumstances under which firms would form, and how the economy would tend not to the frictionless ideal, but to be made up by islands of central planning linked by bridges of price signals.” Well put. I’ve long thought this but, without having a clear formulation in words, it wasn’t so clear to me. This is the bit that made me say the thing at the top of this post, about this being the best economics book I’ve ever read.

p.229: “as laissez-faire economics was just getting off the ground, the Victorian era saw the ideology of financial deregulation grow up at the same time as, and in many cases faster and more vigorously than, financial regulation itself.” That’s funny.

p.231: “The normal state of the political economy of fraud is one of constant pressure toward laxity and deregulation, and this tends only to be reversed when things have got so bad that the whole system is under imminent threat of losing its legitimacy.” Sounds like social psychology! Regarding the application to economics and finance, I think Davies should mention Galbraith’s classic book on the Great Crash, where this laxity and deregulation thing was discussed in detail.

p.243: Davies says that stock purchases by small investors are very valuable to the market because, as a stockbroker, you can “be reasonably sure that you’re not taking too big a risk that the person selling stock to you knows something about it that you don’t.” Interesting point, I’m sure not new to any trader but interesting to me.

p.251: “After paying fines and closing down the Pittston hole, Russ Mahler started a new oil company called Quanta Resources, and somehow convinced the New York authorities that despite having the same owner, employees, and assets, it was nothing to do with the serial polluter that they had banned in 1976.” This story got me wondering: where the authorities asleep at the switched, or were they bribed, or did they just have a policy of letting fraudsters try again?

As Davies writes on p.284, “comparatively few of the case studies we’ve looked at were first offenses. . . . there’s something about the modern economic system that keeps giving fraudsters second chances and putting people back in positions of responsibility when they’ve proved themselves dishonest.” I guess he should say “political and economic system.”

Davies continues: “This is ‘white-collar crime’ we’re talking about after all; one of its defining characteristics is that it’s carried out by people of the same social class as those responsible for making decision about crime and punishment. We’re too easy on people who look and act like ourselves.” I guess so, but also it can go the other way, right? I think I’m the same social class as Cass Sunstein, but I don’t feel any desire to go easy on him; indeed, it seems to me that, with all the advantages he’s had, he has even less excuse to misrepresent research than someone who came in off the street. From the other direction, he might see me as a sort of class traitor.

p.254: “It’s a crime against the control system of the overall economy, the network of trust and agreement that makes an industrial economy livable.” That’s how I feel about Wolfram research when they hire people to spam my inbox with flattering lies. If even the classy outfits are trying to con me, what does that say about our world?

p.254: “Unless they are controlled, fraudulent business units tend to outcompete honest ones and drive them out of business.” Gresham!

p.269: “Denial, when you are not part of it, is actually a terrifying thing. One watches one’s fellow humans doing things that will damage themselves, while being wholly unable to help.” I agree. This is how I felt when corresponding with the ovulation-and-clothing researchers and with the elections-and-lifespan researchers. The people on the other side of these discussions seemed perfectly sincere; they just couldn’t consider the possibility they might be on the wrong track. (You could say the same about me, except: (1) I did consider the possibility I could be wrong in these cases, and (2) there were statistical arguments on my side; these weren’t just matters of opinion.) Anyway, setting aside if I was right or wrong in these disputes, the denial (as I perceived it) just made me want to cry. I don’t think graduate students are well trained in handling mistakes, and then when they grow up and publish research, they remain stuck in this attitude. I can see how this could be even more upsetting if real money and livelihoods are on the line.

Finally

In the last sentence of the last page of his book, Davies writes, “we are all in debt to those who trust; they are the basis of anything approaching a prosperous and civilised society.”

To which I reply, who are the trusters to which we are in debt? For example, I don’t think we are all in debt to those who trust scams such as Theranos or the Hyperloop, nor are we in debt to the Harvard professor who fell for the forged Jesus document and then tried to explain away its problems rather than just listening to the critics. Nor are we in debt to the administrations of Cornell University, Ohio State University, the University of California, etc., when they did their part to diffuse criticism of bad work being done by their faculty who had been so successful at raising money and getting publicity for their institutions.

I get Davies’s point in the context of his book: if you fall for a Wolfram Research scam (for example), you’re not the bad guy. The bad guy is Wolfram Research, which is taking advantage of your state of relaxation, tapping into the difficult-to-replenish reservoir of trust. In other settings, though, the sucker seems more complicit, not the bad guy, exactly—ultimately the responsibility falls on the fraudsters, not the promoters of the fraud—but their state of trust isn’t doing the rest of us any favors, either. So I’m not really sure what to think about this last bit.

“The Role of Doubt in Conceiving Research.” The capacity to be upset, to recognize anomalies for what they are, and to track them down and figure out what in our understanding is lacking:

Stanley Presser sends along this article, “The Role of Doubt in Conceiving Research.” Presser has taught for many years at the University of Maryland but not when I was a student there, also he teaches sociology and I’ve never taken a sociology class.

Presser’s article has lots of interesting discussion and quotes about learning from failure, the problem of researchers believing things that are false, the challenge of recognizing what is an interesting research question, along with some specific issues that arise with survey research.

I’m reminded of the principle that an important characteristic of a good scientist is the capacity to be upset, to recognize anomalies for what they are, and to track them down and figure out what in our understanding is lacking. This sort of unsettled-ness–aan unwillingness to sweep concerns under the rug, a scrupulousness about acknowledging one’s uncertainty—is, I would argue, particularly important for a statistician.

P.S. That last link is from 2016, to a post that begins as follows:

I’ve given many remote talks but this is the first time I’ve spoken at an all-electronic conference. It will be a challenge. In a live talk, everyone’s just sitting in the room staring at you, but in an electronic conference everyone will be reading their email and surfing the web. . . . At the very least, I have to be more lively than my own writing, or people will just tune me out and start reading old blog entries.

Funny to see this, seven years later, now that electronic conferences are the standard. And I think they really are worse than the in-person variety. It’s hard for a speaker to be more interesting than whatever is in everybody’s inbox, not to mention the world that is accessible from google.