No, this paper on strip clubs and sex crimes was never gonna get retracted. Also, a reminder of the importance of data quality, and a reflection on why researchers often think it’s just fine to publish papers using bad data under the mistaken belief that these analyses are “conservative” or “attenuated” or something like that.

Brandon Del Pozo writes:

Born in Bensonhurst, Brooklyn in the 1970’s, I came to public health research by way of 23 years as a police officer, including 19 years in the NYPD and four as a chief of police in Vermont. Even more tortuously, my doctoral training was in philosophy at the CUNY Graduate Center.

I am writing at the advice of colleagues because I remain extraordinarily vexed by a paper that came out in 2021. It purports to measure the effects of opening strip clubs on sex crimes in NYC at the precinct level, and finds substantial reductions within a week of opening each club. The problem is the paper is implausible from the outset because it uses completely inappropriate data that anyone familiar with the phenomena would find preposterous. My colleagues and I, who were custodians of the data and participants in the processes under study when we were police officers, wrote a very detailed critique of the paper and called for its retraction. Beyond our own assertions, we contacted state agencies who went on the record about the problems with the data as well.

For their part, the authors and editors have been remarkably dismissive of our concerns. They said, principally, that we are making too big a deal out of the measures being imprecise and a little noisy. But we are saying something different: the study has no construct validity because it is impossible to measure the actual phenomena under study using its data.

Here is our critique, which will soon be out in Police Practice and Research. Here is the letter from the journal editors, and here is a link to some coverage in Retraction Watch. I guess my main problem is the extent to which this type of problem was missed or ignored in the peer review process, and why it is being so casually dismissed now. It is a matter of economists circling their wagons?

My reply:

1. Your criticisms seem sensible to me. I also have further concerns with the data (or maybe you pointed these out in your article and I did not notice), in particular the distribution of data in Figure 1 of the original article. Most weeks there seem to be approximately 20 sex crime stops (which they misleadingly label as “sex crimes”), but then there’s one week with nearly 200? This makes me wonder what is going on with these data.

2. I see from the Retraction Watch article that one of the authors responded, “As far as I am concerned, a serious (scientifically sound) confutation of the original thesis has not been given yet.” This raises the interesting question of burden of proof. Before the article is accepted for publication, it is the authors’ job to convincingly justify their claim. After publication, the author is saying that the burden is on the critic (i.e., you). To put it another way: had your comment been in a pre-publication referee report, it should’ve been enough to make the editors reject the paper or at least require more from the authors. But post-publication is another story, at least according to current scientific conventions.

3. From a methodological standpoint, the authors follow the very standard approach of doing an analysis, finding something, then performing a bunch of auxiliary analyses–robustness checks–to rule out alternative explanations. I am skeptical of robustness checks; see also here. In some way, the situation is kind of hopeless, in that, as researchers, we are trained to respond to questions and criticism by trying our hardest to preserve our original conclusions.

4. One thing I’ve noticed in a lot of social science research is a casual attitude toward measurement. See here for the general point, and over the years we’ve discussed lots of examples, such as arm circumference being used as a proxy for upper-body strength (we call that the “fat arms” study) and a series of papers characterizing days 6-14 of the menstrual cycle as the days of peak fertility, even though the days of peak fertility vary a lot from woman to woman with a consensus summary being days 10-17. The short version of the problem here, especially in econometrics, is that there’s a general understanding that if you use bad measurements, it should attenuate (that is, pull toward zero) your estimated effect sizes; hence, if someone points out a measurement problem, a common reaction is to think that it’s no big deal because if the measurements are off, that just led to “conservative” estimates. Eric Loken and I wrote this article once to explain the point, but the message has mostly not been received.

5. Given all the above, I can see how the authors of the original paper would be annoyed. They’re following standard practice, their paper got accepted, and now all of a sudden they’re appearing in Retraction Watch!

6. Separate from all the above, there’s no way that paper was ever going to be retracted. The problem is that journals and scholars treat retraction as a punishment of the authors, not as a correction of the scholarly literature. It’s pretty much impossible to get an involuntary retraction without there being some belief that there has been wrongdoing. See discussion here. In practice, a fatal error in a paper is not enough to force retraction.

7. In summary, no, I don’t think it’s “economists circling their wagons.” I think this is a mix of several factors: a high bar for post-publication review, a general unconcern with measurement validity and reliability, a trust in robustness checks, and the fact that retraction was never a serious option. Given that the authors of the original paper were not going to issue a correction on their own, the best outcome for you was to either publish a response in the original journal (which would’ve been accompanied by a rebuttal from the original authors) or to publish in a different journal, which is what happened. Beyond all this, the discussion quickly gets technical. I’ve done some work on stop-and-frisk data myself and I have decades of experience reading social science papers, but even for me I was getting confused with all the moving parts, and indeed I could well imagine being convinced by someone on the other side that your critiques were irrelevant. The point is that the journal editors are not going to feel comfortable making that judgment, any more than I would be.

Del Pozo responded by clarifying some points:

Regarding the data with outliers in my point 1 above, Del Pozo writes, “My guess is that this was a week when there was an intense search for a wanted pattern rape suspect. Many people were stopped by police above the average of 20 per week, and at least 179 of them were innocent. We discuss this in our reply; non only do these reports not record crimes in nearly all cases, but several reports may reflect police stops of innocent people in the search for one wanted suspect. It is impossible to measure crime with stop reports.”

Regarding the issue of pre-publication and post-publication review in my point 2 above, Del Pozo writes, “We asked the journal to release the anonymized peer reviews to see if anyone had at least taken up this problem during review. We offered to retract all of our own work and issue a written apology if someone had done basic due diligence on the matter of measurement during peer review. They never acknowledged or responded to our request. We also wrote that it is not good science when reviewers miss glaring problems and then other researchers have to upend their own research agenda to spend time correcting the scholarly record in the face of stubborn resistance that seems more about pride than science. None of this will get us a good publication, a grant, or tenure, after all. I promise we were much more tactful and diplomatic than that, but that was the gist. We are police researchers, not the research police.”

To paraphrase Thomas Basbøll, they are not the research police because there is no such thing as the research police.

Regarding my point 3 on the lure of robustness checks and their problems, Del Pozo writes, “The first author of the publication was defensive and dismissive when we were all on a Zoom together. It was nothing personal, but an Italian living in Spain was telling four US police officers, three of whom were in the NYPD, that he, not us, better understood the use and limits of NYPD and NYC administrative data and the process of gaining the approvals to open a strip club. The robustness checks all still used opening dates based on registration dates, which do not associate with actual opening in even a remotely plausible way to allow for a study of effects within a week of registration. Any analysis with integrity would have to exclude all of the data for the independent variable.”

Regarding my point 4 on researchers’ seemingly-strong statistical justifications for going with bad measurements, Del Pozo writes, “Yes, the authors literally said that their measurement errors at T=0 weren’t a problem because the possibility of attenuation made it more likely that their rejection of the null was actually based on a conservative estimate. But this is the point: the data cannot possibly measure what they need it to, in seeking to reject the null. It measures changes in encounters with innocent people after someone has let New York State know that they plan to open a business in a few months, and purports to say that this shows sex crimes go down the week after a person opens a sex club. I would feel fraudulent if I knew this about my research and allowed people to cite it as knowledge.”

Regarding my point 6 that just about nothing ever gets involuntarily retracted without a finding of research misconduct, Del Pozo points to an “exception that proves the rule: a retraction for the inadvertent pooling of heterogeneous results in a meta analysis that was missed during peer review, and nothing more.”

Regarding my conclusions in point 7 above, Del Pozo writes, “I was thinking of submitting a formal replication to the journal that began with examining the model, determining there were fatal measurement errors, then excluding all inappropriate data, i.e., all the data for the independent variable and 96% of the data for the dependent variable, thereby yielding no results, and preventing rejection of the null. Voila, a replication. I would be so curious to see a reviewer in the position of having to defend the inclusion of inappropriate data in a replication. The problem of course is replications are normatively structured to assume the measurements are sound, and if anything you keep them all and introduce a previously omitted variable or something. I would be transgressing norms with such a replication. I presume it would be desk rejected.”

Yup, I think such a replication would be rejected for two reasons. First, journals want to publish new stuff, not replications. Second, they’d see it as a criticism of a paper they’d published, and journals usually don’t like that either.

The connection between the psychological concept of “generic language” and the problem of overgeneralization from research studies

A couple years ago I suggested: A quick fix in science communication: Switch from the present to the past tense.

Here’s an example. A paper was published, “Māori and Pacific people in New Zealand have a higher risk of hospitalisation for COVID-19,” and I recommended they change “have” to “had” in that title. More generally, I wrote,

There’s a common pattern in science writing to use the present tense to imply that you’ve discovered a universal truth. For example, “Beautiful parents have more daughters” or “Women are more likely to wear red or pink at peak fertility.” OK, those particular papers had other problems, but my point here is that at best these represented findings about some point in time and some place in the past.

Using the past tense in the titles of scientific reports won’t solve all our problems or even most of our problems or even many of our problems, but maybe it will be a useful start, in reminding authors as well as readers of the scope of their findings.

Recently it was brought to my attention that research has been conducted on this topic.

The relevant paper is Generic language in scientific communication, published by Jasmine DeJesus et al. in 2017, who write:

Scientific communication poses a challenge: To clearly highlight key conclusions and implications while fully acknowledging the limitations of the evidence. Although these goals are in principle compatible, the goal of conveying complex and variable data may compete with reporting results in a digestible form . . . For example, generic language (e.g., “Introverts and extraverts require different learning environments”) may mislead by implying general, timeless conclusions while glossing over exceptions and variability. Using generic language is especially problematic if authors overgeneralize from small or unrepresentative samples . . . In an analysis of 1,149 psychology articles, 89% described results using generics . . . Online workers and undergraduate students judged findings expressed with generic language more important than findings expressed with nongeneric language.

It’s good to see this coming out in the psychology literature, given that just a few years ago a prominent psychology professor expressed annoyance when I expressed problems about representativeness in a published study.

Also relevant is our post from a few years ago, Correlation does not even imply correlation, which also addressed the challenges of drawing general conclusions from nonrepresentative samples in the presence of selection bias.

P.S. Also relevant is a post from 2010, “How hard is it to say what you mean?”

Beneath every application of causal inference to ML lies a ridiculously hard social science problem

This is Jessica. Zach Lipton gave a talk at an event on human-centered AI at the University of Chicago the other day that resonated with me, in which he commented on the adoption of causal inference to solve machine learning problems. The premise was that there’s been considerable reflection lately on methods in machine learning, as it has become painfully obvious that accuracy on held-out IID data is often not a good predictor of model performance in a real-world deployment. So one computer scientist who reads the Book of Why at a time, researchers are adapting causal inference methods to make progress on problems that arise in predictive modeling.  

For example, Northwestern CS now regularly offers a causal machine learning course for undergrads. Estimating counterfactuals is common in approaches to fairness and algorithmic recourse (recommendations of the minimal intervention someone can take to change their predicted label), and in “explainable AI.” Work on feedback loops (e.g., performative prediction) is essentially about how to deal with causal effects of the predictions themselves on the outcomes. 

Jake Hofman et al. have used the term integrative modeling to refer to activities that attempt to predict as-yet unseen outcomes in terms of causal relationships. I have generally been a fan of research happening in this bucket, because I think there is value in making and attempting to test assertions about how we think data are generated. Often doing so lends some conceptual clarity, even if all you get is a better sense of what’s hard about the problem you’re trying to solve. However, it’s not necessarily easy to find great examples yet of integrative modeling. Lipton’s critique was that despite the conceptual elegance gained in bringing causal methods to bear on machine learning problems, their promise for actually solving the hard problems that come up in ML is somewhat illusory, because they inevitably require us to make assumptions that we can’t really back up in the kinds of high dimensional prediction problems on observational data that ML deals with. Hence the title of this post, that ultimately we’re often still left with some really hard social science problem. 

There is an example that this brings to mind which I’d meant to post on over a year ago, involving causal approaches to ML fairness. Counterfactuals are often used to estimate the causal effects of protected attributes like race in algorithmic auditing. However, some applications have been been met with criticism for not reflecting common sense expectations about the effects of race on a person’s life. For example, consider the well known 2004 AER paper by Bertrand and Mullainathan, “Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination,” which attempts to measure race-based discrimination in callbacks on fake resumes by manipulating applicant names on the same resumes to imply different races. Lily Hu uses this example to critique approaches to algorithmic auditing based on direct effects estimation. Hu argues that assuming you can identify racial discrimination by imagining flipping race differently while holding all other qualifications or personal attributes of people constant is incoherent, because the idea that race can be switched on and off without impacting other covariates is incompatible with modern understanding of the effects of race. In this view, Pearl’s statement in Causality that “[t]he central question in any employment discrimination case is whether the employer would have taken the same action had the employee been of a different race… and everything else had been the same” exhibits a conceptual error, previously pointed out by Kohler-Haussman, where race is treated as phenotype or skin type alone, misrepresenting the actual socially constructed nature of race. Similar ideas have been discussed before on the blog around detecting racial bias in police behavior, such as use of force, e.g., here

Path-specific counterfactual fairness methods instead assume the causal graph is known, and hinge on identifying fair versus unfair pathways affecting the outcome of interest. For example, if you’re using matching to check for discrimination, you should be matching units only on path-specific effects of race that are considered fair. To judge if a decision to not call back a black junior in high school with a 3.7 GPA was fair, we need methods that allow us to ask whether he would have gotten the callback if he were his white counterpart. If both knowledge and race are expected to affect GPA, but only one of these is fair, we should adjust our matching procedure to eliminate what we expect the unfair effect of race on GPA to be, while leaving the fair pathway. If we do this we are likely to arrive at a white counterpart with a higher GPA than 3.7, assuming we think being black leads to a lower GPA due to obstacles not faced by the white counterpart, like boosts in grades due to preferential treatment.  

One of Hu’s conclusions is that while this all makes sense in theory, it becomes a very slippery thing to try to define in practice:

To determine whether an employment callback decision process was fair, causal approaches ask us to determine the white counterpart to Jamal, a Black male who is a junior with a 3.7 GPA at the predominantly Black Pomona High School. When we toggle Jamal’s race attribute from black to white and cascade the effect to all of his “downstream” attributes, he becomes white Greg. Who is this Greg? Is it Greg of the original audit study, a white male who is a junior at Pomona High School with a 3.7 GPA? Is it Greg1, a white male who is a junior at Pomona High School with a 3.9 GPA (adjusted for the average Black-White GPA gap at Pomona High School)? Or is it Greg2, a white male who is a junior at nearby Diamond Ranch High School—the predominantly white school in the area—with a 3.82 GPA (accounting for nationwide Black-White GPA gap)? Which counterfactual determines whether Jamal has been treated fairly? Will the real white Greg please stand up?

And so we’re left with the non-trivial task of getting experts to agree on the normative interpretation of which pathways are fair, and what the relevant populations are for estimating effects along the unfair pathways.

This reminds me a bit of the motivation behind writing this paper comparing concerns about ML reproducibility and generalizablity to perceived causes of the replication crisis in social science, and of my grad course on explanation and reproducibility in data-driven science. It’s easy to think that one can take methods from explanatory modeling to solve problems related to distribution shift, and on some level you can make some progress, but you better be ready to embrace some unresolvable uncertainty due to not knowing if your model specification was a good approximation. At any rate, there’s something kind of reassuring about listening to ML talks and being reminded of the crud factor.

How did some of this goofy psychology research become so popular? I think it’s a form of transubstantiation.

OK, more on junk science. Sorry! But it’s been in the news lately, and people keep emailing me about it.

For those of you who want some more technical statistics content, here are some recent unpublished papers to keep you busy:

BISG: When inferring race or ethnicity, does it matter that people often live near their relatives?

Simulation-based calibration checking for Bayesian computation: The choice of test quantities shapes sensitivity

Nested R-hat: Assessing the convergence of Markov chain Monte Carlo when running many short chains

OK, are you satisfied??? Good. Now back to today’s topic: the mysterious popularity of goofy psychology research.

Here’s the deal. We’ve been hearing a lot about glamorous scientists who go on Ted and NPR, write airport bestsellers, get six-figure speaking gigs . . . and then it turns out, first that their work does not replicate, and next that their fame and fortune were based on scientific publications that were fatally flawed. Maybe the data were fabricated, maybe the experiments never happened, maybe the data were analyzed manipulatively or incompetently, often even if everything was on the up-and-up these studies were too noisy for anything useful to be learned.

At this point we usually ask, What happened? How did this bad work get done? Or, How did it not get caught, staying aloft, Wile E. Coyote-like, for years, with no means of support? Or, What caused the eventual fall?

But today I want to ask a different question: How did this work get all the adoring publicity in the first place?

Sometimes the answer seems clear to me. Brian Wansink, for example, the guy who claimed that if you move items around on the menu or use a smaller plate or whatever, you could get people to eat 28% less for lunch. That’s a big deal. Big if true, as the saying goes. If the work was legit, it deserved all the publicity it got.

Similarly with Freakonomics, which has some strong messages regarding incentives and what you can learn from observing people’s behaviors. Some of the research they promoted was mistaken, but they really were going for important topics much of the time. And the Gladwellverse. No, I don’t believe that therapist’s claim that he can predict with 93% accuracy who’s gonna get divorced—but if it were the case, it would be worth hearing about. Again, big if true.

Or, if a researcher cures cancers in mice and then gives a Ted talk saying how he’s gonna end cancer in humans, ok, sure, that’s an exaggeration, but the relevance is clear.

Other times, though, I didn’t get it.
Continue reading

“Stockholm’s Replication Games”

Dan Klein points us to this event announced by Abel Brodeur and Anna Dreber:

On October 26th 2023 we will organize the Stockholm’s Replication Game at the Stockholm School of Economics, Sweden.

Participants will be matched with other researchers working in the same field. Each team will work on replicating a recently published study in a leading journal in economics or political science. All participants will be given co-authorship to a meta paper aggregating the replications results of multiple replication games.

This is a hybrid event. In person participation is encouraged but it will also be possible to join virtually. The event will be at Stockholm School of Economics and coffee will be provided.

To participate please contact Abel Brodeur ([email protected]). Indicate your research interests and the statistical software you are comfortable working with. Also please note whether you want to participate remotely or in person (we have a limit of about 40 participants for the latter).

The project is led by Abel Brodeur (University of Ottawa), Anna Dreber (Stockholm School of Economics), Fernando Hoces (UC Berkeley) and Edward Miguel (UC Berkeley).

This event is part of games organized all over the world (Australia, Canada, France, US, etc.) by the Institute for Replication (https://i4replication.org/). A flyer for the event is attached. More details about the event can be found here: https://i4replication.org/description.html

Sounds like a great idea!

“Psychology needs to get tired of winning”

Olaf Zimmermann points to this article by Gerald Haeffel, which begins:

Nearly 100% of the published studies in psychology confirm the initial hypothesis. This is an amazing accomplishment given the complexity of the human mind and human behaviour. Somehow, as psychological scientists, we always find the expected result; we always win! Recently, however, the legitimacy of psychology’s winning streak has been called into question. Major replication projects show that only about half of psychological findings replicate. Further, there is evidence that psychology’s winning streak may be due to cheating. Similarly to how baseball’s homerun chase in the United States was fuelled by steroids (e.g. Bonds, McGuire, Sosa) and Lance Armstrong’s Tour de France streak was aided by doping, psychology’s winning streak may be the result of questionable research practices . . .

Haeffel summarizes:

This is a problem because science progresses from being wrong. For decades, there have been calls for better theories and the adoption of a strong inference approach to science. However, there is little reason to believe that psychological science is ready to change. Although recent developments like the open science movement have improved transparency and replicability, they have not addressed psychological science’s method-oriented (rather than problem-oriented) mindset. Psychological science still does not embrace the scientific method of developing theories, conducting critical tests of those theories, detecting contradictory results, and revising (or disposing of) the theories accordingly.

Well put. And it’s not just psychology. I’ve seen a lot of this in political science and economics as well. In medical research the problems are more complicated, as there’s a mix of declaring victory from noise and declaring null effects just cos a comparison is not statistically significant.

Regarding psychology in particular, I’ll point to my article with Simine Vazire, Why Did It Take So Many Decades for the Behavioral Sciences to Develop a Sense of Crisis Around Methodology and Replication?, which considers some possible reasons why, on or about December 2010, the behavioral sciences changed.

West Point, like major league baseball, was purely based on merit and achievement before Jackie Robinson and that Tuskegee Airman guy came along and messed everything up.

This news article, “Anti-Affirmative Action Group Sues West Point Over Admissions Policy,” contained this amazing quote:

“For most of its history, West Point has evaluated cadets based on merit and achievement,” the group said in its complaint, filed on Tuesday in the Southern District of New York. But that changed, the group argued, over the last few decades.

Whaaa . . .?

How many black cadets did they have in the 1840s, anyway? Guess we gotta check the internet . . . googling *black cadets at west point* points us to this article from the National Museum of African American History and Culture:

In its first 133 years of existence (1802–1935), over 10,000 white cadets graduated from the United States Military Academy at West Point. In stark contrast, only three African American cadets could claim this achievement . . . Benjamin O. Davis Jr. became the fourth African American cadet to graduate in 1936. Perhaps best known as commander of the famous Tuskegee Airmen in World War II, Davis had a long and distinguished career in the Air Force before retiring in 1970 at the rank of Lieutenant General. . . .

They’ve got this juicy quote from Major General John D. Schofield, Superintendent of West Point, in 1880:

“To send to West Point for four years competition a young man who was born in slavery is to assume that half a generation is sufficient to raise a colored man to the social, moral, and intellectual level which the average white man has reached in several hundred years. As well might the common farm horse be entered in a four-mile race against the best blood inherited from a long line of English racers.”

The article continues:

Between 1870 and 1899, only 12 African American cadets were admitted to West Point. Each endured physical and emotional abuse and racist treatment from their white peers and professors throughout their time at the Academy. They were ostracized, barred from social activities with other cadets, and spoken to only when officially necessary, a practice known as silencing. While white cadets were hazed by their fellow cadets as punishment for serious misconduct, Black cadets were hazed for being Black and for being at West Point.

OK, so here’s the score:

Years # black cadets # black graduates
1802-1869 0 0
1870-1935 12 3

Given the data, it’s absolutely ridiculous of them to say, “For most of its history, West Point has evaluated cadets based on merit and achievement.”

Is that just how lawyers write things in official complaints? Is the idea to make some ludicrous claims just to distract the other side? I don’t get it. If I were a judge, that sort of thing would just annoy me. Then again, I’m not a judge.

It’s just like major league baseball, which was purely based on merit and achievement until those pesky affirmative action bureaucrats came along in 1948 to mess everything up.

In all seriousness, it seems in retrospect to have been a terrible decision to restrict the military academies to whites for the first 100+ years. Imagine if Robert Lee and Stonewall Jackson had had black classmates at West Point. Maybe then they wouldn’t have been so gung-ho to lead troops in defense of slavery. The Dred Scott decision would have a different meaning if it was their friends who were at risk of being kidnapped and enslaved. It seems fair enough to draw a direct line from an all-white West Point to the tragedy of the Civil War. And then for some twit in 2023 to say, “For most of its history, West Point has evaluated cadets based on merit and achievement” . . . !

P.S. Annoyingly, the news article does not link to the actual complaint. But after some googling, I found it here. The whole thing is kinda nuts. In the same paragraph where they make the obviously false claim, “For most of its history, West Point has evaluated cadets based on merit and achievement,” they also point out that the U.S. military wasn’t desegregated until 1948!

Later on, they refer to “the brief period of racial unrest [from 1969 to 1972] that West Point retells over and over.” I guess they’re cool about the first 150 years or so when blacks were entirely or nearly-entirely excluded and evaluation was based on “based on merit and achievement”: keep the place essentially all-white and intimidate the few black cadets who are there, and you have no racial unrest, huh?

Affirmative action is a complicated issue. I don’t think this particular group is helping anyone by trying to push a distorted version of history.

P.P.S. Relevant context from Dred Scott v. Sandford:

The question is simply this: Can a negro whose ancestors were imported into this country, and sold as slaves, become a member of the political community formed and brought into existence by the Constitution of the United States, and as such become entitled to all the rights and privileges and immunities guaranteed to the citizen? . . .

It will be observed, that the plea applies to that class of persons only whose ancestors were negroes of the African race, and imported into this country, and sold and held as slaves. The only matter in issue before the court, therefore, is, whether the descendants of such slaves, when they shall be emancipated, or who are born of parents who had become free before their birth, are citizens of a State, in the sense in which the word citizen is used in the Constitution of the United States. . . .

The situation of this population was altogether unlike that of the Indian race. The latter, it is true, formed no part of the colonial communities, and never amalgamated with them in social connections or in government. But although they were uncivilized, they were yet a free and independent people, associated together in nations or tribes, and governed by their own laws. . . .

The words “people of the United States” and “citizens” are synonymous terms, and mean the same thing. They both describe the political body who, according to our republican institutions, form the sovereignty, and who hold the power and conduct the Government through their representatives. . . . The question before us is, whether the class of persons described in the plea in abatement compose a portion of this people, and are constituent members of this sovereignty? We think they are not, and that they are not included, and were not intended to be included, under the word “citizens” in the Constitution, and can therefore claim none of the rights and privileges which that instrument provides for and secures to citizens of the United States. On the contrary, they were at that time considered as a subordinate and inferior class of beings . . .

They had for more than a century before been regarded as beings of an inferior order, and altogether unfit to associate with the white race, either in social or political relations; and so far inferior, that they had no rights which the white man was bound to respect; and that the negro might justly and lawfully be reduced to slavery for his benefit. He was bought and sold, and treated as an ordinary article of merchandise and traffic, whenever a profit could be made by it. This opinion was at that time fixed and universal in the civilized portion of the white race. It was regarded as an axiom in morals as well as in politics, which no one thought of disputing . . .

That was in the good old days, back when West Point has evaluated cadets based on merit and achievement and there was no racial unrest. Somewhere between 1857 and today, something seems to have gone terribly wrong, according to this new lawsuit. Too bad for them that Roger Taney is no longer on the court.

A message to Parkinson’s Disease researchers: Design a study to distinguish between these two competing explanations of the fact that the incidence of Parkinson’s is lower among smokers

After reading our recent post, “How to quit smoking, and a challenge to currently-standard individualistic theories in social science,” Gur Huberman writes:

You may be aware that the incidence of Parkinson (PD) is lower in the smoking population than in the general population, and that negative relation is stronger for the heavier & longer duration smokers.

The reason for that is unknown. Some neurologists conjecture that there’s something in smoked tobacco which causes some immunity from PD. Other conjecture that whatever causes PD also helps people quit or avoid smoking. For instance, a neurologist told me that Dopamine (the material whose deficit causes PD) is associated with addiction not only to smoking but also to coffee drinking.

Your blog post made me think of a study that will try to distinguish between the two explanations for the negative relation between smoking and PD. Such a study will exploit variations (e.g., in geography & time) between the incidence of smoking and that of PD.

It will take a good deal of leg work to get the relevant data, and a good deal of brain work to set up a convincing statistical design. It will also be very satisfying to see convincing results one way or the other. More than satisfying, such a study could help develop medications to treat or prevent PD.

If this project makes sense perhaps you can bring it to the attention of relevant scholars.

OK, here it is. We’ll see if anyone wants to pick this one up.

I have some skepticism about Gur’s second hypothesis, that “whatever causes PD also helps people quit or avoid smoking.” I say this only because, from my perspective, and as discussed in the above-linked post, the decision to smoke seems like much more of a social attribute than an individual decision. But, sure, I could see how there could be correlations.

In any case, it’s an interesting statistical question as well as an important issue in medicine and public health, so worth thinking about.

How to quit smoking, and a challenge to currently-standard individualistic theories in social science

Paul Campos writes:

Probably the biggest public health success in America over the past half century has been the remarkably effective long-term campaign to reduce cigarette smoking. The percentage of adults who smoke tobacco has declined from 42% in 1965 (the first year the CDC measured this), to 12.5% in 2020.

It’s difficult to disentangle the effect of various factors that have led to this stunning decline of what was once a ubiquitous habit — note that if we exclude people who report having no more than one or two drinks per year, the current percentage of alcohol drinkers in the USA is about the same as the percentage of smokers 60 years ago — but the most commonly cited include:

Anti-smoking educational campaigns

Making it difficult to smoke in public and many private spaces

Increasing prices

Improved smoking cessation treatments, and laws requiring the cost of these to be covered by medical insurance

I would add another factor, which is more broadly cultural than narrowly legal or economic: smoking has become declasse.

This is evident if you look at the relationship between smoking rates and education and income: While 32% of people with a GED smoke, the percentages for holders of four year college degrees and graduate degrees are 5.6% and 3.5% respectively. And while 20.2% of people with household incomes under the $35,000 smoke, 6.2% of people with household incomes over $100,000 do.

All worth noting. Anti-smoking efforts are a big success story, almost such a bit story that it’s easy to forget.

The sharp decline in smoking is a big “stylized fact,” as we say in social science, comparable to other biggies such as the change in acceptance of gay people in the past few decades, and the also-surprising lack of change in attitudes toward abortion.

When we have a big stylized fact like this, we should milk it for as much understanding as we can.

With that in mind, I have a few things to add on the topic:

1. Speaking of stunning, check out these Gallup poll results on rates of drinking alcohol:

At least in the U.S., rich people are much more likely than poor people to drink. That’s the opposite of the pattern with smoking.

2. Speaking of “at least in the U.S.”, it’s my impression that smoking rates have rapidly declined in many other countries too, so in that sense it’s more of a global public health success.

3. Back to the point that we should recognize how stunning this all is: 20 years ago, they banned smoking in bars and restaurants in New York. All at once, everything changed, and you could go to a club and not come home with your clothes smelling like smoke, pregnant women could go places without worrying about breathing it all in, etc. When this policy was proposed and then when it was clear it was really gonna happen, lots of lobbyists and professional contrarians and Debby Downers and free-market fanatics popped up and shouted that the smoking ban would never work, it would be an economic disaster, the worst of the nanny state, bla bla bla. Actually it worked just fine.

4. It’s said that quitting smoking is really hard. Smoking-cessation programs have notoriously low success rates. But some of that is selection bias, no? Some people can quit smoking without much problem, and those people don’t need to try smoking-cessation programs. So the people who do try those programs are a subset that overrepresents people who can’t so easily break the habit.

5. We’re used to hearing the argument that, yeah, everybody knows cigarette smoking causes cancer, but people might want to do it anyway. There’s gotta be some truth to that: smoking relaxes people, or something like that. But also recall what the cigarette executives said, as recounted by historian Robert Proctor:

Philip Morris Vice President George Weissman in March 1954 announced that his company would “stop business tomorrow” if “we had any thought or knowledge that in any way we were selling a product harmful to consumers.” James C. Bowling . . . . Philip Morris VP, in a 1972 interview asserted, “If our product is harmful . . . we’ll stop making it.” Then again in 1997 the same company’s CEO and chairman, Geoffrey Bible, was asked (under oath) what he would do with his company if cigarettes were ever established as a cause of cancer. Bible gave this answer: “I’d probably . . . shut it down instantly to get a better hold on things.” . . . Lorillard’s president, Curtis Judge, is quoted in company documents: “if it were proven that cigarette smoking caused cancer, cigarettes shoudl not be marketed” . . . R. J. Reynolds president, Gerald H. Long, in a 1986 interview asserted that if he ever “saw or thought there were any evidence whatsoever that conclusively proved that, in some way, tobacco was harmful to people, and I believed it in my heart and my soul, then I would get out of the business.”

6. A few years ago we discussed a study of the effects of smoking bans. My thought at the time was: Yes, at the individual level it’s hard to quit smoking, which might give skepticism about the effects of measures designed to reduce smoking—but, at the same time, smoking rates vary a lot by country and by state, This was similar to our argument about the hot hand: given that basketball shooting success rates vary a lot over time and across game conditions, it should not be surprising that previous shots might have an effect. As I wrote awhile ago, “if ‘p’ varies among players, and ‘p’ varies over the time scale of years or months for individual players, why shouldn’t ‘p’ vary over shorter time scales too? In what sense is “constant probability” a sensible null model at all?” Similarly, given how much smoking rates vary, maybe we shouldn’t be surprised that something could be done about it.

7. To me, though, the most interesting thing about the stylized facts on smoking is how there is this behavior that is so hard to change at the individual level but can be changed so much at the national level. This runs counter to currently-standard individualistic theories in social science in which everything is about isolated decisions. It’s more of a synthesis: change came from policy and from culture (whatever that means), but this still had to work its way though individual decisions. This idea of behavior being changed by policy almost sounds like “embodied cognition” or “nudge,” but it feels different to me in being more brute force. Embodied cognition is things like giving people subliminal signals; nudge is things like subtly changing the framing of a message. Here we’re talking about direct education, taxes, bans, big fat warning labels: nothing subtle or clever that the nudgelords would refer to as a “masterpiece.”

Anyway, this idea of changes that can happen more easily at the group or population level than at the individual level, that’s interesting to me. I guess things like this happen all over—“social trends”—and I don’t feel our usual social-science models handle them well. I don’t mean that no models work here, and I’m sure that lots of social scientists done serious work in this area; it just doesn’t seem to quite line up with the usual way we talk about decision making.

P.S. Separate from all the above, I just wanted to remind you that there’s lots of really bad work on smoking and its effects; see here, for example. I’m not saying that all the work is bad, just that I’ve seen some really bad stuff, maybe no surprise what with all the shills on one side and all the activists on the other.

Crypto scam social science thoughts: The role of the elite news media and academia

Campos quotes from one of the many stories floating around regarding ridiculous cryptocurrency scams.

I’m not saying it should’ve been obvious in retrospect that crypto was a scam, just that (a) it always seemed that it could be a scam, and (b) for awhile there have been many prominent people saying it was a scam. Again, prominent people can be in error; what I’m getting at is that the potential scamminess was out there.

The usual way we think about scams is in terms of the scammers and the suckers, and also about the regulatory framework that lets people get away with it.

Here, though, I want to talk about something different, which is the role of outsiders in the information flow. For crypto, we’re talking about trusted journalistic intermediaries such as Michael Lewis or Tyler Cowen who were promoting or covering for crypto.

There were lots of reasons for respected journalists or financial figures to promote crypto, including political ideology, historical analogies financial interest, FOMO, bandwagon-following, contrarianism, and plain old differences of opinion . . . pretty much the same set of reasons for respected journalists or financial figures to have been crypto-skeptical!

My point here is not that I knew better than the crypto promoters—yes, I was crypto-skeptical but not out of any special knowledge—; rather, it’s that the infrastructure of elite journalism was, I think, crucial to keeping the bubble afloat. Sure, crypto had lots of potential just from rich guys selling to each other and throwing venture capital at it, and suckers watching Alex Jones or whatever investing their life savings, but elite media promotion took it to the next level.

It’s not like I have any answers to this one. There were skeptical media all along, and I can’t really fault the media for spotting a trend that was popular among richies and covering it.

I’m just interested in these sorts of conceptual bubbles, whether they be financial scams or bad science (ovulation and voting, beauty and sex ratio, ESP, himmicanes, nudges, UFOs, etc etc etc), and how they can stay afloat in Wiley E. Coyote fashion long after they’ve been exposed.

Crypto is different from Theranos or embodied cognition, I guess, in that it has no inherent value and thus can retain value purely as part of a Keynesian beauty context, whereas frauds or errors that make actual scientific or technological claims can ultimately be refuted. Paradoxically, crypto’s lack of value—actually, its negative value, given its high energy costs—can make it a more plausible investment than businesses or ideas that could potentially do something useful if their claims were in fact true.

P.S. More here from David Morris on the role of the elite news media in this story.

The authors of research papers have no obligation to share their data and code, and I have no obligation to believe anything they write.

Michael Stutzer writes:

This study documents substantial variability in different researchers’ results when they use the same financial data set and are supposed to test the same hypotheses. More generally, I think the prospect for reproducibility in finance is worse than in some areas, because there is a publication bias in favor of a paper that uses a unique dataset provided by a firm. Because this is proprietary data, the firm often makes the researcher promise not to share the data with anybody, including the paper’s referees.

Read the leading journals’ statements carefully and you find that they don’t strictly require sharing.

Here is the statement for authors made by the Journal of Financial Econometrics: “Where ethically and legally feasible, JFEC strongly encourages authors to make all data and software code on which the conclusions of the paper rely available to readers. We suggest that data be presented in the main manuscript or additional supporting files, or deposited in a public repository whenever possible.”

In other words, an author wouldn’t have to share a so-called proprietary data set as defined above, even with the papers’ referees. What is worse, the leading journals not only accept these restrictions, but seem to favor such work over what is viewed as more garden-variety work that employs universally available datasets.

Intersting. I think it’s just as bad in medical or public health research, but there the concern is sharing confidential information. Even in settings where it’s hard to imagine that the confidentiality would matter.

As I’ve said in other such settings, the authors of research papers have no obligation to share their data and code, and I have no obligation to believe anything they write.

That is, my preferred solution is not to nag people for their data, it’s just to move on. That said, this strategy works fine for silly examples such as fat arms and voting, or the effects of unionization on stock prices, but you can’t really follow it for research that is directly relevant to policy.

When I said, “judge this post on its merits, not based on my qualifications,” was this anti-Bayesian? Also a story about lost urine.

Paul Alper writes, regarding my post criticizing an epidemiologist and a psychologist who were coming down from the ivory tower to lecture us on “concrete values like freedom and equality”:

In your P.P.S. you write,

Yes, I too am coming down from the ivory tower to lecture here. You’ll have to judge this post on its merits, not based on my qualifications. And if I go around using meaningless phrases such as “concrete values like freedom and equality,” please call me on it!

While this sounds reasonable, is it not sort of anti-Bayes? By that I mean your qualifications represent a prior and the merits the (new) evidence. I am not one to revere authority but deep down in my heart, I tend to pay more attention to a medical doctor at the Mayo Clinic than I do to Stella Immanuel. On the other hand, decades ago the Mayo Clinic misplaced (lost!) a half liter of my urine and double charged my insurance when getting a duplicate a few weeks later.

Alper continues:

Upon reflection—this was over 25 years ago—a half liter of urine does sound like an exaggeration, but not by much. The incident really did happen and Mayo tried to charge for it twice. I certainly have quoted it often enough so it must be true.

On the wider issue of qualifications and merit, surely people with authority (degrees from Harvard and Yale, employment at the Hoover Institution, Nobel Prizes) are given slack when outlandish statements are made. James Watson, however, is castigated exceptionally precisely because of his exceptional longevity.

I don’t have anything to say about the urine, but regarding the Bayesian point . . . be careful! I’m not saying to make inferences or make decisions solely based on local data, ignoring prior information coming from external data such as qualifications. What I’m saying is to judge this post on its merits. Then you can make inferences and decisions in some approximately Bayesian way, combining your judgment of this post with your priors based on your respect for my qualifications, my previous writings, etc.

This is related to the point that a Bayesian wants everybody else to be non-Bayesian. Judge my post on its merits, then combine with prior information. Don’t double-count the prior.

How many Americans drink alcohol? And who are they?

Following up on yesterday’s post, I was wondering how many Americans drink. A quick google led to this page from Gallup. The overall rate has been steady at a bit over 60% for decades:

And the graph at the top breaks things down by income. The richies are much more likely to drink than the poors. I guess there’s something to that stereotype of the country club.

Chris Chambers’s radical plan for Psychological Science

Someone pointed me to this Vision Statement by Chris Chambers, a psychology professor who would like to edit Psychological Science, a journal that just a few years ago was notorious for publishing really bad junk science. Not as bad as PNAS at its worst, perhaps, but pretty bad. Especially because they didn’t just publish junk, they actively promoted it. Indeed, as late as 2021 the Association for Psychological Science was promoting the ridiculous “lucky golf ball” paper they’d published back in the bad old days of 2010.

So it does seem that the Association for Psychological Science and its journals are ripe for a new vision.

See here for further background.

Chambers has a 12-point action plan. It’s full of details about Accountable Replications and Exploratory Reports and all sorts of other things that I don’t really know about, so if you’re interested I recommend you just follow the link and take a look for yourself.

My personal recommendation is that authors when responding to criticism not be allowed to claim that the discovery of errors “does not change the conclusion of the paper.” Or, if authors want to make that claim, they should be required to make it before publication, a kind of declaration of results independence. Something like this: “The authors attest that they believe their results so strongly that, no matter what errors are found in their data or analysis, they will not change their beliefs about the results.” Just get it out of the way already; this will save everyone lots of time that might otherwise be spent reading the paper.

Why does education research have all these problems?

A few people pointed me to a recent news article by Stephanie Lee regarding another scandal at Stanford.

In this case the problem was an unstable mix of policy advocacy and education research. We’ve seen this sort of thing before at the University of Chicago.

The general problem

Why is education research particularly problematic? I have some speculations:

1. We all have lots of experience of education and lots of memories of education not working well. As a student, it was often clear to me that things were being taught wrong, and as a teacher I’ve often been uncomfortably aware of how badly I’ve been doing the job. There’s lots of room for improvement, even if the way to get there isn’t always so obvious. So when authorities make loud claims of “50% improvement in test scores,” this doesn’t seem impossible, even if we should know better than to trust them.

2. Education interventions are difficult and expensive to test formally but easy and cheap to test informally. A formal study requires collaboration from schools and teachers, and if the intervention is at the classroom level it requires many classes and thus a large number of students. Informally, though, we can come up with lots of ideas and try them out in our classes. Put these together and you get a long backlog of ideas waiting for formal study.

3. No matter how much you systematize teaching—through standardized tests, prepared lesson plans, mooks, or whatever—, the process of learning still occurs at the individual level, one student at a time. This suggests that effects of any interventions will depend strongly on context, which in turn implies that the average treatment effect, however defined, won’t be so relevant to real-world implementation.

4. Continuing on that last point, the big challenge of education is student motivation. Methods for teaching X can typically be framed as some mix of, Methods for motivating students to want to learn X, and Methods for keeping students motivated to practice X with awareness. These things are possible, but they’re challenging, in part because of the difficulty of pinning down “motivation.”

5. Education is an important topic, a lot of money is spent on it, and it’s enmeshed in the political process.

Put these together and you get a mess that is not well served by the traditional push-a-button, take-a-pill, look-for-statistical-significance model of quantitative social science. Education research is full of people who are convinced that their ideas are good, with lots of personal experience that seems to support their views, but with great difficulty in getting hard empirical evidence, for reasons explained in items 2 and 3 above. So you can see how policy advocates can get frustrated and overstate the evidence in favor of their positions.

The scandal at Stanford

As Kinsley famously put it, the scandal is isn’t what’s illegal, the scandal is what’s legal. It’s legal to respond to critics with some mixture of defensiveness and aggression that dodges the substance of the criticism. But to me it’s scandalous that such practices are so common in elite academia. The recent scandal involved the California Math Framework, a controversial new curriculum plan that has been promoted by Stanford professor Jo Boaler, who, has I learned in a comment thread, wrote a book called Mathematical Mindset that had some really bad stuff in it. As I wrote at the time, it was kind of horrible that this book by a Stanford education professor was making a false claim and backing it up with a bunch of word salad from some rando on the internet. If you can’t even be bothered to read the literature in your own field, what are doing at Stanford in the first place?? Why not just jump over the bay to Berkeley and write uninformed op-eds and hang out on NPR and Fox News? Advocacy is fine, just own that you’re doing it and don’t pretend to be writing about research.

In pointing out Lee’s article, Jonathan Falk writes:

Plenty of scary stuff, but the two lines I found scariest were:

Boaler came to view this victory as a lesson in how to deal with naysayers of all sorts: dismiss and double down.

Boaler said that she had not examined the numbers — but “I do question whether people who are motivated to show something to be inaccurate are the right people to be looking at data.”

I [Falk] geţ a little sensitive about this since I’ve spent 40 years in the belief that people who are motivated to show something to be inaccurate are the perfect people to be looking at the data, but I’m even more disturbed by her asymmetry here: if she’s right, then it must also be true that people who are motivated to show something to be accurate are also the wrong people to be looking at the data. And of course people with no motivations at all will probably never look at the data ever.

We’ve discussed this general issue in many different contexts. There are lots of true believers out there. Not just political activists, also many pure researchers who believe in their ideas, and then you get some people such as discussed above who are true believers both on the research and activism fronts. For these people, I don’t the problem is that they don’t look at the data; rather, they know what they’re looking for and so they find it. It’s the old “researcher degrees of freedom” problem. And it’s natural for researchers with this perspective to think that everyone operates this way, hence they don’t trust outsiders because they think outsiders who might come to different conclusions. I agree with Falk that this is very frustrating, a Gresham process similar to the way that propaganda media are used not just to spread lies and bury truths but also to degrade trust in legitimate news media.

The specific research claims in dispute

Education researcher David Dockterman writes:

I know some of the players. Many educators certainly want to believe, just as many elementary teachers want to believe they don’t have to teach phonics.

Popularity with customers makes it tough for middle ground folks to issue even friendly challenges. They need the eggs. Things get pushed to extremes.

He also points to this post from 2019 by two education researchers, who point to a magazine article coauthored by Boaler and write:

The backbone of their piece includes three points:

1. Science has a new understanding of brain plasticity (the ability of the brain to change in response to experience), and this new understanding shows that the current teaching methods for struggling students are bad. These methods include identifying learning disabilities, providing accommodations, and working to students’ strengths.

2. These new findings imply that “learning disabilities are no longer a barrier to mathematical achievement” because we now understand that the brain can be changed, if we intervene in the right way.

3. The authors have evidence that students who thought they were “not math people” can be high math achievers, given the right environment.

There are a number of problems in this piece.

First, we know of no evidence that conceptions of brain plasticity or (in prior decades) lack of plasticity, had much (if any) influence on educators’ thinking about how to help struggling students. . . . Second, Boaler and Lamar mischaracterize “traditional” approaches to specific learning disability. Yes, most educators advocate for appropriate accommodations, but that does not mean educators don’t try intensive and inventive methods of practice for skills that students find difficult. . . .

Third, Boaler and Lamar advocate for diversity of practice for typically developing students that we think would be unremarkable to most math educators: “making conjectures, problem-solving, communicating, reasoning, drawing, modeling, making connections, and using multiple representations.” . . .

Fourth, we think it’s inaccurate to suggest that “A number of different studies have shown that when students are given the freedom to think in ways that make sense to them, learning disabilities are no longer a barrier to . Yet many teachers have not been trained to teach in this way.” We have no desire to argue for student limitations and absolutely agree with Boaler and Lamar’s call for educators to applaud student achievement, to set high expectations, and to express (realistic) confidence that students can reach them. But it’s inaccurate to suggest that with the “right teaching” learning disabilities in math would greatly diminish or even vanish. . . .

Do some students struggle with math because of bad teaching? We’re sure some do, and we have no idea how frequently this occurs. To suggest, however, that it’s the principal reason students struggle ignores a vast literature on learning disability in mathematics. This formulation sets up teachers to shoulder the blame for “bad teaching” when students struggle.

They conclude:

As to the final point—that Boaler & Lamar have evidence from a mathematics camp showing that, given the right instruction, students who find math difficult can gain 2.7 years of achievement in the course of a summer—we’re excited! We look forward to seeing the peer-reviewed report detailing how it worked.

Indeed. Here’s the relevant paragraph from Boaler and Lamar:

We recently ran a summer mathematics camp for students at Stanford. Eighty-four students attended, and all shared with interviewers that they did not believe they were a “math person.” We worked to change those ideas and teach mathematics in an open way that recognizes and values all the ways of being mathematical: including making conjectures, problem-solving, communicating, reasoning, drawing, modeling, making connections, and using multiple representations. After eighteen lessons, the students improved their achievement on standardized tests by the equivalent of 2.7 years. When district leaders visited the camp and saw students identified as having learning disabilities solve complex problems and share their solutions with the whole class, they became teary. They said it was impossible to know who was in special education and who was not in the classes.

This sort of Ted-worthy anecdote can seem so persuasive! I kinda want to be persuaded too, but I’ve seen too many examples of studies that don’t replicate. There are just so many ways things go wrong.

P.S. Lee has reported on other science problems at Stanford and has afflicted the comfortable, enough that she was unfairly criticized for it.

“Whom to leave behind”

OK, this one is hilarious.

The story starts with a comment from Jordan Anaya, pointing to the story of Michael Eisen, a biologist who, as editor of the academic journal eLife, changed its policy to a public review system. It seems that this policy was controversial, much more so than I would’ve thought. I followed the link Anaya gave to Eisen’s twitter feed, scrolled through, and came across the above item.

There are so many bizarre things about this, it’s hard to know where to start:

– The complicated framing. Instead of just putting them in a lifeboat or something, there’s this complicated story about a spaceship with Earth doomed for destruction.

– It says that these people have been “selected” as passengers. All the worthy people in the world, and they’re picking these guys. “An Accountant with a substance abuse problem”? “A racist police officer”? Who’s doing the selection, exactly? This is the best that humanity has to offer??

– That “Eight (8)” thing. What’s the audience for this: people who don’t know what the word “Eight” means?

– All the partial information, which reminds me of those old-fashioned logic puzzles (“Mr. White lives next door to the teacher, who is not the person who likes to play soccer,” etc.) We hear about the athlete being gay and a vegetarian, but what about the novelist or the international student? Are they gay? What do they eat?

– I love that “militant African American medical student.” I could see some conflict here with the “60-year old Jewish university administrator.” Maybe best not to put both on the same spaceship . . .

– Finally, the funniest part is the dude on twitter who calls this “demonic.” Demonic’s a bad thing, right?

Anyway, I was curious where this all came from so I googled *Whom to Leave Behind* and found lots of fun things. A series of links led to this 2018 news article from News5 Cleveland:

In an assignment given out at Roberts Middle School in Cuyahoga Falls, students had to choose who they felt were “most deserving” to be saved from a doomed Earth from a list based on race, religion, sexual orientation and other qualifications. . . .

In a Facebook post, Councilman Adam Miller said he spoke with the teacher who gave the assignment. He said the teacher intended to promote diversity. Miller told News 5 the teacher apologized for the assignment that has caused such controversy.

The Facebook link takes us here:

Hey—they’re taking the wife but leaving the accountant behind. How horrible!

The comments to the post suggest this activity has been around for awhile:

And one of the commenters points to this variant:

I’m not sure exactly why, but I find this one a lot less disturbing than the version with the militant student, the racist cop, and the Jew.

There are a bunch of online links to that 2018 story. Even though the general lifeboat problem is not new, it seems like it only hit the news that one time.

But . . . further googling turns up lots of variants. Particularly charming was this version with ridiculously elaborate descriptions:

I love how they give each person a complex story. For example, sure, Shane is curing cancer, “but he is in a wheelchair.” I assume they wouldn’t have to take the chair onto the boat, though!

Also they illustrate the story with this quote:

“It’s not hard to make decisions when you know what your values are” – Roy Disney

This is weird, given that the whole point of the exercise is that it is hard to make decisions.

Anyway, the original lifeboat activity seems innocuous enough, a reasonable icebreaker for class discussion. But, yeah, adding all the ethnic stuff just seems like you’re asking for trouble.

It’s more fun than the trolley problem, though, I’ll give it that.

NYT does some rewriting without attribution. I guess this is standard in journalism but it seems unethical to me.

Palko points to this post by journalist Lindsay Jones, who writes:

It’s flattering to see @nytimes rewrite my [Jones’s] feature on two Canadian men switched at birth. You can read the original, exclusively reported @globeandmail story I took months to research and write as a former freelancer here: https://www.theglobeandmail.com/canada/article-switched-at-birth-manitoba/

The original article, published 10 Feb 2023 in the Toronto newspaper The Globe and Mail, is called “A hospital’s mistake left two men estranged from their heritages. Now they fight for answers,” subtitled, “In 1955, a Manitoba hospital sent Richard Beauvais and Eddy Ambrose home with the wrong families. After DNA tests revealed the mix-up, both want an explanation and compensation.” It begins:

One winter evening in 2020, Richard Beauvais and his wife pored over the online results of a genealogical DNA kit.

“They screwed up,” Mr. Beauvais surmised, sitting at the kitchen island in his ranch style home near the coastal community of Sechelt, B.C.

According to the test, he was Ukrainian, Polish and Jewish. Mr. Beauvais was stupefied. Mr. Beauvais, whose mother was Cree, grew up in a Métis settlement on the shores of Lake Manitoba and was taken into foster care at age eight or nine. The kit was a gift from his eldest daughter to help Mr. Beauvais learn more about his roots, including his French father, who died when he was 3. But here in front of him was a list of names and nationalities that, he thought, couldn’t be his. . . .

The followup appeared in the New York Times on 2 Aug 2023 and is called “Switched at Birth, Two Canadians Discover Their Roots at 67,” with subtitle “Two Canadian men who were switched at birth to families of different ethnicities are now questioning who they really are and learning how racial heritage shapes identities.” It begins:

Richard Beauvais’s identity began unraveling two years ago, after one of his daughters became interested in his ancestry. She wanted to learn more about his Indigenous roots — she was even considering getting an Indigenous tattoo — and urged him to take an at-home DNA test. Mr. Beauvais, then 65, had spent a lifetime describing himself as “half French, half Indian,” or Métis, and he had grown up with his grandparents in a log house in a Métis settlement.

So when the test showed no Indigenous or French background but a mix of Ukrainian, Ashkenazi Jewish and Polish ancestry, he dismissed it as a mistake and went back to his life as a commercial fisherman and businessman in British Columbia.

It’s amusing to see where the two articles differ. The Globe and Mail is a Canadian newspaper so they don’t need to keep reminding us in their headlines that the men are “Canadian.” They can jump right to “Manitoba” and “B.C.,” and they can just use the word “Métis” without defining him. For the Times, on the other hand, the “Jewish” part of the roots wasn’t enough—they needed to clarify for their readers that it was “Ashkenazi Jewish.”

What happened?

Did the Times article rip off the Globe and Mail article? We may never know. There are lots of similarities, but ultimately the two articles are telling the same story, so it makes sense the articles will be similar too. For example, the Times article mentions a tattoo that the daughter was considering, and the original Globe and Mail article has a photo of the tattoo she finally chose.

So what happened? One possibility is that the NYT reporter, who covers Canada and is based in Montreal, read the Globe and Mail story when it came out and decided to follow it up further. If I were a news reporter covering Canada, I’d probably read many newspapers each day from beginning to end—including the Globe and Mail. Another possibility is that the NYT reporter heard about the story from someone who’d read the Globe and Mail story, he decided to follow up . . . in either case, it seems plausible that it would take a few months for it to get written and published.

It’s extremely hard to believe that the NYT reporter was unaware of the Globe and Mail article. If you’re writing a news article, you’ll google its subjects to make sure there’s nothing major that you’re missing. Here’s what comes up—I restricted the search to end on 31 July 2023 to avoid anything that came after the Times article appeared:

The first link above is the Globe and Mail article in question. The second link comes from something called Global News. They don’t link to or mention the Globe and Mail article either, but I guess they, like the Times a few months later, did some reporting of their own because they include a quote that was not in the original article.

Given that no Google links to the two names appeared before 10 Feb, I’m guessing that the Globe and Mail article from that date was the first time the story appeared. I wonder how Lindsay Jones, the author of that original article, came up with the story? At the end of the article it says:

Last year, reporter Lindsay Jones unravelled the mystery of how two baby girls got switched at a Newfoundland hospital in 1969.

And, at the end of that article, it says:

Freelance journalist Lindsay Jones spoke with The Decibel about unravelling the mystery of how Arlene Lush and Caroline Weir-Greene were switched at birth.

Unfortunately, The Decibel only seems to be audio with no transcript, so I’ll leave it to any of you to listen through for the whole story.

Standard practice

I think it’s standard practice in journalism to avoid referring or linking to whoever reported on the story before you. I agree with Jones that this is bad practice. Even beyond the issue of giving credit to the reporter who broke the story and the newspaper that gave space to publish it, it can be helpful to the reader to know the source.

This is not as bad as the story of the math professor who wrote a general-interest book about chess in which he took stories from other sources without attribution and introduced errors in the process. As I wrote at the time, that’s soooo frustrating, when you copy without clear attribution but you bungle it. I think that the act of hiding the sourcing makes it that much tougher to find the problem. Fewer eyes, less transparency.

Nor is at as bad as when a statistics professor copied without attribution from Wikipedia, again introducing his own errors. Yes, some faculty do add value—it’s just negative value.

The articles from Global News and the New York Times seem better than those cases, in that the authors did their own reporting. Still, it does a disservice to readers, as well as the reporter of the original story, to hide the source. Even if it’s standard practice, I still think it’s tacky.

(from 2017 but still relevant): What Has the Internet Done to Media?

Aleks Jakulin writes:

The Internet emerged by connecting communities of researchers, but as Internet grew, antisocial behaviors were not adequately discouraged.

When I [Aleks] coauthored several internet standards (PNG, JPEG, MNG), I was guided by the vision of connecting humanity. . . .

The Internet was originally designed to connect a few academic institutions, namely universities and research labs. Academia is a community of academics, which has always been based on the openness of information. Perhaps the most important to the history of the Internet is the hacker community composed of computer scientists, administrators, and programmers, most of whom are not affiliated with academia directly but are employed by companies and institutions. Whenever there is a community, its members are much more likely to volunteer time and resources to it. It was these communities that created websites, wrote the software, and started providing internet services.

“Whenever there is a community, its members are much more likely to volunteer time and resources to it” . . . so true!

As I wrote a few years ago, Create your own community (if you need to).

But it’s not just about community; you also have to pay the bills.

Aleks continues:

The skills of the hacker community are highly sought after and compensated well, and hackers can afford to dedicate their spare time to the community. Society is funding universities and institutes who employ scholars. Within the academic community, the compensation is through citation, while plagiarism or falsification can destroy someone’s career. Institutions and communities have enforced these rules both formally and informally through members’ desire to maintain and grow their standing within the community.

Lots to chew on here. First, yeah, I have skills that allow me to be compensated well, and I can afford to dedicate my spare time to the community. This is not new: back in the early 1990s I wrote Bayesian Data Analysis in what was essentially my spare time, indeed my department chair advised me not to do it at all—master of short-term thinking that he was. As Aleks points out, was a time when a large proportion of internet users had this external compensation.

The other interesting thing about the above quote is that academics and tech workers have traditionally had an incentive to tell the truth, at least on things that can be checked. Repeatedly getting things wrong would be bad for your reputation. Or, to put it another way, you could be a successful academic and repeatedly get things wrong, but then you’d be crossing the John Yoo line and becoming a partisan hack. (Just to be clear, I’m not saying that being partisan makes you a hack. There are lots of scholars who express strong partisan views but with intellectual integrity. The “hack” part comes from getting stuff wrong, trying to pass yourself off as an expert on topics you know nothing about, ultimately being willing to say just about anything if you think it will make the people on your side happy.)

Aleks continues:

The values of academic community can be sustained within universities, but are not adequate outside of it. When businesses and general public joined the internet, many of the internet technologies and services were overwhelmed with the newcomers who didn’t share their values and were not members of the community. . . . False information is distracting people with untrue or irrelevant conspiracy theories, ineffective medical treatments, while facilitating terrorist organization recruiting and propaganda.

I’ve not looked at data on all these things, but, yeah, from what I’ve read, all that does seem to be happening.

Aleks then moves on to internet media:

It was the volunteers, webmasters, who created the first websites. Websites made information easily accessible. The website was property and a brand, vouching for the reputation of the content and data there. Users bookmarked those websites they liked so that they could revisit them later. . . .

In those days, I kept current about the developments in the field by following newsgroups and regularly visiting key websites that curated the information on a particular topic. Google entered the picture by downloading all of Internet and indexing it. . . . the perceived credit for finding information went to Google and no longer to the creators of the websites.

He continues:

After a few years of maintaining my website, I was no longer receiving much appreciation for this work, so I have given up maintaining the pages on my website and curating links. This must have happened around 2005. An increasing number of Wikipedia editors are giving up their unpaid efforts to maintain quality in the fight with vandalism or content spam. . . . On the other hand, marketers continue to have an incentive to put information online that would lead to sales. As a result of depriving contributors to the open web with brand and credit, search results on Google tend to be of worse quality.

And then:

When Internet search was gradually taking over from websites, there was one area where a writer’s personal property and personal brand were still protected: blogging. . . . The community connected through the comments on blog posts. The bloggers were known and personally subscribed to.

That’s where I came in!

Aleks continues:

Alas, whenever there’s an unprotected resource online, some startup will move in and harvest it. Social media tools simplified link sharing. Thus, an “influencer” could easily post a link to an article written by someone else within their own social media feed. The conversation was removed from the blog post and instead developed in the influencer’s feed. As a result, carefully written articles have become a mere resource for influencers. As a result, the number of new blogs has been falling.
Social media companies like Twitter and Facebook reduced barriers to entry by making so easy to refer to others’ content . . .

I hadn’t thought about this, but, yeah, good point.

As a producer of “content”—for example, what I’m typing right now—I don’t really care if people come to this blog from Google, Facebook, Twitter, an RSS feed, or a link on their browser. (There have been cases where someone’s stripped the material from here and put it on their own site without acknowledging the source, but that’s happened only rarely.) Any of those legitimate ways of reaching this content is fine with me: my goal is just to get it out there, to inform people and to influence discussion. I already have a well-paying job, so I don’t need to make money off the blogging. If it did make money, that would be fine—I could use it to support a postdoc—but I don’t really have a clear sense of how that would happen, so I haven’t ever looked into it seriously.

The thing I hadn’t thought about was that, even if to me it doesn’t matter where our reader are coming, this does matter to the larger community. Back in the day, if someone wanted to link or react to something on a blog, they’d do it in their own blog or in a comment section. Now they can do it from Facebook or Twitter. The link itself is no problem, but there is a problem in that there’s less of an expectation of providing new content along with the link. Also, Facebook and Twitter are their own communities, which have their strengths but which are different than those of blogs. In particular, blogging facilitates a form of writing where you fill in all the details of your argument, where you can go on tangents if you’d like, and where you link to all relevant sources. Twitter has the advantage of immediacy, but often it seems more like community without the content, where people can go on and say what they love or hate but without the space for giving their reasons.

The connection between junk science and sloppy data handling: Why do they go together?

Nick Brown pointed me to a new paper, “The Impact of Incidental Environmental Factors on Vote Choice: Wind Speed is Related to More Prevention-Focused Voting,” to which his reaction was, “It makes himmicanes look plausible.” Indeed, one of the authors of this article had come up earlier on this blog as a coauthor of paper with fatally-flawed statistical analysis. So, between the general theme of this new article (“How might irrelevant events infiltrate voting decisions?”), the specific claim that wind speed has large effects, and the track record of one of the authors, I came into this in a skeptical frame of mind.

That’s fine. Scientific papers are for everyone, not just the true believers. Skeptics are part of the audience too.

Anyway, I took a look at the article and replied to Nick:

The paper is a good “exercise for the reader” sort of thing to find how they managed to get all those pleasantly low p-values. It’s not as blatantly obvious as, say, the work of Daryl Bem. The funny thing is, back in 2011, lots of people thought Bem’s statistical analysis was state-of-the-art. It’s only in retrospect that his p-hacking looks about as crude as the fake photographs that fooled Arthur Conan Doyle. Figure 2 of this new paper looks so impressive! I don’t really feel like putting in the effort to figuring out exactly how the trick was done in this case . . . Do you have any ideas?

Nick responded:

There are some hilarious errors in the paper. For example:
– On p. 7 of the PDF, they claim that “For Brexit, the “No” option advanced by the Stronger In campaign was seen as clearly prevention-oriented (Mean (M) = 4.5, Standard Error (SE) = 0.17, t(101) = 6.05, p < 0.001) whereas the “Yes” option put forward by the Vote Leave campaign was viewed as promotion-focused (M = 3.05, SE = 0.16, t(101) = 2.87, p = 0.003).": But the question was not "Do you want Brexit, Yes/No". It was "Should the UK Remain in the EU or Leave the EU". Hence why the pro-Brexit campaign was called "Vote Leave", geddit? Both sides agreed on before the referendum that this was fairer and clearer than Yes/No. Is "Remain" more prevention-focused than "Leave"? - On p. 12 of the PDF, they say "In the case of the Brexit vote, the Conservative Party advanced the campaign for the UK to leave the EU." This is again completely false. The Conservative government, including Prime Minister David Cameron, backed Remain. It's true that a number of Conservative politicians backed Leave, and after the referendum lots of Conservatives who had backed Remain pretended that they either really meant Leave or were now fine with it, but if you put that statement, "In the case of the Brexit vote, the Conservative Party advanced the campaign for the UK to leave the EU" in front of 100 UK political scientists, not one will agree with it. If the authors are able to get this sort of thing wrong then I certainly don't think any of their other analyses can be relied upon without extensive external verification. If you run the attached code on the data (mutatis mutandis for the directories in which the files live) you will get Figure 2 of the Mo et al. paper. Have a look at the data (the CSV file is an export of the DTA file, if you don't use Stata) and you will see that they collected a ton of other variables. To be fair they mention these in the paper ('Additionally, we collected data on other Election Day weather indicators (i.e., cloud cover, dew point, precipitation, pressure, and temperature), as well as historical wind speeds per council area.5 The inclusion of other Election Day weather indicators increases our confidence that we are detecting an association between wind speed and election outcomes, and not the effect of other weather indicators that may be correlated with wind speed.") My guess is that they went fishing and found that wind speed, as opposed to the other weather indicators that they mentioned, gave them a good story. Looking only at the Swiss data, I note that they also collected "Income", "Unemployment", "Age", "Race" (actually the percentage of foreign-born people; I doubt if Switzerland collects "Race" data; Supplement, Table S3, page 42), "Education", and "Rural", and threw those into their model as well. They also collected latitude and longitude (of the centroid?) for each canton, although those didn't make it into the analyses. Also they include "Turnout", but for any given Swiss referendum it seems that they only had the national turnout because this number is always the same for every "State" (canton) for any given "Election" (referendum). And the income data looks sketchy (people in Schwyz canton do not make 2.5 times what people in Zürich canton do). I think this whole process shows a degree of naivety about what "kitchen-sink" regression analyses (and more sophisticated versions thereof) can and can't do, especially with noisy measures (such as "Precipitation" coded as 0/1). Voter turnout is positively correlated with precipitation but negatively with cloud cover, whatever that means. Another glaring omission is any sort of weighting by population. The most populous canton in Switzerland has a population almost 100 times the least populous, yet every canton counts equally. There is no "population" variable in the dataset, although this would have been very easy to obtain. I guess this means they avoid the ecological fallacy, up to the point where they talk about individual voting behaviour (i.e., pretty much everywhere in the article).

Nick then came back with more:

I found another problem, and it’s huge:

For “Election 50”, the Humidity and Dew Point data are completely borked (“relative humidity” values around 1000 instead of 0.6 etc; dew point 0.4–0.6 instead of a Fahrenheit temperature slightly below the measured temperature in the 50–60 range). When I remove that referendum from the results, I get the attached version of Figure 2. I can’t run their Stata models, but by my interpretation of the model coefficients from the R model that went into making Figure 2, the value for the windspeed * condition interaction goes from 0.545 (SE=0.120, p=0.000006) to 0.266 (SE=0.114, p=0.02).

So it seems to me that a very big part of the effect, for the Swiss results anyway, is being driven by this data error in the covariates.

And then he posted a blog with further details, along with a link to some other criticisms from Erik Gahner Larsen.

The big question

Why do junk science and sloppy data handling so often seem together? We’ve seen this a lot, for example the ovulation-and-voting and ovulation-and-clothing papers that used the wrong dates for peak fertility, the Excel error paper in economics, the gremlins paper in environmental economics, the analysis of air pollution in China, the collected work of Brian Wansink, . . . .

What’s going on? My hypothesis is as follows. There are lots of dead ends in science, including some bad ideas and some good ideas that just don’t work out. What makes something junk science is not just that it’s studying an effect that’s too small to be detected with noisy data; it’s that the studies appear to succeed. It’s the misleading apparent success that’s turns a scientific dead end into junk science.

As we’ve been aware since the classic Simmons et al. paper from 2011, researchers can and do use researcher degrees of freedom to obtain apparent strong effects from data that could well be pure noise. This effort can be done on purpose (“p-hacking”) or without the researchers realizing it (“forking paths”), or through some mixture of the two.

The point is that, in this sort of junk science, it’s possible to get very impressive-looking results (such as Figure 2 in the above-linked article) from just about any data at all! What that means is that data quality doesn’t really matter.

If you’re studying a real effect, then you want to be really careful with your data: any noise you introduce, whether in measurement or through coding error, can be expected to attenuate your effect, making it harder to discover. When you’re doing real science you have a strong motivation to take accurate measurements and keep your data clean. Errors can still creep in, sometimes destroying a study, so I’m not saying it can’t happen. I’m just saying that the motivation is to get your data right.

In contrast, if you’re doing junk science, the data are not so relevant. You’ll get strong results one way or another. Indeed, there’s an advantage to not looking too closely at your data at first; that way if you don’t find the result you want, you can go through and clean things up until you reach success. I’m not saying the authors of the above-linked paper did any of that sort of thing on purpose; rather, what I’m saying is that they have no particular incentive to check their data, so from that standpoint maybe we shouldn’t be so surprised to see gross errors.