Krugman in the uncanny valley: A theory about east coast pundits and California

One of Palko’s pet peeves is East Coast media figures who don’t understand California. To be more specific, the problem is that they think they know California but they don’t, which puts them in a sort of uncanny San Fernando or Silicon Valley of the mind.

He quotes New York Times columnist Paul Krugman, who first writes about Silicon Valley and the Los Angeles entertainment complex and then continues:

California as a whole is suffering from gentrification. That is, it’s like a newly fashionable neighborhood where affluent newcomers are moving in and driving working-class families out. In a way, California is Brooklyn Heights writ large.

Yet it didn’t have to be this way. I sometimes run into Californians asserting that there’s no room for more housing — they point out that San Francisco is on a peninsula, Los Angeles ringed by mountains. But there’s plenty of scope for building up.

As Palko points out (but unfortunately nobody will see, because he has something like 100 readers, as compared to Krugman’s million), “SF is not part of Silicon Valley; it’s around fifty miles away” and “Neither the city nor the county of LA is ringed with mountains.” This is not to say that Krugman is wrong about making it possible for developers to build up—but this seems more of a generic issue of building apartments where people want to live, rather than restricting construction to available land that’s far away. As Palko’s co-blogger points out, rising housing prices are a problem even in a place like London, Ontario, “a mid-sized city with a mid-ranked university and a 9-10% unemployment rate” not ringed by mountains or anything like that. I’ve heard that rents in Paris are pretty high too, and that’s not ringed by mountains either.

Here’s my theory. When East Coast media figures write about Texas, say, or Minnesota or Oregon or even Pennsylvania, they know they’re starting from a position of relative ignorance and they’re careful to check with the local experts (which leads to the much-mocked trope of the interview with locals in an Ohio diner). And when they write about NYC or Washington D.C. or whatever suburb they grew up in . . . well, then they might have a biased view of the place but at least they know where everything is. But California is the worst of both worlds: they’re familiar enough with the place to write about it but without realizing the limitations of their understanding.

The point here is not that Krugman is the worst or anything like that, even just restricting to the New York Times. I’ve complained before about pundits not correcting major errors in their columns. Krugman’s a more relevant example for the present post because his columns are often informed by data, so it’s interesting when he gets the data wrong.

As we’ve discussed before, to get data wrong, two things need to happen:
1. You need to be misinformed.
2. You need to not realize you’re misinformed.
That’s the uncanny valley—where you think you know something but you don’t—and it’s a topic that interests me a lot because it seems to be where so many problems in science arise.

P.S. It was funny that Krugman picked Brooklyn Heights, of all places. He’s a baby boomer . . . maybe he had The Patty Duke Show in mind. The family on that show was white collar, in a Father Knows Best kind of way, but my vague impression is that white collar was the default on TV back then. Not that all or even most shows had white collar protagonists—I guess that from our current perspective the most famous shows from back then are Westerns, The Honeymooners, and Lucy, none of which fit the “white collar” label, but I still think of white-collar families as representing the norm. In any case, I guess the fictional Brooklyn Heights of The Patty Duke Show was filled with salt-of-the-earth working-class types who would’ve been played for laughs by character actors. Now these roles have been gentrified and everyone on TV is pretty. Actually I have no idea what everyone on TV looks like; I know about as much about that as NYT columnists know about California geography.

P.P.S. Unrelatedly—it just happened to appear today—Palko’s co-blogger Delaney does a good job at dismantling a bad argument from Elon Musk. In this case, Musk’s point is made in the form of a joke, but it’s still work exploring what’s wrong with what he said. Arguing against a joke is tricky so I think Delaney gets credit for doing this well.

Consequences are often intended. And, yes, religion questions in surveys are subject to social desirability bias.

Someone recommended the book, “Hit Makers: The Science of Popularity in an Age of Distraction,” by Derek Thompson. It had a lot of good things, many of which might be familiar with the readers of this blog but with some unexpected stuff too.

Here though, I want to mention a couple of things in the book that I disagreed with.

On page 265, Thompson writes:

Seems like a fair thumbnail description. But why call it an “unintentional” manslaughter? Many of the online advertising ventures were directly competing with newspapers, no? They knew what they were doing. This is not to say they were evil—business is business—but “unintentional” doesn’t seem quite right. This struck me because it reminded me of the “unintended consequences” formulation, which I think is overused, often applied to consequences that were actually intended. The idea of unintended consequences is just so appealing that it can be applied indiscriminately.

The other is on page 261, where Thomson writes that religion is an area where “researchers found no evidence of social desirability bias.” This in the context of a discussion of errors in opinion surveys.

But I’m pretty sure that Thompson’s completely wrong on this one. Religion is a famous example of social desirability bias in surveys: people say they attend church—this is something that’s socially desirable—at much higher rates than they actually do. And researchers have studied this! See here, for example. I could see how Thompson wouldn’t have necessarily heard of this research; what surprised me is that he made such a strong statement that there was no bias. I wonder what he was thinking?

Fun July 4th statistics story! Too many melons, not enough dogs.

Just in time before the holiday ends . . . A correspondent who wishes to remain anonymous points us to this:

Apparently this was written by our former ambassador to the United Nations. I googled and her bachelor’s degree was in accounting! They teach you how to take averages in accounting school, don’t they?? So I’m guessing this particular post was written by someone less well-educated, maybe a staffer with a political science degree or something like that.

But what really gets me is, who eats only 1 hot dog on July 4th? No burger, no chicken, just one hot dog?? Is this staffer on a diet, or what?? Also, get it going, dude! Throw in a burger and some chicken breasts and you can get that inflation rate over 100%, no?

Meanwhile this person’s eating an entire watermelon? The wiener/watermelon ratio at this BBQ is totally wack. I just hope these staffers are more careful with their fireworks tonight than they were with their shopping earlier today. What’re they gonna do with all those extra watermelons?

Here’s what the highest salary of any university president in America was in 1983 in 2022 dollars: $342,000.

The above quote comes from Paul Campos, who also reports that “The mean (not the median) compensation of university presidents [in 1983] was $184,000 (again 2022 dollars!).” And, perhaps even more amazingly, that the highest-paid college football coach in 1981 (Oklahoma’s Barry Switzer) “was making $150,000, including benefits, which is $457,000 in 2022 dollars.”

This got me curious. I took my first academic job in 1990 and my salary was $42,000. According to the inflation calculator, that’s worth $90,000 today. That must not be far from what we pay new faculty. But maybe there’s more spread on the high end than there used to be? I don’t know how much the senior faculty were paid back in 1990, or 1983, but I guess it was less than the university president.

Regarding the university presidents and football coaches: in addition to their take-home pay and benefits, they also were given subordinates: secretaries and assistants and coaches and so forth they could boss around. This makes me think that part of the salary change was a delayed transition, stimulated by the tax cuts of the 1980s, toward paying in $ rather than perks?

In any case, lots of faculty nowadays (regular faculty like me, not law/biz/medicine or coaches) get paid more than all the college presidents were in 1983. Amazing. Again, I guess the presidents were also paid in kind in the sense that they were allowed to hire a zillion assistants. But still. Imagine someone offering the president of a major university a salary of $184,000 today. He’d be insulted!

Another way of saying this is that in 1983, the range of pay for faculty at a university was about 10 (approx $200,000 at the top compared to approx $20,000 at the bottom). But now the ratio is more like 100 (approx $3 million down to approx $30,000). I guess to look for the reasons for why this is happening, we’d want to look at society and the economy more generally, as what’s happening at the university has been happening in many other institutions.

The science bezzle

Palko quotes Galbraith from the classic The Great Crash 1929:

Alone among the various forms of larceny [embezzlement] has a time parameter. Weeks, months or years may elapse between the commission of the crime and its discovery. (This is a period, incidentally, when the embezzler has his gain and the man who has been embezzled, oddly enough, feels no loss. There is a net increase in psychic wealth.) At any given time there exists an inventory of undiscovered embezzlement in—or more precisely not in—the country’s business and banks.

. . .

This inventory—it should perhaps be called the bezzle—amounts at any moment to many millions of dollars. It also varies in size with the business cycle. In good times, people are relaxed, trusting, and money is plentiful. But even though money is plentiful, there are always many people who need more. Under these circumstances, the rate of embezzlement grows, the rate of discovery falls off, and the bezzle increases rapidly. In depression, all this is reversed. Money is watched with a narrow, suspicious eye. The man who handles it is assumed to be dishonest until he proves himself otherwise. Audits are penetrating and meticulous. Commercial morality is enormously improved. The bezzle shrinks.

He also quotes John Kay:

From this perspective, the critic who exposes a fake Rembrandt does the world no favor: The owner of the picture suffers a loss, as perhaps do potential viewers, and the owners of genuine Rembrandts gain little. The finance sector did not look kindly on those who pointed out that the New Economy bubble of the late 1990s, or the credit expansion that preceded the 2008 global financial crisis, had created a large febezzle.

Palko continues:

In 2021, the bezzle grew to unimaginable proportions. Imagine a well-to-do family sitting down to calculate their net worth last December. Their house is worth three times what they paid for it. The portfolio’s doing great, particularly those innovation and disruption stocks they heard about on CNBC. And that investment they made in crypto just for fun has turned into some real money.

Now think about this in terms of stimulus. Crypto alone has pumped trillions of dollars of imaginary money into the economy. Analogously, ending the bezzle functions like a massive contractionary tax. . . .

And this reminds me of . . .

Very parochially, this makes me all think of the replication crisis in psychology, medicine, and elsewhere in the human sciences. As Simine Vazire and I have discussed, a large “bezzle” in these fields accumulated over several decades and then was deflated in the past decade or so.

As with financial bezzles, during this waking-up period there was a lot of anger from academic and media thought leaders—three examples are here, here and here. It seems they were reacting to the loss in value: “From this perspective, the critic who exposes a fake Rembrandt does the world no favor: The owner of the picture suffers a loss, as perhaps do potential viewers, and the owners of genuine Rembrandts gain little.”

This also relates to something else I’ve noticed, which is that many of these science leaders are stunningly unbothered by bad science. Consider some notorious “fake Rembrandts” of 2010 vintage such as the misreported monkey experiments, the noise-mining ESP experiments, the beauty-and-sex ratio paper, the pizzagate food studies, the missing shredder, etc etc. You’d think that institutions such as NPR, Gladwell, Freakonomics, Nudge, and the Association for Psychological Science would be angry at the fakers and incompetents who’d put junk science into the mix, but, to the extent they show emotion on this at all, it tends to be anger at the Javerts who point out the problem.

At some level, I understand. As I put it a few years ago, these people own stock in a failing enterprise, so no wonder they wants to talk it up. Still, they’re the ones who got conned, so I’d think they might want to divert some of their anger to the incompetents and fraudsters who published and promoted the science bezzle—all this unreplicable research.

P.S. Related, from 2014: The AAA tranche of subprime science.

Propagation of responsibility

When studying statistical workflow, or just doing applied statistics, we think a lot about propagation of uncertainty. Today’s post is about something different: it’s about propagation of responsibility in a decision setting with many participants. I’ll briefly return to workflow at the end of this post.

The topic of propagation of responsibility came up in our discussion the other day of fake drug studies. The background was a news article by Johanna Ryan, reporting on medical research fraud:

In some cases the subjects never had the disease being studied or took the new drug to treat it. In others, those subjects didn’t exist at all.

I [Ryan] found out about this crime wave, not from the daily news, but from the law firm of King & Spaulding – attorneys for GlaxoSmithKline (GSK) and other major drug companies. K&S focused not so much on stopping the crime wave, as on advising its clients how to “position their companies as favorably as possible to prevent enforcement actions if the government comes knocking.” In other words, to make sure someone else, not GSK, takes the blame. . . .

So how do multi-national companies like GSK find local clinics like Zain Medical Center or Healing Touch C&C to do clinical trials? Most don’t do so directly. Instead, they rely on Contract Research Organizations (CROs): large commercial brokers that recruit and manage the hundreds of local sites and doctors in a “gig economy” of medical research. . . .

The doctors are independent contractors in these arrangements, much like a driver who transports passengers for Uber one day and pizza for DoorDash the next. If the pizza arrives cold or the ride is downright dangerous, both Uber and the pizza parlor will tell you they’re not to blame. The driver doesn’t work for them!

Likewise, when Dr. Bencosme was arrested, the system allowed GSK to position themselves as victims not suspects. . . .

Commenter Jeremiah concurred:

I want folks to be careful in giving the Sponsor (such as GSK) any pass and putting the blame on the CRO. The Good Clinical Practices (GCP) here are pretty strong that the responsibility lies with the Sponsor to do due diligence and have appropriate processes in place (for example see ICH E6(r2) and ICH E8(r1)

The FDA has been flagging problems with 21CFR 312.50 for years. In fy2021 they identified multiple observations that boil down to a failure to select qualified investigators. The Sponsor owns that and we should never give a pass because of the nature of contract organizations in our industry.

Agreed. Without making any comment on this particular case, which I know nothing about, I agree with your general point about responsibility going up and down the chain. Without such propagation of responsibility, there are huge and at times overwhelming incentives to cheat.

It does seem that when institutions are being set up and maintained, that insufficient attention is paid to maintaining the smooth and reliable propagation of responsibility. Sometimes this is challenging—it’s not always os easy to internalize externalities (“tragedies of the comments”) through side payments—but in a highly regulated area such as medical research, it should be possible, no?

And this does seem to have some connections to ideas such as influence analysis and model validation that arise in statistical workflow. The trail of breadcrumbs.

Statistics and science reform: My conversation with economist Noah Smith

Here it is:

N.S.: In the past few years, you’ve become sort of the scourge of bad statistical papers. I’ve heard it said that the scariest seven words in academia are: “Andrew Gelman just blogged about your paper”. Do you feel that this sort of thing has made empirical researchers more careful about their methodologies?

A.G.: I don’t know if I want researchers to be more careful! It’s good for people to try all sorts of ideas with data collection and analysis without fear. Indeed, I suspect that common statistical mistakes–for example, reliance on statistical significance and refusal to use prior information–arise from researchers being too careful to follow naive notions of rigor. What’s important is not to try to avoid error but rather to be open to criticism and to learn from our mistakes.

N.S.: Ultimately, is the quality of statistical research more fundamentally a matter of opinion and judgment than, say, research in physics or biology?

What’s an example of a recent paper you felt was well-done, vs. a recent one you thought was poorly done?

A.G.: I can’t really answer your first question here because I’m not sure what you mean by “statistical research.” Are you referring to research within the field of statistics, such as a paper in statistical methods or theory? Or are you referring to applied research that uses statistical analysis?

In answer to your second question: Unfortunately, I end up reading many more bad papers than good papers, in part because people keep sending me bad stuff! Back in the early days of the blog, they would send me papers with bad graphs, but now it’s often papers with bad statistical analyses. I hate to name just one because then I feel like I’d be singling it out, so let me split the difference and point to two papers that I think were well done in many ways but still I don’t buy their conclusions. The first was a meta-analysis of nudge interventions which claimed to find positive effects but I think was made essentially useless by relying on studies that were themselves subject to selection bias; see discussion here: https://statmodeling.stat.columbia.edu/2022/01/10/the-real-problem-of-that-nudge-meta-analysis-is-not-that-it-include-12-papers-by-noted-fraudsters-its-the-gigo-of-it-all/ The second was a regression-discontinuity study reporting that winners of close elections for governors of U.S. states lived five to ten years longer than the losers of the elections. This study again was clearly written with open data, but I don’t believe the claims; I think they are artifacts of open-ended statistical analysis, the thing that happens all the time when studying small effects with highly noisy data; see here: https://statmodeling.stat.columbia.edu/2020/07/02/no-i-dont-believe-that-claim-based-on-regression-discontinuity-analysis-that/ I think both these papers have many good features, and I appreciate the effort the authors put into the work, but sometimes your data are just too biased or noisy to allow researchers, no matter how open and scrupulous, to find the small effects that they are looking for. Paradoxically, I think there’s often a mistaken attitude to think a paper is good because of its methods (in the aforementioned examples, a comprehensive meta-analysis and a clean regression discontinuity) without realizing that it is doomed because of low data quality and weak substantive theory. One of the problems of focusing on gaudy examples of researchers cheating is that we forget that honesty and transparency are not enough (http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics14.pdf).

OK, you asked me to also give an example of a recent paper I felt was well done. I should probably spend more time reading the good stuff, actually! I didn’t want to just respond by pointing you to a paper by one of my friends, so to answer your question I went over to the website of the American Political Science Review. The forthcoming articles on their home page look reasonable to me. That said, most of them are not written quite the way I would. Some of this is silly details such as reporting estimates and standard errors to a ridiculous number of decimal places; some of this is an over-reliance on models and estimates without displays of raw data. Scatterplots make me happy, and I feel that many social science research papers make the mistake of considering a table of regression coefficients to be a culmination of their project rather than just part of the story. But I guess that’s partly the point: for an empirical research paper to be good, it doesn’t have to be a tour de force, it should just add to our understanding of the world, and often that can be done with a realistic view of what the contribution can represent. For example, consider this abstract from “The Curse of Good Intentions: Why Anticorruption Messaging Can Encourage Bribery,” by Nic Cheeseman and Caryn Pfeiffer:

“Awareness-raising messages feature prominently in most anticorruption strategies. Yet, there has been limited systematic research into their efficacy. There is growing concern that anticorruption awareness-raising efforts may be backfiring; instead of encouraging citizens to resist corruption, they may be nudging them to “go with the corrupt grain.” This study offers a first test of the effect of anticorruption messaging on ordinary people’s behavior. A household-level field experiment, conducted with a representative sample in Lagos, Nigeria, is used to test whether exposure to five different messages about (anti)corruption influence the outcome of a “bribery game.” We find that exposure to anticorruption messages largely fails to discourage the decision to bribe, and in some cases it makes individuals more willing to pay a bribe. Importantly, we also find that the effect of anticorruption messaging is conditioned by an individual’s preexisting perceptions regarding the prevalence of corruption.”

I like this abstract: It argues for the relevance of the work without making implausible claims. Maybe part of this is that their message is essentially negative: in contrast to much of the work on early childhood intervention (for example, see discussion here: https://statmodeling.stat.columbia.edu/2013/11/05/how-much-do-we-trust-this-claim-that-early-childhood-stimulation-raised-earnings-by-42/), say, they’re not promoting a line of research, which makes it easier for them to report their findings dispassionately. I’m not saying that this particular article on anticorruption messaging, or the other recent APSR articles that I looked at, are perfect, just that they are examples of how we can learn from quantitative data. The common threads seem to be good data and plausible effect sizes.

N.S.: Sorry, I should have been more concrete. By “statistical research” I mean “either empirical research or theoretical research into statistical methods”. But anyway, I think you answered my question perfectly!

Zooming out a bit, I’m wondering, are there any new kinds of mistakes you see lots of researchers making in empirical work? As in, are there any recently popular techniques that people are misapplying or overapplying? As a possible example of what I mean, I’ve been seeing a number of regression discontinuity papers whose data plots look like totally uninformative clouds, but who manage to find large effects at the discontinuity only because they assume bizarre, atheoretical trends before and after the discontinuity. I think you’ve taken apart a couple of these papers.

A.G.: Oh yeah, there’s lots of bad regression discontinuity analysis out there; I discussed this in various posts, for example “Another Regression Discontinuity Disaster and what can we learn from it” (https://statmodeling.stat.columbia.edu/2019/06/25/another-regression-discontinuity-disaster-and-what-can-we-learn-from-it/) and “Regression discontinuity analysis is often a disaster. So what should you do instead? Here’s my recommendation:” (https://statmodeling.stat.columbia.edu/2021/03/11/regression-discontinuity-analysis-is-often-a-disaster-so-what-should-you-do-instead-do-we-just-give-up-on-the-whole-natural-experiment-idea-heres-my-recommendation/) and “How to get out of the credulity rut (regression discontinuity edition): Getting beyond whack-a-mole” (https://statmodeling.stat.columbia.edu/2020/01/13/how-to-get-out-of-the-credulity-rut-regression-discontinuity-edition-getting-beyond-whack-a-mole/) and “Just another day at the sausage factory . . . It’s just funny how regression discontinuity analyses routinely produce these ridiculous graphs and the authors and journals don’t even seem to notice.” (https://statmodeling.stat.columbia.edu/2021/11/21/just-another-day-at-the-sausage-factory-its-just-funny-how-regression-discontinuity-analyses-routinely-produce-these-ridiculous-graphs-and-the-authors-and-journals-dont-even-seen-to-notice/). I don’t actually think regression discontinuity is worse than other methods–we even have a section on the method, with an example, in Regression and Other Stories!–; rather, I think the problem is that a feeling of causal identification gives researchers a feeling of overconfidence, and then they forget that ultimately what they’re trying to do is learn from observational data, and that needs assumptions–not just mathematical “conditions,” but real-world assumptions. It’s similar to how all those psychologists fooled themselves: they were doing randomized experiments and that gave them causal identification, but they didn’t realize that this didn’t help if they weren’t estimating a stable quantity. They were trying to nail a jellyfish to the wall. I will say, though, that bad regression discontinuity analyses have their own special annoyance or amusement in that they are often presented in the published paper with a graph that reveals how ridiculous the fitted model is. It’s kind of amazing when an article contains its own implicit refutation, which the authors and editors never even noticed. They’re so convinced of the rightness of their method that they don’t see what’s right in front of them.

N.S.: Are there any other similar examples from recent years, of methods that have been applied in an overly “push-button” way?

A.G.: I’m sure lots of methods have been applied in an overly push-button way. Regression discontinuity is just particularly easy to notice because the graph showing the ridiculousness of the fitted model is often conveniently included in the published paper!

OK, so what do I think are the statistical methods that have been an overly push-button way? Three statistical methods come to mind:

1. Taking an estimated regression coefficient and using it as a parameter estimate going forward. This is so standard we don’t even think of it as a choice: the estimate is the estimate, right? That’s what we do in our textbooks; it’s what everybody does in their textbooks. It’s an “unbiased” estimate or something like it, right? Wrong. When you report the estimate that comes out of your fitted model, you’re not accounting for selection: the selection in what gets attention, the selection in what gets published, and, before that, the selection in what you decide to focus on, amid all your results. Even the simplest model of selection, saying that the only things that get published are estimates that are more than 2 standard errors away from zero, can correspond to a huge bias, as discussed in pages 17-18 of this paper (http://www.stat.columbia.edu/~gelman/research/published/failure_of.pdf), I looked at a much-publicized study of the effects of early childhood intervention on adult earnings, and under reasonable assumptions about possible effect sizes, the bias in the published estimate can be larger than the true effect size! And this has implications, if you want to use these estimates to guide cost-benefit analyses and policy decisions (see here: https://statmodeling.stat.columbia.edu/2017/07/20/nobel-prize-winning-economist-become-victim-bog-standard-selection-bias/).

Anyway, that’s just one example but this particular statistical error must be happening a million times a year–just above every time a published paper reports a numerical estimate of some effect or comparison. Sometimes maybe it’s no big deal because the magnitude of the effect doesn’t matter, but (a) often the magnitude _does_ matter, and (b) overestimates of effect sizes percolate forward through the literature, as follows: Suppose you’re designing a new study. You want to power it to be large enough to detect realistic effects. The trouble is, you grab those “realistic effects” from old studies, which are subject to bias. Then you conduct a new study that’s hopelessly noisy–but you don’t know it, as your seemingly rigorous calculations led you to believe you have “80% power,” that is, an 80% chance that your target analysis will be statistically significant at a conventional level (https://statmodeling.stat.columbia.edu/2017/12/04/80-power-lie/). But you don’t really have 80% power–you really have something like 6% power (https://statmodeling.stat.columbia.edu/2014/11/17/power-06-looks-like-get-used/)–so the comparison you were planning to look at probably won’t come out a winner, and this motivates you to go along the forking paths of data processing and analysis until you find something statistically significant. Which doesn’t feel like cheating, because, after all, your study had 80% power, so you were supposed to find something, right?? Then this new inflated estimate gets published, and so on. And eventually you have an entire literature filled with overestimates, which you can then throw into a meta-analysis to find apparently conclusive evidence of huge effects (as here: https://statmodeling.stat.columbia.edu/2022/01/10/the-real-problem-of-that-nudge-meta-analysis-is-not-that-it-include-12-papers-by-noted-fraudsters-its-the-gigo-of-it-all/).

2. Using a statistical significance threshold to summarize inferences. Here I’m not talking about selection of what to report, but rather about how inferences are interpreted. Unfortunately, it’s standard practice to make these divisions, what the epidemiologist Sander Greenland calls “dichotomania” (https://statmodeling.stat.columbia.edu/2019/09/13/deterministic-thinking-dichotomania/). Again, this error is so common as to be nearly invisible. At one level, this is a problem that people are very aware of, as is evidenced by the common use of terms such as “p-hacking,” but I think people often miss the point. To me, the problem is not with p-values or so-called type 1 error rates but with this dichotomization, what I sometimes call the premature collapsing of the wavefunction. I’d rather just accept the uncertainty that we have. Sometimes people say that this is impractical, but my colleagues and I disagree; see for example here (http://www.stat.columbia.edu/~gelman/research/published/abandon.pdf). Approaches to do “statistical significance” better through multiple comparisons adjustments or preregistrations etc. . . . I think that’s all missing the point. Examples of this problem come up every day: here’s one from the medical literature that we looked into a couple years ago, where a non-statistically significant comparison was reported as a null effect (http://www.stat.columbia.edu/~gelman/research/published/Stents_published.pdf).

One perspective that might help in thinking about this problem–how to summarize a statistical result in a way that could be useful to decision makers–is to consider the problem of A/B testing in industry. The relevant question is not, “Is B better than A?”–a question which, indeed, may have no true answer, given that effects can and will change over time and an intervention can be effective in some scenarios and useless or even counterproductive in others–but, rather, “What can we say about what might happen if A or B is implemented?” Any realistic answer to such a question will have uncertainty–even if your past sample size is a zillion, you’ll have uncertainty when extrapolating to the future. I’m not saying that such A/B decisions are easy, just that it’s foolish to dichotomize based on the data. Summarizing results based on statistical significance is just a way of throwing away information.

3. Bayesian inference. Lots has been written about this too. The short story here is that the posterior probability is supposed to represent your uncertainty about your unknowns–but this is only as good as your model, and we often fill our models with conventional and unrealistic specifications. A simple example is, suppose you do a clean randomized experiment and you get an estimate of, ummm, 0.2 (on some meaningful scale) with a standard error of 0.2? If you used a flat or so-called noninformative prior, this would imply that your posterior is approximately normally distributed with mean 0.2 and standard deviation 0.2, which implies an 84% posterior probability that the underlying effect is positive. So: you get an estimate that’s 1 standard error from 0, consistent with pure noise, but it leads you to an 84% probability, which if you take it seriously implies you’d bet with 5-to-1 odds that the true effect is greater than 0. To offer 5-1 odds based on data that could easily be explainable by chance alone, that’s ridiculous. As Yuling and I discuss in section 3 of our article on Holes in Bayesian Statistics (http://www.stat.columbia.edu/~gelman/research/published/physics.pdf), the problem here is in the uniform prior, which upon reflection doesn’t make sense but which people use by default–hell, we use it by default in our Bayesian Data Analysis book!

In that case, how is it that Bayesians who read our book (or others) aren’t wandering the streets in poverty, having lost all their assets in foolish 5-to-1 bets on random noise? The answer is that they know not to trust all their inferences. Or, at least, they know not to trust _some_ of their inferences. The trouble is that this approach, of carefully walking through the posterior as if it were a minefield, avoiding the obviously stupid inferences that would blow you up, won’t necessarily help you avoid the less-obviously mistaken inferences that can still hurt you. The problem comes from the standard Bayesian ideology which states that you should be willing to bet on all your probabilities.

I think that Bayesian errors are less common than the other two errors listed above, only because Bayesian methods are used less frequently. But when we make probabilistic forecasts, we pretty much have to think Bayesianly, and in that case we have to wrestle with where are we, relative to the boundaries of our knowledge. We discussed this in the context of election forecasts of the 2020 election; see here (https://statmodeling.stat.columbia.edu/2020/10/28/concerns-with-our-economist-election-forecast/), here (https://statmodeling.stat.columbia.edu/2020/07/31/thinking-about-election-forecast-uncertainty/), and here (https://statmodeling.stat.columbia.edu/2020/10/24/reverse-engineering-the-problematic-tail-behavior-of-the-fivethirtyeight-presidential-election-forecast/). We got some pushback on some of this, but the point is to start with the recognition that your model will be wrong and to perturb it to get a sense of how wrong it is. Rather than walking around the minefield or carefully tiptoeing through it, we grab a stick and start tapping wherever we can, trying to set off some explosions so we can see what’s going on.

Taking all these examples together, along with the regression discontinuity thing I talked about earlier, I see a common feature, which is an apparent theoretical rigor leading to overconfidence and a lack of reflection. The theory says you have causal identification, or unbiased estimation, or a specified error rate, or coherent posterior probabilities, so then you just go with that, (a) without thinking about the ways in which the assumptions of the theory aren’t satisfied, and (b) without thinking about the larger goals of the research.

N.S.: About that third one…What you said reminds me of this old post by Stephen Senn, which argues that researchers can’t do true Bayesian inference, in the philosophical subjective sense, because they can’t quantify their own prior. In fact, I recall that you liked that post a lot. So basically, if researchers can’t write down what their own prior really is, then while the inference they’re doing may use Bayes’ Rule and a so-called “prior”, it’s not really Bayesian inference. So if that’s true, do any of the arguments that we typically see for Bayesian over frequentist inference — for example, the Likelihood Principle — really hold? And if so, is there any general reason researchers should keep using so-called “Bayesian” methods, when they’re cumbersome and unwieldy?

A.G.: Sure, in that case researchers can’t do true inference of any sort, as in practice it’s rare that the mathematical assumptions of our models are satisfied. We rarely have true probability sampling, we rarely have clean random assignment, and we rarely have direct measurements of what we ultimately care about. For example, an education experiment will be performed in schools that permit the study to be done, not a random sample of all schools; the treatment will be assigned differently in different places and can be altered by the teachers on the ground; and outcome measures such as test scores do not fully capture long-term learning. That’s fine; we do our best. I see no reason to single out Bayesian inference here. Many writers on statistics strain at the gnat of the prior distribution while swallowing the camel of the likelihood. All those logistic regressions, independence assumptions, and models with constant parameters: where did they all come from, exactly? In your sentences above, I request that you replace the word “prior” everywhere with the word “model.” “If researchers can’t write down what their own model really is,” etc. As someone with economics training, you will be aware that models are valuable, they can be complicated, and they are in many ways conventional, constructed from available building blocks such as additive utility models, normal distributions, and so forth. Different models have different assumptions, but it would be naive to think that the model you happen to use for a particular problem is, or could be, “what your own model really is.”

Regarding some of the specifics: You say if bla bla bla then “it’s not really Bayesian inference.” I disagree. Bayesian inference is the math; it’s the mapping from prior and data model to posterior, it’s the probabilistic combination of information. Bayesian inference can give you bad results–I talked about that in my answer to your previous question–but it’s still Bayesian inference. We could say the same thing about arithmetic. Suppose I think I have a pound of flour and I give you 6 ounces, I should have 10 ounces left. If my original weighing was wrong and I only had 15 ounces to start, then my analysis is wrong, as I will only have 9 ounces left. But the problem is not with the math, it’s with my assumptions.

OK, at this point it might sound like I’m saying that Bayesian inference can’t fail; it can only be failed. But that’s not what I’m trying to say. I just said that Bayesian inference is the math, but it’s also what goes into it. This has been a lot of what of statistics research has been about: constructing families of models that work in more general situations. As Hal Stern says, the most important thing about a statistical model is not what it does with the data but what data it uses, and often what makes a statistical method useful is that it can make use of more data. An example from my own work is how we use Mister P (MRP, multilevel regression and poststratification) to make population inferences: we’re using information from the structure of the data in the survey or experiment at hand, and also including information about the population. Another example would be modern machine learning methods that use overparameterization and regularization, which allows more predictors to be flexibly included in their webs (see section 1.3 of this article: http://www.stat.columbia.edu/~gelman/research/published/stat50.pdf). The point is that statistical methods exist within a social context: it’s the method and also how it’s used.

A couple more points. You ask about the likelihood principle. I think the likelihood principle is kinda bogus. We discuss this in our Bayesian Data Analysis book (http://www.stat.columbia.edu/~gelman/book/). In short, the likelihood comes from the data model–in Bayesian terms, the probability distribution of the data given the parameters. But we don’t know this model–it’s just an idealization that we construct, so we have to check its fit to data. The likelihood principle is only relevant conditional on the model, which we don’t actually know.

Finally, you ask whether there is any general reason researchers should keep using Bayesian methods, when they’re cumbersome and unwieldy. Ummm, sure, if you can get results that are just as good with less effort, go for it! I guess you’re thinking of problems where you have a small number of parameters, tons of data, and low uncertainty. The problems I’ve seen often have the opposite characteristics! Look at it the other way: if Bayesian methods are really so “cumbersome and unwieldy,” why do we use them at all? Are we just gluttons for punishment? Actually, for many problems a Bayesian approach is fast, direct, and simple, while comparable non-Bayesian methods are cumbersome, requiring awkward approximations that necessitate lots of concern. For example, recently my colleagues and I have been working on a Bayesian model for calibration in laboratory assays. Classical approaches get hung up on measurements that are purportedly above or below detection limits, and they end up throwing away information and giving estimates that don’t make sense. Another example is this project (http://www.stat.columbia.edu/~gelman/research/published/chickens.pdf) where we used Bayesian methods to partially adjust for experimental control data in a way that would be difficult using other approaches–indeed, we were motivated by an application where the standard approach using significance testing was horribly wasteful of information. This is not to say that Bayes is always best. I’ve given some Bayesian success stories; there are lots of success stories of other methods too. To understand why researchers use a method, it makes sense to look at where it has solved problems that cannot easily be solved in other ways.

N.S.: Gotcha. OK, so let me go one step further regarding methods here. In 2001, Leo Breiman wrote a very provocative essay entitled “Statistical Modeling: The Two Cultures”, in which he argued that in many instances, researchers should stop worrying so much about modeling the data in an explicable way, and focus more on prediction. He shows how some basic early machine-learning type approaches were already able to yield consistently better predictions than traditional data models. And then a decade or so later, deep learning sort of explodes on the scene, and starts accomplishing all these magical feats — beating human champions at Go, generating text that sounds as if a human wrote it, revolutionizing the world of protein folding, etc. Does this revolution mean that classical statistics needs to change its ways? Was Breiman basically vindicated? Should statisticians move more toward algorithmic-type approaches, or focus on problems where data is sparse enough that lots of theoretical assumptions are needed, and thus classical modeling approaches still work best?

A.G.: Hey–I wrote something about that Breiman paper (see here: http://www.stat.columbia.edu/~gelman/research/published/gelman_breiman.pdf). Short story is: dude had some blind spots. But we all have blind spots; what’s important is what we do, not what we can’t or won’t do. Anyway, yes, there’s been a revolution in overparametrized models and regularization, with computer go champions and all sorts of amazing feats. It’s good to have these sorts of tools available; at the same time, traditional concerns of statistical design and analysis remain important for lots of important problems. At one end of things, problems in areas ranging from pharmacology to psychometrics involve latent parameters (concentrations of a drug within bodily compartments, or individual abilities and traits), and I think that to do inference for such problems, you need to do some modeling: pure prediction has fundamental limitations when your goal is to learn about latent properties. At the other end, lots of applied social science has statistical challenges arising from high variability and weak theory (consider, for example, those studies of early childhood intervention that I discussed earlier): for these, the big problems involve adjusting for bias, combining information from multiple sources, and handling uncertainty, which are core problems of statistics.

You ask whether statisticians should move toward algorithmic approaches. I’d say that statistics has always been algorithmic. There’s a duality between models and algorithms: start with a model and you’ll need an algorithm to fit it; start with an algorithm and you’ll want to understand how it can fail; this is modeling. Lots of us who do applied statistics spend lots of time developing algorithms, not necessarily because that’s what we want to do, but because existing algorithms are designed for old problems and won’t always work on our new ones.

N.S.: I love that you’ve already written about practically every question I have! I just hope you don’t mind repeating yourself! Anyway, one other thing I wanted to get your thoughts on was the publication system and the quality of published research. The replication crisis and other skeptical reviews of empirical work have got lots of people thinking about ways to systematically improve the quality of what gets published in journals. Apart from things you’ve already mentioned, do you have any suggestions for doing that?

A.G.: I wrote about some potential solutions in pages 19-21 of this article: http://www.stat.columbia.edu/~gelman/research/published/failure_of.pdf from a few years ago. But it’s hard to give more than my personal impression. As statisticians or methodologists we rake people over the coals for jumping to causal conclusions based on uncontrolled data, but when it comes to science reform, we’re all too quick to say, Do this or Do that. Fair enough: policy exists already and we shouldn’t wait on definitive evidence before moving forward to reform science publication, any more than journals waited on such evidence before growing to become what they are today. But we should just be aware of the role of theory and assumptions in making such recommendations. Eric Loken and I made this point several years ago in the context of statistics teaching (http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics2.pdf), and Berna Devezer et al. published an article last year critically examining some of the assumptions that have at times been taken for granted in science reform (https://royalsocietypublishing.org/doi/10.1098/rsos.200805). When talking about reform, there are so many useful directions to go, I don’t know where to start. There’s post-publication review (which, among other things, should be much more efficient than the current system for reasons discussed here: https://statmodeling.stat.columbia.edu/2016/12/16/an-efficiency-argument-for-post-publication-review/), there are all sorts of things having to do with incentives and norms (for example, I’ve argued that one reason that scientists act so defensive when their work is criticized is because of how they’re trained to react to referee reports in the journal review process: https://statmodeling.stat.columbia.edu/2018/01/13/solution-puzzle-scientists-typically-respond-legitimate-scientific-criticism-angry-defensive-closed-non-scientific-way/), and various ideas adapted to specific fields. One idea I saw recently that I liked was from the psychology researcher Gerd Gigerenzer, who wrote that we should consider stimuli in an experiment as being a sample from a population rather than thinking of them as fixed rules (https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/we-need-to-think-more-about-how-we-conduct-research/DFAE681F3EEF581CEE80139BB63DFF6F), which is an interesting idea in part because of its connection to issues of external validity or out-of-sample generalization that are so important when trying to make statements about the outside world.

N.S.: OK, last questions. What are some of the interesting problems you’re working on now — can you give us a taste? And also, for young people getting started in your field, do you have any key pieces of advice?

A.G.: What am I working on now? Mostly teaching and textbooks! My colleagues and I have been trying to integrate modern ideas of statistics (involving modeling, measurement, and inference) with ideas of student-centered learning. The idea is that students spend their time in class working in pairs figuring things out, and I can walk around the room seeing what they’re doing and helping them when they get stuck. In creating these courses, we’re trying to put together all the pieces of the puzzle, including creating class-participation activities for every class period. And this has been making me think a lot about workflow and some fundamental questions of what are we doing when we do statistical data analysis. It looks a lot like science, in that we develop theories, make conjectures, and do experiments. Stepping back a bit to consider methods, my colleagues and I have been thinking a lot about MRP, poststratifying to estimate population quantities and causal effects, poststratifying on non-census variables, priors for models with deep interactions, computation for all these models, leveraging the concentration property by which, as our problems become larger, distributions become closer to normal, allowing approximate computation to be more effective, which brings us to methods that we’re working on for validating approximate computation, along with methods for predictive model averaging and computational tools for statistical workflow (https://statmodeling.stat.columbia.edu/2021/11/19/drawing-maps-of-model-space-with-modular-stan/). For some of the general ideas, see our papers, Bayesian Workflow (http://www.stat.columbia.edu/~gelman/research/unpublished/Bayesian_Workflow_article.pdf) and Toward a taxonomy of trust for probabilistic machine learning (http://www.stat.columbia.edu/~gelman/research/unpublished/taxonomy.pdf)–I’m lucky to have some great collaborators! And what’s it all for? Mostly it’s for other people–users of Stan and other probabilistic programming languages, readers of our textbook, pollsters, laboratory researchers, policy analysts, etc. It’s also motivated by the studies we are doing on political polarization and various projects related to survey research. I guess you can get some idea of what I’ve been working on by going to the published and unpublished articles on my home page, as they’re listed in reverse chronological order.

Finally, what advice do I have for young people getting started? I don’t know! I think that they can get better career advice from people closer to their own situation. I’m happy to offer statistical advice, though. From appendix B of Regression and Other Stories, here are our 10 quick tips to improve your regression modeling:

1. Think about variation and replication

2. Forget about statistical significance

3. Graph the relevant and not the irrelevant

4. Interpret regression coefficients as comparisons

5. Understand statistical methods using fake-data simulation

6. Fit many models

7. Set up a computational workflow

8. Use transformations

9. Do causal inference in a targeted way, not as a byproduct of a large regression

10. Learn methods through live examples.

“Would Democratic Socialism Be Better?”

In his new book, sociologist Lane Kenworthy writes:

Is there a compelling case for socialism? Should we aspire to shift, in the reasonably near future, from a basically capitalist economy to a socialist one?

Let’s stipulate that socialism refers to an economy in which two-thirds or more of employment and output (GDP) is in firms that are owned by the government, citizens, or workers. Two-thirds is an arbitrary cutoff, but it’s as sensible as any other. It connotes a subsidiary role for the private non-worker-owned sector. . . .

To fully and fairly assess democratic socialism’s desirability, we need to compare it to the best version of capitalism that humans have devised: social democratic capitalism, or what is often called the Nordic model. I try in this book to offer such an assessment. My conclusion is that capitalism, and particularly social democratic capitalism, is better than many democratic socialists seem to think.

We most recently encountered Kenworthy in this space when discussing his post, “100 Things to Know.” I still think Kenworthy could up his game on his graphs, but that’s a minor thing.

Most of Kenworthy’s discussions center on macroeconomics, and it’s hard for me to judge a lot of the arguments, so let me just say that I see this book as more of a resource than an argument. That’s not a bad thing: rather than trying to make a single point, Kenworthy is bringing in data from different directions and using these data to address various reasonable arguments. So I’d say you can read this book with the goal of getting a sense of the center-left perspective from a knowledgeable observer without an axe to grind.

Pizzagate and “Nudge”: An opportunity lost

We all make mistakes. What’s important is to engage with our mistakes and learn from them. When we don’t, we’re missing an opportunity to learn.

Here’s an example. A few years ago there was a Harvard study that notoriously claimed that North Carolina was less democratic than North Korea. When this came out, the directors of the study accepted that the North Korea estimate was problematic and they removed it from their dataset. But I don’t think they fully engaged with the error. They lost an opportunity to learn, even though they were admirably open about their process.

Here’s a more recent example. The authors of the influential policy book Nudge came out with a new edition. In the past, they, like many others, had been fooled by junk science on eating behavior, most notably that of Cornell business school professor Brian Wansink. I was curious whassup with that so I searched this press release news article and found some stuff:

They began the [first edition of their] book with the modest example of a school administrator rearranging the display of food at the cafeteria, increasing the likelihood that kids choose a healthier lunch. . . .

In addition to new things in the new version of the book, there are old things from the original version that are gone. That includes the research of former Cornell University professor Brian Wansink, a behavioral scientist who got caught producing shoddy research that fudged numbers and misled the public about his empirical findings. . . . Thaler is cheering on the social scientists probing academic literature to suss out what can be proved and what can’t. “That’s healthy science,” he says.

That’s cool. I like that the author has this attitude: instead of attacking critics as Stasi, he accepts the value of outside criticism.

Just one thing, though. Removing Wansink’s research from the book—that’s a start, but to really do it right you should engage with the error. I don’t have a copy of either edition of the book (and, hey, before you commenters start slamming me about writing a book I haven’t read: first, this is not a book review nor does it purport to be; second, a policy book is supposed to have influence among people who don’t read it. There’s no rule, nor should there be a rule, that I can’t write skeptical things about a book if I haven’t managed to get a copy of it into my hands), but I was able to go on Amazon and take a look at the index.

Here’s the last page of the index of the first edition:

And now the new edition:

Lots of interesting stuff here! But what I want to focus on here are two things:

1. Wansink doesn’t play a large role even in the first edition. He’s only mentioned once, on page 43—that’s it! So let’s not overstate the importance of this story.

2. Wansink doesn’t appear at all in the second edition! That’s the lost opportunity, a chance for the authors to say, “Hey, nudge isn’t perfect; indeed the ideas of nudging have empowered sleazeballs like Wansink, and we got fooled too. Also, beyond this, the core idea of nudging—that small inputs can have large, predictable, and persistent effects—has some deep problems.” Even if they don’t have the space in their book to go into those problems, they could still discuss how they got conned. It’s a great story and fits in well with the larger themes of their book.

Not a gotcha

Connecting Nudge to pizzagate is not a “gotcha.” As I wrote last year after someone pointed out one of my published articles: It’s not that mistakes are a risk of doing science; mistakes are a necessary part of the process.

P.S. The first edition of Nudge mentioned the now-discredited argument that there is no hot hand. Good news is that the Lords realized this was a problem and they excised all mention of the hot hand from their second edition. Bad news is that they did not mention this excision—they just memory-holed the sucker. Another opportunity for learning from mistakes was lost! On the plus side, you’ll probably be hearing these guys on NPR very soon for something or another.

Unsustainable research on corporate sustainability

In a paper to be published in the Journal of Financial Reporting, Luca Berchicci and Andy King shoot down an earlier article claiming that corporate sustainability reliably predicted stock returns. It turns out that this earlier research had lots of problems.

King writes to me:

Getting to the point of publication was an odyssey. At two other journals, we were told that we should not replicate and test previous work but instead fish for even better results and then theorize about those:

“I encourage the authors to consider using the estimates from figure 2 as the dependent variables analyzing which model choices help a researcher to more robustly understand the relation between CSR measures and stock returns. This will also allow the authors to build theory in the paper, which is currently completely absent…”

“In fact, there are some combinations of proxies/ model specifications that are to the left of Khan et al.’s estimate. I am curious as to what proxies/ combinations enhance the results?”

Also, the original authors seem to have attempted to confuse the issues we raise and salvage the standing of their paper (see attached: Understanding the Business Relevance of ESG Issues). We have written a rebuttal (also attached).

Here’s the relevant part of the response, by George Serafeim and Aaron Yoon:

Models estimated in Berchicci and King (2021) suggest that making different variable construction, sample period, and control variable choices can yield different results with regards to the relation between ESG scores and business performance. . . . However, not all models are created equal . . . For example, Khan, Serafeim and Yoon (2016) use a dichotomous instead of a continuous measure because of the weaknesses of ESG data and the crudeness of the KLD data, which is a series of binary variables. Creating a dichotomous variable (i.e., top quintile for example) could be well suited when trying to identify firms on a specific characteristic and the metric identifying that characteristic is likely to be noisy. A continuous measure assumes that for the whole sample researchers can be confident in the distance that each firm exhibits from each other. Therefore, the use of continuous measure is likely to lead to significantly weaker results, as in Berchicci and King (2021) . . .

Noooooooo! Dichotomizing your variable almost always has bad consequences for statistical efficiency. You might want to dichotomize to improve interpretability, but you then should be aware of the loss of efficiency of your estimates, and you should consider approaches to mitigate this loss.

Berchicci and King’s rebuttal is crisp:

The issue debated in Khan, Serafeim, and Yoon (2016) and Berchicci and King (2022) is whether guidance on materiality from the Sustainable Accounting Standards Board (SASB) can be used to select ESG measures that reliably predict stock returns. Khan, Serafeim, and Yoon (2016) (hereafter “KSY”) estimate that had investors possessed SASB materiality data, they could have selected stock portfolios that delivered vastly higher returns, an additional 300 to 600 basis points per year for a period of 20 years. Berchicci and King (2022) (hereafter “BK”) contend that there is no evidence that SASB guidance could have provided a reliable advantage and contend that KSY’s findings are a statistical artifact.

In their defense of KSY, Yoon and Serafeim (2022) ignore the evidence provided in Berchicci and King and leave its main points unrefuted. Rather than make their case directly, they try to buttress their claim with a selective review of research on materiality. Yet a closer look at this literature reveals that little of it is relevant to the debate. Of the 28 articles cited, only two evaluate the connection between SASB materiality guidance and stock price, and both are self-citations.

Berchicci and King continue:

Indeed, in other forums, Serafeim has made a contrasting argument, contending that KSY is a uniquely important study – a breakthrough that shifted decades of understanding (Porter, Serafeim, and Kramer, 2016). Surely, such an important study should be evaluated on its own merits.

That’s funny. It reminds me of the general point that in research we want our results simultaneously to be surprising and to make perfect sense. In this case, this put Yoon and Serafeim in a bind.

And more:

In BK, we evaluate whether KSY’s results are a fair representation of the true link between material sustainability and stock return. We evaluate over 400 ways that the relationship could be analyzed and reveal that 98% of the models result in estimates smaller than the one reported by KSY and that the median estimate was close to zero. We then show that KSY’s estimate is not robust to simple changes in their model . . . Next, we evaluate the cause of KSY’s strong estimate and uncover evidence that it is a statistical artifact. . . . We then show that their measure also lacks face validity because it judges as materially sustainable firms that were (and continue to be) leading emitters of toxic pollution and greenhouse gasses. In some years, this included a large majority of the firms in extractive industries (e.g. oil, coal, cement, etc.). . . . KSY do not address any of these criticisms and instead rely on a belief that their measure and model are the only ones that should be considered. . . .

Where do they sit on the ladder?

It’s good to see this criticism out there, and as usual it’s frustrating to see such a stubborn response by the original authors. A few years ago we presented a ladder of responses to criticism, from the most responsible to the most destructive:

1. Look into the issue and, if you find there really was an error, fix it publicly and thank the person who told you about it.

2. Look into the issue and, if you find there really was an error, quietly fix it without acknowledging you’ve ever made a mistake.

3. Look into the issue and, if you find there really was an error, don’t ever acknowledge or fix it, but be careful to avoid this error in your future work.

4. Avoid looking into the question, ignore the possible error, act as if it had never happened, and keep making the same mistake over and over.

5. If forced to acknowledge the potential error, actively minimize its importance, perhaps throwing in an “everybody does it” defense.

6. Attempt to patch the error by misrepresenting what you’ve written, introducing additional errors in an attempt to protect your original claim.

7. Attack the messenger: attempt to smear the people who pointed out the error in your work, lie about them, and enlist your friends in the attack.

In this case, the authors of the original article are stuck somewhere around rung 4. Not the worse possible reaction—they’ve avoided attacking the messenger, and they don’t seem to have introduced any new errors—but they haven’t reached the all-important step of recognizing their mistake. Not good for them going forward. How can you make serious research progress if you can’t learn from what you’ve done wrong in the past. You’re building a house on a foundation of sand.

P.S. According to Google, the original article, “Corporate Sustainability: First Evidence on Materiality,” has been cited 861 times. How is it that such a flawed paper has so many citations? Part of this might be the instant credibility conveyed by the Harvard affiliations of the authors, and part of this might be the doing-well-by-doing-good happy-talk finding that “investments in sustainability issues are shareholder-value enhancing.” Kinda like that fishy claim about unionization and stock prices or the claims of huge economic benefits from early childhood stimulation. Forking paths allow you to get the message you want from the data, and this is a message that many people want to hear.

The Failure of Null Hypothesis Significance Testing When Studying Incremental Changes, and What to Do About It

Here it is:

A standard mode of inference in social and behavioral science is to establish stylized facts using statistical significance in quantitative studies. However, in a world in which measurements are noisy and effects are small, this will not work: selection on statistical significance leads to effect sizes which are overestimated and often in the wrong direction. After a brief discussion of two examples, one in economics and one in social psychology, we consider the procedural solution of open postpublication review, the design solution of devoting more effort to accurate measurements and within-person comparisons, and the statistical analysis solution of multilevel modeling and reporting all results rather than selection on significance. We argue that the current replication crisis in science arises in part from the ill effects of null hypothesis significance testing being used to study small effects with noisy data. In such settings, apparent success comes easy but truly replicable results require a more serious connection between theory, measurement, and data.

The article was published in 2018 but it remains relevant, all these many years later.

“Stylized Facts in the Social Sciences”

Sociologist Daniel Hirschman writes:

Stylized facts are empirical regularities in search of theoretical, causal explanations. Stylized facts are both positive claims (about what is in the world) and normative claims (about what merits scholarly attention). Much of canonical social science research can be usefully characterized as the production or contestation of stylized facts. Beyond their value as grist for the theoretical mill of social scientists, stylized facts also travel directly into the political arena. Drawing on three recent examples, I show how stylized facts can interact with existing folk causal theories to reconstitute political debates and how tensions in the operationalization of folk concepts drive contention around stylized fact claims.

Interesting. I heard the term “stylized facts” many years ago in conversations with political scientists—but from Hirschman’s article, I learned that the expression is most commonly used in economics, and it was originally used in a 1961 article by macroeconomist Nicholas Kaldor, who wrote:

Since facts, as recorded by statisticians, are always subject to numerous snags and qualifications, and for that reason are incapable of being accurately summarized, the theorist, in my view, should be free to start off with a ‘stylized’ view of the facts—i.e. concentrate on broad tendencies, ignoring individual detail, and proceed on the ‘as if’ method, i.e. construct a hypothesis that could account for these ‘stylized’ facts, without necessarily committing himself on the historical accuracy, or sufficiency, of the facts or tendencies thus summarized.

Hirschman writes:

“Stylized fact” is a term in widespread use in economics and is increasingly used in other social sciences as well. Thus, in some important sense, this article is an attempt to theorize a “folk” concept, with the relevant folk being social scientists themselves. . . . I argue that stylized facts should be understood as simple empirical regularities in need of explanation.

To me, this seems close, but not quite right. I agree with everything about this paragraph except for the last four words. A stylized fact can get explained but I think it remains a stylized fact, even though it is no longer in need of explanation. I’d say that, in social science jargon, a stylized fact in need of explanation is called a “puzzle.” Once the puzzle is figured out, it’s still a stylized fact.

But that’s just my impression. As Hirschman says, a term is defined by its use, and maybe the mainstream use of “stylized fact” is actually restricted to what I would call a puzzle or an unexplained stylized fact.

Why ask, “Why ask why?”?

In any case, beyond being a careful treatment of an interesting topic, Hirschman’s discussion interests me because it connects to a concern that Guido Imbens and I raised a few years ago regarding the following problem that we characterize as being typical of a lot of scientific reasoning:

Some anomaly is observed and it needs to be explained. The resolution of the anomaly may be an entirely new paradigm (Kuhn, 1970) or a reformulation of the existing state of knowledge (Lakatos, 1978). . . . We argue that a question such as “Why are there so many cancers in this place?” can be viewed not directly as a question of causal inference, but rather in- directly as an identification of a problem with an existing statistical model, motivating the development of more sophisticated statistical models that can directly address causation in terms of counterfactuals and potential outcomes.

In short, we say that science often proceeds by identifying stylized facts, which, when they cause us to ask “Why?”, represent anomalies that motivate further study. But in our article, Guido and I didn’t mention the term “stylized fact.” We situated our ideas within statistics, econometrics, and the philosophy of science. Hirschman takes this all a step further by connecting it to the practice of social science.

Buying things vs. buying experiences (vs. buying nothing at all): Again, we see a stock-versus-flow confusion

Alex Tabarrok writes:

A nice, well-reasoned piece from Harold Lee pushing back on the idea that we should buy experiences not goods:

While I appreciate the Stoic-style appraisal of what really brings happiness, economically, this analysis seems precisely backward. It amounts to saying that in an age of industrialization and globalism, when material goods are cheaper than ever, we should avoid partaking of this abundance. Instead, we should consume services afflicted by Baumol’s cost disease, taking long vacations and getting expensive haircuts which are just as hard to produce as ever. . . .

. . . tools and possessions enable new experiences. A well-appointed kitchen allows you to cook healthy meals for yourself rather than ordering delivery night after night. A toolbox lets you fix things around the house and in the process learn to appreciate how our modern world was made. A spacious living room makes it easy for your friends to come over and catch up on one another’s lives. A hunting rifle can produce not only meat, but also camaraderie and a sense of connection with the natural world of our forefathers. . . .

The sectors of the economy that are becoming more expensive every year – which are preventing people from building durable wealth – include real estate and education, both items that are sold by the promise of irreplaceable “experiences.” Healthcare, too, is a modern experience that is best avoided. As a percent of GDP, these are the growing expenditures that are eating up people’s wallets, not durable goods. . . .

OK, first a few little things, then my main argument.

The little things

It’s fun to see someone pushing against the “buy experiences, not goods” thing, which has become a kind of counterintuitive orthodoxy. I wrote about this a few years ago, mocking descriptions of screensaver experiments and advice to go to bullfights. So, yeah, good to see this.

There are some weird bits in the quoted passage above. For one thing, that hunting rifle. What is it with happiness researchers and blood sports, anyway? Are they just all trying to show how rugged they are, or something? I eat meat, and I’m not offering any moral objection to hunting rifles—or bullfights, for that matter—but this seems like an odd example to use, given that you can get “camaraderie and a sense of connection with the natural world of our forefathers” by just taking a walk in the woods with your friends or family—no need to buy the expensive hunting rifle for that!

Also something’s off because in one place he’s using “a spacious living room” as an example of a material good that people should be spending on (it “makes it easy for your friends to come over and catch up on one another’s lives”), but then later he’s telling us to stop spending so much money on real estate. Huh? A spacious living room is real estate. Of course, real estate isn’t all about square footage, it’s also about location, location, and location—but, if your goal is to make it easy for your friends to come over, then it’s worth paying for location, no? Personally, I’d rather live around the corner from my friends and be able to walk over than to have a Lamborghini and have to shlep it through rush-hour traffic to get there. Anyway, my point is not that Lee should sell his Lambo and exchange it for a larger living room in a more convenient neighborhood; it just seems that his views are incoherent and indeed contradictory.

And then there are the slams against education and health care. I work in the education sector so I guess I have a conflict of interest in even discussing this one, but let me give Lee the benefit of the doubt and say that lots of education can be replaced by . . . books. And books are cheaper than ever! A lot of education is motivation, and maybe tricks of gamification can allow this to be done using less labor of instructors. Still, once you’ve bought the computer, these still are services (“experiences”), not durable goods. Indeed, if you’re reading your books online, then these are experiences too.

Talking about education gets people all riled up, so let’s try pushing the discussion sideways, to sports. Lee mentions “a functional kitchen and a home gym (or tennis rackets or cross-country skis).” You might want to pay someone to teach you how to use these things! I think we’re all familiar with the image of the yuppie who buys $100 sneakers and and a $200 tennis racket and goes out to the court, doesn’t know what he’s doing, and throws out his back.

A lot of this seems like what Tyler Cowen calls “mood affiliation.” For example, Lee writes, “If you have a space for entertaining and are intentional about building up a web of friendships, you can be independent from the social pull of expensive cities. Build that network to the point of introducing people to jobs, and you can take the edge off, a little, of the pressure for credentialism.” I don’t get it. If you want a lifestyle that “makes it easy for your friends to come over and catch up on one another’s lives,” you might naturally want to buy a house with a large living room in a neighborhood where many of your friends live. Sure, this may be expensive, but who needs the fancy new car, the ski equipment you’ll never use, the home gym that deprives you of social connections, etc. But nooooo. Lee doesn’t want you to do that! He’s cool with the large living room (somehow that doesn’t count as “real estate”), but he’s offended that you might want to live in an expensive city. Learn some economics, dude! Expensive places are expensive because people want to live there! People want to live there for a reason. Yes, I know that’s a simplification, and there are lots of distortions of the market, but that’s still the basic idea. Similarly, wassup with this “pressure for credentialism”? I introduce people to jobs all the time. People often are hirable because they’ve learned useful skills: is that this horrible “credentialism” thing?

The big thing

The big thing, though, is that I agree with Lee and Tabarrok—goods are cheap, and it does seem wise to buy a lot of them (environmental considerations aside)—but I think they’re missing the point, for a few reasons.

First, basic economics. To the extent that goods are getting cheaper and services aren’t, it makes sense that the trend would be (a) to be consuming relatively more goods and relatively fewer services than before, but (b) to be spending a relatively greater percentage of your money on services. Just think about that one for a moment.

Second, we consume lots more material goods than in the past. Most obviously, we substitute fuel for manual labor, both as individuals and as a society, for example using machines instead of laborers to dig ditches.

Third is the stock vs. flow thing mentioned in the title to this post. As noted, I agree with Lee and Tabarrok that it makes sense in our modern society to consume tons and tons of goods—and we do! We buy gas for our cars, we buy vacuum cleaners and washing machines and dishwashers and computers and home stereo systems and smartphones and toys for the kids and a zillion other things. The “buy experiences not things” advice is not starting from zero: it’s advice starting from the baseline that we buy lots and lots of things. We already have closets and garages and attics full of “thoughtfully chosen material goods can enable new activities can enrich your life, extend your capabilities, and deepen your understanding of the world” (to use Lee’s words).

To put it another way, we’re already in the Lee/Tabarrok world in which we’re surrounded by material possessions with more arriving in the post every week. But, as these goods become cheaper and cheaper, it make sense that a greater proportion of our dollars will be spend on experiences. To try to make the flow of possessions come in even faster and more luxuriously, to the extent of abandoning your education, not going to the doctor, and not living in a desirable neighborhood—that just seems perverse, more of a sign of ideological commitment than anything else.

One more thing

In an important way, all of this discussion, including mine, is in a bubble. If you’re choosing between a fancy kitchen and home gym, a dream vacation complete with bullfight tickets, and a Columbia University education, you’re already doing well financially.

So far we’ve been talking about two ways to spend your money: on things or experiences. But there’s a third goal: security. People buy the house in the nice neighborhood not just for the big living room (that’s a material good that Lee approves of) or to have a shorter commute (an experience, so he’s not so thrilled about that one, I guess), but also to avoid crime and to allow their kids to go to good schools. These are security concerns. Similarly we want reliable health care not for material gain or because it’s a fun experience but because we want some measure of security (while recognizing that none of us will live forever). Similarly for education too: we want the experience of learning and the shiny things we can buy with our future salaries but also future job and career security. So it’s complicated, but I don’t know that either of the solutions on offer—buying more home gym equipment or buying more bullfight tickets—is the answer.

Jamaican me crazy one more time

Someone writes in:

After your recent Jamaican Me Crazy post, I dug into the new JECS paper a bit, and the problems are much deeper than what you mentioned. The main problems have to do with their block permutation approach to inference.

The article he’s referring to is “Effect of the Jamaica early childhood stimulation intervention on labor market outcomes at age 31”; it’s an NBER working paper and I blogged about it last month. I was surprised to hear that it had already been published in JECS.

I did some googling but couldn’t find the JECS version of the paper . . . maybe the title had been changed so I searched JECS and the author names, still couldn’t find it, then I realized I didn’t even know what JECS was: Journal of Economic . . . what, exactly? So I swallowed my pride and asked my correspondent what exactly was the new paper he was referring to, and he replied that it was the Jamaican Early Childhood Stimulation study, hence JECS. Of course! Jamaican me crazy, indeed.

Anyway, my correspondent followed up with specific concerns:

1. The first issue that I noticed is in their block 5. They lay out the blocks in Appendix A. The blocks as described in the body of the paper are stratified first by a mother’s education dummy and assignment to the nutritional supplement treatment arm (supposedly for being unbalanced at baseline), and then gender and an age dummy, which is how the study’s initial randomization was stratified. However, block 5 is on broken up by age, not gender. There’s no reason I can see for doing this – breaking it up by gender won’t create new blocks that are all treatment or control, nor will they be exceptionally small (current blocks 1, 3, and 6 are all smaller than what the resulting blocks would be). Regardless, this violates the exchangeability assumption of their permutation tests. Considering block 5 is 19% of their sample, splitting it could create a meaningful difference in their inferences.

2. Their block 1 is only mothers with higher education, it isn’t broken out by supplement, gender, or age. Again, this violates the exchangeability assumption, no reason is given as to why, and if you were only to read the body of the paper, you would have no idea that this is what they were doing. The actual design of the blocking I’ve attached here.

3. In the 2014 paper, the blocking uses mother’s education, mother’s employment status, a discretized weight-for-height variable, and then gender and age. No reason is given for why they dropped employment and weight-for-height and added supplement assignment – these are all baseline variables, if they were imbalanced in 2014, they’re still imbalanced now! Stranger still, the supplement assignment isn’t even imbalanced, since it was originally a treatment arm!

I [the anonymous correspondent] found the 2014 data on ICPSR, and ran a handful of analyses, looking at the p-values you get if you run their 2021 blocking as they ran it, 2021 with block 5 split properly, some asymptotic robust p’s, and their 2014 blocking as I think they did it. I say “I think” because their replication code is 125 MATLAB files with no documentation. If you do it as described in the 2014 paper, you have 105 kids divided into 48 blocks, you end up with lots of empty or single observation blocks, so I’m sure that isn’t what they did, but it’s my best guess. I attached the table from running those here as well:

There were other issues, like their treatment of emigration, but this email is already long. You might also be interested in something on the academic side. I showed this to my advisor, and was basically told “great work, I don’t think you should pursue this.” . . . He recommended at most that I create a dummy email account, scrub my PDFs of any metadata, and send it anonymously to you and Uri Simonsohn. So at least for now, like your original correspondent, I live in the Midwest and have to walk across parking lots from time to time, so if you do blog, please keep me anonymous.

“Their replication code is 125 MATLAB files with no documentation”: Hey, that sounds a bit like my replication code sometimes! And I’ve been known to have some forking paths in my analyses. That’s one reason why I think the appropriate solution to multiplicity in statistical analysis is to fit multilevel models rather than to try to figure out multiple comparisons corrections. It should be about learning from the data, not about rejecting null hypotheses that we know ahead of time are false. So . . . I’m not particularly interested in the details of the permutation tests as discussed above—except to the extent that the results from those tests are presented as evidence in favor of the researchers’ preferred theories, in which case it’s useful to see the flaws in their reasoning.

Also, yeah, laffs and all that but for reals it’s absolutely horrible that people are afraid to express criticism of published work. What a horrible thing it says about our academic establishment that this sort of thing is happening. I don’t like being called Stasi or a terrorist or all the other crap that they throw at us, and I didn’t like it when my University of California colleagues flat-out lied about my research in order to do their friends a solid and stop my promotion, but, at least that was all in the past. The idea that it’s still going on . . . jeez. Remember this story, where an econ professor literally wrote, “I cam also play this hostage game,” threatening the careers of the students of a journal editor? Or the cynical but perhaps accurate remarks by Steven Levitt regarding scientific journals? I have no reason to think that economists behave worse than researchers in other fields; maybe they’re just more open about it, overtly talking about retaliation and using terms such as “hostage,” whereas people in other fields might just do these things and keep quiet about it.

Just to be on the safe side, though, look both ways before you cross that parking lot.

“How fake science is infiltrating scientific journals”

Chetan Chawla points us to this disturbing news article from Harriet Alexander:

In 2015, molecular oncologist Jennifer Byrne was surprised to discover during a scan of the academic literature that five papers had been written about a gene she had originally identified, but did not find particularly interesting.

“Looking at these papers, I thought they were really similar, they had some mistakes in them and they had some stuff that didn’t make sense at all,” she said. As she dug deeper, it dawned on her that the papers might have been produced by a third-party working for profit. . . .”

The more she investigated, the more clear it became that a cottage industry in academic fraud was infecting the literature. In 2017, she uncovered 48 similarly suspicious papers and brought them to the attention of the journals, resulting in several retractions, but the response from the publishing industry was varied, she said.

“A lot of journals don’t really want to know,” she said. . . .

More recently, she and a French collaborator developed a software tool that identified 712 papers from a total of more than 11,700 which contain wrongly identified sequences that suggest they were produced in a paper mill. . . .

Even if the research was published in low-impact journals, it still had the potential to derail legitimate cancer research, and anybody who tried to build on it would be wasting time and grant money . . . Publishers and researchers have reported an extraordinary proliferation in junk science over the last decade, which has infiltrated even the most esteemed journals. Many bear the hallmarks of having been produced in a paper mill: submitted by authors at Chinese hospitals with similar templates or structures. Paper mills operate several models, including selling data (which may be fake), supplying entire manuscripts or selling authorship slots on manuscripts that have been accepted for publication.

The Sydney Morning Herald has learned of suicides among graduate students in China when they heard that their research might be questioned by authorities. Many universities have made publication a condition of students earning their masters or doctorates, and it is an open secret that the students fudge the data. . . .

In 2017, responding to a fake peer review scandal that resulted in the retraction of 107 papers from a Springer Nature journal, the Chinese government cracked down and created penalties for research fraud. Universities stopped making research output a condition of graduation or the number of articles a condition of promotion. . . . But those familiar with the industry say the publication culture has prevailed because universities still compete for research funding and rankings. . . . The Chinese government’s investigation of the 107 papers found only 11 per cent were produced by paper mills, with the remainder produced in universities. . . .

As Chawla writes, what’s scary is the idea that this Greshaming isn’t just happening in the Freakonomics/Gladwell/NPR/Ted/Psychological Science axis of bogus social science storytelling; it’s also occurring in fields such as cancer research, which we tend to think of as being more serious. OK, not always, but usually, right??

I continue to think that the way forward is to put everything on preprint servers and turn journals into recommender systems. The system would still have to deal with paper mills, but perhaps the problem would he easier to handle through post-publication review.

Opportunity for political scientists and economists to participate in the Multi100 replication project!

Barnabás Szászi writes:

I’m contacting you now regarding a project that I’m co-leading with Aczel Balazs. Here, we aim to estimate how robust published results and conclusions in social and behavioral sciences are to analysts’ analytical choices. What we do is that 100 empirical studies published in different disciplines of social and behavioral sciences are being re-analyzed by independent researchers.

More than 500 re-analysts applied for the project and we have almost all the papers being already analyzed by some, but we did not get enough volunteers for papers from Economics, International Relations, and Political Science. Probably, our network is very psychological.

As you are not just a statistician but also a political scientist, we were wondering if you had any options to put on your blog or send around the following ad to some bigger crowd in any of these areas?

OK, here it is!

Jamaican me crazy: the return of the overestimated effect of early childhood intervention

A colleague sent me an email with the above title and the following content:

We were talking about Jamaica childhood intervention study. The Science paper on returns to the intervention 20 years later found a 25% increase but an earlier draft had reported a 42% increase. See here.

Well, it turns out the same authors are back in the stratosphere! In a Sept 2021 preprint, they report a 43% increase, but now 30 rather than 20 years after the intervention (see abstract). It seems to be the same dataset and they again appear to have a p-value right around the threshold (I think this is the 0.04 in the first row of Table 1 but I did not check super carefully).

Of course, no mention I could find of selection effects, the statistical significance filter, Type M errors, the winner’s curse or whatever term you want to use for it…

From the abstract of the new article:

We find large and statistically significant effects on income and schooling; the treatment group had 43% higher hourly wages and 37% higher earnings than the control group.

The usual focus of economists is earnings, not wages, so let’s go with that 37% number. It’s there in the second row of Table 1 of the new paper: the estimate is 0.37 with a standard error of . . . ummmm, it doesn’t give a standard error but it gives a t statistic of 1.68—

What????

B-b-b-but in the abstract it says this difference is “statistically significant”! I always thought that to be statistically significant the estimate had to be at least 2 standard errors from zero . . .

They have some discussion of some complicated nonparametric tests that they do, but if your headline number is only 1.68 standard errors away from zero, asymptotic theory is the least of your problems, buddy. Going through page 9 of the paper, it’s kind of amazing how much high-tech statistics and econometrics they’re throwing at this simple comparison.

Anyway, their estimate is 0.37 with standard error 0.37/1.68 = 0.22, so the 95% confidence interval is [0.37 +/- 2*0.22] = [-0.07, 0.81]. But it’s “statistically significant” cos the 1-sided p-value is 0.05. Whatever. I don’t really care about statistical significance anyway. It’s just kinda funny that, after all that effort, they had to punt on the p-value like that.

Going back to the 2014 paper, I came across this bit:

I guess that any p-value less than 0.10 is statistically significant. That’s fine; they should just get on the horn with the ORBITA stents people, because their study, when analyzed appropriately, ended up with p = 0.09, and that wasn’t considered statistically significant at all; it was considered evidence of no effect.

I guess the rule is that, if you’re lucky enough to get a result between 0.05 and 0.10, you get to pick the conclusion based on what you want to say: if you want to emphasize it, call it statistically significant; if not, call it non-significant. Or you can always fudge it by using a term like “suggestive.” In the above snippet they said the treatment “may have” improved skills and that treatment “is associated with” migration. I wonder if that phrasing was a concession to the fat p-value of 0.09. If the statistic had been at more conventionally attractive size 0.05 or below, maybe they would’ve felt free to break out the causal language.

But . . . it’s kind of funny for me to be riffing on p-values and statistical significance, given that I don’t even like p-values and statistical significance. I’m on record as saying that everything should be published and there should be no significance threshold. And I would not want to “threshold” any of this work either. Publish it all!

There are two places where I would diverge from these authors. The first is in their air of certainty. Rather than saying a “large and statistically significant effect” of 37%, I’d say an estimate of 37% with a standard error of 22%, or just give the 95% interval like they do in public health studies. JAMA would never let you get away with just giving the point estimate like that! Seeing this uncertainty tells you a few things: (a) the data are compatible (to use Sander Greenland’s term) with a null effect, (b) if the effect is positive, it could be all over the place, so it’s misleading as fiddlesticks to call it “a substantial increase” over a previous estimate of 25%, and (c) it’s empty to call this a “large” effect: with this big of a standard error, it would have to be “large” or it would not be “statistically significant.” To put it another way, instead of the impressive-sounding adjective “large” (which is clearly not necessarily the case, given that the confidence interval includes zero), it would be more accurate to use the less-impressive-sounding adjective “noisy.” Similarly, their statement, “Our results confirm large economic returns . . .”, seems a bit irresponsible given that their data are consistent with small or zero economic returns.

The second place I’d diverge from the authors is in the point estimate. They use a data summary of 37%. This is fine as a descriptive data summary, but if we’re talking policy, I’d like some estimate of treatment effect, which means I’d like to do some partial pooling with respect to some prior, and just about any reasonable prior will partially pool this estimate toward 0.

Ok, look. Lots of people don’t like Bayesian inference, and if you don’t want to use a prior, I can’t make you do it. But then you have to recognize that reporting the simple comparison, conditional on statistical significance (however you define it) will give you a biased estimate, as discussed on pages 17-18 of this article. Unfortunately, that article appeared in a psychology journal so you can’t expect a bunch of economists to have heard about it, but, hey, I’ve been blogging about this for years, nearly a decade, actually (see more here). Other people have written about this winner’s curse thing too. And I’ve sent a couple emails to the first author of the paper pointing out this bias issue. Anyway, my preference would be to give a Bayesian or regularized treatment effect estimator, but if you don’t want to do that, then at least report some estimate of the bias of the estimator that you are using. The good news is, the looser your significance threshold, the lower your bias!

But . . . it’s early childhood intervention! Don’t you care about the children???? you may ask. My response: I do care about the children, and early childhood intervention could be a great idea. It could be great even if it doesn’t raise adult earnings at all, or if it raises adult earnings by an amount that’s undetectable by this noisy study.

Think about it this way. Suppose the intervention has a true effect that it raises earnings by an average of 10%. That’s a big deal, maybe not so much for an individual, but an average effect of 10% is a lot. Consider that some people won’t be helped at all—that’s just how things go—so an average of 10% implies that some people would be helped a whole lot. Anyway, this is a study where the standard deviation of the estimated effect is 0.22, that is, 22%. If the average effect is 10% and the standard error is 22%, then the study has very low power, and it’s unlikely that a preregistered analysis would result in statistical significance, even at the 0.1 or 0.2 level or whatever it is that these folks are using. But, in this hypothetical world, the treatment would be awesome.

My point is, there’s no shame in admitting uncertainty! The point estimate is positive; that’s great There’s a lot of uncertainty, and the data are consistent with a small, tiny, zero, or even negative effect. That’s just the way things go when you have noisy data. As quantitative social scientists, we can (a) care about the kids, (b) recognize that this evaluation leaves us with lots of uncertainty, and (c) give this information to policymakers and let them take it from there. I feel no moral obligation to overstate the evidence, overestimate the effect size, and understate my uncertainty.

It’s so frustrating, how many prominent academics just can’t handle criticism. I guess they feel that they’re in the right and that all this stat stuff is just a bunch of paperwork. And in this case they’re doing the Lord’s work, saving the children, so anything goes. It’s the Armstrong principle over and over again.

And in this particular case, as my colleague points out, it is not just that they are not acknowledging or dealing with criticism in the prior paper, but here they are actively repeating the very same error with the very same study/data when they have been made aware of it on more than one occasion in this new paper and not acknowledging the issue at all. Makes me want to scream.

P.S. When asked whether I could share my colleague’s name, my colleague replied:

Regarding quoting me, do recall that I live in the midwest and have to walk across parking lots from time to time. So please do so anonymously.

Fair enough. I don’t want to get anybody hurt.

Again on the problems with technology that makes it more convenient to gamble away your money

This news article by Sheelah Kolhatkar on the stock-market gambling app Robinhood reminded me of the creepy story a decade ago about some sleazeballs who were using statistics to locate and manipulate gambling addicts so as to bust them out.

It’s complicated. On one hand, gambling is fun, when done in moderation it causes no harm, and these companies are providing a service by making it accessible to people, in the same way that supermarkets provide a service when they sell alcohol. On the other hand, there’s a clear motivation to cater to addicts and get them to spend more. No easy answers. I guess this is just a special case of a general problem of our society of abundance. As a statistician I’m particularly interested in gambling because of its connection to probability and uncertainty.

Gigerenzer: “On the Supposed Evidence for Libertarian Paternalism”

From 2015. The scourge of all things heuristics and biases writes:

Can the general public learn to deal with risk and uncertainty, or do authorities need to steer people’s choices in the right direction? Libertarian paternalists argue that results from psychological research show that our reasoning is systematically flawed and that we are hardly educable because our cognitive biases resemble stable visual illusions. For that reason, they maintain, authorities who know what is best for us need to step in and steer our behavior with the help of “nudges.” Nudges are nothing new, but justifying them on the basis of a latent irrationality is. In this article, I analyze the scientific evidence presented for such a justification. It suffers from narrow logical norms, that is, a misunderstanding of the nature of rational thinking, and from a confirmation bias, that is, selective reporting of research. These two flaws focus the blame on individuals’ minds rather than on external causes, such as industries that spend billions to nudge people into unhealthy behavior. I conclude that the claim that we are hardly educable lacks evidence and forecloses the true alternative to nudging: teaching people to become risk savvy.

Good stuff here on three levels: (1) social science theories and models; (2) statistical reasoning and scientific evidence; and (3) science and society.

Gigerenzer’s article is interesting in itself and also as a counterpart to the institutionalized hype of the Nudgelords.