The contrapositive of “Politics and the English Language.” One reason writing is hard:

Posted on March 25, 2024 9:50 AM by Andrew

In his classic essay, “Politics and the English Language,” the political journalist George Orwell drew a connection between cloudy writing and cloudy content.

The basic idea was: if you don’t know what you’re saying, or if you’re trying to say something you don’t really want to say, then one strategy is to write unclearly. Conversely, consistently cloudy writing can be an indication that the writer ultimately doesn’t want to be understood.

In Orwell’s words:

[The English language] becomes ugly and inaccurate because our thoughts are foolish, but the slovenliness of our language makes it easier for us to have foolish thoughts.

He continues:

In our time, political speech and writing are largely the defence of the indefensible. Things like the continuance of British rule in India, the Russian purges and deportations, the dropping of the atom bombs on Japan, can indeed be defended, but only by arguments which are too brutal for most people to face, and which do not square with the professed aims of the political parties. Thus political language has to consist largely of euphemism, question-begging and sheer cloudy vagueness.

A few years ago I posted on this topic, drawing an analogy to cloudy writing in science. To be sure, much of the bad writing in science comes from researchers who have never learned to write clearly. Writing is hard!

But it’s not just that. A key problem with a lot of the bad science that we see featured in PNAS, Ted, NPR, Gladwell, Freakonomics, etc., is that the authors are trying to use statistical analysis and storytelling to do something they can’t do with their science, which is to draw near-certain conclusions from noisy data that can’t support strong conclusions. This leads to tortured constructions such as this from a medical journal:

The pair‐wise results (using paired‐samples t‐test as well as in the mixed model regression adjusted for age, gender and baseline BMI‐SDS) showed significant decrease in BMI‐SDS in the parents–child group both after 3 and 24 months, which indicate that this group of children improved their BMI status (were less overweight/obese) and that this intervention was indeed effective.

However, as we wrote in the results and the discussion, the between group differences in the change in BMI‐SDS were not significant, indicating that there was no difference in change in our outcome in either of the interventions. We discussed, in length, the lack of between‐group difference in the discussion section. We assume that the main reason for the non‐significant difference in the change in BMI‐SDS between the intervention groups (parents–child and parents only) as compared to the control group can be explained by the fact that the control group had also a marginal positive effect on BMI‐SDS . . .

Obv not as bad as political journalists in the 1930s defending Stalin’s purges or whatever; the point is that the author is in the awkward position of trying to use the ambiguities of language to say something while not quite saying it. Which leads to unclear and barely readable writing, not just by accident.

The writing and the statistics have to be cloudy, because if they were clear, the emptiness of the conclusions would be apparent.

The problem

Orwell’s statement, when transposed to writing a technical paper, is that if you attempt to cover the gaps in your reasoning with words, this will typically yield bad writing. Indeed, if you’re covering the gaps in your reasoning with words, you’ll either have bad writing or dishonest writing, or both. In some important way, it’s a good thing that this sort of writing is so hard to follow; otherwise it could be really misleading.

Now let’s flip it around.

Often you will find yourself trying to write an article, and it will be very difficult to write it clearly. You’ll go around and around, and whatever you, your written output will feel like the worst of both worlds: a jargon-filled mess, while at the same time being sloppy and imprecise. Try to make it more readable and it becomes even sloppier and harder to follow at a technical level; try to make it accurate and precise, and it reads like a complicated, uninterpretable set of directions.

You’re stuck. You’re in a bad place. And any direction you take makes the writing worse in some important way.

What’s going on?

It could be this: You’re trying to write something you don’t fully understand, you’re trying to bridge a gap between what you want to say and what is actually justified by your data and analysis . . . and the result is “Orwellian,” in the sense that you’re desperately using words to try to paper over this yawning chasm in your reasoning.

The solution

One way out of this trap is to follow what we could call Orwell’s Contrapositive.

It goes like this: Step back. Pause in whatever writing you’re doing. Pull out a new sheet of paper (or an empty document on the computer) and write, as directly as you can, in two columns. Column 1 is what you want to be able to say (the method is effective, the treatment saves lives, whatever); Column 2 is what is supported by your evidence (the method works better than a particular alternative in a particular setting, fewer people died in the treatment than the control group after adjusting this and that, whatever).

At that point, do the work to pull Column 2 to Column 1, or make concessions to reality to shift Column 1 toward Column 2. Do what it takes to get them to line up.

At this point, you’ve left the bad zone in which you’re trying to say more than you can honestly say. And the writing should then go much smoother.

That’s the contrapositive: if bad writing is a sign of someone trying to say the indefensible, then you can make your writing better by not trying to say the defensible, either by expanding what is legitimately defensible or restricting what you’re trying to say.

Remember the folk theorem of statistical computing: When you have computational problems, often there’s a problem with your model. Orwell’s Contrapositive is a sort of literary analogy to that.

One reason writing is hard

To put it another way: One reason writing is hard is that we use writing to cover the gaps in our reasoning. This is not always a bad thing! On the way to the destination of covering these gaps is the important step of revealing these gaps. We write to understand. Writing has an internal logic that can protect us from (some) errors and gaps—if we let it, by reacting to the warning sign that the writing is unclear.

Hey! Here’s a study where all the preregistered analyses yielded null results but it was presented in PNAS as being wholly positive.

Posted on March 24, 2024 9:43 AM by Andrew

Ryan Briggs writes:

In case you haven’t seen this, PNAS (who else) has a new study out entitled “Unconditional cash transfers reduce homelessness.” This is the significant statement:

A core cause of homelessness is a lack of money, yet few services provide immediate cash assistance as a solution. We provided a one-time unconditional CAD$7,500 cash transfer to individuals experiencing homelessness, which reduced homelessness and generated net societal savings over 1 y. Two additional studies revealed public mistrust in homeless individuals’ ability to manage money and the benefit of counter-stereotypical or utilitarian messaging in garnering policy support for cash transfers. This research adds to growing global evidence on cash transfers’ benefits for marginalized populations and strategies to increase policy support. Although not a panacea, cash transfers may hasten housing stability with existing social supports. Together, this research offers a new tool to reduce homelessness to improve homelessness reduction policies.

Based on that, I was surprised to read the pre-registration documents and supplemental information and learn that literally none of the outcomes that the researchers pre-registered were significant. Even the variable that they chose to focus on (days homeless) was essentially the same in the 12 month follow up (0.18 vs 0.17) and, just eyeballing Table S3, it seems the differences were rarely large and not ever significant in any single follow up period.

This is now generating news coverage about how cash transfers work to reduce homelessness (e.g., here and here).

I guess in a sense pre-registration worked because we can see that they did not expect this and had to explore to find it, but what good does that do if the press just reports it all credulously?

I have mixed feelings on this one. On one hand, I don’t like the whole statistical-significance-thresholding thing: if the study found positive results, this could be worth reporting, even if the results are within the margin of error. This within-the-margin-of-error bit should just be mentioned in the news articles. On the other hand, if the researchers are rummaging around through their results looking for something big to report, then, yeah, these results will be massively biased upward.

So, from that perspective, maybe a good headline would not be, “Homeless people were given lump sums of cash. Their spending defied stereotypes” or “B.C. researchers studied how homeless people spent a $7,500 handout. Here’s what they found,” but rather something like, “Preliminary results from a small study suggest . . .”

But then we could step back and ask, How did this study get the press in the first place? I’m guessing PNAS is the reason. So let’s head to the PNAS paper. From the abstract:

Exploratory analyses showed that over 1 y, cash recipients spent fewer days homeless, increased savings and spending with no increase in temptation goods spending, and generated societal net savings of $777 per recipient via reduced time in shelters.

I guess that “exploratory analysis” is code for non-preregistered or non-statistically-significant. Either way, I think it’s irresponsible and statistically incorrect—although, regrettably, absolutely standard practice—to report this “$777” without any regularization or partial pooling toward zero. It’s a biased estimate, and the bias could be huge.

Figure 1 of the paper looks very impressive! This figure displays 35 outcomes, almost all of which go in a positive direction (fewer days homeless, more days in stable housing, higher value of savings . . ., all the way down to lower substance use severity, lower cost of all service use, and cost of shelter use. The very few negative outcomes were tiny compared to their uncertainty. If you look at Figure 1, the evidence looks overwhelming.

But Table 1 does not seem like such a great summary of the data displayed elsewhere in the paper. Looking at Table 3, the good stuff all seems to be happening in the 1-month and 3-month followups without much happening after 1 year.

Here’s what the authors wrote:

The preregistered analyses yielded null effects in cognitive and well-being outcomes, which could be due to the low statistical power from the small participant number in each condition or the possibility that any effect on cognition and well-being may take more than 1 mo to show up.

I agree that these null findings should be mentioned right up there in the abstract. They should also include the possibility that the treatment really has no consistent effect on these outcomes. It’s kinda lame to give all these alibis and never even consider that maybe there’s nothing going on.

What about the housing effects going away after a year? The authors write:

First, the cost of living is extremely high in Vancouver, and the majority of the cash was spent within the first 3 mo for most recipients. Second, while the cash provided immediate benefits, control participants even-tually “caught up” over time.

On the other hand, here’s what they said about a different result:

By combining the two cash and two noncash conditions to increase statistical power, exploratory analyses showed that cash recipients showed higher positive affect at 1 mo and higher executive function at 3 mo. Based on debriefing, participants expressed that while they were initially happy with the cash transfer, moving out of homelessness into stable housing took substantial efforts and hard work in the first few months, which could explain the delayed effect on cognitive function.

They’ve successfully convinced me that they have the ability to explain any possible result they might find.

The thing that bothers me most about the paper is that the authors don’t seem to have wrestled with the ways in which their results seem to refute their theoretical framework. Their choice of what to preregister suggests that they were expecting to find large effects on cognitive and subjective well-being outcomes and then maybe, if they were lucky, they’d find some positive results on financial and housing outcomes. I guess their theory was that the money would give people a better take on life, which could then lead to material benefits. Actually, though, they found no benefits on the cognitive and subjective outcomes—when I say “no benefits,” I mean, yeah, really nothing, not just nothing statistically significant—but the money did seem to help people pay the rent for the first few months. That’s fine—there are worse things than giving low-income people some money to pay the rent!—; it’s just a different story than what they’d started with. It’s less of a psychology story and more of an economics story. In any case, yeah, further study is required. I just think that they could get the most from their existing study if they thought more about what went wrong with their theory.

Hey—let’s collect all the stupid things that researchers say in order to deflect legitimate criticism

Posted on March 23, 2024 9:53 AM by Andrew

When rereading this post the other day, I noticed the post that came immediately before.

I followed the link and came across the delightful story of a researcher who, after one of his papers was criticized, replied, “We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)” One of the critics responded with appropriate disdain, writing:

This comment exemplifies the proclivity of some authors to view publication as the encasement of work in a casket, buried deeply so as to never be opened again lest the skeletons inside it escape. But is it really beneficial to science that much of the published literature has become . . . a vast graveyard of undead theories?

I agree. To put it another way: Yes, ha ha ha, let’s spend our time on guitar practice rather than exhuming 11-year-old published articles. Fine—I’ll accept that, as long as you also accept that we should not be citing 11-year-old articles.

As is so often the case, the authors of published work are happy to get unthinking positive publicity and citations, but when anything negative comes in, they pull up the drawbridge.

From the perspective of the ladder of responses to criticism, the above behavior isn’t so bad: they’re not suing their critics or using surrogates to attack them critics or labeling anybody as suicide bombers or East German secret police, they’re just trying to laugh it off. From a scientific perspective, though, it’s still pretty bad to act as there’s something wrong with discussing the flaws of a paper that’s still being cited, just cos it’s a decade old.

Putting together a list

Anyway, this made me think of a fun project, which is to list all the different ways that researchers try to avoid addressing legitimate criticism of their published work.

Here are a few responses we’ve seen. I won’t bother finding the links right now, but if we put together a good list, I can go back and provide references for all of them.

1. The corrections do not affect the main results of the paper. (Always a popular claim, even if the corrections actually do affect the main results of the paper.)

2. The criticism should be dismissed because the critics are obsessive/Stasi/terrorists, etc. (Recall the Javert paradox.)

3. The critics are jealous losers sniping at their betters. Or, if that doesn’t work, the critics are picking on unfortunate young researchers. (I don’t think it does any favors to researchers of any age to exempt their work from criticism.)

4. The criticism is illegitimate if it does not go through the peer-review process. (A hard claim to swallow given how the peer-review process is rigged against criticism of published papers.)

5. Criticism should be a discreet exchange between author and critic, with no public criticism. (But the people who claim to hold that attitude seem to have no problem when their work is cited or praised in a public way.)

The most common response to criticism seems to be to just ignore it entirely and hope it goes away. Unfortunately, that strategy often seems to work very well!

Jonathan Bailey vs. Stephen Wolfram

Posted on March 22, 2024 9:41 AM by Andrew

Key quote:

While there are definitely environments where using a ghostwriter is acceptable, academic publishing typically isn’t one of them.

The reason is simple: Using a ghostwriter on an academic paper entails having an author do significant work on the paper without receiving credit or having their work disclosed. This is broadly seen as a breach of authorship and an act of research misconduct unto itself.

Why are all these school cheating scandals happening?

Posted on March 21, 2024 9:14 AM by Andrew

Paul Alper writes:

While the national scene is all about woke, book banning and the like, apparently Columbia University is still dealing with the long-standing conundrum, the best method to teach kids how to read.

He’s referring to this news article, “Amid Reading Wars, Columbia Will Close a Star Professor’s Shop,” which begins:

Lucy Calkins ran a beloved — and criticized — center at Teachers College for four decades. It is being dissolved. . . .

Her curriculum had teachers conduct “mini-lessons” on reading strategies, but also gave students plenty of time for silent reading and freedom to choose their own books. Supporters say those methods empower children, but critics say they waste precious classroom minutes, and allow students to wallow in texts that are too easy.

Some of the practices she once favored, such as prompting children to guess at words using the first letter and context clues, like illustrations, have been discredited.

Over the past three years, several prominent school districts — including New York City, the nation’s largest — dropped her program, though it remains in wide use. . . .

Critics of her ideas, including some cognitive scientists and instructional experts, said her curriculum bypassed decades of settled research, often referred to as the science of reading. That body of research suggests that direct, carefully sequenced instruction in phonics, vocabulary building and comprehension is more effective for young readers than Dr. Calkins’s looser approach.

Alper writes:

This article did not at all mention anything about language specifics. I bring this up because my granddaughters are in a Minneapolis Spanish immersion primary school. Because Spanish is almost 100 per cent phonetic, and English is terrible in this regard, they spell and read better in Spanish than they do in English. The mechanics of learning to read back in my day, was simple and devoid of theory or disagreement. You kept at it until you got it right. The it was English only because no accommodation was made for special needs, immigrants or for the outside world in general.

I know some people at Teachers College but I’ve never encountered Prof. Calkins, nor have I ever looked at the literature on language teaching. So I got nothin’ on this one.

But I did reply that the above story isn’t half as bad as this one from a few years back, which I titled, “What’s the stupidest thing the NYC Department of Education and Columbia University Teachers College did in the past decade?” It involved someone who was found to be a liar, a cheat, and a thief, and then, with that all known, was hired to two jobs as school principal! And then a Teachers College professor said, “We felt that on balance, her recommendations were so glowing from everyone we talked to in the D.O.E. that it was something that we just were able to live with.” This came out in the news after the principal in question was found do have “forged answers on students’ state English exams in April because the students had not finished the tests.” Quelle surprise, no? A liar/cheat/thief gets a new job doing the same thing and then does more lying and cheating (maybe no stealing that time, though).

Alper responded:

You wrote that in 2015 which is about the same time as this story which made Fani Willis RICO famous:

Her most prominent case was her prosecution of the Atlanta Public Schools cheating scandal. Willis, an assistant district attorney at the time, served as lead prosecutor in the 2014 to 2015 trial of twelve educators accused of correcting answers entered by students to inflate the scores of state administered standardized tests.

SAT and all the others did not exist in my 1950 NYC school days, but I believe we did have the so-called Regents Exams and they are still around. It never crossed my mind that the scoring of those exams was not on the up and up. Was I being naive? Was there more honesty and/or less messing around back then and it was just not financially worth it?

Here’s my response:

1. This particular form of cheating sounds no more easy or difficult now than in the past.

2. In the past (i.e., somewhere between 1950 and 2015), tests were important students but not so much for schools. So, yeah, students may have been motivated to cheat, but teachers and school administrators did not have any motivation, either to help students cheat, or to massively cheat on their own. Nowadays, tests can be high stakes for the school administrators, and so, for some of them, cheating is worth the risk.

“Whistleblowers always get punished”

Posted on March 20, 2024 9:45 AM by Andrew

In one of our comment threads about how scholars and journalists should be thanking, not smearing, people who ask for replications, Allan Stam writes:

The corollary to all this, and closely related to Javert’s paradox, is the social law: Whistleblowers always get punished.

The Javert paradox, as regular readers will recall, goes like this: Suppose you find a problem with published work. If you just point it out once or twice, the authors of the work are likely to do nothing. But if you really pursue the problem, then you look like a Javert, that is, like an obsessive, a “hater,” someone who needs to “get a life.” It’s complicated, because some critics really do obsess over unimportant details.

On the other hand, details that are unimportant in themselves can be important as indicating bigger problems. For example, the Nudgelords hyped some junk science. In one way, that’s no big deal: everybody makes mistakes. But their lack of interest in their mistakes and their willingness to memory-hole these errors suggests a deeper problem, in that their workflow is lacking that important feedback loop that can allow themselves to identify places where their model for the world has failed. A lack of interest in confronting the failure of one’s model: that’s something that bothered me with so many Bayesians back in the early 1990s, motivating much of my work on posterior predictive checking, and it bothers me today.

The point is, sometimes to find the problems you have to look at the details in detail, which takes the sort of extra effort that can make you look obsessive—heck, maybe it is obsessive. But, so what? And, sure, sometimes a critic will be obsessive and also just be mistaken, and that’s annoying, but there’s little we can do except to try our best to respond to those mistaken criticisms when they arise.

Now back to Stam’s point.

I pretty much agree with what he’s saying: whistleblowing just about always seems to be a bad career move. The clarification I’d like to make is that the “punishment” received by a whistleblower is not necessary anyone directly trying directly trying to punish anyone.

Here’s how it goes. Scholar A does something wrong—maybe it’s flat-out cheating, maybe it’s just bad work that the scholar doesn’t want anyone to re-examine, which, OK, that attitude is a form of cheating too (Clarke’s law!). Scholar B points out the problem.

At this point, no “whistleblowing” has happened. “Whistleblowing” occurs following two more steps: (1) Scholar A, instead of behaving properly by acknowledging and considering the criticism, instead evades it or flat-out lies about it; (2) Scholar B, instead of just letting this be the end, keeps on about it. I guess that even this is not necessarily whistleblowing. Also the whistleblower has to be on the inside.

OK, so at this point it’s a negative-sum game. Scholar A can get the reputation of someone who does bad work and refuses to learn from mistakes. Scholar B can get the reputation of not being a team player. The more this goes on, the more that both scholars are hurt. Even if final consensus if close to Scholar B’s position, so that Scholar B has “won” the intellectual and social argument, it’s still likely to be a net loss in that Scholar B gets some reputation as a difficult person. Conversely, even if Scholar A “wins” in the sense of there being a consensus judgment that the criticism was misguided, there can still be a vague cloud that hangs over Scholar A’s head.

Part of this whole net-loss thing arises because most academics get no negative coverage at all. In politics, any success brings some negative coverage, and getting into a fight can be worth it, by helping you stand out from the crowd. In academia, you want to be known for positive contributions. At least, in science academia. Humanities and some of social science seem different: there, I guess it’s more common for scholars to make their names through controversy.

Anyway, here’s my point. A scientific dispute involving claims of unethical behavior can easily end up hurting both sides. Even if nobody’s trying to punish a whistleblower, there are negative social consequences, and in that sense I think Stam is correct.

“I was left with an overwhelming feeling that the World Values Survey is simply a vehicle for telling stories about values . . .”

Posted on March 19, 2024 9:22 AM by Andrew

Dale Lehman writes:

My guess is that you are familiar with the World Values Survey – I was not until I saw it described in the Economist this week (August 12, 2023). It has probably been used in the careers of many academics and is a monumental effort to collect survey data about values from across the world over a long period of time (the latest wave of the survey includes around 130,000 respondents from at least 90 countries). With the caveat that I have no experience with this data and have not read anything about its methodological development, I am struck by what seems like a shoddy ill-conceived research effort. To begin with a minor thing that appeared in the Economist story, I’ve attached a screenshot of part of what appears in the print magazine (the online version interactively builds up this view so the print version is more complete but provides less context). I have an issue with the visualization – some might call it a quibble, I’d call it a major problem – and I can’t tell if the blame lies with The Economist or the WVS, but it is what first alerted me to this data. The change in values over time is shown by the line segments ending in a circle marker for the latest survey wave. Why didn’t they use arrows rather than a circle at one end? I think this is inexcusable – arrows invoke pre-attentive visual processing whereas the line segment/circles force me to constantly reassess the picture to understand how things are changing. In other words, the visual presented doesn’t work – arrows would be immensely better. I don’t believe that is just sloppiness – I think it reveals something more fundamental, and that is what really concerns me about the WVS.

Moving on in the graph, I am immediately stuck by the dimensions of the graph. The methodology is described in detail on the WVS website (https://www.worldvaluessurvey.org/WVSContents.jsp) and I haven’t reviewed it in detail. But I have a number of issues about these measurements. Among these:

– The survival-self expression dimensions strikes me as unintuitive. Since the questions involved (such as the importance of religion vs the importance of environmental protection) are linked to wealth, and much of the WVS research concerns changes in values as wealth changes, why not measure wealth directly? My preference would be for the more unambiguous (relatively speaking) measures like GDP than these derived measures that seem vague to me.

– I have similar issues with the other dimension: traditional vs secular-rational. Neither of these seem intuitive to me and the underlying questions don’t improve things. There are questions about “national pride” and self descriptions of whether or not someone feels “very happy.” I find it very difficult to see how these map cleanly into the dimension they are being used for.

– Since these surveys are done across many countries and over time, I think the meaning of the words may change. For example, asking whether “people are trustworthy” requires the idea of “trust” to mean the same thing in different places and different periods of time. I see no evidence of this and can imagine that there might be differences in how people interpret phrases like that. In general, it seems to me that the wording of these survey questions was not carefully thought out or tested (though perhaps I just am not familiar with their development).

– I am disturbed by the use of single points to represent entire countries. Indeed, there is considerable discussion of how heterogeneous countries are, but the graphs use average measures to represent entire countries. As with many things, the average may be less interesting than the variability. This concern is accentuated by the aggregation of these countries into groups such as “Protestant Europe” and “Orthodox Europe.” I don’t find these groups particularly intuitive either.

– I’m unconvinced that the two dimensional picture of values is the best way to analyze values. Are there two dimensions the most important? Why two? Perhaps the changes over time simply reveal how valid the dimensions are rather than any intrinsic changes in values people hold.

There is more, but I’ll stress again that I have no background with this data. I can say it was difficult for me to even read the Economist article since almost every statement struck me as troublesome regarding what was being measured and how it relates to the fundamental methodology of this two dimensional view of values. I also can’t tell how much of my concern lies with the Economist article or the WVS itself. But I was left with an overwhelming feeling that the WVS is simply a vehicle for telling stories about values and how they differ between countries or groups of people and how these change over time. Those stories are naturally interesting, but I don’t see that the methodology and data support any particular story over any other. It seems like a perfect mechanism for academic career development, but little else.

My reply: I’m not sure! I’ve never worked with the World Values Survey myself. Maybe some readers can share their thoughts?

Inspiring story from a chemistry classroom

Posted on March 18, 2024 9:07 AM by Andrew

From former chemistry teacher HildaRuth Beaumont:

I was reminded of my days as a newly qualified teacher at a Leicestershire comprehensive school in the 1970s, when I was given a group of reluctant pupils with the instruction to ‘keep them occupied’. After a couple of false starts we agreed that they might enjoy making simple glass ornaments. I knew a little about glass blowing so I was able to teach them how to combine coloured and transparent glass to make animal figures and Christmas tree decorations. Then one of them made a small bottle complete with stopper. Her classmate said she should buy some perfume, pour some of it into the bottle and give it to her mum as a Mother’s Day gift. ‘We could actually make the perfume too,’ I said. With some dried lavender, rose petals, and orange and lemon peel, we applied solvent extraction and steam distillation to good effect and everyone was able to produce small bottles of perfume for their mothers.

What a wonderful story. We didn’t do anything like this in our high school chemistry classes! Chemistry 1 was taught by an idiot who couldn’t understand the book he was teaching out of. Chemistry 2 was taught with a single-minded goal of teaching us how to solve the problems on the Advanced Placement exam. We did well on the exam and learned essentially zero chemistry. On the plus side, this allowed me to place out of the chemistry requirement in college. On the minus side . . . maybe it would’ve been good for me to learn some chemistry in college. I don’t remember doing any labs in Chemistry 2 at all!

Preregistration is a floor, not a ceiling.

Posted on March 17, 2024 4:37 PM by Andrew

This comes up from time to time, for example someone sent me an email expressing a concern that preregistration stifles innovation: if Fleming had preregistered his study, he never would’ve noticed the penicillin mold, etc.

My response is that preregistration is a floor, not a ceiling. Preregistration is a list of things you plan to do, that’s all. Preregistration does not stop you from doing more. If Fleming had followed a pre-analysis protocol, that would’ve been fine: there would have been nothing stopping him from continuing to look at his bacterial cultures.

As I wrote in comments to my 2022 post, “What’s the difference between Derek Jeter and preregistration?” (which I just added to the lexicon), you don’t preregister “the” exact model specification; you preregister “an” exact model specification, and you’re always free to fit other models once you’ve seen the data.

It can be really valuable to preregister, to formulate hypotheses and simulate fake data before gathering any real data. To do this requires assumptions—it takes work!—and I think it’s work that’s well spent. And then, when the data arrive, do everything you’d planned to do, along with whatever else you want to do.

Planning ahead should not get in the way of creativity. It should enhance creativity because you can focus your data-analytic efforts on new ideas rather than having to first figure out what defensible default thing you’re supposed to do.

Aaaand, pixels are free, so here’s that 2002 post in full:
Continue reading →

“On the uses and abuses of regression models: a call for reform of statistical practice and teaching”: We’d appreciate your comments . . .

Posted on March 17, 2024 9:51 AM by Andrew

John Carlin writes:

I wanted to draw your attention to a paper that I’ve just published as a preprint: On the uses and abuses of regression models: a call for reform of statistical practice and teaching (pending publication I hope in a biostat journal). You and I have discussed how to teach regression on a few occasions over the years, but I think with the help of my brilliant colleague Margarita Moreno-Betancur I have finally figured out where the main problems lie – and why a radical rethink is needed. Here is the abstract:

When students and users of statistical methods first learn about regression analysis there is an emphasis on the technical details of models and estimation methods that invariably runs ahead of the purposes for which these models might be used. More broadly, statistics is widely understood to provide a body of techniques for “modelling data”, underpinned by what we describe as the “true model myth”, according to which the task of the statistician/data analyst is to build a model that closely approximates the true data generating process. By way of our own historical examples and a brief review of mainstream clinical research journals, we describe how this perspective leads to a range of problems in the application of regression methods, including misguided “adjustment” for covariates, misinterpretation of regression coefficients and the widespread fitting of regression models without a clear purpose. We then outline an alternative approach to the teaching and application of regression methods, which begins by focussing on clear definition of the substantive research question within one of three distinct types: descriptive, predictive, or causal. The simple univariable regression model may be introduced as a tool for description, while the development and application of multivariable regression models should proceed differently according to the type of question. Regression methods will no doubt remain central to statistical practice as they provide a powerful tool for representing variation in a response or outcome variable as a function of “input” variables, but their conceptualisation and usage should follow from the purpose at hand.

The paper is aimed at the biostat community, but I think the same issues apply very broadly at least across the non-physical sciences.

Interesting. I think this advice is roughly consistent with what Aki, Jennifer, and I say and do in our books Regression and Other Stories and Active Statistics.

More specifically, my take on teaching regression is similar to what Carlin and Moreno say, with the main difference being that I find that students have a lot of difficulty understanding plain old mathematical models. I spend a lot of time teaching the meaning of y = a + bx, how to graph it, etc. I feel that most regression textbooks focus too much on the error term and not enough on the deterministic part of the model. Also, I like what we say on the first page of Regression and Other Stories, about the three tasks of statistics being generalizing from sample to population, generalizing from control to treatment group, and generalizing from observed data to underlying constructs of interest. I think models are necessary for all three of these steps, so I do think that understanding models is important, and I’m not happy with minimalist treatments of regression that describe it as a way of estimating conditional expectations.

The first of these tasks is sampling inference, the second is causal inference, and the third refers to measurement. Statistics books (including my own) spend lots of time on sampling and causal inference, not so much on measurement. But measurement is important! For an example, see here.

If any of you have reactions to Carlin and Moreno’s paper, or if you have reactions to my reactions, please share them in comments, as I’m sure they’d appreciate it.

How often is there a political candidate such as Vivek Ramaswamy who is so much stronger in online polls than telephone polls?

Posted on March 16, 2024 9:30 AM by Andrew

Palko points to this news article, “The mystery of Vivek Ramaswamy’s rapid rise in the polls,” which states:

Ramaswamy’s strength comes almost entirely from polls conducted over the internet, according to a POLITICO analysis. In internet surveys over the past month — the vast majority of which are conducted among panels of people who sign up ahead of time to complete polls, often for financial incentives — Ramaswamy earns an average of 7.8 percent, a clear third behind Trump and DeSantis.

In polls conducted mostly or partially over the telephone, in which people are contacted randomly, not only does Ramaswamy lag his average score — he’s way back in seventh place, at just 2.6 percent.

There’s no singular, obvious explanation for the disparity, but there are some leading theories for it, namely the demographic characteristics and internet literacy of Ramaswamy’s supporters, along with the complications of an overly white audience trying to pronounce the name of a son of immigrants from India over the phone.”

And then, in order for a respondent to choose Ramaswamy in a phone poll, he or she will have to repeat the name back to the interviewer. And the national Republican electorate is definitely older and whiter than the country as a whole: In a recent New York Times/Siena College poll, more than 80 percent of likely GOP primary voters were white, and 38 percent were 65 or older.

‘When your candidate is named Vivek Ramaswamy,’ said one Republican pollster, granted anonymity to discuss the polling dynamics candidate, ‘that’s like DEFCON 1 for confusion and mispronunciation.’

Palko writes:

Keeping in mind that the “surge” was never big (maxed out at 10% and has been flat since), we’re talking about fairly small numbers in absolute terms, here are some questions:

1. How much do we normally expect phone and online to agree?

2. Ramaswamy generally scores around 3 times higher online than with phone. Have we seen that magnitude before?

3. How about a difficult name bias. Have we seen that before? How about Buttigieg, for instance? Did a foreign-sounding name hurt Obama in early polls?

4. Is the difference in demographics great enough to explain the difference? Aren’t things like gender and age normally reweighted?

5. Are there other explanations we should consider?

I don’t have any answers here, just one thought which is that it’s early in the campaign (I guess I should call it the pre-campaign, given that the primary elections haven’t started yet), and so perhaps journalists are reasoning that, even if this candidate is not very popular among voters, his active internet presence makes him a reasonable dark-horse candidate looking forward. An elite taste now but could perhaps spread to the non-political-junkies in the future? Paradoxically, the fact that Ramaswamy has this strong online support despite his extreme political stances could be taken as a potential sign of strength? I don’t know.

Conformal prediction and people

Posted on March 15, 2024 12:57 PM by Jessica Hullman

This is Jessica. A couple weeks I wrote a post in response to Ben Recht’s critique of conformal prediction for quantifying uncertainty in a prediction. Compared to Ben, I am more open-minded about conformal prediction and associated generalizations like conformal risk control. Quantified uncertainty is inherently incomplete as an expression of the true limits of our knowledge, but I still often find value in trying to quantify it over stopping at a point estimate.

If expressions of uncertainty are generally wrong in some ways but still sometimes useful, then we should be interested in how people interact with different approaches to quantifying uncertainty. So I’m interested in seeing how people use conformal prediction sets relative to alternatives. This isn’t to say that I think conformal approaches can’t be useful without being human-facing (which is the direction of some recent work on conformal decision theory). I just don’t think I would have spent the last ten years thinking about how people interact and make decisions with data and models if I didn’t believe that they need to be involved in many decision processes.

So now I want to discuss what we know from the handful of controlled studies that have looked at human use of prediction sets, starting with the one I’m most familiar with since it’s from my lab.

In Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling, we study people making decisions with the assistance of a predictive model. Specifically, they label images with access to predictions from a pre-trained computer vision model. In keeping with the theme that real world conditions may deviate from expectations, we consider two scenarios: one where the model makes highly accurate predictions because the new images are from the same distribution as those that the model is trained on, and one where the new images are out of distribution.

We compared their accuracy and the distance between their responses and the true label (in the Wordnet hierarchy, which conveniently maps to ImageNet) across four display conditions. One was no assistance at all, so we could benchmark unaided human accuracy against model accuracy for our setting. People were generally worse than the model in this setting, though the human with AI assistance was able to do better than the model alone in a few cases.

The other three displays were variations on model assistance, including the model’s top prediction with the softmax probability, the top 10 model predictions with softmax probabilities, and a prediction set generated using split conformal prediction with 95% coverage.

We calibrated the prediction sets we presented offline, not dynamically. Because the human is making decisions conditional on the model predictions, we should expect the distribution to change. But often we aren’t going to be able to calibrate adaptively because we don’t immediately observe the ground truth. And even if we do, at any particular point in time we could still be said to hover on the boundary of having useful prior information and steering things off course. So when we introduce a new uncertainty quantification to any human decision setting, we should be concerned with how it works when the setting is as expected and when it’s not, i.e., the guarantees may be misleading.

Our study partially gets at this. Ideally we would have tested some cases where the stated coverage guarantee for the prediction sets was false. But for the out-of-distribution images we generated, we would have had to do a lot of cherry-picking of stimuli to break the conformal coverage guarantee as much as the top-1 coverage broke. The coverage degraded a little but stayed pretty high over the entire set of out-of-distribution instances for the types of perturbations we focused on (>80%, compared to 70% for top 1- and 43% for top 1). For the set of stimuli we actually tested, the coverage for all three was a bit higher, with top 1 coverage getting the biggest bump (70% compared to 83% top 10, 95% conformal). Below are some examples of the images people were classifying (where easy and hard is based on the cross-entropy loss given the model’s predicted probabilities, and smaller and larger refers to the size of the prediction sets).

We find that prediction sets don’t offer much value over top-1 or top-10 displays when the test instances are iid, and they can reduce accuracy on average for some types of instances. However, when the test instances are out of distribution, accuracy is slightly higher with access to prediction sets than with either top-k. This was the case even though the prediction sets for the OOD instances get very large (the average set size for “easy” OOD instances, as defined by the distribution of softmax values, was ~17, for “hard” OOD instances it was ~61, with people sometimes seeing sets with over 100 items). For the in-distribution cases, average set size was about 11 for the easy instances, and 30 for the hard ones.

Based on the differences in coverage across the conditions we studied, our results are more likely to be informative for situations where conformal prediction is used because we think it’s going to degrade more gracefully under unexpected shifts. I’m not sure it’s reasonable to assume we’d have a good hunch about that in practice though.

In designing this experiment in discussion with my co-authors, and thinking more about the value of conformal prediction to model-assisted human decisions, I’ve been thinking about when a “bad” (in the sense of coming with a misleading guarantee) interval might still be better than no uncertainty quantification. I was recently reading Paul Meehl’s clinical vs statistical prediction, where he contrasts clinical judgments doctors make based on intuitive reasoning to statistical judgments informed by randomized controlled experiments. He references a distinction between the “context of justification” for some internal sense of probability that leads to a decision like a diagnosis, and the “context of verification” where we collect the data we need to verify the quality of a prediction.

The clinician may be led, as in the present instance, to a guess which turns out to be correct because his brain is capable of that special “noticing the unusual” and “isolating the pattern” which is at present not characteristic of the traditional statistical techniques. Once he has been so led to a formulable sort of guess, we can check up on him actuarially.

Thinking about the ways prediction intervals can affect decisions makes me think that whenever we’re dealing with humans, there’s potentially going to be a difference between what an uncertainty expression says and can guarantee and the value of that expression for the decision-maker. Quantifications with bad guarantees can still be useful if they change the context of discovery in ways that promote broader thinking or taking the idea of uncertainty seriously. This is what I meant when in my last post I said “the meaning of an uncertainty quantification depends on its use.” But precisely articulating how they do this is hard. It’s much easier to identify ways calibration can break.

There a few other studies that look at human use of conformal prediction sets, but to avoid making this post even longer, I’ll summarize them in an upcoming post.

P.S. There have been a few other interesting posts on uncertainty quantification in the CS blogosphere recently, including David Stutz’s response to Ben’s remarks about conformal prediction, and on designing uncertainty quantification for decision making from Aaron Roth.

“Hot hand”: The controversy that shouldn’t be. And thinking more about what makes something into a controversy:

Posted on March 15, 2024 9:31 AM by Andrew

I was involved in a recent email discussion, leading to this summary:

There is no theoretical or empirical reason for the hot hand to be controversial. The only good reason for there being a controversy is that the mistaken paper by Gilovich et al. appeared first. At this point we should give Gilovich et al. credit for bringing up the hot hand as a subject of study and accept that they were wrong in their theory, empirics, and conclusions, and we can all move on. There is no shame in this for Gilovich et al. We all make mistakes, and what’s important is not the personalities but the research that leads to understanding, often through tortuous routes.

“No theoretical reason”: see discussion here, for example.

“No empirical reason”: see here and lots more in the recent literature.

“The only good reason . . . appeared first”: Beware the research incumbency rule.

More generally, what makes something a controversy? I’m not quite sure, but I think the news media play a big part. We talked about this recently in the context of the always-popular UFOs-as-space-aliens theory, which used to be considered a joke in polite company but now seems to have reached the level of controversy.

I don’t have anything systematic to say about all this right now, but the general topic seems very worthy of study.

“Here’s the Unsealed Report Showing How Harvard Concluded That a Dishonesty Expert Committed Misconduct”

Posted on March 14, 2024 7:34 PM by Andrew

Stephanie Lee has the story:

Harvard Business School’s investigative report into the behavioral scientist Francesca Gino was made public this week, revealing extensive details about how the institution came to conclude that the professor committed research misconduct in a series of papers.

The nearly 1,300-page document was unsealed after a Tuesday ruling from a Massachusetts judge, the latest development in a $25 million lawsuit that Gino filed last year against Harvard University, the dean of the Harvard Business School, and three business-school professors who first notified Harvard of red flags in four of her papers. All four have been retracted. . . .

According to the report, dated March 7, 2023, one of Gino’s main defenses to the committee was that the perpetrator could have been someone else — someone who had access to her computer, online data-storage account, and/or data files.

Gino named a professor as the most likely suspect. The person’s name was redacted in the released report, but she is identified as a female professor who was a co-author of Gino’s on a 2012 now-retracted paper about inducing honest behavior by prompting people to sign a form at the top rather than at the bottom. . . . Allegedly, she was “angry” at Gino for “not sufficiently defending” one of their collaborators “against perceived attacks by another co-author” concerning an experiment in the paper.

But the investigation committee did not see a “plausible motive” for the other professor to have committed misconduct by falsifying Gino’s data. “Gino presented no evidence of any data falsification actions by actors with malicious intentions,” the committee wrote. . . .

Gino’s other main defense, according to the report: Honest errors may have occurred when her research assistants were coding, checking, or cleaning the data. . . .

Again, however, the committee wrote that “she does not provide any evidence of [research assistant] error that we find persuasive in explaining the major anomalies and discrepancies.”

The full report is at the link.

Some background is here, also here, and some reanalyses of the data are linked here.

Now we just have to get to the bottom of the story about the shredder and the 80-pound rock and we’ll pretty much have settled all the open questions in this field.

We’ve already determined that the “burly coolie” story and the “smallish town” story never happened.

It’s good we have dishonesty experts. There’s a lot of dishonesty out there.

Abraham Lincoln and confidence intervals

Posted on March 14, 2024 9:04 AM by Andrew

This one from 2017 is good; I want to share it with all of you again:

Our recent discussion with mathematician Russ Lyons on confidence intervals reminded me of a famous logic paradox, in which equality is not as simple as it seems.

The classic example goes as follows: Abraham Lincoln is the 16th president of the United States, but this does not mean that one can substitute the two expressions “Abraham Lincoln” and “the 16th president of the United States” at will. For example, consider the statement, “If things had gone a bit differently in 1860, Stephen Douglas could have become the 16th president of the United States.” This becomes flat-out false if we do the substitution: “If things had gone a bit differently in 1860, Stephen Douglas could have become Abraham Lincoln.”

Now to confidence intervals. I agree with Rink Hoekstra, Richard Morey, Jeff Rouder, and Eric-Jan Wagenmakers that the following sort of statement, “We can be 95% confident that the true mean lies between 0.1 and 0.4,” is not in general a correct way to describe a classical confidence interval. Classical confidence intervals represent statements that are correct under repeated sampling based on some model; thus the correct statement (as we see it) is something like, “Under repeated sampling, the true mean will be inside the confidence interval 95% of the time” or even “Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.” Russ Lyons, however, felt the statement “We can be 95% confident that the true mean lies between 0.1 and 0.4,” was just fine. In his view, “this is the very meaning of “confidence.'”

This is where Abraham Lincoln comes in. We can all agree on the following summary:

A. Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.

And we could even perhaps feel that the phrase “confidence interval” implies “averaging over repeated samples,” and thus the following statement is reasonable:

B. “We can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.”

Now consider the other statement that caused so much trouble:

C. “We can be 95% confident that the true mean lies between 0.1 and 0.4.”

In a problem where the confidence interval is [0.1, 0.4], “the lower and upper endpoints of the confidence interval” is just “0.1 and 0.4.” So B and C are the same, no? No. Abraham Lincoln, meet the 16th president of the United States.

In statistical terms, once you supply numbers on the interval, you’re conditioning on it. You’re no longer implicitly averaging over repeated samples. Just as, once you supply a name to the president, you’re no longer implicitly averaging over possible elections.

So here’s what happened. We can all agree on statement A. Statement B is a briefer version of A, eliminating the explicit mention of replications because they are implicit in the reference to a confidence interval. Statement C does a seemingly innocuous switch but, as a result, implies conditioning on the interval, thus resulting in a much stronger statement that is not necessarily true (that is, in mathematical terms, is not in general true).

None of this is an argument over statistical practice. One might feel that classical confidence statements are a worthy goal for statistical procedures, or maybe not. But, like it or not, confidence statements are all about repeated sampling and are not in general true about any particular interval that you might see.

P.S. More here.

You probably don’t have a general algorithm for an MLE of Gaussian mixtures

Posted on March 13, 2024 3:00 PM by Bob Carpenter

Those of you who are familiar with Garey and Johnson’s 1979 classic, Computers and Intractability: a guide to the theory of NP-completeness, may notice I’m simply “porting” their introduction, including the dialogue, to the statistics world.

Imagine Andrew had tasked me and Matt Hoffman with fitting simple standard (aka isotropic, aka spherical) Gaussian mixtures rather than hierarchical models. Let’s say that Andrew didn’t like that K-means got a different answer every time he ran it, K-means++ wasn’t much better, and even using soft-clustering (i.e., fitting the stat model with EM) didn’t let him replicate simulated data. Would we have something like Stan for mixtures. Sadly, no. Matt and I may have tried and failed. We wouldn’t want to go back to Andrew and say,

“We can’t find an efficient algorithm. I guess we’re just too dumb.”

We’re computer scientists and we know about proving hardness. We’d like to say,

“We can’t find an efficient algorithm, because no such algorithm is possible.”

But that would’ve been beyond Matt’s and my grasp, because, in this particular case, it would require solving the biggest open problem in theoretical computer science. Instead, it’s almost certain we would have come back and said,

“We can’t find an efficient algorithm, but neither can all these famous people.”

That seems weak. Why would we say that? Because we could’ve proven that the problem is NP-hard. A problem is in the class P if it can be solved in polynomial time with a deterministic algorithm. A problem is in the class NP when there is a non-deterministic (i.e., infinitely parallel) algorithm to solve it in polynomial time. It’s NP-hard if it’s just as hard as any other NP algorithm (formally specified through reductions, a powerful CS proof technique that’s the basis of Gödel’s incompleteness theorem). An NP-hard algorithm often has a non-deterministic algorithm to solve it makes a complete set of (exponentially many) guesses in parallel and then spends polynomial time on each one verifying whether or not it is a solution. An algorithm is NP-complete if it is NP-hard and a member of NP. Some well known NP-complete problems are bin packing, satisfiability in propositional logic, and the traveling salesman problem—there’s a big list of NP-complete problems.

Nobody has found a tractable algorithm to solve an NP-hard problem. When we (computer scientists) say “tractable,” we mean solvable in polynomial time with a deterministic algorithm (i.e., the problem is in P). The only known algorithms for NP-hard problems are exponential. Researchers have been working for the last 50+ years trying to prove that the class of NP problems is disjoint from the class of P problems.

In other words, there’s a Turing Award waiting for you if you can actually turn response (3) into response (2).

In the case of mixtures of standard (spherical, isotropic) Gaussians there’s a short JMLR paper with a proof that maximum likelihood estimation is NP-hard.

Tosh and Dasgupta. 2016. Maximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard. JMLR

And yes, that’s the same Tosh as who was the first author of the “piranha” paper.

Ising models that are not restricted to be planar are also NP-hard.

Lucas. 2014. Ising formulations of many NP problems. Frontiers in Physics.

What both these problems have in common is that they are combinatorial and require inference over sets. I think (though am really not sure) that one of the appeals of quantum computing is potentially solving NP-hard problems.

P.S. How this story really would’ve went is that we would’ve told Andrew that some simple distributions over NP-hard problem instances lead to expected polynomial time algorithms and we’d be knee-deep in the kinds of heuristics used to pack container ships efficiently.

Their signal-to-noise ratio was low, so they decided to do a specification search, use a one-tailed test, and go with a p-value of 0.1.

Posted on March 13, 2024 9:08 AM by Andrew

Adam Zelizer writes:

I saw your post about the underpowered COVID survey experiment on the blog and wondered if you’ve seen this paper, “Counter-stereotypical Messaging and Partisan Cues: Moving the Needle on Vaccines in a Polarized U.S.” It is written by a strong team of economists and political scientists and finds large positive effects of Trump pro-vaccine messaging on vaccine uptake.

They find large positive effects of the messaging (administered through Youtube ads) on the number of vaccines administered at the county level—over 100 new vaccinations in treated counties—but only after changing their specification from the prespecified one in the PAP. The p-value from the main modified specification is only 0.097, from a one-tailed test, and the effect size from the modified specification is 10 times larger than what they get from the pre-specified model. The prespecified model finds that showing the Trump advertisement increased the number of vaccines administered in the average treated county by 10; the specification in the paper, and reported in the abstract, estimates 103 more vaccines. So moving from the specification in the PAP to the one in the paper doesn’t just improve precision, but it dramatically increases the estimated treatment effect. A good example of suppression effects.

They explain their logic for using the modified specification, but it smells like the garden of forking paths.

Here’s a snippet from the article:

I don’t have much to say about the forking paths except to give my usual advice to fit all reasonable specifications and use a hierarchical model, or at the very least do a multiverse analysis. No reason to think that the effect of this treatment should be zero, and if you really care about effect size you want to avoid obvious sources of bias such as model selection.

The above bit about one-tailed tests reflects a common misunderstanding in social science. As I’ll keep saying until my lips bleed, effects are never zero. They’re large in some settings, small in others, sometimes positive, sometimes negative. From the perspective of the researchers, the idea of the hypothesis test is to give convincing evidence that the treatment truly has a positive average effect. That’s fine, and it’s addressed directly through estimation: the uncertainty interval gives you a sense of what the data can tell you here.

When they say they’re doing a one-tailed test and they’re cool with a p-value of 0.1 (that would be 0.2 when following the standard approach) because they have “low signal-to-noise ratios” . . . that’s just wack. Low signal-to-noise ratio implies high uncertainty in your conclusions. High uncertainty is fine! You can still recommend this policy be done in the midst of this uncertainty. After all, policymakers have to do something. To me, this one-sided testing and p-value thresholding thing just seems to be missing the point, in that it’s trying to squeeze out an expression of near-certainty from data that don’t admit such an interpretation.

P.S. I do not write this sort of post out of any sort of animosity toward the authors or toward their topic of research. I write about these methods issues because I care. Policy is important. I don’t think it is good for policy for researchers to use statistical methods that lead to overconfidence and inappropriate impressions of certainty or near-certainty. The goal of a statistical analysis should not be to attain statistical significance or to otherwise reach some sort of success point. It should be to learn what we can from our data and model, and to also get a sense of what we don’t know..

Fully funded doctoral student positions in Finland

Posted on March 13, 2024 6:30 AM by Aki Vehtari

There is a new government funded Finnish Doctoral Program in AI. Research topics include Bayesian inference, modeling and workflows as part of fundamental AI. There is a big joint call, where you can choose the supervisor you want to work with. I (Aki) am also one of the supervisors. Come work with me or share the news! The first call deadline is April 2, and the second call deadline in fall 2024. See how to apply at https://fcai.fi/doctoral-program, and more about my research at my web page.

Zotero now features retraction notices

Posted on March 12, 2024 9:47 AM by Andrew

David Singerman writes:

Like a lot of other humanities and social sciences people I use Zotero to keep track of citations, create bibliographies, and even take & store notes. I also am not alone in using it in teaching, making it a required tool for undergraduates in my classes so they learn to think about organizing their information early on. And it has sharing features too, so classes can create group bibliographies that they can keep using after the semester ends.

Anyway my desktop client for Zotero updated itself today and when it relaunched I had a big red banner informing me that an article in my library had been retracted! I didn’t recognize it at first, but eventually realized that was because it was an article one of my students had added to their group library for a project.

The developers did a good job of making the alert unmissable (i.e. not like a corrections notice in a journal), the full item page contains lots of information and helpful links about the retraction, and there’s a big red X next to the listing in my library. See attached screenshots.

The way they implemented it will also help the teaching component, since a student will get this alert too.

Singerman adds this P.S.:

This has reminded me that some time ago you posted something about David Byrne, and whatever you said, it made me think of David Byrne’s wonderful appearance on the Colbert Report.

What was amazing to me when I saw it was that it’s kind of like a battle between Byrne’s inherent weirdness and sincerity, and Colbert’s satirical right-wing bloviator character. Usually Colbert’s character was strong enough to defeat all comers, but . . . decide for yourself.

Putting a price on vaccine hesitancy (Bayesian analysis of a conjoint experiment)

Posted on March 11, 2024 9:43 AM by Andrew

Tom Vladeck writes:

I thought you may be interested in some internal research my company did using a conjoint experiment, with analysis using Stan! The upshot is that we found that vaccine hesitant people would require a large payment to take the vaccine, and that there was a substantial difference between the prices required for J&J and Moderna & Pfizer (evidence that the pause was very damaging). You can see the model code here.

My reply: Cool! I recommend you remove the blank lines from your Stan code as that will make your program easier to read.

Vladeck responded:

I prefer a lot of vertical white space. But good to know that I’m likely in the minority there.

For me, it’s all about the real estate. White space can help code be more readable but it should be used sparingly. What I’d really like is a code editor that does half white spaces.