How to win the Sloan Sports hackathon

Stan developer Daniel Lee writes:

I walked in with knowing a few things about the work needed to win hackathons:

– Define a problem.
If you can clearly define a problem, you’ll end up it the top third of the competition. It has to be clear why the problem matters and you have to communicate this effectively.

– Specify a solution.
If you’re able to specify a solution to the problem, you’ll end up in the top 10%. It has to be clear to the judges that this solution solves the problem.

– Implement the solution.
If you’ve gotten this far and you’re now able to actually implement the solution that you’ve outlined, you’ll end up in the top 3. It’s hard to get to this point. We’re talking about understanding the topic well enough to define a problem of interest, having explored enough of the solution space to specify a solution, then applying skills through focused effort to build the solution in a short amount of time. Do that and I’m sure you’ll be a finalist.

– Build interactivity.
If the judges can do something with the solution, specifically evaluate “what if” scenarios, then you’ve gone above and beyond the scopes of a hackathon. That should get you a win.

Winning a hackathon takes work and focus. It’s mentally and physically draining to compete in a hackathon. You have to pace yourself well, adjust to different challenges as they come, and have enough time and energy at the end to switch context to present the work.

One additional note: the solution only needs to be a proof of concept and pass a smell test. It’s important to know when to move on.

Positive, negative, or neutral?

We’ve talked in the past about advice being positive, negative, or neutral.

Given that Daniel is giving advice on how to win a competition that has only one winner, you might think I’d call it zero-sum. Actually, though, I’d call it positive-sum, in that the goal of a hackathon is not just to pick a winner, it’s also to get people involved in the field of study. It’s good for a hackathon if its entries are good.

The story

Daniel writes:

I [Daniel] participated in the SSAC22 hackathon. I showed up, found a teammate [Fabrice Mulumba], and won. Here’s a writeup about our project, our strategy for winning, and how we did it.

The Data

All hackathon participants were provided data from the 2020 Stanley Cup Finals. This included:

– Tracking data for 40 players, the puck, and the referees. . . . x, y, z positions with estimates of velocity recorded at ~100 Hz. The data are from chips attached to jerseys and in the puck.

– Play-by-play data. Two separate streams of play-by-play data were included: hand-generated and system-generated. . . .

– Other meta data. Player information, rink information, game time, etc.

Data was provided for each of the 6 games in the series. For a sense of scale: one game has about 1.5M rows of tracking data with 1.5 GB of JSON files across the different types of data.

The Hackathon

There were two divisions for the Hackathon: Student and Open. The competition itself had very little structure. . . . Each team would present to the judges starting at 4 pm and the top teams would present in the finals. . . .

Daniel tells how it came together:

Fabrice and I [Daniel] made a pretty good team. But it almost didn’t happen.

Both Fabrice and I had competed in hackathons before. We first met around 8:30 am, half an hour before the hackathon started. As Fabrice was setting up, I saw that he had on an AfroTech sweatshirt and a Major League Hacking sticker on his laptop. I said hi, asked if he was competing alone, and if he was looking for a teammate. He told me he wanted to compete alone. I was hoping to find a teammate, but had been preparing to compete alone too. While it’s hard to do all the things above alone, it’s actually harder if you have the wrong teammate. We went our separate ways. A few minutes later, we decided to team up.

Something about the team felt right from the start. Maybe I was more comfortable teaming up with one of the few other POC in the room. Maybe there was a familiar cadence and vibe from having parents that immigrated to the US. Maybe it was knowing that the other had been through an intense working session in the past and was voluntarily going through it again. Whatever it was, it worked.

In the few days prior, I had spent a couple hours trying to gain some knowledge about hockey from friends that know the sport. The night before, I found a couple of people that worked for the LA Kings and asked questions about what they thought about and why. I came in thinking we should look at something related to goalie position. Fabrice came in wanting to work on a web app and focus on identifying a process within the game. These ideas melded together and formed the winning project.

For the most part, we worked on separate parts of the problem. We were able to split the work and trust that the other would get their part done. . . .

The Winning Project: Sloan Goalie Card

We focused on a simple question. Does goaltender depth matter?

Having access to x, y, z position of every player meant that we could analyze where the goalie was at the time when shots were taken. Speaking to some hockey people, we found out that this data wasn’t publicly available, so this would be one of the first attempts at this type of analysis.

In the allotted time, we pulled off a quick analysis of goalie depth and built the Sloan Goalie Card web app.

I don’t know anything about hockey so I can’t comment on the actual project. What I like is Daniel’s general advice.

P.S. I googled *how to win a hackathon*. It’s a popular topic, including posts going back to 2014. Some of the advice seems pretty ridiculous; for example one of the links promises “Five Easy Steps to Developer Victory”—which makes me wonder what would happen if two competitors tried this advice for the same hackathon. They couldn’t both win, right?

Using large language models to generate trolling at scale

This is Jessica. Large language models have been getting a lot of negative attention–for being brittle and in need of human curation, for generating socially undesirable outputs–sometimes even on this blog. So I figured I’d highlight a recent application that cleverly exploits their ability to generate toxic word vomit: using them to speculate about possible implications of new designs for social computing platforms.  

In a new paper on using LLMs to generate “social simulacra,” Joon Sung Park et al. write: 

Social computing prototypes probe the social behaviors that may arise in an envisioned system design. This prototyping practice is currently limited to recruiting small groups of people. Unfortunately, many challenges do not arise until a system is populated at a larger scale. Can a designer understand how a social system might behave when populated, and make adjustments to the design before the system falls prey to such challenges? We introduce social simulacra, a prototyping technique that generates a breadth of realistic social interactions that may emerge when a social computing system is populated. Social simulacra take as input the designer’s description of a community’s design — goal, rules, and member personas — and produce as output an instance of that design with simulated behavior, including posts, replies, and anti-social behaviors. We demonstrate that social simulacra shift the behaviors that they generate appropriately in response to design changes, and that they enable exploration of “what if?” scenarios where community members or moderators intervene. To power social simulacra, we contribute techniques for prompting a large language model to generate thousands of distinct community members and their social interactions with each other; these techniques are enabled by the observation that large language models’ training data already includes a wide variety of positive and negative behavior on social media platforms. In evaluations, we show that participants are often unable to distinguish social simulacra from actual community behavior and that social computing designers successfully refine their social computing designs when using social simulacra.

The idea is a clever solution to sampling issues that currently limit researchers’ ability to foresee how a new social platform might be used: it’s difficult to get the number of test users for a prototype that are needed for certain behaviors to manifest, and it’s hard to match the sample makeup to the target deployment population. Using the social simulacra approach, the designer still has to supply some notion of target users, in the form of “seed personas” passed as input to the LLM along with a short description of the system goal (“social commentary and politics”) and any hard or soft rules (“be kind,” “no posting of advertisements”), but then the prototype platform is populated with text associated with a large number of hypothetical users.       

I like how the alignment between the type of data LLMs injest and the context for using the approach makes the often criticized potential for LLMs to push more racist, sexist slurs into the world a feature rather than a bug. If you want to see how your plan for a new platform could be completely derailed by trolls, who better to ask than an LLM? 

It also provides a way to test interventions at scale: “social simulacra can surface a larger space of possible outcomes and enable the designer to explore how design changes might shift them. Likewise, social simulacra allow a designer to explore ‘what if?’ scenarios where they probe how a thread might react if they engaged in a moderation action or replied to one of the comments.” The researcher can intervene at the conversation level to see what happens, and can get a sense of uncertainty in the generating process by using a “multiverse” function to generate many instantiations of an outcome

The idea that we can treat LLM generated social simulacra as predictive of human behavior makes it philosophically intriguing. When I first learned about this project late last year when visiting Stanford, my first question was, What do the ethicists think about doing that? The authors make clear that they aren’t necessarily claiming faithfulness to real world behavior: 

Social simulacra do not aim to predict what is absolutely going to happen in the future – like many early prototyping techniques, perfect resemblance to reality is not the goal. […] However, social simulacra offer designers a tool for testing their intuitions about the breadth of possible social behaviors that may populate their community, from model citizen behaviors to various edge cases that ultimately become the cracks that collapse a community. In so doing, social simulacra, such as those that we have explored here, expand the role of experience prototypes for social computing systems and the set of tools avail- able for designing social interactions, which inspired the original conceptualization of wicked problems [64]. 

By sidestepping the question, they’re asking for something of a leap of faith, but they provide a few evaluations that suggest the generated output is useful. They show 50 people pairs of subreddit conversations, one real, one LLM generated, and find that on average people can identify the real example only slightly better than chance. They also find that potential designers of social computing systems find them helpful for iterating on designs and imagining possible futures. While checking whether humans can discriminate between real and generated social behavior is obviously relevant, it would be nice to see more attempts at characterizing statistical properties of the generated text relative to real world behavior. For example, I’d be curious to see, both for thread level and community level dynamics, where the observed behavior on real social systems falls in the larger space of generated possibilities for the same input. Is it more or less extreme in any ways? Naturally there will be many degrees of freedom in defining such comparisons, but maybe one could observe biases in generated text that a human might miss when reviewing a small set of examples. The paper does summarize some weaknesses observed in the generated text, and mentions using a qualitative analysis to compare real and generated subreddits, but more of this kind of comparison would be welcome. 

I haven’t tried it, but you can play with generating simulacra here:


This article by Albert Burneko doesn’t directly cite the developmental psychology literature on essentialism—indeed, it doesn’t cite any literature at all—but it’s consistent with a modern understanding of children’s thought. As Burneko says it:

Have you ever met a pre-K child? That is literally all they talk and think about. Sorting things into categories is their whole deal.

I kinda wish they’d stick to sports, though. Maybe some zillionaire could sue them out of existence?

Bets as forecasts, bets as probability assessment, difficulty of using bets in this way

John Williams writes:

Bets as forecasts come up on your blog from time to time, so I thought you might be interested in this post from RealClimate, which is the place to go for informed commentary on climate science.

The post, by Gavin Schmidt, is entitled, “Don’t climate bet against the house,” and tells the story of various public bets in the past few decades regarding climate outcomes.

The examples are interesting in their own right and also as a reminder that betting is complicated. In theory, betting has close links to uncertainty, and you should be able to go back and forth between them:

1. From one direction, if you think the consensus is wrong, you can bet against it and make money (in expectation). You should be able to transform your probability statements into bets.

2. From the other direction, if bets are out there, you can use these to assess people’s uncertainties, and from there you can make probabilistic predictions.

In real life, though, both the above steps can have problems, for several reasons. First is the vig (in a betting market) or the uncertainty that you’ll be paid off (in an unregulated setting). Second is that you need to find someone to make that bet with you. Third, and relatedly, that “someone” who will bet with you might have extra information you don’t have, indeed even their willingness to bet at given odds provides some information, in a Newtonian action-and-reaction sort of way. Fourth, we hear about some of the bets and we don’t hear about others. Fifth, people can be in it to make a point or for laffs or thrills or whatever, not just for the money, enough so that, when combined with the earlier items on this list, there won’t be enough “smart money” to take up the slack.

This is not to say that betting is a useless approach to information aggregation; I’m just saying that betting, like other social institutions, works under certain conditions and not in absolute generality.

And this reminds me of another story.

Economist Bryan Caplan reports that his track record on bets is 23 for 23. That’s amazing! How is it possible? Here’s Caplan’s list, which starts in 2007 and continues through 2021, with some of the bets still unresolved.

Caplan’s bets are an interesting mix. The first one is a bet where he offered 1-to-100 odds so it’s no big surprise that he won, but most of them are at even odds. A couple of them he got lucky on (for example, he bet in 2008 that no large country would leave the European Union before January 1, 2020, so he just survived by one month on that one), but, hey, it’s ok to be lucky, and in any case even if he only had won 21 out of 23 bets, that would still be impressive.

It seems to me that Caplan’s trick here is to show good judgment on what pitches to swing at. People come at him with some strong, unrealistic opinions, and he’s been good at crystallizing these into bets. In poker terms, he waits till he has the nuts, or nearly so. 23 out of 23 . . . that’s a great record.

Bayesian inference continues to completely solve the multiple comparisons problem

Erik van Zwet writes:

I saw you re-posted your Bayes-solves-multiple-testing demo. Thanks for linking to my paper in the PPS! I think it would help people’s understanding if you explicitly made the connection with your observation that Bayesians are frequentists:

What I mean is, the Bayesian prior distribution corresponds to the frequentist sample space: it’s the set of problems for which a particular statistical model or procedure will be applied.

Recently Yoav Benjamini criticized your post (the 2016 edition) in section 5.5 of his article/blog “Selective Inference: The Silent Killer of Replicability.”

Benjamini’s point is that your simulation results break down completely if the true prior is mixed ever so slightly with a much wider distribution. I think he has a valid point, but I also think it can be fixed. In my opinion, it’s really a matter of Bayesian robustness; the prior just needs a flatter tail. This is a much weaker requirement than needing to know the true prior. I’m attaching an example where I use the “wrong” tail but still get pretty good results.

In his document, Zwet writes:

This is a comment on an article by Yoav Benjamini entitled “Selective Inference: The Silent Killer of Replicability.”

I completely agree with the main point of the article that over-optimism due to selection (a.k.a. the winner’s curse) is a major problem. One important line of defense is to correct for multiple testing, and this is discussed in detail.

In my opinion, another important line of defense is shrinkage, and so I was surprised that the Bayesian approach is dimissed rather quickly. In particular, a blog post by Andrew Gelman is criticized. The post has the provocative title: “Bayesian inference completely solves the multiple comparisons problem.”

In his post, Gelman samples “effects” from the N(0,0.5) distribution and observes them with standard normal noise. He demonstrates that the posterior mean and 95% credible intervals continue to perform well under selection.

In section 5.5 of Benjamini’s paper the N(0,0.5) is slightly perturbed by mixing it with N(0,3) with probability 1/1000. As a result, the majority of the credibility intervals that do not cover zero come from the N(0,3) component. Under the N(0,0.5) prior, those intervals get shrunken so much that they miss the true parameter.

It should be noted, however, that those effects are so large that they are very unlikely under the N(0,0.5) prior. Such “data-prior conflict” can be resolved by having a prior with a flat tail. This is a matter of “Bayesian robustness” and goes back to a paper by Dawid which can be found here.

Importantly, this does not mean that we need to know the true prior. We can mix the N(0,0.5) with almost any wider normal distribution with almost any probability and then very large effects will hardly be shrunken. Here, I demonstrate this by usin the mixture 0.99*N(0,0.5)+0.01*N(0,6) as prior. This is quite far from the truth, but nevertheless, the posterior inference is quite acceptable. We find that among one million simulations, there are 741 credible intervals that do not cover zero. Among those, the proportion that do not cover the parameter is 0.07 (CI: 0.05 to 0.09).

The point is that the procedure merely needs to recognize that a particular observation is unlikely to come from N(0,0.5), and then apply very little shrinkage.

My own [Zwet’s] views on shrinkage in the context of the winner’s curse are here. In particular, a form of Bayesian robustness is discussed in section 3.4 of a preprint of myself and Gelman here. . . .

He continues with some simulations that you can do yourself in R.

The punch line is that, yes, the model makes a difference, and when you use the wrong model you’ll get the wrong answer (i.e., you’ll always get the wrong answer). This provides ample scope for research on robustness: how wrong are your answers, depending on how wrong is your model? This arises with all statistical inferences, and there’s no need in my opinion to invoke any new principles involving multiple comparisons. I continue to think that (a) Bayesian inference completely solves the multiple comparisons problem, and (b) all inferences, Bayesian included, are imperfect.

Weak separation in mixture models and implications for principal stratification

Avi Feller, Evan Greif, Nhat Ho, Luke Miratrix, and Natesh Pillai write:

Principal stratification is a widely used framework for addressing post-randomization complications. After using principal stratification to define causal effects of interest, researchers are increasingly turning to finite mixture models to estimate these quantities. Unfortunately, standard estimators of mixture parameters, like the MLE, are known to exhibit pathological behavior. We study this behavior in a simple but fundamental example, a two-component Gaussian mixture model in which only the component means and variances are unknown, and focus on the setting in which the components are weakly separated. . . . We provide diagnostics for all of these pathologies and apply these ideas to re-analyzing two randomized evaluations of job training programs, JOBS II and Job Corps.

The paper’s all about maximum likelihood estimates and I don’t care about that at all, but the general principles are relevant to understanding causal inference with intermediate outcomes and fitting such models in Stan or whatever.

Just show me the data, baseball edition

Andrew’s always enjoining people to include their raw data. Jim Albert, of course, does it right. Here’s a recent post from his always fascinating baseball blog, Exploring Baseball Data with R,

The post “just” plots the raw data and does a bit of exploratory data analysis, concluding that the apparent trends are puzzling. Albert’s blog has it all. The very next post fits a simple Bayesian predictive model to answer the question every baseball fan in NY is asking,

P.S. If you like Albert’s blog, check out his fantastic intro to baseball stats, which only assumes a bit of algebra, yet introduces most of statistics through simulation. It’s always the first book I recommend to anyone who wants a taste of modern statistical thinking and isn’t put off by the subject matter,

  • Jim Albert and Jay Bennet. 2001. Curve Ball. Copernicus.


At last! Incontrovertible evidence (p=0.0001) that people over 40 are older, on average, than people under 40.

Andreas Stang points us to the above delightful image, which comes from an otherwise obscure paper in the Journal of Circulating Biomarkers. That’s awesome that they got p=0.0001. And it was multiple-comparisons corrected! I think we finally have hard evidence that people over 40 are older, on average, than people under 40. It’s a good thing it wasn’t p=0.06 or something, or we might still be in the dark regarding the relationship between the “Age” and “AGE>40” variables.

I’m reminded of the immortal line, “Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001]." Research!

I know it might sound strange but I believe you’ll be coming back before too long

Last month we reported on some funky statistics coming out of the Maryland Department of Transportation—something about adding lanes to the Beltway.

Ben Ross sends an update:

Thank you so much for reporting in your blog on my letter about possible scientific fraud in the traffic model for the Maryland toll lane project. There are new developments that your readers may be interested in.

My [Ross’s] letter, sent to US Dept. of Transportation Deputy Secretary Polly Trottenberg, concerned the Final Environmental Impact Statement issued in June by the Federal Highway Administration. This document is the basis for a Record of Decision (ROD), the federal approval needed for the project to go forward.

On July 18, a lobbying group supporting the project wrote to Trottenberg asking her to “ignore” my letter. The signer of the letter was Doug Mayer, former Communications Director to Maryland Governor Larry Hogan. The Mayer letter is attached and a news report on it is here.

A few days ago, the Federal Highway Administration informed the Maryland Dept. of Transportation that USDOT was not ready to issue the ROD and asked them to respond to public comments they had received on the FEIS. This clearly includes my letter to Trottenberg; I don’t know the full extent of what MDOT was asked to respond to.

Yesterday morning, Governor Hogan wrote to President Biden and USDOT Secretary Pete Buttigieg demanding immediate issuance of the ROD without any response to comments on the FEIS. He issued a press release describing the delay as “purely political” and “irresponsible and incompetent federal overreach” and threatening legal action. Press coverage of this has appeared in Maryland Matters and the Washington Post.

In response, the Federal Highway Administration issued the following statement yesterday afternoon:

In his letter, the former communications director says a lot about professionalism: “The traffic engineering and environmental analyses were performed by professional engineers and other qualified subject matter experts from eight federal, state, and local agencies and 20 participating agencies . . . following approved, industry standard procedures . . . consistent with accepted industry standards . . . licensed professionals with advanced degrees in traffic engineering . . .”

Expertise can be important, that’s for sure. But I’m not sure what I’m supposed to think about work that is “consistent with accepted industry standards” in traffic engineering. This came up a few years ago in our article, The Commissar for TrafficPresents the Latest Five-Year Plan. For whatever reason, it seems like standard practice to make bad forecasts and then not update them appropriately with new information:


This sort of behavior might be ok if you’re an academic economist writing about the Soviet Union:samuelson.png

But government employees should be able to do better, no?

Here’s the point. When we see forecasts of bridge traffic, transit traffic, cost projections, etc., made by people with a political or financial interest in the project . . . OK, these forecasts could be good or they could be bad. You can’t just assume they’re correct, just cos they’re by traffic engineers with advanced degrees, consistent with accepted industry standards, etc. Industry standards aren’t always so great, and there are real conflicts of interests here. I’m not saying that these studies shouldn’t be done; I’m just saying that it could be a mistake to assume that the “eight federal, state, and local agencies and 20 participating agencies” experts are producing an unbiased report.

The other interesting thing from the former communication director’s letter is a report from an organization called Public Opinion Strategies. They share results from a poll of 500 registered voters in Maryland, but it’s kind of impossible for me to evaluate given that they don’t say how they sampled the voters or what the survey questions were. I have a horrible feeling the poll was done with the goal of getting positive responses on this Beltway expansion thing. The poll is irrelevant to concerns about the traffic report, but it’s an interesting example of possibly slanted news. Seeing poll results with no information of where the respondents came from or what the questions were . . . it’s like trying to piece together a conversation from hearing only one person’s words.


I absolutely love this bit:

No sense of where these respondents come from or what questions were asked, but, hey, the margin of error is 4.38%. The only thing I don’t get is why didn’t they say it more precisely: the margin of error is 4.382693%. What’s with the rounding, dude??

From chatbots and understanding to appliance repair and statistical practice

A couple months ago we talked about some extravagant claims made by Google engineer Blaise Agüera y Arcas, who pointed toward the impressive behavior of a chatbot and argued that its activities “do amount to understanding, in any falsifiable sense.” Arcas gets to the point of saying, “None of the above necessarily implies that we’re obligated to endow large language models with rights, legal or moral personhood, or even the basic level of care and empathy with which we’d treat a dog or cat,” a disclaimer that just reinforces his position, in that he’s even considering that it might make sense to “endow large language models with rights, legal or moral personhood”—after all, he’s only saying that we’re not “necessarily . . . obligated” to give these rights. It sounds like he’s thinking that giving such rights to a computer program is a live possibility.

Economist Gary Smith posted a skeptical response, first showing how bad a chatbot will perform if it’s not trained or tuned in some way, and more generally saying, “Using statistical patterns to create the illusion of human-like conversation is fundamentally different from understanding what is being said.”

I’ll get back to Smith’s point at the end of this post. First I want to talk about something else, which is how we use Google for problem solving.

The other day one of our electronic appliances wasn’t working. I went online and searched on the problem and I found several forums where the topic was brought up and a solution was offered. Lots of different solutions, but none of them worked for me. I next searched to find a pdf of the owner’s manual. I found it, but again it didn’t have the information to solve the problem. I then went to the manufacturer’s website which had a chat line—I guess it was a real person but it could’ve been a chatbot, because what it did was send me thru a list of attempted solutions and then when none worked the conclusion was that the appliance was busted.

What’s my point here? First, I don’t see any clear benefit here from having convincing human-like interaction here. If it’s a chatbot, I don’t want it to pass the Turing test, I’d rather be aware it’s a chatbot as this will allow me to use it more effectively. Second, for many problems, the solution strategy that humans use is superficial, just trying to fix the problem without understanding it. With modern technology, computers become more like humans, and humans become more like computers in how they solve problems.

I don’t want to overstate that last point. For example, in drug development it’s my impression that the best research is very much based on understanding, not just throwing a zillion possibilities at a disease and seeing what works but directly engineering something that grabs onto the proteins or whatever. And, sure, if I really wanted to fix my appliance it would be best to understand exactly what’s going on. It’s just that in many cases it’s easier to solve the problem, or to just buy a replacement, than to figure out what’s happening internally.

How people do statistics

And then it struck me . . . this is how most people do statistics, right? You have a problem you want to solve; there’s a big mass of statistical methods out there, loosely sorted into various piles (“Bayesian,” “machine learning,” “econometrics,” “robust statistics,” “classification,” “Gibbs sampler,” “Anova,” “exact tests,” etc.); you search around in books or the internet or ask people what method might work for your problem; you look for an example similar to yours and see what methods they used there; you keep trying until you succeed, that is, finding a result that is “statistically significant” and makes sense.

This strategy won’t always work—sometimes the data don’t produce any useful answer, just as in my example above, sometimes the appliance is just busted—but I think this is a standard template for applied statistics. And if nothing comes out, then, sure, you do a new experiment or whatever. Anywhere other than the Cornell Food and Brand Lab, the computer of Michael Lacour, and the trunk of Diederik Stapel’s car, we understand that success is never guaranteed.

Trying things without fully understanding them, just caring about what works: this strategy makes a lot of sense. Sure, I might be a better user of my electronic appliance if I better understood how it worked, but really I just want to use it and not be bothered by it. Similarly, researchers want to make progress in medicine, or psychology, or economics, or whatever: statistics is a means to an end for them, as it generally should be.

Unfortunately, as we’ve discussed many times, the try-things-until-something-works strategy has issues. It can be successful for the immediate goal of getting a publishable result and building a scientific career, while failing in the larger goal of advancing science.

Why is it that I’m ok with the keep-trying-potential-solutions-without-trying-to-really-understand-the-problem method for appliance repair but not for data analysis? The difference, I think, is that appliance repair has a clear win condition but data analysis doesn’t. If the appliance works, it works, and we’re done. If the data analysis succeeds in the sense of giving a “statistically significant” and explainable result, this is not necessarily a success or “discovery” or replicable finding.

It’s a kind of principal-agent problem. In appliance repair, the principal and agent coincide; in scientific research, not so much.

Now to get back to the AI chatbot thing:

– For appliance repair, you don’t really need understanding. All you need is a search engine that will supply enough potential solutions that will either either solve your problem or allow you to be ok with giving up.

– For data analysis, you do need understanding. Not a deep understanding, necessarily, but some sort of model of what’s going on. A “chatbot” won’t do the job.

But, can a dumb chatbot be helpful in data analysis? Sure. Indeed, I use google to look up R functions all the time, and sometimes I use google to look up Stan functions! The point is that some sort of model of the world is needed, and the purpose of the chatbot is to give us tools to attain that understanding.

At this point you might feel that I’m leaving a hostage to fortune. I’m saying that data analysis requires understanding and that existing software tools (including R and Stan) are just a way to aim for that. But what happens 5 or 10 or 15 years in the future when a computer program appears that can do an automated data analysis . . . then will I say it has true understanding? I don’t know, but I might say that the automated analysis is there to facilitate true understanding from the user.

More chatbot interactions

I played around with GPT-3 myself and I kept asking questions and getting reasonable, human-sounding responses. So I sent a message to Gary Smith:

As you know, GPT-3 seems to have been upgraded, and now it works well on those questions you gave it. Setting aside the question of whether the program has “understanding” (I’d say No to that), I’m just wondering, do you think it now will work well on new questions? It’s hard for me to come up with queries, but you seem to be good at that!

I’m asking because I’m writing a short post on chatbots and understanding, and I wanted to get a sense of how good these chatbots are now. I’m not particularly interested in the Turing-test thing, but it would be interesting to see if GPT-3 gives better responses now to new questions? And for some reason I have difficulty coming up with inputs that could test it well. Thanks in advance.

Smith replied:

I tried several questions and here are screenshots of every question and answer. I only asked each question once. I used davinci-002, which I believe is the most powerful version of GPT-3.

My [Smith’s] takeaways are:

1. Remarkably fluent, but has a lot of trouble with distinguishing between meaningless and meaningful correlations, which is the point I am going to push in my Turing test piece. Being “a fluent spouter of bullshit” [a term from Ernie Davis and Gary Marcus] doesn’t mean that we can trust blackbox algorithms to make decisions.

2. It handled two Winograd schema questions (axe/tree and trophy/suitcase) well. I don’t know if this is because these questions are part of the text they have absorbed or if they were hand coded.

3. They often punt (“There’s no clear connection between the two variables, so it’s tough to say.”) when the answer is obvious to humans.

4. They have trouble with unusual situations: Human: Who do you predict would win today if the Brooklyn Dodgers played a football game against Preston North End? AI: It’s tough to say, but if I had to guess, I’d say the Brooklyn Dodgers would be more likely to win.

The Brooklyn Dodgers example reminds me of those WW2 movies where they figure out who’s the German spy by tripping him up with baseball questions.

Smith followed up:

A few more popped into my head. Again a complete accounting. Some remarkably coherent answers. Some disappointing answers to unusual questions:

I like that last bit: “I’m not sure if you can improve your test scores by studying after taking the test, but it couldn’t hurt to try!” That’s the kind of answer that will get you tenure at Cornell.

Anyway, the point here is not to slam GPT-3 for not working miracles. Rather, it’s good to see where it fails to understand its limitations and how to improve it and similar systems.

To return to the main theme of this post, the question of what the computer program can “understand” is different from the question of whether the program can fool us with a “Turing test” is different from the question of whether the program can be useful as a chatbot.

Stan downtown intern posters: scikit-stan & constraining transforms

It’s been a happening summer here at Stan’s downtown branch at the Flatiron Institute. Brian Ward and I advised a couple of great interns. Two weeks or so before the end of the internship, our interns present posters. Here are the ones from Brian’s intern Alexey and my intern Meenal.

Alexey Izmailov: scikit-stan

Alexey built a version of the scikit-learn API backed by Stan’s sampling, optimization, and variational inference. It’s plug and play with scikit.learn.

Meenal Jhajharia: unconstraining transforms

Meenal spent the summer exploring constraining transforms and how to evaluate them with a goal toward refining Stan’s transform performance and to add new data structures. This involved both figuring out how to evaluate them (vs. target distributions w.r.t. convexity, condition if convex, and sampling behavior in the tail, body, and near the mode of target densities). Results are turning out to be more interesting than we suspected in that different transforms seem to work better under different conditions. We’re also working with Seth Axen (Tübingen) and Stan devs Adam Haber and Sean Pinkney.

They don’t make undergrads like they used to

Did I mention they were undergrads? Meenal’s heading back to University of Delhi to finish her senior year and Alexey heads back to Brown to start his junior year! The other interns at the Center for Computational Mathematics, many of whom were undergraduates, have also done some impressive work in everything from using normalizing flows to improve sampler proposals for molecular dynamics to building 2D surface PDE solvers at scale to HPC for large N-body problems. In this case, not making undergrads like they used to is a good thing!

Hiring for next summer

If you’re interested in working on statistical computing as an intern next summer, drop me a line at [email protected]. I’ll announce when applications are open here on the blog.


“The scandal isn’t what’s illegal, the scandal is what’s legal”: application of Kinsley’s rule to science

I was chatting with some people the other day and the ridiculous voodoo study came up, and that reminded me of an article, “The more you play, the more aggressive you become: A long-term experimental study of cumulative violent video game effects on hostile expectations and aggressive behavior,” published several years ago in the Journal of Experimental Social Psychology.

As we discussed a few years after the paper came out, this article had a huge, huge, HUGE problem, which was that it claimed it was a “long-term experimental study”—that’s right in the title!—but the actual study was not long-term in any way. As I wrote:

What was “long term,” you might wonder? 5 years? 10 years? 20 years? Were violent video games even a “thing” 20 years ago?

Nope. By “long-term” here, the authors mean . . . 3 days.

In addition, the treatment is re-applied each day. So we’re talking about immediate, short-term effects.

I’ve heard of short-term thinking, but this is ridiculous! Especially given that the lag between the experimental manipulation and the outcome measure is, what, 5 minutes? The time lag isn’t stated in the published paper, so we just have to guess.

3 days, 5 minutes, whatever. Either way it’s not in any way “long term.” Unless you’re an amoeba.

Ok, this is not news, indeed it wasn’t even news when I posted on it back in 2018. But it’s still buggin me. As Michael Kinsley said so many years ago, and he was just so so so right on this one, the scandal isn’t what’s illegal, the scandal is what’s legal.

So, the Journal of Experimental Social Psychology published a paper back in 2014 with a blatant error RIGHT IN THE TITLE, and do they retract it? Does anyone even care? No and no.

For your reference, here it is:

Yes, I looked up the erratum listed there, and, no, the erratum does not clarify that the title of the paper is at best extremely misleading and at worst the most horrible thing published in a psychology journal since the critical positivity ratio people dined alone.

But, no, of course nobody would consider doing something about this. It wasn’t noticed by four authors, three peer reviewers, an associate editor, and an editor. Kinda makes you wonder, huh?

I will keep screaming about this sort of thing forever.

Does having kids really protect you from serious COVID‑19 symptoms?

Aleks pointed me to this article, which reports:

Epidemiologic data consistently show strong protection for young children against severe COVID-19 illness. . . . We identified 3,126,427 adults (24% [N = 743,814] with children ≤18, and 8.8% [N = 274,316] with youngest child 0–5 years) to assess whether parents of young children—who have high exposure to non-SARS-CoV-2 coronaviruses—may also benefit from potential cross-immunity. In a large, real-world population, exposure to young children was strongly associated with less severe COVID-19 illness, after balancing known COVID-19 risk factors. . , ,

My first thought was that parents are more careful than non-parents so they’re avoiding covid exposure entirely. But it’s not that: non-parents in the matched comparison had a lower rate of infections but a higher rate of severe cases; see Comparison 3 in Table 2 of the linked article.

One complicating factor is that they didn’t seem to have adjusted for whether the adults were vaccinated–that’s a big deal, right? But maybe not such an issue given that the study ended on 31 Jan 2021, and by then it seems that only 9% of Americans were vaccinated. It’s hard for me to know if this would be enough to explain the difference found in the article–for that it would be helpful to have the raw data, including the dates of these symptoms.

Are the data available? It says, “This article contains supporting information online at” but when I click on that link it just takes me to the main page of the article ( so I don’t know whassup with that.

Here’s another thing. Given that the parents in the study were infected at a higher rate than the nonparents, it would seem that the results can’t simply be explained by parents being more careful. But could it be a measurement issue? Maybe parents were more likely to get themselves tested.

The article has a one-paragraph section on Limitations, but it does not consider any of the above issues.

I sent the above to Aleks, who added:

My thought is that the population of parents probably lives differently than non-parents: less urban, perhaps biologically healthier. They did match, but just doing matching doesn’t guarantee that the relevant confounders have truly been handled.

This paper is a big deal 1) because it’s used to support herd immunity 2) because it is used to argue against vaccination 3) because it doesn’t incorporate long Covid risks.

For #3, it might be possible to model out the impact, based on what we know about the likelihood of long-term issues, e.g.

Your point about the testing bias could be picked up by the number of asymptomatic vs asymptomatic cases, which would reveal a potential bias.

My only response here is that if the study ends on Jan 2021, I can’t see how it can be taken as an argument against vaccination. Even taking the numbers in Table 2 at face value, we’re talking about a risk reduction for severe covid from having kids of a factor of 1.5. Vaccines are much more effective than that, no? So even if having Grandpa sleep on the couch and be exposed to the grandchildren’s colds is a solution that works for your family, it’s not nearly as effective as getting the shot–and it’s a lot les convenient.

Aleks responds:

Looking at the Israeli age-stratified hospitalization dashboard, the hospitalization rates for unvaccinated 30-39-olds are almost 5x greater than for vaccinated & boosted ones. However, the hospitalization rates for unvaccinated 80+ is only about 30% higher.

Still more on the Heckman Curve!

Carlos Parada writes:

Saw your blog post on the Heckman Curve. I went through Heckman’s response that you linked, and it seems to be logically sound but terribly explained, so I feel like I need to explain why Rea+Burton is great empirical work, but it doesn’t actually measure the Heckman curve.

The Heckman curve just says that, for any particular person, there exists a point where getting more education isn’t worth it anymore because the costs grow as you get older, or equivalently, the benefits get smaller. This is just trivially true. The most obvious example is that nobody should spend 100% of their life studying, since then they wouldn’t get any work done at all. Or, more tellingly, getting a PhD isn’t worth it for most people, because most people either don’t want to work in academia or aren’t smart enough to complete a PhD. (Judging by some of the submissions to PPNAS, I’m starting to suspect most of academia isn’t smart enough to work in academia.)

The work you linked finds that participant age doesn’t predict the success of educational programs. I have no reason to suspect these results are wrong, but the effect of age on benefit:cost ratios for government programs does not measure the Heckman curve.

To give a toy model, imagine everyone goes to school as long as the benefits of schooling are greater than the costs for them, then drops out as soon as they’re equal. So now, for high school dropouts, what is the benefit:cost ratio of an extra year of school? 1 — the costs roughly equal the benefits. For college dropouts, what’s the benefit:cost ratio? 1 — the costs roughly equal the benefits. And so on. By measuring the effects of government interventions on people who completed x years of school before dropping out, the paper is conditioning on a collider. This methodology would only work if when people dropped out of school was independent of the benefits/costs of an extra year of school.

(You don’t have to assume perfect rationality for this to work: If everyone goes to school until the benefit:cost ratio equals 1.1 or 0.9, you still won’t find a Heckman curve. Models that assume rational behavior tend to be robust to biases of this sort, although they can be very vulnerable in some other cases.)

Heckman seems to have made this mistake at some points too, though, so the authors are in good company. The quotes in the paper suggest he thought an individual Heckman curve would translate to a downwards-sloping curve for government programs’ benefits, when there’s no reason to believe they would. I’ve made very similar mistakes myself.


An econ undergrad who really should be getting back to his Real Analysis homework

Interesting. This relates to the marginal-or-aggregate question that comes up a lot in economics. It’s a common problem that we care about marginal effects but the data more easily allow us to estimate average effects. (For the statisticians in the room, let me remind you that “margin” has opposite meanings in statistics and economics.)

But one problem that Parada doesn’t address with the Heckman curve is that the estimates of efficacy used by Heckman are biased, sometimes by a huge amount, because of selection on statistical significance; see section 2.1 of this article. All the economic theory in the world won’t fix that problem.

P.S. In an amusing example of blog overlap, Parada informs us that he also worked on the Minecraft speedrunning analysis. It’s good to see students keeping busy!

A two-week course focused on basic math, probability, and statistics skills

This post is by Eric.

On August 15, I will be teaching this two-week course offered by the PRIISM center at NYU.  The initial plan was to offer it to the NYU students entering the A3SR MS Program, but we are opening it up to a wider audience. In case you don’t like clicking on things, here is a short blurb:

This course aims to prepare students for the Applied Statistics for Social Science Research program at NYU. We will cover basic programming using the R language, including data manipulation and graphical displays; some key ideas from Calculus, including differentiation, integration, and optimization; an introduction to Linear Algebra, including vector and matrix arithmetic, determinants, and eigenvalues and eigenvectors; some core concepts in Probability including random variables, discrete and continuous distributions, and expectations; and a few simple regression examples.

This is a paid class, but Jenniffer Hill, who runs the program, tells me that department scholarships are available based on program and student needs.

If you would like to take a course, we ask that you fill out a short survey here. (If you need financial assistance, please indicate it under “Is there anything else you’d like to share with us?” survey question.) You can register  here. We are planning to offer it in-person at NYU and online via Zoom.

Warning: This is my first time teaching this class, so I am not sure how much material we will be able to cover. We will have to gauge that as we go.

If you have taught something like this before and have suggestions for me, please leave those in the comments.

Solution to that little problem to test your probability intuitions, and why I think it’s poorly stated

The other day I got this email from Ariel Rubinstein and Michele Piccione asking me to respond to this question which they sent to a bunch of survey respondents:

A very small proportion of the newborns in a certain country have a specific genetic trait.
Two screening tests, A and B, have been introduced for all newborns to identify this trait.
However, the tests are not precise.
A study has found that:
70% of the newborns who are found to be positive according to test A have the genetic trait (and conversely 30% do not).
20% of the newborns who are found to be positive according to test B have the genetic trait (and conversely 80% do not).
The study has also found that when a newborn has the genetic trait, a positive result in one test does not affect the likelihood of a positive result in the other.
Likewise, when a newborn does not have the genetic trait, a positive result in one test does not affect the likelihood of a positive result in the other.
Suppose that a newborn is found to be positive according to both tests.
What is your estimate of the likelihood (in %) that this newborn has the genetic trait?

Here was my response:

OK, let p = Pr(trait) in population, let a1 = Pr(positive test on A | trait), a2 = Pr(positive test on A | no trait), b1 = Pr(positive test on B | trait), b2 = Pr(positive test on B | no trait).
Your first statement is Pr(trait | positive on test A) = 0.7. That is, p*a1/(p*a1 + (1-p)*a2) = 0.7
Your second statement is Pr(trait | positive on test B) = 0.2. That is, p*b1/(p*b1 + (1-p)*b2) = 0.2

What you want is Pr(trait | positive on both tests) = p*a1*b1 / (p*a1*b1 + (1-p)*a2*b2)

It looks at first like there’s no unique solution to this one, as it’s a problem with 5 unknowns and just 2 data points!

But we can do that “likelihood ratio” trick . . .
Your first statement is equivalent to 1 / (1 + ((1-p)/p) * (a2/a1)) = 0.7; therefore (p/(1-p)) * (a1/a2) = 0.7 / 0.3
And your second statement is equivalent to (p/(1-p)) * (b1/b2) = 0.2 / 0.8
Finally, what you want is 1 / (1 + ((1-p)/p) * (a2/a1) * (b2/b1)). OK, this can be written as X / (1 + X), where X is (p/(1-p)) * (a1/a2) * (b1/b2).
Given the information above, X = (0.7 / 0.3) * (0.2 / 0.8) * (1-p)/p

Still not enough information, I think! We don’t know p.

OK, you give one more piece of information, that p is “very small.” I’ll suppose p = 0.001.

Then X = (0.7 / 0.3) * (0.2 / 0.8) * 999, which comes to 580, so the probability of having the trait given positive on both tests is 580 / 581 = 0.998.

OK, now let me check my math. According to the above calculations,
(1/999) * (a1/a2) = 0.7/0.3, thus a1/a2 = 2300, and
(1/999) * (b1/b2) = 0.2/0.8, thus b1/b2 = 250.
And then (p/(1-p))*(a1/a2)*(b1/b2) = (1/999)*2300*250 = 580.

So, yeah, I guess that checks out, unless I did something really stupid. The point is that if the trait is very rare, then the tests have to be very precise to give such good predictive power.

But . . . you also said “the tests are not precise.” This seems to contradict your earlier statement that only “a very small proportion” have the trait. So I feel like your puzzle has an embedded contradiction!

I’m just giving you my solution straight, no editing, so you can see how I thought it through.

Rubinstein and Piccione confirmed that my solution, that the probability is very close to 1, is correct, and they pointed me to this research article where they share the answers that were given to this question when they posed it to a bunch of survey respondents.

I found the Rubinstein and Piccione article a bit frustrating because . . . they never just give the damn responses! The paper is very much in the “economics” style rather than the “statistics” style in that they’re very focused on the theory, whereas statisticians would start with the data. I’m not saying the economics perspective is wrong here—the experiment was motivated by theory, so it makes sense to compare results to theoretical predictions—I just found it difficult to read because there was never a simple plot of all the data.

My problem with their problem

But my main beef with their example is that I think it’s a trick question. On one hand, it says only “very small proportion” in the population have the trait; indeed, I needed that information to solve the problem. On the other hand, it says “the tests are not precise”—but I don’t think that’s right, at least not in the usual way we think about the precision of a test. With this problem description, they’re kinda giving people an Escher box and then asking what side is up!

To put it another way, if you start with “a very small proportion,” and then you take one test and it gets your probability all the way up to 70%, then, yeah, that’s a precise test! It takes a precise test to give you that much information, to take you from 0.001 to 0.7.

So here’s how I think the problem is misleading: The test is described as “not precise,” and then you see the numbers 0.7 and 0.2, so it’s natural to think that these tests do not provide much information. Actually, though, if you accept the other part of the problem (that only “a very small proportion” have the trait), the tests provide a lot of information. It seems strange to me to call a test which offers a likelihood ratio of 2300 as being “not precise.”

To put it another way: I think of the precision of a test as a function of the test’s properties alone, not of the base rate. If you have a precise test and then apply it to a population with a very low base rate, you can end up with a posterior probability of close to 50/50. That posterior probability depends on the test’s precision and also on the base rate.

I guess they could try out this problem on a new set of respondents, where instead of describing the tests as “not precise,” they describe them as “very precise,” and see what happens.

One more thing

On page 11 of their article, Rubinstein and Piccione given an example where different referees have independent data in their private signals, when trying to determine if a defendant is guilty of a crime. This does not seem plausible in the context of deciding whether a defendant is guilty. I think it would make more sense to say that they have overlapping information. This does not change the math of the problem—you can think of their overlapping information, along with the base rate, as being a shared “prior” and the non-overlapping information corresponds to the two data points in your earlier formulation—but that would make it more realistic.

I understand that this model is just based on the literature. I just have political problems with oversimplified models of politics, juries, etc. I’d recommend that the authors either use a different “cover story” or else emphasize that this is just a mathematical story not applicable to real juries. In their paper, they talk about “the assumption that people are Bayesian,” but I’m bothered by the assumption that different referees have independent data in their private signals. That’s a really strong assumption! It’s funny which assumptions people will question and which assumptions they will just accept as representing neutral statements of a problem.

A connection to statistical inference and computing

This problem connects to some of our recent work on the computational challenges of combining posterior distributions. The quick idea is that if theta is your unknown parameter (in this case, the presence or absence of the trait) and you want to combine posteriors p_k(theta|y_k) from independent data sources y_k, k=1,…,K, then you can multiply these posteriors but then you need to divide by the factor p(theta)^(k-1). Dividing by the prior to a power in this way will in general induce computational instability. Here is a short paper on the problem and here is a long paper. We’re still working on this.

Can some major media outlet please give David Weakliem a regular column, as soon as possible?

David Weakliem is a sociology professor who has a blog on public opinion. He produces an impressive stream of thoughtful, surprising, nonpartisan, non-“hot-take,” takes on public opinion. There’s nothing else like this out there. Dude should have a regular column at the Washington Post or Reuters or Bloomberg or the Economist or some legit journalistic outlet. He should be getting millions of readers a week, not just the few dozen or whatever who trickle over from our links page.

He has recent posts on national levels of trust; the ways that people view others at different education levels; politics and covid rates by state; social class and opinions on covid restrictions; abortion and abortion reporting; voter suppression; and lots more.

I can see how Weakliem’s blog doesn’t quite fit into the usual model of newspaper columns, because he gets into some methodological details sometimes. But I think he could make it work. He can keep the blog as backup, but for the weekly newspaper column he’d focus on the public debate and policy implications, giving just the key numbers or graph to make his point.

I seriously think this should happen. Not if the goal is raw clicks or social media presence, but if the goal is to inform readers and get some respect among people who care about public opinion. Some news organizations would want that.

Here’s a little problem to test your probability intuitions:

Ariel Rubinstein and Michele Piccione send along this little problem to test your probability intuitions:

A very small proportion of the newborns in a certain country have a specific genetic trait.
Two screening tests, A and B, have been introduced for all newborns to identify this trait.
However, the tests are not precise.
A study has found that:
70% of the newborns who are found to be positive according to test A have the genetic trait (and conversely 30% do not).
20% of the newborns who are found to be positive according to test B have the genetic trait (and conversely 80% do not).
The study has also found that when a newborn has the genetic trait, a positive result in one test does not affect the likelihood of a positive result in the other.
Likewise, when a newborn does not have the genetic trait, a positive result in one test does not affect the likelihood of a positive result in the other.
Suppose that a newborn is found to be positive according to both tests.
What is your estimate of the likelihood (in %) that this newborn has the genetic trait?

Just to clarify for readers such as myself who are overly familiar with statistics jargon: when they say “likelihood” above, they’re talking about what we could call “conditional probability.”

Anyway, you can check your intuition on this one. Tomorrow I’ll post the solution and get into some interesting subtleties.

P.S. Solution and discussion here.

This journal is commissioning a sequel to one of my smash hits. How much will they pay me for it? You can share negotiation strategies in the comments section.

I know it was a mistake to respond to this spam but I couldn’t resist . . . For the rest of my days, I will pay the price of being on the sucker list.

The following came in the junk mail the other day:

Dear Dr. Andrew Gelman,

My name is **, the editorial assistant of **. ** is a peer-reviewed, open access journal published by **.

I have had an opportunity to read your paper, “Why High-Order Polynomials Should Not Be Used in Regression Discontinuity Designs”, and can find that your expertise fits within the scope of our journal quite well.
Therefore, you are cordially invited to submit new, unpublished manuscripts to **. If you do not have any at the moment, it is appreciated if you could keep our journal in mind for your future research outputs.

You may see the journal’s profile at ** and submit online. You may also e-mail submissions to **.

We are recruiting reviewers for the journal. If you are interested in becoming a reviewer, we welcome you to join us. Please find the application form and details at ** and e-mail the completed application form to **.

** is included in:
· CrossRef; EBSCOhost; EconPapers
· Gale’s Academic Databases
· GetInfo; Google Scholar; IDEAS
· J-Gate; Journal Directory
· JournalTOCs; LOCKSS
· MediaFinder®-Standard Periodical Directory
· RePEc; Sherpa/Romeo
· Standard Periodical Directory
· Ulrich’s; WorldCat
Areas include but are not limited to:
· Accounting;
· Economics
· Finance & Investment;
· General Management;
· Management Information Systems;
· Business Law;
· Global Business;
· Marketing Theory and Applications;
· General Business Research;
· Business & Economics Education;
· Production/Operations Management;
· Organizational Behavior & Theory;
· Strategic Management Policy;
· Labor Relations & Human Resource Management;
· Technology & Innovation;
· Public Responsibility and Ethics;
· Public Administration and Small Business Entrepreneurship.

Please feel free to share this information with your colleagues and associates.

Thank you.

Best Regards,

Editorial Assistant
Tel: ** ext.**
Fax: **
E-mail 1: **
E-mail 2: **
URL: **

Usually I just delete these things, but just the other day we had this discussion of some dude who was paid $100,000 to be the second author on a paper. Which made me wonder how much I could make as a sole author!

And this reminded me of this other guy who claimed that scientific citations are worth $100,000 each. A hundred grand seems like the basic unit of currency here.

So I sent a quick response:

Hi–how much will you pay me to write an article for your journal?

I’m not expecting $100,000 as their first offer—they’ll probably lowball me at first—but, hey, I can negotiate. They say the most important asset in negotiation is the willingness to say No, and I’m definitely willing to say No to these people!

Just a few hours later I received a reply! Here it is:

Dear Dr. Andrew Gelman,

Thanks for your email. We charge the Article Processing Charge (Formatting and Hosting) of 100USD for per article.

Welcome to submit your manuscript to our journal. If you have any questions, please feel free to contact me.

Best Regards,

Editorial Assistant
Tel: ** ext.**
Fax: **
E-mail 1: **
E-mail 2: **
URL: **

I don’t get it. They’re offering me negative $100? That makes no sense? What next, they’ll offer to take my (fully functional) fridge off my hands for a mere hundred bucks?? In what world am I supposed to pay them for the fruits of my labor?

So I responded:

No, I would only provide an article for you if you pay me. It would no make sense for me to pay you for my work.

No answer yet. If they do respond at some point, I’ll let you know. We’ll see what happens. If they offer me $100, I can come back with a counter-offer of $100,000, justifying it by the two links above. Then maybe they’ll say they can’t afford it, they’ll offer, say, $1000 . . . maybe we can converge around $10K. I’m not going to share the lowest value I’d accept—that’s something the negotiation books tell you never ever to do—but I’ll tell you right now, it’s a hell of a lot more than a hundred bucks.

P.S. That paper on higher-order polynomials that they scraped carefully vetted for suitability for their journal . . . according to Google Scholar it has 1501 citations, which implies a value of $150,100,000, according to the calculations referred to above. Now, sure, most of that value is probably due to Guido, my collaborator on that paper, but still . . . 150 million bucks! How hard could it be to squeeze out a few hundred thousand dollars for a sequel? It says online that Knives Out grossed $311.4 million, and Netflix paid $469 million for the rights for Knives Out 2 and 3. If this academic publisher doesn’t offer me a two-paper deal that’s at least in the mid four figures, my agent and I will be taking our talents to Netflix.

If the outcome is that rare, then nothing much can be learned from pure statistics.

Alain Fourmigue writes:

You may have heard of this recent controversial study on the efficacy of colchicine to reduce the number of hospitalisations/deaths due to covid.

It seems to be the opposite of the pattern usually reported on your blog.

Here, we have a researcher making a bold claim despite the lack of statistical significance,
and the scientific community expressing skepticism after the manuscript is released.

This study raises an interesting issue: how to analyse very rare outcomes (prevalence < 1%)? The sample is big (n>4400), but the outcome (death) is rare (y=14).
The SE of the log OR is ~ sqrt(1/5+1/9+1/2230+1/2244).
Because of the small number of deaths, there will inevitably be a lot of uncertainty.
Very frustrating…

Is there nothing we could do?
Is there nothing better than logistic regression / odd ratios for this situation?
I’m not sure the researcher could have afforded a (credible) informative prior.

I replied that, yes, if the outcome is that rare then nothing much can be learned from pure statistics. You’d need a model that connects more directly to the mechanism of the treatment.