Brutus (2) vs. Mo Willems; Cleary advances

Our pals at McKinsey tell us that “The business case for diversity, equity, and inclusion is stronger than ever,” and Jonathan makes a strong DEI case for the initialed contestant in yesterday’s competition:

Foyt was known as Indy car driver, but he won 7 NASCAR races as well, as well as LeMans. That’s the sort of diversity you’re looking for in a speaker, right?

Also, since he’s still alive, he might live longer than Beverly Cleary; in other words, her status as a long-lived famous person only probabilistically exceeds Foyt’s chances in a right-censored distribution sense, so her xomparison there needs to be discounted. And he’s one tough dude when it comes to longevity.

Also, as Kurtis Blow pointed out all those years ago, A. J. is cool. Don’t you love America, my favorite country?

But Ben then upset the applecart with an important revelation:

I was leaning Foyt for the racing stories. Racing seems like a thing of the past. There are so many cars around these days and I don’t wanna get run over by someone hotrodding! Listening to racing stories as some sort of nostalgia trip seems about right.

Then I learned Beverly Cleary authored the Mouse and the Motorcycle — a definitive autosports book! A. J. Foyt may be a real racer, but as a consumer I’m more interested in racing as a fantasy. Beverly Cleary has a better track record here.

Bev it is. It was A. J.’s misfortune to go up against one of the few other speaker candidates with motorsports experience.

Today’s matchup

The #2 traitor of all time—what can you say, dude’s an absolute legend!—up against the author of some modern classics of the kids-book genre, along with some real stinkers. (I’m looking at you, Elephant and Piggy!) It all comes down to two questions:
1. How do yo feel about regicide?
2. Will you let the pigeon drive the bus?
Let the strongest and most amusing arguments win!

Again, here are the announcement and the rules.

When a conceptual tool is used as a practical tool (Venn diagrams edition)

Everyone’s seen Venn diagrams so they’re a great entry to various general issues in mathematics and its applications.

The other day we discussed the limitations of Venn diagrams with more than 3 circles as an example of our general failures of intuitions in high dimensions.

The comment thread from that post featured this thoughtful reflection from Eric Neufeld:

It’s true that Venn diagrams are not widely applicable. But thinking about this for a few days, suggests to me that Venn diagrams play a role similar to truth tables in propositional logic. We can quickly establish the truth of certain tautologies, mostly binary or ternary, with truth tables, and from there move to logical equivalences. And so on. But in a foundation sense, we use the truth tables to assert certain foundational elements and build from there.

Something identical happens with Venn diagrams. A set of basic identifies can be asserted and subsequently generalized to more widely applicable identifies.

Some find it remarkable that all of logic can be seen as resting on purely arbitrary definitions of two or three primitive truth tables (usually and, or and not). Ditto, the core primitives of sets agree with intuition using Venn diagrams. No intuition for gigantic truth tables or multidimensional Venn diagrams.

That’s an interesting point and it got me thinking. Venn diagrams are a great way to teach inclusion/exclusion in sets, and the fact that they can be cleanly drawn with one, two, or three binary factors underlines the point that inclusion/exclusion with interactions is a general idea. It’s great that Venn diagrams are taught in schools, and if you learn them and mistakenly generalize and imagine that you could draw complete Venn diagrams with 4 or 5 or more circles, that’s kind of ok: you’re getting it wrong with regard to these particular pictures—there’s no way to draw 5 circles that will divide the plane into 32 pieces—but you’re correct in the larger point that all these subsets can be mathematically defined and represent real groups of people (or whatever’s being collected in these sets).

Where the problem comes up is not in the use of Venn diagrams as a way to teach inclusions, unions, and intersections of sets. No, the bad stuff happens when they’re used as a tool for data display. Even in the three-circle version, there’s the difficulty that the size of the region doesn’t correspond to the number of people in the subset—and, yes, you can do a “cartogram” version but then you lose the clear “Venniness” of the three-circle image. The problem is that people have in their minds that Venn diagrams are the way to display interactions of sets, and so they try to go with that as a data display, come hell or high water.

This is a problem with statistical graphics, that people have a few tools that they’ll use over and over. Or they try to make graphs beautiful without considering what comparisons are being facilitated. Here’s an example in R that I pulled off the internet.

Yes, it’s pretty—but to learn anything from this graph (beyond that there are high numbers in some of the upper cells of the image) would take a huge amount of work. Even as a look-up table, the Venn diagram is exhausting. I think an Upset plot would be much better.

And then this got me thinking about a more general issue, which is when a wonderful conceptual tool is used as an awkward practical tool. A familiar example to tech people of a certain age would be the computer language BASIC, which was not a bad way for people to learn programming, back in the day, but was not a great language for writing programs for applications.

There must be many other examples of this sort of thing: ideas or techniques that are helpful for learning the concepts but then people get into trouble by trying to use them as practical tools? I guess we could call this, Objects of the class Venn diagrams—if we could just think of a good set of examples.

A different Bayesian World Cup model using Stan (opportunity for model checking and improvement)

Maurits Evers writes:

Inspired by your posts on using Stan for analysing football World Cup data here and here, as well as the follow-up here, I had some fun using your model in Stan to predict outcomes for this year’s football WC in Qatar. Here’s the summary on Netlify. Links to the code repo on Bitbucket are given on the website.

Your readers might be interested in comparing model/data/assumptions/results with those from Leonardo Egidi’s recent posts here and here.

Enjoy, soccerheads!

P.S. See comments below. Evers’s model makes some highly implausible predictions and on its face seems like it should not be taken seriously. From the statistical perspective, the challenge is to follow the trail of breadcrumbs and figure out where the problems in the model came from. Are they from bad data? A bug in the code? Or perhaps a flaw in the model so that the data were not used in the way that were intended? One of the great things about generative models is that they can be used to make lots and lots of predictions, and this can help us learn where we have gone wrong. I’ve added a parenthetical to the title of this post to emphasize this point. Also good to be reminded that just cos a method uses Bayesian inference, that doesn’t mean that its predictions make any sense! The output is only as good as its input and how that input is processed.

Beverly Cleary (2) vs. A. J. Foyt; Chvátil advances

From yesterday, Jonathan succinctly summarizes the case for Fawkes.

Chvátil makes the rules, Fawkes breaks the rules. Making is harder, but breaking is the rock to its scissors.

+1 for the rosham reference.

Raghu makes the case based on audience interest. I’ll have to share his whole story here:

A few years ago, my [Raghu’s] younger son made a diorama of the houses of parliament with space below for Guy Fawkes’ gunpowder for an elementary school assignment in which they had to say something about a holiday. I took a photo of this. This is of interest, of course, to no one but me. However: I thought about this today and wanted to find the photo; I couldn’t remember when it was taken. I typed the word “cardboard” into Google Photos and, like magic, it came up (along with about 10 other photos I’ve taken of cardboard in its many manifestations). Squabbling about AI comes up a lot on this blog, hence this comment. I was stunned, even though I know how this works. The cleverness of machine learning algorithms and the sheer volume of training data is really amazing.

+1 for referring to recent blog discussions.

And Anonymous gives a strong argument against Chvátil:

If I asked a bunch of people who created code names, half of them would say: “Is it on Netflix?” and the other half would say no. There would be one guy who would say “Vladimir something” and a tiny percentage of people who would no. All in all, Guy Fawkes would be interesting while Vladimir is so overrated.

But . . . how can Vladimir be overrated if almost nobody has heard of him? You can’t have it both ways, Anon!

The deciding comment, though, comes from bbis:

Rules are made to be broken and rulers to be blown up. One suggestion would be to go with ‘da bomb’. While that might be a blast, it would probably end too quickly. Also, if you want to be a guy outstanding in your field and invited to give important seminars, you shouldn’t be tunneling under London. Chvatil may be able to discuss how to get a good balance between rigidity and flexibility in rules to get good outcomes.

I’d like to hear about that balance! Also, yeah, tunneling under London, not cool for a seminar.

Today’s matchup

The Sage of Klickitat Street vs. an Indy racing legend. Beverly Cleary was one of the longest-lived famous people ever; A. J. Foyt could drive really really fast. Neither of these things is particularly relevant for a seminar, but both of them would probably have some good stories to share.

Again, here are the announcement and the rules.

Update 2 – World Cup Qatar 2022 Predictions with footBayes/Stan

Time to update our World Cup 2022 model!

The DIBP (diagonal-inflated bivariate Poisson) model performed very well in the first match-day of the group stage in terms of predictive accuracy – consider that the ‘peudo R-squared’, namely the geometric mean of the probabilities assigned from the model to the ‘true’ final match results, is about 0.4, whereas, on average, the main bookmakers got 0.36.

It’s now time to re-fit the model after the first 16 group stage games with the footBayes R package and obtain the probabilistic predictions for the second match-day. Here there are the posterior predictive match probabilities for the held-out matches of the Qatar 2022 group stage played from November 25th to November 28th, along with some ppd ‘chessboard plots’ for the exact outcomes in gray-scale color – ‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results.

Plot/table updates: (see Andrew’ suggestions from the previous post, we’re still developing these plots to improve their appearance, see below some more notes). In the plots below, the first team listed in each sub-title is the ‘favorite’ (x-axis), whereas the second team is the ‘underdog’ (y-axis). The 2-way grid displays the 16 held-out matches in such a way that closer matches appear at the top-left of the grid, whereas more unbalanced matches (‘blowouts’) appear at the bottom-right.  The matches are then ordered from top-left to bottom-right in terms of increasing winning probability for the favorite teams. The table reports instead the matches according to a chronological order.

The most unbalanced game seems Brazil-Switzerland, where the Brazil is the favorite team with an associated winning probability about 71%. The closest game seems Iran-Wales – Iran just won with two goals of margin scored in the last ten minutes! – whereas France is given only 44% probability of winning against Denmark. Argentina seems to be ahead against Mexico, whereas Spain seems to have a non-negligible advantage in the match against Germany.

Another predictive note: Regarding ‘most-likely-outcomes’ (mlo here above), the model ‘guessed’ 4 ‘mlo’ out of 16 in the previous match-day.

You find the complete results, R code and analysis here.

Some more technical notes/suggestions about the table and the plots above:

  • We replaced ‘home’ and ‘away’ by ‘favorite’ and ‘underdog’.
  • I find difficult to handle ‘xlab’ and ‘ylab’ in faceted plots with ggplot2! (A better solution could be in fact to directly put the team names on each of the axes of the sub-plots).
  • The occurrence ‘4’ actually stands for ‘4+’, meaning that it captures the probability of scoring ‘4 or more goals’ (I did not like the thick ‘4+’ in the plot, for this reason we just set ‘4’, however we could improve this).
  • We could consider adding some global ‘x’ and ‘y’-axes with probability margins between underdog and  favorite. Thus, for Brazil-Switzerland, we should have a thick on the x-axis at approximately 62%, whereas for Iran-Wales at 5%.

For other technical notes and model limitations check the previous post.

Next steps: we are going to update the predictions for the third match-day and even compute some World Cup winning probabilities through a ahead-simulation of the whole tournament.

Stay tuned!

Flood/cyclone risk and the all-coastal-cities-are-equal-narrative

Palko writes:

A common, perhaps even the standard framing of rising sea levels is that it’s a existential threat for all coastal cities, and while I understand the desire not to downplay the crisis, this isn’t true. For cities with relatively high elevations like Los Angeles (a few low-lying neighborhoods, but most of it hundreds and some of it thousands of feet above sea-level) or cities with at least moderate elevations and little danger from tropical cyclones (like almost all major cites on the West Coast), we are talking about a problem but not a catastrophe . . . the real tragedy of this framing is not that it overstates the threat to the West Coast, but that it dangerously understates the immediate and genuinely existential threat to many cities on the East and Gulf Coasts. While New York City is not in danger of total oblivion the way Miami or Jacksonville are, it is far from safe from the threats associated with rising sea levels. . . . This is one of the things that makes the following New York Times article from a while back so strange.

An estimated 600 million people live directly on the world’s coastlines, among the most hazardous places to be in the era of climate change. . . . Many people face the risks right now. Two sprawling metropolitan areas offer a glimpse of the future. One rich, one poor, they sit on opposite sides of the Pacific Ocean: the San Francisco Bay Area (population 7 million) and metropolitan Manila (almost 14 million).

Their history, their wealth, and the political and personal choices they make today will shape how they fare as the water inevitably comes to their doorsteps.

Palko continues:

The New York Times felt the need to go all the way to San Francisco to do the story despite the fact that New York City has more people, lower elevation, and faces a far, far greater risk from tropical cyclones. This is not quite as bad as the San Francisco Chronicle doing features on earthquakes and wildfire smoke and using NYC as one of the two examples, but it’s close. . . .

The different elevations of Manila and San Francisco and how they affect the impact of rising sea levels is largely undiscussed. There is exactly one mention of tropical storms, none whatsoever of tropical cyclones, and the fact that certain areas are more vulnerable than others is almost completely ignored. All coastal cities are treated as effectively interchangeable. . . .

The all coastal cities are equal narrative embraced by the New York Times is extraordinarily dangerous. It inevitably underplays the to cities from New York all the way to Houston along the coast, particularly in Florida . . .

I continue to think that Mark Palko and David Weakliem should have columns in the Times. Not that they’re perfect, but both of them seem to have the ability to regularly see through the fog of news media B.S. that surrounds us.

Where does the news media B.S. come from? Some of this is bias, some is corruption, but I think a lot is simple recycling of narratives that for one reason or another are appealing to our tale-spinners.

Vladimir Chvátil (3) vs. Guy Fawkes; Hagman advances

Bob laid out the sincere case for J. R.:

Well, having actually encountered Larry Hagman in a quasi-seminar context (he led some acting workshops in my dorm at Caltech back in the day), I’ll have to say he’d probably give a pretty good talk – plus be able to direct the attendees in various improvisation and related exercises. Could be a lot of fun!

From the other direction, Dan advances an intriguing argument from physics:

A quantum superposition of the captain and the rocker could make for an entertaining seminar.

Ultimately, though, the decision was made for me by Raghu, who wrote:

John Paul Jones I’s Wikipedia page notes that he’s the “Father of the American Navy” (a nickname he shares with John Barry and John Adams[3]).” Therefore John Paul Jones, John Paul Jones, John Adams, and John Adams all share the same name, which should count for something.

John Barry . . . where have I heard that name before? Yeah, that’s right—he’s the musician who is famous for falsely claiming to have written the James Bond theme. It’s already been established that we have too many British secret agents in this competition, so that rules out John Paul Jones.

Also, all those Johns! I’m reminded of that statistic comparing the number off female CEO’s to the number who are named John. Or Bob Carpenter’s remark that when he was in grad school, it was considered a good party when there were more girls than Daves. If John Paul Jones brings along John Barry and any John Adams (or, perhaps, all four of them!), we’ll be overwhelmed with Johns.

Say what you want about Larry Hagman, there’s little risk that he’d show up with overrated economist Larry Summers, legendary fullback Larry Csonka (with the change in offenses in the NFL, there may never again be a star fullback), or crypto shill Larry David, nor do we expect that during the Q-and-A period he’ll be distracting himself by playing the classic video game Leisure Suit Larry.

So Hagman it is.

Today’s matchup

Guy Fawkes is so famous, but he’s not even seeded, which just shows you how strong the Traitors category is. Just wait till we get to Judas and Brutus, neither one of whom was a British secret agent. He’s competing against Vladimir Chvátil, the third seed in the “Creators of laws or rules” category. Code Names unfortunately has a “spy” theme, but that spy thing is really irrelevant to the game play so I don’t think we have to hold this against him.

Again, here are the announcement and the rules.

Understanding exchangeability in statistical modeling: a Thanksgiving-themed post

Several years ago, on the day after Thanksgiving, I was on the phone with my sister telling her how I’d used turkey leftovers to make a delicious chili. The process started by pulling out the pieces that could be eaten on their own—the drumsticks and the big chunks of meat. The rest went into the chili. The next step was to strip off the small bits of meat, pull off the skin and connective tissue and cut them into little pieces (no matter how long you cook the skin, it won’t soften and separate on its own), throw in several onions, a bunch of red bell peppers, a couple tomatoes, a variety of different sorts of hot pepper, a couple squares of unsweetened chocolate, and a mushed banana, and then cook it all in beer for several hours over low heat. Add more hot peppers to taste, and more red peppers to make it sweeter.

After this long description, there was a pause on the other end of the phone, and then my sister said, “I was with you until the bit about the connective tissue.”

Similarly, many classically-trained statisticians are impressed by Bayesian methods and see their appeal, but they can’t find their way to accept the assumption of exchangeability, which stands behind the hierarchical models that are so central to modern Bayesian practice.

In the simplest hierarchical model, the variation among the underlying effects (relative to the uncertainties in their estimation) determines the amount to which the estimated effects are pooled toward the grand mean. This is commonplace to us, at least since Rubin’s 8-schools paper from 1981, but if you go back to the pre-Bayesian literature from decades ago, you will find discussions of the appropriateness of data pooling, which was considered to be a big step requiring a strong substantive motivation.

As late as 1972, in the discussion of the landmark paper by Lindley and Smith, the respected statistician Oscar Kempthorne wrote, “Is it ‘practically realistic’ to use an exchangeable prior? Information is available in the records to show that schools differ widely, students of different social and ethnic backgrounds perform differently on tests, and so on.” Kempthorne was thinking of exchangeability as a substantive concern, to be addressed by subject-matter knowledge. And subject-matter knowledge is indeed relevant, most notably in setting up the model for the varying parameters. But he was completely wrong to think that “exchangeable” meant “identical.”

An attractive feature of hierarchical Bayesian inference is that it translates questions of statistical method to questions about substantive modeling. For example, the amount of partial pooling in a hierarchical modeling is connected to the substantive parameter that represents group-level variation.

In contrast, Stein (1959) and James and Stein (1961) proved that shrinkage gives lower mean squared error, compared to unpooled estimates—even when estimating parameters with no relation to each other. This has mystified researchers to the extent that it has been called a paradox (Efron and Morris, 1977).

Our resolution to “Stein’s paradox” is that if the parameters θj are being estimated are truly unrelated, then they will likely be very different from each other, in which case the estimated group-level variance will probably be large, and very little actual shrinkage will go on. Conversely, if we have a set of unrelated parameters that happen to be close to each other—perhaps because of measurement protocols under which each is expected to have a value near 1.0, say—then, yes, this is information that can improve inferences in each specific case.

Beyond this, nonexchangeability—in the sense of additional information distinguishing the groups in a study—can be included in a Bayesian hierarchical model, typically by including the extra information as group-level predictors. In Kempthorne’s example above, one can use previous school records as a covariate in a regression model. Or, in a more difficult problem involving several parameters, some of which are believed to be similar to each other and others much different, the researcher could fit a mixture model. The point is that aspects of “nonexchangeability” might be better viewed as structure that can be modeled.

Time Series Forecasting: futile but necessary. An example using electricity prices.

This post is by Phil Price, not Andrew.

I have a client company that owns refrigerated warehouses around the world. A refrigerated warehouse is a Costco-sized building that is kept very cold; 0 F is a common setting in the U.S. (I should ask what they do in Europe). As you might expect, they have an enormous electric bill — the company as a whole spends around a billion dollars per year on electricity — so they are very interested in the cost of electricity. One decision they have to make is: how much electricity, if any, should they purchase in advance? The alternative to purchasing in advance is paying the “real-time” electricity price. On average, if you buy in advance you pay a premium…but you greatly reduce the risk of something crazy happening. What do I mean by ‘crazy’? Take a look at the figure below. This is the monthly-average price per Megawatt-hour (MWh) for electricity purchased during the peak period (weekday afternoons and evenings) in the area around Houston, Texas. That big spike in February 2021 is an ice storm that froze a bunch of wind turbines and also froze gas pipelines — and brought down some transmission lines, I think — thus leading to extremely high electricity prices. And this plot understates things, in a way, by averaging over a month: there were a couple of weeks of fairly normal prices that month, and a few days when the price was over $6000/MWh.

Monthly-mean peak-period (weekday afternoon) electricity price, in dollars per megawatt-hour, in Texas.

If you buy a lot of electricity, a month when it costs 20x as much as you expected can cause havoc with your budget and your profits. One way to avoid that is to buy in advance: a year ahead of time, or even a month ahead of time, you could have bought your February 2021 electricity for only a bit more than electricity typically costs in Texas in February. But events that extreme are very rare — indeed I think this is the most extreme spike on record in the whole country in at least the past thirty years — so maybe it’s not worth paying the premium that would be involved if you buy in advance, month after month and year after year, for all of your facilities in the U.S. and Europe. To decide how much electricity to buy in advance (if any) you need at least a general understanding of quite a few issues: how much electricity do you expect to need next month, or in six months, or in a year; how much will it cost to buy in advance; how much is it likely to cost if you just wait and buy it at the time-of-use rate; what’s the chance that something crazy will happen, and, if it does, how crazy will the price be; and so on.

Continue reading

Larry Hagman (4) vs. John Paul Jones; Bowie advances

Yesterday’s commenters were unanimous so I don’t have much of a choice. Manuel’s contribution is not bad:

Are we going to let pass our opportunity to be heroes, even just for one seminar? Sorry, Blade, but I don’t mind at all that Bowie can’t jump.

And then this from Jonathan:

A lot of these contests are contests about clever ways to characterize the combatants, with a winner tag attached for some reason. David Bowie is one of the few who I actually think would have interesting things to say that aren’t just stories from his biography.

So, a sincere vote for Bowie.

Fair enough. We haven’t had many sincere votes, so let’s do that here.

Today’s matchup

Larry Hagman is fourth-seeded in the “People known by initials” category, and, as for John Paul Jones, I’m not sure if he’s the Zep alum who shares a name with the famous captain of the Bonhomme Richard, or if he’s the captain of the Bonhomme Richard who shares a name with the famous Zep alum.

So it’s Who Shot J. R. vs. either a British dude or someone who shot at British dudes. Ahoy!

Again, here are the announcement and the rules.

Bigshot chief scientist of major corporation can’t handle criticism of the work he hypes.

A correspondent who wishes to remain anonymous points us to this article in Technology Review, “Why Meta’s latest large language model survived only three days online. Galactica was supposed to help scientists. Instead, it mindlessly spat out biased and incorrect nonsense.” Here’s the story:

On November 15 Meta unveiled a new large language model called Galactica, designed to assist scientists. But instead of landing with the big bang Meta hoped for, Galactica has died with a whimper after three days of intense criticism. Yesterday the company took down the public demo that it had encouraged everyone to try out.

Meta’s misstep—and its hubris—show once again that Big Tech has a blind spot about the severe limitations of large language models. There is a large body of research that highlights the flaws of this technology, including its tendencies to reproduce prejudice and assert falsehoods as facts.

However, Meta and other companies working on large language models, including Google, have failed to take it seriously. . . .

There was some hype:

Meta promoted its model as a shortcut for researchers and students. In the company’s words, Galactica “can summarize academic papers, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more.”

Actually, though:

Like all language models, Galactica is a mindless bot that cannot tell fact from fiction. Within hours, scientists were sharing its biased and incorrect results on social media. . . . A fundamental problem with Galactica is that it is not able to distinguish truth from falsehood, a basic requirement for a language model designed to generate scientific text. People found that it made up fake papers (sometimes attributing them to real authors), and generated wiki articles about the history of bears in space as readily as ones about protein complexes and the speed of light. It’s easy to spot fiction when it involves space bears, but harder with a subject users may not know much about.

I’d not heard about this Galactica thing at all, but the article connected to some things I had heard about:

For the last couple of years, Google has been promoting language models, such as LaMDA, as a way to look up information.

A few months ago we discussed that Google chatbot. I was disappointed that the Google engineer was willing to hype it but not to respond to reasoned criticisms of his argument.

The Technology Review article continues:

And it wasn’t just the fault of Meta’s marketing team. Yann LeCun, a Turing Award winner and Meta’s chief scientist, defended Galactica to the end. On the day the model was released, LeCun tweeted: “Type a text and Galactica will generate a paper with relevant references, formulas, and everything.” Three days later, he tweeted: “Galactica demo is off line for now. It’s no longer possible to have some fun by casually misusing it. Happy?”

I hate twitter. LeCun also approvingly links to someone else who writes, in response to AI critic Gary Marcus:

or maybe it [Galactica] was removed because people like you [Marcus] abused the model and misrepresented it. Thanks for getting a useful and interesting public demo removed, this is why we can’t have nice things.

Whaaa?

Let’s unpack this last bit for a moment. Private company Meta launched a demo, and then a few days later they decided to remove it. The demo was removed in response to public criticisms, and . . . that’s a problem? “We can’t have nice things” because . . . outsiders are allowed to criticize published material?

This attitude of LeCun is ridiculous on two levels. First, and most obviously, the decision to remove the demo was made by Meta, not by Marcus. Meta is one of the biggest companies in the world; they have some agency, no? Second, what’s the endgame here? What’s LeCun’s ideal? Presumably it’s not a world in which outsiders are not allowed to criticize products. So what is it? I guess the ideal would be that Marcus and others would voluntarily suppress their criticism out of a public-spirited desire not to have Meta take “nice things” away from people? So weird. Marcus doesn’t work for your company, dude.

The funny thing is that the official statement from Meta was much more reasonable! Here it is:

Thank you everyone for trying the Galactica model demo. We appreciate the feedback we have received so far from the community, and have paused the demo for now. Our models are available for researchers who want to learn more about the work and reproduce results in the paper.

I don’t quite understand what it means for the demo to have been paused if the models remain available to researchers, but in any case they’re taking responsibility for what they’re doing with their own code; they’re not blaming critics. This is a case where the corporate marketing team makes much more sense than the company’s chief scientist.

This all relates to Jessica’s recent post on academic fields where criticism is suppressed, where research critique is taken as personal attacks, and where there often seems to be a norm of never saying anything negative. LeCun seems to have that same attitude, not about research papers but about his employer’s products. Either way, it’s the blame-the-critic game, and my take is the same: If you don’t want your work criticized, don’t make it public. It’s disappointing, but all too common, to see scientists who are opposed to criticism, which is essential to the scientific process.

The big picture

Look. I’m not saying LeCun is a bad person. I don’t know the guy at all. Anybody can have a bad day! One of his company’s high-profile products got bad press, so he lashed out. Ultimately no big deal.

It’s just . . . that idea that outside criticism is “why we can’t have nice things” . . . at worst this seems like an authoritarian attitude and at best it seems to reflect an extreme naivety about how science works. I guess that without outside criticism we’d all be driving cars that run on cold fusion, cancer would already have been cured 100 times over, etc.

P.S. I sent the above post to some people, and we got involved in a discussion of whether LeCun in his online discussions is “attacking” Galactica’s critics. I said that, from my perspective, LeCun is disagreeing with the critics but not attacking them. To this, Thomas Basebøll remarked that, whether the critics are on the “attack” or not, LeCun is certainly on the defensive, reacting to the criticism as though it’s an attack. Kind of like calling it “methodological terrorism” or something.

That’s an interesting point regarding LeCun being on the defensive. Indeed, instead of being in the position of arguing how great this product is for humanity, he’s spending his time arguing how it’s not dangerous. I can see how this can feel frustrating from his end.

P.P.S. LeCun responds here in comments.

David Bowie (1) vs. Wesley Snipes; Rosenberg advances

Commenters on yesterday’s contest reaffirm that Miles Davis was, indeed, cool. From Alan:

I’ve probably listened to almost everything the Miles recorded and he is super-cool. My favorite single track has always been ‘Someday My Prince Will Come’ with Wynton Kelley (piano), Paul Chambers (bass), Jimmy Cobb (drums) and the estimable John Coltrane (sax), great solos from everyone. As a seminar speaker, Coltrane is a more obvious choice but he is not one of the contestants (maybe next time). I think Miles would probably turn his back on the audience but no mind, just to hear him talk about the conception of ‘Kind of Blue’ and why he moved on to fusion in the late 1960s would be worth the price of admission. Even if he didn’t speak but just played his trumpet would be outstanding. Perhaps he can bring along Gil Evans and they could discuss ‘Sketches of Spain’ that is maybe my favorite Miles album.

On the other hand, Ben argues:

We’ve already had what, three British intelligence officers come through? I think Ethel’s our only shot at balancing this dang thing out! Also this’ll give the MI-whatever people something to do — we can put Ethel later than them in the batting order and it’d for sure drive them crazy.

The tiebreaker came from Ethan, who wrote:

Miles Davis might blow us the best colloquium ever. I’d rather hear him in a smoke filled club than in the large hall you’d have to find, though of course I’d join the crowd.

But I can hear Miles all over the internet – (search Miles Davis youtube). I’ve no idea what Ethel Rosenberg would say. I want to find out. Let’s move out of our comfort zone.

Good point. I, too, have no idea what Rosenberg would say.

Today’s matchup

The Thin White Duke, as you surely recall, was originally named David Jones but decided to use the name Bowie to avoid confusion with Davy Jones of the Monkees.

Meanwhile, Wesley Snipes is an alleged tax cheat—I guess it’s ok to call him an actual tax cheat, given that he served nearly 2 1/2 years in prison for tax evasion. The relevant section of his wikipedia entry is so bizarre I’ll have to just repeat it here:

On October 12, 2006, Snipes, Eddie Ray Kahn, and Douglas P. Rosile were charged with one count of conspiring to defraud the United States and one count of knowingly making or aiding and abetting the making of a false and fraudulent claim for payment against the United States. Snipes was also charged with six counts of willfully failing to file federal income tax returns by their filing dates. The conspiracy charge against Snipes alleged that he filed a false amended return, including a false tax refund claim of over $4 million for the year 1996, and a false amended return, including a false tax refund claim of over US$7.3 million for the year 1997. The government alleged that Snipes attempted to obtain fraudulent tax refunds using a tax protester theory called the “861 argument” (essentially, an argument that the domestic income of U.S. citizens and residents is not taxable). The government also charged that Snipes sent three worthless, fictitious “bills of exchange” for $14 million to the Internal Revenue Service (IRS).

The government also charged that Snipes failed to file tax returns for the years 1999 through 2004. Snipes responded to his indictment in a letter on December 4, 2006, declaring himself to be “a non-resident alien” of the United States; in reality, Snipes is a birthright U.S. citizen.[26] Such tactics are common of the “Freemen”, “Sovereign Citizen”, or “OPCA” (Organized Pseudolegal Commercial Argument) category of litigation strategy.

So, some interesting parallels here. Bowie played an alien in The Man Who Fell to Earth, but Snipes claimed to be an alien in a failed attempt to avoid paying taxes. Snipes was a better actor than Bowie but was not persuasive in that particular role.

Again, here are the announcement and the rules.

4th down update and my own cognitive illusion

Following up on our recent discussion regarding going for it on 4th down, Paul Campos writes:

The specific suggestion here is that tactics that might make sense in much lower scoring eras cease to make sense when scoring becomes higher, but neither coaches nor fans adjust to the new reality, or adjust very slowly.

This explanation doesn’t really work for the NFL, since scoring in that league has been remarkably stable for the entire post-WWII era. When we look at NFL scoring averages, it’s obvious that the game’s rules makers are constantly tweaking the rules to maintain a balance between offense and defense that results in a scoring average of about 20-23 points per game per team, with significant changes being made whenever — such as in the late 1970s when pass blocking rules were liberalized — scoring begins to fall outside this very narrow range.

I had no idea! I remember when I was a kid there was a Super Bowl that was 16-6. Before that the Dolphins beat the Redskins 14-6, and then there was that Jets-Colts Super Bowl which was a few years before my time. Nowadays it seems like the games all end up with scores like 42-37. So it had been my general impression that average points per game had approximately doubled during the past few decades.

Actually, though, yeah, at least in the regular season the scoring has been very stable, with an average of 20.5 points per team per game in 1980 to an average of 23.0 in 2021. OK, actually 23.0 is a bit higher then 20.5 (and I’m not cheating here by picking atypical years; you can follow the above link to see the numbers).

Also, I was a football fan in the mid-70s, which was a relatively low-scoring period, with about 19 points per team per game on average.

My cognitive illusion

So yes, there has been an increase in scoring during the past several decades, but not by nearly as much as I’d thought. I feel like there’s an illusion here, which has two steps:

1. A 12% increase (from 20.5 points per game to 23.0) might seem small, especially when spread out over decades, but it was actually noticeable to a casual observer.

2. I did notice the increase, but in noticing it I way overestimated it.

I wonder if my error is similar to the error that economists Gertler et al. did when overestimating the effect of early childhood intervention. As you might recall, they reported a statistically significant effect of 42% on earnings. But to be “statistical significant,” the estimate had to be at least about 40%. If you follow the general procedure of reporting statistically significant results, your estimates will be biased upward in magnitude (“type M error”).

Now consider my impressions of trends in football scoring. Whatever impression I had of these trends came from various individual games that I’d heard about: not a random sample but a small sample in any case. Given that average scores have increased in the past few decades, it makes sense that my recollections would also be of an increase—but my recollections represent a very noisy estimate. Had I remembered not much change, I wouldn’t think much about it. But the games that happened to come to mind were low-scoring games in the past and high-scoring recent games. Also, it could be that trends in Super Bowl scores are different than trends in regular-season averages. In any case, the point is that I’m more likely to notice big changes; thus, conditional on my noticing something, it makes sense that my estimate was an overestimate.

One thing that never seems to come up in these discussions is that the fans (or at least, some large subset of “the fans”) want less punting and more chances. As I wrote in my original post, as a kid, I always loved when teams would go for it on 4th down or try an onside kick or run trick plays like fake punts, double reverses, etc.

A different issue that some people brought up in comments was that the relative benefits of different offensive strategies will in general depend on what the defenses are doing. Still, I’m guessing it will pretty much always be a good idea to go for it with 4th-and-2 on the 50-yard line early in the game, and for many years this was more of an automatic punt situation.

Only positive reinforcement for researchers in some fields

This is Jessica. I was talking to another professor in my field recently about a talk one of us was preparing. At one point, the idea of mentioning, in a critical light, some well known recent work in the field came up, since this work had omitted to consider an important aspect of evaluation which would help make one of the points in the talk. I thought it seemed reasonable to make the comment, but my friend (who is more senior than me), ‘We can’t do that anymore. We used to be able to do that’. I immediately knew what they meant: that you can’t publicly express criticism of work done by other people these days, at least not in HCI or visualization.

What I really mean by “you can’t publicly express criticism” is not that you physically can’t or even that some people won’t appreciate it. Instead it’s more that if you do express criticism or skepticism about a published piece in a public forum outside of certain established channels, you will be subject to scrutiny and moral judgment, for being non-inclusive or “gate-keeping” or demonstrating “survivor bias.” The overall sentiment being that expressing skepticism of the quality of some piece of research out of the “proper” channels of reviewing and post-conference-presentation QA makes you somehow threatening to the field. It’s like people assume that critique cannot be helpful unless its somehow balanced with positives or provided in the context of some anonymous format or at a time when authors have prepared themselves to hear comments and will therefore not be surprised if someone says something critical. Andrew has of course commented numerous times on similar things in prior posts.

I write these views as someone who dislikes conflict and publicly bringing up issues in other people’s work. If I’m critiquing something, my style tends to involve going into detail to make it seem more nuanced and shading the critique with acknowledgement of good things to make it seem less harsh. Or if there are common issues I might write a critical paper pointing to the problems in the context of making a bigger argument so that it feels less directed at any particular authors. But I don’t think all this hedging should be so necessary. Criticism in science should be acceptable regardless of how it comes up, and you can’t imply it should go away without seeming to contradict the whole point of doing research. This has always seemed like a matter of principle to me, even back when I was getting critiqued myself as a PhD student and not liking it. So I still get surprised sometime when I realize that my attitude is unusual, at least in the areas I work in. 

One thing I really dislike is the idea that its not possible to be both an inclusive field and a field that embraces criticism. Like the only way to have the former is to suppress the latter. It’s unfortunate I guess that some fields that embrace criticism are not very diverse (say, finance or parts of econ), and that other fields that prioritize novelty and diversity in methods over critiquing what exists tend to be better on diversity, like HCI or visualization which do pretty well in terms of attracting women and other groups. 

In a different conversation with the same friend above, they mentioned how once in giving an invited seminar talk at another university, another professor we know at that university made some critical comments and my friend got into a back and forth with them about the research. My friend didn’t think much of it, but as their visit went on, got the impression that some of the PhD students and other junior scholars who had attended saw the critique and exchange between my friend and the other faculty member as embarrassing (to my friend) and inappropriate. This was surprising to my friend, who felt it was totally normal and fine to that the audience member had given blunt remarks after the talk. I had a similar experience during an online workshop a few months back, where a senior well known faculty member in the audience had multiple critical comments and questions for the keynote speaker, which I thought was a great discussion. But others seemed to view as an extreme event that bordered on inappropriate.   

Related to all this, I sometimes get the sense that many people see it as predetermined that open criticism will have more negative consequences than positive, because it will a) undermine the apparent success of the field and/or b) discourage junior scholars, especially those that bring diversity.  On the latter, I’m not how much evidence people opposed to criticism have in mind versus they can simply imagine a situation where some junior person gets discouraged. But a different way to think about it could be, It’s the responsibility of the broader field, not just the critic, if we have junior researchers fleeing in light of harsh critique. I.e., where are the support structures if all it takes is one scathing blog post? There’s sort of an “every man for himself” attitude that overlooks how much power mentors can have in supporting students who get critiqued. Similarly there’s a tendency to downplay how one person’s research getting critiqued is often less about that particular person being incompetent than it is about various ways in which methods get used or claims are made in a field that are conventional but flawed. If we viewed critique more from the standpoint of ‘we’re all in this together’ maybe it would be less threatening.

A few months ago I wrote a post on my other blog that tries to imagine what it would look like to be more open-minded about critique, e.g., by taking for granted that we are all capable of making mistakes and updating our beliefs. I would like to think it is possible to have healthy open critique. But sometimes when I sense how uncomfortable people are with even talking about critique, I wonder if I’m being naive. For all the progress I’ve seen in my field in some respects (including more diversity in background/demographics, and better application of statistical methods) I haven’t really seen attitudes on critique budge.

Miles Davis (1) vs. Ethel Rosenberg; Dahl advances

Yesterday’s competition pits Roald Dahl, an author who was also an intelligence officer, against Jane Fonda, a traitor (according to Phil) who was also an author of a bestselling book.

John finds another connection between the two:

Jane Fonda and Roald Dahl’s careers butted against each other exactly once, in the April 1974 issue of Playboy. The mag featured an interview with Jane Fonda and Tom Hayden, followed immediately by Roald Dahl’s short story “The Switcheroo”. But the highlight of the issue had to be the photo spread featuring Zero Mostel.

Shame Zero Mostel didn’t qualify for the competition. You’ve got people known by initials, but not people known by numerals. It’s not fair!

It’s too late this year for 0, but this does make me wonder who else would be in this category. R2-D2 and C-3PO for sure. Barbara Feldon. Gilbert Arenas, as long as we have a metal detector at the door. Bo Derek. OK, that’s six people in this group; just need two more.

Bob presents the serious case for Dahl:

Dahl was an amazing individual. I have read (cannot find the reference right now) that Ian Fleming modeled James Bond’s ability to charm women after that of Dahl’s. Dahl wrote the screenplay for the Bond movie You Only Live Twice,

Dahl met and worked with an amazing range of well-known people. Eleanor Roosevelt invited Dahl to White House dinners and to stay with them at Hyde Park. At this time Dahl was working for British intelligence. He ended up spending many weekends at Hyde Park.

Apparently, spurred by medical problems of his son and equipment failures, he put together and coordinated a team that developed a valve for drainage of head injuries. it was used about 3000 times before superior technology was developed. The team refused royalties and the device was basically sold at cost.

Dahl would have lots of material for a talk.

This clincher came from Manuel:

Karma would seem to dictate Dahl is chosen, but I cannot help thinking that Jane Fonda is witty and brilliant enough to turn the tables and show that the joke is on us, if given the opportunity. I go with Fonda, but I would just ask her politely not to include workout videos in the slides.

But . . . we can’t do that! So Fonda might well include workout videos in her slides, and we wouldn’t want that!

Let me tell you a story. A few years ago I was invited to speak at an Ivy League university, in a department that was neither statistics or political science. The seminar organizer said I could speak on what I wanted, as long as I did not bring up a certain topic, as it might bother one of the faculty in the department. I did as asked—and it didn’t even matter because that professor didn’t even come to my talk! Ever since, I’ve been bothered by any request that a speaker avoid some sensitive subject.

Jonathan writes regarding Dahl that “you need to be *really* careful what you serve at the seminar buffet, lest you find yourself an accessory to something-or-other.” But that’s ok. I won’t stand for any constraints on speech, but food restrictions are fine.

The rules

One more comment came in. Person asks:

Who wins the rules?
Rule of Cool: Fonda was in “9 to 5”, a decidedly uncool movie, however funny. Dahl was an ace fighter pilot.
Rule of Kill: Fonda tried to save lives by ending the Vietnam War. Dahl, once again, was an ace fighter pilot, which means he must have killed at least five people.
Rule of Law: Fonda was obviously a bit transgressive, you know how those actor types are. But Dahl was out there committing legitimate crimes as a child.
Rule of Three: Fonda had three husbands and three children. Dahl had only two wives and five children.
Rule of the road: Fonda was American, so she drove on the right side of the road. Dahl was British, so he drove on the wrong side of the road.

This doesn’t really answer the question of whom to invite, but it seems worth sharing.

Today’s matchup

Miles Davis is so damn cool that he might well give the entire talk facing away from the audience. How cool is that?? On the other side is atomic spy Ethel Rosenberg, who might have some atomic secrets to share with us!

What do you think?

Again, here are the announcement and the rules.

A proposal for the Ingram Olkin Forum on Statistics Serving Society

This email came in from the National Institute of Statistical Sciences:

Ingram Olkin (S3) Forums: Call for Proposals

The Statistics Serving Society (S3) is a series of forums to honor the memory of Professor Ingram Olkin. Each forum focuses on a current societal issue that might benefit from new or renewed attention from the statistical community. The S3 Forums aim to bring the latest innovations in statistical methodology and data science into new research and public policy collaborations, working to accelerate the development of innovative approaches that impact societal problems. As the Forum will be the first time a particular group of experts will be gathered together to consider an issue, new energy and synergy is expected to produce a flurry of new ideas and approaches.

S3 Forums aim to develop an agenda of statistical action items that are needed to better inform public policy and to generate reliable evidence that can be used to mitigate the problem.

Upcoming Forum

Advancing Demographic Equity with Privacy Preserving Methodologies

Previous Forums Included

Police Use of Force
COVID and the Schools: Modeling Openings, Closings and Learning Loss
Algorithmic Fairness and Social Justice
Unplanned Clinical Trial Disruptions
Gun Violence – The Statistical Issues

Here’s my proposal. A forum on the use of academic researchers to confuse people about societal harms. The canonical example is the cigarette industry hiring statisticians and medical researchers to muddy the waters regarding the smoking-cancer link.

Doing this for the Ingram Olkin Forum is perfect, because . . . well, here’s the story from historian Robert Proctor about an episode from the 1970s:

Ingram Olkin, chairman of Stanford’s Department of Statistics, received $12,000 to do a similar job (SP-82) on the Framingham Heart Study . . . Lorillard’s chief of research okayed Olkin’s contract, commenting that he was to be funded using “considerations other than practical scientific merit.”

The National Institute of Statistical Sciences is located at Research Triangle Park in North Carolina, in the heart of cigarette country, so it would be a perfect venue for such a discussion, especially with this connection to Olkin.

I’m not much of a conference organizer myself, so I’m putting this out there so that maybe one of you can propose this for an Ingram Olkin Forum. Could be a joint Olkin/Fisher/Zellner/Rubin forum. Lots of statistical heavyweights were involved in this one.

A forum on the use of academic researchers to confuse people about societal harms. What better way for statistics to serve society, no?

Jane Fonda (4) vs. Roald Dahl; Winkler advances

Commenters were mostly favoring the alleged tax cheat over the cool person, with all the arguments turning on Capone’s career as a mobster:

Brian: “Big Al is still buried somewhere under Giants Stadium. Gotta go with Capone here.”

Chipmunk: “Does anyone have to ask if Capone would be more interesting than Winkler? Zed’s dead baby, but resurrection is a thing.”

Ben: “Given the discussion in the Xiao-Li post a couple days ago (do the extra words help reader or no?), and then the Google post (let’s just drop everything but a few keywords), I think Al is a slam dunk. The idea is that mobster talk is always trying to come at a point sideways without actually saying the thing, so we would maximize decoration and minimize content. So it’s sort of the opposite, and that’s fun. ‘What was the population of the United States in 1860?’ -> ‘Say Lincoln comes to me and says, “I want to buy a hat for everyone”, how many hats would that be?'”

Extra credit to me for reformatting the quotation marks to quote Ben’s comments, thus going to the elusive 3-quotes-deep level which we associate with Joseph Conrad and not anybody else.

Anyway, yeah, all these mobster things are interesting, but . . . (1) they’re basically variants of the “mobsters are cool” argument, but Winkler’s already in the “Cool people” category, and (2) Winkler may not himself be a mobster, but he hangs out with Chechen gangsters which is close enough. So I’ll have to go with the Fonz, despite Raghu’s report that he is no longer a culture hero to modern youth.

Today’s matchup

Jane Fonda is listed as the fourth seed in the Traitors category. Just to be clear: this is a joke. Jane Fonda is not really a traitor. She expressed her right as a resident of a free country to dissent from its government’s policies.

Roald Dahl is unseeded in the Children’s book authors category. Our local expert thought this was a scandal that Dahl was not seeded, but look at who got the top seeds in that category and you’ll see it was a tough call. Dahl was brilliant but no way would I rank him above Beverly Cleary, for example. But that’s as a writer. Here we’re talking seminar speakers.

Other relevant information: Dahl was a fighter pilot and intelligence officer, not a traitor at all. From the other side, Fonda wrote Jane Fonda’s Workout Book, unfortunately for adults, not kids.

Again, here are the announcement and the rules.

Going for it on 4th down: What’s striking is not so much that we were wrong, but that we had so little imagination that we didn’t even consider the possibility that we might be wrong. (I wonder what Gerd Gigerenzer, Daniel Kahneman, Josh “hot hand” Miller, and other experts on cognitive illusions think about this one.)

In retrospect, it’s kind of amazing how narrow our sports thinking used to be. As a kid, I always loved when teams would go for it on 4th down or try an onside kick or run trick plays like fake punts, double reverses, etc., but I just assumed that the standard by-the-book approach was the best. The idea that going for it on 4th down was not just fun but also a smart move . . . I had no idea, and I don’t recall any sportswriters or TV commentators suggesting it.

That said, I know next to nothing about football analytics, and it’s possible that these unconventional plays had less of an expected-value payoff back in the 70s when field position was more important and points were harder to come by.

I guess part of the problem is, to use some psychology and statistics jargon, a cognitive bias induced by ecological correlation. There always were some teams that tried unconventional plays, but they tended to be less successful teams that tried these tactics as a last resort. The Oklahomas, the Michigans, the Vikings and Steelers didn’t need this sort of thing. The only thing at all out of the ordinary I can remember being routinely played is Dallas’s two-minute offense with Roger Staubach in the shotgun, but that was a rare exception, as I recall it.

Consider a sequence over the decades:

1. Tactics are developed during the play-in-the-mud, Army-beats-Navy-3-to-0 era.

2. Conservative coaches stick with these tactics for decades.

3. Spectators are so used to things being done that way that they don’t even question it.

4. Analytics revolution.

5. Even now, coaches shade toward the conservative choices, even when stakes are high.

We’re now in step 5. In his above-linked post, Campos expresses frustration about it. And I get his frustration, as this is similar to my frustrations about misconceptions in science, or clueless political reporting, or whatever. But what really intrigues me is step 3, the subject of this post, which is how we were so deep inside this particular framework of assumptions that we couldn’t even see out. Or, it’s not that we couldn’t see out, but that we didn’t even know we were inside all this time.

I wonder what Gerd Gigerenzer, Daniel Kahneman, Josh “hot hand” Miller, and other experts on cognitive illusions think about this one.

P.S. We discussed some of this back in 2006, but there we were focused on the question of why were teams almost always punting on 4th down. Now that it’s become routine to go for it on 4th down, the question shifts to why did it take so long and why hasn’t the new approach completely taken over.

Football World Cup 2022 Predictions with footBayes/Stan

It’s time for football (aka soccer) World Cup Qatar 2022 and statistical predictions!

This year me and my collaborator Vasilis Palaskas implemented a diagonal-inflated bivariate Poisson model for the scores through our `footBayes` R CRAN package (depending on the `rstan` package), by considering as a training set more than 3000 international matches played during the years’ range 2018-2022. The model incorporates some dynamic-autoregressive team-parameters priors for attack and defense abilities and the Coca-Cola/FIFA rankings differences as the only predictor. The model, firstly proposed by Karlis & Ntzoufras in 2003, extends the usual bivariate Poisson model by allowing to inflate the number of draw occurrences. Weakly informative prior distributions for the remaining parameters are assumed, whereas sum-to-zero constraints for attack/defense abilities are considered to achieve model identifiability. Previous World Cup and Euro Cup models posted in this blog can be found here, here and here.

Here is the new model for the joint couple of scores (X,Y,) of a soccer match. In brief:

We fitted the model by using HMC sampling, with 4 Markov Chains, 2000 HMC iterations each, checking for their convergence and effective sample sizes. Here there are the posterior predictive matches probabilities for the held-out matches of the Qatar 2022 group stage, played from November 20th to November 24th, along with some ppd ‘chessboard plots’ for the exact outcomes in gray-scale color (‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results):

Better teams are acknowledged to have higher chances in these first group stage matches:

  • In Portugal-Ghana, Portugal has an estimated winning probability about 81%, whereas in Argentina-Saudi Arabia Argentina has an estimated winning probability about 72%. The match between England and Iran seems instead more balanced, and a similar trend is observed for Germany-Japan. USA is estimated to be ahead in the match against Wales, with a winning probability about 47%.

Some technical notes and model limitations:

  • Keep in mind that ‘home’ and ‘away’ do not mean anything in particular here – the only home team is Qatar! – but they just refer to the first and the second team of the single matches. ‘mlo’ denotes the most likely exact outcome.
  • The posterior predictive probabilities appear to be approximated at the third decimal digit, which could sound a bit ‘bogus’… However, we transparently reported the ppd probabilities as those returned from our package computations.
  • One could use these probabilities for betting purposes, for instance by betting on that particular result – among home win, draw, or away win – for which the model probability exceeds the bookmaker-induced probability. However, we are not responsible for your money loss!
  • Why a diagonal-inflated bivariate Poisson model, and not other models? We developed some sensitivity checks in terms of leave-one-out CV on the training set to choose the best model. Furthermore, we also checked our model in terms of calibration measures and posterior predictive checks.
  • The model incorporates the (rescaled) FIFA ranking as the only predictor. Thus, we do not have many relevant covariates here.
  • We did not distinguish between friendly matches, world cup qualifiers, euro cup qualifiers, etc. in the training data, rather we consider all the data as coming from the same ‘population’ of matches. This data assumption could be poor in terms of predictive performances.
  • We do not incorporate any individual players’-based information in the model, and this also could represent a major limitation.
  • We’ll compute some predictions’ scores – Brier score, pseudo R-squared – to check the predictive power of the model.
  • We’ll fit this model after each stage, by adding the previous matches in the training set and predicting the next matches.

This model is just an approximation for a very complex football tornament. Anyway, we strongly support scientific replication, and for such reason the reports, data, R and RMarkdown codes can be fully found here, in my personal web page. Feel free to play with the data and fit your own model!

And stay tuned for the next predictions in the blog. We’ll add some plots, tables and further considerations. Hopefully, we’ll improve predictive performance as the tournament proceeds.

Al Capone (3) vs. Henry Winkler; Naismith advances

Yesterday’s contest pitched the inventor of basketball against a composer of music. Sports vs. arts. Invention vs. namesake.

John offered this amusingly ambiguous comment:

I like both of their creations in general; though I’ve only actually played one myself, I’ve attended performances of the other a few times. If we’re going with the “smell my perfume” test, I think the composer would get the nod here. Plus I just couldn’t imagine an opera about “Nixon in China”, but wow it works. So I’ll vote for John Adams as the speaker that defies expectations.

And I still can’t quite get my head around Manuel’s contribution:

You can be a composer of dissonant, tuneless music and still be a good speaker, but if I had good speakers in my audio system I wouldn’t use them to play dissonant, tuneless music. James Naismith is a slam dunk in this round.

The decider comes from Raghu:

I’m thoroughly bored by the “hot hand” posts that come up from time to time, and this neverending topic is likely to come up with Naismith. Not reading those posts is good for my productivity, so I vote Naismith.

Whatever we can do to boost Raghu’s productivity is worth it. He wrote this cool book about biophysics, and who knows what heights of productivity he could achieve if we can continue to fill up this blog with “hot hand” material?

Today’s matchup

It’s Al Capone from the “Alleged tax cheats” category coming up against Henry Winkler from the “Cool people.” Whaddya wanna see, Robert De Niro beating someone to death with a baseball bat, or an acting teacher on waterskis? Ayyyyyyyy!

Again, here are the announcement and the rules.