Instability of win probability in election forecasts (with a little bit of R)

Recently we’ve been talking a lot about election forecasting:

Election prediction markets: What happens next?

Why are we making probabilistic election forecasts? (and why don’t we put so much effort into them?)

What’s gonna happen between now and November 5?

Polling averages and political forecasts and what do you really think is gonna happen in November?

The election is coming: What forecasts should we trust?

One thing that comes up in communicating election forecasts is that people confuse probability of winning with predicted vote share. Not always—when the win probability is 90%, nobody’s thinking a candidate will get 90% of the vote—but it’s an issue in settings like the current election, where both numbers are close to 50%. If Harris is predicted to have a 60% chance of winning the electoral college, this does not imply that she’s predicted to win 60% of the electoral vote or 60% of the popular vote.

There are different ways to think about this. You could draw an S curve showing Pr(win) as a function of expected vote share. Once your expected share of the two-party vote goes below 40% or above 60%, your probability of winning becomes essentially 0 or 1. Indeed, if you get 54% of the two-party vote, this will in practice guarantee you an electoral college victory; however, an expected 54% will not translate into 100% win probability, because there’s uncertainty about the election outcome: if that forecast is 54% with a standard deviation of 2%, then there’s a chance you could actually lose.

A few years ago we did some calculations based on the assumption that the national popular vote can be forecast to within a standard deviation of 1.5 percentage point with a normally-distributed uncertainty. So if Harris is currently predicted to get 52% of the two-party vote, let’s say the forecast is that there’s a two-thirds chance she’ll get between 50.5% and 53.5% of the vote and a 95% chance she’ll get between 49% and 55% of the vote. This isn’t quite right but you could change the numbers around and get the same general picture. This forecast gives her a 90% chance of winning the popular vote (in R, the calculation is 1 – pnorm(0.5, 0.52, 0.015) = 0.91) but something like a 60% chance of winning in the electoral college—it’s only 60% and not more because the configuration of the votes in the states is such that she’ll probably need slightly more than a majority of the national vote to gain a majority of the electoral votes. As a rough calculation, we can then say she needs something like 51.6% of the two-party vote to have a 50-50 chance of winning in the electoral college (in R, the calculation is qnorm(0.4, 0.52, 0.015) = 0.516).

Now what happens if the prediction shifts? Increase Harris’s expected vote share by 0.1% (from 52% to 52.1%) and her win probability goes up by 2.5 percentage points (in R, this is pnorm(0.516, 0.52, 0.015) – pnorm(0.516, 0.521, 0.015) = 0.025).

Increase (or decrease) Harris’s expected vote share by 0.4% and her win probability goes up (or down) by 10 percentage points. The other way you can change her win probability—bringing it toward 50%—is to increase the uncertainty in your forecast.

So one reason I don’t believe in reporting win probabilities to high precision is that these win probabilities in a close election are highly sensitive to small changes in the inputs. These small changes can be important—in a close election, a 0.4% vote swing could be decisive—but that’s kind of the point: it’s the very fact that it would be likely to be decisive which makes the win probability strongly dependent on it.

One thing I like about the Economist’s display (see image at the top of this post) is that they report the probability as “3 in 5.” This is good because it’s rounded—it’s 60%, not 58.3% or whatever. Also, I like that they say “3 in 5” rather than “60%,” because it seems less likely that this would be confused with a predicted vote share.

P.S. This is all relevant to Jessica’s recent post, partly because we coauthored a paper a few years ago (with Chris Wlezien and Elliott Morris) on information, incentives, and goals in election forecasts, and more specifically because binary predictions are hard to empirically evaluate (see here) so this is a real-world example of the common scientific problem of having to make a choice that can’t be evaluated on a purely empirical or statistical basis.

Getting a pass on evaluating ways to improve science

This is Jessica. I was thinking recently about how doing research on certain topics related to helping people improve their statistical practice (like data visualization, or open science) can seem to earn researchers a free pass where we might otherwise expect to see rigorous evaluation. For example, I’m sometimes surprised when I see researchers from outside the field getting excited about studies on visualization that I personally wouldn’t trust. It’s like there’s a rosy glow effect when they realize that there is actually research being done on such topics. Then there is open science research, which proposes interventions like preregistration or registered reports, but has been criticized for failing to rigorously motivate and evaluate its claims.

Some of it is undoubtedly selective attention, where we’re less inclined to get critical when the goals of the research align with something we want to believe. Maybe there’s also an implicit tendency to trust that if researchers are working on improving data analysis practices and eliminating sources of bias, they must understand data and statistics well enough themselves not to make dumb mistakes. (Turns out this is not true). 

But on the more extreme end, there’s a belief that the goal of these procedures, whether its “improving science” in the open science case or “improving learning and decision-making from data” in the visualization case, are too hard to evaluate in the usual ways. In visualization research for example, this sometimes manifests as pushback to anything perceived as too logical positivist. Some argue that to really understand the impacts of the visualization or data analysis tools we’re developing, we need to use ethnographic methods like embedding ourselves in the domain as participant observers. 

Arguments against controlled evaluation also pop up in meta-science discussions. For example, Daniel Lakens recently published a blog post that argues that science reforms like preregistration are beyond empirical evidence, because running the sort of long-term randomized controlled experiments to produce causal evidence of their effect is prohibitive. He references Paul Meehl’s idea of cliometric meta-theory, the long term study of how theories affect scientific progress. 

Lakens however is not suggesting a more ethnographic or interpretivist approach to understand the implications of reforms like preregistration. He argues instead that rather than seeking empirical evidence, we should recognize the distinction between empirical and logical justification: 

An empirical justification requires evidence. A logical justification requires agreement with a principle. If we want to justify preregistration empirically, we need to provide evidence that it improved science. If you want to disagree with the claim that preregistration is a good idea, you need to disagree with the evidence. If we want to justify preregistration logically, we need to people to agree with the principle that researchers should be able to transparently evaluate how coherently their peers are acting (e.g., they are not saying they are making an error controlled claim, when in actuality they did not control their error rate).

In other words, if we think it’s important to evaluate the severity of published claims, then needing to preregister is a logical conclusion.

Logic is obviously an important part of rigor, and I can certainly relate to being annoyed with the undervaluing of logic in fields where evidence is conventionally empirical (I am often frustrated with this aspect of research on interfaces!) But the “if we think it’s important” is critical here, as it points to some buried assumptions. It’s worth noting that the argument that preregistration enables evaluating whether researchers are making error controlled claims depends on a specific philosophy of science based in Mayo’s view of severe testing. While Lakens may have chosen a philosophy of science to embrace as complete, this is not necessarily a universally agreed upon approach for how best to do science (see, e.g., discussions on the blog). And so, the simple logical argument Lakens appears to be going for depends on a much larger scaffold of logic, inferential goals, assumptions, epistemic commitments, values, beliefs, etc. 

All this points to a problem with trying to make a logical argument for preregistration, which is that ultimately it’s not really all about “logic.” One might find it useful to adopt in one’s own practice for various reasons, but when it comes to establishing its value for science writ broadly, we end up firmly rooted in the realm of values. Beyond your philosophy of scientific progress, it comes down to the extent to which you think that scientists owe it to others to “prove” that they followed the method they said they did. It’s about how much transparency (versus trust) we feel we owe our fellow scientists, not to mention how committed we are to the idea that lying or bad behavior on the part of scientists are the big limiter of scientific progress. As someone who considers themselves to be highly logical, I don’t expect logic alone to get me very far on these questions.

Overall Lakens’ post leaves me with more questions than answers. I find his argument unsatisfying because it’s not quite clear what exactly he is proposing. It reads a bit as if it’s a defense of preregistration, delivered with an assurance that this logical argument could not possibly be paralleled by empirical evidence: “A little bit of logic is worth more than two centuries of cliometric metatheory.” He argues that all rational individuals who agree with the premise (i.e., share his philosophical commitments) should accept the logical view, whereas empirical evidence has to be “strong enough” to convince and may still be critiqued. And so while he seems to start out by admitting that we’ll never know if science would be better if preregistration was ubiquitous, he ends up concluding that if one shares his views on science, it’s logically necessary to preregister for science to improve. I’m not sure what to do with this. For example, is the implication that logical justification should be enough for journals to require preregistration to publish, or that lack of preregistration should be valid ground for rejecting a paper that makes claims requiring error control?

Elsewhere in his post, Lakens also suggests that empirical evidence is sometimes worth pursuing: 

At this time, I do not believe there will ever be sufficiently conclusive empirical evidence for causal claims that a change in scientific practice makes science better. You might argue that my bar for evidence is too high. That conclusive empirical evidence in science is rarely possible, but that we can provide evidence from observational studies – perhaps by attempting to control for the most important confounds, measuring decent proxies of ‘better science’ on a shorter time scale. I think this work can be valuable, and it might convince some people, and it might even lead to a sufficient evidence base to warrant policy change by some organizations. After all, policies need to be set anyway, and the evidence base for most of the policies in science are based on weak evidence, at best.

It strikes me as contradictory to say that it is a flaw that “Psychologists are empirically inclined creatures, and to their detriment, they often trust empirical data more than logical arguments” while at the same time saying it’s ok to produce weak empirical evidence to convince some people. 

Reading this, I can’t help but think of the recent NHB paper, ‘High replicability of newly discovered social-behavioural findings is achievable’, which as we previously discussed on the blog, had some flaws including a missing preregistration. I bring it up here because one could question whether the paper’s titular claim really required an empirical study (and previous reviewers like Tal Yarkoni did bring this up). If we do high powered replications of high powered original studies, then of course we should be able to find some effects that replicate. Unless we are taking the extreme position that there are no real effects being studied in psychology. This seems like an example of a logical justification that is less tied to a particular philosophy of science than Lakens’ preregistration argument (though it still requires some consensus, e.g., on what we mean by replicate).   

I’m reminded in particular of a social media discussion between Tal Yarkoni and Brian Nosek after the criticism of the NHB paper surfaced, on the question of when it’s ok to produce empirical evidence to justify reforms. Yarkoni argued that it’s wrong to use empirical evidence to try to convince someone who doesn’t understand statistics well that a higher n study is more likely to replicate, while Nosek seemed to be arguing that sometimes it’s appropriate because we should be meeting people where they are at. My personal view aligns with the former: why would you set out to show something that you personally don’t believe is necessary to show? What happens to the “scientific long game” when scientists operate out of a perceived need to persuade with data? Anyway, Lakens has defended the NHB paper on social media, so maybe his post is related to his views on that case.

“Take a pass”: New contronym just dropped.

Richard and I were working on the Bayesian Workflow book, I had an idea for an example to include and volunteered to write it up, and Richard suggested I “take a pass” on it. Then we realized that “take a pass” has two opposite meanings: “Do it” and “Skip it.” Which reminded me of something we discussed several years ago, that “sanction” is one of those funny words with two nearly completely opposite definitions. According to dictionary.com, the first definition of this verb is “to authorize, approve, or allow.” But “sanction” also is used in the opposite sense, to mean “punish.” Similarly, the noun can either mean “permission” or “penalty.” It can get confusing at times, to read that some behavior is being sanctioned.

Richard said that in German there are many such words with opposite meanings; they are called Janus [two-facing] words. In English they are called contronyms; here are wikipedia and Merriam-Webster, which includes a fascinating story of the derivation of the word “fast.” The wikipedia entry for Contronym is in 13 languages, including German but not French.

Election prediction markets: What happens next?

Rajiv Sethi discusses a recent ruling by a U.S. regulatory agency to halt trading on election prediction markets. Here’s Rajiv:

As evidence for this assertion the agency cited some of my [Rajiv’s] writing [that] describe attempts at market manipulation, one for financial gain and the other seemingly for maintaining optimism about the prospects of a candidate.

However, I [Rajiv] feel that the agency is drawing the wrong conclusions from this work, and a proper understanding of it undermines rather than bolsters the case for prohibition.

Some people believe that attempts at manipulating prediction markets are doomed to failure—that such attempts can have no more than a modest and short-lived effect on prices before other traders see a significant profit opportunity and pounce. I do not subscribe to this view. But when there exist prediction markets that lie outside the reach of our regulators, such as crypto-based Polymarket or the British exchange Betfair, the best defense against market manipulation is not prohibition but greater competition and transparency.

That is, I favor allowing Kalshi to proceed with the listing of contracts that reference election outcomes, not because I dismiss concerns about market manipulation or election integrity, but because I take them very seriously. . . .

Visibility is sharpened by the existence of multiple competing markets, especially if they have limited participant overlap. Kalshi is a regulated exchange restricted to verified domestic accounts funded with cash. Polymarket is crypto-based, does not accept cash deposits from US residents, and lies largely outside the purview of our regulatory apparatus.

Rajiv continues:

Prediction markets are characterized by an inescapable paradox. If they are taken seriously as unbiased aggregators of distributed information, beliefs will shift when prices change, and incentives for manipulation will be significant. But if they are seen as vulnerable to manipulation and frequently biased, price changes will be largely ignored, and they will not be worth manipulating. This logic places bounds on the extent of manipulation in markets—it cannot be absent altogether, but cannot be so large as to undermine their credibility.

Another way of saying this is that there are three sorts of manipulations to be concerned about:

1. People manipulating the market to make money using classic schemes such as spreading false information and using this to pump-and-dump, or taking advantage of insider information.

2. People manipulating the market to affect voting behavior or the behavior of potential donors or endorsors. I doubt this would have much effect for major-party candidates in a general election, but it could be an issue in primary elections, where there’s a clear benefit to be perceived as being in the top tier.

3. People trying to throwing the election to win a bet—or something more subtle such as point shaving.

In his post, Rajiv talks about #1 and #2 but not #3. Related to his “prediction market paradox” is that #2 can happen if the amount of money in the markets is low compared to the stakes of the election, so that the prices are affordable to manipulate given the potential gain, whereas #3 can happen if the amount of money in the markets is high compared to the stakes of the election, so that it’s worth taking the risk to throw it. Is there a sweet spot where the markets are big enough so they can’t easily be manipulated but small enough that there’s no motivation to throw the election? I guess that for U.S. presidential elections, the answer would be yes. For some other elections, maybe not.

Rajiv is not saying that prediction markets should be unregulated, indeed he writes that derivative contracts in election markets “serve no legitimate purpose and open up rather obvious strategies for manipulation. And even if attempts at manipulation fail in the end, they still arouse suspicion and sow confusion. . . . It would be a good thing if they were discontinued.” And, even setting aside derivative contracts, even with straight-up election bets, concerns about market manipulation, insider trading, and point shaving are real. But these are issues with all markets: they’re reasons to regulate, not to ban. Or, to put it another way, regulations are about tradeoffs. There are political costs to gambling on elections, but, as Rajiv argues, there are political benefits too.

Rajiv also asks:

Can the forecasting accuracy of markets exceed that of statistical models?

My quick answer is, Yes, I think that markets can be more accurate than statistical forecasts! Statistical election forecasts are public, so market players can make use of that information for free. Indeed, as discussed in this recent post, I think the presence of different public forecasts (including ours at The Economist) does its part in stabilizing market behavior, at least for the national electoral college and popular vote.

For side bets such as individual states and tail probabilities (what’s the chance that Harris wins South Dakota?) or various trifecta-style bets (what’s the chance Trump wins states X, Y, and Z?), not so much, and I say this for two reasons. First, even the forecasts that do well at the headline numbers have problems in the tails (as with the Fivethirtyeight forecast and the Economist’s as well), and prediction markets also do weird things when the probabilities are near 0 or 1 (see here). As for conditional probabilities, which seem so tantalizingly available by juxtaposing prices on multiple bets . . . I think there might be something there, but it would take a bit of statistical modeling; given the noise on each price, if you try to put them together in a naive way, you’ve got nothing but trouble.

This is not to say that prediction markets are a bad thing; just that we should be understanding them empirically—or, I should say, scientifically, combining empirics and theory, since neither alone will do the job—, not just blindly following pro- or anti-market ideology. As Rajiv says, the markets are out there already, and we can learn from them.

20 years of blogging . . . What have been your favorite posts?

Our first post was on 12 Oct 2004: A weblog for research in statistical modeling and applications, especially in social sciences; followed by The Electoral College favors voters in small states; Why it’s rational to vote; Bayes and Popper; and Overrepresentation of small states/provinces, and the USA Today effect.

Later that month we had our first post on the Red State/Blue State Paradox, a guest post on statistical issues in modeling social space, and a stab at partial pooling of interactions.

On 27 Oct came one of my favorite early posts, The blessing of dimensionality, and early the next month came Sam Cook’s post on her now-classic work on Bayesian software validation, which we now call SBC, for simulation-based calibration checking, an early post on cross-validation for Bayesian multilevel modeling—a topic on which we made lots of progress in the decades since.

Also in Nov 2004 came our first formulation of the important (to me) concept of institutional decision analysis, my discovery about correlations in before-after data, and a debunking of a claim of possible election fraud in Florida.

Those were the days!

Here’s my question for you!

What are your favorite posts? It would be fun to compile a list in commemoration of 20 years and over 12,000 posts. So post titles and links to your favorites in the comments below. You can include as many as you want.

Also if you have any favorite posts from the past 20 years on other blogs, share those too. Thanks!

The Mets are looking to hire a data scientist

Michale Jerman writes:

We’re looking to expand our group, and we’d obviously be interested in hiring the type of people who follow you.

The job is Senior Data Scientist, Baseball Analytics:

The New York Mets are seeking a Senior Data Scientist in Baseball Analytics. The Senior Data Scientist will build, test, and present statistical models that inform decision-making in all facets of Baseball Operations. This position requires strong background in complex statistics and data analytics, as well as the ability to communicate statistical model details and findings to both a technical and non-technical audience. Prior experience in or knowledge of baseball is a plus, but is not required.

Essential Duties & Responsibilities:

  • Build statistical models to answer a wide variety of baseball-related questions affecting the operations of the organization using advanced knowledge of statistics and data analytics and exercising appropriate discretion and judgment regarding development of statistical models
  • Interpret data and report conclusions drawn from their analyses
  • Present model outputs in an effective way, both for technical and non-technical audiences
  • Communicate well with both the Baseball Analytics team as well as other Baseball Operations personnel to understand the parameters of any particular research project
  • Provide advice on the desired outputs from the data engineering team, and guidance to the Baseball Systems team on how best to present model results
  • Assist with recruiting, hiring, and mentoring new analysts in the Baseball Analytics department
  • Evaluate potential new data sources and technologies to determine their validity and usefulness
  • Consistently analyze recent research in analytics that can help improve the modeling work done by the Baseball Analytics department

Qualifications:

  • Ph.D. in statistics or a related field, or equivalent professional experience
  • Strong background in a wide variety of statistical techniques
  • Strong proficiency in R, Python, or similar, as well as strong proficiency in SQL
  • Basic knowledge of data engineering and front-end development is a plus, for the purpose of communicating with those departments
  • Strong communication skills
  • Ability to work cooperatively with others, and to take control of large-scale projects with little or no daily oversight

The job description doesn’t literally mention Stan, but I can assure you that proficiency in Bayesian modeling and computing in Stan would be a plus.

Join the team before the Mets win that world championship and you can share some of the credit!

The Rider

I finally followed Phil’s advice and read The Rider, the 1978 cult classic by Tim Krabbé. It lived up to the hype. The book is the story of a minor-league bicycle race, as told by one of its participants, a journalist and amateur road racer. It’s pretty much a perfect book in form and content.

I want to say that The Rider belongs on a shelf of classic short nonfiction books, along with A Little Book About a Big Memory by A. R. Luria, How Animals Work by Knut Schmidt-Nielsen, The Origins of the Second World War by A. J. P. Taylor, Total Poker by David Spanier, and . . . hmmm, there aren’t really so many classic short nonfiction books really, are there?

In this interview, Krabbé characterizes The Rider as “a novel” but also as “90 to 95 percent real.” I wonder if he went back over the course while writing the book so as to jog his memory and help him get the details right.

I kinda wish Krabbé had gone to the trouble to make the book 100% real. It’s not clear to me what’s the missing 5 to 10%. Is he just saying his recollection is imperfect so there will be inevitable mistakes? Or did he change the names or combine some racers into composite characters? Did he reorder some events to make a better story? Did he just make up some stories entirely? The book is great, so I’m in no position to question Krabbé’s judgment in introducing fiction to his story. But if it’s 90 to 95% real, couldn’t he have written a short appendix telling us where the made-up stuff was? I feel that would increase my appreciation of the book. Krabbé has no obligation to do anything like that; I just think it would make this great book even better.

Shreddergate! A fascinating investigation into possible dishonesty in a psychology experiment

From A to Z

A couple years ago we discussed a post by Mark Zimbelman expressing skepticism regarding a claim by psychologist (also business school professor, Ted talk star, NPR favorite, Edge Foundation associate, retired Wall Street Journal columnist, Founding Partner of Irrational Capital, insurance agent, bestselling author, etc etc) Dan Ariely. Zimbelman wrote:

I have had some suspicions about some experiments that Professor Ariely ran using a shredder that was modified so it looked like it was shredding but really wasn’t . . . He claimed it was “quite simple” to convert a shredder by breaking the teeth in the middle with a screwdriver . . . We were unable to break any teeth on the shredders we purchased but ended up finding a way to remove some of the teeth in the center by taking the shredder apart. Unfortunately, when we did this the papers would no longer go through the shredder without getting turned to one side or another and they inevitably got stuck because the shredder no longer had enough teeth to pull them through. We concluded that it was impossible to modify any of the shredders we bought . . .

A couple weeks after his first post, Zimbelman followed up with further investigation:

I [Zimbelman] did an extensive literature search (involving several others who helped out) looking for the research that he claims was done with the modified shredder. The end result is that I can’t find any published paper that discusses using a modified shredder. I even called one of his co-authors and asked him if the experiment that they ran together used a modified shredder. He said the shredder in their study was not modified.

I did find a few papers that used a regular shredder but did not mention any modifications. I also found several statements (including this one and the one linked above) where he claims to use this mysterious modified shredder. Overall then, here’s where we are with the shredder:

1. Dr. Ariely has made numerous claims to use a modified shredder in his matrix experiments.

2. I am unable to find any published papers by Dr. Ariely that use a modified shredder.

3. Modifying a shredder to do what he has claimed appears to be very unlikely.

Zimbelman’s posts were from late 2021, and I reported on them in early 2022. That’s where things stood for me until an anonymous tipster pointed us to this video that Ariely posted to Youtube in September 2023, along with this note:

Over the years I ran many different versions of honesty/ dishonesty experiments. In one of them, I used a shredder. Here is a short piece from the Dishonesty movie in which this shredder is starring and can be seen in action.

Here are some screenshots:

The Dishonesty movie is from 2015. Here’s a press release from 2016 promoting the movie. The press release says that the experiment that used the shredder was performed in 2002.

I assume the scene in the movie showing the experiment was a reconstruction. It just seems doubtful that in 2002 they would’ve taken these videos of the participants in that way. There’s nothing dishonest about reconstructing a past scene—documentaries do that all the time!—I’m just trying to figure out exactly what happened here. Just to be sure, I watched the clip carefully for any clues about when it was shot . . . and I noticed this:

Let’s zoom in on that mobile phone:

I don’t know anything about mobile phones, so I asked people who did, and they assured me that the phones in 2002 didn’t look like that, which again suggests the scene in the documentary was a reconstruction. Which is still fine, no problems yet.

Home Depot . . . or Staples?

Zimbelman’s post links to a voicemail message from Ariely saying they bought the shredder from Home Depot. But in the video, the shredder is labeled Staples:

Wassup with that? In the label on the Youtube video, Ariely says that the shredder in the video is the same as the one they used. (“Over the years I ran many different versions of honesty/ dishonesty experiments. In one of them, I used a shredder. Here is a short piece from the Dishonesty movie in which this shredder is starring and can be seen in action.”) Which makes sense. It’s not like they’d have a whole bunch of modified shredders kicking around. But then where did they buy it? Back in 2002, was Home Depot carrying Staples brand products? I guess a real shredder-head could tell just by looking whether the shredder in the video is vintage 2002 or 2015. The shredder in the video doesn’t look 13 years old to me, but who am I to say? Unfortunately, the image of the label on the back of the machine isn’t readable:

You can make out the Staples brand name but that’s about it.

So, what happened?

There are several possibilities here:

1. The 2002 experiment happened as was described, they really modified that shredder as claimed, they kept the shredder around, it was still operating in 2015, they demonstrated it in the video, and it worked as advertised. Also, that shredder in 2002 had actually been bought at Staples, and Ariely just had a lapse in memory when he said they’d bought it at Home Depot.

2. The 2002 experiment happened as was described, they really bought that shredder at Home Depot and modified it as claimed, but then the shredder was lost, or discarded, or broke. When they made the movie, they went to Staples and bought a new shredder, modified it (using that approach which Ariely said is simple but which Zimbelman in the above-linked post said is actually difficult or impossible), and it worked just as planned. Under this scenario, Ariely was not telling the truth when he wrote that the shredder is the same as in the original experiment, but, hey, it’s just a movie, right? The Coen brothers had that title card saying that Fargo was based on a true story, and it wasn’t—that doesn’t mean they were “lying,” exactly!

3. The 2002 experiment happened as was described, they really bought that shredder at Home Depot and modified it as claimed, but then the shredder was lost, or discarded, or broke. When they made the movie, they went to Staples and bought a new shredder, but at this point they didn’t bother to try to modify it. It was just a dramatization, after all! Why ruin a perfectly good shredder from Staples? Instead they just filmed things just as they literally look in the video: they put paper in the shredder, then off screen they mangle the edges of some other sheets of paper, put them in the bin in the bottom of the shredder, and then turn on the video again and take the partially-mangled papers out. The way I wrote this, it seems kinda complicated, but if you think about it from the standpoint of someone making a video, this is much easier than getting some modified shredder to work! Indeed, if you look at the video, the sheets of paper that went into the shredder at the beginning are not the same as the ones that they took out at the end.

4. The 2002 experiment never happened as described. Ariely or one of his collaborators had that cool idea of modifying the shredder, but it wasn’t so easy to do so they gave up and just faked the study. Then when making the 2015 movie they just said they lost the shredder, and the moviemakers just did the reenactment as described in option 3 above.

5. I’m sure there are other possibilities I didn’t think of!

Is the video “fake”? I doubt it! I assume the video is either a clip from the movie or was filmed at the same time as the movie, and in either case it’s a reenactment of Ariely’s description of what happened in the experiment. In a reenactment if you show some sheets of paper fed into a shredder and then later you show some sheets of paper removed from the shredder, there’s no reason that they have to be the same sheets of paper. Similarly, if a movie reenacts a flight from New York to Chicago, and it shows a shot of a plane taking off from LaGuardia, followed by a shot of a plane landing in O’Hare, they don’t have to be the same plane. It’s just a reenactment! The fact that someone made a video reenacting a scene with a shredder does not imply that this shredding actually happened in the video—there’s really no reason to for them to have gone to the trouble to have done this shredding at all.

So, there are several possibilities consistent with the information that is currently available to us. You can use your own judgment to decide what you think might have happened in 2002 and 2015.

“Dishonesty can permeate through a system and show up not because of selfish interest but because of a desire to help.”

In the above-linked press release—the one that said, “we modified the shredder! We only shredded the sides of the page, whereas the body of the page remained intact”—Ariely also said:

There are pressures for funding. Imagine you run a big lab with 20 people and you’re about to run out of funding – what are the pressures of taking care of the people who work with you? And what kind of shortcuts would you be willing to take? Dishonesty can permeate through a system and show up not because of selfish interest but because of a desire to help.

With regards to research, I don’t think that most people think long-term and think to themselves that somebody would try to replicate their results and find that they don’t work. People often tend to “tweak” data and convince themselves that they are simply helping the data show its true nature. There are lots of things in academic publications that are manifestations of our abilities to rationalize. . . .

I think that the pressures of publication, funding, helping the group, and reputation are very much present in academic publications.

Which is interesting given the fraudulent projects he was involved in. Interesting if he was doing fraud and interesting if he was the unfortunate victim of fraud. He’s an admirably tolerant person. Fraud makes me angry; I would not be so quick to refer to it as “not because of selfish interest but because of a desire to help.”

Wassup, NPR? You raised concerns in 2010 but then promoted

One of the above sources links to this NPR article from 2010:

Should You Be Suspicious Of Your Dentist Or NPR’s Source? . . .

Last month, Dan Ariely, a behavioral economics professor, talked with All Things Considered host Robert Siegel, about how incredibly loyal, almost irrationally so, people are to their dentists – more so than with other medical professions. . . . when he appeared on NPR’s air, there was every reason to trust him.

Ariely offered information certain to unnerve listeners and anger dentists ¬– information based on a fact that he cannot back up.

If two dentists were asked to identify cavities from the same X-ray of the same tooth, Ariely said they would agree only 50 percent of the time.

Ariely cited Delta Dental insurance as his source. However, Delta spokesman Chris Pyle said there is no data that could lead to that conclusion. . . .

Here is what Ariely said:

Prof. ARIELY: And we asked both dentists to find cavities. And the question is, what would be the match? How many cavities will they find, both people would find in the same teeth? . . . It turns out what Delta Dental tells us is that the probability of this happening is about 50 percent. . . .

It’s really, really low. It’s amazingly low. Now, these are not cavities that the dentist finds by poking in and kind of actually measuring one. It’s from X-rays. Now, why is it so low? It’s not that one dentist find cavities and one doesn’t. They both find cavities, just find them in different teeth. . . .

“According to Dr. Ariely, he was basing his statement on a conversation he said he had with someone at Delta Dental,” said Pyle. “But he cannot cite Delta Dental in making that claim because we don’t collect any data like that which would come to such a conclusion.”

So what happened?

Ariely said he got that 50 percent figure from a Delta source who told him about “some internal analysis they have done and they told me the results. But they didn’t give me the raw data. It’s just something they told me.”

Ariely did not provide the name of the Delta medical officer, whom Ariely said was not interested in talking with me. . . .

Ariely told me he happened upon that figure when he was conducting research analyzing 20 years of raw data on Delta claims. . . . But Ariely did not see or analyze any data that would lead to a conclusion that dentists would agree only 50 percent of the time based on studying an X-ray.

Wow. The NPR report continues:

But what is NPR’s responsibility? . . . NPR can’t re-report and check out every thing that an on-air guest says. . . . In this case, the interview with Ariely was taped ahead of time and edited for air – but no one thought it necessary to challenge his undocumented statement.

ATC executive director Christopher Turpin said NPR had no reason to question Ariely, given his credentials as a tenured professor and an expert on how irrational human beings are. . . .

ATC has other pre-taped segments with Ariely, and those should be double-checked before they are aired. There’s no doubt that Ariely is both entertaining and informative about how irrational we humans are — but he also must be right.

It’s funny that, after all that, they write that there’s “no doubt” that Ariely is informative. I have some doubt on that one!

But here’s the interesting thing. The above warning was from 2010. Not 2022, not 2020. 2010. Fourteen years ago. But if you google *NPR Ariely*, you get a bunch of items since then:

2011: Is Marriage Rational? : Planet Money

2011: For Creative People, Cheating Comes More Easily

2012: TED Radio Hour: Dan Ariely: Why Do We Cheat?

2012: ‘The Honest Truth’ About Why We Lie, Cheat And Steal

2014: Dan Ariely: Where’s The Line Between Cheating A Little and Cheating A Lot?

2014: Rethinking Economic Theory: The Evolutionary Roots Of Irrationality : 13.7: Cosmos And Culture

2015: Dan Ariely: What Pushes Us To Work Hard — Even When We Don’t Have To?

2017: Dan Ariely: When Are Our Decisions Made For Us?

2018: Everybody Lies, And That’s Not Always A Bad Thing

2020: Why Some People Lie More Than Others

And then in the past year or so there have been some skeptical stories, such as this from 2023: Did an honesty researcher fabricate data?

But until recently NPR was running lots and lots of stories quoting Ariely completely without question—for years and years after they’d been explicitly warned back in 2010.

The scientist-as-hero narrative is just so strong that NPR kept going back to that well even after their own warning.

Why post on this?

God is in every leaf of every tree. It’s interesting how the more you look into this shredder story the more you can find. But there’s always some residual uncertainty, because there’s always some elaborate explanation that we haven’t thought of.

In the meantime, I’d recommend following the advice of that 2010 NPR report and asking people for their evidence when they make claims of scientific breakthroughs. There’s nothing you can do to stop people from flat-out lying, but if you can get purported experts to specify their sources, that should help.

If scientists know ahead of time that they’re expected to produce the shredder, as it were, maybe they’d be less likely to make things up in the first place.

Freakonomics asks, “Why is there so much fraud in academia,” but without addressing one big incentive for fraud, which is that, if you make grabby enough claims, you can get featured in . . . Freakonomics!

There was this Freakonomics podcast, “Why Is There So Much Fraud in Academia?” Several people emailed me about it, pointing out the irony that the Freakonomics franchise, which has promoted academic work of such varying quality (some excellent, some dubious, some that’s out-and-out horrible), had a feature on this topic without mentioning all the times that they’ve themselves fallen for bad science.

As Sean Manning puts it, “That sounds like an episode of the Suburban Housecat Podcast called ‘Why are bird populations declining?'”

And Nick Brown writes:

Consider the first study on the first page of the first chapter of the first Freakonomics book (Gneezy & Rustichini, 2000, “A Fine is a Price”, 10.1086/468061), in which, when daycare centres in Israel started “fining” parents for arriving late to pick up their children, the amount of lateness actually went up. I have difficulty in believing that this study took place exactly as described; for example, the number of children in each centre appears to remain exactly the same throughout the 20 weeks of the study, with no mention of any new arrivals, dropouts, or days off due to illness or any other reason. Since noticing this, I have discovered that an Israeli economist named Ariel Rubinstein had similar concerns (https://arielrubinstein.tau.ac.il/papers/76.pdf. pp. 249–251). He contacted the authors, who promised to put him in touch with the staff of the daycare centres, but then sadly lost the list of their names. The paper has over 3,200 citations on Google Scholar.

I replied: Indeed, the Freakonomics team has never backed down on many ridiculous causes they have promoted, including the innumerate claim that beautiful parents are 36% more likely to have girls and some climate change denial. But I’m not criticizing the researchers who participated in this latest Freakonomics show; we have to work with the news media we have, flawed as they are.

And, as I’ve said many times before, Freakonomics has so much good stuff. That’s why I’m disappointed, first when they lower their standards and second when they don’t acknowledge or wrestle with their past mistakes. It’s not too late! They could still do a few shows—or even write a book!—on various erroneous claims they’ve promoted over the years. It would be interesting, it would fit their brand, it could be educational and also lots of fun.

This is similar to something that occurs in the behavioral economics literature: there’s so much research on how people make mistakes, how we’re wired to get the wrong answer, etc., but then not so much about the systematic errors made in the behavioral economics literature itself. As they’d say in Freakonomics, behavior can be driven by incentives.

P.S. Some interesting discussion in comments regarding the Gneezy and Rustichini paper. I’ve not looked into this one in detail, and my concerns with Freakonomics don’t come from that example but from various other cases over the years where they’ve promoted obviously bad science; see above links.

Movements in the prediction markets, and going beyond a black-box view of markets and prediction models

My Columbia econ colleague Rajiv Sethi writes:

The first (and possibly last) debate between the two major party nominees for president of the United States is in the books. . . . movements in prediction markets give us a glimpse of what might be on the horizon.

The figure below shows prices for the Harris contract on PredictIt and Polymarket over a twenty-four hour period that encompasses the debate, adjusted to allow for their interpretation as probabilities and to facilitate comparison with statistical models.

The two markets responded in very similar fashion to the debate—they moved in the same direction to roughly the same degree. One hour into the debate, the likelihood of a Harris victory had risen from 50 to 54 on PredictIt and from 47 to 50 on Polymarket. Prices fluctuated around these higher levels thereafter.

Statistical models such as those published by FiveThirtyEight, Silver Bulletin, and the Economist cannot respond to such events instantaneously—it will take several days for the effect of the debate (if any) to make itself felt in horse-race polls, and the models will respond when the polls do.

This relates to something we’ve discussed before, which is how one might consider improving a forecast such as ours at the Economist magazine so as to make use of available information that’s not in the fundamentals-based model and also hasn’t yet made its way into the polls. Such information includes debate performance, political endorsements, and other recent news items as well as potential ticking time bombs such as unpopular positions that are held by a candidate but of which the public is not yet fully aware.

Pointing to the above graph that shows the different prices in the different markets, Sethi continues:

While the markets responded to the debate in similar fashion, the disagreement between them regarding the election outcome has not narrowed. This rasies the question of how such disagreement can be sustained in the face of financial incentives. Couldn’t traders bet against Trump on Polymarket and against Harris on PredictIt, locking in a certain gain of about four percent over two months, or more than twenty-six percent at an annualized rate? And wouldn’t the pursuit of such arbitrage opportunities bring prices across markets into alignment?

There are several obstacles to executing such a strategy. PredictIt is restricted to verified residents of the US who fund accounts with cash, while trading on Polymarket is crypto-based and the exchange does not accept cash deposits from US residents. This leads to market segmentation and limits cross-market arbitrage. In addition, PredictIt has a limit of $850 on position size in any given contract, as well as a punishing fee structure.

This is all super interesting. So much of the discussion I’ve seen of prediction markets is flavored by pro- or anti-market ideology, and it’s refreshing to see these thoughts from Sethi, an economist who studies prediction markets and sees both good and bad things about them without blindly promoting or opposing them in an ideological way.

Sethi also discusses public forecasts that use the fundamentals and the polls:

While arbitrage places limits on the extent to which markets can disagree, there is no such constraint on statistical models. Here the disagreement is substantially greater—the probability of a Trump victory ranges from 45 percent on FiveThirtyEight to 49 percent on the Economist and 62 percent on Silver Bulletin.

Why the striking difference across models that use basically the same ingredients? One reason is a questionable “convention bounce adjustment” in the Silver Bulletin model, without which its disagreement with FiveThirtyEight would be negligible.

But there also seem to be some deep differences in the underlying correlation structure in these models that I find extremely puzzling. For example, according to the Silver Bulletin model, Trump is more likely to win New Hampshire (30 percent) than Harris is to win Arizona (23 percent). The other two models rank these two states very differently, with a Harris victory in Arizona being significantly more likely than a a Trump victory in New Hampshire. Convention bounce adjustments aside, the correlation structure across states in the Silver Bulletin model just doesn’t seem plausible to me.

I have a few thoughts here:

1. A rule of thumb that I calculated a few years ago in my post, Is it meaningful to talk about a probability of “65.7%” that Obama will win the election?, is that a 10 percentage point share in win probability corresponds roughly to a four-tenths of a percentage point swing in expected vote share. So the 5 percentage point swings in those markets correspond to something like a two-tenths of a percentage point swing in opinion, which can crudely be thought of as being roughly equivalent to an implicit model where the ultimate effect of the debate is somewhere between zero and half a percentage point.

2. The rule of thumb gives us a way to roughly calibrate the difference in predictions of different forecasts. A difference between a Trump win probability of 50% in one forecast and 62% in another corresponds to a difference of half a percentage point in predicted national vote share. It doesn’t seem unreasonable for different forecasts to differ by half a percentage point in the vote, given all the judgment calls involved in what polls to include, how to adjust for different polling organizations, how to combine state and national polls, and how you set up the prior or fundamentals-based model.

3. Regarding correlations: I think that Nate Silver’s approach has both the strengths and weaknesses of a highly empirical, non-model-based approach. I’ve never seen a document that describes what he’s done (fair enough, we don’t have such a document for the Economist model either!); my impression based on what I’ve read is that he started with poll aggregation, then applies some sort of weighting, then has an uncertainty model based on uncertainty in state forecasts and uncertain demographic swings. I think that some of the counterintuitive behavior in the tails is coming from the demographically-driven uncertainties and also because, at least when he was working under the Fivethirtyeight banner, he wanted to have wide uncertainties in the national electoral college forecasts, and with the method he was using, the most direct way to do this was to give huge uncertainties for the individual states. The result was weird stuff like the prediction that, if Trump were to win New Jersey, that his probability of winning Alaska would go down. This makes no sense to anyone other than Nate because, if Trump were to have won in New Jersey, that would’ve represented a total collapse of the Democratic ticket, and it’s hard to see how that would’ve played out as a better chance for Biden in Alaska. The point here is not that Nate made a judgment call about New Jersey and Alaska; rather, a 50-state prediction model is a complicated thing. You build your model and fit it to available data, then you have to check its predictions every which way, and when you come across results that don’t make sense, you need to do some mix of calibrating your intuitions (maybe it is reasonable to suppose that Trump winning New Jersey would be paired with Biden winning Alaska?) and figuring out what went wrong with the model (I suspect some high-variance additive error terms that were not causing problems with the headline national forecast but had undesirable properties in the tail). You can figure some of this out by following up and looking at other aspects of the forecast, as I did in the linked post.

So, yeah, I wouldn’t take the correlations of Nate’s forecast that seriously. That said, I wouldn’t take the correlations of our Economist forecast too seriously either! We tried our best, but, again, many moving parts and lots of ways to go wrong. One thing I like about Rajiv’s post is that he’s willing to do the same critical work on the market-based forecasts, not just treating them as a black box.

It’s Stanford time, baby: 8-hour time-restricted press releases linked to a 91% higher risk of hype

Adam Pollack writes:

You and the blog readers might find this interesting: https://newsroom.heart.org/news/8-hour-time-restricted-eating-linked-to-a-91-higher-risk-of-cardiovascular-death.

Yesterday, my friend was very concerned for me after he found out I usually don’t eat breakfast. He told me it’s dangerous. I thought it was as simple as not being hungry for a few hours after I wake up.

He showed me the above press release from the American Heart Association newsroom. I have never seen the results of an abstract for a poster publicized like this. It even made it to CNN (https://www.cnn.com/2024/03/19/health/intermittent-fasting-pros-cons-wellness/index.html). Both the press release and the CNN article emphasize that the findings are preliminary. For example, press release says “As noted in all American Heart Association scientific meetings news releases, research abstracts are considered preliminary until published in a peer-reviewed scientific journal.”

This doesn’t make me feel better about the situation. Let’s pretend this analysis was conducted perfectly (whatever that means). How would the AHA newsroom & CNN report the results if this was peer-reviewed and published? From the newsroom quote above, I get the sense that if it’s in the peer-reviewed scientific journal the press release wouldn’t have any caveat. Maybe they’ll even recommend people change their lifestyles and diets?

I’m being a little disingenuous because the editor’s note from the date after the first press release tells readers they should always consult with their doctor before making changes to their health regimens. Wait, why is there an editor’s note the day after a press release that provides full poster presentation details?? I’m guessing this caused an uproar to some degree in the community. In general, there’s a lot to unpack from this about science communication and the role of science in informing decisions. I’d be most interested in a discussion on your blog about those points, though I’m sure that the poster could inspire some nice statistical discussion too (https://s3.amazonaws.com/cms.ipressroom.com/67/files/20242/8-h+TREmortality_EPI+poster_updated+032724.pdf). For example, the press release reports the authors “were surprised to find that people who followed an 8-hour, time-restricted eating schedule were more likely to die from cardiovascular disease” and it turns out that’s one of 4 effects (look like interaction effects) w/ p < .05 across all the comparisons they make.

The press release refers to “those who followed an 8-hour time-restricted eating schedule, a type of intermittent fasting,” but from the poster, we see that this “eating duration” variable is the average eating duration for the two dietary recall days in the survey. Of the 414 people in the study who reported less than 8 hours averaging those two days, 31 died of cardiovascular disease during the period of the study. In comparison, the reference group is 12-16 hours, which included 11,831 people, of whom 423 died of cardiovascular disease. (31/414)/(423/11831) = 2.09. The estimated risk ratio is 1.91, which they estimated from a a hazard regression adjusting for a bunch of variables including demographics, smoking, and drinking but also total energy intake, body mass index, and self-reported health condition status.

Looks like noise mining to me, but, hey, all things are possible.

Based on what I see in the paper, the statement, “people who followed an 8-hour, time-restricted eating schedule were more likely to die from cardiovascular disease,” does not seem like an accurate description of the data. How you ate in two days of a survey is hardly an “eating schedule.”

Also they say, “Our study’s findings encourage a more cautious, personalized approach to dietary recommendations, ensuring that they are aligned with an individual’s health status and the latest scientific evidence,” which sounds like gobbledygook. You don’t need a statistical analysis to know that, right?

The press release quotes someone else as saying, “Overall, this study suggests that time-restricted eating may have short-term benefits but long-term adverse effects.” B-b-but . . . if they only asked about how people ate for 2 days, in what sense is this telling us about long-term effects? He does follow up with, “it needs to be emphasized that categorization into the different windows of time-restricted eating was determined on the basis of just two days of dietary intake,” and I’m like, yeah, but then how do you get away with that first statement? OK, he is at Stanford Medical School.

B-school prof data sleuth lawsuit fails

Stephanie Lee tells the story: “She Sued the Sleuths Who Found Fraud in Her Data. A Judge Just Ruled Against Her.” Good touch in the headline to say “Found” rather than “Alleged.”

Further background here (“Ted-talking purveyors of fake data who write books about lying and rule-breaking . . . what’s up with that?”), and lots more links here.

P.S. More here from Gideon Lewis-Kraus.

Decisions of parties to run moderate or extreme candidates

Palko points to this article by political journalist Jonathan Chait, “The Democrats Remembered How Politics Works Again: An end to a decade of magical thinking.” Chait points out that pundits and political professionals on the left, right, and center anticipated a Democratic wipeout in 2022, all for different reasons: The left thought the Democrats were insufficiently progressive and so would not be able to motivate their core voters; the right thought that Biden and the Democratic congress had alienated voters with their liberal policies; and everybody expected that historical patterns of midterm elections would result in big wins for the out-party. A few months before the election, Chris Wlezien and I argued that these expectations should change given the controversial decisions by the Republican-controlled Supreme Court during the months before the election.

Chait puts it well:

And indeed, the election results, both in the aggregate and in many of the particulars, vindicate the belief that voters tend to punish rather than reward parties and candidates they associate with radical ideas. To be sure, a tendency is not a rule. The largest factor driving election results is external world events: economic prosperity (or its absence), rallying around the flag in the event of a foreign attack, or widespread disgust with a failed war or major scandal. Midterm elections generally have large swings against the president’s party. One reason 2022 defied the pattern is that the Dobbs decision made Republicans, not Democrats, the party carrying out radical change. Candidates and parties seen as safe and moderate have an advantage — one that may not always override other factors but which matters quite a bit.

This is a fairly uncontroversial finding among political scientists.

I agree that this is a fairly uncontroversial finding among political scientists. See, for example, this unpublished paper with Jonathan Katz from 2007, and there’s a lot more literature on the topic.

Chait continues:

Yet in recent years, many influential figures in the Democratic Party had come to disbelieve it. A series of mistakes followed from this belief that Democrats would pay no penalty or may even benefit from moving farther away from the center.

I guess that some Democrats have this attitude, as do some Republicans. It’s natural to believe that the positions that you deeply hold would be shared by a majority of the population, if they were only given an opportunity to choose them. But I wonder if part of this, on both sides, is the rational calculation that moderation, while beneficial, doesn’t help that much, and so it can make sense in an election where you have a big advantage to run a more extreme candidate in order to get a policy benefit. That would explain the Republicans’ decision to choose extremist candidates in high-profile close races in 2022. Chait’s article has some interesting background on debates within the Democratic party on similar decisions in past elections.

Awesome online graph guessing game. And scatterplot charades.

Julian Gerez points to this awesome time-series guessing game from Ari Jigarjian. The above image gives an example. Stare at the graph for awhile and figure out which is the correct option.

I don’t quite know how Jigarjian does this—where he gets the data and the different options in the multiple-choice set. Does he just start with a graph and then come up with a few alternative stories that could fit, or is there some more automatic procedure going on? In any case, it’s a fun game. A new one comes every day. Some are easy, some not so easy. I guess it depends primarily on how closely your background knowledge lines up with the day’s topic, but also, more interestingly, on how much you can work out the solution by thinking things through.

This graph guessing game reminds me of scatterplot charades, a game that we introduce in section 3.3 of Active Statistics:

Students do this activity in pairs. Each student should come to class with a scatterplot on some interesting topic printed on paper or visible on their computer or phone, and then reveal the plot to the other student in the pair, a bit at a time, starting with the dots only and then successively uncovering units, axes and titles. At each stage, the other student should try to guess what is being plotted, with the final graph being the reveal.

In the book we give four examples. Here are two of them:

The time-series guessing game is different than scatterplot charades in being less interactive, but fun in its own way. The interactivity of scatterplot charades makes for a good classroom demonstration; the non-interactivity of the time-series guessing game makes for a good online app.

Bayesian social science conference in Amsterdam! Next month!

E. J. Wagenmakers writes:

This year Maarten Marsman and I are the local coordinators of the workshop Bayesian Methods for the Social Sciences II. We hope to organize this conference every two years, alternating between Paris and Amsterdam.

We have another great line-up of speakers this year, and we’d like to spread the word to a larger audience.

The conference takes place 16-18 October 2024 in Amsterdam.

Here’s the list of scheduled talks:

Merlise Clyde (Duke): Estimating Posterior Model Probabilities via Bayesian Model Based Sampling
Marie Perrot-Dockès (Université Paris Cité): Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator
Joris Mulder (Tilburg): An empirical Bayes factor for testing random effects

Monica Alexander (Toronto): Estimating Childlessness by Age and Race in the United States using a Bayesian Growth Curve Model
Leontine Alkema (University of Massachussetts): A Bayesian approach to modeling demographic transitions with application to subnational estimation and forecasting of family planning and fertility indicators
Douglas Leasure (Oxford): Population nowcasting in a digital world to support humanitarian action and sustainable development

Radu Craiu (Toronto): Bayesian Copula-based Latent Variable Models
Daniel Heck (Marburg): Bayesian Modeling of Uncertainty in Stepwise Estimation Approaches
Riccardo Rastelli (University College Dublin): A latent space model for multivariate time series analysis

Daniele Durante (Bocconi): Bayesian modeling of criminal networks
Nial Friel (University College Dublin): Bayesian stochastic ordered block models
Maarten Marsman (Amsterdam): Bayesian Edge Selection for Psychometric Network (Graphical) Models

Marco Corneli (Université Côté d’Azur): A Bayesian approach for clustering and exact finite-sample model selection in longitudinal data mixtures
Irene Klugkist (Utrecht): Bayesian Evidence Synthesis in the context of informative hypotheses
Eric-Jan Wagenmakers (Amsterdam): Optional Stopping

Herbert Hoijtink (Utrecht): Bayesian evaluation of single case experimental designs
Adrian Raftery (University of Washington): Bayesian climate change assessment
Robin Ryder (Paris-Dauphine and Imperial College London): Can Bayesian methods reconstruct deep language history?

François Caron (Oxford): Sparse Spatial Network Models for the analysis of mobility data
Geoff Nicholls (Oxford): Partial order models for social hierarchies and rank-order data
Amandine Véber (Université Paris Cité): Modelling expanding biological networks

Lots of great stuff here! I could do without the Bayes factor, but everything else looks really cool, an interesting mix of theoretical and applied topics.

The mainstream press is failing America (UK edition)

This is Bob.

I’m in London (about to head to StanCon!), so I saw today’s opinion piece in The Guardian (a UK newspaper not immune to the following criticisms), which I think is a nice summary of the sorry state of the media (and courts) in the United States.

It has a classic strong open,

The first thing to say about the hate and scorn currently directed at the mainstream US media is that they worked hard to earn it. They’ve done so by failing, repeatedly, determinedly, spectacularly to do their job, which is to maintain their independence, inform the electorate, and speak truth to power.

If you haven’t been following Mark Palko’s blog, West Coast Stats Views, or this whole storyline elsewhere, this article does a good job of framing the issue.

In the article, Jeff Jarvis, a former editor and columnist, is quoted as saying (er, tweeting [or do they call it X-ing now?]),

What ‘press’? The broken and vindictive Times? The newly Murdochian Post? Hedge-fund newspaper husks? Rudderless CNN or NPR? Murdoch’s fascist media?

The article reprises what Mitzi Morris has been saying ever since she worked at the New York Times in the 1990s when it was first going digital. She was appalled by the entire organization’s highly misleading approach to their readership statistics and focus on click-bait. Her take agrees with the article’s,

In pursuit of clickbait content centered on conflicts and personalities, they follow each other into informational stampedes and confirmation bubbles.

As Palko has been pointing out on his blog, the mainstream media, in part led by the New York Times, no longer seems concerned with candidates’ mental health and age now that they can’t criticize Joe Biden. Instead, you get what is described by the article this way,

They pursue the appearance of fairness and balance by treating the true and the false, the normal and the outrageous, as equally valid and by normalizing Republicans, especially Donald Trump, whose gibberish gets translated into English and whose past crimes and present-day lies and threats get glossed over.

This whole trend goes back to at least the Clinton/Trump election. Their relentless focus on Clinton’s email while ignoring all of Trump’s malfeasance led me to stop reading the Times after that. Also, whenever I read anything I know about in the press, like epidemiology or computer science, the coverage is appallingly misleading.

There’s a lot more detail in the article. And a whole lot more on Palko’s blog if you want to do a deeper dive. On a related topic, I’d also recommend Palko’s coverage of Elon Musk’s shenanigans.

Modeling Weights to Generalize (my talk this Wed noon at the Columbia University statistics department)

In the student-organized seminar series, Wed 11 Sep 2024 noon in room 903 room 1025 Social Work Bldg:

Modeling Weights to Generalize

A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But what if you don’t know where the weights came from? We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest. For background, see here: http://www.stat.columbia.edu/~gelman/research/unpublished/weight_regression.pdf

Research!

Here’s a useful response by Christakis to criticisms of the contagion-of-obesity claims

Yesterday I posted an update on citations of the influential paper from 2007 by sociologist Nicholas Christakis and political scientist James Fowler, “The Spread of Obesity in a Large Social Network over 32 Years,” which concluded, “Network phenomena appear to be relevant to the biologic and behavioral trait of obesity, and obesity appears to spread through social ties.”

As I wrote yesterday, several other researchers had criticized that paper on methodological grounds, and in my post I characterized it as being “many times debunked” and expressed distress that the original claim seems to be regularly cited without reference to the published criticisms by economists Jason Fletcher and Ethan Cohen-Cole, mathematician Russ Lyons, political scientists Hans Noel and Brendan Nyhan, and statisticians Cosma Shalizi and Andrew Thomas.

That said, I am not an expert in this field. I have read the articles linked in the above post but have not kept track of the later literature.

In comments, Christakis shares his perspective on all this:

I [Christakis] think this post is an incomplete summary of the very carefully expressed claims in the original 2007 paper, and also an inaccurate summary of the state of the literature. You also may want to look at our original exchanges on this blog from years ago, and to our published responses to prior critiques, including to some of the decades-old critiques (often inaccurate) that you mention.[1]

Many papers have replicated our findings of social contagion with respect to obesity (and the various other phenomena discussed in our original suite of papers), and many papers have evaluated the early methods we used (based on generalized estimating equations) and have supported that approach.

For instance, analyses by various scholars supported the GEE approach, e.g., by estimating how large the effect of unobserved factors would have to be to subvert confidence in the results.[2],[3],[4],[5] Other papers supported the findings in other ways.[6],[7],[8],[9],[10] This does not mean, of course, that this GEE approach does not require various assumptions or is perfectly able to capture causal effects. This is one reason the 2007 paper described exactly what models were implemented, was judicious in its claims, and also proposed certain innovations for causal identification, including the “edge-directionality test.” The strengths and limitations of the edge-directionality test for causal identification have subsequently been explored by computer scientists,[11] econometricians,[12] statisticians,[13] and sociologists.[14]

Work by other investigators with other datasets and approaches has generally confirmed the 2007 findings. Pertinent work regarding weight and related behaviors is quite diverse, including everything from observational studies to experiments.[15],[16],[17],[18],[19],[20],[21],[22],[23] Of course, as expected, work has also confirmed the existence of homophily with respect to weight. Still other studies have used experimental and observational methods to confirm that one mechanism of the interpersonal spread of obesity might indeed be a spread of norms, as speculated in the 2007 paper.[24],[25],[26],[27]

Of course, methods to estimate social contagion with observational data regarding complex networks continue to evolve, and continue to require various assumption, as ours did in 2007. As before, I would love to see someone offer a superior statistical method to observational data. And I should also clarify that the public-use version of the FHS-Net data (posted at dbGap) is not the same version as the one we based our 2007 analyses on (a constraint imposed by the FHS itself, in ways documented there); however, the original data is available via the FHS itself (or at least was). At times, this difference in datasets explains why other analyses have reached slightly different conclusions than us.

In our 2007 paper, we also documented an association of various ego and alter traits to a geodesic depth of three degrees of separation. We also did this with respect to public goods contributions in a network of hunter-gatherers in Tanzania[28] and smoking in the FHS-Net.[29] Other observational studies have also noted this empirical regularity with respect to information,[30],[31] concert attendance,[32] or even the risk of being murdered.[33] We summarized this topic in 2013.[34]

Moreover, we and others have observed actual contagion up to three degrees of separation in experiments which absolutely excludes homophily or context as an explanation for the clustering.[35],[36],[37] For instance, Moussaid et al documented hyper-dyadic contagion of risk perception experimentally.[38] Another experiment found that the reach of propagation in a subjective judgment task “rarely exceeded a social distance of three to four degrees of separation.”[39] A massive experiment with 61 million people on Facebook documented the spread of voting behavior to two degrees of separation.[40] A large field experiment with 24,702 villagers in Honduras showed that certain maternal and child health behaviors likewise spread to at least two degrees of separation.[41] And a 2023 study involved 2,491 women household heads in 50 poor urban residential units in Mumbai documented social contagion, too.[42]

In addition, my own lab, in the span from as early as 2010 to as recently as 2024 has published many demanding randomized controlled field trials and other experiments documenting social contagion, as noted above. For instance, my group published our first experiment with social contagion in 2010,[43] as well as many other experiments involving social contagion in economic games using online subjects[44],[45],[46] often stimulating still other work.[47],[48],[49]

Many other labs, in part stimulated by our work, have conducted many other experiments documenting social contagion. The idea of using a network-based approach to exploit social contagion to disseminate an intervention – so as to change knowledge, attitudes, or practices at both individual and population levels – has been evaluated in a range of settings.[50],[51],[52],[53],[54],[55],[56],[57],[58]

Finally, other rigorous observational and experimental studies involving large samples and mapped networks have explored diverse outcomes in recent years, beyond the examples reviewed so far. For instance, phenomena mediated by online interactions include phone viruses,[59] diverse kinds of information,[60],[61],[62] voting,[63] and emotions.[64] In face-to-face networks, phenomena as diverse as gun violence in Chicago,[65] microfinance uptake in India,[66] bullying in American schools,[67] chemotherapy use by physicians,[68] agricultural technology in Malawi,[69] and risk perception[70] have been shown to spread by social contagion.

The above, in my view, is a fairer and more complete summary of the impact, relevance, and accuracy of our claims about obesity in particular and social contagion in general. Work in the field of social contagion in complex networks, using observational and experimental studies has exploded since we published our 2007 paper.

The list of references is at the end of Christakis’s comment.

Back in 2010 I wrote that this area is ripe for statistical development and also ripe for development in experimental design and data collection. As of 2024, the area is not just “ripe for development” in experimental design, data collection, and statistical analysis; there have also been many developments in all these areas, by Christakis, his collaborators, and other research groups, and my earlier post was misleading: just because I was ignorant of that followup literature, that’s not an excuse for me to act as if it didn’t exist.

One question here is how to think about the original Christakis and Fowler (2007) paper. On one hand, I remain persuaded by the critics that it made strong claims that were not supported by the data at hand. On the other hand, it was studying an evidently real general phenomenon and it motivated tons of interesting and important research.

Whatever its methodological issues, Christakis and Fowler (2007) is not like the ESP paper or the himmicanes paper or the ovulation-and-voting papers, say, whose only useful contributions to science were to make people aware of the replication crisis and motivate some interesting methodological work. One way to say this is that the social contagion of behavior is both real and interesting. I don’t think that’s the most satisfying way to put this—the people who study ESP, social priming, evolutionary psychology, etc., would doubtless say that their subject areas are both real and interesting too!—so consider this paragraph as a placeholder for a fuller investigation of this point (ideally done by someone who can offer a clear perspective than I can here).

In summary:

1. I remain convinced by the critics that the original Christakis and Fowler paper did not have the evidence to back up its claims.

2. But . . . that doesn’t mean there’s nothing there! In their work, Christakis and Fowler (2007) were not just shooting in the dark. They were studying an interesting and important phenomenon, and the fact that their data were too sparse to answer the questions they were trying to answer, well, that’s what motivates future work.

3. This work does not seem to me to be like various notorious examples of p-hacked literature such as beauty-and-sex-ratio, ovulation-and-clothing, mind-body-healing, etc., and I think a key difference is that the scientific hypotheses involving contagion of behavior are more grounded in reality rather than being anything-goes theories that could be used to explain any pattern in the data.

4. I was wrong to refer to the claim of contagion of obesity as being debunked. That original paper had flaws, and I do think that when it is cited, the papers by its critics should be cited too. But that doesn’t mean the underlying claims are debunked. This one’s tricky—it relates to the distinction between evidence and truth—and that’s why followups such as Christakis’s comment (and the review article that it will be part of) are relevant.

I want to thank Christakis again for his thoughtful and informative response, and I apologize for the inappropriate word “debunked.” I’ve usually been so careful over the years to distinguish evidence and truth, but this time I was sloppy—perhaps in the interest of telling a better story. I’ll try to do better next time.

P.S. I think that one problem here is the common attitude that a single study should be definitive. Christakis and Fowler don’t have that attitude—they’ve done lots of work in this area, not just resting their conclusions on one study—and I don’t have that attitude either. I’m often saying this, that (a) one study is rarely convincing enough to believe on its own, and, conversely, (b) just because a particular study has fatal flaws in its data, that doesn’t mean that nothing is there. We usually criticize the single-study attitude when researchers or journalists take one provisional result and run with it. In this case, though, I fell into the single-study fallacy myself by inappropriately taking the well-documented flaws of that one paper as evidence that nothing was there.

That all said, I’m sure that different social scientists have different views on social contagion, and so I’m not trying to present Christakis’s review as the final word. Nor is he, I assume. It’s just appropriate for me to summarize his views on the matter based on all this followup research he discusses and not have the attitude that everything stopped in 2011.

Sports gambling addiction epidemic fueled by some combination of psychology, economics, and politics

We’ve written about this before, for example:

2012: There are four ways to get fired from Caesars: (1) theft, (2) sexual harassment, (3) running an experiment without a control group, and (4) keeping a gambling addict away from the casino

2022: Again on the problems with technology that makes it more convenient to gamble away your money

2023: There are five ways to get fired from Caesars: (1) theft, (2) sexual harassment, (3) running an experiment without a control group, (4) keeping a gambling addict away from the casino, (5) refusing to promote gambling to college students

Corbin Smith shares some new stories on the unfortunately topical subject of gambling addiction and how it relates to the financing and the sports media. In his article, Smith implicitly makes a strong case that to understand the problem you need to think about interactions between psychology, economics, and politics. The sports, news, and entertainment media are pushing gambling so hard. I guess that in the future we will look back on the present era and laugh/cringe in the same way that we laugh/cringe at the “Mad Men”-style drinking and smoking culture from the 1950s.