Skip to content

Does this fallacy have a name?

Rafa Irizarry writes:

What do we call it when someone thinks cor(Y,X) = 0 because lim h -> 0 cor( X, Y | X \in (x-h, x+h) ) = 0


Steph, Kobe, and Jordan are average (or below average) height in the NBA so height does not predict being good at basketball.

GRE math scores don’t predict success in a Math Phd program so you don’t need to know GRE level Math to enter Math PhD program:

I can’t find a name for it.

My reply: I don’t know if there’s a name for it. It’s indeed a well known point—I guess that Gauss, Laplace, Galton, etc., knew about it. We make the point in the attached figure from my two books with Jennifer Hill. Here it is in Regression and Other Stories:

I’ll blog and see if anyone out there knows the name of the fallacy.

Political polarization of professions

Seeing this newspaper article, “In an outraged Louisville, a Police Force in Crisis,” made me think of this discussion from a few years ago.

What happened was that a group of psychology researchers wrote an article, “Political Diversity Will Improve Social Psychological Science,” arguing that the field of social psychology would benefit from the inclusion of more non-liberal voices (here I’m using “liberal” in the sense of current U.S. politics). In our discussion, Neil Gross and I wrote that “when considering ideological balance, it is useful to place social psychology within a larger context of the prevailing ideologies of other influential groups within society, such as military officers, journalists, and business executives.”

Survey researchers have looked this from various directions; for example here:

and here, using more general occupation categories:


Political leanings of educators, police officers, and business executives seems like more of a big deal than political leanings of doctors or roofers. Teachers and cops interact with students and the general public in ways where it seems that their political leanings could make a real difference, and businesspeople can directly influence government policy through lobbying, political donations, and directly running for office themselves.

I’m not sure what could or should be done about this, though. Would we want proportional representation in all professions? How would this be implemented? Would teachers and police officers have to take a political quiz before starting their job, and then only be hired if they fill the quota? Would businesses be required to submit reports on their leaders’ politics?

On the other hand, the lack of any feasible resolution of this issue—that professions differ in their politics from the general population—should not be taken to imply that it’s not a concern. I’m just not quite sure what to say about it.

If something is being advertised as “incredible,” it probably is.

This post was originally titled, “Asleep at the wheel: Junk science gets promoted by Bill Gates, NPR, Ted, University of California, Google, etc.,” but I decided the above quote had better literary value.

We’ve had a few posts now about discredited sleep scientist Matthew Walker; see here and here, and here, with the disappointing but unsurprising followup that his employer, the University of California, just doesn’t seem to care about his research misconduct. According to his website, Walker is employed by Google as well.

Heart attacks

Also disappointingly but unsurprisingly, further exploration of Walker’s work reveals further misrepresentations of research.

Markus Loecher shared this investigation of some attention-grabbing claims made in the celebrated Ted talk, “Sleep is your superpower.”

Here’s Walker:

I could tell you about sleep loss and your cardiovascular system, and that all it takes is one hour. Because there is a global experiment performed on 1.6 billion people across 70 countries twice a year, and it’s called daylight saving time. Now, in the spring, when we lose one hour of sleep, we see a subsequent 24-percent increase in heart attacks that following day. In the autumn, when we gain an hour of sleep, we see a 21-percent reduction in heart attacks. Isn’t that incredible? And you see exactly the same profile for car crashes, road traffic accidents, even suicide rates.

“Isn’t that incredible?”, indeed. This reminds me of the principle that, if something is being advertised as “incredible,” it probably is.

Loecher decided to look into the above claim:

I [Loecher] tend to be sensitive to gross exaggerations disguised as “scientific findings” and upon hearing of such a ridiculously large effect of a one-day-one-hour sleep disturbance, all of my alarm bells went up!

Initially I was super excited about the suggested sample size of 1.6 billion people and wanted to find out how exactly such an incredible data set could possibly have been gathered. Upon my inquiry, Matthew was kind enough to point me to the paper, which was the basis for the rather outrageous claims from above. Luckily, it is an open access article in the openheart Journal from 2014.

Imagine my grave disappointment to find out that the sample was limited to 3 years in the state of Michigan and had just 31 cases per day! On page 4 you find Table 1 which contains the quoted 24% increase and 21% decrease expressed as relative risk (multipliers 1.24 and 0.79, respectively):

Loecher notes the obvious multiple comparisons issues.

But what I want to focus on is the manipulation or incompetence (recall Clarke’s Law).

To start with, here’s the summary from the above-linked article:

Now, we could argue about whether the data really show that daylight savings time “impacts the timing” of acute myocardial infarction—arguably, the data here are consistent with no effect on timing at all! But let’s set that aside and focus on the other point of their summary: daylight savings time “does not influence the overall incidence of this disease.”

This completely contradicts Walker’s theme of sleep deprivation being dangerous. It did not influence the overall incidence of the disease!

Presumably Walker realized this: even if he didn’t read the whole article, he must have read the abstract, at least to pull out those 24% and 21% numbers. (Or maybe Walker’s research assistant did it, but no matter. If Walker gets credit for the book and the Ted talk, he also gets blame for the errors and misrepresentations that he puts out under his name.)

So . . . he read a paper claiming that, at most, daylight time is associated with some time-shifting of heart attacks, and he misrepresents that as being associated with an increase.

Also, he says is “a global experiment performed on 1.6 billion people,” but he’s reporting results on one U.S. state. He must have realized that too, no, that this was not a N = 1.6 billion study???

But wait, there’s more. We switch to daylight time 2am on Sunday. So you might expect the largest effects to occur on Sunday—that day with the sleep deprivation. Or maybe Monday, the first day back at work. All sorts of things are possible. The point is that that, by saying it as “that following day,” Walker is hiding the choice. If it’d been Sunday, it would’ve been “the very day of,” etc. And then when he talks about autumn, he doesn’t say “that following day,” just leaving the (false) impression that it’s the same pattern both seasons.

Tuesday, huh?

I also wonder about Walker’s other claim, that at the switch to daylight or standard time “you see exactly the same profile for car crashes, road traffic accidents, even suicide rates.”

Exactly the same, huh? I’ll believe it when I see the data, and not before.


Loecher also looks into Walker’s claim that “Men who sleep five hours a night have significantly smaller testicles than those who sleep seven hours or more.” There seems to be no good evidence for that one either.

Ted and Edge and all the rest

I hate the whole Ted talk, Edge foundation, Great Man model of science. It can destroy people. As I wrote last year:

Don’t ever think you’re too good for your data. . . .

Which reality do we care about? The scientific reality of measurement and data, or the social reality that a Harvard [or University of California] professor can get caught falsifying data and still be presented as an authority on science and philosophy. Ultimately, both realities matter. But let’s not confuse them. Let’s not confuse social power with scientific evidence. Remember Lysenko. . . .

OK, why am I picking on these guys? Marc Hauser and the Edge foundation: are these not the deadest of dead horses? But remember what they say about beating a dead horse. The larger issue—a smug pseudo-humanistic contempt for scientific measurement, along with an attitude that money plus fame = truth—that’s still out there.

An always-relevant quote

From Dan Davies: Good ideas do not need lots of lies told about them in order to gain public acceptance.

A message to Bill Gates, NPR, Ted, University of California, Google, etc.

It’s not your fault that you got scammed. I mean, sure, it’s kind of your fault for not checking, but that’s easy to say after the fact. Anyone can get scammed. I’ve been scammed! My political scientist colleague Don Green got scammed! Harvard got scammed by that disgraced primatologist. Cornell got scammed . . . George Schultz got scammed . . . maybe you’ve heard about that one. People get scammed. Scammers scam people, that’s what they do.

You got scammed. That’s in the past, now.

The question is: Are you gonna continue to let yourself get scammed?

I’ll break that down into 2 questions:

1. Do you now, at last, realize you’ve been scammed? (If not, what would it take? An actual manipulated graph?? No, I guess not; we already have one of those!)

2. If your answer to the first question is Yes, then are you gonna decide that it’s worth your while to continue to get scammed, because the cost in effort and bad publicity is worse than the cost of continuing to promote this stuff?

If the answer to question 1 is No, that’s just sad, that people could see all this evidence and still not get the point.

If the answer to question 1 is Yes and the answer to question 2 is No, that’s even sadder.

A quick google search appears to reveal six separate appearances by Walker on NPR, with the most recent being this June, several months after Guzey’s takedown of Why We Sleep. But I’m guessing that, once NPR had Walker on once or twice, he became a known quantity for them, so they just put him in the Expert category.

Ted? They’ve had iffy talks before, I guess it comes with the territory. They’re probably focusing on scheduling and promoting new talks, not on problems with talks they’ve already posted.

Bill Gates? He endorsed Walker’s book and now he’s moved on. Gates probably doesn’t care about an endorsement that’s sitting in the past.

The University of California? They know about the problems with Walker’s work but they’ve carefully looked away. I think they’re basically Yes on question 1 and No on question 2, except that they’ve tried really hard to avoid answering question 1. At some level, they must know they’ve been scammed, but as long as they avoid looking at Walker’s work carefully (even to the extent of carefully reading a few blog posts), they can maintain a facade of uncertainty.

Google? I have no idea. I don’t know what Walker does for Google. Maybe he’s doing great work for them. Yes, he has a problem with exaggerating research claims in publications and public talks, but maybe he does excellent work when the lights of publicity are not shining. In that case, he’s not scamming Google at all.

Also, Walker could well be scamming himself. I’m not trying to paint him as some cackling villain here. I could well imagine he’s a true believer in the healing power of sleep, and that when he misrepresents the evidence, in his view that’s just because he doesn’t have all the data at hand. Sure, they didn’t really have data on 1.6 billion people, but if they did, it would undoubtedly confirm his views. He has a direct line to the truth, and he’d be remiss if he didn’t shout it from the treetops.

The trouble is, people who think they have a direct line to the truth, often’t don’t. Recall the above quote from Dan Davies. Tony Blair probably thought those WMDs were real too, at some point—or, if they weren’t, they could’ve been, right? And the war was a good idea anyway, right? Etc. To return to sleep studies: the data support the theory, and once you believe the theory, you don’t need the data anymore.

As always, I’d have no problem if this guy were to just straight-up give inspirational talks and write inspirational books. He could just say that he personally believes in the importance of sleep, and his broad reading and experience has lead him to this conclusion. It’s when he misrepresents data, that’s where I have the problem (even though the University of California doesn’t seem to mind).

“Pictures represent facts, stories represent acts, and models represent concepts.”

I really like the above quote from noted aphorist Thomas Basbøll. He expands:

Simplifying somewhat, pictures represent facts, stories represent acts, and models represent concepts. . . . Pictures are simplified representations of facts and to use this to draw a hard and fast line between pictures and stories and models is itself a simplified picture, story or model of pictures, stories and models. Sometimes a picture tells a story. Sometimes a model represents a fact. The world is a complicated place and the mind is a complicated instrument for making sense of it. Still, simple distinctions can be useful . . .

When I say that a picture represents a fact I mean that it makes an arrangement of things present in your imagination. It’s true that we sometimes also try to imagine what is “going on” in, say, a painting, but we know that this is an extrapolation from the facts it represents. There’s also usually a whole atmosphere or “mood” in a picture, which is hard to reduce to a mere state of affairs. In David Hockney’s “A Bigger Splash,” for example, the fact is a splash of water in a pool with a diving board. We don’t know exactly what made the splash but we assume it is a person. There’s a feeling about the scene that I will leave it to you to experience for yourself, but we can imagine a photograph representing roughly the same facts. . . .

When I say that a story represents an act I mean that it gets us to imagine people doing things, or things happening to people. That’s a gross simplification, to be sure. It’s possible to tell a story about a pool freezing over or ducks landing in it. Things happening to things or animals happening to them. But I think we do actually always anthropomorphize these events a little bit when we tell stories, sometimes barely perceptibly. If we didn’t, I want to argue, we wouldn’t be able to tell a story. . . .

Models are simplifications in perhaps more obvious ways. They will always represent only selected aspects of the reality they are modelling. When I say that a model represents concepts, I mean that they get us to imagine what it is possible to think about a certain population of things or people. . . .

This is related to the idea we’ve discussed from time to time, of storytelling as the working out of logical possibilities.

Postdoc in Bayesian spatiotemporal modeling at Imperial College London!

Seth Flaxman writes:

We are hiring a postdoctoral research associate with a background in statistics or computer science to join a vibrant team at the cutting edge of the emerging field of spatiotemporal statistical machine learning (ST-SML). ST-SML draws in equal parts on Bayesian spatiotemporal statistics, scalable kernel methods and Gaussian processes, and recent deep learning advances in the field of computer vision. See to get an idea of our research on diverse topics including COVID-19 and criminology.

You will be based in the statistics section of the Department of Mathematics at Imperial College London. You will help lead a multiyear programme of methodological research, to tackle pressing public policy problems in collaboration with leading international organisations.

You will regularly collaborate with key external partners, including the Stan Development Team, the World Food Programme, UNAIDS, and NASA. You will regularly collaborate with key internal partners, including the Medical Research Council (MRC) Centre for Global Infectious Disease Analysis in Imperial’s world-renowned School of Public Health. In addition to research, you will help train practitioners in partner organisations. You will also have the opportunity (if desired) to participate in the supervision of postgraduate students and teaching at the doctoral level.

The position is fixed term for 36 months. The expected start date is 15 October 2020 or thereafter.

Contact Seth Flaxman ( to discuss. Closing date: 5 October 2020.

More information and how to apply:

Stan! Bayes! Criminology!

(1) The misplaced burden of proof, and (2) selection bias: Two reasons for the persistence of hype in tech and science reporting

Palko points to this post by Jeffrey Funk, “What’s Behind Technological Hype?”

I’ll quote extensively from Funk’s post, but first I want to make a more general point about the burden of proof in scientific discussions.

What happens is that a researcher or team of researchers makes a strong claim that is not well supported by evidence. But the claim gets published, perhaps in a prestigious journal such as the Journal of Theoretical Biology or PNAS or Lancet or the American Economic Review or Psychological Science. The authors may well be completely sincere in their belief that they’ve demonstrated something important, but their data don’t really back up their claim. Later on, the claims are criticized: outside researchers look carefully at the published material and point out that the evidence isn’t really there. Fine. The problem arises when the critics are then held to a higher standard: it’s not enough for them to point out that the original paper did not offer strong evidence for its striking claim; the critics are asked to (impossibly) prove that the claimed effect cannot possibly be true.

It’s a sort of Cheshire Cat phenomenon: Original researchers propose a striking and noteworthy (i.e., not completely obvious) idea, which is published and given major publicity based on purportedly strong statistical and experimental evidence. The strong evidence turns out not to be there, but—like the smile of the Cheshire cat—the claim remains even after the evidence has disappeared.

This is related to what we’ve called the “research incumbency advantage” (the widespread attitude that a published claim is considered true unless conclusively proved otherwise), and the “time-reversal heuristic” (my suggestion to suppose that the counter-argument or failed replication came first, with the splashy study following after).

Now to Funk’s post on technological hype:

Start-up losses are mounting and innovation is slowing. . . . The large losses are easily explained: extreme levels of hype about new technologies, and too many investors willing to believe it. . . . The media, with help from the financial sector, supports the hype, offering logical reasons for the [stock] price increases and creating a narrative that encourages still more increases. . . .

The [recent] narrative began with Ray Kurzweil’s 2005 book, The Singularity is Near, and has expanded with bestsellers such as Erik Brynjolfsson and Andrew McAfee’s Race Against the Machine (2012), Peter Diamandis and Steven Kotler’s Abundance (2012), and Martin Ford’s The Rise of the Robots (2015). Supported by soaring venture capitalist investments and a rising stock market, the world described in these books is one of rapid and disruptive technological change that will soon lead to great prosperity and perhaps massive unemployment. The media has amplified this message even as evidence of rising productivity or unemployment has yet to emerge.

Here I [Funk] discuss economic data showing that many highly touted new technologies are seriously over-hyped, a phenomenon driven by online news and the professional incentives of those involved in promoting innovation and entrepreneurship. This hype comes at a cost—not only in the form of record losses by start-ups, but in their inability to pursue alternative designs and find more productive and profitable opportunities . . .

These indicators are widely ignored, in part because we are distracted by information appearing to carry a more positive message. The number of patent applications and patent awards has increased about sixfold since 1984, and over the past 10 years the number of scientific papers has doubled. The stock market has tripled in value since 2008. Investments by US venture capitalists have risen about sixfold since 2001 . . . Such upward trends are often used to hype the economic potential of new technologies, but in fact rising patent activity, scientific publication, stock market value, and venture capital investment are all poor indicators of innovativeness.

One reason they are poor indicators is that they don’t consider the record-high losses for start-ups, the lack of innovations for large sectors of the economy such as housing, and the small range of technologies being successfully commercialized by either start-ups or existing firms. . . .

Funk then talks about the sources of hype:

For more recent technologies such as artificial intelligence, a major source of hype is the tendency of tech analysts to extrapolate from one or two highly valued yet unprofitable start-ups to total disruptions of entire sectors. For example, in its report Artificial Intelligence: The Next Digital Frontier? the McKinsey Global Institute extrapolated from the purported success of two early AI start-ups, DeepMind and Nest Labs, both subsidiaries of Alphabet (Google’s parent company), to a 10% reduction in total energy usage in the United Kingdom and other countries. However, other evidence for these purported energy reductions in data centers and homes are nowhere to be found, and the start-ups are currently a long way from profitability. Alphabet reported losses of approximately $580 million in 2017 for DeepMind and $569 million in 2018 for Nest Labs. . . .

Hype and its amplification come from many quarters: not only the financial community but also entrepreneurs, venture capitalists, consultants, scientists, engineers, and universities. . . .

Ya think??

Funk continues:

Online tech-hyping articles are now driven by the same dynamics as fake news. Journalists, bloggers, and websites prioritize page views and therefore say more positive things to attract viewers, while social media works as an amplifier. Journalists become “content marketers,” often hired by start-ups and universities to promote new technologies. Entrepreneurs, venture capitalists, university public relation offices, entrepreneurship programs, and professors who benefit from the promotion of new technologies all end up sharing an interest in increasing the level of hype. . . .

And this connects to the point I made at the beginning of this post. Once a hyped idea gets out there, it’s the default, and merely removing the evidence in favor is not enough. Mars One, Hyperloop, etc.: sure, eventually they fade, but in the meantime they suck up media attention and $$$, in part because they become the default, and the burden of proof is on the skeptics.

Selection bias in tech and science reporting

One other thing: the remark that journalists etc. “say more positive things to attract viewers” reminds me of what I’ve written about selection bias in science reporting (see also here). Lots of science reporters want to do the right thing, and, yes, they want clicks and they want to report positive stories—I too would be much more interested to read or write about a cure for cancer than about some bogus bit of noise mining—and these reporters will steer away from junk science. But here’s where the selection bias comes in: other, less savvy or selective or scrupulous reporters will jump in and hype the junk. So, with rare exceptions (some studies are so bad and so juicy that they just beg to be publicly debunked), the bad studies get promoted by the clueless journalists, and the negative reports don’t get written.

My point here is that selection bias can give us a sort of Gresham effect, even without any journalists knowingly hyping anything of low quality.

“this large reduction in response rats”

Spell check doesn’t catch all the typos.

Bill James is back

I checked Bill James Online the other day and it’s full of baseball articles! I guess now that he’s retired from the Red Sox, he’s free to share his baseball thoughts to all. Cool!

He has 8 posts in the past week or so, which is pretty impressive given that each post has some mixture of data, statistical analysis, and baseball thinking. It’s hard for me to imagine he can keep this up—sure, I do a post a day or so, but most of my posts don’t include original statistical analysis!—but he should go for it as long as he can. Keep the momentum going.

James’s most recent post (at the time of this writing) begins:

Double Plays and Stolen Base Prevention; these things keep the game under control. Our first task today is to estimate how many runs each team has prevented by turning the Double Play. . . .

The 1941 Yankees turned 196 Double Plays. Had they been just average at turning the double play we would have expected them to turn 151, which is an above-average average; the average over time is 139. (The team which would have been expected to turn the most double plays, for whatever this is worth, is the 1983 California Angels, who could have been expected to turn 202 Double Plays, since (a) the team gave up a huge number of hits, and (b) they had an extreme ground ball staff. The Angels actually turned 190 Double Plays, only six fewer than the 1941 Yankees, but 12 below expectation in their case.) . . .

I made a decision earlier that I would use three standard deviations below the norm as the zero-standard in an area in which higher numbers represented excellence, and four standard deviations below the norm as the zero-standard in an area in which higher numbers represented failure. . . .

This was a questionable decision, in the construction of the system, and we’ll revisit it at an appropriate point, but for now, I’m proceeding with 3 standard deviations below the norm as the zero-value standard for double plays. The standard deviation for the 1940s is 16.12—another questionable choice in there, by the way—so three standard deviations below the norm would be 52 double plays. . . .

I just looove this, not so much the baseball and the statistical analysis—that’s all fine—what I really love is the style. It’s just sooo Bill James. I’m reminded not so much of previous Bill James things I’ve read, but of Veronica Geng’s affectionate parody of the Bill James abstracts from back in the 1980s. Reading Geng’s story takes me back to what it felt like then, seeing the new Abstract appear every spring. The Bill James Abstract was pretty much the only statistics out there, period. There was no Freakonomics, there were no data journalists, etc. And that style! It’s hard to pick out exactly what James is doing here, but the style is unmistakably his. Good to see that some things never change.

Further reading

Also relevant:

A Statistician Rereads Bill James

Jim Albert’s blog on baseball statistics

Bill James does model checking

“Faith means belief in something concerning which doubt is theoretically possible.”

A collection of quotes from William James that all could’ve come from Bill James

P.S. I came across this post. Dude should learn about Bayes and partial pooling!

His data came out in the opposite direction of his hypothesis. How to report this in the publication?

Fabio Martinenghi writes:

I am a PhD candidate in Economics and I would love to have guidance from you on this issue of scientific communication. I did an empirical study on the effect of a policy. I had an hypothesis, which turned out to be wrong, in the sense the the expected signs of the effects were opposite of what I thought (and robust to several different specifications and estimators). I looked at the impact of such policy from several angles (looking at different dependent variables), which implies that there is not an infinite amount of hypothesis consistent with the results. Of course, once I stood contradicted by the data, I noticed another aspect of the issue which in turn made me come up with a new convincing explanation which once tested resulted consistent with the results.

Because I care deeply about good science and academic integrity, I wonder how can I write my paper without breaking the conventions in academic writing while avoiding to pretend that I held my final hypothesis as true since the very beginning.

I will learn Bayesian methods as soon as I can (now I need to graduate).

My reply:

Can I post your qu and my response on blog? It should appear in October. You can be anonymous.

Martinenghi’s response:

I will have handed my dissertation in by then. I understand, publication time is what it is these days! Unless you mean you are pre-dating it to last year’s October. I am happy not to be anonymous and in a sense make public that I care about these issues.

In all seriousness, I was just hoping to solve this issue. Covering up the process through which one arrives at his final hypothesis is like shooting a film about a scientific discovery in which you make up the story behind it. It really bothers me and most of the (Econ) academics will just suggest me to do that. I am indifferent relative to the way the answer is given, if any.

My perhaps not-so-satisfying reply:

My quick answer is in two parts. First, you do not need in your paper to go through all the steps of everything you tried that did not work. It’s just not possible, and it’s not so interesting to the readers. Second, you should present all relevant analyses and hypotheses. In this case, it seems that you just have one analysis or set of analyses, but you have multiple hypotheses. I recommend that in your paper you present both hypotheses, then state that the data are more consistent with hypothesis 2, but that if the experiment were replicated under other conditions, perhaps the data would be more consistent with hypothesis 1.

P.S. The original email came in May 2019 and I guess I must have postponed it a couple times, given that it’s only appearing now.

Derived quantities and generative models

Sandro Ambuehl, who sketched the above non-cat picture, writes:

I [Ambuehl] was wondering why we’re not seeing reports measures of Covid19 mortaliy other than the Case Fatality Rate.

In particular, what would seem far more instructive to me than CFR is a comparison of the distributions of age at death, depending on whether the diseased was a carrier of Covid19 or not. That is, I’d really like to see a graph such as the one above. Does this exist anywhere? Or what would be the issues with it?

It seems such a statistic would address three of the major issues with CFR: 1. You can only determine CFR if you know the number of infected people, which, given the large number of asymptomatic cases and limited testing is nearly impossible. 2. Covid19-deaths frequently coincide with comorbitities. It is close to impossible to determine whether a death is attributable to Covid19 or to the comorbidiy. 3. CFR tells you nothing about the number of lost years of life.

My reply:

I continue to think that it’s a mistake to think of “the CFR”: at the very least you’d like to poststratify by age.

Regarding the question, what would be the issues with the above graph? My response is that it’s a fine graph to make, but ultimately I’d think of it as a product of some underlying model. There’s some generative model of disease progression and death, and when you integrate that over the population you’ll get a graph like the one drawn above. So I think of the above graph as a derived quantity.

The challenge of fitting “good advice” into a coherent course on statistics

From an article I published in 2008:

Let’s also not forget the benefit of the occasional dumb but fun example. For example, I came across the following passage in a New York Times article: “By the early 2000s, Whitestone was again filling up with young families eager to make homes for themselves on its quiet, leafy streets. But prices had soared. In October 2005, the Sheas sold the house, for which they had paid $28,000 nearly 40 years ago, for more than $600,000.” They forgot to divide by the Consumer Price Index! Silly but, yes, these things happen, and it’s good to remind social science students that if they know about these simple operations, they’re already ahead of the game. The next step is to discuss more difficult problems such as adjusting the CPI for quality improvements. (For example, if the average home today is larger than the average home 40 years ago, should the CPI adjustment be per home or per square foot?) I also like to mention points such as, “The difference between ‘significant’ and ‘nonsignificant’ is not itself statistically significant.” But I haven’t surmounted the challenge of how to fit this sort of “good advice” into a coherent course so that students have a sense of how to apply these ideas in new problems.

Maybe we’re not all the way there yet, but we’ve been working on it. In Regression and Other Stories we attempt to integrate good statistical advice into a coherent course on applied regression and causal inference.

We want certainty even when it’s not appropriate

Remember the stents example? An experiment was conducted comparing two medical procedures, the difference had a p-value of 0.20 (after a corrected analysis the p-value was 0.09) and so it was declared that the treatment had no effect.

In other cases, of course, “p less than 0.10” is enough for publication in PNAS and multiple awards. This is deterministic thinking for you: it’s no effect or a big scientific finding; no opportunity for the study to just be inconclusive.

This is a big, big problem: interpreting lack of statistical significance as no effect. The study was also difficult to interpret because of the indirect outcome measure. But that’s a standard problem with medical studies: you can’t measure long-term survival or quality of life, so you measure their treadmill times. No easy answers on this one.

Anyway, Doug Helmreich saw this press release on treatments for ischemia, and it reminded him of the stents example:

Here we go again?

I [Helmreich] only glanced through the results… the all-cause death rates are virtually the same. In other cases there’s some evidence for invasive procedures but because it did not meet the p-value threshold it is treated as “no difference”. I guess shades of gray make for much more difficult storytelling… An article titled “evidence for invasive procedures not overwhelming” is harder to write.

I don’t have the energy to follow the links and slides in detail (see P.S. below) but on quick glance I see where Helmreich is coming from. Here are the summary results:

It’s all deterministic; no uncertainty. I understand: as a heart patient myself, I just want to be told what to do. But, given that we’re making conclusions based on statistical patterns in data, this sort of deterministic reporting is a problem.

P.S. Before you haters jump on me for writing about a study I haven’t read, please recall that any press release is, in large part, intended for people who are not going to read the study. So, yes, press releases matter. And until labs stop releasing press releases, and until reporters stop relying on press releases, I’m going to insist that I have every right to record my reaction to a press release.

Also from the press release:

Many doctors routinely use an invasive approach in addition to medical therapy to treat IHD; however, it is not known if this approach is better than medical therapy alone as the initial treatment of patients with stable ischemic heart disease (SIHD), moderate to severe ischemia. ISCHEMIA is designed to find the answer.

“The answer,” huh?

Election Scenario Explorer using Economist Election Model

Ric Fernholz writes:

I wanted to tell you about a new website I built together with my brother Dan. The 2020 Election Scenario Explorer allows you to explore how electoral outcomes in individual states influence the national election outlook using data from your election model.

The map and tables on our site reveal some interesting observations about the election and your model. The site provides a measure of the influence of different states using the expected reduction in entropy or variance, following the data generated by your model. Several of the most influential states according to this measure differ from those states emphasized by more common “tipping point” analyses.

I appreciate you sharing your code and simulation output with the public, as this made our project possible.

Open data and code ftw!

P.S. They should round those numbers to the nearest percentage point (see section 2.1 of this article).

Everything that can be said can be said clearly.

The title as many may know, is a quote from Wittgenstein. It is one that has haunted me for many years. As a first year undergrad, I had mistakenly enrolled in a second year course that was almost entirely based on Wittgenstein’s  Tractatus. Alarmingly, the drop date had passed before I grasped I was supposed to understand (at least some of) the Tractatus to pass. That forced me to repeatedly re-read it numerous times. I did pass the course.

However, I now think the statement is mistaken. At least, outside mathematics in subjects where what is being said is an attempt to say something about the world – that reality that is beyond our direct access. Here some vagueness has its place or may even necessary. What is being said will unlikely be exactly right. Some vagueness may be helpful here in the same way that sheet metal needs to stretch to be an adequate material for a not quite fully rigid structure.

Now, what I am thinking about trying to say more clearly at present is how diagrammatic reasoning, experiments performed on diagrams as a choice of mathematical analysis of probability models utilizing simulation, will enable more to grasp statistical reasoning better. OK, maybe the Wittgenstein quote was mostly click bait.

My current attempt at saying it with 4 paragraphs:

Continue reading ‘Everything that can be said can be said clearly.’ »

In case you’re wondering . . . this is why the U.S. health care system is the most expensive in the world

Read the above letter carefully, then remember this. (Greg Mankiw called comparisons of life expectancies schlocky, but maybe he’ll feel different about this once he reaches the age of 70 or 75 . . .)

P.S. This doesn’t help either.

Low rate of positive coronavirus tests

As happens sometimes, I receive two related emails on the same day.

Noah Harris writes:

I was wondering if you have any comment on the NY State Covid numbers. Day after day the positive percentage stays in a tight range of about 0.85-0.99%. How can the range be so narrow and stable? Do you think we are at the limits of the test and there may be a significant amount of false positives?

And here’s Tom Daula:

Relatively old article, but I think it is interesting considering your analysis of the Stanford study. Another wrinkle for the measurement problem; both of contagious individuals and viral load sufficient to be related to death. The article doesn’t mention international comparisons.

The Times article, which is not so old—it’s from 29 Aug—is entitled, “Your Coronavirus Test Is Positive. Maybe It Shouldn’t Be.
The usual diagnostic tests may simply be too sensitive and too slow to contain the spread of the virus.”

I don’t really know what to think about all this, but I’ll share with you.

Taking the bus

Bert Gunter writes:

This article on bus ridership is right up your alley [it’s a news article with interactive graphics and lots of social science content].

The problem is that they’re graphing the wrong statistic. Raw ridership is of course sensitive to total population. So they should have been graphing is rates per person, not raw rates. I grant you that maybe populations didn’t change that much over the time concerned — but I don’t know that! At least they should have said something about the necessity of that assumption and possible distortions it could cause in their “analysis.” Note that the ensuing discussion in the article speculated about explanations for a decline in the individual propensity to ride (financial, age,…) for which the per person frequency should be the basis, not the per city.

My reply: I agree that they should’ve divided by population. Actually, I think best would be to divide by population * days, so that what they’re plotting is average number of bus rides per person per day. That’s a number that is directly interpretable. For example, 0.1 bus rides per person per day corresponds to 10% of the people in the city riding the bus once that day, or 5% riding the bus twice. The number is still only approximate, as suburbanites and visitors ride the bus too, but it’s at least roughly interpretable.

This sort of thing comes up a lot, the value of rescaling statistics to be on the human scale. Such rescaling is not always so easy—for example, reporting suicide rates per 100,000 is pretty much uninterpretable, but it’s not clear how to put such a rare event on unit scale—but we should do this when we can.

It also looks like they screwed up on the above graph and there was some software setting that cut off the lines when they went below -13%.

My other comment is about the bus riding experience itself. The authors talk about buses vs. trains and bikes and cars, but I don’t see anything about what it feels like to actually ride the bus, except for a brief mention of bus lanes. I like riding the bus, but not when it stops every two blocks, when it has to swerve in and out of traffic after each bus stop, when it stops at just every traffic light, when I have to wait 20 minutes for the bus to show up in the first place (particularly annoying because if you’re not staring eagle-eyed at the street, the bus might come by and not stop for you), etc etc. Lots of these problems are potentially fixable, for example by putting bus stops in the middle of the street and running smaller buses more frequently rather than huge buses further spaced in time.

Election forecasts: The math, the goals, and the incentives (my talk this Friday afternoon at Cornell University)

At the Colloquium for the Center for Applied Mathematics, Fri 18 Sep 3:30pm:

Election forecasts: The math, the goals, and the incentives

Election forecasting has increased in popularity and sophistication over the past few decades and has moved from being a hobby of some political scientists and economists to a major effort in the news media. This is an applied math seminar so we will first discuss several mathematical aspects of election forecasting: the information that goes into the forecasts, the models and assumptions used to combine this information into probabilistic forecasts, the algorithms used to compute these probabilities, and the ways that forecasts can be understood and evaluated. We discuss these in particular reference to the Bayesian forecast that we have prepared with colleagues at the Economist magazine ( We then consider some issues of incentives for election forecasters to be over- or under-confident, different goals of election forecasting, and ways in which analysis of polls and votes can interfere with the political process.

I guess Cornell’s become less anti-Bayesian since Feller’s time . . .

P.S. Some of the material in the talk appears in this article with Jessica Hullman and Chris Wlezien. I should also thank Elliott Morris and Merlin Heidemanns, as the Economist model is our joint product.

Coronavirus disparities in Palestine and in Michigan

I wanted to share two articles that were sent to me recently, one focusing on data collection and one focusing on data analysis.

On the International Statistical Institute blog, Ola Awad writes:

The Palestinian economy is micro — with the majority of establishments employing less than 10 workers, and the informal sector making up about a third of the economy. It is primarily a service-based economy, and also a consumer-based economy with a consumption rate of about 116% of the gross domestic product. . . . Those most affected are the most vulnerable sectors of society such us Palestinians living in refugee camps[3] and in Area C. This area is home to an estimated 180,000-300,000 Palestinians who are suffering from demolitions and forced evictions that deprive people of their homes and disrupt livelihoods, leading to entrenched poverty and increased dependence on aid. . . .

About 29% of those employed in the private sector receive less than the minimum wage (1,450 NIS = USD 426 per month), and to make matters worse — 57% of the employed are considered informal employees, meaning they do not receive formal basic work rights such as employment contracts, paid leave, sick leave or social retirement.

The fragile economy faced even more tragic conditions after the COVID-19 pandemic hit Palestine and the rest of the world. . . . During the lock-down, face-to-face data collection was no longer an option and we had to come up with creative ways like hand held devices, phones, and the use of registers when available. We also performed a rapid assessment to offer real time data capturing the effect of the pandemic. . . . It is vital for governments to have data on the most marginalized groups which are expected to fall deeper into vulnerability due to the pandemic. In Palestine, this includes women heading households, workers of the informal sector, and workers at Israeli settlements, refugees, and the population in Area C. . . .

The lockdown has widened the poverty gap. Families who were on the edge are falling into poverty, leading to the emergence of new groups of poor people, especially in refugee camps and Area C. Around 109,000 women working in the private sector have lost their jobs due to closure measures. . . .

Meanwhile, at the University of Michigan, Jon Zelner, Rob Trangucci, Ramya Naraharisetti, Alex Cao, Ryan Malosh, Kelly Broen, Nina Masters, Paul Delamater write:

Racial disparities in COVID-19 mortality are driven by unequal infection risks.

Geographic, racial-ethnic, age and socioeconomic disparities in exposure and mortality are key features of the first and second wave of the U.S. COVID-19 epidemic. We used individual-level COVID-19 incidence and mortality data from the U.S. state of Michigan to estimate age-specific incidence and mortality rates by race/ethnic group. Data were analyzed using hierarchical Bayesian regression models [using rstanarm], and model results were validated using posterior predictive checks. In crude and age-standardized analyses we found rates of incidence and mortality more than twice as high than Whites for all groups other than Native Americans. Of these, Blacks experienced the greatest burden . . . We also found that the bulk of the disparity in mortality between Blacks and Whites is driven by dramatically higher rates of COVID-19 infection across all age groups, particularly among older adults, rather than age-specific variation in case-fatality rates. Interpretation. This work suggests that well-documented racial disparities in COVID-19 mortality in hard-hit settings, such as the U.S. state of Michigan, are driven primarily by variation in household, community and workplace exposure rather than case-fatality rates.

P.S. In response to some comments (see below), Zelner writes:

I [Zelner] agree 100% about the challenges of knowing what is driven by testing vs infection. But to the naysayers I’d say good luck getting the negative test data you’d need to figure it out! This has been a huge challenge and basically every chance I have I harass the people who could give it to us, but it is actually a legal and logistical morass to get it… For me it’s sort of the perfect vs the good. Plus the effect sizes are so enormous it’s hard for me to believe it’s all testing. Re: pop density I’d reject that out of hand; Dense population in SE Michigan is like a suburb anywhere else.

“Figure 1 looks like random variation to me” . . . indeed, so it does. And Figure 2 as well! But statistical significance was found, so this bit of randomness was published in a top journal. Business as usual in the statistical-industrial complex. Still, I’d hope the BMJ could’ve done better.

Gregory Hunter writes:

The following article made it to the national news in Canada this week.

I [Hunter] read it and was fairly appalled by their statistical methods. It seems that they went looking for a particular result in Canadian birthrate data, and then arranged to find it. Figure 1 looks like random variation to me. I don’t know if it warrants mention in your blog, but it did get into the British Medical Journal.

That’s too bad about it being in the British Medical Journal. Lancet, sure, they’re notorious for publishing politically motivated clickbait. But I thought BMJ was more serious. I guess anyone can make mistakes.

Anyway, getting to the statistics . . . the article is called “Outcome of the 2016 United States presidential election and the subsequent sex ratio at birth in Canada: an ecological study,” and its results are a mess of forking paths:

We hypothesised that the unexpected outcome of the 2016 US presidential election may have been a societal stressor for liberal-leaning populations and thereby precipitated such an effect on the sex ratio in Canada. . . . In the 12 months following the election, the lowest sex ratio occurred in March 2017 (4 months post election). Compared with the preceding months, the sex ratio was lower in the 5 months from March to July 2017 (p=0.02) during which time it was rising (p=0.01), reflecting recovery from the nadir. Both effects were seen in liberal-leaning regions of Ontario (lower sex ratio (p=0.006) and recovery (p=0.002) in March–July 2017) but not in conservative-leaning areas (p=0.12 and p=0.49, respectively).

In addition to forking paths, we also see the statistical fallacy of comparing significant to non-significant.

To their credit, the authors show the data:

As is often the case, if you look at the data without all those lines, you see something that looks like a bunch of numbers with no clear pattern.

The claims made in this article do not represent innumeracy on the level of saying that the probability of a tied election is 10^-90 (which is off by a factor of 10^83), and it’s not innumeracy on the level of that TV commenter and newspaper editor who said that Mike Bloomberg spent a million dollars on each voter (off by a factor of 10^6), but it’s still wrong.

Just to get a baseline here: There were 146,000 births in Ontario last year. 146,000/12 = 12,000 (approximately). So, just from pure chance, we’d expect the monthly proportion of girl births to vary with a standard deviation of 0.5/sqrt(12000) = 0.005. For example if the baseline rate is 48.5% girls, it could jump to 48.0% or 49.0% from month to month. The paper in question reports sex ratio, which is (1-p)/p, so 0.480, 0.498, 0.490 convert to sex ratios of 1.08, 1.06, and 1.04. Or, if you want to do +/-2 standard deviations, you’d expect to see sex ratios varying from roughly 1.10 to 1.02, which is indeed what we see in the top figure above. (The lower figures are each based on less data so of course they’re more variable.) Any real effects on sex ratio will be tiny compared to this variation in the data (see here for discussion of this general point).

In short: this study was dead on arrival. But the authors fooled themselves, and the reviewers, with a blizzard of p-values. As low as 0.002!

So, let me repeat:

– Just cos you have a statistically significant comparison, that doesn’t necessarily mean you’ve discovered anything at all about the world.

– Just cos you have causal identification and a statistically significant comparison, that doesn’t necessarily mean you’ve discovered anything at all about the world.

– Just cos you have honesty, transparency, causal identification, and a statistically significant comparison, that doesn’t necessarily mean you’ve discovered anything at all about the world.

– Just cos you have honesty, transparency, causal identification, a statistically significant comparison, a clear moral purpose, and publication in a top journal, that doesn’t necessarily mean you’ve discovered anything at all about the world.

Sorry, but that’s the way it is. You’d think everyone would’ve learned this—it’s been nearly a decade since that ESP paper was published—but I guess not. The old ways of thinking are sticky. Sticky sticky sticky.

Again, no special criticism on the authors of this new paper. I assume they’re just doing what they were trained to do, and now what they’re rewarded to do. Don’t hate the player etc.