Skip to content

What happened to the hiccups?

Watching Sleepless in Seattle the other day, and at one point the cute kid in the movie gets into a conversation about hiccups, everybody has their own cure for the hiccups, etc.

And it got me thinking: What ever happened to the hiccups? When I was a kid, the hiccups occupied a big part of our mental state. We got the hiccups often enough, and people were always talking about how to cure it. Hold your breath, drink a glass of water slowly, whatever. Once my sister had the hiccups and my dad snuck up behind her and scared her. The hiccups went away but my sister made my dad promise never to scare her again like that. And he never did.

Anyway . . . nowadays we don’t hear so much about the hiccups. Nobody ever seems to get them. What’s up with that?

Here’s a partial list of things that used to occupy lots of kids’ brain real estate but doesn’t seem to anymore:

Bee stings

On the other hand, some things that were big when we were kids are still big. I’m thinking of superheroes.

I don’t know where the hiccups went. I guess video games are so huge now, that some other topics had to go away. After all, there are only 24 hours in the day. On the medical front, we now have things like nut allergy and autism which nobody ever thought about and now are huge. I remember as a kid watching a 60 Minutes segment, I think it was, on this mysterious condition called autism, and I was like, wow, what’s that? Nowadays autism is just part of the conversation. But hiccups aren’t.

Somebody should study this.

OK, here’s the Google N-gram:

So according to this source, hiccups are bigger than ever. But I don’t buy it. I think any recent increase is just authors remembering hiccups from their childhood.

Judith Rich Harris on the garden of forking paths

Ethan Ludwin-Peery writes:

I finally got around to reading The Nurture Assumption and I was surprised to find Judith Rich Harris quite lucidly describing the garden of forking paths / p-hacking on pages 17 and 18 of the book. The edition I have is from 2009, so it predates most of the discussion of these topics, and for all I know this section was in the first edition as well. I’ve never heard this mentioned about JRH before, and I thought you might be interested.

Here’s the passage from Harris’s book:

It is unusual for a socialization study to have as many as 374 subjects. On the other hand, most socialization studies gather a good deal more data from their subjects than we did in our IQ-and-books study: there are usually several measurements of the home environment and several measurements of each child. It’s a bit more work but well worth the trouble. If we collect, say, five different measurements of each home and five different measurements of the child’s intelligence, we can pair them up in twenty-five ways, yielding twenty-five possible correlations. Just by chance alone, it is likely that one or two of them will be statistically significant. What, none of them are? Never fear, all is not lost: we can split up the data and look again, just as we did in our broccoli study. Looking separately at girls and boys immediately doubles the number of correlations, giving us fifty possibilities for success instead of just twenty-five. Looking separately at fathers and mothers is also worth a try. “Divide and conquer” is my name for this method. It works like buying lottery tickets: buy twice as many and you have twice as many chances to win.

And that’s not even the whole story, as she hasn’t even brought up choices in data coding and exclusion, and choices in how to analyze the data.

I replied that we’ve been aware forever of the problem of multiple comparisons but we didn’t realize how huge a problem it was in practice, and Ludwin-Peery replied:

Indeed! The most surprising thing was that she seems to have been aware of how widespread it was (at least in socialization research).

Macbook Pro (16″ 2019) quick review

I just upgraded yesterday to one of the new 2019 Macbook Pro 16″ models:

  • Macbook Pro (16″, 2019), 3072 x 1920 pixel display, 2.4 GHz 8-core i9, 64GB 2667 MHz DDR4 memory, 2880 x 1800 pixel display, AMD Radeon Pro 5500M GPU with 4GB of GDDR6 memory, 1 TB solid-state drive

    US$4120 list including Apple Care (about US$3800 after the education discount)

The only other upgrade option is an additional 4GB GPU memory for US$100.

My computer for the last seven-plus years and my basis for comparison is a mid-2012 Macbook Pro:

  • Macbook Pro (15″ Retina, Mid 2012), 2880 x 1800 pixel display, 2.3 GHz 4-core i7, 16 GB 1600MHz DDR3 memory, 256 GB solid-state drive

I did 100% of my work on Stan during that time using this computer and overall, it’s been the best computer I’ve ever had. But my old computer was dying. The screen was so burned in I could read last week’s xkcd (I never got the replacement during the recall). The battery was so shot it’d go from 30% power to auto shutdown in the blink of an eye.

I have no idea what’s available in the Windows PC or Linux world for a similar price. It probably comes with 32 cores, 256GB memory, and also acts as a hoverboard. For comparison, while working at Bell Labs in the mid-1990s, I once spent US$7200 for a dual-boot Linux/Windows XP Thinkpad with a super high-res monitor for the time and enough memory and compute power to develop and run our speech recognition code. So while US$3800 may seem outrageously expensive, the bigger picture is that really powerful computers just keep getting more affordable over time.

Form factor

I bought the new computer sight unseen without paying too much attention to anything other than that it had 8 cores, 64GB memory, and an escap key. I was expecting something like a PC gamer deck. Compared to my previous machine, the cases are exactly the same size and the machines are the same weight at least insofar as I can tell physically without reading the specs. It’s even the old silver color, which I strongly prefer to the space grey.

I like that the apple on the lid doesn’t light up.

Ease of upgrade

Apple makes it super easy to move everything from an old machine. Once I entered enough passwords on my menagerie of Apple devices, it took less than 2 hours to transfer everything from the old machine to the new one via my home wireless network.

The only software I’ve had to upgrade to get back to working on Stan is Xcode (the C++ compiler). And I did that just from the command line using this one-liner:

> xcode-select --install

Hats off to Dirk Avery for his blog post on Catalina, Xcode, and Homebrew.

It really was that easy. The entire Stan developer toolchain just works. R, RStan, etc., all just ran once I ran the above command from the terminal.

The keyboard, touchpad, and touchbar

There’s an escape key. I’ve been using emacs for 30+ years and it’s a big deal to me and others like me.

Keyboards matter a lot to me. I’m a very fast typist—around 100 words/minute the last time I tested myself on transcription (two years of school, part time jobs as secretary and keypunch operator, followed by tens of thousands of adult hours at the keyboard).

Overall, I consider this keyboard a downgrade from my 2012 Macbook Pro. I had the same problem with ThinkPads between 1996 and 2010—the keyboards just kept getting worse with every new model. At least the new Macbook Pro keyboards are a lot better than the very-short-throw, low-feedback keyboards used in the time between my 2012 Mac and the new 2019 ones.

The touchpad is huge compared to the old machine. I was worried I’d be accidentally hitting it all the time because I set it to use touch rather than click, but that has thanfully not happened.

The touchbar’s fine for what it’s there for. Its default is to display the only controls I ever used on the old computer—volume and brightness.


Together, the keyboard and display are the most important parts of a computer to me. I’ve always prioritized displays over CPUs. I bought a first-generation Retina Macbook Pro as soon as they were available.

The monitor in the 16″ Macbook Pros is impressive. After using it for a day, the color on all my other devices (previous computer, iPhone, iPad) now looks off (specifically, blue-shifted). Sitting next to each other at max brightness, one might think the backlighting was broken in the old monitor it’s so dim.

Even though it’s not that much bigger, having spent 7 years on a slightly smaller one, this one feels a fair bit bigger. They squeezed it into the same form factor by reducing the bezel size. There are also a few more pixes.

Is it faster?

Yes, much. I haven’t done any formal measurements, but with twice as many cores, each of which is faster, and much faster memory, one would expect to see exactly what I’m seeing informally—the Stan C++ unit tests compile and run more than 50% faster.

Not much compared to the PC heyday when every 18 months saw a doubling of straight-line speed. But enough to be noticeable and well worth the upgrade if that was all I was getting.

I haven’t tried any GPU code yet. I wouldn’t expect too much from a notebook on that front.

64 GB?

It wasn’t that much more expensive to fully load the machine’s memory. This means we should be able to run 8 processes each using nearly 8 GB of memory each.

Ports and dongles

There’s a headphone jack on the right (instead of left as it was on my old computer) and two USB-C jacks on either side. I just plugged the power into one of the ones on the left and it worked.

Ports and dongles are the great weakness of Apple-knows-best design in my experience. I’m going to have to buy a USB-C to HDMI dongle. I really liked that the 2012 Macbook Pro had an HDMI port.

I’m also going to have to figure out how to charge my iPad and iPhone. I prefer to travel without the iPad-specific wall wart.

Apple seems to think they get points for being “mimimal”, flying in the face of every review I’ve ever read of an Apple product. So here you go Apple, another negative review of your choice in the port department to ignore.

Am I an Apple fanboy?

I certainly don’t self identify as an Apple fanboy. I use exclusively Apple products (Macbook, iPhone, iPad) primarily because I’m lazy and hate learning new interfaces and managing software. My decision’s being driven almost entirely from the Macbooks because I want Unix on my notebook without the incompatibility of Cygwin or administrative headache of Linux on a notebook.

It’s clear the Macbook isn’t getting the most love among Apple’s products. I also resent Apple’s we-know-best attitude, which I blame for their cavalier attitude toward backward compatibility at both the software and hardware levels. It’s no surprise Microsoft still dominates the corporate PC market and Linux the corporate server market.

Overall impression

I love it. For my use, the 8 cores, faster 64GB memory, and the high resolution and brightness 16″ monitor more than make up for the slightly poorer keyboard and reduced port selection.

I also ordered the same machine for Andrew and he’s been using his a day or two longer than me, so I’m curious what his impressions are.

“Inferential statistics as descriptive statistics”

Valentin Amrhein​, David Trafimow, and Sander Greenland write:

Statistical inference often fails to replicate. One reason is that many results may be selected for drawing inference because some threshold of a statistic like the P-value was crossed, leading to biased reported effect sizes. Nonetheless, considerable non-replication is to be expected even without selective reporting, and generalizations from single studies are rarely if ever warranted. Honestly reported results must vary from replication to replication because of varying assumption violations and random variation; excessive agreement itself would suggest deeper problems, such as failure to publish results in conflict with group expectations or desires. A general perception of a “replication crisis” may thus reflect failure to recognize that statistical tests not only test hypotheses, but countless assumptions and the entire environment in which research takes place. Because of all the uncertain and unknown assumptions that underpin statistical inferences, we should treat inferential statistics as highly unstable local descriptions of relations between assumptions and data, rather than as generalizable inferences about hypotheses or models. And that means we should treat statistical results as being much more incomplete and uncertain than is currently the norm. Acknowledging this uncertainty could help reduce the allure of selective reporting: Since a small P-value could be large in a replication study, and a large P-value could be small, there is simply no need to selectively report studies based on statistical results. Rather than focusing our study reports on uncertain conclusions, we should thus focus on describing accurately how the study was conducted, what problems occurred, what data were obtained, what analysis methods were used and why, and what output those methods produced.

I think the title of their article, “Inferential statistics as descriptive statistics: there is no replication crisis if we don’t expect replication,” is too clever by half: Ultimately, we do want to be able to replicate our scientific findings. Yes, the “replication crisis” could be called an “overconfidence crisis” in that the expectation of high replication rates was itself a mistake—but that’s part of the point, that if findings are that hard to replicate, this is a problem for the world of science, for journals such as PNAS which routinely publish papers that make general claims on the basis of much less evidence than is claimed.

Anyway, I agree with just about all of this linked article except for my concern about the title.

“Deep Origins” and spatial correlations

Morgan Kelly writes:

Back in 2013 you had a column in Chance magazine on the Ashraf-Galor “Out of Africa” paper which claims that genetic diversity determines modern income. That paper is part of a much large literature in economics on Persistence or “Deep Origins” that shows how medieval pogroms prefigure Nazi support, adoption of the plough determines women’s rights etc.

However, most papers in that literature combine unusually high t statistics with extreme spatial autocorrelation of residuals and I wanted to see if these things were connected. The basic idea is in the picture below: regress spatial noise series on each other and you get results that look a lot like persistence:

I [Kelly] go on to examine 27 persistence papers in Top Four economics journals and find that, in most cases, the big persistence variable has lower explanatory power than spatial noise but can, at the same time, strongly predict spatial noise.

For more discussion on that “Out of Africa” paper, see the comment threads here (from 2013) and here (from 2018).

Also some general discussion here of the statistical issue of correlated errors in this and similar examples.

How many Stan users are there?

This is an interesting sampling or measurement problem that came up in a Discourse thread started by Simon Maskell:

It seems we could look at a number of pre-existing data sources (eg discourse views and contributors, papers, StanCon attendance etc) to inform an inference of how many people use Stan (and/or use things that use Stan). We could also generate new data (eg via surveys etc). Do we know the answer and/or how best to work it out?

The cleanest way to do this would be to start with a list of the population possible Stan users, then survey a random sample of them, ask if they use Stan, and extrapolate to the population. But we can’t do this because no such list exists. We could count Stan downloads, but that’s not Stan users, as we assume that lots of the downloads are automatic, and also people might download Stan and then only use it once, or not at all.

Lauren Kennedy suggests doing a snowball or network sample using contributors to the Stan Forums as a starting point.

Snowball sampling could work. There could be other ideas too. Please offer your suggestions in comments.

Here are my thoughts:

1. A natural first step in any research project is to read the literature. There must be some estimates of the numbers of users of other programming languages such as Python, R, C++, Julia, Bugs, Stata, etc. I don’t know where these estimates come from, but looking at them would be a start.

2. If we’re gonna do a survey to estimate the number of Stan users, it perhaps makes sense to expand the project and simultaneously estimate the number of users of some other programming languages too, both for efficiency (with little more effort we can get information that will be of interest to others) and to get comparisons: comparing the uses different languages in our survey and also comparing our estimates to estimates that have been obtained by others.

3. We should also think about how the survey could be done again in the future. If we have a good estimate of the number of users, we might want to repeat the procedure every year or two to get a sense of trends.

4. How many Stan users are there? What’s a “Stan user”? Does this include users of rstanarm and brms? What about people who only use Stan through Prophet—does that count? Do we want to count every-users or current users? How often must you use Stan to count as a user? What if you took a class that used Stan? Etc.

The point of this last set of questions is not that we need a precise definition of Stan user, but rather that we should ask a battery of questions to get at mode and frequency of use. Also, we should consider how we might want to summarize and interpret the results: we should think about this before we conduct the survey (rather than doing the usual thing of gathering a bunch of data and then deciding what to do with it all).

Field goal kicking—like putting in 3D with oblong balls


Andrew Gelman (the author of most posts on this blog, but not this one), recently published a Stan case study on golf putting [link fixed] that uses a bit of geometry to build a regression-type model based on angles and force.

Field-goal kicking

In American football, there’s also a play called a “field goal.” In the American football version, a kicker (often a player migrating from the sport everyone else in the world calls “football”) tries to kick an oblong-ish “ball” between 10 and 70 meters between a pair of vertical posts and above a post at a certain height. If you’re not from the U.S. or other metrically-challenged country still using (British) imperial measures, it’ll help to know that a meter is roughly 1.1 yards.

Sounds kind of like putting, only in 3D and with no penalty for kicking too hard or far and wind effects instead of terrain features. This modeling problem came to my attention from the following blog post:

Unlike Gelman’s golf-putting example, Long’s model combines a kick-by-kick accuracy model with a career-trajectory model for kickers, another popular contemporary sports statistics adjustment. Long used brms, a Bayesian non-linear multilevel modeling package built on top of Stan, to fit his model of field-goal-kicking accuracy. (For what it’s worth, more people use brms and rstanarm now than use Stan directly in R, at least judging from CRAN downloads through RStudio.)

Model expansion

The focus of Gelman’s case study is model expansion—start with a simple model, look at the residuals (errors), figure out what’s going wrong, then refine the model. Like Gelman, Long starts with a logistic regression model for distance; unlike Gelman, he expands the model with career trajectories and situational effects (like “icing” the kicker) rather than geometry. An interesting exercise would be to do what Gelman did and replace Long’s logistic model of distance with one based on geometry. I’m pretty sure this could be done with brms by transforming the data, but someone would need to verify that.

Similarly, Gelman’s model still has plenty of room for expansion if anyone wants to deal with the condition of the greens (how they’re cut, moisture, etc.), topography, putter career trajectories, situational effects, etc. My father was a scratch golfer in his heyday on local public courses, but he said he’d never be able to sink a single putt if the greens were maintained the way they were for PGA tournaments. He likes to quote Lee Trevino, who said pro greens were like putting on the hood of a car; Trevino’s quotes are legendary. My dad’s own favorite golf quote is “drive for show, putt for dough”—he was obsessive about his short game—his own career was ended by knee and rotator cuff surgery—hockey wasn’t good to his body, either, despite playing in a “non-contact” league as an adult.

It would be fun to try to expand both Long’s and Gelman’s models further. This would also be a natural discussion for the Stan forums, which have a different readership than this blog. I like Gelman’s and Long’s post because they’re of the hello-world variety and thus easy to understand. Of course, neither’s ready to go into production for bookmaking yet. It’d be great to see references to some state-of-the-art modeling of these things.

Other field goals

Field goals in basketball (shots into the basket from the floor as opposed to free throws) would be another good target for a model like Gelman’s or Long’s. Like the American football case and unlike golf, there’s a defense. Free throws wouldn’t be a good target as they’re all from the same distance (give or take a bit based on where they position themeselves side to side).

Are there things like field goals in rugby or Australian-rules football? I love that the actual name of the sport has “rules” in the title—it’s the kind of pedantry near and dear to this semanticist’s heart.


I thought twice about writing about American football. I boycott contact sports like football and ice hockey due to their intentionally violent nature. I’ve not watched American football in over a decade.

For me, this is personal now. I have a good friend of my age (mid-50s) who’s a former hockey player who was recently diagnosed with CTE. He can no longer function independently and has been given 1–2 years to live. His condition resulted from multiple concussions that started in school and continued through college hockey into adult hockey. He had a full hockey scholarship and would’ve been a pro (the second best player after him on our state-champion high-school team in Michigan played for the NY Rangers). My friend’s pro hopes ended when an opponent broke both his knees with a stick during a fight in a college game. He continued playing semi-pro hockey as an adult and accumulating concussions. Hockey was the first sport I boycotted, well over 30 years ago when my friend and my father were still playing, because it was clear to me the players were trying to hurt each other.

I’m now worried about baseball. I saw too many catchers and umpires rocked by foul tips to the face mask this season. I feel less bad watching baseball because at least nobody’s trying to hurt the catchers or umpires as part of the sport. The intent is what originally drove me out of watching hockey and football before the prevalence of CTE among former athletes was widely known. I simply have no interest in watching people trying to hurt each other. Nevertheless, it’s disturbing to watch an umpire get led off the field who can no longer see straight or walk on his own or see a catcher don the gear again after multiple concussions. As we know, that doesn’t end well.

How to think about “medical reversals”?

Bill Harris points to this press release, “Almost 400 medical practices found ineffective in analysis of 3,000 studies,” and asks:

The intent seems good; does the process seem good, too? For one thing, there is patient variation, and RCTs seem focused on medians or means. Right tails can be significant.

This seems related to the last email I sent you (“What if your side wins?”).

From the abstract of the research article, by Diana Herrera-Perez, Alyson Haslam, Tyler Crain, Jennifer Gill, Catherine Livingston, Victoria Kaestner, Michael Hayes, Dan Morgan, and Adam Cifu:

Through an analysis of more than 3000 randomized controlled trials (RCTs) published in three leading medical journals (the Journal of the American Medical Association, the Lancet, and the New England Journal of Medicine), we have identified 396 medical reversals.

I’m not sure what to think about this! I’m sympathetic to the aims and conclusions of this article, but I can see there can be problems with the details.

In particular, what qualifies as a “medical reversal”? From the linked article:

Low-value medical practices are medical practices that are either ineffective or that cost more than other options but only offer similar effectiveness . . . Medical reversals are a subset of low-value medical practices and are defined as practices that have been found, through randomized controlled trials, to be no better than a prior or lesser standard of care. . . .

The challenge comes in when making this judgment from data. I fear that pulling out conclusions from the published literature will lead to the judgment being made based on statistical significance, and that doesn’t seem quite right. On the other hand, you have to start somewhere, and there’s a big medical literature to look through: we wouldn’t want to abandon all that and start from scratch. So I’m not quite sure what to think.

“There is this magic that our DNA enables”

I was just at a talk where a computer scientist was making dramatic claims for the value of human decision making. This is typically a crowd-pleasing position to take—after all, we in the audience are humans and we want to hear how great we are.

What really riled me was when the speaker said, “There is this magic that our DNA enables . . .” as a way of saying why humans are so important.

As I wrote a few years ago, it used to be that humans were defined as the rational animal, and now we’re defined as the irrational computer.

It used to be that what we, humans, had to offer in the world was our rationality; now it’s our irrationality that’s valued.

I think the crowd-pleasing position taken by today’s speaker wass just B.S. Don’t get me wrong here, I think human decision making is important; I don’t think we can or should let computers make all our decisions for us.

But let’s not get mystical about it.

“There is this magic that our DNA enables . . .”: This seems to me to be a bizarre sort of techno-mysticism.

The funny thing is, when the speaker was talking about specific research and development, everything said was reasonable. My problem was only in the generalities.

The tone problem in psychology

Are you tone deaf? Find out here.

P.S. Link updated. I guess things can change in 6 months!

Measuring Fraud and Fairness (Sharad Goel’s two talks at Columbia next week)


One Person, One Vote

Abstract: About a quarter of Americans report believing that double voting is a relatively common occurrence, casting doubt on the integrity of elections. But, despite a dearth of documented instances of double voting, it’s hard to know how often such fraud really occurs (people might just be good at covering it up!). I’ll describe a simple statistical trick to directly estimate the rate of double voting — one that builds off the classic “birthday problem” — and show that such behavior is exceedingly rare. I’ll further argue that current efforts to prevent double voting can in fact disenfranchise many legitimate voters.



The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning

Abstract: The nascent field of fair machine learning aims to ensure that decisions guided by algorithms are equitable. Over the last several years, three formal definitions of fairness have gained prominence: (1) anti-classification, meaning that protected attributes — like race, gender, and their proxies — are not explicitly used to make decisions; (2) classification parity, meaning that common measures of predictive performance (e.g., false positive and false negative rates) are equal across groups defined by the protected attributes; and (3) calibration, meaning that conditional on risk estimates, outcomes are independent of protected attributes. In this talk, I’ll show that all three of these fairness definitions suffer from significant statistical limitations. Requiring anti-classification or classification parity can, perversely, harm the very groups they were designed to protect; and calibration, though generally desirable, provides little guarantee that decisions are equitable. In contrast to these formal fairness criteria, I’ll argue that it is often preferable to treat similarly risky people similarly, based on the most statistically accurate estimates of risk that one can produce. Such a strategy, while not universally applicable, often aligns well with policy objectives; notably, this strategy will typically violate both anti-classification and classification parity. In practice, it requires significant effort to construct suitable risk estimates. One must carefully define and measure the targets of prediction to avoid retrenching biases in the data. But, importantly, one cannot generally address these difficulties by requiring that algorithms satisfy popular mathematical formalizations of fairness. By highlighting these challenges in the foundation of fair machine learning, we hope to help researchers and practitioners productively advance the area.


“Pfizer had clues its blockbuster drug could prevent Alzheimer’s. Why didn’t it tell the world?”

Jon Baron points to this news article by Christopher Rowland:

Pfizer had clues its blockbuster drug could prevent Alzheimer’s. Why didn’t it tell the world?

A team of researchers inside Pfizer made a startling find in 2015: The company’s blockbuster rheumatoid arthritis therapy Enbrel, a powerful anti-inflammatory drug, appeared to reduce the risk of Alzheimer’s disease by 64 percent.

The results were from an analysis of hundreds of thousands of insurance claims. Verifying that the drug would actually have that effect in people would require a costly clinical trial — and after several years of internal discussion, Pfizer opted against further investigation and chose not to make the data public, the company confirmed.

Researchers in the company’s division of inflammation and immunology urged Pfizer to conduct a clinical trial on thousands of patients, which they estimated would cost $80 million, to see if the signal contained in the data was real, according to an internal company document obtained by The Washington Post. . . .

The company told The Post that it decided during its three years of internal reviews that Enbrel did not show promise for Alzheimer’s prevention because the drug does not directly reach brain tissue. It deemed the likelihood of a successful clinical trial to be low. A synopsis of its statistical findings prepared for outside publication, it says, did not meet its “rigorous scientific standards.” . . .

Likewise, Pfizer said it opted against publication of its data because of its doubts about the results. It said publishing the information might have led outside scientists down an invalid pathway.

Rowland’s news article is amazing, with lots of detail:

Statisticians in 2015 analyzed real world data, hundreds of thousands of medical insurance claims involving people with rheumatoid arthritis and other inflammatory diseases, according to the Pfizer PowerPoint obtained by The Post.

They divided those anonymous patients into two equal groups of 127,000 each, one of patients with an Alzheimer’s diagnosis and one of patients without. Then they checked for Enbrel treatment. There were more people, 302, treated with Enbrel in the group without Alzheimer’s diagnosis. In the group with Alzheimer’s, 110 had been treated with Enbrel.

The numbers may seem small, but they were mirrored in the same proportion when the researchers checked insurance claims information from another database. The Pfizer team also produced closely similar numbers for Humira, a drug marketed by AbbVie that works like Enbrel. The positive results also showed up when checked for “memory loss” and “mild cognitive impairment,” indicating Enbrel may have benefit for treating the earliest stages of Alzheimer’s.

A clinical trial to prove the hypothesis would take four years and involve 3,000 to 4,000 patients, according to the Pfizer document that recommended a trial. . . .

One reason for caution: another class of anti-inflammatory therapies, called non-steroidal anti-inflammatory drugs (NSAIDS), showed no effect against mild-to-moderate Alzheimer’s in several clinical trials a decade ago. Still, a long-term follow-up of one of those trials indicated a benefit if NSAID use began when the brain was still normal, suggesting the timing of therapy could be key.

Baron writes:

I bet this revelation leads to a slew of off-label prescriptions, just as happened with estrogen a couple of decades ago. My physician friends told me then that you could not recruit subjects for a clinical trial because doctors were just prescribing estrogen for all menopausal women, to prevent
Alzheimer’s. I’m still not convinced that the reversal of this practice was a mistake.

That said, off-label prescribing is often a matter of degree. It isn’t as if physicians prescribed hormone replacement for the sole purpose of preventing Alzheimer’s. Rather this was mentioned to patients as an additional selling point.

Here’s the bit that I didn’t understand:

“Likewise, Pfizer said it opted against publication of its data because of its doubts about the results. It said publishing the information might have led outside scientists down an invalid pathway.”

Huh? That makes no sense at all to me.

Baron also points to this blog by Derek Lowe, “A Missed Alzheimer’s Opportunity? Not So Much,” which argues that the news article quoted above is misleading and that there are good reasons that this Alzheimer’s trial was not done.

Baron then adds:

I find this issue quite interesting. It is not just about statistics in the narrow sense but also about the kind of arguments that would go into forming “Bayesian priors”. In this sort of case, I think that the structure of arguments (about the blood-brain barrier, the possible mechanisms of the effect, the evidence from other trials) could be formalized, perhaps in Bayesian terms. I recall that a few attempts were made to do this for arguments in court cases, but this one is simpler. (And David Schum tried to avoid Bayesian arguments, as I recall.)

It does appear that the reported result was not simply the result of dredging data for anything “significant” (raising the problem of multiple tests). This complicates the story.

I also think that part of the problem is the high cost of clinical trials. In my book “Against bioethics” I argued that some of the problems were the result of “ethical” rules, such as those that regard high pay for subjects as “coercive”, thus slowing down recruitment. But I suspect that FDA statistical requirements may still be a problem. I have not kept up with that.

Conflict of interest statement: I’ve done some work with Novartis.

What’s wrong with null hypothesis significance testing

Following up on yesterday’s post, “What’s wrong with Bayes”:

My problem is not just with the methods—although I do have problems with the method—but also with the ideology.

My problem with the method

You’ve heard this a few zillion times before, and not just from me. Null hypothesis significance testing collapses the wavefunction too soon, leading to noisy decisions—bad decisions. My problem is not with “false positives” or false negatives”—in my world, there are no true zeroes—but rather that a layer of noise is being added to whatever we might be able to learn from data and models.

Don’t get me wrong. There are times when null hypothesis significance testing can make sense. And, speaking more generally, if a tool is available, people can use it as well as they can. Null hypothesis significance testing is the standard approach in much of science, and, as such, it’s been very useful. But I also think it’s useful to understand the problems with the approach.

My problem with the ideology

My problem with null hypothesis significance testing is not just that some statisticians recommend it, but that they think of it as necessary or fundamental.

Again, the analogy to Bayes might be helpful.

Bayesian statisticians will not only recommend and use Bayesian inference, but also will try their best, when seeing any non-Bayesian method, to interpret it Bayesianly. This can be helpful in revealing statistical models that can be said to be implicitly underlying certain statistical procedures—but ultimately a non-Bayesian method has to be evaluated on its own terms. The fact that a given estimate can be interpreted as, say, a posterior mode under a given probability model, should not be taken to imply that that model needs to be true, or even close to be true, for the method to work.

Similarly, any statistical method, even one that was not developed under a null hypothesis significance testing framework, can be evaluated in terms of type 1 and type 2 errors, coverage of interval estimates, etc. These evaluations can be helpful in understanding the method under certain theoretical, if unrealistic, conditions; see for example here.

The mistake is seeing such theoretical evaluations as fundamental. It can be hard for people to shake off this habit. But, remember: type 1 and type 2 errors are theoretical constructs based on false models. Keep your eye on the ball and remember your larger goals. When it comes to statistical methods, the house is stronger than the foundations.

“Would Republicans pay a price if they vote to impeach the president? Here’s what we know from 1974.”

I better post this one now because it might not be so relevant in 6 months . . .

Bob Erikson answers the question, “Would Republicans pay a price if they vote to impeach the president? Here’s what we know from 1974.” The conclusion: “Nixon loyalists paid the price—not Republicans who voted to impeach.”

This is consistent with some of my research with Jonathan Katz from awhile ago. See section 2.3 of this unfinished paper.

What’s wrong with Bayes

My problem is not just with the methods—although I do have problems with the method—but also with the ideology.

My problem with the method

It’s the usual story. Bayesian inference is model-based. Your model will never be perfect, and if you push hard you can find the weak points and magnify them until you get ridiculous inferences.

One example we’ve talked about a lot is the simple case of the estimate,
theta_hat ~ normal(theta, 1)
that’s one standard error away from zero:
theta_hat = 1.
Put a flat prior on theta and you end up with an 84% posterior probability that theta is greater than 0. Step back a bit, and it’s saying that you’ll offer 5-to-1 odds that theta>0 after seeing an observation that is statistically indistinguishable from noise. That can’t make sense. Go around offering 5:1 bets based on pure noise and you’ll go bankrupt real fast. See here for more discussion of this example.

That was easy. More complicated examples will have more complicated problems, but the way probability works is that you can always find some chink in the model and exploit it to result in a clearly bad prediction.

What about non-Bayesian methods: they’re based on models too, so they’ll also have problems? For sure. But Bayesisan inference can be worse because it is so open: you can get the posterior probability for anything.

Don’t get me wrong. I still think Bayesian methods are great, and I think the proclivity of Bayesian inferences to tend toward the ridiculous is just fine—as long as we’re willing to take such poor predictions as a reason to improve our models. But Bayesian inference can lead us astray, and we’re better statisticians if we realize that.

My problem with the ideology

As the saying goes, the problem with Bayes is the Bayesians. It’s the whole religion thing, the people who say that Bayesian reasoning is just rational thinking, or that rational thinking is necessarily Bayesian, the people who refuse to check their models because subjectivity, the people who try to talk you into using a “reference prior” because objectivity. Bayesian inference is a tool. It solves some problems but not all, and I’m exhausted by the ideology of the Bayes-evangelists.

Tomorrow: What’s wrong with null hypothesis significance testing.

Hey—the 2nd-best team in baseball is looking for a Bayesian!

Sarah Gelles writes:

We are currently looking to hire a Bayesian Statistician to join the Houston Astros’ Research & Development team. They would join a growing, cutting-edge R&D team that consists of analysts from a variety of backgrounds and which is involved in all key baseball decisions at the Astros.

Here’s a link to the job posting on Stack Overflow; if anyone in particular comes to mind, we’d appreciate your encouraging them to apply. They’re also welcome to reach out to me directly if they want to further discuss the role and/or working in baseball.

They just need one more left-handed Bayesian to put them over the top.

A Bayesian view of data augmentation.

After my lecture on Principled Bayesian Workflow for a group of machine learners back in August, a discussion arose about data augmentation. The comments were about how it made the data more informative. I questioned that as there is only so much information in the data. In the view of the model assumptions, just the likelihood. So simply modifying the data, information should not increase but only possibly decrease (non-invertible modification).

Later, when I actually saw an example of data augmentation and I thought about this more carefully, I changed my mind. I now realise background knowledge is being brought to bear on how the data is being modified. So data augmentation is just a away of being Bayesian by incorporating prior probabilities. Right?

Then thinking some more, it became all trivial as the equations below show.

P(u|x) ~ P(u) * P(x|u)   [Bayes with just the data.]
~  P(u) * P(x|u) * P(ax|u)   [Add the augmented data.]
P(u|x,ax) ~ P(u) * P(x|u) * P(ax|u) [That’s just the posterior given ax.]
P(u|x,ax) ~ P(u) * P(ax|u) * P(x|u) [Change the order of x and ax.]

Now, augmented data is not real data and should not be conditioned on as real. Arguably it is just part of (re)making the prior specification from P(u) into = P(u) * P(ax|u).

So change the notation to P(u|x) ~ * P(x|u).

If you data augment (and you are using likelihood based ML, implicitly starting with P(u) = 1), you are being a Bayesian whether you like it or not.

So I goggled a bit and asked a colleague in ML about the above. They said it makes sense to me when I think about it, but that was not immediately obvious to me. They also said it was not common knowledge – so here it is.

Now better googling gets more stuff such as  Augmentation is also a form of adding prior knowledge to a model; e.g. images are rotated, which you know does not change the class label. and this paper A Kernel Theory of Modern Data Augmentation Dao et al.  where in the introduction they state “Data augmentation can encode prior knowledge about data or task-specific invariances, act as regularizer to make the resulting model more robust, and provide resources to data-hungry deep learning models.” Although the connection to Bayes in either does not seem to be discussed.

Further scholarship likely would lead me to consider deleting this post, but what’s the fun in that?

P.S. In the comments, Anonymous argued “we should have that I(a,u) >= I(ax, u)” which I am now guessing was about putting the augmentation into the model instead of introducing it through fake data examples. So instead of modifying the data in ways that are irrelevant to the prediction (e.g. small translations, rotations, or deformations for handwritten digits), put it into the prior. So instead of obtaining P.axu(u) = P(u) * P(ax|u) based on n augmentations of the data make mathematically (sort of an infinite number of augmentations of the data).

Then Mark van der Wilk adds a comment about actually doing that for multiple possible,s and then compares these using the marginal likelihood in a paper with colleagues.

Now, there could not be a better motivation for my post then this from their introduction “This human input makes data augmentation undesirable from a machine learning perspective, akin to hand-crafting features. It is also unsatisfactory from a Bayesian perspective, according to which assumptions and expert knowledge should be explicitly encoded in the prior distribution only. By adding data that are not true observations, the posterior may become overconfident, and the marginal likelihood can no longer be used to compare to other models.”

Thanks Mark.






Unquestionable Research Practices

Hi! (This is Dan.) The glorious Josh Loftus from NYU just asked the following question.

Obviously he’s not heard of preregistration.

Seriously though, it’s always good to remember that a lot of ink being spilled over hypothesis testing and it’s statistical brethren doesn’t mean that if we fix that we’ll fix anything.  It all comes to naught if

  1. the underlying model for reality (be it your Bayesian model or your null hypothesis model and test statistic) is rubbish OR
  2. the process of interest is poorly measured or the measurement error isn’t appropriately modelled OR
  3. the data under consideration can’t be generalised to a population of interest.

Control of things like Type 1, Type 2, Type S, and Type M is a bit like combing your hair. It’s great if you’ve got hair to comb, but otherwise it leaves you looking a bit silly.

What’s wrong with Bayes; What’s wrong with null hypothesis significance testing

This will be two posts:

tomorrow: What’s wrong with Bayes

day after tomorrow: What’s wrong with null hypothesis significance testing

My problem in each case is not just with the methods—although I do have problems with the methods—but also with the ideology.

A future post or article: Ideologies of Science: Their Advantages and Disadvantages.

Amazing coincidence! What are the odds?

This post is by Phil Price, not Andrew

Several days ago I wore my cheapo Belarussian one-hand watch. This watch only has an hour hand, but the hand stretches all the way out to the edge of the watch, like the minute hand of a normal watch. The dial is marked with five-minute hash marks, and it turns out it’s quite easy to read it within two or three minutes even without a minute hand. I glanced at it on the dresser at some point and noticed that the hand had stopped exactly at the 12. Amazing! What are the odds?!

I left my house later that morning — the same morning I noticed the watch had stopped at 12 — to meet a friend for lunch. I was wearing a different watch, one with a chronograph (basically a stopwatch) and I started it as I stepped out the door, curious about how well my estimated travel time would match reality. Unfortunately I forgot to stop the watch when I arrived, indeed forgot all about it until my friend and I were sitting down chatting. I reached down and stopped the chronograph without looking at it. When I finally did look at it, several minutes later, I was astonished — astonished, I tell you! — to see that the second hand had stopped exactly at 12.

I started to write out some musings about the various reasons this sort of thing is not actually surprising, but I’m sure most of us have already thought about this issue many times. So just take this as one more example of why we should expect to see ‘unlikely’ coincidences rather frequently.

(BTW, as you can see in the photo neither watch had stopped exactly at 12. The one-hand watch is about 45 seconds shy of 12, and the chronograph, which measures in 1/5-second intervals, is 1 tick too far).

This post is by Phil.