Skip to content

Measuring Fraud and Fairness (Sharad Goel’s two talks at Columbia next week)


One Person, One Vote

Abstract: About a quarter of Americans report believing that double voting is a relatively common occurrence, casting doubt on the integrity of elections. But, despite a dearth of documented instances of double voting, it’s hard to know how often such fraud really occurs (people might just be good at covering it up!). I’ll describe a simple statistical trick to directly estimate the rate of double voting — one that builds off the classic “birthday problem” — and show that such behavior is exceedingly rare. I’ll further argue that current efforts to prevent double voting can in fact disenfranchise many legitimate voters.



The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning

Abstract: The nascent field of fair machine learning aims to ensure that decisions guided by algorithms are equitable. Over the last several years, three formal definitions of fairness have gained prominence: (1) anti-classification, meaning that protected attributes — like race, gender, and their proxies — are not explicitly used to make decisions; (2) classification parity, meaning that common measures of predictive performance (e.g., false positive and false negative rates) are equal across groups defined by the protected attributes; and (3) calibration, meaning that conditional on risk estimates, outcomes are independent of protected attributes. In this talk, I’ll show that all three of these fairness definitions suffer from significant statistical limitations. Requiring anti-classification or classification parity can, perversely, harm the very groups they were designed to protect; and calibration, though generally desirable, provides little guarantee that decisions are equitable. In contrast to these formal fairness criteria, I’ll argue that it is often preferable to treat similarly risky people similarly, based on the most statistically accurate estimates of risk that one can produce. Such a strategy, while not universally applicable, often aligns well with policy objectives; notably, this strategy will typically violate both anti-classification and classification parity. In practice, it requires significant effort to construct suitable risk estimates. One must carefully define and measure the targets of prediction to avoid retrenching biases in the data. But, importantly, one cannot generally address these difficulties by requiring that algorithms satisfy popular mathematical formalizations of fairness. By highlighting these challenges in the foundation of fair machine learning, we hope to help researchers and practitioners productively advance the area.


“Pfizer had clues its blockbuster drug could prevent Alzheimer’s. Why didn’t it tell the world?”

Jon Baron points to this news article by Christopher Rowland:

Pfizer had clues its blockbuster drug could prevent Alzheimer’s. Why didn’t it tell the world?

A team of researchers inside Pfizer made a startling find in 2015: The company’s blockbuster rheumatoid arthritis therapy Enbrel, a powerful anti-inflammatory drug, appeared to reduce the risk of Alzheimer’s disease by 64 percent.

The results were from an analysis of hundreds of thousands of insurance claims. Verifying that the drug would actually have that effect in people would require a costly clinical trial — and after several years of internal discussion, Pfizer opted against further investigation and chose not to make the data public, the company confirmed.

Researchers in the company’s division of inflammation and immunology urged Pfizer to conduct a clinical trial on thousands of patients, which they estimated would cost $80 million, to see if the signal contained in the data was real, according to an internal company document obtained by The Washington Post. . . .

The company told The Post that it decided during its three years of internal reviews that Enbrel did not show promise for Alzheimer’s prevention because the drug does not directly reach brain tissue. It deemed the likelihood of a successful clinical trial to be low. A synopsis of its statistical findings prepared for outside publication, it says, did not meet its “rigorous scientific standards.” . . .

Likewise, Pfizer said it opted against publication of its data because of its doubts about the results. It said publishing the information might have led outside scientists down an invalid pathway.

Rowland’s news article is amazing, with lots of detail:

Statisticians in 2015 analyzed real world data, hundreds of thousands of medical insurance claims involving people with rheumatoid arthritis and other inflammatory diseases, according to the Pfizer PowerPoint obtained by The Post.

They divided those anonymous patients into two equal groups of 127,000 each, one of patients with an Alzheimer’s diagnosis and one of patients without. Then they checked for Enbrel treatment. There were more people, 302, treated with Enbrel in the group without Alzheimer’s diagnosis. In the group with Alzheimer’s, 110 had been treated with Enbrel.

The numbers may seem small, but they were mirrored in the same proportion when the researchers checked insurance claims information from another database. The Pfizer team also produced closely similar numbers for Humira, a drug marketed by AbbVie that works like Enbrel. The positive results also showed up when checked for “memory loss” and “mild cognitive impairment,” indicating Enbrel may have benefit for treating the earliest stages of Alzheimer’s.

A clinical trial to prove the hypothesis would take four years and involve 3,000 to 4,000 patients, according to the Pfizer document that recommended a trial. . . .

One reason for caution: another class of anti-inflammatory therapies, called non-steroidal anti-inflammatory drugs (NSAIDS), showed no effect against mild-to-moderate Alzheimer’s in several clinical trials a decade ago. Still, a long-term follow-up of one of those trials indicated a benefit if NSAID use began when the brain was still normal, suggesting the timing of therapy could be key.

Baron writes:

I bet this revelation leads to a slew of off-label prescriptions, just as happened with estrogen a couple of decades ago. My physician friends told me then that you could not recruit subjects for a clinical trial because doctors were just prescribing estrogen for all menopausal women, to prevent
Alzheimer’s. I’m still not convinced that the reversal of this practice was a mistake.

That said, off-label prescribing is often a matter of degree. It isn’t as if physicians prescribed hormone replacement for the sole purpose of preventing Alzheimer’s. Rather this was mentioned to patients as an additional selling point.

Here’s the bit that I didn’t understand:

“Likewise, Pfizer said it opted against publication of its data because of its doubts about the results. It said publishing the information might have led outside scientists down an invalid pathway.”

Huh? That makes no sense at all to me.

Baron also points to this blog by Derek Lowe, “A Missed Alzheimer’s Opportunity? Not So Much,” which argues that the news article quoted above is misleading and that there are good reasons that this Alzheimer’s trial was not done.

Baron then adds:

I find this issue quite interesting. It is not just about statistics in the narrow sense but also about the kind of arguments that would go into forming “Bayesian priors”. In this sort of case, I think that the structure of arguments (about the blood-brain barrier, the possible mechanisms of the effect, the evidence from other trials) could be formalized, perhaps in Bayesian terms. I recall that a few attempts were made to do this for arguments in court cases, but this one is simpler. (And David Schum tried to avoid Bayesian arguments, as I recall.)

It does appear that the reported result was not simply the result of dredging data for anything “significant” (raising the problem of multiple tests). This complicates the story.

I also think that part of the problem is the high cost of clinical trials. In my book “Against bioethics” I argued that some of the problems were the result of “ethical” rules, such as those that regard high pay for subjects as “coercive”, thus slowing down recruitment. But I suspect that FDA statistical requirements may still be a problem. I have not kept up with that.

Conflict of interest statement: I’ve done some work with Novartis.

What’s wrong with null hypothesis significance testing

Following up on yesterday’s post, “What’s wrong with Bayes”:

My problem is not just with the methods—although I do have problems with the method—but also with the ideology.

My problem with the method

You’ve heard this a few zillion times before, and not just from me. Null hypothesis significance testing collapses the wavefunction too soon, leading to noisy decisions—bad decisions. My problem is not with “false positives” or false negatives”—in my world, there are no true zeroes—but rather that a layer of noise is being added to whatever we might be able to learn from data and models.

Don’t get me wrong. There are times when null hypothesis significance testing can make sense. And, speaking more generally, if a tool is available, people can use it as well as they can. Null hypothesis significance testing is the standard approach in much of science, and, as such, it’s been very useful. But I also think it’s useful to understand the problems with the approach.

My problem with the ideology

My problem with null hypothesis significance testing is not just that some statisticians recommend it, but that they think of it as necessary or fundamental.

Again, the analogy to Bayes might be helpful.

Bayesian statisticians will not only recommend and use Bayesian inference, but also will try their best, when seeing any non-Bayesian method, to interpret it Bayesianly. This can be helpful in revealing statistical models that can be said to be implicitly underlying certain statistical procedures—but ultimately a non-Bayesian method has to be evaluated on its own terms. The fact that a given estimate can be interpreted as, say, a posterior mode under a given probability model, should not be taken to imply that that model needs to be true, or even close to be true, for the method to work.

Similarly, any statistical method, even one that was not developed under a null hypothesis significance testing framework, can be evaluated in terms of type 1 and type 2 errors, coverage of interval estimates, etc. These evaluations can be helpful in understanding the method under certain theoretical, if unrealistic, conditions; see for example here.

The mistake is seeing such theoretical evaluations as fundamental. It can be hard for people to shake off this habit. But, remember: type 1 and type 2 errors are theoretical constructs based on false models. Keep your eye on the ball and remember your larger goals. When it comes to statistical methods, the house is stronger than the foundations.

“Would Republicans pay a price if they vote to impeach the president? Here’s what we know from 1974.”

I better post this one now because it might not be so relevant in 6 months . . .

Bob Erikson answers the question, “Would Republicans pay a price if they vote to impeach the president? Here’s what we know from 1974.” The conclusion: “Nixon loyalists paid the price—not Republicans who voted to impeach.”

This is consistent with some of my research with Jonathan Katz from awhile ago. See section 2.3 of this unfinished paper.

What’s wrong with Bayes

My problem is not just with the methods—although I do have problems with the method—but also with the ideology.

My problem with the method

It’s the usual story. Bayesian inference is model-based. Your model will never be perfect, and if you push hard you can find the weak points and magnify them until you get ridiculous inferences.

One example we’ve talked about a lot is the simple case of the estimate,
theta_hat ~ normal(theta, 1)
that’s one standard error away from zero:
theta_hat = 1.
Put a flat prior on theta and you end up with an 84% posterior probability that theta is greater than 0. Step back a bit, and it’s saying that you’ll offer 5-to-1 odds that theta>0 after seeing an observation that is statistically indistinguishable from noise. That can’t make sense. Go around offering 5:1 bets based on pure noise and you’ll go bankrupt real fast. See here for more discussion of this example.

That was easy. More complicated examples will have more complicated problems, but the way probability works is that you can always find some chink in the model and exploit it to result in a clearly bad prediction.

What about non-Bayesian methods: they’re based on models too, so they’ll also have problems? For sure. But Bayesisan inference can be worse because it is so open: you can get the posterior probability for anything.

Don’t get me wrong. I still think Bayesian methods are great, and I think the proclivity of Bayesian inferences to tend toward the ridiculous is just fine—as long as we’re willing to take such poor predictions as a reason to improve our models. But Bayesian inference can lead us astray, and we’re better statisticians if we realize that.

My problem with the ideology

As the saying goes, the problem with Bayes is the Bayesians. It’s the whole religion thing, the people who say that Bayesian reasoning is just rational thinking, or that rational thinking is necessarily Bayesian, the people who refuse to check their models because subjectivity, the people who try to talk you into using a “reference prior” because objectivity. Bayesian inference is a tool. It solves some problems but not all, and I’m exhausted by the ideology of the Bayes-evangelists.

Tomorrow: What’s wrong with null hypothesis significance testing.

Hey—the 2nd-best team in baseball is looking for a Bayesian!

Sarah Gelles writes:

We are currently looking to hire a Bayesian Statistician to join the Houston Astros’ Research & Development team. They would join a growing, cutting-edge R&D team that consists of analysts from a variety of backgrounds and which is involved in all key baseball decisions at the Astros.

Here’s a link to the job posting on Stack Overflow; if anyone in particular comes to mind, we’d appreciate your encouraging them to apply. They’re also welcome to reach out to me directly if they want to further discuss the role and/or working in baseball.

They just need one more left-handed Bayesian to put them over the top.

A Bayesian view of data augmentation.

After my lecture on Principled Bayesian Workflow for a group of machine learners back in August, a discussion arose about data augmentation. The comments were about how it made the data more informative. I questioned that as there is only so much information in the data. In the view of the model assumptions, just the likelihood. So simply modifying the data, information should not increase but only possibly decrease (non-invertible modification).

Later, when I actually saw an example of data augmentation and I thought about this more carefully, I changed my mind. I now realise background knowledge is being brought to bear on how the data is being modified. So data augmentation is just a away of being Bayesian by incorporating prior probabilities. Right?

Then thinking some more, it became all trivial as the equations below show.

P(u|x) ~ P(u) * P(x|u)   [Bayes with just the data.]
~  P(u) * P(x|u) * P(ax|u)   [Add the augmented data.]
P(u|x,ax) ~ P(u) * P(x|u) * P(ax|u) [That’s just the posterior given ax.]
P(u|x,ax) ~ P(u) * P(ax|u) * P(x|u) [Change the order of x and ax.]

Now, augmented data is not real data and should not be conditioned on as real. Arguably it is just part of (re)making the prior specification from P(u) into = P(u) * P(ax|u).

So change the notation to P(u|x) ~ * P(x|u).

If you data augment (and you are using likelihood based ML, implicitly starting with P(u) = 1), you are being a Bayesian whether you like it or not.

So I goggled a bit and asked a colleague in ML about the above. They said it makes sense to me when I think about it, but that was not immediately obvious to me. They also said it was not common knowledge – so here it is.

Now better googling gets more stuff such as  Augmentation is also a form of adding prior knowledge to a model; e.g. images are rotated, which you know does not change the class label. and this paper A Kernel Theory of Modern Data Augmentation Dao et al.  where in the introduction they state “Data augmentation can encode prior knowledge about data or task-specific invariances, act as regularizer to make the resulting model more robust, and provide resources to data-hungry deep learning models.” Although the connection to Bayes in either does not seem to be discussed.

Further scholarship likely would lead me to consider deleting this post, but what’s the fun in that?

P.S. In the comments, Anonymous argued “we should have that I(a,u) >= I(ax, u)” which I am now guessing was about putting the augmentation into the model instead of introducing it through fake data examples. So instead of modifying the data in ways that are irrelevant to the prediction (e.g. small translations, rotations, or deformations for handwritten digits), put it into the prior. So instead of obtaining P.axu(u) = P(u) * P(ax|u) based on n augmentations of the data make mathematically (sort of an infinite number of augmentations of the data).

Then Mark van der Wilk adds a comment about actually doing that for multiple possible,s and then compares these using the marginal likelihood in a paper with colleagues.

Now, there could not be a better motivation for my post then this from their introduction “This human input makes data augmentation undesirable from a machine learning perspective, akin to hand-crafting features. It is also unsatisfactory from a Bayesian perspective, according to which assumptions and expert knowledge should be explicitly encoded in the prior distribution only. By adding data that are not true observations, the posterior may become overconfident, and the marginal likelihood can no longer be used to compare to other models.”

Thanks Mark.






Unquestionable Research Practices

Hi! (This is Dan.) The glorious Josh Loftus from NYU just asked the following question.

Obviously he’s not heard of preregistration.

Seriously though, it’s always good to remember that a lot of ink being spilled over hypothesis testing and it’s statistical brethren doesn’t mean that if we fix that we’ll fix anything.  It all comes to naught if

  1. the underlying model for reality (be it your Bayesian model or your null hypothesis model and test statistic) is rubbish OR
  2. the process of interest is poorly measured or the measurement error isn’t appropriately modelled OR
  3. the data under consideration can’t be generalised to a population of interest.

Control of things like Type 1, Type 2, Type S, and Type M is a bit like combing your hair. It’s great if you’ve got hair to comb, but otherwise it leaves you looking a bit silly.

What’s wrong with Bayes; What’s wrong with null hypothesis significance testing

This will be two posts:

tomorrow: What’s wrong with Bayes

day after tomorrow: What’s wrong with null hypothesis significance testing

My problem in each case is not just with the methods—although I do have problems with the methods—but also with the ideology.

A future post or article: Ideologies of Science: Their Advantages and Disadvantages.

Amazing coincidence! What are the odds?

This post is by Phil Price, not Andrew

Several days ago I wore my cheapo Belarussian one-hand watch. This watch only has an hour hand, but the hand stretches all the way out to the edge of the watch, like the minute hand of a normal watch. The dial is marked with five-minute hash marks, and it turns out it’s quite easy to read it within two or three minutes even without a minute hand. I glanced at it on the dresser at some point and noticed that the hand had stopped exactly at the 12. Amazing! What are the odds?!

I left my house later that morning — the same morning I noticed the watch had stopped at 12 — to meet a friend for lunch. I was wearing a different watch, one with a chronograph (basically a stopwatch) and I started it as I stepped out the door, curious about how well my estimated travel time would match reality. Unfortunately I forgot to stop the watch when I arrived, indeed forgot all about it until my friend and I were sitting down chatting. I reached down and stopped the chronograph without looking at it. When I finally did look at it, several minutes later, I was astonished — astonished, I tell you! — to see that the second hand had stopped exactly at 12.

I started to write out some musings about the various reasons this sort of thing is not actually surprising, but I’m sure most of us have already thought about this issue many times. So just take this as one more example of why we should expect to see ‘unlikely’ coincidences rather frequently.

(BTW, as you can see in the photo neither watch had stopped exactly at 12. The one-hand watch is about 45 seconds shy of 12, and the chronograph, which measures in 1/5-second intervals, is 1 tick too far).

This post is by Phil.

“Some call it MRP, some Mister P, but the full name is . . .”

Jim Savage points us to this explainer, How do pollsters predict UK general election results?, by John Burn-Murdoch of the Financial Times.

It’s bittersweet seeing my method described by some person I’ve never met. Little baby MRP is all grown up!

Being explained by the Financial Times—that’s about as good as being in the Guardian and the Times. Not quite as good as being mentioned in Private Eye—that’s still the peak of the news media, as far as I’m concerned. But still pretty good.

It’s kind of amazing seeing the phrase “MRP election poll” in a headline, with the implication that’s just a standard phrase now, not even needing a reference. A lot can happen in 22 years.

Don’t believe people who say they can look at your face and tell that you’re lying.

Kevin Lewis points us to this article, Lessons From Pinocchio: Cues to Deception May Be Highly Exaggerated, by Timothy Luke, which begins:

Deception researchers widely acknowledge that cues to deception—observable behaviors that may differ between truthful and deceptive messages—tend to be weak. Nevertheless, several deception cues have been reported with unusually large effect sizes, and some researchers have advocated the use of such cues as tools for detecting deceit and assessing credibility in practical contexts. By examining data from empirical deception-cue research and using a series of Monte Carlo simulations, I demonstrate that many estimated effect sizes of deception cues may be greatly inflated by publication bias, small numbers of estimates, and low power. Indeed, simulations indicate the informational value of the present deception literature is quite low, such that it is not possible to determine whether any given effect is real or a false positive.

Indeed, I’ve always been suspicious of people who claim to be able to detect lies by looking at people’s faces. Valuable information can be obtained from facial expressions, that’s for sure, but detecting lies is tough; it can just be a way for people to exercise their prejudices.

That said, when I was a kid, whenever my sister and I had a dispute, she always told the truth and I was always lying, and my parents believed her every time. It was soooo unfair: they’d believe her, even when there was no direct evidence contradicting whatever story I happened to be spinning.

What comes after Vixra?

OK, so Arxiv publishes anything. But some things are so cranky that Arxiv won’t publish them, so they go on Vixra. Here’s my question: where do the people publish, who can’t publish on Vixra? The cranks’ cranks, as it were? It’s a Cantor’s corner kinda thing.

When speculating about causes of trends in mortality rates: (a) make sure that what you’re trying to explain has actually been happening, and (b) be clear where your data end and your speculations begin.

A reporter writes:

I’d be very interested in getting your take on this recent paper. I am immensely skeptical of it. That’s not to say many Trump supporters aren’t racist! But we’re now going to claim that this entire rise in all-cause mortality can be attributed to the false sense of lost status? So so so so so skeptical.

You’re cited, and the headline takeaway is about perceived racialized threat to social status. But threat to social status isn’t mentioned — % of GOP voteshare is taken as a straightforward proxy of this. But doesn’t voteshare % jump around for a million reasons, often in reaction the most recent election?

I took a look. I don’t see how they can say “For these reasons (and for the sake of parsimony), like Case and Deaton (2017), our starting premise is to examine as a singular phenomenon; the rise in national mortality rates of working-age white men and women.” Just look at figure 2C here. They cite this paper but they don’t seem to get the point that the rate among middle-aged men was going down, not up, from 2005-2015. This is important because much of the decline-of-status discussion centers on men.


Also, see here (which links to an unpublished report with tons more graphs). Some lines go up and some lines go down. “For the sake of parsimony” just doesn’t cut it here. Later in the paper they write that the rise in white mortality “is more accentuated in women than in men.” But “more accentuated” seems wrong. According to the statistics, the mortality rate among 45-54-year-old non-Hispanic white men was declining from 2005-2015.

This is a big problem in social science: lots of effort expended to explain some phenomenon, with it being clear exactly what is being explained. So you have to be careful about statements such as, “A valid causal story must explain something that is occurring widely among whites and also explain why it is not occurring among blacks.” I don’t think that kind of monocausal thinking is helpful.

The comparisons by education group are tricky because average education levels have been increasing over time. That’s not to say the authors should not break things down by education group, just that it’s tricky.

Regarding their county-level analysis: it seems that what they find is that Republican vote share in 2016 is predictive of trends in white mortality rates. This is similar to other correlations that we’ve been seeing: in short, Trump did well (and Clinton poorly) among white voters in certain rural and low-income places in the country. I don’t see that this gives any direct evidence regarding status threat. Also I don’t think the following statement makes sense: “In the absence of an instrumental variable, or of a natural experiment, our study provides a conservative estimate of the effect of the Republican vote share by controlling for a host of economic and social factors.” First, “conservative” is a kind of weasel word that allows people to imply without evidence that true effects are higher than what they found; second, “effect of the Republican vote share” doesn’t make sense. A vote share doesn’t kill people. It doesn’t make sense to say that person X died because a greater percentage of people in person X’s county voted for Trump.

Finally, they put this in italics: “For perhaps the first time, we are suggesting that a major population health phenomenon – a widespread one – cannot be explained by actual social or economic status disadvantage but instead is driven by perceived threat to status.” But I don’t see the evidence for it. They don’t supply any data on “perceived threat to status.” At least, I didn’t see anything in the data. So, sure, they can suggest what they want, but I don’t find it convincing.

All that said, I have general positive feelings about the linked paper, in the sense that they’re studying something worth looking into. Social scientists including myself spend lots of time on fun topics like golf putting and sumo wrestling, and this can be a great way to develop and understand research methods; but it’s also good for people to take a shot at more important problems, even if the data aren’t really there to address the questions we’d like to ask.

There should be a way for researchers to study these issues without feeling the need to exaggerate what they’ve found (as in this press release, on “a striking reversal [in mortality rate trends] among working-age whites, which seems to be driven principally by anxiety among whites about losing social status to Blacks”—without mentioning that (a) the trends go in opposite directions for men and women and (b) their research offers no evidence that anything is being driven, principally or otherwise, by anxiety or social status.

P.S. I can understand my correspondent’s desire for anonymity here. A couple years ago I got blasted on twitter by a leading public health researcher for my response to Case and Deaton. He wrote that I had “scoffed at the Case/Deaton finding about U.S. life expectancy . . . Has he ever admitted he was wrong about that?” I sent him an email saying, “Whenever I am wrong in public, I always announce my error in public too. I’ve corrected four of my published papers and have corrected many errors or unclear points in my other writings. But I can only issue a correction if I know where I was wrong. Can you please explain where I was wrong regarding the work of Case and Deaton? I am not aware of any errors that I made in that regard. Thank you.” We did a few emails back and forth and at no time did he give any examples of where I’d “scoffed” or where I’d been wrong. He wrote that I spent most of my time “carping about compositional effects” and that my efforts “helped spread the idea that Case and Deaton were wrong, that there was nothing to see here, that it was all liberal whining about inequality, etc., etc.” When the facts get in the way of the story, shoot the messenger.

In short, adding more animals to your experiment is fine. The problem is in using statistical significance to make decisions about what to conclude from your data.

Denis Jabaudon writes:

I was thinking that perhaps you could help me with the following “paradox?” that I often find myself in when discussing with students (I am a basic neuroscientist and my unit of counting is usually cells or animals):

When performing a “pilot” study on say 5 animals, and finding an “almost significant” result, or a “trend”, why is it incorrect to add another 5 animals to that sample and to look at the P value then?

Notwithstanding inducing the bias towards false positive (we would not add 5 animals if there was no trend), which I understand, why would the correct procedure to start again from scratch with 10 animals?

Why do these first 5 results (or hundreds of patients depending on context) need to be discarded?

If you have any information on this it would be greatly appreciated; this is such a common practice that I’d like to have good arguments to counter it.

This one comes up a lot, in one form or another. My quick answer is as follows:

1. Statistical significance doesn’t answer any relevant question. Forget statistical significance and p-values. The goal is not to reject a null hypothesis; the goal is to estimate the treatment effect or some other parameter of your model.

2. You can do Bayesian analysis. Adding more data is just fine, you’ll just account for it in your posterior distribution. Further discussion here.

3. If you go long enough, you’ll eventually reach statistical significance at any specified level—but that’s fine. True effects are not zero (or, even if they are, there’s always systematic measurement error of one sort or another).

In short, adding more animals to your experiment is fine. The problem is in using statistical significance to make decisions about what to conclude from your data.

The default prior for logistic regression coefficients in Scikit-learn

Someone pointed me to this post by W. D., reporting that, in Python’s popular Scikit-learn package, the default prior for logistic regression coefficients is normal(0,1)—or, as W. D. puts it, L2 penalization with a lambda of 1.

In the post, W. D. makes three arguments. I agree with two of them.

1. I agree with W. D. that it makes sense to scale predictors before regularization. (There are various ways to do this scaling, but I think that scaling by 2*observed sd is a reasonable default for non-binary outcomes.)

2. I agree with W. D. that default settings should be made as clear as possible at all times.

3. I disagree with the author that a default regularization prior is a bad idea. As a general point, I think it makes sense to regularize, and when it comes to this specific problem, I think that a normal(0,1) prior is a reasonable default option (assuming the predictors have been scaled). I think that rstanarm is currently using normal(0,2.5) as a default, but if I had to choose right now, I think I’d go with normal(0,1), actually.

Apparently some of the discussion of this default choice revolved around whether the routine should be considered “statistics” (where primary goal is typically parameter estimation) or “machine learning” (where the primary goal is typically prediction). As far as I’m concerned, it doesn’t matter: I’d prefer a reasonably strong default prior such as normal(0,1) both for parameter estimation and for prediction.

Again, I’ll repeat points 1 and 2 above: You do want to standardize the predictors before using this default prior, and in any case the user should be made aware of the defaults, and how to override them.

P.S. Sander Greenland and I had a discussion of this. Sander disagreed with me so I think it will be valuable to share both perspectives.
Continue reading ‘The default prior for logistic regression coefficients in Scikit-learn’ »

Controversies in vaping statistics, leading to a general discussion of dispute resolution in science

Episode 2

Brad Rodu writes:

The Journal of the American Heart Association on June 5, 2019, published a bogus research article, “Electronic cigarette use and myocardial infarction among adults in the US Population Assessment of Tobacco and Health [PATH],” by Dharma N. Bhatta and Stanton A. Glantz (here).

Drs. Bhatta and Glantz used PATH Wave 1 survey data to claim that e-cigarette use caused heart attacks. However, the public use data shows that 11 of the 38 current e-cigarette users in their study had a heart attack years before they first started using e-cigarettes.

The article misrepresents the research record; presents a demonstrably inaccurate analysis; and omits critical information with respect to (a) when survey participants were first told that they had a heart attack, and (b) when participants first started using e-cigarettes. The article represents a significant departure from accepted research practices.

For more background, see this news article by Jayne O’Donnell, “Study linking vaping to heart attacks muddied amid spat between two tobacco researchers,” which discusses the controversy and also gives some background on Rodu and Glantz.

I was curious, so I followed the instructions on Rodu’s blog to download the data and run the R script. I did not try to follow all the code; I just ran it. Here’s what pops up:

This indeed appears consistent with Rodu’s statement that “11 of the 38 current e-cigarette users were first told that they had a heart attack years before they started using e-cigarettes.” The above table only has 34 people, not 38; I asked Rodu about this and he said he that the table doesn’t include the 4 participants who had missing info on age at first heart attack or age at first use of e-cigarettes.

How does this relate to the published paper by Bhatta and Glantz? I clicked on the link and took a look.

Here’s the relevant data discussion from Bhatta and Glantz:

As discussed above, we cannot infer temporality from the cross‐sectional finding that e‐cigarette use is associated with having had an MI and it is possible that first MIs occurred before e‐cigarette use. PATH Wave 1 was conducted in 2013 to 2014, only a few years after e‐cigarettes started gaining popularity on the US market around 2007. To address this problem we used the PATH questions “How old were you when you were first told you had a heart attack (also called a myocardial infarction) or needed bypass surgery?” and the age when respondents started using e‐cigarettes and cigarettes (1) for the very first time, (2) fairly regularly, and (3) every day. We used current age and age of first MI to select only those people who had their first MIs at or after 2007 (Table S6). While the point estimates for the e‐cigarette effects (as well as other variables) remained about the same as for the entire sample, these estimates were no longer statistically significant because of a small number of MIs among e‐cigarette users after 2007. . . .

And here’s the relevant table (from an earlier version of Bhatta and Glantz, sent to me by Rodu):

699 patients with MI’s, of whom 38 were vaping.

Table 1 of the paper shows the descriptive statistics at Wave 1 baseline; 643 (2.4%) adults reported that they had a myocardial infarction. Out of those 643 people, a weighted 10.2% were former e-cigarette users, 1.6% some day e-cigarette users, and 1.5% some-day cigarette users. 1.6% + 1.5% = 3.1%, and 3.1% * 643 = 20, not 34 or 38. It seems that the discrepancy here arises from comparing weighted proportions with raw numbers, an issue that often arises with survey data and does not necessarily imply any problems with the published analysis.

But Rodu’s criticism seems more serious. Bhatta and Glantz are making causal claims based on correlation between heart problems and e-cigarette use, so it does seem like it would be appropriate for them to exclude from their analysis the people who didn’t start e-cigarette use until after their heart attacks. Even had they done this, I could see concerns with any results—the confounding with cigarette smoking is the 800-pound gorilla in the room, and any attempt to adjust for this confounding will necessarily depend strongly on the model being used for this adjustment—but removing those 11 people from the analysis, that seems like a freebie.

Is it appropriate for Rodu to describe Bhatta and Glantz’s article as “bogus”? That seems a bit strong. It seems like a real article with a data issue that Rodu found, and the solution would seem to be to perform a corrected analysis removing the data from the people who had heart problems before they started vaping. This won’t make the resulting findings bulletproof but it will at least fix this one problem, and that’s something. One step at a time, right?

Episode 1

Rodu has had earlier clashes with this research group.

Last year, he sent me the following email:

An article recently published in the journal Pediatrics claimed that teen experimental smokers who were e-cigarette triers or past-30-day users at baseline were more likely to be regular smokers one year later than experimental smokers who hadn’t used e-cigs. The authors used regression analysis of a publicly available longitudinal FDA survey dataset (baseline ~2013, follow-up survey one year later). Although the authors used lifetime cigarette consumption to restrict their study to experimental smokers at baseline (LCC ranging from one puff but never a whole cigarette to 99 cigarettes), they ignored this baseline variable as a confounder in their analysis. When I reproduced their analysis and added the LCC variable, the positive results for e-cigarettes essentially disappeared, negating the authors’ core claim.

I [Rodu] called in my blog (here and here) for retraction of this study because the analysis was fatally flawed, and I published a comment on the journal’s website (here). The authors dismissed my criticism, responding with the strange explanation that LCC at baseline is a mediator rather than a confounder. The journal editors apparently believe that the authors’ response is adequate; I believe it is nonsensical.

I believe that this study uses faulty statistics to make unfounded causal claims that will be used to justify public health policies and regulatory actions.

Rodu added:

In my second blog post (here), I stated that “Chaffee et al. called our addition of the LCC information a ‘statistical trick.’” They used that term in a response appearing on the Pediatrics website from March 30 to April 23 (here, courtesy of Wayback Machine). Yesterday a completely new response appeared with the same March 30 date; “statistical trick” disappeared and “mediator” appeared (here).

I agree with Rodu that in this study you should be adjusting for lifetime cigarette consumption at baseline. How exactly to perform this adjustment is a statistical and substantive question, but I’m inclined to agree that not performing the adjustment is a mistake. So, yeah, this seems like a problem. Also, a pre-treatment exposure variable is not a mediator, and “statistical tricks” are OK by me!

I was curious enough about this to want to dig in more—if nothing else, this seemed like a great example of measurement error in regression and the perils of partial adjustment for a confounder. It can be good to work on a live example where there is active controversy, rather than reanalyzing the Electric Company example and the LaLonde data one more time.

So I asked Rodu for the data, and shared it with some colleagues. Unfortunately we got tangled in the details—this often happens with real survey data! We contacted the authors of the paper in question to clear up some questions, and they, like Rodu, were very helpful. Everyone involved was direct and open. However, the data were still a mess and eventually we gave up trying to figure out exactly what was happening. As far as I’m concerned, this is still an open problem, and a student with some persistence should be able to get this all to work.

So, for now, I’d say that Rodu’s statistical point is valid and that the authors should redo the analysis as he suggests. Or maybe some third party can do so, if they’re willing to put in the effort.

Where there’s smoking, there’s fire

Tobacco research is a mess, and it’s been a mess forever.

On one side, you have industry-funded work. Notoriously, in past decades the cigarette industry was not just sponsoring biased studies (forking paths, file drawers, etc.); they were actively spreading disinformation, purposely polluting scientific and public discourse with the goal of delaying or reducing the impact of public awareness of the dangers of smoking, and delaying or reducing the impact of public regulation of cigarettes and smoking.

On the other side, the malign effects of smoking, and the addictive nature of nicotine, have been known for so long that anti-smoking studies are sometimes not subject to strict scrutiny. Anti-smoking researchers are the good guys, right?

There’s still a lot of debate about second-hand smoke, and I don’t really know what to think. Being trapped in a car with two heavy smokers is one thing; working in a large office space where one or two people are smoking is something much less.

There are similar controversies regarding studies of social behavior. When, a couple decades ago, cities started banning smoking in restaurants, bars, and other indoor places, there were lots of people who were saying this was a bad idea, Prohibition Doesn’t Work, etc.—but it seems that these indoor smoking bans worked fine. Lots of smokers wanted to quit and didn’t mind the inconvenience.

So, moving to these recent disputes: both sides are starting with strong positions and potential conflicts of interests. But these data questions are specific enough that they should be resolvable.

How to resolve scientific disputes?

But this gets us to the other problem with science, which is that it does not have clear mechanisms for dispute resolution. As we’ve discussed many times in this space, retraction is not scalable, twitter fights are a disaster, we can’t rely on funding agencies to save us—certainly not in this example!

I get lots of emails from people who see me as a sort of court of last resort, a trusted third party who will look at the evidence and report my conclusions without fear or favor, and that’s fine—but I’m just one person, and I make mistakes too!

One could imagine some sort of loose confederation of vetters—various people like me who’d look at the evidence in individual disputes. But is that scalable? And if it became more formal, I’d be concerned that it would be subject to the same distortions regarding the power structure. Can you imagine: a dispute-resolution committee in social psychology, under the supervision of Robert Sternberg, Susan Fiske, and the editorial board of Perspectives in Psychological Science? Fox in the goddamn chicken coop.

It may be that, right now, Pubpeer is the best thing going, and maybe it can be souped up in some way to be even more useful. I have some concern that Pubpeer can be gamed in the same way as Amazon reviews—but even a gamed Pubpeer could be better than nothing.

“Life Expectancy and Mortality Rates in the United States, 1959-2017”

A reporter pointed me to this article, Life Expectancy and Mortality Rates in the United States, 1959-2017, by Steven Woolf and Heidi Schoomaker, and asked:

Are the findings new? Can you subdivide data, like looking at small populations like middle aged people in Wyoming and have validity? Can you make valid inferences about causes and effects? And why aren’t children and older people suffering an increase in mortality?

My reply:

This link from a couple of years ago might help.

The short answers to your questions are:

1. Mortality trends vary a lot by age as well as geography, so it makes sense to look at different age groups separately.

2. Causes of deaths are much different for children, middle-aged people, and old people—so it makes sense to see different trends in different age categories.

3. Sample sizes are large enough that you can look at individual states (as you can see from the above link).

“Machine Learning Under a Modern Optimization Lens” Under a Bayesian Lens

I (Yuling) read this new book Machine Learning Under a Modern Optimization Lens (by Dimitris Bertsimas and Jack Dunn) after I grabbed it from Andrew’s desk. Apparently machine learning is now such a wide-ranging area that we have to access it through some sub-manifold so as to evade dimension curse, and it is the same reason why I would like to discuss this comprehensive and clearly-structured book through a Bayesian perspective.

Regularization and robustness, and what it means for priors

The first part of the book is most focused on the interpretation of regularization and robustness (Bertsimas and Copenhaver, 2017). In a linear regression with data (X,Y), we consider a small perturbation within the neighborhood \Delta \in \mathcal{U}(q,r)= \{\Delta\in \mathcal{R}^{n\times p}: \max_{\vert\vert \delta \vert\vert_{q} =1 } \vert\vert \delta \Delta \vert\vert_{r} \}, then the l_q regularized regression is precisely equivalently to the minimax robustness:
\displaystyle \min_{\beta}\max_{\Delta\in \mathcal{U}(q,r)} \vert\vert y-(X+\Delta)\beta \vert\vert_{r} = \min_{\beta} \vert\vert y-(X+\Delta)\beta \vert\vert_{r} + \vert\vert \beta \vert\vert_{q}
and such equivalence can also be extended to other norms too.

The discussion of this book is mostly useful for point estimation in a linear regression. As a Bayesian, it is natural to ask, if we could also encode the robustness constraints into the prior for a general model. For example, can we establish something like (I suppress the obvious dependence on X):
\displaystyle \min_{p^{post}} \max_{ p^*: D(p^* \vert\vert p^{sample})<\epsilon } - \int_{\tilde y} \log \int_{\theta} p(\tilde y \vert \theta ) p^{post}(\theta ) d\theta p^*(\tilde y\vert y ) d \tilde y= - \int_{\tilde y} \log \int_{\theta} p(\tilde y \vert \theta) p(\theta\vert y ) d\theta p^{sample}(\tilde y\vert y ) d \tilde y.
where is p(\theta\vert y ) \propto p(\theta)p(y\vert \theta) is the Bayesian posterior distribution under certain prior that is potentially induced by this equation, and y_{1,\dots,n} are iid samples from p^{sample}, which are however different from the future new samples \tilde y \sim p^*(\tilde y) up to a \epsilon perturbation under some divergence D(\cdot|\cdot).

In other words, we might wish to construct the prior in such a way that the model could still give minimax optimal prediction even if the sampling distribution is slightly corrupted.

To be clear, the equivalence between a minimax (point-)estimation and the least-favorable prior is well studied. However here the minimax is the minimax out-of-sample risk of the prediction of the data, rather than the classic risk of the parameter estimation, where the only bridge is through the likelihood.

It also reminds me of our PSIS leave-one-out(LOO). In particular, we know where the model behaves bad (e.g. points with large k hat >\epsilon). It makes sense, for example, if we encode these bad points into the prior adaptively
\displaystyle \log p^{prior}(\theta) \gets \log p^{prior}(\theta) + \gamma\log \sum_{i: k_{i}>\epsilon} p( y_i\vert\theta)
and retrain the model either explicitly or through importance sampling. It is of course nothing but adversarial training in deep leaning.

Nevertheless, it is the time to rethink what a prior is aimed for:

  1. In terms of prior predictive check, we implicitly ask for a large or even the largest (in empirical Bayes) marginal likelihood \int p(y\vert \theta) p(\theta) d\theta.
  2. But prior is also just regularization, and regularization is just robustness from the previous argument. It is not because we or any other people have enough reasons to believe the regression coefficient perfectly forms a Laplace distribution that we use the Lasso; it is rather because we want our model to be more robust under some l_1 perturbation in the data. An adversarial training weighs more on “bad” data points and therefore deliberately decrease the prior predictive power, in exchange for robustness.

Let me recall Andrew’s old blog: What is the “true prior distribution”? :

(If) the prior for a single parameter in a model that is only being used once.for example, we’re doing an experiment to measure the speed of light in a vacuum, where prior for the speed of light is the prior for the speed of light; there is no larger set of problems for which this is a single example. My short answer is: for a model that is only used once, there is no true prior.

Now from the robustness standpoint, we might have another longer answer: it does not even matter if the population \theta itself lives in any population. There is an optimal prior in terms of giving the appropriate amounts of regularization such that prediction from the model is robust under small noise, which is precisely defined by the minimax problem (in case someone hates minimax, I wonder if the average risk in lieu of minimax is also valid).

To conclude, the robustness framework (in the data space) gives us a more interpretable way to explain how strong the regularization should be (in the parameter space), or equivalently how (weakly) informative the prior is supposed to be.

Trees, discrete optimization, and multimodality

A large share of the book is devoted to the optimal classification and regression trees. The authors prove that a deep enough optimal classification tree can achieve the same prediction ability as a deep neural network — when the tree makes splits exactly according to the same network — whereas tree has a much better interpretability (evidently we could prove a multilevel model at the same setting will achieve a prediction ability no worse than that deep net).

The first glance might suggest a computationally prohibitive expense of solving a high dimensional discrete optimization problem, which is ultimately what a tree requires. Fortunately, the author gives a detailed introduction to the mixed integer algorithm they use, and it is shown to be both fast and scalable — although I cannot fit a discrete tree in Stan.

Nevertheless, there are still some discrete natures that may not be desired. For instance when the data is perfectly separable by multiple classification trees, whatever the objective function, the optimizer can only report a tree among all plausible answers. In the ideal situation we would want to average over all the possibility and I suppose that can be approximated by a bootstrapped random forest — but even then we are effectively using no-pooling among all leaves. For example if an online survey has very few samples in certain groups of the population, the best classification a tree can do is to group all of them into some nearby leaves, while the leaf to which they are assigned can only be decided with large variation.

To be fair all methods have limitations and it is certainly useful to include tree-based methods in the toolbox as an alternative to more black-box deep learning models.

We love decision theories, too.

I remember in this year’s JSM, Xiao-Li Meng made a joke that even though we statistician have to struggle to define our territory in terms of AI and ML, it seems even more unfair for operations research, for it is OR that created all the optimization tools upon which modern deep learning relies, but how many outsiders would directly links AI to OR?

The author of this book gives an argument in chapter 13: Although OR pays less attention to data collection and prediction compared with ML, OR is predominately focused on the process of making optimal decisions.

Of course we (Bayesian statisticians) love decision theories, too. Using the notation in this book, given a loss function c, a space for decisions c\in C, and observed data (x, y), we want to minimize the expected loss:
\displaystyle z^*(x) = \arg\min \mathrm{E}[c(z, y)|x]

From a (parametric) Bayesian perspective, such problem is easy to solve: given a parameter model p(y|x, \theta), we have explicit form \mathrm{E}[c(z, y)|x_0] = \int c(z, y) p(y| x_0, \theta) p(\theta| y_{1:n}, x_{1:n}) d \theta that enables a straightforward minimization on z (possibly though stochastic approximation since we can draw \theta from posterior directly).

The authors essentially consider a non-parametric model on Y\vert X in RKHS. That is to consider a kernel K(,) on covariates X and we can rewrite the expected risk as a weighted average of sample risk
\displaystyle \mathrm{E}[c(z, y)|x_0] \approx K(X, x_0) K(X, X)^{-1} c(z, Y).
And not surprisingly we can also construct the kernel though trees.

Indeed we can rewrite any multilevel model by defining an equivalent kernel K(,). So the kernel representation above amounts to a multilevel model with fixed group-level variance \tau (often fixed hyper-parameter in the kernel).

 And we love causal inference, too

Chapter 14 of the book is on perspective trees. The problem is motivated by observational studies with multiple treatments, and the goal is to assign an optimal treatment for a new patient. Given observed outcome y (say the Blood Sugar Level), all covariates X, (potentially nonrandom) treatments Z, what remains to be optimized is the policy for future patients with covariate x. We denote the optimal assignment policy z^*=\tau(x) as a function of x.

It is a causal inference problem. If we have a parametric model, we (Bayesian statisticians) would directly write it down as
\displaystyle z^*=\tau(x_0) = \arg\min_z \mathrm{E}[y|x_0, z]
Under all unconfoundness conditions we have
{E}[y|x_0, z]= \int y p(y| x, z) d y, and p(y| x, z) can be learned from the sampling distribution in observational studies by weighting or matching or regression.

Returning to this book, effectively the perspective trees models p(y| x, z) by sub stratification: y\vert x_0, z_0 is the empirical distribution of all observed outcomes with treatment z=z_0 in the leaf node where x_0 lives.

Making decisions in the space of all trees

The perspective trees take one step further by using the following loss function when training the regression tree:
\displaystyle \mathrm{E}[y|x, \gamma(x) ] + \sigma \sum_{i=1}^n (y_i - \hat y_i)^2
where \hat y_i is the prediction of y_i using the same tree. (As a footnote, it is always better to use leave-one-out error, but I suppose it is hard to cross-validate a tree due to its discrete nature.)

The objective function is interpreted as the weighted average of “making an accurate prediction” (\sigma =\infty) and “making the optimal decision” (\sigma =0). Evidently it is not the same as fitting the tree first that only minimizes prediction error, and then doing causal inference using the post-inference tree and substratification.

I have a conflicting feeling towards this approach. On one hand, it addresses the model uncertainty directly. A parametric Bayesian might always tend to treat the model as it is and ignore all other uncertainty in the forking paths. It is therefore recommended to work in a more general model space– the trees do have merits in intuitively manifesting the model complexity.

On the other hand, to me there seems to be a more principled way to deal with model uncertainty is to consider
\displaystyle \mathrm{E}[y|x, \gamma(x) ]= \mathrm{E}[\mathrm{E}[y|x, \gamma(x), M ]]
where M is any given tree.

Further under normal approximation, the expectation can be expanded to be
\displaystyle \mathrm{E}[\mathrm{E}[y|x, \gamma(x), M ]]= \sum_{k=1}^{\infty} \mathrm{E}[y|x, \gamma(x), M_k] \exp( \sum_{i=1}^n (y_{ik} - \hat y_{ik})^2 ) / \sum_{k=1}^{\infty} \exp( \sum_{i=1}^n (y_{ik} - \hat y_{ik})^2 ),
as long as I am allowed to abuse the notation on infinite sums and self-normalization.

The linear mixture (first expression in this section) can then be viewed as an approximation to this full-Bayes objective: it replaces the infinite sum by the largest one term, which is nearly accurate if all other trees vanish quickly.

To be clear, here different trees are only compared in the context of prediction. There is an analogy in terms of causal assumption where different causal models imply different conditional ignobility, which cannot be learned from the data in the first place.

More generally, there remains a lot to be done in the field of decision theory in the existence of model uncertainty, while everything will be even more complicated with extra settings on infinite model space, causality, and the looming concern that all models are still wrong even in this infinite model space– we almost have abandoned the posterior probability of models in model averaging, but how about decision making with a series of working models?

Overall, I enjoy reading this book. It provides various novel insights to rethink modern machine learning and a rich set of practical tools to fit models in the real world. Most constructively, it is a perfect book that inspires so many interesting open problems to work on after stacking multiple lenses.

Why “bigger sample size” is not usually where it’s at.

Aidan O’Gara writes:

I realized when reading your JAMA chocolate study post that I don’t understand a very fundamental claim made by people who want better social science: Why do we need bigger sample sizes?

The p-value is always going to be 0.05, so a sample of 10 people is going to turn up a false positive for purely random reasons exactly as often as a sample of 1000: precisely 5% of the time. That’ll be increased if you have forking paths, bad experiment design, etc., but is there any reason to believe that those factors weigh more heavily in a small sample?

Let’s take the JAMA chocolate example. If this study is purely capturing noise, you’d need to run 20 experiments to get a statistically significant result like this. If they studied a million people, they’d also need only 20 experiments to get a false positive from noise alone. Let’s say they’re capturing not only noise but bad/malicious statistical design–degrees of freedom, manipulating the experiment. Is this any less common in studies of a million people? Why?

“We need bigger sample sizes” is something I’ve heard a million times, but I just realized I don’t get it. Thanks in advance for the explanation.

My reply:

Sure, more data always helps, but I don’t typically argue that larger sample size is the most important thing. What I like to say is that we need better measurement.

If you’re measuring the wrong thing (as in those studies of ovulation and clothing and voting that got the dates of peak fertility wrong) or if your measurements are super noisy, then a large sample size won’t really help you: Increasing N will reduce variance but it won’t do anything about bias.

Regarding your question above: First, I doubt the study “is purely capturing noise.” There’s lots of variation out there, coming from many sources. My concern is not that these researchers are studying pure noise; rather, my concern is that the effects they’re studying are highly variable and context-dependent, and all this variation will make it hard to find any consistent patterns.

Also, in statistics we often talk about estimating the average treatment effect, but if the treatment effect depends on context, then there’s no universally-defined average to be estimated.

Finally, you write, “they’d also need only 20 experiments to get a false positive from noise alone.” Sure, but I don’t think anybody does 20 experiments and just publishes one of these. What you should really do is publish all 20 experiments, or, better still, analyze the data from all 20 together. But, again, if your measurements are too variable, it won’t matter anyway.