I’m reposting this one from 2014 because I think it could be useful to lots of people.

Also this advice on writing research articles, from 2009.

I’m reposting this one from 2014 because I think it could be useful to lots of people.

Also this advice on writing research articles, from 2009.

Under the heading, “Latino approval of Donald Trump,” Tyler Cowen writes:

From a recent NPR/PBS poll:

African-American approval: 11%

White approval: 40%

Latino approval: 50%

He gets 136 comments, many of which reveal a stunning ignorance of polling. For example, several commenters seem to think that a poll sponsored by National Public Radio is a poll of NPR listeners.

Should NPR waste its money on commissioning generic polls? I don’t think so. There are a zillion polls out there, and NPR—or just about any news organization—has, I believe, better things to do than create artificial headlines by asking the same damn polling question that everyone else does.

In any case, it’s a poll of American adults, not a poll of NPR listeners.

The other big mistake that many commenters made was to take the poll result at face value. Cowen did that too, by reporting the results as “Latino approval of Donald Trump,” rather than “One poll finds . . .”

A few things are going on here. In no particular order:

1. Margin of error. A national poll of 1000 people will have about 150 Latinos. The standard error of a simple proportion is then 0.5/sqrt(150) = 0.04. So, just to start off, that 50% could easily be anywhere between 42% and 58%. And, of course, given what else we know, including other polls, 42% is much more likely than 58%.

That said, even 42% is somewhat striking in that you might expect a minority group to support the Republican president less than other groups. One explanation here is that presidential approval is highly colored by partisanship, and minorities tend to be less partisan than whites—we see this in many ways in lots of data.

2. Selection. Go to the linked page and you’ll see dozens of numbers. Look at enough numbers and you’ll start to focus on noise. The garden of forking paths—it’s not just about p-values. Further selection is that it’s my impression that Cowen enjoys posting news that will fire up his conservative readers and annoy his liberal readers—and sometimes he’ll mix it up and go the other way.

3. The big picture. Trump’s approval is around 40%. It will be higher for some groups and lower for others. If your goal is to go through a poll and find some good news for Trump, you can do so, but it doesn’t alter the big picture.

I searched a bit on the web and found this disclaimer from the PBS News Hour:

President Trump tweeted about a PBS NewsHour/NPR/Marist poll result on Tuesday, highlighting that his approval rating among Latinos rose to 50 percent. . . .

However, the president overlooked the core finding of the poll, which showed that 57 percent of registered voters said they would definitely vote against Trump in 2020, compared to just 30 percent who said they would back the president. The president’s assertion that the poll shows an increase in support from Latino voters also requires context. . . .

But only 153 Latino Americans were interviewed for the poll. The small sample size of Latino respondents had a “wide” margin of error of 9.9 percentage points . . . [Computing two standard errors using the usual formula, 2*sqrt(0.5*0.5/153) gives 0.081, or 8.1 percentage points, so I assume that the margin of error of 9.9 percentage points includes a correction for the survey’s design effect, adjusting for it not being a simple random sample of the target population. — AG.] . . .

[Also] The interviews in the most recent PBS NewsHour/NPR/Marist poll were conducted only in English. . . .

They also report on other polls:

According to a Pew Research Center’s survey from October, only 22 perfect of Latinos said they approved of Trump’s job as president, while 69 percent said they disapproved. . . . Pew has previously explained how language barriers and cultural differences could affect Latinos’ responses in surveys.

That pretty much covers it. But then the question arises: Why did NPR and PBS commission this poll in the first place? There are only a few zillion polls every month on presidential approval. What’s the point of another poll? Well, for one thing, this gets your news organization talked about. They got a tweet from the president, they’re getting blogged about, etc. But is that really what you want to be doing as a news organization: putting out sloppy numbers, getting lots of publicity, then having to laboriously correct the record? Maybe just report some news instead. There’s enough polling out there already.

We went to Peter Luger then took the train back . . . Walking through Williamsburg, everyone looked like a Daniel Clowes character.

Here’s what’s scheduled for the next six months:

This is a great example for a statistics class, or a class on survey sampling, or a political science class

How to read (in quantitative social science). And by implication, how to write.

Causal inference with time-varying exposures

Reproducibility problems in the natural sciences

If you want a vision of the future, imagine a computer, calculating the number of angels who can dance on the head of a pin—forever.

Collinearity in Bayesian models

Inshallah

“Did Austerity Cause Brexit?”

“Widely cited study of fake news retracted by researchers”

Causal inference using repeated cross sections

Calibrating patterns in structured data: No easy answers here.

Healthier kids: Using Stan to get more information out of pediatric respiratory data

Gigerenzer: “The Bias Bias in Behavioral Economics,” including discussion of political implications

Endless citations to already-retracted articles

Update on keeping Mechanical Turk responses trustworthy

Blindfold play and sleepless nights

What does it take to repeat them?

“The most mysterious star in the galaxy”

Gendered languages and women’s workforce participation rates

What’s published in the journal isn’t what the researchers actually did.

Alison Mattek on physics and psychology, philosophy, models, explanations, and formalization

Votes vs. $

“Developing Digital Privacy: Children’s Moral Judgments Concerning Mobile GPS Devices”

Plaig!

From deviance, DIC, AIC, etc., to leave-one-out cross-validation

Of book reviews and selection bias

Swimming upstream? Monitoring escaped statistical inferences in wild populations.

Concerned about demand effects in psychology experiments? Incorporate them into the design.

Just forget the Type 1 error thing.

“This is a case where frequentist methods are simple and mostly work well, and the Bayesian analogs look unpleasant, requiring inference on lots of nuisance parameters that frequentists can bypass.”

The garden of forking paths

This one goes in the Zombies category, for sure.

Allowing intercepts and slopes to vary in a logistic regression: how does this change the ROC curve?

A weird new form of email scam

Holes in Bayesian Philosophy: My talk for the philosophy of statistics conference this Wed.

Hey, look! The R graph gallery is back.

The intellectual explosion that didn’t happen

Deterministic thinking meets the fallacy of the one-sided bet

Are GWAS studies of IQ/educational attainment problematic?

Attorney General of the United States less racist than Nobel prize winning biologist

Here are some examples of real-world statistical analyses that don’t use p-values and significance testing.

They added a hierarchical structure to their model and their parameter estimate changed a lot: How to think about this?

“Beyond ‘Treatment Versus Control’: How Bayesian Analysis Makes Factorial Experiments Feasible in Education Research”

As always, I think the best solution is not for researchers to just report on some preregistered claim, but rather for them to display the entire multiverse of possible relevant results.

Replication police methodological terrorism stasi nudge shoot the messenger wtf

Separated at birth?

What can be learned from this study?

“I feel like the really solid information therein comes from non or negative correlations”

“The issue of how the report the statistics is one that we thought about deeply, and I am quite sure we reported them correctly.”

Is there any scientific evidence that humans don’t like uncertainty?

You should (usually) log transform your positive data

Coney Island

Yes, you can include prior information on quantities of interest, not just on parameters in your model

More on why Cass Sunstein should be thanking, not smearing, people who ask for replications

The importance of talking about the importance of measurement: It depends on the subfield

More on the piranha problem, the butterfly effect, unintended consequences, and the push-a-button, take-a-pill model of science

“No, cardiac arrests are not more common on Monday mornings, study finds”

Beyond Power Calculations: Some questions, some answers

When people make up victim stories

“I am a writer for our school newspaper, the BHS Blueprint, and I am writing an article about our school’s new growth mindset initiative.”

Is the effect they found too large to believe? (the effect of breakfast micronutrients on social decisions)

“It just happens to be in the nature of knowledge that it cannot be conserved if it does not grow.”

He says it again, but more vividly.

The Wife

More golf putting, leading to a discussion of how prior information can be important for an out-of-sample prediction or causal inference problem, even if it’s not needed to fit existing data

A world of Wansinks in medical research: “So I guess what I’m trying to get at is I wonder how common it is for clinicians to rely on med students to do their data analysis for them, and how often this work then gets published”

It’s not just p=0.048 vs. p=0.052

Why didn’t they say they were sorry when it turned out they’d messed up?

Here’s why you need to bring a rubber band to every class you teach, every time.

Here’s a puzzle: Why did the U.S. doctor tell me to drink more wine and the French doctor tell me to drink less?

Was Thomas Kuhn evil? I don’t really care.

Exchange with Deborah Mayo on abandoning statistical significance

My math is rusty

Deterministic thinking (“dichotomania”): a problem in how we think, not just in how we act

I think that science is mostly “Brezhnevs.” It’s rare to see a “Gorbachev” who will abandon a paradigm just because it doesn’t do the job. Also, moving beyond naive falsificationism

“Boston Globe Columnist Suspended During Investigation Of Marathon Bombing Stories That Don’t Add Up”

Question on multilevel modeling reminds me that we need a good modeling workflow (building up your model by including varying intercepts, slopes, etc.) and a good computing workflow

Harking, Sharking, Tharking

“Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science”

Laplace Calling

Challenge of A/B testing in the presence of network and spillover effects

The State of the Art

Bank Shot

They misreport their experiments and don’t fess up when they’ve been caught.

“Persistent metabolic youth in the aging female brain”??

Junk science and fake news: Similarities and differences

We should all routinely criticize our own work.

Controversies in the theory of measurement in mathematical psychology

Here’s a supercool controversy for ya

“Less Wow and More How in Social Psychology”

“Troubling Trends in Machine Learning Scholarship”

P-value of 10^-74 disappears

“What is the conclusion of a clinical trial where p=0.6?”

Kaiser Fung suggests “20 paper ideas pre-approved for prestigious journals”

More on that 4/20 road rage dude

Are statistical nitpickers (e.g., Kaiser Fung and me) getting the way of progress or even serving the forces of evil?

BizStat: Modeling performance indicators for deals

Glenn Shafer: “The Language of Betting as a Strategy for Statistical and Scientific Communication”

Automation and judgment, from the rational animal to the irrational machine

What’s the p-value good for: I answer some questions.

On the term “self-appointed” . . .

When presenting a new method, talk about its failure modes.

Poetry corner

How to think scientifically about scientists’ proposals for fixing science

The status-reversal heuristic

“Here’s an interesting story right in your sweet spot”

The real lesson learned from those academic hoaxes: a key part of getting a paper published in a scholarly journal is to be able to follow the conventions of the journal. And some people happen to be good at that, irrespective of the content of the papers being submitted.

His data came out in the opposite direction of his hypothesis. How to report this in the publication?

He’s looking for a Bayesian book

My best thoughts on priors

Bayesian analysis of data collected sequentially: it’s easy, just include as predictors in the model any variables that go into the stopping rule.

“Bullshitters. Who Are They and What Do We Know about Their Lives?”

“Causal Processes in Psychology Are Heterogeneous”

“Any research object with a strong and obvious series of inconsistencies may be deemed too inaccurate to trust, irrespective of their source. In other words, the description of inconsistency makes no presumption about the source of that inconsistency.”

Many Ways to Lasso

Afternoon decision fatigue

What happens to your metabolism when you eat ultra-processed foods?

Software for multilevel conjoint analysis in marketing

I’m no expert

“Everybody wants to be Jared Diamond”

The dropout rate in his survey is over 60%. What should he do? I suggest MRP.

How to teach sensible elementary statistics to lower-division undergraduates?

“The paper has been blind peer-reviewed and published in a highly reputable journal, which is the gold standard in scientific corroboration. Thus, all protocol was followed to the letter and the work is officially supported.”

The incentives are all wrong (causal inference edition)

“Men Appear Twice as Often as Women in News Photos on Facebook”

“Non-disclosure is not just an unfortunate, but unfixable, accident. A methodology can be disclosed at any time.”

Battle for the headline: Hype and the effect of statistical significance on the ability of journalists to engage in critical thinking

Australian polls failed. They didn’t do Mister P.

When Prediction Markets Fail

Hey! Participants in survey experiments aren’t paying attention.

To do: Construct a build-your-own-relevant-statistics-class kit.

Consider replication as an honor, not an attack.

Is “abandon statistical significance” like organically fed, free-range chicken?

Should we mind if authorship is falsified?

Why do a within-person rather than a between-person experiment?

In research as in negotiation: Be willing to walk away, don’t paint yourself into a corner, leave no hostages to fortune

Stan saves Australians $20 billion

What’s the evidence on the effectiveness of psychotherapy?

What does a “statistically significant difference in mortality rates” mean when you’re trying to decide where to send your kid for heart surgery?

What happens when frauds are outed because of whistleblowing?

How much granularity do you need in your Mister P?

This awesome Pubpeer thread is about 80 times better than the original paper

I’m still struggling to understand hypothesis testing . . . leading to a more general discussion of the role of assumptions in statistics

Structural equation modeling and Stan

Why “bigger sample size” is not usually where it’s at.

Public health researchers: “Death by despair” is a thing, but not the biggest thing

No, Bayes does not like Mayor Pete. (Pitfalls of using implied betting market odds to estimate electability.)

In short, adding more animals to your experiment is fine. The problem is in using statistical significance to make decisions about what to conclude from your data.

What comes after Vixra?

Don’t believe people who say they can look at your face and tell that you’re lying.

What’s wrong with Bayes; What’s wrong with null hypothesis significance testing

What’s wrong with Bayes

What’s wrong with null hypothesis significance testing

“Pfizer had clues its blockbuster drug could prevent Alzheimer’s. Why didn’t it tell the world?”

Are you tone deaf? Find out here.

“There is this magic that our DNA enables”

How to think about “medical reversals”?

The checklist manifesto and beyond

“Deep Origins” and spatial correlations

“Inferential statistics as descriptive statistics”

Judith Rich Harris on the garden of forking paths

What happened to the hiccups?

What does it mean when they say there’s a 30% chance of rain?

Causal inference and within/between person comparisons

‘Sumps and rigor

Elon Musk and George Lucas

Horns! Have we reached a new era in skeptical science journalism? I hope so.

Causal inference, adjusting for 300 pre-treatment predictors

External vs. internal validity of causal inference from natural experiments: The example of charter school lottery studies

“What if your side wins?”

“But when we apply statistical models, do we need to care about whether a model can retrieve the relationship between variables?”

How did our advice about research ethics work out, four years later?

Fitting big multilevel regressions in Stan?

The last-mile problem in machine learning

The Role of Statistics in a Deep-Learning World

Create your own community (if you need to)

I wrote most of these awhile ago and I don’t remember what many of them are about. So I’m in as much suspense as you are.

We had an interesting discussion the other day regarding a regression discontinuity disaster.

In my post I shone a light on this fitted model:

Most of the commenters seemed to understand the concern with these graphs, that the upward slopes in the curves directly contribute to the estimated negative value at the discontinuity leading to a model that doesn’t seem to make sense, but I did get an interesting push-back that is worth discussing further. Commenter Sam wrote:

You criticize the authors for using polynomials. Here is something you yourself wrote with Guido Imbens on the topic of using polynomials in RD designs:

“We argue that estimators for causal effects based on such methods can be misleading, and we recommend researchers do not use them, and instead use estimators based on local linear or quadratic polynomials or other smooth functions.”

From p.15 of the paper:

“We implement the RDD using two approaches: the global polynomial regression and the local linear regression”

They show that their results are similar in either specification.

The commenter made the seemingly reasonable point that, since the authors actually did use the model that Guido and I recommended, and it gave the same results as what they found under the controversial model, what was my problem?

**What if?**

To put it another way, what if the authors had done the exact same analyses but reported them differently, as follows:

– Instead of presenting the piecewise quadratic model as the main result and the local linear model as a side study, they could’ve reversed the order and presented the local linear model as their main result.

– Instead of graphing the fitted discontinuity curve, which looks so bad (see graphs above), they could’ve just presented their fitted model in tabular form. After all, if the method is solid, who needs the graph?

Here’s my reply.

First, I do think the local linear model is a better choice in this example than the global piecewise quadratic. There are cases where a global model makes a lot of sense (for example in pre/post-test situations such as predicting election outcomes given previous election outcomes), but not in this case, when there’s no clear connection at all between percentage vote for a union and some complicated measures of stock prices. So, yeah, I’d say ditch the global piecewise quadratic model, don’t even include it in a robustness check unless the damn referees make you do it and you don’t feel like struggling with the journal review process.

Second, had the researchers simply fit the local linear model without the graph, *I wouldn’t have trusted their results*.

Not showing the graph doesn’t make the problem go away, it just hides the problem. It would be like turning off the oil light on your car so that there’s one less thing for you to be concerned about.

This is a point that the commenter didn’t seem to realize: The graph is not just a pleasant illustration of the fitted model, not just some sort of convention in displaying regression discontinuities. The graph is central to the modeling process.

One challenge with regression discontinuity modeling (indeed, applied statistical modeling more generally) as it is commonly practiced is that it is unregularized (with coefficients estimated using some variant of least squares) and uncontrolled (lots of researcher degrees of freedom in fitting the model). In a setting where there’s no compelling theoretical or empirical reason to trust the model, it’s *absolutely essential* to plot the fitted model against the data and see if it makes sense.

I have no idea what the data and fitted local linear model would look like, and that’s part of the problem here. (The research article in question has other problems, notably regarding data coding and exclusion, choice of outcome to study, and a lack of clarity regarding the theoretical model and its connection to the statistical model, but here we’re focusing on the particular issue of the regression being fit. These concerns do go together, though: if the data were cleaner and the theoretical structure were stronger, this can inspire more trust in a fitted statistical model.)

**Taking the blame**

Examples in statistics and econometrics textbooks (my own included) are too clean. The data come in, already tidy, and then the model is fit, and it works as expected, and some strong and clear conclusion comes out. You learn research methods in this way, and you can expect this to happen in real life, with some estimate or hypothesis test lining up with some substantive question, and all the statistical modeling just being a way to make that connection. And you can acquire the attitude that the methods just simply work. In the above example, you can have the impression that if you do a local linear regression and a bunch of robustness tests, that you’ll get the right answer.

Does following the statistical rules assure you (probabilistically) that you will get the right answer? Yes—in some very simple settings such as clean random sampling and clean randomized experiments, where effects are large and the things being measured are exactly what you want to know. More generally, no. More generally, there are lots of steps connecting data, measurement, substantive theory, and statistical model, and no statistical procedure blindly applied—even with robustness checks!—will be enuf on its own. It’s necessary to directly engage with data, measurement, and substantive theory. Graphing the data and fitted model is one part of this engagement, often a necessary part.

Regression and Other Stories is almost done, and I was spending a couple hours going through it starting from page 1, cleaning up imprecise phrasings and confusing points. . . .

One thing that’s hard about writing a book is that there are *so many* places you can go wrong. A 500-page book contains something like 1000 different “things”: points, examples, questions, etc.

Just for example, we have two pages on reliability and validity in chapter 2 (measurement is important, remember?). A couple of the things I wrote didn’t feel quite right, so I changed them.

And this got me thinking: any expert who reads our book will naturally want to zoom in on the part that he or she knows the most about, to check that we got things right. But with 1000 things, we’ll be making a few mistakes: some out-and-out errors and other places where we don’t explain things clearly and leave a misleading impression. It’s a lot of pressure to not want to get anything wrong.

We have three authors (me, Jennifer, and Aki), so that helps. And we’ve sent the manuscript to various people who’ve found typos, confusing points, and the occasional mistake. So I think we’re ok. But still it’s a concern.

I’ve reviewed a zillion books but only written a few. When I review a book, I notice its problems right away (see for example here and here). I’m talking about factual and conceptual errors, here, not typos. It’s not fun to think about being on the other side, to imagine a well-intentioned reviewer reading our book, going to a topic of interest, and being disappointed that we screwed up.

*You’re an ordinary boy and that’s the way I like it – Magic Dirt*

Look. I’ll say something now, so it’s off my chest. I *hate* order statisics. I loathe them. I detest them. I wish them nothing but ill and strife. They are just awful. And I’ve spent the last god only knows how long buried up to my neck in them, like Jennifer Connelly forced into the fetid pool at the end of Phenomena.

It would be reasonable to ask why I suddenly have opinions about order statistics. And the answer is weird. It’s because of Pareto Smoothing Importance Sampling (aka PSIS aka the technical layer that makes the loo package work).

The original PSIS paper was written by Aki, Andrew, and Jonah. However there is a brand new sparkly version by Aki, Me, Andrew, Yuling, and Jonah that has added a pile of theory and restructured everything (edit by Aki: link changed to the updated arXiv version). Feel free to read it. The rest of the blog post will walk you through some of the details.

**What is importance sampling?**

Just a quick reminder for those who don’t spend their life thinking about algorithms. The problem at hand is estimating the expectation for some function *h* when . If we could sample directly from then the Monte Carlo estimate of the expectation would be

, where .

But in a lot of real life situations we have two problems with doing this directly: firstly it is usually very hard to sample from . If there is a different distribution that we *can *sample from, say *g*, then we can use the following modification of the Monte Carlo estimator

,

where are iid draws from . This is called an *importance sampling *estimator. The good news is that it always converges in probability to the true expectation. The bad news is that it is a *random variable* and it can have infinite variance.

The second problem is that often enough we only know the density up to a normalizing constant, so if , then the following *self-normalized **importance sampler* is useful

,

where the *importance ratios *are defined as

,

where again . This will converge to the correct answer as long as . For the rest of this post I am going to completely ignore self-normalized importance samplers, but everything I’m talking about still holds for them.

**So does importance sampling actually work?**

Well god I do hope so because it is used *a lot*. But there’s a lot of stuff to unpack before you can declare something “works”. (That is a lie, of course, all kinds of people are willing to pick a single criterion and, based on that occurring, declaring that it works. And eventually that is what we will do.)

First things first, an importance sampling estimator is a sum of independent random variables. We may well be tempted to say that, by the central limit theorem, it will be asymptotically normal. And sometimes that is true, *but* *only if *the importance weights have finite variance. This will happen, for example, if the proposal distribution *g* has heavier tails than the target distribution *p*.

And there is a temptation to stop there. To declare that if the importance ratios have finite variance then importance sampling works. *That. Is. A. Mistake.*

Firstly, this is demonstrably untrue in moderate-to-high dimensions. It is pretty easy to construct examples where the importance ratios are bounded (and hence have finite variance) but there is no feasible number of samples that would give small variance. This is a problem as old as time: just because the central limit theorem says the error will be around , that doesn’t mean that won’t be an *enormous *number.

And here’s the thing: we do not know and our only way to estimate it is *to use the importance sampler*. So when the importance sampler doesn’t work well, we may not be able to get a decent estimate of the error. So even if we can guarantee that the importance ratios have finite variance (which is really hard to do in most situations), we may end up being far too optimistic about the error.

Chatterjee and Diaconis recently took a quite different route to asking whether an importance sampler converges. They asked what the minimum sample size required to ensure, with high probability, that is small (with high probability). They showed that you need approximately samples and this number can be *large*. This quantity is also quite hard to compute (and they proposed another heuristic, but that’s not relevant here), but it is going to be important later.

**Modifying importance ratios**

So how do we make importance sampling more robust. A good solution is to somehow modify the importance ratios to ensure they have finite variance. Ionides proposed a method called Truncated Importance Sampling (TIS) where the importance ratios are replaced with truncated weights , for some sequence of thresholds as . The resulting TIS estimator is

.

A lot of real estate in Ionides’ paper is devoted to choosing a good sequence of truncations. There’s theory to suggest that it depends on the tail of the importance ratio distribution. But the suggested choice of truncation sequence is , where *C* is the normalizing constant of *f *which is one when using ordinary rather than self-normalized importance sampling. (For the self normalized version, Appendix B suggests taking *C* as the sample mean of the importance ratios, but the theory only works for deterministic truncations.)

This simple truncation *guarantees *that TIS is asymptotically unbiased, has finite variance that asymptotically goes to zero, and (with some caveats) is asymptotically normal.

But, as we discussed above, none of this actually guarantees that TIS will work for a certain problem. (It does work asymptotically for a vast array of problems and does a lot better that ordinary importance sampler, but no simple truncation scheme can overcome a poorly chosen proposal distribution. And most proposal distributions in high dimensions are poorly chosen.)

**Enter Pareto-Smoothed Importance Sampling**

So a few years ago Aki and Andrew worked on an alternative to TIS that would make things even better. (They originally called it the “Very Good Importance Sampling”, but then Jonah joined the project and ruined the acronym.) The algorithm they came up with was called *Pareto-Smoothed Importance Sampling *(henceforth PSIS, the link is to the three author version of the paper).

They noticed that TIS basically replaces all of the large importance ratios with a single value . Consistent with both Aki and Andrew’s penchant for statistical modelling, they thought they could do better than that (Yes. It’s the Anna Kendrick version. Deal.)

PSIS is based on the principle the idea that, while using the same value for each extreme importance ratio *works*, it would be even better to *model the distribution of extreme importance ratios! *The study of distributions of extremes of independent random variables has been an extremely important (and mostly complete) part of statistical theory. This means that we *know things*.

One of the key facts of extreme value theory is that the distribution of ratios larger than some sufficiently large threshold *u* approximately has a generalized Pareto distribution (gPd). Aki, Andrew, and Jonah’s idea was to fit a generalized Pareto distribution to the *M* largest importance ratios and replace the upper weights with appropriately chosen quantiles of the fitted distribution. (Some time later, I was very annoyed they didn’t just pick a deterministic threshold, but this works better even if it makes proving things much harder.)

They learnt a few things after extensive simulations. Firstly, this almost always does better than TIS (the one example where it doesn’t is example 1 in the revised paper). Secondly, the gPd has two parameters that need to be estimated (the third parameter is an order statistic of the sample. ewwwww) And one of those parameters is *extremely* useful!

The shape parameter (or tail parameter) of the gPd, which we call *k,* controls how many moments the distribution has. In particular, a distribution who’s upper tail limits to a gPd with shape parameter *k* has at most finite moments. This means that if then an importance sampler will have finite variance.

But we do not have access to the true shape parameter. We can only estimate it from a finite sample, which gives us , or, as we constantly write, “k-hat”. The k-hat value has proven to be an extremely useful diagnostic in a wide range of situations. (I mean, sometimes it feels that every other paper I write is about k-hat. I love k-hat. If I was willing to deal with voluntary pain, I would have a k-hat tattoo. I once met a guy with a nabla tattooed on his lower back, but that’s not relevant to this story.)

Aki, Andrew, and Jonah’s extensive simulations showed something that may well have been unexpected: the value of k-hat is a good proxy for the quality of PSIS. (Also TIS, but that’s not the topic). In particular, if k-hat was bigger than around 0.7 it became massively expensive to get an accurate estimate. So we can use k-hat to work out if we can trust our PSIS estimate.

PSIS ended up as the engine driving the loo package in R, which last time I checked had around 350k downloads from the RStudio CRAN mirror. It works for high-dimensional problems and can automatically assess the quality of an importance sampler proposal for a given realization of the importance weights.

So PSIS is robust, reliable, useful, has R and Python packages, and the paper was full of detailed computational experiments that showed that it was robust, reliable, and useful even for high dimensional problems. What could possibly go wrong?

**What possibly went wrong**

Reviewers.

**It works, but where is the theory?**

I wasn’t an author so it would be a bit weird for me to do a postmortem on the reviews of someone else’s paper. But one of the big complaints was that Aki, Andrew, and Jonah had not shown that PSIS was asymptotically unbiased, had finite vanishing variance, or that it was asymptotically normal.

(Various other changes of emphasis or focus in the revised version are possibly also related to reviewer comments from the previous round, but also to just having more time.)

These things turn out to be tricky to show. So Aki, Andrew, and Jonah invited me and Yuling along for the ride.

The aim was to restructure the paper, add theory, and generally take a paper that was very good and complete and add some sparkly bullshit. So sparkly bullshit was added. Very slowly (because theory is hard and I am not good at it).

**Justifying k-hat < 0.7**

Probably my favourite addition to the paper is due to Yuling, who read the Chatterjee and Diaconis paper and noticed that we could use their lower bound on sample size to justify k-hat. The idea is that it is the tail of that breaks the importance sampler. So if we make the assumption that the entire distribution of is generalized Pareto with shape parameter *k*, we can actually compute the minimum sample size for a particular accuracy from ordinary importance sampling. This is not an accurate sample size calculation, but should be ok for an order-of-magnitude calculation.

The first thing we noticed is, consistent with the already existing experiments, the error in importance sampling (and TIS and PSIS) increases smoothly as *k* passes 0.5 (in particular the finite-sample behaviour does not fall off a cliff the moment the variance isn’t finite). But the minimum sample size starts to increase *very *rapidly as soon as *k* got bigger than about 0.7. This is consistent with the experiments that originally motivated the 0.7 threshold and suggests (at least to me) that there may be something fundamental going on here.

We can also use this to justify the threshold on k-hat as follows. The method Aki came up with for estimating k-hat is (approximately) Bayesian, so we can interpret the k-hat at a value selected so that the data is *consistent *with *M* independent samples from a gPd with shape parameter k-hat. So a k-hat value that is bigger than 0.7 can be interpreted loosely as saying that the extreme importance ratios could have come from a distribution that has a tail that is too heavy for PSIS to work reliably.

This is what actually happens in high dimensions (for an example we have that has bounded ratios and hence finite variance). With a reasonable sample size, the estimator for k-hat simply cannot tell that the distribution of extreme ratios has a large but finite variance rather than an infinite variance. And this is exactly what we want to happen! I have no idea how to formalized this intuition, but nevertheless it works.

**So order statistics**

It turned out that–even though it is quite possible that other people would not have found proving unbiasedness and finite variance hard–I found it very hard. Which is quite annoying because the proof for TIS was literally 5 lines.

What was the trouble? Aki, Andrew, and Jonah’s decision to choose the threshold as the *M*th largest importance ratio. This means that the threshold is an order statistic and hence is *not independent* of the rest of the sample. So I had to deal with that.

This meant I had to read an absolute tonne of papers about order statistics. These papers are dry and technical and were all written between about 1959 and 1995 and at some later point poorly scanned and uploaded to JSTOR. And they rarely answered the question I wanted them to. So basically I am quite annoyed with order statistics.

But the end point is that, under some conditions, PSIS is asymptotically unbiased and has finite, vanishing variance.

The conditions are a bit weird, but are usually going to be satisfied. Why are they weird? Well…

**PSIS is TIS with an adaptive threshold and bias correction**

In order to prove asymptotic properties of PSIS, I used the following representation of the PSIS estimator

where the samples have been ordered so that and the weights are deterministic (and given in the paper). They are related to the quantile function for the gPd.

The first term is just TIS with random threshold , while the second term is an approximation to the bias. So PSIS has higher variance than TIS (because of the random truncation), but lower bias (because of the second term) and this empirically usually leads to lower mean-square error than TIS.

But that random truncation is automatically adapted to the tail behaviour of the importance ratios, which is an extremely useful feature!

This representation also gives hints as to where the ugly conditions come from. Firstly, anything that is adaptive is much harder to prove things about than a non-adaptive method, and the technical conditions that we need to be able to adapt our non-adaptive proof techniques are often quite esoteric. The idea of the proof is to show that, conditional on , all of the relevant quantities go to zero (or are finite) with some explicit dependence on *U*. The proof of this is very similar to the TIS proof (and would be exactly the same if the second term wasn’t there).

Then we need to let *U *vary and hope it doesn’t break anything. The technical conditions can be split into the ones needed to ensure behaves itself as *S *gets big; the ones needed to ensure that doesn’t get too big when the importance ratios are large; and the ones that control the last term.

Going in reverse order, to ensure the last term is well behaved we need that *h* is square-integrable with respect to the proposal *g* in addition to the standard assumption that its square integrable with respect to the target *p*.

We need to put growth conditions on *h *because we are only modifying the ratios, which does not help if *h* is also enormous out in the tails. These conditions are actually very easy to satisfy for most problems I can think of, but almost certainly there’s some one out there with a *h* that grows super-exponentially just waiting to break PSIS.

The final conditions are just annoying. They are impossible to verify in practice, but there is a 70 year long literature that coos reassuring phrases like “this almost always holds” into our ears. These conditions are strongly related to the conditions needed to estimate *k* correctly (using something like the Hill estimator). My guess is that these conditions are not vacuous, but are relatively unimportant for finite samples, where the value of k-hat should weed out the catastrophic cases.

**What’s the headline**

With some caveats, PSIS is asymptotically unbiased; has finite, vanishing variance; and a variant of it is asymptotically normal as long as the importance ratios have more than -finite moments. But it probably won’t be useful unless it has at least 1/0.7 = 1.43 moments.

**And now we send it back off into the world and see what happens**

Awhile ago we had a discussion about racism, in the context of a review of a recent book by science reporter Nicholas Wade that attributed all sorts of social changes and differences between societies to genetics. There is no point in repeating all this, but I did want to bring up here an issue that is relevant to political science, which is how do we think about racism, not as a set of policies or even as a set of political attitudes, but as a way of understanding the world.

Wikipedia refers to “scientific racism” as “the use of scientific techniques and hypotheses to support or justify the belief in racism, racial inferiority, or racial superiority, or alternatively the practice of classifying individuals of different phenotypes into discrete races,” and this image from Wikipedia captures the mix of scientific reasoning, racial classifications, and value judgment that is characteristic of that way of thinking.

As with Freudian psychiatry, Marxism and neoclassical economics, the logic of racism can explain anything; it is unfalsifiable. In his book, Wade looked at economic inequality today and ascribed it to race. The study of differences in societies is interesting and I think Wade finds it interesting too (in his book, he has some conflicting lines, at some points talking about how culture is all-important and at other places disparaging those social scientists who are interested in culture). Cramming everything (including interest rates and, in another book, ping pong) into a racial framework is not so convincing to me, for the reasons I stated in my review of his book.

Philosopher of science Karl Popper and others have criticized such theories as being nonscientific because they are non-refutable, but I prefer to think of them as frameworks for doing science. As such, Freudianism or Marxism or rational choice or racism are not theories that make falsifiable predictions but rather approaches to scientific inquiry. Taking some poetic license, one might make an analogy where these frameworks are operating systems, while scientific theories are programs. That’s why I wrote that I can’t say that Wade is wrong, just that I don’t find his stories convincing.

Just to be clear: I’m not saying that racist theories can’t be scientifically tested and falsified. For example, a race-based model could be used to make a prediction about the comparative future economic performance of different groups, and then this prediction could be evaluated. Similarly, Freudian theories can be used to make testable, falsifiable predictions. The Popperian point is that, although they can be used to make falsifiable statements, these frameworks can retroactively explain anything and thus are unfalsifiable in that larger sense.

This can be seen in many popular works of racism including the book by Wade. His model is pretty sophisticated: genes affect culture which affects behavior. But it’s one of those can-explain-any-possible-data sorts of theories. If a group does poorly, it’s either bad genes or bad governance that’s unrelated to genes. If a group succeeds, it could be the good genes revealing themselves, or it could be that the genes themselves changed via adaptation. And if a society is poorly governed, this can have no effect on genes, or it can adapt people to behave in an uncivilized way (as in the Middle East and south Asia) or at can adapt people to behave in a civilized way (as in China). For example, Wade writes:

The Malay, Thai, or Indonesian populations who have prosperous Chinese populations in their midst might envy the Chinese success but are strangely unable to copy it. … If Chinese business success were purely cultural, everyone should find it easy to adopt the same methods. This is not the case because social behavior, of Chinese and others, is genetically shaped.

Wade offers no particular clue on what happened to make Thais and Malays such losers, but he makes it clear that he thinks their lack of economic success demonstrates that it’s their genes that aren’t up to a world-class challenge.

My feeling about Wade’s genetic explanations for economic outcomes is similar to my feeling about other all-encompassing supertheories: I respect the effort to push such theories as far as they can go, but I find them generally less convincing as they move farther from their home base. Similarly with economists’ models: they can make a lot of sense for prices in a fluid market, they can work OK to model negotiation, they seem like a joke when they start trying to model addiction, suicide, etc.

All-encompassing frameworks are different from scientific theories. Both are valuable — frameworks motivate theories and help us interpret scientific results — but I also think it’s important to be clear on the distinction.

**P.S.** I wrote the above note five years ago but it is now behind a paywall so that’s why I’m posting it again now.

Amy Orben and Andrew Przybylski write:

The widespread use of digital technologies by young people has spurred speculation that their regular use negatively impacts psychological well-being. Current empirical evidence supporting this idea is largely based on secondary analyses of large-scale social datasets. Though these datasets provide a valuable resource for highly powered investigations, their many variables and observations are often explored with an analytical flexibility that marks small effects as statistically significant . . . we address these methodological challenges by applying specification curve analysis (SCA) across three large-scale social datasets . . . to rigorously examine correlational evidence for the effects of digital technology on adolescents. The association we find between digital technology use and adolescent well-being is negative but small, explaining at most 0.4% of the variation in well-being. Taking the broader context of the data into account suggests that these effects are too small to warrant policy change.

They continue:

SCA is a tool for mapping the sum of theory-driven analytical decisions that could justifiably have been taken when analysing quantitative data. Researchers demarcate every possible analytical pathway and then calculate the results of each. Rather than reporting a handful of analyses in their paper, they report all results of all theoretically defensible analyses . . .

Here’s the relevant methods paper on specification curve analysis, by Uri Simonsohn, Joseph Simmons, and Leif Nelson, which seems similar to what Sara Steegen, Francis Tuerlinckx, Wolf Vanpaemel and I called the multiverse analysis.

It makes sense that a good idea will come up in different settings with some differences in details. Forking paths in methodology as well as data coding and analysis, one might say.

Anyway, here’s what Orben and Przybylski report:

Three hundred and seventy-two justifiable specifications for the YRBS, 40,966 plausible specifications for the MTF and a total of 603,979,752 defensible specifications for the MCS were identified. Although more than 600 million specifications might seem high, this number is best understood in relation to the total possible iterations of dependent (six analysis options) and independent variables (224 + 225 – 2 analysis options) and whether co-variates are included (two analysis options). . . . The number rises even higher, to 2.5 trillion specifications, for the MCS if any combination of co-variates (212 analysis options) is included.

Given this, and to reduce computational time, we selected 20,004 specifications for the MCS.

I love it that their multiverse was so huge they needed to drastically prune it by only including 20,000 analyses.

How did they choose this particular subset?

We included specifications of all used measures per se, and any combinations of measures found in the previous literature, and then supplemented these with other randomly selected combinations. . . . After noting all specifications, the result of every possible combination of these specifications was computed for each dataset.

I wonder if they could’ve found even more researcher degrees of freedom by considering rules for data coding and exclusion, which is what we focused on in our multiverse paper. (I’m also thinking of the article discussed the other day that excluded all but 687 out of 5342 observations.)

Ultimately I think the right way to analyze this sort of data is through a multilevel model, not a series of separate estimates and p-values.

But I do appreciate that they went to the trouble to count up 603,979,752 paths. This is important, because I think a lot of people don’t realize the weakness of many published claims based on p-values (an issue we discussed in a recent comment thread here, when Ethan wrote: “I think lots of what’s discussed on this blog and a cause of common lay errors in probability comes down to, ‘It’s tempting to believe that you can’t get all of this just by chance, but you can.'”).

This is a book review. It is by Phil Price. It is not by Andrew.

The book is Good To Go: What the athlete in all of us can learn from the strange science of recovery. By Christie Aschwanden, published by W.W. Norton and Company. The publisher offered a copy to Andrew to review, and Andrew offered it to me as this blog’s unofficial sports correspondent.

tldr: This book argues persuasively that when it comes to optimizing the recovery portion of the exercise-recover-exercise cycle, nobody knows nuthin’ and most people who claim to know sumthin’ are wrong. It’s easy to read and has some nice anecdotes. Worth reading if you have a special interest in the subject, otherwise not. Full review follows.

The book is about ‘recovery’. In the context of the book, recovery is what you do between bouts of exercise; or, if you prefer, exercise is what you do between periods of recovery. The book has great blurbs. “A tour de force of great science journalism”, writes Nate Silver (!). “…a definitive tour through a bewildering jungle of scientific and pseudoscientific claims…”, writes David Epstein. “…Aschwanden makes the mid-boggling world of sports recovery a hilarious adventure”, says Olympic gold medal skier Jessie Diggins. With blurbs like these I was expecting a lot…although once I realized Aschwanden works at FiveThirtyEight, I downweighted the Silver blurb appropriately. Even so, I expected too much: the book is fine but ultimately rather unsatisfying. It is fairly interesting and sometimes amusing, but there’s only so much any author can do with the subject given the current state of knowledge, which is this: other than getting enough sleep and eating enough calories, nobody knows for sure what helps athletes recover between events or training sessions better than just living a normal life. The book is mostly just 300 pages of elucidating and amplifying that disappointing state of knowledge.

Continue reading ‘Book Review: Good to Go, by Christie Aschwanden’ »

Bob writes, to someone who is doing work on the Stan language:

The basic execution structure of Stan is in the JSS paper (by Bob Carpenter, Andrew Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell) and in the reference manual. The details of autodiff are in the arXiv paper (by Bob Carpenter, Matt Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betancourt). These are sort of background for what we’re trying to do.

If you haven’t read Maria Gorinova’s MS thesis and POPL paper (with Andrew Gordon and Charles Sutton), you should probably start there.

Radford Neal’s intro to HMC is nice, as is the one in David McKay’s book. Michael Betancourt’s papers are the thing to read to understand HMC deeply—he just wrote another brain bender on geometric autodiff (all on arXiv). Starting with the one on hierarchical models would be good as it explains the necessity of reparameterizations.

Also I recommend our JEBS paper (with Daniel Lee, and Jiqiang Guo) as it presents Stan from a user’s rather than a developer’s perspective.

And, for more general background on Bayesian data analysis, we recommend Statistical Rethinking by Richard McElreath and BDA3.

This workshop should be really interesting:

- Aggregating and analysing crowdsourced annotations for NLP EMNLP Workshop. November 3–4, 2019. Hong Kong.

Silviu Paun and Dirk Hovy are co-organizing it. They’re very organized and know this area as well as anyone. I’m on the program committee, but won’t be able to attend.

I really like the problem of crowdsourcing. Especially for machine learning data curation. It’s a fantastic problem that admits of really nice Bayesian hierarchical models (no surprise to this blog’s audience!).

The rest of this note’s a bit more personal, but I’d very much like to see others adopting similar plans for the future for data curation and application.

**The past**

Crowdsourcing is near and dear to my heart as it’s the first serious Bayesian modeling problem I worked on. Breck Baldwin and I were working on crowdsourcing for applied natural language processing in the mid 2000s. I couldn’t quite figure out a Bayesian model for it by myself, so I asked Andrew if he could help. He invited me to the “playroom” (a salon-like meeting he used to run every week at Columbia), where he and Jennifer Hill helped me formulate a crowdsourcing model.

As Andrew likes to say, every good model was invented decades ago for psychometrics, and this one’s no different. Phil Dawid had formulated exactly the same model (without the hierarchical component) back in 1979, estimating parameters with EM (itself only published in 1977). The key idea is treating the crowdsourced data like any other noisy measurement. Once you do that, it’s just down to details.

Part of my original motivation for developing Stan was to have a robust way to fit these models. Hamiltonian Monte Carlo (HMC) only handles continuous parameters, so like in Dawid’s application of EM, I had to marginalize out the discrete parameters. This marginalization’s the key to getting these models to sample effectively. Sampling discrete parameters that can be marginalized is a mug’s game.

**The present**

Coming full circle, I co-authored a paper with Silviu and Dirk recently, Comparing Bayesian models of annotation, that reformulated and evaluated a bunch of these models in Stan.

*Editorial Aside*: Every field should move to journals like *TACL*. Free to publish, fully open access, and roughly one month turnarond to first decision. You have to experience journals like this in action to believe it’s possible.

**The future**

I want to see these general techniques applied to creating probabilistic corpora, to online adaptative training data (aka active learning), to joint corpus inference and model training (a la Raykar et al.’s models), and to evaluation.

**P.S. Cultural consensus theory**

I’m not the only one who recreated Dawid and Skene’s model. It’s everywhere these days.

Recently, I just discovered an entire literature dating back decades on cultural consensus theory, which uses very similar models (I’m pretty sure either Lauren Kennedy or Duco Veen pointed out the literature). The authors go more into the philosophical underpinnings of the notion of consensus driving these models (basically the underlying truth of which you are taking noisy measurements). One neat innovation in the cultural consensus theory literature is a mixture model of truth—you can assume multiple subcultures are coding the data with different standards. I’d thought of mixture models of coders (say experts, Mechanical turkers, and undergrads), but not of the truth.

In yet another small world phenomenon, right after I discovered cultural consensus theory, I saw a cello concert organized through Groupmuse by a social scientist at NYU I’d originally met through a mutual friend of Andrew’s. He introduced the cellist, Iona Batchelder, and added as an aside she was the daughter of well known social scientists. Not just any social scientists, the developers of cultural consensus theory!

As the above image from Diana Senechal illustrates, a lot can happen near a discontinuity boundary.

Here’s a more disturbing picture, which comes from a recent research article, “The Bright Side of Unionization: The Case of Stock Price Crash Risk,” by Jeong-Bon Kim, Eliza Xia Zhang, and Kai Zhong:

which I learned about from the following email:

On Jun 18, 2019, at 11:29 AM, ** wrote:

Hi Professor Gelman,

This paper is making the rounds on social media:

Look at the RDD in Figure 3 [the above two graphs]. It strikes me as pretty weak and reminds me a lot of your earlier posts on the China air pollution paper. Might be worth blogging about?

If you do, please don’t cite this email or my email address in your blog post, as I would prefer to remain anonymous.

Thank you,

**

This anonymity thing comes up pretty often—it seems that there’s a lot of fear regarding the consequences of criticizing published research.

Anyway, yeah this is bad news. The discontinuity at the boundary looks big and negative, in large part because the fitted curves have a large positive slope in that region, which in turn seems to be driven by action on the boundary of the graph which is essentially irrelevant to the causal question being asked.

It’s indeed reminiscent of this notorious example from a few years ago:

And, as before, it’s stunning not just that the researchers made this mistake—after all, statistics is hard, and we all make mistakes—but that they could put a graph like the ones above directly into their paper and not realize the problem.

This is not a case of the chef burning the steak and burying it in a thick sauce. It’s more like the chef taking the burnt slab of meat and serving it with pride—not noticing its inedibility because . . . the recipe was faithfully applied!

**What happened?**

Bertrand Russell has this great quote, “This is one of those views which are so absurd that only very learned men could possibly adopt them.” On the other hand, there’s this from George Orwell: “To see what is in front of one’s nose needs a constant struggle.”

The point is that the above graphs are obviously ridiculous—but all these researchers and journal editors didn’t see the problem. They’d been trained to think that if they followed certain statistical methods blindly, all would be fine. It’s that all-too-common attitude that causal identification plus statistical significance equals discovery and truth. Not realizing that both causal identification and statistical significance rely on lots of assumptions.

The estimates above are bad. They can either be labeled as noisy (because the discontinuity of interest is perturbed by this super-noisy curvy function) or as biased (because in the particular case of the data the curves are augmenting the discontinuity by a lot). At a technical level, these estimates give overconfident confidence intervals (see this paper with Zelizer and this one with Imbens), but you hardly need all that theory and simulation to see the problem—just look at the above graphs without any ideological lenses.

Ideology—statistical ideology—is important here, I think. Researchers have this idea that regression discontinuity gives rigorous causal inference, and that statistical significance gives effective certainty, and that the rest is commentary. These attitudes are ridiculous, but we have to recognize that they’re there.

The authors do present some caveats but these are a bit weak for my taste:

Finally, we acknowledge the limitations of the RDD and alert readers to be cautious when generalizing our inferences in different contexts. The RDD exploits the local variation in unionization generated by union elections and compares crash risk between the two distinct samples of firms with the close-win and close-loss elections. Thus, it can have strong local validity, but weak external validity. In other words, the negative impact of unionization on crash risk may be only applicable to firms with vote shares falling in the close vicinity of the threshold. It should be noted, however, that in the presence of heterogeneous treatment effect, the RDD estimate can be interpreted as a weighted average treatment effect across all individuals, where the weights are proportional to the ex ante likelihood that the realized assignment variable will be near the threshold (Lee and Lemieux 2010). We therefore reiterate the point that “it remains the case that the treatment effect estimated using a RD design is averaged over a larger population than one would have anticipated from a purely ‘cutoff’ interpretation” (Lee and Lemieux 2010, 298).

I agree that generalization is a problem, but I’m not at all convinced that what they’ve found applies even to their data. Again, a big part of their negative discontinuity estimate is coming from that steep up-sloping curve which seems like nothing more than an artifact. To say it another way: including that quadratic curve fit adds a boost to the discontinuity which then pulls it over the threshold to statistical significance. It’s a demonstration of how bias and coverage problems work together (again, see my paper with Guido for more on this).

This is not to say that the *substantive* conclusions of the article are wrong. I have no idea. All I’m saying is that the evidence is not as strong as is claimed. And also I’m open to the possibility that the substantive truth is the opposite of what is claimed in the article. Also don’t forget that, even had the discontinuity analysis not had this problem—even if there was a clear pattern in the data that didn’t need to be pulled out by adding that upward-sloping curve—we’d still only be learning about these two particular measures that are labeled as stock price crash risk.

**How to better analyze these data?**

To start with, I’d like to see a scatterplot. According to the descriptive statistics there are 687 data points, so the above graph must be showing binned averages or something like that. Show me the data!

Next, accept that this is an observational study, comparing companies that did or did not have unions. These two groups of companies differ in many ways, one of which is the voter share in the union election. But there are other differences too. Throwing them all in a regression will not necessarily do a good job of adjusting for all these variables.

The other thing I don’t really follow are their measures of stock price crash risk. These seem like pretty convoluted definitions; there must be lots of ways to measure this, at many time scales. This is a problem with the black-box approach to causal inference, but I’m not sure how this aspect of the problem could be handled better. The trouble is that stock prices are notoriously noisy, so it’s not like you could have a direct model of unionization affecting the prices—even beyond the obvious point that unionization, or the lack thereof, will have different effects in different companies. But if you go black-box and look at some measure of stock prices as an outcome, then the results could be sensitive to how and when you look at them. These particular measurement issues are not our first concern here—as the above graphs demonstrate, the estimation procedure being used here is a disaster—but if you want to study the problem more seriously, I’m not at all clear that looking at stock prices in this way will be helpful.

**Larger lessons**

Again, I’d draw a more general lesson from this episode, and others like it, that when doing science we should be aware of our ideologies. We’ve seen so many high-profile research articles in the past few years that have had such clear and serious flaws. On one hand it’s a social failure: not enough eyes on each article, nobody noticing or pointing out the obvious problems.

But, again, I also blame the reliance on canned research methods. And I blame pseudo-rigor, the idea that some researchers have that their proposed approach is automatically correct. And, yes, I’ve seen that attitude among Bayesians too. Rigor and proof and guarantee are fine, and they all come with assumptions. If you want the rigor, you need to take on the assumptions. Can’t have one without the other.

Finally, in case there’s a question that I’m being too harsh on an unpublished paper: If the topic is important enough to talk about, it’s important enough to criticize. I’m happy to get criticisms of my papers, published and unpublished. Better to have mistakes noticed sooner rather than later. And, sure, I understand that the authors may well have followed the rules as they understood them, and it’s too bad that resulted in bad work. Kind of like if I was driving along a pleasant country road at the speed limit of 30 mph and then I turned a corner and slammed into a brick wall. It’s really not my fault, it’s whoever put up the damn 30 mph sign. But my car will still be totaled. In the above post, I’m blaming the people who put up the speed limit sign (including me, in that in our textbooks our colleagues and I aren’t always so clear on how our methods can go wrong).

**P.S.** The person who sent the email to me adds some comments on the paper:

I wonder if those weird response variables DUVOL and NCSKEW are themselves “researcher degrees of freedom”. Imagine all the other things they could have studied – stock price growth after the union vote, revenue, price/earnings ratio… these could just as plausibly be related to unionization as the particular crash risk formulas, Equations (1-3), used by the authors.

A few more suspicious aspects:

1. The functional form is purely empirical. They tried polynomials of degrees 1-4 and selected quadratic because it had the best AIC (Footnote 9).

2. Tons and tons of barely significant results, 0.01 < p < 0.05 it looks like based on the tables. You can't just blindly go with an "approved" methodology - you have to at least (1) sanity check your RDD plots, (2) check whether the fitted lines in the RDD make sense theoretically, right? There's no economic reason for those curves to look the way they do.

In a sane world, perhaps this article would have received very little attention, or maybe its problems would’ve been corrected in the review process, or maybe it would’ve appeared in an obscure journal and then not been taken seriously. But it came to a strong conclusion on a politically charged topic.

Science communication is changing. On one hand, we have post-publication review, so there are places to point out when claims are pushed based on questionable evidence. On the other hand, the claims get out there faster.

**P.P.S.** I’m also reminded of something I wrote last month:

I am concerned that all our focus on causal identification, important as it is, can lead to researchers, journalists, and members of the general public to overconfidence in theories as a result of isolated studies, without always the recognition that real life is more complicated.

**P.P.P.S.** More here.

I wrote this post awhile ago but it just appeared . . .

I liked this line so much I’m posting it on its own:

We should be open-minded, but not selectively open-minded.

This is related to the research incumbency effect and all sorts of other things we’ve talked about over the years.

There’s a Bayesian argument, or an implicitly Bayesian argument for believing everything you read in the tabloids, and the argument goes as follows: It’s hard to get a paper published, papers in peer-reviewed journals typically really do go through the peer review process, so the smart money is to trust the experts.

This believe-what-you-read heuristic is Bayesian, but not *fully* Bayesian: it does not condition on new information. The argument against Brian Wansink’s work is not that it was published in the journal Environment and Behavior. The argument against it is that the work has lots of mistakes, and then you can do some partial pooling, looking at other papers by this same author that had lots of mistakes.

Asymmetric open-mindedness—being open to claims published in scientific journals and publicized on NPR, Ted, etc., while not at all being open to their opposites—is, arguably, a reasonable position to take. But this position is only reasonable *before* you look carefully at the work in question. Conditional on that careful look, the fact of publication provides much less information.

To put it another way, defenders of junk science, and even people who might think of themselves as agnostic on the issue, are making the fallacy of the one-sided bet.

Here’s an example.

Several years ago, the sociologist Satoshi Kanazawa claimed that beautiful parents were more likely to have girl babies. This claim was reproduced by the Freakonomics team. It turns out that underlying statistical analysis was flawed, and was was reported was essentially patterns in random numbers (the kangaroo problem).

So, fine. At this point you might say: Some people believe that beautiful parents are more likely to have girl babies, while other people are skeptical of that claim. As an outsider, you might take an intermediate position (beautiful parents *might* be more likely to have girl babies), and you could argue that Kanazawa’s work, while flawed, might still be valuable by introducing this hypothesis.

But that would be a mistake; you’d be making the fallacy of the one-sided bet. If you want to consider the hypothesis that beautiful parents are more likely to have girl babies, you should also consider the hypothesis that beautiful parents are more likely to have boy babies. If you don’t consider both possibilities, you’re biasing yourself—and you’re also giving an incentive for future Wansinks to influence policy through junk science.

**P.S.** I also liked this line that I gave in response to someone who defended Brian Wansink’s junk science on the grounds that “science has progressed”:

To use general scientific progress as a way of justifying scientific dead-end work . . . that’s kinda like saying that the Bills made a good choice to keep starting Nathan Peterman, because Patrick Mahomes has been doing so well.

A problem I see is that the defenders of junk science are putting themselves in the position where they’re defending Science as an entity.

Javier Benitez points us to this op-ed, “Massaging data to fit a theory is not the worst research sin,” where philosopher Martin Cohen writes:

The recent fall from grace of the Cornell University food marketing researcher Brian Wansink is very revealing of the state of play in modern research.

Wansink had for years embodied the ideal to which all academics aspire: innovative, highly cited and media-friendly.

I would just like to briefly interrupt that not all academics aspire to be media-friendly. I have that aspiration myself, and of course people who aspire to be media-friendly are overrepresented in the media—but I’ve met lots of academics who’d prefer to be left in peace and quite to do their work and communicate just with specialists and students.

But that’s not the key point here. So let me continue quoting Cohen:

[Wansink’s] research, now criticised as fatally flawed, included studies suggesting that people who go grocery shopping while hungry buy more calories, that pre-ordering lunch can help you choose healthier food, and that serving people out of large bowls leads them to eat larger portions.

Such studies have been cited more than 20,000 times and even led to an appearance on The Oprah Winfrey Show [and, more to the point, the spending of millions of dollars of government money! — ed.]. But Wansink was accused of manipulating his data to achieve more striking results. Underlying it all is a suspicion that he was in the habit of forming hypotheses and then searching for data to support them. Yet, from a more generous perspective, this is, after all, only scientific method.

Behind the criticism of Wansink is a much broader critique not only of his work but of a certain kind of study: one that, while it might have quantitative elements, is in essence ethnographic and qualitative, its chief value being in storytelling and interpretation. . . .

We forget too easily that the history of science is rich with errors. In a dash to claim glory before Watson and Crick, Linus Pauling published a fundamentally incoherent hypothesis that the structure of DNA was a triple helix. Lord Kelvin misestimated the age of the Earth by more than an order of magnitude. In the early days of genetics, Francis Galton introduced an erroneous mathematical expression for the contributions of different ancestors to individuals’ inherited traits. We forget because these errors were part of broader narratives that came with brilliant insights.

I accept that Wansink may have been guilty of shoehorning data into preconceived patterns – and in the process may have mixed up some of the figures too. But if the latter is unforgivable, the former is surely research as normal.

Let me pause again here. If all that happened is that Wansink “may have mixed up some of the figures,” that this is not “unforgivable” at all. We all “mix up some of the figures” from time to time (here’s an embarrassing example from my own published work), and nobody who does creative work is immune from “shoehorning data into preconceived patterns.”

For some reason, Cohen seems to be on a project to minimize Wansink’s offenses. So let me spell it out. No, the problem with the notorious food researcher is not that he “may have mixed up some of the figures.” First, he *definitely*—not “may have”—mixed up *many*—not “some”—of his figures. We know this because many of his figures contradicted each other, and others made no sense (see, for example, here for many examples). Second, Wansink bobbed and weaved, over the period of years denying problems that were pointed out to him from all sorts of different directions.

Cohen continues:

The critics are indulging themselves in a myth of neutral observers uncovering “facts”, which rests on a view of knowledge as pristine and eternal as anything Plato might have dreamed of.

It is thanks to Western philosophy that, for thousands of years, we have believed that our thinking should strive to eliminate ideas that are vague, contradictory or ambiguous. Today’s orthodoxy is that the world is governed by iron laws, the most important of which is if P then Q. Part and parcel of this is a belief that the main goal of science is to provide deterministic – cause and effect – rules for all phenomena. . . .

Here I think Cohen’s getting things backward! It’s Wansink’s critics who have repeatedly stated that the world is complicated and that we should be wary of taking misreported data from 97 people in a diner to make general statements about eating behavior, men’s and women’s behaviors, nutrition policy, etc.

Contrariwise, it was Wansink and his promoters who were making general statements, claiming to have uncovered facts about human nature, etc.

Cohen continues a few paragraphs later:

Plato attempted to avoid contradictions by isolating the object of inquiry from all other relationships. But, in doing so, he abstracted and divorced those objects from a reality that is multi-relational and multitemporal. This same artificiality dogs much research.

Exactly! Wansink, like all of us, is subject to the Armstrong Principle (“If you promise more than you can deliver, then you have an incentive to cheat.”). Most scholars, myself included, are scaredy-cats: in order to avoid putting ourselves in a Lance Armstrong situation, we’re careful to underpromise. Wansink, though, he overpromised, presenting his artificial research has yielding general truths.

In short, we, the critics of Wansink and other practitioners of cargo-cult science, are on Cohen’s side. We’re the ones who are trying to express scientific method in a way that respects the disconnect between experiment and real world.

Cohen concludes:

Even if the quantitative elements don’t convince and need revising, studies like Wansink’s can be of value if they offer new clarity in looking at phenomena, and stimulate ideas for future investigations. Such understanding should be the researcher’s Holy Grail.

After all, according to the tenets of our current approach to facts and figures, much scientific endeavour of the past amounted to wasted effort, in fields with absolutely no yield of true scientific information. And yet science has progressed.

I don’t get the logic here. “Much endeavour amounted to wasted effort . . . And yet science has progressed.” Couldn’t it be that, to a large extent, the wasted effort and the progress has been done in by different people, different places?

To use general scientific progress as a way of justifying scientific dead-end work . . . that’s kinda like saying that the Bills made a good choice to keep starting Nathan Peterman, because Patrick Mahomes has been doing so well.

**Who cares?**

So what? Why keep talking about this pizzagate? Because I think misconceptions here can get in the way of future learning.

Let me state the situation as plainly as possible, without any reference to this particular case:

Step 1. A researcher performs a study that gets published. The study makes big claims and gets lots of attention, both from the news media and from influential policymakers.

Step 2. Then it turns out that (a) the published work was seriously flawed, and the published claims are not supported by the data being offered in their support: the claims may be true, in some ways, but no good evidence has been given; (b) other published studies that appear to show confirmation of the original claim have their own problems; and (c) statistical analysis shows that it is possible that the entire literature is chasing noise.

Step 3. A call goes out to be open-minded: just because some of these studies did not follow ideal scientific practices, we should not then conclude that their scientific claims are false.

And I agree with Step 3. But I’ve said it before and I’ve said it again: We should be open-minded, but not *selectively* open-minded.

Suppose the original claim is X, but the study purporting to demonstrate X is flawed, and the follow-up studies don’t provide strong evidence for X either. Then, of course we should be open to the possibility that X remains true (after all, for just about any hypothesis X there is always some qualitative evidence and some theoretical arguments that can be found in favor of X), and we should also be open to the possibility that there is no effect (or, to put it more precisely, an effect that is in practice indistinguishable from zero). Fine. But let’s also be open to the possibility of “minus X”; that is, the possibility that the posited intervention is counterproductive. And, if we really want to get real, let’s be open to the possibility that the effect is positive for some people in some scenarios, and negative for other people in other scenarios, and that in the existing state of our knowledge, we can’t say much about where the effect is positive and where it is negative. Let’s show some humility about what we can claim.

Accepting uncertainty does not mean that we can’t make decisions. After all, we were busy making decisions about topic X, whatever it was, before we had any data at all—so we can keep making decisions on a case-by-case basis using whatever information and hunches we have.

Here are some practical implications. First, if we’re not sure the effect of an intervention, maybe we should think harder about costs, including opportunity costs. Second, it makes sense to gather information about what’s happening locally, to get a better sense of what the intervention is doing.

**All the work that you haven’t heard of**

The other thing I want to bring up is the selection bias involved in giving the benefit of the doubt to weak claims that happen to have received positive publicity. One big big problem here is that there are lots of claims in all sorts of directions that you haven’t heard about, because they haven’t appeared on Oprah, or NPR, or PNAS, or Freakonomics, or whatever. By the same logic as Cohen gives in the above-quoted piece, all those obscure claims also deserve our respect as “of value if they offer new clarity in looking at phenomena, and stimulate ideas for future investigations.” The problem is that we’re *not* seeing all that work.

As I’ve also said on various occasions, I have no problem when people combine anecdotes and theorizing to come up with ideas and policy proposals. My problem with Wansink is not that he had interesting ideas without strong empirical support: that happens all the time. Most of our new ideas don’t have strong empirical support, in part because ideas with strong empirical support tend to already exist so they won’t be new! No, my problem with Wansink is that he took weak evidence and presented it as if it were strong evidence. For this discussion, I don’t really care if he did this by accident or on purpose. Either way, now we know he had weak evidence, or no evidence at all. So I don’t see why his conjectures should be taken more seriously than any other evidence-free conjectures. Let a zillion flowers bloom.

*“My friends and I don’t wanna be here if this isn’t an actively trans-affirming space. I’m only coming if all my sisters can.” – I have no music for you today, sorry. But I do have an article about cruise ships *

(This is obviously not Andrew)

A Sunday night quickie post, from the tired side of Toronto’s Pride weekend. It’s also Pride month, and it’s 50 years on Friday since the Stonewall riots, which were a major event in LGBT+ rights activism in the US and across the world. Stan has even gone rainbow for the occasion. (And many thanks to the glorious Michael Betancourt who made the badge.)

This is a great opportunity for a party and to see Bud Lite *et al.* pretend they care deeply about LGBTQIA+ people. But really it should also be a time to think about how open workplaces, departments, universities, conferences, any other place of work are to people who are lesbian, gay, bisexual, transgender, non-binary, two-spirit, gender non-conforming, intersex, or who otherwise lead lives (or wish to lead lives) that lie outside the cisgender, straight world that the majority occupies. People who aren’t spending a bunch of time trying to hide aspects of their life are usually happier and healthier and better able to contribute to things like science than those who are.

Which I guess is to say that diversity is about a lot more than making sure that there aren’t zero women as invited speakers. (Or being able to say “we invited women but they all said no”.) Diversity is about racial and ethnic diversity, diversity of gender, active and meaningful inclusion of disabled people, diversity of sexuality, intersections of these identities, and so much more. It is not an accounting game (although zero is still a notable number).

And regardless of how many professors or style guides or blogposts tell you otherwise, there is no single gold standard absolute perfect way to deliver information. Bring yourself to your delivery. Be gay. Be femme. Be masc. Be boring. Be sports obsessed. Be from whatever country and culture you are from. We can come along for the journey. And people who aren’t willing to are not worth your time.

Anyway, I said a pile of words that aren’t really about this but are about this for a podcast, which if you have not liked the previous three paragraphs you will definitely not enjoy. Otherwise I’m about 17 mins in (but the story about the alligators is also awesome.) If you do not like adult words, you definitely should not listen.

In the spirit of Pride month please spend some time finding love for and actively showing love to queer and trans folk. And for those of you in the UK especially (but everywhere else as well), please work especially hard to affirm and love and care for and support Trans* people who are under attack on many fronts. (Not least the recent rubbish about how being required to use people’s correct names and pronouns is somehow an affront to academic freedom, as if using the wrong pronoun or name for a student or colleague is an academic position.)

And should you find yourself with extra cash, you can always support someone like Rainbow Railroad. Or your local homeless or youth homeless charity. Or your local sex worker support charity like SWOP Behind Bars or the Sex Workers Project from the Urban Justice Centre. (LGBTQ+ people have *much *higher rates of homelessness [especially youth homelessness] and survival sex work than straight and cis people.)

Anyway, that’s enough for now. (Or nowhere near enough ever, but I’ve got other things to do.) Just recall what the extremely kind and glorious writer and academic Anthony Olivera said in the Washington Post: (Also definitely read this from him because it’s amazing)

We do not know what “love is love” means when you say it, because unlike yours, ours is a love that has cost us everything. It has, in living memory, sent us into exterminations, into exorcisms, into daily indignities and compromises. We cannot hold jobs with certainty nor hands without fear; we cannot be sure when next the ax will fall with the stroke of a pen.

Hope you’re all well and I’ll see you again in LGBT+ wrath month. (Or, more accurately, some time later this week to talk about the asymptotic properties of PSIS.)

Interesting juxtaposition as two interesting pieces of spam happened to appear in my inbox on the same day:

1. Subject line “Why the power stance will be your go-to move in 2019”:

The power stance has been highlighted as one way to show your dominance at work and move through the ranks. While moving up in your career comes down to so much more, there may be a way to make your power stance practical while also boosting your motivation and energy at the office.

**’s range of standing desks is the perfect way to bring your power stance to your office while also helping you stay organized, motivated and energized during the typical 9-5. . . . not only are you able to move from sitting to standing (or power stand) with the push of a button, but you are able to completely customize your desk for optimal organization and efficiency. For example, you can customize your desk to include the keyboard platform and dual monitor arms to keep the top of your desk clean and organized to help keep your creativity flowing. . . . the perfect way to help you show your power stance off in the office without ever having to leave your desk.

A standing desk could be cool, but color me skeptical on the power stance. Last time I saw a review of the evidence on that claim, there didn’t seem to be much there.

2. Subject line “Why you’re more productive in a coffee shop…”:

Why “one step at a time” is scientifically proven to help you get more done. Say hello to microproductivity . . .

Readers’ Choice 2018 ⭐️ Why you get more done when you relocate to a coffee shop. Plot twist: it’s not the caffeine. . . .

Feel like you’re constantly working but never accomplishing anything? Use this sage advice to be more strategic.

I clicked on the link for why you get more done when you relocate to a coffee shop, and it all seemed plausible to me. I’ve long noticed that I can get lots more work done on a train ride than in the equivalent number of hours at my desk. The webpage on “the coffee shop effect” has various links, including an article in Psychology Today on “The Science of Accomplishing Your Goals” and a university press release from 2006 reporting on an FMRI study (uh oh) containing several experiments, each on N=14 people (!) such as a statistically significant interaction (p = 0.048!!) and this beauty: “A post hoc analysis showed a significant difference . . . in substantia nigra (one sample t test, p = 0.05, one tailed) . . . but not in the amygdala . . .” So, no, this doesn’t look like high-quality science.

On the other hand, I often *am* more productive on the train, and I could well believe that I could be more productive in the coffee shop. So what’s the role of the scientific research here? I have no doubt that research on productivity in coffee shops *could* have value. But does the existing work have any value at all? I have no idea.

I received the following email:

Dear Dr Andrew Gelman,

I am writing to you on behalf of **. I hereby took this opportunity to humbly request you to consider being a guest speaker on our morning radio show, on 6th August, between 8.30-9.00 am (BST) to discuss North Korea working on new missiles

We would feel honoured to have you on our radio show. having you as a guest speaker would give us and our viewers a great insight into this topic, we would greatly appreciate it if you could give us 10-15 minutes of your time and not just enhance our but also our views knowledge on this topic.

We are anticipating your reply and look forward to possibly having you on our radio show.

Kind regards,

**

Note – All interviews are conducted over the phone

Note – Timing can be altered between 7.30- 9.00 am (BST)

@**

@**

http://**

CONFIDENTIALITY NOTICE

This email is CONFIDENTIAL and LEGALLY PRIVILEGED. If you are not the intended recipient of this email and its attachments, you must take no action based upon them, nor must you copy or show them to anyone. If you believe you have received this email in error, please email **

I don’t know which aspect of this email is more bizarre, that they sent me an unsolicited email that concludes with bullying pseudo-legal instructions, or that they think I’m an expert on North Korea (I guess from this post; to be fair, it seems that I know more about North Korea than the people who run the World Values Survey). Don’t they know that my real expertise is on Freud?

Tyler Cowen writes:

If it were legal, and you tried to sell your vote and your vote alone, you might not get much more than 0.3 cents.

It depends where you live.

If you’re not voting in any close elections, then the value of your vote is indeed close to zero. For example, I am a resident of New York. Suppose someone could pay me $X to switch my vote (or, equivalently, pay me $X/2 to not vote, or, equivalently, pay a nonvoter $X/2 to vote in a desired direction) in the general election for president. Who’d want to do that? There’s not much reason at all, except possibly for a winning candidate who’d like the public relations value of winning by an even larger margin, or for a losing candidate who’d like to lose by a bit less, to look like a more credible candidate next time, or maybe for some organization that would like to see voter turnout reach some symbolic threshold such as 50% or 60%.

If you’re living in a district with a close election, the story is quite different, as Edlin, Kaplan, and I discussed in our paper. In some recent presidential elections, we’ve estimated the ex ante probability of your vote being decisive in the national election (that is, decisive in your state, and, conditional on that, your state being decisive in the electoral college) as being approximately 1 in a million in swing states.

Suppose you live in one of those states? *Then*, how much would someone pay for your vote, if it were legal and moral to do so? I’m pretty sure there are people out there who would pay a lot more than 0.3 cents. If a political party or organization would drop, say, $100M to determine the outcome of the election, then it would be worth $10 to switch one person’s vote in one of those swing states.

We can also talk about this empirically. Campaigns *do* spend money to flip people’s votes and to get voters to turn out. They spend a lot more than 0.3 cents per voter. Now, sure, not all this is for the immediate goal of winning the election right now: for example, some of it is to get people to become regular voters, in anticipation of the time when their vote will make a difference. There’s a difference between encouraging people to turn out and vote (which is about establishing an attitude and a regular behavior) and paying for a single vote with no expectation of future loyalty. That said, even a one-time single vote should be worth a lot more than $0.03 to a campaign in a swing state.

tl;dr. Voting matters. Your vote is, in expectation, worth something real.

Edward Hearn writes:

In an effort to buttress my own understanding of multi-level methods, especially pertaining to those involving instrumental variables, I have been working the examples and the exercises in Jennifer Hill’s and your book.

I can find general answers at the Github repo for ARM examples, but for Chapter 10, Exercise 3 (simulating an IV regression to test assumptions using a binary treatment and instrument) and for the book examples, no code is given and I simply cannot figure out the solution.

My reply:

I have no homework solutions to send. But maybe some blog commenters would like to help out?

Here’s the exercise: