Skip to content

“Boston Globe Columnist Suspended During Investigation Of Marathon Bombing Stories That Don’t Add Up”

I came across this news article by Samer Kalaf and it made me think of some problems we’ve been seeing in recent years involving cargo-cult science.

Here’s the story:

The Boston Globe has placed columnist Kevin Cullen on “administrative leave” while it conducts a review of his work, after WEEI radio host Kirk Minihane scrutinized Cullen’s April 14 column about the five-year anniversary of the Boston Marathon bombings, and found several inconsistencies. . . .

Here’s an excerpt of the column:

I happened upon a house fire recently, in Mattapan, and the smell reminded me of Boylston Street five years ago, when so many lost their lives and their limbs and their sense of security.

I can smell Patriots Day, 2013. I can hear it. God, can I hear it, whenever multiple fire engines or ambulances are racing to a scene.

I can taste it, when I’m around a campfire and embers create a certain sensation.

I can see it, when I bump into survivors, which happens with more regularity than I could ever have imagined. And I can touch it, when I grab those survivors’ hands or their shoulders.

Cullen, who was part of the paper’s 2003 Pulitzer-winning Spotlight team that broke the stories on the Catholic Church sex abuse scandal, had established in this column, and in prior reporting, that he was present for the bombings. . . .

But Cullen wasn’t really there. And his stories had lots of details that sounded good but were actually made up. Including, horrifyingly enough, made-up stories about a little girl who was missing her leg.

OK, so far, same old story. Mike Barnicle, Janet Cooke, Stephen Glass, . . . and now one more reporter who prefers to make things up than to do actual reporting. For one thing, making stuff up is easier; for another, if you make things up, you can make the story work better, as you’re not constrained by pesky details.

What’s the point of writing about this, then? What’s the connection to statistical modeling, causal inference, and social science?

Here’s the point:

Let’s think about journalism:

1. What’s the reason for journalism? To convey information, to give readers a different window into reality. To give a sense of what it was like to be there, for those who were not there. Or to help people who were there, to remember.

2. What does good journalism look like? It’s typically emotionally stirring and convincingly specific.

And here’s the problem.

The reason for journalism is 1, but some journalists decide to take a shortcut and go straight to the form of good journalism, that is, 2.

Indeed, I suspect that many journalists think that 2 is the goal, and that 1 is just some old-fashioned traditional attitude.

Now, to connect to statistical modeling, causal inference, and social science . . . let’s think about science:

1. What’s the reason for science? To learn about reality, to learn new facts, to encompass facts into existing and new theories, to find flaws in our models of the world.

2. And what does good science look like? It typically has an air of rigor.

And here’s the problem.

The reason for science is 1, but some scientists decide to take a shortcut and go straight to the form of good science, that is, 2.

The problem is not scientists don’t care about the goal of learning about reality; the problem is that they think that if they follow various formal expressions of science (randomized experiments, p-values, peer review, publication in journals, association with authority figures, etc.) that they’ll get the discovery for free.

It’s a natural mistake, given statistical training with its focus on randomization and p-values, an attitude that statistical methods can yield effective certainty from noisy data (true for Las Vegas casinos where the probability model is known; not so true for messy real-world science experiments), and scientific training that’s focused on getting papers published.


What struck me about the above-quoted Boston Globe article (“I happened upon a house fire recently . . . I can smell Patriots Day, 2013. I can hear it. God, can I hear it . . . I can taste it . . .”) was how it looks like good journalism. Not great journalism—it’s too clichéd and trope-y for that—but what’s generally considered good reporting, the kind that sometimes wins awards.

Similarly, if you look at a bunch of the fatally flawed articles we’ve seen in science journals in the past few years, they look like solid science. It’s only when you examine the details that you start seeing all the problems, and these papers disintegrate like a sock whose thread has been pulled.

Ok, yeah yeah sure, you’re saying: Once again I’m reminded of bad science. Who cares? I care, because bad science Greshams good science in so many ways: in scientists’ decision of what to work on and publish (why do a slow careful study if you can get a better publication with something flashy?), in who gets promoted and honored and who decides to quit the field in disgust (not always, but sometimes), and in what gets publicized. The above Boston marathon story struck me because it had that same flavor.

P.S. Tomorrow’s post: Harking, Sharking, Tharking.

I think that science is mostly “Brezhnevs.” It’s rare to see a “Gorbachev” who will abandon a paradigm just because it doesn’t do the job. Also, moving beyond naive falsificationism

Sandro Ambuehl writes:

I’ve been following your blog and the discussion of replications and replicability across different fields daily, for years. I’m an experimental economist. The following question arose from a discussion I recently had with Anna Dreber, George Loewenstein, and others.

You’ve previously written about the importance of sound theories (and the dangers of anything-goes theories), and I was wondering whether there’s any formal treatment of that, or any empirical evidence on whether empirical investigations based on precise theories that simultaneously test multiple predictions are more likely to replicate than those without theoretical underpinnings, or those that test only isolated predictions.

Specifically: Many of the proposed solutions to the replicability issue (such as preregistration) seem to implicitly assume one-dimensional hypotheses such as “Does X increase Y?” In experimental economics, by contrast, we often test theories. The value of a theory is precisely that it makes multiple predictions. (In economics, theories that explain just one single phenomenon, or make one single prediction are generally viewed as useless and are highly discouraged.) Theories typically also specify how its various predictions relate to each other, often even regarding magnitudes. They are formulated as mathematical models, and their predictions are correspondingly precise. Let’s call a within-subjects experiment that tests a set of predictions of a theory a “multi-dimensional experiment”.

My conjecture is that all the statistical skulduggery that leads to non-replicable results is much harder to do in a theory-based, multi-dimensional experiment. If so, multi-dimensional experiment should lead to better replicability even absent safeguards such as preregistration.

The intuition is the following. Suppose an unscrupulous researcher attempts to “prove” a single prediction that X increases Y. He can do that by selectively excluding subjects with low X and high Y (or high X and low Y) from the sample. Compare that to a researcher who attempts to “prove”, in a within-subject experiment, that X increases Y and A increases B. The latter researcher must exclude many more subjects until his “preferred” sample includes only subjects that conform to the joint hypothesis. The exclusions become harder to justify, and more subjects must be run.

A similar intuition applies to the case of an unscrupulous researcher who tries to “prove” a hypothesis by messing with the measurements of variables (e.g. by using log(X) instead of X). Here, an example is a theory that predicts that X increases both Y and Z. Suppose the researcher finds a Null if he regresses X on Y, but finds a positive correlation between f(X) on Y for some selected transformation f. If the researcher only “tested” the relation between X and Y (a one-dimensional experiment), the researcher could now declare “success”. In a multi-dimensional experiment, however, the researcher will have to dig for an f that doesn’t only generate a positive correlation between f(X) and Y, but also between f(X) and Z, which is harder. A similar point applies if the researcher measures X in different ways (e.g. through a variety of related survey questions) and attempts to select the measurement that best helps “prove” the hypothesis. (Moreover, such a theory would typically also specify something like “If X increases Y by magnitude alpha, then it should increase Z by magnitude beta”. The relation between Y and Z would then present an additional prediction to be tested, yet again increasing the difficulty of “proving” the result through nefarious manipulations.)

So if there is any formal treatment relating to the above intuitions, or any empirical evidence on what kind of research tends to be more or less likely to replicate (depending on factors other than preregistration), I would much appreciate if you could point me to it.

My reply:

I have two answers for you.

First, some colleagues and I recently published a preregistered replication of one of our own studies; see here. This might be interesting to you because our original study did not test a single thing, so our evaluation was necessarily holistic. In our case, the study was descriptive, not theoretically-motivated, so it’s not quite what you’re talking about—but it’s like your study in that the outcomes of interest were complex and multidimensional.

This was one of the problems I’ve had with recent mass replication studies, that they treat a scientific paper as if it has a single conclusion, even though real papers—theoretically-based or not—typically have many conclusions.

My second response is that I fear you are being too optimistic. Yes, when a theory makes multiple predictions, it may be difficulty to select data to make all the predictions work out. But on the other hand you have many degrees of freedom with which to declare success.

This has been one of my problems with a lot of social science research. Just about any pattern in data can be given a theoretical explanation, and just about any pattern in data can be said to be the result of a theoretical prediction. Remember that claim that women were three times more likely to wear red or pink clothing during a certain time of the month? The authors of that study did a replication which failed–but they declared it a success after adding an interaction with outdoor air temperature. Or there was this political science study where the data went in the opposite direction of the preregistration but were retroactively declared to be consistent with the theory. It’s my impression that a lot of economics is like this too: If it goes the wrong way, the result can be explained. That’s fine—it’s one reason why economics is often a useful framework for modeling the world—but I think the idea that statistical studies and p-values and replication are some sort of testing ground for models, the idea that economists are a group of hard-headed Popperians, regularly subjecting their theories to the hard test of reality—I’m skeptical of that take. I think it’s much more that individual economists, and schools of economists, are devoted to their theories and only rarely abandon them on their own. That is, I have a much more Kuhnian take on the whole process. Or, to put it another way, I try to be Popperian in my own research, I think that’s the ideal, but I think the Kuhnian model better describes the general process of science. Or, to put it another way, I think that science is mostly “Brezhnevs.” It’s rare to see a “Gorbachev” who will abandon a paradigm just because it doesn’t do the job.

Ambuehl responded:

Anna did have a similar reaction to you—and I think that reaction depends much on what passes as a “theory”. For instance, you won’t find anything in a social psychology textbook that an economic theorist would call a “theory”. You’re certainly right about the issues pertaining to hand-wavy ex-post explanations as with the clothes and ovulation study, or “anything-goes theories” such as the Himicanes that might well have turned out the other way.

By contrast, the theories I had in mind when asking the question are mathematically formulated theories that precisely specify their domain of applicability. An example of the kind of theory I have in mind would be Expected Utility theory, tested in countless papers, e.g. here). Another example of such a theory is the Shannon model of choice under limited attention (tested, e.g., here). These theories are in an entirely different ballpark than vague ideas like, e.g., self-perception theory or social comparison theory that are so loosely specified that one cannot even begin to test them unless one is willing to make assumptions on each of the countless researcher degrees of freedom they leave open.

In fact, economic theorists tend to regard the following characteristics virtues, or even necessities, of any model: precision (can be tested without requiring additional assumptions), parsimony (and hence, makes it hard to explain “uncomfortable” results by interactions etc.), generality (in the sense that they make multiple predictions, across several domains). And they very much frown upon ex post theorizing, ad-hoc assumptions, and imprecision. For theories that satisfy these properties, it would seem much harder to fudge empirical research in a way that doesn’t replicate, wouldn’t it? (Whether the community will accept the results or not seems orthogonal to the question of replicability, no?)

Finally, to the extent that theories in the form of precise, mathematical models are often based on wide bodies of empirical research (economic theorists often try to capture “stylized facts”), wouldn’t one also expect higher rates of replicability because such theories essentially correspond to well-informed priors?

So my overall point is, doesn’t (good) theory have a potentially important role to play regarding replicability? (Many current suggestions for solving the replication crisis, in particular formulaic ones such as pre-registration, or p<0.005, don't seem to recognize those potential benefits of sound theory.)

I replied:

Well, sure, but expected utility theory is flat-out false. Much has been written on the way that utilities only exist after the choices are given. This can even be seen in simple classroom demonstrations, as in section 5 of this paper from 1998. No statistics are needed at all to demonstrate the problems with that theory!

Amdahl responded with some examples of more sophisticated, but still testable, theories such as reference-dependent preferences, various theories of decision making under ambiguity, and perception-based theories, and I responded with my view that all these theories are either vague enough to be adaptable to any data or precise enough to be evidently false with no data collection needed. This was what Lakatos noted: any theory is either so brittle that it can be destroyed by collecting enough data, or flexible enough to fit anything. This does not mean we can’t do science, it just means we have to move beyond naive falsificationism.

P.S. Tomorrow’s post: “Boston Globe Columnist Suspended During Investigation Of Marathon Bombing Stories That Don’t Add Up.”

Deterministic thinking (“dichotomania”): a problem in how we think, not just in how we act

This has come up before:

Basketball Stats: Don’t model the probability of win, model the expected score differential.

Econometrics, political science, epidemiology, etc.: Don’t model the probability of a discrete outcome, model the underlying continuous variable

Thinking like a statistician (continuously) rather than like a civilian (discretely)

Message to Booleans: It’s an additive world, we just live in it

And it came up again recently.

Epidemiologist Sander Greenland has written about “dichotomania: the compulsion to replace quantities with dichotomies (‘black-and-white thinking’), even when such dichotomization is unnecessary and misleading for inference.”

I’d avoid the misleadingly clinically-sounding term “compulsion,” and I’d similarly prefer a word that doesn’t include the pejorative suffix “mania,” hence I’d rather just speak of “deterministic thinking” or “discrete thinking”—but I agree with Greenland’s general point that this tendency to prematurely collapse the wave function contributes to many problems in statistics and science.

Often when the problem of deterministic thinking comes up in discussion, I hear people explain it away, arguing that decisions have to be made (FDA drug trials are often brought up here), or that all rules are essentially deterministic (the idea that confidence intervals are interpreted as whether they include zero), or that this is a problem with incentives or publication bias, or that, sure, everyone knows that thinking of hypotheses as “true” or “false” is wrong, and that statistical significance and other summaries are just convenient shorthands for expressions of uncertainty that are well understood.

But I’d argue, with Eric Loken, that inappropriate discretization is not just a problem with statistical practice; it’s also a problem with how people think, that the idea of things being on or off is “actually the internal working model for a lot of otherwise smart scientists and researchers.”

This came up in some of the recent discussions on abandoning statistical significance, and I want to use this space to emphasize one more time the problem of inappropriate discrete modeling.

The issue arose in my 2011 paper, Causality and Statistical Learning:
Continue reading ‘Deterministic thinking (“dichotomania”): a problem in how we think, not just in how we act’ »

My math is rusty

When I’m giving talks explaining how multilevel modeling can resolve some aspects of the replication crisis, I mention this well-known saying in mathematics: “When a problem is hard, solve it by embedding it in a harder problem.” As applied to statistics, the idea is that it could be hard to analyze a single small study, as inferences can be sensitive to the prior, but if you consider this as one of a large population or long time series of studies, you can model the whole process, partially pool, etc.

In math, examples of embedding into a harder problem include using the theory of ideals to solve problems in prime numbers (ideals are a general class that includes primes as a special case, hence any theory on ideals is automatically true on primes but is more general), using complex numbers to solve problems with real numbers, and using generating functions to sum infinite series.

That last example goes like this. You want to compute
S = sum_{n=1}^{infinity} a_n, but you can’t figure out how to do it. So you write the generating function,
G(x) = sum_{n=1}^{infinity} a_n x^n,
you then do some analysis to figure out G(x) as a function of x, then your series is just S = G(1). And it really works. Cool.

Anyway, I thought that next time I mention this general idea, it would be fun to demonstrate with an example, so one day when I was sitting in a seminar with my notebook, I decided to try to work one out.

I thought I’d start with something simple, like this:
S = 1/1^2 + 1/2^2 + 1/3^2 + 1/4^2 + . . .
That is, S = sum_{n=1}^{infinity} n^{-2}
Then the generating function is,
G(x) = sum_{n=1}^{infinity} n^{-2} x^n.
To solve for G(x), we take some derivatives until we can get to something we can sum directly.
First one derivative:
dG/dx = sum_{n=1}^{infinity} n^{-1} x^{n-1}.
OK, taking the derivative again will be a mess, but we can do this:
x dG/dx = sum_{n=1}^{infinity} n^{-1} x^n.
And now we can differentiate again!
d/dx (x dG/dx) = sum_{n=1}^{infinity} x^{n-1}.
Hey, that one we know! It’s 1 + 1/x + 1/x^2 + . . . = 1/(1-x).

So now we have a differential equation:
xG”(x) + G'(x) = 1/(1-x).
Or maybe better to write as,
x(1-x) G”(x) + (1-x) G'(x) – 1 = 0.
Either way, it looks like we’re close to done. Just solve this second-order differential equation. Actually, even easier than that. Let h(x) = G'(x), then we just need to solve,
x(1-x) h'(x) + (1-x) h(x) – 1 = 0.
Hey, that’s just h(x) = -log(1-x) / x. I can’t remember how I figured that one out—it’s just there in my notes—but there must be some easy derivation. In any case, it works:
h'(x) = log(1-x)/x^2 + 1/(x(1-x)), so
x(1-x) h'(x) = log(1-x)*(1-x)/x + 1
(1-x) h(x) = -log(1-x)*(1-x)/x
So, yeah, x(1-x) h'(x) + (1-x) h(x) – 1 = 0. We’ve solved the differential equation!

And now we have the solution:
G(x) = integral dx (-log(1-x) / x).
This is an indefinite integral but that’s not a problem: we can see that, trivially, G(0) = 0, so we just have to do the integral starting from 0.

At this point, I was feeling pretty good about myself, like I’m some kind of baby Euler, racking up these sums using generating functions.

All I need to do is this little integral . . .

OK, I don’t remember integrals so well. It must be easy to do it using integration by parts . . . oh well, I’ll look it up when I come into the office, it’ll probably be an arcsecant or something like that. But then . . . it turns out there’s no closed-form solution!

Here it is in Wolfram alpha (OK, I take back all the things I said about them):

OK, what’s Li_2(x)? Here it is:

Hey—that’s no help at all, it’s just the infinite series again.

So my generating-function trick didn’t work. Next step is to sum the infinite series by integrating it in the complex plane and counting the poles. But I really don’t remember that! It’s something I learned . . . ummm, 35 years ago. And probably forgot about 34 years ago.

So, yeah, my math is rusty.

But I still like the general principle: When a problem is hard, solve it by embedding it in a harder problem.

P.S. We can use this example to teach a different principle of statistics: the combination of numerical and analytic methods.

How do you compute S = sum_{n=1}^{infinity} n^{-2}?

Simplest approach is to add a bunch of terms; for example, in R:
S_approx_1 <- sum((1:1000000)^(-2)). This brute-force method works fine in this example but it would have trouble if the function to evaluate is expensive.

Another approach is to approximate the sum by an integral; thus:
S_approx_2 <- integral_{from x=0.5 to infinity} dx x^{-2} = 2. (The indefinite integral is just -1/x, so the definite integral is 1/infinity - (-1/0.5) = 2.) You have to start the integral at 0.5 because the sum starts at 1, so the little bars to sum are [0.5,1.5], [1.5,2.5], etc. That second approximation isn't so great at the low end of x, though, where the curve 1/x^2 is far from locally linear. So we can do an intermediate approximation:

S_approx_3 <- sum((1:N)^(-2)) + integral_{from x=(N+0.5) to infinity} dx x^{-2} = sum((1:N)^(-2)) + 1/(N+0.5).

That last approximation is fun because it combines numerical and analytic methods. And it works! Just try N=3:
S_approx = 1 + 1/4 + 1/9 + 1/3.5 = 1.647.
The exact value, to three decimal places, is 1.644. Not bad.

There are better approximation methods out there; the point is that even a simple approach of this sort can do pretty well. And I’ve seen a lot of simulation studies that are done using brute force where the answers just don’t make sense, and where just a bit of analytical work at the end could’ve made everything work out.

P.P.S. Tomorrow’s post: Deterministic thinking (“dichotomania”): a problem in how we think, not just in how we act.

P.P.P.S. [From Bob Carpenter] MathJax is turned on for posts, but not comments, so that $latex e^x$ renders as e^x.

The uncanny valley of Malcolm Gladwell

Gladwell is a fun writer, and I like how he plays with ideas. To my taste, though, he lives in an uncanny valley between nonfiction and fiction, or maybe I should say between science and storytelling. I’d enjoy him more, and feel better about his influence, if he’d take the David Sedaris route and go all the way toward storytelling (with the clear understanding that he’s telling us things because they sound good or they make a good story, not because they’re true), or conversely become a real science writer and evaluate science and data claims critically. Instead he’s kind of in between, bouncing back and forth between stories and science, and that makes uncomfortable.

Here’s an example, from a recent review by Andrew Ferguson, “Malcolm Gladwell Reaches His Tipping Point.” I haven’t read Gladwell’s new book, so I can’t really evaluate most of these criticisms, but of course I’m sympathetic to Ferguson’s general point. Key quote:

Gladwell’s many critics often accuse him of oversimplification. Just as often, though, he acts as a great mystifier, imposing complexity on the everyday stuff of life, elevating minor wrinkles into profound conundrums. This, not coincidentally, is the method of pop social science, on whose rickety findings Gladwell has built his reputation as a public intellectual.

In addition, Ferguson has a specific story regarding some suspiciously specific speculation (the claim that “of every occupational category, [poets] have far and away the highest suicide rates—as much as five times higher than the general population.”) which reminds me of some other such items we’ve discussed over the years, including:

– That data scientist’s unnamed smallish town where 75 people per year died “because of the lack of information flow between the hospital’s emergency room and the nearby mental health clinic.”

– That billionaire’s graph purporting to show “percentage of slaves or serfs in the world.”

– Those psychologists’ claim that women were three times more likely to wear red or pink during certain times of the month.

– That claim from “positive psychology” of the “critical positivity ratio” of 2.9013.

– That psychologist’s claim that he could predict divorces with 83 percent accuracy, after meeting with a couple for just 15 minutes.

And lots more.

There’s something hypnotizing about those numbers. Too good to check, I guess.

Let’s try this again: It is nonsense to say that we don’t know whether a specific weather event was affected by climate change. It’s not just wrong, it’s nonsensical.

This post is by Phil Price, not Andrew.

If you write something and a substantial number of well-intentioned readers misses your point, the problem is yours. Too many people misunderstood what I was sayinga few days ago in the post “There is no way to prove that [an extreme weather event] either was, or was not, affected by global warming” and that’s my fault.  Let me see if I can do better.

Forget about climate and weather for a moment. I want to talk about bike riding.

You go for a ride with a friend. You come to a steep, winding climb and you ride up side by side. You are at the right side of the road, with your friend to your left, so when you come to a hairpin turn to the right you have a much steeper (but shorter) path than your friend for a few dozen feet. Later you come to a hairpin to the left, but the situation isn’t quite reversed because you are both still in the right lane so your friend isn’t way over where the hairpin is sharpest and the slope is steepest. You ride to the top of the hill and get to a flat section where you are riding side-by-side.  There is some very minor way in which you can be said to have experienced a ‘different’ climb, because even though you were right next to each other you experienced different slopes at different times, and rode slightly different speeds in order to stay next to each other as the road curved, and in fact you didn’t even end up at exactly the same place because your friend is a few feet from you.  You haven’t done literally the same climb, in the sense that a man can’t literally step twice in the same river (because at the time of the second step the river is not exactly the same, and neither is the man) but if someone said ‘how was your climb affected by your decision to ride on the right side of the lane rather than the middle of the lane’ we would all know what you mean; no reasonable person would say ‘if I had done the climb in the middle rather than the right it would have been a totally different climb.’

You continue your ride together and discuss what route to take where the road splits ahead. One road will take you to a series of hills to the north, the other will take you to a series of hills to the south. You decide to go south. You ride over some hills, along some flat stretches, and over more hills. Three hours into the ride you are climbing another hill, the toughest one yet — long, with some very steep stretches and lots of hairpin turns. As you approach the top, your riding companion says “how would this climb have been different if we had gone north instead of south?”  What is the right answer to this question? Here are some possibilities: (1) “There is no way to prove that this climb either was, or was not, affected by our decision to go south instead of north.” (2) “The question doesn’t make sense: we wouldn’t have encountered this climb at all if we had decided to go north.” (3) “This climb was definitely affected by our decision to go south instead of north, but unless we knew exactly what route we would have taken to the north we can’t know exactly how it was affected.”

1 is just wrong (*).  If you had gone north instead of south you might still had a steep climb  around hour 3, maybe it would have even been a steeper climb the one you are on now, but there is no way it could have been the same climb…and the difference is not a trivial one like the “twice in the same river” example.

2 is the right answer.

3 is not the right answer to the question that was asked, but maybe it’s the right answer to what the questioner had in mind. Maybe when they said “how would this climb have been different” they really meant something like, if you had gone the other way, “what would the biggest climb have been like”, or “what sort of hill would be climbing just about now”?

I think you see where I’m going with this (since I doubt you really forgot all about climate and weather like I asked you to).  On a bike ride you are on a path through physical space, but suppose we were talking about paths through parameter space instead. In this parameterization, long steep climbs correspond to hurricane conditions, and going south instead of north corresponds to experiencing a world with global warming instead of one without. In the global warming world, we don’t experience ‘the same’ weather events that we would have otherwise, but in a slightly different way — like climbing the same hill in the middle of the lane rather than at the side of the lane — we experience entirely different weather events — like climbing different hills.

The specific quote that I cited in my previous post was about Hurricane Katrina. It makes no sense to say we don’t know whether Hurricane Katrina was affected by global warming, just as it would make no sense to say we don’t know whether our hill climb was affected by our decision to go south instead of north. In the counterfactual world New Orleans might have still experienced a hurricane, maybe even on the same day, but it would not have been the same hurricane, just as we might encounter a hill climb on our bike trip at around the three-hour mark whether we went south or north, but it would not have been the same climb.

No analogy is perfect, so please don’t focus on ways in which the analogy isn’t ‘right’. The point is that we are long past the point where global warming is a ‘butterfly effect’ and we can reasonably talk about how individual weather events are affected by it. We aren’t riding up the same road but in a slightly different place, we are in a different part of the territory.

(*) I’m aware that if you had ridden north instead of south you could have circled back and climbed this same climb. Also, it’s possible in principle that some billionaire could have paid to duplicate ‘the same’ climb somewhere to the north — grade the side of a mountain to make this possible, shape the land and the road to duplicate the southern climb, etc.  But get real. And although these are possible for a bike ride, at least in principle, they are not possible for the parameter space of weather and climate that is the real subject of this post.

This post is by Phil, not Andrew.

Exchange with Deborah Mayo on abandoning statistical significance

The philosopher wrote:

The big move in the statistics wars these days is to fight irreplication by making it harder to reject, and find evidence against, a null hypothesis.

Mayo is referring to, among other things, the proposal to “redefine statistical significance” as p less than 0.005. My colleagues and I do not actually like that idea, so I responded to Mayo as follows:

I don’t know what the big moves are, but my own perspective, and I think that of the three authors of the recent article being discussed, is that we should not be “rejecting” at all, that we should move beyond the idea that the purpose of statistics is to reject the null hypothesis of zero effect and zero systematic error.

I don’t want to ban speech, and I don’t think the authors of that article do, either. I’m on record that I’d like to see everything published, including Bem’s ESP paper data and various other silly research. My problem is with the idea that rejecting the null hypothesis tells us anything useful.

Mayo replied:

I just don’t see that you can really mean to say that nothing is learned from finding low-p values, especially if it’s not an isolated case but time and again. We may know a hypothesis/model is strictly false, but we do not yet know in which way we will find violations. Otherwise we could never learn from data. As a falsificationist, you must think we find things out from discovering our theory clashes with the facts–enough even to direct a change in your model. Even though inferences are strictly fallible, we may argue from coincidence to a genuine anomaly & even to pinpointing the source of the misfit.So I’m puzzled.
I hope that “only” will be added to the statement in the editorial to the ASA collection. Doesn’t the ASA worry that the whole effort might otherwise be discredited as anti-science?

My response:

The problem with null hypothesis significance testing is that rejection of straw-man hypothesis B is used as evidence in favor of preferred alternative A. This is a disaster. See here.

Then Mayo:

I know all this. I’ve been writing about it for donkey’s years. But that’s a testing fallacy. N-P and Fisher couldn’t have been clearer. That does not mean we learn nothing from a correct use of tests. N-P tests have a statistical alternative and at most one learns, say, about a discrepancy from a hypothesized value. If a double blind RCT clinical trial repeatedly shows statistically significant (small p-value) increase in cancer risks among exposed, will you deny that’s evidence?


I don’t care about the people, Neyman, Fisher, and Pearson. I care about what researchers do. They do something called NHST, and it’s a disaster, and I’m glad that Greenland and others are writing papers pointing this out.


We’ve been saying this for years and years. Are you saying you would no longer falsify models because some people will move from falsifying a model to their favorite alternative theory that fits the data? That’s crazy. You don’t give up on correct logic because some people use illogic. The clinical trials I’m speaking about do not commit those crimes. would you really be willing to say that they’re all bunk because some psychology researchers do erroneous experiments and make inferences to claims where we don’t even know we’re measuring the intended phenomenon?
Ironically, by the way, the Greenland argument only weakens the possibility of finding failed replications.


I pretty much said it all here.

I don’t think clinical trials are all bunk. I think that existing methods, NHST included, can be adapted to useful purposes at times. But I think the principles underlying these methods don’t correspond to the scientific questions of interest, and I think there are lots of ways to do better.


And I’ve said it all many times in great detail. I say drop NHST. It was never part of any official methodology. That is no justification for endorsing official policy that denies we can learn from statistically significant effects in controlled clinical trials among other legitimate probes. Why not punish the wrong-doers rather than all of science that uses statistical falsification?

Would critics of statistical significance tests use a drug that resulted in statistically significant increased risks in patients time and again? Would they recommend it to members of their family? If the answer to these questions is “no”, then they cannot at the same time deny that anything can be learned from finding statistical significance.


In those cases where NHST works, I think other methods work better. To me, the main value of significance testing is: (a) when the test doesn’t reject, that tells you your data are too noisy to reject the null model, and so it’s good to know that, and (b) in some cases as a convenient shorthand for a more thorough analysis, and (3) for finding flaws in models that we are interested in (as in chapter 6 of BDA). I would not use significance testing to evaluate a drug, or to prove that some psychological manipulation has a nonzero effect, or whatever, and those are the sorts of examples that keep coming up.

In answer to your previous email, I don’t want to punish anyone, I just think statistical significance is a bad idea and I think we’d all be better off without it. In your example of a drug, the key phrase is “time and again.” No statistical significance is needed here.


One or two times would be enough if they were well controlled. And the ONLY reason they have meaning even if it were time and time again is because they are well controlled. I’m totally puzzled as to how you can falsify models using p-values & deny p-value reasoning.

As I discuss through my book, Statistical Inference as Severe Testing, the most important role of the severity requirement is to block claims—precisely the kinds of claims that get support under other methods be they likelihood or Bayesian.
Stop using NHST—there’s speech ban I can agree with. In many cases the best way to evaluate a drug is via controlled trials. I think you forget that for me, since any claim must be well probed to be warranted, estimations can still be viewed as tests.
I will stop trading in biotechs if the rule to just report observed effects gets passed and the responsibility that went with claiming a genuinely statistically significant effect goes by the board.

That said, it’s fun to be talking with you again.


I’m interested in falsifying real models, not straw-man nulls of zero effect. Regarding your example of the new drug: yes, it can be solved using confidence intervals, or z-scores, or estimates and standard errors, or p-values, or Bayesian methods, or just about anything, if the evidence is strong enough. I agree there are simple problems for which many methods work, including p-values when properly interpreted. But I don’t see the point of using hypothesis testing in those situations either—it seems to make much more sense to treat them as estimation problems: how effective is the drug, ideally for each person or else just estimate the average effect if you’re ok fitting that simpler model.

I can blog our exchange if you’d like.

And so I did.

Please be polite in any comments. Thank you.

P.S. Tomorrow’s post: My math is rusty.

I hate Bayes factors (when they’re used for null hypothesis significance testing)

Oliver Schultheiss writes:

I am a regular reader of your blog. I am also one of those psychology researchers who were trained in the NHST tradition and who is now struggling hard to retrain himself to properly understand and use the Bayes approach (I am working on my first paper based on JASP and its Bayesian analysis options). And then tonight I came across this recent blog by Uri Simonsohn, “If you think p-values are problematic, wait until you understand Bayes Factors.”

I assume that I am not the only one who is rattled by this (or I am the only one, and this just reveals my lingering deeper ignorance about the Bayes approach) and I was wondering whether you could comment on Uri’s criticism of Bayes Factors on your own blog.

My reply: I don’t like Bayes factors; see here. I think Bayesian inference is very useful, but Bayes factors are based on a model of point hypotheses that typically does not make sense.
To put it another way, I think that null hypothesis significance testing typically does not make sense. When Bayes factors are used for null hypothesis significance testing, I generally think this is a bad idea, and I don’t think it typically makes sense to talk about the probability that a scientific hypothesis is true.

More discussion here: Incorporating Bayes factor into my understanding of scientific information and the replication crisis. The problem is not so much with the Bayes factor as with the idea of null hypothesis significance testing.

Was Thomas Kuhn evil? I don’t really care.

OK, I guess I care a little . . . but when it comes to philosophy, I don’t really care about Kuhn’s personality or even what exactly he said in his books. I use Kuhn in my work, by which I mean that I use an idealized Kuhn, I take the best from his work (as I see it), the same way I use an idealized Lakatos and Popper, and the same way that Lakatos famously used an idealized Popper (Lakatos called him Popper2, I think it was).

Here’s what Shalizi and I wrote in our article:

We focus on the classical ideas of Popper and Kuhn, partly because of their influence in the general scientific culture and partly because they represent certain attitudes which we believe are important in understanding the dynamic process of statistical modelling.

Actually, we said “modeling,” but someone translated our article into British for publication. Anyway . . . we continue:

The two most famous modern philosophers of science are undoubtedly Karl Popper (1934/1959) and Thomas Kuhn (1970), and if statisticians (like other non-philosophers) know about philosophy of science at all, it is generally some version of their ideas. . . . We do not pretend that our sketch fully portrays these figures, let alone the literatures of exegesis and controversy they inspired, or even how the philosophy of science has moved on since 1970. . . .

To sum up, our views are much closer to Popper’s than to Kuhn’s. The latter encouraged a close attention to the history of science and to explaining the process of scientific change, as well as putting on the agenda many genuinely deep questions, such as when and how scientific fields achieve consensus. There are even analogies between Kuhn’s ideas and what happens in good data-analytic practice. Fundamentally, however, we feel that deductive model checking is central to statistical and scientific progress, and that it is the threat of such checks that motivates us to perform inferences within complex models that we know ahead of time to be false.

My point here is that, as applied statisticians rather than philosophers or historians, we take what we can use from philosophy, being open about our ignorance of most of the literature in that field. Just as applied researchers pick and choose statistical methods in order to design and analyze their data, we statisticians pick and choose philosophical ideas to help us understand what we are doing.

For example, we write:

In some way, Kuhn’s distinction between normal and revolutionary science is analogous to the distinction between learning within a Bayesian model, and checking the model in preparation to discarding or expanding it. Just as the work of normal science proceeds within the presuppositions of the paradigm, updating a posterior distribution by conditioning on new data takes the assumptions embodied in the prior distribution and the likelihood function as unchallengeable truths. Model checking, on the other hand, corresponds to the identification of anomalies, with a switch to a new model when they become intolerable. Even the problems with translations between paradigms have something of a counterpart in statistical practice; for example, the intercept coefficients in a varying-intercept, constant-slope regression model have a somewhat different meaning than do the intercepts in a varying-slope model.

This is all fine, but we recognize:

We do not want to push the analogy too far, however, since most model checking and model reformulation would by Kuhn have been regarded as puzzle-solving within a single paradigm, and his views of how people switch between paradigms are, as we just saw, rather different.

We’re trying to make use of the insights that Kuhn brought to bear, without getting tied up in what Kuhn’s own position was on all this. Kuhnianism without Kuhn, one might say.

Anyway, this all came up because Mark Brown pointed me to this article by John Horgan reporting that Errol Morris thinks that Kuhn was, in Horgan’s words, “a bad person and bad philosopher.”

Errol Morris! He’s my hero. If he hates Kuhn, so do I. Or at least that’s my default position, until further information comes along.

Actually, I do have further information about Kuhn. I can’t say I knew the guy personally, but I did take his course at MIT. Actually, I just came to the first class and dropped it. Hey . . . didn’t I blog this once? Let me check . . . yeah, here it is, from 2011—and I wrote it in response to Errol Morris’s story, the first time I heard about it! I’d forgotten this entirely.

There’s one thing that makes me a little sad. Horgan writes that Morris’s book features “interviews with Noam Chomsky, Steven Weinberg and Hilary Putnam, among other big shots.” I think there must be people with more to say than these guys. This may be a problem that once an author reaches the celebrity stratosphere, he will naturally mingle with other celebrities. If I’m reading a book about philosophy of science, I’d rather see an interview with Steve Stigler, or Josh Miller, or Deborah Mayo, or Cosma Shalizi, or various working scientists with historical and philosophical interests. But it can be hard to find such people, if you’re coming from the outside.

Here’s a puzzle: Why did the U.S. doctor tell me to drink more wine and the French doctor tell me to drink less?

This recent post [link fixed], on the health effects of drinking a glass of wine a day, reminds me of a story:

Several years ago my cardiologist in the U.S. recommended that I drink a glass of red wine a day for health reasons. I’m not a big drinker—probably I average something less than 100 glasses of wine a year—but when I remember, I’ll drink a glass or two with dinner when it’s available. I don’t love the taste of wine, but some of it is OK, and I already preferred the taste of the red to the white, so at least that worked out.

Anyway, awhile after receiving this recommendation, I spent a year in France, and I had to see a doctor and get a physical exam there as a condition of my work permit. The doctor asked me a bunch of questions (and spoke slowly enough that I could converse), including how much did I drink? I said I drink a glass of red wine a day on the recommendation of my cardiologist (ummm, I probably said “the doctor for my heart” or something like that). The French doctor replied that I should stop drinking as it’s bad for my foie.

I was taken aback. The U.S. doctor (not originally from this country, but still) said I should drink more; the French doctor said I should drink less. Some of this could be attributed to specialization: the cardiologist focuses on the heart, while the general practitioner thinks in terms of total risk. But, still, it was a surprise: I’d think it would be the French doctor who’d be more positive about the effects of a daily drink of wine.

OK, this is N=2 so we can learn almost nothing from this story: it could just be that this French doctor hates alcohol for other reasons, for example. Nonetheless, I’ll spin a statistical tale that goes like this:

Both doctors are being Bayesian. The U.S. doctor, upon hearing from me that I drank occasionally but rarely, inferred that: (a) I don’t drink a lot, and (b) I could drink a bit more without worry that I’d start to drink heavily. In contrast, all the French doctor heard was that I drink a glass of wine daily. From this she could’ve inferred that I might be a heavy drinker already and just not admitting it, or that if I was given any encouragement, I might drink to excess. In addition to all that, alcohol consumption is higher in France than the U.S., so the French doctor is probably used to telling her patients to drink less.

What did I actually do, you might ask? I split the difference. I continued to drink red wine but didn’t make such an effort to drink it every day.

Here’s why you need to bring a rubber band to every class you teach, every time.

A student discussion leader in every class period

Recently we’ve been having a student play the role of discussion leader in class. That is, each class period we get a student to volunteer to lead the discussion next time. This student takes special effort to be prepared, and I’ve seen three positive results:

– At least one student has thought hard about the readings, and that alone can take the group discussion to a higher level.

– With student-led discussion, I talk less and students talk more. I think they learn more, and I’m always there to bring up points, answer questions, and guide the discussion as needed.

– The other students in the class—those who are not the discussion leader today—know they’ll have to do it themselves later on, so they have more of a sense of ownership and active participation, compared to the traditional instructor-led session.

But that’s not actually what I want to talk about right now. What I want to do is share something I learned from Ben Levine, our discussion leader today (actually the same class where we had the Reinhart and Rogoff questions that I just blogged).

The rubber band

What happened was this.

As usual, I arrived about five minutes early and I kicked off an informal conversation about statistics as the students trickled into the room. The discussion continued as class began, and I opened up the Jitt responses. At this point Ben raised his hand and reminded me that he, not I, was the discussion leader! So he came up to the front of the room and the discussion continued for a moment.

At this point Ben recognized that the conversation was going all over the place. We were talking about important topics, highly relevant to the course as a whole, but we’d diverged from that week’s topic. So he stopped and said: OK, let’s get back on topic. Let’s first discuss today’s topic, then we can return to the conversation we’ve been having, keeping in mind how it relates to our main subject.

This was really helpful, and I realized I should be doing it all the time in my own. We have great discussions in all my classes, but often we lose the thread. And if the conversation isn’t tied to the main flow of course material, it can be forgotten. Ideas are much more helpful when connected to other ideas we’ve been thinking about.

What Ben did was use a rubber band. Not a physical rubber band; a conceptual rubber band, tied on one end to the day’s scheduled syllabus material and tied on the other end to the class discussion. Digressions are fine, but you have to keep that connection, you have to keep springing back to the main points of the class.

This was great, and I’m gonna try doing this every time I teach. I actually already knew about this when delivering a lecture: when I give a talk, I like to pause from time to time and explain how all the pieces fit together, so the audience can see the details within the context of the larger structure. But, until now, I hadn’t thought of this as a way of keeping class discussions relevant.

Also, as with many risk-limiting tricks, I suspect that the stability attained by the rubber-band technique might well allow discussion to flow even more freely: as a student, you can feel more comfortable moving to a digression, if you are secure in the knowledge that the discussion leader will connect this to the key ideas you’re trying to learn that day.

Things I didn’t have time to talk about yesterday at the Metascience conference

– Sane, reasonable, . . . and wrong. Here I was going to talk about some reasonable and moderate-sounding recommendations that I think miss some key issues.

– The fractal nature of scientific revolution. I’ve talked about this a lot, for example in 2005, 2007, and 2012.

– Workflow. My plan here was to talk about my applied statistics workflow (on my mind because I just wrote up a longer version of the case study on golf putting) and then discuss connections to social science workflow.

– The insider/outsider perspective. Ironic that I got this idea from Seth, given that he was a sucker for junk science.

On the plus side, I did get a chance to talk about:

– Imagine a world . . .

– The piranha principle

– Worse than Freud

– The fallacy of the one-sided bet

– Social science as we know it is impossible, featuring 16

– The vicious cycle, and it’s our fault

– Statistics is hard

– Taking the lessons of metascience and applying them to science.

A world of Wansinks in medical research: “So I guess what I’m trying to get at is I wonder how common it is for clinicians to rely on med students to do their data analysis for them, and how often this work then gets published”

In the context of a conversation regarding sloppy research practices, Jordan Anaya writes:

It reminds me of my friends in residency. Basically, while they were med students for some reason clinicians decided to get them to analyze data in their spare time. I’m not saying my friends are stupid, but they have no stats or programming experience, so the fact that they are the key data analyst for these data sets concerns me, especially considering they don’t have much free time to devote to the work anyways. So I guess what I’m trying to get at is I wonder how common it is for clinicians to rely on med students to do their data analysis for them, and how often this work then gets published.

I asked Jordan if I could post this observation, and he added:

Here’s some more details. I recently got an email from a friend saying they needed my help. They had previously taken a short introduction to R course and used those skills to analyze some data for a clinician. However, recently that clinician sent them some more data and now their code no longer worked so they asked me for help [“I need to do a proportion and see if it is significant . . .”]

They gave me the file and I confirmed R couldn’t read it (at least with the default read.csv—I’m not an R expert), so I looked at it in Python. There were extra rows, some missing data, and the job variable sometimes had commas in it so you couldn’t use comma as a delimiter. Anyways, I provided them with the information they wanted, and didn’t mention that some of their requests sounded like p-hacking.

My other friend is familiar with my Wansink work and what p-hacking is, and he told me he recently ran a bunch of tests on a data set for a clinician so I told him he p-hacked for that clinician, and he said that yes, he did.

Since I didn’t do my internship years I don’t know how these collaborations come about. I imagine a clinician needs help analyzing data and doesn’t have grad students so they turn to med students, and the med students don’t want to turn down a chance at getting their name on a paper.

As with Wansink, the big problem is not p-hacking but just a lack of understanding of the data, a disaster of the data collection/processing/analysis steps, and a lack of adult responsibility throughout.

P.S. Anaya adds:

I have another update for you. I talked with my friend today and they have refused to do any more work on the project! This was supposed to be a short analysis during their 4th year of med school but I guess there was some data handling problems so the clinician got someone else to get the data out of the database which I guess is what I saw last year (which is when the med student was now a first year resident and didn’t have time to work on the project). Anyways now the med student (now a second year resident) doesn’t trust the data enough to devote any more of their time to the project! So we have people who can’t handle data giving their pu pu platter to a med student (now second year resident) who doesn’t know how to program. I’m really starting to understand what happened with some of Wansink’s papers now.

It’s not just p=0.048 vs. p=0.052

Peter Dorman points to this post on statistical significance and p-values by Timothy Taylor, editor of the Journal of Economic Perspectives, a highly influential publication of the American Economic Association.

I have some problems with what Taylor writes, but for now I’ll just take it as representing a certain view, the perspective of a thoughtful and well-meaning social scientist who is moderately informed about statistics and wants to be helpful.

Here I’ll just pull out one quote, which points to a common misperception about the problems of p-values, or what might be called “the p-value transformation,” which takes an estimate and standard error and transforms it to a tail-area probability relative to a null hypothesis. Taylor writes:

[G]iven the realities of real-world research, it seems goofy to say that a result with, say, only a 4.8% probability of happening by chance is “significant,” while if the result had a 5.2% probability of happening by chance it is “not significant.” Uncertainty is a continuum, not a black-and-white difference.

First, I don’t know why he conditions on “the realities of real-world research” here. Even in idealized research, the p-value is a random variable, and it would be goofy to draw a sharp line between p = 0.048 and p = 0.052, just as it would be goofy to draw a sharp line between z-scores of 1.94 and 1.98.

To formalize this slightly, “goofy” = “not an optimal decision rule or close to an optimal decision rule under any plausibly reasonable utility function.”

Also, to get technical for a moment, the p-value is not the “probability of happening by chance.” But we can just chalk that up to a casual writing style.

My real problem with the above-quoted statement is not the details of wording but rather that I think it represents a mistake in emphasis.

This business of 0.048 vs. 0.052, or 0.04 vs. 0.06, etc.: I hear it a lot as a criticism of p-values, and I think it misses the point. If you want a bright-line rule, you need some threshold. There’s no big difference between 18 years old, and 17 years and 364 days old, but if you’re in the first situation you get to vote, and if you’re in the second situation you don’t. That doesn’t mean that there should be no age limit on voting.

No, my problem with the 0.048 vs. 0.052 thing is that it way, way, way understates the problem.

Yes, there’s no stable difference between p = 0.048 and p = 0.052.

But there’s also no stable difference between p = 0.2 (which is considered non-statistically significant by just about everyone) and p = 0.005 (which is typically considered very strong evidence.)

Just look at the z-scores:

> qnorm(1 - c(0.2, 0.005)/2)
[1] 1.28 2.81

The (two-sided) p-values of 0.2 and 0.005 correspond to z-scores of 1.3 and 2.8. That is, a super-duper-significant p = 0.005 is only 1.53 standard errors higher than an ignore-it-pal-there’s-nothing-going-on p = 0.2.

But it’s even worse than that. If these two p-values come from two identical experiments, then the standard error of their difference is sqrt(2) times the standard error of each individual estimate, hence that difference in p-values itself is only (2.81 – 1.28)/sqrt(2) = 1.1 standard errors away from zero.

To say it again: it is completely consistent with the null hypothesis to see p-values of 0.2 and 0.005 from two replications of the same damn experiment.

So. Yes, it seems goofy to draw a bright line between p = 0.048 and p = 0.052. But it’s also goofy to draw a bright line between p = 0.2 and p = 0.005. There’s a lot less information in these p-values than people seem to think.

So, when we say that the difference between “significant” and “not significant” is not itself statistically significant, “we are not merely making the commonplace observation that any particular threshold is arbitrary—for example, only a small change is required to move an estimate from a 5.1% significance level to 4.9%, thus moving it into statistical significance. Rather, we are pointing out that even large changes in significance levels can correspond to small, nonsignificant changes in the underlying quantities.”

Calibration and sharpness?

I really liked this paper, and am curious what other people think before I base a grant application around applying Stan to this problem in a machine-learning context.

Gneiting et al. define what I think is a pretty standard notion of calibration for Bayesian models based on coverage, but I’m not 100% sure if there are alternative sensible definitions.

They also define a notion of sharpness, which for continuous predictions is essentialy narrow posterior intervals, hence the name.

By way of analogy to point estimators, calibration is like unbiasedness and sharpness is like precision (i.e., inverse variance).

I seem to recall that Andrew told me that calibration is a frequentist notion, whereas a true Bayesian would just believe their priors. I’m not so worried about those labels here as about the methodological ramifications of taking the ideas of calibration and sharpness seriously.

The ladder of social science reasoning, 4 statements in increasing order of generality, or Why didn’t they say they were sorry when it turned out they’d messed up?


First the statistical point, then the background.

Statistical point

Consider the following two sentences from the abstract of the paper, Growth in a Time of Debt, published in early 2010 by Carmen Reinhart and Kenneth Rogoff:

[T]he relationship between government debt and real GDP growth is weak for debt/GDP ratios below a threshold of 90 percent of GDP. Above 90 percent, median growth rates fall by one percent, and average growth falls considerably more.

This passage corresponds to a ladder of four statements, which I’ll write in increasing order of generality:

(a) Their data show a pattern with a low correlation for debt/GDP ratios below 90%, and a strong negative correlation for debt/GDP ratios above 90%.

(b) The just-described pattern is not just a feature of past data; it also could be expected to continue into the future.

(c) The correlation between government debt and real GDP growth is low when debt/GDP ratios are low, and strongly negative when debt/GDP ratios are high.

(d) Too much debt is bad for a national economy, and the United States as of 2009 had too much debt.

– Step (a) is a simple data summary.

– You can get from (a) to (b) with simple statistical modeling, as long as you’re willing to assume stationarity and independence in some way. To put it most simply: you assume the future and the past are samples from a common distribution, or you fit some time-series model with is essentially making stationarity and independence assumptions on the residuals.

– Step (c) is a weaker, thus more general version of (b).

– And you can get from (c) to (d) by assuming the average patterns of correlation apply to the particular case of the United States in 2009.

– Completely independently, you can believe (d) in the absence of this specific evidence, just based on some macroeconomic theory.

So what happened?

It turns out that Reinhart and Rogoff made an error in their data processing, so step (a) was just wrong. Upon learning this, you might think, Game Over, but they stuck with (d), (c), and much of (b).

That’s fine—they can feel free to argue for (d) based on theoretical grounds alone, or based on some direct modeling of the U.S. economy, without reference to those historical data. They’d lose something, though, because in their published article they wrote:

Our approach here is decidedly empirical, taking advantage of a broad new historical dataset on public debt . . . Prior to this dataset, it was exceedingly difficult to get more than two or three decades of public debt data even for many rich countries, and virtually impossible for most emerging markets. Our results incorporate data on 44 countries spanning about 200 years. . . .

So if the data got screwed up, you’re kicking away the strongest leg of the stool.

And, given the importance of the published claims (essentially point (d) above) and the relevance of their unique dataset to making these claims, you’d think they authors should’ve made this broad new dataset publicly available from the start. Given the importance of this argument to the U.S. economy, one might argue they had a duty to share their data, so that other researchers could evaluate the data claim (a) and the reasonableness of the steps taking us from (a) to (b) to (c) to (d).


We were discussing the Reinhart and Rogoff story in class today, and the students had two questions:

1. How could it be that the authors of this very influential paper did not share their data?

2. Why did Reinhart and Rogoff never apologize for not sharing their data?

My response to question 1 was quick; answering question 2 took longer, and brought up some interesting thoughts about scientific modeling.

First, question 1. This one’s easy to answer. Data sharing takes work, so we don’t typically do it unless it’s required, or habitual, or we’re doing it out of a public service. Views about data sharing have changed since “Growth in a Time of Debt” was published back in 2010. But even now I usually don’t get around to posting my data. I do send data to people when asked, but that’s not as good as posting so anyone can download whenever they want. There also can be legal, ethical, business, or security reasons for not sharing data, but none of these concerns arose for the Reinhart and Rogoff paper. So, the quick answer to question 1: They didn’t share their data, because people just didn’t generally share data back then. The data were public, so if anyone else wanted to reproduce the dataset, they could put in the work themselves. Attitudes have changed, but back in 2010, that’s how it usually was: If you wanted to replicate someone’s study, the burden was on you to do it all, to figure out every step.

Now on to question 2. Given that Reinhart and Rogoff didn’t share their data, and they did make a mistake which would surely have been found out years earlier had the data been shared all along, and given that the published work reportedly had a big impact on policy, why didn’t they feel bad about not sharing the data all along? Even if not-sharing-data is an “everybody does it” sort of thing, you’d still think the authors would, in retrospect, regret not just making all their files available to the world right away. But we didn’t see this reaction from Reinhart and Rogoff. Why?

My answer has to do with the different steps of modeling described in the first part of this post. I have no idea what the authors of this paper were thinking when they responded to the criticism, but here are a couple of possibilities:

Suppose you start by believing that (c) and (d) are true, and then you find data that show (a), and you convince yourself that your model deriving (b) is reasonable. Then you have a complete story and you’re done.

Or, suppose you start by disbelieving (c) and (d), but the you analyze your data and conclude (a) and (b): This implies that (c) is correct, and now you’re convinced of (d). Meanwhile you can adjust your theory so that (c) and (d) make perfect sense.

Now someone goes to the trouble of replicating your analysis, and it turns out you got (a) wrong. What do you do?

At this point, one option would be to toss the whole thing out and start over: forget about (c) and (d) until you’ve fixed (a) and (b).

But another option is to stick with your theory, continue believing (c) and (d), and just adjust (a) and (b) as needed to fit the data.

If you take that latter option, the spreadsheet error and all the questionable data-coding choices don’t really matter so much.

I think this happens a lot

I think this happens a lot in research:

(a) Discovery of a specific pattern in data;

(b) Inference of same specific pattern in the general population;

(c) Blurring this specific pattern into a stylized fact, assumed valid in the general population;

(d) Assumed applicability in new cases, beyond the conditions under which the data were gathered.

Then if there’s a failed replication, or a data problem, or a data analysis problem that invalidates (a) or (b), researchers still hang on to (c) and (d). Kind of like building a ladder to the sky and then pulling the ladder up after you so you can climb even higher.

Consider power pose, for example:

(a) Under particular experimental conditions, people in an experiment who held the “power pose” had different measurements of certain hormones and behaviors, on average, compared to people in the a control group.

(b) P-value less than 0.05 was taken as evidence that the observed data differences represented large causal effects in the general population.

(c) Assumption that power posing (not just the specific instructions in that one experiment) has general effects on power and social behavior (not just the specific things measured in that study).

(d) Statement that the average patterns represented by (c) will apply to individual people in job interviews and other social settings.

A series of failed replications cast doubt on the relevance of (a), and statistical reasoning revealed problems with the inferences in (b); furthermore, the first author of the original paper revealed data problems which further weakened (a). But, in the meantime, (c) and (d) became popular, and people didn’t want to let it go.

And, indeed, the claims economic growth and government debt, or power pose, or ESP, or various other unproven theories out there, could be correct. Statements (c) and (d) could be true, even if they were derived from mistakes in (a) and (b). This sort of thing happens all the time. But, without the strong backing of (a) and (b), our beliefs in (c) and (d) are going to depend much more on theory. And theory is tricky: often the very same theory that supports (c) and (d), can also support their opposite. These are theories that Jeremy Freese calls “more vampirical than empirical—unable to be killed by mere evidence.”

Once you’re all-in on (c) and (d), you can just park your beliefs there forever. And, if that’s where you are, then when people point out problems with (a) and (b), you’re likely to react with annoyance rather than gratitude toward the people who, from a scientific standpoint, are doing you a favor.

My talk at the Metascience symposium Fri 6 Sep

The meeting is at Stanford, and here’s my talk:

Embracing Variation and Accepting Uncertainty: Implications for Science and Metascience

The world would be pretty horrible if your attitude on immigration could be affected by a subliminal smiley face, if elections were swung by shark attacks and college football games, if how you vote depended on the day within your monthly cycle, etc. Fortunately, there is no good evidence for these and other high-profile claims about the effects of apparently irrelevant stimuli on social and political attitudes and behaviors. Indeed, for theoretical reasons, we argue that it is not possible for these large and persistent effects to coexist in the real world. But if the sorts of effects being studied vary greatly by person and scenario, then simple experiments will not yield reliable estimates of effect sizes. It is necessary to instead embrace variation, which, in turn, requires accepting uncertainty. This has implications for the practice of science and for the proper understanding of replication and other aspects of metascience.

More golf putting, leading to a discussion of how prior information can be important for an out-of-sample prediction or causal inference problem, even if it’s not needed to fit existing data

Steve Stigler writes:

I saw a piece on your blog about putting. It suggests to me that you do not play golf, or you would not think this was a model. Length is much more important than you indicate. I attach an old piece by a friend that is indeed the work of a golfer!

The linked article is called “How to lower your putting score without improving,” and it’s by B. Hoadley and published in 1994. Hoadley’s recommendation is “to target a distance beyond the hole given by the formula: [Two feet]*[Probability of sinking the putt].” Sounds reasonable. And his model has lots of geometry, for example:

Anyway, what I responded to Steve was that the simple model fits just about perfectly up to 20 feet! But for longer putts, it definitely helps to include distance, as was noted by Mark Broadie in the material he sent me.

The other thing is that there’s a difference between prediction and improvement (or, as we would say in statistics jargon, a difference between prediction and causal inference).

I was able to fit a simple one-parameter model that accurately predicted success rates while not including any consideration of the difficulty of hitting the ball the right length (not too soft and not too hard). At least for short distances, up to 20 feet, my model worked, I assume because it the took the uncertainty in how hard the ball is hit, and interpreted it as uncertainty in the angle of the ball. For larger distances, these two errors don’t trade off some cleanly, hence the need for another parameter in the model.

But even for these short putts, my model can be improved if the goal is not just to predict success rates, but to figure out how to put better—that is, to predict success rates if you hit the ball differently.

It’s an interesting example of the difference between in-sample and out-of-sample prediction (and, from a statistical standpoint, causal inference is just a special case of out-of-sample prediction), similar to the familiar problem of regression with collinear or near-collinear predictors, where a wide range of possible parameter vectors will fit the data well (that’s what it means to have a ridge in the likelihood), but if you want to apply the model to predict for new data off that region of near-collinearity it will be necessary to bite the bullet and think harder about what those predictors really mean.

So . . . prior information can be important for an out-of-sample prediction or causal inference problem, even if it’s not needed to fit existing data.

The methods playroom: Mondays 11-12:30

Each Monday 11-12:30 in the Lindsay Rogers room (707 International Affairs Bldg, Columbia University):

The Methods Playroom is a place for us to work and discuss research problems in social science methods and statistics. Students and others can feel free to come to the playroom and work on their own projects, with the understanding that with many people with diverse interests in the room, progress can be made from different directions. The Playroom is not a homework help spot. It’s a place for us to have overlapping conversations about research, including work at early, middle, and late stages of projects (from design and data collection through analysis, interpretation, and presentation of results). It is a place to share different perspectives on quantitative work and connections between quantitative and qualitative work.

“There is no way to prove that [an extreme weather event] either was, or was not, affected by global warming.”

This post is by Phil, not Andrew.

It’s hurricane season, which means it’s time to see the routine disclaimer that no single weather event can be attributed to global warming. There’s a sense in which that is true, and a sense in which it is very wrong.

I’ll start by going way back to 2005. Remember Hurricane Katrina? A month afterwards some prominent climatologists (Rahmstorf, Mann, Benestad, Schmidt, and Connolley) wrote “Could New Orleans be the first major U.S. city ravaged by human-caused climate change? The correct answer–the one we have indeed provided in previous posts (Storms & Global Warming II, Some recent updates and Storms and Climate Change) –is that there is no way to prove that Katrina either was, or was not, affected by global warming. For a single event, regardless of how extreme, such attribution is fundamentally impossible.”

Well, that’s just nonsense. How on earth could Katrina not have been affected by global warming? There’s no way. You can argue that a major hurricane might have struck New Orleans on August 29, 2005 with or without global warming — sure, could be. Or maybe it would have happened a day earlier or a week earlier or a year earlier or a decade earlier. But sure, OK, maybe it would have happened on August 29, 2005. It’s extremely unlikely but not impossible. But there’s no way, literally no way, that it could have been the same storm. Katrina was definitely affected by global warming.

Does it matter? “We all know what they meant”? Well, I don’t know what they meant! And I’ve seen similar statements hundreds of times.

The weather is different than it would have been without global warming, every day and in every location. In some places and at some times the differences are large and in some places they are small. On some days there are fewer tropical cyclones in the Atlantic than there would have been, and on some days there are more; on other days there are exactly the same number of tropical cyclones but they are not in exactly the same places with exactly the same winds.

To say we don’t know whether a given city would have been destroyed by a hurricane on such-and-such a date in the absence of global warming, OK, fine, coincidences happen. But to say that we can’t say whether the storm was affected by global warming, that’s just wrong.  That goes for Hurricane Dorian, too.

I’ve been waiting 14 years to get this off my chest. I feel better.

This post is by Phil Price