Skip to content

What does a “statistically significant difference in mortality rates” mean when you’re trying to decide where to send your kid for heart surgery?

Keith Turner writes:

I am not sure if you caught the big story in the New York Times last week about UNC’s pediatric heart surgery program, but part of the story made me interested to know if you had thoughts:

Doctors were told that the [mortality] rate had improved in recent years, but the program still had one star. The physicians were not given copies or summaries of the statistics, and were cautioned that the information was considered confidential by the Society of Thoracic Surgeons. In fact, surgeons at other hospitals often share such data with cardiologists from competing institutions.

While UNC said in a statement that it was “potentially reckless” to use the data to drive decision-making about where to refer patients, doctors across the country said it was simply one factor, among several, that should be considered.

In October 2017, three babies with complex conditions died after undergoing heart surgery at UNC. In a morbidity and mortality conference the next month, one cardiologist suggested that UNC temporarily stop handling some complex cases, according to a person who was in the room. Dr. Kibbe, the surgery department chairwoman, said in a recent interview that the hospital had never restricted surgeries.

In December, another child died after undergoing surgery a few months earlier for a complex condition.

The four deaths were confirmed by The Times, but are not among those disclosed by UNC. It has declined to publicly release mortality data from July 2017 through June 2018, saying that because the hospital had only one surgeon during most of that period, releasing the data would violate “peer review” protections.

Other information released by UNC shows that the hospital’s cardiac surgery mortality rate from July 2013 through June 2017 was 4.7 percent, higher than those of most of the 82 hospitals that publicly report similar information. UNC says that the difference between its rate and other hospitals’ is not statistically significant, but would not provide information supporting that claim. The hospital said the numbers of specific procedures are too low for the statistics to be a meaningful evaluation of a single institution.

Seems like a lot of these data for UNC are not going to be easy for one to get their hands on. But I wonder if there’s a story to be told with some of the publicly available data from peer institutions? And even in the absence of quantitative data from UNC’s program, I think there are a lot of interesting questions here (besides the ethical ones about hospitals at public institutions withholding mortality data): What does a “statistically significant difference in mortality rates” mean when you’re trying to decide where to send your kid for heart surgery?

My reply:

Good question. I think the answer has to be that there’s other information available. If the only data you have are the mortality rates, and you can choose any hospital, then you’d want to do an 8-schools-type analysis and then choose the hospital where surgery has the highest posterior probability of success. Statistical significance is irrelevant, as you have to decide anyway. But you can’t really choose any hospital, and other information must be available.

In this case, I think the important aspects of decision making are not coming from the parents; rather, where this information is particularly relevant is for the hospitals’ decisions of how to run their programs and allocate resources, and for funders to decide what sorts of operations to subsidize. I’d think that “quality control” is the appropriate conceptual framework here.

Tomorrow’s post: What happens when frauds are outed because of whistleblowing?

What’s the evidence on the effectiveness of psychotherapy?

Kyle Dirck points us to this article by John Sakaluk, Robyn Kilshaw, Alexander Williams, and Kathleen Rhyner in the Journal of Abnormal Psychology, which begins:

Empirically supported treatments (or therapies; ESTs) are the gold standard in therapeutic interventions for psychopathology. Based on a set of methodological and statistical criteria, the APA [American Psychological Association] has assigned particular treatment-diagnosis combinations EST status and has further rated their empirical support as Strong, Modest, and/or Controversial. Emerging concerns about the replicability of research findings in clinical psychology highlight the need to critically examine the evidential value of EST research. We therefore conducted a meta-scientific review of the EST literature.

And here’s what they found:

This review suggests that although the underlying evidence for a small number of empirically supported therapies is consistently strong across a range of metrics, the evidence is mixed or consistently weak for many, including some classified by Division 12 of the APA as “Strong.”

It was hard for me to follow exactly which are the therapies that clearly work and which are the ones where the evidence is so clear. This seems like an important detail, no? Or maybe I’m missing the point. The difference between significant and not significant is not statistically significant, right?

They also write:

Finally, though the trend towards increased statistical power in EST research is a positive development, there must be greater continued effort to increase the evidential value—broadly construed—of the EST literature . . . EST research may need to eschew the model of small trials. A combined workflow of larger multi-lab registered reports (Chambers, 2013; Uhlmann et al., 2018) coupled with thorough analytic review (Sakaluk, Williams, & Biernat, 2014) would yield the highest degree of confirmatory, accurate evidence for the efficacy of ESTs.

This makes sense. But, speaking generally, I think it’s important when talking about improved data collection to not just talk about increasing your sample size. Don’t forget measurement. I don’t know enough about psychotherapy to say anything specific, but there should be ways of getting repeated measurements on people, intermediate outcomes, etc., going beyond up-or-down summaries to learn more from each person in these studies.


So, there’s lots going on here, statistically speaking, regarding the very important topic of the effectiveness of psychotherapy.

First, I’d like to ask, Which treatments work and which don’t? But we can’t possibly answer that question. The right thing to do is to look at the evidence on different treatments and summarize as well as we can, without trying to make a sharp dividing line between treatments that work and treatments that don’t, or are unproven.

Second, different treatments work for different people, and in different situations. That’s the real target: trying to figure out what to do when. And, for reasons we’ve discussed, there’s no way we can expect to approach anything like certainty when addressing such questions.

Third, when gathering data and assessing evidence, we have to move beyond procedural ideas such as preregistration and the simple statistical idea of increasing N, and think about design and data quality linked to theoretical understanding and real-world goals.

I’ve put that last paragraph in bold, as perhaps it will be the most relevance to many of you who don’t study psychotherapy but are interested in experimental science.

Tomorrow’s post: What does a “statistically significant difference in mortality rates” mean when you’re trying to decide where to send your kid for heart surgery?

Break out the marshmallows, friends: Ego depletion is due to change sign!

In a paper amusingly titled, “Ego depletion may disappear by 2020,” Miguel Vadillo (link from Kevin Lewis) writes:

Ego depletion has been successfully replicated in hundreds of studies. Yet the most recent large-scale Registered Replication Reports (RRR), comprising thousands of participants, have yielded disappointingly small effects, sometimes even failing to reach statistical significance. Although these results may seem surprising, in the present article I suggest that they are perfectly consistent with a long-term decline in the size of the depletion effects that can be traced back to at least 10 years ago, well before any of the RRR on ego depletion were conceived. The decline seems to be at least partly due to a parallel trend toward publishing better and less biased research.

But I think Vadillo is totally missing the big story, which is that if you take this trend seriously—and you certainly should—then ego depletion is not just disappearing. It’s changing sign. By 2025 or so, the sign of ego depletion should be clearly negative.

And by around the turn of the next century, ego depletion will be the largest effect known to psychology. Step aside, Stroop, and make room for the new boss on the block.

Stan saves Australians $20 billion

Jim Savage writes:

Not sure if you knew, but Stan was used in the Australian Productivity Commission’s review of the Australian retirement savings system. Their review will likely affect the regulation on $2 trillion of retirement savings, possibly saving Australians around $20-50 billion in fees over the next decade.

OK, we can now officially say that Stan, as an open-source software, has recouped its societal investment.

Tomorrow’s post: What’s the evidence on the effectiveness of psychotherapy?

My talk at Yale this Thursday

It’s the Quantitative Research Methods Workshop, 12:00-1:15 p.m. in Room A002 at ISPS, 77 Prospect Street

Slamming the sham: A Bayesian model for adaptive adjustment with noisy control data

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

It is not always clear how to adjust for control data in causal inference, balancing the goals of reducing bias and variance. In a setting with repeated experiments, Bayesian hierarchical modeling yields an adaptive procedure that uses the data to determine how much adjustment to perform. We demonstrate this procedure on the example that motivated this work, a much-cited series of experiments on the effects of low-frequency magnetic fields on chick brains, as well as on a series of simulated data sets. We also discuss the relevance of this work to causal inference and statistical design and analysis more generally.

This is joint work with Matthijs Vakar:

I might even use a few slides!

P.S. Originally I was going to send them the announcement below, but then I thought it would be better to go specific, as above. In recent years, I’ve usually given pretty general abstracts and drilled into particular examples during the talk. This time I’ll frame the talk around a particular example and use that as a launching pad for more general discussions.

Anyway, here’s the title and abstract I decided not to use:
Continue reading ‘My talk at Yale this Thursday’ »

Is Matthew Walker’s “Why We Sleep” Riddled with Scientific and Factual Errors?

Asher Meir points to this hilarious post by Alexey Guzey entitled, Matthew Walker’s “Why We Sleep” Is Riddled with Scientific and Factual Errors.

Just to start with, the post has a wonderful descriptive title. And the laffs start right away:

Positively Nabokovian, I’d say. I mean it. The above table of contents makes me want to read more.

I’ve not read Walker’s book and I don’t know anything about sleep research, so I won’t try to judge Guzey’s claims. I read through and I found Guzey’s arguments to be persuasive, but, hey, I’m easily persuaded.

I’d be happy to read a followup article by Michael Walker, “Alexey Guzey’s ‘Matthew Walker’s “Why We Sleep” Is Riddled with Scientific and Factual Errors’ Is Riddled with Scientific and Factual Errors.” That (hypothetical) post could completely turn me around! Then, of course, I’d be waiting for Guzey’s reply, “Michael Walker’s ‘Alexey Guzey’s “Matthew Walker’s ‘Why We Sleep’ Is Riddled with Scientific and Factual Errors” Is Riddled with Scientific and Factual Errors’ Is Riddled with Scientific and Factual Errors.” At that point, I’d probably have heard enough to have formed a firm opinion. Right now, the ball is totally in Walker’s court.

After reading various sparkly, hard-hitting bits of Guzey’s post, I was gonna say, “Get this man [Walker] a Ted talk!” But apparently he does have a Ted talk. From Guzey:

I haven’t had this much fun reading something online since the days when Gawker was publishing.

Anyway, it seems that Walker’s Ted talk is called “Sleep is Your Superpower.” Your superpower. How Ted can you get?? The modern world is really wasted on us, without Veronica Geng to mock it for us.

Guzey reports, “Matthew Walker is a professor of neuroscience and psychology at the University of California, Berkeley, where he also leads the Center for Human Sleep Science. . . . His book Why We Sleep . . . was praised by the New York Times . . . was named one of NPR’s favorite books of 2017 . . .”

OK: neuroscience professor at major university, praised by NYT and NPR . . . and he even published an article in the Lancet! We’re touching all the bases here. What next, a collaboration with Dr. Anil Potti?

Guzey continues: “A month after the book’s publication, he became a sleep scientist at Google.”

I’ll have to say, if it’s really true what Guzey says that Walker’s book is riddled with errors, there’s something satisfying about hearing that Walker got a job at Google. We have this image of Google as some sort of juggernaut. It’s good to hear that they can make mistakes too, just like scientific journals, universities, news outlets, etc.

Let me tell you a story. I went to graduate school at Harvard. Finest university in the world. My first day in a Harvard class, I was sitting with rapt attention, learning all sorts of interesting and important things (for reals; it was an amazing class that motivated me to become a statistician), sitting at one of those chairs with a desk attached to it, you know, the kind of chair where the desk part flips up so it’s in front of you, and, on the bottom of that desk was a wad of gum.

Back when I was in junior high, gum was almost a form of currency. I’d buy a pack of grape Bubble Yum for a quarter at the corner store on the way to school, then chew it in the morning during the endless hours between first period and lunch. I’d put one piece of gum in my mouth, chew it until it lost all its flavor, then add the second piece, chew it etc., and continue until I had a massive wad, all five pieces, ultimately flavorless, and I’d chew and chew and blow huge bubbles when the teacher wasn’t looking.

I’m not trying to make myself out into some big rebel here; the point is, we all did that. So of course there was yucky gum under all the desks. You knew to never run your hands under a desk, cos you never knew what might turn up. That was junior high.

Then in high school, everyone was much more mature, a lot less gum chewing . . . but still, gum under the desks. I took classes at the University of Maryland, a fine university with an OK basketball team . . . still, they had gum. Then I went to MIT, one of the finest engineering schools in the world . . . yup, gum. But Harvard? I’d hoped Harvard was better than that. But it wasn’t.

Anyway, that’s how I felt, learning that this purveyor of (possibly) horribly false claims is not just a professor of neuroscience at a top university—we know that top universities have lots of frauds—but was hired by Google. Google! Here I am, almost sixty years old (I don’t feel close to 60, but that’s my problem, not yours), and still there’s room for disillusionment.

Anyway . . . the next question (again conditional on Guzey’s statement that Walker’s book is riddled with errors) is, where does Walker fit in here? Did he not fully read the sources he cited in his book? Did he write much of the book from memory, and just misremember a lot of things? Does he think that it’s ok to get the facts wrong in service of a larger truth? Is he just confused? Maybe he didn’t write the book himself, perhaps he had an army of incompetent research assistants?

Just as an example, here’s the first item in Guzey’s post:

OK, I haven’t followed all the links and read the linked studies. That’s not my job—as Bill James once wrote, I’m not a public utility. If you want to read the linked studies, go for it!

The question is: what was Walker thinking when he wrote things like, “the shorter your sleep, the shorter your life span” or “Routinely sleeping less than six or seven hours a night demolishes your immune system, more than doubling your risk of cancer”? I guess someone will have to ask him. I wonder about this a lot, when people write things that are not supported by the data. It’s my impression that people often write things that sound good, without always thinking about their literal meanings.

Here’s another one:

What was Walker thinking when he wrote, “every species studied to date sleeps”?? Did he forget what he’d read in that 638-page book? Did he read the passage in question so quickly that he took from it the opposite impression? Perhaps he was working from memory, and forgot some things? I have no idea.

OK, I can’t resist. Here’s another:

Again, maybe I’m missing the whole point. Walker should feel free to take a few moments from his jobs at Berkeley and Google to respond in comments to explain how Guzey is misrepresenting the evidence, or if there’s something else going on. I’m open to the possibility.

Or if Walker did get all these things wrong . . . then maybe he can write something to thank Guzey for pointing out the errors. I make mistakes all the time—it happens!—and when I learn about them, I feel bad. Maybe Walker will learn about this and feel bad and figure out how to do better next time. I don’t think Brian Wansink ever got around to thanking people for finding all the errors in his papers, but I think he should’ve.

In the meantime, I appreciate that Guzey wrote his post directly and didn’t bother being polite. Politeness is fine, but it has a cost. First, a polite version of the post would be less fun to read, and less fun means I’d be less likely to read it, which would be a shame, given that this is an important topic. Second, I’m guessing that rewriting this post more politely would take a lot of effort on Guzey’s part, which would mean that maybe he wouldn’t have written it at all, or he would’ve included less information, or that he’d have less time to do other stuff. And that would be unfortunate. Guzey’s time is valuable too.

I wouldn’t’ve wanted Mark Twain, Veronica Geng, or David Sedaris to have worried too much about politeness either. I’m sure Guzey makes mistakes too. That’s ok. Mistakes are mistakes, even when written politely.

And, if it turns out that Guzey got it all wrong regarding Walker’s book, then fine, I’ll report back and update this post accordingly. I wouldn’t be the first person to get fooled by something on the internet.

P.S. Personally, I looove sleep. I sleep 9 or 10 hours a night. OK, I don’t average 9 or 10 hours a night. My average has got to be less than 8, maybe even less than 7, because when I get too little sleep for a night or two, it’s not like I balance it with 11 hours the next night. What I’m saying is that 9 or 10 hours makes me comfortable: it’s how much sleep I’d get if I had no kids and never had an early-morning deadline. Based on my observations of others, I’m guessing I’ll need less sleep as I get older.

P.P.S. Some searching also turned up this amusing exchange:

To be fair, though, my colleagues and I once wrote a book with the subtitle, “Why Americans Vote the Way They Do,” but we never actually answered that question! In the book, all our empirical claims were carefully sourced and explained, but then we just said something silly the title. So don’t trust titles.

P.P.P.S. I had so much fun writing this post, and I’m such a damn perfectionist, I spent 2 hours on it! And now I’m getting to sleep really late, I’ll only get to sleep 7.5 hours at most. So I’ll be uncomfortable and tired in the morning (and I don’t even drink coffee! I never touch the stuff, just don’t like the smell and taste), and it’ll probably lower my productivity tomorrow. Damn. I do feel, though, that if I write what I write when I want to write it, eventually I’ll end up writing the things I should write, the way I want to write them.

In research as in negotiation: Be willing to walk away, don’t paint yourself into a corner, leave no hostages to fortune

There’s a saying in negotiation that the most powerful asset is the ability to walk away from the deal.

Similarly, in science (or engineering, business decision making, etc.), you have to be willing to give up your favorite ideas. When I look at various embarrassing examples in science during the past decade, a common thread is researchers and their supporters not willing to throw in the towel.

Don’t get me wrong: failure is not the goal of a research program. Of course you want your idea to succeed, to flourish, and only to be replaced eventually with some improved version of itself. Similarly, you don’t go into a negotiation intending to walk away. But if you do research without the willingness to walk away if necessary, you’re playing science with one hand behind your back.

As with negotiation, your efforts to keep your theory alive, to defend and improve it, will be stronger if you are ultimately willing to walk away from your theory if necessary. If you know ahead of time that you’ll never give up, then you’ve painted yourself in a corner.

Look what happened with Marc Hauser, or Brian Wansink. They staked their reputations on their theories and their data. That was a mistake. When their data didn’t support their theories, they had to scramble, and it wasn’t pretty.

Look at what happened with Daryl Bem. He collected some data and fooled himself into thinking it represented strong evidence for his theory. When that didn’t work out—when careful examination revealed problems with his methods—he doubled down. Too bad. He could’ve preserved his theory to some extent by just recognizing that his experimental methods were too noisy to measure the ESP he was looking for. If he really wanted to make progress on the science of ESP, he’d have to think more about measurement. Bem’s defense of his scientific theory was too brittle. The front-line defense—defending not just the concept of ESP but the particular methods he used to study it and his particular method of data analyses—left him with no ability to go forward. We can learn from our mistakes, but only if we are willing to learn.

Look at what happened with Satoshi Kanazawa. He found some data and fooled himself into thinking it represented strong evidence for his theory. When that didn’t work out—when careful examination revealed problems with his methods—he doubled down. Too bad. His general theory of evolutionary psychology has to have some truth to it, but he won’t be able to learn anything about it with such noisy methods, any more than I could learn anything useful about special relativity by studying the trajectories of billiard balls. If Kanawawa were willing to walk away from his claims, he could have a chance of doing some useful research here. But instead he’s painted himself into a corner.

To continue with the analogy: Walking away should be an option, but it’s not the only option. A negotiator who walks away too soon can miss out on opportunity. And, similarly, as a scientist you should not abandon your pet theory too quickly. It’s your pet theory—nobody else is as motivated as you are to nurture it, so try your best. So, please, defend your theory—but, at the same time, be willing to walk away. It is only by recognizing that you might be wrong that you can make progress. Otherwise you’re tying yourself into knots, trying to give a deep explanation of every random number that crosses your path.

Tomorrow’s post: Stan saves Australians $20 billion

Why do a within-person rather than a between-person experiment?

Zach Horne writes:

A student of mine was presenting at the annual meeting of the Law and Society Association. She sent me this note after she gave her talk:

I presented some research at LSA which used a within subject design. I got attacked during the Q&A session for using a within subjects design and a few people said it doesn’t mean much unless I can replicate it with a between subjects design and it has no ecological validity.

She asked me if I had any thoughts on this and whether I had previously had problems defending a within subjects design. She also wondered what one should say when people take issue with within subjects designs.

I sent her a note with my initial thoughts but I thought it would be worth bringing up (again) on your blog because I’ve run into this criticism a lot. I don’t think she should just buckle and start running between subjects designs just to appease reviewers. We need people to understand the value of within designs, but the mantra “measurement error is important to think about” doesn’t seem to be doing the trick.

For background, here are two old posts of mine that I found on this topic from 2016 and 2017:

Balancing bias and variance in the design of behavioral studies: The importance of careful measurement in randomized experiments

Poisoning the well with a within-person design? What’s the risk?

Now, to quickly answer the questions above:

First off, the “ecological validity” thing is a red herring. Whoever said that were either misunderstood or didn’t know what they were talking about. Ecological validity refers to generalization from the lab to the real world, and it’s an important concern—but it has nothing to do with whether your measurements are within or between people.

Second, I think within-person designs are generally the best option when studying within-person effects. But there are settings where a between-person design is better.

In order to understand why I prefer the within-person design, it’s helpful to see the key advantage of the between-person design, which is that, by doing giving each person only one treatment, the effects of the treatment are pure. No crossover effects to worry about.

The disadvantage of the between-person design is that it does not control for variation among people, which can be huge.

In short, the between-person design is often cleaner, but at the cost of being so variable as to be essentially useless.

OK, at this point you might say, Fine, just do the between-person design with a really large N. But this approach has two problems. First, people don’t always get a really large N. One reason for that is the naive view that, if you have statistical significance, then your sample size was large enough. Second, all studies have bias (for example, in a psychology experiment there will be information leakage and demand effects), and ramping up N won’t solve that problem.

Here’s what I wrote a few years ago:

The clean simplicity of [within-person] designs has led researchers to neglect important issues of measurement . . .

Why use between-subject designs for studying within-subject phenomena? I see a bunch of reasons. In no particular order:

1. The between-subject design is easier, both for the experimenter and for any participant in the study. You just perform one measurement per person. No need to ask people a question twice, or follow them up, or ask them to keep a diary.

2. Analysis is simpler for the between-subject design. No need to worry about longitudinal data analysis or within-subject correlation or anything like that.

3. Concerns about poisoning the well. Ask the same question twice and you might be concerned that people are remembering their earlier responses. This can be an issue, and it’s worth testing for such possibilities and doing your measurements in a way to limit these concerns. But it should not be the deciding factor. Better a within-subject study with some measurement issues than a between-subject study that’s basically pure noise.

4. The confirmation fallacy. Lots of researchers think that if they’ve rejected a null hypothesis at a 5% level with some data, that they’ve proved the truth of their preferred alternative hypothesis. Statistically significant, so case closed, is the thinking. Then all concerns about measurements get swept aside: After all, who cares if the measurements are noisy, if you got significance? Such reasoning is wrong wrong wrong but lots of people don’t understand.

One motivation for between-subject design is an admirable desire to reduce bias. But we shouldn’t let the apparent purity of randomized experiments distract us from the importance of careful measurement.

And this framing of questions of experimental design and analysis in terms of risks and benefits:

In a typical psychology experiment, the risk and benefits are indirect. No patients’ lives are in jeopardy, nor will any be saved. There could be benefits in the form of improved educational methods, or better psychotherapies, or simply a better understanding of science. On the other side, the risk is that people’s time could be wasted with spurious theories or ineffective treatments. Useless interventions could be costly in themselves and could do further harm by crowding out more effective treatments that might otherwise have been tried.

The point is that “bias” per se is not the risk. The risks and benefits come later on when someone tries to do something with the published results, such as to change national policy on child nutrition based on claims that are quite possibly spurious.

Now let’s apply these ideas to the between/within question. I’ll take one example, the notorious ovulation-and-voting study, which had a between-person design: a bunch of women were asked about their vote preference, the dates of their cycle, and some other questions, and then women in a certain phase of their cycle were compared to women in other phases. Instead, I think this should’ve been studied (if at all) using a within-person design: survey these women multiple times at different times of the month, each time asking a bunch of questions including vote intention. Under the within-person design, there’d be some concern that some respondents would be motivated to keep their answers consistent, but in what sense does that constitute a risk? What would happen is that changes would be underestimated, but when this propagates down to inferences about day-of-cycle effects, I’m pretty sure this is a small problem compared to all the variation that tangles up the between-person design. One could do a more formal version of this analysis; the point is that such comparisons can be done.

So, to get back to the question from my correspondent: what to do if someone hassles you to conduct a between-person design?

First, you can do a simulation study or design calculation and show the huge N that you would need to get a precise enough estimate of your effect of interest.

Second, you can point out that inferences from the between-person design are entirely indirect and only of averages, even though for substantive reasons you almost certainly are interested in individual effects.

Third, you can throw the “ecological validity” thing back at them and point out that, in real life, people are exposed to all sorts of different stimuli. Real life is a within-person design. In psychology experiments, we’re not talking about lifetime exposures to some treatment. In real life, people do different things all the time.

Ballot order effects in the news; I’m skeptical of the claimed 5% effect.

Palko points us to this announcement by Marc Elias:

BREAKING: In major court victory ahead of 2020, Florida federal court throws out state’s ballot order law that lists candidates of the governor’s party first on every ballot for every office. Finds that it gave GOP candidates a 5% advantage. @AndrewGillum lost in 2018 by .4%

Is anyone really saying that being first on the ballot is worth 5% in a high-profile general election for governor?

OK, following the links, I see this post by Joanne Miller linking to some pages from the court decision. Here’s a key passage:

I’m skeptical that ballot order would swing the vote margin in the Florida governor’s race by 5 percentage points.

I discussed the general topic a couple years ago, in the context of the presidential election:

Could ballot order have been enough to cause a 1.2% swing? Maybe so, maybe not. The research is mixed. Analyzing data from California elections where a rotation of candidate orders was used across assembly districts, Jon Krosnick, Joanne Miller, and Michael Tichy (2004) found large effects including in the 2000 presidential race. But in a different analysis of California elections, Daniel Ho and Kosuke Imai (2008) write that “in general elections, ballot order significantly impacts only minor party candidates, with no detectable effects on major party candidates.” Ho and Imai also point out that the analysis of Krosnick, Miller, and Tichy is purely observational. That said, we can learn a lot from observational data. Krosnick et al. analyzed data from the 80 assembly districts but it doesn’t look like they controlled for previous election results in those districts, which would be the obvious thing to do in such an analysis. Amy King and Andrew Leigh (2009) analyze Australian elections and find that “being placed first on the ballot increases a candidate’s vote share by about 1 percentage point.” Marc Meredith and Yuval Salant (2013) find effects of 4-5 percentage points, but this is for city council and school board elections so not so relevant for the presidential race. A Google Scholar search found lots and lots of papers on ballot-order effects but mostly on local elections or primary elections, where we’d expect such effects to be larger. This 1990 paper by R. Darcy and Ian McAllister cites research back to the early 1900s! . . .

Based on the literature I’ve seen, a 1% swing seems to be on the border of what might be a plausible ballot-order effect for the general election for president, maybe a bit on the high end given our current level of political polarization.

Given that I thought that a 1% effect was on the border of plausibility, you won’t be surprised that I think that 5% is way overstating it, at least for a major election. Sure, governor is less major than president. But, again, in this era of polarization I doubt there are so many more swing voters available for that race either.

0.4%, though? Sure, that I believe.

So, yes, I believe that ballot order was enough to swing the 2018 Florida governor’s election. And, more generally, I favor rotated or randomized ballot orders to eliminate the ballot order effects that are there. This is a bias that can be easily and inexpensively removed; it seems like a no-brainer to fix it. Whether this is a matter for the courts, I don’t know.

Zombie semantics spread in the hope of keeping most on the same low road you are comfortable with now: Delaying the hardship of learning better methodology.

Now, everything is connected, but this is not primarily about persistent research misconceptions such as statistical significance.

Instead it is about (inherently) interpretable ML versus (misleading with some nonzero frequency) explanatory ML that I previously blogged on just over a year ago.

That was when I first become aware of work by Cynthia Rudin (Duke) that argues upgraded versions of easy to interpret machine learning (ML) technologies (e.g. Cart constrained optimisation to get sparse rule lists, trees, linear integer models, etc.) can offer similar predictive performance of new(er) ML (e.g. deep neural nets) with the added benefit of inherent interpretability. In that initial post, I overlooked the need to define (inherently) interpretability ML as ML where the connection between the inputs given and prediction made is direct. That is, it is simply clear how the ML predicts but not necessarily why such predictions would make sense – understanding how the model works but not an explanation of how the world works.

What’s new? Not much and that’s troubling.

For instance, policy makers are still widely accepting black box models without significant attempts at getting interpretable (rather than explainable) models that would be even better. Apparently, the current lack of  interpretable models with comparable performance to black box models in some high profile applications is being taken as the usual situation without question. To dismiss consideration of interpretable models? Or maybe it is just wishful thinking?

Now there have been both improvements in interpretable methods and their exposition.

For instance, an interpretable ML achieved comparable accuracy to black box ML and received  the FICO Recognition Award. That acknowledging the interpretable submission for going above and beyond expectations with a fully transparent global model that did not need explanation. Additionally there was a user-friendly dashboard to allow users to explore the global model and its interpretations. So a nice very visible success.

Additionally, theoretical work has proceeded to discern if accurate interpretable models could possibly exist in many if most applications.  It avoids Occham’s-Razor-style arguments about the world being truly simple by using a technical argument about function classes, and in particular, Rashomon Sets.

As for their exposition, there is now a succinct 10 minute youtube  Please Stop Doing “Explainable” ML that hits many of the key points along with a highly readable technical exposition that further fleshes out these points: Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead .

However as pointed out in the paper the problem persists that “Black box machine learning models are currently being used for high stakes decision-making throughout society, causing problems throughout healthcare, criminal justice, and in other domains. People have hoped that creating methods for explaining these black box models will alleviate some of these problems, but trying to explain black box models, rather than creating models that are interpretable in the first place, is likely to perpetuate bad practices and can potentially cause catastrophic harm to society”.

Continue reading ‘Zombie semantics spread in the hope of keeping most on the same low road you are comfortable with now: Delaying the hardship of learning better methodology.’ »

Should we mind if authorship is falsified?

In a typically thought-provoking piece, Louis Menand asks, “Should we mind if a book is a hoax?” In his article, Menand (whose father taught the best course I ever took at MIT, in which we learned that eternal vigilance is the price of liberty) focuses on imaginative literature written by white people but attributed to ethnic minorities. Or, more generally, comfortable people writing in the voices of the less comfortable (thus including, for example, fake Holocaust memoirists who make up their life stories but not their ancestries). His take on it is that there’s an oversupply of well-connected white folks who can pull off the conventions of literary writing, along with an unsatisfied demand for literature sharing the experience of people who’ve suffered. Put together the supply and demand and you get a black market.

Reading Menand’s article made me wonder if there’s anything similarly going on with scientific or scholarly writing. We do sometimes see plagiarism, but that’s more about taking credit for someone else’s work—plagiarism is what lazy and greedy people do—whereas Menand is talking about the opposite, people who do the work but don’t take the credit.

Does authorship matter at all?

For a scientific or scholarly article, what does verifiable authorship get you, the reader or consumer of research? A few things. In no particular order:

– The author is a real person who stands by the work and is thus using his or her reputation as a sort of collateral. In some sense, this works even when it doesn’t work: consider names such as Hauser, Bem, or Wansink where, at first the reputation bolstered the work’s believability, but then the weakness of the published work dragged down the reputation. Reputational inference goes both ways; recall the Lancet, the medical journal that’s published so many problematic papers that publication there can be a bit of a warning sign—maybe not so much as with PNAS or Psychological Science, but it’s a factor.

– Data and meta-data, description of experimental protocols, etc. There’s a real-life person and you can go to the real-life lab.

– Information about the authors can give a paper some street-cred. For example, remember that paper claiming that single women were 20 percentage points more likely to support Barack Obama during certain times of the month? That paper had both male and female authors. If all the authors were male, I wonder if it would’ve been considered too silly or too offensive to publish or to promote.

– Responsibility for errors. Sometimes a paper is presented as single-authored even though it is clearly the work of many people. When there’s an error, who’s to blame. It should be the author’s responsibility, but perhaps the error occurred in a part of the paper that the author did not actually write? It’s hard to know.


In the above discussion I’m purposely not considering issues of fairness, scholarly due process, etc. Setting all that aside, my focus here is on the way that falsification of authorship can directly reduce the usefulness of a published work of scholarship.

Remember how Basbøll and I discussed plagiarism as a statistical crime, based on the idea that plagiarism hides important information regarding the source and context of the copied work in its original form, information which can dramatically alter the statistical inferences made about the work?

Here I’m saying that this concern is more general, not just with plagiarism but with any misrepresentation of data and metadata, which includes authorship as well as details of how an experiment was carried out, what steps were done in data processing and analysis, and so on.

I (inadvertently) misrepresented others’ research in a way that made my story sound better.

During a recent talk (I think it was this one on statistical visualization), I spent a few minutes discussing a political science experiment involving social stimuli and attitudes toward redistribution. I characterized the study as being problematic for various reasons (for background, see this post), and I remarked that you shouldn’t expect to learn much from a between-person study of 38 people in this context.

I was thinking more about this example the other day and so I went back to the original published paper to get more details on who those 38 people were. I found this table of results:

But that’s not 38 people in the active condition; it’s 38 clusters! Looking at the article more carefully, we see this:

The starting race and SES and starting petition were randomized each day, and the confederates rotated based on these starting conditions. In total, there are 74 date–time clusters across 15 days.

That’s 38 clusters in the active condition and 36 in the control.

And this, from the abstract:

Results from 2,591 solicitations . . .

Right there in the abstract! And I missed it. I thought it was N=38 (or maybe I was remembering N=36; I can’t recall) but it was actually N=2591.

There’s a big difference between 38 and 2591. It’s almost as if I didn’t know what I was talking about.

But it’s worse than that. I didn’t just make a mistake (of two orders of magnitude!). I made a mistake that fit my story. My story was that the paper in question had problems—indeed I’m skeptical of its claims, for reasons discussed in the linked post and which had nothing to do with sample size—and so it was all too easy for me to believe it had other problems.

It’s interesting to have caught myself making this mistake, and it’s easy to see how it can happen: if you get a false impression but that impression is consistent with something you already believe, you might not bother checking it, and then you mentally add it to your list of known facts.

This can also be taken as an argument against slide-free talks. I usually don’t use slides when I give talks, I like it that way, and it can go really well—just come to the New York R conference some year and you’ll see. But one advantage of slides is that with slides you have to write everything down, and if you have to write things down, you’ll check the details rather than just going by your recollection. In this case I wouldn’t’ve said the sample size was 38; I would’ve checked, I would’ve found the error, and my talk would’ve been stronger, as I could’ve made the relevant points more directly.

During the talk when I came to that example I said that I didn’t remember all the details of the study and that I was just making a general point . . . but, hey, it was 2591, not 38. Jeez.

Is “abandon statistical significance” like organically fed, free-range chicken?

The question: is good statistics scalable?

This comes up a lot in discussions on abandoning statistical significance, null-hypothesis significance testing, p-value thresholding, etc. I recommend accepting uncertainty, but what if it’s decision time—what to do?

How can the world function if the millions of scientific decisions currently made using statistical significance somehow have to be done another way? From that perspective, the suggestion to abandon statistical significance is like a recommendation that we all switch to eating organically fed, free-range chicken. This might be a good idea for any of us individually or with small groups, but it would just be too expensive to do on a national scale. (I don’t know if that’s true when it comes to chicken farming; I’m just making a general analogy here.)

Even if you agree with me that null-hypothesis significance testing is almost always a bad idea, that it would be better to accept uncertainty and propagate it through our decision making process rather than collapsing the wavefunction with every little experiment, even if you agree that current practices of reporting statistically significant comparisons as real and non-significant comparisons as zero are harmful and impede our scientific understanding, even if you’d rather use prior information in the steps of inference and reporting of results, even if you don’t believe in ESP, himmicanes, ages ending in 9, embodied cognition, and all the other silly and unreplicated results that were originally sold on the basis if statistically significance, even if you don’t think it’s correct to say that stents don’t work just because p was 0.20, even if . . . etc. . . . even if all that, you might still feel that our proposal to abandon statistical significance is unrealistic.

Sure, sure, you might say, if researchers who have the luxury to propagate their uncertainty, fine, but what if they need to make a decision right now about what ideas to pursue next? Sure, sure, null hypothesis significance testing is a joke, and Psychological Science has published a lot of bad papers, but journals have to do something, they need some rule, right? And there aren’t enough statisticians out there to carefully evaluate each claim. It’s not like every paper sent to a psychology journal can be sent to Uri Simonsohn, Greg Francis, etc., for review.

So, the argument goes, yes, there’s a place for context-appropriate statistical inference and decision making, but such analyses have to be done one at a time. Artisanal statistics may be something for all researchers to aspire to, but in the here and now they need effective, mass-produced tools, and p-values and statistical significance is what we’ve got.

My response

McShane, Gal, Robert, Tackett, and I wrote:

One might object here and call our position naive: do not editors and reviewers require some bright-line threshold to decide whether the data supporting a claim is far enough from pure noise to support publication? Do not statistical thresholds provide objective standards for what constitutes evidence, and does this not in turn provide a valuable brake on the subjectivity and personal biases of editors and reviewers?

We responded to this concern in two ways.


Even were such a threshold needed, it would not make sense to set it based on the p-value given that it seldom makes sense to calibrate evidence as a function of this statistic and given that the costs and benefits of publishing noisy results varies by field. Additionally, the p-value is not a purely objective standard: different model specifications and statistical tests for the same data and null hypothesis yield different p-values; to complicate matters further, many subjective decisions regarding data protocols and analysis procedures such as coding and exclusion are required in practice and these often strongly impact the p-value ultimately reported.


We fail to see why such a threshold screening rule is needed: editors and reviewers already make publication decisions one at a time based on qualitative factors, and this could continue to happen if the p-value were demoted from its threshold screening rule to just one among many pieces of evidence.

To say it again:

Journals, regulatory agencies, and other decision-making bodies already use qualitative processes to make their decisions. Journals are already evaluating papers one at a time using a labor-intensive process. I don’t see that removing a de facto “p less than 0.05” rule would make this process any more difficult.

Dow Jones probability calculation

Here’s a cute one for your intro probability class.

Karen Langley from the Wall Street Journal asks:

What is the probability of the Dow Jones Industrial Average closing unchanged from the day before, as it did yesterday?

To answer this question we need to know two things:
1. How much does the Dow Jones average typically vary from day to day?
2. How precisely is the average recorded?

I have no idea about item #1, so I asked Langley and she said that it might go up or down by 50 in a given day. So let’s say that’s the range, from -50 to +50, then the probability is approximately 1/100 of there being no change, rounded to the nearest integer.

For item #2, I googled and found this news article that implies that the Dow Jones average is rounded to 2 decimal places (e.g., 27,691.49).

So then the probability of the Dow being unchanged to 2 decimal places is approximately 1/10000.

That’s a quick calculation. To do better, we’d want a better distribution of the day-to-day change. We could compute some quantiles and then fit a normal density—or even just compute the proportion of day-to-day changes that are in the range [-10, 10] and divide that by 1000. It should be pretty easy to get this number.

Yet another complexity is that there’s a small number of stocks in the Dow Jones average, so it might be that not all prices are even possible. I don’t think that’s an issue, as it seems that each individual stock has a price down to the nearest cent, but maybe there’s something I’m missing here.

Another way to attack the problem is purely empirically. According to this link, the Dow being unchanged is a “once-in-a-decade event.” A year has approximately 250 business days, hence if it’s truly once in a decade, that’s a probability of 1/2500. In that case, my 1/10000 estimate is way too low. On the other hand, given that prices have been rising, the probability of an exact tie should be declining. So even if the probability was 1/2500 ten years ago, it could be lower now. Also, an approximate ten-year gap does not give a very precise estimate of the probability. All these numbers are in the same order of magnitude.

Anyway, this is a good example to demonstrate the empirical calculations of probabilities, similar to some of the examples in chapter 1 of BDA but with more detail. I prefer estimating the probability of a tied election, or maybe some sports example, but if you like stocks, you can take this one.

Call for Paper proposals for the American Political Science Association: Symposium on Forecasting the 2020 American National Elections

Ruth Dassonneville and Charles Tien write:

Even though elections are seemingly increasingly unstable, and voters’ behaviour seems fickle, the outcomes of US elections have historically been quite predictable. First, election outcomes seem systematically correlated with election fundamentals, such as the state of the economy, and incumbency. Second, what happens in the run-up to the election, during the primaries and the conventions, or how much money candidates raise or spend is known to affect their electoral success. Third, public opinion polls – with measures of presidential approval or vote intentions – give a good sense of what is to expect on Election Day, even months before the election. With this knowledge of what historically determines election outcomes in the US, what are the predictions for the 2020 American National Elections?

With the support of the APSA Political Forecasting Group, we are proposing a symposium for PS: Political Science & Politics of papers that each lay out a forecasting model and make a prediction for the 2020 American National Elections. The goal of the proposal is to show a diversity of approaches to forecasting elections, including political markets, structural models, work that is based on the aggregation of polls, and combinations of both. We welcome paper proposals that focus on different types of elections, including predictions of the Presidential, House, Senate, and governorships, as well as the Electoral College. We also are keen to include voices that might be critical of political forecasting, or the way it has been done in political science. The proposers may be members of the Political Forecasting Group of the APSA, but that is by no means a requirement.

Specifically, we are seeking short papers of 3,000 words maximum, that offer a prediction of the 2020 American National Elections. The symposium will have space for updates of models that have historically performed well at predicting the outcomes of US elections, but we also encourage submissions that lay out new and innovative ways to forecasting the election outcome. We particularly welcome diverse teams of scholars and members of underrepresented groups and junior scholars to submit abstracts for the Symposium.

Abstracts should be submitted by 20 December 2019 to or The guest editors, Ruth Dassonneville and Charles Tien, will evaluate those abstracts and make a selection of papers for inclusion in the Symposium proposal. If the proposal is successful, the full papers for the Symposium are due by August 3. To ensure a timely publication of the Symposium, authors have to commit to responding to reviews in one-week time. For questions regarding the Symposium proposal, contact Ruth Dassonneville and Charles Tien.

What if it’s never decorative gourd season?

If it rains, now we’ll change
We’ll hold and save all of what came
We won’t let it run away
If it rains — Robert Forster

I’ve been working recently as part of a team of statisticians based in Toronto on a big, complicated applied problem. One of the things about working on this project is that, in a first for me, we know that we need to release all code and data once the project is done. And, I mean, I’ve bolted on open practices to the end of an analysis, or just released a git repo at the end (sometimes the wrong one!). But this has been my first real opportunity to be part of a team that is weaving open practices all the way through an analysis. And it is certainly a challenge.

It’s worth saying that, notoriously “science adjacent” as I am, the project is not really a science project. It is a descriptive, explorative, and predictive study, rather than one that is focussed on discovery or confirmation. So I get to work my way through open and reproducible science practices without, say, trying desperately to make Neyman-Pearson theory work.

A slight opening

Elisabeth Kübler-Ross taught us that there are five stages in the transition to more open and reproducible scientific practices: 

  • Denial (I don’t need to do that!)
  • Anger (How dare they not do it!)
  • Bargaining (A double whammy of “Please let this be good enough” and “Please let other people do this as well”)
  • Depression (Open and reproducible practices are so hard and no one wants to do them properly)
  • Acceptance (Open and reproducible science is not a single destination, but a journey and an exercise in reflective practice)

And, really, we’re often on many parts of the journey simultaneously. (Although, like, we could probably stop spending so long on Anger, because it’s not that much fun for anyone.)  

And a part of this journey is to carefully and critically consider the shibboleths and touchstones of open and reproducible practice. Not because everyone else is wrong, but more because these things are complex and subtle and working out how to weave them into our idiosyncratic research practices.

So I’ve found myself asking the following question.

Should we release code with our papers?

Now to friends and family who are also working their way through the Kübler-Ross stages of Open Science, I’m very sorry but you’re probably not going to love where I land on this. Because I think most code that is released is next to useless. And that it would be better to release nothing than release something that is useless. Less digital pollution.

It’s decorative gourd season!

A fairly well known (and operatically sweary) piece in McSweeney’s Internet Tendency celebrates the Autumn every year by declaring It’s decorative gourd season, m**********rs! And that’s the piece. A catalogue of profane excitement at the chance to display decorative gourds. Why? Because displaying them is enough!

But is that really true for code? While I do have some sympathy for the sort of “it’s been a looonnng day and if you just take one bite of the broccoli we can go watch Frozen again”-school of getting reluctant people into open science, it’s a desperation move and at best a stop-gap measure. It’s the type of thing that just invites malicious compliance or, perhaps worse, indifferent compliance.

Moreover, making a policy (even informally) that “any code release is better than no code release” is in opposition to our usual insistence that manuscripts reach a certain level of (academic) clarity and that our analyses are reported clearly and conscientiously. It’s not enough that a manuscript or a results section or a graph simply exist. We have much higher standards than that.

So what should the standard for code be?

The gourd’s got potential! Even if it’s only decorative, it can still be useful.

One potential use purely decorative code is the idea that the code can be read to help us understand what the paper is actually doing.

This is potentially true, but it definitely isn’t automatically true. Most code is too hard to read to be useful for this purpose. Just like most gourds aren’t the type of decorative gourd you’d write a rapturous essay about.

So unless code meets a certain standard, it’s going to need do something than just sit there and look pretty, which means we will need our code to be at least slightly functional. 

A minimally functional gourd?

This is actually really hard to work out. Why? Well there are just so many things we can look at. So let’s look at some possibilities. 

Good code “runs”. Why the scare quotes? Well because there are always some caveats here. Code can be good even if it takes some setup or a particular operating system to run. Or you might need a Matlab license. To some extent, the idea of whether “the code runs” is an ill-defined target that may vary from person to person. But in most fields there are common computing set ups and if your code runs on one of those systems it’s probably fine.

Good code takes meaningful input and produces meaningful output: It should be possible to, for example, run good code on similar-but-different data.  This means it shouldn’t require too much wrangling to get data into the code. There are some obvious questions here about what is “similar” data. 

Good code should be somewhat generalizable. A simple example of this: good code for a regression-type problem should not assume you have exactly 7 covariates, making it impossible to use when there data has 8 covariates. This is vital for dealing with, for instance, the reviewer who asks for an extra covariate to be added, or for a graph to change.

How limited can code be while still being good? Well that depends on the justification. Good code should have justifiable limitations.

Code with these 4 properties is no longer decorative! It might not be good, but it at least does something.  Can we come up with some similar targets for the written code to make it more useful? It turns out that this is much harder because judging the quality of code is much more subjective.

Good gourd! What is that smell?

The chances that a stranger can pick up your code and, without running it, understand what the method is doing are greatly increased with good coding practice. Basically, if it’s code you can come back to a year later and modify as if you’d never put it down, then your code is possibly readable. 

This is not an easy skill to master. And there’s no agreed upon way to write this type of code. Like clearly written prose, there are any number of ways that code can be understandable. But like writing clear prose, there are a pile of methods, techniques, and procedures to help you write better code.

Simple things like consistent spacings and doing whatever RStudio’s auto-format does like adding spaces each side of “+” can make your code much easier to read. But it’s basically impossible to list a set of rules that would guarantee good code. Kinda like it’s impossible to list a set of rules that would make good prose. 

So instead, let’s work out what is bad about code. Again, this is a subjective thing, but we are looking for code that smells.

If you want to really learn what this means (with a focus on R), you should listen to Jenny  Bryan’s excellent keynote presentation on code smell (slides etc here). But let’s summarize.

How can you tell if code smells? Well if you open a file and are immediately moved to not just light a votive candle but realize in your soul that without intercessory prayer you will never be able to modify even a corner of the code, then the code smells.  If you can look at it and at a glance see basically what the code is supposed to do, then your code smells nice snd clean.

If this sounds subjective, it’s because it is. Jenny’s talk gives some really good advice about how to make less whiffy code, but her most important piece of advice is not about a specific piece of bad code. It’s the following:

Your taste develops faster than your ability. 

To say it differently, as you code more you learn what works and what doesn’t. But a true frustration is that (just like with writing) you tend to know what you want to do before you necessarily have the skills to pull it off. 

The good thing is that code for academic work is iterative. We do all of our stuff, send it off for review, and then have to change things. So we have a strong incentive to make our code better and we have multiple opportunities to make it so.

Because what do you do when you have to add a multilevel component to a model? Can you do that by just changing your code in a single place? Or do you have to change the code in a pile of different places? Because good smelling code is often code that is modular and modifiable.

But because we build our code over the full lifecycle of a project (rather than just once after which it is never touched again), we can learn the types of structures we need to build into our code and we can share these insights with our friends, colleagues, and students.

A gourd supportive lab environment is vital to success

The frustration we feel when we want to be able to code better than our skills allow is awful. I think everyone has experienced a version of it. And this is where peers and colleagues and supervisors have their chance to shine. Because just as people need to learn how to write scientific reports and people need to learn how to build posters and people need to learn how to deliver talks, people need to learn how to write good code.

Really, the only teacher is experience. But you can help experience along. Work through good code with your group. Ask for draft code. Review it. Just like the way you’ll say “the intro needs more “Piff! Pop! Woo!” because right now I’m getting “*Sad trombone*” and you’ve done amazing work so this should reflect that”, you need to say the same thing about the code. Fix one smell at a time. Be kind. Be present. Be curious. And because you most likely were also not trained in programming, be open and humble.

Get members of your lab to swap code and explain it back to the author. This takes time. But this time is won back when reviews come or when follow up work happens and modifications need to be made. Clean, nice code is easy to modify, easy to change, and easy to use.

But trainees who are new at programming are nervous about programming.

They’re usually nervous about giving talks too. Or writing. Same type of strategy.

But none of us are professional programmers

Sadly, in the year of our lord two thousand and nineteen if you work in a vaguely quantitative field in science, social science, or the vast mire that surrounds them, you are probably being paid to program. That makes you a professional programmer.  You might just be less good at that aspect of your job than others.

I am a deeply mediocre part time professional programmer. I’ve been doing it long enough to learn how code smells, to have decent practices, and to have a bank of people I can learn from. But I’m not good at it. And it does not bring me joy. But neither does spending a day doing forensic accounting on the universities bizarre finance system. But it’s a thing that needs to be done as part of my job and for the most part I’m a professional who tries to do my best even if I’m not naturally gifted at the task.

Lust for gourds that are more than just decorative

In Norwegian, the construct “to want” renders “I want a gourd” as “Jeg har lyst på en kalebas” and it’s really hard, as an english speaker, not to translate that to “I have lust for a gourd”. And like that’s the panicking Norwegian 101 answer (where we can’t talk about the past because it’s linguistically complex or the future because it’s hard, so our only verbs can be instantaneous. One of the first things I was taught was “Finn er sjalu.” (Finn is jealous.) I assume because jealousy has no past or future).

But it also really covers the aspect of desiring a better future. Learning to program is learning how to fail to program perfectly. Just like learning to write is learning to be clunky and inelegant. To some extent you just have to be ok with that. But you shouldn’t be ok with the place you are being the end of your journey.

So did I answer my question? Should we release code with our papers?

I think I have an answer that I’m happy with. No in general. Yes under circumstances.

We should absolutely release code that someone has tried to make good code. Even though they will have failed. We should carry each other forward even in our imperfection. Because the reality is that science doesn’t get more open by making arbitrary barriers. Arbitrary barriers just encourages malicious compliance. 

When I lived in Norway as a newly minted gay (so shiny) I remember once taking a side trip to Gay’s The Word, the LGBTQIA+ bookshop in London and buying (among many many others) a book called Queering Anarchism. And I can’t refer to it because it definitely got lost somewhere in the nine times I’ve moved house since then.

The thing I remember most about this book (other than being introduced to the basics of intersectional trans-feminism) was its idea of anarchism as a creative force. Because after tearing down existing structures, anarchists need to have a vision of a new reality that isn’t simply an inversion of the existing hierarchy (you know. Reducing the significance threshold. Using Bayes Factors instead of p-values. Pre-registering without substantive theory.) A true anarchist, the book suggested, needs to queer rather than invert the existing structures and build a more equitable version of the world.

So let’s build open and reproducible science as a queer reimagining of science and not a small perturbation of the world that is. Such a system will never be perfect. Just lusting to be better.

Extra links:

Instead of replicating studies with problems, let’s replicate the good studies. (Consider replication as an honor, not an attack.)

Commenter Thanatos Savehn pointed to an official National Academy of Sciences report on Reproducibility and Replicability that included the following “set of criteria to help determine when testing replicability may be warranted”:

1) The scientific results are important for individual decision-making or for policy decisions.
2) The results have the potential to make a large contribution to basic scientific knowledge.
3) The original result is particularly surprising, that is, it is unexpected in light of previous evidence and knowledge.
4) There is controversy about the topic.
5) There was potential bias in the original investigation, due, for example, to the source of funding.
6) There was a weakness or flaw in the design, methods, or analysis of the original study.
7) The cost of a replication is offset by the potential value in reaffirming the original results.
8) Future expensive and important studies will build on the original scientific results.

I’m ok with items 1 and 2 on this list, and items 7 and 8: You want to put in the effort to replicate on problems that are important, and where the replications will be helpful. One difficulty here is are determining if “The scientific results are important . . . potential to make a large contribution to basic scientific knowledge.” Consider, for example, Bem’s notorious ESP study: if the claimed results are true, they could revolutionize science. If there’s nothing there, though, it’s not so interesting. This sort of thing comes up a lot, and it’s not clear how we should answer questions 1 and 2 above in the context of such uncertainty.

But the real problem I have is with items 3, 4, 5, and 6, all of which would seem to favor replications of studies that have problems.

In particular consider item 6: “There was a weakness or flaw in the design, methods, or analysis of the original study.”

I’d think about it the other way: If a study is strong, it makes sense to try to replicate it. If a study is weak, why bother?

Here’s the point. Replication often seems to be taken as a sort of attack, something to try when a study has problems, an attempt to shoot down a published claim. But I think that replication is an honor, something to try when you think a study has found something, to confirm something interesting.

ESP, himmicanes, ghosts, Bigfoot, astrology etc.: all very interesting if true, not so interesting as speculations not supported by any good evidence.

So I recommend changing items 3, 4, 5, and 6 of the National Academy of Sciences. Instead of replicating studies with problems, let’s replicate the good studies.

To put it another way: The problem with the above guidelines is that they implicitly assume that if a study doesn’t have obvious major problems, that it should be believed. Thus, they see the point of replications as checking up on iffy claims. But I’d say it the other way: unless a study in its design, data collection, and results are unambiguously clear, we should default to skepticism, hence replication can be valuable in giving support to a potentially important claim.

Tomorrow’s post: Is “abandon statistical significance” like organically fed, free-range chicken?

Participate in Brazilian Reproducibility Initiative—even if you don’t live in South America!

Anna Dreber writes:

There’s a big reproducibility initiative in Brazil on biomedical research led by Olavo Amaral and others, which is an awesome project where they are replicating 60 studies in Brazilian biomedical research. We (as usual lots of collaborators) are having a prediction survey and prediction markets for these replications – would it be possible for you to post something on your blog about this to attract participants? I am guessing that some of your readers might be interested.

Here’s more about the project and here is how to sign up:

Sounds like fun.

To do: Construct a build-your-own-relevant-statistics-class kit.

Alexis Lerner, who took a couple of our courses on applied regression and communicating data and statistics, designed a new course, “Jews: By the Numbers,” at the University of Toronto:

But what does it mean to work with data and statistics in a Jewish studies course? For Lerner, it means not only teaching her students to work with materials like survey results, codebooks, archives and data visualization, but also to understand the larger context of data. . . .

Lerner’s students are adamant that the quantification and measurement they performed on survivor testimonies did not depersonalize the stories they examined, a stereotype often used to criticize quantitative research methods.

“Once you learn the methods that go into statistical analysis, you understand how it’s not reductionist,” says Daria Mancino, a third-year student completing a double major in urban studies and the peace, conflict and justice program. “That’s really the overarching importance of this course for the social sciences or humanities: to show us why quantifying something isn’t necessarily reductionist.” . . .

Lerner hopes her students will leave her class with a critical eye for data and what goes into making it. Should survey questions be weighted, for example? How large of a sample size is large enough for results to be reliable? How do we know that survey respondents aren’t lying? How should we calculate margins of error?

Lerner’s students will leave the course with the tools to be critical analysts, meticulous researchers and – perhaps most importantly – thoughtful citizens in an information-heavy world.

This sounds great, and of course the same idea could be used to construct a statistics course based on any minority group. You could do it for other religious minorities or ethnic groups or states or countries or political movements or . . . just about anything.

So here’s what I want someone to do: Take this course, abstract it, and make it into a structure that could be expanded by others to fit their teaching needs. Wouldn’t it be great if there were hundreds of such classes, all over the world, wherever statistics is taught?

A build-your-own-relevant-statistics-class kit.

Let’s take Lerner’s course as a starting point, because we have it already, and from there abstract what is needed to create a structure that others can fill in.

Tomorrow’s post: Instead of replicating studies with problems, let’s replicate the good studies. (Consider replication as an honor, not an attack.)

The hot hand and playing hurt

So, was chatting with someone the other day and it came up that I sometimes do sports statistics, and he told me how he read that someone did some research finding that the hot hand in basketball isn’t real . . .

I replied that the hot hand is real, and I recommended he google “hot hand fallacy fallacy” to find out the full story.

We talked a bit about that, and then I was thinking of something related, which is that I’ve been told that professional athletes play hurt all the time. Games are so intense, and seasons are so long, that they just never have time to fully recover. If so, I could imagine that much of the hot hand has to do with temporarily not being seriously injured, or with successfully working around whatever injuries you have.

I have no idea; it’s just a thought. And it’s related to my reflection from last year:

The null model [of “there is no hot hand”] is that each player j has a probability p_j of making a given shot, and that p_j is constant for the player (considering only shots of some particular difficulty level). But where does p_j come from? Obviously players improve with practice, with game experience, with coaching, etc. So p_j isn’t really a constant. But if “p” varies among players, and “p” varies over the time scale of years or months for individual players, why shouldn’t “p” vary over shorter time scales too? In what sense is “constant probability” a sensible null model at all?

I can see that “constant probability for any given player during a one-year period” is a better model than “p varies wildly from 0.2 to 0.8 for any player during the game.” But that’s a different story. The more I think about the “there is no hot hand” model, the more I don’t like it as any sort of default.