On deck very soon

A bunch of the 170 are still in the queue. I haven’t been adding to the scheduled posts for awhile, instead I’ve been inserting topical items from time to time—I even got some vicious hate mail for my article on the electoral college—and then I’ve been shoving material for new posts into a big file that now has a couple hundred items, I’m not quite sure what to do with that one, maybe I’ll write all my posts for 2017 on a single day and get that over with? Also, sometimes our co-bloggers post here, and that’s cool.

Anyway, three people emailed me today about a much-publicized science news item that pissed them off. It’s not really topical but maybe I’ll post on it, just to air the issue out. And I have a couple literary ideas I wanted to share. So maybe I’ll do something I haven’t done for a few months, and bump a few of this week’s posts to the end of the queue.

Before that, though, I have a post that is truly topical and yet will never be urgent. I’ll schedule it to appear in the usual slot, between 9 and 10 in the morning.

Low correlation of predictions and outcomes is no evidence against hot hand

link

Josh Miller (of Miller & Sanjurjo) writes:

On correlations, you know, the original Gilovich, Vallone, and Tversky paper found that the Cornell players’ “predictions” of their teammates’ shots correlated 0.04, on average. No evidence they can see the hot hand, right?

Here is an easy correlation question: suppose Bob shoots with probability ph=.55 when he is hot and pn=.45 when he is not hot. Suppose Lisa can perfectly detect when he is hot, and when he is not. If Lisa predicts based on her perfect ability to detect when Bob is hot, what correlation would you expect?

With that setup, I could only assume the correlation would be low.

I did the simulation:

> n <- 10000
> bob_probability <- rep(c(.55,.45),c(.13,.87)*n)
> lisa_guess <- round(bob_probability)
> bob_outcome <- rbinom(n,1,bob_probability)
> cor(lisa_guess, bob_outcome)
[1] 0.06

Of course, in this case I didn’t even need to compute lisa_guess as it’s 100% correlated with bob_probability.

This is a great story, somewhat reminiscent of the famous R-squared = .01 example.

P.S. This happens to be closely related to the measurement error/attenuation bias issues that Miller told me about a couple years ago. And Jordan Ellenberg in comments points to a paper from Kevin Korb and Michael Stillwell, apparently from 2002, entitled “The Story of The Hot Hand: Powerful Myth or Powerless Critique,” that discusses related issues in more detail.

The point is counterintuitive (or, at least, counter to the intuitions of Gilovich, Vallone, Tversky, and a few zillion other people, including me before Josh Miller stepped into my office that day a couple years ago) and yet so simple to demonstrate. That’s cool.

Just to be clear, right here my point is not the small-sample bias of the lagged hot-hand estimate (the now-familar point that there can be a real hot hand but it could appear as zero using GIlovich et al.’s procedure) but rather the attenuation of the estimate: the less-familiar point that even a large hot hand effect will show up as something tiny when estimated using 0/1 data. As Korb and Stillwell put it, “binomial data are relatively impoverished.”

This finding (which is mathematically obvious, once you see it, and can demonstrated in 5 lines of code) is related to other obvious-but-not-so-well-known examples of discrete data being inherently noisy. One example is the R-squared=.01 problem linked to at the end of the above post, and yet another is the beauty-and-sex-ratio problem, where a researcher published paper after paper of what was essentially pure noise, in part because he did not seem to realize how little information was contained in binary data.

Again, none of this was a secret. The problem was sitting in open sight, and people have been writing about this statistical power issue forever. Here, for example, is a footnote from one of Miller and Sanjurjo’s papers:

Funny how it took this long for it to become common knowledge. Almost.

P.P.S. I just noticed another quote from Korb and Stillwell (2002):

Kahneman and Tversky themselves, the intellectual progenitors of the Hot Hand study, denounced the neglect of power in null hypothesis significance testing, as a manifestation of a superstitious belief in the “Law of Small Numbers”. Notwithstanding all of that, Gilovich et al. base their conclusion that the hot hand phenomenon is illusory squarely upon a battery of significance tests, having conducted no power analysis whatsoever! This is perhaps the ultimate illustration of the intellectual grip of the significance test over the practice of experimental psychology.

I agree with the general sense of this rant, but I’d add that, at least informally, I think Gilovich et al., and their followers, came to their conclusion not just based on non-rejection of significance tests but also based on the low value of their point estimates. Hence the relevance of the issue discussed in my post above, regarding attenuation of estimates. It’s not just that Gilovich et al. found no statistically significant differences, it’s also that their estimates were biased in a negative direction (that was the key point of Miller and Sanjurjo) and pulled toward zero (the point being made above). Put all that together and it looked to Gilovich et al. like strong evidence for a null, or essentially null, effect.

P.P.P.S. Miller and Sanjurjo update: A Visible Hand? Betting on the Hot Hand in Gilovich, Vallone, and Tversky (1985).

Would Bernie Sanders have lost the presidential election?

Nobody knows what would’ve happened had Bernie Sanders been the Democratic nominee in 2016. My guess based on my reading of the political science literature following Steven Rosenstone’s classic 1983 book, Forecasting Presidential Elections, is that Sanders would’ve done a bit worse than Hillary Clinton, because Clinton is a centrist within the Democratic party and Sanders is more on the ideological extreme. This is similar to the reasoning that Ted Cruz, as the most conservative of the Republican candidates, would’ve been disadvantaged in the general election.

But I disagree with Kevin Drum, who writes, “Bernie Sanders Would Have Lost the Election in a Landslide.” Drum compares Sanders to failed Democratic candidates George McGovern, Walter Mondale, and Michael Dukakis—but they were all running against incumbent Republicans under economic conditions which were inauspicious for the Democratic opposition.

My guess would be that Sanders’s ideological extremism could’ve cost the Democrats a percentage or two of the vote. So, yes, a priori, before the campaign, I’d say that Hillary Clinton was the stronger general election candidate. And I agree with Drum that, just as lots of mud was thrown at Clinton, the Russians would’ve been able to find some dirt on Sanders too.

But here’s the thing. Hillary Clinton won the election by 3 million votes. Her votes were just not in the right places. Sanders could’ve won a million or two votes less than Clinton, and still won the election. Remember, John Kerry lost to George W. Bush by 3 million votes but still almost won in the Electoral College—he was short just 120,000 votes in Ohio.

So, even if Sanders was a weaker general election candidate than Clinton, he still could’ve won in this particular year.

Or, to put it another way, Donald Trump lost the presidential vote by 3 million votes but managed to win the election because of the vote distribution. A more mainstream Republican candidate could well have received more votes—even a plurality!—without winning the electoral college.

The 2016 election was just weird, and it’s reasonable to say that (a) Sanders would’ve been a weaker candidate than Clinton, but (b) in the event, he could’ve won.

P.S. Drum responds to my points above with a post entitled, “Bernie Woulda Lost.” Actually that title is misleading because then in his post he writes, “I won’t deny that Sanders could have won. Gelman is right that 2016 was a weird year, and you never know what might have happened.”

But here’s Drum’s summary:

Instead of Clinton’s 51-49 percent victory in the popular vote, my [Drum’s] guess is that Sanders would lost 47-53 or so.

Drum elaborates:

Sanders would have found it almost impossible to win those working-class votes [in the midwest]. There’s no way he could have out-populisted Trump, and he had a ton of negatives to overcome.

We know that state votes generally follow the national vote, so if Sanders had lost 1-2 percentage points compared to Clinton, he most likely would have lost 1-2 percentage points in Wisconsin, Michigan, and Pennsylvania too. What’s the alternative? That he somehow loses a million votes in liberal California but gains half a million votes in a bunch of swing states in the Midwest? What’s the theory behind that?

OK, there are a few things going on here.

1. Where does that 47-53 estimate come from? Drum’s saying that Sanders would’ve done a full 4 percentage points worse than Clinton in the popular vote. 4 percentage points is huge. It’s huge historically—Rosenstone in his aforementioned book estimates the electoral penalty for ideological extremism to be much less than that—and even more huge today in our politically polarized environment. So I don’t really see where that 4 percentage points is coming from. 1 or 2 percentage points, sure, which is why in my post above I did not say that I thought Sanders necessarily would’ve won, I just say it could’ve happened, and my best guess is that the election would’ve been really close.

As I said, I see Sanders’s non-centered political positions as costing him votes, just not nearly as much as Drum is guessing. And, again, I have no idea where Drum’s estimated 4 percentage point shift is coming from. However, there is one other thing, which is that Sanders is a member of a religious minority. It’s said that Romney being a Mormon cost him a bunch of votes in 2012, and similarly it’s not unreasonable to assume that Sanders being Jewish would cost him too. It’s hard to say: one might guess that anyone who would vote against someone just for being a Jew would already be voting for Trump, but who knows?

2. Drum correctly points out that swings are national and of course I agree with that (see, for example, item 9 here), but of course there were some departures from uniform swing. Drum attributes this to Mitt Romney being “a pro-trade stiff who was easy to caricature as a private equity plutocrat”—but some of this characterization applied to Hillary Clinton too. So I don’t think we should take the Clinton-Trump results as a perfect template for what would’ve happened, had the candidates been different.

Here are the swings:

states_2016_2

To put it another way: suppose Clinton had run against Scott Walker instead of Donald Trump. I’m guessing the popular vote totals might have been very similar to what actually happened, but with a different distribution of votes.

Drum writes, “if Sanders had lost 1-2 percentage points compared to Clinton, he most likely would have lost 1-2 percentage points in Wisconsin, Michigan, and Pennsylvania too. What’s the alternative? That he somehow loses a million votes in liberal California but gains half a million votes in a bunch of swing states in the Midwest? What’s the theory behind that?”

My response: The theory is not that Sanders “loses” a million votes in liberal California but that he doesn’t do as well there as Clinton did—not an unreasonable assumption given that Clinton won the Democratic primary there. Similarly with New York. Just cos California and New York are liberal states, that doesn’t mean that Sanders would outperform Clinton in those places in the general election: after all, the liberals in those states would be voting for either of them over the Republican. And, yes, I think the opposite could’ve happened in the Midwest. Clinton and Sanders won among different groups and in different states in the primaries, and the gender gap in the general election increased a lot in 2016, especially among older and better-educated voters, so there’s various evidence suggesting that the two candidates were appealing to different groups of voters. My point is not that Sanders was a stronger candidate than Clinton on an absolute scale—as I wrote above, I don’t know, but my guess is that he would’ve done a bit worse in the popular vote—but rather that the particular outcome we saw was a bit of a fluke, and I see no reason to think a Sanders candidacy would’ve seen the same state-by-state swings as happened to occur with Clinton. Drum considers the scenario suggested above to be “bizarre” but I think he’s making the mistake of taking the particular Clinton-Trump outcome as a baseline. If you take Obama-Romney as a starting point and go from there, everything looks different.

Finally, writes that my post “sounds like special pleading.” I looked up that term and it’s defined as “argument in which the speaker deliberately ignores aspects that are unfavorable to their point of view.” I don’t think I was doing that. I was just expressing uncertainty. Drum wrote the declarative sentence, “Bernie Sanders Would Have Lost the Election in a Landslide,” and I responded with doubt. My doubt regarding landslide claims is not new. For example, here I am on 31 Aug 2016:

Trump-Clinton Probably Won’t Be a Landslide. The Economy Says So.

I wasn’t specially pleading then, and I’m not specially pleading now. I’m just doing my best to assess the evidence.

“Calm Down. American Life Expectancy Isn’t Falling.”

Ben Hanowell writes:

In the middle of December 2016 there were a lot of headlines about the drop in US life expectancy from 2014 to 2015. Most of these articles painted a grim picture of US population health. Many reporters wrote about a “trend” of decreasing life expectancy in America.

The trouble is that the drop in US life expectancy last year was the smallest among six drops between 1960 and 2015. What’s more, life expectancy dropped in 2015 by only a little over a month. That’s half the size of the next smallest drop and two-thirds the size of the average among those six drops. Compare that to the standard deviation in year-over-year change in life expectancy, which is nearly three months. In terms of percent change, 2015 life expectancy dropped by 1.5%… but the standard deviation of year-over-year percent change in life expectancy is nearly 4%.

Most importantly, of course, life expectancy in the US has increased by about two months on average since 1960. [see above graph]

Hanowell has the full story at his blog:

The media is abuzz about a small drop in life expectancy in 2015. Yet despite sensationalist headlines, human lifespan has actually risen globally and nationally for decades if not centuries with no signs of a reversal. Alarmist news headlines follow noise rather than signal, causing us to lose sight of what’s really important: understanding how human lifespan has improved; how we can maintain that progress; how social institutions will cope with a rapidly aging population; and trends in vital statistics more fine-grained than overall life expectancy at birth.

Don’t believe the hype. Life expectancy isn’t plummeting.

Hanowell then goes through the steps:

What Is Life Expectancy?

Fact: Human Lifespan Has Risen Globally for Over 250 Years

Then he gets to the main point:

Fact: There’s No Evidence American Life Expectancy at Birth Is Falling

Okay. So the human lifespan has been increasing over the last few centuries in the U.S. and other nations. There still could have been a recent slowdown or reversal, right? Well, yes, but there’s virtually no evidence for it. The 2015 annual drop in lifespan is a mere 1.2 months of life. That’s 50% smaller than the average among six annual drops since 1960. Yet between 1960 and 2015, life expectancy in the U.S. increased by about two months per year on average. In 1960, newborns could expect to live just over 71 years. Now they can expect to live just under 79 years.

If words aren’t enough to convince you, here is an annotated picture of the numbers.

And then he gives the image that I’ve reproduced at the top of this post.

What, then?

Hanowell continues:

Let’s Stop Crying Wolf About Falling American Life Expectancy

Here are some examples of sensationalist, alarmist headlines about life expectancy:

U.S. life expectancy declines for first time in 20 years (BBC News)
Drugs blamed for fall in U.S. life expectancy (The Times)
Dying younger: U.S. life expectancy a ‘real problem’ (USA Today)
Heart disease, Alzheimer’s and accidents lead to drop in U.S. life expectancy (Newsweek)
We’ve already seen that American life expectancy is probably not a “real problem.” Quite the opposite. There may be an explanation for this short-term drop. Maybe The Times is right and it has something to do with the so-called “opioid epidemic.” Maybe Newsweek is right and we should chalk it up to heart disease and Alzheimer’s (although probably not). Maybe it’s something else entirely.

By sensationalizing short-term trends without the proper long-term context, we lose sight of the progress we’ve made. That leaves us less informed about how we’ve come so far in the first place, and where to go from here.

What We Should Be Talking About Instead of Falling Life Expectancy

(Because It Isn’t Falling)

Falling American lifespan isn’t a pressing problem. What should we focus on instead? Here are a couple of ideas:

Understand How We Came This Far and How to Keep Going . . .

Improve Health and Quality of Life at Advanced Ages Without Overwhelming Social Institutions . . .

Pay Greater Attention to Trends in Finer-Grained Vital Statistics Than Overall Life Expectancy . . .

Hanowell concludes his post as follows:

Recent headlines about a drop in expected American lifespan are misleading. Although life expectancy dropped by a small amount between 2014 and 2015, the long-term trend shows climbing lifespan. Instead of worrying about a problem for which there is no evidence, we should be focusing on meeting the challenges that come with longer human lifespans, and understanding why lifespan differs by demographic characteristics.

And then he has a question for me:

How can we encourage journalists and the prominent scientists they quote that you can still make a story about steadily increasing life expectancy despite occasional faltering, and it won’t hurt your chances of it “going viral” or getting research funding next year? Because to me, steadily increasing life expectancy is a more interesting story once you take into account how we got here, and what we’ll need to do to keep up with our own needs while taking care of the elderly.

An efficiency argument for post-publication review

This came up in a discussion last week:

We were talking about problems with the review process in scientific journals, and a commenter suggested that prepublication review should be more rigorous:

There are lot of statistical missteps you just can’t catch until you actually have the replication data in front of you to work with and look at. Andrew, do you think we will ever see a system implemented where you have to submit the replication code with the initial submission of the paper, rather than only upon publication (or not at all)? If reviewers had the replication files, they could catch many more of these types of arbitrary specification and fishing problems that produce erroneous results, saving the journal from the need for a correction. . . . I review papers all the time and sometimes I suspect there might be something weird going on in the data, but without the data itself I often just have to take the author(s) word for it that when they say they do X, they actually did X, etc. . . . then bad science gets through and people can only catch the mistakes post-publication, triggering all this bs from journals about not publishing corrections.

I responded that, no, I don’t think that beefing up prepublication review is a good idea:

As a reviewer I am not going to want to spend the time finding flaws in a submitted paper. I’ve always been told that it is the author, not the journal, who is responsible for the correctness of the claims. As a reviewer, I will, however, write that the paper does not give enough information and I can’t figure out what it’s doing.

Ultimately I think the only only only solution here is post-publication review. The advantage of post-publication review is that its resources are channeled to the more important cases: papers on important topics (such as Reinhart and Rogoff) or papers that get lots of publicity (such as power pose). In contrast, with regular journal submission, every paper gets reviewed, and it would be a huge waste of effort for all these papers to be carefully scrutinized. We have better things to do.

This is an efficiency argument. Reviewing resources are limited (recall that millions of scientific papers are published each year) so it makes sense to devote them to work that people care about.

And, remember, the problem with peer review is the peers.

Hark, hark! the p-value at heaven’s gate sings

Three different people pointed me to this post [which was taken down; here’s the internet archive of it], in which food researcher and business school professor Brian Wansink advises Ph.D. students to “never say no”: When a research idea comes up, check it out, put some time into it and you might get some success.

I like that advice and I agree with it. Or, at least, this approached worked for me when I was a student and it continues to work for me now, and my favorite students are those who follow this approach. That said, there could be some selection bias here, that the students who say Yes to new projects are the ones who are more likely to be able to make use of such opportunities. Maybe the students who say No would just end up getting distracted and making no progress, were they to follow this advice. I’m not sure. As an advisor myself, I recommend saying Yes to everything, but in part I’m using this advice to take advantage of the selection process, in that students who don’t like this advice might decide not to work with me.

Wansink’s post is dated 21 Nov but it’s only today, 15 Dec, that three people told me about it, so it must just have hit social media in some way.

The controversial and share-worthy aspect of the post is not the advice for students to be open to new research projects, but rather some of the specifics. Here’s Wansink:
Continue reading

How can time series information be used to choose a control group?

This post is by Phil Price, not Andrew.

Before I get to my question, you need some background.

The amount of electricity that is provided by an electric utility at a given time is called the “electric load”, and the time series of electric load is called the “load shape.” Figure 1 (which is labeled Figure 2 and is taken from a report by Scottmadden Management Consultants) shows the load shape for all of California for one March day from each of the past six years (in this case, the day with the lowest peak electric load). Note that the y-axis does not start at zero.

Duck Curve

Figure 1: Electric load (the amount of electricity provided by the electric grid) in the middle of the day has been decreasing year by year in California as alternative energy sources (mostly solar) are added.

In March in California, the peak demand is in the evening, when people are at home with their lights on, watching television and cooking dinner and so on.

An important feature of Figure 1 is that the electric load around midnight (far left and far right of the plot) is rather stable from year to year, and from day to day within a month, but the load in the middle of the day has been decreasing every year. The resulting figure is called the “duck curve”: see the duck’s tail at the left, body in the middle, and head/bill at the right?

The decrease in the middle of the day is due in part to photovoltaic (PV) generation, which has been increasing yearly and is expected to continue to increase in the future: when the sun is out, the PV panels on my house provide most of the electricity my house uses, so the load that has to be met by the utility is lower now than before we got PV.

Continue reading

Applying statistical thinking to the search for extraterrestrial intelligence

Thomas Basbøll writes:

A statistical question has been bugging me lately. I recently heard that Yuti Milner has donated 100 millions dollars to 10-year search for extraterrestrial intelligence.

I’m not very practiced in working out probability functions but I thought maybe you or your readers would find it easy and fun to do this. Here’s my sense of the problem:

Suppose there are 1 million civilizations in the Galaxy. Suppose they are all sending us a strong, unambiguous signal. Suppose we listen continuously in all directions and suppose we’re right about what frequencies to listen to. What are the odds of detecting a signal within ten years?

As far as I can tell, this question is impossible to answer if we don’t know when they started sending.

Suppose they all *just* started sending a signal. On this view, we’ll probably not detect the fist signal for 300 years. (Carl Sagan calculated the average distance between 1 million galactic civilisations to be 300 ly.) So if we’re hoping there’s some chance of detection within the next 10 years, we’re assuming that some of the 1 million signals left their source long ago.

Now, suppose one of them is 10,000 ly away and started signalling 100,000 years ago. In order to be detectable today the beacon would have to transmit for at least 90,000 years. If it transmitted for a “mere” 25,000 years we’d miss it.

For each civilisation there is uncertainty about the start time *and* the duration of the signal (two important uncertainties).

If the range of start-time uncertainty is 10 billion years (from five billion years ago to five billion years from now) and the duration can be from 1000 to 100,000 years. And we now assign a random distribution over 1 million sources to distances between 300 and 100,000 light years away. That is from each of one million points, between 300 and 100,000 light years away, starting sometime from 5 billion years ago to five billion years from now, a signal lasting between 1000 and 100,000 years is directed at us. What are the odds that a signal is hitting us right now (or during the 10-year ”now” of the Breakthough Listen project).

My sense is that they are very, very low. But am I right about that?

I realize this is a somewhat esoteric topic, but I’d be interested in seeing how this problem can be modelled, and how the parameters can be changed to improved the odds. As far as I can tell, actually, SETI promoters explicitly ”neglect time” in their models, imagining that each signal has been transmitting since the birth of the galaxy and will continue to transmit forever. On that view, of course, there are one million signals actually hitting us right now to find. And the ”cosmic haystack” is just the 200 billion stars in the galaxy. But this assumption is so unrealistic that I’d like to see if the haystack can’t be modelled more usefully.

My reply: I have no idea but perhaps some of the commenters will have thought about this one. My quick thought is that, given that we haven’t heard any such signals so far, it doesn’t seem likely that we’ll hear any soon. But maybe that misses the point that any signals will be so weak that they’d need lots of instrumentation to be detected and lots of computing power to resolve.

The other thing that strikes me is how little we hear about this nowadays. It seems to me that a few decades ago there was a lot more talk about extraterrestrial aliens. Perhaps one reason for the decline in interest in the topic is that we haven’t heard any signals; another reason is that we have an alien intelligence among us now—computers—so there’s less interest in a hypothetical alien intelligence that might not even exist.

Designing an animal-like brain: black-box “deep learning algorithms” to solve problems, with an (approximately) Bayesian “consciousness” or “executive functioning organ” that attempts to make sense of all these inferences

The journal Behavioral and Brain Sciences will be publishing this paper, “Building Machines That Learn and Think Like People,” by Brenden Lake, Tomer Ullman, Joshua Tenenbaum, and Samuel Gershman. Here’s the abstract:

Recent progress in artificial intelligence (AI) has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board games, achieving performance that equals or even beats humans in some respects. Despite their biological inspiration and performance achievements, these systems differ from human intelligence in crucial ways. We review progress in cognitive science suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn, and how they learn it. Specifically, we argue that these machines should (a) build causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems; (b) ground learning in intuitive theories of physics and psychology, to support and enrich the knowledge that is learned; and (c) harness compositionality and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. We suggest concrete challenges and promising routes towards these goals that can combine the strengths of recent neural network advances with more structured cognitive models.

The journal solicited discussions, with the rule being that you say what you’re going to talk about and give a brief abstract of what you’ll say. I wrote the following:

What aspect of the target article or book you would anticipate commenting on:

The idea that a good model of the brain’s reasoning should use Bayesian inference rather than predictive machine learning.

Proposal for commentary:

Lake et al. in this article argue that atheoretical machine learning has limitations and they argue in favor of more substantive models to better simulate human-brain-like AI. As a practicing Bayesian statistician, I’m sympathetic to this view—but I’m actually inclined to argue something somewhat different: I’d claim that it could make sense to do AI via black-box machine learning algorithms such as the famous program that plays Pong, or various automatic classification algorithms, and then have the Bayesian model be added on, as a sort of “consciousness” or “executive functioning organ” that attempts to make sense of all these inferences. That seems to me to possibly be a better description of how our brains operate, and in some deeper level I think it is closer to fitting my view of how we learn from data.

The editors decided they didn’t have space for my comment so I did not write anything more. Making the call based on the abstract is an excellent, non-wasteful system, much better than another journal (which I will not name) where they requested I write an article for them on a specific topic, then I wrote the article, then they told me they didn’t want it. That’s just annoying, cos then I have this very specialized article that I can’t do anything with.

Anyway, I still find the topic interesting and important; I’d been looking forward to writing a longer article on it. In the meantime, you can read the above paragraph along with this post from a few months ago, “Deep learning, model checking, AI, the no-homunculus principle, and the unitary nature of consciousness.” And of course you can read the Lake et al. article linked to above.

Science journalist recommends going easy on Bigfoot, says you should bash of mammograms instead

Paul Alper points us to this transcribed lecture by John Horgan. It’s a talk Horgan gave to a conference on Science and Skepticism, which began:

I [Horgan] am a science journalist. I don’t celebrate science, I criticize it, because science needs critics more than cheerleaders. I point out gaps between scientific hype and reality. That keeps me busy, because, as you know, most peer-reviewed scientific claims are wrong.

Following the links, I also came across this bit by Horgan, from a post entitled, “A Dig Through Old Files Reminds Me Why I’m So Critical of Science”:

I [Horgan] keep struggling to find the right balance between celebrating and challenging alleged advances in science. After all, I became a science writer because I love science, and so I have tried not to become too cynical and suspicious of researchers. I worry sometimes that I’m becoming a knee-jerk critic. But the lesson I keep learning over and over again is that I am, if anything, not critical enough. . . .

The vast majority of scientists and journalists who write about science—not to mention the legions of flaks working at universities, science-oriented corporations and other institutions—present science in a positive light. My own journalistic shortcomings aside, I believe science has been ill-served by all this positivity.

I agree. It doesn’t help when credentialed scientists, recipients of huge levels of public funds and publicity, issue pronouncements such as “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” Or when the National Academy of Sciences puts its seal of approval on laughable papers on air rage, himmicanes, etc. Or when Lancet publishes with a straight face an unregularized regression with 50 data points and 39 predictors. Or when leading news organizations such as NPR hype this sort of work.

It’s hard not to be a skeptic when so many leading institutions keep trying to keep following business as usual.

The controversial parts of Horgan’s speech, though, were not where he criticized picayune bad science—he didn’t mention “power pose” and Ted talks, and he could’ve, but that all wouldn’t’ve been so relevant to his audience, who I think are (rightly) more interested in the big questions (How does the universe work? Why are we here? Etc.) than in the daily life of the scientific community.

Rather, controversy came from his recommendations of where skeptics should aim their fire. Here’s Horgan:

You don’t apply your skepticism equally. You are extremely critical of belief in God, ghosts, heaven, ESP, astrology, homeopathy and Bigfoot. You also attack disbelief in global warming, vaccines and genetically modified food.

These beliefs and disbeliefs deserve criticism, but they are what I call “soft targets.” . . .

Meanwhile, you neglect what I call hard targets. These are dubious and even harmful claims promoted by major scientists and institutions. In the rest of this talk, I’ll give you examples of hard targets from physics, medicine and biology. I’ll wrap up with a rant about war, the hardest target of all.

Rather than getting into the details of Horgan’s argument with Pinker and others (go to Horgan’s post and search on Pinker for lots of links, including this fascinating story involving Marc “Evilicious” Hauser, among others), I thought it could be helpful to attempt some sort of taxonomy of science, or pseudo-science, that could be criticized. Then we can se how Horgan and the organized skeptic community fit on this scale.

So, here are various scientific or pseudo-scientific beliefs that get criticized:

1. Crackpot ideas: Bigfoot, perpetual motion machines, spoon bending, etc. These are ridiculous notions that educated people talk about only because they really want to believe (Conan Doyle and those obviously faked photographs of fairies) or perhaps for some political or quasi-politcal reason (Freeman Dyson and Velikovsky).

2. Ideas with a religious connection: Jesus sightings, Dead Sea crossings, Noah’s ark, creationism, anonymous prayer, etc. These ideas make zero sense on their own, but can be difficult to dispute without offending millions of religious fundamentalists. Perhaps for some people, offending the fundamentalists is a goal in itself, but in any case these ideas have baggage that spoon bending, etc., do not.

3. Pathological science: ESP, N-rays, cold fusion, and any other phenomena that show up when studied by the proponents of the idea, but which outsiders can never see. The work of Marc Hauser is an extreme example here in that he actively tried to stop others in his lab from questioning his data coding.

4. Junk science: power pose, air rage, ovulation and voting, beauty and sex ratio, himmicanes, etc. These ideas don’t contradict any known science but have been studied in such a sloppy way that nothing can be learned from the research in question. This work is junk science in part because of the very weak connection between theory and measurement.

5. Politicized questions: effectiveness of charter schools, causes of global warming, effects of a higher minimum wage, etc. These are issues that, at least in the U.S., are so strongly tied to political affiliation that it seems that one can only have discussions on the specifics, not on the bigger questions. Sometimes this leads to actual junk science (for example, claims about negative effects of abortion, or Soviet-era claims about the effectiveness of planned economies) but lots of this is solid science that’s hard for people to handle because they don’t want to hear the conclusions.

6. Slightly different are theories that are not inherently political but which some people seem to feel very strongly about, ideas such as vaccines and autism, various theories about diet. Also somewhere in here are belief systems such as astrology and homeopathy that follow some of the forms of science but sort of live in their own bubbles.

7. Unfalsifiable theories (what I call frameworks): Marxism, racism, Freudianism, evolutionary psychology, neoclassical economics, etc. All these frameworks are based on real insights and observation (as well as hopes and biases) but can get taken way beyond any level of falsifiability: for their enthusiasts, these become universal explanations of human behavior.

8. Mathematical models: Here I’m thinking of string theory, which Horgan disses because of being undetectable by experiment. I put these in a different category than item 4 above because . . . ummmm, it’s hard to explain, but I think there’s a difference between a mathematical theory such as superstrings, and a catchall explanation machine such as racism or Freudianism or neoclassical economics or whatever.

9. Hype: This could be any of the categories above (for example, the “gay gene”), or simply good science that’s getting hyped, for example solid research on genetics that is being inappropriately sold as representing a cancer cure just around the corner.

10. Misdirection: Horgan and others are bothered by the scientific establishment spending zillions on Big Science without attacking real problems that face the world. As we used to say, How come they can put a man on the moon but they can’t cure the common cold? This line of reasoning might be wrong—maybe Big Science really is the way to go. I’m just listing this as one of the ways in which science gets criticized.

11. Scientific error. This can get controversial. Horgan criticized (and believes skeptics should criticize) “the deep-roots theory of war,” on the grounds that “the evidence is overwhelming that war was a cultural innovation—like agriculture, religion, or slavery—that emerged less than 12,000 years ago.” On the other hand, various other people think there is strong evidence for deep roots of war. It’s hard for me to judge this one, so all I’ll say is that this kind of thing is different than items 1-7 above.

Different skeptics have different tastes on what to criticize. Horgan things we should be slamming items 8, 9, 10, and 11. I spend a lot of time on items 3, 4, and 9. Items 1 and 2 don’t interest me so much, perhaps because they’re pretty much outside of science. Various commenters here think I talk too much about item 4, but those cases interest me in part because of the statistical insight I can draw from them. Item 5 is super-important and it comes up on this blog from time to time, but I’m never quite sure what to say on such issues because it seems that people have already decided what they’re going to think. Item 6 is kind of boring to me but it’s a traditional topic of skeptical inquiry. Finally, items 10 and 11 are huge; I don’t talk that much about them because I don’t have much particular expertise there. Except for some specific areas in political science where I’ve done the research: I really do get annoyed when political pundits such as Michael Barone garble the connections between income and voting, or when know-it-alls such as Steven Levitt go around saying it’s irrational to vote, or when people who are know just enough math to be dangerous but not enough to really understand things go around saying that voters in large states have an electoral college advantage. These items are pretty small potatoes though, so I don’t spend too much time screaming about them.

P.S. I doubt that anyone’s actually talking about bigfoot anymore. But I do think that skeptics target the items that get the hype. Back in the 1970s, we really were hearing a lot in the media about bigfoot, the Bermuda triangle, Biblical relics, and ESP, so it makes sense that these topics were investigated by skeptics. In the 2005-2015 era, major news media such as NPR, Gladwell, Freakonomics, Ted, etc., were talking about psychology pseudoscience, and this attracted skeptics’ attention. Horgan’s point, I suppose, is that by always following the bouncing ball of what’s in the news RIGHT NOW, we’re neglecting the big chronic issues.

Bayesian statistics: What’s it all about?

Kevin Gray sent me a bunch of questions on Bayesian statistics and I responded. The interview is here at KDnuggets news. For some reason the KDnuggets editors gave it the horrible, horrible title, “Bayesian Basics, Explained.” I guess they don’t waste their data mining and analytics skills on writing blog post titles!

That said, I like a lot of the things I wrote, so I’ll repeat the material (with some slight reorganization) here:

What is Bayesian statistics?

Bayesian statistics uses the mathematical rules of probability to combine data with prior information to yield inferences which (if the model being used is correct) are more precise than would be obtained by either source of information alone.

In contrast, classical statistical methods avoid prior distributions. In classical statistics, you might include in your model a predictor (for example), or you might exclude it, or you might pool it as part of some larger set of predictors in order to get a more stable estimate. These are pretty much your only choices. In Bayesian inference you can—OK, you must—assign a prior distribution representing the set of values the coefficient can be. You can reproduce the classical methods using Bayesian inference: In a regression prediction context, setting the prior of a coefficient to uniform or “noninformative” is mathematically equivalent to including the corresponding predictor in a least squares or maximum likelihood estimate; setting the prior to a spike at zero is the same as excluding the predictor, and you can reproduce a pooling of predictors thorough a joint deterministic prior on their coefficients. But in Bayesian inference you can do much more: by setting what is called an “informative prior,” you can partially constrain a coefficient, setting a compromise between noisy least-squares estimation or completely setting it to zero. It turns out this is a powerful tool in many problems—especially because in problems with structure, we can fit so-called hierarchical models which allow us to estimate aspects of the prior distribution from data.

The theory of Bayesian inference originates with its namesake, Thomas Bayes, a 18th-century English cleric, but it really took off in the late 18th century with the work of the French mathematician and physicist Pierre-Simon Laplace. Bayesian methods were used for a long time after that to solve specific problems in science, but it was in the mid-20th century that they became proposed as a general statistical tool. Some key figures include John Maynard Keynes and Frank Ramsey who in the 1920s developed an axiomatic theory of probability; Harold Jeffreys and Edwin Jaynes, who from the 1930s through the 1970s developed Bayesian methods for a variety of problems in the physical sciences; Jimmie Savage and Dennis Lindley, mathematicians who in research from the 1950s through the 1970s connected and contrasted Bayesian methods with classical statistics; and, not least, Alan Turing, who used Bayesian probability methods to crack the Enigma code in the second world war, and his colleague I. J. Good, who explored and wrote prolifically about these ideas over the succeeding decades.

Within statistics, Bayesian and related methods have become gradually more popular over the past several decades, often developed in different applied fields, such as animal breeding in the 1950s, educational measurement in the 1960s and 1970s, spatial statistics in the 1980s, and marketing and political science in the 1990s. Eventually a sort of critical mass developed in which Bayesian models and methods that had been developed in different applied fields became recognized as more broadly useful.

Another factor that has fostered the spread of Bayesian methods is progress in computing speed and improved computing algorithms. Except in simple problems, Bayesian inference requires difficult mathematical calculations—high-dimensional integrals—which are often most practically computed using stochastic simulation, that is, computation using random numbers. This is the so-called Monte Carlo method, which was developed systematically by the mathematician Stanislaw Ulam and others when trying out designs for the hydrogen bomb in the 1940s and then rapidly picked up in the worlds of physics and chemistry. The potential for these methods to solve otherwise intractable statistics problems became apparent in the 1980s, and since then each decade has seen big jumps in the sophistication of algorithms, the capacity of computers to run these algorithms in real time, and the complexity of the statistical models that practitioners are now fitting to data.

Now, don’t get me wrong—computational and algorithmic advances have become hugely important in non-Bayesian statistical and machine learning methods as well. Bayesian inference has moved, along with statistics more generally, away from simple formulas toward simulation-based algorithms.

Comparisons to other statistical methods

I wouldn’t say there’s anything that only Bayesian statistics can provide. When Bayesian methods work best, it’s by providing a clear set of paths connecting data, mathematical/statistical models, and the substantive theory of the variation and comparison of interest. From this perspective, the greatest benefits of the Bayesian approach come not from default implementations, valuable as they can be in practice, but in the active process of model building, checking, and improvement. In classical statistics, improvements in methods often seem distressingly indirect: you try a new test that’s supposed to capture some subtle aspect of your data, or you restrict your parameters or smooth your weights, in some attempt to balance bias and variance. Under a Bayesian approach, all the tuning parameters are supposed to be interpretable in real-world terms, which implies—or should imply—that improvements in a Bayesian model come from, or supply, improvements in understanding of the underlying problem under studied.

The drawback of this Bayesian approach is that it can require a bit of a commitment to construction of a model that might be complicated, and you can get end up putting effort into modeling aspects of data that maybe aren’t so relevant for your particular inquiry.

Bayesian methods are often characterized as “subjective” because the user must choose a prior distribution, that is, a mathematical expression of prior information. The prior distribution requires information and user input, that’s for sure, but I don’t see this as being any more “subjective” than other aspects of a statistical procedure, such as the choice of model for the data (for example, logistic regression) or the choice of which variables to include in a prediction, the choice of which coefficients should vary over time or across situations, the choice of statistical test, and so forth. Indeed, Bayesian methods can in many ways be more “objective” than conventional approaches in that Bayesian inference, with its smoothing and partial pooling, is well adapted to including diverse sources of information and thus can reduce the number of data coding or data exclusion choice points in an analysis.

There’s room for lots of methods. What’s important in any case is what problems they can solve. We use the methods we already know and then learn something new when we need to go further. Bayesian methods offer a clarity that comes from the explicit specification of a so-called “generative model”: a probability model of the data-collection process and a probability model of the underlying parameters. But construction of these models can take work, and it makes sense to me that for problems where you have a simpler model that does the job, you just go with that.

Looking at the comparison from the other direction, when it comes to big problems with streaming data, Bayesian methods are useful but the Bayesian computation can in practice only be approximate. And once you enter the zone of approximation, you can’t cleanly specify where the modeling approximation ends and the computing approximation begins. At that point, you need to evaluate any method, Bayesian or otherwise, by looking at what it does to the data, and the best available method for any particular problem might well be set up in a non-Bayesian way.

Bayesian inference and big data

The essence of Bayesian statistics is the combination of information from multiple sources. We call this data and prior information, or hierarchical modeling, or dynamic updating, or partial pooling, but in any case it’s all about putting together data to understand a larger structure. Big data, or data coming from the so-called internet of things, are inherently messy: scraped data not random samples, observational data not randomized experiments, available data not constructed measurements. So statistical modeling is needed to put data from these different sources on a common footing. I see this in the analysis of internet surveys where we use multilevel Bayesian models to use non-random samples to make inferences about the general population, and the same ideas occur over and over again in modern messy-data settings.

Using Bayesian methods yourself

You have to learn by doing, and one place to start is to look at some particular problem. One example that interested me recently was a website constructed by the sociologist Pierre-Antoine Kremp, who used the open-source statistics language R and the open-source Bayesian inference engine Stan (named after Stanislaw Ulam, the inventor of the Monte Carlo method mentioned earlier) to combine U.S. national and state polls to make daily forecasts of the U.S. presidential election. In an article for Slate, I called this “the open-source poll aggregator that will put all other poll aggregators out of business” because ultimately you can’t beat the positive network effects of free and open-source: the more people who see this model, play with it, and probe its weaknesses, the better it can become. The Bayesian formalism allows a direct integration of data from different sorts of polls in the context of a time-series prediction models.

Are there any warnings? As a famous cartoon character once said, With great power comes great responsibility. Bayesian inference is powerful in the sense that it allows the sophisticated combination of information from multiple sources via partial pooling (that is, local inferences are constructed in part from local information and in part from models fit to non-local data), but the flip side is that when assumptions are very wrong, conclusions can be far off too. That’s why Bayesian methods need to be continually evaluated with calibration checks, comparisons of observed data to simulated replications under the model, and other exercises that give the model an opportunity to fail. Statistical model building, but maybe especially in its Bayesian form, is an ongoing process of feedback and quality control.

A statistical procedure is a sort of machine that can run for awhile on its own, but eventually needs maintenance and adaptation to new conditions. That’s what we’ve seen in the recent replication crisis in psychology and other social sciences: methods of null hypothesis significance testing and p-values, which had been developed for analysis of certain designed experiments in the 1930s, were no longer working a modern settings of noisy data and uncontrolled studies. Savvy observers had realized this for awhile—psychologist Paul Meehl was writing acerbically about statistically-driven pseudoscience as early as the 1960s—but it took awhile for researchers in many professions to catch on. I’m hoping that Bayesian modelers will be sooner to recognize their dead ends, and in my own research I’ve put a lot of effort into developing methods for checking model fit and evaluating predictions.

Stan

Different software will serve different needs. Many users will not know a lot of statistics and will want to choose among some menu of models or analyses, and I respect that. We have written wrappers in Stan with pre-coded versions of various standard choices such as linear and logistic regression, ordered regression, multilevel models with varying intercepts and slopes, and so forth, and we’re working on tutorials that will allow the new user to fit these models in R or Stata or other familiar software.

Other users come to Stan because they want to build their own models, or, better still, want to explore their data by fitting multiple models, comparing them, and evaluating their fit. Indeed, our motivation in developing Stan was to solve problems in my own applied research, to fit models that I could not easily fit any other way.

Statistics is sometimes divided between graphical or “exploratory” data analysis, and formal or “confirmatory” inference. But I think that division is naive: in my experience, data exploration is most effectively done using models, and, conversely, our most successful models are constructed as the result of an intensive period of exploration and feedback. So, for me, I want model-fitting software that is:

– Flexible (so I can fit the models I want and expand them in often unanticipated ways);
– Fast (so I can fit many models);
– Connected to other software (so I can prepare my datasets before entering them in the model, and I can graphically and otherwise explore the fitted model relative to the data);
– Open (so I can engage my collaborators and the larger scientific community in my work, and conversely so I can contribute by sharing my modeling expertise in a common language);
– Readable and transparent (both so I can communicate my models with others and so I can actually understand what my models are doing).

Our efforts on Stan move us toward these goals.

Future research

Lots of directions here. From the modeling direction, we have problems such as polling where our samples are getting worse and worse, less and less representative, and we need to do more and more modeling to make reasonable inferences from sample to population. For decision making we need causal inference, which typically requires modeling to adjust for differences between so-called treatment and control groups in observational studies. And just about any treatment effect we care about will vary depending on scenario. The challenge here is to estimate this variation, while accepting that in practice we will have a large residue of uncertainty. We’re no longer in the situation where “p less than .05” can be taken as a sign of a discovery. We need to accept uncertainty and embrace variation. And that’s true no matter how “big” our data are.

In practice, much of my thought goes into computing. We know our data are messy, we know we want to fit big models, but the challenge is to do so stably and in reasonable time—in the current jargon, we want “scalable” inference. Efficiency, stability, and speed of computing are essential. And we want more speed than you might think, because, as discussed earlier, when I’m learning from data I want to fit lots and lots of models. Of course then you have to be concerned about overfitting, but that’s another story. For most of the problems I’ve worked on, there are potential big gains from exploration, especially if that exploration is done through substantively-based models and controlled with real prior information. That is, Bayesian data analysis.

The social world is (in many ways) continuous but people’s mental models of the world are Boolean

Raghu Parthasarathy points me to this post and writes:

I wrote after seeing one too many talks in which someone bases boolean statements about effects “existing” or “not existing” (infuriating in itself) based on “p < 0.05” or “p > 0.5”. Of course, you’ve written tons of great things on the pitfalls, errors, and general absurdity of NHST [null hypothesis significance testing], but I’m not sure if you’ve ever called out the general error of “binary” thinking, and how NHST enables this.

In reply, I pointed him to these old posts:

Thinking like a statistician (continuously) rather than like a civilian (discretely)

Message to Booleans: It’s an additive world, we just live in it

Confirmationist and falsificationist paradigms of science

Whither the “bet on sparsity principle” in a nonsparse world?

Raghu responded:

Your second link contains a very interesting sentence, that “We live in an additive world that our minds try to model Booleanly.” Often, when people criticize science, a common complaint is that science and scientists want to see complex issues as “black and white.” However, science that’s done well doesn’t do this, as you’ve written many times—it recognizes and quantifies uncertainty, complexity, and all the rest. One could argue that the “our” in “that our minds try to model Booleanly” is the view not of the non-scientist lay-person, nor of a “good” scientist, but rather that of a naive scientist who hasn’t moved beyond the simple textbook picture of science that we teach people at young ages.

I replied that I do think it’s a human tendency to understand things as Boolean, maybe because such rules are simpler to remember and compute.

To which Raghu responded:

You’re probably right. There must be some interesting psychological / anthropological / historical work out there on when people (either as individuals or culturally) start to, at least sometimes, adopt continuous rather than binary measures of causes & effects.

Avoiding only the shadow knowing the motivating problem of a post.

Given I am starting to make some posts to this blog (again) I was pleased to run across a youtube of Xiao-Li Meng being interviewed on the same topic by Suzanne Smith the Director of the Center for Writing and Communicating Ideas.

One thing I picked up was to make the problem being addressed in a any communication very clear as there should be a motivating problem – the challenges of problem recognising and problem defining should not be over looked. The other thing was that the motivating problem should be located in the sub-field(s) of statistics that addresses such problems.

The second is easier as my motivating problems mostly involve ways to better grasp insight(s) from theoretical statistics in order to better apply statistics in applications – so the sub-fields are theory and application, going primarily from theory to application. This largely involves trying to find metaphors or even better – various ways to re-represent theory in terms that are more suggestive of how/why it works or hopes to work. Vaguely (and overly hopeful), to try and get diagrammatic representations that facilitate a moving picture of how/why it works or hopes to. To see representing (modelling) at work.

At a very general level, my current sense is that statistics is best viewed as being primarily about conjecturing, assessing, and adopting idealised representations of reality, predominantly using probability generating models for both parameters and data. Now we want less wrong representations of reality and hopefully we can get them. This can only be a hope as we never have direct access to reality to ever in fact know. In light of this, my motivating problem is how to get less wrong representations of reality that remain hopeful.

This representation of reality venture can be broken into three stages:

1. Speculate a prior distribution for how unknowns (e.g. parameters) were determined or set in nature and then observations subsequently generated given those unknowns.

2. Deduce the most relevant representation given the actual observations that occurred (aka getting the posterior).

3. Evaluate the fit and credibility of the representation in light of 1 and 2 with prejudice for finding faults (ways to improve) returning to 1 until no further improvement seems currently plausible.

Steps of 1,2,3; 1,2,3; 1,2,3 – hold for now and hope.

Now the full theory I am trying to draw from is mostly statistical but some theory of (profitable) empirical inquiry (aka philosophy) is required as the aim is to enable others to avoid being misled when trying to learn from observations while being aware of the risks they are unable to avoid.

In summary my future posts will have these motivations, most likely will focus on speculation of _good_ priors and the evaluation of fitting, understanding, criticising and deciding to (tentatively) keep with representations. This should not be taken as suggesting that getting posteriors is less important – but that is not my strength (and I am hoping Stan will increasingly make that simple in more and more cases).

 

Avoiding selection bias by analyzing all possible forking paths

Ivan Zupic points me to this online discussion of the article, Dwork et al. 2015, The reusable holdout: Preserving validity in adaptive data analysis.

The discussants are all talking about the connection between adaptive data analysis and the garden of forking paths; for example, this from one commenter:

The idea of adaptive data analysis is that you alter your plan for analyzing the data as you learn more about it. . . . adaptive data analysis is typically how many researchers actually conduct their analyses, much to the dismay of statisticians. As such, if one could do this in a statistical valid manner, it would revolutionize statistical practice.

Just about every data analysis I’ve ever had is adaptive, and I do think most of what I do is “statistically valid,” so whassup with that? A clue is provided by my 2001 paper with Jennifer and Masanao, “Why we (usually) don’t have to worry about multiple comparisons.” If you fit a multilevel model (or a Bayesian model with informative prior distributions), then it’s perfectly “statistically valid” to look at many comparisons. The key is aim to do all the analyses you might do, avoiding selection bias by performing all relevant comparisons, and avoiding the problems with p-values by partially pooling all your comparisons rather than just reporting a selected subset.

What is valued by the Association for Psychological Science

Someone pointed me to this program of the forthcoming Association for Psychological Science conference:

screen-shot-2016-12-11-at-5-01-44-pm

Kind of amazing that they asked Amy Cuddy to speak. Weren’t Dana Carney or Andy Yap available? What would really have been bold would have been for them to invite Eva Ranehill or Anna Dreber.

Good stuff. The chair of the session is Susan Goldin-Meadow, who’s famous both for inviting that non-peer-reviewed “methodological terrorism” article that complained about non-peer-reviewed criticism, and also for some over-the-top claims of her own, including this amazing statement:

Barring intentional fraud, every finding is an accurate description of the sample on which it was run.

This is ridiculous. For example, I think it’s safe to assume that Reinhart and Rogoff did not do any intentional fraud in that famous paper of theirs—even their critics just talked about an “Excel error.” But their findings were not an accurate description of their sample. Similarly with Susan Fiske and her t statistics of 5.03 and 11.14 which were actually 1.8 and 3.3. No intentional fraud, just an error. But the findings were not an accurate description of the sample. Or Daryl Bem’s paper where he reported various incomplete summaries of the data. Even if each summary was correct, his “findings” were not: they were selections of the data which, despite Bem’s claim, do not provide evidence of ESP. They don’t even provide evidence that the students in this particular sample had ESP. Or, if you want an even cleaner example, consider Nosek’s “50 shades of gray” study where Nosek and his collaborators themselves don’t believe that their findings were an accurate description of their sample.

Or, hey, here’s another one, a paper that claimed, “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.” This is the concluding sentence of the abstract. Actually, though, there were no measurements of power in that paper, and the reported finding was not an accurate description of the sample on which it was run. For it to have been an accurate description, there would’ve had to be some measure of power on the participants. But there was no measure of power, just feelings of power. Which I think we can all agree is not the same thing. No fraud, intentional or otherwise, just a plain old everyday journal article where the thing being stated in the abstract is not the thing that was done in the study.

I have no plans to be at this conference, which is too bad, as this session sounds like lots of fun. Maybe they’ll feature some of Marc Hauser’s famous monkey videos, and they can film it live as a Ted talk. The whole thing should be a real himmicane!

P.S. In all seriousness, do these people even read what they’ve written? (1) “Barring intentional fraud, every finding is an accurate description of the sample on which it was run”? (2) “Instantly become more powerful”?? (3) “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%”???

Look, we all make mistakes, and there are also legitimate differences of opinion. The data don’t really support the claims that married women were more likely to support Mitt Romney during that time of the month, or that beautiful parents are more likely to have girls, or that Hurricane Missy will be more damaging than Hurricane Butch. But, sure, these hypotheses, and their opposites, are all possible. It could be that Cornell students have ESP, or that power pose makes you weaker, or all sorts of things. So even if I think people are overinterpreting their evidence and even being blockheaded in their interpretation of questionable data and in their resistance to other sources of evidence, I can see how they could make the claims they’re making. Daryl Bem may ultimately have the last laugh on all of us. But statements like (1), (2), and (3) immediately above: they just make no sense in the context in which they’re written.

Psychology is not just a club of academics, and “psychological science” is not just the name of their treehouse. It’s supposed to be for all of us—I’m speaking as a taxpayer and citizen here—and I think the scholars who represent the field of psychology have a duty to write clearly, to avoid false statements where possible, and to put themselves into a frame of mind where they can learn from their mistakes.

P.P.S. Maybe also worth repeating this bit:

I’m not an adversary of pscyhological science! I’m not even an adversary of low-quality psychological science: we often learn from our mistakes and, indeed, in many cases it seems that we can’t really learn without first making errors of different sorts. What I am an adversary of, is people not admitting error and studiously looking away from mistakes that have been pointed out to them.

We learn from our mistakes, but only if we recognize that they are mistakes. Debugging is a collaborative process. If you approve some code and I find a bug in it, I’m not an adversary, I’m a collaborator. If you try to paint me as an “adversary” in order to avoid having to correct the bug, that’s your problem.

P.P.P.S. In the first version of this post, I mistakenly labeled this session as being in the American Psychological Association conference. It’s actually the Association for Psychological Science. I apologize to the American Psychological Association for my error. It’s no big deal, though, it’s not like anybody’s being tortured or anything.

How to think about the p-value from a randomized test?

Roahn Wynart asks:

Scenario: I collect a lot of data for a complex psychology experiment. I put all the raw data into a computer. I program the computer to do 100 statistical tests. I assign each statistical test to a key on my keyboard. However, I do NOT execute the statistical test. Each key will trigger the evaluation of a different statistical test. I push, say, the “B” key and I get a positive result at 98% confidence. I then stop and publish. I never push any other key.

Is there something wrong with that procedure?

My reply:

1. Yes, there’s something wrong with this procedure, and the clear “something wrong” is the use of a p-value to decide whether to publish something. Even if your computer only has one key, so that your p-value is unequivocally kosher, it’s a mistake in my opinion to use statistical significance to decide what to publish. The problem is that if your signal-to-noise ratio is low, then any statistically significant estimate will be a big overestimate of the true effect, and it is likely to be in the wrong direction. This is discussed by Carlin and me in our recent paper in Perspectives on Psychological Science.

2. Is this a legitimate p-value? That’s a tough one. The easy answer is, if you choose which key to press after seeing the data then, no, in general this is not a legitimate p-value, for reasons discussed by Loken and me in our recent paper in American Scientist (the garden of forking paths). If you chooses the key completely at random, then, sure, I guess it’s an ok p-value, although this is a bit controversial in frequentist statistics as it depends on what is being conditioned on. Even then, though, I wouldn’t recommend the procedure because of point 1 above.

fMRI clusterf******

Several people pointed me to this paper by Anders Eklund, Thomas Nichols, and Hans Knutsson, which begins:

Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.

I’m not a big fan of the whole false-positive, false-negative thing. In this particular case it makes sense because they’re actually working with null data, but ultimately what you’ll want to know is what’s happening to the estimates in the more realistic case that there are nonzero differences amidst the noise. The general message is clear, though: don’t trust FMRI p-values. And let me also point out that this is yet another case of a classical (non-Bayesian) method that is fatally assumption-based.

Perhaps what’s the most disturbing thing about this study is how unsurprising it all is. In one sense, it’s big big news: FMRI is a big part of science nowadays, and if it’s all being done wrong, that’s a problem. But, from another perspective, it’s no surprise at all: we’ve been hearing about “voodoo correlations” in FMRI for nearly a decade now, and I didn’t get much sense that the practitioners of this sort of study were doing much of anything to clean up their act. I pretty much don’t believe FMRI studies on the first try, any more than I believe “gay gene” studies or various other headline-of-the-week auto-science results.

What to do? Short-term, one can handle the problem of bad statistics by insisting on preregistered replication, thus treating traditional p-value-based studies as screening exercises. But that’s a seriously inefficient way to go: if you don’t watch out, your screening exercises are mostly noise, and then you’re wasting your effort with the first study, then again with the replication.

On the other hand, if preregistered replication becomes a requirement for a FMRI study to be taken seriously (I’m looking at you, PPNAS; I’m looking at you, Science and Nature and Cell; I’m looking at you, TED and NIH and NPR), then it won’t take long before researchers themselves realize they’re wasting their time.

The next step, once researchers learn to stop bashing their heads against the wall, will be better data collection and statistical analysis. When the motivation for spurious statistical significance goes away, there will be more motivation for serious science.

Something needs to be done, though. Right now the incentives are all wrong. Why not do a big-budget FMRI study? In many fields, this is necessary for you to be taken seriously. And it’s not like you’re spending your own money. Actually, it’s the opposite: at least within the university, when you raise money for a big-budget experiment, you’re loved, because the university makes money on the overhead. And as long as you close your eyes to the statistical problems and move so fast that you never have to see the failed replications, you can feel like a successful scientist.

The other thing that’s interesting is how this paper reflects divisions within PPNAS. On one hand you have editors such as Susan Fiske or Richard Nisbett who are deeply invested in the science-as-routine-discovery-through-p-values paradigm; on the other, you have editors such as Emery Brown (editor of this particular paper; full disclosure, I know Emery from grad school) who as a statistician has a more skeptical take and who has nothing to lose by pulling the house down.

Those guys at Harvard (but not in the statistics department!) will say, “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” But they’re innumerate, and they’re wrong. Time for us to move on, time for the scientists to do more science and for the careerists to find new ways to play the game.

P.S. An economist writes in:

I wanted to provide a bit more context/background for your recent fMRI post. It went from a short comment to something much longer. Unfortunately, this is another time that a sensational headline misrepresents the actual content of the paper. I recently left academia and started a blog (among other things) but still have a few things far enough along that they might be published one day.

5 more things I learned from the 2016 election

After posting the 19 Things We Learned from the 2016 Election, I received a bunch of helpful feedback in comments and email. Here are some of the key points that I missed or presented unclearly:

Non-presidential elections

Nadia Hassan points out that my article is “so focused on the Presidential race than it misses some key pertinent downballot stuff. Straight ticket voting soared in this election in the Senate races, though not the governor’s races,” which supports explanations based on fundamentals and polarization rather than candidate-specific stories.

The Latino vote

In the “Demography is not destiny” category, I cited exit polls that showed the Latino vote dividing 66%-28% in favor of Clinton. But exit polls have a lot of problems, as Justin Gross noted in comments and which others pointed out to me by email. Gary Segura and Matt Barreto suggest that “the national exit polls interviewed few if any Latino voters in areas where many Latinos actually live.” Trump winning based on the white vote is consistent with what Yair and I found earlier this year about the electorate being whiter than observers had thought based on exit polls, as reported in a news article, “There Are More White Voters Than People Think. That’s Good News for Trump.”

Siloed news

Andy Guess writes, conventional wisdom says news is “siloed.” But the best evidence (from passive metering data) doesn’t support the idea, and on social media, see this. We have more discussion of fake news in comments here.

Shark attacks

I ragged on Chris Achen and Larry Bartels’s claim that shark attacks swing elections. But as commenter WB points out, we shouldn’t let that distract us from Achen and Bartels’s larger point that that many voters are massively uninformed about politics, policy, and governing, which is relevant even if it’s not true, as they claimed, that voters are easily swung by irrelevant stimuli.

The Clinton campaign’s “ground game”

Someone who had led Obama’s ground game in a rural area of a midwestern state sent me this note:

I [my correspondent] returned there to informally assist Senator Clinton after it became apparent that she was having difficulty in that state (September 2016). It is from this background that I respectfully think you’re wrong about ground games being overrated (point 10). That is the wrong lesson.

You are correct that Democrats were supposed to have an amazing ground game. More hires. More offices. A field guy as campaign manager experienced in tight field wins (DCCC 2012; McAuliffe 2013). The problem is that Clinton never ran a ground game.

When I arrived in September/October, I was astounded to discover that the field staff had spent all their time on volunteer recruitment. This meant that they were only calling people who were already friendly to Clinton and asking those same people to come into the office to call more people friendly to Clinton. At no point during the campaign did the field staff ever ID voters or do persuasion (e.g. talk to a potentially non-friendly voter). That is a call center, it is not a ground game.

Part of the reason for this is that Brooklyn read an academic piece suggesting that voter contact more than 10 days out is worthless—a direct repudiation of the organizing model used by Obama in 2008 and 2012 when field contacted each voter 4 times between July and November. The result is that the Clinton campaign started asking people to turn out for Clinton only in the final week of the election when they began GOTV work. There was no preexisting relationship. Those calls for turning out might as well have come from a Hyderabad call center for all the good they did.

I hate to see people taking the wrong lesson from this campaign. Ground games are critical for Democrats to win. But non organizing-based ground games are worse than useless as they artificially inflate your expectations, demoralize volunteers (they want to talk to voters, not recruit more volunteers), and fail to turn out your base.

Thanks to everyone for your comments. One excellent thing about blogging is that we can revise what we write, in contrast to the David Brookses of the world who can never admit error.

P.S. One more thing on the ground game: Ryan Enos and Anthony Fowler estimated that the ground campaigning in 2012 increased turnout in the most targeted states by 7-8 percentage points.

“The Fundamental Incompatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsampling”

Here’s Michael Betancourt writing in 2015:

Leveraging the coherent exploration of Hamiltonian flow, Hamiltonian Monte Carlo produces computationally efficient Monte Carlo estimators, even with respect to complex and high-dimensional target distributions. When confronted with data-intensive applications, however, the algorithm may be too expensive to implement, leaving us to consider the utility of approximations such as data subsampling. In this paper I demonstrate how data subsampling fundamentally compromises the scalability of Hamiltonian Monte Carlo.

But then here’s Jost Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter in 2016:

Despite its successes, the prototypical Bayesian optimization approach – using Gaussian process models – does not scale well to either many hyperparameters or many function evaluations. Attacking this lack of scalability and flexibility is thus one of the key challenges of the field. . . . We obtain scalability through stochastic gradient Hamiltonian Monte Carlo, whose robustness we improve via a scale adaptation. Experiments including multi-task Bayesian optimization with 21 tasks, parallel optimization of deep neural networks and deep reinforcement learning show the power and flexibility of this approach.

So now I’m not sure what to think! I guess a method can be useful even if it doesn’t quite optimize the function it’s supposed to optimize? Another twist here is that these deep network models are multimodal so you can’t really do full Bayes for them even in problems of moderate size, even before worrying about scalability. Which suggests that we should think of algorithms such as that of Springenberg et al. as approximations, and we should be doing more work on evaluating these approximations. To put it another way, when they run stochastic gradient Hamiltonian Monte Carlo, we should perhaps think of this not as a way of tracing through the posterior distribution but as a way of exploring the distribution, or some parts of it.