## “Figure 1 looks like random variation to me” . . . indeed, so it does. And Figure 2 as well! But statistical significance was found, so this bit of randomness was published in a top journal. Business as usual in the statistical-industrial complex. Still, I’d hope the BMJ could’ve done better.

Gregory Hunter writes:

The following article made it to the national news in Canada this week.

I [Hunter] read it and was fairly appalled by their statistical methods. It seems that they went looking for a particular result in Canadian birthrate data, and then arranged to find it. Figure 1 looks like random variation to me. I don’t know if it warrants mention in your blog, but it did get into the British Medical Journal.

That’s too bad about it being in the British Medical Journal. Lancet, sure, they’re notorious for publishing politically motivated clickbait. But I thought BMJ was more serious. I guess anyone can make mistakes.

Anyway, getting to the statistics . . . the article is called “Outcome of the 2016 United States presidential election and the subsequent sex ratio at birth in Canada: an ecological study,” and its results are a mess of forking paths:

We hypothesised that the unexpected outcome of the 2016 US presidential election may have been a societal stressor for liberal-leaning populations and thereby precipitated such an effect on the sex ratio in Canada. . . . In the 12 months following the election, the lowest sex ratio occurred in March 2017 (4 months post election). Compared with the preceding months, the sex ratio was lower in the 5 months from March to July 2017 (p=0.02) during which time it was rising (p=0.01), reflecting recovery from the nadir. Both effects were seen in liberal-leaning regions of Ontario (lower sex ratio (p=0.006) and recovery (p=0.002) in March–July 2017) but not in conservative-leaning areas (p=0.12 and p=0.49, respectively).

In addition to forking paths, we also see the statistical fallacy of comparing significant to non-significant.

To their credit, the authors show the data:

As is often the case, if you look at the data without all those lines, you see something that looks like a bunch of numbers with no clear pattern.

The claims made in this article do not represent innumeracy on the level of saying that the probability of a tied election is 10^-90 (which is off by a factor of 10^83), and it’s not innumeracy on the level of that TV commenter and newspaper editor who said that Mike Bloomberg spent a million dollars on each voter (off by a factor of 10^6), but it’s still wrong.

Just to get a baseline here: There were 146,000 births in Ontario last year. 146,000/12 = 12,000 (approximately). So, just from pure chance, we’d expect the monthly proportion of girl births to vary with a standard deviation of 0.5/sqrt(12000) = 0.005. For example if the baseline rate is 48.5% girls, it could jump to 48.0% or 49.0% from month to month. The paper in question reports sex ratio, which is (1-p)/p, so 0.480, 0.498, 0.490 convert to sex ratios of 1.08, 1.06, and 1.04. Or, if you want to do +/-2 standard deviations, you’d expect to see sex ratios varying from roughly 1.10 to 1.02, which is indeed what we see in the top figure above. (The lower figures are each based on less data so of course they’re more variable.) Any real effects on sex ratio will be tiny compared to this variation in the data (see here for discussion of this general point).

In short: this study was dead on arrival. But the authors fooled themselves, and the reviewers, with a blizzard of p-values. As low as 0.002!

So, let me repeat:

– Just cos you have a statistically significant comparison, that doesn’t necessarily mean you’ve discovered anything at all about the world.

– Just cos you have causal identification and a statistically significant comparison, that doesn’t necessarily mean you’ve discovered anything at all about the world.

– Just cos you have honesty, transparency, causal identification, and a statistically significant comparison, that doesn’t necessarily mean you’ve discovered anything at all about the world.

– Just cos you have honesty, transparency, causal identification, a statistically significant comparison, a clear moral purpose, and publication in a top journal, that doesn’t necessarily mean you’ve discovered anything at all about the world.

Sorry, but that’s the way it is. You’d think everyone would’ve learned this—it’s been nearly a decade since that ESP paper was published—but I guess not. The old ways of thinking are sticky. Sticky sticky sticky.

Again, no special criticism on the authors of this new paper. I assume they’re just doing what they were trained to do, and now what they’re rewarded to do. Don’t hate the player etc.

## 2 econ Nobel prizes, 1 error

This came up before on the blog but it’s always worth remembering. From Larry White, quoted by Don Boudreaux:

As late as the 1989 edition [of his textbook, Paul Samuelson] and coauthor William Nordhaus wrote: “The Soviet economy is proof that, contrary to what many skeptics had earlier believed, a socialist command economy can function and even thrive.”

Paul Samuelson and William Nordhaus won Nobel prizes in economics. We’ve talked about this example before; I learned about it from a post by Alex Tabarrok recounting some research by David Levy and Sandra Peart.

There are lots of cases of academics, Nobel-prize-winning and otherwise, falling for fringe theories as they get older. Google Josephson or Shockley, for example. And even the most distinguished scholars have been known to make technical errors and then be too stubborn to correct them.

But the above-quoted example is different: Samuelson and Nordhaus are economists. They’re writing about their area of expertise—or, I should say, purported expertise. They should know better, or at least know that they don’t know enough to claim to know, right?

My point is not that Samuelson and Nordhaus are fools. I’ve not met or read much by either them. They might both be brilliant (although it’s hard to tell because sometimes it seems that economists like to say how brilliant other economists are) and I assume they’ve both done excellent, important work. The above quote from their book . . . make of it what you will.

I guess the problem is that social science is so damn political. Not just politics politics, also academic politics. Last year we talked about the echo chamber within the subfield of climate economics. So much of the academic world seems to be about people promoting each others’ careers.

P.S. I complain a lot about academic economics and psychology. I’m sure lots of other fields have problems that are just as big. Economics and psychology are just easy to talk about because they are relatively non-technical topics. If we want to argue about whether a particular drug works the way it’s claimed, we might need to know a lot of biology. But you don’t need any particular technical knowledge to recognize that the Soviet economy was not thriving and that the claims in those pizzagate papers were not coherent.

## What are my statistical principles?

Jared Harris writes:

I am not a statistician but am a long time reader of your blog and have strong interests in most of your core subject matter, as well as scientific and social epistemology.

I’ve been trying for some time to piece together the broader implications of your specific comments, and have finally gotten to a perspective that seems to be implicit in a lot of your writing, but deserves to be made more explicit. (Or if you’ve already made it explicit, I want to find out where!)

My sense is that many see statistics as essentially defensive — helping us *not* to believe things that are likely to be wrong. While this is clearly part of the story it is not an adequate mission statement.

Your interests seem much wider — for example your advocacy of maximally informative graphs and multilevel models. I’d just like to have a clearer and more explicit statement of the broad principles.

An attempted summary: Experimental design and analysis, including statistics, should help us learn as much as we can from our work:
– Frame and carry out experiments that help us learn as much as possible.
– Analyze the results of the experiments to learn as much as possible.

One obstacle to learning from experiments is the way we talk and think about experimental outcomes. We say an experiment succeeded or failed — but this is not aligned with maximizing learning. Naturally we want to minimize or hide failures and this leads to the file drawer problem and many others. Conversely we are inclined to maximize success and so we are motivated to produce and trumpet “successful” results even if they are uninformative.

We’d be better aligned if we judged experiments on whether they are informative or uninformative (a matter of degree). Negative results can be extremely informative. The cost to the community of missing or suppressing negative results can be enormous because of the effort that others will waste. Also negative results can help delimit “negative space” and contribute to seeing important patterns.

I’m not at all experienced with experiment design, but I guess that designing experiments to be maximally informative would lead to a very different approach than designing experiments to have the best possible chance of yielding positive results, and could produce much more useful negative results.

This approach has some immediate normative implications:

One grave sin is wasting effort on uninformative experiments and analysis, when we could have gotten informative outcomes — even if negative. Design errors like poor measurement and forking paths lead to uninformative results. This seems like a stronger position than just avoiding poorly grounded positive results.

Another grave sin is suppressing informative results — whether negative or positive. The file drawer problem should be seen as a moral failure — partly collective because most disciplines and publishing venues share the bias against negative results.

I was going to respond to this with some statement of my statistical principles and priorities—but then I thought maybe all of you could make more sense out of this than I can. You tell me what you think are my principles and priorities based on what you’ve read from me, then I’ll see what you say and react to it. It might be that what you think are my priorities, are not my actual priorities. If so, that implies that some of what I’ve written has been misfocused—and it would be good for us to know that!

## “Congressional Representation: Accountability from the Constituent’s Perspective”

Steve Ansolabehere and Shiro Kuriwaki write:

The premise that constituents hold representatives accountable for their legislative decisions undergirds political theories of democracy and legal theories of statutory interpretation. But studies of this at the individual level are rare, examine only a handful of issues, and arrive at mixed results. We provide an extensive assessment of issue accountability at the individual level. We trace the congressional rollcall votes on 44 bills across seven Congresses (2006–2018), and link them to constituent’s perceptions of their representative’s votes and their evaluation of their representative. Correlational, instrumental variables, and experimental approaches all show that constituents hold representatives accountable. A one-standard deviation increase in a constituent’s perceived issue agreement with their representative can improve net approval by 35 percentage points. Congressional districts, however, are heterogeneous. Consequently, the effect of issue agreement on vote is much smaller at the district level, resolving an apparent discrepancy between micro and macro studies.

That last point is worth saying again, and Ansolabehere and Kuriwaki do so, at the end of their article:

Our findings also help reconcile two observations. On the one hand, individual con- stituents respond strongly to their legislators’ roll call votes. But on the other hand, aggregate vote shares are only modestly correlated with legislators’ roll call voting records. This is a result of aggregation. Many legislative districts are fairly evenly split on key legislation. A legislator may vote with the majority of her district and get the support of 55 percent of her constituents, but lose the support of the remaining 45 percent. Those with whom the legislator sides care deeply about the issue, as do those opposed to the legislator’s vote. But, in the aggregate the net effect is modest because much of the support and opposition for the bill cancels out. Aggregate correlations should not be taken as measures of the true degree to which individuals care about or vote on the issues. By the same token, in extremely competitive districts, representatives have a difficult time satisfying the majority of the voters back home.

This is thematically consistent with Ansolabehere’s earlier work on stability of issue attitudes, in that details of measurement can make a bit difference in how we understand political behavior.

## Why we kept the trig in golf: Mathematical simplicity is not always the same as conceptual simplicity

You define the threshold angle as arcsin((R – r)/x), but shouldn’t it be arctan((R – r)/x) instead?

Is it just that it does not matter with these small angles, where
sine and tangent are about the same, or am I missing something?

This sin vs tan thing comes up from time to time.

As you note, given the dimensions involved, the two functions are for all practical purposes equivalent. If you look at the picture I drew of the little ball and the big ball and the solid and dotted lines, the dotted lines are supposed to just touch the edge of the inner circle. In that case, if you drop a perpendicular from where the dotted line hits the circle, that perpendicular goes through the center of the circle, hence the angle of interest is arcsin((r-R)/x): the solid line of length x is the hypoteneuse of the triangle.

As many people have pointed out, the whole model is approximate in that it assumes the ball enters the whole only if the entire ball is over the hole. But actually the ball could fall in, if only half of the ball is over the whole. So, arguably, it should be asin(R/x). But that’s also only an approximation!

So we could replace all that trig with a simple R/x, and I’m guessing it would fit the data just as well. So why didn’t I do that? Doesn’t the trig just complicate things? I kept the trig because, to me, it’s cleaner to derive the trig solution and just use it, then have mathematical conversations about simplifying it. Mathematically, arcsin((R-r)/x) is more complicated than R/x, but conceptually I think it’s simpler to go with arcsin((R-r)/x) as it has a direct geometrical derivation. And, for teaching purposes, I like having a model that’s very clearly tailored to this particular problem.

Mathematical simplicity is not always the same as conceptual simplicity. A (somewhat) complicated mathematical expression can give some clarity, as the reader can see how each part of the formula corresponds to a different aspect of the problem being modeled.

## They want “statistical proof”—whatever that is!

Bert Gunter writes:

I leave it to you to decide whether this is fodder for your blog:

So when a plaintiff using a hiring platform encounters a problematic design feature — like platforms that check for gaps in employment — she should be able to bring a lawsuit on the basis of discrimination per se, and the employer would then be required to provide statistical proof from internal and external audits to show that its hiring platform is not unlawfully discriminating against certain groups.

It’s from an opinion column about the problems of automated hiring algorithms/systems.

I would not begin to know how to respond.

I don’t know how to respond either! Looking for statistical “proof” seems like asking for trouble. The larger problem, it seems, is an inappropriate desire for certainty.

I’m reminded of the time, several years ago, when I was doing legal consulting, and the lawyers I was working for asked me what I would say on the stand if the opposing lawyer asked me if I was certain my analysis was correct, if it was possible I had made a mistake. I responded that, if asked, I’d say, Sure, it’s possible I made a mistake. The lawyer told me I shouldn’t say that. I can’t remember how this particular conversation got resolved, but as it happened the case was settled and I never had to testify.

I don’t know what “statistical proof” is, but I hope it’s not the thing that the ESP guy, the beauty-and-sex-ratio guy, the Bible Code dudes, etc etc etc, had that got all those papers published in peer-reviewed journals.

In response to the above-linked op-ed, I’d prefer to replace the phrase “statistical proof” by the phrase “strong statistical evidence.” Yes, it’s just a change of words, but I think it’s an improvement in that it’s now demanding something that could really be delivered. If you’re asking for “proof,” then . . . remember the Lance Armstrong principle.

## Information, incentives, and goals in election forecasts

Jessica Hullman, Christopher Wlezien, and I write:

Presidential elections can be forecast using information from political and economic conditions, polls, and a statistical model of changes in public opinion over time. We discuss challenges in understanding, communicating, and evaluating election predictions, using as examples the Economist and Fivethirtyeight forecasts of the 2020 election.

Here are the contents of the article:

1. Forecasting presidential elections

1.1. Forecasting elections from political and economic fundamentals

1.2. Pre-election surveys and poll aggregation

1.3. Putting together an electoral college forecast

2. Communicating and diagnosing problems with probabilistic election forecasts

2.1. Win probabilities

2.2. Visualizing uncertainty

2.3. Other ways to communicate uncertainty

2.4. State and national predictions

2.5. Replacement candidates, vote-counting disputes, and other possibilities not included in the forecasting model

3. Calibration and incentives

3.1. The difficulty of calibration

3.2. Incentives for overconfidence

3.3. Incentives for underconfidence

3.4. Comparing different forecasts

3.5. Martingale property

3.6. Novelty and stability

4. Discussion

I like this paper. It gathers various thoughts we’ve had about information underlying election forecasts, how we communicate and understand these predictions, and some of the incentives that lead to different forecasts having different statistical properties.

We thank Joshua Goldstein, Elliott Morris, Merlin Heidemanns, Dhruv Madeka, Yair Ghitza, Doug Rivers, Bob Erikson, Bob Shapiro, and Jon Baron for helpful comments and various government agencies and private foundations for supporting this research.

## Who are you gonna believe, me or your lying eyes?

This post is by Phil Price, not Andrew.

A commenter on an earlier post quoted Terence Kealey, who said this in an interview in Scientific American in 2003:

“But the really fascinating example is the States, because it’s so stunningly abrupt. Until 1940 it was American government policy not to fund science. Then, bang, the American government goes from funding something like $20 million of basic science to$3,000 million, over the space of 10 or 15 years. I mean, it’s an unbelievable increase, which continues all the way to the present day. And underlying rates of economic growth in the States simply do not change. So these two historical bits of evidence are very, very powerful…”

One thing any reader of this blog should know by now, if you didn’t learn it long ago, is that you should not take any claim at face value, no matter how strongly and authoritatively it is made. Back In The Day (pre-Internet), checking this kind of thing was not always so easy. A lot of people, myself included, would have a copy of the Statistical Abstract of the United States, and an almanac or two, and a new atlas and an old atlas, and a CRC datebook, and a bunch of other references…but honestly usually we just had to go through life not knowing whether a claim like this was true or not.

But now it’s a lot easier to check this sort of thing, and in this case it’s especially easy because another blog commenter provided a reference: https://nintil.com/on-the-constancy-of-the-rate-of-gdp-growth/

So I look at that page, and sure enough there’s a nice graph of US GDP per capita as a function of time…and the growth rate is NOT, in fact, the same after 1940 as before!

US per capita GDP from late 1800s to 2011, in 2011 dollars; y axis is logarithmic

I have done no quantitative calculations at all, all I’ve done is look at the plot, but it’s obvious that the slope is higher after 1940 than before. Maybe the best thing to do is to leave out the Great Depression and WWII, and just look at the period before 1930 and after 1950, or you can just look at pre and post 1940 if you want…no matter how you slice it, the slope is higher after WWII. I’m not saying the change is huge — if you continued the pre-WWII slope until 2011, you’d be within a factor of 2 of the data — but there’s no doubt that there’s a change.

I pointed out to the commenter who provided the link that the slope is higher after WWII, and he said, in essence, no it isn’t: economists agree that the slope is the same before and after. So who am I gonna believe, economists or my lying eyes?

I have no idea about the topic that started the conversation, which is whether government investment in science pays off economically. The increase in slope after WWII could be due to all kinds of things (for instance, women and blacks were allowed to enter the workforce in ways and numbers not previously available).  I’m not making any claims about that topic. I just think it’s funny that someone claims that the “fact” that a number is unchanged is “very, very powerful” evidence of something…and in fact the number did change!

This post is by Phil.

## “Day science” and “Night science” are the same thing—if done right!

Chetan Chawla writes:

This paper will interest you, in defense of data mining.

Isn’t this similar to the exploration Wasnik was encouraging in his infamous blog post?

The article, by Itai Yanai and Martin Lercher, is called, “A hypothesis is a liability,” and it appeared in the journal Genome Biology.

I took a look and replied: I don’t think they’re like Wansink, but there’s something important they are missing.

Here’s what they write:

There is a hidden cost to having a hypothesis. It arises from the relationship between night science and day science, the two very distinct modes of activity in which scientific ideas are generated and tested, respectively. With a hypothesis in hand, the impressive strengths of day science are unleashed, guiding us in designing tests, estimating parameters, and throwing out the hypothesis if it fails the tests. But when we analyze the results of an experiment, our mental focus on a specific hypothesis can prevent us from exploring other aspects of the data, effectively blinding us to new ideas. A hypothesis then becomes a liability for any night science explorations. The corresponding limitations on our creativity, self-imposed in hypothesis-driven research, are of particular concern in the context of modern biological datasets, which are often vast and likely to contain hints at multiple distinct and potentially exciting discoveries. Night science has its own liability though, generating many spurious relationships and false hypotheses. Fortunately, these are exposed by the light of day science, emphasizing the complementarity of the two modes, where each overcomes the other’s shortcomings. . . .

I understand that a lot of scientists think of science as being like this, an alternation between inspiration and criticism, exploratory data analysis and confirmatory data analysis, creative “night science” and rigorous “day science.” Indeed, in Bayesian Data Analysis we talk about the separate steps of model building, model fitting, and model checking.

But . . . I don’t think we should enthrone this separation.

Yanai and Lercher contrast “the expressed goal of testing a specific hypothesis” with the mindset of “exploration, where we look at the data from as many angles as possible.” They continue:

In this mode, we take on a sort of playfulness with the data, comparing everything to everything else. We become explorers, building a map of the data as we start out in one direction, switching directions at crossroads and stumbling into unanticipated regions. Essentially, night science is an attitude that encourages us to explore and speculate. . . .

What’s missing here is a respect for the ways in which hypotheses, models, and theories can help us be more effective explorers.

Consider exploratory data analysis, which uses tools designed to reveal unexpected patterns in data. “Unexpected” is defined relative to “expected,” and the more fully fleshed out our models of the expected, the more effective can be our explorations in search of the unexpected. This is a key point of my 2003 article, A Bayesian formulation of exploratory data analysis and goodness-of-fit testing (one of my favorite papers, even though it’s been only cited about 200 times, and many of those citations are by me!). I’m not claiming here that fancy models are required to do good exploratory analysis; rather, I’m saying that exploration is relative to models, and formalizing these models can help us do better exploration.

And it goes the other way, too: careful exploration can reveal unexpected data patterns that improve our modeling.

My first problem with the creative-night-science, rigorous-day-science dichotomy is that it oversimplifies the creative part of the work. In part, Yanai and Lercher get it: they write:

[M]ore often than not, night science may require the most acute state of mental activity: we not only need to make connections where previously there were none, we must do this while contrasting any observed pattern on an elaborate mental background that represents the expected. . . .

[W]hen you roam the limits of the scientific knowns, you need a deep understanding of a field to even recognize a pattern or to recognize it as surprising. Different scientists looking at a given dataset will do this against a backdrop of subtly different knowledge and expectations, potentially highlighting different patterns. Looking is not the same as seeing, after all, and this may be why some of us may stumble upon discoveries in data that others have already analyzed.

That’s good. It reminds me of Seth Roberts’s championing of the “insider-outsider perspective,” where you’re open to new ideas (you’re an outsider without an investment in the currently dominant way of thinking) but you have enough knowledge and understanding that you’ll know where to look for anomalies (you’re an insider with some amount of specialized knowledge).

But “day science” is part of this exploration too! By not using the inferential tools of “day science,” you’re doing “night science” with one hand behind your back, using your intuition but not allowing it to be informed by calculation.

To put it another way, calculation and formal statistical inference are not just about the “day science” of hypothesis testing and p-values; they’re also about helping us better understand our data.

Consider the birthday example from the cover of BDA3 and pictured above. Is this “night science” (exploration) or “day science” (confirmation)? The answer is that it’s both! Or, to put it another way, we’re using the modeling and inferential tools of “day science” to do “night science” more effectively. If you want, you can say that the birthday example is pure “night science”—but then I’d say that night science is pretty much all we ever need. I’d just call it “science.”

My second problem with the night/day science framing is that it doesn’t give enough for “day science” to do. Rigorous inference is not just about testing hypothesis; it’s about modeling variation, causation, all sorts of things. Testing the null hypothesis of zero effect and all data coming from a specific random number generator—that’s pretty much the most boring thing, and one of the most useless things, you can do in statistics. If we have data from lots of different experiments with lots of different hypotheses floating around, we can do some multilevel modeling!

To put it another way: Just as “night science” can be more effective if it uses statistical inference from data, “day science” can be more effective via creative modeling, not just testing fixed hypotheses.

Science is not basketball. You’re allowed to do a moving pick.

P.S. After I wrote and scheduled this post, another person emailed me asking about the above-linked article. If 2 different people email me, that’s a sign of some general interest, so I decided to post this right away instead of on the usual delay.

## Here’s a question for the historians of science out there: How modern is the idea of a scientific “anomaly”?

Occasional blog commenter Raghu recommended Les Insulaires by Pascal Garnier. My French isn’t so good but dammit I’m gonna read this one all the way through. We’ll see if I finish it by the time this post appears in September . . .

Anyway, I was talking with someone about my difficulties with foreign languages, which reminded me about this paper I wrote with Josh “Don’t call him ‘hot hand'” Miller on Laplace’s theories of cognitive illusions, heuristics, and biases. I bought a copy the original Laplace book that we were discussing in our paper, but I found it too much of a struggle to read, so I relied entirely on the translation.

And this got me thinking about Laplace, who famously promoted the idea of a clockwork universe that runs all on its own without the assistance of God. And that got me wondering what Laplace thought about rigid bodies. Not “how do rigid bodies move?”, but “how can rigid bodies exist?” If you try to build a rigid body out of billiard balls, they’ll just drift apart. For that matter, how does a block of wood hold itself together? When I took physics in college, I learned that quantum mechanics is required here: no quantum mechanics, no rigid bodies. But Laplace didn’t know about quantum mechanics, so how did he think about rigid bodies? Did he just hypothesize some glue-like force that held the cells together in a block of wood or a ball of ivory? For that matter, how did he think glue worked?

I’m not expecting here that Laplace would’ve had all the answers. My question is: Did Laplace think of the existence of rigid bodies as an anomaly within his world of a clockwork universe?

But then this made me think: Is the concept of a scientific anomaly itself a modern idea? Did scientists before the 1850s, say, think in terms of anomalies, or did they just view science as a collection of partly connected theories and facts? I’m asking about Laplace in particular because he’s famous for presenting science as a complete explanation of the world.

…All Bodies seem to be composed of hard Particles: For otherwise Fluids would not congeal; as Water, Oils, Vinegar, and Spirit or Oil of Vitriol do by freezing; Mercury by Fumes of Lead; Spirit of Nitre and Mercury, by dissolving the Mercury and evaporating the Flegm; Spirit of Wine and Spirit of Urine, by deflegming and mixing them; and Spirit of Urine and Spirit of Salt, by subliming them together to make Sal-armoniac. Even the Rays of Light seem to be hard Bodies; for otherwise they would not retain different Properties in their different Sides. And therefore Hardness may be reckon’d the Property of all uncompounded Matter. At least, this seems to be as evident as the universal Impenetrability of Matter. For all Bodies, so far as Experience reaches, are either hard, or may be harden’d; and we have no other Evidence of universal Impenetrability, besides a large Experience without an experimental Exception. Now if compound Bodies are so very hard as we find some of them to be, and yet are very porous, and consist of Parts which are only laid together; the simple Particles which are void of Pores, and were never yet divided, must be much harder. For such hard Particles being heaped up together, can scarce touch one another in more than a few Points, and therefore must be separable by much less Force than is requisite to break a solid Particle, whose Parts touch in all the Space between them, without any Pores or Interstices to weaken their Cohesion. And how such very hard Particles which are only laid together and touch only in a few Points, can stick together, and that so firmly as they do, without the assistance of something which causes them to be attracted or press’d towards one another, is very difficult to conceive.

## “Psychology’s Zombie Ideas”

Hey, psychologists! Don’t get mad at me about the above title. I took it from a post at Macmillan Learning by David Myers, who’s a psychology professor and textbook writer. Myers presents some “mind-eating, refuse-to-die ideas” that are present in everyday psychology but are contradicted by research:

1. People often repress painful experiences, which years later may later reappear as recovered memories or disguised emotions. (In reality, we remember traumas all too well, often as unwanted flashbacks.)

2. In realms from sports to stock picking, it pays to go with the person who’s had the hot hand. . . .

3. Parental nurture shapes our abilities, personality, and sexual orientation. (The greatest and most oft-replicated surprise of psychological science is the minimal contribution of siblings’ “shared environment.”)

4. Immigrants are crime-prone. (Contrary to what President Donald Trump has alleged, and contrary to people’s greater fear of immigrants in regions where few immigrants live, immigrants do not have greater-than-average arrest and incarceration rates.)

5. Big round numbers: The brain has 100 billion neurons. 10 percent of people are gay. We use only 10 percent of our brain. 10,000 daily steps make for health. 10,000 practice hours make an expert. (Psychological science tells us to distrust such big round numbers.)

6. Psychology’s three most misunderstood concepts are that: “Negative reinforcement” refers to punishment. “Heritability” means how much of a person’s traits are attributable to genes. “Short-term memory” refers to your inability to remember what you experienced yesterday or last week, as opposed to long ago. (These zombie ideas are all false, as I explain here.)

7. Seasonal affective disorder causes more people to get depressed in winter, especially in cloudy places, and in northern latitudes. (This is still an open debate, but massive new data suggest to me that it just isn’t so.)

8. To raise healthy children, protect them from stress and other risks. (Actually, children are antifragile. Much as their immune systems develop protective antibodies from being challenged, children’s emotional resilience builds from experiencing normal stresses.)

9. Teaching should align with individual students’ “learning styles.” (Do students learn best when teaching builds on their responding to, say, auditory versus visual input? Nice-sounding idea, but researchers—here and here—continue to find little support for it.)

10. Well-intentioned therapies change lives. (Often yes, but sometimes no—as illustrated by the repeated failures of some therapy zombies: Critical Incident Stress Debriefing, D.A.R.E. Drug Abuse Prevention, Scared Straight crime prevention, Conversion Therapy for sexual reorientation, permanent weight-loss training programs.)

Of the above list, one is wrong (#2; see here), one is not psychology (#4), two seem too vague to have any real empirical content (#8 and #9), and for one, I’m not sure many people really hold the “zombie belief” in question (#10). But the other five seem reasonable. And, no joke, 5 out of 10 ain’t bad. If I gave a list of 10 recommendations, I’d be happy if some outsider felt that 5 of them made sense.

So, overall, I like Myers’s post. It’s commonsensical, relevant to everyday life, and connects theory with evidence—all good things that I aspire to in my own teaching. Based on this post, I bet he writes good textbooks.

Just one thing . . .

There’s one thing that bugs me, though: The zombie psychology ideas that Myers mention all seem to fall outside of current mainstream psychology. I guess that some of these ideas such as the effect of parental nurture or learning styles used to be popular in academic psychology, but no longer.

Here are some zombie psychology ideas that Myers didn’t mention:

11. Extreme evolutionary psychology: The claim that women are three times more likely to wear red or pink during certain times of the month, the claim that single women were 20 percentage points more likely to vote for Barack Obama during certain times of the month, the claim that beautiful parents are more likely to have girl babies, and lots more along those lines. These debunked claims all fit within a naive gender-essentialism that is popular within evolutionary psychology and in some segments of the public.

12. Claims that trivial things have large and consistent effects on people’s personal lives: The idea that disaster responses are much different for hurricanes with boy or girl names. The idea that all sorts of behaviors are different if your age ends in a 9. There are all these superficially plausible ideas but they are not borne out by the data.

13. Claims that trivial things have large and consistent effects on people’s political decisions: The claims that votes are determined by shark attacks, college football games, and subliminal smiley faces.

14. Embodied cognition, Sadness may impair color perception, Visual contrast polarizes moral judgment, etc etc etc.

OK, you get the idea. We could keep going and going. Just pick up an issue of Psychological Science or PNAS from a few years ago.

It’s good for a psychology textbook writer to point out misconceptions in psychology. Here’s how Myers ends his post:

When subjected to skeptical scrutiny, crazy-sounding ideas do sometimes find support. . . . But more often, as I suggest in Psychology 13th Edition (with Nathan DeWall), “science becomes society’s garbage collector, sending crazy-sounding ideas to the waste heap atop previous claims of perpetual motion machines, miracle cancer cures, and out-of-body travels. To sift reality from fantasy and fact from fiction therefore requires a scientific attitude: being skeptical but not cynical, open-minded but not gullible.”

That’s all fine. But watch out. Sometimes the call is coming from inside the house. Or, to be more specific, sometimes science (as manifested in the Association for Psychological Science, the National Academy of Sciences, etc.) is not “society’s garbage collector,” it’s society’s garbage creator, and it’s the institution that gives garbage a high value.

I’m not saying that psychology is worse than other fields. I’m just saying that if a psychologist is going to write about bad zombie ideas in psychology, it would make sense for him to include some that remain popular with high-status researchers within psychology itself.

## Econ grad student asks, “why is the government paying us money, instead of just firing us all?”

Someone who wishes anonymity writes:

I am a graduate student at the Department of Economics at a European university.

Throughout the last several years, I have been working as RA (and sometimes co-author) together with multiple different professors and senior researchers, mainly within economics, and predominantly analysing very large datasets.

I have 3 questions related to my accumulated professional experiences from “the garden of forking paths” etcetera. Maybe it would suffice with just asking the actual questions, but I nevertheless begin with some background and examples…

In my experience, all social science researchers that I have worked with seem to treat the process of writing a paper as some kind of exercise in going “back-and-forth” between theoretical analysis and empirical evidence. Just as an example, they (we) might run X number of regressions, and try to find a fitting theory that can explain the results. Researchers with the most top publications often seem to get/have access to the greatest number of RAs and PhD students, who perform thousands of analyses that only very few people will ever hear about unless something “promising” is found (or unless you happen to share the same office). I have performed plenty of such analyses myself. In one recent case, my task was to attempt to replicate a paper published in a top journal using data from my country (instead of from the country whose data was used in the original paper). When asking my boss in that project whether we could perhaps publish the results of this replication as a working paper, he replied that him and his collaborator (a famous professor from yet another country) just wanted me to perform this replication in order to see whether it was “worthwhile” to test some other (somewhat related) hypotheses they had. The idea, he wrote, was never to make any independent product out of this replication, or even to incorporate it into any related research product. In this case, I found “promising” results, so they decided to pursue the investigation of their (somewhat) related hypotheses. In other similar cases, where I didn’t find any such “promising” results, my boss decided to try something else or even drop the subject entirely.

Using e.g. “pre-analysis plans” never seems to be an option in practice for most researchers that I have worked with, and the more honest(?) ones explicitly tell me that it would be career suicide if they chose not to try out multiple analyses after looking at the data. Furthermore, at seminars people often ask if you have tried this or that analysis, if you haven’t (yet) found enough “valuable” stuff in your analyses. (To be fair, they also ask whether you have performed this or that robustness check or sensitivity analysis.)

I might also note that when writing my M.Sc. thesis, some supervisors explicitly and openly encouraged me and other students to exclude results that were not consistent with the overall “story” (we) tried to “tell”, with the motivation that ”this is only a M.Sc. thesis”. However, while maybe not as openly encouraged, similar stuff nevertheless continues also during Ph.D studies (if one is admitted to the Ph.D. program, perhaps partly thanks to a M.Sc. thesis telling a sufficiently convincing “story”). And, as my experiences as RA suggest, it continues also after finishing a Ph.D. In my experience, the incentives to be (partially) dishonest often seem to be quite overwhelming at most stages of a researcher’s career.

No one seems to worry too much about statistical significance, not necessarily because they do not care about “obtaining” significant results, but because if one analysis doesn’t yield any stars, you can almost always just try a different one (or ask your PhD student or RA to do it for you). I have “tried” hundreds of analyses, models and specifications during my four years as research assistant. I’d say that I might easily have produced sufficient material to publish at least 5 complete studies with null results or results that were not regarded as “interesting” or “clear” enough. No one except me and a few other researchers will hear of these results.

In the project where I am working at the moment, we are currently awaiting a delivery of data. While waiting, I suggested to my current boss, who has published articles in top journals, that I could write all the necessary code for our regression analyses as well as the empirical method section of our paper. In that way, we would have everything completely ready when finally getting access to all the data. My boss replied that this might be sensible with regards to the code used for e.g. the final merging of the data and some variable construction, but he argued that writing code for any subsequent regression analyses before obtaining access to the final datasets would be less useful for us since “after seeing the data you’ll always have to try out multiple different analyses”. To be fair, I want to stress here that my impression was not at all that he had any intention to fish. I simply interpreted his comment as a spontaneous and frank/disillusioned statement about what is (unfortunately) current standard practice at the department and in the field more generally. Similar to at least some other researchers that I have worked with, my current boss seems like a genuinely honest person and he also seems quite aware of many of the problems you mention in your articles and blog. My impression is that he is also transparent with most of what he does (at least compared to some others I’ve worked with). In my opinion, he is both generous, fair-minded and honest. Thus, my concern in this particular example is more about the existing incentives and standard practices which seem to make even the most honest researchers do stuff that perhaps they should not. I also want to stress that the sample sizes in our analyses are very large (using high-quality data, sometimes with millions of observations). Furthermore, (top) publications in economics often include e.g. a very comprehensive Web Appendix with sensitivity analyses, robustness checks and alternative specifications (and sometimes also replication dofiles).

Now, if you have time, my questions to you are the following:

1. Does e.g. having access to very large sample sizes in the analyses, and publishing a 100+ page Web Appendix together with any article, mitigate “the garden of forking paths” problems outlined above somewhat? And what can I do to contribute less to these problems?

2. A small number of researchers that I have collaborated with argue (at least in private) that their research is mainly to be regarded as “exploratory” because of the stuff I have outlined above. Would simply stating that one’s research is “exploratory” in a paper be a half-decent excuse to do any of the p-hacking and other stuff outlined in my email?

3. Has my job throughout the last several years been completely useless (or even destructive) for society? That is how I personally feel sometimes. And if I am right, why should we fund any social science research at all? It often seems to me that it would be impossible to get any of the incentives right. When recently asking a prominent researcher at my current workplace whether he believed that the current system of peer-review is successful in mitigating problems related to :the garden of forking paths” or even outright “cheating”, he simply started to laugh and replied “No, no… No! Absolutely not!” If this prominent researcher is right, why is the government paying us money, instead of just firing us all?

First, it’s too bad you have to remain anonymous. I understand it, but I just want to lament the state of the world, by which the expectation is that people can’t gripe in public without fear of retaliation.

Why is the government paying us money, instead of just firing us all? All cynicism aside, our society has the need, or the perceived need, for social science expertise. We need people to perform cost-benefit analyses, to game out scenarios, to make decisions about where to allocate resources. Governments need this, and business need it too. There’s more to decision analysis than running a spreadsheet. We need this expertise, and universities are where we train people to learn these tools, so the government funds universities. It doesn’t have to be done this way, but it’s how things have been set up for awhile.

Should they be funding this particular research? Probably not! The trouble is, this is the research that is given the expert seal of approval in the “top 5 journals” and so on. So it’s not quite clear what to do. The government could just zero all the funding: that wouldn’t be the worst thing in the world, but it would disrupt the training pipeline. So I can see why they are funding you.

Has my job during the last 4 years been completely useless (or even destructive) for society? Possibly. Again, though, perhaps the most important part of the job is not the research but the training. Also, even if your own research has been a complete waste of time, or even negative value in that it’s wasting other people’s time and money also, there might be other research being done by similarly-situated people in your country that’s actually useful. In which case it could make sense to quit your current job and work for some people who are doing good work. In practice, though, this could be difficult to do or even bad for your career, so I’m not sure what to actually recommend.

Would simply stating that one’s research is “exploratory” in a paper be a half-decent excuse to do any of the p-hacking and other criminal stuff? Lots of my research is exploratory, and that’s fine. The problem with p-values, p-hacking, etc., is not that they are “exploratory” but rather that they’re mostly a way to add noise to your data. Take a perfectly fine, if noisy, experiment, run it through the statistical-significance filter (made worse by p-hacking, but often pretty bad even when only one analysis is done on your data), and you can end up with something close to a pile of random numbers. That’s not good for exploratory research either!

So, no, labeling one’s research as exploratory is no excuse at all for doing bad work. Honesty and transparency are no excuse for being bad work. A good person can do bad work. Doing bad work doesn’t mean you’re a bad person; being a good person doesn’t mean you’re doing good work.

Does e.g. having access to very large sample sizes in the analyses, and publishing a 100+ page Web Appendix together with any article, mitigate “the garden of forking paths” problems? I don’t think the 100+ page web appendix will get you much. I mean, fine, sure, include it, but web appendixes are subject to forking paths, just like everything else. I’ve seen lots of robustness studies that avoid dealing with real problems. My recommendation is to perform the analyses simultaneously using a multilevel model. Partial pooling is your friend.

## Rethinking Rob Kass’ recent talk on science in a less statistics-centric way.

Reflection on a recent post on a talk by Rob Kass’ has lead me to write this post. I liked the talk very much and found it informative. Perhaps especially for it’s call to clearly distinguish abstract models from brute force reality. I believe that is a very important point that has often been lost sight of by many statisticians in the past. I would actually point to many indicating Box’s quote  “all models are wrong, but some are useful” as being insightful rather as something already at the top on most statisticians minds, as evidence of that.

However, the reflection has lead me to think Kass’ talk is too statistics-centic. Now Kass’ talk was only about 25 minutes long while being on subtle topic. It is very hard to be both concise and fully balanced, but I believe we have a different perspective and I would like to bring that out here. For instance, I think this  statement “I [Kass] conclude by saying that science and the world as a whole would function better if scientific narratives were informed consistently by statistical thinking” would be better put as saying that statistics and the statistical discipline as a whole would function better if statistical methods and practice were informed consistently by purposeful experimental thinking (AKA scientific thinking).

Additionally, this statement ““the essential flaw in the ways we talk about science is that they neglect the fundamental process of reasoning from data” seems somewhat dismissive of science being even more fundamental about process of reasoning from data, with statistics being a specialization when data is noisy or varies haphazardly. In fact, Steven Stigler has argued that statistics arose as a result of astronomers trying to make sense of observations that varied when they believed what was being observed did not.

Finally, this statement “the aim of science is to explain how things work” I would rework into the aim (logic) of science is to understand how experiments can bring out how things work in this world by using abstractions that themselves are understood by using experiments. So experiments all the way up.

As usual, I am drawing heavily on my grasp of writings by CS Peirce. He seemed to think that everything should be thought as an experiment. Including mathematics that he defined as experiments performed on diagrams or symbols rather than chemicals or physical objects. Some quotes from his 1905 paper What pragmatism is.  “Whenever a man acts purposively, he acts under a belief in some experimental phenomenon. … some unchanging idea may come to influence a man more than it had done; but only because some experience equivalent to an experiment has brought its truth home to him more intimately than before…”

I do find the thinking of anything one can as an experiment as being helpful. For instance, in this previous post discussion led to a comment by Andrew that “Mathematics is simulation by other means”. One way to unpack this by thinking of mathematics as experiments on diagrams or symbols would be to claim that calculus is one design of an experiment while simulation is just another design. Different costs and advantages, that’s all.  It’s the idea to be experimental and experiment most appropriately that one can – that is fundamental. Then sorting out most appropriately would point to economy of research as the other fundamental piece.

## Battle of the open-science asymmetries

1. Various tenured legacy-science yahoos say: “Any idiot can write a critique; it takes work to do original research.” That’s my paraphrase of various concerns that the replication movement makes it too easy for critics to get cheap publications.

2. Rex Douglass says: “It is an order of magnitude less effort to spam poorly constructed hypotheticals than it is to deconstruct them.” That’s one of the problems that arise when we deal with junk science published in PNAS etc. Any well-connected fool can run with some data, make dramatic claims, and get published in PNAS, get featured in NPR, etc. But it can take a lot of work to untangle exactly what went wrong. Which is one reason we have to move beyond the default presumption that published and publicized claims are correct.

It’s interesting that these two perspectives are the exact opposite? We have two models of the world:

1. Original researchers do all the hard work, and “open science” critics such as myself are lazy parasites; or

2. People who write and publish junk science papers are lazy parasites on the body of healthy science, and open science critics have to do the hard work to figure out exactly what went wrong in each case.

Which model is correct?

I guess both models are correct, at different times. It depends on the quality of the science and the quality of the criticism. Also, some junk science takes a lot of work. Brian Wansink wasn’t just sitting on his ass! He was writing press releases and doing TV interviews all the time: that’s not easy.

In any case, it’s interesting that people on both sides of this divide each think that they’re doing the hard work and that the people on the other side are freeloaders.

## From monthly return rate to importance sampling to path sampling to the second law of thermodynamics to metastable sampling in Stan

(This post is by Yuling, not Andrew, except many ideas are originated from Andrew.)

This post is intended to advertise our new preprint Adaptive Path Sampling in Metastable Posterior Distributions  by Collin, Aki, Andrew and me, where we developed an automated implementation of path sampling and adaptive continuous tempering. But I have been recently reading a writing book and I am convinced that an advertisement post should always start by stories to hook the audience, so let’s pretend I am starting this post from the next paragraph with an example.

#### Some elementary math about geometric means

To start, one day I was computing my portfolio return, but I was doing that in Excel, in which the only function I knew is mean and sum. So I would simply calculate the average monthly return, and multiply it by 12.

Of course you do not really open your 401K account now. We can simulate data in R too, which amounts to:

month_return=rnorm(60,.01,0.04)
print(mean(month_return)*12)


Sure, it is not the “annualized return”. A 50% gain followed by a 50% loss in the next month results in 1.5* 0.5 = 0.75, or a 25% loss. Jensen’s inequality ensures that the geometric mean is always smaller than the arithmetic mean.

exp(mean(log(1+ month_return)))-1- mean(month_return)

That said, this difference is small since the monthly return itself is typically close enough to zero. But even to annualize these two estimates to a whole year still yields a nearly identical annualized return value:

exp(mean(log(1+month_return)))^12-1 - mean(month_return)*12

Intuitively, the arithmetic mean is larger per unit, but is also offset by lack of the compound interest.
It is not anything surprising beyond elementary math. The first order Taylor series expansion says
$\log (1+ x) \approx x,$
for any small x. And the geometric mean of x is really the arithmetic mean of log (1+ x). Indeed in the limit of per second return, by integrating both sides, these two approaches are just identical. When per unit change of x is not smooth enough (i.e. too big gap), the second order term -.5 sd(x) will kick in via Itô calculus, but that is irreverent to what I will discuss next. Also the implicit assumption here is that x>-1 otherwise log(1+x) becomes invalid, so this asymptotical equivalence will also fail if the account is liquidated, but that is more irreverent to what I will discuss.

#### The bridge between Taylor series expansion and importance sampling

Forget about finance, let’s focus on the normalizing constant in statistical computing. In many statistical problems, including marginal likelihood computation and marginal density/moment estimation in MCMC, we are given an unnormalized density $q(\theta, \lambda)$, where $\theta \in \Theta$ is a multidimensional sampling parameter and $\lambda \in \Lambda$ is a free parameter, and we need to evaluate the integrals $z(\lambda)$=$\int_{\Theta} q(\theta, \lambda) d \theta$, for any $\lambda \in \Lambda.$

Here z(.) is a function of $\lambda$. For convenience let’s still call z() the normalizing constant without mentioning function.

Two accessible but conceptually orthogonal approaches stand out for the normalizing constant. Viewing it as the expectation with respect to the conditional density $\theta|\lambda \propto q(\theta, \lambda)$, we can numerically integrate it using quadrature, where the simplest is linearly interpolation, and first order Taylor series expansion yields
$\log\frac{z(\lambda)}{z(\lambda_0)}$ $\approx (\lambda - \lambda_0) \frac{1}{ z(\lambda_0)}\int_\Theta (\frac {d}{d \lambda }q(\theta, \lambda) | {\lambda=\lambda_0} )d \theta.$
In contrast, we can sample from the conditional density $\theta| \lambda_0 \propto q(\theta, \lambda_0)$, and apply importance sampling,
$\frac{z(\lambda)}{z(\lambda_0)}\approx\frac{1}{S}\sum_{s=1}^S$ $\frac{ q (\theta_s, \lambda)}{q(\theta_s, \lambda_0)}$, $\theta_{s=1, \cdots, S} \sim q (\theta, \lambda_0).$
Due to the same reason that the annualized arithmetic return equivalents the geometric return, the first order Taylor series expansion and importance sampling reach the same first order limit when the proposal is infinitely close to the target. That is, for any fixed $\lambda_0$, as $\delta=|\lambda_1 - \lambda_0|\to 0$,
$\frac{1}{\delta}\log E_{\lambda_{0}}\left[\frac{q(\theta|\lambda_{1})}{q(\theta|\lambda_0)}\right]$ = $\int_{\Theta} \frac{\partial}{\partial \lambda} \log q(\theta|\lambda_{0}) p(\theta|\lambda_0)d\theta + o(1)= \frac{1}{\delta}E_{\lambda_{0}}\left[ \log \frac{q(\theta|\lambda_{1})} {q(\theta|\lambda_0)}\right].$

#### Path sampling

The path sampling estimate is just the integral of the dominate term in the middle of the equation above:
$\frac{z(\lambda_1)}{z(\lambda_0)}\approx\int_{\lambda_0}^{\lambda_1}\int_{\Theta} \frac{\partial}{\partial \lambda} \log q(\theta|\lambda_{l}) p(\theta|\lambda_l)d\theta d \lambda.$
In such sense, the path sampling is the continuous limit of both the importance sampling and the Taylor expansion approach.

More generally, if we draw $(a, \theta)$ sample from the joint density
$p(a,\theta)\propto\frac{1}{c(a)}q(\theta,\lambda=f(a)),$
where $c()$ is any pseudo prior, and $a$ is a transformed parameter of lambda through a link function f(), then the thermodynamic integration (Gelman and Meng, 1998, indeed dated back to Kirkwood, 1935) yields the identity
$\frac{d}{da}\log z(f(a))=E_{\theta|f(a)}\Bigl[\frac{\partial}{\partial a}\log\left(q(\theta, f(a)\right)\Bigr],$
where the expectation is over the invariant conditional distribution $\theta | a \propto q(\theta, f(a))$.

If a is continuous, this conditional $\theta|f(a)$ is hard to evaluate, as typically there is only one theta for each unique $a$. However, it is still a valid stochastic approximation to approximate
$E_{\theta|f(a_i)}\Bigl[\frac{\partial}{\partial a}\log\left(q(\theta,f(a)\right)\Bigr]\approx\frac{\partial}{\partial a}\log\left(q(\theta_i, f(a_i)\right)$
by one Monte Carlo draw.

#### Continuously simulated tempering on an isentropic circle

Simulated tempering and its variants provide an accessible approach to sampling from a multimodal distribution. We augment the state space $\Theta$ with an auxiliary inverse temperature parameter $\lambda$, and employ a sequence of interpolating densities, typically through a power transformation $p_j\propto p(\theta|y)^{\lambda_j}$ on the ladder 0=$\lambda_0 \leq \cdots\leq \lambda_K=1$,
such that $p_K$ is the distribution we want to sample from and $p_0$ is a (proper) base distribution. At a smaller $\lambda$, the between-mode energy barriers in $p(\theta|\lambda)$ collapse and the Markov chains are easier to mix. This dynamic makes the sampler more likely to fully explore the target distribution at $\lambda=1$. However, it is often required to scale the number of interpolating densities K at least O(parameter dimension), soon becoming unaffordable in any moderately high dimensional problems.

A few years ago, Andrew came up with this idea to use path sampling in continuous simulated tempering (I remember he did so during one BDA class when he was teaching simulated temperimg). In this new paper, we present an automated way to conduct continuously simulated tempering and adaptive path sampling.

The basic strategy is to augment the target density $q(\theta)$, potentially multimodal, by a tempered path
$1/c(\lambda)\psi(\theta)^{1-\lambda}q(\theta)^{\lambda}$ where $\psi$ is some base measurement. Here the temperature lambda is continuous in [0,1], so as to adapt to regions where the conditional density changes rapidly with the temperature (which might be missed by discrete tempering).

Directly sampling from theta and lambda makes it hard to access the samples from the target density (as Pr(lambda=1)=0 in any continuous densities). Hence we further transform theta into a transformed $a$ using a piecewise polynomial link function $\lambda=f(a)$ like this,

An $a$-trajectory from 0 to 2 corresponds to a complete $\lambda$ tour from 0 to 1 (cooling) and back down to 0 (heating). This formulation allows the sampler to cycle back and forth through the space of lambda continuously, while ensuring that some of the simulation draws (those with a between 0.8 and 1.2) are drawn from the exact target distribution with $\lambda=1$. The actual sampling takes place in $a\in [0,2]\times\theta\in\Theta$. It has a continuous density and is readily implemented in Stan.

We then use path sampling compute the log normalizing constant of this system, and adaptive modify the a-margin to ensure a complete cooling-heating cycle.

Notable, in discrete simulated tempering and annealed importance sampling, we typically compute the marginal density and the normalization constant
$\frac{z_{\lambda_K}}{z_{\lambda_0}}=E\left[\exp\mathcal{W}(\theta_0,\dots,\theta_{K-1})\right],\mathcal{W}(\theta_0,\dots,\theta_{K-1})=\log\prod_{j=0}^{k-1}\frac{q(\theta_j,\lambda_{j+1})}{q(\theta_j,\lambda_{j})}.$
In statistical physics, the $\mathcal{W}$ quantity can be interpreted as virtual work induced on the system. The same Jensen’s inequality that has appeared twice above leads to $E \mathcal{W} \geq \log z(\lambda_K)- \log z(\lambda_0)$. This is a microscopic analogy to the second law of thermodynamics: the sum of work entered in the system is always larger than the free energy change, unless the switching is processed infinitely slow, which corresponds to the proposed path sampling scheme where the interpolating densities are infinitely dense. A physics minded might call this procedure a reversible-adiabatic/isentropic/quasistatic process—It is the conjunction of path sampling estimation (unbiased for free energy) and the continuous tempering (infinitely many interpolating states) that  preserves the entropy.

This asymptotic equivalence between importance sampling based tempering and continuous tempering is again the same math behind the annualize return example. The log (1+ monthly return) term now corresponds the free energy (log normalization constant) difference between two adjacent temperatures, which are essentially dense in our scheme.

#### Practical implementation in Stan

Well, if the blog post is an ideal format for writing math equations, there would not be journal articles, the same reason that MCMC will not exist if importance sampling scales well. Details of our methods are better summarized in the paper. We also provide an easy access to continuous tempering in R and Stan for a black box implementation of path sampling based continuous tempering.
Consider the following Stan model:

data {
real y;
}
parameters {
real theta;
}
model{
y ~ cauchy(theta, 0.1);
-y ~ cauchy(theta, 0.1);
}

With a moderately large input data y, the posterior distribution of theta will be only be asymptotically changed at two points close to y and -y. As a result, Stan cannot fully sample from this two-point-spike even with a large number of iterations.

To run continuous tempering, a user can specify any base model, say normal(0,5), and list it in an alternative model block as if it is a regular model.

model{ // keep the original model
y ~ cauchy(theta,0.1);
-y ~ cauchy(theta,0.1);
}
alternative model{ // add a new block of the base measure (e.g., the prior).
theta ~ normal(0,5);
}

Save this as cauchy.stan. To run path sampling, simply run

devtools::install_github("yao-yl/path-tempering/package/pathtemp")
library(pathtemp)
update_model <- stan_model("solve_tempering.stan")
#https://github.com/yao-yl/path-tempering/blob/master/solve_tempering.stan
file_new <- code_temperature_augment("cauchy.stan")
sampling_model <- stan_model(file_new) # compile
path_sample_fit <- path_sample(data=list(gap=10), # the data list in original model,
sampling_model=sampling_model,
N_loop=5, iter_final=6000)

The returned value provides access to the posterior draws from the target density and base density, the join path in the final adaptation, and the estimated log normalizing constant.

 sim_cauchy <- extract(path_sample_fit$fit_main) in_target <- sim_cauchy$lambda==1
in_prior <- sim_cauchy$lambda==0 # sample from the target hist(sim_cauchy$theta[in_target])
# sample from the base
hist(sim_cauchy$theta[in_prior]) # the joint "path" plot(sim_cauchy$a, sim_cauchy$theta) # the normalizing constant plot(g_lambda(path_sample_fit$path_post_a), path_sample_fit\$path_post_z)


Here is the output.

Have fun playing with code!

Let me close with two caveats. First, we don’t think tempering can solve all metastability issues in MCMC. Tempering imposes limitations on dimensions and problem scales. We provide a failure mode example where the proposed tempering scheme (or any other tempering methods) fails. Adaptive path sampling comes with an importance-sampling-theory-based diagnosis that makes this failure manifest. That said, in many cases we find this new path-sampling-adapted-continuous-tempering approach performs better than existing methods for metastable sampling. Second, posterior multimodality often betokens model misspecification (we discussed this in our previous paper on chain-stacking). The ultimate goal is not to stop at the tempered posterior simulation draws, but to use them to check and improve the model in a workflow.

## Parallel in Stan

by Andrew Gelman and Bob Carpenter

We’ve been talking about some of the many many ways that parallel computing is, or could be used, in Stan. Here are a few:

– Multiple chains (Stan runs 4 or 8 on my laptop automatically)

– Hessians scale linearly in computation with dimension and are super useful. And we now have a fully vetted forward mode other than for ODEs.

EP (data partitioning)

– Running many parallel chains, stopping perhaps before convergence, and weighting them using stacking (Yuling and I are working on a paper on this)

– Bob’s idea of using many parallel chains spawned off an optimization, as a way to locate the typical set during warmup

– Generic MPI for multicore in-box and out-of-box for
parallel density evaluation

– Multithreading for parallel forward and backward time exploration in HMC

– GPU kernelization of sequence operations

– Multithreading for multiple outcomes in density functions

– Then there’s all the SSE optimization down at the CPU level for pipelining.

P.S. Thanks to Zad for the above image demonstrating parallelism.

## Post-stratified longitudinal item response model for trust in state institutions in Europe

This is a guest post by Marta Kołczyńska:

Paul, Lauren, Aki, and I (Marta) wrote a preprint where we estimate trends in political trust in European countries between 1989 and 2019 based on cross-national survey data.

This paper started from the following question: How to estimate country-year levels of political trust with data from surveys that (a) mostly have the same trust questions but measured with ordinal rating scales of different lengths, and (b) mostly have samples that aim to be representative for general adult populations, but this representativeness is likely reached to different degrees?

Our solution combines:

1. item response models of responses to trust items that account for the varying scale lengths across survey projects,
2. splines to model changes over time,
3. post-stratification by age, sex, and education.

In the paper we try to explain all the modeling decisions, so that the paper may serve as a guide for people who want to apply similar methods or — even better — extend and improve them.

We apply this approach to data from 12 cross-national projects (1663 national surveys) carried out in 27 European countries between 1989 and 2019. We find that (a) political trust is pretty volatile, (b) there has not been any clear downward trend in political trust in Europe in the last 30 years, although trust did decline in many Central-East European countries in the 1990s, and there was a visible dip following the 2008 crisis in countries that were hit most, followed by at least partial recovery. Below are estimated levels of political trust for the 27 countries (see the preprint for more details on differences in political trust by sex, age, and education):

The modeling was done in brms thanks to some special features that Paul wrote, and overall this is one of the projects that would not have been possible without Stan.

One of the main obstacles we faced was the limited availability of population data for post-stratification. In the end we used crude education categories (less than high school, high school or above – also because of the incoherent coding of education in surveys), combined Eurostat data with harmonized census samples from IPUMS International, and imputed values for the missing years.

We think our approach or some of its components can be more broadly applied to modeling attitudes in a way that addresses issues of measurement and sample representativeness.

## Automatic data reweighting!

John Cook writes:

Suppose you are designing an autonomous system that will gather data and adapt its behavior to that data.

At first you face the so-called cold-start problem. You don’t have any data when you first turn the system on, and yet the system needs to do something before it has accumulated data. So you prime the pump by having the system act at first . . .

Now you face a problem. You initially let the system operate on assumptions rather than data out of necessity, but you’d like to go by data rather than assumptions once you have enough data. Not just some data, but enough data. Once you have a single data point, you have some data, but you can hardly expect a system to act reasonably based on one datum. . . . you’d like the system to gradually transition . . . weaning the system off initial assumptions as it becomes more reliant on new data.

The delicate part is how to manage this transition. How often should you adjust the relative weight of prior assumptions and empirical data? And how should you determine what weights to use? Should you set the weight given to the prior assumptions to zero at some point, or should you let the weight asymptotically approach zero?

Fortunately, there is a general theory of how to design such systems. . . .

Cool! Sounds like a good idea to me. Could be the basis for a new religion if you play it right.

## Problem of the between-state correlations in the Fivethirtyeight election forecast

Elliott writes:

I think we’re onto something with the low between-state correlations [see item 1 of our earlier post]. Someone sent me this collage of maps from Nate’s model that show:

– Biden winning every state except NJ
– Biden winning LA and MS but not MI and WI
– Biden losing OR but winning WI, PA

And someone says that in the 538 simulations where Trump wins CA, he only has a 60% chance of winning the elec overall.

Seems like the arrows are pointing to a very weird covariance structure.

I agree that these maps look really implausible for 2020. How’s Biden gonna win Idaho, Wyoming, Alabama, etc. . . . but not New Jersey?

But this does all seem consistent with correlations of uncertainties between states that are too low.

Perhaps this is a byproduct of Fivethirtyeight relying too strongly on state polls and not fully making use of the information from national polls and from the relative positions of the states in previous elections.

If you think of the goal as forecasting the election outcome (by way of vote intentions; see item 4 in the above-linked post), then state polls are just one of many sources of information. But if you start by aggregating state polls, and then try to hack your way into a national election forecast, then you can run into all sorts of problems. The issue here is that the between-state correlation is mostly not coming from the polling process at all; it’s coming from uncertainty in public opinion changes among states. So you need some underlying statistical model of opinion swings in the 50 states, or else you need to hack in a correlation just right. I don’t think we did this perfectly either! But I can see how the Fivethirtyeight team could’ve not even realized the difficulty of this problem, if they were too focused on creating simulations based on state polls without thinking about the larger forecasting problem.

There’s a Bayesian point here, which is that correlation in the prior induces correlation in the posterior, even if there’s no correlation in the likelihood.

And, as we discussed earlier, if your between-state correlations are too low, and at the same time you’re aiming for a realistic uncertainty in the national level, then you’re gonna end up with too much uncertainty for each individual state.

At some level, the Fivethirtyeight team must realize this—earlier this year, Nate Silver wrote that correlated errors are “where often *most* of the work is in modeling if you want your models to remotely resemble real-world conditions”—but recognizing the general principle is not the same thing as doing something reasonable in a live application.

These things happen

Again, assuming the above maps actually reflect the Fivethirtyeight forecast and they’re not just some sort of computer glitch, this does not mean that what they’re doing at that website is useless, nor does it mean that we’re “right” and they’re “wrong” in whatever other disagreements we might have (although I’m standing fast on the Carmelo Anthony thing). Everybody makes mistakes! We made mistakes in our forecast too (see item 3 in our earlier post)! Multivariate forecasting is harder than it looks. In our case, it helped that we had a team of 3 people staring at our model, but of course that didn’t stop us from making our mistakes the first time.

At the very least, maybe this will remind us all that knowing that a forecast is based on 40,000 simulations or 40,000,000 simulations or 40,000,000,000 simulations doesn’t really tell us anything until we know how the simulations are produced.

P.S. Again, the point here is not about the silly scenario in which Trump wins New Jersey while losing the other 49 states; rather, we can use problems in the predictive distribution to try to understand what went wrong with the forecasting procedure. Just as Kos did for me several years ago (go here and search on “Looking at specifics has worked”). When your procedure messes up, that’s good news in that it represents a learning opportunity.

P.P.S. A commenter informs us that Nate wrote something, not specifically addressing the maps shown here, but saying that these extreme results arose from a long-tailed error term in his simulation procedure. There’s some further discussion in comments. One relevant point here is that it you add independent state errors with a long-tailed distribution, this will induce lower correlation in the final distribution. See discussion in comments here and here.

## More on that Fivethirtyeight prediction that Biden might only get 42% of the vote in Florida

I’ve been chewing more on the above Florida forecast from Fivethirtyeight.

Their 95% interval for the election-day vote margin in Florida is something like [+16% Trump, +20% Biden], which corresponds to an approximate 95% interval of [42%, 60%] for Biden’s share of the two-party vote.

This is buggin me because it’s really hard for me to picture Biden only getting 42% of the vote in Florida.

By comparison, our Economist forecast gives a 95% interval of [47%, 58%] for Biden’s Florida vote share.

Is there really a serious chance that Biden gets only 42% of the vote in Florida?

Let’s look at this in a few ways:

1. Where did the Fivethirtyeight interval come from?

2. From 95% intervals to 50% intervals.

3. Using weird predictions to discover problems with your model.

4. Vote intentions vs. the ultimate official vote count.

1. Where did the Fivethirtyeight interval come from?

How did they get such a wide interval for Florida?

I think two things happened.

First, they made the national forecast wider. Biden has a clear lead in the polls and a lead in the fundamentals (poor economy and unpopular incumbent). Put that together and you give Biden a big lead in the forecast; for example, we give him a 90% chance of winning the electoral college. For understandable reasons, the Fivethirtyeight team didn’t think Biden’s chances of winning were so high. I disagree on this—I’ll stand by our forecast—but I can see where they’re coming from. After all, this is kind of a replay of 2016 when Trump did win the electoral college, also he has the advantages of incumbency, for all that’s worth. You can lower Biden’s win probability by lowering his expected vote—you can’t do much with the polls, but you can choose a fundamentals model that forecasts less than 54% for the challenger—and you can widen the interval. Part of what Fivethirtyeight did is widen their intervals, and when you widen the interval for the national vote, this will also widen your interval for individual states.

Second, I suspect they screwed up a bit in their model of correlation between states. I can’t be sure of this—I couldn’t find a full description of their forecasting method anywhere—but I’m guessing that the correlation of uncertainties between states is too low. Why do I say this? Because the lower the correlation between states, the more uncertainty you need for each individual state forecast to get a desired national uncertainty.

Also, setting up between-state uncertainties is tricky. I know this because Elliott, Merlin, and I struggled when setting up our own model, which indeed is a bit of a kluge when it comes to that bit.

Alternatively, you could argue that [42%, 60%] is just fine as a 95% interval for Biden’s Florida vote share—I’ll get back to that in a bit. But if you feel, as we do that this 42% is too low to be plausible, then the above two model features—an expanded national uncertainty and too-low between-state correlations—are one way that Fivethirtyeight could’ve ended up there.

2. From 95% intervals to 50% intervals.

95% intervals are hard to calibrate. If all is good with your modeling, your 95% intervals will be wrong only 1 time in 20. To put it another way, you’d expect only 50 such mispredicted state-level events in 80 years of national elections. So you might say that the interval for Florida should be super-wide. This doesn’t answer the question of how wide: should the lower bound of that interval be 47% (as we have it), or 42% (as per 538), or maybe 37%???—but it does tell us that it’s hard to think about such intervals.

It’s easier to think about 50% intervals, and, fortunately, we can read these off the above graphic too. The 50% prediction interval for Florida is roughly (+4% Trump, +8% Biden), i.e. (0.48, 0.54) for Biden’s two-party vote share.

Given that Biden’s currently at 52% in the polls in Florida (and at 55% in national polls, so it’s not like the Florida polls are some kind of fluke), I don’t really buy the (0.48, 0.54) interval.

To put it another way, I think there’s less than a 1-in-4 probability that Biden less than 48% of the two-party vote in Florida. This is not to say I think Biden is certain to win, just that I think the Fivethirtyeight interval is too wide. I already thought this about the 95% interval, and I think this about the 50% interval too.

That’s just my take (and the take of our statistical model). The Fivethirtyeight is under no obligation to spit out numbers that are consistent with my view of the race. I’m just explaining where I’m coming from.

In their defense, back in 2016, some of the polls were biased. Indeed, back in September of that year, the New York Times gave data from a Florida poll to Sam Corbett-Davies, David Rothschild, and me. We estimated Trump with a 1% lead in the state—even while the Times and three other pollsters (one Republican, one Democratic, and one nonpartisan) all pointed toward Clinton, giving her a lead of between 1 and 4 points.

In that case, we adjusted the raw poll data for party registration, the other pollsters didn’t, and that explains why they were off. If the current Florida polls are off in the same way, then that would explain the Fivethirtyeight forecast. But (a) I have no reason to think the current polls are off in this way, and one reason I have this assurance is that our model does allow for bias in polls that don’t adjust for partisanship of respondents, and (b) I don’t think Fivethirtyeight attempts this bias correction; it’s my impression that they take the state poll toplines as is. Again, I do think they widen their intervals, but I think that leads to unrealistic possibilities in their forecast distribution, which is how I led off this post.

3. Using weird predictions to discover problems with your model.

Weird predictions can be a good way of finding problems with your model. We discussed this in our post the other day: go here and scroll down to “Making predictions, seeing where they look implausible, and using this to improve our modeling.” As I wrote, it’s happened to me many times that I’ve fit a model that seemed reasonable, but then some of its predictions didn’t quite make sense, and I used this disconnect to motivate a careful look at the model, followed by a retooling.

Indeed, this happened to us just a month ago! It started when Nate Silver and others questioned the narrow forecast intervals of our election forecasting model—at the time, we were giving Biden a 99% chance of winning more than half the national vote. Actually, we’d been wrestling with this ourselves, but the outside criticism motivated us to go in and think more carefully about it. We looked at our model and found some bugs in the code! and some other places where the model could be improved. And we even did some work on our between-state covariance matrix.

We could tell when looking into this that the changes in our model would not have huge effects—of course they wouldn’t, given that we’d carefully tested our earlier model on 2008, 2012, and 2016—so we kept up our old model while we fixed up the new one, and then after about a week we were read and we released the improved model (go here and scroll down to “Updated August 5th, 2020”).

4. Vote intentions vs. the ultimate official vote count.

I was talking with someone about my doubts that a forecast that allowed Biden to get only 42% of the vote in Florida, and I got the following response:

Your model may be better than Nate’s in using historical and polling data. But historical and polling data don’t help you much when one of the parties has transformed into a cult of personality that will go the extra mile to suppress opposing votes.

I responded:

How does cult of personality get to Trump winning 58% of votes in Florida?

He responded:

Proposition: Vote-suppression act X is de-facto legal and constitutional as long as SCOTUS doesn’t enforce an injunction against act X.

This made me realize that in talking about the election, we should distinguish between two things:

1. Vote intentions. The total number of votes for each candidate, if everyone who wants to vote gets to vote and if all these votes are counted.

2. The official vote count. Whatever that is, after some people decide not to vote because the usual polling places are closed and the new polling places are too crowded, or because they planned to vote absentee but their ballots arrived too late (this happened to me on primary day this year!), or because they followed all the rules and voted absentee but then the post office didn’t postmark their votes, or because their ballot is ruled invalid for some reason, or whatever.

Both these vote counts matter. Vote intentions matter, and the official vote count matters. Indeed, if they differ by enough, we could have a constitutional crisis.

But here’s the point. Poll-aggregation procedures such as Fivethirtyeight’s and ours at the Economist are entirely forecasting vote intentions. Polls are vote intentions, and any validation of these models is based on past elections, where sure there have been some gaps between vote intentions and the official vote count (notably Florida in 2000), but nothing like what it would take to get a candidate’s vote share from, say, 47% down to 42%.

When Nate Silver says, “this year’s uncertainty is about average, which means that the historical accuracy of polls in past campaigns is a reasonably good guide to how accurate they are this year,” he’s talking about vote intentions, not about potential irregularities in the vote count.

If you want to model the possible effects of vote suppression, that can make sense—here’s Elliott Morris’s analysis, which I haven’t looked at in detail myself—but we should be clear that this is separate from, or in addition to, poll aggregation.

Summary

I think that [42%, 60%] is way too wide as a 95% interval for Biden’s share of the two-party vote in Florida, and I suspect that Fivethirtyeight ended up with this super-wide interval because they messed up with their correlation model.

A naive take on this might be that the super-wide interval could be plausible because maybe some huge percentage of mail-in ballots will be invalidated, but, if so, this isn’t in the Fivethirtyeight procedure (or in our Economist model), as these forecasts are based on poll aggregation and are validated based on past elections which have not had massive voting irregularities. If you’re concerned about problems with the vote count, this is maybe worth being concerned about, but it’s a completely separate issue from how to aggregate polls and fundamentals-based forecasts.

P.S. A correspondent pointed me to this summary of betting odds, which suggests that the bettors see the race as a 50/50 tossup. I’ve talked earlier about my skepticism regarding betting odds; still, 50/50 is a big difference between anything you’d expect from the polls or the economic and political fundamentals. I think a lot of this 50% for Trump is coming from some assessed probability of irregularities in vote counting. If the election is disputed, I have no idea how these betting services will decide who gets paid off.

Or you could disagree with me entirely and say that Trump has a legit chance at 58% of the two-party vote preference in Florida come election day. Then you’d have a different model than we have.