Oh, I hate it when work is criticized (or, in this case, fails in attempted replications) and then the original researchers don’t even consider the possibility that maybe in their original work they were inadvertently just finding patterns in noise.

I have a sad story for you today.

Jason Collins tells it:

In The (Honest) Truth About Dishonesty, Dan Ariely describes an experiment to determine how much people cheat . . . The question then becomes how to reduce cheating. Ariely describes one idea:

We took a group of 450 participants and split them into two groups. We asked half of them to try to recall the Ten Commandments and then tempted them to cheat on our matrix task. We asked the other half to try to recall ten books they had read in high school before setting them loose on the matrices and the opportunity to cheat. Among the group who recalled the ten books, we saw the typical widespread but moderate cheating. On the other hand, in the group that was asked to recall the Ten Commandments, we observed no cheating whatsoever.

Sounds pretty impressive! But these things all sound impressive when described at some distance from the data.

Anyway, Collins continues:

This experiment has now been subject to a multi-lab replication by Verschuere and friends. The abstract of the paper:

. . . Mazar, Amir, and Ariely (2008; Experiment 1) gave participants an opportunity and incentive to cheat on a problem-solving task. Prior to that task, participants either recalled the 10 Commandments (a moral reminder) or recalled 10 books they had read in high school (a neutral task). Consistent with the self-concept maintenance theory . . . moral reminders reduced cheating. The Mazar et al. (2008) paper is among the most cited papers in deception research, but it has not been replicated directly. This Registered Replication Report describes the aggregated result of 25 direct replications (total n = 5786), all of which followed the same pre-registered protocol. . . .

And what happened? It’s in the graph above (from Verschuere et al., via Collins). The average estimated effect was tiny, it was not conventionally “statistically significant” (that is, the 95% interval included zero), and it “was numerically in the opposite direction of the original study.”

As is typically the case, I’m not gonna stand here and say I think the treatment had no effect. Rather, I’m guessing it has an effect which is sometimes positive and sometimes negative; it will depend on person and situation. There doesn’t seem to be any large and consistent effect, that’s for sure. Which maybe shouldn’t surprise us. After all, if the original finding was truly a surprise, then we should be able to return to our original state of mind, when we did not expect this very small intervention to have such a large and consistent effect.

I promised you a sad story. But, so far, this is just one more story of a hyped claim that didn’t stand up to the rigors of science. And I can’t hold it against the researchers that they hyped it: if the claim had held up, it would’ve been an interesting and perhaps important finding, well worth hyping.

No, the sad part comes next.

Collins reports:

Multi-lab experiments like this are fantastic. There’s little ambiguity about the result.

That said, there is a response by Amir, Mazar and Ariely. Lots of fluff about context. No suggestion of “maybe there’s nothing here”.

You can read the response and judge for yourself. I think Collins’s report is accurate, and that’s what made me sad. These people care enough about this topic to conduct a study, write it up in a research article and then in a book—but they don’t seem to care enough to seriously entertain the possibility they were mistaken. It saddens me. Really, what’s the point of doing all this work if you’re not going to be open to learning?

(See this comment for further elaboration of these points.)

And there’s no need to think anything done in the first study was unethical at the time. Remember Clarke’s Law.

Another way of putting it is: Ariely’s book is called “The Honest Truth . . .” I assume Ariely was honest when writing this book; that is, he was expressing sincerely-held views. But honesty (and even transparency) are not enough. Honesty and transparency supply the conditions under which we can do good science, but we still need to perform good measurements and study consistent effects. The above-discussed study failed in part because of the old, old problem that they were using a between-person design to study within-person effects; see here and here. (See also this discussion from Thomas Lumley on a related issue.)

P.S. Collins links to the original article by Mazar, Amir, and Ariely. I guess that if I’d read it in 2008 when it appeared, I’d’ve believed all its claims too. A quick scan shows no obvious problems with the data or analyses. But there can be lots of forking paths and unwittingly opportunistic behavior in data processing and analysis; recall the 50 Shades of Gray paper (in which the researchers performed their own replication and learned that their original finding was not real) and its funhouse parody 64 Shades of Gray paper, whose authors appeared to take their data-driven hypothesizing all too seriously. The point is: it can look good, but don’t trust yourself; do the damn replication.
Continue reading

Time series of Democratic/Republican vote share in House elections

Yair prepared this graph of average district vote (imputing open seats at 75%/25%; see here for further discussion of this issue) for each House election year since 1976:

Decades of Democratic dominance persisted through 1992; since then the two parties have been about even.

As has been widely reported, a mixture of geographic factors and gerrymandering have given Republicans the edge in House seats in recent years (most notably in 2012 where they retained control even after losing the national vote), but if you look at aggregate votes it’s been a pretty even split.

The above graph also shows that the swing in 2018 was pretty big: not as large as the historic swings in 1994 and 2010, but about the same as the Democratic gains in 2006 and larger than any other swing in the past forty years.

See here and here for more on what happened in 2018.

“Do you have any recommendations for useful priors when datasets are small?”

A statistician who works in the pharmaceutical industry writes:

I just read your paper (with Dan Simpson and Mike Betancourt) “The Prior Can Often Only Be Understood in the Context of the Likelihood” and I find it refreshing to read that “the practical utility of a prior distribution within a given analysis then depends critically on both how it interacts with the assumed probability model for the data in the context of the actual data that are observed.” I also welcome your comment about the importance of “data generating mechanism” because, for me, is akin to selecting the “appropriate” distribution for a given response. I always make the point to the people I’m working with that we need to consider the clinical, scientific, physical and engineering principles governing the underlying phenomenon that generates the data; e.g., forces are positive quantities, particles are counts, yield is bounded between 0 and 1.

You also talk about the “big data, small signal revolution.” In industry, however, we face the opposite problem, our datasets are usually quite small. We may have a new product, for which we want to make some claims, and we may have only 4 observations. I do not consider myself a Bayesian, but I do believe that Bayesian methods can be very helpful in industrial situations. I also read your Prior Choice Recommendations [see also discussion here — AG] but did not find anything specific about small sample sizes. Do you have any recommendations for useful priors when datasets are small?

My reply:

When datasets are small, and when data are noisier, that’s when priors are more important. When in doubt, I think the way to explore anything in statistics, including priors, is through fake data simulation, which in this case will give you a sense of what is implied, in terms of potential patterns in data, from any particular set of prior assumptions. Typically we set priors to be too weak, and this can be seen in replicated data that include extreme and implausible results.

Prior distributions for covariance matrices

Someone sent me a question regarding the inverse-Wishart prior distribution for covariance matrix, as it is the default in some software he was using. Inverse-Wishart does not make sense for prior distribution; it has problems because the shape and scale are tangled. See this paper, “Visualizing Distributions of Covariance Matrices,” by Tomoki Tokuda, Ben Goodrich, Iven Van Mechelen, Francis Tuerlinckx and myself. Right now I’d use the LKJ family. In Stan there are lots of options. See also our wiki on prior distributions.

Should we be concerned about MRP estimates being used in later analyses? Maybe. I recommend checking using fake-data simulation.

Someone sent in a question (see below). I asked if I could post the question and my reply on blog, and the person responded:

Absolutely, but please withhold my name because this is becoming a touchy issue within my department.

The boldface was in the original.

I get this a lot. There seems to be a lot of fear out there when it comes to questioning established procedures.

Anyway, here’s the question that the person sent in:

CDC has recently been using your multilevel estimation with post-stratification method to produce county, city, and census tract-level disease prevalence estimates (see https://www.cdc.gov/500cities/). The data source is the annual phone-based Behavioral Risk Factor Surveillance System (n=450k). CDC is not transparent about covariates included in the models used to construct the estimates, but as I understand it they are mostly driven by national individual-level associations between sociodemographic factors and disease prevalence. Presumably, the random effects would not influence a unit’s estimated prevalence much if the sample size from that unit is small (as is true for most cities/counties, and for many census tracts the sample size is zero).

I am wondering if you are as troubled as I am by how these estimates are being used. First, websites like County Health Rankings and City Health Dashboard are providing these estimates to the public without any disclaimer that these are not actually random samples of cities/counties/tracts and may not reflect reality. Second, and more problematically, researchers are starting to conduct ecologic studies that analyze the association, for example, between census tract socioeconomic composition and obesity prevalence (It seems quite likely that the study is actually just identifying the individual-level association between income and obesity used to produce the estimates).

I’ve now become involved in a couple of projects that are trying to analyze these estimates so it seems as though their use will increase over time. The only disclaimer that CDC provides is that the estimates shouldn’t be used to evaluate policy.

Are you more confident about the use of these estimates than I am? I am also wondering if CDC should be more explicit in disclosing their limitations to prevent misuse.

My reply:

Wow, N = 450K. That’s quite a survey. (I know my correspondent called it “n,” but when it’s this big, I think the capital letter is warranted.) And here’s the page where they mention Mister P! And they have a web interface.

I’m not quite sure why you say the website provides the estimate “without any disclaimer.” Here’s one of the displays:

It’s not the prettiest graph in the world—I’ll grant you that—but it’s clearly labeled “Model-based estimates” right at the top.

I agree with you, though, in your concern that if these model-based estimates are being used in later analyses, there’s a risk of reification, in which county or city-level predictors that are used in the model can look automatically like good predictors of the outcomes. I’d guess this would be more of a concern with rare conditions than with something like coronary heart disease where the sample size will be (unfortunately) so large.

The right thing to do next, I think, is some fake-data simulation to see how much this should be a concern. CDC has already done some checking (from their methodology page, “CDC’s internal and external validation studies confirm the strong consistency between MRP model-based SAEs and direct BRFSS survey estimates at both state and county levels.”) and I guess you could do more.

Overall, I’m positively inclined toward these MRP estimates because I’d guess it’s much better than the alternatives such as raw or weighted local averages or some sort of postprocessed analysis of weighted averages. I think those approaches would have lots more problems.

In any case, it’s cool to see my method being used by people who’ve never met me! Mister P is all grown up.

P.S. My correspondent provides further background:

The CDC generates prevalence estimates for various diseases at the county level (or smaller) by applying MRP to the national Behavioral Risk Factor Surveillance System. Unlike for other diseases, they’ve documented their methods for diabetes. Their model defines 12 population strata per county (2 races x 2 genders x 3 age groups) and incorporates random effects for stratum, county, and state. There are no other variables at any level in the model.

A number of papers use the MRP-derived data to estimate associations between, for example, PM2.5 and diabetes prevalence. Do you think this is a valid approach? Would it be valid if all of the MRP covariates are included in the model?

My response:

1. Regarding the MRP model, it is what it is. Including more demographic factors is better, but adjusting for these 12 cells per county is better than not adjusting, I’d think. One thing I do recommend is to use group-level predictors. In this case, the group is county, and lots of county-level predictors will be available that will be relevant for predicting health outcomes.

2. Regarding the postprocessing using the MRP estimates: Sure, it should be better to fold the two models together, but the two-stage approach (first use MRP to estimate prevalences, then fit another model) could work ok too, with some loss of efficiency. Again, I’d recommend using fake-data simulation to estimate the statistical properties of this approach for the problem at hand.

My footnote about global warming

At the beginning of my article, How to think scientifically about scientists’ proposals for fixing science, which we discussed yesterday, I wrote:

Science is in crisis. Any doubt about this status has surely been been dispelled by the loud assurances to the contrary by various authority figures who are deeply invested in the current system . . . When leaders go to that much trouble to insist there is no problem, it’s only natural for outsiders to worry.

And at that point came a footnote, which I want to share with you here:

At this point a savvy critic might point to global-warming denialism and HIV/AIDS denialism as examples where the scientific consensus is to be trusted and where the dissidents are the crazies and the hacks. Without commenting on the specifics of these fields, I will just point out that the research leaders in those areas are not declaring a lack of crisis—far from it!—nor are they shilling for their “patterns of discovery.” Rather, the leaders in these fields have been raising the alarm for decades and have been actively pointing out inconsistencies in their theories and gaps in their understanding. Thus, I do not think that my recommendation to watch out when the experts tell you to calm down, implies blanket support for dissidents in all areas of science. One’s attitude toward dissidents should depend a bit on the openness to inquiry of the establishments from which they are dissenting.

Latour Sokal NYT

Alan Sokal writes:

I don’t know whether you saw the NYT Magazine’s fawning profile of sociologist of science Bruno Latour about a month ago.

I wrote to the author, and later to the editor, to critique the gross lack of balance (and even of the most minimal fact-checking). No reply. So I posted my critique on my webpage.

From that linked page from Sokal:

The basic trouble with much of Latour’s writings—as with those of some other sociologists and philosophers of a “social constructivist” bent—is that (as Jean Bricmont and I [Sokal] pointed out already in 1997)

these texts are often ambiguous and can be read in at least two distinct ways: a “moderate” reading, which leads to claims that are either worth discussing or else true but trivial; and a “radical” reading, which leads to claims that are surprising but false. Unfortunately, the radical interpretation is often taken not only as the “correct” interpretation of the original text but also as a well-established fact (“X has shown that …”) . . .

numerous ambiguous texts that can be interpreted in two different ways: as an assertion that is true but relatively banal, or as one that is radical but manifestly false. And we cannot help thinking that, in many cases, these ambiguities are deliberate. Indeed, they offer a great advantage in intellectual battles: the radical interpretation can serve to attract relatively inexperienced listeners or readers; and if the absurdity of this version is exposed, the author can always defend himself by claiming to have been misunderstood, and retreat to the innocuous interpretation.

Sokal offers a specific example.

First, he quotes the NYT reporter who wrote:

When [Latour] presented his early findings at the first meeting of the newly established Society for Social Studies of Science, in 1976, many of his colleagues were taken aback by a series of black-and-white photographic slides depicting scientists on the job, as though they were chimpanzees. It was felt that scientists were the only ones who could speak with authority on behalf of science; there was something blasphemous about subjecting the discipline, supposedly the apex of modern society, to the kind of cold scrutiny that anthropologists traditionally reserved for “premodern” peoples.

Sokal responds:

In reality, it beggars belief to imagine that sociologists of science—whose entire raison d’être is precisely to subject the social practice of science to “cold scrutiny”—could possibly think that “scientists were the only ones who could speak with authority on behalf of science”. Did you bother to seek confirmation of this self-serving claim from anyone present at that 1976 meeting, other than Latour himself?

Sokal continues in his letter to the NYT reporter:

In the same way, you faithfully reproduce Latour’s ambiguities concerning the notion of “fact”:

It had long been taken for granted, for example, that scientific facts and entities, like cells and quarks and prions, existed “out there” in the world before they were discovered by scientists. Latour turned this notion on its head. In a series of controversial books in the 1970s and 1980s, he argued that scientific facts should instead be seen as a product of scientific inquiry. …

In your article you take for granted that Latour’s view is correct: indeed, a few paragraphs later you say that Latour showed “that scientific facts are the product of all-too-human procedures”. But, like Latour, you never explain in what sense the traditional view—that cells and quarks and prions existed “out there” in the world before they were discovered by scientists—is mistaken.

I’m with Sokal: Scientific facts are real. Their discovery, expression, and (all too often) misrepresentation are the product of human procedures, but the facts and entities exist.

As Sokal discusses, the whole thing is slippery, as can be seen even in the brief discussion excerpted above. If you give Latour’s statements a minimalist interpretation—the concepts of “cells,” “quarks,” etc. are human-constructed—there’s really no problem. Yes, the phenomena described by our concepts of cells, quarks, etc. are real and would exist even if humans had never appeared on the Earth, but one could imagine completely different ways of expressing and formulating models for these scientific facts, in forms that might look nothing like “cells” and “quarks.” Just as one can, for example, express classical mechanics with or without the concept of “force.”

And, of course, if you want to go further, there’s lots of apparent scientific facts that, it seems, are simply human-created mistakes: I’m thinking here of examples such as recent studies of ESP, himmicanes, air rage, beauty and sex ratio, etc.

So Latour’s general perspective is valuable. But Sokal argues, convincingly to me, that much of the reading of Latour, including in that news article, takes the strong view, what might be called the postmodern view, which throws the baby of replicable science out with the bathwater of contingent theories.

Sokal writes:

If Latour had really shown that scientific facts are the product of all-too-human procedures, then the critics’ charge would be unfair. But in reality Latour had not shown anything of the sort; he had simply asserted it, and many others (not cited by you) had criticized those assertions. Of course, it goes without saying that scientists’ beliefs (and assertions of alleged fact) about the external world are the product of all-too-human procedures — that is true and utterly banal. But Latour’s claims are nothing more than deliberate confusion between two senses of the word “fact” (namely, the usual one and his own idiosyncratic one). . . . muddying the distinction between facts and assertions of fact undermines our ability to think clearly about this crucial psychological/sociological/political problem.

Sokal continues with his correspondence with the New York Times (they eventually replied after he sent them several emails).

Just to be clear here, I don’t think there are any villains in this story.

Latour has a goofy view of science, and I agree with Sokal that his (Latour’s) expressions of his ideas are a bit slippery—but, hey, Latour entitled to express his views, and you gotta give him credit for being influential. Latour’s successes must in some part be a consequence of previous gaps or at least underemphasized points in discussions of science.

The author of the NYT article, Ava Kofman, found a good story and ran with it. I agree with Sokal that she missed the point—or, to put it another way, that she might well be doing a good job telling the story of Latour, she’s not doing a good job telling the story of Latour’s ideas. But, that’s not quite her job: even if, as the saying goes, Latour’s work “contains much that is original and much that is correct; unfortunately that which is correct is not original, and that which is original is not correct,” Kofman is not really writing about this; she’s writing more about Latour’s influence.

The ironic thing, though, is that Kofman’s article is following the standard template of feature stories about a scientist or academic, which is to treat him as a hero. If there’s one idea that Latour stands for, it’s that scientists are part of a social process, and it misses the point to routinely treat them as misunderstood geniuses.

Anyway, although I share Sokal’s annoyance that the author of an article on Latour missed key aspects of Latour’s ideas and then didn’t even reply to his thoughtful criticism, I can understand why the reporter wants to move on to her next project. In my experience, journalists are more forward-looking than academics: we worry about our past errors, they just move on. It’s a different style, perhaps deriving from the difference between traditional publication in bound volumes and publication in fishwrap.

Finally, perhaps there’s not much the NYT editors can do at this point. Newspapers, and for that matter scientific journals, rarely run corrections even of clear factual errors—at least, that’s been my experience. So I can’t blame them too much for following common practice.

Ultimately, this all comes down to questions of emphasis and interpretation. Latour has, for better or worse, expressed ideas that have been influential in the sociology of science; his story is interesting and worth a magazine article; writing a story with Latour as hero leads to some confusion about what is understood by others in that field. In that sense it’s not so different from a story in the sports or business pages that presents a contest from one side. That’s a journalistic convention, and that’s fine, and it’s also fine for someone such as Sokal who has a different perspective (one that I happen to agree with) to share that too.

As Sokal puts it:

The ironic thing is that Latour has spent his life decrying (and rightly so) the scientist-as-hero approach to the presenting science to the general public; but here is an article that takes an extreme version of the same approach, albeit applied to a sociologist/philosopher rather than a scientist.

A newspaper or magazine article about a thinker should not merely be a fawning and uncritical celebration of his brilliance; it should also discuss his ideas. Indeed, this article does purport to explain and discuss Latour’s ideas, not just his personal story; but it does so in a completely uncritical way, not even letting on that there might be people who have cogent critiques of his ideas. That, it seems to me, is a gross failure of balance—and more importantly, a gross abdication of the newspaper’s mission to inform its readers about important subjects. (In this case, a subject that has serious real-world consequences.) Not to mention the gross lack of elementary fact-checking that I pointed out.

Of course, one could also question whether the “hero” mode of writing is appropriate even on the sports or business pages. This mode of writing presents a contest from one side only; and it is not very often the case in sports or business that there is in fact only one side.

So, yeah, the NYT article was not so bad as feature articles go—it told an engaging story from one particular perspective—but there was an opportunity to do better. Hence Sokal’s post, and this post linking to it.

P.S. Hey, the name Bruno Latour rings a bell . . . Unfortunately, he didn’t make it out of the first round of our seminar speaker competition.

A parable regarding changing standards on the presentation of statistical evidence

Now, the P-value Sneetches
Had tables with stars.
The Bayesian Sneetches
Had none upon thars.

Those stars weren’t so big. They were really so small.
You might think such a thing wouldn’t matter at all.

But, because they had stars, all the P-value Sneetches
Would brag, “We’re the best kind of Sneetch on the Beaches.
With their snoots in the air, they would sniff and they’d snort
“We’ll have nothing to do with the Bayesian sort!”
And whenever they met some, when they were out walking,
They’d hike right on past them without even talking.

When the P-value children went out to play ball,
Could a Bayesian get in the game… ? Not at all.
You only could play if your tables had stars
And the Bayesian children had none upon thars.

When the P-value Sneetches had frankfurter roasts
Or picnics or parties or PNAS toasts,
They never invited the Bayesian Sneetches.
They left them out cold, in the dark of the beaches.
They kept them away. Never let them come near.
And that’s how they treated them year after year.

Then ONE day, seems… while the Bayesian Sneetches
Were moping and doping alone on the beaches,
Just sitting there wishing their tables had stars…
A stranger zipped up in the strangest of cars!

“My friends,” he announced in a voice clear and keen,
“My name is Savage McJeffreys McBean.
And I’ve heard of your troubles. I’ve heard you’re unhappy.
But I can fix that. I’m the Fix-it-Up Chappie.
I’ve come here to help you. I have what you need.
And my prices are low. And I work at great speed.
And my work is one hundred per cent guaranteed!

Then, quickly Savage McJeffreys McBean
Put together a Bayes Factor machine.
And he said, “You want stars like a Star-Tabled Sneetch… ?
My friends, you can have them for three dollars each!”

“Just pay me your money and hop right aboard!”
So they clambered inside. Then the big machine roared
And it klonked. And it bonked. And it jerked. And it berked
And it bopped them about. But the thing really worked!
When the Bayesian Sneetches popped out, they had stars!
They actually did. They had stars upon thars!

Then they yelled at the ones who had stars at the start,
“We’re exactly like you! You can’t tell us apart.
We’re all just the same, now, you snooty old smarties!
And now we can go to your NPR parties.”

“Good grief!” groaned the ones who had stars at the first.
“We’re still the best Sneetches and they are the worst.
But, now, how in the world will we know,” they all frowned,
“If which kind is what, or the other way round?”

Then came McBean with a very sly wink.
And he said, “Things are not quite as bad as you think.
So you don’t know who’s who. That is perfectly true.
But come with me, friends. Do you know what I’ll do?
I’ll make you, again, the best Sneetches on beaches
And all it will cost you is ten dollars eaches.”

“P-value stars are no longer in style,” said McBean.
“What you need is a trip through my Replication Machine.
This wondrous contraption will take off your stars
So you won’t look like Sneetches who have them on thars.”
And that handy machine
Working very precisely
Removed all the stars from their tables quite nicely.

Then, with snoots in the air, they paraded about
And they opened their beaks and they let out a shout,
“We know who is who! Now there isn’t a doubt.
The best kind of Sneetches are Sneetches without!”

Then, of course, those with stars all got frightfully mad.
To be wearing a star now was frightfully bad.
Then, of course, old Savage McJeffreys McBean
Invited them into his Star-Off machine.

Then, of course from THEN on, as you probably guess,
Things really got into a horrible mess.
All the rest of that day, on those wild screaming beaches,
The fix-it-up Chappie kept fixing up Sneetches.
Off again! On Again!
In again! Out again!
Through the machines they raced round and about again,
Changing their stars every minute or two.
They kept paying money. They kept running through
Until neither the Plain nor the Star-Tables knew
Whether this one was that one… or that one was this one
Or which one was what one… or what one was who.

Then, when every last cent
Of their money was spent,
The Fix-it-Up Chappie packed up
And he went.

And he laughed as he drove
In his car up the beach,
“They never will learn.
No. You can’t teach a Sneetch!”

But McBean was quite wrong. I’m quite happy to say
That the Sneetches got really quite smart on that day,
The day they decided that Sneetches are Sneetches
And no kind of Sneetch is the best on the beaches
That day, all the Sneetches forgot about stars
And whether they had one, or not, upon thars.

[Original is on the web, for example here. I was inspired to construct the above adaptation after thinking of the series of public advice I’ve given over the years regarding prior distributions: first we recommended uniform priors, then scaled-inverse-Wishart and Cauchy and half-Cauchy, now LKJ and normal and half-normal and horseshoe, and who knows what in the future. And I used to recommend p-values and now I don’t. It’s hard to keep up . . .]

Niall Ferguson and the perils of playing to your audience

History professor Niall Ferguson had another case of the sillies.

Back in 2012, in response to Stephen Marche’s suggestion that Ferguson was serving up political hackery because “he has to please corporations and high-net-worth individuals, the people who can pay 50 to 75K to hear him talk,” I wrote:

But I don’t think it’s just about the money. By now, Ferguson must have enough money to buy all the BMWs he could possibly want. To say that Ferguson needs another 50K is like saying that I need to publish in another scientific journal. No, I think what Ferguson is looking for (as am I, in my scholarly domain) is influence. He wants to make a difference. And one thing about being paid $50K is that you can assume that whoever is paying you really wants to hear what you have to say.

The paradox, though, as Marche notes, is that Ferguson gets and keeps the big-money audience is by telling them not what he (Ferguson) wants to say—not by giving them his unique insights and understanding—but rather by telling his audience what they want to hear.

That’s what I called The Paradox of Influence.

But then, a year later, Ferguson went too far, even by his own standards, when during a talk to a bunch of richies he attributed Keynes’s economic views (I don’t actually know exactly what Keyesianism is, but I think a key part is for the government to run surpluses during economic booms and deficits during recessions) to Keynes being gay and marrying a ballerina and talking about poetry. The general idea, I think, is that people without kids don’t care so much about the future, and this motivated Keynes’s party-all-the-time attitude, which might have worked just fine for Eddie Murphy’s girl in the 1980s and in San Francisco bathhouses of the 1970s but, according to Ferguson, is not the ticket for preserving today’s American empire.

My theory on that one is not that Ferguson is a flaming homophobe or a shallow historical determinist (the expression is “piss-poor monocausal social science,” I believe) but rather that he misjudged his audience and threw them some academic frat-boy-style humor that he mistakenly thought they’d enjoy. He served them red meat, but the wrong red meat. Probably would’ve been better for him to have just preached the usual get-the-government-off-our-backs sermon and not tried to get cute by bring up the whole ballerina thing.

Anyway, it happened again! Fergie made a fool of himself, just for trying to make some people happy.

Brian Contreras, Ada Statler, and Courtney Douglas (link from Jeet Heer via Mark Palko) report:

Leaked emails show Hoover academic conspiring with College Republicans to conduct ‘opposition research’ on student . . . “[The original Cardinal Conversations steering committee] should all be allies against O. Whatever your past differences, bury them. Unite against the SJWs. [Christos] Makridis [a fellow at Vox Clara, a Christian student publication] is especially good and will intimidate them,” Ferguson wrote. “Now we turn to the more subtle game of grinding them down on the committee. The price of liberty is eternal vigilance” . . . In the email chain, Ferguson wrote, “Some opposition research on Mr. O might also be worthwhile,” referring to Ocon.
Minshull wrote in response that he would “get on the opposition research for Mr. O.” Minshull is presently Ferguson’s research assistant . . .

It’s hard for me to imagine that Ferguson, globetrotting historian and media personality that he is, would really care so much about “grinding down” some students in a university committee. I’m guessing he was just trying to ingratiate himself with these youngsters, who I guess he views as the up-and-coming new generation of college politicians. Ferguson’s just the modern version of the stock figure, the middle-aged guy trying to talk groovy like the kids. “Some opposition research on Mr. O might also be worthwhile,” indeed. It’s the university-politics version of, ummm, I dunno, building a treehouse with some 12-year-olds, or playing hide-and-seek with a group of 4-year-olds.

The whole thing’s kinda sad in that Fergie seems so clueless. Even in the aftermath, he says, “I very much regret the publication of these emails. I also regret having written them.” Which is fine, but he still doesn’t seem to recognize the absurdity of the situation, a professor in his fifties playing student politics. As with his slurs of Keynes, the man is just a bit too eager to give his audience what he thinks they want to hear.

(pre-2000) academic historian
(2000-2005) propagandist for Anglo-American empire
(2010-2015) TV talking head and paid speaker for rich people
(2018) player in undergraduate campus politics.

At this point, he’s gotta be thinking: Could I have stopped somewhere along the way? Or was the whole trajectory inevitable. It’s a question of virtual history.

“Statistical insights into public opinion and politics” (my talk for the Columbia Data Science Society this Wed 9pm)

7pm in Fayerweather 310:

Why is it more rational to vote than to answer surveys (but it used to be the other way around)? How does this explain why we should stop overreacting to swings in the polls? How does modern polling work? What are the factors that predict election outcomes? What’s good and bad about political prediction markets? How do we measure political polarization, and what does it imply for our politics? We will discuss these and other issues in American politics and more generally how we can use data science to learn about the social world.

People can read the following articles ahead of time if they would like.

Short:
https://slate.com/news-and-politics/2018/11/midterms-blue-wave-statistics-data-analysis.html
https://www.slate.com/articles/news_and_politics/politics/2016/08/why_trump_clinton_won_t_be_a_landslide.html
https://slate.com/news-and-politics/2016/08/dont-be-fooled-by-clinton-trump-polling-bounces.html
https://www.slate.com/articles/news_and_politics/moneybox/2016/07/why_political_betting_markets_are_failing.html

Longer:
https://www.stat.columbia.edu/~gelman/research/published/what_learned_in_2016_5.pdf
https://www.stat.columbia.edu/~gelman/research/published/swingers.pdf

Bayes, statistics, and reproducibility: “Many serious problems with statistics in practice arise from Bayesian inference that is not Bayesian enough, or frequentist evaluation that is not frequentist enough, in both cases using replication distributions that do not make scientific sense or do not reflect the actual procedures being performed on the data.”

This is an abstract I wrote for a talk I didn’t end up giving. (The conference conflicted with something else I had to do that week.) But I thought it might interest some of you, so here it is:

Bayes, statistics, and reproducibility

The two central ideas in the foundations of statistics—Bayesian inference and frequentist evaluation—both are defined in terms of replications. For a Bayesian, the replication comes in the prior distribution, which represents possible parameter values under the set of problems to which a given model might be applied; for a frequentist, the replication comes in the reference set or sampling distribution of possible data that could be seen if the data collection process were repeated. Many serious problems with statistics in practice arise from Bayesian inference that is not Bayesian enough, or frequentist evaluation that is not frequentist enough, in both cases using replication distributions that do not make scientific sense or do not reflect the actual procedures being performed on the data. We consider the implications for the replication crisis in science and discuss how scientists can do better, both in data collection and in learning from the data they have.

P.S. I wrote the above abstract in January for a conference that ended up being scheduled for October. It is now June, and this post is scheduled for December. There’s no real rush, I guess; this topic is perennially of interest.

P.P.S. In writing Bayesian “inference” and frequentist “evaluation,” I’m following Rubin’s dictum that Bayes is one way among many to do inference and make predictions from data, and frequentism refers to any method of evaluating statistical procedures using their modeled long-run frequency properties. Thus, Bayes and freq are not competing, despite what you often hear. Rather, Bayes can be a useful way of coming up with statistical procedures, which you can then evaluate under various assumptions.

Both Bayes and freq are based on models. The model in Bayes is obvious: It’s the data model and the prior or population model for the parameters. The model in freq is what you use to get those long-run frequency properties. Frequentist statistics is not based on empirical frequencies: that’s called external validation. All the frequentist stuff—bias, variance, coverage, mean squared error, etc.—that all requires some model or reference set.

And that last paragraph is what I’m talkin bout, how Bayes and freq are two ways of looking at the same problem. After all, Bayesian inference has ideal frequency properties—if you do these evaluations, averaging over the prior and data distributions you used in your model fitting. The frequency properties of Bayesian (or other) inference when the model is wrong—or, mathematically speaking, when you want to average over a joint distribution that’s not the same as the one in your inferential model—that’s another question entirely. That’s one thing makes frequency evaluation interesting and challenging. If we knew all our models were correct, statistics would simply be a branch of probability theory, hence a branch of mathematics, and nothing more.

OK, that was kinda long for a P.P.S. It felt good to write it all down, though.

My talk tomorrow (Tues) noon at the Princeton University Psychology Department

Integrating collection, analysis, and interpretation of data in social and behavioral research

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

The replication crisis has made us increasingly aware of the flaws of conventional statistical reasoning based on hypothesis testing. The problem is not just a technical issue with p-values, not can it be solved using preregistration or other purely procedural approaches. Rather, appropriate solutions have three aspects. First, in collecting your data there should be a concordance between theory and measurement: for example, in studying the effect of an intervention applied to individuals, you should measure within-person comparisons. Second, in analyzing your data, you should study all comparisons of potential interest, rather than selecting based on statistical significance or other inherently noisy measures. Third, you should interpret your results in the context of theory, background knowledge, and the data collection and analysis you have performed. We discuss these issues on a theoretical level and with examples in psychology, political science, and policy analysis.

Here are some relevant references:

Some natural solutions to the p-value communication problem—and why they won’t work.

Honesty and transparency are not enough.

The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective.

And this:

No guru, no method, no teacher, Just you and I and nature . . . in the garden. Of forking paths.

The talk will be Tuesday, December 4, 2018, 12:00pm, in A32 Peretsman Scully Hall.

In which I demonstrate my ignorance of world literature

Fred Buchanan, a student at Saint Anselm’s Abbey School, writes:

I’m writing a paper on the influence of Jorge Luis Borges in academia, in particular his work “The Garden of Forking Paths”. I noticed that a large number of papers from a wide array of academic fields include references to this work. Your paper, “The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time”, is one of these.

If you don’t mind, I would like to ask some questions about the work’s influence on you. Was your paper’s title directly influenced by the “The Garden of Forking Paths”? If the work directly influenced the title, what relation does the story have to the content of your paper? Since I’m not a statistician, I would appreciate if you could explain this in layman’s terms. If you have read the story, what was your opinion of it? When did you first encountered the story? Did you expect other people in your field to recognize the reference? If they did, what was their reaction?

My reply:

Yes, the title and concept came straight from the Borges story. But I have not ever read the Borges story; I’ve only heard of it. Also, some people recognize the reference but many people do not. Indeed, when we published our article in the magazine American Scientist, the editors insisted on changing the title because they thought it was too obscure.

StanCon 2018 Helsinki talk slides, notebooks and code online

StanCon 2018 Helsinki talk slides, notebooks and code have been available for some time in StanCon talks repository, but it seems we forgot to announce this. The StanCon 2018 Helsinki talk list includes also links to videos.

StanCon’s version of conference proceedings is a collection of contributed talks based on interactive notebooks. Every submission is peer reviewed by at least two reviewers. The reviewers are members of the Stan Conference Organizing Committee and the Stan Developmemt Team. This repository contains all of the accepted notebooks as well as any supplementary materials required for building the notebooks. The slides presented at the conference are also included.

Thanks for all the presenters, and see you in StanCon 2019!

The p-value is 4.76×10^−264

Jerrod Anderson points us to Table 1 of this paper:

It seems that the null hypothesis that this particular group of men and this particular group of women are random samples from the same population, is false.

Good to know. For a moment there I was worried.

On the plus side, as Anderson notes, the paper includes distributional comparisons:

This is fine as a visualization, but I don’t think there’s much here beyond the means and variances. Seems a lot of space to devote to demonstrating that men, on average, are bigger than women. There’s other stuff in the paper as well, but my favorite is the p-value of 4.76×10^−264. I love that they have all these decimal places. Because 4×10^-264 wouldn’t be precise enuf. That’s even worse—actually, a lot worse—than this example.

Stephen Wolfram explains neural nets

It’s easy to laugh at Stephen Wolfram, and I don’t like some of his business practices, but he’s an excellent writer and is full of interesting ideas. This long introduction to neural network prediction algorithms is an example. I have no idea if Wolfram wrote this book chapter himself or if he hired one of his paid theorem-provers to do it—I guess it’s probably some sort of collaboration—but it doesn’t really matter. It all looks really cool.

“And when you did you weren’t much use, you didn’t even know what a peptide was”

Last year we discussed the story of an article, “Variation in the β-endorphin, oxytocin, and dopamine receptor genes is associated with different dimensions of human sociality,” published in PNAS that, notoriously, misidentified what a peptide was, among other problems.

Recently I learned of a letter published in PNAS by Patrick Jern, Karin Verweij, Fiona Barlow, and Brendan Zietsch, with the no-fooling-around title, “Reported associations between receptor genes and human sociality are explained by methodological errors and do not replicate.”

And here’s the response by one of the authors, Robin Dunbar, entitled “Sorry, we got it wrong” “On asking the right questions.”

Too bad they couldn’t simply admit they made an error, stating clearly and without equivocation that their original conclusions were not substantiated. On the plus side, they weren’t as rude as these authors.

P.S. The other thing in that post was that I suggested to PNAS that they change their slogan from “PNAS publishes only the highest quality scientific research” to “PNAS aims to publish only the highest quality scientific research.” And they did it! So cool.

Multilevel models for multiple comparisons! Varying treatment effects!

Mark White writes:

I have a question regarding using multilevel models for multiple comparisons, per your 2012 paper and many blog posts. I am in a situation where I do randomized experiments, and I have a lot of additional demographic information about people, as well. For the moment, let us just assume that all of these are categorical demographic variables. I want to not only know if there is an effect of the treatment over the control—but for what groups there is an effect (positive or negative) for. I never get too granular, but I do look at an intersection between two variables (e.g., Black men, younger married people, Republican women) as well as just within one variable (e.g., women, Republicans, married people).

The issue I’m running into is that I want to look at the effects for all of these groups, but I don’t want to get mired down by Type I error and go chasing noise. (I know you reject the Type I error paradigm because a null of precisely zero is a straw-man argument, but clients and other stakeholders still want to be sure we aren’t reading too much into something that is not there.)

In the machine learning literature, there is a growing interest in causal inference and now a whole topic called “heterogeneous treatment effects.” In the general linear model world in which I was taught as a psychologist, this could also just be called “looking for interactions.” Many of these methods are promising, but I’m finding them difficult to implement in my scenario (I wrote a question here https://stats.stackexchange.com/questions/341402/a-few-questions-regarding-the-practice-of-heterogeneous-treatment-effect-analysi and posed a tailored question about one package to package creators directly here https://github.com/swager/grf/issues/238).

Turning back to multilevel models, it seems like I could do this in that framework. Basically, I just create a non-nested/crossed/whatever you’d like to call it model where people are nested in k groups, where k refers to how many demographic variables I have. I simulated data and fit a model here: https://gist.github.com/markhwhiteii/592d40f93b052663f240125fc9b8db99

The questions I have for you are the questions I pose at the bottom of that R script at the GitHub code snippet:

1. Is this a reasonable approach to examine “heterogenous treatment effects” without getting bogged down by Type I error and multiple comparison problems?

2. If it is, how can I get confidence intervals from the fitted model object using glmer? You all do so in the 2012 paper, I believe

3. More importantly, how can I look at the intersection between two groups? The code I sent in that GitHub snippet looks at effects for men, women, Blacks, Whites, millennials, etc. But I coded in an effect for Black men specifically. How could I use that fitted model object to examine the effect for Black men, White women, millennials with kids, etc.? And how would I calculate standard errors for these?

4. Would all of these things be easier to do in Stan? What would that Stan model look like? Since then I wouldn’t have to figure out how to calculate standard errors for everything, but just sample from the posterior.

My reply:

We’ve been talking about varying treatment effects for a long time. (“Heterogeneous” is jargon for “varying,” I think.)

From 2004: Treatment effects in before-after data.

From 2008: Estimating incumbency advantage and its variation, as an example of a before/after study.

From 2015: The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective.

From 2015: Hierarchical models for causal effects.

From 2015: The connection between varying treatment effects and the well-known optimism of published research findings.

From 2017: Let’s accept the idea that treatment effects vary—not as something special but just as a matter of course.

I definitely think hierarchical modeling is the way to go here. Think of it as a regression model, in which you’re modeling (predicting) treatment effects given pre-treatment predictors, so the treatment could be more effective for men than for women, or for young people than for old people, etc. You’ll end up with lots of predictors in this regression, and multilevel modeling is a way to control or regularize their coefficients.

In short, the key virtue of multilevel modeling (or some other regularization approach) here is that it allows you to include more predictors in your regression. Without regularization, your estimates would become too noisy, then you’d have to fit a cruder model, not allowing you to study the variation that you care about.

The other thing is, yeah, forget type 1 error rates and all the rest. Abandon the idea that the goal of the statistical analysis is to get some sort of certainty. Instead, accept posterior ambiguity: don’t try to learn more from the data than you really can.

I’ll start with some models in lme4 (or rstanarm) notation. Suppose you have a treatment z and pre-treatment predictors x1 and x2. Then here are some models:

y ~ z + x1 + x2     # constant treatment effect
y ~ z + x1*z + x2*z # treatment can vary by x1 and x2
y ~ z + x1*x2*z     # also include interaction of x1 and x2

If you have predictors x3 and x4 with multiple levels:

y ~ z + x1 + x2 + (1 | x3) + (1 | x4)   # constant treatment effect
y ~ z + x1*z + x2*z + (1 + z | x3) + (1 + z | x4)   # varying treatment effect
y ~ z + x1*z + x2*z + (1 + z | x3*x4) # includes an interaction

One thing we’re still struggling with, is that there are all these possible models. Really we’d like to start and end with the full model, something like this, with all the interactions:

y ~ (1 + x1*x2*z | x3*x4)

But these models can be hard to handle. I think we need stronger priors, stronger than the current defaults in rstanarm. So for now I’d build up from the simple model, including interactions as appropriate.

In any case, you can get posterior uncertainties for whatever you want from stan_glmer() in rstanarm; simulations of all the parameters are directly accessible from the fitted object.

You can also aggregate however you want. It’s mathematically the same as Mister P; you’re just working with treatment effects rather than averages.