## Can statistical software do better at giving warnings when you apply a method when maybe you shouldn’t?

Gaurav Sood writes:

There are legions of permutation-based methods which permute the value of a feature to determine whether the variable should be added (e.g., Boruta Algorithm) or its importance. I couldn’t reason for myself why that is superior to just dropping the feature and checking how much worse the fit is or what have you. Do you know why permuting values may be superior?

Here’s the feature importance based on permutation: “We measure the importance of a feature by calculating the increase in the model’s prediction error after permuting the feature. A feature is “important” if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction.”

From here: https://christophm.github.io/interpretable-ml-book/feature-importance.html

Another way to get at variable importance = estimate the model with all the variables, estimate the model after nuking the variable. And see improvement in MSE etc.

Under what circumstance would permuting the variable be better? Permuting, because of chance alone, would create some stochasticity in learning, no? There is probably some benefit in run time for RF as you permute xs after the model is estimated. But statistically, is it better than estimating variable importance through simply nuking a variable?

I guess people like permutation testing because it’s nonparametric? I’m probably the wrong person to ask because this is not what I do either. Some people think of permutation testing as being more pure than statistical modeling. Even though there’s generally no real justification for the particular permutations being used (see section 3.3 of this paper).

That doesn’t mean that permutation testing is bad in practice: lots of things work even though their theoretical justification is unclear, and, conversely, lots of things have seemingly strong theoretical justification but have serious problems when applied in practice.

Gaurav followed up by connecting to a different issue:

It is impressive to me that we produce so much statistical software with so little theoretical justification.

There is then the other end of statistical software which doesn’t include any guidance for known-known errors:

Most word processing software helpfully point out grammatical errors and spelling mistakes. Some even autocorrect. And some, like Grammarly, even give style advice.

Now consider software used for business statistics. Say you want to compute the correlation between two vectors: [100, 2000, 300, 400, 500, 600] and [1, 2, 3, 4, 5, 17000]. Most (all?) software will output .65. (Most—all?—software assume you want Pearson’s.) Experts know that the relatively large value in the second vector has a large influence on the correlation. For instance, switching it to -17000 will reverse the correlation coefficient to -.65. And if you remove the last observation, the correlation is 1. But a lay user would be none the wiser. Common software, e.g., Excel, R, Stata, Google Sheets, etc., do not warn the user about the outlier and its potential impact on the result. It should.

Take another example—the fickleness of the interpretation of AUC when you have binary predictors (see here) as much depends on how you treat ties. It is an obvious but subtle point. But commonly used statistical software do not warn people about the issue and I am sure a literature search will bring up multiple papers that fall prey to the point.

Given the rate of increase in the production of knowledge, increasingly everyone is a lay user. For instance, in 2013, Lin showed that estimating ATE using OLS with a full set of interactions improves the precision of ATE. But such analyses are uncommon in economics papers. The analyses could be absent for a variety of reasons: 1. ignorance, 2. difficulty, 3. dispute the result, etc. But only ignorance stands the scrutiny. The model is easy to estimate, so the second explanation is unlikely to explain much. The last explanation also seems unlikely, given the result was published in a prominent statistical journal and experts use it. And while we cannot be sure, ignorance is likely the primary explanation. If ignorance is the primary reason, should the onus of being well informed about the latest useful discoveries in methods be on a researcher working in a substantive area? Plausibly. But that is clearly not working very well. One way to accelerate dissemination is to provide such guidance as ‘warnings’ in commonly used statistical software.

I agree. This fits in with the whole workflow thing, where we recognize that we’ll be fitting lots of models, many of which will be horribly inappropriate to the task. Better to recognize this ahead of time rather than starting with the presumption that everything you’re doing is correct, and then spending the rest of your time scrambling to defend all your arbitrary decisions.

## Thoughts on “The American Statistical Association President’s Task Force Statement on Statistical Significance and Replicability”

The statement . . . describes establishment of the task force to “address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of ‘p<0.05’ and ‘statistically significant’ in statistical analysis.)” The authors go on to more specifically identify the purpose of the statement as “two-fold: to clarify that the use of P-values and significance testing, properly applied and interpreted, are important tools that should not be abandoned, and to briefly set out some principles of sound statistical inference that may be useful to the scientific community.”

The task force includes several prominent academic and government statisticians (including lots of people I know personally), and both of its goals—clarifying the ASA’s official position and giving its own recommendations—seem valuable to me.

Following Megan, I’d say the task force succeeded in goal #1 and failed in goal #2.

Goal #1—clarifying the ASA’s official position—was simple but it still had to be done. A few years ago the ASA had a committee that in 2016 released a statement on statistical significance and p-values. I was on this committee, along with other difficult people such as Sander Greenland—I mean “difficult” in a good way here!—and I agreed with much but not all of the statement. My response, “The problems with p-values are not just with p-values,” was published here. Various leading statisticians disagreed more strongly with that committee’s report. I think it’s fair to say that that earlier report is not official ASA policy, and it’s good for this new report to clarify this point.

I can tell a story along these lines. A few years ago I happened to be speaking at a biostatistics conference that was led off by Nicole Lazar, one of the authors of the 2016 report. Lazar gave a strong talk promoting the idea that statistical methods should more clearly convey uncertainty, and she explained how presenting results as a string of p-values doesn’t do that. (It’s the usual story: p-values are noisy data summaries, they’re defined relative to a null hypothesis that is typically of no interest, the difference between p-values of 0.01 and 0.20 can easily be explained from pure chance variation, etc etc., if you need your memory refreshed you can read the above-linked statement from 2016.) It was good stuff, and the audience was alert and interested. I was happy to see this change in the world. But later in that day someone else gave a talk from a very traditional perspective. It wasn’t a terrible talk, but all the reasoning was based on p-values, and I was concerned that the researchers were to some extent chasing noise without realizing it. It was the usual situation, where a story was pieced together using different comparisons that happened to be statistically significant or not. But what really upset me was not the talk itself but that the audience were completely cool with it. It was as if Lazar’s talk had never happened!

Now, just to be clear, this was just my impression. My point is not that other talk was wrong. It was operating in a paradigm that I don’t trust, but I did not try to track down all the details, and the research might have been just fine. My point only is that (a) it’s far from a consensus that statistics via null hypothesis significance testing is a problem, and (b) Lazar’s talk was well received, but after it was over the statisticians in that room seemed to spring right back to the old mode of thinking. So, yeah, whether or not that 2016 statement can be considered official ASA policy, I don’t think it should be considered as such, given that there is such a wide range of views within the profession.

Goal #2—giving new own recommendations—is another story. For the reasons stated by Megan, I disagree with much of this new statement, and overall I’m unhappy with it. For example, the statement says, “P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results.” Here’s Megan:

(1) Stating “P-values are valid statistical measures” says nothing of when they are or are not valid (or any gray area in between) – instead, it implies they are always valid (especially to those who want that to be the case); (2) I completely agree that they “provide convenient conventions,” but that is not a good reason for using them and works against positive change relative to their use; and (3) I don’t think p-values do a good job “communicating uncertainty” and definitely not The uncertainty inherent in quantitative results as the sentence might imply to some readers. To be fair, I can understand how the authors of the statement could come to feeling okay with the sentence through the individual disclaimers they carry in their own minds, but those disclaimers are invisible to readers. In general, I am worried about how the sentence might be used to justify continuing with poor practices. I envision the sentence being quoted again and again by those who do not want to change their use of p-values in practice and need some official, yet vague, statement of the broad validity of p-values and the value of “convenience.” This is not what we need to improve scientific practice.

Also this:

The last general principle provided is: “In summary, P-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data.” Hard to know where to start with this one. It repeats the dangers I have already discussed. It can easily be used as justification for continuing poor practices because the issue is a lack of agreement or understanding of what is “proper” and what counts as “rigor.” As is, I don’t agree with such a general statement as “increase the rigor of the conclusions.” Too broad. Too big. Too little justification for such a statement. Again, I’m not sure what a practicing scientist is to take away from this that will “aid researchers on all areas of science” as Kafadar states in the accompanying editorial. Scientists do not need vague, easily quotable and seemingly ASA-backed statements to defend their use of current practices that might be questionable – or at least science doesn’t need scientists to have them.

I was also unhappy with the the report’s statement, “Thresholds are helpful when actions are required.” It depends on the action! If the action is whether a journal should publish a paper, no, I don’t think a numerical threshold is helpful. If the action is whether to make a business decision or to approve a drug, then a threshold can be helpful, but I think the threshold should depend on costs and benefits, not the data alone, and not on a p-value. McShane et al. discuss that here. I think the whole threshold thing is a pseudo-practicality. I’m as practical as anybody and I don’t see the need for thresholds at all.

This new report avoided including difficult people like Sander and me, so I guess they had no problem forming a consensus. Like Megan, I have my doubts as to whether this sort of consensus is a good thing. I expressed this view last year, and Megan’s post leaves me feeling that I was right to be concerned.

This then raises the question: how is it that a group of experts I like and respect so much could come up with a statement that I find so problematic? I wasn’t privy to the group’s discussions so all I can offer are some guesses:

1. Flexible wording. As Megan puts it, “the authors of the statement could come to feeling okay with the sentence through the individual disclaimers they carry in their own minds, but those disclaimers are invisible to readers.” For example, a statement such as “thresholds are helpful when actions are required,” is vague enough that who could disagree with it—but in practice this sentence can imply endorsement of statistical significance thresholds in scientific work. Similarly, who can disagree with the statement, “Analyzing data and summarizing results are often more complex than is sometimes popularly conveyed”? And a statement such as “P-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data,” has that clause “when properly applied and interpreted,” which covers whatever you want it to mean.

2. The goal of consensus. This may be a matter of taste. I agree with Megan when she says, “We need to see the disagreements and discussion more than we need to see a pretty unified statement that doesn’t acknowledge the inherent messiness and nuances that create the problems to begin with.”

3. Committee dynamics. When I was on that ASA committee a few years ago, I felt a lot of social pressure to come to an agreement and then to endorse the final statement. I can be a stubborn person sometimes, but it was hard to keep saying no, and eventually I gave in. This wasn’t a terrible decision—as noted above, I agreed with most of the 2016 report and still think it was a good thing—but it’s awkward to be part of a group authoring a statement I even partly disagree with.

So, yeah, I understand that committees require compromise; I just think compromise works better in politics than in science. To compromise on a scientific report you need to either include statements that people on the committee disagree with, or else write things so vaguely that they can be interpreted in different ways by different people. I think the new committee report fails because I disagree with a lot of those interpretations. I wish the task force had included some more stubborn people, which could’ve resulted in a document that I think would’ve been more useful to practitioners. Indeed, the presence of stubborn disagreement on the committee could’ve freed all the creative people already in the task force to more clearly express their differing views.

The big picture

I hate to end on a note of disagreement, so let me step back and recognize the good reasons for this sort of consensus statement. We’re living in an age of amazing scientific accomplishments (for example, the covid vaccine) and amazing statistical accomplishments of so many sorts. Statistical tools of all sorts have been valuable in so many ways, and I agree with these statements from the committee report:

Capturing the uncertainty associated with statistical summaries is critical.

Dealing with replicability and uncertainty lies at the heart of statistical science. Study results are replicable if they can be verified in further studies with new data.

The theoretical basis of statistical science offers several general strategies for dealing with uncertainty.

These are not just three sentences pulled out of the report; they’re the first three statements in bold type. They’re 3/5 of what the report wants to emphasize, and I agree with all three. More broadly, I agree with an optimistic take on statistics. I have a lot of problems with null hypothesis significance testing, and I think we can do a lot better, but in the meantime scientists have been able to use these tools to do great work. So I understand and respect the motivation for releasing a document that is positive on statistics and focuses on the consensus within the statistics profession. This gets back to Goal #1 listed above. Yes, there are serious foundational controversies within statistics. But, no, statistics is not in a shambles. Yes, people such as the beauty-and-sex-ratio researcher and the ovulation-and-voting researchers and the critical positivity ratio team and the embodied cognition dudes and the pizzagate guy and various political hacks and all the rest have misunderstood statistical tools and derailed various subfields of science, and yes, we should recognize ways in which our teaching has failed, and, yes, as Megan says we should be open about our disagreements—but I also see value in the American Statistical Association releasing a statement emphasizing the consensus point that our field has strong theoretical foundations and practical utility.

There’s room for difficult people like Sander, Megan, and me, and also for more reasonable people who step back and take the long view. Had I been writing the take-the-long-view report, I would’ve said things slightly differently (in particular removing the fourth and fifth bold statements in their document, for reasons discussed above), but I respect their goal of conveying the larger consensus of what statistics is. Indeed, it is partly because I agree with the consensus view of statistics’s theoretical and practical strengths that I think their statement would be even stronger if it did not tie itself to p-value thresholds. I agree with their conclusion:

Analyzing data and summarizing results are often more complex than is sometimes popularly conveyed. Although all scientific methods have limitations, the proper application of statistical methods is essential for interpreting the results of data analyses and enhancing the replicability of scientific results.

OK, I wouldn’t quite go that far. I wouldn’t say the proper application of statistical methods is essential for interpreting the results of data analyses—after all, lots of people interpret their data just fine while making statistical mistakes or using no statistics at all—but that’s just me being picky and literal again. I agree with their message that statistics is useful, and I can see how they were concerned that that 2016 report might have left people with too pessimistic a view.

## On fatally-flawed, valueless papers that journals refuse to retract

Commenter Carlos pointed us to this story (update here) of some scientists—Florin Moldoveanu, Richard Gill, and five others—all of whom seem to know what they’re talking about and who are indignant that the famous Royal Society of London published a paper that’s complete B.S. and then refused to retract it when the error was pointed out. I understand that indignant feeling. The Royal Society journal did publish an “expression of concern” about the fatally-flawed paper, and, that’s something, but I understand the frustration of Moldoveanu, Gill, et al., that:

1. The expression of concern does not clearly state that the paper is wrong, instead saying vaguely that “there was a divergence of opinion.” Yes, there’s a diversion of opinion: some people say the earth is flat, some do not.

2. The expression of concern expresses things conditionally: “a controversial paper may eventually be shown to contain flaws.” But this is misleading, because it does not clearly state that (a) the paper is “controversial” because it has a mathematical error that destroys its central argument, and (b) it’s not that the paper “may eventually be shown” to contain flaws; it’s that these flaws have already been publicly pointed out.

3. The flaws were also pointed out in the original review process and the editors simply disregarded the review that pointed out the fatal flaw in the paper.

4. Yes, the expression of concern is out there, but if you go to the original paper, you’ll find only a very subtle link to the expression of concern:

And the pdf of the article doesn’t mention the expression of concern at all!

Going medieval?

At this point, I’m ready to go medieval on the Royal Society, an organization which seems justly proud of its long history:

We published Isaac Newton’s Principia Mathematica, and Benjamin Franklin’s kite experiment demonstrating the electrical nature of lightning. We backed James Cook’s journey to Tahiti, reaching Australia and New Zealand, to track the Transit of Venus. We published the first report in English of inoculation against disease, approved Charles Babbage’s Difference Engine, documented the eruption of Krakatoa and published Chadwick’s detection of the neutron that would lead to the unleashing of the atom.

The Royal Society’s motto ‘Nullius in verba’ is taken to mean ‘take nobody’s word for it’. It is an expression of the determination of Fellows to withstand the domination of authority and to verify all statements by an appeal to facts determined by experiment.

Their leading journal, or at least the first one listed on their journals page, is called Royal Society Open Science. And that’s the journal that refuses to retract the offending paper.

Now, at this point, you could say, Hey, give them a break, this is theoretical physics we’re talking about, it’s super-complicated! To which I’d reply: sure, but there are some experts on theoretical physics out there, no? What’s the point of the Royal Society publishing anything at all on theoretical physics if they’re not gonna check it? If you want to publish papers with mathematical errors, we already have Arxiv, right? The Royal Society is supposed to be providing some value added, but here they seem to be just hiding behind the obscurity of theoretical physics. They’re refusing to make a judgment. Which, again, fine, you can refuse to make a judgment. But then why go through the referee process at all?

At this point, the Royal Society is looking almost as bad as Lancet. OK, not as bad as Lancet: the Royal Society didn’t let a fraudulent anti-vax paper sit in their journal for 12 years, and they didn’t go to the press and social media to defend a fraudulent coronavirus paper. So, OK, the Royal Society isn’t Lancet bad, but they’re still refusing to retract a paper that’s been shown to be wrong.

But, then again, everybody does it!

So, I was all ready to work myself into a righteous fury, but then I remembered . . .

Statistical Science published the Bible code paper in 1994 and never retracted it! Yes, they later published a demolition of that paper, but if you go back to the original Bible Code article, there’s no retraction, no correction, no mention of the refutation, and no expression of concern.

The Journal of Personality and Social Psychology published the ESP paper in 2011 and never retracted it. Many people have demolished that paper (hardly necessary, considering how bad it is), but if you to the article on the American Psychology Association’s website, there’s no retraction, no correction, no mention of the refutation, and no expression of concern.

The Bible code paper and the ESP paper are just as wrong, just as methodologically flawed, and just as bad as the recent Royal Society physics paper. The errors in all three of these papers are unambiguous—and they were unambiguous at the time. Indeed, I have the sense that the editorial boards of Statistical Science in 1994 and the Journal of Personality and Social Psychology in 2011 were pretty sure that these articles were B.S., while they were considering publishing them. And, of course, like the Royal Society paper, these articles made claims that would be hard to interpret as anything other than violations of the laws of physics. Why did those journals publish these terrible submissions? My guess is that they were bending over backward to be fair, to not play the role of censor. I don’t have any easy answers to that particular problem, except to spare a thought for poor Brian Wansink: He doesn’t seem to have any publications in Statistical Science or the Journal of Personality and Social Psychology (or, for that matter, Lancet or Royal Society Open Science), despite having written dozens of papers that are no worse than the ones discussed above. How fair is that?

Conclusion

So, yeah, I don’t really know what to say about all this. In the four examples above, the decision seems clear enough: the papers never should’ve been published. And, once the fatal flaws had been pointed out, they should’ve been retracted. But a general policy is not so clear. And, you know that problem when you repaint a dirty wall in your apartment, and then you realize you need to paint the rest of the walls in that room, then you need to paint the other rooms in the apartment? It’s the same thing: once we ask the Royal Society to start retracting papers that have fatal errors and no redeeming qualities, where do we stop? No easy answers.

I’d like to call this an example of the research incumbency rule, whereby flaws that would easily get an article to lose in the review process, are brushed aside once the article has been published. But that’s not quite right, given that all the above-discussed articles had flaws that must have been obvious during the review process as well.

P.S. More here from Retraction Watch.

## Impressive visualizations of social mobility

An anonymous tipster points to this news article by Emily Badger, Claire Cain Miller, Adam Pearce, and Kevin Quealy featuring an amazing set of static and dynamic graphs.

## “The real thing, like the Perseverance mission, is slow, difficult and expensive, but far cooler than the make-believe alternative.”

Good point by Palko. He’s talking about the Mars rover:

There’s a huge disconnect in our discussion of manned space travel. We’ve grown accustomed to vague promises about Martian cities just around the corner, but in the real world, our best engineering minds have never landed anything larger than a car on Mars and this is the least risky way they’ve come up with to do it. . . . The real thing, like the Perseverance mission, is slow, difficult and expensive, but far cooler than the make-believe alternative.

And it reminds me of a conversation I had with some people several years ago. They were talking about ghosts—someone had some story about some old house with creaking doors, and I was like: Pressure differentials causing doors to open at unexpected times, that’s interesting. Ghosts are boring! In the same way, a computer chess program is interesting. A box with a little guy hiding inside moving pieces around is boring. Real-world physics causing thunderstorms: that’s cool. Some dude like Zeus or Thor sitting in the sky throwing down thunderbolts: boring.

Similarly with junk science. The idea that we make predictions using implicit heuristics: cool. The idea that Cornell students can see the future using some sort of ESP: boring. The tensile properties of metals: cool. The idea that someone can bend a spoon without touching it: boring. The properties of real-life social interactions as they play out in a job interview: interesting. That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications: boring.

Or, I should say, some of these statements are interesting if true but boring if false.

We discussed the interesting-if-true thing a few years ago:

Some theories or hypotheses are interesting even if not true, as they represent a way of thinking. Freudian psychiatry, neoclassical economics, the median voter theorem: ideas like that are interesting, and I’m a big fan of research that explores the range of applicability of such theories.

Some other theories are, to me, only interesting if true. The Loch Ness monster: if true, the idea that there’s this sea monster out there that hasn’t really made itself seen for all these years, that’s kind of amazing. If false, though, it’s just one more boring story of media hype. Similarly, Bem’s ESP experiment: if true, how cool is that? If false, it’s just one more empty ESP claim backed up by shaky statistics, like all the others we’ve seen over the past hundred years.

Power pose is an example of a theory that is interesting if true or false. If true, it’s worth knowing that hormones and actual power can be so easily manipulated. If false, that’s interesting too because so many people believed it to be true: it’s worth knowing (and perhaps a relief) that it’s not so easy for people to manipulate their hormones and their actual power in that way.

Other theories can be interesting but much depends on the quantitative scale. The theory about beauty and sex ratio would be interesting if true—but a careful read of the literature reveals that, if the theory were true, the effects would have to be tiny, which implies, first, that maybe it’s not so interesting (how much would we really care about a 0.1 percentage point difference) and, second, that it’s undetectable given available data.

A lot of social science theories are like that: interest depends on the magnitude of effects or comparisons, and much of that is lost in the usual descriptions (in scientific journal articles as well as the popular media) which focus on the (purported) existence of the effect without any understanding of the magnitude. The most notorious case of this was the Freakonomics report on the beauty-and-sex ratio which claimed that “good-looking parents are 36% more likely to have a baby daughter as their first child than a baby son,” which is something like claiming that Tiger Woods has a new golf club that allows him to drive the ball 30,000 yards.

## She sent a letter pointing out problems with a published article, the reviewers agreed that her comments were valid, but the journal didn’t publish her letter because “the policy among editors is not to accept comments.”

The journal in question is called The Economic Journal. To add insult to injury, the editor wrote the following when announcing they wouldn’t publish the letter:

My [the editor’s] assessment is that this paper is a better fit for a field journal in education.

OK, let me get this straight. The original paper, which was seriously flawed, was ok for Mister Big Shot Journal. But a letter pointing out those flaws . . . that’s just good enough for a Little Baby Field Journal.

That doesn’t make sense to me. I mean, sure, when it comes to the motivations of the people involved, it makes perfect sense: their job as journal editors is to give out gold stars to academics who write the sorts of big-impact papers they want to publish; to publish critical letters would devalue these stars. But from a scientific standpoint, it doesn’t make sense. If the statement, “Claim X is supported by evidence Y,” was considered publishable in a general-interest journal, then the statement “Claim X is not supported by evidence Y” should also be publishable.

It’s tricky, though. It only works if the initial, flawed, claim was actually published. Consider this example:

Scenario A:
– A photographer disseminates a blurry picture and says, “Hey—evidence of Bigfoot!”
– The Journal of the Royal Society of Zoology publishes the picture under the title, “Evidence of Bigfoot.”
– An investigator shows that this could well just be a blurry photograph of some dude in a Chewbacca suit.
– The investigator submits her report to the Journal of the Royal Society of Zoology.

Scenario B:
– A photographer disseminates a blurry picture and says, “Hey—evidence of Bigfoot!”
– An investigator shows that this could well just be a blurry photograph of some dude in a Chewbacca suit.
– The investigator submits her report to the Journal of the Royal Society of Zoology.

What should JRSZ do? The answer seems pretty clear to me. In scenario A, JRSZ should publish the investigator’s report. In scenario B, they shouldn’t bother.

Similarly, the Journal of Personality and Social Psychology decided in its finite wisdom to publish that silly ESP article in 2011. As far as I’m concerned, this puts them on the hook to publish a few dozen articles showing no evidence for ESP. They made their choice, now they should live with it.

Background

Here’s the story, sent to me by an economist who would like anonymity:

I thought that you may find the following story quite revealing and perhaps you may want to talk about it in your blog.

In a nutshell, a PhD student, Claudia Troccoli, replicated a paper published in a top Economics journal and she found a major statistical mistake that invalidates the main results. She wrote a comment and sent it to the journal. Six months later she heard from the journal. She received two very positive referee reports supporting her critique, but the editor decided to reject the comment because he had just learned that the journal has an (unwritten) policy of not accepting comments. Another depressing element of the story is that the original paper was a classical example where a combination of lack of statistical power and multiple testing leads to implausible large effects (probably one order of magnitude of what one would have expected based on the literature). It is quite worrying that some editors in top economic journals are still unable to detect the pattern.

The student explained yesterday this story in twitter here and she has posted the comment, the editor letter, and referee reports here.

This story reminds me of my experience with the American Sociological Review a few years ago. They did not want to publish a letter of mine pointing out flaws in a paper they’d published, and their reason was that my letter was not important enough. I don’t buy that reasoning. Assuming the originally published paper was itself important (if not, the journal wouldn’t have published it), I’d say that pointing out the lack of empirical support for a claim in that paper was also important. Not as important as the original paper, which made many points that were not invalidated by my criticism—but, then again, my letter was much shorter than that paper! I think it had about the same amount of importance per page.

Beyond this, I think journals have the obligation to correct errors in the papers they’ve published, once those errors have been pointed out to them. Unfortunately, most journals seem to have a pretty strong policy not to do that.

As Trocolli wrote of the Economics Journal:

The behavior of the journal reflects an incentive problem. No journal likes to admit mistakes. However, as a profession, it is crucial that we have mechanisms to correct errors in published papers and encourage replication.

I agree. And “the profession” is all of science, not just economics.

Not too late for a royal intervention?

I googled *Economic Journal* and found this page, which says that it’s “the Royal Economic Society’s flagship title.” Kind of horrible of the Royal Economic Society to not correct its errors, no?

Perhaps the queen or Meghan Markle or someone like that could step in and fix this mess. Maybe Prince Andrew, as he’s somewhat of a scientific expert—didn’t he write something for the Edge Foundation once? I mean, what’s the point of having a royal society if you can’t get some royal input when needed? It’s a constitutional monarchy, right?

P.S. Werner sends in this picture of a cat that came up to him on a park bench at the lake of Konstanz in Germany and who doesn’t act like a gatekeeper at all.

## Is this a refutation of the piranha principle?

Jonathan Falk points to this example of a really tiny stimulus having a giant effect (in brain space) and asks if it’s a piranha violation. I don’t think it is, but the question is amusing.

## Top 10 Ideas in Statistics That Have Powered the AI Revolution

Aki and I put together this listsicle to accompany our recent paper on the most important statistical ideas of the top 50 years.

Kim Martineau at Columbia, who suggested making this list, also had the idea that youall might have suggestions for other important articles and books; tweet your thoughts at @columbiascience of put them in comments below and we can discuss at a future date.

Each idea below can be viewed as a stand-in for an entire subfield. We make no claim that these are the “best” articles and books in statistics and machine learning, we’re just saying they’re important in themselves and represent important developments. By singling out these works, we do not mean to diminish the importance of similar, related work. We focus on methods in statistics and machine learning, rather than equally important breakthroughs in statistical computing, and computer science and engineering, which have provided the tools and computing power for data analysis and visualization to become everyday practical tools. Finally, we have focused on methods, while recognizing that developments in theory and methods are often motivated by specific applications.

The 10 articles and books below all were published in the last 50 years and are listed in chronological order.

1. Hirotugu Akaike (1973). Information Theory and an Extension of the Maximum Likelihood Principle. Proceedings of the Second International Symposium on Information Theory.

This is the paper that introduced the term AIC (originally called An Information Criterion but now known as Akaike Information Criterion), for evaluating a model’s fit based on its estimated predictive accuracy. AIC was instantly recognized as a useful tool, and this paper was one of several published in the mid-1970s placing statistical inference within a predictive framework. We now recognize predictive validation as a fundamental principle in statistics and machine learning. Akaike was an applied statistician, who in the 1960s, tried to measure the roughness of airport runways, in the same way that Benoit Mandelbrot’s early papers on taxonomy and Pareto distributions led to his later work on the mathematics of fractals.

2. John Tukey (1977). Exploratory Data Analysis.

This book has been hugely influential and is a fun read that can be digested in one sitting. Traditionally, data visualization and exploration were considered low-grade aspects of practical statistics; the glamour was in fitting models, proving theorems, and developing the theoretical properties of statistical procedures under various mathematical assumptions or constraints. Tukey flipped this notion on its head. He wrote about statistical tools not for confirming what we already knew (or thought we knew), and not for rejecting hypotheses that we never, or should never have, believed, but for discovering new and unexpected insights from data. His work motivated advances in network analysis, software, and theoretical perspectives that integrate confirmation, criticism, and discovery.

3. Grace Wahba (1978). Improper Priors, Spline Smoothing and the Problem of Guarding Against Model Errors in Regression. Journal of the Royal Statistical Society.

Spline smoothing is an approach for fitting nonparametric curves. Another of Wahba’s papers from this period is called “An automatic French curve,” referring to a class of algorithms that can fit arbitrary smooth curves through data without overfitting to noise, or outliers. The idea may seem obvious now, but it was a major step forward in an era when the starting points for curve fitting were polynomials, exponentials, and other fixed forms. In addition to the direct applicability of splines, this paper was important theoretically. It served as a foundation for later work in nonparametric Bayesian inference by unifying ideas of regularization of high-dimensional models.

4. Bradley Efron (1979). Bootstrap Methods: Another Look at the Jackknife. Annals of Statistics.

Bootstrapping is a method for performing statistical inference without assumptions. The data pull themselves up by their bootstraps, as it were. But you can’t make inference without assumptions; what made the bootstrap so useful and influential is that the assumptions came implicitly with the computational procedure: the audaciously simple idea of resampling the data. Each time you repeat the statistical procedure performed on the original data. As with many statistical methods of the past 50 years, this one became widely useful because of an explosion in computing power that allowed simulations to replace mathematical analysis.

5. Alan Gelfand and Adrian Smith (1990). Sampling-based Approaches to Calculating Marginal Densities. Journal of the American Statistical Association.

Another way that fast computing has revolutionized statistics and machine learning is through open-ended Bayesian models. Traditional statistical models are static: fit distribution A to data of type B. But modern statistical modeling has a more Tinkertoy quality that lets you flexibly solve problems as they arise by calling on libraries of distributions and transformations. We just need computational tools to fit these snapped-together models. In their influential paper, Gelfand and Smith did not develop any new tools; they demonstrated how Gibbs sampling could be used to fit a large class of statistical models. In recent decades, the Gibbs sampler has been replaced by Hamiltonian Monte Carlo, particle filtering, variational Bayes, and more elaborate algorithms, but the general principle of modular model-building has remained.

6. Guido Imbens and Joshua Angrist (1994). Identification and Estimation of Local Average Treatment Effects. Econometrica.

Causal inference is central to any problem in which the question isn’t just a description (How have things been?) or prediction (What will happen next?), but a counterfactual (If we do X, what would happen to Y?). Causal methods have evolved with the rest of statistics and machine learning through exploration, modeling, and computation. But causal reasoning has the added challenge of asking about data that are impossible to measure (you can’t both do X and not-X to the same person). As a result, a key idea in this field is identifying what questions can be reliably answered from a given experiment. Imbens and Angrist are economists who wrote an influential paper on what can be estimated when causal effects vary, and their ideas form the basis for much of the later work on this topic.

7. Robert Tibshirani (1996). Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society.

In regression, or predicting an outcome variable from a set of inputs or features, the challenge lies in including lots of inputs along with their interactions; the resulting estimation problem becomes statistically unstable because of the many different ways of combining these inputs to get reasonable predictions. Classical least squares or maximum likelihood estimates will be noisy and might not perform well on future data, and so various methods have been developed to constrain or “regularize” the fit to gain stability. In this paper, Tibshirani introduced lasso, a computationally efficient and now widely used approach to regularization, which has become a template for data-based regularization in more complicated models.

8. Leland Wilkinson (1999). The Grammar of Graphics.

In this book, Wilkinson, a statistician who’s worked on several influential commercial software projects including SPSS and Tableau, lays out a framework for statistical graphics that goes beyond the usual focus on pie charts versus histograms, how to draw a scatterplot, and data ink and chartjunk, to abstractly explore how data and visualizations relate. This work has influenced statistics through many pathways, most notably through ggplot2 and the tidyverse family of packages in the computing language R. It’s an important step toward integrating exploratory data and model analysis into data science workflow.

9. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio (2014). Generative Adversarial Networks. Proceedings of the International Conference on Neural Information Processing Systems.

One of machine learning’s stunning achievements in recent years is in real-time decision making through prediction and inference feedbacks. Famous examples include self-driving cars and DeepMind’s AlphaGo, which trained itself to become the best Go player on Earth. Generative adversarial networks, or GANs, are a conceptual advance that allow reinforcement learning problems to be solved automatically. They mark a step toward the longstanding goal of artificial general intelligence while also harnessing the power of parallel processing so that a program can train itself by playing millions of games against itself. At a conceptual level, GANs link prediction with generative models.

10. Yoshua Bengio, Yann LeCun, and Geoffrey Hinton (2015). Deep Learning. Nature.

Deep learning is a class of artificial neural network models that can be used to make flexible nonlinear predictions using a large number of features. Its building blocks—logistic regression, multilevel structure, and Bayesian inference—are hardly new. What makes this line of research so influential is the recognition that these models can be tuned to solve a variety of prediction problems, from consumer behavior to image analysis. As with other developments in statistics and machine learning, the tuning process was made possible only with the advent of fast parallel computing and statistical algorithms to harness this power to fit large models in real time. Conceptually, we’re still catching up with the power of these methods, which is why there’s so much interest in interpretable machine learning.

## When does mathematical tractability predict human behavior?

This is Jessica. I have been reading theoretical papers recently that deal with human behavior in aggregate (e.g., game theory, distribution shift in ML), and have been puzzling a little over statements made about the link between solution complexity and human behavior. For example:

In a book on algorithmic game theory: “Can a Nash equilibrium be computed efficiently, either by an algorithm or the players themselves? In zero-sum games like Rock-Paper-Scissors, where the payoff pair in each entry sums to zero, this can be done via linear programming, or if a small amount of error can be tolerated, via simple iterated learning algorithms […] These algorithms give credence to the Nash equilibrium concept as a good prediction of behavior in zero-sum games.”

Similarly, from a classic paper on the complexity of computing a Nash equilibrium

“Universality is a desirable attribute for an equilibrium concept. Of course, such a concept must also be natural and credible as a prediction of behavior by a group of agents— for example, pure Nash seems preferable to mixed Nash, in games that do have a pure Nash equilibrium. But there is a third important desideratum on equilibrium concepts, of a computational nature: An equilibrium concept should be efficiently computable if it is to be taken seriously as a prediction of what a group of agents will do. Because, if computing a particular kind of equilibrium is an intractable problem, of the kind that take lifetimes of the universe to solve on the world’s fastest computers, it is ludicrous to expect that it can be arrived at in real life.”

Finally, in a paper on strategic classification (where Contestant refers to a person best responding to a classifier by manipulating their features in order to get a better label):

“We develop our classifier using c_assumed, but then for tests allow Contestant to best-respond to the classifier given the cost function c_true. We note that finding the best response to a linear classifier given the cost function c_true is a simple calculus problem.”

When I read things like this, I wonder: What is the landscape of assumptions behind statements like this, which might seem to suggest that a solution concept is a more valid predictor of human behavior if an algorithm can efficiently solve for an exact solution? As someone who often thinks about how human cognition compares to different kinds of statistical processing, I find this idea intriguing but also kind of murky.

For instance, the first example above reads to me like an assertion that ‘if it’s computable, then it’s likely people will achieve it’ whereas the second reads more like, ‘if it’s computable, then it’s within the realm of possibility that people can achieve it.’ But is this difference an artifact of different writing styles or a real difference in assumptions?

I can see making the argument, as the second quote above implies, that if no algorithm other than brute force search through all possible solutions can solve it, then people may not ever reach the equilibrium because human time is finite. But often solution concepts in theoretical work are exact. Only one of the above examples mentions approximate correctness, which is closer to concepts from foundations of learning theory, like probability approximately correct (PAC) learning, where it’s not about learning an exact solution but about distinguishing when we can learn one within some error bound, with some probability. This kind of approximate guarantee seems more reasonable as an expectation of what people can do. I’m thinking about all the work on heuristics and biases implying that for many judgments people rely on simple approximations and sometimes substitutions of hard to estimate quantities, like deciding whether or not to purchase insurance based on how bad you feel when you imagine your home on fire.

So I asked a colleague who is a computer science / game theorist, Jason Hartline, what to make of statements like this. Here’s his explanation:

The work on computation and equilibrium really starts with the negative result.  If a computer can’t compute it, neither can people.  This negative result is a critique on one of Nash’s main selling points: existence.   And again, this is a negative perspective:  a model of behavior cannot be correct if it has nothing to say about some settings.  The pure game theorists then say that Nash exists everywhere so (to have a double negative) it’s not obviously wrong.  The work on computation says that well, maybe it is wrong, since if computers can’t find Nash, then people can’t find Nash.  So there are settings where Nash cannot describe people.

Ok, so what do these “algorithmic game theorists” think about games where Nash is computable, like zero sum games.  Well, this is a negative negative result.  We can’t prove that Nash is wrong here using computation.  But is still not a positive result for people playing Nash.  I’m not sure what there is for theory people in thinking about whether this negative negative result should be a positive result.

Whether people do or do not play Nash seems like more of an empirical question than a theoretical question.  Here the experiments point to stuff like quantal response equilibrium, or cognitive hierarchy, or quantal cognitive hierarchy.  Sometimes these models are good for theory, and in these cases, theorists are happy to use them instead.

It seems from this that the example statements above are essentially saying is that if we can’t efficiently compute it, we can’t expect people to reach it. What’s still a bit ambiguous to me is the underlying assumption about the comparability of human information processing and computation. When can we make logical statements that imply some form of equivalence in computability between the two and where should we expect it to break down? Maybe people who work on human versus computational learning have thoughts on this, or even neuroscience or mathematical psych.  Apparently there is also the Church-Turing thesis (see the section on philosophy of mind).

## How to reconcile that I hate structural equation models, but I love measurement error models and multilevel regressions, even though these are special cases of structural equation models?

Andy Dorsey writes:

I’m a graduate student in psychology. I’m trying to figure out what seems to me to be a paradox: One issue you’ve talked about in the past is how you don’t like structural equation modeling (e.g., your blog post here). However, you have also talked about the problems with noisy measures and measurement error (e.g., your papers here and here).

Here’s my confusion: Isn’t the whole point of structural equation modeling to have a measurement model that accounts for measurement error? So isn’t structural equation modeling actually addressing the measurement problem you’ve lamented?

The bottom line is that I really want to address measurement error (via a measurement model) because I’m convinced that it will improve my statistical inferences. I just don’t know how to do that if structural equation modeling is a bad idea.

I do like latent variables. Indeed, when we work with models that don’t have latent variables, we can interpret these as measurement-error models where the errors have zero variance.

And I have no problem with structural equation modeling in the general sense of modeling observed data conditional on an underlying structure.

My problem with structural equation modeling as it is used in social science is that the connections between the latent variables are just too open-ended. Consider the example on the second page of this article.

So, yes, I like measurement-error models and multilevel regressions, and mathematically these are particular examples of structural equation models. But I think that when researchers talk about structural equation models, they’re usually talking about big multivariate models that purport to untangle all sorts of direct and indirect effects from data alone, and I don’t think this is possible. For further discussion of these issues, see Sections 19.7 and B.9 of Regression and Other Stories.

One other thing: I think they should be called “structural models” or “stochastic structural models.” The word “equation” in the name doesn’t seem quite right to me, because the whole point of these models is that they’re not equating the measurement with the structure. The models allow error, so I don’t think of them as equations.

P.S. Zad’s cat, above, is dreaming of latent variables.

## The Alice Neel exhibition at the Metropolitan Museum of Art

This exhibit closes at the end of the month so I can’t put this one on the usual 6-month delay. (Sorry, “Is There a Replication Crisis in Finance?”, originally written in February—you’ll have to wait till the end of the year to be seen by the world.) I’d never heard of Neel before, which I guess is just my ignorance, but this was just about the most satisfying museum show I’ve ever seen. If you’re local, I recommend it. Some of the early work reminded me of Picasso, and the later work was in the style of Van Gogh (as was clear in one display which juxtaposed a Neal painting with one of the Met’s Van Goghs), but Neel conveyed relationships between people in ways that those other artists didn’t. The exhibit was beautifully curated, and I learned a lot from the little notes they had on the wall next to the paintings.

No statistics content at all here, except that after going through the Neel exhibit, we went into another one that didn’t interest me so much, so I read a few pages of the book I was carrying, “Two Girls, Fat and Thin,” by Mary Gaitskill, which had a synesthesia vibe to it. For example, “The voices sounded like young, cramp-shouldered people taking their lunch breaks in cafeterias lit by humming fluorescent lights.” And “she received a call from someone with a high-pitched voice that reminded her of a thin stalk with a rash of fleshy bumps.” Maybe we can use this as an epigraph for our sensification paper. And this:

Justine Shade’s voice sounded different in person than it had on the phone. Floating from the receiver, it had been eerie but purposeful, moving in a line toward a specific destination. In my living room, her words formed troublesome shapes of all kinds that, instead of projecting into the room, she swallowed with some difficulty.

This somehow reminds me of John Updike, if he had a sense of humor.

## John Cook: “Students are disturbed when they find out that Newtonian mechanics ‘only’ works over a couple dozen orders of magnitude. They’d really freak out if they realized how few theories work well when applied over two orders of magnitude.”

Following up on our post from this morning about scale-free parameterization of statistical models, Cook writes:

The scale issue is important. I know you’ve written about that before, that models are implicitly fit to data over some range, and extrapolation beyond that range is perilous. The world is only locally linear, at best.

Students are disturbed when they find out that Newtonian mechanics “only” works over a couple dozen orders of magnitude. They’d really freak out if they realized how few theories work well when applied over two orders of magnitude.

It’s kind of a problem with mathematical notation. It’s easy to write the equation “y = a + bx + error,” which implies “y is approximately equal to a + bx for all possible values of x.” It’s not so easy using our standard mathematical notation to write, “y = a + bx + error for x in the range (A, B).”

The more general issue is that it takes fewer bits to make a big claim than to make a small claim. It’s easier to say “He never lies” than to say “He lies 5% of the time.” Generics are mathematically simpler and easier to handle, even though they represent stronger statements. That’s kind of a paradox.

## From “Mathematical simplicity is not always the same as conceptual simplicity” to scale-free parameterization and its connection to hierarchical models

I sent the following message to John Cook:

This post popped up, and I realized that the point that I make (“Mathematical simplicity is not always the same as conceptual simplicity. A (somewhat) complicated mathematical expression can give some clarity, as the reader can see how each part of the formula corresponds to a different aspect of the problem being modeled.”) is the kind of thing that you might say!

Cook replied:

On a related note, I [Cook] am intrigued by dimensional analysis, type theory, etc. It seems alternately trivial and profound.

Angles in radians are ratios of lengths, so they’re technically dimensionless. And yet arcsin(x) is an angle, and so in some sense it’s a better answer.

I’m interested in sort of artificially injecting dimensions where the math doesn’t require them, e.g. distinguishing probabilities from other dimensionless numbers.

To which I responded:

That’s interesting. It relates to some issues in Bayesian computation. More and more I think that the scale should be an attribute of any parameter. For example, suppose you have a pharma model with a parameter theta that’s in micrograms per liter, with a typical value such as 200. Then I think we should parameterize theta relative to some scale: theta = alpha*phi, where alpha is the scale and phi is the scale-free parameter. This becomes clearer if you think of there being many thetas, thus theta_j = alpha * phi_j, for j=1,…,J. The scale factor alpha could be set a priori (for example, alpha = 200 micrograms/L) or it could itself be estimated from the data. This is redundant parameterization but it can make sense from a scientific perspective.

P.S. More here.

## The continuing misrepresentations coming from the University of California sleep researcher and publicized by Ted and NPR

Markus Loecher writes:

Just when I had put the “Matthew Walker fake news” into a comfortable place of oblivion, NPR sends me this suggested story.

How disappointing that NPR’s fact check is no better than other media outlets. Then again, it is a different TED talk.

I [Loecher] am itching to look into the claims that:

– “Restricting your sleep to 4 hours in just one night, we observed a 70% drop of killer cells“ (minute 10:40)

– “Limiting your sleep to 6 hours for one week, they measured the change in their gene activity profile, they observed (i) a sizeable 711 modified/damaged genes and (ii) half of these were increased, half decreased” (minute 12:40)

Is it not incredible that TED talkers do not have to supply a list of references which they cite ? Every school child’s presentation is required to do this nowadays.

Well, sure, standards are much higher in elementary school. Art Buchwald had a column about that once.

Loecher supplied the background last year:

The video of [University of California professor Matthew] Walker’s 19-minute “Sleep is your superpower” talk received more than 1 million views . . .

I [Loecher] applaud both the outstanding delivery and the main message of his captivating presentation . . . However, at about 8:45 into the talk, Matthew walks (no pun intended) on very thin ice:

I could tell you about sleep loss and your cardiovascular system, and that all it takes is one hour. Because there is a global experiment performed on 1.6 billion people across 70 countries twice a year, and it’s called daylight saving time. Now, in the spring, when we lose one hour of sleep, we see a subsequent 24-percent increase in heart attacks that following day. In the autumn, when we gain an hour of sleep, we see a 21-percent reduction in heart attacks. Isn’t that incredible? And you see exactly the same profile for car crashes, road traffic accidents, even suicide rates.

Initially I [Loecher] was super excited about the suggested sample size of 1.6 billion people and wanted to find out how exactly such an incredible data set could possibly have been gathered. Upon my inquiry, Matthew was kind enough to point me to the paper, which was the basis for the rather outrageous claims from above. Luckily, it is an open access article in the openheart Journal from 2014.

Imagine my grave disappointment to find out that the sample was limited to 3 years in the state of Michigan and had just 31 cases per day! On page 4 you find Table 1 which contains the quoted 24% increase and 21% decrease expressed as relative risk (multipliers 1.24 and 0.79, respectively) . . .

More importantly, these changes were not observed on the “following day” after DST but instead on a Monday and Tuesday, respectively. Why did Matthew Walker choose these somewhat random days? My guess would be, because they were the only ones significant at the 5% level, which would be a classic case of “p-value fishing”. . . .

I was unable to find any backing of the statement on “exactly the same profile for car crashes, road traffic accidents, even suicide rates” in the literature.

I’m reminded of the nudgelord who wrote in the New York Times:

Knowing a person’s political leanings should not affect your assessment of how good a doctor she is — or whether she is likely to be a good accountant or a talented architect. But in practice, does it? Recently we conducted an experiment to answer that question. Our study . . . found that knowing about people’s political beliefs did interfere with the ability to assess those people’s expertise in other, unrelated domains.

Actually, their study said nothing at all about doctors, accountants, or architects.

I guess that for some people, the general news media are a place to exaggerate findings and flat-out make things up in order to get more attention. It’s all for a good cause, right?

What really frustrates me is that when people call them on their, ummm, misrepresentations, these people do their best to pretend the criticism never existed, replying only when absolutely necessary and then not admitting what they got wrong. To Walker’s credit, he seems to have stayed on the (relatively) high road, ducking criticism and continuing to push nonsense, but at least not calling us Stasi, terrorist, etc., or attacking us in other ways. I appreciate this bit of civility.

Loecher wrote the following letter to Ted:
Continue reading ‘The continuing misrepresentations coming from the University of California sleep researcher and publicized by Ted and NPR’ »

## Honor Thy Father as a classic of Mafia-deflating literature

In an article, “Why New York’s Mob Mythology Endures,” Adam Gopnik writes:

[The Mafia] has supplied our only reliable, weatherproof American mythology, one sturdy enough to sustain and resist debunking or revisionism. Cowboys turn out to be racist and settlers genocidal, and even astronauts have flaws. But mobsters come pre-disgraced, as jeans come pre-distressed; what bad thing can you say about the Mob that hasn’t been said already? So residual virtues, if any, shine bright. . . .

Good point. He continues:

You could still imagine that books debunking the Cosa Nostra, revealing a truth less glamorous if not more virtuous than what has been peddled, would be plentiful. But, where you could not get a popular historian to repeat the story of, say, Clara Barton and the American Red Cross without much close squinting and revision, a book about the Mob in New York will happily repeat the same twenty stories already known, without probing the possibility that, given the Mob’s secrecy and need for self-generated storytelling, much of what we think we know may not be remotely true. When revision does occur, it meets a stony response. . . .

In his review, Gopnik mentions several books and movies, but not The Sopranos (which, I agree, does its demythologizing in a way that leaves the latter-day mobsters as yet another collection of adorable killers) and, most notably for me, not Honor Thy Father, a book that I read several years ago after Palko recommended it in our comments section.

Honor Thy Father, written by the legendary Mad Men-era magazine writer Gay Talese, is an odd book. In its tone it follows (or, perhaps I should say, is one of the originators of) the classic writing-about-the-Mafia style, with a mix of solemnity, action, and musings on familial bonds. And I got the sense from reading the book that Talese had a lot of affection for his subjects. On the other hand, if you look at the actual content, it’s the story of a spoiled young man who’s never worked a day in his life and whose only core conviction seems to be that he should never have to pay for anything with his own money. The book starts off with some going-to-the-mattresses drama and I thought it was heading for Godfather-style shootouts, but it devolves into a cross-country trip of motel rooms and stolen credit cards, a sort of Lolita without the plot. Wikipedia quotes a New York Times review stating that Talese “conveys the impression that being a mobster is much the same as being a sportsman, film star or any other kind of public ‘personality,'” but I think that misses the point. Sportsmen, film stars, etc. are paid to do their jobs and entertain people. Bill Bonanno was just a well-connected thief. Even “thief” sounds too glamorous, like he’s some kind of Lupin. But, yes, Bonanno gives off some of the vibes of a Joseph Heller middle-manager character.

To say this is not to diss Talese. On the contrary, I find it admirable that he could tell it like it was (or at least appear to do so; it’s not like I have any idea how it really was, or is) and not feel the need to warp the content to match the style.

Anyway, I’m surprised that in his essay Gopnik never mentioned Honor Thy Father. It’s one of the top mob books of all time, no? And I think its mix of elegiac style and debunking content fits Gopnik’s story.

## Not being able to say why you see a 2 doesn’t excuse your uninterpretable model

This is Jessica, but today I’m blogging about a blog post on interpretable machine learning that co-blogger Keith wrote for his day job and shared with me. I agree with multiple observations he makes. Some highlights:

The often suggested simple remedy for this unmanageable complexity is just finding ways to explain these black box models; however, those explanations can sometimes miss key information. In turn, rather than being directly connected with what is going on in the black box model, they result in being “stories” for getting concordant predictions. Given that concordance is not perfect, they can result in very misleading outcomes for many situations.

I blogged a bit about this before, giving examples like inconsistency in explanations for the same inputs and outputs. Thinking about these complications, I’m reminded of a talk by Chris Olah that I saw back in 2018, where he talked about how feature visualizations of activiations that fire for different image inputs to a deep neural net allow us to seriously consider what’s going on inside, in a way that makes them analogous to the discovery of the microscope opening up a whole new world of microorganisms. I wonder if this idea has lost favor given that sometimes these explanations don’t behave the way we would hope.

I can also buy that not-quite-correct explanations can cause problems downstream since I see it in the human context. The other day I had to ask a collaborator to try to refrain from providing explanations instead of methodological details for unexpected analysis results, since when delivered passionately the explanations could seem good enough that I wouldn’t question them initially, only to later realize we wasted time when there was a better explanation. Plus when every unexpected result comes with explanation, skepticism with the explanation can make it feel like you’re undercutting everything a person says, even if you want to encourage discussion of these things.

While we need to accept what we cannot understand, we should never overlook the advantages of what we can understand. For example, we may never fully understand the physical world. Nor how people think, interact, create and or decide. In ML, Geoffrey Hinton’s 2018 YouTube drew attention to the fact that people are unable to explain exactly how they decide in general if something is the digit 2 or not. This fact was originally pointed out, a while ago, by Herbert Simon, and has not been seriously disputed (Erickson and Simon, 1980). However, prediction models are just abstractions and we can understand the abstractions created to represent that reality, which is complex and often beyond our direct access. So not being able to understand people is not a valid reason to dismiss desires to understand prediction models.

In essence, abstractions are diagrams or symbols that can be manipulated, in error-free ways, to discern their implications. Usually referred to as models or assumptions, they are deductive and hence can be understood in and of themselves for simply what they imply. That is, until they become too complex. For instance, triangles on the plane are understood by most, while triangles on the sphere are understood by less. Reality may always be too complex, but models that adequately represent reality for some purpose need not be. Triangles on the plane are for navigation of short distances while on the sphere, for long distances. Emphatically, it is the abstract model that is understood not necessarily the reality it attempts to represent.

I’m not sure I totally grasp the distinction Keith is trying to get at here. To me the above passage implies we should be careful about assuming that some aspects of reality are too complex to explain. But given the part about concordances being misleading above, it seems applying this recursively can lead to problems: when the deep model is the complex thing we want to explain, we have to be careful isolating what we think are simpler units of abstractions to capture what it’s doing. For instance, a node in a neural network is a relatively simple abstraction (i.e., a linear regression wrapped in a non-linear activation function), but is thinking at that level of abstraction as a means of trying to understand the much more complex behavior of the network as a whole useful? Maybe Keith is trying to motivate considering interpretability in your choice of model, which he talks about later.

Related to people not being able to say how they recognize a 2, one thing that people can potentially do is point to the processor they think is responsible; e.g., I can’t describe why it’s a 2 succintly based on low level properties like edge detection but maybe I could say something higher level like, ‘I would guess it’s something like visual word form memory.’ It’s not a complete explanation, but it seems that sort of meta statement could potentially be useful since the first step to debugging is to figure out where to start looking.

[A] persistent misconception has arisen in ML that models for accurate prediction usually need to be complex. To build upon previous examples, there remains some application areas where simple models have yet to achieve accuracy comparable to black box models. On the other hand, simple models continue to predict as accurately as any state of the art black box model and thus, the question, as noted in the 2019 article by Rudin and Radin, is: “Why Are We Using Black Box Models in AI When We Don’t Need To?”

The referenced paper describes how the authors entered a NeurIPS competition on explainability, but then realized they didn’t need a black box at all to do the task, they could just use one of many simpler, interpretable models. Oops. Some of the interpretability work coming out of ML does seem like what you get when complexity enthusiasts excitedly latch onto new problem that can motivate more of what they’re good at (e.g., optimization), without necessarily questioning the premise.

Interpretable models are far more trustworthy in that they can be more readily discerned where and when they should be trusted or not and in what ways. But, how can one do this without understanding how the model works, especially for a model that is patently not trustworthy? This is especially important in cases where the underlying distribution of data changes, where it is critical to trouble shoot and modify without delays, as noted in the 2020 article by Hamamoto et al. It is arguably much more difficult to remain successful in the ML full life cycle with black box models than with interpretable models.

Agreed; debugging calls for some degree of interpretability. And often the more people you can get helping debug something, the more likely you are to find the problem.

There is increasing understanding based on considering numerous possible prediction models in a given prediction task. The not-too-unusual observation of simple models performing well for tabular data (a collection of variables, each of which has meaning on its own) was noted over 20 years ago and was labeled the “Rashomon effect” (Breiman, 2001). Breiman posited the possibility of a large Rashomon set in many applications; that is, a multitude of models with approximately the same minimum error rate. A simple check for this is to fit a number of different ML models to the same data set. If many of these are as accurate as the most accurate (within the margin of error), then many other untried models might also be. A recent study (Semenova et al., 2019), now supports running a set of different (mostly black box) ML models to determine their relative accuracy on a given data set to predict the existence of a simple accurate interpretable model—that is, a way to quickly identify applications where it is a good bet that accurate interpretable prediction model can be developed.

I like the idea of trying to estimate how many different ways there are to achieve good accuracy on some inference problem. I’m reminded of a paper I recently read which does basically the inverse – generate a bunch of hypothetical datasets and see how well a model intended to explain human behavior does across them, to understand when you just have a very flexible model versus when it’s actually providing some insight into behavior.

The full data science and life-cycle process likely is different when using interpretable models. More input is needed from domain experts to produce an interpretable model that make sense to them. This should be seen as an advantage. For instance, it is not too unusual at a given stage to find numerous equally interpretable and accurate models. To the data scientist, there may seem little to guide the choice between these. But, when shown to domain experts, they may easily discern opportunities to improve constraints as well as indications of which ones are less likely to generalize well. All equally interpretable and accurate models are not equal in the eyes of domain experts.

I definitely agree with this and other comments Keith makes about the need to consider interpretability early in the process. I was involved in a paper a few years ago where my co-authors had interviewed a bunch of machine learning developers about interpretability. One of the more surprising things we found was that in contrast to ML lit implying that interpretability can be applied post model development, it was seen by many of the developers as a more holistic thing related to how much others in their organization trusted their work at all, and consequently many thought about from the beginning of model development.

There is now a vast and confusing literature, which conflates interpretability and explainability. In this brief blog, the degree of interpretability is taken simply as how easily the user can grasp the connection between input data and what the ML model would predict. Erasmus et al. (2020) provide a more general and philosophical view. Rudin et al. (2021) avoid trying to provide an exhaustive definition by instead providing general guiding principles to help readers avoid common, but problematic ways of thinking about interpretability. On the other hand, the term “explainability” often refers to post hoc attempts to explain a black box by using simpler ‘understudy’ models that predict the black box predictions.

I’ve always found the simple definition of interpretability as ability to simulate what a model will predict interesting. At one point I was thinking about how if interpretability is mainly aimed at building trust in model predictions, maybe a “deeper”  proxy for trust could be called internalizability, which is where the person (after using the model) is simulating the model but they don’t know it.

## “Test & Roll: Profit-Maximizing A/B Tests” by Feit and Berman

Elea McDonnell Feit and Ron Berman write:

Marketers often use A/B testing as a tool to compare marketing treatments in a test stage and then deploy the better-performing treatment to the remainder of the consumer population. While these tests have traditionally been analyzed using hypothesis testing, we re-frame them as an explicit trade-off between the opportunity cost of the test (where some customers receive a sub-optimal treatment) and the potential losses associated with deploying a sub-optimal treatment to the remainder of the population.

We derive a closed-form expression for the profit-maximizing test size and show that it is substantially smaller than typically recommended for a hypothesis test, particularly when the response is noisy or when the total population is small. The common practice of using small holdout groups can be rationalized by asymmetric priors. The proposed test design achieves nearly the same expected regret as the flexible, yet harder-to-implement multi-armed bandit under a wide range of conditions.

We [Feit and Berman] demonstrate the benefits of the method in three different marketing contexts—website design, display advertising and catalog tests—in which we estimate priors from past data. In all three cases, the optimal sample sizes are substantially smaller than for a traditional hypothesis test, resulting in higher profit.

I’ve not read the paper in detail, but the basic idea makes a lot of sense to me.

I’m not an expert on this literature. I heard about this particular article from a blog comment today. You readers will perhaps have more to say about the topic.

## Politics and economic policy in the age of political science

Reading the London Review of Books, I came across this interesting essay by historian Adam Tooze about the transition of Paul Krugman from 1990s snobby center-left academic economist to 2000s angry left-wing pundit. This is something that’s puzzled me for awhile (see for example here and here), and Tooze gives a plausible account of Krugman’s transformation as explainable by a combination of political and economic events.

There was one thing that Tooze didn’t get to, though. Part of his story is the disappointment of Krugman and others on the left because the Obama administration’s 2009 economic stimulus package wasn’t as big as they wanted it to be. It’s not clear how much Obama should be blamed for this, given the willingness of Senate conservatives to use the filibuster rule, but let me set that aside for a moment.

Right now I want to remind you of my theory of Obama’s motivation in 2009 to keep the stimulus from being too big. I suspect that Obama advisor Lawrence Summers, or some part of Summers, feared that a big stimulus at the beginning of Obama’s first term would work all too well, leading to a 1978-style economic expansion followed by a 1980-style dive. I’m sure that Summers’s preferred outcome was steady economic growth, but given all the uncertainty involved, I wouldn’t be surprised if he preferred to err on the side of a lower stimulus to avoid overheating the economy. Here’s what I wrote on the topic back in 2010:

Why didn’t the Democrats do more?

Why, in early 2009, seeing the economy sink, did Obama and congressional Democrats not do more? Why didn’t they follow the advice of Krugman and others and (a) vigorously blame the outgoing administration for their problems and (b) act more decisively to get Americans spending again? . . .

Several Democratic senators did not favor the big stimulus. Part of this can be attributed to ideology (or, to put it in a more positive way, conservative or free-market economic convictions) or maybe even to lobbyists etc. Beyond this, there was the feeling, somewhere around mid-2009, that government intervention wasn’t so popular—that, between TARP, the stimulus, and the auto bailout, voters were getting a bit wary of big government taking over the economy. . . .

On not wanting to repeat the mistakes of the past

But didn’t Obama do a better job of leveling with the American people? In his first months in office, why didn’t he anticipate the example of the incoming British government and warn people of economic blood, sweat, and tears? Why did his economic team release overly-optimistic graphs such as shown here? Wouldn’t it have been better to have set low expectations and then exceed them, rather than the reverse?

I don’t know, but here’s my theory. When Obama came into office, I imagine one of his major goals was to avoid repeating the experiences of Bill Clinton and Jimmy Carter in their first two years.

Clinton, you may recall, was elected with less then 50% of the vote, was never given the respect of a “mandate” by congressional Republicans, wasted political capital on peripheral issues such as gays in the military, spent much of his first two years on centrist, “responsible” politics (budgetary reform and NAFTA) which didn’t thrill his base, and then got rewarded with a smackdown on heath care and a Republican takeover of Congress. Clinton may have personally weathered the storm but he never had a chance to implement the liberal program.

Carter, of course, was the original Gloomy Gus, and his term saw the resurgence of the conservative movement in this country, with big tax revolts in 1978 and the Reagan landslide two years after that. It wasn’t all economics, of course: there were also the Russians, Iran, and Jerry Falwell pitching in.

Following Plan Reagan

From a political (but not a policy) perspective, my impression was that Obama’s model was not Bill Clinton or Jimmy Carter but Ronald Reagan. Like Obama in 2008, Reagan came into office in 1980 in a bad economy and inheriting a discredited foreign policy. The economy got steadily worse in the next two years, the opposition party gained seats in the midterm election, but Reagan weathered the storm and came out better than ever.

If the goal was to imitate Reagan, what might Obama have done?

– Stick with the optimism and leave the gloom-and-doom to the other party. Check.
– Stand fast in the face of a recession. Take the hit in the midterms with the goal of bouncing back in year 4. Check.
– Keep ideological purity. Maintain a contrast with the opposition party and pass whatever you can in Congress. Check.

The Democrats got hit harder in 2010 than the Republicans in 1982, but the Democrats had further to fall. Obama and his party in Congress can still hope to bounce back in two years.

Avoiding the curse of Bartels

Political scientist Larry Bartels wrote an influential paper, later incorporated into his book, Unequal Democracy, presenting evidence that for the past several decades, the economy generally has done better under Democratic than Republican presidents. Why then, Bartels asked, have Republicans done so well in presidential elections? Bartels gives several answers, including different patterns at the low and high end of the income spectrum, but a key part of his story is timing: Democratic presidents tend to boost the economy when the enter office and then are stuck watching it rebound against them in year 4 (think Jimmy Carter), whereas Republicans come into office with contract-the-economy policies which hurt at first but tend to yield positive trends in time for reelection (again, think Ronald Reagan).

Overall, according to Bartels, the economy does better under Democratic administrations, but at election time, Republicans are better situated. And there’s general agreement among political scientists that voters respond to recent economic conditions, not to the entire previous four years. Bartels and others argue that the systematic differences between the two parties connect naturally to their key constituencies, with new Democratic administrations being under pressure to heat up the economy and improve conditions for wage-earners and incoming Republicans wanting to keep inflation down.

Some people agree with Bartels’s analysis, some don’t. But, from the point of Obama’s strategy, all that matters is that he and his advisers were familiar with the argument that previous Democrats had failed by being too aggressive with economic expansion. Again, it’s the Carter/Reagan example. Under this story, Obama didn’t want to peak too early. So, sure, he wanted a stimulus–he didn’t want the economy to collapse, but he didn’t want to turn the stove on too high and spark an unsustainable bubble of a recovery. In saying this, I’m not attributing any malign motives (any more than I’m casting aspersions of conservatives’ skepticism of unsustainable government-financed recovery). Rather, I’m putting the economic arguments in a political context to give a possible answer to the question of why Obama and congressional Democrats didn’t do things differently in 2009.

Anyway, this is what I think Tooze is missing in his story. He talks about politics and he talks about economics. He recognizes that economic policy has a political element, but I don’t think he’s fully catching on that policies can be set based on anticipated political consequences of economic conditions—and, for that, I think political science research is relevant. Or, should I say, policymakers’ understanding of political science research is relevant. Sure, everybody knows about juicing the economy during an election year, but the idea of not going too fast because you want to rebound back in 4 years, I think that’s a real Carter vs. Reagan lesson, reinforced by research such as that of Bartels. The funny thing is that now it seems that even moderate Democrats want a big stimulus right away to avoid what they see as the negative political consequences deriving from Obama not going big in 2009. Always fighting the last war. Not that I have any policy recommendations of my own; here I’m just trying to trace some logical motivations.

P.S. Tooze is a professor at Columbia University but I’ve never met the guy. He’s in the history department. I guess Columbia’s a pretty big place, and there’s a lack of complete overlap among the historians who study American politics and the political scientists who study American politics. I’ve never met Eric Foner either. I looked up Tooze on Wikipedia and . . . his grandfather was one of those upper-class British communists! I wonder if he was the model for any of those John Le Carre characters. It also says that he (Tooze, not the grandfather) used to teach at Jesus College so maybe he knows Niall Ferguson. Given their much different politics, I imagine they don’t get along so well.

This post is a rerun. I was listening to This American Life on my bike today and heard Ira say:

There’s this study done by the Pew Research Center and Smithsonian Magazine . . . they called up one thousand and one Americans. I do not understand why it is a thousand and one rather than just a thousand. Maybe a thousand and one just seemed sexier or something. . . .

And my first thought was, Hey, I know why they surveyed 1001 people and not exactly 1000! And my second thought was, Hey, I think this came up on the blog the first time that episode aired. And indeed, here it is:

The survey may well have aimed for 1000 people, but you can’t know ahead of time exactly how many people will respond. They call people, leave messages, call back, call back again, etc. The exact number of people who end up in the survey is a random variable.

If Ira can do repeats, so can we!

Maybe one of you works for Chicago Public Radio and can let them know why the survey didn’t have exactly 1000 respondent? Ira has given me so much information and entertainment over the years; it would be good to give back just a little.

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.