## The State of the Art

Jesse Singal writes:

This was presented, in Jennifer Eberhardt’s book Biased, as evidence to support the idea that even positive portrayals of black characters could be spreading and exacerbating unconscious antiblack bias. I did not see evidence to support that idea.

I replied:

I don’t understand what you’re saying here. I clicked thru and the article seems reasonable enough, for what it is. As you probably know, I’m not a big fan of these implicit bias tests. But I didn’t think the article was making any statements about positive portrayals of black characters. I thought they were saying that even for shows for which viewers perceived the black characters as being portrayed positively, a more objective measure showed the black characters being portrayed more negatively than the whites. I didn’t go thru all the details so maybe there’s something off in how they did their statistical adjustment, but the basic point seemed reasonable, no?

Singal responded:

Yeah, I didn’t include much detail. Basically it is this thing I see a ton of in social-priming-related research where people extrapolate, from results that appear to me to be fairly unimpressive, rather big claims about the ostensible impact of priming stuff on human behavior/attitudes in the real world. I think this table is key:

This was from when they edited out black and white characters and asked people unfamiliar with the shows how they perceived the characters in question. The researchers appear to have tested six different things, found one that statistically significant (but only barely), and gone all-in on that one, explanations-wise. Then by the time the finding is translated to Eberhardt’s book, where all the nuance is taken out (we don’t hear that in five of the six things they tested they found nothing), we’re told that it could be that even black characters who are portrayed positively on TV—the subject of this story—could be spreading implicit bias throughout the land.

I don’t really have a strong take on all this, but I thought it could be useful to post on this, just because sometimes maybe it’s a good idea to express this sort of uncertainty in judgment. In any sort of writing there is a pressure to come to a strong conclusion—less pressure in blogging than on other media, perhaps, but still there’s some pull toward certainty. In this case I’ll just leave the discussion where we have it here.

Tomorrow’s Post: Bank Shot

## “Suppose that you work in a restaurant…”

In relation to yesterday’s post on Monty Hall, Josh Miller sends along this paper coauthored with the ubiquitous Adam Sanjurjo, “A Bridge from Monty Hall to the Hot Hand: The Principle of Restricted Choice,” which begins:

Suppose that you work in a restaurant where two regular customers, Ann and Bob, are equally likely to come in for a meal. Further, you know that Ann is indifferent among the 10 items on the menu, whereas Bob strictly prefers the hamburger. While in the kitchen, you receive an order for a hamburger. Who is more likely to be the customer: Ann or Bob?

I just love this paper, not so much for its content (which is fine) but for its opening. “Suppose that you work in a restaurant…”

I get the feeling that econ papers always take the perspective of people who are more likely to be owners, or at least consumers, not employees, in restaurants. Sure, there was that one famous paper about taxicab drivers, but I feel like most of the time you’ll hear economists talking about why it’s rational to tip, or how much a restaurant should charge its customers, or ways of ramping up workers’ performance, etc. Lots about Ray Kroc, not so much about the people who prepare the fries. (When my sister worked at McDonalds, they let her serve customers and make fries—but not burgers. Only the boys were allowed to do that.)

Look. I’m not trying to pull out my (nonexistent) working-class credentials. I’ve been lucky and have never had to work a crap job in my life.

It’s just refreshing to read an econ paper that takes the employee’s perspective, not to make an economic point and not to make a political point, but just cos why not. Kind of like Night of the Living Dead.

## Challenge of A/B testing in the presence of network and spillover effects

Gaurav Sood writes:

There is a fun problem that I recently discovered:

Say that you are building a news recommender that lists which relevant news items in each person’s news feed. Say that your first version of the news recommender is a rules-based system that uses signals like how many people in your network have seen the news, how many people in total have read the news, the freshness of the news, etc., and sums up the signals in an arbitrary way to rank news items. Your second version uses the same signals but uses a supervised model to decide on the optimal weights.

Say that you find that the recommendations vary a fair bit between the two systems. But which one is better? To suss that, you conduct an A/B test. But a naive experiment will produce biased estimates of the effect and the s.e. because:

1. The signals on which your control group ranking system on is based are influenced by the kinds of news articles that people in treatment group see. And vice versa.

2. There is an additional source of stochasticity in recommendations that people see: the order in which people arrive matters.

The effect of the first concern is that our estimates are likely attenuated. To resolve the first issue, show people in the Control Group news articles based on predicted views of news articles based on historical data or pro-rated views of people assigned to control group alone. (This adds a bit of noise to the Control Group estimates.) And keep a separate table of input data for the treatment group and apply the ML model to the pro-rated data from that table.

The consequence of the second issue is that our s.e. is very plausibly much larger than what we will get with the split world testing (each condition gets its own table of counts for views, etc.). The sequence in which people arrive matters as it intersects with “social influence world.” To resolve the second issue, you need to estimate how the sequence of arrival affects outcomes. But given the number of pathways, the best we can probably do is bound. We could probably estimate the effect of ranking the least downloaded item first as a way to bound the effects.

The phrase ‘social influence world’ is linked to: https://www.princeton.edu/~mjs3/salganik_watts08.pdf

Tomorrow’s Post: The State of the Art

## Dan’s Paper Corner: Can we model scientific discovery and what can we learn from the process?

Jesus taken serious by the many
Jesus taken joyous by a few
Jazz police are paid by J. Paul Getty
Jazzers paid by J. Paul Getty II

Leonard Cohen

So I’m trying a new thing because like no one is really desperate for another five thousand word essay about whatever happens to be on my mind on a Thursday night in a hotel room in Glasgow. Also, because there’s a pile of really interesting papers that I think it would be good and fun for people to read and think about.

And because if you’re going to do something, you should jump right into an important topic, may I present for your careful consideration Berna Devezer, Luis G. Nardin, Bert Baumgaertner,  and Erkan Ozge Buzbas’ fabulous paper Scientific discovery in a model-centric framework: Reproducibility, innovation, and epistemic diversity. (If we’re going to talk about scientific discovery and reproducibility, you better believe I’m going to crack out the funny Leonard Cohen.)

I am kinda lazy so I’m just going to pull out the last paragraph of the paper as a teaser. But you should read the whole thing. You can also watch Berna give an excellent seminar on the topic. Regardless, here is that final paragraph.

Our research also raises questions with regard to reproducibility of scientific results. If reproducibility can be uncorrelated with other possibly desirable properties of scientific discovery, optimizing the scientific process for reproducibility might present trade-offs against other desirable properties. How should scientists resolve such trade-offs? What outcomes should scientists aim for to facilitate an efficient and proficient scientific process? We leave such considerations for future work.

I like this paper for a pile of reasons. A big one is that a lot of discussion that I have seen around scientific progress is based around personal opinions (some I agree with, some I don’t) and proposed specific interventions. Both of these things are good, but they are not the only tools we have. This paper proposes a mechanistic model of discovery encoding some specific assumptions and investigates the consequences. Broadly speaking, that is a good thing to do.

Some random observations:

• The paper points out that the background information available for a replicated experiment is explicitly different from the background information from the original experiment in that we usually know the outcome of the original. That the set of replications is not a random sample of all experiments is very relevant when making statements like x% of experiments in social psychology don’t replicate.
• One of the key points of the paper is that reproducibility is not the only scientifically relevant properties of an experiment. Work that doesn’t reproduce may well lead to a “truth” discovery (or at least a phenomenological model that is correct within the precision of reasonable experiments) faster than work that does reproduce. An extremely nerdy analogy would be that reproducibility will be like a random walk towards the truth, while work that doesn’t reproduce can help shoot closer to the truth.
• Critically, proposals that focus on reproducibility of single experiments (rather than stability of experimental arcs) will most likely be inefficient. (Yes, that includes preregistration, the current Jesus taken serious by the many)
• This is a mathematical model so everything is “too simple”, but that doesn’t mean it’s not massively informative. Some possible extensions would be to try to model more explicitly the negative effect of persistent-but-wrong flashy theories. Also the effect of incentives. Also the effect of QRPs, HARKing, Hacking, Forking, and other deviations from The Way The Truth and The Life.

I’ll close out with a structurally but not actually related post from much-missed website The Toast: Permission To Play Devil’s Advocate Denied by the exceptional Daniel Mallory Ortberg (read his books. They’re excellent!)

Our records indicate that you have requested to play devil’s advocate for either “just a second here” or “just a minute here” over fourteen times in the last financial quarter. While we appreciate your enthusiasm, priority must be given to those who have not yet played the position. We would like to commend you for the excellent work you have done in the past year arguing for positions you have no real interest or stake in promoting, including:

• Affirmative Action: Who’s the Real Minority Here?
• Maybe Men Score Better In Math For A Reason
• Well, They Don’t Have To Live Here
• I Think You’re Taking This Too Personally
• Would It Be So Bad If They Did Die?
• If You Could Just Try To See It Objectively, Like Me

## Josh Miller’s alternative, more intuitive, formulation of Monty Hall problem

Here it is:

Three tennis players. Two are equally-matched amateurs; the third is a pro who will beat either of the amateurs, always.

You blindly guess that Player A is the pro; the other two then play.

Player B beats Player C. Do you want to stick with Player A in a Player A vs. Player B match-up, or do you want to switch?
And what’s the probability that Player A will beat Player B in this match-up?

And here’s the background.

It started when Josh Miller proposed this alternative formulation of the Monty Hall problem:

Three boxers. Two are equally matched; the other will beat either them, always.

You blindly guess that Boxer 1 is the best; the other two fight.

Boxer 2 beats Boxer 3. Do you want to stick with Boxer 1 in a Boxer 1 vs. Boxer 2 match-up, or do you want to switch?

I liked the formulation in terms of boxers (of course, and see data-based followup here), but Josh’s particular framing above bothered me.

My first thought was confusion about how this relates to the Monty Hall problem. In that problem, Monty opens a door, he doesn’t compare two doors (in his case, comparing 2 boxers). There’s no “Monty” in the boxers problem.

Then Josh explained:

When Monty chooses between the items you can think of it as a “fight.” The car will run over the goat, and Monty reveals the goat. With two goats, they are evenly matched, so the unlucky one gets gored and is revealed.

And I pieced it together. But I was still bothered:

Now I see it. The math is the same (although I think it’s a bit ambiguous in your example). Pr(boxer B beats boxer C) = 1 if B is better than C, or 1/2 if B is equal in ability to C. Similarly, Pr(Monty doesn’t rule out door B) = 1 if B has the car and C has the goat, or 1/2 if B and C both have goats.

It took me awhile to understand this because I had to process what information is given in “Boxer 2 beats Boxer 3.” My first inclination is that if 2 beats 3, then 2 is better than 3, but your model is that there are only two possible bouts: good vs. bad (with deterministic outcome) or bad vs. bad (with purely random outcome).

My guess is that the intuition on the boxers problem is apparently so clear to people because they’re misunderstanding the outcome, “Boxer 2 beats Boxer 3.” My guess is that they think “Boxer 2 beats Boxer 3” implies that boxer 2 is better than boxer 3. (Aside: I prefer calling them A, B, C so we don’t have to say things like “2 > 3”.)

To put it another way, yes, in your form of the problem, people easily pick the correct “door.” But my guess is that they will get the probability of the next bout wrong. What is Pr(B>A), given the information supplied to us so far? Intuitively from your description, Pr(B>A) is something close to 1. But the answer you want to get is 2/3.

My problem with the boxers framing is that the information “B beats C” feels so strong that it overwhelms everything else. Maybe also the issue is that our intuition is that boxers are in a continuous range, which is different than car >> goat.

I then suggested switching to tennis players, framing as “two amateurs who are evenly matched.” The point is that boxing evokes this image of a knockout, so once you hear that B beat C, you think of B as the powerhouse. With tennis, it seems more clear somehow that you can win and just be evenly matched.

Josh and I went back and forth on this for awhile and we came up with the tennis version given above. I still think the formulation of “You blindly guess that Player 1 is the pro” is a bit awkward, but maybe something like that is needed to draw the connection to the Monty Hall problem.

Ummm, here’s an alternative:

You’re betting on a tennis tournament involving three players. Two are equally-matched amateurs; the third is a pro who will beat either of the amateurs, always.

You have no idea who is the pro, and you randomly place your bet on Player A.

The first match is B vs. C. Player B wins.

Players A and B then compete. Do you want to keep your bet on Player A, or do you want to switch? And what’s the probability that Player A will beat Player B in this match-up?

This seems cleaner to me, but maybe it’s too far away from the Monty Hall problem. Remember, the point here is not to create a new probability problem; it’s to demystify Monty Hall. Which means that the problem formulation, the correct solution, and the isomorphism to Monty Hall should be as transparent as possible.

P.S. Josh noted that the story was also discussed by Alex Tabarrok, and a similar form of the problem was studied by Bruce Burns and Marieke Wieth in 2004.

## Laplace Calling

Laplace calling to the faraway towns
Now war is declared and battle come down
Laplace calling to the underworld
Come out of the sample, you boys and girls
Laplace calling, now don’t look to us
Phony Bayesmania has bitten the dust
Laplace calling, see we ain’t got no swing
Except for the ring of that probability thing

The asymptote is coming, inference a farce
Meltdown expected, the data’s growin’ sparse
Stan stops running, but I have no fear
‘Cause Laplace is drowning, and I, I live by the prior

Laplace calling to the replication zone
Forget it, brother, you can go it alone
Laplace calling to the zombies of death
Quit holding out and draw another breath
Laplace calling and I don’t want to shout
But when we were talking I saw you nodding out
Laplace calling, see we ain’t got no high
Except for that one with the yellowy eye

The asymptote’s coming, inference a farce
Stan stops running, the data’s growin’ sparse
A parallel era, but I have no fear
‘Cause Laplace is drowning, and I, I live by the prior

The asymptote is coming, inference a farce
Stan stops running, the data’s growin’ sparse
A parallel era, but I have no fear
‘Cause Laplace is drowning, and I, I live by the prior

Now get this

Laplace calling, yes, I was there, too
And you know what they said? Well, some of it was true!
Laplace calling, two hundred years hence
And after all this, won’t you have confidence?

I never felt so much exchangeable

(Apologies to you know who.)

Tomorrow’s post: Challenge of A/B testing in the presence of network and spillover effects

## All the names for hierarchical and multilevel modeling

The title Data Analysis Using Regression and Multilevel/Hierarchical Models hints at the problem, which is that there are a lot of names for models with hierarchical structure.

Ways of saying “hierarchical model”

hierarchical model
a multilevel model with a single nested hierarchy (note my nod to Quine’s “Two Dogmas” with circular references)
multilevel model
a hierarchical model with multiple non-nested hierarchies
random effects model
Item-level parameters are often called “random effects”; reading all the ways the term is used on the Wikipedia page on random effects illustrates why Andrew dislikes the term so much (see also here; both links added by Andrew)—it means many different things to different communities.
mixed effects model
that’s a random effects model with some regular “fixed effect” regression thrown in; this is where lme4 is named after linear mixed effects and NONMEM after nonlinear mixed effects models.
empirical Bayes
Near and dear to Andrew’s heart, because regular Bayes just isn’t empirical enough. I jest—it’s because “empirical Bayes” means using maximum marginal likelihood to estimate priors from data (just like lme4 does).
regularized/penalized/shrunk regression
common approach in machine learning where held out data is used to “learn” the regularization parameters, which are typically framed as shrinkage or regularization scales in penalty terms rather than as priors
automatic relevance determination (ARD)
Radford Neal’s term in his thesis on Gaussian processes and now widely adopted in the GP literature
This one’s common in the machine-learning literature; I think it came from Hal Daumé III’s paper, “Frustratingly easy domain adaptation” in which he rediscovered the technique; he also calls logistic regression a “maximum entropy classifier”, like many people in natural language processing (and physics)
variance components model
I just learned this one on the Wikipedia page on random effects models
cross-sectional (time-series) model
apparently a thing in econometrics
nested data model, split-plot design, random coefficient
iterated nested Laplace approximation (INLA), expectation maximization (EM), …
Popular algorithmic approaches that get confused with the modeling technique.

I’m guessing the readers of the blog will have more items to add to the list.

If you liked this post

You might like my earlier post, Logistic regression by any other name.

## Brief summary notes on Statistical Thinking for enabling better review of clinical trials.

This post is not by Andrew.

Now it was spurred by Andrew’s recent post on Statistical Thinking enabling good science.

The day of that post, I happened to look in my email’s trash and noticed that it went back to 2011. One email way back then had an attachment entitled Learning Priorities of RCT versus Non-RCTs. I had forgotten about it. It was one of the last things I had worked on when I last worked in drug regulation.

It was draft of summary points I was putting together for clinical reviewers (clinicians and biologists working in a regulatory agency) to give them a sense of (hopefully good) statistical thinking in reviewing clinical trials for drug approval. I though it brought out many of the key points that were in Andrew’s post and in the paper by Tong that Andrew was discussing.

Now my summary points are in terms of statistical significance, type one error and power, but was 2011. Additionally, I do believe (along with David Spiegelhalter) that regulatory agencies do need to have lines drawn in the sand or set cut points. They have to approve or not approve.  As the seriousness of the approval increases, arguably these set cut points should move from  being almost automatic defaults, to inputs into a weight of evidence evaluation that may overturn them. Now I am working on a post to give an outline of what usually happens in drug regulation. I have received some links to material from a former colleague to help update my 2011 experience base.

In this post, I have made some minor edits, it is not meant to be polished prose but simply summary notes. I thought it might of interest to some and hey I have not posted in over a year and this one was quick and easy.

What can you learn from randomized versus non-randomized comparisons?
What You Can’t Learn (WYCL);
How/Why That’s Critical (HWTC);
Anticipate How To Lessen these limitations (AHTL)

## “Superior: The Return of Race Science,” by Angela Saini

“People so much wanted the story to be true . . . that they couldn’t look past it to more mundane explanations.” – Angela Saini, Superior.

I happened to be reading this book around the same time as I attended the Metascience conference, which was motivated by the realization during the past decade or so of the presence of low-quality research and low-quality statistical methods underlying some subfields of the human sciences.

I like Saini’s book a lot. In some sense it seems too easy, as she points at one ridiculous racist after another, but a key point here is that, over the years, prominent people who should know better have been suckers for junk science offering clean stories to support social prejudices. From Theodore Roosevelt in the early 20th century to David Brooks and the Freakonomics team a hundred years later, politicians, pundits, and scientists have lapped up just-so stories of racial and gender essentialism, without being too picky about the strength of the scientific evidence being offered.

Superior even tells some of the story of Satoshi Kanazawa, but focusing on his efforts regarding racial essentialism rather than his gender essentialist work that we’ve discussed on this blog.

As Saini discusses, race is an available explanation for economic and social inequality. We discussed this a few years ago in response to a book by science journalist Nicholas Wade.

As Saini points out (and as I wrote in the context of my review of Wade’s book), the fact that many racist claims of the past and present have been foolish and scientifically flawed, does not mean that other racist scientific claims are necessarily false (or that they’re true). The fact that Satoshi Kanazawa misuses statistics has no bearing on underlying reality; rather, the uncritical reaction to Kanazawa’s work in many quarters just reveals how receptive many people are to crude essentialist arguments.

A couple weeks ago some people asked why I sometimes talk about racism here—what does it have to do with “statistical modeling, causal inference, and social science”? I replied that racism is a sort of pseudoscientific or adjacent-to-scientific thinking that comes up a lot in popular culture and also in intellectual circles, and also of course it’s related to powerful political movements. So it’s worth thinking about, just as it’s worth thinking about various other frameworks that people use to understand the world. You might ask why I don’t write about religion so much; maybe that’s because, in the modern context, religious discourse is pretty much separate from scientific discourse so it’s not so relevant to our usual themes on this blog. When we talk about religion here it’s mostly from a sociology or political-science perspective (for example here) without really addressing the content of the beliefs or the evidence offered in their support.

Tomorrow’s post: Laplace Calling

## “Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science”

As promised, let’s continue yesterday’s discussion of Christopher Tong’s article, “Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science.”

First, the title, which makes an excellent point. It can be valuable to think about measurement, comparison, and variation, even if commonly-used statistical methods can mislead.

This reminds me of the idea in decision analysis that the most important thing is not the solution of the decision tree but rather what you decide to put in the tree in the first place, or even, stepping back, what are your goals. The idea is that the threat of decision analysis is more powerful than its execution (as Chrissy Hesse might say): the decision-analytic thinking pushes you to think about costs and uncertainties and alternatives and opportunity costs, and that’s all valuable even if you never get around to performing the formal analysis. Similarly, I take Tong’s point that statistical thinking motivates you to consider design, data quality, bias, variance, conditioning, causal inference, and other concerns that will be relevant, whether or not they all go into a formal analysis.

That said, I have one concern, which is that “the threat is more powerful than the execution” only works if the threat is plausible. If you rule out the possibility of the execution, then the threat is empty. Similarly, while I understand the appeal of “Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science,” I think this might be good static advice, applicable right now, but not good dynamic advice: if we do away with statistical inference entirely (except in the very rare cases when no external assumptions are required to perform statistical modeling), then there may be less of a sense of the need for statistical thinking.

Overall, though, I agree with Tong’s message, and I think everybody should read his article.

Now let me go through some points where I disagree, or where I feel I can add something.

– Tong discusses “exploratory versus confirmatory analysis.” I prefer to think of exploratory and confirmatory analysis as two aspects of the same thing. (See also here.)

In short: exploratory data analysis is all about learning the unexpected. This is relative to “the expected,” that is, some existing model. So, exploratory data analysis is most effective when done in the context of sophisticated models. Conversely, exploratory data analysis is a sort of safety valve that can catch problems with your model, thus making confirmatory data analysis more effectively.

Here, I think of “confirmatory data analysis” not as significance testing and the rejection of straw-man null hypotheses, but rather as inference conditional on models of substantive interest.

– Tong:

There is, of course, one arena of science where the exploratory/confirmatory distinction is clearly made, and attitudes toward statistical inferences are sound: the phased experimentation of medical clinical trials.

I think this is a bit optimistic, for two reasons, First, I doubt the uncertainty in exploratory, pre-clinical analyses is correctly handled when it comes time to make decisions in designing clinical trials. Second, I don’t see statistical significance thresholds in clinical trials as being appropriate for deciding drug approval.

– Tong:

Medicine is a conservative science and behavior usually does not change on the basis of one study.

Sure, but the flip side of formal conservatism is that lots of informal decisions will be made based on noisy data. Waiting for conclusive results from a series of studies . . . that’s fine, but in the meantime, decisions need to be made, and are being made, every day. This is related to the Chestertonian principle that extreme skepticism is a form of credulity.

– Tong quotes Freedman (1995):

I wish we could learn to look at the data more directly, without the fictional models and priors. On the same wish list: We should stop pretending to fix bad designs and inadequate measurements by modeling.

I have no problem with this statement as literally construed: it represents someone’s wish. But to the extent it is taken as a prescription or recommendation for action, I have problems with it. First, in many cases it’s essentially impossible to look at the data without “fictional models.” For example, suppose you are doing a psychiatric study of depression: “the data” will strongly depend on whatever “fictional models” are used to construct the depression instrument. Similarly for studies of economic statistics, climate reconstruction, etc. I strongly do believe that looking at the data is important—indeed, I’m on record as saying I don’t believe statistical claims when their connection to the data is unclear—but, rather than wishing we could look at the data without models (just about all of which are “fictional”), I’d prefer to look at the data alongside, and informed by, our models.

Regarding the second wish (“stop pretending to fix bad designs and inadequate measurements by modeling”), I guess I might agree with this sentiment, depending on what is meant by “pretend” and “fix”—but I do think it’s a good idea to adjust bad designs and inadequate measurements by modeling. Indeed, if you look carefully, all designs are bad and all measurements are inadequate, so we should adjust as well as we can.

To paraphrase Bill James, the alternative to “inference using adjustment” is not “no inference,” it’s “inference not using adjustment.” Or, to put it in specific terms, if people don’t use methods such as our survey adjustment here, they’ll just use something cruder. I wouldn’t want criticism of the real flaws of useful models to be taken as a motivation for using worse models.

– Tong quotes Feller (1969):

The purpose of statistics in laboratories should be to save labor, time, and expense by efficient experimental designs.

Design is one purpose of statistics in laboratories, but I wouldn’t say it’s the purpose of statistics in laboratories. In addition to design, there’s analysis. A good design can be made even more effective with a good analysis. And, conversely, the existence of a good analysis can motivate a more effective design. This is not a new point; it dates back at least to split-plot, fractional factorial, and other complex designs in classical statistics.

– Tong quotes Mallows (1983):

A good descriptive technique should be appropriate for its purpose; effective as a mode of communication, accurate, complete, and resistant.

I agree, except possibly for the word “complete.” In complex problems, it can be asking too much to expect any single technique to give the whole picture.

– Tong writes:

Formal statistical inference may only be used in a confirmatory setting where the study design and statistical analysis plan are specified prior to data collection, and adhered to during and after it.

I get what he’s saying, but this just pushes the problem back, no? Take a field such as survey sampling where formal statistical inference is useful, both for obtaining standard errors (which give underestimates of total survey error, but an underestimate can still be useful as a starting point), for adjusting for nonresponse (this is a huge issue in any polling), and for small-area estimation (as here). It’s fair for Tong to say that all this is exploratory, not confirmatory. These formal tools are still useful, though. So I think it’s important to recognize that “exploratory statistics” is not just looking at raw data; it also can include all sorts of statistical analysis that is, in turn, relevant for real decision making.

– Tong writes:

A counterargument to our position is that inferential statistics (p-values, confidence intervals, Bayes factors, and so on) could still be used, but considered as just elaborate descriptive statistics, without inferential implications (e.g., Berry 2016, Lew 2016). We do not find this a compelling way to salvage the machinery of statistical inference. Divorced from the probability claims attached to such quantities (confidence levels, nominal Type I errors, and so on), there is no longer any reason to privilege such quantities over descriptive statistics that more directly characterize the data at hand.

I’ll just say, it depends on the context. Again, in survey research, there are good empirical and theoretical reasons for model-based adjustment as an alternative to just looking at the raw data. I do want to see the data, but if I want to learn about the population, I will do my best to adjust for known problems with the sample. I won’t just say that, because my models aren’t perfect, I shouldn’t use them at all.

To put it another way, I agree with Tong that there’s no reason to privilege such quantities as “p-values, confidence intervals, Bayes factors, . . . confidence levels, nominal Type I errors, and so on,” but I wouldn’t take this as a reason to throw away “the machinery of statistical inference.” Statistical inference gives us all sorts of useful estimates and data adjustments. Please don’t restrict “statistical inference” to those particular tools listed in that above paragraph!

– Tong writes:

A second counterargument is that, as George Box (1999) reminded us, “All models are wrong, but some are useful.” Statistical inferences may be biased per the Optimism Principle, but they are reasonably approximate (it might be claimed), and paraphrasing John Tukey (1962), we are concerned with approximate answers to the right questions, not exact answers to the wrong ones. This line of thinking also fails to be compelling, because we cannot safely estimate how large such approximation errors can be.

I think the secret weapon is helpful here. You can use inferences as they come up, but it’s hard to interpret them one at a time. Much better to see a series of estimates as they vary over space or time, as that’s the right “denominator” (as we used to say in the context of classical Anova) for comparison.

Summary

I like Tong’s article. The above discussion is intended to offer some modifications or clarifications of his good ideas.

Tomorrow’s post: “Superior: The Return of Race Science,” by Angela Saini

## Harking, Sharking, Tharking

Bert Gunter writes:

You may already have seen this [“Harking, Sharking, and Tharking: Making the Case for Post Hoc Analysis of Scientific Data,” John Hollenbeck, Patrick Wright]. It discusses many of the same themes that you and others have highlighted in the special American Statistician issue and elsewhere, but does so from a slightly different perspective, which I thought you might find interesting. I believe it provides some nice examples of what Chris Tong called “enlightened description” in his American Statistician piece.

I replied that Hollenbeck and Wright’s claims seem noncontroversial. I’ve tharked in every research project I’ve ever done.

I also clicked through and read the Tong paper, “Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science.” The article is excellent—starting with its title—and it brings up many thoughts. I’ll devote an entire post to it.

Also I was amused by this, the final sentence of Tong’s article:

More generally, if we had to recommend just three articles that capture the spirit of the overall approach outlined here, they would be (in chronological order) Freedman (1991), Gelman and Loken (2014), and Mogil and Macleod (2017).

If Freedman were to see this sentence, he’d spin in his grave. He absolutely despised me, and he put in quite a bit of effort to convince himself and others that my work had no value.

Tomorrow’s post: “Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science”

## “Boston Globe Columnist Suspended During Investigation Of Marathon Bombing Stories That Don’t Add Up”

I came across this news article by Samer Kalaf and it made me think of some problems we’ve been seeing in recent years involving cargo-cult science.

Here’s the story:

The Boston Globe has placed columnist Kevin Cullen on “administrative leave” while it conducts a review of his work, after WEEI radio host Kirk Minihane scrutinized Cullen’s April 14 column about the five-year anniversary of the Boston Marathon bombings, and found several inconsistencies. . . .

Here’s an excerpt of the column:

I happened upon a house fire recently, in Mattapan, and the smell reminded me of Boylston Street five years ago, when so many lost their lives and their limbs and their sense of security.

I can smell Patriots Day, 2013. I can hear it. God, can I hear it, whenever multiple fire engines or ambulances are racing to a scene.

I can taste it, when I’m around a campfire and embers create a certain sensation.

I can see it, when I bump into survivors, which happens with more regularity than I could ever have imagined. And I can touch it, when I grab those survivors’ hands or their shoulders.

Cullen, who was part of the paper’s 2003 Pulitzer-winning Spotlight team that broke the stories on the Catholic Church sex abuse scandal, had established in this column, and in prior reporting, that he was present for the bombings. . . .

But Cullen wasn’t really there. And his stories had lots of details that sounded good but were actually made up. Including, horrifyingly enough, made-up stories about a little girl who was missing her leg.

OK, so far, same old story. Mike Barnicle, Janet Cooke, Stephen Glass, . . . and now one more reporter who prefers to make things up than to do actual reporting. For one thing, making stuff up is easier; for another, if you make things up, you can make the story work better, as you’re not constrained by pesky details.

What’s the point of writing about this, then? What’s the connection to statistical modeling, causal inference, and social science?

Here’s the point:

1. What’s the reason for journalism? To convey information, to give readers a different window into reality. To give a sense of what it was like to be there, for those who were not there. Or to help people who were there, to remember.

2. What does good journalism look like? It’s typically emotionally stirring and convincingly specific.

And here’s the problem.

The reason for journalism is 1, but some journalists decide to take a shortcut and go straight to the form of good journalism, that is, 2.

Indeed, I suspect that many journalists think that 2 is the goal, and that 1 is just some old-fashioned traditional attitude.

Now, to connect to statistical modeling, causal inference, and social science . . . let’s think about science:

1. What’s the reason for science? To learn about reality, to learn new facts, to encompass facts into existing and new theories, to find flaws in our models of the world.

2. And what does good science look like? It typically has an air of rigor.

And here’s the problem.

The reason for science is 1, but some scientists decide to take a shortcut and go straight to the form of good science, that is, 2.

The problem is not scientists don’t care about the goal of learning about reality; the problem is that they think that if they follow various formal expressions of science (randomized experiments, p-values, peer review, publication in journals, association with authority figures, etc.) that they’ll get the discovery for free.

It’s a natural mistake, given statistical training with its focus on randomization and p-values, an attitude that statistical methods can yield effective certainty from noisy data (true for Las Vegas casinos where the probability model is known; not so true for messy real-world science experiments), and scientific training that’s focused on getting papers published.

Summary

What struck me about the above-quoted Boston Globe article (“I happened upon a house fire recently . . . I can smell Patriots Day, 2013. I can hear it. God, can I hear it . . . I can taste it . . .”) was how it looks like good journalism. Not great journalism—it’s too clichéd and trope-y for that—but what’s generally considered good reporting, the kind that sometimes wins awards.

Similarly, if you look at a bunch of the fatally flawed articles we’ve seen in science journals in the past few years, they look like solid science. It’s only when you examine the details that you start seeing all the problems, and these papers disintegrate like a sock whose thread has been pulled.

Ok, yeah yeah sure, you’re saying: Once again I’m reminded of bad science. Who cares? I care, because bad science Greshams good science in so many ways: in scientists’ decision of what to work on and publish (why do a slow careful study if you can get a better publication with something flashy?), in who gets promoted and honored and who decides to quit the field in disgust (not always, but sometimes), and in what gets publicized. The above Boston marathon story struck me because it had that same flavor.

P.S. Tomorrow’s post: Harking, Sharking, Tharking.

## I think that science is mostly “Brezhnevs.” It’s rare to see a “Gorbachev” who will abandon a paradigm just because it doesn’t do the job. Also, moving beyond naive falsificationism

Sandro Ambuehl writes:

I’ve been following your blog and the discussion of replications and replicability across different fields daily, for years. I’m an experimental economist. The following question arose from a discussion I recently had with Anna Dreber, George Loewenstein, and others.

You’ve previously written about the importance of sound theories (and the dangers of anything-goes theories), and I was wondering whether there’s any formal treatment of that, or any empirical evidence on whether empirical investigations based on precise theories that simultaneously test multiple predictions are more likely to replicate than those without theoretical underpinnings, or those that test only isolated predictions.

Specifically: Many of the proposed solutions to the replicability issue (such as preregistration) seem to implicitly assume one-dimensional hypotheses such as “Does X increase Y?” In experimental economics, by contrast, we often test theories. The value of a theory is precisely that it makes multiple predictions. (In economics, theories that explain just one single phenomenon, or make one single prediction are generally viewed as useless and are highly discouraged.) Theories typically also specify how its various predictions relate to each other, often even regarding magnitudes. They are formulated as mathematical models, and their predictions are correspondingly precise. Let’s call a within-subjects experiment that tests a set of predictions of a theory a “multi-dimensional experiment”.

My conjecture is that all the statistical skulduggery that leads to non-replicable results is much harder to do in a theory-based, multi-dimensional experiment. If so, multi-dimensional experiment should lead to better replicability even absent safeguards such as preregistration.

The intuition is the following. Suppose an unscrupulous researcher attempts to “prove” a single prediction that X increases Y. He can do that by selectively excluding subjects with low X and high Y (or high X and low Y) from the sample. Compare that to a researcher who attempts to “prove”, in a within-subject experiment, that X increases Y and A increases B. The latter researcher must exclude many more subjects until his “preferred” sample includes only subjects that conform to the joint hypothesis. The exclusions become harder to justify, and more subjects must be run.

A similar intuition applies to the case of an unscrupulous researcher who tries to “prove” a hypothesis by messing with the measurements of variables (e.g. by using log(X) instead of X). Here, an example is a theory that predicts that X increases both Y and Z. Suppose the researcher finds a Null if he regresses X on Y, but finds a positive correlation between f(X) on Y for some selected transformation f. If the researcher only “tested” the relation between X and Y (a one-dimensional experiment), the researcher could now declare “success”. In a multi-dimensional experiment, however, the researcher will have to dig for an f that doesn’t only generate a positive correlation between f(X) and Y, but also between f(X) and Z, which is harder. A similar point applies if the researcher measures X in different ways (e.g. through a variety of related survey questions) and attempts to select the measurement that best helps “prove” the hypothesis. (Moreover, such a theory would typically also specify something like “If X increases Y by magnitude alpha, then it should increase Z by magnitude beta”. The relation between Y and Z would then present an additional prediction to be tested, yet again increasing the difficulty of “proving” the result through nefarious manipulations.)

So if there is any formal treatment relating to the above intuitions, or any empirical evidence on what kind of research tends to be more or less likely to replicate (depending on factors other than preregistration), I would much appreciate if you could point me to it.

I have two answers for you.

First, some colleagues and I recently published a preregistered replication of one of our own studies; see here. This might be interesting to you because our original study did not test a single thing, so our evaluation was necessarily holistic. In our case, the study was descriptive, not theoretically-motivated, so it’s not quite what you’re talking about—but it’s like your study in that the outcomes of interest were complex and multidimensional.

This was one of the problems I’ve had with recent mass replication studies, that they treat a scientific paper as if it has a single conclusion, even though real papers—theoretically-based or not—typically have many conclusions.

My second response is that I fear you are being too optimistic. Yes, when a theory makes multiple predictions, it may be difficulty to select data to make all the predictions work out. But on the other hand you have many degrees of freedom with which to declare success.

This has been one of my problems with a lot of social science research. Just about any pattern in data can be given a theoretical explanation, and just about any pattern in data can be said to be the result of a theoretical prediction. Remember that claim that women were three times more likely to wear red or pink clothing during a certain time of the month? The authors of that study did a replication which failed–but they declared it a success after adding an interaction with outdoor air temperature. Or there was this political science study where the data went in the opposite direction of the preregistration but were retroactively declared to be consistent with the theory. It’s my impression that a lot of economics is like this too: If it goes the wrong way, the result can be explained. That’s fine—it’s one reason why economics is often a useful framework for modeling the world—but I think the idea that statistical studies and p-values and replication are some sort of testing ground for models, the idea that economists are a group of hard-headed Popperians, regularly subjecting their theories to the hard test of reality—I’m skeptical of that take. I think it’s much more that individual economists, and schools of economists, are devoted to their theories and only rarely abandon them on their own. That is, I have a much more Kuhnian take on the whole process. Or, to put it another way, I try to be Popperian in my own research, I think that’s the ideal, but I think the Kuhnian model better describes the general process of science. Or, to put it another way, I think that science is mostly “Brezhnevs.” It’s rare to see a “Gorbachev” who will abandon a paradigm just because it doesn’t do the job.

Ambuehl responded:

Anna did have a similar reaction to you—and I think that reaction depends much on what passes as a “theory”. For instance, you won’t find anything in a social psychology textbook that an economic theorist would call a “theory”. You’re certainly right about the issues pertaining to hand-wavy ex-post explanations as with the clothes and ovulation study, or “anything-goes theories” such as the Himicanes that might well have turned out the other way.

By contrast, the theories I had in mind when asking the question are mathematically formulated theories that precisely specify their domain of applicability. An example of the kind of theory I have in mind would be Expected Utility theory, tested in countless papers, e.g. here). Another example of such a theory is the Shannon model of choice under limited attention (tested, e.g., here). These theories are in an entirely different ballpark than vague ideas like, e.g., self-perception theory or social comparison theory that are so loosely specified that one cannot even begin to test them unless one is willing to make assumptions on each of the countless researcher degrees of freedom they leave open.

In fact, economic theorists tend to regard the following characteristics virtues, or even necessities, of any model: precision (can be tested without requiring additional assumptions), parsimony (and hence, makes it hard to explain “uncomfortable” results by interactions etc.), generality (in the sense that they make multiple predictions, across several domains). And they very much frown upon ex post theorizing, ad-hoc assumptions, and imprecision. For theories that satisfy these properties, it would seem much harder to fudge empirical research in a way that doesn’t replicate, wouldn’t it? (Whether the community will accept the results or not seems orthogonal to the question of replicability, no?)

Finally, to the extent that theories in the form of precise, mathematical models are often based on wide bodies of empirical research (economic theorists often try to capture “stylized facts”), wouldn’t one also expect higher rates of replicability because such theories essentially correspond to well-informed priors?

So my overall point is, doesn’t (good) theory have a potentially important role to play regarding replicability? (Many current suggestions for solving the replication crisis, in particular formulaic ones such as pre-registration, or p<0.005, don't seem to recognize those potential benefits of sound theory.)

I replied:

Well, sure, but expected utility theory is flat-out false. Much has been written on the way that utilities only exist after the choices are given. This can even be seen in simple classroom demonstrations, as in section 5 of this paper from 1998. No statistics are needed at all to demonstrate the problems with that theory!

Amdahl responded with some examples of more sophisticated, but still testable, theories such as reference-dependent preferences, various theories of decision making under ambiguity, and perception-based theories, and I responded with my view that all these theories are either vague enough to be adaptable to any data or precise enough to be evidently false with no data collection needed. This was what Lakatos noted: any theory is either so brittle that it can be destroyed by collecting enough data, or flexible enough to fit anything. This does not mean we can’t do science, it just means we have to move beyond naive falsificationism.

P.S. Tomorrow’s post: “Boston Globe Columnist Suspended During Investigation Of Marathon Bombing Stories That Don’t Add Up.”

## Deterministic thinking (“dichotomania”): a problem in how we think, not just in how we act

This has come up before:

And it came up again recently.

Epidemiologist Sander Greenland has written about “dichotomania: the compulsion to replace quantities with dichotomies (‘black-and-white thinking’), even when such dichotomization is unnecessary and misleading for inference.”

I’d avoid the misleadingly clinically-sounding term “compulsion,” and I’d similarly prefer a word that doesn’t include the pejorative suffix “mania,” hence I’d rather just speak of “deterministic thinking” or “discrete thinking”—but I agree with Greenland’s general point that this tendency to prematurely collapse the wave function contributes to many problems in statistics and science.

Often when the problem of deterministic thinking comes up in discussion, I hear people explain it away, arguing that decisions have to be made (FDA drug trials are often brought up here), or that all rules are essentially deterministic (the idea that confidence intervals are interpreted as whether they include zero), or that this is a problem with incentives or publication bias, or that, sure, everyone knows that thinking of hypotheses as “true” or “false” is wrong, and that statistical significance and other summaries are just convenient shorthands for expressions of uncertainty that are well understood.

But I’d argue, with Eric Loken, that inappropriate discretization is not just a problem with statistical practice; it’s also a problem with how people think, that the idea of things being on or off is “actually the internal working model for a lot of otherwise smart scientists and researchers.”

This came up in some of the recent discussions on abandoning statistical significance, and I want to use this space to emphasize one more time the problem of inappropriate discrete modeling.

## My math is rusty

When I’m giving talks explaining how multilevel modeling can resolve some aspects of the replication crisis, I mention this well-known saying in mathematics: “When a problem is hard, solve it by embedding it in a harder problem.” As applied to statistics, the idea is that it could be hard to analyze a single small study, as inferences can be sensitive to the prior, but if you consider this as one of a large population or long time series of studies, you can model the whole process, partially pool, etc.

In math, examples of embedding into a harder problem include using the theory of ideals to solve problems in prime numbers (ideals are a general class that includes primes as a special case, hence any theory on ideals is automatically true on primes but is more general), using complex numbers to solve problems with real numbers, and using generating functions to sum infinite series.

That last example goes like this. You want to compute
S = sum_{n=1}^{infinity} a_n, but you can’t figure out how to do it. So you write the generating function,
G(x) = sum_{n=1}^{infinity} a_n x^n,
you then do some analysis to figure out G(x) as a function of x, then your series is just S = G(1). And it really works. Cool.

Anyway, I thought that next time I mention this general idea, it would be fun to demonstrate with an example, so one day when I was sitting in a seminar with my notebook, I decided to try to work one out.

S = 1/1^2 + 1/2^2 + 1/3^2 + 1/4^2 + . . .
That is, S = sum_{n=1}^{infinity} n^{-2}
Then the generating function is,
G(x) = sum_{n=1}^{infinity} n^{-2} x^n.
To solve for G(x), we take some derivatives until we can get to something we can sum directly.
First one derivative:
dG/dx = sum_{n=1}^{infinity} n^{-1} x^{n-1}.
OK, taking the derivative again will be a mess, but we can do this:
x dG/dx = sum_{n=1}^{infinity} n^{-1} x^n.
And now we can differentiate again!
d/dx (x dG/dx) = sum_{n=1}^{infinity} x^{n-1}.
Hey, that one we know! It’s 1 + 1/x + 1/x^2 + . . . = 1/(1-x).

So now we have a differential equation:
xG”(x) + G'(x) = 1/(1-x).
Or maybe better to write as,
x(1-x) G”(x) + (1-x) G'(x) – 1 = 0.
Either way, it looks like we’re close to done. Just solve this second-order differential equation. Actually, even easier than that. Let h(x) = G'(x), then we just need to solve,
x(1-x) h'(x) + (1-x) h(x) – 1 = 0.
Hey, that’s just h(x) = -log(1-x) / x. I can’t remember how I figured that one out—it’s just there in my notes—but there must be some easy derivation. In any case, it works:
h'(x) = log(1-x)/x^2 + 1/(x(1-x)), so
x(1-x) h'(x) = log(1-x)*(1-x)/x + 1
(1-x) h(x) = -log(1-x)*(1-x)/x
So, yeah, x(1-x) h'(x) + (1-x) h(x) – 1 = 0. We’ve solved the differential equation!

And now we have the solution:
G(x) = integral dx (-log(1-x) / x).
This is an indefinite integral but that’s not a problem: we can see that, trivially, G(0) = 0, so we just have to do the integral starting from 0.

At this point, I was feeling pretty good about myself, like I’m some kind of baby Euler, racking up these sums using generating functions.

All I need to do is this little integral . . .

OK, I don’t remember integrals so well. It must be easy to do it using integration by parts . . . oh well, I’ll look it up when I come into the office, it’ll probably be an arcsecant or something like that. But then . . . it turns out there’s no closed-form solution!

Here it is in Wolfram alpha (OK, I take back all the things I said about them):

OK, what’s Li_2(x)? Here it is:

Hey—that’s no help at all, it’s just the infinite series again.

So my generating-function trick didn’t work. Next step is to sum the infinite series by integrating it in the complex plane and counting the poles. But I really don’t remember that! It’s something I learned . . . ummm, 35 years ago. And probably forgot about 34 years ago.

So, yeah, my math is rusty.

But I still like the general principle: When a problem is hard, solve it by embedding it in a harder problem.

P.S. We can use this example to teach a different principle of statistics: the combination of numerical and analytic methods.

How do you compute S = sum_{n=1}^{infinity} n^{-2}?

Simplest approach is to add a bunch of terms; for example, in R:
S_approx_1 <- sum((1:1000000)^(-2)). This brute-force method works fine in this example but it would have trouble if the function to evaluate is expensive.

Another approach is to approximate the sum by an integral; thus:
S_approx_2 <- integral_{from x=0.5 to infinity} dx x^{-2} = 2. (The indefinite integral is just -1/x, so the definite integral is 1/infinity - (-1/0.5) = 2.) You have to start the integral at 0.5 because the sum starts at 1, so the little bars to sum are [0.5,1.5], [1.5,2.5], etc. That second approximation isn't so great at the low end of x, though, where the curve 1/x^2 is far from locally linear. So we can do an intermediate approximation:

S_approx_3 <- sum((1:N)^(-2)) + integral_{from x=(N+0.5) to infinity} dx x^{-2} = sum((1:N)^(-2)) + 1/(N+0.5).

That last approximation is fun because it combines numerical and analytic methods. And it works! Just try N=3:
S_approx = 1 + 1/4 + 1/9 + 1/3.5 = 1.647.
The exact value, to three decimal places, is 1.644. Not bad.

There are better approximation methods out there; the point is that even a simple approach of this sort can do pretty well. And I’ve seen a lot of simulation studies that are done using brute force where the answers just don’t make sense, and where just a bit of analytical work at the end could’ve made everything work out.

P.P.S. Tomorrow’s post: Deterministic thinking (“dichotomania”): a problem in how we think, not just in how we act.

P.P.P.S. [From Bob Carpenter] MathJax is turned on for posts, but not comments, so that $latex e^x$ renders as $e^x$.

## The uncanny valley of Malcolm Gladwell

Gladwell is a fun writer, and I like how he plays with ideas. To my taste, though, he lives in an uncanny valley between nonfiction and fiction, or maybe I should say between science and storytelling. I’d enjoy him more, and feel better about his influence, if he’d take the David Sedaris route and go all the way toward storytelling (with the clear understanding that he’s telling us things because they sound good or they make a good story, not because they’re true), or conversely become a real science writer and evaluate science and data claims critically. Instead he’s kind of in between, bouncing back and forth between stories and science, and that makes uncomfortable.

Here’s an example, from a recent review by Andrew Ferguson, “Malcolm Gladwell Reaches His Tipping Point.” I haven’t read Gladwell’s new book, so I can’t really evaluate most of these criticisms, but of course I’m sympathetic to Ferguson’s general point. Key quote:

Gladwell’s many critics often accuse him of oversimplification. Just as often, though, he acts as a great mystifier, imposing complexity on the everyday stuff of life, elevating minor wrinkles into profound conundrums. This, not coincidentally, is the method of pop social science, on whose rickety findings Gladwell has built his reputation as a public intellectual.

In addition, Ferguson has a specific story regarding some suspiciously specific speculation (the claim that “of every occupational category, [poets] have far and away the highest suicide rates—as much as five times higher than the general population.”) which reminds me of some other such items we’ve discussed over the years, including:

– That data scientist’s unnamed smallish town where 75 people per year died “because of the lack of information flow between the hospital’s emergency room and the nearby mental health clinic.”

– That billionaire’s graph purporting to show “percentage of slaves or serfs in the world.”

– Those psychologists’ claim that women were three times more likely to wear red or pink during certain times of the month.

– That claim from “positive psychology” of the “critical positivity ratio” of 2.9013.

– That psychologist’s claim that he could predict divorces with 83 percent accuracy, after meeting with a couple for just 15 minutes.

And lots more.

There’s something hypnotizing about those numbers. Too good to check, I guess.

## Let’s try this again: It is nonsense to say that we don’t know whether a specific weather event was affected by climate change. It’s not just wrong, it’s nonsensical.

This post is by Phil Price, not Andrew.

If you write something and a substantial number of well-intentioned readers misses your point, the problem is yours. Too many people misunderstood what I was sayinga few days ago in the post “There is no way to prove that [an extreme weather event] either was, or was not, affected by global warming” and that’s my fault.  Let me see if I can do better.

Forget about climate and weather for a moment. I want to talk about bike riding.

You go for a ride with a friend. You come to a steep, winding climb and you ride up side by side. You are at the right side of the road, with your friend to your left, so when you come to a hairpin turn to the right you have a much steeper (but shorter) path than your friend for a few dozen feet. Later you come to a hairpin to the left, but the situation isn’t quite reversed because you are both still in the right lane so your friend isn’t way over where the hairpin is sharpest and the slope is steepest. You ride to the top of the hill and get to a flat section where you are riding side-by-side.  There is some very minor way in which you can be said to have experienced a ‘different’ climb, because even though you were right next to each other you experienced different slopes at different times, and rode slightly different speeds in order to stay next to each other as the road curved, and in fact you didn’t even end up at exactly the same place because your friend is a few feet from you.  You haven’t done literally the same climb, in the sense that a man can’t literally step twice in the same river (because at the time of the second step the river is not exactly the same, and neither is the man) but if someone said ‘how was your climb affected by your decision to ride on the right side of the lane rather than the middle of the lane’ we would all know what you mean; no reasonable person would say ‘if I had done the climb in the middle rather than the right it would have been a totally different climb.’

1 is just wrong (*).  If you had gone north instead of south you might still had a steep climb  around hour 3, maybe it would have even been a steeper climb the one you are on now, but there is no way it could have been the same climb…and the difference is not a trivial one like the “twice in the same river” example.

3 is not the right answer to the question that was asked, but maybe it’s the right answer to what the questioner had in mind. Maybe when they said “how would this climb have been different” they really meant something like, if you had gone the other way, “what would the biggest climb have been like”, or “what sort of hill would be climbing just about now”?

I think you see where I’m going with this (since I doubt you really forgot all about climate and weather like I asked you to).  On a bike ride you are on a path through physical space, but suppose we were talking about paths through parameter space instead. In this parameterization, long steep climbs correspond to hurricane conditions, and going south instead of north corresponds to experiencing a world with global warming instead of one without. In the global warming world, we don’t experience ‘the same’ weather events that we would have otherwise, but in a slightly different way — like climbing the same hill in the middle of the lane rather than at the side of the lane — we experience entirely different weather events — like climbing different hills.

The specific quote that I cited in my previous post was about Hurricane Katrina. It makes no sense to say we don’t know whether Hurricane Katrina was affected by global warming, just as it would make no sense to say we don’t know whether our hill climb was affected by our decision to go south instead of north. In the counterfactual world New Orleans might have still experienced a hurricane, maybe even on the same day, but it would not have been the same hurricane, just as we might encounter a hill climb on our bike trip at around the three-hour mark whether we went south or north, but it would not have been the same climb.

No analogy is perfect, so please don’t focus on ways in which the analogy isn’t ‘right’. The point is that we are long past the point where global warming is a ‘butterfly effect’ and we can reasonably talk about how individual weather events are affected by it. We aren’t riding up the same road but in a slightly different place, we are in a different part of the territory.

(*) I’m aware that if you had ridden north instead of south you could have circled back and climbed this same climb. Also, it’s possible in principle that some billionaire could have paid to duplicate ‘the same’ climb somewhere to the north — grade the side of a mountain to make this possible, shape the land and the road to duplicate the southern climb, etc.  But get real. And although these are possible for a bike ride, at least in principle, they are not possible for the parameter space of weather and climate that is the real subject of this post.

This post is by Phil, not Andrew.

## Exchange with Deborah Mayo on abandoning statistical significance

The philosopher wrote:

The big move in the statistics wars these days is to fight irreplication by making it harder to reject, and find evidence against, a null hypothesis.

Mayo is referring to, among other things, the proposal to “redefine statistical significance” as p less than 0.005. My colleagues and I do not actually like that idea, so I responded to Mayo as follows:

I don’t know what the big moves are, but my own perspective, and I think that of the three authors of the recent article being discussed, is that we should not be “rejecting” at all, that we should move beyond the idea that the purpose of statistics is to reject the null hypothesis of zero effect and zero systematic error.

I don’t want to ban speech, and I don’t think the authors of that article do, either. I’m on record that I’d like to see everything published, including Bem’s ESP paper data and various other silly research. My problem is with the idea that rejecting the null hypothesis tells us anything useful.

Mayo replied:

I just don’t see that you can really mean to say that nothing is learned from finding low-p values, especially if it’s not an isolated case but time and again. We may know a hypothesis/model is strictly false, but we do not yet know in which way we will find violations. Otherwise we could never learn from data. As a falsificationist, you must think we find things out from discovering our theory clashes with the facts–enough even to direct a change in your model. Even though inferences are strictly fallible, we may argue from coincidence to a genuine anomaly & even to pinpointing the source of the misfit.So I’m puzzled.
I hope that “only” will be added to the statement in the editorial to the ASA collection. Doesn’t the ASA worry that the whole effort might otherwise be discredited as anti-science?

My response:

The problem with null hypothesis significance testing is that rejection of straw-man hypothesis B is used as evidence in favor of preferred alternative A. This is a disaster. See here.

Then Mayo:

I know all this. I’ve been writing about it for donkey’s years. But that’s a testing fallacy. N-P and Fisher couldn’t have been clearer. That does not mean we learn nothing from a correct use of tests. N-P tests have a statistical alternative and at most one learns, say, about a discrepancy from a hypothesized value. If a double blind RCT clinical trial repeatedly shows statistically significant (small p-value) increase in cancer risks among exposed, will you deny that’s evidence?

Me:

I don’t care about the people, Neyman, Fisher, and Pearson. I care about what researchers do. They do something called NHST, and it’s a disaster, and I’m glad that Greenland and others are writing papers pointing this out.

Mayo:

We’ve been saying this for years and years. Are you saying you would no longer falsify models because some people will move from falsifying a model to their favorite alternative theory that fits the data? That’s crazy. You don’t give up on correct logic because some people use illogic. The clinical trials I’m speaking about do not commit those crimes. would you really be willing to say that they’re all bunk because some psychology researchers do erroneous experiments and make inferences to claims where we don’t even know we’re measuring the intended phenomenon?
Ironically, by the way, the Greenland argument only weakens the possibility of finding failed replications.

Me:

I pretty much said it all here.

I don’t think clinical trials are all bunk. I think that existing methods, NHST included, can be adapted to useful purposes at times. But I think the principles underlying these methods don’t correspond to the scientific questions of interest, and I think there are lots of ways to do better.

Mayo:

And I’ve said it all many times in great detail. I say drop NHST. It was never part of any official methodology. That is no justification for endorsing official policy that denies we can learn from statistically significant effects in controlled clinical trials among other legitimate probes. Why not punish the wrong-doers rather than all of science that uses statistical falsification?

Would critics of statistical significance tests use a drug that resulted in statistically significant increased risks in patients time and again? Would they recommend it to members of their family? If the answer to these questions is “no”, then they cannot at the same time deny that anything can be learned from finding statistical significance.

Me:

In those cases where NHST works, I think other methods work better. To me, the main value of significance testing is: (a) when the test doesn’t reject, that tells you your data are too noisy to reject the null model, and so it’s good to know that, and (b) in some cases as a convenient shorthand for a more thorough analysis, and (3) for finding flaws in models that we are interested in (as in chapter 6 of BDA). I would not use significance testing to evaluate a drug, or to prove that some psychological manipulation has a nonzero effect, or whatever, and those are the sorts of examples that keep coming up.

In answer to your previous email, I don’t want to punish anyone, I just think statistical significance is a bad idea and I think we’d all be better off without it. In your example of a drug, the key phrase is “time and again.” No statistical significance is needed here.

Mayo:

One or two times would be enough if they were well controlled. And the ONLY reason they have meaning even if it were time and time again is because they are well controlled. I’m totally puzzled as to how you can falsify models using p-values & deny p-value reasoning.

As I discuss through my book, Statistical Inference as Severe Testing, the most important role of the severity requirement is to block claims—precisely the kinds of claims that get support under other methods be they likelihood or Bayesian.
Stop using NHST—there’s speech ban I can agree with. In many cases the best way to evaluate a drug is via controlled trials. I think you forget that for me, since any claim must be well probed to be warranted, estimations can still be viewed as tests.
I will stop trading in biotechs if the rule to just report observed effects gets passed and the responsibility that went with claiming a genuinely statistically significant effect goes by the board.

That said, it’s fun to be talking with you again.

Me:

I’m interested in falsifying real models, not straw-man nulls of zero effect. Regarding your example of the new drug: yes, it can be solved using confidence intervals, or z-scores, or estimates and standard errors, or p-values, or Bayesian methods, or just about anything, if the evidence is strong enough. I agree there are simple problems for which many methods work, including p-values when properly interpreted. But I don’t see the point of using hypothesis testing in those situations either—it seems to make much more sense to treat them as estimation problems: how effective is the drug, ideally for each person or else just estimate the average effect if you’re ok fitting that simpler model.

I can blog our exchange if you’d like.

And so I did.

P.S. Tomorrow’s post: My math is rusty.

## I hate Bayes factors (when they’re used for null hypothesis significance testing)

Oliver Schultheiss writes:

I am a regular reader of your blog. I am also one of those psychology researchers who were trained in the NHST tradition and who is now struggling hard to retrain himself to properly understand and use the Bayes approach (I am working on my first paper based on JASP and its Bayesian analysis options). And then tonight I came across this recent blog by Uri Simonsohn, “If you think p-values are problematic, wait until you understand Bayes Factors.”

I assume that I am not the only one who is rattled by this (or I am the only one, and this just reveals my lingering deeper ignorance about the Bayes approach) and I was wondering whether you could comment on Uri’s criticism of Bayes Factors on your own blog.

My reply: I don’t like Bayes factors; see here. I think Bayesian inference is very useful, but Bayes factors are based on a model of point hypotheses that typically does not make sense.
To put it another way, I think that null hypothesis significance testing typically does not make sense. When Bayes factors are used for null hypothesis significance testing, I generally think this is a bad idea, and I don’t think it typically makes sense to talk about the probability that a scientific hypothesis is true.

More discussion here: Incorporating Bayes factor into my understanding of scientific information and the replication crisis. The problem is not so much with the Bayes factor as with the idea of null hypothesis significance testing.

## Was Thomas Kuhn evil? I don’t really care.

OK, I guess I care a little . . . but when it comes to philosophy, I don’t really care about Kuhn’s personality or even what exactly he said in his books. I use Kuhn in my work, by which I mean that I use an idealized Kuhn, I take the best from his work (as I see it), the same way I use an idealized Lakatos and Popper, and the same way that Lakatos famously used an idealized Popper (Lakatos called him Popper2, I think it was).

Here’s what Shalizi and I wrote in our article:

We focus on the classical ideas of Popper and Kuhn, partly because of their influence in the general scientific culture and partly because they represent certain attitudes which we believe are important in understanding the dynamic process of statistical modelling.

Actually, we said “modeling,” but someone translated our article into British for publication. Anyway . . . we continue:

The two most famous modern philosophers of science are undoubtedly Karl Popper (1934/1959) and Thomas Kuhn (1970), and if statisticians (like other non-philosophers) know about philosophy of science at all, it is generally some version of their ideas. . . . We do not pretend that our sketch fully portrays these figures, let alone the literatures of exegesis and controversy they inspired, or even how the philosophy of science has moved on since 1970. . . .

To sum up, our views are much closer to Popper’s than to Kuhn’s. The latter encouraged a close attention to the history of science and to explaining the process of scientific change, as well as putting on the agenda many genuinely deep questions, such as when and how scientific fields achieve consensus. There are even analogies between Kuhn’s ideas and what happens in good data-analytic practice. Fundamentally, however, we feel that deductive model checking is central to statistical and scientific progress, and that it is the threat of such checks that motivates us to perform inferences within complex models that we know ahead of time to be false.

My point here is that, as applied statisticians rather than philosophers or historians, we take what we can use from philosophy, being open about our ignorance of most of the literature in that field. Just as applied researchers pick and choose statistical methods in order to design and analyze their data, we statisticians pick and choose philosophical ideas to help us understand what we are doing.

For example, we write:

In some way, Kuhn’s distinction between normal and revolutionary science is analogous to the distinction between learning within a Bayesian model, and checking the model in preparation to discarding or expanding it. Just as the work of normal science proceeds within the presuppositions of the paradigm, updating a posterior distribution by conditioning on new data takes the assumptions embodied in the prior distribution and the likelihood function as unchallengeable truths. Model checking, on the other hand, corresponds to the identification of anomalies, with a switch to a new model when they become intolerable. Even the problems with translations between paradigms have something of a counterpart in statistical practice; for example, the intercept coefficients in a varying-intercept, constant-slope regression model have a somewhat different meaning than do the intercepts in a varying-slope model.

This is all fine, but we recognize:

We do not want to push the analogy too far, however, since most model checking and model reformulation would by Kuhn have been regarded as puzzle-solving within a single paradigm, and his views of how people switch between paradigms are, as we just saw, rather different.

We’re trying to make use of the insights that Kuhn brought to bear, without getting tied up in what Kuhn’s own position was on all this. Kuhnianism without Kuhn, one might say.

Anyway, this all came up because Mark Brown pointed me to this article by John Horgan reporting that Errol Morris thinks that Kuhn was, in Horgan’s words, “a bad person and bad philosopher.”

Errol Morris! He’s my hero. If he hates Kuhn, so do I. Or at least that’s my default position, until further information comes along.

Actually, I do have further information about Kuhn. I can’t say I knew the guy personally, but I did take his course at MIT. Actually, I just came to the first class and dropped it. Hey . . . didn’t I blog this once? Let me check . . . yeah, here it is, from 2011—and I wrote it in response to Errol Morris’s story, the first time I heard about it! I’d forgotten this entirely.

There’s one thing that makes me a little sad. Horgan writes that Morris’s book features “interviews with Noam Chomsky, Steven Weinberg and Hilary Putnam, among other big shots.” I think there must be people with more to say than these guys. This may be a problem that once an author reaches the celebrity stratosphere, he will naturally mingle with other celebrities. If I’m reading a book about philosophy of science, I’d rather see an interview with Steve Stigler, or Josh Miller, or Deborah Mayo, or Cosma Shalizi, or various working scientists with historical and philosophical interests. But it can be hard to find such people, if you’re coming from the outside.