In searching for the immortal phrase, “piss-poor monocausal social science,” I came across this amusing story of two public intellectuals discrediting each other.

But then this made wonder . . . did the lawsuit ever happen? Here’s what the headline said:

Niall Ferguson threatens to sue over accusation of racism

Historian claims writer Pankaj Mishra accused him of racism and must apologise or face court action

I googled and . . . it looks like Mishra never apologised, but the promised court action from Ferguson never happened. Dude must’ve been too busy making fun of Keynes for being gay and marrying a ballerina and talking about poetry.

People are just suing each other all the time. So let’s take a moment to celebrate an instance when someone decided not to.

I published a blog post in which I reanalyze the results of Chernozhukov et al. (2021) on the effects of NPIs in the US during the first wave of the pandemic and, if you have time to take a look at it, I’d be curious to hear your thoughts.

Here is a summary that recaps the main points:

– The effects of non-pharmaceutical interventions on the COVID-19 pandemic are very difficult to evaluate. In particular, most studies on the issue fail to adequately take into account the fact that people voluntarily change their behavior in response to changes in epidemic conditions, which can reduce transmission independently of non-pharmaceutical interventions and confound the effect of non-pharmaceutical interventions.

– Chernozhukov et al. (2021) is unusually mindful of this problem and the authors tried to control for the effect of voluntary behavioral changes. They found that, even when you take that into account, non-pharmaceutical interventions led to a substantial reduction in cases and deaths during the first wave in the US.

– However, their conclusions rest on dubious assumptions, and are very sensitive to reasonable changes in the specification of the model. When the same analysis is performed on a broad range of plausible specifications of the model, none of the effects are robust. This is true even for their headline result about the effect of mandating face masks for employees of public-facing businesses.

– Another reason to regard even this result as dubious is that, when the same analysis is performed to evaluate the effect of mandating face masks for everyone and not just employees of public-facing businesses, the effect totally disappears and is even positive in many specifications. The authors collected data on this broader policy, so they could have performed this analysis in the paper, but they failed to do so despite speculating in the paper that mandating face masks for everyone could have a much larger effect than just mandating them for employees.

– This suggests that something is wrong with the kind of model Chernozhukov et al. used to evaluate the effects of non-pharmaceutical interventions. In order to investigate this issue, I fit a much simpler version of this model on simulated data and find that, even in very favorable conditions, the model performs extremely poorly. I also show with placebo tests that it can easily find spurious effects. This is a problem not just for this particular study, but for any study that relies on that kind of model to study the effects of non-pharmaceutical interventions.

– To be clear, as I stress in the conclusion, this doesn’t mean that mask-wearing doesn’t reduce transmission, because this paper evaluated the effect of mandating mask wearing, which is not the same thing. It may be that, as another study recently found (though I have no idea how good this paper is), mandates don’t really matter because people who are going to wear masks do so even if they’re not legally required to do so.

Anyway, since you disagreed with my harsh take on Flaxman et al.’s paper about the effects of NPIs in Europe during the first wave, I was curious to know your thoughts about this other study.

I replied that I agree with Lemoine’s general point that it’s very hard to untangle the effects of any particular policy, given that so much depends on behavior. Another complication is the desire for definitive results. From the other direction, I see the value of quantitative analyses, as some policy choices need to be made.

Lemoine responded:

On the need to make policy choices and what it means for what should be done with quantitative analyses, I think it’s a very complicated issue. I was a hawk on COVID-19 before it was cool and, back in March, I was in favor of the first lockdown. I changed my mind after that because I became convinced that, whatever their precise effects (I think it’s impossible to estimate them with anything resembling precision), they couldn’t be huge otherwise we’d see it much more easily (as with vaccination) and they generally needed to be huge in order to have a chance of passing a cost-benefit test. One reason I came to deeply regret my initial support for lockdowns is that I have since then realized they have become a sort of institutionalized default response, which is something I think I should have predicted but didn’t, so this has taught me the wisdom of requiring a much higher level of confidence in social scientific results before acting on them. (I’m French and here we have been under a curfew and bars/restaurants have remained completely closed between last October and May of this year!)

In response to my question about what exactly was meant by “lockdown,” Lemoine pointed to his post arguing against lockdowns and added:

I [Lemoine] think it has been a problem in those debates on both sides, but it’s not really a problem in Chernozhukov et al. (2021) since they look at pretty specific policies. My impression is that, when people talk about “lockdowns”, they have in mind a vague set of particularly stringent restrictions such as curfews, closure of “non-essential businesses” and stay-at-home orders. In any case, this is what I’m referring to when I use this term, though in my work I usually talk about “restrictions” and state my position as the claim that, whatever the precise effects of the most stringent restrictions (again things like curfews, closure of “non-essential businesses” and stay-at-home orders) are, they are not plausibly large enough for those policies to pass a cost-benefit test when you take into account their immediate effects on people’s well-being, because even when I make preposterous assumptions about their effects on transmission and do a back-of-the-envelope cost-benefit analysis the results come out as incredibly lopsided against those policies. This is still vague but I think not too vague. In particular, I don’t think mask mandates of any kind count as “lockdowns”, nor do I think that anyone does even the fiercest opponents of those mandates.

I did not have the energy to read Chernozhukov et al.’s paper or Lemoine’s criticism in detail, but as noted above I am sympathetic with Lemoine’s general point that it is difficult to untangle causal effects of policies—and this difficulty persists even if, like Chernozhukov et al., you are fully aware of these difficulties and trying your best to address them. We had a similar discussion a few years ago regarding the deterrent effect of the death penalty, a topic that has seen many quantitative studies of varying quality but which, as Donohue and Wolfers explained, is pretty much impossible to figure out from empirical data. Effects of policies on disease spread should be easier to estimate, as the causal mechanism is much clearer, but we still have the problem of multiple interventions done at the same time, interventions motivated by existing conditions (which can be addressed statistically, but results will be necessarily sensitive to details of how the adjustment is done), effects that vary from one jurisdiction to another, and unclear relationships between behavior and policy. For example, when they closed the schools here in New York City, lots of parents were pulling their kids out of school and lots of teachers were not planning to keep showing up, so the school closing could be thought of as a coordination policy as much as a mandate. And then there are annoying policies such as closing parks and beaches, which nobody really thinks would have much effect on disease spread but represent some sort of signal of seriousness. And the really big thing which is people lowering the spread of disease by avoiding social situations, avoiding talking into each others’ faces, etc. From a policy standpoint it’s hard for me to hold all this in my head at once, especially because I’m really looking forward to teaching in person this fall, masked or otherwise. One of the points of a statistical analysis is to be able to integrate different sources of information—a multivariate probability distribution can “hold all this in its head at once” even when I can’t . . . ummm, at this point I’m just babbling. Speaking as a statistician, let me just say that it’s important to see the trail of breadcrumbs showing how the conclusions came from the data, scientific assumptions, and statistical model, starting from simple comparisons and then doing adjustments from there. I think the sorts of analyses of Chernozhukov et al. and Lemoine should be helpful in taking us in this direction.

Thought you might like this example from the leaked CDC slides. One of the big claims being repeated in the media is that “Infections in vaccinated Americans are rare, compared with those in unvaccinated people, the document said. But when they occur, vaccinated people may spread the virus just as easily.” (NYT) That is, this focuses on possible equivalence (vs. not) within some subpopulation who get infected. And, of course, the vaccine affects who gets infected and whether it gets reported and included in the sample.

This is apparently based on the results on this slide:

The first bullet is a comparison within vaccinated people who have reported breakthrough cases. Based on 19 such cases with Delta, this suggests ~10 times increase in viral load associated with Delta. (One widely reported comparison, cited by Dr. Fauci earlier this week, in viral load for Delta is ~1000 times, so this would actually be much lower than that.)

The second bullet is a comparison — for one particular outbreak — of vaccinated and unvaccinated cases in a cluster associated with Provincetown’s extensive July 4th parties. It seems like an interesting question here is whether this conditioning on a known infection makes sense.

(Other outlets focus on a different dichotomization of these results, saying for example, “New data suggests vaccinated people could transmit delta variant” as if this is new information at all!)

To me, this all gets at how valuable it is to think about things in degrees (if not fully quantitatively) and comparatively rather than reducing everything to 0 or non-zero.

Obviously, I am not an infectious disease biologist, but this seems like a nice example of dichotomization, conditioning on post-treatment variables (which can sometimes make sense — does it here?), and science communication.

Interesting point about conditioning on a known infection. The implicit causal model in the comparison is that the infection is something that just can happen to you, but I take Eckles’s point to be that, if you know that a vaccinated person was infected, that fact tells us something about that person—some combination of behavioral and biological information that we would expect to be relevant to the rate at which they spread the virus. Thus, it could be true that infected vaccinated people spread the virus as easily as infected non-vaccinated people, but that statement could be rephrased from a latent-variable perspective as “the sorts of vaccinated people who are likely to get infected are the sorts of people who are more likely to spread the virus,” without necessarily implying that the effect of being infected on spreading the virus is the same among vaccinated and unvaccinated people. I agree with Eckles that these question can get very tangled.

I’m not a statistician, but one thing I’ve noticed is that most (or all?) of the percentage change plots that I’ve seen don’t use a logarithmic scale. I think the logarithmic scale would be better, since most people are better at mentally performing addition operations, and the cumulative effect of percent-changes is multiplicative.

The log-scale would reflect this, by making the cumulative effects over many time steps additive with respect to the scale of the plot.

For example, it seems like it would make a lot of these plots easier to interpret.

I replied that I agree, but it’s controversial, and I pointed to Jessica’s recent post on the topic.

Lamb responded:

I was specifically thinking about percentage growth rates, for example GDP growth per year like: 5%, 10%, -10%, …, 1%. One thing I noticed is that if you compare countries which have had really good economic growth like Malaysia or China vs. unstable countries with low growth, the unstable country’s growth plot often has a higher net area under the curve in total, since it’s a mix of years with very high positive growth and negative growth. The negative growth years actually count for much more due to the multiplicative interaction. If you plot on the log scale, then net area under the curve actually is a correct measure for total growth.

For absolute measures like how many people have gotten coronavirus, it’s less obvious to me if log scale is the right choice. I think log scale makes it easier to discriminate between different exponential growth rates, but makes it much harder to discriminate between exponential and non-exponential growth rates.

Uh oh, don’t talk about Malaysia, it will bring the racists out of the woodwork!

This post is by Lizzie. I also took the kitten photo — there’s a white paw taking up much of the foreground and a little gray tail in the background. As this post is about uncertainty, I thought maybe it worked.

I was back east for work in June, drifting from Boston to Hanover, New Hampshire and seeing a couple colleagues along the way. These meetings were always outside, often in the early evenings, and so they sit in my mind with the lovely luster of nice spring weather in the northeast, with the sun glinting in at just the right angle.

One meeting was sitting on a little sloping patch of grass in a backyard in Arlington, where I was chatting with a former postdoc, who now works for a consulting company tightly intertwined with US government. When he was in my lab he and I learned Bayesian statistics (and Stan), and I asked him how much he was using Bayesian approaches. He smiled slyly at me and told me a story about a recent meeting he was at where one of the senior people said:

“No regulatory body uses Bayesian statistics to make decisions.”

He quickly added that he’s not at all sure this is true, but that it encapsulates a perspective that is not uncommon in his world.

The next meeting was next to the Connecticut river and with a senior ecologist, who works on issues with some real policy implications: how to manage beetle populations as they take off for the north with warming (hello, or should I say goodbye, New Jersey pine barrens), the thawing Arctic, and more. I was asking him if he thought this statement was true, which he didn’t answer, but set off on a different declaratory statement:

“The problem with Bayesian statistics is their emphasis on uncertainty.”

Ah. Uncertainty. Do you think uncertainty is the most commonly used word in the title of blog posts here? (Some recent posts here, here and here.)

In response to my colleague I may have blurted out something like ‘but I love uncertainty!’ or ‘that is a great thing about Bayesian!’ and so the conversation veered deeply into a ditch, from which I am not sure that it ever recovered. I said something along the lines of, isn’t it better to have all that uncertainty out in the middle of the room? Rather than trying to fit in under the cushions of the sofa as I feel so many ecologists do when they do their models in sequential steps, dropping off uncertainty along the way (often using p-values of delta AIC values of 2 or…) to drive ahead to their imaginary land of near-certainty? (I know at some point I also poorly steered it towards my thoughts on whether climate change scientists have done themselves a service or disservice in shying away from communicating uncertainty; I regret that.)

We left mired in the muck that so many of the ecologists around me feel about Bayesian — too much emphasis on uncertainty, too little concrete information that could lead to decision making.

So I pose this back to you all: what should I have said in response to either of these remarks? I am looking for excellent information, and persuasive viewpoints.

I’ll open the floor with what I thought a good reply from Michael Betancourt for the first quote: fisheries, and that Bayesian gives better options to steer policy. For example, if you want maximum sustainable yield without crashing a fish stock, you can more easily suggest a quantile of catch that puts you a little more firmly in ‘non-crashing’ outcome.

Under the subject line, “A potentially dubious study making the rounds, re police shootings,” Gordon Danning links to this article, which begins:

Police use of force is a controversial issue, but the broader consequences and spillover effects are not well understood. This study examines the impact of in utero exposure to police killings of unarmed blacks in the residential environment on black infants’ health. Using a preregistered, quasi-experimental design and data from 3.9 million birth records in California from 2007 to 2016, the findings show that police killings of unarmed blacks substantially decrease the birth weight and gestational age of black infants residing nearby. There is no discernible effect on white and Hispanic infants or for police killings of armed blacks and other race victims, suggesting that the effect reflects stress and anxiety related to perceived injustice and discrimination. Police violence thus has spillover effects on the health of newborn infants that contribute to enduring black-white disparities in infant health and the intergenerational transmission of disadvantage at the earliest stages of life.

My first thought is to be concerned about the use of causal language (“substantially decrease . . . no discernible effect . . . the effect . . . spillover effects . . . contribute to . . .”) from observational data.

On the other hand, I’ve estimated causal effects from observational data, and Jennifer and I have a couple of chapters in our book on estimating causal effects from observational data, so it’s not like I think this can’t be done.

So let’s look more carefully at the research article in question.

Their analysis “compares changes in birth outcomes for black infants in exposed areas born in different time periods before and after police killings of unarmed blacks to changes in birth outcomes for control cases in unaffected areas.” They consider this a natural experiment in the sense that dates of the killings can be considered as random.

Here’s a key result, plotting estimated effect on birth weight of black infants. The x-axis here is distance to the police killing, and the lines represent 95% confidence intervals:

There’s something about this that looks wrong to me. The point estimates seem too smooth and monotonic. How could this be? There’s no way that each point here represents an independent data point.

I read the paper more carefully, and I think what’s happening is that the x-axis actually represents maximum distance to the killing; thus, for example, the points at x=3 represent all births that are up to 3 km from a killing.

Also, the difference between “significant” and “not significant” is not itself statistically significant. Thus, the following statement is misleading: “The size of this effect is substantial for exposure during the first and second trimesters. . . . The effect of exposure during the third trimester, however, is small and statistically insignificant, which is in line with previous research showing reduced effects of stressors at later stages of fetal development.” This would be ok if they were to also point out that their results are consistent with a constant effect over all trimesters.

I have a similar problem with this statement: “The size of the effect is spatially limited and decreases with distance from the event. It is small and statistically insignificant in both model specifications at around 3 km.” Again, if you want to understand how effects vary by distance, you should study that directly, not make conclusions based on statistical significance of various aggregates.

The big question, though, is do we trust the causal attribution: as stated in the article, “the assumption that in the absence of police killings, birth outcomes would have been the same for exposed and unexposed infants.” I don’t really buy this, because it seems that other bad things happen around the same time as police killings. The model includes indicators for census tracts and months, but I’m still concerned.

I recognized that my concerns are kind of open-ended. I don’t see a clear flaw in the main analysis, but I remain skeptical, both of the causal identification and of forking paths. (Yes, the above graphs show statistically-significant results for the first two trimesters for some of the distance thresholds, but had the results gone differently, I suspect it would’ve been possible to find an explanation for why it would’ve been ok to average all three trimesters. Similarly, the distance threshold allows lots of places to find statistically significant results.)

So I could see someone reading this post and reacting with frustration: the paper has no glaring flaws and I still am not convinced by its conclusion! All I can say is, I have no duty to be convinced. The paper makes a strong claim and provides some evidence—I respect that. But a statistical analysis with some statistical significance is just not as strong evidence as people have been trained to believe. We’ve just been burned too many times, and not just by the Diederik Stapels, Brian Wansinks, etc., but also by serious researchers, trying their best.

I have no problem with these findings being published. Let’s just recognize that they are speculative. It’s a report of some associations, which we can interpret in light of whatever theoretical understanding we have of causes of low birth weight. It’s not implausible that mothers behave differently in an environment of stress, whether or not we buy this particular story.

I was curious so I went over to this wikipedia page, which at the time of this writing includes:

musicians/composers 12
actors/performers/motivational speakers 10
athletes 6
writers/journalists 4
politicians/activists 3
scientists 3
rich people / political commentators 2

Some of the people fell into multiple categories; for them, I picked one.

It’s an interesting mix, with Samuel Johnson as the top dog. More musicians and performers than I would’ve thought, but maybe that’s just who’s on wikipedia?

The most interesting story on the list is of this guy, who died from methanol poisoning while working in Antarctica. It’s possible he was murdered.

Yesterday I was among those copied when a correspondent wrote something I’d seen before and which had always stood out as totally wrong to me in the statistical context I usually work within (mainly regression coefficients for generalized linear models with canonical link functions like the normal-linear, binomial-logistic, and log-linear proportional-hazards models as applied to human health and medical data).

Here’s the offending quote:

“The non-informative prior is the current default in almost all statistical application. It’s used either implicitly when people interpret the usual confidence interval as if it is a credible interval, or explicitly when people do a Bayesian analysis with a ‘non-informative’ prior. This is a really big mistake. The uniform prior is far from non-informative! In fact, it represents the prior belief that effects are likely to be very large, and also that the (actual, achieved) power is likely to be very large. We can see in the Cochrane data that this is not true for RCTs [randomized clinical trials]. Consequently, the uniform prior leads to considerable overestimation in RCTs.”

– As Andrew (who has made the same claim) has said in other contexts, No, no, no!: I think the claim that an improper uniform prior is highly informative is the really big mistake; it is its lack of information that justifies turning to priors derived from other RCTs.

There is no unique measure of information, but among those I’ve seen in common use in both statistics and communications engineering, the improper uniform prior contains zero information. For example, its Fisher information is zero (or to be more technical, zero is the limiting information of any regular proper prior distribution that converges to uniform as its scale is allowed to expand without bound); likewise, another measure of the information in the prior, the Kullback-Leibler information divergence (KLID) from the posterior to the normalized likelihood function, is zero under an improper uniform prior.

Now this pair of facts is almost the same result as the Fisher information is the coefficient of the first nonvanishing term in an expansion of the KLID, but the same fact comes up with all the variations of information measures for distributions I’ve seen in the literature on the topic: The improper uniform prior is indeed non-informative.

One way to describe the situation in informal betting (“operational”) Bayesian terms is that the claim about an improper uniform prior overlooks how the information content of the prior is a function of its concentration of belief (as measured by expected gain or loss). With an improper uniform prior in binomial logistic or normal linear regression coefficients, the posterior bets depend only on the likelihood function (as is implicit in treating the maximum-likelihood statistics as posterior summaries), which seems as good a definition as any of complete lack of useful a priori guiding information (i.e., a state of total ignorance before seeing the current data).

Generalizing, all the weakly informative and reference-Bayes priors proposed to replace the uniform have very little information compared to the likelihood function in all but the tiniest real studies. That’s because those priors typically have the information content of 1 or 2 observations according to some familiar information measure. What makes the improper uniform prior distasteful to me is that we never, ever have zero uncontested prior information: The very fact that anyone would do a formal study shows there is plenty of information that the effect in question cannot be so huge as to be obvious without such a study (which is why there are no RCTs of having a parachute vs. nothing when jumping off a 1000 meter drop).

Intuitively, I think we could agree that someone must be completely ignorant of the research world as well as of the specific topic if they claim that any huge effect you can name, such as a causal rate ratio of 0.0001 or 10000, is as a priori probable as say 0.1 or 10, which is what is implied by the improper uniform prior for the log rate ratio (i.e., the proportional-hazards coefficient). But replacement of an improper uniform prior with a vague proper prior capturing uncontested prior information only matters when the the likelihood (or estimating) function would not swamp that information.

The bottom line is that to blame a uniform or vague prior for overestimation is to evade our responsibility to use the real-world information we have: Namely, that if a treatment needs an RCT to settle debate about whether its effect is large enough to care about, that fact alone should narrow our prior dramatically in comparison to typical “weakly informative” priors, and forms a valid empirical basis for considering recent proposals for shrinkage based on RCT databases.

Rob Trangucci points us to this paper by William Stephenson, Soumya Ghosh, Tin Nguyen, Mikhail Yurochkin, Sameer Deshpande, and Tamara Broderick. I’m posting it here because it involves GPs, so Aki should be interested too.

Related ideas:

Static sensitivity analysis (for example section 6.3 here)

Nothing new from me here, just the usual topic of trying to develop tools for understanding fitted models by considering various versions of d(inference)/d(input). I’m blogging it because it will be easier to find things that way.

I happened to look up the classic programming book Code Complete (fully, “Code Complete: A Practical Handbook of Software Construction, Second Edition,” by Steve McConnell) and I learned two amusing things when scrolling down the page:

1. It says, “You last purchased this item on September 9, 2004.” Wow! I bought it, probably on Bob Carpenter’s recommendation, but never read it. Maybe it’s still sitting in my office somewhere? I should really read the damn book already.

2. Also this:

“The Dating Playbook For Men,” huh? I guess the kind of people who would buy a book on coding might buy this one too! Nothing wrong with dating advice; it’s just amusing to see it here. I clicked through, and this Dating Advice book seems controversial. 69% of its ratings are positive, but the three top reviews are very negative. Code Complete seems to be more universally loved.

Something came up where I realized I was wrong. It wasn’t a mathematical error; it was a statistical model that was misleading when I tried to align it with reality. And this made me realize that there was something I was misunderstanding about potential outcomes and casual inference. And then I thought: If I’m confused, maybe some of you will be confused on this too, so maybe it would be helpful for me to work this through out loud.

Here’s the story.

It started with a discussion about the effectiveness of an experimental coronavirus treatment. The doctor wanted to design the experiment to detect an effect of 0.25—that is, he hypothesized that survival rate would be 25 percentage points higher in the treatment group than among the controls. For example, maybe the survival rate would be 20% in the control group and 45% in the treated group. Or 40% and 65%, or whatever.

I was trying to figure out what this meant, and I framed it in terms of potential outcomes. Consider four types of patients:
1. The people who would survive under the treatment and would survive under the control,
2. The people who would survive under the treatment but would die under the control,
3. The people who would die under the treatment but would survive under the control,
4. The people who would die either way.
Then the assumption is that p2 – p3 = 0.25.

For simplicity, suppose that p3=0, so that the treatment can save your life but it can’t kill you. Then the effect of the treatment in the study is simply the proportion of type 2 people in the experiment. That’s it. Nothing more and nothing less. To “win” in designing this sort of experiment, you want to lasso as many type 2 people into your study and minimize the number of type 1 and type 4 people. (And you really want to minimize the number of type 3 people, but here we’re assuming they don’t exist.)

I liked this way of thinking about the problem, partly because it connected analysis back to design and partly because it made it super-clear that there is no Platonic “treatment effect” here. The treatment effect depends entirely on who’s in the study, full stop. The treatment is what it is, but its effect relies on the experiment including enough type 2 people. This also makes it clear that there’s nothing special about the hypothesized effect of 25 percentage points, as the exact same treatment would have an effect of 10 percentage points, say, in a study that’s diluted with type 1 and type 4 people (those who don’t need the treatment, or those for whom the treatment wouldn’t help anyway).

I shared the above story in a talk at Memorial Sloan Kettering Cancer Center, and afterward Andrew Vickers sent me a note:

You gave a COVID story arguing that the effect size in a function of the population in a trial and how there are four types of patients (always die, always survive, die unless they take drug, die only if they take drug). This idea has recently been termed “heterogeneity of treatment effect” and I was part of a PCORI panel that wrote two papers in the Annals of Internal Medicine on this subject (here and here) and here).

In brief, you characterized the issue as one of interaction (drug will work for patient A due to nature of their disease or genetics, won’t work for patient B because their disease or genetics are different). The PCORI panel focused instead on the idea of baseline risk: if a drug halves your risk of death, that means a greater absolute risk difference for someone with a high baseline risk (e.g. high viral load, pre-existing pulmonary disease, limited metabolic reserve) than for a patient at low baseline risk (e.g. a young, healthy person).

We followed with a long email exchange. Rather than just jump to the tl;dr end, I’m gonna give all of it—because only through this detailed back-and-forth will you see the depth of my confusion.

Here was my initial response to Vickers:

My example was even simpler because if the outcome is binary you can just think in terms of potential outcomes. It’s impossible to have a constant treatment effect if the outcome is discrete!

Vickers disagreed:

I don’t follow that at all. Let’s assume we are treating patients with high blood pressure, trying to prevent a cardiovascular event. Now it turns out that the risk of an event given your blood pressure is as follows:

Assume that a drug reduces blood pressure by 10mm Hg. Then no matter what your blood pressure is to start with, taking the drug will half your risk. The relative risk in the study – 50% – is independent (give or take) of the trial population; the absolute risk difference depends absolutely on the distribution of blood pressure in the trial: it could vary from 1% if most patients had blood pressure of 140 to 5% if most patients had 170.

I then responded:

What I’m saying is that a person has 2 potential outcomes, y=1 (live) or 0 (die). Then the effect of the treatment on the individual is either 0 (if you’d live under either treatment or control or if you’d die under either treatment or control), +1 (if you’d live under the treatment or die under control), or -1 (if the reverse). Call these people types A, B, and C. We can never observe a person’s type, but we can consider it as a latent variable. By definition, the treatment effect varies. If the treatment has a positive average effect, that implies there are more people of type B than people of type C in the population. The effectiveness of a treatment in a study will depend on how many people of each type are in the study.

Vickers:

You are ignoring the stochastic process here. Imagine that that everyone has two coins, and it they throw one head, they are executed. My “intervention” is to take away one coin. That lowers your probability of death from 75% to 50%. But there aren’t “latent types” of people who will respond or not respond to the coin removal intervention. In medicine, many processes are stochastic (like heart attack) but you can raise or lower probabilities.

Me:

Hmm, I’ll have to think about this. It seems to me that we can always define the potential outcomes, but I agree with you that in some examples, such as the coin example, the potential outcome framework isn’t so useful. If the coin example there are still these 3 types of people, but whether you are one of these people is itself random and completely unpredictable.

Vickers:

I don’t like the idea of types defined by something that happens in the future. We wouldn’t say “there are six types of people in the world. Those that if you gave them a die to throw would throw a 1, those that would throw a 2, those who’d throw a 3 etc. etc.”

There is a fairly exact analogy between the coin example and adjuvant chemotherapy for cancer. getting chemotherapy reduces your burden of cancer cells. These cells may randomly mutate and then start growing and spreading. So getting chemotherapy isn’t much different from having a coin removed from your pile where you have to throw zero heads to escape execution.

Me:

I guess it’s gotta depend on context. For example, suppose that the effectiveness of the treatment depended on a real but unobserved aspect of the patient. Then I think it would make sense to talk about the 3 kinds of people. The chemotherapy example in an interesting one. I’m thinking the coronavirus example is different, if for no other reason than that most coronavirus patients aren’t dying. If you work with a population in which only 2% are dying, the the absolute maximum possible effectiveness of the treatment is 2%. So, at the very least, if that researcher was saying he was anticipating a 25% effectiveness, he was implicitly assuming that at least 25% of the people in his study would die in the absence of the treatment. That’s a big assumption already.

I guess the right way to think about it would be to allow some of the variation to be due to real characteristics of the patients and for some of it to be random.

Vickers:

No question that your investigator’s estimates were way out of line.

As regards “I guess the right way to think about it would be to allow some of the variation to be due to real characteristics of the patients and for some of it to be random”, I guess I like to think in terms of mechanisms. In the case of adjuvant chemotherapy, or cardiovascular prevention, an event (cancer recurrence, a heart attack) occurs at the end of a long chain of random processes (blood pressure only damages a vessel in the heart because of there is a slight weakness in that vessel, a cancer cell not removed during surgery mutates). We can think of treatments as having a relatively constant risk reduction, so the absolute risk reduction observed in any study depends on the distribution of baseline risk in the study cohort. In other cases such as an antimicrobial or a targeted agent for cancer, you’ll have some patients that will respond (e.g. the microbe is sensitive to the particular drug, the patient’s cancer expresses the protein that is the target) and some that won’t. The absolute risk reduction depends on the distribution of the types of patient.

This discussion was very helpful in clarifying my thoughts. I was taught causal inference under the potential-outcome framework and I hadn’t fully thought through these issues.

And watching me get convinced in real time . . . it’s like the hot hand all over again!

P.S. In comments, several people make the point that the two frameworks discussed above are mathematically equivalent, and there is no observable difference between them. That’s right, and I think that’s one reason why Rubin prefers his model of deterministic potential outcomes as, in some sense, the purest way to write things. Also relevant is this paper from 2012 by Tyler Vanderweele and James Robins on stochastic potential outcomes that was pointed out by commenter Z.

– When played optimally, is pawn race a win for white, black, or a draw?

– Could I beat a grandmaster if he was down a queen? I tried playing Stockfish with it down a Q and a R, and I won easily. (Yeah!) I suspect I could beat it just starting up a queen. But Stockfish plays assuming I’ll play optimally. Essentially it’s just trying to lose gracefully. In contrast, I’m pretty sure the grandmaster would know I’m not good, so he could just complicate the position and tear me apart. Perhaps someone’s already written a chess program called Patzer that plays kinda like me, and then another program called Hustler that can regularly defeat Patzer.

– If I was playing a top player, and my only goal was to last as many moves as possible, and his goal was to beat me in as few moves as possible, how long could I last?

– In bughouse, which is better, 4 bishops or 4 knights?

– How best to set up maharajah to make it balanced? Maybe the non-maharajah team isn’t allowed to promote and they have to win in some prespecified number of moves?

– And then there are the obvious ones:
1. Under perfect play, is the game a draw?
2. How much worse is the best current computer, compared to perfect play?
3. Could a human ever train to play the best computer to a draw? If not, how much worse etc?

– Do you have any good open questions in chess? If so, please share them in the comments.

We were discussing such questions after seeing this amusing article by Tom Murphy on the longest chess game (“There are jillions of possible games that satisfy the description above and reach 17,697 moves; here is one of them”) and also this fun paper comparing various wacky chess engines such as “random_move,” “alphabetical,” and “worstfish” and which begins:
“CCS Concepts: • Evaluation methodologies → Tournaments; • Chess → Being bad at it;
Additional Key Words and Phrases: pawn, horse, bishop, castle, queen, king.” I’ve been told that here’s an accompanying video but I never have the patience to watch videos.

One thing, though. Murphy writes, “Fiddly bits aside, it is a solved problem to maintain a numeric skill rating of players for some game (for example chess, but also sports, e-sports, probably also z-sports if that’s a thing). Though it has some competition (suggesting the need for a meta-rating system to compare them), the Elo Rating System is a simple and effective way to do it.” I know it’s a jokey paper but I just want to remind youall that these rating systems are not magic: you can think of them as mathematical algorithms or as fits of statistical models, but in any case they don’t always make sense, as can be seen from a simple hypothetical example of program A which always beats B, and B always beats C, and C always beats A. Murphy actually discusses this sort of example in his article, so he’s aware of the imperfections of ratings. I just wanted to bring this one up because chess rating is an example of the fallacy of measurement, by which people think that when something is measured, it must represent some underlying reality.

P.S. Also don’t forget Tim Krabbé’s chess records page, which unfortunately hasn’t been updated for over three years (at the time of this writing). Chess games can’t be copyrighted, so the youngest professor nationwide could collect some of this material in a book and put his name on it!

Traditional approaches to causal attribution propose that information about covariation of factors is used to identify causes of events. In contrast, we present a series of studies showing that people seek out and prefer information about causal mechanisms rather than information about covariation. . . . The subjects tended to seek out information that would provide evidence for or against hypotheses about underlying mechanisms. When asked to provide causes, the subjects’ descriptions were also based on causal mechanisms. . . . We conclude that people do not treat the task of causal attribution as one of identifying a novel causal relationship between arbitrary factors by relying solely on covariation information. Rather, people attempt to seek out causal mechanisms in developing a causal explanation for a specific event.

Interesting.

This finding is supportive of Judea Pearl’s attitude that causal relationships, rather than statistical relationships, are how we understand the world and how we think about evidence. And it’s also supportive of my attitude that we should think about causation in terms of mechanisms rather using black-box reasoning based on identification strategies.

I just read the above-titled book by Alex Beam and I really enjoyed it. I’ve been a fan of Beam for a long time; he just has this wonderful equanimous style.

The thing that amazes me is that the book got published at all. It’s subtitle is “Vladimir Nabokov, Edmund Wilson, and the End of a Beautiful Friendship.” Sounds like it would sell about 4 copies. I guess I’m underestimating the number of Nabokov fans out there. In any case, I’m glad that Beam went to the trouble of writing this book and that he got it published. It’s interesting and amusing all the way to the end of the Acknowledgments. OK, the Index is pretty boring, but I guess if you’re gonna write a book about Nabokov, you have to make a choice, and who am I to fault Beam for not entering this particular arena. “A distant northern land.”

I’m an Edmund Wilson fan too—I mean, he’s no George Orwell, or even an Anthony West (see also here), and for that matter I find Diana Trilling and Dwight Macdonald to be much more fun to read as well. But Wilson has some special something.

I’m having a bit of a ‘crisis’ of confidence regarding inferential statistics. I’ve been reading some of the work by Stephen Gorard (e.g. “Against Inferential Statistics”) and David Freedman and Richard Berk (e.g. “Statistical Assumptions as empirical commitments”). These authors appear to be saying this:

(1) Inferential statistics assume random sampling

(2) (Virtually) all experimental research (in psychology, for example) uses convenience sampling, not random sampling

(3) Therefore (virtually) all experimental research should have nothing to do with inferential statistics

If a researcher gets a convenience sample (say 100 college students), randomly assigns them to two groups and then uses multiple regression (let’s say) to analyse the results, is that researcher ‘wrong’ to report/use/rely on the p-values that result? [Perhaps the researcher could just use the parameter estimates – based on the convenience sample – and ignore the p values and confidence intervals…?]

Are inferential statistics just (totally) inappropriate for convenience samples?

I would love to hear your views. Perhaps you’ve written a blog post on the matter?

Yes, this has come up many times! Here are some posts:

The longer answer is that random sampling is just a model. It doesn’t apply to opinion polls either. But it can be a useful starting point.

Adam replied:

I’ve been worried that everything I’ve learnt (or taught myself) about inferential statistics has been a waste of time given that I’ve only ever used convenience samples and (in my area – psychology) don’t see anybody using anything other than convenience samples. When I read the Stephen Gorard papers and saw that at least one eminent expert in stats – Gene Glass – agreed, I had my ‘crisis of confidence.’

If you can restore my confidence, I would be very grateful. I really like using multiple regression to analyse my data and would hate to think that it’s all been a charade!

My reply: I don’t want to be too confident. I recommend you read the section in Regression and Other Stories where we talk about the assumptions of linear regression.

Here’s a regression puzzle courtesy of Advanced NFL Stats from a few years ago and pointed to recently by Holden Karnofsky from his interesting new blog, ColdTakes. The nominal issue is how to figure our whether Aaron Rodgers is underpaid or overpaid given data on salaries and expected points added per game. Assume that these are the right stats and correctly calculated. The real issue is which is the best graph to answer this question:

Brian 1: …just look at this super scatterplot I made of all veteran/free-agent QBs. The chart plots Expected Points Added (EPA) per Game versus adjusted salary cap hit. Both measures are averaged over the veteran periods of each player’s contracts. I added an Ordinary Least Squares (OLS) best-fit regression line to illustrate my point (r=0.46, p=0.002).

Rodgers’ production, measured by his career average Expected Points Added (EPA) per game is far higher than the trend line says would be worth his $21M/yr cost. The vertical distance between his new contract numbers, $21M/yr and about 11 EPA/G illustrates the surplus performance the Packers will likely get from Rodgers.

According to this analysis, Rodgers would be worth something like $25M or more per season. If we extend his 11 EPA/G number horizontally to the right, it would intercept the trend line at $25M. He’s literally off the chart.

Brian 2: Brian, you ignorant slut. Aaron Rodgers can’t possibly be worth that much money….I’ve made my own scatterplot and regression. Using the exact same methodology and exact same data, I’ve plotted average adjusted cap hit versus EPA/G. The only difference from your chart above is that I swapped the vertical and horizontal axes. Even the correlation and significance are exactly the same.

As you can see, you idiot, Rodgers’ new contract is about twice as expensive as it should be. The value of an 11 EPA/yr QB should be about $10M.

Alex concludes with a challenge:

Ok, so which is the best graph for answering this question? Show your work. Bonus points: What is the other graph useful for?

I followed all the links and read all the comments and I have my answer, which is different (although not completely unrelated to) what other people are saying. It’s interesting to see people struggling to work this one out.

But giving my solution right now would be boring, right? So I’ll leave it up for youall to discuss in comments, then in a day or two I’ll post my answer. I will say this, though: it’s not a trick, and I’m not trying to use any football-specific or NFL-specific knowledge.

Enjoy. I’m teaching applied regression and causal inference this fall and spring so it’s great to have examples like this. Although maybe this one’s a bit too complicated for an intro class . . .

P.S. I’d prefer the graphs to just have the names and get rid of those distracting little circles all over the place.

I liked this article by Hannah Fry about the challenges of statistical measurement. This is a topic that many statisticians have ignored, so it’s especially satisfying to see it in the popular press. Fry discusses several examples described in recent books of Deborah Stone and Tim Harford of noisy, biased, or game-able measurements.

I agree with Fry in her conclusion that statistical measurement is both difficult and important:

Numbers are a poor substitute for the richness and color of the real world. . . . But to recognize the limitations of a data-driven view of reality is not to downplay its might. It’s possible for two things to be true: for numbers to come up short before the nuances of reality, while also being the most powerful instrument we have when it comes to understanding that reality.

And she quotes Stone as saying, “To count well, we need humility to know what can’t or shouldn’t be counted.”

The role of the news media (and now social media as well)

I just want to add one thing to Fry’s discussion. Bad statistics can pop up from many directions, including the well-intentioned efforts of reformers, the muddled thinking of everyday scientists trying their best, the desperate striving of scientific glory hounds, and let’s never forget political hacks and out-and-out frauds.

OK, that’s all fine. But how do we hear about these misleading numbers? Through the news and social media. Also selection bias, as we’ve discussed before:

Lots of science reporters want to do the right thing, and, yes, they want clicks and they want to report positive stories—I too would be much more interested to read or write about a cure for cancer than about some bogus bit of noise mining—and these reporters will steer away from junk science. But here’s where the selection bias comes in: other, less savvy or selective or scrupulous reporters will jump in and hype the junk. So, with rare exceptions (some studies are so bad and so juicy that they just beg to be publicly debunked), the bad studies get promoted by the clueless journalists, and the negative reports don’t get written.

My point here is that selection bias can give us a sort of Gresham effect, even without any journalists knowingly hyping anything of low quality.

Fry published her article in the New Yorker, and even that august publication will occasionally jump on the junk-science bandwagon. For example, Fry cites “the great psychologist Daniel Kahneman,” which is fine—Kahneman has indeed done great work—but the problem with any “the great scientist X” formulation is that it can lead to credulity on the occasions that the great one gets it wrong. And of course one of their star writers is Malcolm Gladwell.

I’m not saying that Fry in her New Yorker article is supposed to dredge up all the mistakes made by its feature writers—after all, I’ll go months on this blog without mentioning that I share an employer with Dr. Oz! Rather, I’m trying to make a more general point that these mistakes in measurement come from many sources, but we should also be aware of how they’re promulgated—and how this fits into our narratives of science. We have to be careful to not just replace the old-fashioned Freakonomics or Gladwell-style scientist-as-hero narrative with a new version involving heroic debunkers; see also here, where I argue that scientific heroism, to the extent it exists, lives in the actions that the hero inspires.

This note is meant as a quick explainer of a set of three pre-prints at The Shrinkage Trilogy. All three have the same simple set-up: We abstract a “study” as a triple (beta,b,s) where

– beta is the parameter of interest
– b is an unbiased, normally distributed estimate of beta
– s is the standard error of b.

In other words, we are assuming that our estimate b has the normal distribution with mean beta and standard deviation s. We do not observe beta, but we do observe the pair (b,s).

We define the z-value z=b/s and the signal-to-noise ratio SNR=beta/s. Note that the z-value is the sum of the SNR and independent standard normal “noise”. This means that the distribution of the z-value is the convolution of the distribution of the SNR with the standard normal density.

It is not difficult to estimate the distribution of z-values if we have a sample of study results from a particular field of study. Subsequently, we can obtain the distribution of the SNRs in that field by deconvolution. Moreover, we also know the conditional distribution of the z-value given the SNR; it’s just normal with mean SNR and standard deviation 1. So, we can actually get the joint distribution of the z-value and the SNR.

So, we’re going to estimate the distribution of z=b/s, deconvolve to get the distribution of SNR=beta/s and scale that distribution by s to get a prior for beta given s. We can then use conjugate theory the posterior of beta, given b and s. The posterior mean of beta is a useful shrinkage estimator. Shrinkage is very important, because the signal-to-noise ratio is often very low and therefore |b| tends to overestimate (exaggerate) |beta|. This is especially bad when we condition on statistical significance (|z|>1.96).

2 z-value and SNR

To estimate the distribution of the z-value in some particular field of research, we need an “honest” sample that is free from publication bias, file drawer effect, fishing, forking paths etc. Recently, Barnett and Wren (2019) collected more than a million confidence intervals from Medline publications (data are here). We converted those to z-values and display the histogram below. The striking shortage of z-values between -2 and 2 suggests strong publication bias. This biased sample of z-values is not suitable for our purpose.

Simon Schwab (2020) collected more than 20,000 z-values from RCTs from the Cochrane database (data are here). The histogram shows much less publication bias. This may be due to the fact that many studies in the Cochrane database are pre-registered, and to the efforts of the Cochrane collaboration to find unpublished results.

We fitted a mixture of 3 normal distributions to the z-values from the Cochrane database. We show the fit in the histogram above, and note that it is quite satisfactory.

fit=flexmix::flexmix(z ~ 1, k = 3) # estimate mixture distribution of z
p=summary(fit)@comptab$prior # mixture proportions
mu=parameters(fit)[1,] # mixture means
sigma=parameters(fit)[2,] # mixture standard deviations
round(data.frame(p,mu,sigma),2)

## p mu sigma
## Comp.1 0.46 -0.25 1.25
## Comp.2 0.08 -0.76 5.39
## Comp.3 0.46 -0.13 2.18

We can now get the distribution of the SNR by deconvolution of the distribution of the z-value with the standard normal distribution. Deconvolution is not easy in general, but in our case it is trivial. Since we estimated the distribution of the z-value as a normal mixture, we can simply subtract 1 from the variances of the mixture components. We plot the densities of z and SNR together, and see that the density of the z-value is a “smeared-out” version of the density of the SNR.

tau=sqrt(sigma^2-1); round(tau,2) # deconvolution; standard deviations of the SNR
## Comp.1 Comp.2 Comp.3
## 0.75 5.30 1.94

3 Power

The power of the two-sided test of H_0 : beta=0 at level 5% is

Since the power is just a function of the SNR, we can transform a sample from the distribution of the SNR into a sample from the distribution of the power (see also the histogram below).

rmix = function(n,p,mean,sd){ # sample from a normal mixture
d=rmultinom(n,1,p)
rnorm(n,mean%*%d,sd%*%d)
}
snr=rmix(10^6,p,mu,tau)
power=pnorm(snr - 1.96) + 1 - pnorm(snr + 1.96)
S=summary(power); round(S,2)

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05 0.07 0.14 0.28 0.39 1.00

We see that the median power across the Cochrane database is about 14%, while the average power is about 28%. The average power can be interpreted as the probability that a randomly selected result from the Cochrane collection is significant. And indeed, 29% of our z-values exceeds 1.96 in absolute value. The fact that the (achieved) power is often very low should not surprise us, see The “80% power” lie. However, it also does not imply that the usual sample size calculations aiming for 80% or 90% power are necessarily wrong. The goal of such calculations is to have high power against a particular alternative that is considered to be of clinical interest – the effect “you would not want to miss”. That high power is often not achieved just goes to show that medical research is hard, and treatments often do not provide the benefit that was hoped for.

It is also possible to condition the power on a specific z-value. This will allow us assess the probability that the replication of a specific result will be significant.

The “exaggeration ratio” is |b/beta| = |z/SNR|. Since it is just a function of the z-value and the SNR, we can easily get a sample from its distribution (see also the histogram below).

We see that the median exaggeration is 1.23. That means that half the studies overestimate the effect by at least 23%.

It is also possible to condition the exaggeration on a specific z-value. This will allow us assess the exaggeration of a specific estimate. We can then correct for this by shrinking the estimate.

5 Shrinkage

We can use shrinkage (regularization) to correct the exaggeration. We have estimated the distribution of the SNR as a normal mixture parameterized by (p,mu,tau). Recalling that SNR=beta/s, we scale this distribution by s to get a distribution for beta. So, the distribution of beta is a normal mixture parameterized by (p,s*mu,s*tau).

We can now compute the conditional (or posterior) distribution of beta given the pair (b,s). It is again a normal mixture distribution.

posterior = function(b,s,p,mu,sd){ # compute conditional distr of beta given (b,s)
# mixture distr of beta given by (p,mu,sd)
s2=s^2
sd2=sd^2
q=p*dnorm(b,mu,sqrt(sd2+s2)) # conditional mixing probs
q=q/sum(q)
pm=(mu*s2 + b*sd2)/(sd2+s2) # conditional means
pv=sd2*s2/(sd2+s2) # conditional variances
ps=sqrt(pv) # conditional std devs
data.frame(q,pm,ps)
}

As an example, we compute the conditional (posterior) distribution of the beta given b=2 and s=1. It is a normal mixture with the following parameters:

In particular, we can use the conditional (posterior) mean as an estimator.

post.mean=sum(post$q * post$pm)
round(post.mean,2) # posterior mean of beta

## [1] 1.24

round(post.mean/b,2) # shrinkage factor

## [1] 0.62

6 Conclusion

Low power is very common. It leads to overestimation of effects (a.k.a. exaggeration, inflation or type M error) which must be corrected by shrinkage. For more details, we refer to The Shrinkage Trilogy. Here we end with three remarks.

Undoubtedly, the Cochrane database suffers from at least some publication bias, file drawer effects, fishing, forking paths etc. Unfortunately, this means that the power is likely to be even lower and the exaggeration even greater.

Strictly speaking, our analysis concerns the results in the Cochrane database. However, we believe that problems of low power and exaggeration are similar – if not worse – in many other areas of research.

Our estimate of the distribution of the SNR is quite close to the standard Cauchy distribution, which we recommend as a default prior without reference to the Cochrane database. Of course, nothing beats real, substantive prior information that is specific to the study of interest.

This Wednesday, at 11:30 am ET, Elea Feit is stopping by to talk to us about her recent work on Conjoint models fit using GPs. You can register here.

Abstract

Choice-based conjoint analysis is a widely-used technique for assessing consumer preferences. By observing how customers choose between alternatives with varying attributes, consumers’ preferences for the attributes can be inferred. When one alternative is chosen over the others, we know that the decision-maker perceived this option to have higher utility compared to the unchosen options. In addition to observing the choice that a customer makes, we can also observe the response time for each task. Building on extant literature, we propose a Gaussian Process model that relates response time to four features of the choice task (question number, alternative difference, alternative attractiveness, and attribute difference). We discuss the nonlinear relationships between these four features and response time and show that incorporating response time into the choice model provides us with a better understanding of individual preferences and improves our ability to predict choices.

About the speaker

Elea Feit is an Associate Professor of Marketing at Drexel University. Prior to joining Drexel, she spent most of her career at the boundary between academia and industry, including positions at General Motors Research, The Modellers, and Wharton Customer Analytics. Her work is inspired by the decision problems that marketers face and she has published research on using randomized experiments to measure advertising incrementality and using conjoint analysis to design new products. Methodologically, she is a Bayesian with expertise in hierarchical models, experimental design, missing data, data fusion, and decision theory. She is also the co-author of R for Marketing Research and Analytics. More at eleafeit.com.

In our discussion a couple days ago on the role of hypotheses in science, Lakeland wrote:

Even “this data is relevant to the question we’re studying” is already a hypothesis. There’s no such thing as hypothesis free data analysis.

I’ve sometimes said similar things, in that I like to interpret exploratory graphics as model checks, where the model being checked might be implicit; see for example this recent paper with Jessica Hullman.

But, thinking about this more, I wouldn’t quite go so far as Lakeland. I’m thinking there’s a connection between his point and the idea of workflow, or performing multiple analyses on data. For example: I just went on Baby Name Voyager and started typing in names. This was as close to hypothesis-free data analysis as you can get. But after I saw a few patterns, I started to form hypotheses. For example, I typed in Stephanie and saw how the name frequency has dropped so fast during the past twenty years. Then I had a hypothesis: could it be alternative spellings? So I tried Stefany etc. Then I got to wondering about Stephen. That wasn’t a hypothesis, exactly, more of a direction to look. I had a meta-hypothesis that I might learn something by looking at the time trend for Stephen. I saw a big drop since the 1950s. Also for Steven (recall that earlier hypothesis about alternative spellings). And so on.

My point is that a single static data analysis (for example, looking up Stephanie in the Baby Name Voyager) can be motivated by curiosity or a meta-hypothesis that I might learn something interesting, but as I start going through workflow, hypothesizing is inevitably involved.

I’m thinking now that this is a big deal, connecting some of our statistical thoughts about modeling and model checking and hypotheses with scientific practice and the philosophy of science. Statistical theory and textbooks and computation tend to focus on one model at a time, or one statistical procedure at a time; in the workflow perspective we recognize that we are performing a series of statistical analyses.

It’s hard for me to imagine doing a series of analyses without forming some hypotheses and without thinking of how to refine these hypotheses or adjudicate among alternative theories of the world. One quick data analysis, though, that’s different. I sincerely think I looked at that Stephanie graph out of pure curiosity. As noted above, deciding to look at some data out of curiosity could be said to reflect a meta-hypothesis that something interesting may turn up, but I would not classify that as much of a hypothesis at all. After looking at the graph, though, the decision of what to look at next is definitely hypothesis-informed.

Similarly, I can conduct a survey and ask a bunch of questions without having any hypothesis of how people respond; I can just think it’s a good idea to gather these data. But I think it would be hard to conduct a follow-up survey without making some hypotheses. (Again, I’m speaking here of scientific or engineering hypotheses, not “hypotheses” in the sense of that horrible statistical theory of “hypothesis testing.”)

So . . . hypothesizing plays a crucial role in statistical workflow, even though I don’t think a hypothesis is necessary to get started.