“The series is inspired by Dan Ariely’s novel ‘Predictably Irrational’ . . .”

Someone points us to this news article:

NBC has placed a series order for “The Irrational,” the network announced on Tuesday.

According to the logline, the drama follows Alec Baker, a world-renowned professor of behavioral science, lends his expertise to an array of high-stakes cases involving governments, law enforcement and corporations with his unique and unexpected approach to understanding human behavior. The series is inspired by Dan Ariely’s novel “Predictably Irrational,” which was published in February 2008 by HarperCollins. Ariely will serve as a consultant.

I wonder if they’ll have an episode with the mysterious paper shredder?

In all seriousness, I absolutely love the above description of the TV show, especially the part where they described Ariely’s book as a novel. How edgy!

Describing his work as fiction . . . more accurate than they realize.

P.S. In the movie of me, I’d like to be played by a younger Rodney Dangerfield.

Maybe Paul Samuelson and his coauthors should’ve spent less time on dominance games and “boss moves” and more time actually looking out at the world that they were purportedly describing.

Yesterday we pointed to a post by Gary Smith, “Don’t worship math: Numbers don’t equal insight,” subtitled, “The unwarranted assumption that investing in stocks is like rolling dice has led to some erroneous conclusions and extraordinarily conservative advice,” that included a wonderful story that makes the legendary economist Paul Samuelson look like a pompous fool. Here’s Smith:

Mathematical convenience has often trumped common sense in financial models. For example, it is often assumed — because the assumption is useful — that changes in stock prices can be modeled as independent draws from a probability distribution. Paul Samuelson offered this analogy:

Write down those 1,800 percentage changes in monthly stock prices on as many slips of paper. Put them in a big hat. Shake vigorously. Then draw at random a new couple of thousand tickets, each time replacing the last draw and shaking vigorously. That way we can generate new realistically representative possible histories of future equity markets.

I [Smith] did Samuelson’s experiment. I put 100 years of monthly returns for the S&P 500 in a computer “hat” and had the computer randomly select monthly returns (with replacement) until I had a possible 25-year history. I repeated the experiment one million times, giving one million “Samuelson simulations.”

I also looked at every possible starting month in the historical data and determined the very worst and very best actual 25-year investment periods. The worst period began in September 1929, at the start of the Great Crash. An investment over the next 25 years would have had an annual return of 5.1%. The best possible starting month was January 1975, after the 1973-1974 crash. The annual rate of return over the next 25 years was 17.3%.

In the one million Samuelson simulations, 9.6% of the simulations gave 25-year returns that were worse than any 25-year period in the historical data and 4.9% of the simulations gave 25-year returns that were better than any actual 25-year historical period. Overall, 14.5% of the Samuelson simulations gave 25-year returns that were too extreme. Over a 50-year horizon, 24.5% of the Samuelson simulations gave 50-year returns that were more extreme than anything that has ever been experienced.

You might say that Smith is being unfair, as Samuelson was only offering a simple mathematical model. But it was Samuelson, not Smith, who characterized his random drawing as “realistically representative possible histories of future equity markets.” Samuelson was the one claiming realism.

My take is that Samuelson wanted it both ways. He wanted to show off his math, but he also wanted relevance, hence his “realistically.”

The prestige of economics comes partly from its mathematical sophistication but mostly because it’s supposed to relate to the real world.

Smith’s example of Samuelson’s error reminded me of this story from David Levy and Sandra Peart of this graph from the legendary textbook. This is from 1961:


Alex Tabarrok pointed out that it’s even worse than it looks: “in subsequent editions Samuelson presented the same analysis again and again except the overtaking time was always pushed further into the future so by 1980 the dates were 2002 to 2012. In subsequent editions, Samuelson provided no acknowledgment of his past failure to predict and little commentary beyond remarks about ‘bad weather’ in the Soviet Union.”

The bit about the bad weather is funny. If you’ve had bad weather in the past, maybe the possibility of future bad weather should be incorporated into the forecast, no?

Is there a connection?

Can we connect Samuelson’s two errors?

Again, the error with the Soviet economy forecast is not that he was wrong in the frenzied post-Sputnik year of 1961; the problem is that he kept making this error in his textbook for decades to come. Here’s another bit, from Larry White:

As late as the 1989 edition [Samuelson] coauthor William Nordhaus wrote: ‘The Soviet economy is proof that, contrary to what many skeptics had earlier believed, a socialist command economy can function and even thrive.’

I see three similarities between the stock-market error and the command-economy error:

1. Love of simple mathematical models: the random walk in one case and straight trends in the other. The model’s so pretty, it’s too good to check.

2. Disregard of data. Smith did that experiment disproving Samuelson’s claim. Samuelson could’ve done that experiment himself! But he didn’t. That didn’t stop him from making a confident claim about it. As for the Soviet Union, by the time 1980 had come along Samuelson had 20 years of data refuting his original model, but that didn’t stop him from just shifting the damn curve. No sense that, hey, maybe the model has a problem!

3. Technocratic hubris. There’s this whole story about how Samuelson was so brilliant. I have no idea how brilliant he was—maybe standards were lower back then?—but math and reality don’t care how brilliant you are. I see a connection between Samuelson thinking that he could describe the stock market with a simple random walk model, and him thinking that the Soviets could just pull some levers and run a thriving economy. Put the experts in charge, what could go wrong, huh?

More stories

Smith writes:

As a student, Samuelson reportedly terrorized his professors with his withering criticisms.

Samuelson is of course the uncle of Larry Summers, another never-admit-a-mistake guy. There is a story about Summers saying something stupid to Samuelson a week before Arthur Okun’s funeral. Samuelson reportedly said to Summers, “In my eulogy for Okun, I’m going to say that I don’t remember him ever saying anything stupid. Well, now I won’t be able to say that about you.”

There was a famous feud between Samuelson and Harry Markowitz about whether investors should think about arithmetic or geometric means. In one Samuelson paper responding to Markowitz, every word (other than author names) was single syllable.

I once gave a paper at a festschrift honoring Tobin. Markowitz began his talk by graciously saying to Samuelson, who was sitting arm-crossed in the front row, “In the spirit of this joyous occasion, I would like to say to Paul that ‘Perhaps there is some merit in your argument.’” Samuelson immediately responded, “I wish I could say the same.”

Here’s the words-of-one-syllable paper, and here’s a post that Smith found:

Maybe Samuelson and his coauthors should’ve spent less time on dominance games and “boss moves” and more time actually looking out at the world that they were purportedly describing.

Is omicron natural or not – a probabilistic theory?

Aleks points us to this article, “A probabilistic approach to evaluate the likelihood of artificial genetic modification and its application to SARS-CoV-2 Omicron variant,” which begins:

A method to find a probability that a given bias of mutations occur naturally is proposed to test whether a newly detected virus is a product of natural evolution or artificial genetic modification. The probability is calculated based on the neutral theory of molecular evolution and binominal distribution of non-synonymous (N) and synonymous (S) mutations. Though most of the conventional analyses, including dN/dS analysis, assume that any kinds of point mutations from a nucleotide to another nucleotide occurs with the same probability, the proposed model takes into account the bias in mutations, where the equilibrium of mutations is considered to estimate the probability of each mutation. The proposed method is applied to evaluate whether the Omicron variant strain of SARS-CoV-2, whose spike protein includes 29 N mutations and only one S mutation, can emerge through natural evolution. The result of binomial test based on the proposed model shows that the bias of N/S mutations in the Omicron spike can occur with a probability of 1.6 x 10^(-3) or less. Even with the conventional model where the probabilities of any kinds of mutations are all equal, the strong N/S mutation bias in the Omicron spike can occur with a probability of 3.7 x 10^(-3), which means that the Omicron variant is highly likely a product of artificial genetic modification.

I don’t know anything about the substance. The above bit makes me suspicious, as it looks like what they’re doing is rejecting null hypothesis A and using this to claim that their favored alternative hypothesis B is true.

Further comments from an actual expert are here.

The problems with p-values are not just with p-values.

From 2016 but still worth saying:

Ultimately the problem is not with p-values but with null-hypothesis significance testing, that parody of falsificationism in which straw-man null hypothesis A is rejected and this is taken as evidence in favor of preferred alternative B. Whenever this sort of reasoning is being done, the problems discussed above will arise. Confidence intervals, credible intervals, Bayes factors, cross-validation: you name the method, it can and will be twisted, even if inadvertently, to create the appearance of strong evidence where none exists.

I put much of the blame on statistical education, for two reasons:

First, in our courses and textbooks (my own included), we tend to take the “dataset” and even the statistical model as given, reducing statistics to a mathematical or computational problem of inference and encouraging students and practitioners to think of their data as given. . . .

Second, it seems to me that statistics is often sold as a sort of alchemy that transmutes randomness into certainty, an “uncertainty laundering” that begins with data and concludes with success as measured by statistical significance. Again, I do not exempt my own books from this criticism: we present neatly packaged analyses with clear conclusions. This is what is expected—demanded—of subject-matter journals. . . .

If researchers have been trained with the expectation that they will get statistical significance if they work hard and play by the rules, if granting agencies demand power analyses in which researchers must claim 80% certainty that they will attain statistical significance, and if that threshold is required for publication, it is no surprise that researchers will routinely satisfy this criterion, and publish, and publish, and publish, even in the absence of any real effects, or in the context of effects that are so variable as to be undetectable in the studies that are being conducted.

In summary:

I agree with most of the ASA’s statement on p-values but I feel that the problems are deeper, and that the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.

Decades of polling have drained the aquifer of survey participation.

I wrote about this in 2004:

Back in the 1950s, when the Gallup poll was almost the only game in town, it was rational to respond to the survey–you’d be one of about 1000 respondents and could have a reasonable chance of (indirectly) affecting policy. Now you’re just one of millions, and so answering a pollster is probably not worth the time (See here for related arguments).

The recent proliferation of polls—whether for marketing or to just to sell newspapers—exploits people’s civic-mindedness. Polling and polling and polling until all the potential respondents get tired—it’s like draining the aquifer to grow alfalfa in the desert.

Physics educators do great work with innovative teaching. They should do better in evaluating evidence of effectiveness and be more open to criticism.

Michael Weissman pointed me to a frustrating exchange he had with the editor of, Physical Review Physics Education Research. Weissman submitted an article criticizing an article that the journal had published, and the editor refused to publish his article. That’s fine—it’s the journal’s decision to decide what to publish!—but I agree with Weissman that some of the reasons they gave for not publishing were bad reasons, for example, “in your abstract, you describe the methods used by the researchers as ‘incorrect’ which seems inaccurate, SEM or imputation are not ‘incorrect’ but can be applied, each time they are applied, it involves choices (which are often imperfect). But making these choices explicit, consistent, and coherent in the application of the methods is important and valuable. However, it is not charitable to characterize the work as incorrect. Challenges are important, but PER has been and continues to be a place where people tend to see the positive in others.”

I would not have the patience to go even 5 minutes into these models with the coefficients and arrows, as I think they’re close to hopeless even in the best of settings and beyond hopeless for observational data, nor do I want to think too hard about terms such as “two-way correlation,” a phrase which I hope never to see again!

I agree with Weissman on these points:

1. It is good for journals to publish critiques, and I don’t think that critiques should be held to higher standards than the publications they are critiquing.

2. I think that journals are too focused on “novel contributions” and not enough on learning from mistakes.

3. Being charitable toward others is fine, all else equal, but not so fine if this is used as a reason for researchers, or an entire field, to avoid confronting the mistakes they have made or the mistakes they have endorsed. Here’s something I wrote in praise of negativity.

4. Often these disputes are presented as if the most important parties are authors of the original paper, the journal editor, and the author of the letter or correction note. But that’s too narrow a perspective. The most important parties are not involved in the discussion at all: these are the readers of the articles—those who will takes its claims and apply them to policy or to further researchers—and all the future students who may be affected by these policies. Often it seems that the goal is to minimize any negative career impact on the authors of the original paper and to minimize any inconvenience to the journal editors. I think that’s the wrong utility function, and to ignore the future impacts of uncorrected mistakes is implicitly an insult to the entire field. If the journal editors think the work they publish has value—not just in providing chits that help scholars get promotions and publicity, but in the world outside the authors of these articles—then correcting errors and learning from mistakes should be a central part of their mission.

I hope Weissman’s efforts in this area have some effect in the physics education community.

As a statistics educator, I’ve been very impressed by the innovation shown by physics educators (for example, the ideas of peer instruction and just-in-time teaching, which I use in my classes), so I hope they can do better in this dimension of evaluating evidence of effectiveness.

“The market can become rational faster than you can get out”

Palko pointed me to one of these stories about a fraudulent online business that crashed and burned. I replied that it sounded a lot like Theranos. The conversation continued:

Palko: Sounds like all the unicorns. The venture capital model breeds these things.

Me: Unicorns aren’t real, right?

Palko: Unicorns are mythical beasts and those who invest in them are boobies.

Me: Something something longer than something something stay solvent.

Palko: That’s good advice for short sellers, but it’s good to remember the corollary: the market can become rational faster than you can get out.

Good point.

Straining on the gnat of the prior distribution while swallowing the camel that is the likelihood. (Econometrics edition)

Jason Hawkins writes:

I recently read an article by the econometrician William Greene of NYU and others (in a 2005 book). They state the following:

The key difference between Bayesian and classical approaches is that Bayesians treat the nature of the randomness differently. In the classical view, the randomness is part of the model; it is the heterogeneity of the taste parameters, across individuals. In the Bayesian approach, the randomness ‘represents’ the uncertainty in the mind of the analyst (conjugate priors notwithstanding). Therefore, from the classical viewpoint, there is a ‘true’ distribution of the parameters across individuals. From the Bayesian viewpoint, in principle, there could be two analysts with different, both legitimate, but substantially different priors, who therefore could obtain very different, albeit both legitimate, posteriors.

My understanding is that this statement runs counter to the Berstein-von Mises theorem, which in the wording of Wikipedia “ assumes there is some true probabilistic process that generates the observations, as in frequentism” (my emphasis). Their context is comparing individual parameters from a mixture model, which can be taken from the posterior of a Bayesian inference or (in the frequentist case) obtained through simulation. I was particularly struck by their terming randomness as part of the model in the frequentist approach, which to me reads more as a feature of Bayesian approaches that are driven by uncertainty quantification.

My reply: Yes, I disagree with the above-quoted passage. They are exhibiting a common misunderstanding. I’ll respond with two points:

1. From the Bayesian perspective there also is a true parameter; see for example Appendix B of BDA for a review of the standard asymptotic theory. That relates to Hawkins’s point about the Berstein-von Mises theorem.

2. Greene et al. write, “From the Bayesian viewpoint, in principle, there could be two analysts with different, both legitimate, but substantially different priors, who therefore could obtain very different, albeit both legitimate, posteriors.” The same is true in the classical viewpoint; just replace the word “priors” by “likelihoods” or, more correctly, “data models.” Hire two different econometricians to fit two different models to your data and they can get “very different, albeit both legitimate” inferences.

Hawkins sends another excerpts from the paper:

The Bayesian approach requires the a priori specification of prior distributions for all of the model parameters. In cases where this prior is summarising the results of previous empirical research, specifying the prior distribution is a useful exercise for quantifying previous knowledge (such as the alternative currently chosen). In most circumstances, however, the prior distribution cannot be fully based on previous empirical work. The resulting specification of prior distributions based on the analyst’s subjective beliefs is the most controversial part of Bayesian methodology. Poirier (1988) argues that the subjective Bayesian approach is the only approach consistent with the usual rational actor model to explain individuals’ choices under uncertainty. More importantly, the requirement to specify a prior distribution enforces intellectual rigour on Bayesian practitioners. All empirical work is guided by prior knowledge and the subjective reasons for excluding some variables and observations are usually only implicit in the classical framework. The simplicity of the formula defining the posterior distribution hides some difficult computational problems, explained in Brownstone (2001).

That’s a bit better but it still doesn’t capture the all-important point that that skeptics and subjectivists alike strain on the gnat of the prior distribution while swallowing the camel that is the likelihood.

And this:

Allenby and Rossi (1999) have carried out an extensive Bayesian analysis of discrete brand choice and discussed a number of methodological issues relating to the estimation of individual level preferences. In comparison of the Bayesian and classical methods, they state the simulation based classical methods are likely to be extremely cumbersome and are approximate whereas the Bayesian methods are much simpler and are exact in addition. As to whether the Bayesian estimates are exact while sampling theory estimates are approximate, one must keep in mind what is being characterised by this statement. The two estimators are not competing for measuring the same population quantity with alternative tools. In the Bayesian approach, the ‘exact’ computation is of the analysts posterior belief about the distribution of the parameter (conditioned, one might note on a conjugate prior virtually never formulated based on prior experience), not an exact copy of some now revealed population parameter. The sampling theory ‘estimate’ is of an underlying ‘truth’ also measured with the uncertainty of sampling variability. The virtue of one over the other is not established on any but methodological grounds – no objective, numerical comparison is provided by any of the preceding or the received literature.

Again, I don’t think the framing of Bayesian inference as “belief” is at all helpful. Does the classical statistician or econometrician’s logistic regression model represent his or her “belief”? I don’t think so. It’s not a belief, it’s a model, it’s an assumption.

But I agree with their other point that we should not consider the result of an exact computation to itself be exact. The output depends on the inputs.

We can understand this last point without thinking about statistical inference at all. Just consider a simple problem of measurement, where we estimate the weight of a liquid by weighing an empty jar, then weighing the jar with the liquid in it, then subtracting. Suppose the measured weights are 213 grams and 294 grams, so that the estimated weight of the liquid is 81 grams. The calculation, 294-213=81, is exact, but if the original measurements have error, then that will propagate to the result, so it would not be correct to say that 81 grams is the exact weight.

Hey—here’s some ridiculous evolutionary psychology for you, along with some really bad data analysis.

Jonathan Falk writes:

So I just started reading The Mind Club, which came to me highly recommended. I’m only in chapter 2. But look at the above graph, which is used thusly:

“As figure 5 reveals, there was a slight tendency for people to see more mind (rated consciousness and capacity for intention) in faster animals (shown by the solid sloped line)—it is better to be the hare than the tortoise. The more striking pattern in the graph is an inverted U shape (shown by the dotted curve), whereby both very slow and very fast animals are seen to have little mind, and human-speeded animals like dogs and cats are seen to have the most mind. This makes evolutionary sense, as potential predators and prey are all creatures moving at roughly our speed, and so it pays to understand their intentions and feelings. In the modern world we seldom have to worry about catching deer and evading wolves, but timescale anthropomorphism stays with us; in the dance of perceiving other minds, it pays to move at the same speed as everyone else.”

Wegner, Daniel M.; Gray, Kurt. The Mind Club (pp. 29-30). Penguin Publishing Group. Kindle Edition.

That “inverted U shape” seems a bit housefly-dependent, wouldn’t you say? And how is the “slight tendency” less “striking” than this putative inverse U shape?

Yeah, that quadratic curve is nuts. As is the entire theory.

Also, what’s the scale of the x-axis on that graph? If a sloth’s speed is 35, the wolf should be more than 70, no? This seems like the psychology equivalent of that political science study that said that North Carolina was less democratic than North Korea.

Falk sent me the link to the article, and it seems that the speed numbers are survey responses for “perceived speed of movement.” GIGO all around!

4 different meanings of p-value (and how my thinking has changed)

The p-value is one of the most common, and one of the most confusing, tools in applied statistics. Seasoned educators are well aware of all the things the p-value is not. Most notably, it’s not “the probability that the null hypothesis is true.” McShane and Gal find that even top researchers routinely misinterpret p-values.

But let’s forget for a moment about what p-values are not and instead ask what they are. It turns out that there are different meanings of the term. At first I was going to say that these are different “definitions,” but Sander Greenland pointed out that not all are definitions:

Definition 1. p-value(y) = Pr(T(y_rep) >= T(y) | H), where H is a “hypothesis,” a generative probability model, y is the observed data, y_rep are future data under the model, and T is a “test statistic,” some pre-specified specified function of data. I find it clearest to define this sort of p-value relative to potential future data; it can also be done mathematically and conceptually without any reference to repeated or future sampling, as in this 2019 paper by Vos and Holbert.

Definition 2. Start with a set of hypothesis tests of level alpha, for all values alpha between 0 and 1. p-value(y) is the smallest alpha of all the tests that reject y. This definition starts with a family of hypothesis tests rather than a test statistic, and it does not necessarily have a Bayesian interpretation, although in particular cases, it can also satisfy Definition 1.

Property 3. p-value(y) is some function of y that is uniformly distributed under H. I’m not saying that the term “p-value” is taken as a synonym for “uniform variate” but rather that this conditional uniform distribution is sometimes taken to be a required property of a p-value. It’s not a definition because in practice no one would define a p-value without some reference to a tail-area probability (Definition 1) or a rejection region (Definition 2)—but it is sometimes taken as a property that is required for something to be a true p-value. The relevant point here is that a p-value can satisfy Property 3 without satisfying Definition 1 (there are methods of constructing uniformly-distributed p-values that are not themselves tail-area probabilities), and a p-value can satisfy Definition 1 without satisfying Property 3 (when there is a composite null hypothesis and the distribution of the test statistic is not invariant to parameter values; see Xiao-Li Meng’s paper from 1994).

Description 4. p-value(y) is the result of some calculations applied to data that are conventionally labeled as a p-value. Typically, this will be a p-value under Definition 1 or 2 above, but perhaps defined under a hypothesis H that is not actually the model being fit to the data at hand, or a hypothesis H(y) that itself is a function of data, for example from p-hacking or forking paths. I’m labeling this as a “description” rather than a “definition” to clarify that this sort of p-value is used all the time without always a clear definition of the hypothesis, for example if you have a regression coefficient with estimate beta_hat and standard error s, and you compute 2 times the tail-area probability of |beta_hat|/s under the normal or t distribution, without ever defining a null hypothesis relative to all the parameters in your model. Sander Greenland calls this sort of thing a “descriptive” p-value, capturing the idea that the p-value can be understood as a summary of the discrepancy or divergence of the data from H according to some measure, ranging from 0 = completely incompatible to 1 = completely compatible. For example, the p-value from a linear regression z-score can be understood as a data summary without reference to a full model for all the coefficients.

These are not four definitions/properties/descriptions of the same thing. They are four different things. Not completely different, as they coincide in certain simple examples, but different, and they serve different purposes. They have different practical uses and implications, and you can make mistakes when you use one sort to answer a different question. Just as, for example, posterior intervals and confidence intervals coincide in some simple examples but in general are different: lots of real-world posterior intervals don’t have classical confidence coverage, even in theory, and lots of real-world confidence intervals don’t have Bayesian posterior coverage, even in theory.

A single term with many meanings—that’s a recipe for confusion! Hence this post, which does not claim to solve any technical problems but is just an attempt to clarify.

In all the meanings above, H is a “generative probability model,” that is, a class of probability models for the modeled data, y. If H is a simple null hypothesis, H represents a specified probability distribution, p(y|H). If H is a composite null hypothesis, there is some vector of unknown parameters theta indexing a family of probability distributions, p(y|theta,H). As Daniel Lakeland so evocatively put it, a null hypothesis is a specific random number generator.

Under any of the above meanings, the p-value is a number—a function of data, y—and also can be considered as a random variable with probability distribution induced by the distribution of y under H. For a composite null hypothesis, that distribution will in general depend on theta, but that complexity is not our focus here.

So, back to p-values. How can one term have four meanings? Pretty weird, huh?

The answer is that under certain ideal conditions, the four meanings coincide. In a model with continuous data and a continuous test statistic and a point null hypothesis, all four of the above meanings give the same answer. Also there are some models with unknown parameters where the test statistic can be defined to have a distribution under H that is invariant to parameters. And this can also be the case asymptotically.

More generally, though, the four meanings are different. None of them are “better” or “worse”; they’re just different. Each has some practical value:

– A p-value under Definition 1 can be directly interpreted as a probability statement about future data conditional on the null hypothesis (as discussed here).

– A p-value under Definition 2 can be viewed as a summary of a class of well-defined hypothesis tests (as discussed in footnote 4 of this article by Philip Stark).

– A p-value with Property 3 has a known distribution under the null hypothesis, so the distribution of a collection of p-values can be compared to uniform (as discussed here).

– A p-value from Description 4 is unambiguously defined from existing formulas so is a clear data summary even if it can’t easily be interpreted as a probability in the context of the problem at hand.

As an example, in this article from 1989, Besag and Clifford come up with a Monte Carlo procedure that yields p-values that satisfy Property 3 but not Definition 1 or Description 4. And in 1996, Meng, Stern, and I discussed Bayesian p-values that satisfied Definition 1 but not Property 3.

The natural way to proceed is to give different names to the different p-values. The trouble is that different authors choose different naming conventions!

I’ve used the term “p-value” for Definition 1 and “u-value” for Property 3; see section 2.3 of this article from 2003. And in this article from 2014 we attempted to untangle the difference between Definition 1 and Property 3. I haven’t thought much about Definition 2, and I’ve used the term “nominal p-value” for Description 4.

My writing about p-values has taken Definition 1 as a starting point. My goal has not been to examine misfit of the null hypothesis with respect to some data summary or test statistic, not to design a procedure to reject with a fixed probability conditional on a null hypothesis or to construct a measure of evidence that is uniformly distributed under the null. Others including Bernardo, Bayarri, and Robins are less interested in a particular test statistic and are more interested in creating a testing procedure or a calibrated measure of evidence, and they have taken Definition 2 or Property 3 as their baseline, referring to p-values with Property 3 as “calibrated” or “valid” p-values. This terminology is as valid as mine; it’s just taking a different perspective on the same problem.

In an article from 2023 with follow-up here, Sander Greenland distinguishes between “divergence p-values” and “decision p-values,” addressing similar issues of overloading of the term “p-value.” The former corresponds to Definition 1 above using the same sort of non-repeated-sampling view of p-values as favored by Vos and Holbert and addresses the issues raised by Description 4, and the latter corresponds to Definition 2 and addresses the issues raised by Property 3. As Greenland emphasizes, a p-value doesn’t exist in a vacuum; it should be understood in the context in which it will be used.

My thinking has changed.

My own thinking about p-values and significance testing has changed over the years. It all started when I was working on my Ph.D. thesis, fitting a big model to medical imaging data and finding that the model didn’t fit the data. I could see this because the chi-squared statistic was too large! We had something like 20,000 pieces of count data, and the statistic was about 30,000, which would be compared to a chi-squared distribution with 20,000 minus k degrees of freedom, where k is the the number of “effective degrees of freedom” in the model. The “effective degrees of freedom” thing was interesting, and it led me into the research project that culminated in the 1996 paper with Meng and Stern.

The relevant point here was that I was not coming into that project with the goal of creating “Bayesian p-values.” Rather, I wanted to be able to check the fit of model to data, and this was a way for me to deal with the fact that existing degrees-of-freedom adjustments did not work in my problem.

The other thing I learned when working on thaT project was that a lot of Bayesians didn’t like the idea of model checking at all! They had this bizarre (to me) attitude that, because their models were “subjective,” they didn’t need to be checked. So I leaned hard into the idea that model checking is a core part of Bayesian data analysis. This one example, and the fallout from it, gave me a much clearer sense of data analysis as a Popperian or Lakatosian process, leading to this 2013 article with Shalizi.

In the meantime, though, I started to lose interest in p-values. Model checking was and remains important to me, but I found myself doing it using graphs. Actually, the only examples I can think of where I used hypothesis testing for data analysis were the aforementioned tomography model from the late 1980s (where the null hypothesis was strongly rejected) and the 55,000 residents desperately need your help! example from 2004 (where we learned from a non-rejection of the null). Over the years, I remained aware of issues regarding p-values, and I wrote some articles on the topic, but this was more from theoretical interest or with the goal of better understanding common practice, not with any goal to develop better methods for my own use. This discussion from 2013 of a paper by Greenland and Poole gives a sense of my general thinking.

P.S. Remember, the problems with p-values are not just with p-values.

P.P.S. I thank Sander Greenland for his help with this, even if he does not agree with everything written here.

P.P.P.S. Sander also reminds me that all the above disagreements are trivial compared to the big issues of people acting as if “not statistically significant” results are confirmations of the null hypothesis and as if “statistically significant” results are confirmations of their preferred alternative. Agreed. Those are the top two big problems related to model checking and hypothesis testing, and then I’d say the third most important problem is people not being willing to check their models or consider what might happen if their assumptions are wrong (both Greenland and Stark have written a lot about that problem, and, as noted above, that was one of my main motivations to doing research on the topic of hypothesis testing).

Compared to those three big issues, the different meanings of p-values are less of a big deal. But they do come up, as all four of the above sorts of p-values are used in serious statistical practice, so I think it’s good to be aware of their differences. Otherwise it’s easy to slip into the attitude that other methods are “wrong,” when they’re just different.


A colleague told me that he got a useful research idea from StatRetro the other day, so I wanted to plug it again:

StatRetro is a twitter feed with old posts from the Statistical Modeling, Causal Inference, and Social Science blog from 2004 to now, in chronological order, tweeted every 8 hours. It’s now in May 2007. Lots of great stuff, including for example this post, “Happiness, children, and the difficulties of trying to answer Why-type questions,” which anticipated my paper with Guido from several years later, “Why ask why? Forward causal inference and reverse causal questions.”

Three posts a day! Enjoy.

“Risk ratio, odds ratio, risk difference… Which causal measure is easier to generalize?”

Anders Huitfeldt writes:

Thank you so much for discussing my preprint on effect measures (“Count the living or the dead?”) on your blog! I really appreciate getting as many eyes as possible on this work; having it highlighted on by you is the kind of thing that can really make the snowball start rolling towards getting a second chance in academia (I am currently working as a second-year resident in addiction medicine, after exhausting my academic opportunities)

I just wanted to highlight a preprint that was released today by Bénédicte Colnet, Julie Josse, Gaël Varoquaux, and Erwan Scornet. To me, this preprint looks like it might become an instant classic. Colnet and her coauthors generalize my thought process, and present it with much more elegance and sophistication. It is almost something I might have written if I had an additional standard deviation in IQ, and if I was trained in biostatistics instead of epidemiology.

The article in question begins:

From the physician to the patient, the term effect of a drug on an outcome usually appears very spontaneously, within a casual discussion or in scientific documents. Overall, everyone agrees that an effect is a comparison between two states: treated or not. But there are various ways to report the main effect of a treatment. For example, the scale may be absolute (e.g. the number of migraine days per month is expected to diminishes by 0.8 taking Rimegepant) or relative (e.g. the probability of having a thrombosis is expected to be multiplied by 3.8 when taking oral contraceptives). Choosing one measure or the other has several consequences. First, it conveys a different impression of the same data to an external reader. . . . Second, the treatment effect heterogeneity – i.e. different effects on sub-populations – depends on the chosen measure. . . .

Beyond impression conveyed and heterogeneity captured, different causal measures lead to different generalizability towards populations. . . . Generalizability of trials’ findings is crucial as most often clinicians use causal effects from published trials (i) to estimate the expected response to treatment for a specific patient . . .

This is indeed important, and it relates to things that people have been thinking about for awhile recently regarding varying treatment effects. Colnet et al. point out that, even if effects are constant on one scale, they will vary on other scales. In some sense, this hardly matters given that we can expect effects to vary on any scale. Different scales correspond to different default interpretations, which fits the idea that the choice of transformation is as much a matter of communication as of modeling. In practice, though, we use default model classes, and so parameterization can make a difference.

The new paper by Colnet et al. is potentially important because, as they point out, there remains a lot of confused thinking on the topic, both in theory and in practice, and I think part of the problem is a traditional setup in which there is a “treatment effect” to be estimated. In applied studies, you’ll often see this as a coefficient in a model. But, as Colnet et al. point out, if you take that coefficient as estimated from study A and use it to generalize to study B, you’ll be making some big assumptions. Better to get those assumptions out in the open and consider how the effect can vary.

As we discussed a few years ago, the average causal effect can be defined in any setting, but it can be misleading to think of it as a “parameter” to be estimated, as in general it can depend strongly on the context where it is being studied.

Finally, I’d like to again remind readers of our recent article, Causal quartets: Different ways to attain the same average treatment effect (blog discussion here), which discusses the many different ways that an average causal effect can manifest itself in the context of variation:

As Steve Stigler paper pointed out, there’s nothing necessarily “causal” about the content of our paper, or for that matter of the Colnet et al. paper. In both cases, all the causal language could be replaced by predictive language and the models and messages would be unchanged. Here is what we say in our article:

Nothing in this paper so far requires a causal connection. Instead of talking about heterogeneous treatment effects, we could just as well have referred to variation more generally. Why, then, are we putting this in a causal framework? Why “causal quartets” rather than “heterogeneity quartets”?

Most directly, we have seen the problem of unrecognized heterogeneity come up all the time in causal contexts, as in the examples in [our paper], and not so much elsewhere. We think a key reason is that the individual treatment effect is latent. So it’s not possible to make the “quartet” plots with raw data. Instead, it’s easy for researchers to simply assume the causal effect is constant, or to not think at all about heterogeneity of causal effects, in a way that’s harder to do with observable outcomes. It is the very impossibility of directly drawing the quartets that makes them valuable as conceptual tools.

So, yes, variation is everywhere, but in the causal setting, where at least half of the potential outcomes are unobserved, it’s easier for people to overlook variation or to use models where it isn’t there, such as the default model of a constant effect (on some scale or another).

It can be tempting to assume a constant effect, maybe because it’s simpler or maybe because you haven’t thought too much about it or maybe because you think that, in the absence of any direct data on individual causal effects, it’s safe to assume the effect doesn’t vary. But, for reasons discussed in the various articles above, assuming constant effects can be misleading in many different ways. I think it’s time to move off of that default.

Surgisphere . . . is back!!

Unfortunately, this is not an April Fool’s post.

The story comes from Dale Lehman:

In case you were wondering what happened to Dr. Sapan Desai (of Surgisphere fame), here is is.

They claim to uphold the highest research standards! I didn’t recognize any of the other names associated with them.

From the webpage:

Sapan S. Desai (MD, Ph.D., MBA, FACS) is a board-certified vascular surgeon and the President and Chief Executive Officer (CEO) of Surgisphere Corporation, a public service organization located in Chicago, Illinois.

Author of more than 200 peer-reviewed publications and textbooks. Dr. Desai is an internationally-recognized expert in big data analytics, hospital performance improvement, and machine learning.

Can’t these people get honest jobs as clerks or stockbrokers or ditchdiggers or something?

If this new company doesn’t work out maybe Desai could get an evilicious job at Risk Eraser doing primate research. I’m sure that Bouzha Cookman, Steven Pinker, Susan Carey, Gregg Solomon, and Larry Suter would love to have him on the team. Big-data analytics and 200 peer-reviewed publications. Unfortunately, we know about the problem with peer review.

The behavioral economists’ researcher degree of freedom

A few years ago we talked about the two modes of pop-microeconomics:

1. People are rational and respond to incentives. Behavior that looks irrational is actually completely rational once you think like an economist.

2. People are irrational and they need economists, with their open minds, to show them how to be rational and efficient.

Argument 1 is associated with “why do they do that?” sorts of puzzles. Why do they charge so much for candy at the movie theater, why are airline ticket prices such a mess, why are people drug addicts, etc. The usual answer is that there’s some rational reason for what seems like silly or self-destructive behavior.

Argument 2 is associated with “we can do better” claims such as why we should fire 80% of public-school teachers or Moneyball-style stories about how some clever entrepreneur has made a zillion dollars by exploiting some inefficiency in the market.

The trick is knowing whether you’re gonna get 1 or 2 above. They’re complete opposites!

I thought of this when rereading this post from a few years ago, where we quoted Jason Collins, who wrote regarding the decades-long complacency of the academic psychology and economics establishment regarding the hot-hand fallacy fallacy:

We have a body of research that suggests that even slight cues in the environment can change our actions. Words associated with old people can slow us down. Images of money can make us selfish. And so on. Yet why haven’t these same researchers been asking why a basketball player would not be influenced by their earlier shots – surely a more salient part of the environment than the word “Florida”? The desire to show one bias allowed them to overlook another.

When writing the post with the above quote, I had been thinking specifically of issues with the hot hand.

Stepping back, I see this as part of the larger picture of researcher degrees of freedom in the fields of social psychology and behavioral economics.

You can apply the “two modes of thinking” idea to the hot hand:

Argument 1 goes like this: Believing in the hot hand sounds silly. But lots of successful players and coaches believe in it. Real money is at stake—this is not cheap talk! So it’s our duty to go beneath the surface and understand why, counterintuitively, belief in the hot hand makes sense, even though it might naively seem like a fallacy. Let’s prove that the pointy-headed professors outsmarted themselves and the blue-collar ordinary-Joe basketball coaches were right all along, following the anti-intellectual mode that was so successfully employed by the Alvin H. Baum Professor of Economics at the University of Chicago (for example, an unnamed academic says something stupid, only to be shot down by regular-guy “Chuck Esposito, a genial, quick-witted and thoroughly sports-fixated man who runs the race and sports book at Caesars Palace in Las Vegas.”)

Argument 2 goes the other way: Everybody thinks there’s a hot hand, but we, the savvy social economists and behavioral economists, know that because of evolution our brains make lots of shortcuts. Red Auerbach might think he’s an expert at basketball, but actually some Cornell professors have collected some data and have proved definitively that everything you thought about basketball was wrong.

Argument 1 is the “Econ 101” idea that when people have money on the line, they tend to make smart decisions, and we should be suspicious of academic theories that claim otherwise. Argument 2 is the “scientist as hero” idea that brilliant academics are making major discoveries every day, as reported to you by Ted, NPR, etc.

In the case of the hot hand, the psychology and economics establishment went with Argument 2. I don’t see any prior reason why they’d pick 1 or 2. In this case I think they just made an honest mistake: a team of researchers did a reasonable-seeming analysis and everyone went from there. Following the evidence—that’s a good idea! Indeed, for decades I believed that the hot hand was a fallacy. I believed in it, I talked about it, I used it as an example in class . . . until Josh Miller came to my office and explained to me how so many people, including me, had gotten it wrong.

So my point here is not to criticize economists and psychologists for getting this wrong. The hot hand is subtle, and it’s easy to get this one wrong. What interests me is how they chose—even if the choice was not made consciously—to follow Argument 2 rather than Argument 1 here. You could say the data led them to Argument 2, and that’s fine, but the same apparent strength of data could’ve led them to Argument 1. These are people who promote flat-out ridiculous models of the Argument 1 form such as the claim that “all deaths are to some extent suicides.” Sometimes they have a hard commitment to Argument 1. This time, though, they went with #2, and this time they were the foolish professors who got lost trying to model the real world.

I’m still working my way though the big picture here of trying to understand how Arguments 1 and 2 coexist, and how the psychologists and economists decide which one to go for in any particular example.

Interestingly enough, in the hot-hand example, after the behavioral economists saw their statistical argument overturned, they didn’t flip over to Argument 1 and extol the savvy of practical basketball coaches. Instead they pretty much tried to minimize their error and try to keep as much of Argument 2 as they could, for example arguing that, ok, maybe there is a hot hand but it’s much less than people think. They seem strongly committed to the idea that basketball players can’t be meaningfully influenced by previous shots, even while also being committed to the idea that words associated with old people can slow us down, images of money can make us selfish, and so on. I’m still chewing on this one.

Someone has a plea to teach real-world math and statistics instead of “derivatives, quadratic equations, and the interior angles of rhombuses”

Robert Thornett writes:

What if, for example, instead of spending months learning about derivatives, quadratic equations, and the interior angles of rhombuses, students learned how to interpret financial and medical reports and climate, demographic, and electoral statistics? They would graduate far better equipped to understand math in the real world and to use math to make important life decisions later on.

I agree. I mean, I can’t be sure; he’s making a causal claim for which there is no direct evidence. But it makes sense to me.

Just one thing. The “interior angles of rhombuses” thing is indeed kinda silly, but I think it would be awesome to have a geometry class where students learn to solve problems like: Here’s the size of a room, here’s the location of the doorway opening and the width of the hallway, here are the dimensions of a couch, now how do you manipulate the couch to get it from the hall through the door into the room, or give a proof that it can’t be done. That would be cool, and I guess it would motivate some geometrical understanding.

In real life, though, yeah, learning standard high school and college math is all about turning yourself into an algorithm for solving exam problems. If the problem looks like A, do X. If it looks like B, to Y, etc.

Lots of basic statistics teaching looks like that too, I’m afraid. But statistics has the advantage of being one step closer to application, which should help a bit.

Also, yeah, I think we can all agree that “derivatives, quadratic equations, and the interior angles of rhombuses” are important too. The argument is not that these should not be taught, just that these should not be the first things that are taught. Learn “how to interpret financial and medical reports and climate, demographic, and electoral statistics” first, then if you need further math courses, go on to the derivatives and quadratic equations.

Bad stuff going down in biostat-land: Declaring null effect just cos p-value is more than 0.05, assuming proportional hazards where it makes no sense

Wesley Tansey writes:

This is no doubt something we both can agree is a sad and wrongheaded use of statistics, namely incredible reliance on null hypothesis significance testing. Here’s an example:

Phase III trial. Failed because their primary endpoint had a p-value of 0.053 instead of 0.05. Here’s the important actual outcome data though:

For the primary efficacy endpoint, INV-PFS, there was no significant difference in PFS between arms, with 243 (84%) of events having occurred (stratified HR, 0.77; 95% CI: 0.59, 1.00; P = 0.053; Fig. 2a and Table 2). The median PFS was 4.5 months (95% CI: 3.9, 5.6) for the atezolizumab arm and 4.3 months (95% CI: 4.2, 5.5) for the chemotherapy arm. The PFS rate was 24% (95% CI: 17, 31) in the atezolizumab arm versus 7% (95% CI: 2, 11; descriptive P < 0.0001) in the chemotherapy arm at 12 months and 14% (95% CI: 7, 21) versus 1% (95% CI: 0, 4; descriptive P = 0.0006), respectively, at 18 months (Fig. 2a). As the INV-PFS did not cross the 0.05 significance boundary, secondary endpoints were not formally tested.

The odds of atezolizumab being better than chemo are clearly high. Yet this entire article is being written as the treatment failing simply because the p-value was 0.003 too high.

He adds:

And these confidence intervals are based on proportional hazards assumptions. But this is an immunotherapy trial where we have good evidence that these trials violate the PH assumption. Basically, you get toxicity early on with immunotherapy, but patients that survive that have a much better outcome down the road. Same story here; see figure below. Early on the immunotherapy patients are doing a little worse than the chemo patients but the long-term survival is much better.

As usual, our recommended solution for the first problem is to acknowledge uncertainty and our recommended solution for the second problem is to expand the model, at the very least by adding an interaction.

Regarding acknowledging uncertainty: Yes, at some point decisions need to be made about choosing treatments for individual patients and making general clinical recommendations—but it’s a mistake to “prematurely collapse the wave function” here. This is a research paper on the effectiveness of the treatment, not a decision-making effort. Keep the uncertainty there; you’re not doing us any favors by acting as if you have certainty when you don’t.

Predicting LLM havoc

This is Jessica. Jacob Steinhardt recently posted an interesting blog post on predicting emergent behaviors in modern ML systems like large language models. The premise is that we can get qualitatively different behaviors form a deep learning model with enough scale–e.g., AlphaZero hitting a point in training where suddenly it has acquired a number of chess concepts. Broadly we can think of this happening as a result of how acquiring new capabilities can help a model lower its training loss and how as scale increases, you can get points where some (usually more complex) heuristic comes to overtake another (simpler) one. The potential for emergent behaviors might seem like a counterpoint to the argument that ML researchers should write broader impacts statements to prospectively name the potential harms their work poses to society… non-linear dynamics can result in surprises, right? But Steinhardt’s argument is that some types of emergent behavior are predictable.  

The whole post is worth reading so I won’t try to summarize it all. What most captured my attention though is his argument about predictable deception, where a model fools or manipulates the (human) supervisor rather than doing the desired tasks, because doing so gets it better or equal reward. Things like ChatGPT saying that “When I said that tequila has ‘relatively high sugar content,’ I was not suggesting that tequila contains sugar” or an LLM claiming there is “no single right answer to this question” when there is, sort of like a journalist insisting on writing a balanced article about some issue where one side is clearly ignoring evidence. 

The creepy part is that the post argues that there is reason to believe that certain factors we should expect to see in the future–like models being trained on more data, having longer dialogues with humans, and being more embedded in the world (with a potential to act)–are likely to increase deception. One reason is because models can use the extra info they are acquiring to build better theories-of-mind and use them to better convince their human judges of things. And when they can understand what humans respond to and act in the world they can influence human beliefs through generating observables. For example, we might get situations like the following: 

suppose that a model gets higher reward when it agrees with the annotator’s beliefs, and also when it provides evidence from an external source. If the annotator’s beliefs are wrong, the highest-reward action might be to e.g. create sockpuppet accounts to answer a question on a web forum or question-answering site, then link to that answer. A pure language model can’t do this, but a more general model could.

This reminds me of a similar example used by Gary Marcus of how we might start with some untrue proposition or fake news (e.g., Mayim Bialik is selling CBD gummies) and suddenly have a whole bunch of websites on this topic. Though he seemed to be talking about humans employing LLMs to generate bullshit web copy. Steinhardt also argues that we might expect deception to emerge very quickly (think phase transition), as suddenly a model achieves high enough performance by deceiving all the time that those heuristics dominate over the more truthful strategies. 

The second part of the post on emergent optimization argues that as systems increase in optimization power—i.e., as they consider a larger and more diverse space of possible policies to achieve some goal—they become more likely to hack their reward functions. E.g., a model might realize your long term goals are hard to achieve (say, lots of money and lots of contentness) but that’s hard. And so instead it resorts to trying to change how you appraise one of those things over time. The fact that planning capabilities can emerge in deep models even when they are given a short-term objective (like predicting the next token in some string of text) and that we should expect planning to drive down training loss (because humans do a lot of planning and human-like behavior is the goal) means we should be prepared for reward hacking to emerge. 

From a personal perspective, the more time I spend trying out these models, and the more I talk to people working on them, the more I think being in NLP right now is sort of a double-edged sword. The world is marveling at how much these models can do, and the momentum is incredible, but it also seems that on a nearly daily basis we have new non-human-like (or perhaps worse, human-like but non-desirable) behaviors getting classified and becoming targets for research. So you can jump into the big whack-a-mole game, and it will probably keep you busy for awhile, but if you have any reservations about the limitations of learning associatively on huge amounts of training data you sort of have to live with some uncertainty about how far we can go with these approaches. Though I guess anyone who is watching curiously what’s going on in NLP is in the same boat. It really is kind of uncomfortable.

This is not to say though that there aren’t plenty of NLP researchers thinking about LLMs with a relatively clear sense of direction and vision – there certainly are. But I’ve also met researchers who seem all in but without being able to talk very convincely about where they see it all going. Anyway, I’m not informed enough about LLMs to evaluate Steinhardt’s predictions but I like that some people are making thoughtful arguments about what we might expect to see.


P.S. I wrote “but if you have any reservations about the limitations of learning associatively on huge amounts of training data you sort of have to live with some uncertainty about how far we can go with these approaches” but it occurs to me now that it’s not really clear to me what I’m waiting for to determine “how far we can go.” Do deep models really need to perfectly emulate humans in every way we can conceive of for these approaches to be considered successful? It’s interesting to me that despite all the impressive things LLMs can do right now, there is this tendency (at least for me) to talk about them as if we need to withhold judgment for now. 

Blogs > Twitter, part the umpteenth

I happened to come across this 2015 post that recounted the anecdote of the famed economist Robert Solow telling us back in 1986 that if it was up to him he’d cut funding for Amtrak to zero. I think his point was that, yeah, he’s a liberal but he’s no doctrinaire, but when he said it I just thought he was displaying narrow economist tribal bias (cars are good because they’re free enterprise, trains are bad because government) and not recognizing all the tax money that goes into maintaining the road system.

But that’s not the topic for today. Scrolling down at the above-linked post you’ll come to an exchange in the comments section that proceeds roughly as follows:

1. Someone pulls a sentence of mine from an earlier post out of context and uses it to criticize me in a stuffy way.

2. I get annoyed and reply, “That’s just humorless and rude of you,” and explain in great detail why the commenter didn’t get the joke.

3. The commenter responds that it was me who missed the point, and he’d just been kidding.

4. I thank him for the clarification and remark that intonation is notoriously difficult to convey in typed speech.

5. He acknowledges my acknowledgment and agrees with the point about intonation and typed speech.

6. All’s good.

Now just imagine how this would’ve played out in twitter:

1. Same step 1 as above.

1a. Before I get around to responding, 12 people read this guy’s comment and don’t realize it’s a joke or follow the links. They explode in righteous anger against me.

2. Same step 2 as above. But, because of 1a, I’m not just moderately annoyed at him, I’m very annoyed at him.

2a. Before the original commenter gets around to responding, 12 people read my reaction and don’t realize I didn’t get the joke. They explode in righteous anger against him.

3. Same step 3 as above. But, because of 2a, he’s not just amused, he’s annoyed at me.

4. Same step 4 as above.

5. Same step 5 as above.

6. All’s good with original commenter and me, but now there are 100 people who only saw part of the thread and leave with the impression that I’m an idiot, and another 100 people who only saw part of the thread and think he’s a humorless ass. This carries through to later twitter appearances. (“Isn’t he the guy who . . .”)

P.S. Blogs > Twitter from 2014

P.P.S. Blogs > Twitter again from 2022

Should drug companies be required to release data right away, not holding data secret until after regulatory approval?

Dale Lehman writes:

The attached article from the latest NEJM issue (study and appendix) caught my attention – particularly Table 1 which shows the randomized vaccine and control groups. The groups looked too similar compared with what I am used to seeing in RCTs. Table S4 provides various medical conditions in the two groups, and this looks a bit more like what I’d expect. However, I was still a bit disturbed, sort of like seeing the pattern HTHTHTHTHT in 10 flips of a coin. So, there is a potential substantive issue here – but more importantly a policy issue which I will get to shortly.

The Potential Substantive Issue

I have no reason to believe the study or data analysis was done poorly. Indeed, the vaccine appears to be quite effective, and my initial suspicions about the randomization seem less salient to me now. But, just to investigate it further, I looked at the confidence intervals and coverage if the assignment had been purely random. Out of 35 comparisons between the control and vaccine groups (some demographic and some medical – I made no adjustment for the fact that these comparisons are not independent), 17% fell outside of a 95% confidence interval and 40% outside of a 68% (one standard deviation) confidence interval. This did not reinforce my suspicions, as the large sample size made the similarities between the 2 groups less striking than I initially thought.

So, as a comparison, I looked at another recent RCT in the NEJM (“Comparative Effectiveness of Aspirin Dosing in Cardiovascular Disease, NEJM, May 27, 2021). Doing the same comparisons of the difference between the control and treatment groups in relation to the confidence intervals, 36% of the comparisons fell outside of a 95% confidence interval and 68% outside of the 68% confidence interval. This is closer to what I normally see – it is difficult to match control and treatment groups through random assignment, which is why I always try to do a multivariate analysis (and, I believe, why you always are asking for multilevel studies).

So, this particular vaccine study seems to have matched the 2 groups closer than the second study, but my initial suspicions were not heightened by my analysis. So, what I wanted to see was some cross tabulations to see if the two groups similarities continued at a more granular level. Which brings me to the more important policy issue.

Policy Issue

The data sharing arrangement here stated that the deidentified data would be made available upon publication. The instructions were to access it through the Yale Open Data Access Project site. This study was not listed there, but there is a provision to apply for access to data from studies not listed. So, I went back to the Johnson & Johnson data sharing policy to make sure I could request access – but that link was broken. So, I wrote to the first author of the study. He responded that the link was indeed broken, and

“However, our policy on data sharing is also provided there on the portal and data are made available AFTER full regulatory approval. So apologies for that misstatement in the article we are working to correct this with NEJM. For trials not listed, researchers are welcome to submit an inquiry and provide additional information for consideration by the YODA Project.”

I requested clarification regarding what “full regulatory approval” meant and the response was “specifically it means licensure in the US and EU.”

I completely understand Johnson & Johnson’s concern about releasing data prior to regulatory approval. Doing otherwise would seem like a poor business decision – and potentially one that would stand in the way of promoting public health. However, my concern is with the New England Journal of Medicine, and their role in this process. Making data available only after regulatory approval seems to offer little opportunity for post-publication (let alone, pre-publication) review that has much relevance. And, we know what happened with the Surgisphere episode earlier this year, so I would think that the Journal might have a heightened concern about data availability.

I don’t think the issue of availability of RCT data prior to regulatory approval is a simple one. There are certainly legitimate concerns on all sides of this issue. But I’d be interested if you want to weigh in on this. Somehow, the idea that a drug company seeking regulatory approval will only release the data after obtaining that approval – and using esteemed journals in this way – just feels bad to me. Surely there must be better arrangements? I have also attached the Johnson & Johnson official policy statement regarding data sharing and publication. It sounds good – and even involves the use of the Yale Open Data Access Project (an Ivy League institution, after all) – but it does specify that data availability follows regulatory approval.

I agree that it seems like a good idea to require data availability. I’m not so worried about confidentiality or whatever. At least, if I happened to have been in the study, I wouldn’t care if others had access to an anonymized dataset including my treatment and disease status, mixed with data on other patients. I’m much more concerned the other way, about problems with the research not being detected because there are no outside eyes on the data, along with incentives to do things wrong because the data are hidden and there’s a big motivation to get drug approval.

“Risk without reward: The myth of wage compensation for hazardous work.” Also some thoughts of how this literature ended up to be so bad.

Peter Dorman writes:

Still interested in Viscusi and his value of statistical life after all these years? I can finally release this paper, since the launch just took place.

The article in question is called “Risk without reward: The myth of wage compensation for hazardous work,” by Peter Dorman and Les Boden, and goes as follows:

A small but dedicated group of economists, legal theorists, and political thinkers has promoted the argument that little if any labor market regulation is required to ensure the proper level of protection for occupational safety and health (OSH), because workers are fully compensated by higher wages for the risks they face on the job and that markets alone are sufficient to ensure this outcome. In this paper, we argue that such a sanguine perspective is at odds with the history of OSH regulation and the most plausible theories of how labor markets and employment relations actually function. . . .

In the English-speaking world, OSH regulation dates to the Middle Ages. Modern policy frameworks, such as the Occupational Safety and Health Act in the United States, are based on the presumption of employer responsibility, which in turn rests on the recognition that employers generally hold a preponderance of power vis-à-vis their workforce such that public intervention serves a countervailing purpose. Arrayed against this presumption, however, has been the classical liberal view that worker and employer self-interest, embodied in mutually agreed employment contracts, is a sufficient basis for setting wages and working conditions and ought not be overridden by public action—a position we dub the “freedom of contract” view. This position broadly corresponds to the Lochner-era stance of the U.S. Supreme Court and today characterizes a group of economists, led by W. Kip Viscusi, associated with the value-of-statistical-life (VSL) literature. . . .

Following Viscusi, such researchers employ regression models in which a worker’s wage, typically its natural logarithm, is a function of the worker’s demographic characteristics (age, education, experience, marital status, gender) and the risk of occupational fatality they face. Using census or similar surveys for nonrisk variables and average fatal accident rates by industry and occupation for risk, these researchers estimate the effect of the risk variable on wages, which they interpret as the money workers are willing to accept in return for a unit increase in risk. This exercise provides the basis for VSL calculations, and it is also used to argue that OSH regulation is unnecessary since workers are already compensated for differences in risk.

This methodology is highly unreliable, however, for a number of reasons . . . Given these issues, it is striking that hazardous working conditions are the only job characteristic for which there is a literature claiming to find wage compensation. . . .

This can be seen as an update of Dorman’s classic 1996 book, “Markets and Mortality: Economics, Dangerous Work, and the Value of Human Life.” It must be incredibly frustrating for Dorman to have shot down that literature so many years ago but still see it keep popping up. Kinda like how I feel about that horrible Banzhaf index or the claim that the probability of a decisive vote is 10^-92 or whatever, or those terrible regression discontinuity analyses, or . . .

Dorman adds some context:

The one inside story that may interest you is that, when the paper went out for review, every economist who looked at it said we had it backwards: the wage compensation for risk is underestimated by Viscusi and his confreres, because of missing explanatory variables on worker productivity. We have only limited information on workers’ personal attributes, they argued, so some of the wage difference between safe and dangerous jobs that should be recognized as compensatory is instead slurped up by lumping together lower- and higher-tiered employment. According to this, if we had more variables at the individual level we would find that workers get even more implicit hazard pay. Given what a stretch it is a priori to suspect that hazard pay is widespread and large—enough to motivate employers to make jobs safe on their own initiative—it’s remarkable that this is said to be the main bias.

Of course, as we point out in the paper, and as I think I had already demonstrated way back in the 90s, missing variables on the employer and industry side impose the opposite bias: wage differences are being assigned to risk that would otherwise be attributed to things like capital-labor ratios, concentration ratios (monopoly), etc. In the intervening years the evidence for these employer-level effects has only grown stronger, a major reason why antitrust is a hot topic for Biden after decades in the shadows.

Anyway, if you have time I’d be interested in your reactions. Can the value-of-statistical-life literature really be as shoddy as I think it is?

I don’t know enough about the literature to even try to answer that last question!

When I bring up the value of statistical life in class, I’ll point out that the most dangerous jobs pay very low, and high-paying jobs are usually very safe. Any regression of salary vs. risk will start with a strong negative coefficient, and the first job of any analysis will be to bring that coefficient positive. At that point, you have to decide what else to include in the model to get a coefficient that you want. Hard for me to see this working out.

This has a “workflow” or comparison-of-models angle, as the results can best be understood within a web of possible models that could be fit to the data, rather than focusing on a single fitted model, as is conventionally done in economics or statistics.

As to why the literature ended up so bad: it seems to be a perfect storm of economic/political motivations along with some standard misunderstandings about causal inference in econometrics.