Skip to content

Stan pedantic mode

This used to be on the Stan wiki but that page got reorganized so I’m putting it here. Blog is not as good as wiki for this purpose: you can add comments but you can’t edit. But better blog than nothing, so here it is.

I wrote this a couple years ago and it was just sitting there, but recently Ryan Bernstein implemented some version of it as part of his research into static analysis of probabilistic programs; see this thread and, more recently, this thread from the Stan Forums.

Background:

We see lots of errors in Stan code that could be caught automatically. These are not syntax errors; rather, they’re errors of programming or statistical practice that we would like to flag. The plan is to have a “pedantic” mode of the Stan parser that will catch these problems (or potential problems) and emit warnings. I’m imagining that “pedantic” will ultimately be the default setting, but it should also be possible for users to turn off pedantic mode if they really want.

Choices in implementation:

– It would be fine with me to have this mode always on, but I could see how some users would be irritated if, for example, they really want to use an inverse-gamma model or if they have a parameter called sigma that they want to allow to be negative, etc. If it is _not_ always on, i’d prefer it to be on by default, so that the user would have to declare “–no-check-mode” to get it to _not_ happen.

– What do others think? Would it be ok for everyone to have it always on? If that’s easier to implement, we could start with always on and see how it goes, then only bother making it optional if it bothers people.

– I’m thinking these should give Warning messages. Below I’ll give a suggestion for each pattern.

– And now I’m thinking we should have a document (not the same as this wiki) which explains _why_ for each of our recommendations. Then each warning can be a brief note with a link to the document. I’m not quite sure what to call this document, maybe something like “Some common problems with Stan programs”? We could subdivide this further, into Problems with Bayesian models and Problems with Stan code.

Now I’ll list a bunch of patterns that we’d like to catch.

To start with, I’ll list the patterns in no particular order. Ultimately we’ll probably want to categorize or organize them a bit.

– Uniform distributions. All uses of uniform() should be flagged. Just about all the examples we’ve ever seen are either superfluous (as they just add a constant to the log density) or mistaken (in that they should have been entered as bounds on parameters).

Warning message: “Warning: On line ***, your Stan program has a uniform distribution. The uniform distribution is not recommended, for two reasons: (a) Except when there are logical or physical constraints, it is very unusual for you to be sure that a parameter will fall inside a specified range, and (b) The infinite gradient induced by a uniform density can cause difficulties for Stan’s sampling algorithm. As a consequence, we recommend soft constraints rather than hard constraints; for example, instead of giving an elasticity parameter a uniform(0,1) distribution, try normal(0.5,0.5).”

– Parameter bounds of the form “lower=A, upper=B” should be flagged in all cases except A=0, B=1 and A=-1, B=1.

Warning message: “Warning: On line ***, your Stan program has a parameter with hard constraints in its declaration. Hard constraints are not recommended, for two reasons: (a) Except when there are logical or physical constraints, it is very unusual for you to be sure that a parameter will fall inside a specified range, and (b) The infinite gradient induced by a hard constraint can cause difficulties for Stan’s sampling algorithm. As a consequence, we recommend soft constraints rather than hard constraints; for example, instead of constraining an elasticity parameter to fall between 0, and 1, leave it unconstrained and give it a normal(0.5,0.5) prior distribution.”

– Any parameter whose name begins with “sigma” should have “lower=0” in its declaration; otherwise flag.

Warning message: “Warning: On line ***, your Stan program has a unconstrained parameter with a name beginning with “sigma”. Parameters with this name are typically scale parameters and constrained to be positive. If this parameter is indeed a scale (or standard deviation or variance) parameter, add lower=0 to its declaration.”

– Parameters with positive-constrained distribution (such as gamma or lognormal) and no corresponding constraint in the definition.

Warning message: “Warning: Parameter is given a constrained distribution on line *** but was declared with no constraints, or incompatible constraints, on line ***. Either change the distribution or change the constraints.”

– gamma(A,B) or inv_gamma(A,B) should be flagged if A = B < 1. (The point is to catch those well-intentioned but poorly-performing attempts at improper priors.)

Warning message: “Warning: On line ***, your Stan program has a gammma, or inverse-gamma, model with parameters that are equal to each other and set to values less than 1. This is mathematically acceptable and can make sense in some problems, but typically we see this model used as an attempt to assign a noninformative prior distribution. In fact, priors such as inverse-gamma(.001,.001) can be very strong, as explained by Gelman (2006). Instead we recommend something like a normal(0,1) or student_t(4,0,1), with parameter constrained to be positive.”

– if/else statements in the transformed parameters or model blocks: These can cause problems with HMC so should probably be flagged as such.

Warning message: Hmmm, I’m not sure about this one!

– Code has no indentation.

Warning message: “Warning: Your Stan code has no indentation and this makes it more difficult to read. See *** for guidelines on writing easy-to-read Stan programs.”

– Code has blank lines.

Warning message: “Warning: Your Stan code has blank lines and this makes it more difficult to read. See *** for guidelines on writing easy-to-read Stan programs.” (Bob: I don’t think we want to flag all blank lines, just ones at starts of blocks—otherwise, they’re convenient for organizing)

– Vectorization that doesn’t involve first argument. For example, `y[n] ~ normal(mu, sigma)` where `y[n]` is a scalar and `mu` is a vector — it leads to too many density increments.

– Undefined variables. Bob describes this as a biggy that’s a lot harder to code. (Bob: But when we have compound declare/define, then we can flag any variable that’s defined and not declared at the same time. It is much harder to find ones that never get defined; not impossible, just requires a second pass or keeping track of which variables aren’t defined as we go along.)

Warning message: “Warning: Variable ** is used on line *** but is nowhere defined. Check your spelling or add the appropriate declaration.”

– Variables defined lower in the program than the variable was used.

Warning message: “Warning: Variable ** is used on line *** but is not defined until line ****. Declare the variable before it is used.”

– Large or small numbers.

Warning message: “Try to make all your parameters scale free. You have a constant in your program that is less than 0.1 or more than 10 in absolute value on line **. This suggests that you might have parameters in your model that have not been scaled to roughly order 1. We suggest rescaling using a multiplier; see section *** of the manual for an example.

– Warn user if parameter has no priors or multiple priors Bruno Nicenboim suggested this on https://github.com/stan-dev/stan/issues/2445)

– Warn user if parameter has indexes but is used without indexes in a loop, e.g.,
“`
real[N] y;
for (n in 1:N)
y ~ normal(0, 1); // probably not what user intended
“`

– If there are other common and easily-identifiable Stan programming errors, we should aim to catch them too.

Pedantic mode for Rstanarm:

– Flag predictors that are not on unit scale

– Check if a predictor is constant

Comments from Bob:

Most of these are going to have to be built into the parser code itself to do it right, especially given the creative syntax we see from users.

I don’t know if the “sigma…” thing is worth doing. Not many of our users use “sigma” or “sigma_foo” the way Andrew does.

Do we also flag normal(0, 1000) and other attempts at very diffuse priors?

The problem with a lot of this is that we’ll have to be doing simple blind matching. Most of our users don’t use “sigma” (or “tau” the way Andrew does). Nor do they follow the sigma_foo standard—it’s all over the place.

We can pick up literal values (and by literal, I mean it’s a number in the code). We won’t be able to pick up more complex things like gamma(a, b) where a < 1 and b < 1, because that’s a run-time issue, not a compile-time issue, for instance if a and b are data.

Should we also flag “lower=A” if A != 0 or A!= -1? (Again, we’ll be able to pick up literal numbers there, not more complex violations).

Conditionals (if-else) are only problematic if the conditions depend on parameters.

Everyone wants to do more complicated things, like make sure that if a variable is given a distribution with a bound that the bound’s declared in the parameters. So no:

real alpha;

alpha ~ lognormal(…)

because then alpha isn’t well defined. These things are going to be a lot of work to track down and we’ll again only be able to pick out direct cases like this.

Comment from Michael:

I’m thinking of a -pedantic flag to the parser, which we can then call through the interfaces in various ways, so it would make sense to built everything up in the parser itself.

Bob replies:

The patterns in question are complex syntactic relationships in some clases, like the link between a variable’s declaration and use. If we start cluttering up the parser trying to do all this on the fly, things could get very ugly (in the code) very fast, as the underlying parser’s just a huge pain to work with.

Mike:

So what’s the best place to introduce these checks? It’s not dissimilar for what’s already done with the Jacobian checks, no?

Bob:

It depends what they are. I should’ve been clearer that I meant on the fly as in find the issue as it’s parsing.

There are two approaches — plumb enough information top down through the parser so that we can check inside of a single expression or statement — that’s what the Jacobian check does now. To do that, the source of every variable needs to be plumbed through the entire parse tree. If we do more of that it’s going to get ugly. Some of these are “peephole” warnings — things you can catch just locally.

For others, we have the AST and can design walkers for it that walk over the code any old way and report.

Or, we could do something hacky with regexes along the lines of cpplint (a Python program, not something plumbed into the C++ parser, at least as far as I know).

#### Indentation Errors

Catch when a program’s indentation doesn’t match the braces. E.g.,

“`
for (i in 1:10)
mu[i] ~ normal(0, 1);
sigma[i] ~ cauchy(0, 2);
“`

Should flag the line with `sigma` as it’s not in scope.

#### Bad stuff

– Parameter defined but never used.

– Overpromoting. Parameter defined three times.

Are informative priors “[in]compatible with standards of research integrity”? Click to find out!!

A couple people asked me what I thought of this article by Miguel Ángel García-Pérez, Bayesian Estimation with Informative Priors is Indistinguishable from Data Falsification, which states:

Bayesian analysis with informative priors is formally equivalent to data falsification because the information carried by the prior can be expressed as the addition of fabricated observations whose statistical characteristics are determined by the parameters of the prior.

I agree with the mathematical point. Once you’ve multiplied the prior with the likelihood, you can’t separate what came from where. The prior is exactly equivalent to a measurement; conversely, any factor of the likelihood is exactly equivalent to prior information, from a mathematical perspective.

I don’t think it’s so helpful to label this procedure as “data falsification.” The prior is an assumption, just as the likelihood is an assumption. All the assumptions we use in applied statistics are false, so, sure the prior is a falsification, just as every normal distribution you use is a falsification, every logistic regression is a falsification, etc. Whatever. The point is, yes, the prior and the likelihood have equal mathematical status when they come together to form the posterior.

The article continues:

This property of informative priors makes clear that only the use of non-informative, uniform priors in all types of Bayesian analyses is compatible with standards of research integrity.

Huh? Where does “research integrity” come in here? That’s just nuts. I guess now we know what comes after Vixra. In all seriousness, I guess there’s always a market for over-the-top claims that tell (some) people what they want to hear (in this case, in a bit of an old-fashioned way).

To get to the larger issue: I do think there are interesting questions regarding the interactions between ethics and the use of prior information. No easy answers but the issue is worth thinking about. As I summarized in my 2012 article:

I believe we are ethically required to clearly state our assumptions and, to the best of our abilities, explain the rationales for these assumptions and our sources of information.

What a difference a month makes (polynomial extrapolation edition)

Someone pointed me to this post from Cosma Shalizi conveniently using R to reproduce the famous graph endorsed by public policy professor and A/Chairman @WhiteHouseCEA.

Here’s the original graph that caused all that annoyance:

Here’s Cosma’s reproduction in R (retro-Cosma is using base graphics!), fitting a third-degree polynomial on the logarithms of the death counts:

Cosma’s data are slightly different from those in the government graph, but they give the same statistical story.

Its been a couple weeks so I’ll redo the graph running Cosma’s code on current data (making a few changes in display choices):

Hey, a cubic still fits the data! Well, not really. According to that earlier graph, the number of new deaths should be approximately zero by now. What happened is that the cubic has shifted, now that we’ve included new data in the fit.

Anyway, here’s my real question.

Cosma is using the same x-axis as the U.S. government was using, going until 4 Aug 2020. But where did 4 Aug come from? That’s kind of a weird date to use as an endpoint. Why, not, say, go until 1 Sept?

Cosma provided code, so it’s trivial to extend the graph to the end of the month, and here’s what we get:

Whoa! What happened?

But, yes, of course! A third-degree polynomial doesn’t just go up, then down. It goes up, then down, then up. Here’s the fitted polynomial in question:

                            coef.est coef.se
(Intercept)                   5.85     0.06 
poly(weeks_since_apr_1, 3)1  17.45     0.54 
poly(weeks_since_apr_1, 3)2 -10.94     0.54 
poly(weeks_since_apr_1, 3)3   1.20     0.54 
---
n = 75, k = 4
residual sd = 0.54, R-Squared = 0.95

The coefficient of x^3 is positive, so indeed the function has to blow up to infinity once x is big enough. (It blows up to negative infinity for sufficiently low values of x, but since we’re exponentiating to get the prediction back on the original scale, that just sends the fitted curve to zero.)

When I went back and fit the third-degree model just to the data before 5 May, I got this:

                            coef.est coef.se
(Intercept)                  5.57     0.07  
poly(weeks_since_apr_1, 3)1 17.97     0.52  
poly(weeks_since_apr_1, 3)2 -8.54     0.52  
poly(weeks_since_apr_1, 3)3 -0.97     0.52  
---
n = 63, k = 4
residual sd = 0.52, R-Squared = 0.96

Now the highest-degree coefficient estimate is negative, so the curve will continue declining to 0 as x increases. It would retrospectively blow up for low enough values of x, but this is not a problem as we’re only going forward in time with our forecasts.

Baby alligators: Adorable, deadly, or endangered? You decide.

Yeah, I know, yet another post about baby alligators . . . can’t we find anything else to write about around here?

Lizzie points us to this tabloid news article which she files under the heading, “why we need priors in climate change biology”:

Why baby alligators in some spots could be 98% female by century’s end . . .

Rising global temperatures could shift the balance between males and females in crocodile and alligator populations, potentially leading to a sharp decline in the reptiles’ reproduction rates. . . .

Between 2010 and 2018, Samantha Bock at the University of Georgia in Athens and her colleagues measured the temperature of 86 nests made by American alligators (Alligator mississippiensis) in Florida and South Carolina. The researchers also collected data on daily air temperatures at these sites and found that average nest temperatures were higher during warmer years.

Using estimates of future climate change, the researchers predicted that, if global temperature continues to rise unabated, sex ratios at both sites will become highly male-skewed by the middle of this century. But, by 2100, higher nest temperatures could produce up to 98% females.

I asked Lizzie: Can the alligators adapt in some way or are they doomed?

She replied:

Doomed! According to folks in my field, who have no model they won’t extrapolate and are happy to ignore the climate variation alligators cope with today … and have over, say, their 37 million years or so on earth (including my favorite period, the Younger Dryas, potential source of Nordic myths).

The article on the Younger Dryas, by W. H. Berger, begins:

According to Nordic myth, the modem world begins when Odin slays Ymir, the Ice Giant. Ymir’s offspring drown in his blood, but two survive and start a new race of frost giants. Lodged in Utgard, they pose a constant threat. Only Odin’s son Thor, brandishing Mjollnir, the magic hammer, keeps them in check. The pronounced and abrupt changes in climate during the Glacial-Holocene transition suggest that the luck of battle switched sides frequently before Odin and Thor won over Ymir and his kin.

But enough about the Younger Dryas. Back to the alligators.

The research article in question, “Spatial and temporal variation in nest temperatures forecasts sex ratio skews in a crocodilian with environmental sex determination,” is here. If I read that paper right, they didn’t actually collect data on the sexes of their baby alligators.

I asked Lizzie, who replied:

I think you’re right. They just cite a 1994 paper to extrapolate sexes from next temperature, the paper has data on 20 nests and ends with “[t]herefore, we suggest that information on hatchling sex ratios be obtained by directly sexing hatchlings as opposed to relying solely on predictions based on nest temperatures. Monitoring of nest temperatures is useful for management and natural history reasons.”

I don’t think alligator biologists will much care about my opinion (I am not an expert at all) but it is an interesting example of jumping through so much uncertainty in estimates to end up at some extreme predictions. There are lots of bad consequences of climate change, and this could be one of them, I am just not convinced by this study.

When it comes to baby alligators, I’m still not sure whether to say Awwwww or Aaaaaa! The baby gators in this news report are pretty cute. It also says, “The American alligator teetered on the brink of extinction in the 1980s.” I had no idea.

So much of academia is about connections and reputation laundering

There are two ways of looking at this:

1. Statistics / numerical analysis / data science is a lot harder than you, the reader of this blog, might think.

2. Academia, like any other working environment, is full of prominent, successful, well-connected bluffers.

Someone sent me this:

and this followup:

I clicked around a bit, and there was a lot of appropriate distress this this sort of bad statistical analysis was being done at the highest levels of government.

But here I want to focus on something else, which is the general level of mediocrity, even at the top levels of academia.

When he is not “A/Chairman @WhiteHouseCEA,” the author of the above tweet is a tenured professor of public policy:

In his tweet, Philipson introduces the useful concept of “economist turned political hack.” My problem here is not whether Philipson is a hack but rather that someone can reach the heights of academia and government while being so ignorant as to think that this above curve fit could be a good idea.

OK, part of this is straight-up politics. There will always be prestigious and well-paying jobs for people who are willing to say the things that rich and powerful people want to hear.

But I think it’s more than that. Let me go back to points 1 and 2 above.

1. Statistics is hard. Sure, a decorated and tenured professor of public policy, “one of the world’s preeminent health economists.” Tyler Cowen says that graduate students in top econ programs all have “absolutely stellar” math GRE scores, and Philipson graduated from the University of Pennsylvania, which I think has a top program . . . But, again, statistics is hard. Really hard for some people. You could get a perfect math GRE score but not understand some basic principles of curve fitting.

Michael Jordan was a world champion basketball player but couldn’t hit the curve ball. So you’re surprised that someone who was good at taking multiple-choice math tests doesn’t understand statistics? C’mon.

Further evidence of the difficulty of statistics comes from Philipson’s economist colleagues from celebrated institutions around the world who famously flubbed a regression discontinuity analysis by massively overfitting—so, yes, the same sort of statistical error made by the Council of Economic Advisors and discussed above—in an analysis that was so bad that it motivated not one, but two papers (here and here) exploring what went wrong.

2. Academia, like any other working environment, is full of prominent, successful, well-connected bluffers. The striking thing is not that a decorated professor and A/Chairman @WhiteHouseCEA made a statistical error, nor should we be surprised that a prominent academic in economics (or any other field) doesn’t understand statistics. What’s striking is that the professor and A/Chairman doesn’t know that he doesn’t know. I’m struck by his ignorance of his ignorance, his willingness to think that he knows what he’s talking about when he doesn’t.

Some of this must be because bluffing is rewarded: it’s one way to advance in the world. If I wanted to be really cynical, I’d say that being not just wrong but aggressively wrong (as in the A/Chairman’s above tweet) sends a signal of commitment to the cause. I’m not quite sure what the cause is here, but maybe that’s part of the point too.

Again, this is not a problem that’s special to economists. I’m sure you’d see the same thing with sociologists, political scientists, psychologists, etc. Economists may be more of a concern because of their current position of power in government and industry, but the difficulty of statistics and the ease of bluffing is a more general issue.

P.S. More on the specific curve fit here.

If the outbreak ended, does that mean the interventions worked? (Jon Zelner talk tomorrow)

Jon Zelner speaks tomorrow (Thurs) at 1pm:

PREDICTING COVID-19 TRANSMISSION

In this talk Dr. Zelner will discuss some ongoing modeling work focused on understanding when we can and cannot infer that interventions meant to stop or slow infectious disease transmission have actually worked, and when observed outcomes cannot be distinguished from selection bias.

Dude’s an epidemiologist, though, so better check his GRE scores before going on. Also, test his markets to check that they’re sufficiently thick and liquid (yuck)!

Years of Life Lost due to coronavirus

This post is by Phil Price, not Andrew.

A few days ago I posted some thoughts about the coronavirus response, one of which was that I wanted to see ‘years of life lost’ in addition to (or even instead of) ‘deaths’. Mendel pointed me to a source of data for Florida cases and deaths, which I have used to do that calculation myself for that dataset. The plots below show:
(Top) Histogram of deaths as a function of age, colored by sex because why not, although I would rather color them by ‘number of comorbidities’ or something else informative, since the difference by sex isn’t all that big.
(Middle) Points are expected years of life remaining for a person of a given age, from the Social Security Administration; and the lines are from a model that I fit to the points in order to get a continuous function of age that runs from birth to…well, to any age, although it predicts the same number of remaining years of life (or rather months of life) for anyone over 108 years old.
(Bottom) Histogram of ‘expected years of life lost’, calculated using the functions shown in the middle, i.e. a function of age and sex only. This is presumably an overestimate because the people dying of COVID-19 were probably already sicker (and thus set to be shorter-lived on average) than their same-age peers, although perhaps not as much as news reports might suggest: sure, most COVID-19 deaths of people over 80 are of people who have several “co-morbidities”, but most people over 80 have some health issues so it would be very surprising if that weren’t true.

The data through 5/12 include 1849 deaths, which the model predicts to represent 23177 years of life lost; that’s an average of about 12.5 years lost per death, but see the caveat in the explanation of the bottom plot. Is this a lot or a little? Daniel Lakeland has suggested dividing the years of life lost by 80 to get an equivalent number of lifetimes, where ‘equivalent’ just means equivalent in terms of life-years lost; in this case that gives us about 290, so the deaths of these 1849 people represent about the same loss of life-years as the death of 290 infants. This is not meant to imply that the tragedy is equal either way, it’s just a way to put this in terms that are easier to understand. 

It will be interesting to see if the YLL distribution (and the deaths distribution) shift towards lower ages as the pandemic progresses. At least in California most of the new cases are among workers. If better hygiene and social distancing have reduced the spread of the virus among the old, but it continues among the young, then we would expect to see fewer cases become deaths, but each death will represent more years of life lost. 

This post is by Phil.

 

Update on OHDSI Covid19 Activities.

I have been providing some sense of the ongoing activities of the OHDSI group working on Covid19.

In particular, this gives a quick sense of one of the newer activities:

I believe there is a lot of studying to be done yet…

 

Here’s what academic social, behavioral, and economic scientists should be working on right now.

In a recent comment thread on the lack of relevance of academic social and behavioral science to the current crisis, Terry writes:

We face a once-in-a-lifetime event, and the existing literature gives mostly vapid-sounding guidance. Take this gem at the beginning of the article:

One of the central emotional responses during a pandemic is fear. Humans, like other animals, possess a set of defensive systems for combating ecological threats. Negative emotions resulting from threat can be contagious6, and fear can make threats appear more imminent.

But, the pandemic seems to be a huge opportunity for future work delving into the details of how the pandemic will change society and behavior. Forget the vague, overarching studies of generalities and focus instead on the myriad of details. Things like how the meat-packing industry is going to change its assembly line procedures. How will STD transmission change. Look carefully at working remotely, when does it work and not work. How can distance education be made better; how can the collegiate experience be replicated among distance learners. Etc.

I agree. It’s funny for me to be making this argument, given that I’m not doing this work myself. When I’m not writing, I’m doing statistical modeling. I’m not redesigning meat-packing assembly lines. But, yeah, I think Terry is right that there are lots of exciting applied problems for social science to be working on. Also, God is in every leaf of every tree, so if you really push hard to solve these applied problems, the theory will come along with it.

Is JAMA potentially guilty of manslaughter?

No, of course not. I would never say such a thing. Sander Greenland, though, he’s a real bomb-thrower. He writes:

JAMA doubles down on distortion – and potential manslaughter if on the basis of this article anyone prescribes HCQ in the belief it is unrelated to cardiac mortality:

– “compared with patients receiving neither drug cardiac arrest was significantly more likely in patients receiving hydroxychloroquine+azithromycin
(adjusted OR, 2.13 [95% CI, 1.12-4.05]), but not hydroxychloroquine alone (adjusted OR, 1.91 [95% CI, 0.96-3.81]).”

– never mind that the null is already not credible… see
Do You Believe in HCQ for COVID-19? It Will Break Your Heart and Hydroxychloroquine-Triggered QTc-Interval Prolongations in COVID-19 Patients.

I’m not so used to reading medical papers, but I thought this would be a good chance to learn something, so I took a look. The JAMA article in question is “Association of Treatment With Hydroxychloroquine or Azithromycin With In-Hospital Mortality in Patients With COVID-19 in New York State,” and here are its key findings:

In a retrospective cohort study of 1438 patients hospitalized in metropolitan New York, compared with treatment with neither drug, the adjusted hazard ratio for in-hospital mortality for treatment with hydroxychloroquine alone was 1.08, for azithromycin alone was 0.56, and for combined hydroxychloroquine and azithromycin was 1.35. None of these hazard ratios were statistically significant.

I sent along some quick thoughts, and Sander responded to each of them! Below I’ll copy my remarks and Sander’s reactions. Medical statistics is Sander’s expertise, not mine, so you’ll see that my thoughts are more speculative and his are more definitive.

Andrew: The study is observational not experimental, but maybe that’s not such a big deal, given that they adjusted for so many variables? It was interesting to me that they didn’t mention the observational nature of the data in their Limitations section. Maybe they don’t bother mentioning it in the Limitations because they mention it in the Conclusions.

Sander: In med there is automatic downgrading of observational studies below randomized (no matter how fine the former or irrelevant the latter – med RCTs are notorious for patient selectivity). So I’d guess they didn’t feel any pressure to emphasize the obvious. But I’d not have let them get away with spinning it as “may be limited” – that should be “is limited.”

Andrew: I didn’t quite get why they analyzed time to death rather than just survive / not survive. Did they look at time to death because it’s a way of better adjusting for length of hospital stay?

Sander: I’d just guess they could say they chose to focus on death because that’s the bottom line – if you are doomed to die within this setting, it might be arguably better for both the patient in suffering (often semi-comatose) and terminal care costs to go early (few would dare say that in a research article).

[This doesn’t quite answer my question. I understand why they are focusing on death as an outcome. My question is why don’t they just take survive/death in hospital as a binary outcome? Why do the survival analysis? I don’t see that dying after 2 days is so much worse than dying after 5 days. I’m not saying the survival analysis is a bad idea; I just want to understand why they did it, rather than a more simple binary-outcome model. — AG]

Andrew: The power analysis seems like a joke: a study is powered to detect a hazard rate of 0.65 (i.e, 1.5 if you take the ratio in the other direction). That’s a huge assumed effect, no?

Sander: I view all power commentary for data-base studies like this one as a joke, period, part of the mindless ritualization of statistics that is passed off as needed for “objective standards”. (It has a rationale in RCTs to show that the study was planned to that level of detail, but still has no place in the analysis.)

Andrew: I can’t figure out why they include p-values in their balance table (Table 1). It’s not a randomized assignment so the null hypothesis is of no interest. What’s of interest is the size and direction of the imbalance, not a p-value.

Sander: Agreed. I once long ago argued with Ben Hansen about that in the context of confounder scoring, to no resolution. But at least he tried his best to give a rationale; I’m sure here it’s just another example of ritualized reflexes.

Andrew: Figure 2 is kinda weird. It has those steps, but it looks like a continuous curve. It should be possible to make a better graph using raw data. With some care, you should be able to construct such a graph to incorporate the regression adjustments. This is an obvious idea; I’m sure there are 50 biostatistics papers on the topic of how to make such graphs.

Sander: Proposals for using splines to develop such curves go back at least to the 1980s and are interesting in that their big advantage comes in rate comparisons in very finite samples, e.g., most med studies. (Technically the curves in Fig. 3 are splines too – zero-order piecewise constant splines).

[But I don’t think that’s what’s happening here. I don’t think those curves are fits to data; I’m guessing these are just curves from a fitted model that have been meaninglessly discretized. They look like Kaplan-Meier curves but they’re not. — AG]

Andrew: Table 3 bothers me. I’d like to see the unadjusted and adjusted rates of death and other outcomes for the 4 groups, rather than all these comparisons.

Sander: Isn’t what you want in Fig. 3 and Table 4? Fig. 3 is very suspect for me as the HCQ-alone and neither groups look identical there. I must have missed something in the text (well, I missed a lot). Anyway I do want comparisons in the end, but Table 3 is in my view bad because the comparisons I’d want would be differences and ratios of the probabilities, not odds ratios (unless in all categories the outcomes were uncommon, which is not the case here). But common software (they used SAS) does not offer my preferred option easily, at least not with clustered data like theirs. That problem arises again in their use of the E-value with their odds ratios, but the E-value in their citation is for risk ratios. By the way, Ioannidis has vociferously criticized the E-value in print from his usual nullistic position, and I have a comment in press criticizing the E-value from my anti-nullistic position!

Andrew: Their conclusion is that the treatment “was not significantly associated with differences in in-hospital mortality.” I’d like to see a clearer disentangling. In the main results section, it says that 24% of patients receiving HCQ died (243 out of 1006), compared to 11% of patients not receiving HCQ (49 out of 432). The statistical adjustment reduced this difference. I guess I’d like to see a graph with estimated difference on the y-axis and the amount of adjustment on the x-axis.

Sander: That’s going way beyond anything I normally see in the med lit. And I’m sure this was a rush job given the topic.

[Yeah, I see this. What I really want to do is to make this graph in some real example, then write it up, then put it in a textbook and an R package, and then maybe in 10 years it will be standard practice. You laugh, but 10 years ago nobody in political science made coefficient plots from fitted regressions, and now everyone’s doing it. And they all laughed at posterior predictive checks, but now people do that too. It was all kinds of hell to get our R-hat paper published back in 1991/1992, and now people use it all the time. And MRP is a thing. Weakly informative priors too! We can change defaults; it just takes work. I’ve been thinking about this particular plot for at least 15 years, and at some point I think it will happen. It took me about 15 years to write up my thoughts about Popperian Bayes, but that happened, eventually! — AG]

Andrew: This would be a good example to study further. But I’m guessing I already know the answer to the question, Are the data available?

Sander: Good luck! Open data is an anathema to much of the med community, aggravated by the massive confidentiality requirements imposed by funders, IRBs, and institutional legal offices. Prophetically, Rothman wrote a 1981 NEJM editorial lamenting the growing problem of these requirements and how they would strangulate epidemiology; a few decades later he was sued by an ambulance-chasing lawyer representing someone in a database Rothman had published a study from, on grounds of potentially violating the patient’s privacy.

[Jeez. — AG]

Andrew: I assume these concerns are not anything special with this particular study; it’s just the standard way that medical research is reported.

Sander: A standard way, yes. JAMA ed may well have forced everything bad above on this team’s write-up – I’ve heard several cases where that is exactly what the authors reported upon critical inquiries from me or colleagues about their statistical infelicities. JAMA and journals that model themselves on it are the worst that I know of in this regard. Thanks I think in part to the good influences of Steve Goodman, AIM and some other more progressive journals are less rigid; and most epidemiology journals (which often publish studies like this one except for urgency) are completely open to alternative approaches. One, Epidemiology, actively opposes and forbids the JAMA approach (just like JAMA forbids our approach), much to the ire of biostatisticians who built their careers around 0.05.

[Two curmudgeons curmudging . . . but I think this is good stuff! Too bad there isn’t more of this in scientific journals. The trouble is, if we want to get this published, we’d need to explain everything in detail, and then you lose the spontaneity. — AG]

P.S. Regarding that clickbait title . . . OK, sure, JAMA’s not killing anybody. But, if you accept that medical research can and should have life-and-death implications, then mistakes in medical research could kill people, right? If you want to claim that your work is high-stakes and important, then you have to take responsibility for it. And it is a statistical fallacy to take a non-statistically-significant result from a low-power study and use this as a motivation to default to the null hypothesis.

Get your research project reviewed by The Red Team: this seems like a good idea!

Ruben Arslan writes:

A colleague recently asked me to be a neutral arbiter on his Red Team challenge. He picked me because I was skeptical of his research plans at a conference and because I recently put out a bug bounty program for my blog, preprints, and publications (where people get paid if they find programming errors in my scientific code).

I’m writing to you of course because I’m hoping you’ll find the challenge interesting enough to share with your readers, so that we can recruit some of the critical voices from your commentariat. Unfortunately, it’s time-sensitive (they are recruiting until May 14th) and I know you have a long backlog on the blog.

OK, OK, I’ll post it now . . .

Arslan continues:

The Red Team approach is a bit different to my bounty program. Their challenge recruits five people who are given a $200 stipend to examine data, code, and manuscript. Each critical error they find yields a donation to charity, but it’s restricted to about a month of investigation. I have to arbitrate what is and isn’t critical (we set out some guidelines beforehand).

I [Arslan] am very curious to see how this goes. I have had only small submissions to my bug bounty program, but I have not put out many highly visible publications since starting the program and I don’t pay a stipend for people to take a look. Maybe the Red Team approach yields a more focused effort. In addition, he will know how many have actually looked, whereas I probably only hear from people who find errors.

My own interest in this comes from my work as a reviewer and supervisor, where I often find errors, especially if people share their data cleaning scripts and not just their modelling scripts, but also from my own work. When I write software, I have some best practices to rely on and still make tons of mistakes. I’m trying to import these best practices to my scientific code. I’ve especially tried to come up with ways to improve after I recently corrected a published paper twice after someone found coding errors during a reanalysis (I might send you that debate too since you blogged the paper, it was about menstrual cycles and is part of the aftermath of dealing with the problems you wrote about so often).

Here’s some text from the blog post introducing the challenge:

We are looking for five individuals to join “The Red Team”. Unlike traditional peer review, this Red Team will receive financial incentives to identify problems. Each Red Team member will receive a $200 stipend to find problems, including (but not limited to) errors in the experimental design, materials, code, analyses, logic, and writing. In addition to these stipends, we will donate $100 to a GoodWell top ranked charity (maximum total donations: $2,000) for every new “critical problem” detected by a Red Team member. Defining a “critical problem” is subjective, but a neutral arbiter—Ruben Arslan—will make these decisions transparently. At the end of the challenge, we will release: (1) the names of the Red Team members (if they wish to be identified), (2) a summary of the Red Team’s feedback, (3) how much each Red Team member raised for charity, and (4) the authors’ responses to the Red Team’s feedback.

Daniël has also written a commentary about the importance of recruiting good critics, especially now for fast-track pandemic research (although I still think Anne Scheels blog post on our 100% CI blog made the point even clearer).

OK, go for it! Seems a lot better than traditional peer review, the incentives are better aligned, etc. Too bad Perspectives on Psychological Science didn’t decide to do this when they were spreading lies about people.

This “red team” thing could be the wave of the future. For one thing, it seems scalable. Here are some potential objections, along with refutations to these objections:

– You need to find five people who will review your paper—but for most topics that are interesting enough to publish on in the first place, you should be able to find five such people. If not, your project must be pretty damn narrow.

– You need to find up to $3000 to pay your red team members and make possible charitable donations. $3000 is a lot, not everyone has $3000. But I think the approach would also work with smaller payments. Also, journal refereeing isn’t free! 3 referee reports, the time of an editor and an associate editor . . . put it all together, and the equivalent cost could be well over $1000. For projects that are grant funded, the red team budget could be incorporated into the funding plan. And for unfunded projects, you could find people like Alexey Guzey or Ulrich Schimmack who might “red team” your paper for free—if you’re lucky!

2 perspectives on the relevance of social science to our current predicament: (1) social scientists should back off, or (2) social science has a lot to offer

Perspective 1: Social scientists should back off

This is what the political scientist Anthony Fowler wrote the other day:

The public appetite for more information about Covid-19 is understandably insatiable. Social scientists have been quick to respond. . . . While I understand the impulse, the rush to publish findings quickly in the midst of the crisis does little for the public and harms the discipline of social science. Even in normal times, social science suffers from a host of pathologies. Results reported in our leading scientific journals are often unreliable because researchers can be careless, they might selectively report their results, and career incentives could lead them to publish as many exciting results as possible, regardless of validity. A global crisis only exacerbates these problems. . . . and the promise of favorable news coverage in a time of crisis further distorts incentives. . . .

Perspective 2: Social science has a lot to offer

42 people published an article that begins:

The COVID-19 pandemic represents a massive global health crisis. Because the crisis requires large-scale behaviour change and places significant psychological burdens on individuals, insights from the social and behavioural sciences can be used to help align human behaviour with the recommendations of epidemiologists and public health experts. Here we discuss evidence from a selection of research topics relevant to pandemics, including work on navigating threats, social and cultural influences on behaviour, science communication, moral decision-making, leadership, and stress and coping.

The author list includes someone named Nassim, but not Taleb, and someone named Fowler, but not Anthony. It includes someone named Sander but not Greenland. Indeed it contains no authors with names of large islands. It includes someone named Zion but no one who, I’d guess, can dunk. Also no one from Zion. It contains someone named Dean and someone named Smith but . . . ok, you get the idea. It includes someone named Napper but no sleep researchers named Walker. It includes someone named Rand but no one from Rand. It includes someone named Richard Petty but not the Richard Petty. It includes Cass Sunstein but not Richard Epstein. Make of all this what you will.

As befits an article with 42 authors, there are a lot of references: 6.02 references per author, to be precise. But, even with all these citations, I’m not quite sure where this research can be used to “support COVID-19 pandemic response,” as promised in the title of the article.

The trouble is that so much of the claims are so open-ended that they don’t tell us much about policy. For example, I’m not sure what we can do with a statement such as this:

Negative emotions resulting from threat can be contagious, and fear can make threats appear more imminent. A meta-analysis found that targeting fears can be useful in some situations, but not others: appealing to fear leads people to change their behaviour if they feel capable of dealing with the threat, but leads to defensive reactions when they feel helpless to act. The results suggest that strong fear appeals produce the greatest behaviour change only when people feel a sense of efficacy, whereas strong fear appeals with low-efficacy messages produce the greatest levels of defensive responses.

Beyond the very indirect connection to policy, I’m also concerned because, of the three references cited in the above passage, one is from PNAS in 2014 and one was from Psychological Science in 2013. That’s not a good sign!

Looking at the papers in more detail . . . The PNAS study found that if you manipulate people’s Facebook news feeds by increasing the proportion of happy or sad stories, people will post more happy or sad things themselves. The Psychological Science study is based on two lab experiments: 101 undergraduates who “participated in a study ostensibly measuring their thoughts about “island life,” and 48 undergraduates who were “randomly assigned to watch one of three videos” of a shill. Also a bunch of hypothesis tests with p-values like 0.04. Anyway, the point here is not to relive the year 2013 but rather to note that the relevance of these p-hacked lab experiments to policy is pretty low.

Also, the abstract of the 40-author paper says, “In each section, we note the nature and quality of prior research, including uncertainty and unsettled issues.” But then the paper goes on to unqualified statements that the authors don’t even seem to agree with.

For example, from the article, under the heading, “Disaster and ‘panic’” [scare quotes in original]:

There is a common belief in popular culture that, when in peril, people panic, especially when in crowds. That is, they act blindly and excessively out of self-preservation, potentially endangering the survival of all. . . .However, close inspection of what happens in disasters reveals a different picture. . . . Indeed, in fires and other natural hazards, people are less likely to die from over-reaction than from under-reaction, that is, not responding to signs of danger until it is too late. In fact, the concept of ‘panic’ has largely been abandoned by researchers because it neither describes nor explains what people usually do in disaster. . . . use of the notion of panic can be actively harmful. News stories that employ the language of panic often create the very phenomena that they purport to condemn. . . .

But, just a bit over two moths ago, one of the authors of this article wrote an op-ed titled, “The Cognitive Bias That Makes Us Panic About Coronavirus”—and he cited lots of social-science research in making that argument.

Now, I don’t think social science research has changed so much between 28 Feb 2020 (when this pundit wrote about panic and backed it up with citations) and 30 Apr 2020 (when this same pundit coauthored a paper saying that researchers shouldn’t be talking about panic). And, yes, I know that the author of an op-ed doesn’t write the headline. But, for a guy who thinks that “the concept of ‘panic'” is not useful in describing behavior, it’s funny how quickly he leaps to use that word. A quick google turned up this from 2016: “How Pro Golf Explains the Stock Market Panic.”

All joking aside, this just gets me angry. These so-called behavioral scientists are so high and mighty, with big big plans for how they’re going to nudge us to do what they want. Bullfight tickets all around! Any behavior they see, they can come up with an explanation for. They have an N=100 lab experiment for everything. They can go around promoting themselves and their friends with the PANIC headline whenever they want. But then in their review article, they lay down the law and tell us how foolish we are to believe in “‘panic.'” They get to talk about panic whenever they want, but when we want to talk about it, the scare quotes come out.

Don’t get me wrong. I’m sure these people mean well. They’re successful people who’ve climbed to the top of the greasy academic pole; their students and colleagues tell them, week after week and month after month, how brilliant they are. We’re facing a major world event, they want to help, so they do what they can do.

Fair enough. If you’re an interpretive dancer like that character from Jules Feiffer, and you want to help with a world crisis, you do an interpretive dance. If you’re a statistician, you fit models and make graphs. If you’re a blogger, you blog. If you’re a pro athlete, you want until you’re allowed to play again, and then you go out and entertain people. You do what you can do.

The problem is not with social scientists doing their social science thing; the problem is with them overclaiming, overselling, and then going around telling people what to do. [That wasn’t really fair of me to say this. See comment here. — AG]

A synthesis?

Can we find any overlap between the back-off recommendation of Fowler and we-can-do-it attitude of the 42 authors? Maybe.

Back to Fowler:

Social scientists have for decades studied questions of great importance for pandemics and beyond: How should we structure our political system to best respond to crises? How should responses be coordinated between local, state and federal governments? How should we implement relief spending to have the greatest economic benefits? How can we best communicate health information to the public and maximize compliance with new norms? To the extent that we have insights to share with policy makers, we should focus much of our energy on that.

Following Fowler, maybe the 42 authors and their brothers and sisters in the world of social science should focus not on “p less than 0.05” psychology experiments, Facebook experiments, and ANES crosstabs, but on some more technical work on political and social institutions, tracing where people are spending their money, and communicating health information.

On the plus side, I didn’t notice anything in that 42-authored article promoting B.S. social science claims such as beauty and sex ratio, ovulation and voting, himmicanes, Cornell students with ESP, the critical positivity ratio, etc etc. I choose these particular claims as examples because they weren’t just mistakes—like, here’s a cool idea, too bad it didn’t replicate—but were they were quantitatively wrong, and no failed replication was needed to reveal their problems. A little bit of thought and real-world knowledge, was enough. Also, these were examples with no strong political content, so there’s no reason to think the journals involved were “doing a Lancet” and publishing fatally flawed work because it pushed a political agenda.

So, yeah, it’s good that they didn’t promote any of these well-publicized bits of bad science. On the other hand, then it’s not so clear from reading the article that not all the science that they do promote, can be trusted.

Also, remember the problems with the scientist-as-hero narrative.

P.S. More here from Simine Vazire.

“Stay-at-home” behavior: A pretty graph but I have some questions

Or, should I say, a pretty graph and so have some questions. It’s a positive property of a graph that it makes you want to see more.

Clare Malone and Kyle Bourassa write:

Cuebiq, a private data company, assessed the movement of people via GPS-enabled mobile devices across the U.S. If you look at movement data in a cross-section of states President Trump won in the southeast in 2016 — Tennessee, Georgia, Louisiana, North Carolina, South Carolina and Kentucky — 23 percent of people were staying home on average during the first week of March. That proportion jumped to 47 percent a month later across these six states.

And then they display this graph by Julia Wolfe:

So here are my questions:

1. Why did they pick those particular states to focus on? If they’re focusing on the south, why leave out Mississippi and Alabama? If they’re focusing on Republican-voting states, why leave out Idaho and Wyoming?

2. I’m surprised that it says that the proportion of New Yorkers staying at home increased by only 30 percentage points compared to last year. I would’ve thought it was higher. Maybe it’s a data issue? People like me are not in their database at all!

3. It’s weird how all the states show a pink line—fewer people staying at home compared to last year—at the beginning of the time series (I can’t quite tell when that is, maybe early March?). I’m guessing this is an artifact of measurement, that the number of GPS-enabled mobile devices has been gradually increasing over time, so the company that gathered these data would by default show an increase in movement (an apparent “Fewer people stayed home”) even in the absence of any change in behavior.

I’m thinking it would make sense to shift the numbers, or the color scheme, accordingly. As it is, the graph shows a dramatic change at the zero point, but if this zero is artifactual, then this could be misleading.

I guess what I’d like to see is a longer time series. Show another month at the beginning of each series, and that will give us a baseline.

Again, it’s not a slam on this graph to say that it makes me want to learn more.

Stay-at-home orders

The above-linked article discusses the idea that people were already staying at home, before any official stay-at-home orders were issued. And, if you believe the graphs, it looks like stay-at-home behavior did not even increase following the orders. This raises the question of why issue stay-at-home orders at all, and it also raises statistical questions about estimating the effects of such orders.

An argument against stay-at-home or social-distancing orders is that, even in the absence of any government policies on social distancing, at some point people would’ve become so scared that they would’ve socially distanced themselves, canceling trips, no longer showing up to work and school, etc., so the orders are not necessary.

Conversely, an argument in favor of governmentally mandated social distancing is that it coordinates expectations. I remember in early March that we had a sense that there were big things going on but we weren’t sure what to do. If everyone is deciding on their own whether to go to work etc., things can be a mess. Yes, there is an argument in favor of decentralized decision making, but what do you do, for example, if schools are officially open but half the kids are too scared to show up?

P.S. In comments, Brent points out a problem with framing this based on “stay-at-home orders”:

In my [Hutto’s] state the order closing schools was on March 15. The “stay at home” order came on April 7.

As best as I can interpret the x-axis of the graphs, they have the April 7 order marked with the vertical line.

It’s no puzzle why mobility data showed more people staying at home three weeks earlier. Mobility became limited on Monday, March 16 when a million or so families suddenly had children to take care of at home instead of going off to school.

This also raises questions about estimates of the effects of interventions such as lockdowns and school closings. Closing schools induces some social distancing and staying home from work, even beyond students and school employees.

“Young Lions: How Jewish Authors Reinvented the American War Novel”

I read this book by Leah Garrett and I liked it a lot. Solid insights on Joseph Heller, Saul Bellow, and Norman Mailer, of course, but also the now-forgotten Irwin Shaw (see here and here) and Herman Wouk. Garrett’s discussion of The Caine Mutiny was good: she takes it seriously, enough to point out its flaws, showing respect to it as a work of popular art.

I’d read many of the novels that Garrett wrote about, but when I’d read them I’d not thought much about the Jewish element (except in The Young Lions, where it’s central to the book, no close reading necessary). Garrett’s book gave me insight into the Jewish themes but also helped me reinterpret the novels in their social and political context.

P.S. Garrett is speaking online next week about her new book, X Troop: The Secret Jewish Commandos of World War II.

“1919 vs. 2020”

We had this discussion the other day about a questionable claim regarding the effects of social distancing policies during the 1918/1919 flu epidemic, and then I ran across this post by Erik Loomis who compares the social impact of today’s epidemic to what happened 102 years ago:

It’s really remarkable to me [Loomis] that the flu of a century killed 675,000 Americans out of a population of 110 million, meaning that roughly works out to the 2.2 million upper range guess of projections for COVID-19 by proportion of the population. And yet, the cultural response to it was primarily to shrug our collective shoulders and get on with our lives. . . . Some communities did engage in effective quarantining, for instance, and there were real death rate differentials between them. But to my knowledge anyway, sports weren’t cancelled. The World Series went on as normal (and quite famously in 1919!). There was no effective government response at the federal level.

Moreover, when it ended, the Spanish flu had almost no impact on American culture. There’s a very few references to it in American literature. Katherine Anne Porter’s Pale Horse Pale Rider. Hemingway mentions it in Death in the Afternoon. There’s a good John O’Hara story about it. And….that’s basically it? . . .

Now, yes it is true that the years of 1918 and 1919 were fast-paced years in the U.S. Over 100,000 people died in World War I . . . [but] while the war and its aftermath obviously were dominant features of American life at the time, there’s hardly anything in there that would erase the memory of a situation where nearly 7 times as many people died as in the war.

So what is going on here? . . . Americans were simply more used to death in 1919 than in 2020. People died younger and it was a more common fact of life then. Now, don’t underestimate the science in 1919. The germ theory was pretty well-established. Cities were being cleaned up. People knew that quarantining worked. The frequent pandemics of the 16th-19th centuries were largely in the past. But still….between deaths in pregnancy and deaths on the job, deaths from poisonings of very sorts and deaths from any number of accidents in overcrowded and dangerous cities, people died young. . . .

I remember thinking about this in the 1970s and 1980s, when we were all scared of being blown up in a nuclear war. (Actually, I’m still scared about that.) My reasoning went like this: (1) The post-1960s period was the first time in human history that we had the ability to destroy our civilization. (2) This seemed particular horrifying for my generation because we had grown up with the assumption that we’d all live long and full lives. (3) If it wasn’t nuclear war, it would be biological weapons: the main reason that the U.S. and the Soviet Union didn’t have massive bioweapons programs was that nuclear weapons were more effective at mass killing. (4) It made sense that the ability to develop devastating biological weapons came at around the same time as we could cure so many diseases. So, immortality and potential doom came together.

Regarding 1918, remember this graph:

Just pretend they did the right thing and had the y-axis go down to 0. Then you’ll notice two things: First, yeah, the flu in 1918 really was a big deal—almost 4 times the death rate compared to early years. Second, it was only 4 times the death rate. I mean, yeah, that’s horrible, but only a factor of 4, not a factor of 10. I guess what I’m saying is, I hadn’t realized how much of a scourge flu/pneumonia was even in non-“pandemic” years. Interesting.

These are all just scattered thoughts. There must be some books on the 1918/1919 flu that would give some more perspective on all this.

Coronavirus Grab Bag: deaths vs qalys, safety vs safety theater, ‘all in this together’, and more.

This post is by Phil Price, not Andrew.

This blog’s readership has a very nice wind-em-up-and-watch-them-go quality that I genuinely appreciate: a thought-provoking topic provokes some actual thoughts. So here are a few things I’ve been thinking about, without necessarily coming to firm conclusions. Help me think about some of these. This post is rather long so I’m putting most of it below the fold.

Continue reading ‘Coronavirus Grab Bag: deaths vs qalys, safety vs safety theater, ‘all in this together’, and more.’ »

Uncertainty and variation as distinct concepts

Jake Hofman, Dan Goldstein, and Jessica Hullman write:

Scientists presenting experimental results often choose to display either inferential uncertainty (e.g., uncertainty in the estimate of a population mean) or outcome uncertainty (e.g., variation of outcomes around that mean). How does this choice impact readers’ beliefs about the size of treatment effects? We investigate this question in two experiments comparing 95% confidence intervals (means and standard errors) to 95% prediction intervals (means and standard deviations). The first experiment finds that participants are willing to pay more for and overestimate the effect of a treatment when shown confidence intervals relative to prediction intervals. The second experiment evaluates how alternative visualizations compare to standard visualizations for different effect sizes. We find that axis rescaling reduces error, but not as well as prediction intervals or animated hypothetical outcome plots (HOPs), and that depicting inferential uncertainty causes participants to underestimate variability in individual outcomes.

These results make sense. Sometimes I try to make this point by distinguishing between uncertainty and variation. I’ve always thought these two concepts were conceptually distinct (we can speak of uncertainty in the estimate of a population average, or variation across the population), but then I started quizzing students, and I learned that, to them, “uncertainty” and ‘variation” were not distinct concepts. Part of this is wording—there’s an idea that these two words are roughly synonyms—but I think part of it is that most people don’t think of these as being two different ideas. And if lots of students don’t get this distinction, it’s no surprise that researchers and consumers of research also get stuck on this.

I’m reminded of the example from a few months ago where someone published a paper including graphs that revealed the sensitivity of its headline conclusions on some implausible assumptions. The question then arose: what if the paper had not included the graph, then maybe no one would’ve realized the problem. I argued that, had the graph not been there, I would’ve wanted to see the data. But a lot of people would just accept the estimate and standard error and not want to know more.

They want open peer review for their paper, and they want it now. Any suggestions?

Someone writes:

We’re in the middle of what feels like a drawn out process of revise and resubmit with one of the big journals (though by pre-pandemic standards everything has moved quite quickly), and what’s most frustrating is that the helpful criticisms and comments from the reviewers, plus our extensive responses and new sensitivity analyses and model improvements, are all happening not in public. (We could post them online, but I think we’re not allowed to share the reviews! And we didn’t originally put our report on a preprint server, which I now regret, so a little hard to get an update disseminated.)

For our next report, I wonder if you know of any platforms that’d allow us to do the peer review out in the open. Medrxiv/arxiv.org are great for getting the preprints out there, but not collecting reviews. Something like OpenReview.net (used for machine learning conferences) might work. Maybe there’s something else out there you know about? Do any journals do public peer review?

My reply:

Can PubPeer include reviews of preprints? And there is a site called Researchers One that does open review.

Also, you could send me your paper, I’ll post it here and people can give open reviews in the comments section!

Standard deviation, standard error, whatever!

Ivan Oransky points us to this amusing retraction of a meta-analysis. The problem: “Standard errors were used instead of standard deviations when using data from one of the studies”!

Actually, I saw something similar happen in a consulting case once. The other side had a report with estimates and standard errors . . . the standard errors were suspiciously low . . . I could see that the numbers were wrong right away, but it took me a couple hours to figure out that what they’d done was to divide by sqrt(N) rather than sqrt(n)—that is, they used the population size rather than the sample size when computing their standard errors.

As Bob Carpenter might say, it doesn’t help that statistics uses such confusing jargon. Standard deviation, standard error, variance, bla bla bla.

But what really amused me about this Retraction Watch article was the this quote at the end:

As Ingram Olkin stated years ago, “Doing a meta-analysis is easy . . . Doing one well is hard.”

Whenever I see the name Ingram Olkin, I think of this story from the cigarette funding archives:

Much of the cancer-denial work was done after the 1964 Surgeon General’s report. For example,

The statistician George L. Saiger from Columbia University received [Council for Tobacco Research] Special Project funds “to seek to reduce the correlation of smoking and diseases by introduction of additional variables”; he also was paid $10,873 in 1966 to testify before Congress, denying the cigarette-cancer link.

. . .

Ingram Olkin, chairman of Stanford’s Department of Statistics, received $12,000 to do a similar job (SP-82) on the Framingham Heart Study . . . Lorillard’s chief of research okayed Olkin’s contract, commenting that he was to be funded using “considerations other than practical scientific merit.”

So maybe doing a meta-analysis badly is hard, too!

It’s “a single arena-based heap allocation” . . . whatever that is!

After getting 80 zillion comments on that last post with all that political content, I wanted to share something that’s purely technical.

It’s something Bob Carpenter wrote in a conversation regarding implementing algorithms in Stan:

One thing we are doing is having the matrix library return more expression templates rather than copying on return as it does now. This is huge in that it avoids a lot of intermediate copies. Some of these don’t look so big when running a single chain, but stand out more when running multiple chains in parallel when there’s overall more memory pressure.

Another current focus for fusing operations is the GPU so that we don’t need to move data on and off GPU between operations.

Stan only does a single arena-based heap allocation other than for local vector and Eigen::Matrix objects (which are standard RAII). Actually, it’s more of an expanding thing, since it’s not pre-sized. But each bit only gets allocated once in exponentially increasing chunk sizes, so there’ll be at most log(N) chunks. It then reuses that heap memory across iterations and only frees at the end.

We’ve found that using a standard functional map operation internally that partially evaluates reverse-mode autodiff for each block over which the function is mapped. This reduces overall memory size and keeps the partial evaluations more memory local, all of which speeds things up at the expense of clunkier Stan code.

The other big thing we’re doing now is looking at static matrix types, of the sort used by Autograd and JAX. Stan lets you assign into a matrix, which destroys any hope of memory locality. If matrices never have their entries modified after creation (for example through a comprehension or other operation), then the values and adjoints can be kept as separate double matrices. Our early experiments are showing a 5–50-fold speedup depending on the order of data copying reduced and operations provided. Addition’s great at O(N^2) copies currently for O(N^2) operations (on N x N matrices). Multiplication with an O(N^2) copy cost and O(N^3) operation cost is less optimizable when matrices get big.

I love this because I don’t understand half of what Bob’s talking about, but I know it’s important. To make decisions under uncertainty, we want to fit hierarchical models. This can increase computational cost, etc. In short: Technical progress on computing allows us to fit better models, so we can learn more.

Recall the problems with this study, which could’ve been avoided by a Bayesian analysis of uncertainty in specificity and sensitivity, and multilevel regression and poststratification for adjusting for differences between sampling and population. In that particular example, no new computational developments are required—Stan will work just fine as it is, and to the extent that Stan improvements would help, it would be in documentation and interfaces. But, moving forward, computational improvements will help us fit bigger and better models. This stuff can make a real difference in our lives, so I wanted to highlight how technical it can all get.