Skip to content

No, its not correct to say that you can be 95% sure that the true value will be in the confidence interval

Hans van Maanen writes:

Mag ik je weer een statistische vraag voorleggen?

If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me. My visualisation is that she filled a bowl with 100 intervals, 95 of which do contain the true value and 5 do not, and she picked one at random.
Now, if she gives me two independent 95%-CI’s (e.g., two primary endpoints in a clinical trial), I can only be 90% sure (0.95^2 = 0,9025) that they both contain the true value. If I have a table with four measurements and 95%-CI’s, there’s only a 81% chance they all contain the true value.

Also, if we have two results and we want to be 95% sure both intervals contain the true values, we should construct two 97.5%-CI’s (0.95^(1/2) = 0.9747), and if we want to have 95% confidence in four results, we need 0,99%-CI’s.

I’ve read quite a few texts trying to get my head around confidence intervals, but I don’t remember seeing this discussed anywhere. So am I completely off, is this a well-known issue, or have I just invented the Van Maanen Correction for Multiple Confidence Intervals? ;-))

Ik hoop dat je tijd hebt voor een antwoord. It puzzles me!

My reply:

Ja hoor kan ik je hulpen, maar en engels:

1. “If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me.” Not quite true. Yes, true on average, but not necessarily true in any individual case. Some intervals are clearly wrong. Here’s the point: even if you picked an interval at random from the bowl, once you see the interval you have additional information. Sometimes the entire interval is implausible, suggesting that it’s likely that you happened to have picked one of the bad intervals in the bowl. Other times, the interval contains the entire range of plausible values, suggesting that you’re almost completely sure that you have picked one of the good intervals in the bowl. This can especially happen if your study is noisy and the sample size is small. For example, suppose you’re trying to estimate the difference in proportion of girl births, comparing two different groups of parents (for example, beautiful parents and ugly parents). You decide to conduct a study of N=400 births, with 200 in each group. Your estimate will be p2 – p1, with standard error sqrt(0.5^2/200 + 0.5^2/200) = 0.05, so your 95% conf interval will be p2 – p1 +/- 0.10. We happen to be pretty sure that any true population difference will be less than 0.01 (see here), hence if p2 – p1 is between -0.09 and +0.09, we can be pretty sure that our 95% interval does contain the true value. Conversely, if p2 – p1 is less than -0.11 or more than +0.11, then we can be pretty sure that our interval does not contain the true value. Thus, once we see the interval, it’s no longer generally a correct statement to say that you can be 95% sure the interval contains the true value.

2. Regarding your question: I don’t really think it makes sense to want 95% confidence in four results. It makes more sense to accept that our inferences are uncertain, we should not demand or act as if that they all be correct.

Claims about excess road deaths on “4/20” don’t add up

Sam Harper writes:

Since you’ve written about similar papers (that recent NRA study in NEJM, the birthday analysis) before and we linked to a few of your posts, I thought you might be interested in this recent blog post we wrote about a similar kind of study claiming that fatal motor vehicle crashes increase by 12% after 4:20pm on April 20th (an annual cannabis celebration…google it).

The post is by Harper and Adam Palayew, and it’s excellent. Here’s what they say:

A few weeks ago a short paper was published in a leading medical journal, JAMA Internal Medicine, suggesting that, over the 25 years from 1992-2016, excess cannabis consumption after 4:20pm on 4/20 increased fatal traffic crashes by 12% relative to fatal crashes that occurred one week before and one week after. Here is the key result from the paper:

In total, 1369 drivers were involved in fatal crashes after 4:20 PM on April 20 whereas 2453 drivers were in fatal crashes on control days during the same time intervals (corresponding to 7.1 and 6.4 drivers in fatal crashes per hour, respectively). The risk of a fatal crash was significantly higher on April 20 (relative risk, 1.12; 95% CI, 1.05-1.19; P = .001).
— Staples JA, Redelmeier DA. The April 20 Cannabis Celebration and Fatal Traffic Crashes in the United States JAMA Int Med, Feb 18, 2018, p.E2

Naturally, this sparked (heh) considerable media interest, not only because p<.05 and the finding is “surprising”, but also because cannabis is a hot topic these days (and, of course, April 20th happens every year).

But how seriously should we take these findings? Harper and Palayew crunch the numbers:

If we try and back out some estimates of what might have to happen on 4/20 to generate a 12% increase in the national rate of fatal car crashes, it seems less and less plausible that the 4/20 effect is reliable or valid. Let’s give it a shot. . . .

Over the 25 year period [the authors of the linked paper] tally 1369 deaths on 4/20 and 2453 deaths on control days, which works out to average deaths on those days each year of 1369/25 ~ 55 on 4/20 and 2453/25/2 ~ 49 on control days, an average excess of about 6 deaths each year. If we use our estimates of post-1620h VMT above, that works out to around 55/2.5 = 22 fatal crashes per billion VMT on 4/20 vs. 49/2.5 = 19.6 on control days. . . .

If we don’t assume the relative risk changes on 4/20, just more people smoking, what proportion of the population would need to be driving while high to generate a rate of 22 per billion VMT? A little algebra tells us that to get to 22 we’d need to see something like . . . 15%! That’s nearly one-sixth of the population driving while high on 4/20 from 4:20pm to midnight, which doesn’t, absent any other evidence, seem very likely. . . . Alternatively, one could also raise the relative risk among cannabis drivers to 6x the base rate and get something close. Or some combination of the two. This means either the nationwide prevalence of driving while using cannabis increases massively on 4/20, or the RR of a fatal crash with the kind of cannabis use happening on 4/20 is absurdly high. Neither of these scenarios seem particularly likely based on what we currently know about cannabis use and driving risks.

They also look at the big picture:

Nothing so exciting is happening on 20 Apr, which makes sense given that total accident rates are affected by so many things, with cannabis consumption being a very small part. It’s similar to that NRA study (see link at beginning of this post) in that the numbers just don’t add up.

Harper sent me this email last year. I wrote the above post and scheduled it for 4/20. In the meantime, he had more to report:

We published a replication paper with some additional analysis. The original paper in question (in JAMA Internal Med no less) used a design (comparing an index ‘window’ on a given day to the same ‘window’ +/- 1 week) similar to some others that you have blogged about (the NRA study, for example), and I think it merits similar skepticism (a sizeable fraction of the population would need to be driving while drugged/intoxicated on this day to raise the national rate by such a margin).

As I said, my co-author Adam Palayew and I replicated that paper’s findings but also showed that their results seem much more consistent with daily variations in traffic crashes throughout the year (lots of noise) and we used a few other well known “risky” days (July 4th is quite reliable for excess deaths from traffic crashes) as a comparison. We also used Stan to fit some partial pooling models to look at how these “effects” may vary over longer time windows.

I wrote an updated blog post about it here.

And the gated version of the paper is now posted on Injury Prevention’s website, but we have made a preprint and all of the raw data and code to reproduce our work available at my Open Science page.


A question about the piranha problem as it applies to A/B testing

Wicaksono Wijono writes:

While listening to your seminar about the piranha problem a couple weeks back, I kept thinking about a similar work situation but in the opposite direction. I’d be extremely grateful if you share your thoughts.

So the piranha problem is stated as “There can be some large and predictable effects on behavior, but not a lot, because, if there were, then these different effects would interfere with each other, and as a result it would be hard to see any consistent effects of anything in observational data.” The task, then, is to find out which large effects are real and which are spurious.

At work, sometimes people bring up the opposite argument. When experiments (A/B tests) are pre-registered, a lot of times the results are not statistically significant. And a few months down the line people would ask if we can re-run the experiment, because the app or website has changed, and so the treatment might interact differently with the current version. So instead of arguing that large effects can be explained by an interaction of previously established large effects, some people argue that large effects are hidden by yet unknown interaction effects.

My gut reaction is a resounding no, because otherwise people would re-test things every time they don’t get the results they want, and the number of false positives would go up like crazy. But it feels like there is some ring of truth to the concerns they raise.

For instance, if the old website had a green layout, and we changed the button to green, then it might have a bad impact. However, if the current layout is red, making the button green might make it stand out more, and the treatment will have positive effect. In that regard, it will be difficult to see consistent treatment effects over time when the website itself keeps evolving and the interaction terms keep changing. Even for previously established significant effects, how do we know that the effect size estimated a year ago still holds true with the current version?

What do you think? Is there a good framework to evaluate just when we need to re-run an experiment, if that is even a good idea? I can’t find a satisfying resolution to this.

My reply:

I suspect that large effects are out there, but, as you say, the effects can be strongly dependent on context. So, even if an intervention works in a test, it might not work in the future because in the future the conditions will change in some way. Given all that, I think the right way to study this is to explicitly model effects as varying. For example, instead of doing a single A/B test of an intervention, you could try testing it in many different settings, and then analyze the results with a hierarchical model so that you’re estimating varying effects. Then when it comes to decision-making, you can keep that variation in mind.

Lessons about statistics and research methods from that racial attitudes example

Yesterday we shared some discussions of recent survey results on racial attitudes.

For students and teachers of statistics or research methods, I think the key takeaway should be that you don’t want to pull out just one number from a survey; you want to get the big picture by looking at multiple questions, multiple years, and multiple data sources. You want to use the secret weapon.

Where do formal statistical theory and methods come in here? Not where you might think. No p-values or Bayesian inferences in the above-linked discussion, not even any confidence intervals or standard errors.

But that doesn’t mean that formal statistics are irrelevant, not at all.

Formal statistics gets used in the design and analysis of these surveys. We use probability and statistics to understand and design sampling strategies (cluster sampling, in the case of the General Social Survey) and to adjust for differences between sample and population (poststratification and survey weights, or, if these adjustments are deemed not necessary, statistical methods are used to make that call too).

Formal statistics underlies this sort of empirical work in social science—you just don’t see it because it was already done before you got to the data.

“Sometimes all we have left are pictures and fear”: Dan Simpson talk in Columbia stat dept, 4pm Monday

4:10pm Monday, April 22 in Social Work Bldg room 903:

Data is getting weirder. Statistical models and techniques are more complex than they have ever been. No one understand what code does. But at the same time, statistical tools are being used by a wider range of people than at any time in the past. And they are not just using our well-trodden, classical tools. They are working at the bleeding edge of what is possible. With this in mind, this talk will look at how much we can trust our tools. Do we ever really compute the thing we think we do? Can we ever be sure our code worked? Are there ways that it’s not safe to use the output? While “reproducibility” may be the watchword of the new scientific era, if we also want to ensure safety maybe all we have to lean on are pictures and fear.

Important stuff.

Changing racial differences in attitudes on changing racial differences

Elin Waring writes:

Have you been following the release of GSS results this year? I had been vaguely aware that there was reporting on a few items but then I happened to run the natrace and natracey variables (I use these in my class to look at question wording), they are from the are we spending too much/too little/about the right amont on “Improving the conditions of blacks” and “aid to blacks” (the images are from the SDA website at Berkeley):

Much as I [Waring] would love to believe that the American public really has changed racial attitudes, I find such a huge shift over such a short time very unlikely given what we know about stability of attitudes. And I even broke it down by age and there was a shift for all the age groups.

Then I saw this, and a colleague mentioned to me that the results for proportion not sexually active were strange. And then today people talking about the increase in the proportion not religiously affiliated.

It just seems very odd to me and I wondered if you had noticed it too. Could it be they just hit a strange cluster in their sampling? Or a weighting error of some kind? It’s true that attitudes on gay marriage changed very fast and that seems real, but this seems so surprising across so many separate issues.

I wasn’t sure so I passed this along to David Weakliem, my go-to guy when it comes to making sense of surveys and public opinion. Weakliem responded with some preliminary thoughts:

It did seem hard to believe at first. But there was a big move from 2014 to 2016 too (bigger than 2016-8), so if there is a problem with the survey it’s not just with 2018. The GSS also has a general question about whether the government has a special obligation to help blacks vs. no special treatment, and that also showed large moves in a liberal direction from 2014-6 and again from 2016-8. Finally, I looked for relevant questions from other surveys. There are some about how much discrimination there is. In 2013 and 2014, 19% and then 17% said there was a lot of discrimination against “African Americans” but in 2015 it was 36%; in 2016 and 2017 the question referred to “blacks” and 40% said there was a lot. So it seems that there really has been a substantial change in opinions about race since 2014. As far as why, I would guess that the media coverage and videos of police mistreatment of blacks had an impact—they made people think there really is a problem.

To which Waring replied:

The one thing I’d say in response to David is that while he could be right, these are shifts across a number of the long term variables not just the racial attitudes. Also I think that GSS is intentionally designed to not be so responsive to day to day fluctuations based on the latest news. And POLHITOK sees an increase in “no” responses in 2018 but not so dramatic and it looks like it’s in the same general territory as others from 2006 forward.

What really made me look at those particular variables was all the recent talk about reparations for slavery.

I also saw that Jay Livingston, who I wish had his own column in the New York Times—I’d rather see a sociologist’s writing about sociology, than an ignorant former reporter’s writing about sociology—wrote something recently on survey attitudes regarding racial equality, but using a different data source:

Just last week, Pew published a report (here) about race in the US. Among many other things, it asked respondents about the “major” reasons that Black people “have a harder time getting ahead.” As expected, Whites were more likely to point to cultural/personal factors, Blacks to structural ones. But compared with a similar survey Pew did just three years ago, it looks like everyone is becoming more woke. . . .

For “racial discrimination,” Black-White difference remains large. But in both groups, the percentage citing it as a major cause increases – by 14 points among Blacks, by nearly 20 points among Whites. The percent identifying access to good schools as an important factor have not changed so much, increasing slightly among both Blacks and Whites.

More curious are the responses about jobs. In 2013, far more Whites than Blacks said that the lack of jobs was a major factor. In the intervening three years, jobs as a reason for not getting ahead became more salient among Blacks, less so among Whites.

At the same time, “culture of poverty” explanations became less popular.

Livingston continues with some GSS data and then concludes:

If both Whites and Blacks are paying more attention to racial discrimination and less to personal-cultural factors, if everyone is more woke, how does this square with the widely held perception that in the era of Trump, racism is on the rise. (In the Pew survey, 56% over all and 49% of Whites said Trump has made race relations worse. In no group, even self-identified conservatives, does anything coming even close to a majority say that Trump has made race relations better.)

The data here points to a more complex view of recent history. The nastiest of the racists may have felt freer to express themselves in word and deed. And when they do, they make the news. Hence the widespread perception that race relations have deteriorated. But surveys can tell us what we don’t see on the news and Twitter. And in this case what they tell us is that the overall trend among Whites has been towards more liberal views on the causes of race differences in who gets ahead.

Interesting. Also an increasing proportion of Americans are neither white nor black. So lots going on here.

P.S. Livingston adds:

I also noticed something when I was checking the GSS data that Tristan Bridges posted about LGB self-identification. For those variables (and maybe others—I haven’t looked), the GSS 2014 sample was much larger than in other years before and since, and the 2018 sample smaller. That shouldn’t affect the actual percents, but with fairly rare responses like identifying as gay, the sample size did make me pause to wonder. With larger-n attitude items it shouldn’t matter.

I followed the link to Bridges’s blog, which had lots of interesting stuff, including this post from 2016, Why Popular Boy Names are More Popular than Popular Girl Names, which featured this familiar-looking graph:

Why did this graph look so familiar?? Because I plotted the exact same data in 2013:



I assume that Bridges just independently came up with the same idea that I had—these are public data, and counting the top 10 names is a pretty obvious thing to do, I guess. It was just funny to come across this graph again, in an unexpected place.

Abandoning statistical significance is both sensible and practical

Valentin Amrhein​, Sander Greenland, Blakeley McShane, and I write:

Dr Ioannidis writes against our proposals [here and here] to abandon statistical significance in scientific reasoning and publication, as endorsed in the editorial of a recent special issue of an American Statistical Association journal devoted to moving to a “post p<0.05 world.” We appreciate that he echoes our calls for “embracing uncertainty, avoiding hyped claims…and recognizing ‘statistical significance’ is often poorly understood.” We also welcome his agreement that the “interpretation of any result is far more complicated than just significance testing” and that “clinical, monetary, and other considerations may often have more importance than statistical findings.”

Nonetheless, we disagree that a statistical significance-based “filtering process is useful to avoid drowning in noise” in science and instead view such filtering as harmful. First, the implicit rule to not publish nonsignificant results biases the literature with overestimated effect sizes and encourages “hacking” to get significance. Second, nonsignificant results are often wrongly treated as zero. Third, significant results are often wrongly treated as truth rather than as the noisy estimates they are, thereby creating unrealistic expectations of replicability. Fourth, filtering on statistical significance provides no guarantee against noise. Instead, it amplifies noise because the quantity on which the filtering is based (the p-value) is itself extremely noisy and is made more so by dichotomizing it.

We also disagree that abandoning statistical significance will reduce science to “a state of statistical anarchy.” Indeed, the journal Epidemiology banned statistical significance in 1990 and is today recognized as a leader in the field.

Valid synthesis requires accounting for all relevant evidence—not just the subset that attained statistical significance. Thus, researchers should report more, not less, providing estimates and uncertainty statements for all quantities, justifying any exceptions, and considering ways the results are wrong. Publication criteria should be based on evaluating study design, data quality, and scientific content—not statistical significance.

Decisions are seldom necessary in scientific reporting. However, when they are required (as in clinical practice), they should be made based on the costs, benefits, and likelihoods of all possible outcomes, not via arbitrary cutoffs applied to statistical summaries such as p-values which capture little of this picture.

The replication crisis in science is not the product of the publication of unreliable findings. The publication of unreliable findings is unavoidable: as the saying goes, if we knew what we were doing, it would not be called research. Rather, the replication crisis has arisen because unreliable findings are presented as reliable.

I especially like our title and our last paragraph!

Let me also emphasize that we have a lot of positive advice of how researchers can design studies and collect and analyze data (see for example here, here, and here). “Abandon statistical significance” is not the main thing we have to say. We’re writing about statistical significance to do our best to clear up some points of confusion, but our ultimate message in most of our writing and practice is to offer positive alternatives.

P.S. Also to clarify: “Abandon statistical significance” does not mean “Abandon statistical methods.” I do think it’s generally a good idea to produce estimates accompanied by uncertainty statements. There’s lots and lots to be done.

The network of models and Bayesian workflow, related to generative grammar for statistical models

Ben Holmes writes:

I’m a machine learning guy working in fraud prevention, and a member of some biostatistics and clinical statistics research groups at Wright State University in Dayton, Ohio.

I just heard your talk “Theoretical Statistics is the Theory of Applied Statistics” on YouTube, and was extremely interested in the idea of a model-space for exploring and choosing from possibilities in ‘model space’.

I was wondering if you knew of work on any R (or Python, or whatever, I’m not picky!) packages that was being done on this, or could recommend a place to start reading more about the theory/concept.

My reply:

I love this idea of the network of models but I’ve never written anything formal on it, nor do I have any software implementations. Here’s a talk on the topic from 2011, and here’s a post from 2017 with some comments from others too.

I still think this is an important topic—it relates to the idea of a generative grammar for building statistical models, and it should fit in well with Stan. So I’m posting this in the hope that someone will follow up and do it in some way.

Parliamentary Constituency Factsheet for Indicators of Nutrition, Health and Development in India

S. V. Subramanian writes:

In India, data on key developmental indicators that formulate policies and interventions are routinely available for the administrative units of districts but not for the political units of Parliamentary Constituencies (PC). Members of Parliament (MPs) in the Lok Sabha, each representing 543 PCs as per the 2014 India map, are the representatives with the most direct interaction with their constituents. The MPs are responsible for articulating the vision and the implementation of public policies at the national level and for their respective constituencies. In order for MPs to efficiently and effectively serve their people, and also for the constituents to understand the performance of their MPs, it is critical to produce the most accurate and up-to-date evidence on the state of health and well-being at the PC-level. However, absence of PC identifiers in nationally representative surveys or the Census has eluded an assessment of how a PC is doing with regards to key indicators of nutrition, health and development.

On this website, we report PC estimates for indicators of nutrition, health and development derived from two data sources:

The National Family Health Survey 4 (NFHS-4) District Factsheets
The National Sample Survey (NSS), 2010-11, 2011-12, 2014 (Author calculations) . . .

The PC estimates for each of the indicators are classified into quintiles for map visualizations. Currently, we provide map-based visualizations for a subset of indicators, and these will be continually updated for additional indicators. . . .

In addition to providing a visualization of indicators at the PC level, we also provide tables of the PC estimates. . . .

Further details are at the link.

I’ve not looked at this all myself, but I thought it could be of interest to some of you.

State-space models in Stan

Michael Ziedalski writes:

For the past few months I have been delving into Bayesian statistics and have (without hyperbole) finally found statistics intuitive and exciting. Recently I have gone into Bayesian time series methods; however, I have found no libraries to use that can implement those models.

Happily, I found Stan because it seemed among the most mature and flexible Bayesian libraries around, but is there any guide/book you could recommend me for approaching state space models through Stan? I am referring to more complex models, such as those found in State-Space Models, by Zeng and Wu, as well as Bayesian Analysis of Stochastic Process Models, by Insua et al. Most advanced books seem to use WinBUGS, but that library is closed-source and a bit older.

I replied that he should you post his question on the Stan mailing list and also look at the example models and case studies for Stan.

I also passed the question on to Jim Savage, who added:

Stan’s great for time series, though mostly because it just allows you to flexibly write down whatever likelihood you want and put very flexible priors on everything, then fits it swiftly with a modern sampler and lets you do diagnoses that are difficult/impossible elsewhere!

Jeff Arnold has a fairly complete set of implementations for state-space models in Stan here. I’ve also got some more introductory blog posts that might help you get your head around writing out some time-series models in Stan. Here’s one on hierarchical VAR models. Here’s another on Hamilton-style regime-switching models. I’ve got a half-written tutorial on state-space models that I’ll come back to when I’m writing the time-series chapter in our Bayesian econometrics in Stan book.

One of the really nice things about Stan is that you can write out your state as parameters. Because Stan can efficiently sample from parameter spaces with hundreds of thousands of dimensions (if a bit slowly), this is fine. It’ll just be slower than a standard Kalman filter. It also changes the interpretation of the state estimate somewhat (more akin to a Kalman smoother, given you use all observations to fit the state).

Here’s an example of such a model.

Actually that last model had some problems with the between-state correlations, but I guess it’s still a good example of how to put something together in Markdown.

All statistical conclusions require assumptions.

Mark Palko points us to this 2009 article by Itzhak Gilboa, Andrew Postlewaite, and David Schmeidler, which begins:

This note argues that, under some circumstances, it is more rational not to behave in accordance with a Bayesian prior than to do so. The starting point is that in the absence of information, choosing a prior is arbitrary. If the prior is to have meaningful implications, it is more rational to admit that one does not have sufficient information to generate a prior than to pretend that one does. This suggests a view of rationality that requires a compromise between internal coherence and justification, similarly to compromises that appear in moral dilemmas. Finally, it is argued that Savage’s axioms are more compelling when applied to a naturally given state space than to an analytically constructed one; in the latter case, it may be more rational to violate the axioms than to be Bayesian.

The paper expresses various misconceptions, for example the statement that the Bayesian approach requires a “subjective belief.” All statistical conclusions require assumptions, and a Bayesian prior distribution can be as subjective or un-subjective as any other assumption in the model. For example, I don’t recall seeing textbooks on statistical methods referring to the subjective belief underlying logistic regression or the Poisson distribution; I guess if you assume a model but you don’t use the word “Bayes,” then assumptions are just assumptions.

More generally, it seems obvious to me that no statistical method will work best under all circumstances, hence I have no disagreement whatsoever with the opening sentence quoted above. I can’t quite see why they need 12 pages to make this argument, but whatever.

P.S. Also relevant is this discussion from a few years ago: The fallacy of the excluded middle—statistical philosophy edition.

Works of art that are about themselves

I watched Citizen Kane (for the umpteenth time) the other day and was again struck by how it is a movie about itself. Kane is William Randolph Hearst, but he’s also Orson Welles, boy wonder, and the movie Citizen Kane is self-consciously a masterpiece.

Some other examples of movies that are about themselves are La La Land, Primer (a low-budget experiment about a low-budget experiment), and Titanic (the biggest movie ever made, about the biggest boat ever made).

I want to call this, Objects of the Class X, but I’m not sure what X is.

Several reviews of Deborah Mayo’s new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

A few months ago I sent the following message to some people:

Dear philosophically-inclined colleagues:

I’d like to organize an online discussion of Deborah Mayo’s new book.

The table of contents and some of the book are here at Google books, also in the attached pdf and in this post by Mayo.

I think that many, if not all, of Mayo’s points in her Excursion 4 are answered by my article with Hennig here.

What I was thinking for this discussion is that if you’re interested you can write something, either a review of Mayo’s book (if you happen to have a copy of it) or a review of the posted material, or just your general thoughts on the topic of statistical inference as severe testing.

I’m hoping to get this all done this month, because it’s all informal and what’s the point of dragging it out, right? So if you’d be interested in writing something on this that you’d be willing to share with the world, please let me know. It should be fun, I hope!

I did this in consultation with Deborah Mayo, and I just sent this email to a few people (so if you were not included, please don’t feel left out! You have a chance to participate right now!), because our goal here was to get the discussion going. The idea was to get some reviews, and this could spark a longer discussion here in the comments section.

And, indeed, we received several responses. And I’ll also point you to my paper with Shalizi on the philosophy of Bayesian statistics, with discussions by Mark Andrews and Thom Baguley, Denny Borsboom and Brian Haig, John Kruschke, Deborah Mayo, Stephen Senn, and Richard D. Morey, Jan-Willem Romeijn and Jeffrey N. Rouder.

Also relevant is this summary by Mayo of some examples from her book.

And now on to the reviews.

Brian Haig

I’ll start with psychology researcher Brian Haig, because he’s a strong supporter of Mayo’s message and his review also serves as an introduction and summary of her ideas. The review itself is a few pages long, so I will quote from it, interspersing some of my own reaction:

Deborah Mayo’s ground-breaking book, Error and the growth of statistical knowledge (1996) . . . presented the first extensive formulation of her error-statistical perspective on statistical inference. Its novelty lay in the fact that it employed ideas in statistical science to shed light on philosophical problems to do with evidence and inference.

By contrast, Mayo’s just-published book, Statistical inference as severe testing (SIST) (2018), focuses on problems arising from statistical practice (“the statistics wars”), but endeavors to solve them by probing their foundations from the vantage points of philosophy of science, and philosophy of statistics. The “statistics wars” to which Mayo refers concern fundamental debates about the nature and foundations of statistical inference. These wars are longstanding and recurring. Today, they fuel the ongoing concern many sciences have with replication failures, questionable research practices, and the demand for an improvement of research integrity. . . .

For decades, numerous calls have been made for replacing tests of statistical significance with alternative statistical methods. The new statistics, a package deal comprising effect sizes, confidence intervals, and meta-analysis, is one reform movement that has been heavily promoted in psychological circles (Cumming, 2012; 2014) as a much needed successor to null hypothesis significance testing (NHST) . . .

The new statisticians recommend replacing NHST with their favored statistical methods by asserting that it has several major flaws. Prominent among them are the familiar claims that NHST encourages dichotomous thinking, and that it comprises an indefensible amalgam of the Fisherian and Neyman-Pearson schools of thought. However, neither of these features applies to the error-statistical understanding of NHST. . . .

There is a double irony in the fact that the new statisticians criticize NHST for encouraging simplistic dichotomous thinking: As already noted, such thinking is straightforwardly avoided by employing tests of statistical significance properly, whether or not one adopts the error-statistical perspective. For another, the adoption of standard frequentist confidence intervals in place of NHST forces the new statisticians to engage in dichotomous thinking of another kind: A parameter estimate is either inside, or outside, its confidence interval.

At this point I’d like to interrupt and say that a confidence or interval (or simply an estimate with standard error) can be used to give a sense of inferential uncertainty. There is no reason for dichotomous thinking when confidence intervals, or uncertainty intervals, or standard errors, are used in practice.

Here’s a very simple example from my book with Jennifer:

This graph has a bunch of estimates +/- standard errors, that is, 68% confidence intervals, with no dichotomous thinking in sight. In contrast, testing some hypothesis of no change over time, or no change during some period of time, would make no substantive sense and would just be an invitation to add noise to our interpretation of these data.

OK, to continue with Haig’s review:

Error-statisticians have good reason for claiming that their reinterpretation of frequentist confidence intervals is superior to the standard view. The standard account of confidence intervals adopted by the new statisticians prespecifies a single confidence interval (a strong preference for 0.95 in their case). . . . By contrast, the error-statistician draws inferences about each of the obtained values according to whether they are warranted, or not, at different severity levels, thus leading to a series of confidence intervals. Crucially, the different values will not have the same probative force. . . . Details on the error-statistical conception of confidence intervals can be found in SIST (pp. 189-201), as well as Mayo and Spanos (2011) and Spanos (2014). . . .

SIST makes clear that, with its error-statistical perspective, statistical inference can be employed to deal with both estimation and hypothesis testing problems. It also endorses the view that providing explanations of things is an important part of science.

Another interruption from me . . . I just want to plug my paper with Guido Imbens, Why ask why? Forward causal inference and reverse causal questions, in which we argue that Why questions can be interpreted as model checks, or, one might say, hypothesis tests—but tests of hypotheses of interest, not of straw-man null hypotheses. Perhaps there’s some connection between Mayo’s ideas and those of Guido and me on this point.

Haig continues with a discussion of Bayesian methods, including those of my collaborators and myself:

One particularly important modern variant of Bayesian thinking, which receives attention in SIST, is the falsificationist Bayesianism of . . . Gelman and Shalizi (2013). Interestingly, Gelman regards his Bayesian philosophy as essentially error-statistical in nature – an intriguing claim, given the anti-Bayesian preferences of both Mayo and Gelman’s co-author, Cosma Shalizi. . . . Gelman acknowledges that his falsificationist Bayesian philosophy is underdeveloped, so it will be interesting to see how its further development relates to Mayo’s error-statistical perspective. It will also be interesting to see if Bayesian thinkers in psychology engage with Gelman’s brand of Bayesian thinking. Despite the appearance of his work in a prominent psychology journal, they have yet to do so. . . .

Hey, not quite! I’ve done a lot of collaboration with psychologists; see here and search on “Iven Van Mechelen” and “Francis Tuerlinckx”—but, sure, I recognize that our Bayesian methods, while mainstream in various fields including ecology and political science, are not yet widely used in psychology.

Haig concludes:

From a sympathetic, but critical, reading of Popper, Mayo endorses his strategy of developing scientific knowledge by identifying and correcting errors through strong tests of scientific claims. . . . A heartening attitude that comes through in SIST is the firm belief that a philosophy of statistics is an important part of statistical thinking. This contrasts markedly with much of statistical theory, and most of statistical practice. Given that statisticians operate with an implicit philosophy, whether they know it or not, it is better that they avail themselves of an explicitly thought-out philosophy that serves practice in useful ways.

I agree, very much.

To paraphrase Bill James, the alternative to good philosophy is not “no philosophy,” it’s “bad philosophy.” I’ve spent too much time seeing Bayesians avoid checking their models out of a philosophical conviction that subjective priors cannot be empirically questioned, and too much time seeing non-Bayesians produce ridiculous estimates that could have been avoided by using available outside information. There’s nothing so practical as good practice, but good philosophy can facilitate both the development and acceptance of better methods.

E. J. Wagenmakers

I’ll follow up with a very short review, or, should I say, reaction-in-place-of-a-review, from psychometrician E. J. Wagenmakers:

I cannot comment on the contents of this book, because doing so would require me to read it, and extensive prior knowledge suggests that I will violently disagree with almost every claim that is being made. In my opinion, the only long-term hope for vague concepts such as the “severity” of a test is to embed them within a rational (i.e., Bayesian) framework, but I suspect that this is not the route that the author wishes to pursue. Perhaps this book is comforting to those who have neither the time nor the desire to learn Bayesian inference, in a similar way that homeopathy provides comfort to patients with a serious medical condition.

You don’t have to agree with E. J. to appreciate his honesty!

Art Owen

Coming from a different perspective is theoretical statistician Art Owen, whose review has some mathematical formulas—nothing too complicated, but not so easy to display in html, so I’ll just link to the pdf and share some excerpts:

There is an emphasis throughout on the importance of severe testing. It has long been known that a test that fails to reject H0 is not very conclusive if it had low power to reject H0. So I wondered whether there was anything more to the severity idea than that. After some searching I found on page 343 a description of how the severity idea differs from the power notion. . . .

I think that it might be useful in explaining a failure to reject H0 as the sample size being too small. . . . it is extremely hard to measure power post hoc because there is too much uncertainty about the effect size. Then, even if you want it, you probably cannot reliably get it. I think severity is likely to be in the same boat. . . .

I believe that the statistical problem from incentives is more severe than choice between Bayesian and frequentist methods or problems with people not learning how to use either kind of method properly. . . . We usually teach and do research assuming a scientific loss function that rewards being right. . . . In practice many people using statistics are advocates. . . . The loss function strongly informs their analysis, be it Bayesian or frequentist. The scientist and advocate both want to minimize their expected loss. They are led to different methods. . . .

I appreciate Owen’s efforts to link Mayo’s words to the equations that we would ultimately need to implement, or evaluate, her ideas in statistics.

Robert Cousins

Physicist Robert Cousins did not have the time to write a comment on Mayo’s book, but he did point us to this monograph he wrote on the foundations of statistics, which has lots of interesting stuff but is unfortunately a bit out of date when it comes to the philosophy of Bayesian statistics, which he ties in with subjective probability. (For a corrective, see my aforementioned article with Hennig.)

In his email to me, Cousins also addressed issues of statistical and practical significance:

Our [particle physicists’] problems and the way we approach them are quite different from some other fields of science, especially social science. As one example, I think I recall reading that you do not mind adding a parameter to your model, whereas adding (certain) parameters to our models means adding a new force of nature (!) and a Nobel Prize if true. As another example, a number of statistics papers talk about how silly it is to claim a 10^{⁻4} departure from 0.5 for a binomial parameter (ESP examples, etc), using it as a classic example of the difference between nominal (probably mismeasured) statistical significance and practical significance. In contrast, when I was a grad student, a famous experiment in our field measured a 10^{⁻4} departure from 0.5 with an uncertainty of 10% of itself, i.e., with an uncertainty of 10^{⁻5}. (Yes, the order or 10^10 Bernoulli trials—counting electrons being scattered left or right.) This led quickly to a Nobel Prize for Steven Weinberg et al., whose model (now “Standard”) had predicted the effect.

I replied:

This interests me in part because I am a former physicist myself. I have done work in physics and in statistics, and I think the principles of statistics that I have applied to social science, also apply to physical sciences. Regarding the discussion of Bem’s experiment, what I said was not that an effect of 0.0001 is unimportant, but rather that if you were to really believe Bem’s claims, there could be effects of +0.0001 in some settings, -0.002 in others, etc. If this is interesting, fine: I’m not a psychologist. One of the key mistakes of Bem and others like him is to suppose that, even if they happen to have discovered an effect in some scenario, there is no reason to suppose this represents some sort of universal truth. Humans differ from each other in a way that elementary particles to not.

And Cousins replied:

Indeed in the binomial experiment I mentioned, controlling unknown systematic effects to the level of 10^{-5}, so that what they were measuring (a constant of nature called the Weinberg angle, now called the weak mixing angle) was what they intended to measure, was a heroic effort by the experimentalists.

Stan Young

Stan Young, a statistician who’s worked in the pharmaceutical industry, wrote:

I’ve been reading at the Mayo book and also pestering where I think poor statistical practice is going on. Usually the poor practice is by non-professionals and usually it is not intentionally malicious however self-serving. But I think it naive to think that education is all that is needed. Or some grand agreement among professional statisticians will end the problems.

There are science crooks and statistical crooks and there are no cops, or very few.

That is a long way of saying, this problem is not going to be solved in 30 days, or by one paper, or even by one book or by three books! (I’ve read all three.)

I think a more open-ended and longer dialog would be more useful with at least some attention to willful and intentional misuse of statistics.

Chambers C. The Seven Deadly Sins of Psychology. New Jersey: Princeton University Press, 2017.

Harris R. Rigor mortis: how sloppy science creates worthless cures, crushes hope, and wastes billions. New York: Basic books, 2017.

Hubbard R. Corrupt Research. London: Sage Publications, 2015.

Christian Hennig

Hennig, a statistician and my collaborator on the Beyond Subjective and Objective paper, send in two reviews of Mayo’s book.

Here are his general comments:

What I like about Deborah Mayo’s “Statistical Inference as Severe Testing”

Before I start to list what I like about “Statistical Inference as Severe Testing”. I should say that I don’t agree with everything in the book. In particular, as a constructivist I am skeptical about the use of terms like “objectivity”, “reality” and “truth” in the book, and I think that Mayo’s own approach may not be able to deliver everything that people may come to believe it could, from reading the book (although Mayo could argue that overly high expectations could be avoided by reading carefully).

So now, what do I like about it?

1) I agree with the broad concept of severity and severe testing. In order to have evidence for a claim, it has to be tested in ways that would reject the claim with high probability if it indeed were false. I also think that it makes a lot of sense to start a philosophy of statistics and a critical discussion of statistical methods and reasoning from this requirement. Furthermore, throughout the book Mayo consistently argues from this position, which makes the different “Excursions” fit well together and add up to a consistent whole.

2) I get a lot out of the discussion of the philosophical background of scientific inquiry, of induction, probabilism, falsification and corroboration, and their connection to statistical inference. I think that it makes sense to connect Popper’s philosophy to significance tests in the way Mayo does (without necessarily claiming that this is the only possible way to do it), and I think that her arguments are broadly convincing at least if I take a realist perspective of science (which as a constructivist I can do temporarily while keeping the general reservation that this is about a specific construction of reality which I wouldn’t grant absolute authority).

3) I think that Mayo does by and large a good job listing much of the criticism that has been raised in the literature against significance testing, and she deals with it well. Partly she criticises bad uses of significance testing herself by referring to the severity requirement, but she also defends a well understood use in a more general philosophical framework of testing scientific theories and claims in a piecemeal manner. I find this largely convincing, conceding that there is a lot of detail and that I may find myself in agreement with the occasional objection against the odd one of her arguments.

4) The same holds for her comprehensive discussion of Bayesian/probabilist foundations in Excursion 6. I think that she elaborates issues and inconsistencies in the current use of Bayesian reasoning very well, maybe with the odd exception.

5) I am in full agreement with Mayo’s position that when using probability modelling, it is important to be clear about the meaning of the computed probabilities. Agreement in numbers between different “camps” isn’t worth anything if the numbers mean different things. A problem with some positions that are sold as “pragmatic” these days is that often not enough care is put into interpreting what the results mean, or even deciding in advance what kind of interpretation is desired.

6) As mentioned above, I’m rather skeptical about the concept of objectivity and about an all too realist interpretation of statistical models. I think that in Excursion 4 Mayo manages to explain in a clear manner what her claims of “objectivity” actually mean, and she also appreciates more clearly than before the limits of formal models and their distance to “reality”, including some valuable thoughts on what this means for model checking and arguments from models.

So overall it was a very good experience to read her book, and I think that it is a very valuable addition to the literature on foundations of statistics.

Hennig also sent some specific discussion of one part of the book:

1 Introduction

This text discusses parts of Excursion 4 of Mayo (2018) titled “Objectivity and Auditing”. This starts with the section title “The myth of ‘The myth of objectivity'”. Mayo advertises objectivity in science as central and as achievable.

In contrast, in Gelman and Hennig (2017) we write: “We argue that the words ‘objective’ and ‘subjective’ in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes.” I will here outline agreement and disagreement that I have with Mayo’s Excursion 4, and raise some issues that I think require more research and discussion.

2 Pushback and objectivity

The second paragraph of Excursion 4 states in bold letters: “The Key Is Getting Pushback”, and this is the major source of agreement between Mayo’s and my views (*). I call myself a constructivist, and this is about acknowledging the impact of human perception, action, and communication on our world-views, see Hennig (2010). However, it is an almost universal experience that we cannot construct our perceived reality as we wish, because we experience “pushback” from what we perceive as “the world outside”. Science is about allowing us to deal with this pushback in stable ways that are open to consensus. A major ingredient of such science is the “Correspondence (of scientific claims) to observable reality”, and in particular “Clear conditions for reproduction, testing and falsification”, listed as “Virtue 4/4(b)” in Gelman and Hennig (2017). Consequently, there is no disagreement with much of the views and arguments in Excursion 4 (and the rest of the book). I actually believe that there is no contradiction between constructivism understood in this way and Chang’s (2012) “active scientific realism” that asks for action in order to find out about “resistance from reality”, or in other words, experimenting, experiencing and learning from error.

If what is called “objectivity” in Mayo’s book were the generally agreed meaning of the term, I would probably not have a problem with it. However, there is a plethora of meanings of “objectivity” around, and on top of that the term is often used as a sales pitch by scientists in order to lend authority to findings or methods and often even to prevent them from being questioned. Philosophers understand that this is a problem but are mostly eager to claim the term anyway; I have attended conferences on philosophy of science and heard a good number of talks, some better, some worse, with messages of the kind “objectivity as understood by XYZ doesn’t work, but here is my own interpretation that fixes it”. Calling frequentist probabilities “objective” because they refer to the outside world rather than epsitemic states, and calling a Bayesian approach “objective” because priors are chosen by general principles rather than personal beliefs are in isolation also legitimate meanings of “objectivity”, but these two and Mayo’s and many others (see also the Appendix of Gelman and Hennig, 2017) differ. The use of “objectivity” in public and scientific discourse is a big muddle, and I don’t think this will change as a consequence of Mayo’s work. I prefer stating what we want to achieve more precisely using less loaded terms, which I think Mayo has achieved well not by calling her approach “objective” but rather by explaining in detail what she means by that.

3. Trust in models?

In the remainder, I will highlight some limitations of Mayo’s “objectivity” that are mainly connected to Tour IV on objectivity, model checking and whether it makes sense to say that “all models are false”. Error control is central for Mayo’s objectivity, and this relies on error probabilities derived from probability models. If we want to rely on these error probabilities, we need to trust the models, and, very appropriately, Mayo devotes Tour IV to this issue. She concedes that all models are false, but states that this is rather trivial, and what is really relevant when we use statistical models for learning from data is rather whether the models are adequate for the problem we want to solve. Furthermore, model assumptions can be tested and it is crucial to do so, which, as follows from what was stated before, does not mean to test whether they are really true but rather whether they are violated in ways that would destroy the adequacy of the model for the problem. So far I can agree. However, I see some difficulties that are not addressed in the book, and mostly not elsewhere either. Here is a list.

3.1. Adaptation of model checking to the problem of interest

As all models are false, it is not too difficult to find model assumptions that are violated but don’t matter, or at least don’t matter in most situations. The standard example would be the use of continuous distributions to approximate distributions of essentially discrete measurements. What does it mean to say that a violation of a model assumption doesn’t matter? This is not so easy to specify, and not much about this can be found in Mayo’s book or in the general literature. Surely it has to depend on what exactly the problem of interest is. A simple example would be to say that we are interested in statements about the mean of a discrete distribution, and then to show that estimation or tests of the mean are very little affected if a certain continuous approximation is used. This is reassuring, and certain other issues could be dealt with in this way, but one can ask harder questions. If we approximate a slightly skew distribution by a (unimodal) symmetric one, are we really interested in the mean, the median, or the mode, which for a symmetric distribution would be the same but for the skew distribution to be approximated would differ? Any frequentist distribution is an idealisation, so do we first need to show that it is fine to approximate a discrete non-distribution by a discrete distribution before worrying whether the discrete distribution can be approximated by a continuous one? (And how could we show that?) And so on.

3.2. Severity of model misspecification tests

Following the logic of Mayo (2018), misspecification tests need to be severe in ordert to fulfill their purpose; otherwise data could pass a misspecification test that would be of little help ruling out problematic model deviations. I’m not sure whether there are any results of this kind, be it in Mayo’s work or elsewhere. I imagine that if the alternative is parametric (for example testing independence against a standard time series model) severity can occasionally be computed easily, but for most model misspecification tests it will be a hard problem.

3.3. Identifiability issues, and ruling out models by other means than testing

Not all statistical models can be distinguished by data. For example, even with arbitrarily large amounts of data only lower bounds of the number of modes can be estimated; an assumption of unimodality can strictly not be tested (Donoho 1988). Worse, only regular but not general patterns of dependence can be distinguished from independence by data; any non-i.i.d. pattern can be explained by either dependence or non-identity of distributions, and telling these apart requires constraints on dependence and non-identity structures that can itself not be tested on the data (in the example given in 4.11 of Mayo, 2018, all tests discover specific regular alternatives to the model assumption). Given that this is so, the question arises on which grounds we can rule out irregular patterns (about the simplest and most silly one is “observations depend in such a way that every observation determines the next one to be exactly what it was observed to be”) by other means than data inspection and testing. Such models are probably useless, however if they were true, they would destroy any attempt to find “true” or even approximately true error probabilities.

3.4. Robustness against what cannot be ruled out

The above implies that certain deviations from the model assumptions cannot be ruled out, and then one can ask: How robust is the substantial conclusion that is drawn from the data against models different from the nominal one, which could not be ruled out by misspecification testing, and how robust are error probabilities? The approaches of standard robust statistics probably have something to contribute in this respect (e.g., Hampel et al., 1986), although their starting point is usually different from “what is left after misspecification testing”. This will depend, as everything, on the formulation of the “problem of interest”, which needs to be defined not only in terms of the nominal parametric model but also in terms of the other models that could not be rules out.

3.5. The effect of preliminary model checking on model-based inference

Mayo is correctly concerned about biasing effects of model selection on inference. Deciding what model to use based on misspecification tests is some kind of model selection, so it may bias inference that is made in case of passing misspecification tests. One way of stating the problem is to realise that in most cases the assumed model conditionally on having passed a misspecification test does no longer hold. I have called this the “goodness-of-fit paradox” (Hennig, 2007); the issue has been mentioned elsewhere in the literature. Mayo has argued that this is not a problem, and this is in a well defined sense true (meaning that error probabilities derived from the nominal model are not affected by conditioning on passing a misspecification test) if misspecification tests are indeed “independent of (or orthogonal to) the primary question at hand” (Mayo 2018, p. 319). The problem is that for the vast majority of misspecification tests independence/orthogonality does not hold, at least not precisely. So the actual effect of misspecification testing on model-based inference is a matter that requires to be investigated on a case-by-case basis. Some work of this kind has been done or is currently done; results are not always positive (an early example is Easterling and Anderson 1978).

4 Conclusion

The issues listed in Section 3 are in my view important and worthy of investigation. Such investigation has already been done to some extent, but there are many open problems. I believe that some of these can be solved, some are very hard, and some are impossible to solve or may lead to negative results (particularly connected to lack of identifiability). However, I don’t think that these issues invalidate Mayo’s approach and arguments; I expect at least the issues that cannot be solved to affect in one way or another any alternative approach. My case is just that methodology that is “objective” according to Mayo comes with limitations that may be incompatible with some other peoples’ ideas of what “objectivity” should mean (in which sense it is in good company though), and that the falsity of models has some more cumbersome implications than Mayo’s book could make the reader believe.

(*) There is surely a strong connection between what I call “my” view here with the collaborative position in Gelman and Hennig (2017), but as I write the present text on my own, I will refer to “my” position here and let Andrew Gelman speak for himself.

Chang, H. (2012) Is Water H2O? Evidence, Realism and Pluralism. Dordrecht: Springer.

Donoho, D. (1988) One-Sided Inference about Functionals of a Density. Annals of Statistics 16, 1390-1420.

Easterling, R. G. and Anderson, H.E. (1978) The effect of preliminary normality goodness of fit tests on subsequent inference. Journal of Statistical Computation and Simulation 8, 1-11.

Gelman, A. and Hennig, C. (2017) Beyond subjective and objective in statistics (with discussion). Journal of the Royal Statistical Society, Series A 180, 967–1033.

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986) Robust statistics. New York: Wiley.

Hennig, C. (2010) Mathematical models and reality: a constructivist perspective. Foundations of Science 15, 29–48.

Hennig, C. (2007) Falsification of propensity models by statistical tests and the goodness-of-fit paradox. Philosophia Mathematica 15, 166-192.

Mayo, D. G. (2018) Statistical Inference as Severe Testing. Cambridge University Press.

My own reactions

I’m still struggling with the key ideas of Mayo’s book. (Struggling is a good thing here, I think!)

First off, I appreciate that Mayo takes my own philosophical perspective seriously—I’m actually thrilled to be taken seriously, after years of dealing with a professional Bayesian establishment tied to naive (as I see it) philosophies of subjective or objective probabilities, and anti-Bayesians not willing to think seriously about these issues at all—and I don’t think any of these philosophical issues are going to be resolved any time soon. I say this because I’m so aware of the big Cantor-size hole in the corner of my own philosophy of statistical learning.

In statistics—maybe in science more generally—philosophical paradoxes are sometimes resolved by technological advances. Back when I was a student I remember all sorts of agonizing over the philosophical implications of exchangeability, but now that we can routinely fit varying-intercept, varying-slope models with nested and non-nested levels and (we’ve finally realized the importance of) informative priors on hierarchical variance parameters, a lot of the philosophical problems have dissolved; they’ve become surmountable technical problems. (For example: should we consider a group of schools, or states, or hospitals, as “truly exchangeable”? If not, there’s information distinguishing them, and we can include such information as group-level predictors in our multilevel model. Problem solved.)

Rapid technological progress resolves many problems in ways that were never anticipated. (Progress creates new problems too; that’s another story.) I’m not such an expert on deep learning and related methods for inference and prediction—but, again, I think these will change our perspective on statistical philosophy in various ways.

This is all to say that any philosophical perspective is time-bound. On the other hand, I don’t think that Popper/Kuhn/Lakatos will ever be forgotten: this particular trinity of twentieth-century philosophy of science has forever left us in a different place than where we were, a hundred years ago.

To return to Mayo’s larger message: I agree with Hennig that Mayo is correct to place evaluation at the center of statistics.

I’ve thought a lot about this, in many years of teaching statistics to graduate students. In a class for first-year statistics Ph.D. students, you want to get down to the fundamentals.

What’s the most fundamental thing in statistics? Experimental design? No. You can’t really pick your design until you have some sense of how you will analyze the data. (This is the principle of the great Raymond Smullyan: To understand the past, we must first know the future.) So is data analysis the most fundamental thing? Maybe so, but what method of data analysis? Last I heard, there are many schools. Bayesian data analysis, perhaps? Not so clear; what’s the motivation for modeling everything probabilistically? Sure, it’s coherent—but so is some mental patient who thinks he’s Napoleon and acts daily according to that belief. We can back into a more fundamental, or statistical, justification of Bayesian inference and hierarchical modeling by first considering the principle of external validation of predictions, then showing (both empirically and theoretically) that a hierarchical Bayesian approach performs well based on this criterion—and then following up with the Jaynesian point that, when Bayesian inference fails to perform well, this recognition represents additional information that can and should be added to the model. All of this is the theme of the example in section 7 of BDA3—although I have the horrible feeling that students often don’t get the point, as it’s easy to get lost in all the technical details of the inference for the hyperparameters in the model.

Anyway, to continue . . . it still seems to me that the most foundational principles of statistics are frequentist. Not unbiasedness, not p-values, and not type 1 or type 2 errors, but frequency properties nevertheless. Statements about how well your procedure will perform in the future, conditional on some assumptions of stationarity and exchangeability (analogous to the assumption in physics that the laws of nature will be the same in the future as they’ve been in the past—or, if the laws of nature are changing, that they’re not changing very fast! We’re in Cantor’s corner again).

So, I want to separate the principle of frequency evaluation—the idea that frequency evaluation and criticism represents one of the three foundational principles of statistics (with the other two being mathematical modeling and the understanding of variation)—from specific statistical methods, whether they be methods that I like (Bayesian inference, estimates and standard errors, Fourier analysis, lasso, deep learning, etc.) or methods that I suspect have done more harm than good or, at the very least, have been taken too far (hypothesis tests, p-values, so-called exact tests, so-called inverse probability weighting, etc.). We can be frequentists, use mathematical models to solve problems in statistical design and data analysis, and engage in model criticism, without making decisions based on type 1 error probabilities etc.

To say it another way, bringing in the title of the book under discussion: I would not quite say that statistical inference is severe testing, but I do think that severe testing is a crucial part of statistics. I see statistics as an unstable mixture of inference conditional on a model (“normal science”) and model checking (“scientific revolution”). Severe testing is fundamental, in that prospect of revolution is a key contributor to the success of normal science. We lean on our models in large part because they have been, and will continue to be, put to the test. And we choose our statistical methods in large part because, under certain assumptions, they have good frequency properties.

And now on to Mayo’s subtitle. I don’t think her, or my, philosophical perspective will get us “beyond the statistics wars” by itself—but perhaps it will ultimately move us in this direction, if practitioners and theorists alike can move beyond naive confirmationist reasoning toward an embrace of variation and acceptance of uncertainty.

I’ll summarize by expressing agreement with Mayo’s perspective that frequency evaluation is fundamental, while disagreeing with her focus on various crude (from my perspective) ideas such as type 1 errors and p-values. When it comes to statistical philosophy, I’d rather follow Laplace, Jaynes, and Box, rather than Neyman, Wald, and Savage. Phony Bayesmania has bitten the dust.


Let me again thank Haig, Wagenmakers, Owen, Cousins, Young, and Hennig for their discussions. I expect that Mayo will respond to these, and also to any comments that follow in this thread, once she has time to digest it all.

P.S. And here’s a review from Christian Robert.

Active learning and decision making with varying treatment effects!

In a new paper, Iiris Sundin, Peter Schulam, Eero Siivola, Aki Vehtari, Suchi Saria, and Samuel Kaski write:

Machine learning can help personalized decision support by learning models to predict individual treatment effects (ITE). This work studies the reliability of prediction-based decision-making in a task of deciding which action a to take for a target unit after observing its covariates x̃ and predicted outcomes p̂(ỹ∣x̃,a). An example case is personalized medicine and the decision of which treatment to give to a patient. A common problem when learning these models from observational data is imbalance, that is, difference in treated/control covariate distributions, which is known to increase the upper bound of the expected ITE estimation error. We propose to assess the decision-making reliability by estimating the ITE model’s Type S error rate, which is the probability of the model inferring the sign of the treatment effect wrong. Furthermore, we use the estimated reliability as a criterion for active learning, in order to collect new (possibly expensive) observations, instead of making a forced choice based on unreliable predictions. We demonstrate the effectiveness of this decision-making aware active learning in two decision-making tasks: in simulated data with binary outcomes and in a medical dataset with synthetic and continuous treatment outcomes.

Decision making, varying treatment effects, type S errors, Stan, validation. . . this paper has all of my favorite things!

What sort of identification do you get from panel data if effects are long-term? Air pollution and cognition example.

Don MacLeod writes:

Perhaps you know this study which is being taken at face value in all the secondary reports: “Air pollution causes ‘huge’ reduction in intelligence, study reveals.” It’s surely alarming, but the reported effect of air pollution seems implausibly large, so it’s hard to be convinced of it by a correlational study alone, when we can suspect instead that the smarter, more educated folks are more likely to be found in polluted conditions for other reasons. They did try to allow for the usual covariates, but there is the usual problem that you never know whether you’ve done enough of that.

Assuming equal statistical support, I suppose the larger an effect, the less likely it is to be due to uncontrolled covariates. But also the larger the effect, the more reasonable it is to demand strongly convincing evidence before accepting it.

From the above-linked news article:

“Polluted air can cause everyone to reduce their level of education by one year, which is huge,” said Xi Chen at Yale School of Public Health in the US, a member of the research team. . . .

The new work, published in the journal Proceedings of the National Academy of Sciences, analysed language and arithmetic tests conducted as part of the China Family Panel Studies on 20,000 people across the nation between 2010 and 2014. The scientists compared the test results with records of nitrogen dioxide and sulphur dioxide pollution.

They found the longer people were exposed to dirty air, the bigger the damage to intelligence, with language ability more harmed than mathematical ability and men more harmed than women. The researchers said this may result from differences in how male and female brains work.

The above claims are indeed bold, but the researchers seem pretty careful:

The study followed the same individuals as air pollution varied from one year to the next, meaning that many other possible causal factors such as genetic differences are automatically accounted for.

The scientists also accounted for the gradual decline in cognition seen as people age and ruled out people being more impatient or uncooperative during tests when pollution was high.

Following the same individuals through the study: that makes a lot of sense.

I hadn’t heard of this study when it came out so I followed the link and read it now.

You can model the effects of air pollution as short-term or long-term. An example of a short-term effect is that air pollution makes it harder to breathe, you get less oxygen in your brain, etc., or maybe you’re just distracted by the discomfort and can’t think so well. An example of a long-term effect is that air pollution damages your brain or other parts of your body in various ways that impact your cognition.

The model includes air pollution levels on the day of measurement and on the past few days or months or years, and also a quadratic monthly time trend from Jan 2010 to Dec 2014. A quadratic time trend, that seems weird, kinda worrying. Are people’s test scores going up and down in that way?

In any case, their regression finds that air pollution levels from the past months or years are a strong predictor of the cognitive test outcome, and today’s air pollution doesn’t add much predictive power after including the historical pollution level.

Some minor things:

Measurement of cognitive performance:

The waves 2010 and 2014 contain the same cognitive ability module, that is, 24 standardized mathematics questions and 34 word-recognition questions. All of these questions are sorted in ascending order of difficulty, and the final test score is defined as the rank of the hardest question that a respondent is able to answer correctly.

Huh? Are you serious? Wouldn’t it be better to use the number of questions answered correctly? Even better would be to fit a simple item-response model, but I’d guess that #correct would capture almost all the relevant information in the data. But to just use the rank of the hardest question answered correctly: that seems inefficient, no?

Comparison between the sexes:

The authors claim that air pollution has a larger effect on men than on women (see above quote from the news article). But I suspect this is yet another example of The difference between “significant” and “not significant” is not itself statistically significant. It’s hard to tell. For example, there’s this graph:

The plot on the left shows a lot of consistency across age groups. Too much consistency, I think. I’m guessing that there’s something in the model keeping these estimates similar to each other, i.e. I don’t think they’re five independent results.

The authors write:

People may become more impatient or uncooperative when exposed to more polluted air. Therefore, it is possible that the observed negative effect on cognitive performance is due to behavioral change rather than impaired cognition. . . . Changes in the brain chemistry or composition are likely more plausible channels between air pollution and cognition.

I think they’re missing the point here and engaging in a bit of “scientism” or “mind-body dualism” in the following way: Suppose that air pollution irritates people, making it hard for people to concentrate on cognitive tasks. That is a form of impaired cognition. Just cos it’s “behavioral,” doesn’t make it not real.

In any case, putting this all together, what can we say? This seems like a serious analysis, and to start with the authors should make all their data and code available so that others can try fitting their own models. This is an important problem, so it’s good to have as many eyes on the data as possible.

In this particular example, it seems that the key information is coming from:

– People who moved from one place to another, either moving from a high-pollution to a low-pollution area or vice-versa, and then you can see if their test scores went correspondingly up or down. After adjusting for expected cognitive decline by age during this period.

– People who lived in the same place but where there was a negative or positive trend in pollution. Again you can see if these people’s test scores went up or down. Again, after adjusting for expected cognitive decline by age during this period.

– People who didn’t move, comparing these people who lived all along in high- or low-pollution areas, and seeing who had higher test scores. After adjusting for demographic differences between people living in these different cities.

This leaves me with two thoughts:

First, I’d like to see the analyses in these three different groups. One big regression is fine, but in this sort of problem I think it’s important to understand the path from data to conclusions. This is especially an issue given that we might see different results from the three different comparisons listed above.

Second, I am concerned with some incoherence regarding how the effect works. The story in the paper, supported by the regression analysis, seems to be that what matters is long-term exposure. But, if so, I don’t see how the short-term longitudinal analysis in this paper is getting us to that. If effects of air pollution on cognition are long-term, then really this is all a big cross-sectional analysis, which brings up the usual issues of unobserved confounders, selection bias, etc., and the multiple measurements on each person is not really giving us identification at all.

P.S. The problems with this study, along with the uncritical press coverage, suggests a concern not with this particular paper but a more general concern with superstar journals such as PNAS, Science, Nature, Lancet, NEJM, JAMA, etc., which is that they often seem to give journalists a free pass to report uncritically. This sort of episode makes me think the world would be better if these superstar journals just didn’t exist, or if they were all to shut down tomorrow and be replaced by regular old field journals.

What is the most important real-world data processing tip you’d like to share with others?

This question was in today’s jitts for our communication class. Here are some responses:

Invest the time to learn data manipulation tools well (e.g. tidyverse). Increased familiarity with these tools often leads to greater time savings and less frustration in future.

Hmm it’s never one tip.. I never ever found it useful to begin writing code especially on a greenfield project unless I thought of the steps to the goal. I often still write the code in outline form first and edit before entering in programming steps. Some other tips.
1. Choose the right tool for the right job. Don’t use C++ if you’re going to design a web site.
2. Document code well but don’t overdo it, and leave some unit tests or assertions inside a commented field.
3. Testing code will always show the presence of bugs not their absence ( Dijkstra) but that dosen’t mean you should be a slacker.
4. Keep it simple at first, you may have to rewrite the program several times if it’s something new so don’t optimize until you’re satisfied. Finally, If you can control the L1 cache, you can control the world (Sabini).

Just try stuff. Nothing works the first time and you’ll have to throw out your meticulous plan once you actually start working. You’ll find all the hiccups and issues with your data the more time you actually spend in it.

Consider the sampling procedure and the methods (specifics of the questionnaire etc.) of data collection for “real-world” data to avoid any serious biases or flaws.

Quadruple-check your group by statements and joins!!

Cleaning data properly is essential.

Write a script to analyze the data. Don’t do anything “manually”.

Don’t be afraid to confer with others. Even though there’s often an expectation that we all be experts in all things data processing, the fact is that we all have different strengths and weaknesses and it’s always a good idea to benefit from others’ expertise.

For me, cleaning data is always really time-consuming. In particular when I use real-world data and (especially) string data such name of cities/countries/individuals. In addition, when you make a survey for your research, there will be always that guy that digit “b” instead of “B” or “B “ (pushing the computer’s Tab). For these reason, my tip is: never underestimate the power of Excel (!!) when you have this kind of problems.

Data processing sucks. Work in an environment that enables you to do as little of it as possible. Tech companies these days have dedicated data engineers, and they are life-changing (in a good way) for researchers/data scientists.

If the data set is large, try the processing steps on a small subset of the data to make sure the output is what you expect. Include checks/control totals if possible. Do not overwrite the same dataset in important, complicated steps.

While converting data types, for example, extracting integers or convert to date, always check the agreement between data before and after convention. Sometimes when I was converting levels to integers, (numerical values somehow are recorded as categorical because of the existence of NA), there are errors and the results are not what I expected (e.g. convert “3712” to “1672”).

Learn dplyr.

Organisation of files and ideas are vital – constantly leave reminders of what you were doing and why you made particular choices either within the file names (indicating perhaps the date in which the code or data was updated) or within comments throughout the code that explain why you made certain decisions.

Thanks, kids!

P.S. Lots of good discussion in comments, especially this from Bob Carpenter.

Prestigious journal publishes sexy selfie study

Stephen Oliver writes:

Not really worth blogging about and a likely candidate for multiverse analysis, but the beginning of the first sentence in the 2nd paragraph made me laugh:

In the study – published in prestigious journal PNAS . . .

The researchers get extra points for this quote from the press release:

The researchers say that the findings make sense from an evolutionary point of view.

In evolutionary terms, these kinds of behaviours are completely rational, even adaptive. The basic idea is that the way people compete for mates, and the things they do to put themselves at the top of the hierarchy are really important. This is where this research fits in – it’s all about how women are competing and why they’re competing.

All right, then.

“How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions” . . . and still stays around even after it’s been retracted

Chuck Jackson points to two items of possible interest:

Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions, by Richard Harris. Review here by Leonard Freedman.

Retractions do not work very well, by Ken Cor and Gaurav Sood. This post by Tyler Cowen brought this paper to my attention.

Here’s a quote from Harris’s review:

Harris shows both sides of the reproducibility debate, noting that many eminent members of the research establishment would like to see this new practice of airing the scientific community’s dirty laundry quietly disappear. He describes how, for example, in the aftermath of their 2012 paper demonstrating that only 6 of 53 landmark studies in cancer biology could be reproduced, Glenn Begley and Lee Ellis were immediately attacked by some in the biomedical research aristocracy for their “naïveté,” their “lack of competence” and their “disservice” to the scientific community.

“The biomedical research aristocracy” . . . I like that.

From Cor and Sood’s abstract:

Using data from over 3,000 retracted articles and over 74,000 citations to these articles, we find that at least 31.2% of the citations to retracted articles happen a year after they have been retracted. And that 91.4% of the post-retraction citations are approving—note no concern with the cited article.

I’m reminded of this story: “A study fails to replicate, but it continues to get referenced as if it had no problems. Communication channels are blocked.”

This is believable—and disturbing. But . . . do you really have to say “31.2%” and “91.4%”? Meaningless precision alert! Even if you could estimate those percentages to this sort of precision, you can’t take these numbers seriously, as the percentages are varying over time etc. Saying 30% and 90% would be just fine, indeed more appropriate and scientific, for the same reason that we don’t say that Steph Curry is 6’2.84378″ tall.

Emile Bravo and agency

I was reading Tome 4 of the adventures of Jules (see the last item here), and it struck me how much agency the characters had. They seemed to be making their own decisions, saying what they wanted to say, etc.

Just as a contrast, I’m also reading an old John Le Carre book, and here the characters have no agency at all. They’re just doing what is necessary to make the plot run. For Le Carre, that’s fine; the plot’s what it’s all about. So that’s an extreme case.

Anyway, I found the agency of Bravo’s characters refreshing. It’s not something I think about so often when reading, but this time it struck me.

P.S. I wrote about agency a few years ago in the context of Benjamin Kunkel’s book Indecision. I did a quick search and it doesn’t look like Kunkel has written much since. Too bad. But maybe he’s doing a Klam and it will be all right.

Research topic on the geography of partisan prejudice (more generally, county-level estimates using MRP)

1. An estimate of the geography of partisan prejudice

My colleagues David Rothschild and Tobi Konitzer recently published this MRP analysis, “The Geography of Partisan Prejudice: A guide to the most—and least—politically open-minded counties in America,” written up by Amanda Ripley, Rekha Tenjarla, and Angela He.

Ripley et al. write:

In general, the most politically intolerant Americans, according to the analysis, tend to be whiter, more highly educated, older, more urban, and more partisan themselves. This finding aligns in some ways with previous research by the University of Pennsylvania professor Diana Mutz, who has found that white, highly educated people are relatively isolated from political diversity. They don’t routinely talk with people who disagree with them; this isolation makes it easier for them to caricature their ideological opponents. . . . By contrast, many nonwhite Americans routinely encounter political disagreement. They have more diverse social networks, politically speaking, and therefore tend to have more complicated views of the other side, whatever side that may be. . . .

The survey results are summarized by this map:

I’m not a big fan of the discrete color scheme, which creates all sorts of discretization artifacts—but let’s leave that for another time. In future iterations of this project we can work on making the map clearer.

There are some funny things about this map and I’ll get to them in a moment, but first let’s talk about what’s being plotted here.

There are two things that go into the above map: the outcome measure and the predictive model, and it’s all described this post from David and Tobi.

First, the outcome. They measured partisan prejudice by asking 14 partisan-related questions, from “How would you react if a member of your immediate family married a Democrat?” to “How well does the term ‘Patriotic’ describe Democrats? to “How do you feel about Democratic voters today?”, asking 7 questions about each of the two parties and then fitting an item-response model to score each respondent who is a Democrat or Republican on how tolerant, or positive, they are about the other party.

Second, the model. They took data from 2000 survey responses and regressed these on individual and neighborhood (census block)-level demographic and geographic predictors to construct a model to implicitly predict “political tolerance” for everyone in the country, and then they poststratified, summing these up over estimated totals for all demographic groups to get estimates for county averages, which is what they plotted.

Having done the multilevel modeling and poststratification, they could plot all sorts of summaries, for example a map of estimated political tolerance just among whites, or a scatterplot of county-level estimated political tolerance vs. average education at the county level, or whatever. But we’ll focus on the map above.

2. Two concerns with the map and how it’s constructed

People have expressed two concerns about David and Tobi’s estimates.

First, the inferences are strongly model-based. If you’re getting estimates for 3000 counties from 2000 respondents—or even from 20,000 respondents, or 200,000—you’ll need to lean on a model. As a results, the map should not be taken to represent independent data within each county; rather, it’s a summary of a national-level model including individual and neighborhood (census block-level) predictors. As such, we want to think about ways of understanding and evaluating this model.

Second, the map shows some artifacts at state borders, most notably with Florida, South Carolina, New York state, South Dakota, Utah, and Wisconsin, also some suggestive patterns elsewhere such as the borders between Virginia and North Carolina, and Missouri and Arkansas. I’m not sure about all these—as noted above, the discrete color scheme can create apparent patterns from small variation, and there are real differences in political cultures between states (Utah comes to mind)—but there are definitely some problems here, problems which David and Tobi attribute to differences between states in the voter files that are used to estimate the total number of partisans (Democrats and Republicans) in each demographic category in each county. If the voter files for neighboring states are coming from different sorts of data, this can introduce apparent differences in the poststratification stage. Their counting problems are especially cumbersome because we have to estimate the total number of partisans in each demographic category in each county

3. Four plans for further research

So, what to do about these concerns? I have four ideas, all of which involve some mix of statistics and political science research, along with good old data munging:

(a) Measurement error model for differences between states in classifications. The voter files have different meanings in different states? Model it, with some state effects that are estimated from the data and using whatever additional information we can find on the measurement and classification process.

(b) Varying intercept model plus spatial correlation as a fix to the state boundary problems. This is kind of a light, klugey version of the above option. We recognize that some state-level fix is needed, and instead of modeling the measurement error or coding differences directly, we throw in a state-level error term, along with a spatial correlation penalty term to enforce similarity across county boundaries (maybe only counting counties that are similar in certain characteristics such as ethnic breakdown and proportion urban/suburban/rural).

(c) Tracking down exactly what happened to create those artifacts at the state boundaries. Before or after doing the modeling to correct the glaring boundary artifacts, it would be good to do some model analysis to work out the “trail of breadcrumbs” explaining exactly how the particular artifacts we see arose, to connect the patterns on the map with what was going on in the data.

(d) Fake-data simulation to understand scenarios where the MRP approach could fail. As noted in point 2 above, there are legitimate concerns about the use of any model-based approach to draw inferences for 3000 counties from 2000 (or even 20,000 or 200,000) respondents. One way to get a sense of potential problems here is to construct some fake-data worlds in which the model-based estimates will fail.

OK, so four research directions here. My inclination is to start with (b) and (d) because I’m kind of intimidated by the demographic classifications in the voter file, so I’d rather just consider them as a black box and try to fix them indirectly, rather than to model and understand them. Along similar lines, it seems to me that solving (b) and (d) will give us general tools that can be used in many other adjustment problems in sampling and causal inference. That said, (a) is appealing because it’s all about doing things right, and it could have real impact on future studies using the voter file, and (c) would be an example of building bridges between different models in statistical workflow, which is an idea I’ve talked about a lot recently, so I’d like to see that too.