Skip to content

Several post-doc positions in probabilistic programming etc. in Finland

There are several open post-doc positions in Aalto and University of Helsinki in 1. probabilistic programming, 2. simulator-based inference, 3. data-efficient deep learning, 4. privacy preserving and secure methods, 5. interactive AI. All these research programs are connected and collaborating. I (Aki) am the coordinator for the project 1 and contributor in the others. Overall we are developing methods and tools for a big part of Bayesian workflow. The methods are developed generally, but Stan is one of the platforms used to make the first implementations and thus some post-docs will work also with Stan development team. See more details here.

Ballot order update

Darren Grant writes:

Thanks for bringing my work on ballot order effects to the attention of a wider audience via your recent blog post. The final paper, slightly modified from the version you posted, was published last year in Public Choice.

Like you, I am not wedded to traditional hypothesis testing, but think it is the right way to go here. The post spoke to the issue that I struggled with the most in writing the paper—the role of description. I view this as an important aspect of an empirical analysis, and normally should be done early on. This paper was unique, for me at least, in that I chose to invert this order—twice.

The first time involved the presentation of mean vote shares by ballot position—the best simple way to describe this data, in my opinion. Instead of beginning with these, I began with regression results, and then presented the (differences in) means as a robustness check. As discussed in the paper, these means check for bias caused by imperfect randomization, as well as (hopefully) reassure the reader that the more complex SUR method isn’t driving the results.

The second time involved the magnitude of effect, which as Dale points out is quite large. I support the plausibility of effects this large in the “money shot” figure, Figure 2a, that comes near the end of the paper. This simple histogram simply depicts one candidate’s county-wide vote shares in an unusual two-candidate contest (described on p. 21 of your version of the paper):

The histogram has two peaks, at 40% and 60% of the vote. These correspond to second/first ballot position, and imply a ballot order effect of roughly 20 percentage points—at least double the main estimates presented earlier in the paper (which top out around 10 percentage points). Regression estimates (Figure 2c) also indicate a (nearly) 20 percentage point ballot order effect.

I am not adamant that this order of presentation was correct—just wanted to share with you the way I tried to use description in a non-standard way to alleviate some of the concerns you raised about the estimates. Upon request, I can address some of the other points that were raised.

I have three comments:

1. I have not looked at this paper in detail, but I guess if there are large ballot order effects, I’d expect to see them in these sorts of low-information races. (Recall my skepticism about the claim that Trump won the presidential election in 2016 because of ballot order in key states.)

2. I don’t see the point of hypothesis testing here at all. Nobody thinks the ballot order effect is exactly zero.

3. Figure 2a confuses me. If you want to show vote percentages with the two different ballot orders, why not show two histograms, one where this candidate is first on the ballot and one where he’s second? That would seem to be the more direct comparisons. Also, what’s with the bizarre x and y axes here? Better would be to put x-axes at 30, 40, 50, 60, 70, and y-axes at 0, 5, 10.

Postdoctoral position in Vancouver! Using Stan! Working on wine! For reals.

Lizzie Wolkovich writes that she is hiring someone to help build Stan models for winegrapes. Here’s the ad:

Postdoctoral Fellow in Winegrape Research—University of British Columbia

The Temporal Ecology Lab is looking for a bright, motivated and collaborative researcher to join the lab and develop new winegrape models using Stan (mc-stan.org). The project combines decades of historical records with modern Bayesian modeling to address the challenge of shifting climate regimes on the wine industry, with implications for crops across the globe. The fellow will join an interdisciplinary team of researchers based across Canada, the United States and Europe.

The position would be based at the University of British Columbia in the Forest and Conservation Sciences Department. Applicants must be willing to travel to the Okanagan winegrowing region (in southern British Columbia) and France for field work and to meet with collaborators. Travel costs are covered by the lab (in advance of travel as needed).

The ideal researcher will be both able to lead current projects, develop their own projects, and support ongoing work in the lab. Current lab research covers a broad range of topics—climate change impacts via phenology on forests and winegrapes, community assembly via the temporal niche—using a variety of methods from field empirical data, to meta-analyses and analytical coexistence models. More details on the lab’s research can be found at www.temporalecology.org.

A successful applicant would have/be:
• Either a Ph.D. in agriculture, ecology or related fields with a strong interest in statistical modeling or a Ph.D. in computer science, statistics, physical sciences or related fields with a strong interest in agriculture and/or ecology.
• Strong quantitative and computational skills.
• Experienced with R or Python (or similar skills), ideally with proficiency in LaTeX, git and Stan (applicants without experience in these languages must be excited to learn them quickly).
• Comfortable working with diverse file structures and large datasets (e.g., climate data in formats such as NetCDF).
• Excellent writing skills and good publication record.
• Experience relevant to mentoring undergraduate and graduate students
• Excellent record of being a good lab and community member.

To apply email the following in PDF format (preferably one file) to E. M. Wolkovich at e.wolkovich@ubc.ca (informal inquiries welcome):
• Cover letter (see ‘successful applicant’ list above and detail relevant skills and experience)
• Curriculum vitae
• Brief Description of research interests (maximum of two pages)
• Two examples of published papers (one in prep acceptable).
• Names and contact information of 3 references.

Application review will begin immediately and will continue until the position is filled.

OK, it’s not flying squirrels. But it’s still pretty cool.

“Appendix: Why we are publishing this here instead of as a letter to the editor in the journal”

David Allison points us to this letter he wrote with Cynthia Kroeger and Andrew Brown:

Unsubstantiated conclusions in randomized controlled trial of binge eating program due to Differences in Nominal Significance (DINS) Error

Cachelin et al. tested the effects of a culturally adapted, Cognitive Behavioral Therapy-based, guided self-help (CBTgsh) intervention on binge eating reduction . . . The authors report finding a causal effect in their conclusion by stating,

Treatment with the CBTgsh program resulted in significant reductions in frequency of binge eating, depression, and psychological distress and 47.6% of the intention-to-treat CBTgsh group were abstinent from binge eating at follow-up. In contrast, no significant changes were found from pre- to 12-week follow-up assessments for the waitlisted group. Results indicate that CBTgsh can be effective in addressing the needs of Latinas who binge eat and can lead to improvements in symptoms.

This study is well-designed to test for causal effects between these groups; however, the authors did not conduct the statistical test needed to draw causal inference. Specifically, the authors base their conclusions from a parallel groups RCT on within-group analyses. Such analyses have been well-documented as invalid as tests for between-group treatment effects; instead, between-group tests should be utilized to inform conclusions (Bland & Altman, 2011; Gelman & Stern, 2006; Huck & McLean, 1975).

The Differences in Nominal Significance (DINS) error is a term used to describe this error of basing between-group conclusions on comparisons of the statistical significance of two (or more) separate tests . . . DINS errors are common within peer-reviewed obesity literature . . .

The difference between “significant” and “not significant” is not itself statistically significant.

Allison adds:

You might be interested in the Appendix titled “Why we are publishing this here instead of as a letter to the editor in the journal.”

Here’s the story:

Appendix: Why we are publishing this here instead of as a letter to the editor in the journal

We first contacted a Peer Review Manager of Psychological Services on May 12, 2018 to inquire as to how one should submit a Letter to the Editor to their journal, regarding an article published in their journal, because this article type was not an option in the author center of their online submission system.

On June 2, 2018, we received a reply from the Peer Review Manager stating the journal does not usually receive submissions like this, but that the editor confirmed we could submit it as a regular article and just explain that it is a Letter to the Editor in the cover letter.

On July 3, 2018, we followed these instructions and submitted the letter above.

On August 16, 2018, we wrote to inquire as to the status of our submitted letter. A reply from the Peer Review Manager was received on August 23, 2018 stating that the handling editor confirmed she is working on it and consulting with a reviewer.

On October 4, 2018, we received a decision from an Editor, stating that the editorial team was contacted and that they do not publish letters to the editor in this journal. Permission was requested to send a blinded version of our letter to the authors.

On October 13, 2018, the Peer Review Manager followed up to request permission to share a masked version of our letter with the authors of the original manuscript, so that they can potentially address the concerns.

On October 16, 2018, I replied to the Peer Review Manager indicating that I was unsure how to respond, because I was confused about the decision. I attached our previous email correspondence that mentioned how the Editor confirmed that we could submit the article as a regular article and explain that it was a Letter to the Editor in the cover letter – even though the journal normally does not have submissions like this. I mentioned that other journals often have original authors address concerns in a formal response and asked whether this person knew the process by which authors would address our concerns otherwise.

On October 17, 2018, I received an email from the Editor, indicating that our letter was read through carefully and our concerns were taken seriously. Multiple statisticians were consulted to see how to best address the concerns. The editorial team also was consulted with. The editorial team agreed that given the nature of our letter was statistical procedures related to a study, that publishing it in the journal was not the best step. It was explained that they reject papers that are not in the area of typical foci of their journal or readership, and that this is why the letter could not be accepted for publication. It was explained that they agreed the best step was to reach out to the author that needed corrections and wanted permission to send our blinded letter to them, so they can make the corrections.

On October 23, 2018, we responded to the Editor and provided consent to share our letter with the original authors. We also shared with them our plan to publish our submitted letter as a comment in PubPeer and update our comment as progress continues.

The Editor responded on October 23, 2018, asking whether we still wanted our letter to be blinded when they share it with original authors.

We responded on October 24, 2018, giving permission to send our letter unblinded. We said to please let us know if we can help further in any way and to please feel free to share with the original authors that we are happy to help if they think that might be useful. The Editor responded affirmatively and thanked us.

In sum, we decided to post our letter here. Posting our concerns here is in line with the COPE ideal of quickly making the scientific community aware of an issue. We also find the editor’s decision to not allow for the publication of LTEs discussing errors in their papers to be counter to the ideals of rigor, reproducibility, and transparency.

This is very similar to what happened to me when I tried to publish a letter pointing out a problem in a paper published in the American Sociological Review. I eventually gave up and just published the story in Chance. That was a few years ago. Had it happened more recently, I would’ve submitted it to Sociological Science.

Just today, someone sent me another story of this sort: A paper with serious statistical errors appeared in a medical journal, my correspondent found some problems with it but the journal editor refused to do anything about it. (In this case, the data were not made available, making it difficult to figure out what exactly was going on.) My correspondent was stuck, didn’t know what to do. I suggested publishing the criticism as a short article in a different journal in the same medical subfield. We’ll see what happens.

P.S. Since we’re on the topic of publication and conflict of interest, here’s an unrelated story.

Why “statistical significance” doesn’t work: An example.

Reading some of the back-and-forth in this thread, it struck me that some of the discussion was about data, some was about models, some was about underlying reality, but none of the discussion was driven by statements that this or that pattern in data was “statistically significant.”

Here’s the problem with “statistical significance” as I typically see it used. I see statistical significance used in 4 ways, all of them problematic:

1. Researcher has certain goals in mind, uses forking paths to find a statistically significant result consistent with a pre-existing story.

2. Researcher finds a non-significant result and identifies it as zero.

3. Researcher has a pile of results and agnostically uses statistical significance to decide what is real and what is not.

4. Community of researchers use p-values to distinguish between different theories.

All these are bad. Approaches 1 and 2 are obviously bad, in that statistical significance is being used to imply empirical support beyond what can really be learned from the data. Approaches 3 and 4 are bad in a different way, in that they are taking whatever process of scientific discussion and learning is happening, and sprinkling it with noise.

R-squared for multilevel models

Brandon Sherman writes:

I just was just having a discussion with someone about multilevel models, and the following topic came up. Imagine we’re building a multilevel model to predict SAT scores using many students. First we fit a model on students only, then students in classrooms, then students in classrooms within district, the previous case within cities, then counties, countries, etc. Maybe we even add in census tract info. The idea here is we keep arbitrarily adding levels to the hierarchy.

In standard linear regression, adding variables, regardless of informativeness, always leads to an increase in R^2. In the case of multilevel modeling, does adding levels to the hierarchy always lead to a stronger relationship with the response, even if it’s a tiny one that’s only applicable to the data the model is built on?

My reply: Not always. See here.

P.S. Since we’re on the topic, I should point you to this recent paper with Ben, Jonah, and Aki on Bayesian R-squared.

Wanted: Statistical success stories

Bill Harris writes:

Sometime when you get a free moment, it might be great to publish a post that links to good, current exemplars of analyses. There’s a current discussion about RCTs on a program evaluation mailing list I monitor. I posted links to your power=0.06 post and your Type S and Type M post, but some still seem to think RCTs are the foundation. I can say “Read one of your books” or “Read this or that book,” or I could say “Peruse your blog for the last, oh, eight-ten years,” but either one requires a bunch of dedication. I could say “Read some Stan examples,” but those seem mostly focused on teaching Stan. Some published examples use priors you no longer recommend, as I recall. I think I’ve noticed a few models with writeups on your blog that really did begin to show how one can put together a useful analysis without getting into NHST and RCTs, but I don’t recall where they are.

Relatedly, Ramiro Barrantes-Reynolds writes:

I would be very interested in seeing more in your blog about research that does a good job in the areas that are most troublesome for you: measurement, noise, forking paths, etc; or that addresses those aspects so as to make better inferences. I think after reading your blog I know what to look for to see when some investigator (or myself) is chasing noise (i.e. I have a sense of what NOT to do), but I am missing good examples to follow in order to do better research – I would consider myself a beginning statistician so examples of research that is well done and addresses the issues of forking paths, measurement, etc help. I think blog posts and the discussion that arises would be beneficial to the community.

So, two related questions. The first one’s about causal inference beyond simple analyses of randomized trials; the second is about examples of good measurement and inference in the context of forking paths.

My quick answer is that, yes, we do have examples in our books, and it doesn’t involve that much dedication to order them and take a look at the examples. I also have a bunch of examples here and here.

More specifically:

Causal inference without a randomized trial: Millennium villages, incumbency advantage (and again)

Measurement: penumbras, assays

Forking paths: Millennium villages, polarization

I guess the other suggestion is that we post on high-quality new work so we can all discuss, not just what makes bad work bad, but also what makes good work good. That makes sense. To start with, you should start pointing me to some good stuff to post on.

No, its not correct to say that you can be 95% sure that the true value will be in the confidence interval

Hans van Maanen writes:

Mag ik je weer een statistische vraag voorleggen?

If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me. My visualisation is that she filled a bowl with 100 intervals, 95 of which do contain the true value and 5 do not, and she picked one at random.
Now, if she gives me two independent 95%-CI’s (e.g., two primary endpoints in a clinical trial), I can only be 90% sure (0.95^2 = 0,9025) that they both contain the true value. If I have a table with four measurements and 95%-CI’s, there’s only a 81% chance they all contain the true value.

Also, if we have two results and we want to be 95% sure both intervals contain the true values, we should construct two 97.5%-CI’s (0.95^(1/2) = 0.9747), and if we want to have 95% confidence in four results, we need 0,99%-CI’s.

I’ve read quite a few texts trying to get my head around confidence intervals, but I don’t remember seeing this discussed anywhere. So am I completely off, is this a well-known issue, or have I just invented the Van Maanen Correction for Multiple Confidence Intervals? ;-))

Ik hoop dat je tijd hebt voor een antwoord. It puzzles me!

My reply:

Ja hoor kan ik je hulpen, maar en engels:

1. “If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me.” Not quite true. Yes, true on average, but not necessarily true in any individual case. Some intervals are clearly wrong. Here’s the point: even if you picked an interval at random from the bowl, once you see the interval you have additional information. Sometimes the entire interval is implausible, suggesting that it’s likely that you happened to have picked one of the bad intervals in the bowl. Other times, the interval contains the entire range of plausible values, suggesting that you’re almost completely sure that you have picked one of the good intervals in the bowl. This can especially happen if your study is noisy and the sample size is small. For example, suppose you’re trying to estimate the difference in proportion of girl births, comparing two different groups of parents (for example, beautiful parents and ugly parents). You decide to conduct a study of N=400 births, with 200 in each group. Your estimate will be p2 – p1, with standard error sqrt(0.5^2/200 + 0.5^2/200) = 0.05, so your 95% conf interval will be p2 – p1 +/- 0.10. We happen to be pretty sure that any true population difference will be less than 0.01 (see here), hence if p2 – p1 is between -0.09 and +0.09, we can be pretty sure that our 95% interval does contain the true value. Conversely, if p2 – p1 is less than -0.11 or more than +0.11, then we can be pretty sure that our interval does not contain the true value. Thus, once we see the interval, it’s no longer generally a correct statement to say that you can be 95% sure the interval contains the true value.

2. Regarding your question: I don’t really think it makes sense to want 95% confidence in four results. It makes more sense to accept that our inferences are uncertain, we should not demand or act as if that they all be correct.

Claims about excess road deaths on “4/20” don’t add up

Sam Harper writes:

Since you’ve written about similar papers (that recent NRA study in NEJM, the birthday analysis) before and we linked to a few of your posts, I thought you might be interested in this recent blog post we wrote about a similar kind of study claiming that fatal motor vehicle crashes increase by 12% after 4:20pm on April 20th (an annual cannabis celebration…google it).

The post is by Harper and Adam Palayew, and it’s excellent. Here’s what they say:

A few weeks ago a short paper was published in a leading medical journal, JAMA Internal Medicine, suggesting that, over the 25 years from 1992-2016, excess cannabis consumption after 4:20pm on 4/20 increased fatal traffic crashes by 12% relative to fatal crashes that occurred one week before and one week after. Here is the key result from the paper:

In total, 1369 drivers were involved in fatal crashes after 4:20 PM on April 20 whereas 2453 drivers were in fatal crashes on control days during the same time intervals (corresponding to 7.1 and 6.4 drivers in fatal crashes per hour, respectively). The risk of a fatal crash was significantly higher on April 20 (relative risk, 1.12; 95% CI, 1.05-1.19; P = .001).
— Staples JA, Redelmeier DA. The April 20 Cannabis Celebration and Fatal Traffic Crashes in the United States JAMA Int Med, Feb 18, 2018, p.E2

Naturally, this sparked (heh) considerable media interest, not only because p<.05 and the finding is “surprising”, but also because cannabis is a hot topic these days (and, of course, April 20th happens every year).

But how seriously should we take these findings? Harper and Palayew crunch the numbers:

If we try and back out some estimates of what might have to happen on 4/20 to generate a 12% increase in the national rate of fatal car crashes, it seems less and less plausible that the 4/20 effect is reliable or valid. Let’s give it a shot. . . .

Over the 25 year period [the authors of the linked paper] tally 1369 deaths on 4/20 and 2453 deaths on control days, which works out to average deaths on those days each year of 1369/25 ~ 55 on 4/20 and 2453/25/2 ~ 49 on control days, an average excess of about 6 deaths each year. If we use our estimates of post-1620h VMT above, that works out to around 55/2.5 = 22 fatal crashes per billion VMT on 4/20 vs. 49/2.5 = 19.6 on control days. . . .

If we don’t assume the relative risk changes on 4/20, just more people smoking, what proportion of the population would need to be driving while high to generate a rate of 22 per billion VMT? A little algebra tells us that to get to 22 we’d need to see something like . . . 15%! That’s nearly one-sixth of the population driving while high on 4/20 from 4:20pm to midnight, which doesn’t, absent any other evidence, seem very likely. . . . Alternatively, one could also raise the relative risk among cannabis drivers to 6x the base rate and get something close. Or some combination of the two. This means either the nationwide prevalence of driving while using cannabis increases massively on 4/20, or the RR of a fatal crash with the kind of cannabis use happening on 4/20 is absurdly high. Neither of these scenarios seem particularly likely based on what we currently know about cannabis use and driving risks.

They also look at the big picture:

Nothing so exciting is happening on 20 Apr, which makes sense given that total accident rates are affected by so many things, with cannabis consumption being a very small part. It’s similar to that NRA study (see link at beginning of this post) in that the numbers just don’t add up.

Harper sent me this email last year. I wrote the above post and scheduled it for 4/20. In the meantime, he had more to report:

We published a replication paper with some additional analysis. The original paper in question (in JAMA Internal Med no less) used a design (comparing an index ‘window’ on a given day to the same ‘window’ +/- 1 week) similar to some others that you have blogged about (the NRA study, for example), and I think it merits similar skepticism (a sizeable fraction of the population would need to be driving while drugged/intoxicated on this day to raise the national rate by such a margin).

As I said, my co-author Adam Palayew and I replicated that paper’s findings but also showed that their results seem much more consistent with daily variations in traffic crashes throughout the year (lots of noise) and we used a few other well known “risky” days (July 4th is quite reliable for excess deaths from traffic crashes) as a comparison. We also used Stan to fit some partial pooling models to look at how these “effects” may vary over longer time windows.

I wrote an updated blog post about it here.

And the gated version of the paper is now posted on Injury Prevention’s website, but we have made a preprint and all of the raw data and code to reproduce our work available at my Open Science page.

Stan!

A question about the piranha problem as it applies to A/B testing

Wicaksono Wijono writes:

While listening to your seminar about the piranha problem a couple weeks back, I kept thinking about a similar work situation but in the opposite direction. I’d be extremely grateful if you share your thoughts.

So the piranha problem is stated as “There can be some large and predictable effects on behavior, but not a lot, because, if there were, then these different effects would interfere with each other, and as a result it would be hard to see any consistent effects of anything in observational data.” The task, then, is to find out which large effects are real and which are spurious.

At work, sometimes people bring up the opposite argument. When experiments (A/B tests) are pre-registered, a lot of times the results are not statistically significant. And a few months down the line people would ask if we can re-run the experiment, because the app or website has changed, and so the treatment might interact differently with the current version. So instead of arguing that large effects can be explained by an interaction of previously established large effects, some people argue that large effects are hidden by yet unknown interaction effects.

My gut reaction is a resounding no, because otherwise people would re-test things every time they don’t get the results they want, and the number of false positives would go up like crazy. But it feels like there is some ring of truth to the concerns they raise.

For instance, if the old website had a green layout, and we changed the button to green, then it might have a bad impact. However, if the current layout is red, making the button green might make it stand out more, and the treatment will have positive effect. In that regard, it will be difficult to see consistent treatment effects over time when the website itself keeps evolving and the interaction terms keep changing. Even for previously established significant effects, how do we know that the effect size estimated a year ago still holds true with the current version?

What do you think? Is there a good framework to evaluate just when we need to re-run an experiment, if that is even a good idea? I can’t find a satisfying resolution to this.

My reply:

I suspect that large effects are out there, but, as you say, the effects can be strongly dependent on context. So, even if an intervention works in a test, it might not work in the future because in the future the conditions will change in some way. Given all that, I think the right way to study this is to explicitly model effects as varying. For example, instead of doing a single A/B test of an intervention, you could try testing it in many different settings, and then analyze the results with a hierarchical model so that you’re estimating varying effects. Then when it comes to decision-making, you can keep that variation in mind.

Lessons about statistics and research methods from that racial attitudes example

Yesterday we shared some discussions of recent survey results on racial attitudes.

For students and teachers of statistics or research methods, I think the key takeaway should be that you don’t want to pull out just one number from a survey; you want to get the big picture by looking at multiple questions, multiple years, and multiple data sources. You want to use the secret weapon.

Where do formal statistical theory and methods come in here? Not where you might think. No p-values or Bayesian inferences in the above-linked discussion, not even any confidence intervals or standard errors.

But that doesn’t mean that formal statistics are irrelevant, not at all.

Formal statistics gets used in the design and analysis of these surveys. We use probability and statistics to understand and design sampling strategies (cluster sampling, in the case of the General Social Survey) and to adjust for differences between sample and population (poststratification and survey weights, or, if these adjustments are deemed not necessary, statistical methods are used to make that call too).

Formal statistics underlies this sort of empirical work in social science—you just don’t see it because it was already done before you got to the data.

“Sometimes all we have left are pictures and fear”: Dan Simpson talk in Columbia stat dept, 4pm Monday

4:10pm Monday, April 22 in Social Work Bldg room 903:

Data is getting weirder. Statistical models and techniques are more complex than they have ever been. No one understand what code does. But at the same time, statistical tools are being used by a wider range of people than at any time in the past. And they are not just using our well-trodden, classical tools. They are working at the bleeding edge of what is possible. With this in mind, this talk will look at how much we can trust our tools. Do we ever really compute the thing we think we do? Can we ever be sure our code worked? Are there ways that it’s not safe to use the output? While “reproducibility” may be the watchword of the new scientific era, if we also want to ensure safety maybe all we have to lean on are pictures and fear.

Important stuff.

Changing racial differences in attitudes on changing racial differences

Elin Waring writes:

Have you been following the release of GSS results this year? I had been vaguely aware that there was reporting on a few items but then I happened to run the natrace and natracey variables (I use these in my class to look at question wording), they are from the are we spending too much/too little/about the right amont on “Improving the conditions of blacks” and “aid to blacks” (the images are from the SDA website at Berkeley):

Much as I [Waring] would love to believe that the American public really has changed racial attitudes, I find such a huge shift over such a short time very unlikely given what we know about stability of attitudes. And I even broke it down by age and there was a shift for all the age groups.

Then I saw this, and a colleague mentioned to me that the results for proportion not sexually active were strange. And then today people talking about the increase in the proportion not religiously affiliated.

It just seems very odd to me and I wondered if you had noticed it too. Could it be they just hit a strange cluster in their sampling? Or a weighting error of some kind? It’s true that attitudes on gay marriage changed very fast and that seems real, but this seems so surprising across so many separate issues.

I wasn’t sure so I passed this along to David Weakliem, my go-to guy when it comes to making sense of surveys and public opinion. Weakliem responded with some preliminary thoughts:

It did seem hard to believe at first. But there was a big move from 2014 to 2016 too (bigger than 2016-8), so if there is a problem with the survey it’s not just with 2018. The GSS also has a general question about whether the government has a special obligation to help blacks vs. no special treatment, and that also showed large moves in a liberal direction from 2014-6 and again from 2016-8. Finally, I looked for relevant questions from other surveys. There are some about how much discrimination there is. In 2013 and 2014, 19% and then 17% said there was a lot of discrimination against “African Americans” but in 2015 it was 36%; in 2016 and 2017 the question referred to “blacks” and 40% said there was a lot. So it seems that there really has been a substantial change in opinions about race since 2014. As far as why, I would guess that the media coverage and videos of police mistreatment of blacks had an impact—they made people think there really is a problem.

To which Waring replied:

The one thing I’d say in response to David is that while he could be right, these are shifts across a number of the long term variables not just the racial attitudes. Also I think that GSS is intentionally designed to not be so responsive to day to day fluctuations based on the latest news. And POLHITOK sees an increase in “no” responses in 2018 but not so dramatic and it looks like it’s in the same general territory as others from 2006 forward.

What really made me look at those particular variables was all the recent talk about reparations for slavery.

I also saw that Jay Livingston, who I wish had his own column in the New York Times—I’d rather see a sociologist’s writing about sociology, than an ignorant former reporter’s writing about sociology—wrote something recently on survey attitudes regarding racial equality, but using a different data source:

Just last week, Pew published a report (here) about race in the US. Among many other things, it asked respondents about the “major” reasons that Black people “have a harder time getting ahead.” As expected, Whites were more likely to point to cultural/personal factors, Blacks to structural ones. But compared with a similar survey Pew did just three years ago, it looks like everyone is becoming more woke. . . .

For “racial discrimination,” Black-White difference remains large. But in both groups, the percentage citing it as a major cause increases – by 14 points among Blacks, by nearly 20 points among Whites. The percent identifying access to good schools as an important factor have not changed so much, increasing slightly among both Blacks and Whites.

More curious are the responses about jobs. In 2013, far more Whites than Blacks said that the lack of jobs was a major factor. In the intervening three years, jobs as a reason for not getting ahead became more salient among Blacks, less so among Whites.

At the same time, “culture of poverty” explanations became less popular.

Livingston continues with some GSS data and then concludes:

If both Whites and Blacks are paying more attention to racial discrimination and less to personal-cultural factors, if everyone is more woke, how does this square with the widely held perception that in the era of Trump, racism is on the rise. (In the Pew survey, 56% over all and 49% of Whites said Trump has made race relations worse. In no group, even self-identified conservatives, does anything coming even close to a majority say that Trump has made race relations better.)

The data here points to a more complex view of recent history. The nastiest of the racists may have felt freer to express themselves in word and deed. And when they do, they make the news. Hence the widespread perception that race relations have deteriorated. But surveys can tell us what we don’t see on the news and Twitter. And in this case what they tell us is that the overall trend among Whites has been towards more liberal views on the causes of race differences in who gets ahead.

Interesting. Also an increasing proportion of Americans are neither white nor black. So lots going on here.

P.S. Livingston adds:

I also noticed something when I was checking the GSS data that Tristan Bridges posted about LGB self-identification. For those variables (and maybe others—I haven’t looked), the GSS 2014 sample was much larger than in other years before and since, and the 2018 sample smaller. That shouldn’t affect the actual percents, but with fairly rare responses like identifying as gay, the sample size did make me pause to wonder. With larger-n attitude items it shouldn’t matter.

I followed the link to Bridges’s blog, which had lots of interesting stuff, including this post from 2016, Why Popular Boy Names are More Popular than Popular Girl Names, which featured this familiar-looking graph:

Why did this graph look so familiar?? Because I plotted the exact same data in 2013:

n

topten

I assume that Bridges just independently came up with the same idea that I had—these are public data, and counting the top 10 names is a pretty obvious thing to do, I guess. It was just funny to come across this graph again, in an unexpected place.

Abandoning statistical significance is both sensible and practical

Valentin Amrhein​, Sander Greenland, Blakeley McShane, and I write:

Dr Ioannidis writes against our proposals [here and here] to abandon statistical significance in scientific reasoning and publication, as endorsed in the editorial of a recent special issue of an American Statistical Association journal devoted to moving to a “post p<0.05 world.” We appreciate that he echoes our calls for “embracing uncertainty, avoiding hyped claims…and recognizing ‘statistical significance’ is often poorly understood.” We also welcome his agreement that the “interpretation of any result is far more complicated than just significance testing” and that “clinical, monetary, and other considerations may often have more importance than statistical findings.”

Nonetheless, we disagree that a statistical significance-based “filtering process is useful to avoid drowning in noise” in science and instead view such filtering as harmful. First, the implicit rule to not publish nonsignificant results biases the literature with overestimated effect sizes and encourages “hacking” to get significance. Second, nonsignificant results are often wrongly treated as zero. Third, significant results are often wrongly treated as truth rather than as the noisy estimates they are, thereby creating unrealistic expectations of replicability. Fourth, filtering on statistical significance provides no guarantee against noise. Instead, it amplifies noise because the quantity on which the filtering is based (the p-value) is itself extremely noisy and is made more so by dichotomizing it.

We also disagree that abandoning statistical significance will reduce science to “a state of statistical anarchy.” Indeed, the journal Epidemiology banned statistical significance in 1990 and is today recognized as a leader in the field.

Valid synthesis requires accounting for all relevant evidence—not just the subset that attained statistical significance. Thus, researchers should report more, not less, providing estimates and uncertainty statements for all quantities, justifying any exceptions, and considering ways the results are wrong. Publication criteria should be based on evaluating study design, data quality, and scientific content—not statistical significance.

Decisions are seldom necessary in scientific reporting. However, when they are required (as in clinical practice), they should be made based on the costs, benefits, and likelihoods of all possible outcomes, not via arbitrary cutoffs applied to statistical summaries such as p-values which capture little of this picture.

The replication crisis in science is not the product of the publication of unreliable findings. The publication of unreliable findings is unavoidable: as the saying goes, if we knew what we were doing, it would not be called research. Rather, the replication crisis has arisen because unreliable findings are presented as reliable.

I especially like our title and our last paragraph!

Let me also emphasize that we have a lot of positive advice of how researchers can design studies and collect and analyze data (see for example here, here, and here). “Abandon statistical significance” is not the main thing we have to say. We’re writing about statistical significance to do our best to clear up some points of confusion, but our ultimate message in most of our writing and practice is to offer positive alternatives.

P.S. Also to clarify: “Abandon statistical significance” does not mean “Abandon statistical methods.” I do think it’s generally a good idea to produce estimates accompanied by uncertainty statements. There’s lots and lots to be done.

The network of models and Bayesian workflow, related to generative grammar for statistical models

Ben Holmes writes:

I’m a machine learning guy working in fraud prevention, and a member of some biostatistics and clinical statistics research groups at Wright State University in Dayton, Ohio.

I just heard your talk “Theoretical Statistics is the Theory of Applied Statistics” on YouTube, and was extremely interested in the idea of a model-space for exploring and choosing from possibilities in ‘model space’.

I was wondering if you knew of work on any R (or Python, or whatever, I’m not picky!) packages that was being done on this, or could recommend a place to start reading more about the theory/concept.

My reply:

I love this idea of the network of models but I’ve never written anything formal on it, nor do I have any software implementations. Here’s a talk on the topic from 2011, and here’s a post from 2017 with some comments from others too.

I still think this is an important topic—it relates to the idea of a generative grammar for building statistical models, and it should fit in well with Stan. So I’m posting this in the hope that someone will follow up and do it in some way.

Parliamentary Constituency Factsheet for Indicators of Nutrition, Health and Development in India

S. V. Subramanian writes:

In India, data on key developmental indicators that formulate policies and interventions are routinely available for the administrative units of districts but not for the political units of Parliamentary Constituencies (PC). Members of Parliament (MPs) in the Lok Sabha, each representing 543 PCs as per the 2014 India map, are the representatives with the most direct interaction with their constituents. The MPs are responsible for articulating the vision and the implementation of public policies at the national level and for their respective constituencies. In order for MPs to efficiently and effectively serve their people, and also for the constituents to understand the performance of their MPs, it is critical to produce the most accurate and up-to-date evidence on the state of health and well-being at the PC-level. However, absence of PC identifiers in nationally representative surveys or the Census has eluded an assessment of how a PC is doing with regards to key indicators of nutrition, health and development.

On this website, we report PC estimates for indicators of nutrition, health and development derived from two data sources:

The National Family Health Survey 4 (NFHS-4) District Factsheets
The National Sample Survey (NSS), 2010-11, 2011-12, 2014 (Author calculations) . . .

The PC estimates for each of the indicators are classified into quintiles for map visualizations. Currently, we provide map-based visualizations for a subset of indicators, and these will be continually updated for additional indicators. . . .

In addition to providing a visualization of indicators at the PC level, we also provide tables of the PC estimates. . . .

Further details are at the link.

I’ve not looked at this all myself, but I thought it could be of interest to some of you.

State-space models in Stan

Michael Ziedalski writes:

For the past few months I have been delving into Bayesian statistics and have (without hyperbole) finally found statistics intuitive and exciting. Recently I have gone into Bayesian time series methods; however, I have found no libraries to use that can implement those models.

Happily, I found Stan because it seemed among the most mature and flexible Bayesian libraries around, but is there any guide/book you could recommend me for approaching state space models through Stan? I am referring to more complex models, such as those found in State-Space Models, by Zeng and Wu, as well as Bayesian Analysis of Stochastic Process Models, by Insua et al. Most advanced books seem to use WinBUGS, but that library is closed-source and a bit older.

I replied that he should you post his question on the Stan mailing list and also look at the example models and case studies for Stan.

I also passed the question on to Jim Savage, who added:

Stan’s great for time series, though mostly because it just allows you to flexibly write down whatever likelihood you want and put very flexible priors on everything, then fits it swiftly with a modern sampler and lets you do diagnoses that are difficult/impossible elsewhere!

Jeff Arnold has a fairly complete set of implementations for state-space models in Stan here. I’ve also got some more introductory blog posts that might help you get your head around writing out some time-series models in Stan. Here’s one on hierarchical VAR models. Here’s another on Hamilton-style regime-switching models. I’ve got a half-written tutorial on state-space models that I’ll come back to when I’m writing the time-series chapter in our Bayesian econometrics in Stan book.

One of the really nice things about Stan is that you can write out your state as parameters. Because Stan can efficiently sample from parameter spaces with hundreds of thousands of dimensions (if a bit slowly), this is fine. It’ll just be slower than a standard Kalman filter. It also changes the interpretation of the state estimate somewhat (more akin to a Kalman smoother, given you use all observations to fit the state).

Here’s an example of such a model.

Actually that last model had some problems with the between-state correlations, but I guess it’s still a good example of how to put something together in Markdown.

All statistical conclusions require assumptions.

Mark Palko points us to this 2009 article by Itzhak Gilboa, Andrew Postlewaite, and David Schmeidler, which begins:

This note argues that, under some circumstances, it is more rational not to behave in accordance with a Bayesian prior than to do so. The starting point is that in the absence of information, choosing a prior is arbitrary. If the prior is to have meaningful implications, it is more rational to admit that one does not have sufficient information to generate a prior than to pretend that one does. This suggests a view of rationality that requires a compromise between internal coherence and justification, similarly to compromises that appear in moral dilemmas. Finally, it is argued that Savage’s axioms are more compelling when applied to a naturally given state space than to an analytically constructed one; in the latter case, it may be more rational to violate the axioms than to be Bayesian.

The paper expresses various misconceptions, for example the statement that the Bayesian approach requires a “subjective belief.” All statistical conclusions require assumptions, and a Bayesian prior distribution can be as subjective or un-subjective as any other assumption in the model. For example, I don’t recall seeing textbooks on statistical methods referring to the subjective belief underlying logistic regression or the Poisson distribution; I guess if you assume a model but you don’t use the word “Bayes,” then assumptions are just assumptions.

More generally, it seems obvious to me that no statistical method will work best under all circumstances, hence I have no disagreement whatsoever with the opening sentence quoted above. I can’t quite see why they need 12 pages to make this argument, but whatever.

P.S. Also relevant is this discussion from a few years ago: The fallacy of the excluded middle—statistical philosophy edition.

Works of art that are about themselves

I watched Citizen Kane (for the umpteenth time) the other day and was again struck by how it is a movie about itself. Kane is William Randolph Hearst, but he’s also Orson Welles, boy wonder, and the movie Citizen Kane is self-consciously a masterpiece.

Some other examples of movies that are about themselves are La La Land, Primer (a low-budget experiment about a low-budget experiment), and Titanic (the biggest movie ever made, about the biggest boat ever made).

I want to call this, Objects of the Class X, but I’m not sure what X is.

Several reviews of Deborah Mayo’s new book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

A few months ago I sent the following message to some people:

Dear philosophically-inclined colleagues:

I’d like to organize an online discussion of Deborah Mayo’s new book.

The table of contents and some of the book are here at Google books, also in the attached pdf and in this post by Mayo.

I think that many, if not all, of Mayo’s points in her Excursion 4 are answered by my article with Hennig here.

What I was thinking for this discussion is that if you’re interested you can write something, either a review of Mayo’s book (if you happen to have a copy of it) or a review of the posted material, or just your general thoughts on the topic of statistical inference as severe testing.

I’m hoping to get this all done this month, because it’s all informal and what’s the point of dragging it out, right? So if you’d be interested in writing something on this that you’d be willing to share with the world, please let me know. It should be fun, I hope!

I did this in consultation with Deborah Mayo, and I just sent this email to a few people (so if you were not included, please don’t feel left out! You have a chance to participate right now!), because our goal here was to get the discussion going. The idea was to get some reviews, and this could spark a longer discussion here in the comments section.

And, indeed, we received several responses. And I’ll also point you to my paper with Shalizi on the philosophy of Bayesian statistics, with discussions by Mark Andrews and Thom Baguley, Denny Borsboom and Brian Haig, John Kruschke, Deborah Mayo, Stephen Senn, and Richard D. Morey, Jan-Willem Romeijn and Jeffrey N. Rouder.

Also relevant is this summary by Mayo of some examples from her book.

And now on to the reviews.

Brian Haig

I’ll start with psychology researcher Brian Haig, because he’s a strong supporter of Mayo’s message and his review also serves as an introduction and summary of her ideas. The review itself is a few pages long, so I will quote from it, interspersing some of my own reaction:

Deborah Mayo’s ground-breaking book, Error and the growth of statistical knowledge (1996) . . . presented the first extensive formulation of her error-statistical perspective on statistical inference. Its novelty lay in the fact that it employed ideas in statistical science to shed light on philosophical problems to do with evidence and inference.

By contrast, Mayo’s just-published book, Statistical inference as severe testing (SIST) (2018), focuses on problems arising from statistical practice (“the statistics wars”), but endeavors to solve them by probing their foundations from the vantage points of philosophy of science, and philosophy of statistics. The “statistics wars” to which Mayo refers concern fundamental debates about the nature and foundations of statistical inference. These wars are longstanding and recurring. Today, they fuel the ongoing concern many sciences have with replication failures, questionable research practices, and the demand for an improvement of research integrity. . . .

For decades, numerous calls have been made for replacing tests of statistical significance with alternative statistical methods. The new statistics, a package deal comprising effect sizes, confidence intervals, and meta-analysis, is one reform movement that has been heavily promoted in psychological circles (Cumming, 2012; 2014) as a much needed successor to null hypothesis significance testing (NHST) . . .

The new statisticians recommend replacing NHST with their favored statistical methods by asserting that it has several major flaws. Prominent among them are the familiar claims that NHST encourages dichotomous thinking, and that it comprises an indefensible amalgam of the Fisherian and Neyman-Pearson schools of thought. However, neither of these features applies to the error-statistical understanding of NHST. . . .

There is a double irony in the fact that the new statisticians criticize NHST for encouraging simplistic dichotomous thinking: As already noted, such thinking is straightforwardly avoided by employing tests of statistical significance properly, whether or not one adopts the error-statistical perspective. For another, the adoption of standard frequentist confidence intervals in place of NHST forces the new statisticians to engage in dichotomous thinking of another kind: A parameter estimate is either inside, or outside, its confidence interval.

At this point I’d like to interrupt and say that a confidence or interval (or simply an estimate with standard error) can be used to give a sense of inferential uncertainty. There is no reason for dichotomous thinking when confidence intervals, or uncertainty intervals, or standard errors, are used in practice.

Here’s a very simple example from my book with Jennifer:

This graph has a bunch of estimates +/- standard errors, that is, 68% confidence intervals, with no dichotomous thinking in sight. In contrast, testing some hypothesis of no change over time, or no change during some period of time, would make no substantive sense and would just be an invitation to add noise to our interpretation of these data.

OK, to continue with Haig’s review:

Error-statisticians have good reason for claiming that their reinterpretation of frequentist confidence intervals is superior to the standard view. The standard account of confidence intervals adopted by the new statisticians prespecifies a single confidence interval (a strong preference for 0.95 in their case). . . . By contrast, the error-statistician draws inferences about each of the obtained values according to whether they are warranted, or not, at different severity levels, thus leading to a series of confidence intervals. Crucially, the different values will not have the same probative force. . . . Details on the error-statistical conception of confidence intervals can be found in SIST (pp. 189-201), as well as Mayo and Spanos (2011) and Spanos (2014). . . .

SIST makes clear that, with its error-statistical perspective, statistical inference can be employed to deal with both estimation and hypothesis testing problems. It also endorses the view that providing explanations of things is an important part of science.

Another interruption from me . . . I just want to plug my paper with Guido Imbens, Why ask why? Forward causal inference and reverse causal questions, in which we argue that Why questions can be interpreted as model checks, or, one might say, hypothesis tests—but tests of hypotheses of interest, not of straw-man null hypotheses. Perhaps there’s some connection between Mayo’s ideas and those of Guido and me on this point.

Haig continues with a discussion of Bayesian methods, including those of my collaborators and myself:

One particularly important modern variant of Bayesian thinking, which receives attention in SIST, is the falsificationist Bayesianism of . . . Gelman and Shalizi (2013). Interestingly, Gelman regards his Bayesian philosophy as essentially error-statistical in nature – an intriguing claim, given the anti-Bayesian preferences of both Mayo and Gelman’s co-author, Cosma Shalizi. . . . Gelman acknowledges that his falsificationist Bayesian philosophy is underdeveloped, so it will be interesting to see how its further development relates to Mayo’s error-statistical perspective. It will also be interesting to see if Bayesian thinkers in psychology engage with Gelman’s brand of Bayesian thinking. Despite the appearance of his work in a prominent psychology journal, they have yet to do so. . . .

Hey, not quite! I’ve done a lot of collaboration with psychologists; see here and search on “Iven Van Mechelen” and “Francis Tuerlinckx”—but, sure, I recognize that our Bayesian methods, while mainstream in various fields including ecology and political science, are not yet widely used in psychology.

Haig concludes:

From a sympathetic, but critical, reading of Popper, Mayo endorses his strategy of developing scientific knowledge by identifying and correcting errors through strong tests of scientific claims. . . . A heartening attitude that comes through in SIST is the firm belief that a philosophy of statistics is an important part of statistical thinking. This contrasts markedly with much of statistical theory, and most of statistical practice. Given that statisticians operate with an implicit philosophy, whether they know it or not, it is better that they avail themselves of an explicitly thought-out philosophy that serves practice in useful ways.

I agree, very much.

To paraphrase Bill James, the alternative to good philosophy is not “no philosophy,” it’s “bad philosophy.” I’ve spent too much time seeing Bayesians avoid checking their models out of a philosophical conviction that subjective priors cannot be empirically questioned, and too much time seeing non-Bayesians produce ridiculous estimates that could have been avoided by using available outside information. There’s nothing so practical as good practice, but good philosophy can facilitate both the development and acceptance of better methods.

E. J. Wagenmakers

I’ll follow up with a very short review, or, should I say, reaction-in-place-of-a-review, from psychometrician E. J. Wagenmakers:

I cannot comment on the contents of this book, because doing so would require me to read it, and extensive prior knowledge suggests that I will violently disagree with almost every claim that is being made. In my opinion, the only long-term hope for vague concepts such as the “severity” of a test is to embed them within a rational (i.e., Bayesian) framework, but I suspect that this is not the route that the author wishes to pursue. Perhaps this book is comforting to those who have neither the time nor the desire to learn Bayesian inference, in a similar way that homeopathy provides comfort to patients with a serious medical condition.

You don’t have to agree with E. J. to appreciate his honesty!

Art Owen

Coming from a different perspective is theoretical statistician Art Owen, whose review has some mathematical formulas—nothing too complicated, but not so easy to display in html, so I’ll just link to the pdf and share some excerpts:

There is an emphasis throughout on the importance of severe testing. It has long been known that a test that fails to reject H0 is not very conclusive if it had low power to reject H0. So I wondered whether there was anything more to the severity idea than that. After some searching I found on page 343 a description of how the severity idea differs from the power notion. . . .

I think that it might be useful in explaining a failure to reject H0 as the sample size being too small. . . . it is extremely hard to measure power post hoc because there is too much uncertainty about the effect size. Then, even if you want it, you probably cannot reliably get it. I think severity is likely to be in the same boat. . . .

I believe that the statistical problem from incentives is more severe than choice between Bayesian and frequentist methods or problems with people not learning how to use either kind of method properly. . . . We usually teach and do research assuming a scientific loss function that rewards being right. . . . In practice many people using statistics are advocates. . . . The loss function strongly informs their analysis, be it Bayesian or frequentist. The scientist and advocate both want to minimize their expected loss. They are led to different methods. . . .

I appreciate Owen’s efforts to link Mayo’s words to the equations that we would ultimately need to implement, or evaluate, her ideas in statistics.

Robert Cousins

Physicist Robert Cousins did not have the time to write a comment on Mayo’s book, but he did point us to this monograph he wrote on the foundations of statistics, which has lots of interesting stuff but is unfortunately a bit out of date when it comes to the philosophy of Bayesian statistics, which he ties in with subjective probability. (For a corrective, see my aforementioned article with Hennig.)

In his email to me, Cousins also addressed issues of statistical and practical significance:

Our [particle physicists’] problems and the way we approach them are quite different from some other fields of science, especially social science. As one example, I think I recall reading that you do not mind adding a parameter to your model, whereas adding (certain) parameters to our models means adding a new force of nature (!) and a Nobel Prize if true. As another example, a number of statistics papers talk about how silly it is to claim a 10^{⁻4} departure from 0.5 for a binomial parameter (ESP examples, etc), using it as a classic example of the difference between nominal (probably mismeasured) statistical significance and practical significance. In contrast, when I was a grad student, a famous experiment in our field measured a 10^{⁻4} departure from 0.5 with an uncertainty of 10% of itself, i.e., with an uncertainty of 10^{⁻5}. (Yes, the order or 10^10 Bernoulli trials—counting electrons being scattered left or right.) This led quickly to a Nobel Prize for Steven Weinberg et al., whose model (now “Standard”) had predicted the effect.

I replied:

This interests me in part because I am a former physicist myself. I have done work in physics and in statistics, and I think the principles of statistics that I have applied to social science, also apply to physical sciences. Regarding the discussion of Bem’s experiment, what I said was not that an effect of 0.0001 is unimportant, but rather that if you were to really believe Bem’s claims, there could be effects of +0.0001 in some settings, -0.002 in others, etc. If this is interesting, fine: I’m not a psychologist. One of the key mistakes of Bem and others like him is to suppose that, even if they happen to have discovered an effect in some scenario, there is no reason to suppose this represents some sort of universal truth. Humans differ from each other in a way that elementary particles to not.

And Cousins replied:

Indeed in the binomial experiment I mentioned, controlling unknown systematic effects to the level of 10^{-5}, so that what they were measuring (a constant of nature called the Weinberg angle, now called the weak mixing angle) was what they intended to measure, was a heroic effort by the experimentalists.

Stan Young

Stan Young, a statistician who’s worked in the pharmaceutical industry, wrote:

I’ve been reading at the Mayo book and also pestering where I think poor statistical practice is going on. Usually the poor practice is by non-professionals and usually it is not intentionally malicious however self-serving. But I think it naive to think that education is all that is needed. Or some grand agreement among professional statisticians will end the problems.

There are science crooks and statistical crooks and there are no cops, or very few.

That is a long way of saying, this problem is not going to be solved in 30 days, or by one paper, or even by one book or by three books! (I’ve read all three.)

I think a more open-ended and longer dialog would be more useful with at least some attention to willful and intentional misuse of statistics.

Chambers C. The Seven Deadly Sins of Psychology. New Jersey: Princeton University Press, 2017.

Harris R. Rigor mortis: how sloppy science creates worthless cures, crushes hope, and wastes billions. New York: Basic books, 2017.

Hubbard R. Corrupt Research. London: Sage Publications, 2015.

Christian Hennig

Hennig, a statistician and my collaborator on the Beyond Subjective and Objective paper, send in two reviews of Mayo’s book.

Here are his general comments:

What I like about Deborah Mayo’s “Statistical Inference as Severe Testing”

Before I start to list what I like about “Statistical Inference as Severe Testing”. I should say that I don’t agree with everything in the book. In particular, as a constructivist I am skeptical about the use of terms like “objectivity”, “reality” and “truth” in the book, and I think that Mayo’s own approach may not be able to deliver everything that people may come to believe it could, from reading the book (although Mayo could argue that overly high expectations could be avoided by reading carefully).

So now, what do I like about it?

1) I agree with the broad concept of severity and severe testing. In order to have evidence for a claim, it has to be tested in ways that would reject the claim with high probability if it indeed were false. I also think that it makes a lot of sense to start a philosophy of statistics and a critical discussion of statistical methods and reasoning from this requirement. Furthermore, throughout the book Mayo consistently argues from this position, which makes the different “Excursions” fit well together and add up to a consistent whole.

2) I get a lot out of the discussion of the philosophical background of scientific inquiry, of induction, probabilism, falsification and corroboration, and their connection to statistical inference. I think that it makes sense to connect Popper’s philosophy to significance tests in the way Mayo does (without necessarily claiming that this is the only possible way to do it), and I think that her arguments are broadly convincing at least if I take a realist perspective of science (which as a constructivist I can do temporarily while keeping the general reservation that this is about a specific construction of reality which I wouldn’t grant absolute authority).

3) I think that Mayo does by and large a good job listing much of the criticism that has been raised in the literature against significance testing, and she deals with it well. Partly she criticises bad uses of significance testing herself by referring to the severity requirement, but she also defends a well understood use in a more general philosophical framework of testing scientific theories and claims in a piecemeal manner. I find this largely convincing, conceding that there is a lot of detail and that I may find myself in agreement with the occasional objection against the odd one of her arguments.

4) The same holds for her comprehensive discussion of Bayesian/probabilist foundations in Excursion 6. I think that she elaborates issues and inconsistencies in the current use of Bayesian reasoning very well, maybe with the odd exception.

5) I am in full agreement with Mayo’s position that when using probability modelling, it is important to be clear about the meaning of the computed probabilities. Agreement in numbers between different “camps” isn’t worth anything if the numbers mean different things. A problem with some positions that are sold as “pragmatic” these days is that often not enough care is put into interpreting what the results mean, or even deciding in advance what kind of interpretation is desired.

6) As mentioned above, I’m rather skeptical about the concept of objectivity and about an all too realist interpretation of statistical models. I think that in Excursion 4 Mayo manages to explain in a clear manner what her claims of “objectivity” actually mean, and she also appreciates more clearly than before the limits of formal models and their distance to “reality”, including some valuable thoughts on what this means for model checking and arguments from models.

So overall it was a very good experience to read her book, and I think that it is a very valuable addition to the literature on foundations of statistics.

Hennig also sent some specific discussion of one part of the book:

1 Introduction

This text discusses parts of Excursion 4 of Mayo (2018) titled “Objectivity and Auditing”. This starts with the section title “The myth of ‘The myth of objectivity'”. Mayo advertises objectivity in science as central and as achievable.

In contrast, in Gelman and Hennig (2017) we write: “We argue that the words ‘objective’ and ‘subjective’ in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes.” I will here outline agreement and disagreement that I have with Mayo’s Excursion 4, and raise some issues that I think require more research and discussion.

2 Pushback and objectivity

The second paragraph of Excursion 4 states in bold letters: “The Key Is Getting Pushback”, and this is the major source of agreement between Mayo’s and my views (*). I call myself a constructivist, and this is about acknowledging the impact of human perception, action, and communication on our world-views, see Hennig (2010). However, it is an almost universal experience that we cannot construct our perceived reality as we wish, because we experience “pushback” from what we perceive as “the world outside”. Science is about allowing us to deal with this pushback in stable ways that are open to consensus. A major ingredient of such science is the “Correspondence (of scientific claims) to observable reality”, and in particular “Clear conditions for reproduction, testing and falsification”, listed as “Virtue 4/4(b)” in Gelman and Hennig (2017). Consequently, there is no disagreement with much of the views and arguments in Excursion 4 (and the rest of the book). I actually believe that there is no contradiction between constructivism understood in this way and Chang’s (2012) “active scientific realism” that asks for action in order to find out about “resistance from reality”, or in other words, experimenting, experiencing and learning from error.

If what is called “objectivity” in Mayo’s book were the generally agreed meaning of the term, I would probably not have a problem with it. However, there is a plethora of meanings of “objectivity” around, and on top of that the term is often used as a sales pitch by scientists in order to lend authority to findings or methods and often even to prevent them from being questioned. Philosophers understand that this is a problem but are mostly eager to claim the term anyway; I have attended conferences on philosophy of science and heard a good number of talks, some better, some worse, with messages of the kind “objectivity as understood by XYZ doesn’t work, but here is my own interpretation that fixes it”. Calling frequentist probabilities “objective” because they refer to the outside world rather than epsitemic states, and calling a Bayesian approach “objective” because priors are chosen by general principles rather than personal beliefs are in isolation also legitimate meanings of “objectivity”, but these two and Mayo’s and many others (see also the Appendix of Gelman and Hennig, 2017) differ. The use of “objectivity” in public and scientific discourse is a big muddle, and I don’t think this will change as a consequence of Mayo’s work. I prefer stating what we want to achieve more precisely using less loaded terms, which I think Mayo has achieved well not by calling her approach “objective” but rather by explaining in detail what she means by that.

3. Trust in models?

In the remainder, I will highlight some limitations of Mayo’s “objectivity” that are mainly connected to Tour IV on objectivity, model checking and whether it makes sense to say that “all models are false”. Error control is central for Mayo’s objectivity, and this relies on error probabilities derived from probability models. If we want to rely on these error probabilities, we need to trust the models, and, very appropriately, Mayo devotes Tour IV to this issue. She concedes that all models are false, but states that this is rather trivial, and what is really relevant when we use statistical models for learning from data is rather whether the models are adequate for the problem we want to solve. Furthermore, model assumptions can be tested and it is crucial to do so, which, as follows from what was stated before, does not mean to test whether they are really true but rather whether they are violated in ways that would destroy the adequacy of the model for the problem. So far I can agree. However, I see some difficulties that are not addressed in the book, and mostly not elsewhere either. Here is a list.

3.1. Adaptation of model checking to the problem of interest

As all models are false, it is not too difficult to find model assumptions that are violated but don’t matter, or at least don’t matter in most situations. The standard example would be the use of continuous distributions to approximate distributions of essentially discrete measurements. What does it mean to say that a violation of a model assumption doesn’t matter? This is not so easy to specify, and not much about this can be found in Mayo’s book or in the general literature. Surely it has to depend on what exactly the problem of interest is. A simple example would be to say that we are interested in statements about the mean of a discrete distribution, and then to show that estimation or tests of the mean are very little affected if a certain continuous approximation is used. This is reassuring, and certain other issues could be dealt with in this way, but one can ask harder questions. If we approximate a slightly skew distribution by a (unimodal) symmetric one, are we really interested in the mean, the median, or the mode, which for a symmetric distribution would be the same but for the skew distribution to be approximated would differ? Any frequentist distribution is an idealisation, so do we first need to show that it is fine to approximate a discrete non-distribution by a discrete distribution before worrying whether the discrete distribution can be approximated by a continuous one? (And how could we show that?) And so on.

3.2. Severity of model misspecification tests

Following the logic of Mayo (2018), misspecification tests need to be severe in ordert to fulfill their purpose; otherwise data could pass a misspecification test that would be of little help ruling out problematic model deviations. I’m not sure whether there are any results of this kind, be it in Mayo’s work or elsewhere. I imagine that if the alternative is parametric (for example testing independence against a standard time series model) severity can occasionally be computed easily, but for most model misspecification tests it will be a hard problem.

3.3. Identifiability issues, and ruling out models by other means than testing

Not all statistical models can be distinguished by data. For example, even with arbitrarily large amounts of data only lower bounds of the number of modes can be estimated; an assumption of unimodality can strictly not be tested (Donoho 1988). Worse, only regular but not general patterns of dependence can be distinguished from independence by data; any non-i.i.d. pattern can be explained by either dependence or non-identity of distributions, and telling these apart requires constraints on dependence and non-identity structures that can itself not be tested on the data (in the example given in 4.11 of Mayo, 2018, all tests discover specific regular alternatives to the model assumption). Given that this is so, the question arises on which grounds we can rule out irregular patterns (about the simplest and most silly one is “observations depend in such a way that every observation determines the next one to be exactly what it was observed to be”) by other means than data inspection and testing. Such models are probably useless, however if they were true, they would destroy any attempt to find “true” or even approximately true error probabilities.

3.4. Robustness against what cannot be ruled out

The above implies that certain deviations from the model assumptions cannot be ruled out, and then one can ask: How robust is the substantial conclusion that is drawn from the data against models different from the nominal one, which could not be ruled out by misspecification testing, and how robust are error probabilities? The approaches of standard robust statistics probably have something to contribute in this respect (e.g., Hampel et al., 1986), although their starting point is usually different from “what is left after misspecification testing”. This will depend, as everything, on the formulation of the “problem of interest”, which needs to be defined not only in terms of the nominal parametric model but also in terms of the other models that could not be rules out.

3.5. The effect of preliminary model checking on model-based inference

Mayo is correctly concerned about biasing effects of model selection on inference. Deciding what model to use based on misspecification tests is some kind of model selection, so it may bias inference that is made in case of passing misspecification tests. One way of stating the problem is to realise that in most cases the assumed model conditionally on having passed a misspecification test does no longer hold. I have called this the “goodness-of-fit paradox” (Hennig, 2007); the issue has been mentioned elsewhere in the literature. Mayo has argued that this is not a problem, and this is in a well defined sense true (meaning that error probabilities derived from the nominal model are not affected by conditioning on passing a misspecification test) if misspecification tests are indeed “independent of (or orthogonal to) the primary question at hand” (Mayo 2018, p. 319). The problem is that for the vast majority of misspecification tests independence/orthogonality does not hold, at least not precisely. So the actual effect of misspecification testing on model-based inference is a matter that requires to be investigated on a case-by-case basis. Some work of this kind has been done or is currently done; results are not always positive (an early example is Easterling and Anderson 1978).

4 Conclusion

The issues listed in Section 3 are in my view important and worthy of investigation. Such investigation has already been done to some extent, but there are many open problems. I believe that some of these can be solved, some are very hard, and some are impossible to solve or may lead to negative results (particularly connected to lack of identifiability). However, I don’t think that these issues invalidate Mayo’s approach and arguments; I expect at least the issues that cannot be solved to affect in one way or another any alternative approach. My case is just that methodology that is “objective” according to Mayo comes with limitations that may be incompatible with some other peoples’ ideas of what “objectivity” should mean (in which sense it is in good company though), and that the falsity of models has some more cumbersome implications than Mayo’s book could make the reader believe.

(*) There is surely a strong connection between what I call “my” view here with the collaborative position in Gelman and Hennig (2017), but as I write the present text on my own, I will refer to “my” position here and let Andrew Gelman speak for himself.

References:
Chang, H. (2012) Is Water H2O? Evidence, Realism and Pluralism. Dordrecht: Springer.

Donoho, D. (1988) One-Sided Inference about Functionals of a Density. Annals of Statistics 16, 1390-1420.

Easterling, R. G. and Anderson, H.E. (1978) The effect of preliminary normality goodness of fit tests on subsequent inference. Journal of Statistical Computation and Simulation 8, 1-11.

Gelman, A. and Hennig, C. (2017) Beyond subjective and objective in statistics (with discussion). Journal of the Royal Statistical Society, Series A 180, 967–1033.

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986) Robust statistics. New York: Wiley.

Hennig, C. (2010) Mathematical models and reality: a constructivist perspective. Foundations of Science 15, 29–48.

Hennig, C. (2007) Falsification of propensity models by statistical tests and the goodness-of-fit paradox. Philosophia Mathematica 15, 166-192.

Mayo, D. G. (2018) Statistical Inference as Severe Testing. Cambridge University Press.

My own reactions

I’m still struggling with the key ideas of Mayo’s book. (Struggling is a good thing here, I think!)

First off, I appreciate that Mayo takes my own philosophical perspective seriously—I’m actually thrilled to be taken seriously, after years of dealing with a professional Bayesian establishment tied to naive (as I see it) philosophies of subjective or objective probabilities, and anti-Bayesians not willing to think seriously about these issues at all—and I don’t think any of these philosophical issues are going to be resolved any time soon. I say this because I’m so aware of the big Cantor-size hole in the corner of my own philosophy of statistical learning.

In statistics—maybe in science more generally—philosophical paradoxes are sometimes resolved by technological advances. Back when I was a student I remember all sorts of agonizing over the philosophical implications of exchangeability, but now that we can routinely fit varying-intercept, varying-slope models with nested and non-nested levels and (we’ve finally realized the importance of) informative priors on hierarchical variance parameters, a lot of the philosophical problems have dissolved; they’ve become surmountable technical problems. (For example: should we consider a group of schools, or states, or hospitals, as “truly exchangeable”? If not, there’s information distinguishing them, and we can include such information as group-level predictors in our multilevel model. Problem solved.)

Rapid technological progress resolves many problems in ways that were never anticipated. (Progress creates new problems too; that’s another story.) I’m not such an expert on deep learning and related methods for inference and prediction—but, again, I think these will change our perspective on statistical philosophy in various ways.

This is all to say that any philosophical perspective is time-bound. On the other hand, I don’t think that Popper/Kuhn/Lakatos will ever be forgotten: this particular trinity of twentieth-century philosophy of science has forever left us in a different place than where we were, a hundred years ago.

To return to Mayo’s larger message: I agree with Hennig that Mayo is correct to place evaluation at the center of statistics.

I’ve thought a lot about this, in many years of teaching statistics to graduate students. In a class for first-year statistics Ph.D. students, you want to get down to the fundamentals.

What’s the most fundamental thing in statistics? Experimental design? No. You can’t really pick your design until you have some sense of how you will analyze the data. (This is the principle of the great Raymond Smullyan: To understand the past, we must first know the future.) So is data analysis the most fundamental thing? Maybe so, but what method of data analysis? Last I heard, there are many schools. Bayesian data analysis, perhaps? Not so clear; what’s the motivation for modeling everything probabilistically? Sure, it’s coherent—but so is some mental patient who thinks he’s Napoleon and acts daily according to that belief. We can back into a more fundamental, or statistical, justification of Bayesian inference and hierarchical modeling by first considering the principle of external validation of predictions, then showing (both empirically and theoretically) that a hierarchical Bayesian approach performs well based on this criterion—and then following up with the Jaynesian point that, when Bayesian inference fails to perform well, this recognition represents additional information that can and should be added to the model. All of this is the theme of the example in section 7 of BDA3—although I have the horrible feeling that students often don’t get the point, as it’s easy to get lost in all the technical details of the inference for the hyperparameters in the model.

Anyway, to continue . . . it still seems to me that the most foundational principles of statistics are frequentist. Not unbiasedness, not p-values, and not type 1 or type 2 errors, but frequency properties nevertheless. Statements about how well your procedure will perform in the future, conditional on some assumptions of stationarity and exchangeability (analogous to the assumption in physics that the laws of nature will be the same in the future as they’ve been in the past—or, if the laws of nature are changing, that they’re not changing very fast! We’re in Cantor’s corner again).

So, I want to separate the principle of frequency evaluation—the idea that frequency evaluation and criticism represents one of the three foundational principles of statistics (with the other two being mathematical modeling and the understanding of variation)—from specific statistical methods, whether they be methods that I like (Bayesian inference, estimates and standard errors, Fourier analysis, lasso, deep learning, etc.) or methods that I suspect have done more harm than good or, at the very least, have been taken too far (hypothesis tests, p-values, so-called exact tests, so-called inverse probability weighting, etc.). We can be frequentists, use mathematical models to solve problems in statistical design and data analysis, and engage in model criticism, without making decisions based on type 1 error probabilities etc.

To say it another way, bringing in the title of the book under discussion: I would not quite say that statistical inference is severe testing, but I do think that severe testing is a crucial part of statistics. I see statistics as an unstable mixture of inference conditional on a model (“normal science”) and model checking (“scientific revolution”). Severe testing is fundamental, in that prospect of revolution is a key contributor to the success of normal science. We lean on our models in large part because they have been, and will continue to be, put to the test. And we choose our statistical methods in large part because, under certain assumptions, they have good frequency properties.

And now on to Mayo’s subtitle. I don’t think her, or my, philosophical perspective will get us “beyond the statistics wars” by itself—but perhaps it will ultimately move us in this direction, if practitioners and theorists alike can move beyond naive confirmationist reasoning toward an embrace of variation and acceptance of uncertainty.

I’ll summarize by expressing agreement with Mayo’s perspective that frequency evaluation is fundamental, while disagreeing with her focus on various crude (from my perspective) ideas such as type 1 errors and p-values. When it comes to statistical philosophy, I’d rather follow Laplace, Jaynes, and Box, rather than Neyman, Wald, and Savage. Phony Bayesmania has bitten the dust.

Thanks

Let me again thank Haig, Wagenmakers, Owen, Cousins, Young, and Hennig for their discussions. I expect that Mayo will respond to these, and also to any comments that follow in this thread, once she has time to digest it all.

P.S. And here’s a review from Christian Robert.