Mag ik je weer een statistische vraag voorleggen?

If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me. My visualisation is that she filled a bowl with 100 intervals, 95 of which do contain the true value and 5 do not, and she picked one at random.

Now, if she gives me two independent 95%-CI’s (e.g., two primary endpoints in a clinical trial), I can only be 90% sure (0.95^2 = 0,9025) that they both contain the true value. If I have a table with four measurements and 95%-CI’s, there’s only a 81% chance they all contain the true value.Also, if we have two results and we want to be 95% sure both intervals contain the true values, we should construct two 97.5%-CI’s (0.95^(1/2) = 0.9747), and if we want to have 95% confidence in four results, we need 0,99%-CI’s.

I’ve read quite a few texts trying to get my head around confidence intervals, but I don’t remember seeing this discussed anywhere. So am I completely off, is this a well-known issue, or have I just invented the Van Maanen Correction for Multiple Confidence Intervals? ;-))

Ik hoop dat je tijd hebt voor een antwoord. It puzzles me!

My reply:

Ja hoor kan ik je hulpen, maar en engels:

1. “If I ask my frequentist statistician for a 95%-confidence interval, I can be 95% sure that the true value will be in the interval she just gave me.” Not quite true. Yes, true on average, but not necessarily true in any individual case. Some intervals are clearly wrong. Here’s the point: even if you picked an interval at random from the bowl, once you see the interval you have additional information. Sometimes the entire interval is implausible, suggesting that it’s likely that you happened to have picked one of the bad intervals in the bowl. Other times, the interval contains the entire range of plausible values, suggesting that you’re almost completely sure that you have picked one of the good intervals in the bowl. This can especially happen if your study is noisy and the sample size is small. For example, suppose you’re trying to estimate the difference in proportion of girl births, comparing two different groups of parents (for example, beautiful parents and ugly parents). You decide to conduct a study of N=400 births, with 200 in each group. Your estimate will be p2 – p1, with standard error sqrt(0.5^2/200 + 0.5^2/200) = 0.05, so your 95% conf interval will be p2 – p1 +/- 0.10. We happen to be pretty sure that any true population difference will be less than 0.01 (see here), hence if p2 – p1 is between -0.09 and +0.09, we can be pretty sure that our 95% interval *does* contain the true value. Conversely, if p2 – p1 is less than -0.11 or more than +0.11, then we can be pretty sure that our interval *does not* contain the true value. Thus, once we *see the interval*, it’s no longer generally a correct statement to say that you can be 95% sure the interval contains the true value.

2. Regarding your question: I don’t really think it makes sense to want 95% confidence in four results. It makes more sense to accept that our inferences are uncertain, we should not demand or act as if that they all be correct.

]]>Since you’ve written about similar papers (that recent NRA study in NEJM, the birthday analysis) before and we linked to a few of your posts, I thought you might be interested in this recent blog post we wrote about a similar kind of study claiming that fatal motor vehicle crashes increase by 12% after 4:20pm on April 20th (an annual cannabis celebration…google it).

The post is by Harper and Adam Palayew, and it’s excellent. Here’s what they say:

A few weeks ago a short paper was published in a leading medical journal, JAMA Internal Medicine, suggesting that, over the 25 years from 1992-2016, excess cannabis consumption after 4:20pm on 4/20 increased fatal traffic crashes by 12% relative to fatal crashes that occurred one week before and one week after. Here is the key result from the paper:

In total, 1369 drivers were involved in fatal crashes after 4:20 PM on April 20 whereas 2453 drivers were in fatal crashes on control days during the same time intervals (corresponding to 7.1 and 6.4 drivers in fatal crashes per hour, respectively). The risk of a fatal crash was significantly higher on April 20 (relative risk, 1.12; 95% CI, 1.05-1.19; P = .001).

— Staples JA, Redelmeier DA. The April 20 Cannabis Celebration and Fatal Traffic Crashes in the United States JAMA Int Med, Feb 18, 2018, p.E2Naturally, this sparked (heh) considerable media interest, not only because p<.05 and the finding is “surprising”, but also because cannabis is a hot topic these days (and, of course, April 20th happens every year).

But how seriously should we take these findings? Harper and Palayew crunch the numbers:

If we try and back out some estimates of what might have to happen on 4/20 to generate a 12% increase in the national rate of fatal car crashes, it seems less and less plausible that the 4/20 effect is reliable or valid. Let’s give it a shot. . . .

Over the 25 year period [the authors of the linked paper] tally 1369 deaths on 4/20 and 2453 deaths on control days, which works out to average deaths on those days each year of 1369/25 ~ 55 on 4/20 and 2453/25/2 ~ 49 on control days, an average excess of about 6 deaths each year. If we use our estimates of post-1620h VMT above, that works out to around 55/2.5 = 22 fatal crashes per billion VMT on 4/20 vs. 49/2.5 = 19.6 on control days. . . .

If we don’t assume the relative risk changes on 4/20, just more people smoking, what proportion of the population would need to be driving while high to generate a rate of 22 per billion VMT? A little algebra tells us that to get to 22 we’d need to see something like . . . 15%! That’s nearly one-sixth of the population driving while high on 4/20 from 4:20pm to midnight, which doesn’t, absent any other evidence, seem very likely. . . . Alternatively, one could also raise the relative risk among cannabis drivers to 6x the base rate and get something close. Or some combination of the two. This means either the nationwide prevalence of driving while using cannabis increases massively on 4/20, or the RR of a fatal crash with the kind of cannabis use happening on 4/20 is absurdly high. Neither of these scenarios seem particularly likely based on what we currently know about cannabis use and driving risks.

They also look at the big picture:

Nothing so exciting is happening on 20 Apr, which makes sense given that total accident rates are affected by so many things, with cannabis consumption being a very small part. It’s similar to that NRA study (see link at beginning of this post) in that the numbers just don’t add up.

Harper sent me this email last year. I wrote the above post and scheduled it for 4/20. In the meantime, he had more to report:

We published a replication paper with some additional analysis. The original paper in question (in JAMA Internal Med no less) used a design (comparing an index ‘window’ on a given day to the same ‘window’ +/- 1 week) similar to some others that you have blogged about (the NRA study, for example), and I think it merits similar skepticism (a sizeable fraction of the population would need to be driving while drugged/intoxicated on this day to raise the national rate by such a margin).

As I said, my co-author Adam Palayew and I replicated that paper’s findings but also showed that their results seem much more consistent with daily variations in traffic crashes throughout the year (lots of noise) and we used a few other well known “risky” days (July 4th is quite reliable for excess deaths from traffic crashes) as a comparison. We also used Stan to fit some partial pooling models to look at how these “effects” may vary over longer time windows.

I wrote an updated blog post about it here.

And the gated version of the paper is now posted on Injury Prevention’s website, but we have made a preprint and all of the raw data and code to reproduce our work available at my Open Science page.

Stan!

]]>While listening to your seminar about the piranha problem a couple weeks back, I kept thinking about a similar work situation but in the opposite direction. I’d be extremely grateful if you share your thoughts.

So the piranha problem is stated as “There can be some large and predictable effects on behavior, but not a lot, because, if there were, then these different effects would interfere with each other, and as a result it would be hard to see any consistent effects of anything in observational data.” The task, then, is to find out which large effects are real and which are spurious.

At work, sometimes people bring up the opposite argument. When experiments (A/B tests) are pre-registered, a lot of times the results are not statistically significant. And a few months down the line people would ask if we can re-run the experiment, because the app or website has changed, and so the treatment might interact differently with the current version. So instead of arguing that large effects can be explained by an interaction of previously established large effects, some people argue that large effects are hidden by yet unknown interaction effects.

My gut reaction is a resounding no, because otherwise people would re-test things every time they don’t get the results they want, and the number of false positives would go up like crazy. But it feels like there is some ring of truth to the concerns they raise.

For instance, if the old website had a green layout, and we changed the button to green, then it might have a bad impact. However, if the current layout is red, making the button green might make it stand out more, and the treatment will have positive effect. In that regard, it will be difficult to see consistent treatment effects over time when the website itself keeps evolving and the interaction terms keep changing. Even for previously established significant effects, how do we know that the effect size estimated a year ago still holds true with the current version?

What do you think? Is there a good framework to evaluate just when we need to re-run an experiment, if that is even a good idea? I can’t find a satisfying resolution to this.

My reply:

I suspect that large effects *are* out there, but, as you say, the effects can be strongly dependent on context. So, even if an intervention works in a test, it might not work in the future because in the future the conditions will change in some way. Given all that, I think the right way to study this is to explicitly model effects as varying. For example, instead of doing a single A/B test of an intervention, you could try testing it in many different settings, and then analyze the results with a hierarchical model so that you’re estimating varying effects. Then when it comes to decision-making, you can keep that variation in mind.

For students and teachers of statistics or research methods, I think the key takeaway should be that you don’t want to pull out just one number from a survey; you want to get the big picture by looking at multiple questions, multiple years, and multiple data sources. You want to use the secret weapon.

Where do formal statistical theory and methods come in here? Not where you might think. No p-values or Bayesian inferences in the above-linked discussion, not even any confidence intervals or standard errors.

But that doesn’t mean that formal statistics are irrelevant, not at all.

Formal statistics gets used in the design and analysis of these surveys. We use probability and statistics to understand and design sampling strategies (cluster sampling, in the case of the General Social Survey) and to adjust for differences between sample and population (poststratification and survey weights, or, if these adjustments are deemed not necessary, statistical methods are used to make that call too).

Formal statistics underlies this sort of empirical work in social science—you just don’t see it because it was already done before you got to the data.

]]>Data is getting weirder. Statistical models and techniques are more complex than they have ever been. No one understand what code does. But at the same time, statistical tools are being used by a wider range of people than at any time in the past. And they are not just using our well-trodden, classical tools. They are working at the bleeding edge of what is possible. With this in mind, this talk will look at how much we can trust our tools. Do we ever really compute the thing we think we do? Can we ever be sure our code worked? Are there ways that it’s not safe to use the output? While “reproducibility” may be the watchword of the new scientific era, if we also want to ensure safety maybe all we have to lean on are pictures and fear.

Important stuff.

]]>Have you been following the release of GSS results this year? I had been vaguely aware that there was reporting on a few items but then I happened to run the natrace and natracey variables (I use these in my class to look at question wording), they are from the are we spending too much/too little/about the right amont on “Improving the conditions of blacks” and “aid to blacks” (the images are from the SDA website at Berkeley):

Much as I [Waring] would love to believe that the American public really has changed racial attitudes, I find such a huge shift over such a short time very unlikely given what we know about stability of attitudes. And I even broke it down by age and there was a shift for all the age groups.

Then I saw this, and a colleague mentioned to me that the results for proportion not sexually active were strange. And then today people talking about the increase in the proportion not religiously affiliated.

It just seems very odd to me and I wondered if you had noticed it too. Could it be they just hit a strange cluster in their sampling? Or a weighting error of some kind? It’s true that attitudes on gay marriage changed very fast and that seems real, but this seems so surprising across so many separate issues.

I wasn’t sure so I passed this along to David Weakliem, my go-to guy when it comes to making sense of surveys and public opinion. Weakliem responded with some preliminary thoughts:

It did seem hard to believe at first. But there was a big move from 2014 to 2016 too (bigger than 2016-8), so if there is a problem with the survey it’s not just with 2018. The GSS also has a general question about whether the government has a special obligation to help blacks vs. no special treatment, and that also showed large moves in a liberal direction from 2014-6 and again from 2016-8. Finally, I looked for relevant questions from other surveys. There are some about how much discrimination there is. In 2013 and 2014, 19% and then 17% said there was a lot of discrimination against “African Americans” but in 2015 it was 36%; in 2016 and 2017 the question referred to “blacks” and 40% said there was a lot. So it seems that there really has been a substantial change in opinions about race since 2014. As far as why, I would guess that the media coverage and videos of police mistreatment of blacks had an impact—they made people think there really is a problem.

To which Waring replied:

The one thing I’d say in response to David is that while he could be right, these are shifts across a number of the long term variables not just the racial attitudes. Also I think that GSS is intentionally designed to not be so responsive to day to day fluctuations based on the latest news. And POLHITOK sees an increase in “no” responses in 2018 but not so dramatic and it looks like it’s in the same general territory as others from 2006 forward.

What really made me look at those particular variables was all the recent talk about reparations for slavery.

I also saw that Jay Livingston, who I wish had his own column in the New York Times—I’d rather see a sociologist’s writing about sociology, than an ignorant former reporter’s writing about sociology—wrote something recently on survey attitudes regarding racial equality, but using a different data source:

Just last week, Pew published a report (here) about race in the US. Among many other things, it asked respondents about the “major” reasons that Black people “have a harder time getting ahead.” As expected, Whites were more likely to point to cultural/personal factors, Blacks to structural ones. But compared with a similar survey Pew did just three years ago, it looks like everyone is becoming more woke. . . .

For “racial discrimination,” Black-White difference remains large. But in both groups, the percentage citing it as a major cause increases – by 14 points among Blacks, by nearly 20 points among Whites. The percent identifying access to good schools as an important factor have not changed so much, increasing slightly among both Blacks and Whites.

More curious are the responses about jobs. In 2013, far more Whites than Blacks said that the lack of jobs was a major factor. In the intervening three years, jobs as a reason for not getting ahead became more salient among Blacks, less so among Whites.

At the same time, “culture of poverty” explanations became less popular.

Livingston continues with some GSS data and then concludes:

If both Whites and Blacks are paying more attention to racial discrimination and less to personal-cultural factors, if everyone is more woke, how does this square with the widely held perception that in the era of Trump, racism is on the rise. (In the Pew survey, 56% over all and 49% of Whites said Trump has made race relations worse. In no group, even self-identified conservatives, does anything coming even close to a majority say that Trump has made race relations better.)

The data here points to a more complex view of recent history. The nastiest of the racists may have felt freer to express themselves in word and deed. And when they do, they make the news. Hence the widespread perception that race relations have deteriorated. But surveys can tell us what we don’t see on the news and Twitter. And in this case what they tell us is that the overall trend among Whites has been towards more liberal views on the causes of race differences in who gets ahead.

Interesting. Also an increasing proportion of Americans are neither white nor black. So lots going on here.

**P.S.** Livingston adds:

I also noticed something when I was checking the GSS data that Tristan Bridges posted about LGB self-identification. For those variables (and maybe others—I haven’t looked), the GSS 2014 sample was much larger than in other years before and since, and the 2018 sample smaller. That shouldn’t affect the actual percents, but with fairly rare responses like identifying as gay, the sample size did make me pause to wonder. With larger-n attitude items it shouldn’t matter.

I followed the link to Bridges’s blog, which had lots of interesting stuff, including this post from 2016, Why Popular Boy Names are More Popular than Popular Girl Names, which featured this familiar-looking graph:

Why did this graph look so familiar?? Because I plotted the exact same data in 2013:

I assume that Bridges just independently came up with the same idea that I had—these are public data, and counting the top 10 names is a pretty obvious thing to do, I guess. It was just funny to come across this graph again, in an unexpected place.

]]>Dr Ioannidis writes against our proposals [here and here] to abandon statistical significance in scientific reasoning and publication, as endorsed in the editorial of a recent special issue of an American Statistical Association journal devoted to moving to a “post p<0.05 world.” We appreciate that he echoes our calls for “embracing uncertainty, avoiding hyped claims…and recognizing ‘statistical significance’ is often poorly understood.” We also welcome his agreement that the “interpretation of any result is far more complicated than just significance testing” and that “clinical, monetary, and other considerations may often have more importance than statistical findings.”

Nonetheless, we disagree that a statistical significance-based “filtering process is useful to avoid drowning in noise” in science and instead view such filtering as harmful. First, the implicit rule to not publish nonsignificant results biases the literature with overestimated effect sizes and encourages “hacking” to get significance. Second, nonsignificant results are often wrongly treated as zero. Third, significant results are often wrongly treated as truth rather than as the noisy estimates they are, thereby creating unrealistic expectations of replicability. Fourth, filtering on statistical significance provides no guarantee against noise. Instead, it amplifies noise because the quantity on which the filtering is based (the p-value) is itself extremely noisy and is made more so by dichotomizing it.

We also disagree that abandoning statistical significance will reduce science to “a state of statistical anarchy.” Indeed, the journal Epidemiology banned statistical significance in 1990 and is today recognized as a leader in the field.

Valid synthesis requires accounting for all relevant evidence—not just the subset that attained statistical significance. Thus, researchers should report more, not less, providing estimates and uncertainty statements for all quantities, justifying any exceptions, and considering ways the results are wrong. Publication criteria should be based on evaluating study design, data quality, and scientific content—not statistical significance.

Decisions are seldom necessary in scientific reporting. However, when they are required (as in clinical practice), they should be made based on the costs, benefits, and likelihoods of all possible outcomes, not via arbitrary cutoffs applied to statistical summaries such as p-values which capture little of this picture.

The replication crisis in science is not the product of the publication of unreliable findings. The publication of unreliable findings is unavoidable: as the saying goes, if we knew what we were doing, it would not be called research. Rather, the replication crisis has arisen because unreliable findings are presented as reliable.

I especially like our title and our last paragraph!

Let me also emphasize that we have a lot of positive advice of how researchers can design studies and collect and analyze data (see for example here, here, and here). “Abandon statistical significance” is not the main thing we have to say. We’re writing about statistical significance to do our best to clear up some points of confusion, but our ultimate message in most of our writing and practice is to offer positive alternatives.

**P.S.** Also to clarify: “Abandon statistical significance” does not mean “Abandon statistical methods.” I do think it’s generally a good idea to produce estimates accompanied by uncertainty statements. There’s lots and lots to be done.

I’m a machine learning guy working in fraud prevention, and a member of some biostatistics and clinical statistics research groups at Wright State University in Dayton, Ohio.

I just heard your talk “Theoretical Statistics is the Theory of Applied Statistics” on YouTube, and was extremely interested in the idea of a model-space for exploring and choosing from possibilities in ‘model space’.

I was wondering if you knew of work on any R (or Python, or whatever, I’m not picky!) packages that was being done on this, or could recommend a place to start reading more about the theory/concept.

My reply:

I love this idea of the network of models but I’ve never written anything formal on it, nor do I have any software implementations. Here’s a talk on the topic from 2011, and here’s a post from 2017 with some comments from others too.

I still think this is an important topic—it relates to the idea of a generative grammar for building statistical models, and it should fit in well with Stan. So I’m posting this in the hope that someone will follow up and do it in some way.

]]>In India, data on key developmental indicators that formulate policies and interventions are routinely available for the administrative units of districts but not for the political units of Parliamentary Constituencies (PC). Members of Parliament (MPs) in the Lok Sabha, each representing 543 PCs as per the 2014 India map, are the representatives with the most direct interaction with their constituents. The MPs are responsible for articulating the vision and the implementation of public policies at the national level and for their respective constituencies. In order for MPs to efficiently and effectively serve their people, and also for the constituents to understand the performance of their MPs, it is critical to produce the most accurate and up-to-date evidence on the state of health and well-being at the PC-level. However, absence of PC identifiers in nationally representative surveys or the Census has eluded an assessment of how a PC is doing with regards to key indicators of nutrition, health and development.

On this website, we report PC estimates for indicators of nutrition, health and development derived from two data sources:

The National Family Health Survey 4 (NFHS-4) District Factsheets

The National Sample Survey (NSS), 2010-11, 2011-12, 2014 (Author calculations) . . .The PC estimates for each of the indicators are classified into quintiles for map visualizations. Currently, we provide map-based visualizations for a subset of indicators, and these will be continually updated for additional indicators. . . .

In addition to providing a visualization of indicators at the PC level, we also provide tables of the PC estimates. . . .

Further details are at the link.

I’ve not looked at this all myself, but I thought it could be of interest to some of you.

]]>For the past few months I have been delving into Bayesian statistics and have (without hyperbole) finally found statistics intuitive and exciting. Recently I have gone into Bayesian time series methods; however, I have found no libraries to use that can implement those models.

Happily, I found Stan because it seemed among the most mature and flexible Bayesian libraries around, but is there any guide/book you could recommend me for approaching state space models through Stan? I am referring to more complex models, such as those found in State-Space Models, by Zeng and Wu, as well as Bayesian Analysis of Stochastic Process Models, by Insua et al. Most advanced books seem to use WinBUGS, but that library is closed-source and a bit older.

I replied that he should you post his question on the Stan mailing list and also look at the example models and case studies for Stan.

I also passed the question on to Jim Savage, who added:

Stan’s great for time series, though mostly because it just allows you to flexibly write down whatever likelihood you want and put very flexible priors on everything, then fits it swiftly with a modern sampler and lets you do diagnoses that are difficult/impossible elsewhere!

Jeff Arnold has a fairly complete set of implementations for state-space models in Stan here. I’ve also got some more introductory blog posts that might help you get your head around writing out some time-series models in Stan. Here’s one on hierarchical VAR models. Here’s another on Hamilton-style regime-switching models. I’ve got a half-written tutorial on state-space models that I’ll come back to when I’m writing the time-series chapter in our Bayesian econometrics in Stan book.

One of the really nice things about Stan is that you can write out your state as parameters. Because Stan can efficiently sample from parameter spaces with hundreds of thousands of dimensions (if a bit slowly), this is fine. It’ll just be slower than a standard Kalman filter. It also changes the interpretation of the state estimate somewhat (more akin to a Kalman smoother, given you use all observations to fit the state).

Here’s an example of such a model.

Actually that last model had some problems with the between-state correlations, but I guess it’s still a good example of how to put something together in Markdown.

]]>This note argues that, under some circumstances, it is more rational not to behave in accordance with a Bayesian prior than to do so. The starting point is that in the absence of information, choosing a prior is arbitrary. If the prior is to have meaningful implications, it is more rational to admit that one does not have sufficient information to generate a prior than to pretend that one does. This suggests a view of rationality that requires a compromise between internal coherence and justification, similarly to compromises that appear in moral dilemmas. Finally, it is argued that Savage’s axioms are more compelling when applied to a naturally given state space than to an analytically constructed one; in the latter case, it may be more rational to violate the axioms than to be Bayesian.

The paper expresses various misconceptions, for example the statement that the Bayesian approach requires a “subjective belief.” All statistical conclusions require assumptions, and a Bayesian prior distribution can be as subjective or un-subjective as any other assumption in the model. For example, I don’t recall seeing textbooks on statistical methods referring to the subjective belief underlying logistic regression or the Poisson distribution; I guess if you assume a model but you don’t use the word “Bayes,” then assumptions are just assumptions.

More generally, it seems obvious to me that no statistical method will work best under all circumstances, hence I have no disagreement whatsoever with the opening sentence quoted above. I can’t quite see why they need 12 pages to make this argument, but whatever.

**P.S.** Also relevant is this discussion from a few years ago: The fallacy of the excluded middle—statistical philosophy edition.

Some other examples of movies that are about themselves are La La Land, Primer (a low-budget experiment about a low-budget experiment), and Titanic (the biggest movie ever made, about the biggest boat ever made).

I want to call this, Objects of the Class X, but I’m not sure what X is.

]]>Dear philosophically-inclined colleagues:

I’d like to organize an online discussion of Deborah Mayo’s new book.

The table of contents and some of the book are here at Google books, also in the attached pdf and in this post by Mayo.

I think that many, if not all, of Mayo’s points in her Excursion 4 are answered by my article with Hennig here.

What I was thinking for this discussion is that if you’re interested you can write something, either a review of Mayo’s book (if you happen to have a copy of it) or a review of the posted material, or just your general thoughts on the topic of statistical inference as severe testing.

I’m hoping to get this all done this month, because it’s all informal and what’s the point of dragging it out, right? So if you’d be interested in writing something on this that you’d be willing to share with the world, please let me know. It should be fun, I hope!

I did this in consultation with Deborah Mayo, and I just sent this email to a few people (so if you were not included, please don’t feel left out! You have a chance to participate right now!), because our goal here was to get the discussion going. The idea was to get some reviews, and this could spark a longer discussion here in the comments section.

And, indeed, we received several responses. And I’ll also point you to my paper with Shalizi on the philosophy of Bayesian statistics, with discussions by Mark Andrews and Thom Baguley, Denny Borsboom and Brian Haig, John Kruschke, Deborah Mayo, Stephen Senn, and Richard D. Morey, Jan-Willem Romeijn and Jeffrey N. Rouder.

Also relevant is this summary by Mayo of some examples from her book.

And now on to the reviews.

**Brian Haig**

I’ll start with psychology researcher Brian Haig, because he’s a strong supporter of Mayo’s message and his review also serves as an introduction and summary of her ideas. The review itself is a few pages long, so I will quote from it, interspersing some of my own reaction:

Deborah Mayo’s ground-breaking book, Error and the growth of statistical knowledge (1996) . . . presented the first extensive formulation of her error-statistical perspective on statistical inference. Its novelty lay in the fact that it employed ideas in statistical science to shed light on philosophical problems to do with evidence and inference.

By contrast, Mayo’s just-published book, Statistical inference as severe testing (SIST) (2018), focuses on problems arising from statistical practice (“the statistics wars”), but endeavors to solve them by probing their foundations from the vantage points of philosophy of science, and philosophy of statistics. The “statistics wars” to which Mayo refers concern fundamental debates about the nature and foundations of statistical inference. These wars are longstanding and recurring. Today, they fuel the ongoing concern many sciences have with replication failures, questionable research practices, and the demand for an improvement of research integrity. . . .

For decades, numerous calls have been made for replacing tests of statistical significance with alternative statistical methods. The new statistics, a package deal comprising effect sizes, confidence intervals, and meta-analysis, is one reform movement that has been heavily promoted in psychological circles (Cumming, 2012; 2014) as a much needed successor to null hypothesis significance testing (NHST) . . .

The new statisticians recommend replacing NHST with their favored statistical methods by asserting that it has several major flaws. Prominent among them are the familiar claims that NHST encourages dichotomous thinking, and that it comprises an indefensible amalgam of the Fisherian and Neyman-Pearson schools of thought. However, neither of these features applies to the error-statistical understanding of NHST. . . .

There is a double irony in the fact that the new statisticians criticize NHST for encouraging simplistic dichotomous thinking: As already noted, such thinking is straightforwardly avoided by employing tests of statistical significance properly, whether or not one adopts the error-statistical perspective. For another, the adoption of standard frequentist confidence intervals in place of NHST forces the new statisticians to engage in dichotomous thinking of another kind: A parameter estimate is either inside, or outside, its confidence interval.

At this point I’d like to interrupt and say that a confidence or interval (or simply an estimate with standard error) can be used to give a sense of inferential uncertainty. There is no reason for dichotomous thinking when confidence intervals, or uncertainty intervals, or standard errors, are used in practice.

Here’s a very simple example from my book with Jennifer:

This graph has a bunch of estimates +/- standard errors, that is, 68% confidence intervals, with no dichotomous thinking in sight. In contrast, testing some hypothesis of no change over time, or no change during some period of time, would make no substantive sense and would just be an invitation to add noise to our interpretation of these data.

OK, to continue with Haig’s review:

Error-statisticians have good reason for claiming that their reinterpretation of frequentist confidence intervals is superior to the standard view. The standard account of confidence intervals adopted by the new statisticians prespecifies a single confidence interval (a strong preference for 0.95 in their case). . . . By contrast, the error-statistician draws inferences about each of the obtained values according to whether they are warranted, or not, at different severity levels, thus leading to a series of confidence intervals. Crucially, the different values will not have the same probative force. . . . Details on the error-statistical conception of confidence intervals can be found in SIST (pp. 189-201), as well as Mayo and Spanos (2011) and Spanos (2014). . . .

SIST makes clear that, with its error-statistical perspective, statistical inference can be employed to deal with both estimation and hypothesis testing problems. It also endorses the view that providing explanations of things is an important part of science.

Another interruption from me . . . I just want to plug my paper with Guido Imbens, Why ask why? Forward causal inference and reverse causal questions, in which we argue that Why questions can be interpreted as model checks, or, one might say, hypothesis tests—but tests of hypotheses of interest, not of straw-man null hypotheses. Perhaps there’s some connection between Mayo’s ideas and those of Guido and me on this point.

Haig continues with a discussion of Bayesian methods, including those of my collaborators and myself:

One particularly important modern variant of Bayesian thinking, which receives attention in SIST, is the falsificationist Bayesianism of . . . Gelman and Shalizi (2013). Interestingly, Gelman regards his Bayesian philosophy as essentially error-statistical in nature – an intriguing claim, given the anti-Bayesian preferences of both Mayo and Gelman’s co-author, Cosma Shalizi. . . . Gelman acknowledges that his falsificationist Bayesian philosophy is underdeveloped, so it will be interesting to see how its further development relates to Mayo’s error-statistical perspective. It will also be interesting to see if Bayesian thinkers in psychology engage with Gelman’s brand of Bayesian thinking. Despite the appearance of his work in a prominent psychology journal, they have yet to do so. . . .

Hey, not quite! I’ve done a lot of collaboration with psychologists; see here and search on “Iven Van Mechelen” and “Francis Tuerlinckx”—but, sure, I recognize that our Bayesian methods, while mainstream in various fields including ecology and political science, are not yet widely used in psychology.

Haig concludes:

From a sympathetic, but critical, reading of Popper, Mayo endorses his strategy of developing scientific knowledge by identifying and correcting errors through strong tests of scientific claims. . . . A heartening attitude that comes through in SIST is the firm belief that a philosophy of statistics is an important part of statistical thinking. This contrasts markedly with much of statistical theory, and most of statistical practice. Given that statisticians operate with an implicit philosophy, whether they know it or not, it is better that they avail themselves of an explicitly thought-out philosophy that serves practice in useful ways.

I agree, very much.

To paraphrase Bill James, the alternative to good philosophy is not “no philosophy,” it’s “bad philosophy.” I’ve spent too much time seeing Bayesians avoid checking their models out of a philosophical conviction that subjective priors cannot be empirically questioned, and too much time seeing non-Bayesians produce ridiculous estimates that could have been avoided by using available outside information. There’s nothing so practical as good practice, but good philosophy can facilitate both the development and acceptance of better methods.

**E. J. Wagenmakers**

I’ll follow up with a very short review, or, should I say, reaction-in-place-of-a-review, from psychometrician E. J. Wagenmakers:

I cannot comment on the contents of this book, because doing so would require me to read it, and extensive prior knowledge suggests that I will violently disagree with almost every claim that is being made. In my opinion, the only long-term hope for vague concepts such as the “severity” of a test is to embed them within a rational (i.e., Bayesian) framework, but I suspect that this is not the route that the author wishes to pursue. Perhaps this book is comforting to those who have neither the time nor the desire to learn Bayesian inference, in a similar way that homeopathy provides comfort to patients with a serious medical condition.

You don’t have to agree with E. J. to appreciate his honesty!

**Art Owen**

Coming from a different perspective is theoretical statistician Art Owen, whose review has some mathematical formulas—nothing too complicated, but not so easy to display in html, so I’ll just link to the pdf and share some excerpts:

There is an emphasis throughout on the importance of severe testing. It has long been known that a test that fails to reject H0 is not very conclusive if it had low power to reject H0. So I wondered whether there was anything more to the severity idea than that. After some searching I found on page 343 a description of how the severity idea differs from the power notion. . . .

I think that it might be useful in explaining a failure to reject H0 as the sample size being too small. . . . it is extremely hard to measure power post hoc because there is too much uncertainty about the effect size. Then, even if you want it, you probably cannot reliably get it. I think severity is likely to be in the same boat. . . .

I believe that the statistical problem from incentives is more severe than choice between Bayesian and frequentist methods or problems with people not learning how to use either kind of method properly. . . . We usually teach and do research assuming a scientific loss function that rewards being right. . . . In practice many people using statistics are advocates. . . . The loss function strongly informs their analysis, be it Bayesian or frequentist. The scientist and advocate both want to minimize their expected loss. They are led to different methods. . . .

I appreciate Owen’s efforts to link Mayo’s words to the equations that we would ultimately need to implement, or evaluate, her ideas in statistics.

**Robert Cousins**

Physicist Robert Cousins did not have the time to write a comment on Mayo’s book, but he did point us to this monograph he wrote on the foundations of statistics, which has lots of interesting stuff but is unfortunately a bit out of date when it comes to the philosophy of Bayesian statistics, which he ties in with subjective probability. (For a corrective, see my aforementioned article with Hennig.)

In his email to me, Cousins also addressed issues of statistical and practical significance:

Our [particle physicists’] problems and the way we approach them are quite different from some other fields of science, especially social science. As one example, I think I recall reading that you do not mind adding a parameter to your model, whereas adding (certain) parameters to our models means adding a new force of nature (!) and a Nobel Prize if true. As another example, a number of statistics papers talk about how silly it is to claim a 10^{⁻4} departure from 0.5 for a binomial parameter (ESP examples, etc), using it as a classic example of the difference between nominal (probably mismeasured) statistical significance and practical significance. In contrast, when I was a grad student, a famous experiment in our field measured a 10^{⁻4} departure from 0.5 with an uncertainty of 10% of itself, i.e., with an uncertainty of 10^{⁻5}. (Yes, the order or 10^10 Bernoulli trials—counting electrons being scattered left or right.) This led quickly to a Nobel Prize for Steven Weinberg et al., whose model (now “Standard”) had predicted the effect.

I replied:

This interests me in part because I am a former physicist myself. I have done work in physics and in statistics, and I think the principles of statistics that I have applied to social science, also apply to physical sciences. Regarding the discussion of Bem’s experiment, what I said was not that an effect of 0.0001 is unimportant, but rather that if you were to really believe Bem’s claims, there could be effects of +0.0001 in some settings, -0.002 in others, etc. If this is interesting, fine: I’m not a psychologist. One of the key mistakes of Bem and others like him is to suppose that, even if they happen to have discovered an effect in some scenario, there is no reason to suppose this represents some sort of universal truth. Humans differ from each other in a way that elementary particles to not.

And Cousins replied:

Indeed in the binomial experiment I mentioned, controlling unknown systematic effects to the level of 10^{-5}, so that what they were measuring (a constant of nature called the Weinberg angle, now called the weak mixing angle) was what they intended to measure, was a heroic effort by the experimentalists.

**Stan Young**

Stan Young, a statistician who’s worked in the pharmaceutical industry, wrote:

I’ve been reading at the Mayo book and also pestering where I think poor statistical practice is going on. Usually the poor practice is by non-professionals and usually it is not intentionally malicious however self-serving.

ButI think it naive to think that education is all that is needed. Or some grand agreement among professional statisticians will end the problems.There are science crooks and statistical crooks

andthere are no cops, or very few.That is a long way of saying, this problem is not going to be solved in 30 days, or by one paper, or even by one book or by three books! (I’ve read all three.)

I think a more open-ended and longer dialog would be more useful with at least some attention to willful and intentional misuse of statistics.

Chambers C. The Seven Deadly Sins of Psychology. New Jersey: Princeton University Press, 2017.

Harris R. Rigor mortis: how sloppy science creates worthless cures, crushes hope, and wastes billions. New York: Basic books, 2017.

Hubbard R. Corrupt Research. London: Sage Publications, 2015.

**Christian Hennig**

Hennig, a statistician and my collaborator on the Beyond Subjective and Objective paper, send in *two* reviews of Mayo’s book.

Here are his general comments:

What I like about Deborah Mayo’s “Statistical Inference as Severe Testing”

Before I start to list what I like about “Statistical Inference as Severe Testing”. I should say that I don’t agree with everything in the book. In particular, as a constructivist I am skeptical about the use of terms like “objectivity”, “reality” and “truth” in the book, and I think that Mayo’s own approach may not be able to deliver everything that people may come to believe it could, from reading the book (although Mayo could argue that overly high expectations could be avoided by reading carefully).

So now, what do I like about it?

1) I agree with the broad concept of severity and severe testing. In order to have evidence for a claim, it has to be tested in ways that would reject the claim with high probability if it indeed were false. I also think that it makes a lot of sense to start a philosophy of statistics and a critical discussion of statistical methods and reasoning from this requirement. Furthermore, throughout the book Mayo consistently argues from this position, which makes the different “Excursions” fit well together and add up to a consistent whole.

2) I get a lot out of the discussion of the philosophical background of scientific inquiry, of induction, probabilism, falsification and corroboration, and their connection to statistical inference. I think that it makes sense to connect Popper’s philosophy to significance tests in the way Mayo does (without necessarily claiming that this is the only possible way to do it), and I think that her arguments are broadly convincing at least if I take a realist perspective of science (which as a constructivist I can do temporarily while keeping the general reservation that this is about a specific construction of reality which I wouldn’t grant absolute authority).

3) I think that Mayo does by and large a good job listing much of the criticism that has been raised in the literature against significance testing, and she deals with it well. Partly she criticises bad uses of significance testing herself by referring to the severity requirement, but she also defends a well understood use in a more general philosophical framework of testing scientific theories and claims in a piecemeal manner. I find this largely convincing, conceding that there is a lot of detail and that I may find myself in agreement with the occasional objection against the odd one of her arguments.

4) The same holds for her comprehensive discussion of Bayesian/probabilist foundations in Excursion 6. I think that she elaborates issues and inconsistencies in the current use of Bayesian reasoning very well, maybe with the odd exception.

5) I am in full agreement with Mayo’s position that when using probability modelling, it is important to be clear about the meaning of the computed probabilities. Agreement in numbers between different “camps” isn’t worth anything if the numbers mean different things. A problem with some positions that are sold as “pragmatic” these days is that often not enough care is put into interpreting what the results mean, or even deciding in advance what kind of interpretation is desired.

6) As mentioned above, I’m rather skeptical about the concept of objectivity and about an all too realist interpretation of statistical models. I think that in Excursion 4 Mayo manages to explain in a clear manner what her claims of “objectivity” actually mean, and she also appreciates more clearly than before the limits of formal models and their distance to “reality”, including some valuable thoughts on what this means for model checking and arguments from models.

So overall it was a very good experience to read her book, and I think that it is a very valuable addition to the literature on foundations of statistics.

Hennig also sent some specific discussion of one part of the book:

1 Introduction

This text discusses parts of Excursion 4 of Mayo (2018) titled “Objectivity and Auditing”. This starts with the section title “The myth of ‘The myth of objectivity'”. Mayo advertises objectivity in science as central and as achievable.

In contrast, in Gelman and Hennig (2017) we write: “We argue that the words ‘objective’ and ‘subjective’ in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes.” I will here outline agreement and disagreement that I have with Mayo’s Excursion 4, and raise some issues that I think require more research and discussion.

2 Pushback and objectivity

The second paragraph of Excursion 4 states in bold letters: “The Key Is Getting Pushback”, and this is the major source of agreement between Mayo’s and my views (*). I call myself a constructivist, and this is about acknowledging the impact of human perception, action, and communication on our world-views, see Hennig (2010). However, it is an almost universal experience that we cannot construct our perceived reality as we wish, because we experience “pushback” from what we perceive as “the world outside”. Science is about allowing us to deal with this pushback in stable ways that are open to consensus. A major ingredient of such science is the “Correspondence (of scientific claims) to observable reality”, and in particular “Clear conditions for reproduction, testing and falsification”, listed as “Virtue 4/4(b)” in Gelman and Hennig (2017). Consequently, there is no disagreement with much of the views and arguments in Excursion 4 (and the rest of the book). I actually believe that there is no contradiction between constructivism understood in this way and Chang’s (2012) “active scientific realism” that asks for action in order to find out about “resistance from reality”, or in other words, experimenting, experiencing and learning from error.

If what is called “objectivity” in Mayo’s book were the generally agreed meaning of the term, I would probably not have a problem with it. However, there is a plethora of meanings of “objectivity” around, and on top of that the term is often used as a sales pitch by scientists in order to lend authority to findings or methods and often even to prevent them from being questioned. Philosophers understand that this is a problem but are mostly eager to claim the term anyway; I have attended conferences on philosophy of science and heard a good number of talks, some better, some worse, with messages of the kind “objectivity as understood by XYZ doesn’t work, but here is my own interpretation that fixes it”. Calling frequentist probabilities “objective” because they refer to the outside world rather than epsitemic states, and calling a Bayesian approach “objective” because priors are chosen by general principles rather than personal beliefs are in isolation also legitimate meanings of “objectivity”, but these two and Mayo’s and many others (see also the Appendix of Gelman and Hennig, 2017) differ. The use of “objectivity” in public and scientific discourse is a big muddle, and I don’t think this will change as a consequence of Mayo’s work. I prefer stating what we want to achieve more precisely using less loaded terms, which I think Mayo has achieved well not by calling her approach “objective” but rather by explaining in detail what she means by that.

3. Trust in models?

In the remainder, I will highlight some limitations of Mayo’s “objectivity” that are mainly connected to Tour IV on objectivity, model checking and whether it makes sense to say that “all models are false”. Error control is central for Mayo’s objectivity, and this relies on error probabilities derived from probability models. If we want to rely on these error probabilities, we need to trust the models, and, very appropriately, Mayo devotes Tour IV to this issue. She concedes that all models are false, but states that this is rather trivial, and what is really relevant when we use statistical models for learning from data is rather whether the models are adequate for the problem we want to solve. Furthermore, model assumptions can be tested and it is crucial to do so, which, as follows from what was stated before, does not mean to test whether they are really true but rather whether they are violated in ways that would destroy the adequacy of the model for the problem. So far I can agree. However, I see some difficulties that are not addressed in the book, and mostly not elsewhere either. Here is a list.

3.1. Adaptation of model checking to the problem of interest

As all models are false, it is not too difficult to find model assumptions that are violated but don’t matter, or at least don’t matter in most situations. The standard example would be the use of continuous distributions to approximate distributions of essentially discrete measurements. What does it mean to say that a violation of a model assumption doesn’t matter? This is not so easy to specify, and not much about this can be found in Mayo’s book or in the general literature. Surely it has to depend on what exactly the problem of interest is. A simple example would be to say that we are interested in statements about the mean of a discrete distribution, and then to show that estimation or tests of the mean are very little affected if a certain continuous approximation is used. This is reassuring, and certain other issues could be dealt with in this way, but one can ask harder questions. If we approximate a slightly skew distribution by a (unimodal) symmetric one, are we really interested in the mean, the median, or the mode, which for a symmetric distribution would be the same but for the skew distribution to be approximated would differ? Any frequentist distribution is an idealisation, so do we first need to show that it is fine to approximate a discrete non-distribution by a discrete distribution before worrying whether the discrete distribution can be approximated by a continuous one? (And how could we show that?) And so on.

3.2. Severity of model misspecification tests

Following the logic of Mayo (2018), misspecification tests need to be severe in ordert to fulfill their purpose; otherwise data could pass a misspecification test that would be of little help ruling out problematic model deviations. I’m not sure whether there are any results of this kind, be it in Mayo’s work or elsewhere. I imagine that if the alternative is parametric (for example testing independence against a standard time series model) severity can occasionally be computed easily, but for most model misspecification tests it will be a hard problem.

3.3. Identifiability issues, and ruling out models by other means than testing

Not all statistical models can be distinguished by data. For example, even with arbitrarily large amounts of data only lower bounds of the number of modes can be estimated; an assumption of unimodality can strictly not be tested (Donoho 1988). Worse, only regular but not general patterns of dependence can be distinguished from independence by data; any non-i.i.d. pattern can be explained by either dependence or non-identity of distributions, and telling these apart requires constraints on dependence and non-identity structures that can itself not be tested on the data (in the example given in 4.11 of Mayo, 2018, all tests discover specific regular alternatives to the model assumption). Given that this is so, the question arises on which grounds we can rule out irregular patterns (about the simplest and most silly one is “observations depend in such a way that every observation determines the next one to be exactly what it was observed to be”) by other means than data inspection and testing. Such models are probably useless, however if they were true, they would destroy any attempt to find “true” or even approximately true error probabilities.

3.4. Robustness against what cannot be ruled out

The above implies that certain deviations from the model assumptions cannot be ruled out, and then one can ask: How robust is the substantial conclusion that is drawn from the data against models different from the nominal one, which could not be ruled out by misspecification testing, and how robust are error probabilities? The approaches of standard robust statistics probably have something to contribute in this respect (e.g., Hampel et al., 1986), although their starting point is usually different from “what is left after misspecification testing”. This will depend, as everything, on the formulation of the “problem of interest”, which needs to be defined not only in terms of the nominal parametric model but also in terms of the other models that could not be rules out.

3.5. The effect of preliminary model checking on model-based inference

Mayo is correctly concerned about biasing effects of model selection on inference. Deciding what model to use based on misspecification tests is some kind of model selection, so it may bias inference that is made in case of passing misspecification tests. One way of stating the problem is to realise that in most cases the assumed model conditionally on having passed a misspecification test does no longer hold. I have called this the “goodness-of-fit paradox” (Hennig, 2007); the issue has been mentioned elsewhere in the literature. Mayo has argued that this is not a problem, and this is in a well defined sense true (meaning that error probabilities derived from the nominal model are not affected by conditioning on passing a misspecification test) if misspecification tests are indeed “independent of (or orthogonal to) the primary question at hand” (Mayo 2018, p. 319). The problem is that for the vast majority of misspecification tests independence/orthogonality does not hold, at least not precisely. So the actual effect of misspecification testing on model-based inference is a matter that requires to be investigated on a case-by-case basis. Some work of this kind has been done or is currently done; results are not always positive (an early example is Easterling and Anderson 1978).

4 Conclusion

The issues listed in Section 3 are in my view important and worthy of investigation. Such investigation has already been done to some extent, but there are many open problems. I believe that some of these can be solved, some are very hard, and some are impossible to solve or may lead to negative results (particularly connected to lack of identifiability). However, I don’t think that these issues invalidate Mayo’s approach and arguments; I expect at least the issues that cannot be solved to affect in one way or another any alternative approach. My case is just that methodology that is “objective” according to Mayo comes with limitations that may be incompatible with some other peoples’ ideas of what “objectivity” should mean (in which sense it is in good company though), and that the falsity of models has some more cumbersome implications than Mayo’s book could make the reader believe.

(*) There is surely a strong connection between what I call “my” view here with the collaborative position in Gelman and Hennig (2017), but as I write the present text on my own, I will refer to “my” position here and let Andrew Gelman speak for himself.

References:

Chang, H. (2012) Is Water H2O? Evidence, Realism and Pluralism. Dordrecht: Springer.Donoho, D. (1988) One-Sided Inference about Functionals of a Density. Annals of Statistics 16, 1390-1420.

Easterling, R. G. and Anderson, H.E. (1978) The effect of preliminary normality goodness of fit tests on subsequent inference. Journal of Statistical Computation and Simulation 8, 1-11.

Gelman, A. and Hennig, C. (2017) Beyond subjective and objective in statistics (with discussion). Journal of the Royal Statistical Society, Series A 180, 967–1033.

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986) Robust statistics. New York: Wiley.

Hennig, C. (2010) Mathematical models and reality: a constructivist perspective. Foundations of Science 15, 29–48.

Hennig, C. (2007) Falsification of propensity models by statistical tests and the goodness-of-fit paradox. Philosophia Mathematica 15, 166-192.

Mayo, D. G. (2018) Statistical Inference as Severe Testing. Cambridge University Press.

**My own reactions**

I’m still struggling with the key ideas of Mayo’s book. (Struggling is a good thing here, I think!)

First off, I appreciate that Mayo takes my own philosophical perspective seriously—I’m actually thrilled to be taken seriously, after years of dealing with a professional Bayesian establishment tied to naive (as I see it) philosophies of subjective or objective probabilities, and anti-Bayesians not willing to think seriously about these issues at all—and I don’t think any of these philosophical issues are going to be resolved any time soon. I say this because I’m so aware of the big Cantor-size hole in the corner of my own philosophy of statistical learning.

In statistics—maybe in science more generally—philosophical paradoxes are sometimes resolved by technological advances. Back when I was a student I remember all sorts of agonizing over the philosophical implications of exchangeability, but now that we can routinely fit varying-intercept, varying-slope models with nested and non-nested levels and (we’ve finally realized the importance of) informative priors on hierarchical variance parameters, a lot of the philosophical problems have dissolved; they’ve become surmountable technical problems. (For example: should we consider a group of schools, or states, or hospitals, as “truly exchangeable”? If not, there’s information distinguishing them, and we can include such information as group-level predictors in our multilevel model. Problem solved.)

Rapid technological progress resolves many problems in ways that were never anticipated. (Progress creates new problems too; that’s another story.) I’m not such an expert on deep learning and related methods for inference and prediction—but, again, I think these will change our perspective on statistical philosophy in various ways.

This is all to say that any philosophical perspective is time-bound. On the other hand, I don’t think that Popper/Kuhn/Lakatos will ever be forgotten: this particular trinity of twentieth-century philosophy of science has forever left us in a different place than where we were, a hundred years ago.

To return to Mayo’s larger message: I agree with Hennig that Mayo is correct to place evaluation at the center of statistics.

I’ve thought a lot about this, in many years of teaching statistics to graduate students. In a class for first-year statistics Ph.D. students, you want to get down to the fundamentals.

What’s the most fundamental thing in statistics? Experimental design? No. You can’t really pick your design until you have some sense of how you will analyze the data. (This is the principle of the great Raymond Smullyan: To understand the past, we must first know the future.) So is data analysis the most fundamental thing? Maybe so, but what method of data analysis? Last I heard, there are many schools. *Bayesian* data analysis, perhaps? Not so clear; what’s the motivation for modeling everything probabilistically? Sure, it’s coherent—but so is some mental patient who thinks he’s Napoleon and acts daily according to that belief. We can back into a more fundamental, or statistical, justification of Bayesian inference and hierarchical modeling by first considering the principle of external validation of predictions, then showing (both empirically and theoretically) that a hierarchical Bayesian approach performs well based on this criterion—and then following up with the Jaynesian point that, when Bayesian inference fails to perform well, this recognition represents additional information that can and should be added to the model. All of this is the theme of the example in section 7 of BDA3—although I have the horrible feeling that students often don’t get the point, as it’s easy to get lost in all the technical details of the inference for the hyperparameters in the model.

Anyway, to continue . . . it still seems to me that the most foundational principles of statistics are frequentist. Not unbiasedness, not p-values, and not type 1 or type 2 errors, but frequency properties nevertheless. Statements about how well your procedure will perform in the future, conditional on some assumptions of stationarity and exchangeability (analogous to the assumption in physics that the laws of nature will be the same in the future as they’ve been in the past—or, if the laws of nature are changing, that they’re not changing very fast! We’re in Cantor’s corner again).

So, I want to separate the principle of frequency evaluation—the idea that frequency evaluation and criticism represents one of the three foundational principles of statistics (with the other two being mathematical modeling and the understanding of variation)—from specific statistical methods, whether they be methods that I like (Bayesian inference, estimates and standard errors, Fourier analysis, lasso, deep learning, etc.) or methods that I suspect have done more harm than good or, at the very least, have been taken too far (hypothesis tests, p-values, so-called exact tests, so-called inverse probability weighting, etc.). We can be frequentists, use mathematical models to solve problems in statistical design and data analysis, and engage in model criticism, without making decisions based on type 1 error probabilities etc.

To say it another way, bringing in the title of the book under discussion: I would not quite say that statistical inference *is* severe testing, but I do think that severe testing is a crucial part of statistics. I see statistics as an unstable mixture of inference conditional on a model (“normal science”) and model checking (“scientific revolution”). Severe testing is fundamental, in that prospect of revolution is a key contributor to the success of normal science. We lean on our models in large part because they have been, and will continue to be, put to the test. And we choose our statistical methods in large part because, under certain assumptions, they have good frequency properties.

And now on to Mayo’s subtitle. I don’t think her, or my, philosophical perspective will get us “beyond the statistics wars” by itself—but perhaps it will ultimately move us in this direction, if practitioners and theorists alike can move beyond naive confirmationist reasoning toward an embrace of variation and acceptance of uncertainty.

I’ll summarize by expressing agreement with Mayo’s perspective that frequency evaluation is fundamental, while disagreeing with her focus on various crude (from my perspective) ideas such as type 1 errors and p-values. When it comes to statistical philosophy, I’d rather follow Laplace, Jaynes, and Box, rather than Neyman, Wald, and Savage. Phony Bayesmania has bitten the dust.

**Thanks**

Let me again thank Haig, Wagenmakers, Owen, Cousins, Young, and Hennig for their discussions. I expect that Mayo will respond to these, and also to any comments that follow in this thread, once she has time to digest it all.

**P.S.** And here’s a review from Christian Robert.

Machine learning can help personalized decision support by learning models to predict individual treatment effects (ITE). This work studies the reliability of prediction-based decision-making in a task of deciding which action a to take for a target unit after observing its covariates x̃ and predicted outcomes p̂(ỹ∣x̃,a). An example case is personalized medicine and the decision of which treatment to give to a patient. A common problem when learning these models from observational data is imbalance, that is, difference in treated/control covariate distributions, which is known to increase the upper bound of the expected ITE estimation error. We propose to assess the decision-making reliability by estimating the ITE model’s Type S error rate, which is the probability of the model inferring the sign of the treatment effect wrong. Furthermore, we use the estimated reliability as a criterion for active learning, in order to collect new (possibly expensive) observations, instead of making a forced choice based on unreliable predictions. We demonstrate the effectiveness of this decision-making aware active learning in two decision-making tasks: in simulated data with binary outcomes and in a medical dataset with synthetic and continuous treatment outcomes.

Decision making, varying treatment effects, type S errors, Stan, validation. . . this paper has all of my favorite things!

]]>Perhaps you know this study which is being taken at face value in all the secondary reports: “Air pollution causes ‘huge’ reduction in intelligence, study reveals.” It’s surely alarming, but the reported effect of air pollution seems implausibly large, so it’s hard to be convinced of it by a correlational study alone, when we can suspect instead that the smarter, more educated folks are more likely to be found in polluted conditions for other reasons. They did try to allow for the usual covariates, but there is the usual problem that you never know whether you’ve done enough of that.

Assuming equal statistical support, I suppose the larger an effect, the less likely it is to be due to uncontrolled covariates. But also the larger the effect, the more reasonable it is to demand strongly convincing evidence before accepting it.

From the above-linked news article:

“Polluted air can cause everyone to reduce their level of education by one year, which is huge,” said Xi Chen at Yale School of Public Health in the US, a member of the research team. . . .

The new work, published in the journal Proceedings of the National Academy of Sciences, analysed language and arithmetic tests conducted as part of the China Family Panel Studies on 20,000 people across the nation between 2010 and 2014. The scientists compared the test results with records of nitrogen dioxide and sulphur dioxide pollution.

They found the longer people were exposed to dirty air, the bigger the damage to intelligence, with language ability more harmed than mathematical ability and men more harmed than women. The researchers said this may result from differences in how male and female brains work.

The above claims are indeed bold, but the researchers seem pretty careful:

The study followed the same individuals as air pollution varied from one year to the next, meaning that many other possible causal factors such as genetic differences are automatically accounted for.

The scientists also accounted for the gradual decline in cognition seen as people age and ruled out people being more impatient or uncooperative during tests when pollution was high.

Following the same individuals through the study: that makes a lot of sense.

I hadn’t heard of this study when it came out so I followed the link and read it now.

You can model the effects of air pollution as short-term or long-term. An example of a short-term effect is that air pollution makes it harder to breathe, you get less oxygen in your brain, etc., or maybe you’re just distracted by the discomfort and can’t think so well. An example of a long-term effect is that air pollution damages your brain or other parts of your body in various ways that impact your cognition.

The model includes air pollution levels on the day of measurement and on the past few days or months or years, and also a quadratic monthly time trend from Jan 2010 to Dec 2014. A quadratic time trend, that seems weird, kinda worrying. Are people’s test scores going up and down in that way?

In any case, their regression finds that air pollution levels from the past months or years are a strong predictor of the cognitive test outcome, and today’s air pollution doesn’t add much predictive power after including the historical pollution level.

Some minor things:

Measurement of cognitive performance:

The waves 2010 and 2014 contain the same cognitive ability module, that is, 24 standardized mathematics questions and 34 word-recognition questions. All of these questions are sorted in ascending order of difficulty, and the final test score is defined as the rank of the hardest question that a respondent is able to answer correctly.

Huh? Are you serious? Wouldn’t it be better to use the number of questions answered correctly? Even better would be to fit a simple item-response model, but I’d guess that #correct would capture almost all the relevant information in the data. But to just use the rank of the hardest question answered correctly: that seems inefficient, no?

Comparison between the sexes:

The authors claim that air pollution has a larger effect on men than on women (see above quote from the news article). But I suspect this is yet another example of The difference between “significant” and “not significant” is not itself statistically significant. It’s hard to tell. For example, there’s this graph:

The plot on the left shows a lot of consistency across age groups. Too much consistency, I think. I’m guessing that there’s something in the model keeping these estimates similar to each other, i.e. I don’t think they’re five independent results.

The authors write:

People may become more impatient or uncooperative when exposed to more polluted air. Therefore, it is possible that the observed negative effect on cognitive performance is due to behavioral change rather than impaired cognition. . . . Changes in the brain chemistry or composition are likely more plausible channels between air pollution and cognition.

I think they’re missing the point here and engaging in a bit of “scientism” or “mind-body dualism” in the following way: Suppose that air pollution irritates people, making it hard for people to concentrate on cognitive tasks. That is a form of impaired cognition. Just cos it’s “behavioral,” doesn’t make it not real.

In any case, putting this all together, what can we say? This seems like a serious analysis, and to start with the authors should make all their data and code available so that others can try fitting their own models. This is an important problem, so it’s good to have as many eyes on the data as possible.

In this particular example, it seems that the key information is coming from:

– People who moved from one place to another, either moving from a high-pollution to a low-pollution area or vice-versa, and then you can see if their test scores went correspondingly up or down. After adjusting for expected cognitive decline by age during this period.

– People who lived in the same place but where there was a negative or positive trend in pollution. Again you can see if these people’s test scores went up or down. Again, after adjusting for expected cognitive decline by age during this period.

– People who didn’t move, comparing these people who lived all along in high- or low-pollution areas, and seeing who had higher test scores. After adjusting for demographic differences between people living in these different cities.

This leaves me with two thoughts:

First, I’d like to see the analyses in these three different groups. One big regression is fine, but in this sort of problem I think it’s important to understand the path from data to conclusions. This is especially an issue given that we might see different results from the three different comparisons listed above.

Second, I am concerned with some incoherence regarding how the effect works. The story in the paper, supported by the regression analysis, seems to be that what matters is long-term exposure. But, if so, I don’t see how the short-term longitudinal analysis in this paper is getting us to that. If effects of air pollution on cognition are long-term, then really this is all a big cross-sectional analysis, which brings up the usual issues of unobserved confounders, selection bias, etc., and the multiple measurements on each person is not really giving us identification at all.

**P.S.** The problems with this study, along with the uncritical press coverage, suggests a concern not with this particular paper but a more general concern with superstar journals such as PNAS, Science, Nature, Lancet, NEJM, JAMA, etc., which is that they often seem to give journalists a free pass to report uncritically. This sort of episode makes me think the world would be better if these superstar journals just didn’t exist, or if they were all to shut down tomorrow and be replaced by regular old field journals.

Invest the time to learn data manipulation tools well (e.g. tidyverse). Increased familiarity with these tools often leads to greater time savings and less frustration in future.

Hmm it’s never one tip.. I never ever found it useful to begin writing code especially on a greenfield project unless I thought of the steps to the goal. I often still write the code in outline form first and edit before entering in programming steps. Some other tips.

1. Choose the right tool for the right job. Don’t use C++ if you’re going to design a web site.

2. Document code well but don’t overdo it, and leave some unit tests or assertions inside a commented field.

3. Testing code will always show the presence of bugs not their absence ( Dijkstra) but that dosen’t mean you should be a slacker.

4. Keep it simple at first, you may have to rewrite the program several times if it’s something new so don’t optimize until you’re satisfied. Finally, If you can control the L1 cache, you can control the world (Sabini).Just try stuff. Nothing works the first time and you’ll have to throw out your meticulous plan once you actually start working. You’ll find all the hiccups and issues with your data the more time you actually spend in it.

Consider the sampling procedure and the methods (specifics of the questionnaire etc.) of data collection for “real-world” data to avoid any serious biases or flaws.

Quadruple-check your group by statements and joins!!

Cleaning data properly is essential.

Write a script to analyze the data. Don’t do anything “manually”.

Don’t be afraid to confer with others. Even though there’s often an expectation that we all be experts in all things data processing, the fact is that we all have different strengths and weaknesses and it’s always a good idea to benefit from others’ expertise.

For me, cleaning data is always really time-consuming. In particular when I use real-world data and (especially) string data such name of cities/countries/individuals. In addition, when you make a survey for your research, there will be always that guy that digit “b” instead of “B” or “B “ (pushing the computer’s Tab). For these reason, my tip is: never underestimate the power of Excel (!!) when you have this kind of problems.

Data processing sucks. Work in an environment that enables you to do as little of it as possible. Tech companies these days have dedicated data engineers, and they are life-changing (in a good way) for researchers/data scientists.

If the data set is large, try the processing steps on a small subset of the data to make sure the output is what you expect. Include checks/control totals if possible. Do not overwrite the same dataset in important, complicated steps.

While converting data types, for example, extracting integers or convert to date, always check the agreement between data before and after convention. Sometimes when I was converting levels to integers, (numerical values somehow are recorded as categorical because of the existence of NA), there are errors and the results are not what I expected (e.g. convert “3712” to “1672”).

Learn dplyr.

Organisation of files and ideas are vital – constantly leave reminders of what you were doing and why you made particular choices either within the file names (indicating perhaps the date in which the code or data was updated) or within comments throughout the code that explain why you made certain decisions.

Thanks, kids!

**P.S.** Lots of good discussion in comments, especially this from Bob Carpenter.

Not really worth blogging about and a likely candidate for multiverse analysis, but the beginning of the first sentence in the 2nd paragraph made me laugh:

In the study – published in prestigious journal PNAS . . .

The researchers get extra points for this quote from the press release:

The researchers say that the findings make sense from an evolutionary point of view.

In evolutionary terms, these kinds of behaviours are completely rational, even adaptive. The basic idea is that the way people compete for mates, and the things they do to put themselves at the top of the hierarchy are really important. This is where this research fits in – it’s all about how women are competing and why they’re competing.

All right, then.

]]>Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions, by Richard Harris. Review here by Leonard Freedman.

Retractions do not work very well, by Ken Cor and Gaurav Sood. This post by Tyler Cowen brought this paper to my attention.

Here’s a quote from Harris’s review:

Harris shows both sides of the reproducibility debate, noting that many eminent members of the research establishment would like to see this new practice of airing the scientific community’s dirty laundry quietly disappear. He describes how, for example, in the aftermath of their 2012 paper demonstrating that only 6 of 53 landmark studies in cancer biology could be reproduced, Glenn Begley and Lee Ellis were immediately attacked by some in the biomedical research aristocracy for their “naïveté,” their “lack of competence” and their “disservice” to the scientific community.

“The biomedical research aristocracy” . . . I like that.

From Cor and Sood’s abstract:

Using data from over 3,000 retracted articles and over 74,000 citations to these articles, we find that at least 31.2% of the citations to retracted articles happen a year after they have been retracted. And that 91.4% of the post-retraction citations are approving—note no concern with the cited article.

I’m reminded of this story: “A study fails to replicate, but it continues to get referenced as if it had no problems. Communication channels are blocked.”

This is believable—and disturbing. But . . . do you really have to say “31.2%” and “91.4%”? Meaningless precision alert! Even if you could estimate those percentages to this sort of precision, you can’t take these numbers seriously, as the percentages are varying over time etc. Saying 30% and 90% would be just fine, indeed more appropriate and scientific, for the same reason that we don’t say that Steph Curry is 6’2.84378″ tall.

]]>Just as a contrast, I’m also reading an old John Le Carre book, and here the characters have no agency at all. They’re just doing what is necessary to make the plot run. For Le Carre, that’s fine; the plot’s what it’s all about. So that’s an extreme case.

Anyway, I found the agency of Bravo’s characters refreshing. It’s not something I think about so often when reading, but this time it struck me.

**P.S.** I wrote about agency a few years ago in the context of Benjamin Kunkel’s book Indecision. I did a quick search and it doesn’t look like Kunkel has written much since. Too bad. But maybe he’s doing a Klam and it will be all right.

My colleagues David Rothschild and Tobi Konitzer recently published this MRP analysis, “The Geography of Partisan Prejudice: A guide to the most—and least—politically open-minded counties in America,” written up by Amanda Ripley, Rekha Tenjarla, and Angela He.

Ripley et al. write:

In general, the most politically intolerant Americans, according to the analysis, tend to be whiter, more highly educated, older, more urban, and more partisan themselves. This finding aligns in some ways with previous research by the University of Pennsylvania professor Diana Mutz, who has found that white, highly educated people are relatively isolated from political diversity. They don’t routinely talk with people who disagree with them; this isolation makes it easier for them to caricature their ideological opponents. . . . By contrast, many nonwhite Americans routinely encounter political disagreement. They have more diverse social networks, politically speaking, and therefore tend to have more complicated views of the other side, whatever side that may be. . . .

The survey results are summarized by this map:

I’m not a big fan of the discrete color scheme, which creates all sorts of discretization artifacts—but let’s leave that for another time. In future iterations of this project we can work on making the map clearer.

There are some funny things about this map and I’ll get to them in a moment, but first let’s talk about what’s being plotted here.

There are two things that go into the above map: the outcome measure and the predictive model, and it’s all described this post from David and Tobi.

First, the outcome. They measured partisan prejudice by asking 14 partisan-related questions, from “How would you react if a member of your immediate family married a Democrat?” to “How well does the term ‘Patriotic’ describe Democrats? to “How do you feel about Democratic voters today?”, asking 7 questions about each of the two parties and then fitting an item-response model to score each respondent who is a Democrat or Republican on how tolerant, or positive, they are about the other party.

Second, the model. They took data from 2000 survey responses and regressed these on individual and neighborhood (census block)-level demographic and geographic predictors to construct a model to implicitly predict “political tolerance” for everyone in the country, and then they poststratified, summing these up over estimated totals for all demographic groups to get estimates for county averages, which is what they plotted.

Having done the multilevel modeling and poststratification, they could plot all sorts of summaries, for example a map of estimated political tolerance just among whites, or a scatterplot of county-level estimated political tolerance vs. average education at the county level, or whatever. But we’ll focus on the map above.

**2. Two concerns with the map and how it’s constructed**

People have expressed two concerns about David and Tobi’s estimates.

First, the inferences are strongly model-based. If you’re getting estimates for 3000 counties from 2000 respondents—or even from 20,000 respondents, or 200,000—you’ll need to lean on a model. As a results, the map should not be taken to represent independent data within each county; rather, it’s a summary of a national-level model including individual and neighborhood (census block-level) predictors. As such, we want to think about ways of understanding and evaluating this model.

Second, the map shows some artifacts at state borders, most notably with Florida, South Carolina, New York state, South Dakota, Utah, and Wisconsin, also some suggestive patterns elsewhere such as the borders between Virginia and North Carolina, and Missouri and Arkansas. I’m not sure about all these—as noted above, the discrete color scheme can create apparent patterns from small variation, and there are real differences in political cultures between states (Utah comes to mind)—but there are definitely some problems here, problems which David and Tobi attribute to differences between states in the voter files that are used to estimate the total number of partisans (Democrats and Republicans) in each demographic category in each county. If the voter files for neighboring states are coming from different sorts of data, this can introduce apparent differences in the poststratification stage. Their counting problems are especially cumbersome because we have to estimate the total number of partisans in each demographic category in each county

**3. Four plans for further research**

So, what to do about these concerns? I have four ideas, all of which involve some mix of statistics and political science research, along with good old data munging:

(a) *Measurement error model for differences between states in classifications.* The voter files have different meanings in different states? Model it, with some state effects that are estimated from the data and using whatever additional information we can find on the measurement and classification process.

(b) *Varying intercept model plus spatial correlation as a fix to the state boundary problems.* This is kind of a light, klugey version of the above option. We recognize that some state-level fix is needed, and instead of modeling the measurement error or coding differences directly, we throw in a state-level error term, along with a spatial correlation penalty term to enforce similarity across county boundaries (maybe only counting counties that are similar in certain characteristics such as ethnic breakdown and proportion urban/suburban/rural).

(c) *Tracking down exactly what happened to create those artifacts at the state boundaries.* Before or after doing the modeling to correct the glaring boundary artifacts, it would be good to do some model analysis to work out the “trail of breadcrumbs” explaining exactly how the particular artifacts we see arose, to connect the patterns on the map with what was going on in the data.

(d) *Fake-data simulation to understand scenarios where the MRP approach could fail.* As noted in point 2 above, there are legitimate concerns about the use of any model-based approach to draw inferences for 3000 counties from 2000 (or even 20,000 or 200,000) respondents. One way to get a sense of potential problems here is to construct some fake-data worlds in which the model-based estimates will fail.

OK, so four research directions here. My inclination is to start with (b) and (d) because I’m kind of intimidated by the demographic classifications in the voter file, so I’d rather just consider them as a black box and try to fix them indirectly, rather than to model and understand them. Along similar lines, it seems to me that solving (b) and (d) will give us general tools that can be used in many other adjustment problems in sampling and causal inference. That said, (a) is appealing because it’s all about doing things right, and it could have real impact on future studies using the voter file, and (c) would be an example of building bridges between different models in statistical workflow, which is an idea I’ve talked about a lot recently, so I’d like to see that too.

]]>The Heckman Curve describes the rate of return to public investments in human capital for the disadvantaged as rapidly diminishing with age. Investments early in the life course are characterised as providing significantly higher rates of return compared to investments targeted at young people and adults. This paper uses the Washington State Institute for Public Policy dataset of program benefit cost ratios to assess if there is a Heckman Curve relationship between program rates of return and recipient age. The data does not support the claim that social policy programs targeted early in the life course have the largest returns, or that the benefits of adult programs are less than the cost of intervention.

Here’s the conceptual version of the curve, from a paper published by economist Heckman in 2006:

This graph looks pretty authoritative but of course it’s not directly data-based.

As Rea and Burton explain, the curve makes some sense:

Underpinning the Heckman Curve is a comprehensive theory of skills that encompass all forms of human capability including physical and mental health . . .

• skills represent human capabilities that are able to generate outcomes for the individual and society;

• skills are multiple in nature and cover not only intelligence, but also non cognitive skills, and health (Heckman and Corbin, 2016);

• non cognitive skills or behavioural attributes such as conscientiousness, openness to experience, extraversion, agreeableness and emotional stability are particularly influential on a range of outcomes, and many of these are acquired in early childhood;

• early skill formation provides a platform for further subsequent skill accumulation . . .

• families and individuals invest in the costly process of building skills; and

• disadvantaged families do not invest sufficiently in their children because of information problems rather than limited economic resources or capital constraints (Heckman, 2007; Cunha et al., 2010; Heckman and Mosso, 2015).

Early intervention creates higher returns because of a longer payoff over which to generate returns.

But the evidence is not so clear. Rea and Burton write:

The original papers that introduced the Heckman Curve cited evidence on the relative return of human capital interventions across early childhood education, schooling, programs for at-risk youth, university and active employment and training programs (Heckman, 1999).

I’m concerned about these all being massive overestimates because of the statistical significance filter (see for example section 2.1 here or my earlier post here). The researchers have every motivation to exaggerate the effects of these interventions, and they’re using statistical methods that produce exaggerated estimates. Bad combination.

Rea and Burton continue:

A more recent review by Heckman and colleagues is contained in an OECD report Fostering and Measuring Skills: Improving Cognitive and Non-Cognitive Skills to Promote Lifetime Success (Kautz et al., 2014). . . . Overall 27 different interventions were reviewed . . . twelve had benefit cost ratios reported . . . Consistent with the Heckman Curve, programs targeted to children under five have an average benefit cost ratio of around 7, while those targeted at older ages have an average benefit cost ratio of just under 2.

But:

This result is however heavily influenced by the inclusion of the Perry Preschool programme and the Abecedarian Project. These studies are somewhat controversial in the wider literature . . . Many researchers argue that the Perry Preschool programme and the Abecedarian Project do not provide a reliable guide to the likely impacts of early childhood education in a modern context . . .

Also the statistical significance filter. A defender of those studies might argue that these biases don’t matter because they could be occurring for all studies, not just early childhood interventions. But these biases can be huge, and in general it’s a mistake to ignore huge biases in the vague hope that they may be canceling out.

And:

The data on programs targeted at older ages do not appear to be entirely consistent with the Heckman Curve. In particular the National Guard Challenge program and the Canadian Self-Sufficiency Project provide examples of interventions targeted at older age groups which have returns that are larger than the cost of funds.

Overall the programs in the OECD report represent only a small sample of the human capital interventions with well measured program returns . . . many rigorously studied and well known interventions are not included.

So Rea and Burton decide to perform a meta-analysis:

In order to assess the Heckman Curve we analyse a large dataset of program benefit cost ratios developed by the Washington State Institute for Public Policy.

Since the 1980s the Washington State Institute for Public Policy has focused on evidence-based policies and programs with the aim of providing state policymakers with advice about how to make best use of taxpayer funds. The Institute’s database covers programs in a wide range of areas including child welfare, mental health, juvenile and adult justice, substance abuse, healthcare, higher education and the labour market. . . .

The August 2017 update provides estimates of the benefit cost ratios for 314 interventions. . . . The programs also span the life course with 10% of the interventions being aimed at children 5 years and under.

And here’s what they find:

Wow, that’s one ugly graph! Can’t you do better than that? I also don’t really know what to do with these numbers. Benefit-cost ratios of 90! That’s the kind of thing you see with, what, a plan to hire more IRS auditors? I guess what I’m saying is that I don’t know which of these dots I can really trust, which is a problem with a lot of meta-analyses (see for example here).

To put it another way: Given what I see in Rea and Burton’s paper, I’m prepared to agree with their claim that the data don’t support the diminishing-returns “Heckman curve”: The graph from that 2006 paper, reproduced at the top of this post, is just a story that’s not backed up by what is known. At that same time, I don’t know how seriously to take the above scatterplot, as many or even most of the dots there could be terrible estimates. I just don’t know.

In their conclusion, Rea and Burton say that their results do *not* “call into question the more general theory of human capital and skills advanced by Heckman and colleagues.” They express the view that:

Heckman’s insights about the nature of human capital are essentially correct. Early child development is a critical stage of human development, partly because it provides a foundation for the future acquisition of health, cognitive and non-cognitive skills. Moreover the impact of an effective intervention in childhood has a longer period of time over which any benefits can accumulate.

Why, then, do the diminishing returns of interventions not show up in the data? Rea and Burton write:

The importance of early child development and the nature of human capital are not the only factors that influence the rate of return for any particular intervention. Overall the extent to which a social policy investment gives a good rate of return depends on the assumed discount rate, the cost of the intervention, the interventions ability to impact on outcomes, the time profile of impacts over the life course, and the value of the impacts.

Some interventions may be low cost which will make even modest impacts cost effective.

The extent of targeting and the deadweight loss of the intervention are also important. Some interventions may be well targeted to those who need the intervention and hence offer a good rate of return. Other interventions may be less well targeted and require investment in those who do not require the intervention. A potential example of this might be interventions aimed at reducing youth offending. While early prevention programs may be effective at reducing offending, they are not necessarily more cost effective than later interventions if they require considerable investment in those who are not at risk.

Another consideration is the proximity of an intervention to the time where there are the largest potential benefits. . . .

Another factor is that the technology or active ingredients of interventions differ, and it is not clear that those targeted to younger ages will always be more effective. . . .

In general there are many circumstances where interventions to deliver ‘cures’ can be as cost effective as ‘prevention’. Many aspects of life have a degree of unpredictability and interventions targeted as those who experience an adverse event (such as healthcare in response to a car accident) can plausibly be as cost effective as prevention efforts.

These are all interesting points.

**P.S.** I sent Rea some of these comments, and he wrote:

]]>I had previously read your paper ‘The failure of the null hypothesis’ paper, and remember being struck by the para:

The current system of scientific publication encourages the publication of speculative papers making dramatic claims based on small, noisy experiments. Why is this? To start with, the most prestigious general-interest journals—Science, Nature, and PNAS—require papers to be short, and they strongly favor claims of originality and grand importance….

I had thought at the time that this applied to the original Heckman paper in Science.

I think we agree with your point about not being able to draw any positive conclusions from our data. The paper is meant to be more in the spirit of ‘here is an important claim that has been highly influential in public policy, but when we look at what we believe is a carefully constructed dataset, we don’t see any support for the claim’. We probably should frame it more about replication and an invitation for other researchers to try and do something similar using other datasets.

Your point about the underlying data drawing on effect sizes that are likely biased is something we need to reflect in the paper. But in defense of the approach, my assumption is that well conducted meta analysis (which Washington State Institute for Public Policy use to calculate their overall impacts) should moderate the extent of the bias. Searching for unpublished research, and including all robust studies irrespective of the magnitude and significance of the impact, and weighting by each studies precision, should overcome some of the problems? In their meta analysis, Washington State also reduce a studies contribution to the overall effect size if there is evidence of a conflict of interest (the researcher was also the program developer).

On the issue of the large effect sizes from the early childhood education experiments (Perry PreSchool and Abecedarian Project), the recent meta analysis of high quality studies by McCoy et al. (2017) was helpful for us.

Generally the later studies have shown smaller impacts (possibly because control group are now not so deprived of other services). Here is one of their lovely forest plots on grade retention. I am just about to go and see if they did any analysis of publication bias.

**What can you expect?**

There will be two days of tutorials at all levels and two days of invited and submitted talks.

The previous three StanCons (NYC 2017, Asilomar 2018, Helsinki 2018) were wonderful experiences for both their content and their collegial nature. StanCon is ridiculously interdisciplinary, cutting across science, engineering, finance, education, government, and sports (check out the previous lineups). It also brings a balanced mix of academic and industrial attendees and speakers.

**Propose a tutorial or submit a paper**

There’s still time to propose a tutorial or submit an abstract for a talk or poster. This year’s process for submitting abstracts is streamlined from previous StanCons.

**Early registration**

Early registration (at 33% savings) ends May 15, 2019.

**About Cambridge**

In terms of scenery, I’ll say no more than that Cambridge University is where they filmed Hogwarts.

It’s also steeped in real history. The university has had professors like Isaac Newton and has yielded software like BUGS. It also has a famous river with infamously precarious boats. It’s England, so there are great pubs.

**If you’ve got ’em, wear ’em**

Speaking of Harry Potter, it appears from the home page link above that the banquet’s going to be full Georgian Gothic in the King’s College Dining Hall. If you happen to own an academic gown (or a cosplay version thereof), this seems like the venue (just kidding—you might be the only one!).

]]>Per #3 here, just want to make sure you saw the Coppock Leeper Mullinix paper indicating treatment effect heterogeneity is rare.

My reply:

I guess it depends on what is being studied. In the world of evolutionary psychology etc., interactions are typically claimed to be larger than main effects (for example, that claim about fat arms and redistribution). It is possible that in the real world, interactions are not so large.

To step back a moment, I don’t think it’s quite right to say that treatment effect heterogeneity is “rare.” All treatment effects vary. So the question is not, Is there treatment effect heterogeneity?, but rather, How large is treatment effect heterogeneity? In practice, heterogeneity can be hard to estimate, so all we can say is that, whatever variation there is in the treatment effects, we can’t estimated it well from the data alone.

In real life, when people design treatments, they need to figure out all sorts of details. Presumably the details matter. These details are treatment interactions, and they’re typically designed entirely qualitatively, which makes sense given the difficulty of estimating their effects from data.

]]>]]>I have a couple of events coming up that people might be interested in. They are all at bayescamp.com/courses

Stan Taster Webinar is on 15 May, runs for one hour and is only £15. I’ll demo Stan through R (and maybe PyStan and CmdStan if the interest is there on the day), show the code and discuss the strength and limitations of HMC.

I have a 2 day Introduction to Stan course at the Royal Statistical Society on 9-10 July, which obviously expands on this, gets everyone involved in interactive small-group exercises, thinking of data-generating processes and appropriate parameterisations, etc etc.

And I can exclusively reveal that Rasmus Bååth is going to do an introductory course for beginners in Bayes in the autumn. His YouTube videos have been very popular so I think that will be exciting.

This paper provides the first evidence of the effect of a U.S. paid maternity leave policy on the long-run outcomes of children. I exploit variation in access to paid leave that was created by long-standing state differences in short-term disability insurance coverage and the state-level roll-out of laws banning discrimination against pregnant workers in the 1960s and 1970s. While the availability of these benefits sparked a substantial expansion of leave-taking by new mothers, it also came with a cost. The enactment of paid leave led to shifts in labor supply and demand that decreased wages and family income among women of child-bearing age. In addition, the first generation of children born to mothers with access to maternity leave benefits were 1.9 percent less likely to attend college and 3.1 percent less likely to earn a four-year college degree.

I was curious so I clicked through and took a look. It seems that the key comparisons are at the state-year level, with some policy changes happening in different states at different years. So what I’d like to see are some time series for individual states and some scatterplots of state-years. Some other graphs, too, although I’m not quite sure what. The basic idea is that this is an observational study in which the treatment is some policy change, so we’re comparing state-years with and without this treatment; I’d like to see a scatterplot of the outcome vs. some pre-treatment measure, with different symbols for treatment and control cases. As it is, I don’t really know what to make of the results, what with all the processing that has gone on between the data and the estimate.

In general I am skeptical about results such as given in the above abstract because there are so many things that can affect college attendance. Trends can vary by state, and this sort of analysis will simply pick up whatever correlation there might be, between state-level trends and the implementation of policies. There are lots of reasons to think that the states where a given policy would be more or less likely to be implemented, happen to be states where trends in college attendance are higher or lower. This is all kind of vague because I’m not quite sure what is going on in the data—I didn’t notice a list of which states were doing what. My general point is that to understand and trust such an analysis I need a “trail of bread crumbs” connecting data, theory, and conclusions. The theory in the paper, having to do with economic incentives and indirect effects, seemed a bit farfetched to me but not impossible—but it’s not enough for me to just have the theory and the regression table; I really need to understand where in the data the result is coming from. As it is, this just seems like two state-level variables that happen to be correlated. There might be something here; I just can’t say.

**P.S.** Cowen’s commenters express lots of skepticism about this claim. I see this skepticism as a good sign, a positive aspect of the recent statistical crisis in science that people do not automatically accept this sort of quantitative claim, even when it is endorsed by a trusted intermediary. I suspect that Cowen too is happy that his readers read him critically and don’t believe everything he posts!

A survey with N=1! And not even a random sample. How could we possibly learn anything useful from that? We have a few things in our favor:

– Auxiliary information on the survey respondent. We have some sense of our respondent’s left-right ideology, relative to the general primary electorate.

– An informative measure of the respondent’s attitude. He didn’t just answer a yes/no question about his vote intention; he told me that he wasn’t even considering voting for the alternative candidate.

– A model of opinions and voting: Uniform partisan swing. We assume that, from election to election, voters move only a small random amount on the left-right scale, relative to the other voters.

– Assumption of random sampling, conditional on auxiliary information: My friend is not a random sample of Democrats, but I’m implicitly considering him as representative of Democrats at his particular point in left-right ideology.

Substantive information + informative data + model + assumption. Put these together and you can learn a lot.

Today I ran into survey respondent and I thought I’d ask him, my representative center-left Democrat, who he supported in the presidential race. Not who he thought would win, but who he supported.

So I asked him, he paused for about a second, and then said, Beto.

**P.S.** According to Predictwise, Beto’s currently at 15%. I don’t really have a sense if this is too low or too high. And, in any case, there are several reasons why primaries are hard to predict. But, for now, given my N=1 poll, I’m going with Beto.

Predictwise also has odds for the Republican nomination. Their probabilities are Trump 86%, Kasich 20%, all else 4%. The Kasich number doesn’t seem right to me. Trump 86%, that seems reasonable enough, but conditional on Trump not being the nominee, is there really an 80% chance that it will be Kasich? That conditional probability seems too high. I guess that implies I should lay some money on Mike Pence, Paul Ryan, etc.

]]>“Your readers are my target audience. I really want to convince them that it makes sense to divide regression coefficients by 2 and their standard errors by sqrt(2). Of course, additional prior information should be used whenever available.”

**The background**

It started with an email from Erik van Zwet, who wrote:

In 2013, you wrote about the hidden dangers of non-informative priors:

Finally, the simplest example yet, and my new favorite: we assign a non-informative prior to a continuous parameter theta. We now observe data, y ~ N(theta, 1), and the observation is y=1. This is of course completely consistent with being pure noise, but the posterior probability is 0.84 that theta>0. I don’t believe that 0.84. I think (in general) that it is too high.

I agree – at least if theta is a regression coefficient (other than the intercept) in the context of the life sciences.

In this paper [which has since been published in a journal], I propose that a suitable default prior is the normal distribution with mean zero and standard deviation equal to the standard error SE of the unbiased estimator. The posterior is the normal distribution with mean y/2 and standard deviation SE/sqrt(2). So that’s a default Edlin factor of 1/2. I base my proposal on two very different arguments:

1. The uniform (flat) prior is considered by many to be non-informative because of certain invariance properties. However, I argue that those properties break down when we reparameterize in terms of the sign and the magnitude of theta. Now, in my experience, the primary goal of most regression analyses is to study the direction of some association. That is, we are interested primarily in the sign of theta. Under the prior I’m proposing, P(theta > 0 | y) has the standard uniform distribution (Theorem 1 in the paper). In that sense, the prior could be considered to be non-informative for inference about the sign of theta.

2. The fact that we are considering a regression coefficient (other than the intercept) in the context of the life sciences is actually prior information. Now, almost all research in the life sciences is listed in the MEDLINE (PubMed) database. In the absence of any additional prior information, we can consider papers in MEDLINE that have regression coefficients to be exchangeable. I used a sample of 50 MEDLINE papers to estimate the prior and found the normal distribution with mean zero and standard deviation 1.28*SE. The data and my analysis are available here.

The two arguments are very different, so it’s nice that they yield fairly similar results. Since published effects tend to be inflated, I think the 1.28 is somewhat overestimated. So, I end up recommending the N(0,SE^2) as default prior.

I think it makes sense to divide regression coefficients by 2 and their standard errors by sqrt(2). Of course, additional prior information should be used whenever available.

Hmmm . . . one way to think about this idea is to consider where it *doesn’t* make sense. You write, “a suitable default prior is the normal distribution with mean zero and standard deviation equal to the standard error SE of the unbiased estimator.” Let’s consider two cases where this default *won’t* work:

– The task is to estimate someone’s weight with one measurement on a scale where the measurements have standard deviation 1 pound, and you observe 150 pounds. You’re not going to want to partially pool that all the way to 75 pounds. The point here, I suppose, is that the goal of the measurement is *not* to estimate the sign of the effect. But we could do the same reasoning where the goal was to estimate the sign. For example, I weigh you, then I weigh you again a year later. I’m interested in seeing if you gained or lost weight. The measurement was 150 pounds last year and 140 pounds this year. The classical estimate of the difference of the two measurements is 10 +/- 1.4. Would I want to partially pool that all the way to 5? Maybe, in that these are just single measurements and your weight can fluctuate. But that can’t be the motivation here, because we could just as well take 100 measurements at one time and 100 measurements a year later, so now maybe your average is, say, 153 pounds last year and 143 pounds this year: an estimated change of 10 +/- 0.14. We certainly wouldn’t want to use a super-precise prior with mean 0 an sd 0.14 here!

– The famous beauty-and-sex-ratio study where the difference in probability of girl birth, comparing children of beautiful and non-beautiful parents, was estimated from some data to be 8 percentage points +/- 3 percentage points. In this case, an Edlin factor of 0.5 is not enough. Pooling down to 4 percentage points is not enough pooling. A better estimate would of the difference be 0 percentage points, or 0.01 percentage points, or something like that.

I guess what I’m getting at is that the balance between prior and data changes as we get more information, so I don’t see how a fixed amount of partial pooling can work.

That said, maybe I’m missing something here. After all, a default can never cover all cases, and the current default of no partial pooling or flat prior has all sorts of problems. So we can think more about this.

**P.S.** In the months since I wrote the above post, Zwet sent along further thoughts:

Since I emailed you in the fall, I’ve continued thinking about default priors. I have a clearer idea now about what I’m trying to do:

In principle, one can obtain prior information for almost any research question in the life sciences via a meta-analysis. In practice, however, there are (at least) three obstacles. First, a meta-analysis is extra work and that is never popular. Second, the literature is not always reliable because of publication bias and such. Third, it is generally unclear what the scope of the meta-analysis should be.

Now, researchers often want to be “objective” or “non-informative”. I believe this can be accomplished by performing a meta-analysis with a very wide scope. One might think that this would lead to very diffuse priors, but that turns out not to be the case! Using a very wide scope to obtain prior information also means that the same meta-analysis can be recycled in many situations.

The problem of publication bias in the literature remains, but there may be ways to handle that. In the paper I sent earlier, I used p-values from univariable regressions that were used to “screen” variables for a multivariable model. I figure that those p-values should be largely unaffected by selection on significance, simply because that selection is still to be done!

More recently, I’ve used a set of “honest” p-values that were generated by the Open Science Collaboration in their big replication project in psychology (Science, 2015). I’ve estimated a prior and then computed type S and M errors. I attach the results together with the (publicly available) data. The results are also here.

Zwet’s new paper is called Default prior for psychological research, and it comes with two data files, here and here.

It’s an appealing idea, in practice should be better than the current default Edlin factor of 1 (that is, no partial pooling toward zero at all). And I’ve talked a lot about constructing default priors based on empirical information, so it’s great to see someone actually doing it. Still, I have some reservations about the specific recommendations, for the reasons expressed in my response to Zwet above. Like him, I’m curious about your thoughts on this.

I’ll also wrote something on this in our Prior Choice Recommendations wiki:

Default prior for treatment effects scaled based on the standard error of the estimate

Erik van Zwet suggests an Edlin factor of 1/2. Assuming that the existing or published estimate is unbiased with known standard error, this corresponds to a default prior that is normal with mean 0 and sd equal to the standard error of the data estimate. This can’t be right–for any given experiment, as you add data, the standard error should decline, so this would suggest that the prior depends on sample size. (On the other hand, the prior can often only be understood in the context of the likelihood; http://www.stat.columbia.edu/~gelman/research/published/entropy-19-00555-v2.pdf, so we can’t rule out an improper or data-dependent prior out of hand.)

Anyway, the discussion with Zwet got me thinking. If I see an estimate that’s 1 se from 0, I tend not to take it seriously; I partially pool it toward 0. So if the data estimate is 1 se from 0, then, sure, the normal(0, se) prior seems reasonable as it pools the estimate halfway to 0. But if the data estimate is, say, 4 se’s from zero, I wouldn’t want to pool it halfway: at this point, zero is not so relevant. This suggests something like a t prior. Again, though, the big idea here is to scale the prior based on the standard error of the estimate.

Another way of looking at this prior is as a formalization of what we do when we see estimates of treatment effects. If the estimate is only 1 standard error away from zero, we don’t take it too seriously: sure, we take it as some evidence of a positive effect, but far from conclusive evidence–we partially pool it toward zero. If the estimate is 2 standard errors away from zero, we still think the estimate has a bit of luck to it–just think of the way in which researchers, when their estimate is 2 se’s from zero, (a) get excited and (b) want to stop the experiment right there so as not to lose the magic–hence some partial pooling toward zero is still in order. And if the estimate is 4 se’s from zero, we just tend to take it as is.

I sent some of the above to Zwet, who replied:

]]>I [Zwet] proposed that default Edlin factor of 1/2 only when the estimate is less than 3 se’s away from zero (or rather, p<0.001). I used a mixture of two zero-mean normals; one with sd=0.68 and the other with sd=3.94. I’m quite happy with the fit. The shrinkage is a little more than 1/2 when the estimate is close to zero, and disappears gradually for larger estimates. It’s in the data! You can see it when you do a “wide scope” meta-analysis.

I have a lot to say, and it’s hard to put it all together, in part because my collaborators and I have said much of it already, in various forms.

For now I thought I’d start by listing my different thoughts in a short post while I figure out how best to organize all of this.

**Goals**

There’s also the problem that these discussions can easily transform into debates. After proposing an idea and seeing objections, it’s natural to then want to respond to those objections, then the responders respond, etc., and the original goals are lost.

So, before going on, some goals:

– Better statistical analyses. Learning from data in a particular study.

– Improving the flow of science. More prominence to reproducible findings, less time wasted chasing noise.

– Improving scientific practice. Changing incentives to motivate good science and demotivate junk science.

Null hypothesis testing, p-values, and statistical significance represent one approach toward attaining the above goals. I don’t think this approach works so well anymore (whether it did in the past is another question), but the point is to keep these goals in mind.

**Some topics to address**

*1. Is this all a waste of time?*

The first question to ask is, why am I writing about this at all? Paul Meehl said it all fifty years ago, and people have been rediscovering the problems with statistical-significance reasoning every decade since, for example this still-readable paper from 1985, The Religion of Statistics as Practiced in Medical Journals, by David Salsburg, which Richard Juster sent me the other day. And, even accepting the argument that the battle is still worth fighting, why don’t I just leave this in the capable hands of Amrhein, Greenland, McShane, and various others who are evidently willing to put in the effort?

The short answer is I think I have something extra to contribute. So far, my colleagues and I have come up with some new methods and new conceptualizations—I’m thinking of type M and type S errors, the garden of forking paths, the backpack fallacy, the secret weapon, “the difference between . . .,” the use of multilevel models to resolve the multiple comparisons problem, etc. We haven’t been just standing on the street corner the past twenty years, screaming “Down with p-values; we’ve been reframing the problem in interesting and useful ways.

How did we make these contributions? Not out of nowhere, but as a byproduct of working on applied problems, trying to work things out from first principles, and, yes, reading blog comments and answering questions from randos on the internet. When John Carlin and I write an article like this or this, for example, we’re not just expressing our views clearly and spreading the good word. We’re also figuring out much of it as we go along. So, when I see misunderstanding about statistics and try to clean it up, I’m learning too.

*2. Paradigmatic examples*

It could be a good idea to list the different sorts of examples that are used in these discussions. Here are a few that keep coming up:

The clinical trial comparing a new drug to the standard treatment. “Psychological Science” or “PNAS”-style headline-grabbing unreplicable noise mining. Gene-association studies. Regressions for causal inference from observational data. Studies with multiple outcomes. Descriptive studies such as in Red State Blue State.

I think we can come up with more of these. My point here is that different methods can work for different examples, so I think it makes sense to put a bunch of these cases in one place so the argument doesn’t jump around so much. We can also include some examples where p-values and statistical significance don’t seem to come up at all. For instance, MRP to estimate state-level opinion from national surveys: nobody’s out there testing which states are statistically significantly different from others. Another example is item-response or ideal-point modeling in psychometrics or political science: again, these are typically framed as problems of estimation, not testing.

*3. Statistics and computer science as social sciences*

We’re used to statistical methods being controversial, with leading statisticians throwing polemics at each other regarding issues that are both theoretically fundamental and also core practical concerns. The fighting’s been going on, in different ways, for about a hundred years!

But here’s a question. Why is it that statistics is so controversial? The math is just math, no controversy there. And the issues aren’t political, at least not in a left-right sense. Statistical controversies don’t link up in any natural way to political disputes about business and labor, or racism, or war, or whatever.

In its deep and persistent controversies, statistics looks less like the hard sciences and more like the social sciences. Which, again, seems strange to me, given that statistics is a form of engineering, or applied math.

Maybe the appropriate point of comparison here is not economics or sociology, which have deep conflicts based on human values, but rather computer science. Computer scientists can get pretty worked up about technical issues which to me seem unresolvable: the best way to structure a programming language, for example. I don’t like to label these disputes as “religious wars,” but the point is that the level of passion often seems pretty high, in comparison to the dry nature of the subject matter.

I’m not saying that passion is wrong! Existing statistical methods have done their part to slow down medical research: lives are at stake. Still, stepping back, the passion in statistical debates about p-values seems a bit more distanced from the ultimate human object of concern, compared to, say the passion in debates about economic redistribution or racism.

To return to the point about statistics and computer science: These two fields fundamentally are about how they are used. A statistical method or a computer ultimately connects to a human: someone has to decide what to do. So they both are social sciences, in a way that physics, chemistry, or biology are not, or not as much.

*4. Different levels of argument*

The direct argument in favor of the use of statistical significance and p-values is that it’s desirable to use statistical procedures with so-called type 1 error control. I don’t buy that argument because I think that selecting on statistical significance yields noisy conclusions. To continue the discussion further, I think it makes sense to consider particular examples, or classes of examples (see item 2 above). They talk about error control, I talk about noise, but both these concepts are abstractions, and ultimately it has to come down to reality.

There are also indirect arguments. For example: 100 million p-value users can’t be wrong. Or: Abandoning statistical significance might be a great idea, but nobody will do it. I’d prefer to have the discussion at the more direct level of what’s a better procedure to use, with the understanding that it might take awhile for better options to become common practice.

*5. “Statistical significance” as a lexicographic decision rule*

This is discussed in detail in my article with Blake McShane, David Gal, Christian Robert, and Jennifer Tackett:

[In much of current scientific practice], statistical significance serves as a lexicographic decision rule whereby any result is first required to have a p-value that attains the 0.05 threshold and only then is consideration—often scant—given to such factors as related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain.

Traditionally, the p < 0.05 rule has been considered a safeguard against noise-chasing and thus a guarantor of replicability. However, in recent years, a series of well-publicized examples (e.g., Carney, Cuddy, and Yap 2010; Bem 2011) coupled with theoretical work has made it clear that statistical significance can easily be obtained from pure noise . . . We propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence. We have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures.

*6. Confirmationist and falsificationist paradigms of science*

I wrote about this a few years ago:

In confirmationist reasoning, a researcher starts with hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A.

In falsificationist reasoning, it is the researcher’s actual hypothesis A that is put to the test.

It is my impression that in the vast majority of cases, “statistical significance” is used in confirmationist way. To put it another way: the problem is not just with the p-value, it’s with the mistaken idea that falsifying a straw-man null hypothesis is evidence in favor of someone’s pet theory.

*7. But what if we need to make an up-or-down decision?*

This comes up a lot. I recommend accepting uncertainty, but what if it’s decision time—what to do?

How can the world function if the millions of scientific decisions currently made using statistical significance somehow have to be done another way? From that perspective, the suggestion to abandon statistical significance is like a recommendation that we all switch to eating organically-fed, free-range chicken. This might be a good idea for any of us individually or with small groups, but it would just be too expensive to do on a national scale. (I don’t know if that’s true when it comes to chicken farming; I’m just making a general analogy here.)

Regarding the economics, the point that we made in section 4.4 of our paper is that decisions are *not* currently made in an automatic way. Papers are reviewed by hand, one at a time.

As Peter Dorman puts it:

The most important determinants of the dispositive power of statistical evidence should be its quality (research design, aptness of measurement) and diversity. “Significance” addresses neither of these. Its worst effect is that, like a magician, it distracts us from what we should be paying most attention to.

To put it another way, there are two issues here: (a) the potential benefits of an automatic screening or decision rule, and (b) using a p-value (null-hypothesis tail area probability) for such a rule. We argue against using screening rules (or, to use them much less often). But in the cases where screening rules are desired, we see no reason to use p-values for this.

*8. What should we do instead?*

To start with, I think many research papers would be improved if all inferences were replaced by simple estimates and standard errors, with these standard errors *not* used to decide whether effects should be declared real, but just to give a sense of baseline uncertainty.

As Eric Loken and I put it:

Without modern statistics, we find it unlikely that people would take seriously a claim about the general population of women, based on two survey questions asked to 100 volunteers on the internet and 24 college students. But with the p-value, a result can be declared significant and deemed worth publishing in a leading journal in psychology.

For a couple more examples, consider the two studies discussed in section 2 of this article. For both of them, nothing is gained and much is lost by passing results through the statistical significance filter.

Again, the use of standard errors and uncertainty intervals is *not* just significance testing in another form. The point is to use these uncertainties as a way of contextualizing estimates, not to declare things as real or not.

The next step is to recognize multiplicity in your problem. Consider this paper, which contains many analyses but not a single p-value or even a confidence interval. We are able to assess uncertainty by displaying results from multiple polls. Yes, it is possible to have data with no structure at all—a simple comparison with no replications—and for these, I’d just display averages, variation, and some averages and uncertainties—but this is rare, as such simple comparisons are typically part of a stream of results in a larger research project.

One can and should continue with multilevel models and other statistical methods that allow more systematic partial pooling of information from different sources, but the secret weapon is a good start.

**Plan**

My current plan to write this all up as a long article, Unpacking the Statistical Significance Debate and the Replication Crisis, and put it on Arxiv. That could reach people who don’t feel like engaging with blogs.

In the meantime, I’d appreciate your comments and suggestions.

]]>In a recent blog post I introduce an in-development R package that helps researchers to identify, document and exhaust inherent research design choices in work based on observational data.

As the analysis that I propose is similar in notion to a multiverse analysis that you suggested, I thought that maybe the package and the blog article might be of interest to you.

I haven’t had a chance to look at all of this but I’m posting it here as it might be of interest to you.

We’ve had lots of discussion of problems with statistical methods, so it’s good to see people developing methods to address some of these concerns.

]]>There has long been speculation of an “informational backfire effect,” whereby the publication of questionable scientific claims can lead to behavioral changes that are counterproductive in the aggregate. Concerns of informational backfire have been raised in many fields that feature an intersection of research and policy, including education, medicine, or nutrition—but it has been difficult to study this effect empirically because of confounding of the act of publication with the effects of the research ideas in question through other pathways. In the present paper we estimate the informational backfire effect using a unique identification strategy based on the timing of publication of high-profile articles in well-regarded scientific journals. Using measures of academic citation, traditional media mentions, and social media penetration, we show, first, that published claims backed by questionable research practices receive statistically significantly wider exposure, and, second, that this exposure leads to large and statistical significant aggregate behavioral changes, as measured by a regression discontinuity analysis. The importance of this finding can be seen using a case study in the domain of alcohol consumption, where we demonstrate that publication of research papers claiming a safe daily dose is linked to increased drinking and higher rates of drunk driving injuries and fatalities, with the largest proportional increases occurring in states with the highest levels of exposure to news media science and health reporting.

I don’t know how much to believe all this, as there are the usual difficulties of studying small effects using aggregate data—the needle-in-a-haystack problem—and I’d like to see the raw data. But in any case I wanted to share this with you, as it relates to various discussions we’ve had such as here, for example. Also this relates to general questions we’ve had regarding the larger effects of scientific research on our thoughts and behaviors.

]]>Going through this put me in mind of Jim Zidek’s early 1980s work on multi-Bayesian theory. The most cited paper there is his JRSS-A paper with Weerahandri from 1981. From the abstract it looks more like it addresses formation of a consensus posterior or decision choice and is not about study design. That work is behind a Wiley pay wall so high that even Stanford’s library credentials do not let me see it. I keep this in mind whenever Wiley asks me to contribute an encyclopedia article; preparing a write-only paper for them is a very low priority.

Also, Wiley publishes Wikipedia articles at an infinity-percent markup.

]]>One of the fun parts of this was reading some of what Meehl wrote. I’d seen him quoted but had not read him before. What he says reminds me a lot of how p values were presented when I was an undergraduate at Waterloo. They emphasized large p values as a way of saying ‘not necessarily’ instead of small ones as Eureka.

Well put.

I’ll be posting soon with more reviews of Mayo, but I just wanted to post the above quote on its own.

]]>“The prior can often only be understood in the context of the likelihood”: http://www.stat.columbia.edu/~gelman/research/published/entropy-19-00555-v2.pdf

Here’s an idea for not getting tripped up with default priors: For each parameter (or other qoi), compare the posterior sd to the prior sd. If the posterior sd for any parameter (or qoi) is more than 0.1 times the prior sd, then print out a note: “The prior distribution for this parameter is informative.” Then the user can go back and check that the default prior makes sense for this particular example.

I’ve not incorporated this particular method into my workflow, but I like the idea and I’d like to study it further. I think this idea, or something like it, could be important.

]]>The Electoral College has been in the news recently. I [Weakliem] am going to write a post about public opinion on the Electoral College vs. popular vote, but I was diverted into writing about the arguments offered in favor of it.

An editorial in the National Review says “it prevents New York and California from imposing their will on the rest of the country.” Taken literally, that is ridiculous–those two states combined had about 16% of the popular vote in 2016. But presumably the general idea is that the Electoral College makes it harder for a small number of large states to provide a victory. . . . In 2016, 52% of the popular vote came from 10 states: California, Florida, Texas, New York, Pennsylvania, Illinois, Ohio, Michigan, North Carolina, and Georgia (in descending order of number of votes). In the Electoral College, those states combined had 256 electoral votes–in order to win, you would need to add New Jersey (14). Even if you think the difference between ten and eleven states is important, the diversity of the ten biggest states is striking–there’s no way a candidate could win all of them without winning a lot of others.

Good point. Weakliem continues:

The National Review also says that the Electoral College keeps candidates from “retreating to their preferred pockets and running up the score.” That assumes that it’s easier to add to your lead when you already have a lead than when you are close or behind. That may be true in some sports, but in getting votes it seems that things would be more likely to go in the other direction–if you don’t have much support in a place, you have little to lose and a lot to gain. If it made any difference, election by popular vote would probably encourage parties to look outside their “preferred pockets”–e.g., the Republicans might try to compete in California rather than write it off.

I’d not thought of that before, but that sounds right. I guess we’re assuming there’s no large-scale cheating. There could be a concern that one-party-dominant states could cheat in the vote counting, or even more simply by making it harder for voters of one party to vote. Then again, this already happens, so if cheating is a concern, I think the appropriate solution is more transparency in vote counting and in the rules for where people can vote.

Weakliem then talks about public opinion:

There is always more support for abolishing [the electoral college] than keeping it—until 2016, a lot more. . . . The greatest support for abolishing it (80%) was in November 1968, right after the third-party candidacy of George Wallace, which had the goal of preventing an Electoral College majority. The election of 2000 had much less impact on opinions that 2016, maybe because of the general increase in partisanship since 2000.

A lot of recent commentary has treated abolishing the Electoral College as a radical cause, but the public generally likes the idea. . . .

But:

I suspect that most people don’t have strong opinions, and will just follow their party, so that if it becomes a significant topic of debate there will be something close to a 50/50 split.

And then he breaks things down a bit:

The percent in favor of electing the president by popular vote in surveys ending on October 9, 2011 and November 20, 2016:

2011 2016

Democrats 74% 77%

Independents 70% 60%

Republicans 53% 28%

Weakliem presented these numbers to the fractional decimal place, but that is poor form given that variation in these numbers is much more than 1 percentage point, so it would be like reporting your weight as 193.4 pounds.

One thing I *do* appreciate is that Weakliem just presents the Yes proportions. Lots of times, people present both Yes and No rates, which gives you twice as many numbers to wade through, and then comparisons become much more difficult. So good job on the clean display.

Anyway, he continues with some breakdowns by state:

I used the 2011 survey to look for factors affecting state-level support. I considered number of electoral votes, margin of victory, and region. Support for the electoral college was somewhat higher in small states, which is as expected since it gives their voters more weight. There was no evidence that being in a state where the vote was close made any difference . . . Finally, the only regional distinction that appeared to matter was South vs. non-South. That makes some sense, since despite the talk about “coastal enclaves” vs. “heartland,” the South is still the most regionally distinctive part, and southerners may think that the electoral college protects their regional interests . . .

Funny that support for the electoral college isn’t higher in swing states. It’s not that I think swing-state voters are so selfish that they want the electoral college to preserve their power; it’s more the opposite, that I’d think voters in non-swing states would get annoyed that their votes don’t count. But, hey, I guess not: voters are thinking at the national, not the state level.

Lots more to look at here, I’m sure; also this is an instructive example of how much can be learned by looking carefully at available data.

**P.S.** I’m posting this now rather than with the usual 6-month delay, not because the subject is particularly topical—if anything, I expect it will become more topical as we go forward toward the next election—but because it demonstrates this general point of learning from observational data by looking at interesting comparisons and time trends. I’d like to have this post up, so I can point students to it when they are thinking of projects involving learning from social science data.

- Ben Lambert. 2018.
*A Student’s Guide to Bayesian Statistics.*SAGE Publications.

If Ben Goodrich is recommending it, it’s bound to be good. Amazon reviewers seem to really like it, too. You may remember Ben Lambert as the one who finally worked out the bugs in our HMM code for Stan for animal movement models; I blogged about it a couple years ago and linked the forum discussions where it was being worked out.

The linked page has answers to the exercises and an associated Shiny app for exploring distributions. There are also videos for a course based on the book:

- class videos (YouTube).

I haven’t seen a copy, but I am very curious about the section titled “Bob’s bees in a house”, as it’s an example I’ve used in courses. I didn’t come up with the analogy—I borrowed it from a physics presentation on equilibrium in gases or something like that I’d seen somewhere.

Does anyone know if the Kindle version of this book is readable? Living and working in NYC, I have very limited space for physical books.

]]>Joe Hoover writes:

An issue has come up in my subsequent analyses, which uses my MrsP estimates to explore the relationship between county-level moral values and the county-level distribution of hate groups, as defined by the SPLC.

Setting aside issues of spatial auto-correlation, control variables, measurement, and all other potential complications, I want to explore the US county-level association between a county mean outcome X and the county-level distribution of rare-event Y (N Y = 0 is about 2800, N Y > 0 is about 250).

My initial analytical plan included two analyses:

1. Model Y as some zero inflated function of X. I tried this and observed a lot of noise (small effects with estimated with low uncertainty).

2. Employ a case-control design that includes all hate group counties + a random sample of counties without hate groups. This design is based on a recent paper that investigated the county-level distribution of hate groups. When I tried this approach, estimation uncertainty decreased and the effects were in the hypothesized direction (how convenient!).

My issue now is that I have two very different sets of results that rely on two very different designs. It seems to me that they address two different questions, but am not entirely sure what question the second analysis really addresses:

1. If we know X for a given county, does that tell us anything about the expected rate of hate groups in that county. Answer: no.

2. Among counties that…mostly have at least one hate group, does knowing X tell us anything about how the expected rate of hate groups in that county. Answer: yes?

Part of my confusion about how to work with these results derives from the complexity of the DGP: there are probably many counties that would be nice places to start a hate group, but maybe…there are no self-motivated bigots there. Or, the bigots there are introverted and don’t like to be in groups, etc.

I guess I’m thinking of these factors as something analogous to epidemiological exposure. For example, perhaps county-level population density increases the risk contracting a virus at the county level. But, if the virus is rare, estimating a model that includes every county won’t reveal this relationship because most counties were never exposed.

This kind of epidemiological reasoning makes sense to me, but it is outside of my areas of expertise. And, I am also aware that it is probably not a coincidence that the reasoning which justifies the ‘good’ results ‘makes sense’ to me.

Accordingly, I would like to place myself on firmer ground by better understanding the precedents for these different analytical approaches. Specifically, I would like to know if it ever makes sense to use a case-control approach if you have data for the entire world (i.e. in my case, case-control requires throwing out observations, which feels strange). Also, I would like to have a better idea of how to interpret these kind of results.

My reply:

I’m getting confused on the details here so let me try to step back and answer in the abstract. He’s fitting two completely different models to the same data . . . hmmmm, not quite the same data, more like two takes on the same problem.

Thinking about fundamentals . . . I was taught that, when stuck, we should think about statistical problems as prediction problems, with causal inference corresponding to prediction under various potential outcomes. So that’s what I’d do here. Instead of saying that you want to “explore the relationship between county-level moral values and the county-level distribution of hate group,” try to define a more precise question (WWJD), then some of the answers will flow.

]]>What happened was, I was scanning this list of Springbrook High School alumni. And I was like, Tina Fernandes? Class of 1982? I know that person. We didn’t know each other well, but I guess we must have been in the same homeroom a few times? All I can remember from back then is that Tina was a nice person and that she was outspoken. So it was fun to see this online interview, by Cliff Sosis, from 2017. Thanks, Cliff!

**P.S.** As a special bonus, here’s an article about Chuck Driesell. Chuck and I were in the same economics class, along with Yitzhak. Chuck majored in business in college, Yitzhak became an economics professor, and I never took another econ course again. Which I guess explains how I feel so confident when pontificating about economics.

**P.P.S.** And for another bonus, I came across this page where Ted Alper (class of 1980) answers random questions. It’s practically a blog!

As I put it in the rejoinder for my 2005 discussion paper:

ANOVA is more important than ever because we are fitting models with many parameters, and these parameters can often usefully be structured into batches. The essence of “ANOVA” (as we see it) is to compare the importance of the batches and to provide a framework for efficient estimation of the individual parameters and related summaries such as comparisons and contrasts. . . .

A statistical model is usually taken to be summarized by a likelihood, or a likelihood and a prior distribution, but we go an extra step by noting that the parameters of a model are typically batched, and we take this batching as an essential part of the model. . . .

A key technical contribution of our paper is to disentangle modeling and inferential summaries. A single multilevel model can yield inference for finite-population and superpopulation inferences. . . .

I summarize:

First, if you are already fitting a complicated model, your inferences can be better understood using the structure of that model.Second, if you have a complicated data structure and are trying to set up a model, it can help to use multilevel modeling—not just a simple units-within-groups structure but a more general approach with crossed factors where appropriate. . . .

I’m sharing this with you now because Josh Miller pointed me to this webpage by Jonas Kristoffer Lindeløv entitled “Common statistical tests are linear models (or: how to teach stats).”

Lindeløv’s explanations are good, and I do think it’s useful for students and practitioners to understand that all these statistical procedures are based on the same class of underlying model. He also notes that the Wilcoxon rank test can be formulated approximately as a linear model on ranks, a point that we put in BDA and which I’ve occasionally blogged (see here and here). It’s good to see these ideas being rediscovered: they’re useful enough that they shouldn’t be trapped within a single book and a few old blog entries.

The point of my post today is to emphasize that it’s not just what model you fit, it’s also how you summarize it. To put it another way, I think the unification of statistical comparisons is taught to everyone in econometrics 101, and indeed this is a key theme of my book with Jennifer, in that we use regression as an organizing principle for applied statistics. (Just to be clear, I’m not claiming that we discovered this. Quite the opposite. I’m saying that we constructed our book in large part based on the understanding we’d gathered from basic ideas in statistics and econometrics that we felt had not fully been integrated into how this material was taught.)

So, it’s well known that all these models are a special case of regression, and that’s why in a good econometrics class they won’t bother teaching Anova, chi-squared tests, etc., they just do regression. My Anova paper demonstrates how the concept of Anova has value, not just from the model (which is just straightforward multilevel linear regression) but because of the structured way the fits are summarized.

For more, go to my Anova article or, for something quicker, these old blog posts:

– Anova for economists

– A psychology researcher asks: Is Anova dead?

– Anova is great—if you interpret it as a way of structuring a model, not if you focus on F tests.

I think these are important points: the connection between the statistical models, and also the extra understanding that arises from batching and summarizing by batch.

]]>A couple of time at my suggestion, you’ve blogged about Paulo Macchiarini.

Here is an update from Susan Perry in which she interviews the director of the Swedish documentary about Macchiarini:

Indeed, Macchiarini made it sound as if his patients had recovered their health when, in fact, the synthetic tracheas he had implanted in their bodies did not work at all. His patients were dying, not thriving.

In 2015, the investigator concluded that Macchiarini had, indeed, committed research fraud. Yet the administrators [at Sweden’s Karolinska Institute] continued to defend their star surgeon — and threatened the whistleblowers with dismissal.

But then there was the fact that the leadership of the hospital and the institute had, instead of listening to the complaints, gone after the whistleblowers and had even complained [about them] to the police.

**What was he thinking???**

Check out this stunning exchange from the interview:

MinnPost: Did you come to any conclusion about what was motivating [Macchiarini]? It seemed at times at the documentary that he really cared about the patients. He seemed moved by them. And, yet, he then abandons them. He doesn’t follow up with them.

Bosse Lindquist [director of the documentary about this story]: I think that he feels that he deserves success in life and that he ultimately deserves something like a Nobel Prize or something like that. He thinks the world just hasn’t quite seen his excellence yet and that they will eventually. He believes that he’s helping mankind, and I think that he construes reality in such a way that he actually thinks that he was doing good with these patients, but that there were minor problems and stuff that sort of [tripped him up].

This jibes with my impressions in other, nonlethal, examples of research incompetence and research fraud: The researcher believes that he or she is an important person doing important work, and thinks of criticisms of any sort as a bunch of technicalities getting in the way of pathbreaking, potentially life-changing advances. And, of course, once you frame things in this way, a simple utilitarian calculation implies that you’re justified in all sorts of questionable behavior to derail your critics.

All of this is, in some sense, a converse to Clarke’s Law, and it also points to a general danger with utilitarianism—or, to put it another way, it points to the general value of rules and norms.

**And what about the whistleblowers?**

MP: And what about the whistleblowers? Have they been able to go back to their careers without any professional harm?

BL: No. Two of them have had to change cities and hospitals. Two are still there, but they have been subjected to threats from management and from some of their colleagues who were involved with Macchiarini. They have not received any new grants since this whole thing happened. It’s a crying shame.

MP: That’s quite a terrible outcome, because that may stop other people from stepping forward in similar situations.

BL: Exactly.

MP: Do you feel that everyone who was responsible for ignoring the warnings about Macchiarini has resigned or been fired?

BL: No, no, no. A number of people are still there and have their old jobs and just carry on. Some have been forced to change jobs, to get another job — but in some other function within the hospital or in the government.

**And, finally . . .**

This:

MP: What has happened to the patients. One was able to successfully have the tube removed, is that correct?

BL: Yeah. One person.

MP: And everybody else has died?

BL: Yes.

The whole thing is no damn joke.

I originally called this “research-lies-allegations-windpipe update update,” but I can’t laugh about this anymore, hence the revised title above.

**P.S.** Alper writes:

According to the NYT’s Gretchen Reynolds, the Institute is looking into breathing again:

Two dozen healthy young male and female volunteers inhaled 12 different scents from small vials held to their noses. Some of the smells were familiar, like the essence of orange, while others were obscure. The subjects were told to memorize each scent. They went through this process on two occasions. For one, they sat quietly for an hour immediately after the sniffing, with their noses clipped shut to prevent nasal breathing; on the other, they sat for an hour with tape over their mouths to prevent oral breathing.

The men and women were consistently much better at recognizing smells if they breathed through their noses during the quiet hour. Mouth breathing resulted in fuzzier recall and more incorrect answers.

But, no numerical notion of “how much better.” And only “two dozen” subjects? Despite the defrocking of Paolo Macchiarini, the Karolinska Institute is undoubtedly still solvent so it seems strange that it undertakes a study that is more typical of a psychology professor, who has little or no funding, and seeks a publication using his students as convenient subjects. One is reminded of the famous sweaty T-shirt study.

I guess there’s always a market for one-quick-trick-that-will-change-your-life.

]]>Why is there so much suspicion of big business?

Perhaps in part because we cannot do without business, so many people hate or resent business, and they love to criticize it, mock it, and lower its status. Business just bugs them. . . .

The short answer is, No, I *don’t* think there is so much suspicion of big business in this country. No, I don’t think people love to criticize, mock and lower the status of big business.

This came up a few years ago, and at the time I pulled out data from a 2007 survey showing that just about every big business you could think of was popular, with the only exception being oil companies. Microsoft, Walmart, Citibank, GM, Pfizer: you name it, the survey respondents were overwhelmingly positive.

Nearly two-thirds of respondents say corporate profits are too high, but, “more than seven in ten agree that ‘the strength of this country today is mostly based on the success of American business’ – an opinion that has changed very little over the past 20 years.”

Corporations are more popular with Republicans than with Democrats, but most of the corporations in the survey were popular with a clear majority in either party.

Big business does lots of things for us, and the United States is a proudly capitalist country, so it’s no shocker that most businesses in the survey were very popular.

So maybe the question is, Why did an economist such as Cowen think that people view big business so negatively?

My quick guess is that we notice negative statements more than positive statements. Cowen himself roots for big business, he’s generally on the side of big business, so when he sees any criticism of it, he bristles. He notices the criticism and is bothered by it. When he sees positive statements about big business, that all seems so sensible that perhaps he hardly notices. The negative attitudes are jarring to him so more noticeable. Perhaps in the same way that I notice bad presentations of data. An ugly table or graph is to me like fingernails on the blackboard.

Anyway, it’s perfectly reasonable for Cowen to be interested in those people who “hate or resent business, and they love to criticize it, mock it, and lower its status.” We should just remember that, at least from these survey data, it seems that this is a small minority of people.

**Why did I write this post?**

The bigger point here is that this is an example of something I see a lot, which is a social scientist or pundit coming up with theories to explain some empirical pattern in the world, but it turns out the pattern isn’t actually real. This came up years ago with Red State Blue State, when I noticed journalists coming up with explanations for voting patterns that were not happening (see for example here) and of course it comes up a lot with noise-mining research, whether it be a psychologist coming up with theories to explain ESP, or a sociologist coming up with theories to explain spurious patterns in sex ratios.

It’s fine to explain data; it’s just important to be aware of what’s being explained. In the context of the above-linked Cowen post, it’s fine to answer the question, “If business is so good, why is it so disliked?”—as long as this sentence is completed as follows: “If business is so good, why is it so disliked by a minority of Americans?” Explaining minority positions is important; we should just be clear it’s a minority.

Or of course it’s possible that Cowen has access to other data I haven’t looked at, perhaps more recent surveys that would modify my empirical understanding. That would be fine too.

**P.S.** The title of this post was originally “Most Americans like big business.” I changed the last word to “businesses” in response to comments who pointed out that most Americans express negative views about “big business” in general, but they like most individual big businesses that they’re asked about.

First some background, then the bad news, and finally the good news.

Spoiler alert: The bad news is that exploring the posterior is intractable; the good news is that we don’t need to explore all of it to calculate expectations.

**Sampling to characterize the posterior**

There’s a misconception among Markov chain Monte Carlo (MCMC) practitioners that the purpose of sampling is to explore the posterior. For example, I’m writing up some reproducible notes on probability theory and statistics through sampling (in pseudocode with R implementations) and have just come to the point where I’ve introduced and implemented Metropolis and want to use it to exemplify convergence mmonitoring. So I did what any right-thinking student would do and borrowed one of my mentor’s diagrams (which is why this will look familiar if you’ve read the convergence monitoring section of *Bayesian Data Analysis 3*).

First M steps of of isotropic random-walk Metropolis with proposal scale normal(0, 0.2) targeting a bivariate normal with unit variance and 0.9 corelation. After 50 iterations, we haven’t found the typical set, but after 500 iterations we have. Then after 5000 iterations, everything seems to have mixed nicely through this two-dimensional example.

This two-dimensional traceplot gives the misleading impression that the goal is to make sure each chain has moved through the posterior. This low-dimensional thinking is nothing but a trap in higher dimensions. Don’t fall for it!

**Bad news from higher dimensions**

It’s simply intractable to “cover the posterior” in high dimensions. Consider a 20-dimensional standard normal distribution. There are 20 variables, each of which may be positive or negative, leading to a total of , or more than a million orthants (generalizations of quadrants). In 30 dimensions, that’s more than a billion. You get the picture—the number of orthant grows exponentially so we’ll never cover them all explicitly through sampling.

**Good news in expectation**

Bayesian inference is based on probability, which means integrating over the posterior density. This boils down to computing expectations of functions of parameters conditioned on data. This we can do.

For example, we can construct point estimates that minimize expected square error by using posterior means, which are just expectations conditioned on data, which are in turn integrals, which can be estimated via MCMC,

where are draws from the posterior

If we want to calculate predictions, we do so by using sampling to calculate the integral required for the expectation,

If we want to calculate event probabilities, it’s just the expectation of an indicator function, which we can calculate through sampling, e.g.,

The good news is that we don’t need to visit the entire posterior to compute these expectations to within a few decimal places of accuracy. Even so, MCMC isn’t magic—those two or three decimal places will be zeroes for tail probabilities.

]]>Large-scale population health studies face increasing difficulties in recruiting representative samples of participants. Non-participation, item non-response and attrition, when follow-up is involved, often result in highly selected samples even in well-designed studies. We aimed to assess the potential value of multilevel regression and poststratification, a method previously used to successfully forecast US presidential election results, for addressing biases due to non-participation in the estimation of population descriptive quantities in large cohort studies. The investigation was performed as an extensive case study using a large national health survey of Australian males, the Ten to Men study. Analyses were performed in the Bayesian computational package RStan. Results showed greater consistency and precision across population subsets of varying sizes, when compared with estimates obtained using conventional survey sampling weights. Estimates for smaller population subsets exhibited a greater degree of shrinkage towards the national estimate. Multilevel regression and poststratification provides a promising analytic approach to addressing potential participation bias in the estimation of population descriptive quantities from large-scale health surveys and cohort studies.

It makes me so happy to see our methods used in new problems like this!

I’ve been dealing with all sorts of crap during the past week or so, so it’s good to be reminded of how our work can make a difference.

]]>I realize that so many people bitch about the seminar showdown that you might need at one thank you. This year, I managed to re-read the bulk of Geng, and for that I thank you. I have not yet read any Sattouf, but it clearly has made an impression on you, so it’s on my list.

In thanks, my first brief foray into pseudo-Gengiana, I think I’ve got the tone roughly right, but I’m way short on whimsy, but this is what I managed in a sustained fifteen minute effort. Thanks again.

My fellow Americans:

As you are no doubt aware, I have completed my investigation and report. I write this to inform you of an unfortunate mishap from Friday. Many news outlets have reported that my final report was taken by security guard from my offices to the Justice Department. That is not true. In an attempt to maintain my obsessive secrecy, that was a dummy report, actually containing the text of an unpublished novel by David Foster Wallace that we found in Michael Cohen’s safe. We couldn’t understand it—maybe Bill Barr will have better luck.

The real one was handed to my intern, Jeff, in an ordinary interoffice envelope, and Jeff was told to drop it off at Justice on his way home. He lives nearby with six other interns. Not knowing what he had, he stopped off at the Friday Trivia Happy Hour at the Death and Taxes Pub, drank a little too much, and left the report there. We’ve gone back to look and nobody can find it.

So why not just print out another one? Or for that matter, why didn’t I just email the first report? As you’ve no doubt gleaned by now, computers and email aren’t my thing. As my successor at the FBI, Mr. Comey, demonstrated, email baffles just about all of us. And I don’t use a computer. So there isn’t another copy of the real report. I’ve got all my notes, though, so I ought to be able to cobble together a new report in a couple of months.Apologies for the delay,

Robert MuellerPS: Jeff has been chastised. We haven’t fired him, but in asking him about this he let slip that his parents didn’t pay taxes on the nanny who raised him and they may have strongly implied that he played on a high school curling team to get into college. His parents are going to jail and the nanny’s immigration status is being investigated. This requires a short re-opening of the investigation.

The mention of “Jeff” seems particularly Geng-like to me. Perhaps I’m reminded of “Ed.” Thinking of Geng makes me a bit sad, though, not just for her but because it reminds me of the passage of time. I associate Geng, Bill James, and Spy magazine with the mid-1980s. Ahhh, lost youth!

]]>I replied: Ahhh, Harvard . . . the reporter should’ve asked Marc Hauser for a quote.

Alper responded:

Marc Hauser’s research involved “cotton-top tamarin monkeys” while Piero Anversa was falsifying and spawning research on damaged hearts:

The cardiologist rocketed to fame in 2001 with a flashy paper claiming that, contrary to scientific consensus, heart muscle could be regenerated. If true, the research would have had enormous significance for patients worldwide.

I, and I suspect that virtually all of the other contributors to your blog know nothing** about cotton-top tamarin monkeys but are fascinated and interested in stem cells and heart regeneration. Consequently, are Hauser and Anversa separated by a chasm or should they be lumped together in the Hall of Shame? Put another way, do we have yet an additional instance of crime and appropriate punishment?

**Your blog audience is so broad that there well may be cotton-top tamarin monkey mavens out there dying to hit the enter key.

Good point. It’s not up to me at all: I don’t administer punishment of any sort; as a blogger I function as a very small news organization, and my only role is to sometimes look into these cases, bring them to others’ notice, and host discussions. If it were up to me, David Weakliem and Jay Livingston would be regular New York Times columnists, and Mark Palko and Joseph Delaney would be the must-read bloggers that everyone would check each morning. Also, if it were up to me, everyone would have to post all their data and code—at least, that would be the default policy; researchers would have to give very good reasons to get out of this requirement. (Not that I always or even usually post my data and code; but I should do better too.) But none of these things are up to me.

From Harvard’s point of view, perhaps the question is whether they should go easy on people like Hauser, a person who is basically an entertainer, and whose main crime was to fake some of his entertainment—a sort of Doris Kearns Goodwin, if you will—. and be tougher on people such as Anversa, whose misdeeds can cost lives. (I don’t know where you should put someone like John Yoo who advocated for actual torture, but I suppose that someone who agreed with Yoo politically would make a similar argument against, say, old-style apologists for the Soviet Union.)

One argument for not taking people like Hauser, Wansink, etc., seriously, even in their misdeeds, is that after the flaws in their methods were revealed—after it turned out that their blithe confidence (in Wansink’s case) or attacks on whistleblowers (in Hauser’s case) were not borne out by the data—these guys just continued to say their original claims were valid. So, for them, it was never about the data at all, it was always about their stunning ideas. Or, to put it another way, the data were there to modify the details of their existing hypotheses, or to allow them to gently develop and extend their models, in a way comparable to how Philip K. Dick used the I Ching to decide what would happen next in his books. (Actually, that analogy is pretty good, as one could just as well say that Dick he used randomness not so much to “decide what would happen” but rather “to discover what would happen” next.)

Anyway, to get back to the noise-miners: The supposed empirical support was just there for them to satisfy the conventions of modern-day science. So when it turned out that the promised data had never been there . . . so what, really? The data never mattered in the first place, as these researchers implicitly admitted by not giving up on any of their substantive claims. So maybe these profs should just move into the Department of Imaginative Literature and the universities can call it a day. The medical researchers who misreport their data: That’s a bigger problem.

And what about the news media, myself included? Should I spend more time blogging about medical research and less time blogging about social science research? It’s a tough call. Social science is my own area of expertise, so I think I’m making more of a contribution by leveraging that expertise than by opining on medical research that I don’t really understand.

A related issue is accessibility: people send me more items on social science, and it takes me less effort to evaluate social science claims.

Also, I think social science *is* important. It does not seem that there’s any good evidence that elections are determined by shark attacks or the outcomes of college football games, or that subliminal smiley faces cause large swings in opinion, or that women’s political preferences vary greatly based on time of the month—but if any (or, lord help us, all) of these claims were true, then this would be consequential: it would “punch a big hole in democratic theory,” in the memorable words of Larry Bartels.

Monkey language and bottomless soup bowls: I don’t care about those so much. So why have I devoted so much blog space to those silly cases? Partly its from a fascination with people who refuse to admit error even when it’s staring them in the face, partly because it can give insights into general issues and statistics and science, and partly because I think people can miss the point in these cases by focusing on the drama and missing out on the statistics; see for example here and here. But mostly I write more about social science because social science is my “thing.” Just like I write more about football and baseball than about rugby and cricket.

**P.S.** One more thing: Don’t forget that in all these fields, social science, medical science, whatever, the problem’s is *not* just with bad research, cheaters, or even incompetents. No, there are big problems even with solid research done by honest researchers who are doing their best but are still using methods that misrepresent what we learn from the data. For example, the ORBITA study of heart stents, where p=0.20 (actually p=0.09 when the data were analyzed more appropriately) was widely reported as implying no effect. Honesty and transparency—and even skill and competence in the use of standard methods—are not enough. Sometimes, as in the above post, it makes sense to talk about flat-out bad research and the prominent people who do it, but that’s only one part of the story.

Recently, I had a conversation with a colleague of mine about the virtues of synthetic data and their role in data analysis. I think I’ve heard a sermon/talk or two where you mention this and also in your blog entries. But having convinced my colleague of this point, I am struggling to find good references on this topic.

I was hoping to get some leads from you.

My reply:

Hi, here are some refs: from 2009, 2011, 2013, also this and this and this from 2017, and this from 2018. I think I’ve missed a few, too.

If you want something in dead-tree style, see Section 8.1 of my book with Jennifer Hill, which came out in 2007.

Or, for some classic examples, there’s Bush and Mosteller with the “stat-dogs” in 1954, and Ripley with his simulated spatial processes from, ummmm, 1987 I think it was? Good stuff, all. We should be doing more of it.

]]>The Institute for Policy Research and the Department of Statistics is seeking applicants for a Postdoctoral Fellowship with Dr. Larry Hedges and Dr. Elizabeth Tipton. This fellowship will be a part of a new center which focuses on the development of statistical methods for evidence-based policy. This includes research on methods for meta-analysis, replication, causal generalization, and, more generally, the design and analysis of randomized trials in social, behavioral, and education settings.

The position will include a variety of tasks, including: Conducting simulation studies to understand properties of different estimators; performing reviews of available methods (in the statistics literature) and the use of these methods (in the education and social science literatures); the development of examples of the use of new methods; writing white papers summarizing methods developments for researchers conducting evidence-based policy; and the development of new methods in these areas.

Job Requirements

Required: Ph.D. (expected or obtained) in statistics, biostatistics, the quantitative social sciences, education research methods, or a related field; strong analytical and written communication skills; strong programming skills (R, desired) and familiarity with cluster-computing; and experience with education research, randomized trials, meta-analysis, and/or evidence-based policy.

This will be a one-year appointment beginning September 2019 (or a mutually agreed upon date), with the possibility of renewal for a second year based upon satisfactory performance.

Candidates should submit the following documents in PDF to Valerie Lyne (v-lyne@northwestern.edu) with subject line “Post-Doc”:

· CV

· A 1-page statement regarding the candidate’s research interests, qualifications, and prior research experience relevant to this position

· Names and addresses of three references (no letters are required at this time)

We plan to begin reviewing applications on April 12th, 2019 and will continue to do so until the position is filled.

Looks fun, also this is important work.

]]>Here’s the golf putting data we were using, typed in from Don Berry’s 1996 textbook. The columns are distance in feet from the hole, number of tries, and number of successes:

x n y 2 1443 1346 3 694 577 4 455 337 5 353 208 6 272 149 7 256 136 8 240 111 9 217 69 10 200 67 11 237 75 12 202 52 13 192 46 14 174 54 15 167 28 16 201 27 17 195 31 18 191 33 19 147 20 20 152 24

Graphed here:

Here’s the idealized picture of the golf putt, where the only uncertainty is the angle of the shot:

Which we assume is normally distributed:

And here’s the model expressed in Stan:

data { int J; int n[J]; vector[J] x; int y[J]; real r; real R; } parameters { realsigma; } model { vector[J] p; for (j in 1:J){ p[j] = 2*Phi(asin((R-r)/x[j]) / sigma) - 1; } y ~ binomial(n, p); } generated quantities { real sigma_degrees; sigma_degrees = (180/pi())*sigma; }

Fit to the above data, the estimate of sigma_degrees is 1.5. And here’s the fit:

**Part 2**

The other day, Mark Broadie came to my office and shared a larger dataset, from 2016-2018. I’m assuming the distances are continuous numbers because the putts have exact distance measurements and have been divided into bins by distance, with the numbers below representing the average distance in each bin.

x n y 0.28 45198 45183 0.97 183020 182899 1.93 169503 168594 2.92 113094 108953 3.93 73855 64740 4.94 53659 41106 5.94 42991 28205 6.95 37050 21334 7.95 33275 16615 8.95 30836 13503 9.95 28637 11060 10.95 26239 9032 11.95 24636 7687 12.95 22876 6432 14.43 41267 9813 16.43 35712 7196 18.44 31573 5290 20.44 28280 4086 21.95 13238 1642 24.39 46570 4767 28.40 38422 2980 32.39 31641 1996 36.39 25604 1327 40.37 20366 834 44.38 15977 559 48.37 11770 311 52.36 8708 231 57.25 8878 204 63.23 5492 103 69.18 3087 35 75.19 1742 24

Comparing the two datasets in the range 0-20 feet, the success rate is similar for longer putts but is much higher than before for the short putts. This could be a measurement issue, if the distances to the hole are only approximate for the old data.

Beyond 20 feet, the empirical success rates are lower than would be predicted by the old model. This makes sense: for longer putts, the angle isn’t the only thing you need to control; you also need to get the distance right too.

So Broadie fit a new model in Stan. See here and here for further details.

]]>Following the first circulation of that article, the authors of that article and some others of us had some email discussion that I thought might be of general interest.

I won’t copy out all the emails, but I’ll share enough to try to convey the sense of the conversation, and any readers are welcome to continue the discussion in the comments.

**1. Is it appropriate to get hundreds of people to sign a letter of support for a scientific editorial?**

John Ioannidis wrote:

Brilliant Comment! I am extremely happy that you are publishing it and that it will certainly attract a lot of attention.

He had some specific disagreements (see below for more on this). Also, he was bothered by the group-signed letter and wrote:

I am afraid that what you are doing at this point is not science, but campaigning. Leaving the scientific merits and drawbacks of your Comment aside, I am afraid that a campaign to collect signatures for what is a scientific method and statistical inference question sets a bad precedent. It is one thing to ask for people to work on co-drafting a scientific article or comment. This takes effort, real debate, multiple painful iterations among co-authors, responsibility, undiluted attention to detailed arguments, and full commitment. Lists of signatories have a very different role. They do make sense for issues of politics, ethics, and injustice. However, I think that they have no place on choosing and endorsing scientific methods. Otherwise scientific methodology would be validated, endorsed and prioritized based on who has the most popular Tweeter, Facebook or Instagram account. I dread to imagine who will prevail.

To this, Sander Greenland replied:

YES we are campaigning and it’s long overdue . . . because YES this is an issue of politics, ethics, and injustice! . . .

My own view is that this significance issue has been a massive problem in the sociology of science, hidden and often hijacked by those pundits under the guise of methodology or “statistical science” (a nearly oxymoronic term). Our commentary is an early step toward revealing that sad reality. Not one point in our commentary is new, and our central complaints (like ending the nonsense we document) have been in the literature for generations, to little or no avail – e.g., see Rothman 1986 and Altman & Bland 1995, attached, and then the travesty of recent JAMA articles like the attached Brown et al. 2017 paper (our original example, which Nature nixed over sociopolitical fears). Single commentaries even with 80 authors have had zero impact on curbing such harmful and destructive nonsense. This is why we have felt compelled to turn to a social movement: Soft-peddled academic debate has simply not worked. If we fail, we will have done no worse than our predecessors (including you) in cutting off the harmful practices that plague about half of scientific publications, and affect the health and safety of entire populations.

And I replied:

I signed the form because I feel that this would do more good than harm, but as I wrote here, I fully respect the position of not signing any petitions. Just to be clear, I don’t think that my signing of the form is an act of campaigning or politics. I just think it’s a shorthand way of saying that I agree with the general points of the published article and that I agree with most of its recommendations.

Zad Chow replied more agnostically:

Whether political or not, it seems like signing a piece as a form of endorsement seems far more appropriate than having papers with mass authorships of 50+ authors where it is unlikely that every single one of those authors contributed enough to actually be an author, and their placement as an author is also a political message.

I also wonder if such pieces, whether they be mass authorships or endorsements by signing, actually lead to notable change. My guess is that they really don’t, but whether or not such endorsements are “popularity contests” via social media, I think I’d prefer that people who participate in science have some voice in the manner, rather than having the views of a few influential individuals, whether they be methodologists or journal editors, constantly repeated and executed in different outlets.

**2. Is “retiring statistical significance” really a good idea?**

Now on to problems with the Amrhein et al. article. I mostly liked it, although I did have a couple places where I suggested changes of emphasis, as noted in my post linked above. The authors made some of my suggested changes; in other places I respect their decisions even if I might have written things slightly differently.

Ioannidis had more concerns, as he wrote in an email listing a bunch of specific disagreements with points in the article:

1. Statement:Such misconceptions have famously warped the literature with overstated claims and, less famously, led to claims of conflicts between studies where none exist

Why it is misleading:Removing statistical significance entirely will allow anyone to make any overstated claim about any result being important. It will also facilitate claiming that that there are no conflicts between studies when conflicts do exist.

2. Statement:Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P-value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero.

Why it is misleading:In most scientific fields we need to conclude something and then convey our uncertainty about the conclusion. Clear, pre-specified rules on how to conclude are needed. Otherwise, anyone can conclude anything according to one’s whim. In many cases using sufficiently stringent p-value thresholds, e.g. p=0.005 for many disciplines (or properly multiplicity-adjusted p=0.05, e.g. 10-9 for genetics or FDR or Bayes factor threhsolds or any thresholds) make perfect sense. We need to make some careful choices and move on. Saying that any and all associations cannot be 100% dismissed is correct strictly speaking, but practically it is nonsense. We will get paralyzed because we cannot exclude that everything may be causing everything.

3. Statement:statistically non-significant results were interpreted as indicating ‘no difference’ in XX% of articles

Why it is misleading:this may have been entirely appropriate in many/most/all cases, one has to examine carefully each one of them. It is probably at least or even more inappropriate that some/many of the remaining 100-XX% were not indicated as “no difference”.

4. Statement:The editors introduce the collection (2) with the caution “don’t say ‘statistically significant’.” Another article (3) with dozens of signatories calls upon authors and journal editors to disavow the words. We agree and call for the entire concept of statistical significance to be abandoned. We don’t mean to drop P-values, but rather to stop using them dichotomously to decide whether a result refutes or supports a hypothesis.

Why it is misleading:please see my e-mail about what I think regarding the inappropriateness of having “signatories” when we are discussing about scientific methods. We do need to reach conclusions dichotomously most of the time: is this genetic variant causing depression, yes or no? Should I spend 1 billion dollars to develop a treatment based on this pathway, yes or no? Is this treatment effective enough to warrant taking it, yes or no? Is this pollutant causing cancer, yes or no?

5. Statement:whole paragraph beginning with “Tragically…”

Why it is misleading:we have no evidence that if people did not have to defend their data as statistically significant, publication bias would go away and people would not be reporting whatever results look nicer, stronger, more desirable and more fit to their biases. Statistical significance or any other preset threshold (e.g. Bayesian or FDR) sets an obstacle to making unfounded claims. People may play tricks to pass the obstacle, but setting no obstacle is worse.

6. Statement:For example, the difference between getting P = 0.03 versus P = 0.06 is the same as the difference between getting heads versus tails on a single fair coin toss (8).

Why it is misleading:this example is factually wrong; it is true only if we are certain that the effect being addressed is truly non-null.

7. Statement:One way to do this is to rename confidence intervals ‘compatibility intervals,’ …

Why it is misleading:Probably the least thing we need in the current confusing situation is to add yet a new, idiosyncratic term. “Compatibility” is even a poor choice, probably worse than “confidence”. Results may be entirely off due to bias and the X% CI (whatever C stands for) may not even include the truth much of the time if bias is present.

8. Statement:We recommend that authors describe the practical implications of all values inside the interval, especially the observed effect or point estimate (that is, the value most compatible with the data) and the limits.

Why it is misleading:I think it is far more important to consider what biases may exist and which may lead to the entire interval, no matter how we call it, to be off and thus incompatible with the truth.

9. Statement:We’re frankly sick of seeing nonsensical ‘proofs of the null’ and claims of non-association in presentations, research articles, reviews, and instructional materials.

Why it is misleading:I (and many others) are frankly sick with seeing nonsensical “proofs of the non-null”, people making strong statements about associations and even causality with (or even without) formal statistical significance (or other statistical inference tool) plus tons of spin and bias. Removing entirely the statistical significance obstacle, will just give a free lunch, all-is-allowed bonus to make any desirable claim. All science will become like nutritional epidemiology.

10. Statement:That means you can and should say “our results indicate a 20% increase in risk” even if you found a large P-value or a wide interval, as long as you also report and discuss the limits of that interval.

Why it is misleading:yes, indeed. But then, welcome to the world where everything is important, noteworthy, must be licensed, must be sold, must be bought, must lead to public health policy, must change our world.

11. Statement:Paragraph starting with “Third, the default 95% used”

Why it is misleading:indeed, but this means that more appropriate P-value thresholds and, respectively X% CI intervals are preferable and these need to be decided carefully in advance. Otherwise, everything is done post hoc and any pre-conceived bias of the investigator can be “supported”.

12. Statement:Factors such as background evidence, study design, data quality, and mechanistic understanding are often more important than statistical measures like P-values or intervals (10).

Why it is misleading:while it sounds reasonable that all these other factors are important, most of them are often substantially subjective. Conversely, statistical analysis at least has some objectivity and if the rules are carefully set before the data are collected and the analysis is run, then statistical guidance based on some thresholds (p-values, Bayes factors, FDR, or other) can be useful. Otherwise statistical inference is becoming also entirely post hoc and subjective.

13. Statement:The objection we hear most against retiring statistical significance is that it is needed to make yes-or-no decisions. But for the choices often required in regulatory, policy, and business environments, decisions based on the costs, benefits, and likelihoods of all potential consequences always beat those made based solely on statistical significance. Moreover, for decisions about whether to further pursue a research idea, there is no simple connection between a P-value and the probable results of subsequent studies.

Why it is misleading:This argument is equivalent to hand waving. Indeed, most of the time yes/no decisions need to be made and this is why removing statistical significance and making it all too fluid does not help. It leads to an “anything goes” situation. Study designs for questions that require decisions need to take all these other parameters into account ideally in advance (whenever possible) and set some pre-specified rules on what will be considered “success”/actionable result and what not. This could be based on p-values, Bayes factors, FDR, or other thresholds or other functions, e.g. effect distribution. But some rule is needed for the game to be fair. Otherwise we will get into more chaos than we have now, where subjective interpretations already abound. E.g. any company will be able to claim that any results of any trial on its product do support its application for licensing.

14. Statement:People will spend less time with statistical software and more time thinking.

Why it is misleading:I think it is unlikely that people will spend less time with statistical software but it is likely that they will spend more time mumbling, trying to sell their pre-conceived biases with nice-looking narratives. There will be no statistical obstacle on their way.

15. Statement:the approach we advocate will help halt overconfident claims, unwarranted declarations of ‘no difference,’ and absurd statements about ‘replication failure’ when results from original and the replication studies are highly compatible.

Why it is misleading:the proposed approach will probably paralyze efforts to refute the millions of nonsense statements that have been propagated by biased research, mostly observational, but also many subpar randomized trials.

Overall assessment:the Comment is written with an undercurrent belief that there are zillions of true, important effects out there that we erroneously dismiss. The main problem is quite the opposite: there are zillions of nonsense claims of associations and effects that once they are published, they are very difficult to get rid of. The proposed approach will make people who have tried to cheat with massaging statistics very happy, since now they would not have to worry at all about statistics. Any results can be spun to fit their narrative. Getting entirely rid of statistical significance and preset, carefully considered thresholds has the potential of making nonsense irrefutable and invincible.

That said, despite these various specific points of disagreement, Ioannidis emphasized that Amrhein et al. raise important points that “need to be given an opportunity to be heard loud and clear and in their totality.”

In reply to Ioannidis’s points above, I replied:

1. You write, “Removing statistical significance entirely will allow anyone to make any overstated claim about any result being important.” I completely disagree. Or, maybe I should say, anyone is already allowed to make any overstated claim about any result being important. That’s what PNAS is, much of the time. To put it another way: I believe that embracing uncertainty and avoiding overstated claims are important. I don’t think statistical significance has much to do with that.

2. You write, “In most scientific fields we need to conclude something and then convey our uncertainty about the conclusion. Clear, pre-specified rules on how to conclude are needed. Otherwise, anyone can conclude anything according to one’s whim.” Again, this is already the case that people can conclude what they want. One concern is what is done by scientists who are honestly trying to do their best. I think those scientists are often misled by statistical significance, all the time, ALL THE TIME, taking patterns that are “statistically significant” and calling them real, and taking patterns that are “not statistically significant” and treating them as zero. Entire scientific papers are, through this mechanism, data in, random numbers out. And this doesn’t even address the incentives problem, by which statistical significance can create an actual disincentive to gather high-quality data.

I disagree with many other items on your list, but two is enough for now. I think the overview is that you’re pointing out that scientists and consumers of science want to make reliable decisions, and statistical significance, for all its flaws, delivers some version of reliable decisions. And my reaction is that whatever plus it is that statistical significance sometimes provides reliable decisions, is outweighed by (a) all the times that statistical significance adds noise and provides

unreliable decisions, and (b) the false sense of security that statistical significance gives so many researchers.

One reason this is all relevant, and interesting, is that we all agree on so much—yet we disagree so strongly here. I’d love to push this discussion toward the real tradeoffs that arise when considering alternative statistical recommendations, and I think what Ioannidis wrote, along with the Amrhein/Greenland/McShane article, would be a great starting point.

Ioannidis then responded to me:

On whether removal of statistical significance will increase or decrease the chances that overstated claims will be made and authors will be more or less likely to conclude according to their whim, the truth is that we have no randomized trial to tell whether you are right or I am right. I fully agree that people are often confused about what statistical significance means, but does this mean we should ban it? Should we also ban FDR thresholds? Should we also ban Bayes factor thresholds? Also probably we have different scientific fields in mind. I am afraid that if we ban thresholds and other (ideally pre-specified) rules, we are just telling people to just describe their data as best as they can and unavoidably make strength-of-evidence statements as they wish, kind of impromptu and post-hoc. I don’t think this will work. The notion that someone can just describe the data without making any inferences seems unrealistic and it also defies the purpose of why we do science: we do want to make inferences eventually and many inferences are unavoidably binary/dichotomous. Also actions based on inferences are binary/dichotomous in their vast majority.

I replied:

I agree that the effects of any interventions are unknown. We’re offering, or trying to offer, suggestions for good statistical practice in the hope that this will lead to better outcome. This uncertainty is a key reason why this discussion is worth having, I think.

**3. Mob rule, or rule of the elites, or gatekeepers, consensus, or what?**

One issue that came up is, what’s the point of that letter with all those signatories? Is it mob rule, the idea that scientific positions should be determined by those people who are loudest and most willing to express strong opinions (“the mob” != “the silent majority”)? Or does it represent an attempt by well-connected elites (such as Greenland and myself!) to tell people what to think? Is the letter attempting to serve a gatekeeping function by restricting how researchers can analyze their data? Or can this all be seen as a crude attempt to establish a consensus of the scientific community?

None of these seem so great! Science should be determined my truth, accuracy, reproducibility, strength of theory, real-world applicability, moral values, etc. All sorts of things, but these should not be the property of the mob, or the elites, or gatekeepers, or a consensus.

That said, the mob, the elites, gatekeepers, and the consensus aren’t going anywhere. Like it or not, people *do* pay attention to online mobs. I hate it, but it’s there. And elites will always be with us, sometimes for good reasons. I don’t think it’s such a bad idea that people listen to what I say, in part on the strength of my carefully-written books—and I say that even though, at the beginning of my career, I had to spend a huge amount of time and effort struggling against the efforts of elites (my colleagues in the statistics department at the University of California, and their friends elsewhere) who did their best to use their elite status to try to put me down. And gatekeepers . . . hmmm, I don’t know if we’d be better off without anyone in charge of scientific publishing and the news media—but, again, the gatekeepers are out there: NPR, PNAS, etc. are real, and the gatekeepers feed off of each other: the news media bow down before papers published in top journals, and the top journals jockey for media exposure. Finally, the scientific consensus is what it is. Of course people mostly do what’s in textbooks, and published articles, and what they see other people do.

So, for my part, I see that letter of support as Amrhein, Greenland, and McShane being in the arena, recognizing that mob, elites, gatekeepers, and consensus are real, and trying their best to influence these influencers and to counter negative influences from all those sources. I agree with the technical message being sent by Amrhein et al., as well as with their open way of expressing it, so I’m fine with them making use of all these channels, including getting lots of signatories, enlisting the support of authority figures, working with the gatekeepers (their comment is being published in Nature, after all; that’s one of the tabloids), and openly attempting to shift the consensus.

Amrhein et al. don’t *have* to do it that way. It would be also fine with me if they were to just publish a quiet paper in a technical journal and wait for people to get the point. But I’m fine with the big push.

**4. And now to all of you . . .**

As noted above, I accept the continued existence and influence of mob, elites, gatekeepers, and consensus. But I’m also bothered by these, and I like to go around them when I can.

Hence, I’m posting this on the blog, where we have the habit of reasoned discussion rather than mob-like rhetorical violence, where the comments have no gatekeeping (in 15 years of blogging, I’ve had to delete less than 5 out of 100,000 comments—that’s 0.005%!—because they were too obnoxious), and where any consensus is formed from discussion that might just lead to the pluralistic conclusion that sometimes no consensus is possible. And by opening up our email discussion to all of you, I’m trying to demystify (to some extent) the elite discourse and make this a more general conversation.

**P.S.** There’s some discussion in comments about what to do in situations like the FDA testing a new drug. I have a response to this point, and it’s what Blake McShane, David Gal, Christian Robert, Jennifer Tackett, and I wrote in section 4.4 of our article, Abandon Statistical Significance:

While our focus has been on statistical significance thresholds in scientific publication, similar issues arise in other areas of statistical decision making, including, for example, neuroimaging where researchers use voxelwise NHSTs to decide which results to report or take seriously; medicine where regulatory agencies such as the Food and Drug Administration use NHSTs to decide whether or not to approve new drugs; policy analysis where non-governmental and other organizations use NHSTs to determine whether interventions are beneficial or not; and business where managers use NHSTs to make binary decisions via A/B tests. In addition, thresholds arise not just around scientific publication but also within research projects, for example, when researchers use NHSTs to decide which avenues to pursue further based on preliminary findings.

While considerations around taking a more holistic view of the evidence and consequences of decisions are rather different across each of these settings and different from those in scientific publication, we nonetheless believe our proposal to demote the p-value from its threshold screening role and emphasize the currently subordinate factors applies in these settings. For example, in neuroimaging, the voxelwise NHST approach misses the point in that there are typically no true zeros and changes are generally happening at all brain locations at all times. Plotting images of estimates and uncertainties makes sense to us, but we see no advantage in using a threshold.

For regulatory, policy, and business decisions, cost-benefit calculations seem clearly superior to acontextual statistical thresholds. Specifically, and as noted, such thresholds implicitly express a particular tradeoff between Type I and Type II error, but in reality this tradeoff should depend on the costs, benefits, and probabilities of all outcomes.

That said, we acknowledge that thresholds—of a non-statistical variety—may sometimes be useful in these settings. For example, consider a firm contemplating sending a costly offer to customers. Suppose the firm has a customer-level model of the revenue expected in response to the offer. In this setting, it could make sense for the firm to send the offer only to customers that yield an expected profit greater than some threshold, say, zero.

Even in pure research scenarios where there is no obvious cost-benefit calculation—for example a comparison of the underlying mechanisms, as opposed to the efficacy, of two drugs used to treat some disease—we see no value in p-value or other statistical thresholds. Instead, we would like to see researchers simply report results: estimates, standard errors, confidence intervals, etc., with statistically inconclusive results being relevant for motivating future research.While we see the intuitive appeal of using p-value or other statistical thresholds as a screening device to decide what avenues (e.g., ideas, drugs, or genes) to pursue further, this approach fundamentally does not make efficient use of data: there is in general no connection between a p-value—a probability based on a particular null model—and either the potential gains from pursuing a potential research lead or the predictive probability that the lead in question will ultimately be successful. Instead, to the extent that decisions do need to be made about which lines of research to pursue further, we recommend making such decisions using a model of the distribution of effect sizes and variation, thus working directly with hypotheses of interest rather than reasoning indirectly from a null model.

We would also like to see—when possible in these and other settings—more precise individual-level measurements, a greater use of within-person or longitudinal designs, and increased consideration of models that use informative priors, that feature varying treatment effects, and that are multilevel or meta-analytic in nature (Gelman, 2015, 2017; McShane and Bockenholt, 2017, 2018).

**P.P.S.** Regarding the petition thing, I like what Peter Dorman had to say:

]]>A statistical decision rule is a coordination equilibrium in a very large game with thousands of researchers, journal editors and data users. Perhaps once upon a time such a rule might have been proposed on scientific grounds alone (rightly or wrongly), but now the rule is firmly in place with each use providing an incentive for additional use. That’s why my students (see comment above) set aside what I taught in my stats class and embraced NHST. The research they rely on uses it, and the research they hope to produce will be judged by it. That matters a lot more to them than what I think.

That’s why mass signatures make sense. It is not mob rule in the sociological sense; we signers are not swept up in a wave of transient hysterical solidarity. Rather, we are trying to dent the self-fulfilling power of expectations that locks NHST in place. 800 is too few to do this, alas, but it’s worth a try to get this going.

Resolving the Replication Crisis Using Multilevel Modeling

In recent years we have come to learn that many prominent studies in social science and medicine, conducted at leading research institutions, published in top journals, and publicized in respected news outlets, do not and cannot be expected to replicate. Proposed solutions to the replication crisis in science fall into three categories: altering procedures and incentives, improving design and data collection, and improving statistical analysis. We argue that progress in all three dimensions is necessary: new procedures and incentives will offer little benefit without better data; more complex data structures require more elaborate analysis; and improved incentives are required for researchers to try new methods. We propose a way forward involving multilevel modeling, and we discuss in the context of applications in social research and public health.

Montréal Mathematical Sciences Colloquium, 1205 Burnside Hall, 3:30-4:30pm:

]]>Challenges in Bayesian Computing

Computing is both the most mathematical and most applied aspect of statistics. We shall talk about various urgent computing-related topics in statistical (in particular, Bayesian) workflow, including exploratory data analysis and model checking, Hamiltonian Monte Carlo, monitoring convergence of iterative simulations, scalable computing, evaluation of approximate algorithms, predictive model evaluation, and simulation-based calibration. This work is inspired by applications including survey research, drug development, and environmental decision making.

I’m hoping you can clarify a Bayesian “metaphysics” question for me. Let me note I have limited experience with Bayesian statistics.

In frequentist statistics, probability has to do with what happens in the long run. For example, a p value is defined in terms of what happens if, from now till eternity, we repeatedly draw random samples from some population of interest, compute the value of a test statistic, and keep a running tabulation of the proportion of values that exceed a certain given value. Let me refer to probability in a frequentist context as F-probability.

In Bayesian statistics, probability has to do with degree of belief. Prior and posterior distributions refer to our degree of confidence (prior to looking at data and after looking at data, respectively) that a parameter falls within certain ranges of values, where 1 represents total certainty and 0 represents total disbelief. Let me refer to probability in a Bayesian context as B-probability.

Both F-probability and B-probability are valid interpretations of probability, in that they satisfy the axioms of probability. But they are distinct interpretations.

My conceptual confusion is that Bayes Theorem combines a term with an F-probability interpretation (the likelihood, which is essentially the density of the sampling distribution) with a term with a B-probability interpretation (density of the prior distribution) to produce an entity with a B-probability interpretation, namely, the density of the posterior distribution. I’m not questioning the validity of the derivation of Bayes Theorem here. Rather, it seems conceptually messy to me that an F-probability term is combined with a B-probability term; both terms have to do with “probability,” but what is meant by “probability” is very different for each of them.

Can you provide some conceptual clarity?

My reply:

See here

and here, also here and here.

At this point, I’ve written about this so many times I just have to point to the relevant links. Kinda like that joke about the jokes with the numbers.

]]>