Discovering what mattered: Answering reverse causal questions using change-point models

Felix Pretis writes:

I came across your 2013 paper with Guido Imbens, “Why ask why? Forward causal inference and reverse causal questionson reverse causal questions,” and found it to be extremely useful for a closely-related project my co-author Moritz Schwarz and I have been working on.

We introduce a formal approach to answer “reverse causal questions” by expanding on the idea mentioned in your paper that reverse causal questions involve “searching for new variables.” We place the concept of reverse causal questions into the domain of variable and model selection. Specifically, we focus on detecting and estimating treatment effects when both treatment assignment and timing is unknown. The setting of unknown treatment reflects the problem often faced by policy makers: rather than trying to understand whether a particular intervention caused an outcome to change, they might be concerned with the broader question of what affected the outcome in general, but they might be unsure what treatment interventions took place. For example, rather than asking whether carbon pricing reduced CO2 emissions, a policy maker might be interested in what reduces CO2 emissions in general?

We show that such unknown treatment can be detected as structural breaks in panels by using machine learning methods to remove all but relevant treatment interaction terms that capture heterogeneous treatment effects. We demonstrate the feasibility of this approach by detecting the impact of ETA terrorism on Spanish regional GDP per capita without prior knowledge of its occurrence.

Predis and Schwartz describe their general idea in a paper, “Discovering what mattered: Answering reverse causal questions by detecting unknown treatment assignment and timing as breaks in panel models,” and they published an application of the approach in a paper, “Attributing agnostically detected large reductions in road CO2 emissions to policy mixes.”

It’s so cool to see this sort of work being done, transferring general concepts about causal inference to methods that can be used in real applications.

SIMD, memory locality, vectorization, and branch point prediction

The title of this post lists the three most important considerations for performant code these days (late 2023).

SIMD

GPUs can do a lot of compute in parallel. High end ($15K to $30K) GPUs like they use in big tech perform thousands of operations in parallel (50 teraflops for the H100). The catch is that they want all of those operations to be done in lockstep on different data. This is called single-instruction, multiple data (SIMD). Matrix operations, as used by today’s neural networks, are easy to code with SIMD.

GPUs cannot do 1000s of different things at once. This makes it challenging to write recursive algorithms like the no-U-turn sampler (NUTS) and is one of the reason people like Matt Hoffman (developer of NUTS) have turned to generalized HMC.

GPUs can do different things in sequence if you keep memory on the GPU (in kernel). This is how deep nets can sequence, feedforward, convolution, attention, activation, and GLM layers. Steve Bronder is working on keeping generated Stan GPU code in kernel.

Memory locality

Memory is slow. Really slow. The time it takes to fetch a fresh value from random access memory (RAM) is on the order of 100–200 times slower than the time it takes to execute a primitive arithmetic operation. On a CPU. The problem just gets worse with GPUs.

Modern CPUs are equipped with levels of cache. For instance, a consumer grade 8-core CPU like an Intel i9 might have a shared 32 MB L3 cache among all the cores, a 1MB L2 cache, and an 80KB L1 cache. When memory is read in from RAM, it gets read in blocks, with not only the value you requested, but other values near it. This gets pulled first into L3 cache, then into L2 cache, then into L1 cache, then it gets into registers on the CPU. This means if you have laid an array x out contiguously in memory and you have read x[n], then it is really cheap to read x[n + 1] because it’s either in your L1 cache already or being pipelined there. If your code accesses non-local pieces of memory, then you wind up getting cache misses. The higher up the cache miss (L1, L2, L3, or RAM), the more time the CPU has to wait to get the values it needs.

One way to see this in practice is to consider matrix layout in memory. If we use column major order, then each column is contiguous in memory and the columns are laid out one after the other. This makes it much more efficient to traverse the matrix by first looping over the columns, then looping over the rows. Getting the order wrong can be an order of magnitude penalty or more. Matrix libraries will do more efficient block-based transposes so this doesn’t bite users writing naive code.

The bottom line here is that even if you have 8 cores, your can’t run 8 parallel chains of MCMC as fast as you can run 1. On my Xeon desktop with 8 cores, I can run 4 chains in parallel, followed by another 4 in parallel in the same amount of time as I can run 8 in parallel. As a bonus, my fan doesn’t whine as loudly. The reason for the slowdown with 8 parallel chains is due not due to the CPUs being busy, it’s because the asynchronous execution causes a bottleneck in the cache. This can be overcome with hard work by restructuring parallel code to be more cache sensitive, but it’s a deep dive.

Performant code often recomputes the value of a function if its operands are in cache in order to reduce the memory pressure that would arise from storing the value. Or it reorders operations to be lazy explicitly to support recompilation. Stan does this to prioritize scalability over efficiency (i.e., it recomputes values which means fewer memory fetches but more operations).

Vectorization

Modern CPUs pipeline their operations, for example using AVX and SSE instructions on Intel chips. C++ compilers at high levels of optimization will exploit this if the right flags are enabled. This way, CPUs can do on the order of 8 simultaneous arithmetic operations. Writing loops so that they are in blocks of 8 so that they can exploit CPU vectorization is critical for performant code. The good news is that calling underlying matrix libraries like Eigen or BLAS will do that for you. The bad news is that if you write your own loops, they are going to be slow compared to loops optimized using vectorization. You have to do it yourself in C++ if you want performant code.

Another unexpected property of modern CPUs for numerical computing is that integer operations are pretty much free. The CPUs have separate integer and floating point units and with most numerical computing, there is far less pressure on integer arithmetic. So you can see things like adding integer arithmetic to a loop without slowing it down.

Branch-point prediction

When the CPU executes a conditional such as the compiled form of

if (A) then B else C;

it will predict whether A will evaluate to true or false. If it predicts true, then the operations in B will begin to execute “optimistically” at the same time as A. If A does evaluate to true, we have a head start. If A evaluates to false, then we have a branch-point misprediction. We have to backtrack, flush the results from optimistic evaluation of B, fetch the instructions for C, then continue. This is very very slow because of memory contention and because it breaks the data pipeline. And it’s double trouble for GPUs. Stan includes suggestions (pragmas) to the compiler as to which branch is more likely in our tight memory management code for automatic differentiation.

Conclusion

The takeaway message is that for efficient code, our main concerns are memory locality, branch-point prediction, and vectorization. With GPUs, we further have to worry about SIMD. Good luck!

The problem with p-values is how they’re used

The above-titled article is from 2014. Key passage:

Hypothesis testing and p-values are so compelling in that they fit in so well with the Popperian model in which science advances via refutation of hypotheses. For both theoretical and practical reasons I am supportive of a (modified) Popperian philosophy of science in which models are advanced and then refuted. But a necessary part of falsificationism is that the models being rejected are worthy of consideration. If a group of researchers in some scientific field develops an interesting scientific model with predictive power, then I think it very appropriate to use this model for inference and to check it rigorously, eventually abandoning it and replacing it with something better if it fails to make accurate predictions in a definitive series of experiments. This is the form of hypothesis testing and falsification that is valuable to me. In common practice, however, the “null hypothesis” is a straw man that exists only to be rejected. In this case, I am typically much more interested in the size of the effect, its persistence, and how it varies across different situations. I would like to reserve hypothesis testing for the exploration of serious hypotheses and not as in indirect form of statistical inference that typically has the effect of reducing scientific explorations to yes/no conclusions.

The logical followup is that article I wrote the other day, Before data analysis: Additional recommendations for designing experiments to learn about the world.

But the real reason I’m bringing up this old paper is to link to this fun discussion revolving around how the article never appeared in the journal that invited it, because I found out they wanted to charge me $300 to publish it, and I preferred to just post it for free. (OK, not completely free; it does cost something to maintain these sites, but the cost is orders of magnitude less than $300 for 115 kilobytes of content.)

Hydrology Corner: How to compare outputs from two models, one Bayesian and one non-Bayesian?

Zac McEachran writes:

I am a Hydrologist and Flood Forecaster at the National Weather Service in the Midwest. I use some Bayesian statistical methods in my research work on hydrological processes in small catchments.

I recently came across a project that I want to use a Bayesian analysis for, but I am not entirely certain what to look for to get going on this. My issue: NWS uses a protocol for calibrating our river models using a mixed conceptual/physically-based model. We want to assess whether a new calibration is better than an old calibration. This seems like a great application for a Bayesian approach. However, a lot of the literature I am finding (and methods I am more familiar with) are associated with assessing goodness-of-fit and validation for models that were fit within a Bayesian framework, and then validated in a Bayesian framework. I am interested in assessing how a non-Bayesian model output compares with another non-Bayesian model output with respect to observations. Someday I would like to learn to use Bayesian methods to calibrate our models but one step at a time!

My response: I think you need somehow to give a Bayesian interpretation to your non-Bayesian model output. This could be as simple as taking 95% prediction intervals and interpreting them as 95% posterior intervals from a normally-distributed posterior. Or if the non-Bayesian fit only gives point estimates, then do some boostrapping or something to get an effective posterior. Then you can use external validation or cross validation to compare the predictive distributions of your different models, as discussed here; also see Aki’s faq on cross validation.

A Hydrologist and Flood Forecaster . . . how cool is that?? Last time we had this level of cool was back in 2009 when we were contacted by someone who was teaching statistics to firefighters.

On a really bad paper on birth month and autism (and why there’s value in taking a look at a clear case of bad research, even if it’s obscure and from many years ago)

In an otherwise unrelated thread on Brutus vs. Mo Willems, an anonymous commenter wrote:

Researchers found that the risk of autism in twins depended on the month they were born in, with January being 80% riskier than December.

The link is from a 2005 article in the fun magazine New Scientist, “Autism: Lots of clues, but still no answers,” which begins:

The risk of autism in twins appears to be related to the month they are born in. The chance of both babies having the disorder is 80 per cent higher for January births than December births.

This was one of the many findings presented at the conference in Boston last week. It typifies the problems with many autism studies: the numbers are too small to be definitive – this one was based on just 161 multiple-birth babies – and even if the finding does stand up, it raises many more questions than it answers.

The article has an excellently skeptical title and lead-off, so I was curious what’s up with the author, Celeste Biever. A quick search shows that she’s currently Chief News and Features editor at Nature, so still in the science writing biz. That’s good!

The above link doesn’t give the full article but I was able to read the whole thing through the Columbia University library. The relevant part is that one of the authors of the birth-month study was Craig Newschaffer of the Johns Hopkins School of Public Health. I searched for *Craig Newschaffer autism birth month* on Google Scholar and found an article, “Variation in season of birth in singleton and multiple births concordant for autism spectrum disorders,” by L. C. Lee, C. J. Newschaffer, et al., published in 2008 in Paediatric and Perinatal Epidemiology.

I suppose that, between predatory journals and auto-writing tools such as Galactica, the scientific literature will be a complete mess in a few years, but for now we can still find papers from 2008 and be assured that they’re the real thing.

The searchable online version only gave the abstract and references, but again I could find the full article through the Columbia library. And I can report to you that the claim that the “chance of both babies having the disorder is 80 per cent higher for January births than December births,” is not supported by the data.

Let’s take a look. From the abstract:

This study aimed to determine whether the birth date distribution for individuals with autism spectrum disorders (ASD), including singletons and multiple births, differed from the general population. Two ASD case groups were studied: 907 singletons and 161 multiple births concordant for ASD.

161 multiple births . . . that’s about 13 per month, sounds basically impossible for there to be any real evidence of different frequencies comparing December to January. But let’s see what the data say.

From the article:

Although a pattern of birth seasonality in autism was first reported in the early 1980s, the findings have been inconsistent. The first study to examine autism births by month was conducted by Bartlik more than two decades ago. That study compared the birth month of 810 children diagnosed with autism with general births and reported that autism births were higher than expected in March and August; the effect was more pronounced in more severe cases. A later report analysed data from the Israeli national autism registry which had information on 188 individuals diagnosed with autistic disorder. It, too, demonstrated excess births in March and August. Some studies, however, found excess autism births in March only.

March and August, huh? Sounds like noise mining to me.

Anyway, that’s just the literature. Now on to the data. First they show cases by day:

Ok, that was silly, no real reason to have displayed it at all. Then they have graphs by month. They use some sort of smoothing technique called Empiric Mode Decomposition, whatever. Anyway, here’s what they’ve got, first for autistic singleton births and then for autistic twins:

Looks completely random to me. The article states:

In contrast to the trend of the singleton controls, which were relatively flat throughout the year, increases in the spring (April), the summer (late July) and the autumn (October) were found in the singleton ASD births (Fig. 2). Trends were also observed in the ASD concordant multiple births with peaks in the spring (March), early summer (June) and autumn (October). These trends were not seen in the multiple birth controls. Both ASD case distributions in Figs. 2 and 3 indicated a ‘valley’ during December and January. Results of the non-parametric time-series analyses suggested there were multiple peaks and troughs whose borders were not clearly bound by month.

C’mon. Are you kidding me??? Then this:

Caution should be used in interpreting the trend for multiple concordant births in these analyses because of the sparse available data.

Ya think?

Why don’t they cut out the middleman and just write up a bunch of die rolls.

Then this:

Figures 4 and 5 present relative risk estimates from Poisson regression after adjusting for cohort effects. Relative risk for multiple ASD concordant males was 87% less in December than in January with 95% CIs from 2% to 100%. In addition, excess ASD concordant multiple male births were indicated in March, May and September, although they were borderline for statistical significance.

Here are the actual graphs:

No shocker that if you look at 48 different comparisons, you’ll find something somewhere that’s statistically significant at the 5% level and a couple more items that are “borderline for statistical significance.”

This is one of these studies that (a) shows nothing, and (b) never had a chance. Unfortunately, statistics education and practice is focused on data analysis and statistical significance, not so much on design. This is just a ridiculously extreme case of noise mining.

In addition, I came across an article, The Epidemiology of Autism Spectrum Disorders, by Newschaffer et al. published in the Annual Review of Public Health in 2007 that doesn’t mention birth month at all. So, somewhere between 2005 and 2007, it seems that Newschaffer decided that, whatever birth-month effects were out there weren’t important enough to include in a 20-page review article. Then a year later they published a paper with all sorts of bold claims. Does not make a lot of sense to me.

Shooting a rabbit with a cannon?

Ok, this is getting ridiculous, you might say. Here we are picking to death an obscure paper from 15 years ago, an article we only heard about because it was indirectly referred to in a news article from 2005 that someone mentioned in a blog comment.

Is this the scientific equivalent to searching for offensive quotes on twitter and then getting offended? Am I just being mean to go through the flaws of this paper from the archives?

I don’t think so. I think there’s a value to this post, and I say it for two reasons.

1. Autism is important! There’s a reason why the government funds a lot of research on the topic. From the above-linked paper:

The authors gratefully acknowledge the following people and institutions for their resources and support on this manuscript:
1 The Autism Genetic Resource Exchange (AGRE) Consortium. AGRE is a programme of Cure Autism Now and is supported, in part, by Grant MH64547 from the National Institute of Mental Health to Daniel H. Geschwind.
2 Robert Hayman, PhD and Isabelle Horon, DrPH at the Maryland Department of Health and Mental Hygiene Vital Statistics Administration for making Maryland State aggregated birth data available for this analysis.
3 Rebecca A. Harrington, MPH, for editorial and graphic support.
Drs Lee and Newschaffer were supported by Centers for Disease Control and Prevention cooperative agreement U10/CCU320408-05, and Dr. Zimmerman and Ms. Shah were supported by Cure Autism Now and by Dr Barry and Mrs Renee Gordon. A preliminary version of this report was presented in part at the International Meeting for Autism Research, Boston, MA, May 2005.

This brings us to two points:

1a. All this tax money spent on a hopeless study of monthly variation in a tiny dataset is money that wasn’t spent on more serious research into autism or for that matter on direct services of some sort. Again, the problem with this study is not just that the data are indistinguishable from pure noise. The problem is that, even before starting the study, a competent analysis would’ve found that there was not enough data here to learn anything useful.

1b. Setting funding aside, attention given to this sort of study (for example, in that 2005 meeting and in the New Scientist article) is attention not being given to more serious research on the topic. To the extent that we are concerned about autism, we should be concerned about this diversion of attentional resources. At best, other researchers will just ignore this sort of pure-noise study; at worst, other researchers will take it seriously and waste more resources following it up in various ways.

Now, let me clarify that I’m not saying the authors who did this paper are bad people or that they were intending to waste government money and researchers’ attention. I can only assume they were 100% sincere and just working in a noise-mining statistical paradigm. This was 2005, remember, before “p-hacking,” “researcher degrees of freedom,” and “garden of forking paths” became commonly understood concepts in the scientific community. They didn’t know any better! They were just doing what they were trained to do: gather data, make comparisons, highlight “statistical significance” and “borderline statistical significance,” and tell stories. That’s what quantitative research was!

And that brings us to our final point:

2. That noise-mining paradigm is still what a lot of science and social science looks like. See here, for example. We’re talking about sincere, well-meaning researchers, plugged into the scientific literature and, unfortunately, pulling patterns out of what is essentially pure noise. Some of this work gets published in top journals, some of it gets adoring press treatment, some of it wins academic awards. We’re still there!

For that reason, I think there’s value in taking a look at a clear case of bad research. Not everything’s a judgment call. Some analyses are clearly valueless. Another example is the series of papers by that sex-ratio researcher, all of which are a mixture of speculative theory and pure noise mining, and all of which would be stronger without the distraction of data. Again, they’d be better off just reporting some die rolls; at least then the lack of relevant information content would be clearer.

P.S. One more time: I’m not saying the authors of these papers are bad people. They were just doing what they were trained to do. It’s our job as statistics teachers to change that training; it’s also the job of the scientific community not to reward noise-mining—even inadvertent noise-mining—as a career track.

Simulations of measurement error and the replication crisis: Maybe Loken and I have a mistake in our paper?

Federico D’Atri writes:

I am a PhD student in Neuroscience and Cognitive Science at the University of Trieste.

As an exercise, I was trying to replicate the simulation results in Measurement error and the replication crisis, however, what remains unclear to me is how introducing measurement error, even when selecting for statistical significance, can lead to higher estimates compared to ideal conditions in over 50% of cases.

Imagine the case of two variables, x and y, with a certain degree of true correlation. Introducing measurement noise to y will produce a new variable, y’, which has a diminished true correlation with x. The distribution of the correlation coefficient calculated on a sample is known and depends both on the sample size and the true correlation. If we have a reduced true correlation (due to noise), even when selecting for statistical significance (and hence truncating the distribution of sample correlation coefficients), shouldn’t we find that the correlation in the noise-free case is higher in the majority of the cases?

In your article, there are three graphs. I’ve managed to reproduce the first and second graphs, and I understand that increasing the sample size decreases the proportion of studies where the absolute effect size is greater when noise is added. In the article’s second graph, however, it seems that even for small sample sizes, the majority of the time the effect is larger in the “ideal study” scenario when selecting for statistical significance. The third graph, while correctly representing the monotonic decreasing trend of the proportion, seems to contradict the second graph regarding small samples. Even though the effect might be larger, I don’t think that introducing noise would result in an effect size estimate that’s larger than without noise more than 50% of the time, given the reduced true correlation.

I ran some simulations and the only scenario in which this happens is when considering correlations very close to zero. By adding noise, thus reducing the true correlation, it becomes “easier” to obtain large, statistically significant correlations of the opposite sign. I might be missing something or making a blatant error, but I can’t see how, even when selecting for statistical significance and for small sample sizes, once we select the effect sign consistent with the true correlation, that adding noise could result in larger effects than without it over 50% of the time.

Here is a Word document where my argument is better formalized. Additionally, I’ve included two files here and here with the R code used for plotting the exact distribution of the correlation coefficients and for the simulations used to reproduce the plots from you article.

I responded that I didn’t remember exactly what we did in the paper, but I did have some R code which we used to make our graphs. It’s possible that we made a mistake somewhere or that we described our results in a confused way.

D’Atri responded:

The specific statement that I think could be wrong is this: “For small N, the reverse can be true. Of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small”.

The simulations I did are slightly different, I added measurement error only on y and not on x, but the result would be the same (if you add measurement error the true correlation get smaller in magnitude).

I’ve reviewed the latter section of the code, particularly the segment related to generating the final graph. It appears there’s a problem: the approach used selects the 100 most substantial “noisy” correlations. By only using the larger noisy correlations, it neglects a fair comparison with the “non-noisy” ones (I added a comment in the line where I think lies the problem).

To address this, I’ve adapted your code while preserving its original structure. Specifically, I’ve chosen to include all effects whenever either of the two correlations (with or without added measurement noise) demonstrates statistical significance. Given the structure of our data, in particular the small sample sizes ranging from 25 to 300, the low true correlation values, and the fact that selecting for statistical significance yields few statistically significant correlations, I believed it was good to increase the simulation count significantly. This adjustment provide a more reliable proportion estimate. With these changes, the graph now closely resembles what one would achieve using your initial code, but without any selection filters based on significance or size.

Furthermore, I’ve added another segment to the code. This new portion employs a package allowing for the generation of random data with a specific foundational structure (in our case, a predetermined degree of correlation).

One thing I definitely agree with you is the need to minimize measurement error as much as possible and the detrimental effects of selecting based on statistical significance. From my perspective, the presence of a greater measurement error amplifies the tendencies towards poor research practices, post-hoc hypotheses, and the relentless pursuit of statistically significant effects where there is mainly noise.

I have not had a chance to study this in detail, so I’m posting the above discussion and code to share with others for now. The topic is important and worth thinking about.

We were gonna submit something to Nature Communications, but then we found out they were charging $6290 for publication. For that amount of money, we could afford 37% of an invitation to a conference featuring Grover Norquist, Gray Davis, and a rabbi, or 1/160th of the naming rights for a sleep center at the University of California, or 4735 Jamaican beef patties.

My colleague and I wrote a paper, and someone suggested we submit it to the journal Nature Communications. Sounds fine, right? But then we noticed this:

Hey! We wrote the damn article, right? They should be paying us to publish it, not the other way around. Ok, processing fees yeah yeah, but $6290??? How much labor could it possibly take to publish one article? This makes no damn sense at all. I guess part of that $6290 goes to paying for that stupid website where they try to con you into paying several thousand dollars to put an article on their website that you can put on Arxiv for free.

Ok, then the question arises: What else could we get for that $6290? A trawl through the blog archive gives some possibilities:

– 37% of an invitation to a conference featuring Grover Norquist, Gray Davis, and a rabbi

– 1/160th of the naming rights for a sleep center at the University of California

– 4735 Jamaican beef patties

I guess that, among all these options, the Nature Communications publication would do the least damage to my heart. Still, I couldn’t quite bring myself to commit to forking over $6290. So we’re sending the paper elsewhere.

At this point I’m still torn between the other three options. 4735 Jamaican beef patties sounds good, but 1/160th of a sleep center named just for me, that would be pretty cool. And 37% of a chance to meet Grover Norquist, Gray Davis, and a rabbi . . . that’s gotta be the most fun since Henry Kissinger’s 100th birthday party. (Unfortunately I was out of town for that one, but I made good use of my invite: I forwarded it to Kissinger superfan Cass Sunstein, and it seems he had a good time, so nothing was wasted.) So don’t worry, that $6290 will go to a good cause, one way or another.

Blue Rose Research is hiring (yet again) !

Blue Rose Research has a few roles that we’re actively hiring for as we gear up to elect more Democrats in 2024, and advance progressive causes!

A bit about our work:

  • For the 2022 US election, we used engineering and statistics to advise major progressive organizations on directing hundreds of millions of dollars to the right ads and states.
  • We tested thousands of ads and talking points in the 2022 election cycle and partnered with orgs across the space to ensure that the most effective messages were deployed from the state legislative level all the way up to Senate and Gubernatorial races and spanning the issue advocacy space as well.
  • We were more accurate than public polling in identifying which races were close across the Senate, House, and Gubernatorial maps.
  • And we’ve built up a technical stack that enables us to continue to build on innovative machine learning, statistical, and engineering solutions.

Now as we are looking ahead to 2024, we are hiring for the following positions:

All positions are remote, with optional office time with the team in New York City.

Please don’t hesitate to reach out with any questions (shira@blueroseresearch.org).

Experts and the politics of disaster prevention

Martin Gilens, Tali Mendelberg, and Nicholas Short write:

Despite the importance of effective disaster policy, governments typically fail to produce it. The main explanation offered by political scientists is that voters strongly support post-disaster relief but not policies that seek to prevent or prepare for disaster. This study challenges that view. We develop novel measures of preferences for disaster prevention and post-disaster relief. We find strong support for prevention policies and candidates who pursue them, even among subgroups least likely to do so. Support for prevention has the hallmarks of “real” attitudes: consistency across wordings and response formats, including open ended probes; steadfastness in the face of arguments; and willingness to make trade-offs against disaster relief, increased taxes, and reduced spending on other programs. Neither cognitive biases for the here and now nor partisan polarization prevent robust majority support for disaster prevention. We validate these survey findings with election results.

This is from a paper, “The Politics of Disaster Prevention,” being presented next week in the political science department; here’s a link to an earlier presentation. I’m just sharing the abstract because I’m not sure if they want the full article to be available yet.

In any case, the results are interesting. Lots to chew on regarding political implications. And it reminded me of something that something comes up in policy discussions.

Disaster preparedness is an area where experts play an important intermediary role between government, citizens, media, and activists. And, unlike, say, medical or defense policy, where there are recognized categories of experts (doctors and retired military officers), there’s not really such a thing as a credentialed expert on disaster prevention.

I’m not saying that we should always trust doctors or retired military officers (or, for that matter, political science professors), just that there’s some generally-recognized path to recognition of expertise.

In contrast, when it comes to disaster preparedness, we might hear from former government officials and various entrepreneurial academics such as Dan Ariely and Cass Sunstein who have a demonstrated willingness to write about just about anything (ok, I have such willingness too, but for better or worse I’m not an unofficially NPR-certified authority).

We might also hear from researchers who are focused on judgement under uncertainty—but they can also have problems with probability themselves. The problem here might be that academics tend to think in theoretical terms—even when we’re working on an applied problem, we’re typically thinking about how it slots into our larger research program—and, as a result, we can botch the details, which is a problem when the topic is disaster preparedness.

I offer no solutions here; I’m just trying to add one small bit to the framework of Gilens, Mendelberg, and Short regarding the politics of disaster preparedness. They talk about voters and politicians, and to some extent about media and activists; somehow the fact that there are no generally recognized experts in the area seems relevant too.

I sent the above to the authors, and Mendelberg replied:

Prevention policy experts do exist. FEMA may be best known. Less known is that disaster response and preparedness is a profession, with its own journals, training certification, and even a Bureau of Labor Statistics classification (Emergency Management Director, numbering about 12,000 jobs, mostly in local and state government). Btw, Columbia has a center for disaster preparedness, and they offer training certifications.

Do these experts have influence? One comparison point is the role of experts in opinion about climate policy. That role has become fraught, as that policy domain has become politically polarized. People skeptical of climate change are turned off by climate scientists asserting their scientific expertise to advocate for policy. By contrast, according to out findings, disaster prevention is not very polarized. So prevention experts could shape public opinion on prevention policy. Whether they shape disaster policy is a separate question. Anecdotally, I’ve heard they wish they had more policy influence. Are they hamstrung because politicians under-estimate public support? An interesting question.

Wow, indeed Columbia appears to be home to the National Center for Disaster Preparedness. I hope some of its staffers come to Mendelberg’s talk.

Teaching materials now available for Llaudet and Imai’s Data Analysis for Social Science!

Last year we discussed the exciting new introductory social-science statistics textbook by Elena Llaudet and Kosuke Imai.

Since then, Llaudet has created a website with tons of materials for instructors.

This is the book that I plan to teach from, next time I teach introductory statistics. As it is, I recommend it as a reference for students in more advanced classes such as Applied Regression and Causal Inference, if they want a clean refresher from first principles.

EDA and modeling

This is Jessica. This past week we’ve been talking about exploratory data analysis (EDA) in my interactive visualization course for CS undergrads, which is one of my favorite topics. I get to talk about model checks and graphical inference, why some people worry about looking at data too much, the limitations of thinking about the goal of statistical analysis as rejecting null hypotheses, etc. If nothing else, I think the students get intrigued because they can tell I get worked up about these things!

However, I was also reminded last week in reading some recent papers that there are still a lot of misconceptions about exploratory data analysis in research areas like visualization and human-computer interaction. EDA is sometimes described by well-meaning researchers as being essentially model-free and hypothesis-free, as if it’s a very different style of analysis than what happens when an analyst is exploring some data with some hunches about what they might find. 

It bugs me when people use the term EDA as synonymous with having few to no expectations about what they’ll find in the data. Identifying the unexpected is certainly part of EDA, but casting the analyst as a blank slate loses much of the nuance. For one, it’s hard to even begin making graphics if you truly have no idea what kinds of measurements you’re working with. And once you learn how the data were collected, you probably begin to form some expectations. It also mischaracterizes the natural progression as you build up understanding of the data and consider possible interpretations. Tukey for instance wrote about different phases in an exploratory analysis, some of which involve probabilistic reasoning in the sense of assessing “With what accuracy are the appearances already found to be believed?“ Similar to people assuming that “Bayesian” is equivalent to Bayes rule, the term EDA is often used to refer to some relatively narrow phase of analysis rather than something multi-faceted and nuanced. 

As Andrew and I wrote in our 2021 Harvard Data Science Review article, the simplistic (and unrealistic) view of EDA as not involving any substantive a priori expectations on the part of the analyst can be harmful for practical development of visualization tools. It can lead to a plethora of graphical user interface systems, both in practice and research, that prioritize serving up easy-to-parse views of the data, at the expense of surfacing variation and uncertainty or enabling the analyst to interrogate their expectations. These days we have lots of visualization recommenders for recommending the right chart type given some query, but it’s usually about getting the choice of encodings (position, size, etc.) right. 

What is better? In the article we had considered what a GUI visual analysis tool might look like if it took the idea of visualization as model checking seriously, including displaying variation and uncertainty by default and making it easier for the analyst to specify and check the data against provisional statistical models that capture relationships they think they see. (In Tableau Software, for example, it’s quite a pain to fit a simple regression to check its predictions against the data). But there was still a leap left after we wrote this, between proposing the ideas and figuring out how to implement this kind of support in a way that would integrate well with the kinds of features that GUI systems offer without resulting in a bunch of new problems. 

So, Alex Kale, Ziyang Guo, Xiao-li Qiao, Jeff Heer, and I recently developed EVM (Exploratory Visual Modeling), a prototype Tableau-style visual analytics tool where you can drag and drop variables to generate visualizations, but which also includes a “model bar.” Using the model bar, the analyst can specify provisional interpretations (in the form of regression) and check their predictions against the observed data. The initial implementation provides support for a handful of common distribution families and takes input in the form of Wilkinson-Pinheiro-Bates syntax. 

The idea is that generating predictions under different model assumptions absolves the analyst from having to rely so heavily on their imagination to assess hunches they have about which variables have explanatory power. If I think I see some pattern as I’m trying out different visual structures (e.g., facetting plots by different variables) I can generate models that correspond to the visualization I’m looking at (in the sense of having the same variables as predictors as shown in the plot), as well as view-adjacent models, that might add or remove variables relative to the visualization specification.

As we were developing EVM, we quickly realized that trying to pair the model and the visualization in terms of constraining them to involve the same variables is overly restrictive. And a visualization will always generally map to multiple possible statistical models so why aim for congruency.

I see this project, which Alex presented this week at IEEE VIS in Melbourne, as an experiment rather than a clear success or failure. There have been some interesting ideas proposed over the years related to graphical inference, and the connection between visualizations and statistical models, but I’ve seen few attempts to locate them in existing workflows for visual analysis like those supported by GUI tools. Line-ups, for instance, which hide a plot of the observed amongst a line-up of plots representing the null hypothesis, are a cool idea, but the implementations I’ve seen have been standalone software packages (e.g., in R) rather than attempts to integrate them into the types of visual analysis tools the non-programmers are using. To bring these ideas into existing tools, we have to think about what kind of workflow we want to encourage, and how to avoid new potential failure modes. For example, with EVM there’s the risk that having the ability to directly check different models one generates as they look at data leaves them with a sense that they’ve thoroughly checked their assumptions and can be even more confident about what explains the patterns. That’s not what we want.

Playing around with the tool ourselves has been interesting, in that it’s forced us to think about what the ideal use of this kind of functionality is, and under what conditions it seems to clearly benefit an analysis over not having it. The benefits are nuanced. We also had 12 people familiar with visual analysis in tools like Tableau use the system, and observed how their analyses of datasets we gave them seemed to differ from what they did without the model bar. Without it they all briefly explored patterns across a broad set of available variables and then circled back to recheck relationships they had already investigated. Model checking on the other hand tended to structure all but one participants’ thinking around one or two long chains of operations geared toward gradually improving models, through trying out different ways of modeling the distribution of the outcome variable, or selection of predictor variables. This did seem to encourage thinking about the data-generating process, which was our goal, though a few of them got fixated on details in the process, like trying to get a perfect visual match between predictions and observed data (without any thought as to what they were changing in the model spec).

Figuring out how to avoid these risks requires understanding who exactly can benefit from this, which is itself not obvious because people use these kinds of GUI visual analysis tools in lots of different ways, from data diagnostics and initial data analysis to dashboard construction as a kind of end-user programming. If we think that a typical user is not likely to follow up on their visual interpretations by gathering new data to check if they still hold, then we might need to build in hold-out sets to prevent perceptions that models fit during data exploration are predictive. To improve the ecosystem of visual analysis tools, we need to understand goals, workflow, and expertise.

Springboards to overconfidence: How can we avoid . . .? (following up on our discussion of synthetic controls analysis)

Following up on our recent discussion of synthetic control analysis for causal inference, Alberto Abadie points to this article from 2021, Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects.

Abadie’s paper is very helpful in that it lays out the key assumptions and decision points, which can help us have a better understanding of what went so wrong in the paper on Philadelphia crime rates that we discussed in my earlier post.

I think it’s a general concern in methods papers–mine included!—that we tend to focus more on examples where the method works well, than on examples where it doesn’t. Abadie’s paper has an advantage over mine in that he gives conditions under which a method will work, and it’s not his fault that researchers then use the methods and get bad answers.

Regarding the specific methods issue, of course there are limits to what can be learned from N=1 treated units, whether analyzed using synthetic control or any other approach. It seems that researchers sometimes lose track of that point in their desire to make strong statements. On a very technical level, I suspect that, if researchers are using a weighted average as a comparison, that they’d do better using some regularization rather than just averaging over a very small number of other cases. But I don’t think that would help much in that particular application that we were discussing on the blog.

The deeper problem

The question is, when scholars such as Abadie write such clear descriptions of a method, including all its assumptions, how is it that applied researchers such as the authors of that Philadelphia article make such a mess of things? The problem is not unique to synthetic control analysis; it also arises with other “identification strategies” such as regression discontinuity, instrumental variables, linear regression, and plain old randomized experimentation. In all these cases, researchers often seem to end up using the identification strategy not as a tool for learning from data but rather as a sort of springboard to overconfidence. Beyond causal inference, there are all the well-known misapplications of Bayesian inference and classical p-values. No method is safe.

So, again, nothing special about synthetic control analysis. But what did happen in the example that got this discussion started? To quote from the original article:

The research question here is whether the application of a de-prosecution policy has an effect on the number of homicides for large cities in the United States. Philadelphia presents a natural experiment to examine this question. During 2010–2014, the Philadelphia District Attorney’s Office maintained a consistent and robust number of prosecutions and sentencings. During 2015–2019, the office engaged in a systematic policy of de-prosecution for both felony and misdemeanor cases. . . . Philadelphia experienced a concurrent and historically large increase in homicides.

After looking at the time series, here’s my quick summary: Philadelphia’s homicide rate went up since 2014 during the same period that it decreased prosecutions, and this was part of a national trend of increased homicides—but there’s no easy way given the directly available information to compare to other cities with and without that policy.

I’ll refer you to my earlier post and its comment thread for more on the details.

At this point, the authors of the original article used a synthetic controls analysis, following the general approach described in the Abadie paper. the comparisons they make are to that weighted average of Detroit, New Orleans, and New York. The trouble is . . . that’s just 3 cities, and homicide rates can vary a lot from city to city. There’s no good reason to think that an average of three cities that give you numbers comparable to Philadelphia’s for the homicide rates or counts in the five previous years will give you a reasonable counterfactual for trends in the next five years. Beyond this, some outside researchers pointed out many forking paths in the published analysis. Forking paths are not in themselves a problem—my open applied work is full of un-preregistered data coding and analysis decisions—; the relevance here is that they help explain how it’s possible for researchers to get apparently “statistically significant” results from noisy data.

So what went wrong? Abadie’s paper discusses a mathematical problem: if you want to compare Philadelphia to some weighted average of the other 96 cities, and if you want these weights to be positive and sum to 1 and be estimated using an otherwise unregularized procedure, then there are certain statistical properties associated with using a procedure which, in this case, if various decisions are made, will lead to choosing a particular average of Detroit, New Orleans, and New York. There’s nothing wrong with doing this, but, ultimately, all you have is a comparison of 1 city to 3 cities, and it’s completely legit from an applied perspective to look at these cities and recognize how different they all are.

It’s not the fault of the synthetic control analysis if you have N=1 in the treatment group. It’s just the way things go. The error is to use that analysis to make strong claims, and the further error is to think that the use of this particular method—or any particular method—should insulate the analysis from concerns about reasonableness. If you want to compare one city to 96 others, then your analysis will rely on assumptions about comparability of the different cities, and not just on one particular summary such as the homicide counts during a five-year period.

You can say that this general concern arises with linear regression as well—you’re only adjusting for whatever pre-treatment variables that are included in the model. For example, when we estimated the incumbency advantage in congressional elections by comparing elections with incumbents running for reelection to elections in open seats, adjusting for previous vote share and party control, it would be a fair criticism to say that maybe the treatment and control cases differed in other important ways not included in the analysis. And we looked at that! I’m not saying our analysis was perfect; indeed, a decade and a half later we reanalyzed the data with a measurement-error model and got what we thing were improved results. It was a big help that we had replication: many years, and many open-seat and incumbent elections in each year. This Philadelphia analysis is different because it’s N=1. If we tried to do linear regression with N=1, we’d have all sorts of problems. Unfortunately, the synthetic control analysis did not resolve the N=1 problem—it’s not supposed to!—but it did seem to lead the authors into a some strong claims that did not make a lot of sense.

P.S. I sent the above to Abadie, who added:

I would like to share a couple of thoughts about N=1 and whether it is good or bad to have a small number of units in the comparison group.

Synthetic controls were originally proposed to address the N=1 (or low N) setting in cases with aggregate and relatively noiseless data and strong co-movement across units. I agree with you that they do not mechanically solve the N=1 problem in general (and that nothing does!). They have to be applied with care and there will be settings where they do not produce credible estimates (e.g., noisy series, short pre-intervention windows, poor pre-intervention fit, poor prediction in hold-out pre-intervention windows, etc). There are checks (e.g., predictive power in hold-out pre-intervention windows) that help assess the credibility of synthetic control estimates in applied settings.

Whether a few controls or many controls are better depends on the context of the investigation and on what one is trying to attain. Precision may call for using many comparisons. But there is a trade-off. The more units we use as comparisons, the less similar those may be relative to the treated unit. And the use of a small number of units allows us to evaluate / correct for potential biases created by idiosyncratic shocks and / or interference effects on the comparison units. If the aggregate series are “noiseless enough” like in the synthetic control setting, one might care more about reducing bias than about attaining additional precision.

Postdoc on Bayesian methodological and applied work! To optimize patient care! Using Stan! In North Carolina!

Sam Berchuck writes:

I wanted to bring your attention to a postdoc opportunity in my group at Duke University in the Department of Biostatistics & Bioinformatics. The full job ad is here: https://forms.stat.ufl.edu/statistics-jobs/entry/10978/.

The postdoc will work on Bayesian methodological and applied work, with a focus on modeling complex longitudinal biomedical data (including electronic health records and mobile health data) to create data-driven approaches to optimize patient care among patients with chronic diseases. The position will be particularly interesting to people interested in applying Bayesian statistics in real-world big data settings. We are looking for people who have experience in Bayesian inference techniques, including Stan!

Interesting. In addition to the Stan thing, I’m interested in data-driven approaches to optimize patient care. This is an area where a Bayesian approach, or something like it, is absolutely necessary, as you typically just won’t have enough data to make firm conclusions about individual effects, so you have to keep track of uncertainty. Sounds like a wonderful opportunity.

“Modeling Social Behavior”: Paul Smaldino’s cool new textbook on agent-based modeling

Paul Smaldino is a psychology professor who is perhaps best known for his paper with Richard McElreath from a few years ago, “The Natural Selection of Bad Science,” which presents a sort of agent-based model that reproduces the growth in the publication of junk science that we’ve seen in recent decades.

Since then, it seems that Smaldino has been doing a lot of research and teaching on agent-based models in social science more generally, and he just came out with a book, “Modeling Social Behavior: Mathematical and Agent-Based Models of Social Dynamics and Cultural Evolution.” The book has social science, it has code, it has graphs—it’s got everything.

It’s an old-school textbook with modern materials, and I hope it’s taught in thousands of classes and sells a zillion copies.

There’s just one thing that bothers me. The book is entertainingly written and bursting with ideas, also does a great job of giving concerns about the models that it’s simulating, not just acting like everything’s already known. My concern is that nobody reads books anymore. If I think about students taking a class in agent-based modeling and using this book, it’s hard for me to picture most of them actually reading the book. They’ll start with the homework assignments and then flip through the book to try to figure out what they need. That’s how people read nonfiction books nowadays, which I guess is one reason that books, even those I like, are typically repetitive and low on content. Readers don’t want the book to offer a delightful reading experience, so authors don’t deliver it, and then readers don’t expect it, etc.

To be clear: this is a textbook, not a trade book. It’s a readable and entertaining book in the way that Regression and Other Stories is a readable and entertaining book, not in the way that Guns, Germs, and Steel is. Still, within the framework of being a social science methods book, it’s entertaining and thought-provoking. Also, I like it as a methods book because it’s focused on models rather than on statistical inference. We tried to get a similar feel with A Quantitative Tour of Social Sciences but with less success.

So it kinda makes me sad to see this effort of care put into a book that probably very few students will read from paragraph to paragraph. I think things were different 50 years ago: back then, there wasn’t anything online to read, you’d buy a textbook and it was in front of you so you’d read it. On the plus side, readers can now go in and make the graphs themselves—I assume that Smaldino has a website somewhere with all the necessary code—so there’s that.

P.S. In the preface, Smaldino is “grateful to all the modelers whose work has inspired this book’s chapters . . . particularly want to acknowledge the debt owed to the work of,” and then he lists 16 names, one of which is . . . Albert-Lázló Barabási!

Huh?? Is this the same Albert-Lázló Barabási who said that scientific citations are worth $100,000 each. I guess he did some good stuff too? Maybe this is worthy of an agent-based model of its own.

Wow—those are some really bad referee reports!

Dale Lehman writes:

I missed this recent retraction but the whole episode looks worth your attention. First the story about the retraction.

Here are the referee reports and authors responses.

And, here is the author’s correspondence with the editors about retraction.

The subject of COVID vaccine safety (or lack thereof) is certainly important and intensely controversial. The study has some fairly remarkable claims (deaths due to the vaccines numbering in the hundreds of thousands). The peer reviews seem to be an exemplary case of your statement that “the problems with peer review are the peer reviewers). The data and methodology used in the study seem highly suspect to me – but the author appears to respond to many challenges thoughtfully (even if I am not convinced) and raises questions about the editorial practices involved with the retraction.

Here are some more details on that retracted paper.

Note the ethics statement about no conflicts – doesn’t mention any of the people supposedly behind the Dynata organization. Also, I was surprised to find the paper and all documentation still available despite being retracted. It includes the survey instrument. From what I’ve seen, the worst aspect of this study is that it asked people if they knew people who had problems after receiving the vaccine – no causative link even being asked for. That seems like an unacceptable method for trying to infer deaths from the vaccine – and one that the referees should never have permitted.

The most amazing thing about all this was the review reports. From the second link above, we see that the article had two review reports. Here they are, in their entirety:

The first report is an absolute joke, so let’s just look at the second review. The author revised in response to that review by rewriting some things, then the paper was published. At no time were any substantive questions raised.

I also noticed this from the above-linked news article:

“The study found that those who knew someone who’d had a health problem from Covid were more likely to be vaccinated, while those who knew someone who’d experienced a health problem after being vaccinated were less likely to be vaccinated themselves.”

Here’s a more accurate way to write it:

“The study found that those who SAID THEY knew someone who’d had a health problem from Covid were more likely to be SAY THEY WERE vaccinated, while those who SAID THEY knew someone who’d experienced a health problem after being vaccinated were less likely to SAY THEY WERE vaccinated themselves.”

Yes, this is sort of thing arises with all survey responses, but I think the subjectivity of the response is much more of a concern here than in a simple opinion poll.

The news article, by Stephanie Lee, makes the substantive point clearly enough:

This methodology for calculating vaccine-induced deaths was rife with problems, observers noted, chiefly that Skidmore did not try to verify whether anyone counted in the death toll actually had been vaccinated, had died, or had died because of the vaccine.

Also this:

Steve Kirsch, a veteran tech entrepreneur who founded an anti-vaccine group, pointed out that the study had the ivory tower’s stamp of approval: It had been published in a peer-reviewed scientific journal and written by a professor at Michigan State University. . . .

In a sympathetic interview with Skidmore, Kirsch noted that the study had been peer-reviewed. “The journal picks the peer reviewers … so how can they complain?” he said.

Ultimately the responsibility for publishing a misleading article falls upon the article’s authors, not upon the journal. You can’t expect or demand careful reviews from volunteer reviewers, nor can you expect volunteer journal editors to carefully vet every paper they will publish. Yes, the peer reviews for the above-discussed paper were useless—actually worse than useless, in that they gave a stamp of approval to bad work—but you can’t really criticize the reviewers for “not doing their jobs,” given that reviewing is not their job—they’re doing it for free.

Anyway, it’s a good thing that the journal shared the review reports so we can see how useless they were.

Bloomberg News makes an embarrassing calibration error

Palko points to this amusing juxtaposition:

I was curious so I googled to find the original story, “Forecast for US Recession Within Year Hits 100% in Blow to Biden,” by Josh Wingrove, which begins:

A US recession is effectively certain in the next 12 months in new Bloomberg Economics model projections . . . The latest recession probability models by Bloomberg economists Anna Wong and Eliza Winger forecast a higher recession probability across all timeframes, with the 12-month estimate of a downturn by October 2023 hitting 100% . . .

I did some further googling but could not find any details of the model. All I could find was this:

With probabilities that jump around this much, you can expect calibration problems.

This is just a reminder that for something to be a probability, it’s not enough that it be a number between 0 and 1. A real-world probability don’t exist in isolation; they are ensnared in a web of interconnections. Recall our discussion from last year:

Justin asked:

Is p(aliens exist on Neptune that can rap battle) = .137 valid “probability” just because it satisfies mathematical axioms?

And Martha sagely replied:

“p(aliens exist on Neptune that can rap battle) = .137” in itself isn’t something that can satisfy the axioms of probability. The axioms of probability refer to a “system” of probabilities that are “coherent” in the sense of satisfying the axioms. So, for example, the two statements

“p(aliens exist on Neptune that can rap battle) = .137″ and p(aliens exist on Neptune) = .001”

are incompatible according to the axioms of probability, because the event “aliens exist on Neptune that can rap battle” is a sub-event of “aliens exist on Neptune”, so the larger event must (as a consequence of the axioms) have probability at least as large as the probability of the smaller event.

The general point is that a probability can only be understood as part of a larger joint distribution; see the second-to-last paragraph of the boxer/wrestler article. I think that confusion on this point has led to lots of general confusion about probability and its applications.

Beyond that, seeing this completely avoidable slip-up from Bloomberg gives us more respect for the careful analytics teams at other news outlets such as the Economist and Fivethirtyeight, both of which are far from perfect, but at least we’re all aware that it would not make sense to forecast a 100% probability of recession in this sort of uncertain situation.

P.S. See here for another example of a Bloomberg article with a major quantitative screw-up. In this case the perpetrator was not the Bloomberg in-house economics forecasting team, it was a Bloomberg Opinion columnist who is described as “a former editorial director of Harvard Business Review,” which at first kinda sounds like he’s an economist at the Harvard business school, but I guess what it really means is that he’s a journalist without strong quantitative skills.

What is “public opinion”?

Gur Huberman writes:

The following questions crossed my mind as I read this news article.

1. Is there an observable corresponding to the term “public opinion?”

2 From the article,

Polls in Russia, or any other authoritarian country, are an imprecise measure of opinion because respondents will often tell pollsters what they think the government wants to hear. Pollsters often ask questions indirectly to try to elicit more honest responses, but they remain difficult to accurately gauge.

What is the information conveyed by the adjective “authoritarian?”

My reply: Yeah, there’s a Heisenberg uncertainty thing going on, where the act of measurement (here, asking the survey question) affects the response. In rough analogy to physical quantum uncertainty, you can try to ask the question in a very non-invasive way and then get a very noisy response, or you can ask the question more invasively, in which case you can get a stable response but it can be far from the respondent’s initial state.

To put it another way: yes, it can be hard to get an accurate survey response if respondents are afraid of saying the wrong thing; also, there’s not always an underlying opinion to be measured. It depends on the item: sometimes there’s a clearly defined true answer or well-defined opinion, sometimes not.

In this case, the news article reports on a company that “tries to address this shortcoming by constantly gathering data from small local internet forums, social media companies and messaging apps to determine public sentiment.” I’m not sure how “public sentiment” is defined here; they seem to be measuring something, but I’m not quite sure what it is.

This happens a lot in social science, and science in general. We have something of importance that we’d like to measure but is not clearly defined, we then measure something related to it, and then we have to think about what we’ve learned. Another example is legislative redistricting, where researchers have come up with various computational procedures to simulate randomly drawn districts, but without there being any underlying model for such a distribution. And I guess lots of examples in biomedicine where researchers work with some measure of general health, without the latent concept ever being defined.

Moving from “admit your mistakes” to accepting the inevitability of mistakes and their fractal nature

The other day we talked about checking survey representativeness by looking at canary variables:

Like the canary in the coal mine, a canary variable is something with a known distribution that was not adjusted for in your model. Looking at the estimated distribution of the canary variable, and then comparing to external knowledge, is a way of checking your sampling procedure. It’s not an infallible check—–our sample, or your adjusted sample, can be representative for one variable but not another—but it’s something you can do.

Then I noticed another reference, from 2014:

What you’d want to do [when you see a problem] is not just say, Hey, mistakes happen! but rather to treat these errors as information, as model checks, as canaries in the coal mine and use them to improve your procedure. Sort of like what I did when someone pointed out problems in my election maps.

Canaries all around us

When you notice a mistake, something that seemed to fit your understanding but turned out to be wrong, don’t memory-hole it; engage with it. I got soooo frustrated with David Brooks, or the Nudgelords (further explanation here), or the Freakonomics team or, at a more technical level, the Fivethirtyeight team, when they don’t wrestle with their mistakes.

Dudes! A mistake is a golden opportunity, a chance to learn. You don’t get these every day—or maybe you do! To throw away such opportunities . . . it’s like leaving the proverbial $20 bill on the table.

When Matthew Walker or Malcolm Gladwell get caught out on their errors and they bob and weave and avoid confronting the problem, then I don’t get frustrated in the same way. Their entire brand is based on simplifying the evidence. Similarly with Brian Wansink: there was no there there. If he were to admit error, there’s be nothing left.

But David Brooks, Nudge, Freakonomics, Fivethirtyeight . . . they’re all about explanation, understanding, and synthesis. Sure, it would be a short-term hit to their reputations to admit they got fooled by bad statistical analyses (on the topic of Jews, lunch, beauty, and correlated forecasts, respectively) that happened to aligned with their ideological or intellectual preconceptions, but longer-term, they could do so much better. C’mon, guys! There’s more to life than celebrity, isn’t there? Try to remember what got you interested in writing about social science in the first place.

Moving from “admit your mistakes” to accepting the inevitability of mistakes and their fractal nature

I wonder whether part of this is the implicit dichotomy of “admit when you’re wrong.” We’re all wrong all the time, but when we frame “being wrong” as something that stands out, something that needs to be admitted, maybe that makes it more difficult for us to miss all the micro-errors that we make. If we could get in the habit of recognizing all the mistakes we make every day, all the false starts and blind alleys and wild goose chases that are absolutely necessary in any field of inquiry, then maybe it would be less of a big deal to face up to mistakes we make that are pointed out to us by others.

Mistakes are routine. We should be able to admit them forthrightly without even needing to swallow hard and face up to them, as it were. For example, Nate Silver recently wrote, “The perfect world is one in which the media is both more willing to admit mistakes—and properly frame provisional reporting as provisional and uncertain—and the public is more tolerant of mistakes. We’re not living that world.” Which I agree with, and it applies to Nate too. Maybe we need to go even one step further and not think of a mistake as something that needs to be “admitted,” but just something that happens when we are working on complicated problems, whether they be problems of straight-up journalism (with reports coming from different sources), statistical modeling (relying on assumptions that are inevitably wrong in various ways), or assessment of evidence more generally (at some point you end up with pieces of information that are pointing in different directions).

“His continued strength in the party is mostly the result of Republican elites’ reluctance to challenge him, which is a mixture of genuine support and exaggerated ideas about his strength among Republican voters.”

David Weakliem, who should have a regular column at the New York Times (David Brooks and Paul Krugman could use the occasional break, no?), writes:

A couple of months ago, some people were saying that Donald Trump’s favorability ratings rose every time he was indicted. . . . Closer examination has shown that this isn’t true, that his favorability ratings actually declined slightly after the indictments. But at the time, it occurred to me that the degree of favorability might be more subject to change—shifting from “strongly favorable” to “somewhat favorable” is easier than shifting from favorable to unfavorable—and that the degree of favorability will matter in the race for the nomination. On searching, I [Weakliem] found there aren’t many questions that ask for degree of favorability, and that breakdowns by party weren’t available for most of them. However, the search wasn’t useless, because it reminded me of the American National Election Studies “feeling thermometers” for presidential candidates, which ask people to rate the candidates on a scale of zero to 100. Here is the percent rating the major party candidates at zero:

With the exception of George McGovern in 1972, everyone was below 10% until 2004, when 13% rated GW Bush at zero. In 2008, things were back to normal, with both Obama and McCain at around 7%, but starting in 2012, zero ratings increased sharply.

The next figure shows the percent rating each candidate at 100.

There is a lot of variation from one election to the next, but no trend. In 2016, 6.4% rated Trump at 100, which is a little lower than average (and the same as Hillary Clinton). He rose to 15.4% in 2020, which is the second highest ever, just behind Richard Nixon in 1972. But several others have been close, most recently Obama in 2012 and Bush in 2004, and it’s not unusual for presidents to have a large increase in their first term (GW Bush, Clinton, and Reagan had similar gains).

Weakliem concludes:

That is, Trump doesn’t seem to have an exceptionally large number of enthusiastic supporters among the public . . . I think his continued strength in the party is mostly the result of Republican elites’ reluctance to challenge him, which is a mixture of genuine support and exaggerated ideas about his strength among Republican voters.

This is interesting in that it goes against the usual story which is that Republican elites keep trying to get rid of Trump, but Republican voters won’t let them. I think the resolution here is that many Republican elites presumably can’t stand Trump and are trying to get rid of him behind the scenes, but publicly they continue to offer him strong support. Without the support of Republican elites, I think that Trump would have a lot less support among Republican voters. But even a Trump with less support could still do a lot of damage to the party, a point that Palko made back in 2015. This has been the political dynamic for years, all the way since the beginning of Trump’s presidency: He needed, and needs, the support of the Republican elites to have a chance of being competitive in any two-party election or to do anything at all as president; the Republican elites needed, and need, Trump to stay on their side. The implicit bargain was, and is, that the Republican elites support Trump electorally and he support Republican elites on policies that are important to them. The January 6 insurrection fell into the electoral-support category.

Thinking about this from a political science perspective, what’s relevant is that, even though we’re talking about elections and public opinion, you can’t fully understand the situation by only looking at elections and public opinion. You also have to consider some basic game theory or internal politics.

Weakliem also links to a post from 2016 where he asks, “does Trump have an unusually enthusiastic ‘base’?” and, after looking at some poll data, concludes that no, he doesn’t. Rather, Weakliem writes, “what is rising is not enthusiastic support for one’s own side, but strong dislike or fear of the other side.”

This seems consistent with what we know about partisan polarization in this country. The desire for a strong leader comes with the idea that this is what is necessary to stop the other side.

Academia corner: New candidate for American Statistical Association’s Founders Award, Enduring Contribution Award from the American Political Science Association, and Edge Foundation just dropped

Bethan Staton and Chris Cook write:

A Cambridge university professor who copied parts of an undergraduate’s essays and published them as his own work will remain in his job, despite an investigation upholding a complaint that he had committed plagiarism. 

Dr William O’Reilly, an associate professor in early modern history, submitted a paper that was published in the Journal of Austrian-American History in 2018. However, large sections of the work had been copied from essays by one of his undergraduate students.

The decision to leave O’Reilly in post casts doubt on the internal disciplinary processes of Cambridge, which rely on academics judging their peers.

Dude’s not a statistician, but I think this alone should be enough to make him a strong candidate for the American Statistical Association’s Founders Award.

And, early modern history is not quite the same thing as political science, but the copying thing should definitely make him eligible for the Aaron Wildavsky Enduring Contribution Award from the American Political Science Association. Long after all our research has been forgotten, the robots of the 21st century will be able to sift through the internet archive and find this guy’s story.

Or . . . what about the Edge Foundation? Plagiarism isn’t quite the same thing as misrepresenting your data, but it’s close enough that I think this guy would have a shot at joining that elite club. I’ve heard they no longer give out flights to private Caribbean islands, but I’m sure there are some lesser perks available.

According to the news article:

Documents seen by the Financial Times, including two essays submitted by the third-year student, show nearly half of the pages of O’Reilly’s published article — entitled “Fredrick Jackson Turner’s Frontier Thesis, Orientalism, and the Austrian Militärgrenze” — had been plagiarised.

Jeez, some people are so picky! Only half the pages were plagiarized, right? Or maybe not? Maybe this prof did a “Quentin Rowan” and constructed his entire article based on unacknowledged copying from other sources. As Rowan said:

It felt very much like putting an elaborate puzzle together. Every new passage added has its own peculiar set of edges that had to find a way in.

I guess that’s how it felt when they were making maps of the Habsburg empire.

On the plus side, reading about this story motivated me to take a look at the Journal of Austrian-American History, and there I found this cool article by Thomas Riegler, “The Spy Story Behind The Third Man.” That’s one of my favorite movies! I don’t know how watchable it would be to a modern audience—the story might seem a bit too simplistic—but I loved it.

P.S. I laugh but only because that’s more pleasant than crying. Just to be clear: the upsetting thing is not that some sleazeball managed to climb halfway up the greasy pole of academia by cheating. Lots of students cheat, some of these students become professors, etc. The upsetting thing is that the organization closed ranks to defend him. We’ve seen this sort of thing before, over and over—for example, Columbia never seemed to make any effort whatsoever to track down whoever was faking its U.S. News numbers—, so this behavior by Cambridge University doesn’t surprise me, but it still makes me sad. I’m guessing it’s some combination of (a) the perp is plugged in, the people who make the decisions are his personal friends, (b) a decision that the negative publicity for letting this guy stay on at his job is not as bad as the negative publicity for firing him.

Can you imagine what it would be like to work in the same department as this guy?? Fun conversations at the water cooler, I guess. “Whassup with the Austrian Militärgrenze, dude?”

Meanwhile . . .

There are people who actually do their own research, and they’re probably good teachers too, but they didn’t get that Cambridge job. It’s hard to compete with an academic cheater, if the institution he’s working for seems to act as if cheating is just fine, and if professional societies such as the American Statistical Association and the American Political Science Association don’t seem to care either.