Skip to content

Gendered languages and women’s workforce participation rates

Rajesh Venkatachalapathy writes:

I recently came across a world bank document claiming that gendered languages reduce women’s labor force participation rates. It is summarized in the following press release: Gendered Languages May Play a Role in Limiting Women’s Opportunities, New Research Finds.

This sounds a lot like the piranha problem, if there is any effect at all.

I [Venkatachalapathy] am disturbed by claims of large effects in their study. Their work seems to rely conceptually on the Sapir-Whorf hypothesis in linguistics, which is also quiet controversial on its own. I am curious to know what your take is on this report.

He continues:

The cognitive science behind Sapir-Whorf, and the related field of embodied cognition in general is quiet controversial; it appeals to so many people, yet has very weak evidence (see for example, the recent book by McWhorter). This paper seems to magnify this to say something so strong about macroeconomic labor market demographic indicators. I cannot avoid comparisons with Pinker’s hypothesis in his most recent book that enlightenment thought and secular humanistic principles derived from it has been one of the primary drivers of the civilizing process of the Norbert Elias kind or the Pinker kind.

I am not claiming that such macro-level claims can never be justified. For example, I just began reading your academic colleague, economist Suresh Naidu’s recent paper on how democratization in countries causes economic growth. From the looks of it, they seem to have worked hard at establishing their main hypothesis. Maybe, their [Naidu or his collaborators] approach might provide us with additional insight on whether the causal claims of the paper on gendered language and workforce participation is reasonable and defensible with existing data, and with their [the paper’s] data analysis approach. I just find it difficult to imagine how a psychologically weak effect can suddenly become magnified when scaled to level of large scale societies.

After having trained hard to be skeptical of all causal claims over the years, I see what I feel is an epidemic of causal claims popping up in the literature and I find it hard to believe them all, especially given the fact that progress in philosophical causality and causal inference has been only incremental.

My response: I agree that such claims from observational data in cross-country and cross-cultural comparisons can be artifactual, and languages are correlated with all sorts of things. I don’t know enough about the topic to say more.

A rise in premature publications among politically engaged researchers may be linked to Trump’s election, study says

A couple people pointed me to this news story, “A rise in premature births among Latina women may be linked to Trump’s election, study says,” and the associated JAMA article, which begins:

Question Did preterm births increase among Latina women who were pregnant during the 2016 US presidential election?

Findings This population-based study used an interrupted time series design to assess 32.9 million live births and found that the number of preterm births among Latina women increased above expected levels after the election.

Meaning The 2016 presidential election may have been associated with adverse health outcomes of Latina women and their newborns.

Hmmm, the research article says “may have been associated” but then ups that to “appears to have been associated.”

On one hand, I find it admirable that JAMA will publish a paper with such an uncertain conclusion. On the other hand, the conclusions got stronger once they made their way into news reports. In the above-linked article, “may have been associated” becomes “an association was found” and then “We think there are very few alternative explanations for these results.”

There’s also a selection issue. It’s fine to report maybes, but then why this particular maybe? There are lots and lots of associations that may be happening, right?

Let’s look at the data

In any case, they did an interrupted time series analysis, so let’s see the time series:

I don’t think the paper’s claim, “In the 9-month period beginning with November 2016, an additional 1342 male (95% CI, 795-1889) and 995 female (95% CI, 554-1436) preterm births to Latina women were found above the expected number of preterm births had the election not occurred,” is at all well supported by these data. But you can make your own judgement here.

Also, I’m surprised they are analyzing raw numbers of pre-term births rather than rates.

In general

Medical journals sometimes seem to show poor judgment when it comes to stories that agree with their political prejudices. See for example here and here.

Look. Don’t get me wrong. This topic is important. We’d like to minimize preterm births, and graphs such as shown above (ideally using rates, not counts, I think) should be a key part of a monitoring system that will allow us to notice problems. It should be possible to look at such time series without pulling out one factor and wrapping this sort of story around it. I think this is a problem with scientific publication, that journals and the news media want to publish big splashy claims.

“The most mysterious star in the galaxy”

Charles Margossian writes:

The reading for tomorrow’s class reminded me of a project I worked on as an undergraduate. It was the planet hunter initiative. The project shows light-curves to participants and asks them to find transit signals (i.e. evidence of a transiting planets). The idea was to rely on human pattern recognition capabilities to find planets missed by NASA’s algorithms—and it worked! “The first publication I was involved in was on the discovery of such a planet.

But even better: users found a star with a very strange light-curve, which had been dismissed as a false-positive signal by the algorithm. Upon inspection it turned out that… we had no idea what was going on. A paper, falsifying a bunch of hypothesis, was published. It was cool to see a popular paper about us not knowing. The star was called Taby’s star (after the astronomer who investigated it), and deemed “the most mysterious star in the galaxy”.

So there—an example of using graphs to do research and, what’s more, make it accessible to the public.

Pre-results review: Some results

Aleks Bogdanoski writes:

I’m writing from the Berkeley Initiative for Transparency in the Social Sciences (BITSS) at UC Berkeley with news about pre-results review, a novel form of peer review where journals review (and accept) research papers based on their methods and theory — before any results are known. Pre-results review is motivated by growing concerns about reproducibility in science, including results-based biases in the ways research is reviewed in academic journals.
Over that past year, BITSS has been working with the Journal of Development Economics to pilot this form of peer review, and we recently shared some of the lessons we learned through a post on the World Bank’s Development Impact blog. In a nutshell, pre-results review has helped helped authors improve the methodological quality of their work and provided an opportunity for earlier recognition – a particularly important incentive for early-career researchers. For editors and reviewers, Pre-results Review has been a useful commitment device for preventing results-based publication bias.
I’m attaching a press release that details the story in full, and here you can learn more about our Pre-results Review collaboration with the JDE.

I don’t have time to look at this right now, but I’m forwarding it because it seems like the kind of thing that might interest many of you.

Pre-results review could solve the Daryl Bem problem

I will lay out one issue that’s bugged me for awhile regarding results-blind reviewing, which is what we could call the Daryl Bem problem.

It goes like this. Some hypothetical researcher designs an elaborate study conducted at a northeastern university noted for its p-hacking, and the purpose of the study is to demonstrate (or, let’s say, test for) the existence of extra-sensory perception (ESP).

Suppose the Journal of Personality and Social Psychology was using pre-results review. Should they accept this hypothetical study?

Based on the above description from BITSS, this accept/reject decision should come down to the paper’s “methods and theory.” OK, the methods for this hypothetical paper could be fine, but there’s no theory.

So I think that, under this regime, JPSP would reject the paper. Which seems fair enough. If they did accept this paper just because of its method (preregistration, whatever), they’d open the floodgates to accepting every damn double-blind submission anyone sent them. Perpetual motion machines, spoon bending, ovulation and voting, power pose, beauty and sex ratio, you name it. It would be kinda fun for awhile, becoming the de facto Journal of Null Results—indeed, this could do a great service to some areas of science—but I don’t think that’s why anyone wants to become a journal editor, just to publish null results.

OK, fine. But here’s the problem. Suppose this carefully-designed experiment is actually done, and it shows positive results. In that case they really have made a great discovery, and the result really should be publishable.

At this point you might say that you don’t believe it until an outside lab does a preregistered replication. That makes sense.

But, at this point, results-blind review comes to the rescue! That first Bem study should not be accepted because it has no theoretical justification. But the second study, by the outside laboratory . . . its authors could make the argument that the earlier successful study gives enough of a theoretical justification for pre-results acceptance.

So, just to be clear here: to get an ESP paper published under this new regime, you’d need to have two clean, pre-registered studies. The first study would not be results-blind publishable on its own (of course, it could still be published in Science, Nature, PNAS, Psychological Science, or some other results-focused journal), but it would justify the second study being published in results-blind form.

You really need 2 papers from 2 different labs, though. For example, the existing Bem (2011) paper, hyper p-hacked as it is, cannot in any reasonable way serve as a theoretical or empirical support for an ESP study.

I guess this suggests a slight modification of the above BITSS guidelines, that they change “methods and theory” to “methods and theory or strong empirical evidence.”

Methodology is important, but methodology is not just causal identification and sample size and preregistration

In any case, my key point here is that we need to take seriously these concerns regarding theory and evidence. Methodology is important, but methodology is not just causal identification and sample size and preregistration: it’s also measurement and connection to existing knowledge. In empirical social science in particular, we have to avoid privileging ill-founded ideas that happen to be attached to cute experiments or identification strategies.

What does it take to repeat them?

Olimpiu Urcan writes:

Making mistakes is human, but it takes a superhuman dose of ego and ignorance to repeat them after you’ve been publicly admonished about them.

Not superhuman at all, unfortunately. We see it all the time. All. The. Time.

I’m reminded of the very first time I contacted newspaper columnist David Brooks to point out one of his published errors. I honestly thought he’d issue a correction. But no, he just dodged it. Dude couldn’t handle the idea that he might have ever been wrong.

Similarly with those people who publish all those goofy research claims. Very rarely do they seem to be able to admit they made a mistake. I’m not talking about fraud or scientific misconduct here, just admitting an honest mistake of the sort that can happen to any of us. Nope. On the rare occasion when a scientist does admit a mistake, it’s cause for celebration.

So, no. Unfortunately I disagree with Urcan that repeating mistakes is anything superhuman. Repeating mistakes is standard operating practice, and it goes right along with never wanting to accept that an error was made in the first place.

This bit from Urcan I do agree with, though:

For plagiarists, scammers and utter incompetents to thrive, they seek enablers with the same desperation and urgency leeches seek hemoglobin banks.

Well put. And these enablers are all over the place. Some people even seem to make a career of it. I can see why they do it. If you help a scammer, he might help you in return. And you get to feel like a nice person, too. As long as you don’t think too hard about the people wasting their time reading the scammer’s products.

Blindfold play and sleepless nights

In Edward Winter’s Chess Explorations there is the following delightful quote from the memoirs of chess player William Winter:

Blindfold play I have never attempted seriously. I once played six, but spent so many sleepless nights trying to drive the positions out of my head that I gave it up.

I love that. We think of the difficulty as being in the remembering, but maybe it is the forgetting that is the challenge. I’m reminded of a lecture I saw by Richard Feynman at Bell Labs: He was talking about the theoretical challenges of quantum computing, and he identified the crucial entropy-producing step as that of zeroing the machine, i.e. forgetting.

Update on keeping Mechanical Turk responses trustworthy

This topic has come up before . . . Now there’s a new paper by Douglas Ahler, Carolyn Roush, and Gaurav Sood, who write:

Amazon’s Mechanical Turk has rejuvenated the social sciences, dramatically reducing the cost and inconvenience of collecting original data. Recently, however, researchers have raised concerns about the presence of “non-respondents” (bots) or non-serious respondents on the platform. Spurred by these concerns, we fielded an original survey on MTurk to measure response quality. While we find no evidence of a “bot epidemic,” we do find that a significant portion of survey respondents engaged in suspicious be- havior. About 20% of respondents either circumvented location requirements or took the survey multiple times. In addition, at least 5-7% of participants likely engaged in “trolling” or satisficing. Altogether, we find about a quarter of data collected on MTurk is potentially untrustworthy. Expectedly, we find response quality impacts experimental treatments. On average, low quality responses attenuate treatment effects by approximately 9%. We conclude by providing recommendations for collecting data on MTurk.

And here are the promised recommendations:

• Use geolocation filters on survey platforms like Qualtrics to enforce any geographic restrictions.

• Make use of tools on survey platforms to retrieve IP addresses. Run each IP through Know Your IP to identify blacklisted IPs and multiple responses originating from the same IP.

• Include questions to detecting trolling and satisficing but do not copy and paste from a standard canon as that makes “gaming the survey” easier.

• Increase the time between Human intelligence task (HIT) completion and auto-approval so that you can assess your data for untrustworthy responses before approving or rejecting the HIT.

• Rather than withhold payments, a better policy may be to incentivize workers by giving them a bonus when their responses pass quality filters.

• Be mindful of compensation rates. While unusually stingy wages will lead to slow data collection times and potentially less effort by Workers, unusually high wages may give rise to adverse selection—especially because HITs are shared on Turkopticon, etc. soon after posting. . . Social scientists who conduct research on MTurk should stay apprised of the current “fair wage” on MTurk and adhere accordingly.

• Use Worker qualifications on MTurk and filter to include only Workers who have a high percentage of approved HITs into your sample.

They also say they do not think that the problem is limited to MTurk.

I haven’t tried to evaluate all these claims myself, but I thought I’d share it all with those of you who are using this tool in your research. (Or maybe some of you are MTurk bots; who knows what will be the effect of posting this material here.)

P.S. Sood adds:

From my end, “random” error is mostly a non-issue in this context. People don’t use M-Turk to produce generalizable estimates—hardly anyone post-stratifies, for instance. Most people use it to say they did something. I suppose it is a good way to ‘fail fast.’ (The downside is that most failures probably don’t see the light of day.) And if we people wanted to buy stat. sig., bulking up on n is easily and cheaply done — it is the raison d’etre of MTurk in some ways.

So what is the point of the article? Twofold, perhaps. First is that it is good to parcel out measurement error where we can. And the second point is about how do we build a system where the long-term prognosis is not simply noise. And what struck out for me from the data was just the sheer scale of plausibly cheeky behavior. I did not anticipate that.

Voter turnout and vote choice of evangelical Christians

Mark Palko writes, “Have you seen this?”, referring to this link to this graph:

I responded: Just one of those things, I think.

Palko replied:

Just to be clear, I am more than willing to believe the central point about the share of the population dropping while the share of the electorate holds relatively steady, but having dealt with more than my share of bad data, I get really nervous when I see a number like that hold absolutely steady.

My response:

I did a quick check of some of those 26% numbers online and they seem to be from actual exit polls. Them all being equal just seems like a coincidence. The part of the graph I really don’t believe is the sharp decline in % evangelical Christian. I’m guessing that the survey question on the exit polls is different than the survey question on whatever poll they’re using to estimate % evangelical.

And, since I have you on the line, here are some graphs from chapter 6 of Red State Blue State:

It seems that religious affiliation is becoming more of a political thing. Or maybe political affiliation is becoming more of a religious thing.

In any case, be careful about comparing time trends of survey questions that are asked in different ways.

Endless citations to already-retracted articles

Ken Cor and Gaurav Sood write:

Many claims in a scientific article rest on research done by others. But when the claims are based on flawed research, scientific articles potentially spread misinformation. To shed light on how often scientists base their claims on problematic research, we exploit data on cases where problems with research are broadly publicized. Using data from over 3,000 retracted articles and over 74,000 citations to these articles, we find that at least 31.2% of the citations to retracted articles happen a year after they have been retracted. And that 91.4% of the post-retraction citations are approving—note no concern with the cited article. We augment the analysis with data from an article published in Nature Neuroscience highlighting a serious statistical error in articles published in prominent journals. Data suggest that problematic research was approvingly cited more frequently after the problem was publicized [emphasis added]. Our results have implications for the design of scholarship discovery systems and scientific practice more generally.

I think that by “31.2%” and “91.4%” they mean 30% and 90% . . . but, setting aside this brief lapse in taste or numeracy, their message is important.

P.S. In case you’re wondering why I’d round those numbers: I just don’t think those last digits are conveying any real information. To put it another way, in any sort of replication, I’d expect to see numbers that differ by at least a few percentage points. Reporting as 30% and 90% seems to me to capture what they found without adding meaningless precision.

Gigerenzer: “The Bias Bias in Behavioral Economics,” including discussion of political implications

Gerd Gigerenzer writes:

Behavioral economics began with the intention of eliminating the psychological blind spot in rational choice theory and ended up portraying psychology as the study of irrationality. In its portrayal, people have systematic cognitive biases that are not only as persistent as visual illusions but also costly in real life—meaning that governmental paternalism is called upon to steer people with the help of “nudges.” These biases have since attained the status of truisms. In contrast, I show that such a view of human nature is tainted by a “bias bias,” the tendency to spot biases even when there are none. This may occur by failing to notice when small sample statistics differ from large sample statistics, mistaking people’s random error for systematic error, or confusing intelligent inferences with logical errors. Unknown to most economists, much of psychological research reveals a different portrayal, where people appear to have largely fine-tuned intuitions about chance, frequency, and framing. A systematic review of the literature shows little evidence that the alleged biases are potentially costly in terms of less health, wealth, or happiness. Getting rid of the bias bias is a precondition for psychology to play a positive role in economics.

Like others, Gigerenzer draws the connection to visual illusions, but with a twist:

By way of suggestion, articles and books introduce biases together with images of visual illusions, implying that biases (often called “cognitive illusions”) are equally stable and inevitable. If our cognitive system makes such big blunders like our visual system, what can you expect from everyday and business decisions? Yet this analogy is misleading, and in two respects.

First, visual illusions are not a sign of irrationality, but a byproduct of an intelligent brain that makes “unconscious inferences”—a term coined by Hermann von Helmholtz—from two-dimensional retinal images to a three-dimensional world. . . .

Second, the analogy with visual illusions suggests that people cannot learn, specifically that education in statistical reasoning is of little efficacy (Bond, 2009). This is incorrect . . .

It’s an interesting paper. Gigerenzer goes through a series of classic examples of cognitive errors, including the use of base rates in conditional probability, perceptions of patterns in short sequences, the hot hand, bias in estimates of risks, systematic errors in almanac questions, the Lake Wobegon effect, and framing effects.

I’m a sucker for this sort of thing. It might be that at some points Gigerenzer is overstating his case, but he makes a lot of good points.

Some big themes

In his article, Gigerenzer raises three other issues that I’ve been thinking about a lot lately:

1. Overcertainty in the reception and presentation of scientific results.

2. Claims that people are stupid.

3. The political implications of claims that people are stupid.

Overcertainty and the problem of trust

Gigerenzer writes:

The irrationality argument exists in many versions (e.g. Conley, 2013; Kahneman, 2011). Not only has it come to define behavioral economics but it also has defined how most economists view psychology: Psychology is about biases, and psychology has nothing to say about reasonable behavior.

Few economists appear to be aware that the bias message is not representative of psychology or cognitive science in general. For instance, loss aversion is often presented as a truism; in contrast, a review of the literature concluded that the “evidence does not support that losses, on balance, tend to be any more impactful than gains” (Gal and Rucker, 2018). Research outside the heuristics-and-biases program that does not confirm this message—including most of the psychological research described in this article—is rarely cited in the behavioral economics literature (Gigerenzer, 2015).

(We discussed Gal and Rucker (2018) here.)

More generally, this makes me think of the problem of trust that Kaiser Fung and I noted in the Freakonomics franchise. There’s so much published research out there, indeed so much publicized research, that it’s hard to know where to start, so a natural strategy for sifting through and understanding it all is using networks of trust. You trust your friends and colleagues, they trust their friends and colleagues, and so on. But you can see you this can lead to economists getting a distorted view of the content of psychology and cognitive science.

Claims that people are stupid

The best of the heuristics and biases research is fascinating, important stuff that has changed my life and gives us, ultimately, a deeper respect for ourselves as reasoning beings. But, as Gigerenzer points out, this same research is often misinterpreted as suggesting that people are easily-manipulable (or easily-nudged) fools, and this fits in with lots of junk science claims of the same sort: pizzagate-style claims that the amount you eat can be manipulated by the size of your dining tray, goofy poli-sci claims that a woman’s vote depends on the time of the month, air rage, himmicanes, shark attacks, ages-ending-in-9, and all the rest. This is an attitude which I can understand might be popular among certain marketers, political consultants, and editors of the Proceedings of the National Academy of Sciences, but I don’t buy it, partly because of zillions of errors in the published studies in question and also because of the piranha principle. Again, what’s important here is not just the claim that people make mistakes, but that they can be consistently manipulated using what would seem to be irrelevant stimuli.

Political implications

As usual, let me emphasize that if these claims were true—if it were really possible to massively and predictably change people’s attitudes on immigration by flashing a subliminal smiley face on a computer screen—then we’d want to know it.

If the claims don’t pan out, then they’re not so interesting, except inasmuch as: (a) it’s interesting that smart people believed these things, and (b) we care if resources are thrown at these ideas. For (b), I’m not just talking about NSF funds etc., I’m also talking about policy money (remember, pizzagate dude got appointed to a U.S. government position at one point to implement his ideas) and just a general approach toward policymaking, things like nudging without persuasion, nudges that violate the Golden Rule, and of course nudges that don’t work.

There’s also a way in which a focus on individual irrationality can be used to discredit or shift blame onto the public. For example, Gigerenzer writes:

Nicotine addiction and obesity have been attributed to people’s myopia and probability-blindness, not to the actions of the food and tobacco industry. Similarly, an article by the Deutsche Bank Research “Homo economicus – or more like Homer Simpson?” attributed the financial crisis to a list of 17 cognitive biases rather than the reckless practices and excessive fragility of banks and the financial system (Schneider, 2010).

Indeed, social scientists used to talk about the purported irrationality of voting (for our counter-argument, see here). If voters are irrational, then we shouldn’t take their votes seriously.

I prefer Gigerenzer’s framing:

The alternative to paternalism is to invest in citizens so that they can reach their own goals rather than be herded like sheep.

Healthier kids: Using Stan to get more information out of pediatric respiratory data

Robert Mahar, John Carlin, Sarath Ranganathan, Anne-Louise Ponsonby, Peter Vuillermin, and Damjan Vukcevic write:

Paediatric respiratory researchers have widely adopted the multiple-breath washout (MBW) test because it allows assessment of lung function in unsedated infants and is well suited to longitudinal studies of lung development and disease. However, a substantial proportion of MBW tests in infants fail current acceptability criteria. We hypothesised that a model-based approach to analysing the data, in place of traditional simple empirical summaries, would enable more efficient use of these tests. We therefore developed a novel statistical model for infant MBW data and applied it to 1,197 tests from 432 individuals from a large birth cohort study. We focus on Bayesian estimation of the lung clearance index (LCI), the most commonly used summary of lung function from MBW tests. Our results show that the model provides an excellent fit to the data and shed further light on statistical properties of the standard empirical approach. Furthermore, the modelling approach enables LCI to be estimated using tests with different degrees of completeness, something not possible with the standard approach.

They continue:

Our model therefore allows previously unused data to be used rather than discarded, as well as routine use of shorter tests without significant loss of precision.

Yesssss! This reminds me of our work on serial dilution assays, where we squeezed information out of data that had traditionally been declared “below detection limit.”

Mahar, Carlin, et al. continue:

Beyond our specific application, our work illustrates a number of important aspects of Bayesian modelling in practice, such as the importance of hierarchical specifications to account for repeated measurements and the value of model checking via posterior predictive distributions.

Wow—all my favorite things! And check this out:

Keywords: lung clearance index, multiple-breath washout, variance components, Stan, incomplete data.

That’s right. Stan.

There’s only one thing that bugs me. From their Stan program:

alpha ~ normal(0, 10000);

Ummmmm . . . no.

But basically I love this paper. It makes me so happy to think that the research my colleagues and I have been doing for the past thirty years is making a difference.

Bob also points out this R package, “breathteststan: Stan-Based Fit to Gastric Emptying Curves,” from Dieter Menne et al.

There’s so much great stuff out there. And this is what Stan’s all about: enabling people to construct good models, spending less time on figuring how to fit the damn things and more time on model building, model checking, and design of data collection. Onward!

Leonard Shecter’s coauthor has passed away.

I don’t really have anything to add here except to agree with Phil that Ball Four is one of the best nonfiction books ever. (And, no, I don’t consider Charlie Brown to be nonfiction.)

They’re looking to hire someone with good working knowledge of Bayesian inference algorithms development for multilevel statistical models and mathematical modeling of physiological systems.

Frederic Bois writes:

We have an immediate opening for a highly motivated research / senior scientist with good working knowledge of Bayesian inference algorithms development for multilevel statistical models and mathematical modelling of physiological systems. The successful candidate will assist with the development of deterministic or stochastic methods and algorithms applicable to systems pharmacology/biology models used in safety and efficacy assessment of small and large molecules within the Simcyp Simulators. Candidates should have experience in applied mathematics, biostatistics and data analysis. Ideally, this should be in pharmacokinetics-, toxicokinetics- and/or pharmacodynamics- related areas. In particular, candidates should have hand-on experience in development of optimisation methods and algorithms and capable of dealing with complex numerical problems including non-linear mixed effect models. The successful candidate is expected to keep abreast of the latest scientific developments, disseminate research results, and actively engage with peers and clients within industry, academia and regulatory agencies.

The company is Certara, and it’s located in Sheffield, U.K. Full information here.

Calibrating patterns in structured data: No easy answers here.

“No easy answers” . . . Hey, that’s a title that’s pure anti-clickbait, a veritable kryptonite for social media . . .

Anyway, here’s the story. Adam Przedniczek writes:

I am trying to devise new or tune up already existing statistical tests assessing rate of occurrences of some bigger compound structures, but the most tricky part is to take into account their substructures and building blocks.

To make it as simple as possible, let’s say we are particularly interested in a test for enrichment or over-representation of given structures, e.g. quadruples, over two groups. Everything is clearly depicted in this document.

And here the doubts arise: I have strong premonition that I should take into consideration their inner structure and constituent pairs. In the attachment I show such an adjustment for enrichment of pairs, but I don’t know how to extend this approach properly over higher (more compound) structures.

Hey—this looks like a fun probability problem! (Readers: click on the above link if you haven’t done so already.) The general problem reminds me of things I’ve seen in social networks, where people summarize a network by statistics such as the diameter, the number of open and closed triplets, the number of loops and disconnected components, etc.

My quick answer is that there are two completely different ways to approach the problem. It’s not clear which is best; I guess it could make sense to do both.

The first approach is with a generative model. The advantage of the generative model is that you can answer any question you’d like. The disadvantage is that with structured dependence, it can be really hard to come up with a generative model that captures much of the data features that you care about. With network data, they’re still playing around with variants of that horribly oversimplified Erdos-Renyi model of complete independence. Generative modeling can be a great way to learn, but any particular generative model can be a trap if there are important aspects of the data it does not capture.

The second approach is more phenomenological, where you compare different groups using raw data and then do some sort of permutation testing or bootstrapping to get a sense of the variation in your summary statistics. This approach has problems, too, though, in that you need to decide how to do the permutations or sampling. Complete randomness can give misleading answers, and there’s a whole literature, with no good answers, on how to bootstrap or perform permutation tests on time series, spatial, and network data. Indeed, when you get right down to it, a permutation test or a bootstrapping rule corresponds to a sampling model, and that gets you close to the difficulties of generative models that we’ve already discussed.

So . . . no easy answers! But, whatever procedure you do, I recommend you check it using fake-data simulation.

Causal inference using repeated cross sections

Sadish Dhakal writes:

I am struggling with the problem of conditioning on post-treatment variables. I was hoping you could provide some guidance. Note that I have repeated cross sections, not panel data. Here is the problem simplified:

There are two programs. A policy introduced some changes in one of the programs, which I call the treatment group (T). People can select into T. In fact there’s strong evidence that T programs become more popular in the period after policy change (P). But this is entirely consistent with my hypothesis. My hypothesis is that high-quality people select into the program. I expect that people selecting into T will have better outcomes (Y) because they are of higher quality. Consider the specification (avoiding indices):

Y = b0 + b1 T + b2 P + b3 T X P + e (i)

I expect that b3 will be positive (which it is). Again, my hypothesis is that b3 is positive only because higher quality people select into T after the policy change. Let me reframe the problem slightly (And please correct me if I’m reframing it wrong). If I could observe and control for quality Q, I could write the error term e = Q + u, and b3 in the below specification would be zero.

Y = b0 + b1 T + b2 P + b3 T X P + Q + u (ii)

My thesis is not that the policy “caused” better outcomes, but that it induced selection. How worried should I be about conditioning on T? How should I go about avoiding bogus conclusions?

My reply:

There are two ways I can see to attack this problem, and I guess you’d want to do both. First is to control for lots of pre-treatment predictors, including whatever individual characteristics you can measure which you think would predict the decision to select into T. Second is to include in your model a latent variable representing this information, if you don’t think you can measure it directly. You can then do a Bayesian analysis averaging over your prior distribution on this latent variable, or a sensitivity analysis assessing the bias in your regression coefficient as a function of characteristics of the latent variable and its correlations with your outcome of interest.

I’ve not done this sort of analysis myself; perhaps you could look at a textbook on causal inference such as Tyler VanderWeele’s Explanation in Causal Inference: Methods for Mediation and Interaction, or Miguel Hernan and Jamie Robins’s Causal Inference.

“Widely cited study of fake news retracted by researchers”

Chuck Jackson forwards this amusing story:

Last year, a study was published in the Journal of Human Behavior, explaining why fake news goes viral on social media. The study itself went viral, being covered by dozens of news outlets. But now, it turns out there was an error in the researchers’ analysis that invalidates their initial conclusion, and the study has been retracted.

The study sought to determine the role of short attention spans and information overload in the spread of fake news. To do this, researchers compared the empirical data from social networking sites that show that fake news is just as likely to be shared as real news — a fact that Filippo Menczer, a professor of informatics and computer science at Indiana University and a co-author of the study, stresses to Rolling Stone is still definitely true — to a simplified model they created of a social media site where they could control for various factors.

Because of an error in processing their findings, their results showed that the simplified model was able to reproduce the real-life numbers, determining that people spread fake news because of their short attention spans and not necessarily, for example, because of foreign bots promoting particular stories. Last spring, the researchers discovered the error when they tried to reproduce their results and found that while attention span and information overload did impact how fake news spread through their model network, they didn’t impact it quite enough to account for the comparative rates at which real and fake news spread in real life. They alerted the journal right away, and the journal deliberated for almost a year whether to issue a correction or a retraction, before finally deciding on Monday to retract the article.

“For me, it’s very embarrassing, but errors occur and of course when we find them we have to correct them,” Menczer tells Rolling Stone. “The results of our paper show that in fact the low attention span does play a role in the spread of low-quality information, but to say that something plays a role is not the same as saying that it’s enough to fully explain why something happens. It’s one of many factors.”…

As Jackson puts it, the story makes the journal look bad but the authors look good. Indeed, there’s nothing so horrible about getting a paper retracted. Mistakes happen.

Another story about a retraction, this time one that didn’t happen

I’m on the editorial board of a journal that had published a paper with serious errors. There was a discussion among the board of whether to retract the paper. One of the other board members did not want to retract, on the grounds that he (the board member) did not see deliberate research misconduct, that this just seemed like incredibly sloppy work. The board member was under the opinion that deliberate misconduct “is basically the only reason to force a retraction of an article (see COPE guideline).”

COPE is the Committee on Publication Ethics. I looked up the COPE guidelines and found this:

Journal editors should consider retracting a publication if:

• they have clear evidence that the findings are unreliable, either as a result of misconduct (e.g. data fabrication) or honest error (e.g. miscalculation or experimental error) . . .

So, no, the COPE guidelines do not require misconduct for a retraction. Honest error is enough. The key is that the findings are unreliable.

I shared this information with the editorial board but they still did not want to retract.

I don’t see why retraction should be a career-altering, or career-damaging, move—except to the very minor extent that it damages your career by making that one paper no longer count.

That said, I don’t care at all whether a paper is “retracted” or merely “corrected” (which I’ve done for 4 of my published papers).

“Did Austerity Cause Brexit?”

Carsten Allefeld writes:

Do you have an opinion on the soundness of this study by Thiemo Fetzer, Did Austerity Cause Brexit?. The author claims to show that support for Brexit in the referendum is correlated with the individual-level impact of austerity measures, and therefore possibly caused by them.

Here’s the abstract of Fetzer’s paper:

Did austerity cause Brexit? This paper shows that the rise of popular support for the UK Independence Party (UKIP), as the single most important correlate of the subsequent Leave vote in the 2016 European Union (EU) referendum, along with broader measures of political dissatisfaction, are strongly and causally associated with an individual’s or an area’s exposure to austerity since 2010. In addition to exploiting data from the population of all electoral contests in the UK since 2000, I leverage detailed individual level panel data allowing me to exploit within-individual variation in exposure to specific welfare reforms as well as broader measures of political preferences. The results suggest that the EU referendum could have resulted in a Remain victory had it not been for a range of austerity-induced welfare reforms. Further, auxiliary results suggest that the welfare reforms activated existing underlying economic grievances that have broader origins than what the current literature on Brexit suggests. Up until 2010, the UK’s welfare state evened out growing income differences across the skill divide through transfer payments. This pattern markedly stops from 2010 onwards as austerity started to bite.

I came into this with skepticism about the use of aggregate trends to learn about individual-level attitude change. But I found Fetzer’s arguments to be pretty convincing.

That said, there are always alternative explanations for this sort of observational correlation.

What happened is that the places that were hardest-hit by austerity were the places where there was the biggest gain for the far-right party.

One alternative explanation is that these gains would still have come even in the absence of austerity, and it’s just that these parts of the country, which were trending to the far right politically, were also the places where austerity also bit hardest.

A different alternative explanation is that economic did cause Brexit but at the national rather than the local or individual level: the idea here is that difficult national economic conditions motivated voters in those areas to go for the far right, but again in this explanation this did not arise from direct local effects of austerity.

I don’t see how one could untangle these possible stories based on the data used in Fetzer’s article. But his story makes some sense and it’s something worth thinking about. I’d be interested to hear what Piero Stanig thinks about all this, as he is a coauthor (with Italo Colantone) of this article, Global Competition and Brexit, cited by Fetzer.


This came up in comments the other day:

I kinda like the idea of researchers inserting the word “Inshallah” at appropriate points throughout their text. “Our results will replicate, inshallah. . . . Our code has no more bugs, inshallah,” etc.


God is in every leaf of every tree

Collinearity in Bayesian models

Dirk Nachbar writes:

We were having a debate about how much of a problem collinearity is in Bayesian models. I was arguing that it is not much of a problem. Imagine we have this model

Y ~ N(a + bX1 + cX2, sigma)

where X1 and X2 have some positive correlation (r > .5), they also have similar distributions. I would argue that if we assume 0 centered priors for b and c, then multi chain MCMC should find some balance between the estimates.

In frequentist/OLS models it is a problem and both estimates of b and c will be biased.

With synthetic data, some people have shown that Bayesian estimates are pretty close to biased frequentist estimates.

What do you think? How does it change if we have more parameters than we have data points (low DF)?

My reply:

Yes, with an informative prior distribution on the coefficients you should be fine. Near-collinearity of predictors implies that the data can’t tell you so much about the individual coefficients—you can learn about the linear combination but not as much about the separate parameters—hence it makes sense to include prior information to do better.

If you want a vision of the future, imagine a computer, calculating the number of angels who can dance on the head of a pin—forever.

Riffing on techno-hype news articles such as An AI physicist can derive the natural laws of imagined universes, Peter Woit writes:

This is based on the misconception about string theory that the problem with it is that “the calculations are too hard”. The truth of the matter is that there is no actual theory, no known equations to solve, no real calculation to do. But, with the heavy blanket of hype surrounding machine learning these days, that doesn’t really matter, one can go ahead and set the machines to work. . . .

Taking all these developments together, it starts to become clear what the future of this field may look like . . . As the machines supersede humans’ ability to do the kind of thing theorists have been doing for the last twenty years, they will take over this activity, which they can do much better and faster. Biological theorists will be put out to pasture, with the machines taking over, performing ever more complex, elaborate and meaningless calculations, for ever and ever.

Much of the discussion of Woit’s post focuses on the details of the physics models and also the personalities involved in the dispute.

My interest here is somewhat different. For our purposes here let’s just assume Woit is correct that whatever these calculations are, they’re meaningless.

The question is, if they’re meaningless, why do them at all? Just to draw an analogy: it used to be a technical challenge for humans to calculate digits of the decimal expansion of pi. But now computers can do it faster. I guess it’s still a technical challenge for humans to come up with algorithms by which computers can compute more digits. But maybe someone will at some point program a computer to come up with faster algorithms on their own. And we could imagine a network of computers somewhere, doing nothing but computing more digits of pi. But that would just be a pointless waste of resources, kinda like bitcoin but without the political angle.

I guess in the short term there would be motivation to have computers working out more and more string theory, but only because there are influential humans who think it’s worth doing. So in that sense, machines doing string theory is like the old-time building of pyramids and cathedrals, except that the cost is in material resources rather than human labor. It’s kind of amusing to think of the endgame of this sort of science as being its production purely for its own sake. A robot G. H. Hardy would be pleased.