Skip to content

Update on keeping Mechanical Turk responses trustworthy

This topic has come up before . . . Now there’s a new paper by Douglas Ahler, Carolyn Roush, and Gaurav Sood, who write:

Amazon’s Mechanical Turk has rejuvenated the social sciences, dramatically reducing the cost and inconvenience of collecting original data. Recently, however, researchers have raised concerns about the presence of “non-respondents” (bots) or non-serious respondents on the platform. Spurred by these concerns, we fielded an original survey on MTurk to measure response quality. While we find no evidence of a “bot epidemic,” we do find that a significant portion of survey respondents engaged in suspicious be- havior. About 20% of respondents either circumvented location requirements or took the survey multiple times. In addition, at least 5-7% of participants likely engaged in “trolling” or satisficing. Altogether, we find about a quarter of data collected on MTurk is potentially untrustworthy. Expectedly, we find response quality impacts experimental treatments. On average, low quality responses attenuate treatment effects by approximately 9%. We conclude by providing recommendations for collecting data on MTurk.

And here are the promised recommendations:

• Use geolocation filters on survey platforms like Qualtrics to enforce any geographic restrictions.

• Make use of tools on survey platforms to retrieve IP addresses. Run each IP through Know Your IP to identify blacklisted IPs and multiple responses originating from the same IP.

• Include questions to detecting trolling and satisficing but do not copy and paste from a standard canon as that makes “gaming the survey” easier.

• Increase the time between Human intelligence task (HIT) completion and auto-approval so that you can assess your data for untrustworthy responses before approving or rejecting the HIT.

• Rather than withhold payments, a better policy may be to incentivize workers by giving them a bonus when their responses pass quality filters.

• Be mindful of compensation rates. While unusually stingy wages will lead to slow data collection times and potentially less effort by Workers, unusually high wages may give rise to adverse selection—especially because HITs are shared on Turkopticon, etc. soon after posting. . . Social scientists who conduct research on MTurk should stay apprised of the current “fair wage” on MTurk and adhere accordingly.

• Use Worker qualifications on MTurk and filter to include only Workers who have a high percentage of approved HITs into your sample.

They also say they do not think that the problem is limited to MTurk.

I haven’t tried to evaluate all these claims myself, but I thought I’d share it all with those of you who are using this tool in your research. (Or maybe some of you are MTurk bots; who knows what will be the effect of posting this material here.)

P.S. Sood adds:

From my end, “random” error is mostly a non-issue in this context. People don’t use M-Turk to produce generalizable estimates—hardly anyone post-stratifies, for instance. Most people use it to say they did something. I suppose it is a good way to ‘fail fast.’ (The downside is that most failures probably don’t see the light of day.) And if we people wanted to buy stat. sig., bulking up on n is easily and cheaply done — it is the raison d’etre of MTurk in some ways.

So what is the point of the article? Twofold, perhaps. First is that it is good to parcel out measurement error where we can. And the second point is about how do we build a system where the long-term prognosis is not simply noise. And what struck out for me from the data was just the sheer scale of plausibly cheeky behavior. I did not anticipate that.

Voter turnout and vote choice of evangelical Christians

Mark Palko writes, “Have you seen this?”, referring to this link to this graph:

I responded: Just one of those things, I think.

Palko replied:

Just to be clear, I am more than willing to believe the central point about the share of the population dropping while the share of the electorate holds relatively steady, but having dealt with more than my share of bad data, I get really nervous when I see a number like that hold absolutely steady.

My response:

I did a quick check of some of those 26% numbers online and they seem to be from actual exit polls. Them all being equal just seems like a coincidence. The part of the graph I really don’t believe is the sharp decline in % evangelical Christian. I’m guessing that the survey question on the exit polls is different than the survey question on whatever poll they’re using to estimate % evangelical.

And, since I have you on the line, here are some graphs from chapter 6 of Red State Blue State:

It seems that religious affiliation is becoming more of a political thing. Or maybe political affiliation is becoming more of a religious thing.

In any case, be careful about comparing time trends of survey questions that are asked in different ways.

Endless citations to already-retracted articles

Ken Cor and Gaurav Sood write:

Many claims in a scientific article rest on research done by others. But when the claims are based on flawed research, scientific articles potentially spread misinformation. To shed light on how often scientists base their claims on problematic research, we exploit data on cases where problems with research are broadly publicized. Using data from over 3,000 retracted articles and over 74,000 citations to these articles, we find that at least 31.2% of the citations to retracted articles happen a year after they have been retracted. And that 91.4% of the post-retraction citations are approving—note no concern with the cited article. We augment the analysis with data from an article published in Nature Neuroscience highlighting a serious statistical error in articles published in prominent journals. Data suggest that problematic research was approvingly cited more frequently after the problem was publicized [emphasis added]. Our results have implications for the design of scholarship discovery systems and scientific practice more generally.

I think that by “31.2%” and “91.4%” they mean 30% and 90% . . . but, setting aside this brief lapse in taste or numeracy, their message is important.

P.S. In case you’re wondering why I’d round those numbers: I just don’t think those last digits are conveying any real information. To put it another way, in any sort of replication, I’d expect to see numbers that differ by at least a few percentage points. Reporting as 30% and 90% seems to me to capture what they found without adding meaningless precision.

Gigerenzer: “The Bias Bias in Behavioral Economics,” including discussion of political implications

Gerd Gigerenzer writes:

Behavioral economics began with the intention of eliminating the psychological blind spot in rational choice theory and ended up portraying psychology as the study of irrationality. In its portrayal, people have systematic cognitive biases that are not only as persistent as visual illusions but also costly in real life—meaning that governmental paternalism is called upon to steer people with the help of “nudges.” These biases have since attained the status of truisms. In contrast, I show that such a view of human nature is tainted by a “bias bias,” the tendency to spot biases even when there are none. This may occur by failing to notice when small sample statistics differ from large sample statistics, mistaking people’s random error for systematic error, or confusing intelligent inferences with logical errors. Unknown to most economists, much of psychological research reveals a different portrayal, where people appear to have largely fine-tuned intuitions about chance, frequency, and framing. A systematic review of the literature shows little evidence that the alleged biases are potentially costly in terms of less health, wealth, or happiness. Getting rid of the bias bias is a precondition for psychology to play a positive role in economics.

Like others, Gigerenzer draws the connection to visual illusions, but with a twist:

By way of suggestion, articles and books introduce biases together with images of visual illusions, implying that biases (often called “cognitive illusions”) are equally stable and inevitable. If our cognitive system makes such big blunders like our visual system, what can you expect from everyday and business decisions? Yet this analogy is misleading, and in two respects.

First, visual illusions are not a sign of irrationality, but a byproduct of an intelligent brain that makes “unconscious inferences”—a term coined by Hermann von Helmholtz—from two-dimensional retinal images to a three-dimensional world. . . .

Second, the analogy with visual illusions suggests that people cannot learn, specifically that education in statistical reasoning is of little efficacy (Bond, 2009). This is incorrect . . .

It’s an interesting paper. Gigerenzer goes through a series of classic examples of cognitive errors, including the use of base rates in conditional probability, perceptions of patterns in short sequences, the hot hand, bias in estimates of risks, systematic errors in almanac questions, the Lake Wobegon effect, and framing effects.

I’m a sucker for this sort of thing. It might be that at some points Gigerenzer is overstating his case, but he makes a lot of good points.

Some big themes

In his article, Gigerenzer raises three other issues that I’ve been thinking about a lot lately:

1. Overcertainty in the reception and presentation of scientific results.

2. Claims that people are stupid.

3. The political implications of claims that people are stupid.

Overcertainty and the problem of trust

Gigerenzer writes:

The irrationality argument exists in many versions (e.g. Conley, 2013; Kahneman, 2011). Not only has it come to define behavioral economics but it also has defined how most economists view psychology: Psychology is about biases, and psychology has nothing to say about reasonable behavior.

Few economists appear to be aware that the bias message is not representative of psychology or cognitive science in general. For instance, loss aversion is often presented as a truism; in contrast, a review of the literature concluded that the “evidence does not support that losses, on balance, tend to be any more impactful than gains” (Gal and Rucker, 2018). Research outside the heuristics-and-biases program that does not confirm this message—including most of the psychological research described in this article—is rarely cited in the behavioral economics literature (Gigerenzer, 2015).

(We discussed Gal and Rucker (2018) here.)

More generally, this makes me think of the problem of trust that Kaiser Fung and I noted in the Freakonomics franchise. There’s so much published research out there, indeed so much publicized research, that it’s hard to know where to start, so a natural strategy for sifting through and understanding it all is using networks of trust. You trust your friends and colleagues, they trust their friends and colleagues, and so on. But you can see you this can lead to economists getting a distorted view of the content of psychology and cognitive science.

Claims that people are stupid

The best of the heuristics and biases research is fascinating, important stuff that has changed my life and gives us, ultimately, a deeper respect for ourselves as reasoning beings. But, as Gigerenzer points out, this same research is often misinterpreted as suggesting that people are easily-manipulable (or easily-nudged) fools, and this fits in with lots of junk science claims of the same sort: pizzagate-style claims that the amount you eat can be manipulated by the size of your dining tray, goofy poli-sci claims that a woman’s vote depends on the time of the month, air rage, himmicanes, shark attacks, ages-ending-in-9, and all the rest. This is an attitude which I can understand might be popular among certain marketers, political consultants, and editors of the Proceedings of the National Academy of Sciences, but I don’t buy it, partly because of zillions of errors in the published studies in question and also because of the piranha principle. Again, what’s important here is not just the claim that people make mistakes, but that they can be consistently manipulated using what would seem to be irrelevant stimuli.

Political implications

As usual, let me emphasize that if these claims were true—if it were really possible to massively and predictably change people’s attitudes on immigration by flashing a subliminal smiley face on a computer screen—then we’d want to know it.

If the claims don’t pan out, then they’re not so interesting, except inasmuch as: (a) it’s interesting that smart people believed these things, and (b) we care if resources are thrown at these ideas. For (b), I’m not just talking about NSF funds etc., I’m also talking about policy money (remember, pizzagate dude got appointed to a U.S. government position at one point to implement his ideas) and just a general approach toward policymaking, things like nudging without persuasion, nudges that violate the Golden Rule, and of course nudges that don’t work.

There’s also a way in which a focus on individual irrationality can be used to discredit or shift blame onto the public. For example, Gigerenzer writes:

Nicotine addiction and obesity have been attributed to people’s myopia and probability-blindness, not to the actions of the food and tobacco industry. Similarly, an article by the Deutsche Bank Research “Homo economicus – or more like Homer Simpson?” attributed the financial crisis to a list of 17 cognitive biases rather than the reckless practices and excessive fragility of banks and the financial system (Schneider, 2010).

Indeed, social scientists used to talk about the purported irrationality of voting (for our counter-argument, see here). If voters are irrational, then we shouldn’t take their votes seriously.

I prefer Gigerenzer’s framing:

The alternative to paternalism is to invest in citizens so that they can reach their own goals rather than be herded like sheep.

Healthier kids: Using Stan to get more information out of pediatric respiratory data

Robert Mahar, John Carlin, Sarath Ranganathan, Anne-Louise Ponsonby, Peter Vuillermin, and Damjan Vukcevic write:

Paediatric respiratory researchers have widely adopted the multiple-breath washout (MBW) test because it allows assessment of lung function in unsedated infants and is well suited to longitudinal studies of lung development and disease. However, a substantial proportion of MBW tests in infants fail current acceptability criteria. We hypothesised that a model-based approach to analysing the data, in place of traditional simple empirical summaries, would enable more efficient use of these tests. We therefore developed a novel statistical model for infant MBW data and applied it to 1,197 tests from 432 individuals from a large birth cohort study. We focus on Bayesian estimation of the lung clearance index (LCI), the most commonly used summary of lung function from MBW tests. Our results show that the model provides an excellent fit to the data and shed further light on statistical properties of the standard empirical approach. Furthermore, the modelling approach enables LCI to be estimated using tests with different degrees of completeness, something not possible with the standard approach.

They continue:

Our model therefore allows previously unused data to be used rather than discarded, as well as routine use of shorter tests without significant loss of precision.

Yesssss! This reminds me of our work on serial dilution assays, where we squeezed information out of data that had traditionally been declared “below detection limit.”

Mahar, Carlin, et al. continue:

Beyond our specific application, our work illustrates a number of important aspects of Bayesian modelling in practice, such as the importance of hierarchical specifications to account for repeated measurements and the value of model checking via posterior predictive distributions.

Wow—all my favorite things! And check this out:

Keywords: lung clearance index, multiple-breath washout, variance components, Stan, incomplete data.

That’s right. Stan.

There’s only one thing that bugs me. From their Stan program:

alpha ~ normal(0, 10000);

Ummmmm . . . no.

But basically I love this paper. It makes me so happy to think that the research my colleagues and I have been doing for the past thirty years is making a difference.

Bob also points out this R package, “breathteststan: Stan-Based Fit to Gastric Emptying Curves,” from Dieter Menne et al.

There’s so much great stuff out there. And this is what Stan’s all about: enabling people to construct good models, spending less time on figuring how to fit the damn things and more time on model building, model checking, and design of data collection. Onward!

Leonard Shecter’s coauthor has passed away.

I don’t really have anything to add here except to agree with Phil that Ball Four is one of the best nonfiction books ever. (And, no, I don’t consider Charlie Brown to be nonfiction.)

They’re looking to hire someone with good working knowledge of Bayesian inference algorithms development for multilevel statistical models and mathematical modeling of physiological systems.

Frederic Bois writes:

We have an immediate opening for a highly motivated research / senior scientist with good working knowledge of Bayesian inference algorithms development for multilevel statistical models and mathematical modelling of physiological systems. The successful candidate will assist with the development of deterministic or stochastic methods and algorithms applicable to systems pharmacology/biology models used in safety and efficacy assessment of small and large molecules within the Simcyp Simulators. Candidates should have experience in applied mathematics, biostatistics and data analysis. Ideally, this should be in pharmacokinetics-, toxicokinetics- and/or pharmacodynamics- related areas. In particular, candidates should have hand-on experience in development of optimisation methods and algorithms and capable of dealing with complex numerical problems including non-linear mixed effect models. The successful candidate is expected to keep abreast of the latest scientific developments, disseminate research results, and actively engage with peers and clients within industry, academia and regulatory agencies.

The company is Certara, and it’s located in Sheffield, U.K. Full information here.

Calibrating patterns in structured data: No easy answers here.

“No easy answers” . . . Hey, that’s a title that’s pure anti-clickbait, a veritable kryptonite for social media . . .

Anyway, here’s the story. Adam Przedniczek writes:

I am trying to devise new or tune up already existing statistical tests assessing rate of occurrences of some bigger compound structures, but the most tricky part is to take into account their substructures and building blocks.

To make it as simple as possible, let’s say we are particularly interested in a test for enrichment or over-representation of given structures, e.g. quadruples, over two groups. Everything is clearly depicted in this document.

And here the doubts arise: I have strong premonition that I should take into consideration their inner structure and constituent pairs. In the attachment I show such an adjustment for enrichment of pairs, but I don’t know how to extend this approach properly over higher (more compound) structures.

Hey—this looks like a fun probability problem! (Readers: click on the above link if you haven’t done so already.) The general problem reminds me of things I’ve seen in social networks, where people summarize a network by statistics such as the diameter, the number of open and closed triplets, the number of loops and disconnected components, etc.

My quick answer is that there are two completely different ways to approach the problem. It’s not clear which is best; I guess it could make sense to do both.

The first approach is with a generative model. The advantage of the generative model is that you can answer any question you’d like. The disadvantage is that with structured dependence, it can be really hard to come up with a generative model that captures much of the data features that you care about. With network data, they’re still playing around with variants of that horribly oversimplified Erdos-Renyi model of complete independence. Generative modeling can be a great way to learn, but any particular generative model can be a trap if there are important aspects of the data it does not capture.

The second approach is more phenomenological, where you compare different groups using raw data and then do some sort of permutation testing or bootstrapping to get a sense of the variation in your summary statistics. This approach has problems, too, though, in that you need to decide how to do the permutations or sampling. Complete randomness can give misleading answers, and there’s a whole literature, with no good answers, on how to bootstrap or perform permutation tests on time series, spatial, and network data. Indeed, when you get right down to it, a permutation test or a bootstrapping rule corresponds to a sampling model, and that gets you close to the difficulties of generative models that we’ve already discussed.

So . . . no easy answers! But, whatever procedure you do, I recommend you check it using fake-data simulation.

Causal inference using repeated cross sections

Sadish Dhakal writes:

I am struggling with the problem of conditioning on post-treatment variables. I was hoping you could provide some guidance. Note that I have repeated cross sections, not panel data. Here is the problem simplified:

There are two programs. A policy introduced some changes in one of the programs, which I call the treatment group (T). People can select into T. In fact there’s strong evidence that T programs become more popular in the period after policy change (P). But this is entirely consistent with my hypothesis. My hypothesis is that high-quality people select into the program. I expect that people selecting into T will have better outcomes (Y) because they are of higher quality. Consider the specification (avoiding indices):

Y = b0 + b1 T + b2 P + b3 T X P + e (i)

I expect that b3 will be positive (which it is). Again, my hypothesis is that b3 is positive only because higher quality people select into T after the policy change. Let me reframe the problem slightly (And please correct me if I’m reframing it wrong). If I could observe and control for quality Q, I could write the error term e = Q + u, and b3 in the below specification would be zero.

Y = b0 + b1 T + b2 P + b3 T X P + Q + u (ii)

My thesis is not that the policy “caused” better outcomes, but that it induced selection. How worried should I be about conditioning on T? How should I go about avoiding bogus conclusions?

My reply:

There are two ways I can see to attack this problem, and I guess you’d want to do both. First is to control for lots of pre-treatment predictors, including whatever individual characteristics you can measure which you think would predict the decision to select into T. Second is to include in your model a latent variable representing this information, if you don’t think you can measure it directly. You can then do a Bayesian analysis averaging over your prior distribution on this latent variable, or a sensitivity analysis assessing the bias in your regression coefficient as a function of characteristics of the latent variable and its correlations with your outcome of interest.

I’ve not done this sort of analysis myself; perhaps you could look at a textbook on causal inference such as Tyler VanderWeele’s Explanation in Causal Inference: Methods for Mediation and Interaction, or Miguel Hernan and Jamie Robins’s Causal Inference.

“Widely cited study of fake news retracted by researchers”

Chuck Jackson forwards this amusing story:

Last year, a study was published in the Journal of Human Behavior, explaining why fake news goes viral on social media. The study itself went viral, being covered by dozens of news outlets. But now, it turns out there was an error in the researchers’ analysis that invalidates their initial conclusion, and the study has been retracted.

The study sought to determine the role of short attention spans and information overload in the spread of fake news. To do this, researchers compared the empirical data from social networking sites that show that fake news is just as likely to be shared as real news — a fact that Filippo Menczer, a professor of informatics and computer science at Indiana University and a co-author of the study, stresses to Rolling Stone is still definitely true — to a simplified model they created of a social media site where they could control for various factors.

Because of an error in processing their findings, their results showed that the simplified model was able to reproduce the real-life numbers, determining that people spread fake news because of their short attention spans and not necessarily, for example, because of foreign bots promoting particular stories. Last spring, the researchers discovered the error when they tried to reproduce their results and found that while attention span and information overload did impact how fake news spread through their model network, they didn’t impact it quite enough to account for the comparative rates at which real and fake news spread in real life. They alerted the journal right away, and the journal deliberated for almost a year whether to issue a correction or a retraction, before finally deciding on Monday to retract the article.

“For me, it’s very embarrassing, but errors occur and of course when we find them we have to correct them,” Menczer tells Rolling Stone. “The results of our paper show that in fact the low attention span does play a role in the spread of low-quality information, but to say that something plays a role is not the same as saying that it’s enough to fully explain why something happens. It’s one of many factors.”…

As Jackson puts it, the story makes the journal look bad but the authors look good. Indeed, there’s nothing so horrible about getting a paper retracted. Mistakes happen.

Another story about a retraction, this time one that didn’t happen

I’m on the editorial board of a journal that had published a paper with serious errors. There was a discussion among the board of whether to retract the paper. One of the other board members did not want to retract, on the grounds that he (the board member) did not see deliberate research misconduct, that this just seemed like incredibly sloppy work. The board member was under the opinion that deliberate misconduct “is basically the only reason to force a retraction of an article (see COPE guideline).”

COPE is the Committee on Publication Ethics. I looked up the COPE guidelines and found this:

Journal editors should consider retracting a publication if:

• they have clear evidence that the findings are unreliable, either as a result of misconduct (e.g. data fabrication) or honest error (e.g. miscalculation or experimental error) . . .

So, no, the COPE guidelines do not require misconduct for a retraction. Honest error is enough. The key is that the findings are unreliable.

I shared this information with the editorial board but they still did not want to retract.

I don’t see why retraction should be a career-altering, or career-damaging, move—except to the very minor extent that it damages your career by making that one paper no longer count.

That said, I don’t care at all whether a paper is “retracted” or merely “corrected” (which I’ve done for 4 of my published papers).

“Did Austerity Cause Brexit?”

Carsten Allefeld writes:

Do you have an opinion on the soundness of this study by Thiemo Fetzer, Did Austerity Cause Brexit?. The author claims to show that support for Brexit in the referendum is correlated with the individual-level impact of austerity measures, and therefore possibly caused by them.

Here’s the abstract of Fetzer’s paper:

Did austerity cause Brexit? This paper shows that the rise of popular support for the UK Independence Party (UKIP), as the single most important correlate of the subsequent Leave vote in the 2016 European Union (EU) referendum, along with broader measures of political dissatisfaction, are strongly and causally associated with an individual’s or an area’s exposure to austerity since 2010. In addition to exploiting data from the population of all electoral contests in the UK since 2000, I leverage detailed individual level panel data allowing me to exploit within-individual variation in exposure to specific welfare reforms as well as broader measures of political preferences. The results suggest that the EU referendum could have resulted in a Remain victory had it not been for a range of austerity-induced welfare reforms. Further, auxiliary results suggest that the welfare reforms activated existing underlying economic grievances that have broader origins than what the current literature on Brexit suggests. Up until 2010, the UK’s welfare state evened out growing income differences across the skill divide through transfer payments. This pattern markedly stops from 2010 onwards as austerity started to bite.

I came into this with skepticism about the use of aggregate trends to learn about individual-level attitude change. But I found Fetzer’s arguments to be pretty convincing.

That said, there are always alternative explanations for this sort of observational correlation.

What happened is that the places that were hardest-hit by austerity were the places where there was the biggest gain for the far-right party.

One alternative explanation is that these gains would still have come even in the absence of austerity, and it’s just that these parts of the country, which were trending to the far right politically, were also the places where austerity also bit hardest.

A different alternative explanation is that economic did cause Brexit but at the national rather than the local or individual level: the idea here is that difficult national economic conditions motivated voters in those areas to go for the far right, but again in this explanation this did not arise from direct local effects of austerity.

I don’t see how one could untangle these possible stories based on the data used in Fetzer’s article. But his story makes some sense and it’s something worth thinking about. I’d be interested to hear what Piero Stanig thinks about all this, as he is a coauthor (with Italo Colantone) of this article, Global Competition and Brexit, cited by Fetzer.


This came up in comments the other day:

I kinda like the idea of researchers inserting the word “Inshallah” at appropriate points throughout their text. “Our results will replicate, inshallah. . . . Our code has no more bugs, inshallah,” etc.


God is in every leaf of every tree

Collinearity in Bayesian models

Dirk Nachbar writes:

We were having a debate about how much of a problem collinearity is in Bayesian models. I was arguing that it is not much of a problem. Imagine we have this model

Y ~ N(a + bX1 + cX2, sigma)

where X1 and X2 have some positive correlation (r > .5), they also have similar distributions. I would argue that if we assume 0 centered priors for b and c, then multi chain MCMC should find some balance between the estimates.

In frequentist/OLS models it is a problem and both estimates of b and c will be biased.

With synthetic data, some people have shown that Bayesian estimates are pretty close to biased frequentist estimates.

What do you think? How does it change if we have more parameters than we have data points (low DF)?

My reply:

Yes, with an informative prior distribution on the coefficients you should be fine. Near-collinearity of predictors implies that the data can’t tell you so much about the individual coefficients—you can learn about the linear combination but not as much about the separate parameters—hence it makes sense to include prior information to do better.

If you want a vision of the future, imagine a computer, calculating the number of angels who can dance on the head of a pin—forever.

Riffing on techno-hype news articles such as An AI physicist can derive the natural laws of imagined universes, Peter Woit writes:

This is based on the misconception about string theory that the problem with it is that “the calculations are too hard”. The truth of the matter is that there is no actual theory, no known equations to solve, no real calculation to do. But, with the heavy blanket of hype surrounding machine learning these days, that doesn’t really matter, one can go ahead and set the machines to work. . . .

Taking all these developments together, it starts to become clear what the future of this field may look like . . . As the machines supersede humans’ ability to do the kind of thing theorists have been doing for the last twenty years, they will take over this activity, which they can do much better and faster. Biological theorists will be put out to pasture, with the machines taking over, performing ever more complex, elaborate and meaningless calculations, for ever and ever.

Much of the discussion of Woit’s post focuses on the details of the physics models and also the personalities involved in the dispute.

My interest here is somewhat different. For our purposes here let’s just assume Woit is correct that whatever these calculations are, they’re meaningless.

The question is, if they’re meaningless, why do them at all? Just to draw an analogy: it used to be a technical challenge for humans to calculate digits of the decimal expansion of pi. But now computers can do it faster. I guess it’s still a technical challenge for humans to come up with algorithms by which computers can compute more digits. But maybe someone will at some point program a computer to come up with faster algorithms on their own. And we could imagine a network of computers somewhere, doing nothing but computing more digits of pi. But that would just be a pointless waste of resources, kinda like bitcoin but without the political angle.

I guess in the short term there would be motivation to have computers working out more and more string theory, but only because there are influential humans who think it’s worth doing. So in that sense, machines doing string theory is like the old-time building of pyramids and cathedrals, except that the cost is in material resources rather than human labor. It’s kind of amusing to think of the endgame of this sort of science as being its production purely for its own sake. A robot G. H. Hardy would be pleased.

Read this: it’s about importance sampling!

Importance sampling plays an odd role in statistical computing. It’s an old-fashioned idea and can behave just horribly if applied straight-up—but it keeps arising in different statistics problems.

Aki came up with Pareto-smoothed importance sampling (PSIS) for leave-one-out cross-validation.

We recently revised the PSIS article and Dan Simpson wrote a useful blog post about it the other day. I’m linking to Dan’s post again here because he gave it an obscure title so you might have missed it.

We’ve had a bunch of other ideas during the past few years involving importance sampling, including adaptive proposal distributions, wedge sampling, expectation propagation, and gradient-based marginal optimization, so I hope we can figure out some more things.

Reproducibility problems in the natural sciences

After reading my news article on the replication crisis, Mikael Wolfe writes:

While I’m sure there is a serious issue about replication in social science experiments, what about the natural sciences? You use the term “science” even though you don’t include natural sciences in your piece. I fear that climate and other science deniers will use your piece as ammunition that peer-reviewed science is “junk” and therefore no action on climate change and other environmental problems is warranted.

My reply: In climate science it is difficult to do any real replication because we can’t rerun the global climate, so any replications will necessarily be very model based. Regarding the natural sciences more generally, there have been many high-profile replication failures in biology and medicine. In biology it can be notoriously difficult for people from one lab to replicate studies from other labs. Finally, regarding your last point: I don’t think that uncertainty should be a reason for doing nothing. After all, we are uncertain about what countries might attack us, but that does not stop us from spending money on national defense. We should be able to acknowledge uncertainty and still make decisions.

Causal inference with time-varying mediators

Adan Becerra writes to Tyler VanderWeele:

I have a question about your paper “Mediation analysis for a survival outcome with time-varying exposures, mediators, and confounders” that I was hoping that you could help my colleague (Julia Ward) and me with. We are currently using Medicare claims data to evaluate the following general mediation among dialysis patients with atrial fibrillation:

Race -> Warfarin prescriptions -> Stroke within 1 year.

where Warfarin prescriptions is a time-varying mediator (using part D claims with number supplied as days) and there are time-dependent confounders. Even though the exposure doesn’t vary over time, this is an extension of Van der Laans time dependent mediation method because yours also includes time dependent confounders. However, I would also like to account for death as a competing risk via a sub-hazard. Am I correct that the G-formula cannot do this? If so, are you aware of any methods that could do this? I found the following paper that implements a marginal structural subdistribution hazard models, but this doesn’t do mediation (at least I don’t think so).

Becerra also cc-ed me, adding:

I recognize that you have stated on the blog before that you are hesitant to use mediation analyses but they are very common in epi/clinical epi but any help would be much appreciated.

I replied that the two arrows in the above diagram have different meanings. The first arrow is a comparison, comparing people of different races. The second arrow is causal, comparing what would happen if people are prescribed Warfarin or not.

To put it another way, the first arrow is a between-person comparison, whereas the second arrow is implicitly a within-person comparison.

I assume they’d also want another causal arrow, going from Warfarin prescription -> taking Warfarin -> Stroke. But maybe they’re assuming that getting the prescription is equivalent to taking the drug in this case. Anyway, it seems to me that prescription of the drug is not a “mediator” but rather is the causal variable (in the diagram) or an instrument (in the more elaborate diagram, where prescription is the instrument and taking the drug is the causal variable).

This sort of thing comes up a lot when someone proposes a method I don’t fully understand. Perhaps because I don’t really understand it, I end up thinking about the problem in a different way.

Becerra responded:

I see your point about the first arrow maybe not being causal. In fact, Tyler and Whitney Robinson wrote a whole paper on the topic:

We also discuss a stronger interpretation of the “effect of race” (stronger in terms of assumptions) involving the joint effects of race-associated physical phenotype (e.g. skin color), parental physical phenotype, genetic background and cultural context when such variables are thought to be hypothetically manipulable and if adequate control for confounding were possible.

So according to this it seems like there is a way of estimating the causal effect of race. but let’s suppose my exposure wasn’t race just so I can highlight the real issue in this analysis. My concern is that I haven’t been able to find a method thay does mediation analyses with time varying mediators and exposures and confounders for a survival outcome with a sub hazard competing risk a la Fine and Grey. in the dialysis population death is a huge competing risk for stroke.

However, I am no expert in this and I too am afraid I may be missing something which is why I reached out. When I saw Tyler’s original paper I thought it would work but I can’t see how to incorporate the sub hazard.

Is there such a thing as time varying instruments? In this analysis I’m using part D claims in Medicare (so I don’t really know if they took the drug) and patients can go on and off the drugs as well as initiate other drugs (like beta blockers calcium channel blockers etc). I’m really uniterested in warfarin so I’m concerned about time varying confounding due to other drugs.

Doesn’t seem to me like a static yes no instrument would work but I’ve never fit an IV model so what do I know.

I did just see some instrumental variable models with sub hazards so I’ll look there.

And then VanderWeele replied:

Yes, as Andrew noted, if you have “race” as your exposure then this should not be interpreted causally with respect to race. You can (as per the VanderWeele and Robinson, 2014) still interpret the “indirect effect” estimate as e.g. by what portion you would reduce the existing racial disparity if you intervened on warfarin to equalize its distribution in the black population to what it is in the white population, and the “direct effect” as the portion of the disparity that would still remain after that intervention.

We do have a paper on mediation with a survival outcome with a time-varying mediator, but alas it will not handle competing risk and sub-hazards. That would require further methods development.

I’ve never worked on this sort of problem myself. If I did so, I think I’d start by modeling the probability of stroke given drug prescriptions and individual-level background variables including ethnicity, age, sex, previous health status, etc. Maybe with some measurement error model if claims data are imperfect.

How to read (in quantitative social science). And by implication, how to write.

I’m reposting this one from 2014 because I think it could be useful to lots of people.

Also this advice on writing research articles, from 2009.

This is a great example for a statistics class, or a class on survey sampling, or a political science class

Under the heading, “Latino approval of Donald Trump,” Tyler Cowen writes:

From a recent NPR/PBS poll:

African-American approval: 11%

White approval: 40%

Latino approval: 50%

He gets 136 comments, many of which reveal a stunning ignorance of polling. For example, several commenters seem to think that a poll sponsored by National Public Radio is a poll of NPR listeners.

Should NPR waste its money on commissioning generic polls? I don’t think so. There are a zillion polls out there, and NPR—or just about any news organization—has, I believe, better things to do than create artificial headlines by asking the same damn polling question that everyone else does.

In any case, it’s a poll of American adults, not a poll of NPR listeners.

The other big mistake that many commenters made was to take the poll result at face value. Cowen did that too, by reporting the results as “Latino approval of Donald Trump,” rather than “One poll finds . . .”

A few things are going on here. In no particular order:

1. Margin of error. A national poll of 1000 people will have about 150 Latinos. The standard error of a simple proportion is then 0.5/sqrt(150) = 0.04. So, just to start off, that 50% could easily be anywhere between 42% and 58%. And, of course, given what else we know, including other polls, 42% is much more likely than 58%.

That said, even 42% is somewhat striking in that you might expect a minority group to support the Republican president less than other groups. One explanation here is that presidential approval is highly colored by partisanship, and minorities tend to be less partisan than whites—we see this in many ways in lots of data.

2. Selection. Go to the linked page and you’ll see dozens of numbers. Look at enough numbers and you’ll start to focus on noise. The garden of forking paths—it’s not just about p-values. Further selection is that it’s my impression that Cowen enjoys posting news that will fire up his conservative readers and annoy his liberal readers—and sometimes he’ll mix it up and go the other way.

3. The big picture. Trump’s approval is around 40%. It will be higher for some groups and lower for others. If your goal is to go through a poll and find some good news for Trump, you can do so, but it doesn’t alter the big picture.

I searched a bit on the web and found this disclaimer from the PBS News Hour:

President Trump tweeted about a PBS NewsHour/NPR/Marist poll result on Tuesday, highlighting that his approval rating among Latinos rose to 50 percent. . . .

However, the president overlooked the core finding of the poll, which showed that 57 percent of registered voters said they would definitely vote against Trump in 2020, compared to just 30 percent who said they would back the president. The president’s assertion that the poll shows an increase in support from Latino voters also requires context. . . .

But only 153 Latino Americans were interviewed for the poll. The small sample size of Latino respondents had a “wide” margin of error of 9.9 percentage points . . . [Computing two standard errors using the usual formula, 2*sqrt(0.5*0.5/153) gives 0.081, or 8.1 percentage points, so I assume that the margin of error of 9.9 percentage points includes a correction for the survey’s design effect, adjusting for it not being a simple random sample of the target population. — AG.] . . .

[Also] The interviews in the most recent PBS NewsHour/NPR/Marist poll were conducted only in English. . . .

They also report on other polls:

According to a Pew Research Center’s survey from October, only 22 perfect of Latinos said they approved of Trump’s job as president, while 69 percent said they disapproved. . . . Pew has previously explained how language barriers and cultural differences could affect Latinos’ responses in surveys.

That pretty much covers it. But then the question arises: Why did NPR and PBS commission this poll in the first place? There are only a few zillion polls every month on presidential approval. What’s the point of another poll? Well, for one thing, this gets your news organization talked about. They got a tweet from the president, they’re getting blogged about, etc. But is that really what you want to be doing as a news organization: putting out sloppy numbers, getting lots of publicity, then having to laboriously correct the record? Maybe just report some news instead. There’s enough polling out there already.

Miscreant’s Way

We went to Peter Luger then took the train back . . . Walking through Williamsburg, everyone looked like a Daniel Clowes character.