Skip to content

Social science plaig update

OK, we got two items for you, one in political science and one in history. Both are updates on cases we’ve discussed in the past on this blog. I have no personal connection to any of the people involved; my only interest is annoyance at the ways in which plagiarism pollutes scientific understanding and the ways in which third parties go to great efforts to protect plagiarists, I assume because of the desire to avoid bad publicity for their organizations.

1. The award for the political science book

Frank Fischer, a professor of political science who was caught copying big blocks of text (with minor modifications) from others’ writings without attribution.

Last November I received the following email last year from a political scientist who wishes to remain anonymous:

Your recent posts on plagrism refreshed my memories about Frank Fischer.
You and Basboll already wrote on his clear cut case long ago. So I was totally taken aback when Fischer received the Aaron Wildavsky Enduring Contribution Award from the American Political Science Association.

They even cited his plagiarized work in the announcement.

Bottom line: Congrats to Fischer for getting plagiarized work recognized as an enduring contribution to the field!

A blog post would surely boost attention to the problem, but I would like to remain anonymous.

My correspondent told me that “lots of senior people in the discipline” discouraged him from getting involved in this one.

Hey, I knew Aaron Wildavsky a bit. Not personally, but we were both professors at Berkeley at the same time and I saw him in some seminars. He was a bit of a wild man, kinda out of control in the way he’d react or overreact to things that people said.

Anyway, rather than blog the above item directly, I thought I’d approach the relevant committee directly, so I sent off an email to the APSA Public Policy Section recommending that, if they don’t want to retract their 2017 Aaron Wildavsky Enduring Contribution Award to Frank Fischer for his 2003 book Reframing Public Policy, that they also give it to Giandomenico Majone and David Walsh, who wrote two of the works from which Fischer copied in his 2003 book without appropriate attribution:

Majone, G., 1989. Evidence, Argument, and Persuasion in the Policy Process. New Haven: Yale University Press.

Walsh, D., 1972. Sociology and the Social World. In: Filmer, Paul, Phillipson, Michael, Silverman, David and Walsh, David, New Directions in Sociological Theory. London, Collier-Macmillan: 15-35. [Also published by MIT Press, Cambridge, Mass., 1973.]

I also emphasized that I am not an expert in this area and have no intention of pursuing any formal process here. I just wanted to let the committee be aware of this situation so that they could have the opportunity to fix it.

Pretty stunning that the APSA gave out that Enduring Contribution Award, several years after the copying-without-attribution came out. And the story was no secret: it appeared in the Chronicle of Higher Education.

2. The award for the history book

Balazs Szalontai updates us on this story from a couple years ago:

You may have heard the news that Professor Armstrong is to retire from Columbia in 2020.

The way the Columbia administration handled the case perfectly justified the concerns that you had expressed here. The Standing Committee on the Conduct of Research drastically whittled down the original recommendations of the Investigation Committee, effectively depriving me of any public vindication. First they vetoed the idea that Armstrong should formally acknowledge his misconduct and that the plagiarized book be withdrawn, and later they discarded the public statement, too.

I replied: Like with his namesake Lance, the problem was not just the violation of norms but also his use of his privileged position to attack people who pointed out what he was doing.

Szalontai pointed me to this bit from a review by Armstrong published in 2011 in the Journal of Asian Studies:

Stone-cold history prof Armstrong uses his gatekeeper role to patronizingly dismiss Szalontai’s work. As Szalontai puts it, it’s “as if a fabricated Russian source was more valuable than a genuine Hungarian source.” In a 2006 article, though, Armstrong described Szalontai’s book as “extensively researched, impressively detailed and insightful” that “does a great service to the fields of Korean studies, Cold War history and the history of communist regimes.”

Armstrong even wrote that Szalontai’s book “should be required reading for anyone curious about the workings of this reclusive country [North Korea] . . . filled with fascinating bits of information.” Good enough to plagiarize, I guess!

The country club mentality

Fischer and Armstrong are pretty obscure figures, even within social-science academia. The only reason I’d heard of either of them was from the plagiarism scandals.

It’s no surprise that writers plagiarize: writing a book is hard work, it’s less hard if you copy others’ work, and, conditional on copying others’ work, you’ll get more credit if you don’t cite where it’s coming from. It’s very simple: you want the credit without putting in the effort, so you cheat.

What’s really bad is when the cheaters do a Lance Armstrong and attack the people who reveal the problem. When engaging in this attack on truth-tellers, the cheaters often play the Javert card, acting as if it’s completely fine to plagiarize, and that their critics are obsessed weirdos. It’s as if all the people that matter are buddies at a country club, and they have to deal with impertinent caddies who call them out on every damn mulligan. They may get even more annoyed at people like Sokal and me who are members of the club but still side with the caddies.

The real lesson learned from those academic hoaxes: a key part of getting a paper published in a scholarly journal is to be able to follow the conventions of the journal. And some people happen to be good at that, irrespective of the content of the papers being submitted.

I wrote this email to a colleague:

Someone pointed me to this paper. It’s really bad. It was published by The Review of Environmental Economics and Policy, “the official journal of the Association of Environmental and Resource Economists and the European Association of Environmental and Resource Economists.” Is this a real organization? The whole thing seems like a mess to me. I understand that journal editors can find it difficult to get good submissions, but couldn’t they just publish fewer papers? They must be pretty desperate for articles if they’re publishing this sort of thing. Any insight on this would be appreciated.

My colleague responded:

No idea why it got published – REEP is a policy journal and my guess is the editor felt inclined to give some space to “the other side.”

I was thinking about this, and it seems like a key part of getting a paper published in a scholarly journal is to be able to follow the conventions of the journal. I guess that the author of the above-linked article is really good at that.

Writing a paper in the style of a particular academic field—that’s a skill in itself, nearly independent of the content of what’s being written.

And that got me thinking about all sorts of things that get published. I’m not thinking of bad papers such as discussed above, or frauds (you can’t expect reviewers to do the sleuthing to find that) or possibly-frauds-possibly-just-big-sloppy-messes such as those Pizzagate articles, or big claims backed by weak evidence, or empirical papers with statistics errors, or run-of-the-mill low-quality work that fits the preconceptions of the journal editors.

No, here I’m talking about those hoax articles, the ones where a team of authors constructs a paper with zero scientific or scholarly content but which is carefully constructed to be written in the style of a particular journal. The journal is the target, and the goal is publication—and then the later publicity to be had by revealing that the journal got hoaxed.

For some people, this is their entire career, or at least much of it. Not hoaxing, but writing papers with very little content, papers whose main characteristic is that they’re written in the style that’s acceptable to certain journal editors.

It’s a Turing test kind of thing. Or another analogy would be old-style hack writers who could get their crappy books published. Although that’s a bit different because these books had actual readers. Then again, Richard Tol (author of the “gremlins” papers discussed above) must have thousands of readers too, as his papers have been cited tens of thousands of times.

Anyway, here’s my point. We talk a lot about what appears in scientific journals and what scientific papers have media exposure and policy influence. But a key thing seems to be this orthogonal factor, which is the ability of some authors to craft just about anything into a publishable paper in field X.

I have this skill in the field of applied statistics. But I think I use the skill in a beneficial way, to publish papers that are interesting and useful. Other people can use this skill to push propaganda, or just to promote their own careers. And some people are so good at this that they overdo it, as with Bruno “Arrow’s other theorem” Frey.

The point

OK, here’s the deal. The key lesson from the hoaxes of Sokal etc is not that the academic humanities is crap, or that postmodernism is a scam, or whatever, but rather that a large part about getting published, in almost any venue, is about form, not content. If Alan Sokal had never been born, we could see it from the papers of Richard Tol. Being able to write a paper that will get published is a skill, quite separate from the content of the paper. It’s cook that Sokal etc. have that skill, but I feel like we’ve all been missing the point on this. Until now. So we have Tol to thank for something.

“Here’s an interesting story right in your sweet spot”

Jonathan Falk writes:

Here’s an interesting story right in your sweet spot:

Large effects from something whose possible effects couldn’t be that large? Check.
Finding something in a sample of 1024 people that requires 34,000 to gain adequate power? Check.
Misuse of p values? Check
Science journalist hype? Check
Searching for the cause of an effect that isn’t real enough to matter? Check.
Searching for random associatedness? (Nostalgia-proneness?) Check.
Multiple negative studies ignored? Check.

And some great observations from Scott Alexander [author of the linked post]:

First, what bothers me isn’t just that people said 5-HTTLPR mattered and it didn’t. It’s that we built whole imaginary edifices, whole castles in the air on top of this idea of 5-HTTLPR mattering. We “figured out” how 5-HTTLPR exerted its effects, what parts of the brain it was active in, what sorts of things it interacted with, how its effects were enhanced or suppressed by the effects of other imaginary depression genes. This isn’t just an explorer coming back from the Orient and claiming there are unicorns there. It’s the explorer describing the life cycle of unicorns, what unicorns eat, all the different subspecies of unicorn, which cuts of unicorn meat are tastiest, and a blow-by-blow account of a wrestling match between unicorns and Bigfoot.

Alexander links to this letter by
Nina Rieckmann, Michael Rapp, and Jacqueline Müller-Nordhorn published in 2009 in the Journal of the American Medical Association:

Dr Risch and colleagues concluded that the results of a study showing that the serotonin transporter gene (5-HTTLPR) genotype moderates the effect of stressful life events on the risk of depression could not be replicated in a meta-analysis of 14 studies. The authors pointed out the importance of replication studies before new findings are translated into clinical and health practices. We believe that it is also important to note that editorial practices of scientific journals may contribute to the lack of attention received by studies that fail to replicate original findings.

The original study was published in 2003 in Science, a prominent journal with a very high impact factor that year. In the year following its publication, it was cited 110 times in sources indexed in the Web of Science citation report. In 2005, the first study that failed to replicate the original finding in a sample of 1091 participants was published in Psychological Medicine, a specialized journal with a relatively low impact factor. That study was cited 24 times in the following year.

We believe that unless editors actively encourage the submission of null findings, replication studies, and contradictory results alike, the premature uncritical adoption of new findings will continue to influence the way resources are allocated in research and clinical practice settings. Studies that do not replicate an important finding and that meet high methodological standards should be considered for publication by influential journals at the same level as the respective original reports. This will encourage researchers to conduct replication studies and to make primary data easily accessible.

I can’t really comment on the substance or the statistics here, since I haven’t read any of these articles, nor do I have any understanding of the arguments from genetics.

Setting all that aside, it’s striking (a) that Rieckmann et al. made this impassioned speech back in 2009, well before the general awareness of replication issues in science, and (b) this was all 10 years ago but it seems that this is still a live issue, as Alexander is writing about it now. Rieckmann et al. must find all this very frustrating.

Jonathan pointed me to this story in May, and I informed him that this post would appear in Oct. Given that the problem was pointed out 10 years ago (in JAMA, no less!), I guess there’s no rush.

P.S. More on this from savvy science writer Ed Yong.

The status-reversal heuristic

Awhile ago we came up with the time-reversal heuristic, which was a reaction to the common situation that there’s a noisy study, followed by an unsuccessful replication, but all sorts of people want to take the original claim as the baseline and construct high walls to make it difficult to move away from that claim. The time-reversal heuristic is to imagine the two studies in reverse order: First a large and careful study that finds nothing of interest, then a small noisy replication whose authors fish around in the data and find an unexpected statistically significant result. The idea is to remove the “research incumbency effect” and to consider each study on its own merits.

Recently we discussed something similar, a status-reversal heuristic:

Sometimes I think the world would be a better place if, every time an economist or a journalist saw a published claim by an economist, they were to be told that the research had been performed by a sociologist, or an anthropologist. This would induce in them an appropriate level of skepticism.

Similarly with medical research: Suppose that every time a doctor or a journalist saw a published claim by a M.D., they were told that the research had been done by a nurse, or a social worker. Again, then maybe they’d be appropriately skeptical.

Oooh, I like this game. Here’s another one: Every time you see a paper by a Harvard professor, mentally change the affiliation to State U. Then you’ll be appropriately skeptical.

And, every time you see a paper endorsed by a member of the National Academy of Sciences . . . ummm, I guess that one’s ok, we already know not to believe it!

Arguments against the status-reversal heuristic, and responses to those arguments

I can think of two arguments against. (Maybe you can think of more—that’s what the comments section is for, to explain to me how wrong I am!)

1. It’s kind of weird to see me advocating this status-reversal heuristic, as I have all sorts of status: the Ph.D., the Ivy League faculty position, the access to national media, etc. I’ve even published in the Proceedings of the National Academy of Sciences!

In response, all I can say is: Sure, the status-reversal argument can apply to me too. Feel free to evaluate this post, and other things I write, as if I had no pre-existing status. I think this post, and what we have on this blog more generally, holds up fine under the status-reversal argument.

2. The status-reversal heuristic is anti-Bayesian. Status provides relevant prior information. Sure, some Ivy League professors are blowhards at best and frauds at worst, but lots of us have done good work—indeed, that’s what can get us to the Ivies in the first place. Similarly, if economists have more default credibility than sociologists, maybe there’s a reason for that.

I have two responses here. First, sometimes I don’t think status adds any information at all. For example, if the topic is medical policy, I don’t see why we should expect an M.D. to have a more informed opinion than a nurse or a social worker. Indeed, the M.D. could well be less informed, if he or she has been trained to ignore the opinions of non-M.D.’s. Second, even in areas where status is correlated with expertise, I’m guessing that people have overweighted the prior information associated with that status. So removing that weighting can be a step forward. Third, status can be abused. Here I’m thinking of journals such as New England Journal of Medicine and Lancet that will publish bad research that pushes a political agenda, or PNAS, which publishes, bad research that pushes a particularly set of scientific theories.

My main argument here is the second one I just gave: Even in areas where status is correlated with expertise, I’m guessing that people have overweighted the prior information associated with that status. So removing that weighting can be a step forward. Perhaps this could be studied empirically in some way. I’m not quite sure how, but maybe there’s a way using prediction markets? We could ask Anna Dreber. She doesn’t teach at Harvard, but she’s an economist and she’s coauthored with some famous people, so there’s that.

My talk on visualization and data science this Sunday 9am

Uncovering Principles of Statistical Visualization

Visualizations are central to good statistical workflow, but it has been difficult to establish general principles governing their use. We will try to back out some principles of visualization by considering examples of effective and ineffective uses of graphics in our own applied research. We consider connections between three goals of visualization: (a) vividly displaying results, (b) exploration of unexpected patterns in data, and (c) understanding fitted models.

The virtue of fake universes: A purposeful and safe way to explain empirical inference.

I keep being drawn to thinking there is a away to explain statistical reasoning to others that will actually do more good than harm. Now, I also keep thinking I should know better – but can’t stop. 

My recent attempt starts with a shadow metaphor, then a review of analytical chemistry and moves to the concept of abstract fake universes (AFUs). AFU’s allow you to be the god of a universe, though not a real one ;-). However, ones you can conveniently define using probability models where it is easy to discern what would repeated happen – given an exactly set truth. 

The shadow metaphor emphasizes that though you see shadows, you are really interested in what is casting the shadows. The analytical chemistry metaphor emphasizes the advantage of making exactly known truths by spiking a set amount of a chemical in test tubes and repeatedly measuring the test tube contents with the inherently noisy assays.  For many empirical questions such spiking is not possible (e.g. underlying cancer incidence) so we have no choice but the think abstractly. Now, abstractly a probability model is an abstract shadow generating machine: with set parameters values it can generate shadows. Well actually samples. Then it seems advantageous to think of probability models as an ideal means to make AFUs with exactly set truths where it is easy to discern what would repeated happen.   

Now, my enthusiasm is buoyed by the realization that one of the best routes for doing that is the prior predictive. The prior predictive generates a large set of hopefully appropriate fake universes, where you know the truth (prior parameters drawn) and can discern what the posterior would be (given the joint model proposed for the analysis and the fake data generated). That is, in each fake universe from the prior predictive, you have the true parameter value(s), the fake sample that the data generating model generated from these and can discern what the posterior would have been calculated to be. Immediately (given computation time) one obtains a large sample of what would repeatedly happen using the proposed joint model in varied fake universes. Various measures of goodness can then be assessed and various averages then calculated.

Good for what? And in which appropriate collective of AFUs (aka Bayesian reference set)?

An earlier attempt of mine to do this in a lecture was recorded and Bob Carpenter has kindly uploaded it as Lectures: Principled Bayesian Workflow—Practicing Safe Bayes (YouTube) Keith O’Rourke (2019). I you decide to watch it, I would suggest setting the play back speed at 1.25. For those who don’t like videos slides and code are here.

The rest of the post below provides some background material for those who may lack background in prior predictive simulation and two stage sampling to obtain a sample from the posterior.

Continue reading ‘The virtue of fake universes: A purposeful and safe way to explain empirical inference.’ »

A heart full of hatred: 8 schools edition

No; I was all horns and thorns
Sprung out fully formed, knock-kneed and upright
Joanna Newsom

Far be it for me to be accused of liking things. Let me, instead, present a corner of my hateful heart. (That is to say that I’m supposed to be doing a really complicated thing right now and I don’t want to so I’m going to scream into a void for a little while.)

The object of my ire: The 8-Schools problem.

Now, for those of you who aren’t familiar with the 8-schools problem, I suggest reading any paper by anyone who’s worked with Andrew (or has read BDA). It’s a classic.

So why hate on a classic?

Well, let me tell you. As you can well imagine, it’s because of a walrus.

I do not hate walruses (I only hate geese and alpacas: they both know what they did), but I do love metaphors. And while sometimes a walrus is just a walrus, in this case it definitely isn’t.

The walrus in question is the Horniman Walrus (please click the link to see my smooth boy!). The Horniman walrus is a mistake that you an see, for a modest fee, at a museum in South London.

The story goes like this: Back in the late 19th century someone killed a walrus, skinned it, hopefully did some other things to it, and sent it back to England to be stuffed and mounted. Now, it was the late 19th century and it turned out that the English taxidermist maybe didn’t know what a walrus looked like. (The museum’s website claims that “only a few people had ever seen a live walrus” at this point in history which, even for a museum, is really [expletive removed] white. Update: They have changed the text on the website! It now says “Over 100 years ago, not many people (outside of Artic regions) had ever seen a live walrus”!!!!!!!!!!)

But hey. He had a sawdust. He had glue. He had other things that are needed to stuff and mount a dead animal. So he took his dead animal and his tools, introduced them to each other and proudly displayed the results.

(Are you seeing the metaphor?)

Now, of course, this didn’t go well. Walruses, if you’ve never seen one, are huge creatures with loose skin folds. The taxidermist did not know this and so he stuffed the walrus full leading to a photoshop disaster of a walrus. Smooth like a beachball. A glorious mistake. And a genuine tourist attraction.

So this is my first problem. Using a problem like 8 schools as default test for algorithms has a tendency to lead to over-stuffed algorithms that are tailored to specific models. This is not a new problem. You could easily call it the NeurIPS Problem (aka how many more ways do you want to over-fit MNIST?). (Yes, I know NeurIPS has other problems as well. I’m focussing on this one.)

A different version of this problem is a complaint I remember from back in my former life when I cared about supercomputers. This was before the whole “maybe you can use big computers on data” revolution. In these dark times, the benchmarks that mattered were the speed at which you could multiply two massive dense matrices, and the speed at which you could do a dense LU decomposition of a massive matrix. Arguably neither of these things were even then the key use of high-performance computers, but as the metrics became goals, supercomputer architectures emerged that could only be used to their full capacity on very specialized problems that had enough arithmetic intensity to make use of the entire machine. (NB: This is quite possibly still true, although HPC has diversified from just talking about Cray-style architectures)

So my problem, I guess, is with benchmark problems in general.

A few other specific things:

Why so small? 8 Schools has 8 observations, which is not very many observations. We have moved beyond the point where we need to include the data in a table in the paper.

Why so meta? The weirdest thing about the 8 Schools problem is that it has the form

y_j\mid\mu_j\sim N(\mu_j,\sigma_j)
\mu_j\mid\mu,\tau\sim N(\mu,\tau)

with appropriate priors on \mu and \tau. The thing here is that the observation standard deviations \sigma_j are known. Why? Because this is basically a meta-analysis. So 8-schools is a very specialized version of a Gaussian multilevel model. Buy fixing the observation standard deviation, the model has a much nicer posterior than the equivalent model with an unknown observation standard deviation. Hence, 8-schools doesn’t even test an algorithm on  an ordinary linear mixed model.

But it has a funnel! So does Radford Neal’s funnel distribution (in more than 17 dimensions). Sample from that instead.

But it’s real data! Firstly, no it isn’t. You grabbed it out of a book. Secondly, the idea that testing inference algorithms on real data is somehow better than systematically testing on simulated data is just wrong. We’re supposed to be statisticians so let me ask you this: How does an algorithm’s success on real data set A generalize to the set of all possible data sets? (Hint: It doesn’t.)

So, in conclusion, I am really really really sick of seeing the 8-schools data set.

 

 

Postscript: There’s something I want to clarify here: I am not saying that empirical results are not useful for evaluating inference algorithms. I’m saying that it’s only useful if the computational experiments are clear. Experiments using well-designed simulated data are unbelievably important. Real data sets are not.

Why? Because real data sets are not indicative of data that you come across in practice. This is because of selection bias! Real data sets that are used to demonstrate algorithms come in two types:

  1. Data that is smooth and lovely (like 8-Schools or an over-stuffed walrus)
  2. Data that is pointy and unexpected (like StackLoss, which famously has an almost singular design matrix, or this excellent photo a friend of mine once took)

Basically this means that if you have any experience with a problem at all, you can find a data set that makes y0ur method look good or that demonstrates a flaw in a competing method, or makes your method look bad. But this choice is opaque to people who are not experts in the problem at hand. Well-designed computational experiments on the other hand are clear in their aims (eg. this data has almost co-linear covariates, or this data has outliers, or this data should be totally pooled).

Simulated data is clearer, more realistic, and more honest when evaluating or even demonstrating algorithms.

How to think scientifically about scientists’ proposals for fixing science

I kinda like this little article which I wrote a couple years ago while on the train from the airport. It will appear in the journal Socius. Here’s how it begins:

Science is in crisis. Any doubt about this status has surely been been dispelled by the loud assurances to the contrary by various authority figures who are deeply invested in the current system and have written things such as, “Psychology is not in crisis, contrary to popular rumor . . . Crisis or no crisis, the field develops consensus about the most valuable insights . . . National panels will convene and caution scientists, reviewers, and editors to uphold standards.” (Fiske, Schacter, and Taylor, 2016). When leaders go to that much trouble to insist there is no problem, it’s only natural for outsiders to worry.

The present article is being written for a sociology journal, which is appropriate for two reasons. First, sociology includes the study of institutions and communities; modern science is both an institution and a community, and as such it would be of interest to me as a citizen and a political scientist, even beyond my direct involvement as a practicing researcher. Second, sociology has a tradition of questioning; it is a field from whose luminaries I hope never to hear platitudes such as “Crisis or no crisis, the field develops consensus about the most valuable insights.” Sociology, like statistics and political science, is inherently accepting of uncertainty and variation. Following Karl Popper, Thomas Kuhn, Imre Lakatos, and Deborah Mayo, we cheerfully build our theories as tall and broad as we can, in the full awareness that reality will knock them down. We know that one of the key purposes of data analysis is to “kill our darlings,” and we also know that the more specific we make our models, the more we learn from their rejection. Structured modeling and thick description go together.

Just as we learn in a local way from our modeling failures, we can learn more globally from crises in entire subfields of science. When I say that the replication crisis is also an opportunity, this is more than a fortune-cookie cliche; it is also a recognition that when a group of people make a series of bad decisions, this motivates a search for what went wrong in their decision-making process.

A full discussion of the crisis in science would include three parts:

1. Evidence that science is indeed in crisis: at the very least, a series of examples of prominent products of mainstream science that were seriously flawed but still strongly promoted by the scientific community, and some evidence or at least speculation that such problems are prevalent enough to be worth our concern.

2. A discussion of what has gone wrong in the ideas and methods of scientific inquiry and in the process by which scientific claims are promoted and disseminated within the community and the larger society. This discussion could include specific concerns about statistical methods such as null hypothesis significance testing, and also institutional issues such as the increasing pressure on research to publish large numbers of articles.

3. Proposed solutions, which again range from research methods (for example, the suggestion to perform within-person, rather than between-person, comparisons wherever possible) to rules such as preregistration of hypotheses, to changes in the system of scientific publication and credit.

I and others have written enough on topics 1 and 2, and since this article has been solicited for a collection on Fixing Science, I’ll restrict my attention to topic 3: what to do about the problem?

I then continue:

If you’ve gone to the trouble to pick up (or click on) this volume in the first place, you’ve probably already seen, somewhere or another, most of the ideas I could possibly propose on how science should be fixed. My focus here will not be on the suggestions themselves but rather on what are our reasons for thinking these proposed innovations might be good ideas. The unfortunate paradox is that the very aspects of “junk science” that we so properly criticize—the reliance on indirect, highly variable measurements from nonrepresentative samples, open-ended data analysis, followed up by grandiose conclusions and emphatic policy recommendations drawn from questionable data— all seem to occur when we suggest our own improvements to the system. All our carefully-held principles seem to evaporate when our emotions get engaged. . . .

After some discussion of potential solutions, I conclude:

The foregoing review is intended to be thought provoking, but not nihilistic. One of the most important statistical lessons from the recent replication crisis is that certainty or even near-certainty is harder to come by then most of us had imagined. We need to make some decisions in any case, and as the saying goes, deciding to do nothing is itself a decision. Just as an anxious job-interview candidate might well decide to chill out with some deep breaths, full-body stretches, and a power pose, those of us within the scientific community have to make use of whatever ideas are nearby, in order to make the micro-decisions that, in the aggregate, drive much of the directions of science. And, when considering larger ideas, proposals for educational requirements or recommendations for new default statistical or research methods or reorganizations of the publishing system, we need to recognize that our decisions will necessarily rely much more on logic and theory than on direct empirical evidence. This suggests in turn that our reasoning be transparent and openly connected to the goals and theories that motivate and guide our attempts toward fixing science.

It’s fun, writing an article like this from first principles, with no position to defend, just trying to think things through.

Rachel Tanur Memorial Prize for Visual Sociology

Judith Tanur writes:

The Rachel Tanur Memorial Prize for Visual Sociology recognizes students in the social sciences who incorporate visual analysis in their work. The contest is open worldwide to undergraduate and graduate students (majoring in any social science). It is named for Rachel Dorothy Tanur (1958–2002), an urban planner and lawyer who cared deeply about people and their lives and was an acute observer of living conditions and human relationships.

The 2020 Rachel Tanur Memorial Prize is now open for applications. Entries for the 2020 competition must be received by January 22, 2020. Winners will be notified by March 30, 2020. Up to three cash prizes will be awarded at the IV International Sociological Association (ISA) Forum of Sociology, “Challenges of the 21st Century: Democracy, Environment, Inequalities, Intersectionality,” to be held in Porto Alegre, Brazil on July 14-18, 2020. Attendance at the forum is not a requirement but is encouraged. Prizes, supported by the Mark Family Foundation, will be awarded by the Research Committee on Visual Sociology of the ISA. The first prize will be $2,500 USD, the second $1,500, and the third $500. The prize is awarded biennially. For more information and to apply please go to racheltanurmemorialprize.org

Poetry corner

When presenting a new method, talk about its failure modes.

A coauthor writes:

I really like the paper [we are writing] as it is. My only criticism of it perhaps would be that we present this great new method and discuss all of its merits, but we do not really discuss when it fails / what its downsides are. Are there any cases where the traditional analyses or some other analysis are more appropriate? Should we say that in the main body of the paper?

Good point! I’m gonna add a section to the paper called Failure Modes or something like that, to explore where our method makes things worse.

And, no, this is not the same as the traditional “Threats to Validity” section. The Threats to Validity section, like the Robustness Checks, are typically a joke in that the purpose is usually not to explore potential problems but rather to rule out potential objections.

Now, don’t get me wrong, I love our recommended method and so does my coauthor. It’s actually hard for us to think of examples where our approach would be worse than what people were diong before. But I’ll think about it. I agree we should write something about failure modes.

I love my collaborators. They’re just great.

The best is the enemy of the good. It is also the enemy of the not so good.

This post is by Phil Price, not Andrew.

The Ocean Cleanup Project’s device to clean up plastic from the Great Pacific Garbage Patch is back in the news because it is back at work and is successfully collecting plastic. A bunch of my friends are pretty happy about it and have said so on social media…and it drives me nuts. The machine might be OK but it makes no sense to put it way out in the Pacific.  Someone asked why not, and here’s what I wrote:

Suppose I have a machine that removes plastic from all of the water it encounters. I offer you a choice: you can put it in a location where it will remove 1 ton per month — the Pacific Garbage Patch — or in a location where it will remove 10 tons per month (let’s say that’s the Gulf of Thailand but in fact I do not know where the best place would be). Obviously you will put it where it can remove 10 tons per month. Now you raise money to build and operate a second machine. You put your first machine in the best place you could find, so do you now put your second machine in the Pacific Garbage Patch? You shoudn’t, if your goal is to remove as much plastic from the ocean as possible: you should put it in the best place where you don’t already have a machine…the Bay of Bengal, maybe. Or maybe it, too, should go in the Gulf of Thailand. Or maybe in the Caribbean. I have no idea where the plastic concentrations are highest, but I know it is not the Great Pacific Garbage Patch. At any rate you should put the first machine where it will remove the most plastic per month; the second machine in the best remaining place after you have installed the first one; the third machine in the best remaining place after you have installed the first two; and so on. The Pacific Garbage Patch isn’t literally the last place you should install a machine, but it is way way down the list. (If you know in advance that you are going to build a lot of machines, you can optimize the joint placement of all of them and you might come up with a slightly different answer, but let’s not worry about that detail.)

The paragraph above assumes that you are just trying to remove as much plastic from the ocean as possible. If you have some other goal then of course the answer could be different. For instance, if you are trying to reduce the amount of plastic at some specific spot in the middle of the Pacific, you should put your machine at that spot even if it won’t get you very much in terms of plastic removed per month.

That paragraph also implicitly assumes the cost of installing and operating the machine is the same everywhere. If it is very expensive to install and operate the machine in the Gulf of Thailand, then maybe you’d be better off somewhere else: for the same money as one machine in the place where it would maximize the plastic removal per month, maybe I could build two machines and install them in cheaper places where they would combine to remove more plastic. It becomes an optimization problem. But: I have never seen anyone, not even the project proponents, who thinks the middle of the Pacific is a relatively _cheap_ place to install and operate a machine: in fact it is very expensive because it is so remote.

And of course the situation gets even more complicated when you consider other factors like whether you will interfere with fishing or with ship traffic, what effect will the machine have on the marine ecosystem, are you inside or outside a nation’s territorial waters, and so on.

Choosing the best place for your first, second, third, fourth, fifth, sixth,… machine might be complicated, but I have not seen any reasonable argument for why the Pacific Garbage Patch is even in the running. It just doesn’t make sense.

I am in agreement with…uh, I think it was Darrell Huff (author of “How to Lie with Statistics”) who made this point, but I could be wrong… when he said that the more important something is, the more important it is to be rational about it. If you’re trying to save human lives, for example, anything other than the most efficient allocation of resources is literally killing people. So to the extent that it is important to people to remove plastic from the oceans, it’s important to allocate resources efficiently. But, much as we would like to think it is important to people and therefore should be done as efficiently as possible, in fact people are often not rational. It may be the case that people are willing to contribute much much more money, time, and energy to a program to remove plastic from the ocean inefficiently than to one that would do so efficiently. If people are willing to contribute to remove plastic from the Pacific Garbage Patch but not from anywhere else, well, OK, put your machine in the Pacific Garbage Patch. So I’m not saying people shouldn’t do this project. I’m just saying it doesn’t make sense. That is, sadly, not the same thing.

 

This post is by Phil, not Andrew

On the term “self-appointed” . . .

I was reflecting on what bugs me so much about people using the term “self-appointed” (for example, when disparaging “self-appointed data police” or “self-appointed chess historians“).

The obvious question when someone talks about “self-appointed” whatever is, Who self-appointed you to decide who is illegitimately self-appointed?

But my larger concern is with the idea that being a self-appointed whatever is a bad thing. Consider the alternative, which is to be appointed by some king or queen or governmental body or whatever. That wouldn’t do much to foster a culture of openness, would it? First, the kind of people who are appointed would be those who don’t offend the king/queen/government/etc, or else they’d need to hide their true colors until getting that appointment. Second, by restricting yourself to criticism coming from people with official appointments, you’re shutting out the vast majority of potential sources of valuable criticism.

Let’s consider the two examples above.

1. “Self-appointed data police.” To paraphrase Thomas Basboll, there are no data police. In any case, data should be available to all (except in cases of trade secrets, national security, confidentiality, etc.), and anyone should be able to “appoint themselves” the right to criticize data analyses.

2. “Self-appointed chess historians.” This one’s even funnier in that I don’t think there are any official chess historians. Here’s a list, but it includes one of the people criticized in the above quote as being “self-appointed” so that won’t really work.

So, next time you hear someone complain about “self-appointed” bla bla, consider the alternative . . . Should criticism only be allowed from those who have been officially appointed? That’s a recipe for disaster.

And, regarding questions regarding the personal motivations of critics (calling them “terrorists” etc.), recall the Javert paradox.

Dan’s Paper Corner: Yes! It does work!

Only share my research
With sick lab rats like me
Trapped behind the beakers
And the Erlenmeyer flasks
Cut off from the world, I may not ever get free
But I may
One day
Trying to find
An antidote for strychnine — The Mountain Goats

Hi everyone! Hope you’re enjoying Peak Libra Season! I’m bringing my Air Sign goodness to another edition of Dan’s Paper Corner, which is a corner that I have covered in papers I really like.

And honestly, this one is mostly cheating. Two reasons really. First, it says nice things about the work Yuling, Aki, Andrew, and I did and then proceeds to do something much better. And second because one of the authors is Tamara Broderick, who I really admire and who’s been on an absolute tear recently.

Tamara—often working with the fabulous Trevor Campbell (who has the good grace to be Canadian), the stunning Jonathan Huggins (who also might be Canadian? What am I? The national register of people who are Canadian?), and the unimpeachable Ryan Giordano (again. Canadian? Who could know?)—has written a pile of my absolute favourite recent papers on Bayesian modelling and Bayesian computation.

Here are some of my favourite topics:

As I say, Tamara and her team of grad students, postdocs, and co-authors have been on one hell of a run!

Which brings me to today’s paper: Practical Posterior Error Bounds from Variational Objectives by Jonathan Huggins, Mikołaj Kasprzak, Trevor Campbell, and Tamara Broderick.

In the grand tradition of Dan’s Paper Corner, I’m not going to say much about this paper except that it’s really nice and well worth reading if you care about asking “Yes, but did it work?” for variational inference.

I will say that this paper is amazing and covers a tonne of ground. It’s fully possible that someone reading this paper for the first time won’t recognize how unbelievably practical it is. It is not trying to convince you that its new melon baller will ball melons faster and smoother than your old melon baller. Instead it stakes out much bolder ground: this paper provides a rigorous and justified and practical workflow for using variational inference to solve a real statistical problem.

I have some approximately sequential comments below, but I cannot stress this enough: this is the best type of paper. I really like it. And while it may be of less general interest than last time’s general theory of scientific discovery, it is of enormous practical value. Hold this paper close to your hearts!

  • On a personal note, they demonstrate that the idea in the paper Yuling, Aki, Andrew, and I wrote is good for telling when variational posteriors are bad, but the k-hat diagnostic being small does not necessarily mean that the variational posterior will be good. (And, tbh, that’s why we recommended polishing it with importance sampling)
  • But that puts us in good company, because they show that neither the KL divergence that’s used in deriving the ELBO or the Renyi divergence is a particularly good measure of the quality of the solution.
  • The first of these is not all that surprising. I think it’s been long acknowledged that the KL divergence used to derive variational posteriors is the wrong way around!
  • I do love the Wasserstein distance (or as an extremely pissy footnote in my copy of Bogachev’s glorious two volume treatise on measure theory insists: the KantorovichRubinstein metric). It’s so strong. I think it does CrossFit. (Side note: I saw a fabulous version of A Streetcar Named Desire in Toronto [Runs til Oct 27] last week and really it must be so much easier to find decent Stanleys since CrossFit became a thing.)
  • The Hellinger distance is strong too and will also control the moments (under some conditions. See Lemma 6.3.7 of Andrew Stuart’s encyclopedia)
  • Reading the paper sequentially, I get to Lemma 4.2 and think “ooh. that could be very loose”. And then I get excited about minimizing over \eta in Theorem 4.3 because I contain multitudes.
  • Maybe my one point of slight disagreement with this paper is where they agree with our paper. Because, as I said, I contain multitudes. They point out that it’s useful to polish VI estimates with importance sampling, but argue that they can compute their estimate of VI error instead of k-hat. I’d argue that you need to compute both because just like we didn’t show that small k-hat guarantees a good variational posterior, they don’t show that a good approximate upper bound on the Wasserstein distance guarantees that importance sampling will work. So ha! (In particular, Chatterjee and Diaconis argue very strongly, as does Mackay in his book, that the variance of an importance sampler being finite is somewhere near meaningless as a practical guarantee that an importance sampler actually works in moderate to high dimensions.)
  • But that is nought but a minor quibble, because I completely and absolutely agree with the workflow for Variational Inference that they propose in Section 4.3.
  • Let’s not kid ourselves here. The technical tools in this paper are really nice.
  • There is not a single example I hate more than the 8 schools problem. It is the MNIST of hierarchical modelling. Here’s hoping it doesn’t have any special features that makes it a bad generic example of how things work!
  • That said, it definitely shows that k-hat isn’t enough to guarantee good posterior behaviour.

Anyway. Here’s to more papers like this and to fewer examples of what the late, great David Berman referred to as ceaseless feasts of schadenfreude“.

What’s the p-value good for: I answer some questions.

Martin King writes:

For a couple of decades (from about 1988 to 2006) I was employed as a support statistician, and became very interested in the p-value issue; hence my interest in your contribution to this debate. (I am not familiar with the p-value ‘reconciliation’ literature, as published after about 2005.) I would hugely appreciate it, if you might find the time to comment further on some of the questions listed in this document.

I would be particularly interested in learning more about your views on strict Neyman-Pearson hypothesis testing, based on critical values (critical regions), given an insistence on power calculations among research funding organisations (i.e., first section headed ‘p-value thresholds’), and the long-standing recommendation that biomedical researchers should focus on confidence intervals instead of p-values (i.e., penultimate section headed ‘estimation and confidence intervals’).

Here are some excerpts from King’s document that I will respond to:

My main question is about ‘dichotomous thinking’ and p-value thresholds. McShane and Gal (2017, page 888) refers to “dichotomous thinking and similar errors”. Is it correct to say that dichotomous thinking is an error? . . .

If funding bodies insist on strict hypothesis testing (otherwise why the insistence on power analysis, as opposed to some other assessment of adequate precision), is it fair to criticise researchers for obeying the rules dictated by the method? In summary, before banning p-value thresholds, do you have to persuade the funding bodies to abandon their insistence on power calculations, and allow applicants more flexibility in showing that a proposed study has sufficient precision? . . .

This brings us to the second question regarding what should be taught in statistics courses, aimed at biomedical researchers. A teacher might want the freedom to design courses that assumes an ideal world in which statisticians and researchers are free to adopt a rational approach of their choice. Thus, a teacher might decide to drop frequentist methods (if she/he regards frequentist statistics a nonsense) and focus on the alternatives. But this creates a problem for the course recipients, if grant awarding bodies and journal editors insist on frequentist statistics? . . .

It is suggested (McShane et al. 2018) that researchers often fail to provide sufficient information on currently subordinate factors. I spent many years working in an experimental biomedical environment, and it is my impression that most experimental biomedical researchers do present this kind of information. (They do not spend time doing experiments that are not expected to work or collecting data that are not expected to yield useful and substantial information. It is my impression that some authors go to the extreme in attempting to present an argument for relevance and plausibility.) Do you have a specific literature in mind where it is common to see results offered with no regard for motivation, relevance, mechanism, plausibility etc. (apart from data dredging/data mining studies in which mechanism and plausibility might be elusive)? . . .

For many years it had not occurred to me that there is a distinction between looking at p-values (or any other measure of evidence) obtained as a participant in a research study, versus looking at third-party results given in some publication, because the latter have been through several unknown filters (researcher selection, significance filter etc). Although others had commented on this problem, it was your discussions on the significance filter that prompted me to fully realise the importance of this issue. Is it a fact that there is no mechanism by which readers can evaluate the strength of evidence in many published studies? I realise that pre-registration has been proposed as a partial solution to this problem. But it is my impression that, of necessity, much experimental and basic biomedical science research takes the form of an iterative and adaptive learning process, as outlined by Box and Tiao (pages 4-5), for example. I assume that many would find It difficult to see how pre-registration (with constant revision) would work in this context, without imposing a massive obstacle to making progress.

And now my response:

1. Yes, I think dichotomous frameworks are usually a mistake in science. With rare exceptions, I don’t think it makes sense to say that an effect is there or not there. Instead I’d say that effects vary.

Sometimes we don’t have enough data to distinguish an effect from zero, and that can be a useful thing to say. Reporting that an effect is not statistically significant can be informative, but I don’t think it should be taken as an indication that the true effect as zero; it just tells us that our data and model do not give us enough precision to distinguish the effect from zero.

2. Sometimes decisions have to be made. That’s fine. But then I think the decisions should be made based on estimated costs, benefits, and probabilities—not based on the tail-area probability with respect of a straw-man null hypothesis.

3. If scientists in the real world are required to do X, Y, and Z, then, yes, we should train them on how to do X, Y, and Z, but we should also explain why these actions can be counterproductive to larger goals of scientific discovery, public health, etc.

Perhaps a sports analogy will help. Suppose you’re a youth coach, and your players would like to play in an adult league that uses what you consider to be poor strategies. Short term, you need to teach your players these poor strategies so they can enter the league on the league’s terms. But you should also teach them the strategies that will ultimately be more effective so that, once they’ve established themselves, or if they happen to play with an enlightened coach, they can really shine.

4. Regarding “currently subordinate factors”: In many many of the examples we’ve discussed over the years on this blog, published papers do not include raw data or anything close to it, they don’t give details on what data were collected or how the data were processed or what data were excluded. Yes, there will be lots of discussion of motivation, relevance, mechanism, plausibility etc. of the theories, but not much thought about data quality. Some quick examples include the evolutionary psychology literature, where the days of peak fertility were mischaracterized or measurement of finger length was characterized as a measure of testosterone. There’s often a problem that data and measurements are really noisy, and authors of published papers (a) don’t even address the point and (b) don’t seem to think it matters, under the (fallacious) reasoning that, once you have achieved statistical significance, measurement error doesn’t matter.

5. Preregistration is fine for what it is, but I agree that it does not resolve issues of research quality. At best, preregistration makes it more difficult for people to make strong claims from noise (although they can still do it!), hence it provides an indirect incentives for people to gather better data and run stronger studies. But it’s just an incentive; a noisy study that is preregistered is still a noisy study.

Summary

I think that p-values and statistical significance as used in practice are a noise magnifier, and I think people would be better off reporting what they find without the need to declare statistical significance.

There are times when p-values can be useful: it can help to know that a certain data + model are weak enough that we can’t rule out some simple null hypothesis.

I don’t think the p-value is a good measure of the strength of evidence for some claim, and for several reasons I don’t think it makes sense to compare p-values. But the p-value as one piece of evidence in a larger argument about data quality, that can make sense.

Finally the above comments apply not just to p-values but to any method used for null hypothesis significance testing.

Elsevier > Association for Psychological Science

Everyone dunks on Elsevier. But here’s a case where they behaved well. Jordan Anaya points us to this article from Retraction Watch:

In May, [psychology professor Barbara] Fredrickson was last author of a paper in Psychoneuroendocrinology claiming to show that loving-kindness meditation slowed biological aging, specifically that it kept telomeres — which protect chromosomes — from shortening. The paper caught the attention of Harris Friedman, a retired researcher from University of Florida who had scrutinized some of Fredrickson’s past work, for what Friedman, in an interview with Retraction Watch, called an “extraordinary claim.”

Friedman, along with three colleagues, looked deeper. When they did, they found a few issues. One was that the control group in the study seemed to show a remarkably large decrease in telomere length, which made the apparent differences between the groups seem larger. The quartet — Friedman, Nicholas Brown, Douglas MacDonald and James Coyne — also found a coding error.

Friedman and his colleagues wanted to write a piece for the journal that would address all of these issues, but they were told they could submit a letter of only 500 words. They did, and it was published in August. The journal also published a corrigendum about the coding error last month — but only after having changed the article without notice first.

Friedman had hoped that the journal would credit him and his colleagues in the corrigendum, which it did not. But it was a letter that the journal published on August 24 that really caught his eye (as well as the eye of a PubPeer commenter, whose comment was flagged for us.) It read, in its entirety:

As Corresponding Author of “Loving-kindness meditation slows biological aging in novices: Evidence from a 12-week randomized controlled trial,” I decline to respond to the Letter authored by Friedman, MacDonald, Brown and Coyne. I stand by the peer review process that the primary publication underwent to appear in this scholarly journal. Readers should be made aware that the current criticisms continue a long line of misleading commentaries and reanalyses by this set of authors that (a) repeatedly targets me and my collaborators, (b) dates back to 2013, and (c) spans multiple topic areas. I take this history to undermine the professional credibility of these authors’ opinions and approaches.

When Friedman saw the letter, he went straight to the journal’s publisher, Elsevier, and said it was defamatory, and had no business appearing in a peer-reviewed journal.

The journal has now removed the letter, and issued a notice of temporary removal. Fredrickson hasn’t responded to our requests for comment.

As Friedman noted, however, the letter’s language, which is undeniably sharp, is “coming from the loving-kindness researcher.”

Jordan writes:

I didn’t realize Friedman asked the journal to take down the response. To me I would have been happy the response was posted since it made Fredrickson look really bad—if her critics’ points are truly wrong and have been wrong over the course of multiple years then it should be easy for her to dunk on her critics with a scientific response.

I disagree. Mud can stick. Better to have the false statement removed, or at least flagged with a big RETRACTED watermark, rather than having it out there to confuse various outsiders.

Anyway, say what you want about Elsevier. At least they’re willing to retract false and defamatory claims that they publish. The Association for Psychological Science won’t do that. When I pointed out that they’d made false and defamatory statements about me and another researcher, they just refused to do anything.

It’s sad that a purportedly serious professional organization is worse on ethics than a notorious publisher.

But maybe we should look on the bright side. It’s good news that a notorious publisher is better on ethics than a serious professional organization.

At this point I think it would be pretty cool if the Association for Psychological Science would outsource its ethics decisions to Elsevier or some other outside party.

In the meantime, I suggest that Fredrickson send that letter to Perspectives on Psychological Science. They’d probably have no problem with it!

P.S. Fredrickson’s webpage says, “She has authored 100+ peer-reviewed articles and book chapters . . .” I guess they’ll have to change that to 99+.

Automation and judgment, from the rational animal to the irrational machine

Virgil Kurkjian writes:

I was recently going through some of your recent blog posts and came across Using numbers to replace judgment.

I recently wrote something about legible signaling which I think helps shed some light on exactly what causes the bureaucratization of science and maybe what we can do about it. In short I agree that we do and should use our qualitative judgment and attempts to add “objectivity” are not objective and lead to bureaucratic capture.

I don’t quite understand what Kurkjian was saying in his post but I thought it might interest you, so you can follow the link and judge for yourself.

From the rational animal to the irrational machine

I do see some connections to my idea that people used to think of humans as being special for their rationality, but now it’s our irrationality that is considered a virtue. In the past we compared ourselves to animals, hence the human was “the rational animal.” Now we compare ourselves to computers, hence the human is “the irrational machine.”

As a rational animal myself, I’m not so thrilled with this change in attitude.

Glenn Shafer: “The Language of Betting as a Strategy for Statistical and Scientific Communication”

Glenn Shafer writes:

I have joined the immense crowd writing about p-values. My proposal is to replace them with betting outcomes: the factor by which a bet against the hypothesis multiplies the money it risks. This addresses the desideratum you and Carlin identify: embrace all the uncertainty. No one will forget that the outcome of a bet is uncertain. See Working Paper 54 here.

And here’s the two-minute version, on a poster:

I sent this to Anna Dreber who suggested using prediction market prices as priors, to which Shafer replied:

See the paragraph entitled “Bayesian interpretation?” on page 7 of the paper I called to your attention, which I attach this time.

My proposal is to report the outcome of a bet instead of a p-value or odds for a proposed bet. As I say on my poster, a 5% significance test is like an all-or-nothing bet: you multiply your money by 0 or 20. People want to report a p-value as the outcome instead of “reject at 5%” or “do not reject at 5%” because they want a more graded report. We can get this with bets that have many possible payoffs.

I think this whole 95% confidence attitude is a bad idea and I think that Shafer’s in a dead end here; I don’t see these methods being useful now or in the future. But, hey, I could be wrong—it’s happened lots of times before!—so I’m sharing it all with you here. I think Dan Kahan might like this stuff.

Stan contract jobs!

Sean writes:

We are starting to get money and time to manage paid contracting jobs to try to get a handle on some of our technical debt. Any or all of the skills could be valuable:

C++ software engineering
C++ build tools, compilers, and toolchains
Creating installers or packages of any kind (especially cross-platform)
Windows development
Declarative Jenkins pipelines
AWS
OCaml / opam packaging

If you’d like to be considered for such work when it arises, please email me [Sean Talts] for now (first name dot last name at gmail) with some material you think demonstrates your suitability, for example a resume or a code sample, though there are no formal requirements. If you have specific Stan maintenance projects to propose that’s great too! This is all a function of the time we have to manage folks, ability to split out discrete tasks that need doing, perceived task urgency, and, of course, the money we want to set aside for various kinds of technical debt. I or someone else will be in touch if we find a good match.

My talk at the Brookings Institution this Fri 11am

The replication crisis in science: Does it matter for policy?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

I argue that policy analysts should care about the replication crisis for three reasons: (1) High-profile policy claims have been systematically exaggerated; (2) This has implications for how to conduct and interpret new research; (3) Much of the work that has been called into question is attached to a manipulable-voter model that has malign political implications.

The replication crisis is typically discussed in the context of particular silly claims, or in terms of the sociology of science, or with regard to controversies in statistical practice. We can also consider the content of unreplicated or otherwise shaky empirical claims in political science, which often seem to be associated with a model in which attitudes and behavior can be easily manipulated using irrelevant stimuli. This set of theories, if true, would have important implications for politics, supporting certain views held on the left, right, and technocratic center of the political spectrum. Conversely, the lack of empirical support for the manipulable-voter model has political implications which are worth considering: if voters and politicians are not so easily swayed in this way, this suggests that we should try to more carefully understand their direct motivations.

P.S. If you want to attend, you should contact Brookings here.