My thoughts on “What’s Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers”

Chetan Chawla and Asher Meir point us to this post by Alvaro de Menard, who writes:

Over the past year, I [Menard] have skimmed through 2578 social science papers, spending about 2.5 minutes on each one.

What a great beginning! I can relate to this . . . indeed, it roughly describes my experience as a referee for journal articles during the past year!

Menard continues:

This was due to my participation in Replication Markets, a part of DARPA’s SCORE program, whose goal is to evaluate the reliability of social science research. 3000 studies were split up into 10 rounds of ~300 studies each. Starting in August 2019, each round consisted of one week of surveys followed by two weeks of market trading. I finished in first place in 3 out 10 survey rounds and 6 out of 10 market rounds.

The studies were sourced from all social sciences disciplines (economics, psychology, sociology, management, etc.) and were published between 2009 and 2018 . . .

Then he waxes poetic:

Actually diving into the sea of trash that is social science gives you a more tangible perspective, a more visceral revulsion, and perhaps even a sense of Lovecraftian awe at the sheer magnitude of it all: a vast landfill—a great agglomeration of garbage extending as far as the eye can see, effluvious waves crashing and throwing up a foul foam of p=0.049 papers. As you walk up to the diving platform, the deformed attendant hands you a pair of flippers. Noticing your reticence, he gives a subtle nod as if to say: “come on then, jump in”.

To this, I’d like to add the distress I feel when I see bad research reflexively defended, not just by the authors of the offending papers and by Association for Psychological Science bureaucrats, but also by celebrity academics such as Steven Pinker and Cass Sunstein.

There’s also just the exhausting aspect of all of this. You’ll see a paper that has obvious problems, enough that there’s really no reason at all to take it seriously, but it will take lots and lots of work to explain this to people who are committed to the method in question, and that’s related to the publication/criticism asymmetry we’ve discussed before. So, yeah, I imagine there’s something satisfying about reading 2578 social science papers and just seeing the problems right away, with no need for long arguments about why some particular interaction or regression discontinuity doesn’t really imply what the authors are claiming.

Also, I share Menard’s frustration with institutions:

Journals and universities certainly can’t blame the incentives when they stand behind fraudsters to the bitter end. Paolo Macchiarini “left a trail of dead patients” but was protected for years by his university. Andrew Wakefield’s famously fraudulent autism-MMR study took 12 years to retract. Even when the author of a paper admits the results were entirely based on an error, journals still won’t retract.

Recently we discussed the University of California’s problem with that sleep researcher. And then just search this blog for Lancet.

But there are a few places where I disagree with Medard. First I think he has a naivety about sample sizes. Regarding “Marketing/Management,” he writes:

In their current state these are a bit of a joke, but I don’t think there’s anything fundamentally wrong with them. Sure, some of the variables they use are a bit fluffy, and of course there’s a lack of theory. But the things they study are a good fit for RCTs, and if they just quintupled their sample sizes they would see massive improvements.

The problem here is the implicit assumption that there’s some general treatment effect of interest. Elsewhere Menard criticizes studies that demonstrate the obvious (“Homeless students have lower test scores, parent wealth predicts their children’s wealth, that sort of thing), but the trouble is that effects that are not obvious will vary. A certain trick in marketing might increase sales in some settings and decrease it in others. That’s fine—it just pushes the problem back one step, so that the point is not to demonstrate an effect but rather to find out the conditions under which it’s large and positive, and the conditions under which it is large and negative—my point here is just that quintupling the sample size won’t do much in the absence of theory and understanding. Consider the article discussed here, for example. Would quintupling its sample size do much to help? I don’t think so. This is covered in the “Lack of Theory” section of Medard’s post, so I don’t see why he says he doesn’t think there’s “anything fundamentally wrong” with those papers.

I also don’t see how Menard aligns his statement that “They [researchers] Know What They’re Doing” when they do bad statistics with his later claim (in the context of discussion of possible political biases) that “the vast majority of work is done in good faith.”

Here’s what I think is happening. I think that the vast majority of researchers think that they have strong theories, and they think their theories are “true” (whatever that means). It’s tricky, though: they think their theories are not only true but commonsensical (see item 2 of Rolf Zwaan’s satirical research guidelines), but at the same time they find themselves pleasantly stunned by the specifics of what they find. Yes, they do a lot of conscious p-hacking, but this does not seem like cheating to them. Rather, these researchers have the attitude that research methods and statistics are a bunch of hoops to jump through, some paperwork along the lines of the text formatting you need for a grant submission or the forms you need to submit to the IRB. From these researchers’ standpoint, they already know the truth; they just need to do the experiment to prove it. They’re acting in good faith on their terms but not on our terms.

Menard also says, “Nobody actually benefits from the present state of affairs”—but of course some people do benefit.

And of course I disagree with his recommendation, “Increase sample sizes and lower the significance threshold to .005.” We’ve discussed this one before, for example here and here.

In short, I agree with many of Menard’s attitudes, and it’s always good to have another ranter in the room, but I think he’s still trapped in a conventional hypothesis-testing framework.

35 thoughts on “My thoughts on “What’s Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers”

  1. To concisely discuss the distinction, most of the papers do not describe experiments; they describe ‘demonstrations.’

    With a demonstration, it doesn’t really matter how much tweaking, etc, is needed to get it to work. N = 1 is good enough to demonstrate it.
    A degree of reproducibility is nice, but robustness only matters if you expect to ‘show it work.’ In front of a class, for example.

    • “To concisely discuss the distinction, most of the papers do not describe experiments; they describe ‘demonstrations.’

      With a demonstration, it doesn’t really matter how much tweaking, etc, is needed to get it to work. N = 1 is good enough to demonstrate it.”

      +1

  2. The “Increase sample sizes and lower the significance threshold to .005” bits are only two of de Menard’s several suggestions. He concludes that NSF/NIH should be more actively involved in setting up structures and enforcing standards. I agree with this, and I’ve long been amazed at how hands-off both are about all sorts of structural issues — reproducibility, publication, over-production of PhDs, …

      • Keith:

        “Perceived best interests” . . . well put. Lots of people do all sorts of silly things out of the belief that they’re serving their interests, even though it’s not clear they’re doing themselves any favors with these actions.

      • I don’t have much time to write now, but I don’t understand what you mean by “perceived best interests.” If you mean the immediate career consequences, NSF and NIH program officers and higher are (roughly) shielded from all this. If you mean perceived interests of science as a whole, that is of course the point, and there is widespread agreement that this needs improvement.

  3. It’s been a couple of months, so I might be fuzzy on the details, but I remember quite enjoying the linked post, but being troubled by how far he takes some of the conclusions based on the “replication estimate” data. I’m referring to the narrow conclusions, not the general rant about social science research.

    Basically, in short, it’s true that “past evidence have shown that prediction markets work well”. Well, I’d amend to reasonably well, but the premise that “people trying to guess reproducibility based on skimming the paper find some good signal to predict with” is fair enough.

    But then the post takes that fairly narrow point, and uses it to get at some conclusions I thought were highly dubious. For instance, the analysis of “replication rates by field” is a statement about *predicted* replication rates by field. And while I think it’s perfectly fair to start with that for your intuition, it’s quite a leap from “estimates of reproducibility have reasonable predictive merit” to using those estimates for a fine grained *comparison between different fields*. And I’d say the same thing to all the different fine grained analyses here.

    I’m not giving much of a formal analysis, my apologies, and I still quite like the post, because this is absolutely a valuable way to get some starting intuition about the state of research. It’s evidence, and you’d be foolish not to use any evidence in front of you. So I guess I’m only griping with the strength of the language here, which sounds petty, but I think it’s a pretty important distinction for how to read these results. I don’t find that plot of “reproducibility by discipline” at all convincing personally, because I don’t have all that much confidence in that sort of *relative* accuracy of the predictors.

    FWIW, the most important analysis in here is the reproducibility over time, and I at least find that a tiny bit more convincing intuitively (presumably these predicting participants are looking for roughly the same markers of what should reproduce over time, and they’re not finding them popping up more I guess). But I wouldn’t put a lot of faith in the detailed analysis presented here.

    But the study itself is enormously valuable, and absolutely should be pursued further. Like, the limitations here is just that there isn’t all that much evidence to work with. This is the sort of thing that would hopefully have WAY more folk actively trying to do reproductions, running experiments predicting reproducibility, and all that, so we could start to answer some of these questions.

    • +1
      I agree on the aspect of the very far reaching conclusions of the post that AG refers to, and I do not find the strong language very helpful either. On top of that, I am surprised how one can make sweeping judgements about entire disciplines and sub-disciplines, without discussing the heterogeneity within sub-disciplines, or overlaps / boundaries between sub-disciplines that are not well defined.

      Statements like “Criminology Should Just Be Scrapped”, well, even if you strongly disagree with methodological standards in that field, can we really make such a statement? Likewise, statements about which research is trivial and which is not, I would not think that I am in a position to make such a statement about a single sub-discipline, let alone across multiple sub-disciplines within social science.

      Also, I do not find it helpful that a post, advocating progress in open science, makes very strong statements without giving access to the data. At least – with some effort – I could not identify the data (e.g., the list of articles). Relatedly, it is not clear how the studies were selected, which I think is relevant given the heterogeneity in the sub-disciplines, e.g., in economics you have lab experiments, field experiments, observational data with some other identification strategy (e.g., instrumental variables). Are all of these included?

      • Slutsky says:

        “I would not think that I am in a position to make such a statement about a single sub-discipline…”

        Well he did just cover 2578 papers. That’s a small dabbling of what’s out there, but it’s still more than most people cover in a year I suspect. I think that statements like “criminology should just be scrapped” are rhetorical or exaggeration for effect, not to be taken literally. To me they mean “quality is extremely bad in Criminology”.

      • > Relatedly, it is not clear how the studies were selected

        I guess I was hoping they were representative in someway, but I forgot that’s an assumption.

        I liked the idea that the 3000 here would have reference-judgings. Like apparently a bunch of people reviewed em! I’m kinda curious if I agree or disagree. Like this person claims to be able to do it in 2.5 minutes. I’ve spent more than 2.5 minutes writing this! I wanna be able to review a paper in 2.5 minutes.

        Looks like the plan is to replicate 250 of the 3000 studies — maybe all this stuff will be made clear once that is finished. I’m not sure. I can also believe I’m just not looking in the right spot.

        • Just go to r/science. You won’t need the abstract to rate many popular social science papers as irreproducible.

          one quick example:

          “People in societies where money plays a minimal role can have very high levels of happiness. High levels of subjective well-being can be achieved with minimal monetization, challenging the perception that economic growth will automatically raise life satisfaction among low-income populations.”

          “Can” have high levels of happiness? Sure! So “can” people on death row, for minutes, seconds, days or even years at a time I’m sure. Any group of people “can” have high levels of happiness. This does not “challenge the perception that economic growth will automatically raise life satisfaction among low-income populations.”

          That’s just for starters, we haven’t even deconstructed the second sentence, which has a number of additional red flags: that people can be momentarily happy doesn’t have anything to do with “perceptions” about economic growth if such a “perception” even exists, not to mention who holds that perception.

          Now, there may be some amazing data and analysis underlying this BS title. But if there is, the people who wrote the title are making a big mistake by putting a ridiculous title on their excellent work.

          Yes, it took five minutes to write this but only a few seconds to read the title and snort.

        • Someone recently suggested in a comment that standardized testing doesn’t predict success later in life. If we count the writers of this paper as “successful”, then it’s hardly surprising that standardized skill testing wouldn’t predict their success!!!

        • > Like this person claims to be able to do it in 2.5 minutes. I’ve spent more than 2.5 minutes writing this! I wanna be able to review a paper in 2.5 minutes.

          I must have missed this. I think this adds to my impression that these are pretty steep claims, and I seriously doubt whether they are helpful to advance the goals of open science. I seriously doubt that it is possible for a normal person to judge the methodological merits of a broad range of different papers after reading each paper for 2.5 minutes, in particular if we look at how complex some papers in economics or adjacent fields are in terms of methods.

          Why do I think these claims are not helpful for advancing the goals of open science? There are people working in these disciplines, and when you want to change these disciplines, you have to get these people on board to change the way they work. I doubt this will work if you start by telling them: I read a few of your field’s papers, approx. 2.5 minutes each, and I have come to the conclusion that this is trash, or trivial, maybe we should just scrap your discipline. How can any one working in such a discipline accept this as valid and fair criticism?

        • Something that made it easier was that Replication Markets chose one specific statistical claim from each paper and summarized both the claim and the statistical evidence for it up-front. So really we weren’t judging a paper in its entirety, but rather a single regression coefficient or a single ANOVA group comparison, etc. And there was no need to blindly search for info in the paper, it was easy to know where to look for the relevant info.

          On top of that many of the things that you’d usually look into were irrelevant: for example, one of the most important considerations in general is causal identification. But when it comes to direct replications it doesn’t matter, a confounded result will show up in the replication as well (in fact the confounding might make a claim really “solid” from a pure replication perspective). So the range of methodological issues I was looking it was limited to those that would impact replicability.

          That said, at least when it comes to the RM sample (which includes many of the top econ journals), I wouldn’t agree with the statement that econ papers are methodologically complex. It’s mostly just diff-in-diff, IVs, RDDs, etc. There might be some complexity in the details but broadly I would say that the vast majority of papers draw from a standard stats playbook.

          >How can any one working in such a discipline accept this as valid and fair criticism?

          Well, the replication results will be coming out any day now. If the markets did well, then I don’t see how this criticism can be avoided. Objective predictive track records have a way of clearing the smoke…

        • > Replication Markets chose one specific statistical claim from each paper and summarized both the claim and the statistical evidence for it up-front.

          Oooh, interesting. Thanks for the clarification.

      • > Relatedly, it is not clear how the studies were selected

        Selection was done by the Center for Open Science. They’ll have to provide details, but SCORE was attempting to be more representative of Social Science, sampling in some stratified way across 62 journals covering various fields (link in another reply). Pending the forthcoming study description, this from our FAQ:

        > [The studies] were selected for SCORE by the Center for Open Science. They identified about 30,000 candidate studies from the target journals and time period (2009-2018), and narrowed those down to 3,000 eligible for forecasting. Criteria include whether they have at least one inferential test, contain quantitative measurements on humans, have sufficiently identifiable claims, and whether the authors can be reached.

        -Charles Twardy, Replication Markets PI

    • Glad you like the paper, Kevin. Yes, we intentionally avoided wading too far into the discussion pros/cons of NHST and focused on what we thought people should do when they are trying to decide which version of an ad to use. By looking at a specific decision problem, the loss function becomes very clear.

  4. You have to kiss a lot of frogs before you find your prince. An airline attendant told me that once. Maybe a lot of bad research is the price for a little good research. It would be nice if all research could be good. Reward the good, punish the bad. A strong wise leader is needed. Kamala?

    • Renzo:

      I discussed here the idea that junk science could be serving the public good by being a way for people to consider new ideas.

      I have no idea why you are bringing up Kamala, but if you want random political references you might want to go to twitter.

      • I am not sure about the “kiss a lot of frogs before you find your prince” because sometimes frogs can actually hinder things. Consider recent study that was widely reported as casting doubt on efficacy on the Oxford-AstraZeneca covid-19 vaccine. This is what Peter Openshaw, professor of experimental medicine at Imperial College London said to BMJ “.. the trial seems to have been restricted to HIV negative younger people [with a] mean age [of] 31—about 1000 in the placebo arm and 1000 in the active vaccine group. In this age group and with these numbers, the effect on severe disease is going to be hard to estimate. Without seeing results in detail it isn’t possible to be sure how firm these conclusions are.” I think the world would be a better without this “frog” study.

      • I’m optimistic that Kamala will exceed expectations as a leader. I missed the memo that it’s not cool anymore to talk about race and politics. I have the memo now.

        • Renzo:

          You can talk about whatever you want, but I’d appreciate if you don’t use our comment section for trolling. Bringing up irrelevant references and saying things like “I have the memo” seem like trolling to me. There’s enough good discussion already going on in the comments here that there’s no need to try to stir things up like this. This was a post and a discussion about scientific replication, not about “race and politics.” Again, discussions can go in interesting and unexpected directions, but that can be done without trolling.

  5. You know, the more I think about it the more damning it is that p-values operate in a similar way to financial markets.

    In the regular markets, the safe bet is that in the long run, the whole market goes up. So invest in index funds or S&P, and you’ll be making money overall.

    With p-values, the safe bet is that as N increases, p-values get smaller. Even when you’re predicting noise, or tiny effects that don’t matter! So you can “win” at these prediction markets (or “succeed” in replications) just by massively increasing sample sizes until you can detect trivial effect sizes.

    A really simple winning heuristic would just be “large N studies replicate, small N ones don’t”. Some minor tweaks like “interaction effects are less likely to replicate, because they are by definition typically smaller than main effects” and “within subjects designs are more likely to replicate” and you could probably do pretty well in a game like this.

    But I think an awful lot about Cumming’s “dance of the p-values” when I think about the intended design. Get a bunch of people to “bet” on which articles will replicate, then replicate a smaller subset of studies. Sampling variance adds a bunch of noise and makes a nice way to gamble on which ones will come out that is *partially* predictable in advance, but with enough random variation to resemble gambling / financial investing.

    This is why I have been pretty skeptical that a useful machine learning “bullshit-detector” can really come out of the SCORE project. You could make one, sure, but it’s not going to be useful if it’s based on replicating p-values. It’ll end up basically like those quantitatively-based mutual funds: In the end, they don’t do that much better than index investing (i.e., big N = replication).

    But I suppose they have lots of other avenues to try, outside the replication markets option; haven’t heard much on what they’re doing outside of the replication markets.

    • Indeed, smaller p-value predicts replicability, as you’d expect from “dance of the p-values”. Using data from the Pooled Market Study (https://arxiv.org/abs/2102.00517), we found a simple decision tree using p-values did nearly as well as the crowds. That’s not a bad thing: small-N studies *are* unstable. But it’s disappointing the humans were only a few percentage points better (they averaged ~73% accurate).

      However, the humans were clearly using *different* information, so combining them should help. In Replication Markets, we set the starting prices for the markets using the p-value model. We thought this would let the crowd focus on other information — like interaction effects, prior plausibility, reputation, etc. — and get us over 80% accuracy. Preliminary results are discouraging – we await further resolutions. Also, the Melbourne team took a different approach and may have done better. We’ll see.

      As for a “useful” automatic BS-detector: replication is only one indicator of BS, but it’s measurable. Prediction outcomes cannot be more accurate than the true power of the replication study: unknown but probably not above 90%, considering systemic noise etc. But an 85% accurate detector could drive much smarter choices about WHERE to spend replication dollars.

      I look forward to seeing how close the rest of SCORE can come to that. And how it holds up after 5 years.

      -Charles Twardy, Replication Markets PI

  6. “Over the past year, I [Menard] have skimmed through 2578 social science papers, spending about 2.5 minutes on each one.

    What a great beginning! I can relate to this . . . indeed, it roughly describes my experience as a referee for journal articles during the past year!”

    i’m sure you are good enough to submit useful reviews even taking little time, but this may not be the case for others reading the blog who look up to you…statements like the above may be unintentionally encouraging a cavalier attitude to reviewing–which, in the end, further reduces the quality of published research…

    • Dl:

      I was joking. I actually spend about 15 minutes to review an article. Just to be clear: when I review an article, I do not consider it my duty to decide whether the paper should be published. Rather, I see my role as to provide information to the journal editor. I think there are diminishing returns, and I think I can contribute more by reviewing 100 papers at 15 minutes each, rather than 25 papers at 1 hour each, or 5 papers as 5 hours each. My referee reports are short, and the journal editors are free to use the information I provide as they see fit.

      I also recognize that different people have different styles, and that other reviewers might prefer to review 1/20th as many papers as I do but to spend 20 times as long on each review. That’s fine with me. Actually, it’s more than fine: I think the system works better with different reviewers following different approaches to the process.

  7. btw, the setup of the RM study was a mess…most obnoxiously, they used some kind of automated procedure to try to extract the relevant claims from the papers (the claims were presented to the traders), and it did not work well…

    other parts of it were confusing too…there still have been no payouts, etc…

    • This is a fair critique. Due to SCORE’s scale, the extractions RM and others received were much noisier than in similar but smaller replication studies. In earlier studies, one could do well just by relying on the high-quality summary.

      And RM is still waiting to hear replication results so we can close the markets and pay out. We’re thankful our forecasters have been as patient as they have.

      -Charles Twardy, Replication Markets PI

  8. I think that the big problem with Social Science is the lack of real science. Social “science” is useless. There are many ideological concepts preconditioning the intentions, the conclusions are simple minded, and almost nothing is done with real concern to use methodological scientific techniques. We are burning money with Social “Science”.

  9. “these researchers have the attitude that research methods and statistics are a bunch of hoops to jump through, some paperwork along the lines of the text formatting you need for a grant submission or the forms you need to submit to the IRB.”

    Your best sentence ever.

Leave a Reply to jim Cancel reply

Your email address will not be published. Required fields are marked *