Unsustainable research on corporate sustainability

In a paper to be published in the Journal of Financial Reporting, Luca Berchicci and Andy King shoot down an earlier article claiming that corporate sustainability reliably predicted stock returns. It turns out that this earlier research had lots of problems.

King writes to me:

Getting to the point of publication was an odyssey. At two other journals, we were told that we should not replicate and test previous work but instead fish for even better results and then theorize about those:

“I encourage the authors to consider using the estimates from figure 2 as the dependent variables analyzing which model choices help a researcher to more robustly understand the relation between CSR measures and stock returns. This will also allow the authors to build theory in the paper, which is currently completely absent…”

“In fact, there are some combinations of proxies/ model specifications that are to the left of Khan et al.’s estimate. I am curious as to what proxies/ combinations enhance the results?”

Also, the original authors seem to have attempted to confuse the issues we raise and salvage the standing of their paper (see attached: Understanding the Business Relevance of ESG Issues). We have written a rebuttal (also attached).

Here’s the relevant part of the response, by George Serafeim and Aaron Yoon:

Models estimated in Berchicci and King (2021) suggest that making different variable construction, sample period, and control variable choices can yield different results with regards to the relation between ESG scores and business performance. . . . However, not all models are created equal . . . For example, Khan, Serafeim and Yoon (2016) use a dichotomous instead of a continuous measure because of the weaknesses of ESG data and the crudeness of the KLD data, which is a series of binary variables. Creating a dichotomous variable (i.e., top quintile for example) could be well suited when trying to identify firms on a specific characteristic and the metric identifying that characteristic is likely to be noisy. A continuous measure assumes that for the whole sample researchers can be confident in the distance that each firm exhibits from each other. Therefore, the use of continuous measure is likely to lead to significantly weaker results, as in Berchicci and King (2021) . . .

Noooooooo! Dichotomizing your variable almost always has bad consequences for statistical efficiency. You might want to dichotomize to improve interpretability, but you then should be aware of the loss of efficiency of your estimates, and you should consider approaches to mitigate this loss.

Berchicci and King’s rebuttal is crisp:

The issue debated in Khan, Serafeim, and Yoon (2016) and Berchicci and King (2022) is whether guidance on materiality from the Sustainable Accounting Standards Board (SASB) can be used to select ESG measures that reliably predict stock returns. Khan, Serafeim, and Yoon (2016) (hereafter “KSY”) estimate that had investors possessed SASB materiality data, they could have selected stock portfolios that delivered vastly higher returns, an additional 300 to 600 basis points per year for a period of 20 years. Berchicci and King (2022) (hereafter “BK”) contend that there is no evidence that SASB guidance could have provided a reliable advantage and contend that KSY’s findings are a statistical artifact.

In their defense of KSY, Yoon and Serafeim (2022) ignore the evidence provided in Berchicci and King and leave its main points unrefuted. Rather than make their case directly, they try to buttress their claim with a selective review of research on materiality. Yet a closer look at this literature reveals that little of it is relevant to the debate. Of the 28 articles cited, only two evaluate the connection between SASB materiality guidance and stock price, and both are self-citations.

Berchicci and King continue:

Indeed, in other forums, Serafeim has made a contrasting argument, contending that KSY is a uniquely important study – a breakthrough that shifted decades of understanding (Porter, Serafeim, and Kramer, 2016). Surely, such an important study should be evaluated on its own merits.

That’s funny. It reminds me of the general point that in research we want our results simultaneously to be surprising and to make perfect sense. In this case, this put Yoon and Serafeim in a bind.

And more:

In BK, we evaluate whether KSY’s results are a fair representation of the true link between material sustainability and stock return. We evaluate over 400 ways that the relationship could be analyzed and reveal that 98% of the models result in estimates smaller than the one reported by KSY and that the median estimate was close to zero. We then show that KSY’s estimate is not robust to simple changes in their model . . . Next, we evaluate the cause of KSY’s strong estimate and uncover evidence that it is a statistical artifact. . . . We then show that their measure also lacks face validity because it judges as materially sustainable firms that were (and continue to be) leading emitters of toxic pollution and greenhouse gasses. In some years, this included a large majority of the firms in extractive industries (e.g. oil, coal, cement, etc.). . . . KSY do not address any of these criticisms and instead rely on a belief that their measure and model are the only ones that should be considered. . . .

Where do they sit on the ladder?

It’s good to see this criticism out there, and as usual it’s frustrating to see such a stubborn response by the original authors. A few years ago we presented a ladder of responses to criticism, from the most responsible to the most destructive:

1. Look into the issue and, if you find there really was an error, fix it publicly and thank the person who told you about it.

2. Look into the issue and, if you find there really was an error, quietly fix it without acknowledging you’ve ever made a mistake.

3. Look into the issue and, if you find there really was an error, don’t ever acknowledge or fix it, but be careful to avoid this error in your future work.

4. Avoid looking into the question, ignore the possible error, act as if it had never happened, and keep making the same mistake over and over.

5. If forced to acknowledge the potential error, actively minimize its importance, perhaps throwing in an “everybody does it” defense.

6. Attempt to patch the error by misrepresenting what you’ve written, introducing additional errors in an attempt to protect your original claim.

7. Attack the messenger: attempt to smear the people who pointed out the error in your work, lie about them, and enlist your friends in the attack.

In this case, the authors of the original article are stuck somewhere around rung 4. Not the worse possible reaction—they’ve avoided attacking the messenger, and they don’t seem to have introduced any new errors—but they haven’t reached the all-important step of recognizing their mistake. Not good for them going forward. How can you make serious research progress if you can’t learn from what you’ve done wrong in the past. You’re building a house on a foundation of sand.

P.S. According to Google, the original article, “Corporate Sustainability: First Evidence on Materiality,” has been cited 861 times. How is it that such a flawed paper has so many citations? Part of this might be the instant credibility conveyed by the Harvard affiliations of the authors, and part of this might be the doing-well-by-doing-good happy-talk finding that “investments in sustainability issues are shareholder-value enhancing.” Kinda like that fishy claim about unionization and stock prices or the claims of huge economic benefits from early childhood stimulation. Forking paths allow you to get the message you want from the data, and this is a message that many people want to hear.

18 thoughts on “Unsustainable research on corporate sustainability

  1. I always love people who come up with crap measures, for which they can find no evidence of validity, who suggest dichotomizing the scores since the continuous scores can’t be trusted.

    Why would taking an invalid continuous number and dichotomizing it into Top Quintile Versus Lower Quintiles or similar make it suddenly valid?

  2. Agree completely with name just above, but are there situations when a predictor is so noisy that you can improve results by throwing away information? The obvious example for me is winsorising, because you know that high and low values are much more likely to be erroneous than accurate. Dichotomising is really anti-winsorising, so it’s a bad idea to solve this particular issue, but is there a standard reference on when throwing away information improves estimates?

    • The only way throwing away information improves anything is when your model/estimating procedure is bad. When your model doesn’t account for things like nonlinearity or response or presence of outliers or asymmetric errors etc then you will have your model pulled in weird directions as it contorts itself to fit the real data which has those things. If you just go ahead and model those things then they will inform your model about reality

    • I am not sure George Serafeim had started his consultancy when he published the study. I think he was on the advisory board at the Sustainable Accounting Standards Board and one of his advisors, Bob Eccles, was its academic head. SASB is a non-profit. Troublingly, John Streur, who was the head of a sustainable investing company called Calvert, clamed in testimony before the US Senate that they had “partnered” with Serafeim in conducting the research. My hope is that this was hyperbole.

      To me, the problem is that sustainability research is now so replete with opportunities for return that research is being biased. Luca and I were asked by a funder to apply for a grant, but when they found out about our initial results, they asked us to change our proposal to a different topic. We declined, but my sense is that this kind of pressure is everywhere.
      Finally, Luca and I do not know that KSY fished their results. It may be that they simply were “lucky” and got them straight out of the box. In the usual rush for graduation or promotion, they may never have checked the construct validity of their variable. Or, they may not know that fishing is wrong. The recommendation that we fish from the TAR reviewers may suggest it is an accepted practice.

      Finally, let me be clear that my own hands are not clean. I look back at the process I used in few of my older papers with embarrassment.

      Our paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3848664
      Our comment: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4119617
      Streur’s testimony: https://www.banking.senate.gov/imo/media/doc/Streur%20Testimony%204-2-19.pdf

      • When I defended my dissertation, I included multiple different model specifications and discussed differences in estimates across them. A member of my committee told me I needed to choose only one model because I am “supposed to be an expert.” It wasn’t stated explicitly, but the point was that I should only choose models that produced statistically significant estimates.

        A couple years later, I attended a PhD student’s practice job talk in their department (different institution than my PhD). The student presented a heat map-like graphic showing the results of about 50 “robustness checks.” The student was advised to not show so many, especially the ones that did not show statistically significant estimates.

        The problems run deep.

  3. I’m the JFR editor who handled the Berchicci and King article, and thought I’d share a couple of comments that might lead to some productive discussion. This one is about re-examinations. BK is one of two re-examinations of prior work that are being published in the next issue of JFR. In line with Andrew’s observation that the original (re-examined) authors are not counter-attacking, this was not a very contentious process. But other one involved a great deal of back-and-forth with two editors and both sets of authors. Both re-examinations were presented in a conference earlier this Spring. Here’s what I said in my introductory remarks.

    Re-examinations should be an unremarkable and routine part of the scientific process. But they are rarely published in accounting, so it is remarkable that JFR is publishing two re-examinations the same issue. And the process itself was hardly routine. Pincus, Wu, and Hwang went through many rounds of revision, with a lot of input from the original authors. Berchicci and King went through JFR’s formal process pretty quickly, but that’s because the authors and I were already corresponding while the paper was being repeatedly reviewed and rejected by other journals.

    I want to share two lessons I’ve taken from these editorial processes. First, it’s very important to label the nature of a re-examination very precisely. Re-examination is the most general term for taking another look and we can only get more specific when it’s justified. People throw around words like replication, and replicate, and reproduction, and reproduce, but it’s not always clear what they mean. Do they mean they used exactly the same data set and methods reported in the original paper, so that any difference suggests that the original authors didn’t do exactly what they said they did? Or do they mean that they used the same analyses on different data to see whether the original result would reappear reliably, or that used the different analyses on the same data to see whether the original result was robust. Even a little imprecision can cause major misunderstandings of what the re-examining authors did and found, and what that implies about the work being re-examined, and most importantly, what it implies about why the original authors reported what they did.

    And that leads the next lesson I take away from this editorial process: re-examinations need to focus readers on the work itself, not on the personalities involved. JFR is publishing these papers so readers can understand more about the associations among accruals earnings management, real earnings management, and SOX, and among sustainability performance, financial performance, and definitions of materiality. But readers are human, and humans often find far more interest in personal conflict, drama, and motivation.

    We rarely acknowledge our motivations for conducting the research we do. Hopefully there’s always an element of curiosity, but the search for truth is inevitably mixed in with searches for tenure, promotion, fame, glory, support for a political agenda—and let’s not forget the satisfaction showing that we’re right and someone else is wrong. We also face temptation tempted to take the easy way over the hard way, and to prefer analyses that show strong results over weaker ones. Conference socializing is full of gossip about why academics did what they did, but that’s really all it is—gossip. It doesn’t belong in JFR. So we work with authors to make sure everyone is sticking to just the facts—how was data gathered, analyzed, and reported, without nothing that would lead readers to make unwarranted inferences about motivation. And we take seriously a standard rule of academic writing—write so you can’t be misunderstood. If a sentence can be read as speaking to motivation, it needs to be rewritten.

    In the end, I think these two re-examinations do what we need them to do: they take another look at very influential work, are very clear about how they take that look, and are very careful to speak directly to the work, without a hint of gossip. I’m hoping that publicizing these papers in this session will encourage some of you to tackle new re-examinations, and give you some insight into how you’d be treated if you submitted to JFR, or if someone else re-examined your influential work.

    • Very insightful, thanks for sharing! Does the JFR have a policy of requiring a full replication package? Or at the minimum the code that could in principle replicate the analysis? It helps future researchers even for articles that re-examine. Berchicci and King would of course have had an easier time reproducing the materiality mapping if it were published at the time in the Accounting Review. As we’ve learned from this blog, the ‘social’ or post-publication peer-review is more powerful than the editorial process – when it is enabled through replication.

      • Ulrich,

        The original article by Khan, Serafeim, and Yoon (2016) was published in TAR, and we submitted our analysis their first. In fact, the abstracted comments are from TAR reviewers. We had the mapping the published, but it is at an aggregate level. Their also shared their measure with us.

  4. While BK is mostly a re-examination, it is also the first paper in Accounting (as far as I know) to use model uncertainty analysis. I’d love to know what you all think about this approach. BK construct 448 different regression models in which they alter: how they map data to materiality topics; how they map firms into industry definitions; how they process the measure with differencing, orthogalization, and dichotomization; what sample they use; and what fixed effects and firm characteristics they control for. They show that only a handful of these models generate results as strong as what the original authors proposed, and most of them generate an estimate of the key parameter that is indistinguishable from 0 or has the wrong sign.

    An issue that has come up repeatedly is that not all models are equally reasonable. Who cares if the results go away when using lousy measure processing, or a lousy fixed effect structure, and so on? Our literature is pretty solidly locked into null hypothesis statistical testing, and vanilla regressions with fixed effects, and there are many sharp debates over which models within this fairly constrained space are best in which situations. So in the end, as editor I mostly asked for a lot of transparency so that people can see which models generate which results, and draw their own conclusions.

    Is there a better way, short of dropping NHST, p-values, and vanilla regressions? And how would a move away from NHST and vanilla frequentist regression even help address this issue?

    • And how would a move away from NHST and vanilla frequentist regression even help address this issue?

      If the move is towards deriving a prediction from a set of assumptions and testing that, I think you’ll find researchers are now incentivized to account for all sources of uncertainty. Then they will want to replicate each others work to figure out the various sources of error.

      This is the opposite of testing a strawman hypothesis of zero difference between groups. In that case researchers are incentivized to have many sources of systematic error, which will contribute to “getting significance.”

        • A non-inferiority trial is even worse! All that means is “success” gets redefined to be weaker. Ie, the results for Group A can be worse than Group B but still “non-inferior”.

          Meanwhile none of the issues with NHST are addressed.

          And we know the proper way to treat such data, do a cost-benefit analysis (where cost is both side effects and financial). Now this will likely be highly individualized so an average effect is of limited value. But if that is all you want to report just provide it for new and old treatments along with the uncertainty and side effect rates.

          What do these “tests” contribute?

        • Anon:

          I pretty much agree with you. I’d put it this way: a non-inferiority trial is asking a question that is of real interest, but it is a mistake for this question to be shoehorned awkwardly into the null hypothesis significance testing framework.

  5. To anyone trying to predict stock returns, especially using fuzzy ESG data, I wish you luck. If you missed the opportunity to buy OXY warrants, for example, because you were reading about sustainability and ESG, you probably shouldn’t be predicting stock returns.

    • Gilligan:

      The goal is not to predict stock returns. The goals are (1) to get tenure at Harvard or wherever and (2) to get people to invest in your hedge fund. For those, a “statistically significant” result in a peer-reviewed journal is a win.

Leave a Reply

Your email address will not be published. Required fields are marked *