Zak David expresses critical views of some published research in empirical quantitative finance

In honor of Ebenezer Scrooge, what better time than Christmas Eve to discuss the topic of liquidity in capital markets . . .

A journalist asked, “I just wanted to know how bad the problem of data mining is in capital markets compared to other fields, and whether the reasons for false postives in finance are the same as other fields?”, and I passed the question over to Jim Savage who passed it over to Zak David, who replied:

(First, I too will say that I don’t know how many of the problems in financial research relate to other fields in size or dimension.)

The question is: with respect to quantitative investment research, how bad is the problem of data mining?

Having done a healthy share of paper replications over the past decade, and having been consistently disappointed when the models or techniques broke down on data shortly after (or even before) the authors’ sample periods, I’d say of course it’s a gigantic problem — spurious results are the norm. But also over those years, the granularity of market data available to researchers has become finer, and the ubiquity of machine learning software packages has placed predictive modeling in the hands of anyone with a pulse. As a result, more troubling than potential “false positives”, I’ve found papers so filled with coding errors and domain mistakes that the authors couldn’t have possibly found the existence of “a positive” in the first place (I hate neologisms but until you come up with something better these are called “impossitives”). So not only is it easier than ever to find spurious relationships, but there are several new classes of fatal errors that the current publishing and peer review system is poorly equipped to identify.

That said, this should also be clear: despite its pejorative connotations, data mining itself is not malicious and does not cause false positives. In data-driven trading or quantitative investment research, data mining is a big part of what you have to do. But successfully employing those techniques requires a lot of attention to the nuances of the domain, critical evaluation of your assumptions, and a commitment to exploring philosophies of statistics and science. When “pairs trading” was invented at Morgan Stanley in the early 1980’s, that came from a computer science graduate from Columbia who was calculating relative weights and correlations for groups of stocks sorted by industry sectors. In that same era, RenTech was getting started. Robert Frey, a former managing director, has said that their approach was “decidedly non-theoretical. Discipline is the key, not theory.” The founder, Jim Simons, tells an anecdote about sending people down to the NYFed building to copy historical interest rate information by hand that was kept in a written ledger. These people are perhaps the most successful data miners in all of finance.

An ever-increasing risk for catastrophic errors is when academics write code. A few years ago one of the most famous results in empirical economics was actually a spreadsheet error. More recently, I discovered numerous invalidating bugs in the ad hoc backtesting software created for a paper published by the editor of the journal Quantitative Finance. From long-term factor model research to high-frequency order book analysis, researchers are writing more lines of code of increasing complexity. There’s an often-cited statistic that the professional software industry produces 15-50 errors per 1,000 lines of code delivered. Yet to the best of my knowledge, there are no journals that have professional software developers review submitted code. Most journals still either don’t have or don’t enforce code submission and open publishing standards — an asinine move that both increases the uncertainty of the results as well as slows the progress of future research.

The other big problem, with a worrisome trend (especially among lower quality journals, and lower quality academics), is how much of the research and publishing apparatus is completely blasé about the differences in the domain knowledge required to analyze large cross-sections of quarterly stock data versus high-frequency trade and order book data. Examples:

  • This paper casually invented a metric to measure high-frequency “liquidity takers”, but if the author could have asked anyone in industry and realized he was counting market makers racing for queue position to provide liquidity.
  • This paper assumed stock volumes are normally distributed, so instead of finding evidence of high-frequency “quote stuffing” as the authors’ claims, instead, they found evidence that stock volumes look more like this.
  • This paper didn’t know what times futures markets were open, and also applauded itself for a high predictive accuracy on a contract no one trades.
  • This paper calculated closed PL from buying a futures contract with one expiration and selling a futures contract the next day with a different expiration.
  • This paper applies a denoising filter to the whole time series before predicting it, meaning that each point has information from the future in it. And the authors also added trading costs to their PL.
  • This paper tries to predict 5-minute returns for US stocks but forgot the fact that the market is closed on nights and weekends, and the authors also admitted to trying a whole bunch of models until one looked good in sample.

Each of those papers has numerous other problems. When you take these problems to the authors and publishers, their first response is always “oh no! we should do something!”, but at some point I think they all go: “what has reality done for me lately? It certainly didn’t publish this paper.”

10 thoughts on “Zak David expresses critical views of some published research in empirical quantitative finance

  1. This blog is full of interesting tidbits about the misapplication of mathematics in finance. One of the most compelling posts is plot of the maximum Sharpe ratio as a function of the number of backtests. This is essentially the quant angle on the garden of forking paths: if you backtest more portfolios, even completely oblivious ones, the probability of achieving a high Sharpe ratio increases.
    https://mathinvestor.org/2018/04/the-most-important-plot-in-finance/

  2. As someone who often used the financial literature to buttress points made in litigation, let me second everything Zak says here, particularly the coding point. The difference in the litigation realm is that the code has to be turned over and there is an equally well paid and trained team on the other side to tear it apart. The standards are incomparably higher. There is an asymmetry in peer review which skews the incentives — authors are far more keen to “win” the peer review process than reviewers can get by anonymously enforcing competency.

    Oddly, that is one of the best reasons to cite literature in one’s testimony. No matter how much shade the other side can throw on a shoddy published result, one can say, “Oh yeah? If your critiques are so devastating, how did this pass the peer review process? Where’s your peer-reviewed paper making this point?” This is of course total bullshit, but we litigating economists have the problem that the authorities deciding which of us is right have no competence in he field at all and combine this with an unholy (but understandable) reverence for the peer review process in lieu of the hard work of educating themselves.

    So to review, the litigation process has much better error checking, but the problem is that the final judgment is made by someone completely unfamiliar with statistics, data analysis and/or economic theory. This is why I think the *best* method is internal to a private investment firm in which the cross-checking is carried out by people with incentives to get it right with he firm’s money on the line. There is still a downside, though. The best analyses remain proprietary and secret: thus Renaissance Capital and AQR and the other competent shops.

  3. Thanks for posting my little bit of Scrooging.

    I’d like to preemptively avoid any bickering over the “spreadsheet error” example by saying that there was a significant amount of debate over how much the error altered the results, and I’m not sure how that whole episode resolved. Reading this again for the first time in 6 months, that example seems pretty out of place among the others since it’s the only one that didn’t come from me thoroughly going over a paper.

    Happy to discuss any of the others.

    Merry Christmas y’all

  4. Did any of the linked papers get published in any sort of reputable journal?

    The links are all to working papers, and most working papers are of course seriously flawed or worse.

    • Hi Terry, good question. The short answer is: eeehhhhhhh.

      (I apologize for the long answer and what is about to look like a ton of self-promotion.)

      First, the list from the Examples bullet points:

      Paper 1: This was the author’s Job Market Paper for econ. Looks like it was also presented at a few workshops. (Small section discussing the paper here: http://zacharydavid.com/2014/03/09/on-hft-assumptions-agent-based-modeling-and-a-philosophy-of-error/#Very-Liquidity-Such-Market-Wow )

      Paper 2: When I first came across this, it had been presented at some conferences, I think had been submitted, but what I found dangerous was that it had been presented to regulators as fact and was even cited in Congressional testimony as evidence of “quote stuffing”. So I showed that you could achieve the same results with a random number generator if you take the distribution of messaging activity into account (there’s a longer story behind the scenes, and I was much younger and hot-headed, but I’m glad it’s still a working paper: http://zacharydavid.com/2014/05/11/know-thy-model-specificity-and-the-importance-of-using-fake-data/ )

      Paper 3: This was published in the journal Algorithmic Finance in 2016. According to the journal’s editor, that particular issue composed by a guest editor and he normally rejects all papers like that. (Full write-up here: http://zacharydavid.com/2017/08/06/fitting-to-noise-or-nothing-at-all-machine-learning-in-markets/ )

      Papers 4 and 5: These were both published in PLOSOne in 2016/17 (with one citing the other) where PLOSOne claims to have done completed some peer review process. I’m obviously not big on journal credentialism or elitism since I work primarily in industry, but there’s been an open investigation for almost a year now. Along with other mistakes I’ve caught in that journal, my personal belief is that they don’t even have a peer review process (See here: http://zacharydavid.com/2018/01/08/accountability-generalizability-and-rigor-in-finance-research-machine-learning-in-markets-part-ii/#Two-Papers-No-Cars )

      Paper 6: Was published in ACM Transactions on Management Information Systems (TMIS) in 2017 (My take: http://zacharydavid.com/2018/01/08/accountability-generalizability-and-rigor-in-finance-research-machine-learning-in-markets-part-ii/#Spurious-By-Design )

      The paper I mentioned in the paragraph before that list, the one about “The Alpha Engine” was published in a book “High Performance Computing in Finance, Chapman & Hall/CRC Series in Mathematical Finance, 2017” published by the editor of the journal Quantitative Finance. I didn’t have the energy left to do another post on that topic, especially because some of their mistakes were as bad as “didn’t think it’s suspicious that their system beats a completely random number generator”, and “not realizing java indexing is 0-based.” (but there was a funny twitter thread: https://twitter.com/ZakDavid/status/975253136951664640 )

      Overall, these things are at least getting published somewhere. To viewers outside of academia: there’s considerable democratization of publishing venues taking place and working papers are being cited in testimony heard by lawmakers. So it’s not helpful to take the position that “working papers or low impact journals” don’t matter. They clearly do, thus these problems are legitimate.

      • So it looks like these are just a few of the many many trashy papers out there done by relative amateurs. I am not at all surprised they are garbage.

        Also it sounds like they were specifically chosen because they are trash. I am completely un-gobsmacked that a small pile of this trash can be easily identified. I am also unsurprised that one was cited by Congress. Congress is a garbage magnet.

        I like your point about domain-specific knowledge being important. When studying an intricate institution or complex dataset, being sloppy can generate all sorts of false results so sloppiness actually gets rewarded.

        • We certainly agree that a significant amount of “garbage” is unsurprising. The position I’m obviously not succeeding at moving you a little more toward is that we should care about these various forms of garb.

          Take paper 3, that’s a TT asst professor with funding from Intel — hardly a “relative amateur,” at least in the eyes of academia. If you strip out all the bogus stuff about P/L curves from that paper, it looks like there might be some interesting things with respect to faster learning algorithms. But that result doesn’t make the press release — and I don’t think there’s any code supplied anyway. Can we get the potentially interesting results without the pressure for fabulous Sharpe ratios?

          Or papers 4 and 5 again: one was influenced by the other. Thus, PLOSONE’s complete lack of a real peer review and open code standards caused the second group of researchers to waste a tremendous amount of time on something that had no chance of being legitimate. I see that as more than a “garbage is as garbage does” sort of problem.

  5. I’d just like to chime in again to note that the Excel error in Reinhart & Rogoff had a relatively small impact, but because of how obviously wrong it was it gets referenced the most often. The more important differences between their paper & the one from Herndon, Ash & Pollin were their exclusion of certain data deemed unreliable (something New Zealand officials of the time have agreed with them on) and how they binned their data. But the biggest problem with their paper wasn’t in the details of how they analyzed their data, but in their whole approach. They were finding correlations rather than showing causation, and there is a priori reason to expect low (or negative) growth to cause higher debt levels. Miles Kimball & Yichuan Wang actually tried to address causation and found that debt did not reduce growth:
    https://blog.supplysideliberal.com/post/54317917221/quartz-24-after-crunching-reinhart-and-rogoffs

Leave a Reply to Jonathan (another one) Cancel reply

Your email address will not be published. Required fields are marked *