Ashwin Malshe:

I would like to bring to your attention a recent controversy in accounting research. It relates to how extreme values may drive results in observational studies. However, this issue is more complex in this specific case because the event (corporate whistleblowing) is rather rare, showing up only about 20% of the time in the sample.

The background is that a paper published by Call et al. (2018) in the Journal of Accounting Research (JAR), an elite journal in accounting, showed that when frauds are outed because of whistleblowing, the penalties on average are higher. Because the data are publicly available from the journal, another researcher, Kuvvet (2019), tried to replicate the paper by eliminating the top 11 (less than 1% of the sample) firms with highest penalties. It seems that the results go away and reverse in some models! The original authors respond to this but in an inadequate fashion (in my opinion). The elimination of top 1% firms by penalties results in removal of about 4%-5% firms with whistleblowing event. It seems that this is a concern about the replication among some of my accounting colleagues who I spoke with. However, there is no consensus about a solution.

The replication paper was rejected in JAR, which additionally highlights a problem you posted on your blog previously.

The original paper by Call et al. 2018 (Whistleblowers and Outcomes of Financial Misrepresentation Enforcement Actions).

The replication by Kuvvet 2019.

Call et al.’s comment on replication.

I don’t really have the energy to look into this one but I thought I’d post for those who might be interested.

**P.S.** I was going give this post the title, “A recent controversy in accounting research,” but I thought that would be too boring even for this blog (recall P.S. here).

Top “11” seems like a carefully-chosen number.

The finding disappears for Top 10?

On the other hand, everyone is always preparing “Top 11” lists for all sorts of topics.

“Because Top 10 Lists Are For Cowards” from https://11points.com/

So the issue is just how you treat outliers? And the critique is if you exclude some data you get different results? I don’t get the controversy.

Seems like this is actually a pretty good example of how these things should go. Someone published something, shared their data, and someone did another analysis with the shared data. So the original journal wouldn’t publish the replication? Maybe because the replicators didn’t actually find anything wrong with the original publication. It just seems to be a difference in how to approach data analysis.

Can someone explain what the relevance of the findings are? If the penalty revenue isn’t higher, we don’t want whistleblowers to come forward?

Well, I don’t have access to one of those articles, and this certainly isn’t my main area of interest, so I’ll give a couple of basic impressions about situations where people go around removing special data points and re-analyzing.

I think this is a situation where Bayesian models really help you a lot. In a Frequentist model you typically use some estimator, and often rely on that estimator having a known sampling distribution asymptotically for large sample size. For example, an average, or a median, or a variance or the interquartile range or whatnot.

When you have distributions with large skew or fat tails, these estimators typically have two problems:

1) the sample size required for asymptotic results to hold for good approximation is much larger, so there is usually a mismatch between your sampling assumptions and your actual situation.

and

2) The quantity being estimated is typically irrelevant to the real question you have… averages when the distribution is hugely skewed for example. If you look at the distribution of net worth in the world for example the average across all 7B people might be a couple hundred or thousand bucks, but the average of the top 10% is multiple millions maybe. Even in the US, data I downloaded a couple years ago from the Census showed the median black family in the US had something like $6000 in net worth, whereas the median white family had on the order of $100k. That’s two very distinct population, so one measure of central tendency for example is unlikely to be helpful. Similarly if you look at personal income in the ACS, there are something like 37% of people who have *exactly zero* wage income (labor force participation rate is something like 63%). That’s an infinitely high delta-function spike at 0 in a distribution whose median is on the order of 50,000

So, because the estimator is so sensitive to the outlying data points, and because people don’t necessarily have much Bayesian stats background, people try to probe the data with alternative estimators… namely trimmed means and things like that, looking to see how much the estimator depends on the outlying values.

How do you go about answering questions if you HAVE a Bayesian stats background? In a Frequentist analysis you take some statistic of the data and then its sampling distribution is whatever it is… you hope through some math that you can know something about it because of central limit type theorems. In a Bayesian analysis you assert something about where you think the data are likely to lie, and then infer something about the parameters that describe that distribution. So you have the opportunity to *set* your assumption about the data. In particular, if things are highly skewed, then you can supply a p(Data | Model) that is highly skewed, or has a spike at a certain place, or has other features you expect.

When you do that, you use *all* the data, and the outliers inform you about the shape of the distribution, and there is *never* a reason to fiddle with sensitive differences between dropping 10 data points or dropping 11 as Sean S mentions. The instability in inference from small samples results in wider, less certain inference on the parameters of interest, and you make much more defensible inference compared to what you get when you choose to drop a few data points, then pretend that some asymptotic results might hold, then calculate a meaningless standard-error that relies on those asymptotic results, etc.

When I was working on income or wealth data, or on alcohol consumption data, long tailed distributions like t with between 4 and 10 degrees of freedom were routine choices and stabilized the inference for things like medians so that outliers didn’t lead to misleading results.

I think the removal of outliers is a problem. I think the removal of *any* observations from data is a problem unless there is a principled reason for believing that the data are erroneous, or not in universe for the research question. While outliers may have a higher probability of meeting those criteria, blanket removal of outliers just biases the data collection. Moreover, it is even worse when, as here, the removal from a predictive model constitutes selection on the study’s outcome variable. If you selectively remove on the values of a predictor, you can still argue that your results are generalizable within a restricted domain. But when you remove selectively on the values of an outcome, you cannot even determine whether the model is applicable to a new data point because that requires knowing its outcome in advance.

+1

In addition, selection of a model for the outcome variable should, as much as possible, have a rationale based on anything known about the nature of the variable generating process (not just the values of the data collected).

Love the phrase “an elite journal in accounting”!

If the goal was to know if “when frauds are outed because of whistleblowing, the penalties on average are higher” then wouldn’t the issue be settle by comparing the average over whistleblown cases versus the average over no-whistleblowers cases?

Most likely what’s really of interest is whether whistleblowing causes a case to have higher penalties than it would have if there were no whistleblower.

the difference in the type or severity of case between the two populations is less interesting, it’s the causal increment from whistleblowing that seems interesting.

“Most likely what’s really of interest is whether whistleblowing causes a case to have higher penalties than it would have if there were no whistleblower.” This does seem to be the concern of the replicators, but it just seems silly when the counterfactual is more likely that in the absence of whistleblowers, those cases would not have even occurred.

Sure whistleblowers potentially affect two things:

the probability that the case is detected, and … the size of the penalty…

So, *conditional on the case being detected one way or another* what is the change in the size of the penalty due to the fact that it was detected by whistleblowing?

I mean, this is potentially an interesting thing for a possible whistleblower to know, that if they blow the whistle not only will there be penalties but typically they’d be higher than if the case was detected in another way?

> this is potentially an interesting thing for a possible whistleblower to know

How so?

Suppose you are a potential whistleblower, you perceive that there is some decent chance things will be found out even if you don’t blow the whistle… you now should decide whether to blow or not… if there’s typically an effect on severity of punishment, you might want to know that to make the decision?

Whistleblower laws provide for financial rewards to be paid to whistleblowers. They may well be interested in understanding the size of penalties out of pure self-interest.

The size of penalties if they blow the whistle, absolutely. But the relative size of the penalty were the irregularities to be discovered by some other reason?

« If I don’t uncover the fraud there is a 12.3% probability that the DoJ finds it. If the fine is $8.2mn in either case then the expected value of the fine will be above $1m. That’s acceptable! But what if the fine if they discover the fraud without my help? How could I sleep at night knowing that I’m bringing the expected value of the penalty to six digits? »

Carlos, I think a lot of whistleblowers are well aware of the fact that if they blow the whistle they face a significant risk of never being able to work again in their industry, or at least have a much harder time getting future employment… knowing that they not only make it more likely to have fraud detected, but also that the penalties are likely to be higher and the deterrence effect potentially more could help to overcome the disincentive to report.

It’s also plausible that there are cases where the wrongdoing is more ambiguous, like perhaps there’s one guy who’s committing fraud, but there are a bunch of people who aren’t really aware of what’s going on who might get caught up in the investigation and be hurt. There might be reasons why people decide not to blow the whistle if they think the punishment will be outlandish compared to the severity of the crime? I don’t know.

If one guy were to say “oh yeah, we went after higher penalties because that what the whistle-blower told us pissed us off” wouldn’t that definitively answer the causality question in a way all the statistical analysis in the universe will never, ever, come close to doing?

But how would you know that it wasn’t just what that one guy thought?

Or in just one instance.

Also, how much more?

Will selecting the right outliers to include, or not, in the averages tell us what they were really thinking?

It’s not really about what they were “really thinking” it’s more about what typically *does* happen.

For example, just because a prosecutor asks for heavy penalties, doesn’t mean they’re given by the judge/jury. Or just because someone thinks “hey we went after those guys really hard” doesn’t mean that it was any harder than the guys going after the accounting frauds they found without a whistleblower. And there’s no easy way to compare effort between totally separate prosecution teams. And even if we could compare effort, we probably care more about how the final outcome came out rather than how much effort was put in to get that outcome.

I don’t see a problem at all with asking this causal question about the size of the penalties. If you actually want psychology rather than monetary penalties… you’d have to ask a different question.

On the other hand: including vs excluding the outliers is a terrible analysis methodology, so if that’s what you’re getting at, then yes, that’s not a good way to answer any of these questions.

Causation also goes in the other direction: penalty –> whistleblow probability.

A larger penalty means a larger payoff to a whistle blower, which means more incentive to whistle blow. Therefore, we expect a higher probability of a whistleblow when the expected penalty is higher.

There is also the issue of exaggeration. Since a WBer’s payoff is a function of the penalty, a WBer has an incentive to exaggerate misconduct. This incentive is not present when there is no WBer.

There are also professional WBers now. Harry Markopolos of Bernie Madoff fame is an example. He looks for large potential penalties.