and then a response from Donoho. They ordered the discussions alphabetically! I think it would’ve been better to sequence them in order they were submitted. In any case, I recommend the usual strategy of reading such things which is to first read the the author’s response, then read the original article, then take a look at the discussions if you have time.

I saw an earlier version of Donoho’s paper and posted something on it, with input from Jessica Hullman a year ago.

The funny thing is, when I wrote my comment for the journal, I’d completely forgotten about my post on the topic. There’s essentially zero overlap between the two documents . . . and I think the blog post is much more interesting!

I wonder if there’s something about the blog format that keeps my eye on the ball. My comment for the journal is fine, don’t get me wrong; it’s just a bit boring somehow. So, just for you, I’ll repost our more interesting remarks from last year:

Frictionless reproducibility; methods as proto-algorithms; division of labor as a characteristic of statistical methods; statistics as the science of defaults; statisticians well prepared to think about issues raised by AI; and robustness to adversarial attacks

Tian points us to this article by David Donoho, which argues that some of the rapid progress in data science and AI research in recent years has come from “frictionless reproducibility,” which he identifies with “data sharing, code sharing, and competitive challenges.” This makes sense: the flip side of the unreplicable research that has destroyed much of social psychology, policy analysis, and related fields is that when we can replicate an analysis with a press of a button using open-source software, it’s much easier to move forward.

Frictionless reproducibility

Frictionless reproducibility is a useful goal in research. It can take a while between the development of a statistical idea and its implementation in a reproducible way, and that’s ok. But it’s good to aim for that stage. The effort it takes to make a research idea reproducible is often worth it, in that getting to reproducibility typically requires a level of care and rigor beyond what is necessary just to get a paper published. One thing I’ve learned with Stan is that much is learned in the process of developing a general tool that will be used by strangers.

I think that statisticians have a special perspective for thinking about these issues, for the following reason:

Methods as proto-algorithms

As statisticians, we’re always working with “methods.” Sometimes we develop new methods or extend existing methods; sometimes we place existing methods into a larger theoretical framework; sometimes we study the properties of methods; sometimes we apply methods. Donoho and I are typical of statistics professors in having done all these things in our work.

A “method” is a sort of proto-algorithm, not quite fully algorithmic (for example, it could require choices of inputs, tuning parameters, expert inputs at certain points) but it follows some series of steps. The essence of a method is that it can be applied by others. In that sense, any method is a bridge between different humans; it’s a sort of communication among groups of people who may never meet or even directly correspond. Fisher invented logistic regression and decades later some psychometrician uses it; the method is a sort of message in a bottle.

Division of labor as a characteristic of statistical methods

There are different ways to take this perspective. One direction is to recognize that almost all statistical methods involve a division of labor. In Bayes, one agent creates the likelihood model and another agent creates the prior model. In bootstrap, one agent comes up with the estimator and another agent comes up with the bootstrapping procedure. In classical statistics, one agent creates the measurement protocol, another agent designs the experiment, and a third agent performs the analysis. in machine learning, there’s the training and test sets. With public surveys, one group conducts the survey and computes weights; other groups analyze the data using the weights. Etc. We discussed this general idea a few years ago here.

But that’s not the direction I want to go right here. Instead I want to consider something else, which is the way that a “method” is an establishment of a default; see here and also here.

Statistics as the science of defaults

The relevance to the current discussion is that, to the extent that defaults are a move toward automatic behavior, statisticians are in the business of automating science. That is, our methods are “successes” to the extent that they enable automatic behavior on the part of users. As we have discussed, automatic behavior is not a bad thing! When we make things automatic, users can think at the next level of abstraction. For example, push-button linear regression allows researchers to focus on the model rather than on how to solve a matrix equation, and it can even take them to the next level of abstraction and think about prediction without even thinking about the model. As teachers and users of research, we then are (rightly) concerned that lack of understanding can be a problem, but it’s hard to go back. We might as well complain that the vast majority of people drive their cars with no understanding of how those little explosions inside the engine make the car go round.

Statisticians well prepared to think about issues raised by AI

To get back to the AI issue: I think that we as statisticians are particularly well prepared to think about the issues that AI brings, because the essence of statistics is the development of tools designed to automate human thinking about models and data. Statistical methods are a sort of slow-moving AI, and it’s kind of always been our dream to automate as much of the statistics process as possible, while recognizing that for Cantorian reasons (see section 7 here) we will never be there. Given that we’re trying, to a large extent, to turn humans into machines or to routinize what has traditionally been a human behavior that has required care, knowledge, and creativity, we should have some insight into computer programs that do such things.

In some ways, we statisticians are even more qualified to think about this than computer scientists are, in that the paradigmatic action of a computer scientist is to solve a problem, whereas the paradigmatic action of a statistician is to come up with a method that will allow other people to solve their problems.

I sent the above to Jessica, who wrote:

I like the emphasis on frictionless reproducibility as a critical driver of the success in ML. Empirical ML has clearly emphasized methods for ensuring the validity of predictive performance estimates (hold out sets, common task framework etc) compared to fields that use statistical modeling to generate explanations, like social sciences, and it does seem like that has paid off.

From my perspective, there’s something else that’s been very successful though as well – post-2015ish there’s been a heavy emphasis on making models robust to adversarial attack. Being able to take an arbitrary evaluation metric and incorporate it into your loss function so you’re explicitly training for it is also likely to improve things fast. We comment on this a bit in a paper we wrote last year reflecting on what, if anything, recent concerns about ML reproducibility and replicability have in common with the so-called replication crisis in social science.

I do think we are about at max hype currently in terms of perceived success of ML though, and it can be hard to tell sometimes how much the emerging evidence of success from ML research is overfit to the standard benchmarks. Obviously there have been huge improvements on certain test suites, but just this morning for instance I saw an ML researcher present a pretty compelling graph showing that the “certified robustness” of the top LLMs (GPT-3.5, GPT 4, llambda 2, etc), when trained on the common datasets (imagenet, mnist, etc), has not really improved much at all in the past 7-8 years. This was a line graph where each line denoted changes in robustness for different benchmarks (imagenet, mnist, etc) with new methodological advances. Each point in a line represented the robustness of a deep net on that particular benchmark given whatever was considered the state of the art in robust ML at that time. The x-axis was related to time, but each tick represented a particular paper that advanced SOTA. It’s still very easy to trick LLMs into generating toxic text, leaking private data they trained on, or changing their mind based on what should be an inconsequential change to the wording of a prompt, for example.

In the blog discussion that followed, Donoho commented here, here, and here. It’s great to have multiple forums for discussion.

1 thought on “19 ways of looking at data science at the singularity, from David Donoho and 17 others”

Dale Lehman on July 13, 2024 10:35 AM at 10:35 am said:

Well nobody has responded to this post, so I’ll see if I can get something started – I think this is important and deserves some attention. I have read Donoho’s original paper and his response, as well as a few (but not all) of these responses to the paper. There are many things – most things – that I agree with, but I find it all a bit naive. I certainly agree with the goal of frictionless reproducibility as well as the components. But Donoho presents it as a dominant model that seems destined to prevail (although he does acknowledge that their are impediments). He also emphasizes (in his response) the social and community aspects of the data science ecosystem. What seems under-appreciated are the obstacles posed by the economic and regulatory incentives that stand in the way of data and code sharing. While there are many examples given where such sharing occurs, there are strong incentives and regulatory protections against sharing data (especially medical and educational data), and private incentives to restrict code sharing if the perceived private gains from keeping it secret exceed the foregone gains from sharing.

As but one example, consider the data competitions, particularly using Kaggle as an example. I’ve entered some of the competitions and followed them from the beginning. But they have become less interesting to me over time – many of the competitions anonymize the variables in the data. I’ve seen competitions where the variables are labeled Var1, Var2,….Var400. For man computer scientists that may be sufficient for improving their models: from my point of view, losing all context for what the variables represent makes the exercise somewhat sterile. I’ve always thought that the context of the data was critical to understanding it.

Of course, there are many obstacles to data sharing. The identity of the observations needs to be protected in most cases, and protection of the identity of the variables derives from proprietary data sources, usually of commercial value. What I find ironic is the belief that the examples of sharing will dominate these commercial and regulatory incentives. Since data science is an ecosystem with strong social components, I try to find reasons for optimism from all those other social mechanisms that human prosperity relies on. How successful has social cooperation been to preventing ecological damage and ravages of war? How successful has it been of promoting meaningful social discourse?

That is why I find Donoho’s paper and many of the responses naive. It is not that these obstacles are ignored – they are recognized, but there seems to be a belief that the huge advantages of the frictionless reproducibility model must prevail. I find that a bit like believing that the advantages of democracy and capitalism must prevail. The evidence is not so convincing to me.

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

19 ways of looking at data science at the singularity, from David Donoho and 17 others

1 thought on “19 ways of looking at data science at the singularity, from David Donoho and 17 others”

Leave a Reply Cancel reply