Aki and I wrote this article, doing our best to present a broad perspective.

We argue that the most important statistical ideas of the past half century are: counterfactual causal inference, bootstrapping and simulation-based inference, overparameterized models and regularization, multilevel models, generic computation algorithms, adaptive decision analysis, robust inference, and exploratory data analysis. These eight ideas represent a categorization based on our experiences and reading of the literature and are not listed in chronological order or in order of importance. They are separate concepts capturing different useful and general developments in statistics. We discuss common features of these ideas, how they relate to modern computing and big data, and how they might be developed and extended in future decades.

An earlier version of this paper appeared on Arxiv but then we and others noticed some places to fix it, so we updated it.

Here are the sections of the paper:

1. The most important statistical ideas of the past 50 years

1.1. Counterfactual causal inference

1.2. Bootstrapping and simulation-based inference

1.3. Overparameterized models and regularization

1.4. Multilevel models

1.5. Generic computation algorithms

1.6. Adaptive decision analys

1.7. Robust inference

1.8. Exploratory data analysis

2. What these ideas have in common and how they differ

2.1. Ideas lead to methods and workflows

2.2. Advances in computing

2.3. Big data

2.4. Connections and interactions among these ideas

2.5. Theory motivating application and vice versa

2.6. Links to other new and useful developments in statistics

3. What will be the important statistical ideas of the next few decades?

3.1. Looking backward

3.2. Looking forward

The article was fun to write and to revise, and I hope it will motivate others to share their views.

Thanks for writing this!

This paragraph in 3.2 (Looking forward) caught my eye:

“Another general area that is ripe for development is model understanding, sometimes called interpretable machine learning (Murdoch et al., 2019, Molnar, 2020). The paradox here is that the best way to understand a complicated model is often to approximate it with a simpler model, but then the question is, what is really being communicated here? One potentially useful approach is to compute sensitivities of inferences to perturbations of data and model parameters (Giordano, Broderick, and Jordan, 2018), combining ideas of robustness and regularization with gradient-based computational methods that are used in many different statistical algorithms.”

I’m glad you mentioned this, since the tendency to draw a sharp line between statistics methods and statistical communication is undoubtedly a contributor to the replication crisis and other abuses of data science. However I would probably state this future goal more broadly though as the need to get serious about statistical communication, including getting more scientific about things like how we communicate uncertainty. The recent interest in ML in interpretability is obviously part of this, but I see the goal of figuring out what kind of approximations people can use to understand what deep learning is doing (what much of the ML intepretability stuff focuses on) as just one part of a much bigger research area that needs to develop alongside statistical methods. Fully admit this is somewhat my personal pet peeve as I watch a bunch of ML researchers “discover” uncertainty visualization, an area I work in, as though its brand new and people haven’t thought at all about how to create visualizations or interactive interfaces to help people get intuitions about models.

Agreed, I think a lot of the issues we see in ML come from researchers failing to understand the models that arise when they combine complex algorithms with complex data. The virtue of a model is not just in its predictive power, but in its ability to clarify the structure of data. So when we end up with a model that is just as hard to understand—even for the people who implemented it—I don’t think much has been gained.

On that front, I wonder if anyone has tried using compression algorithms or something similar to quantify the complexity of a neural net model? I’m thinking about taking the matrices of trained weights and treating them like an image/video file and seeing how small it can be made? This is analogous to Kolmogorov complexity, in that it represents how short a description is needed to reproduce the model. I specifically mentioned images/videos because these are media that are meant to be viewed and understood by humans, so whatever compression techniques exist in those domains might be “tuned” to features that people can understand.

I’m just spitballing here and it sounds kinda silly having written it out.

I’ve been on an ‘information theory to quantify assumptions about uncertainty communication” kick, so it doesn’t sound silly to me. This looks related and now I will have to read it: https://proceedings.neurips.cc//paper/2020/file/b1adda14824f50ef24ff1c05bb66faf3-Paper.pdf

Awesome! Thanks for both the link (which I will also have to read) and for making me feel less silly.

Here’s an idea. Take different neural net models that do approximately the same thing but were developed independently. Compress each separately to determine their respective complexity, then compress them together to get a sense of how much of the complexity in one model is redundant with the other. Whatever parts get compressed together, those are the parts most likely to be essential to the task (all else being equal). The other parts are more likely to be superfluous or circuitous (all else being equal). The procedure is analogous to correlation and regression, except that it depends on Shannon information instead of Fisher information. That means the Shannon correlation is inference-free: the complexity is directly observable, and you’re only using this method to determine which model components you should observe and compare between models. That is, unlike a Pearson correlation between two population constructs, you can directly examine the model components to see how they’re similar, and to investigate the hypothesis that similarity between components of useful models implies these are useful components with meaningful interpretations.

I do like the spirit of this idea, though I haven’t thought enough about it enough to have much practical sense of it.

Essentially, it is taking two models and translating them into a common format (that’s the compression) and seeing how much they overlap in that new form. Arguably, Pearson correlation does the same thing where the “common format” is the standardized scale. But in this case, the compression preserves structural features of the model so there is more meaning to the comparison.

I also appreciate the emphasis on comparing NN models trained in different settings. I don’t do much ML myself, but as the saying goes, some of my best friends are ML people. And it is shocking to me that reported models are usually just the “best” run and there is little/no attempt to replicate results within a lab, let alone between.

You’ve got it. The key is that looking at the covariance between models actually tells us something different from, and not necessarily as useful as, what their mutual complexity tells us. Covariance tells us how similar the structures are, with the assumption that the covariance is due to a common factor (parameter). Mutual complexity makes no such assumption–shared complexity is solely concerned with superficial similarity. Thus, according to Wikipedia:

“Using the ideas of Kolmogorov complexity, one can consider the mutual information of two sequences independent of any probability distribution…Approximations of this quantity via compression can be used to define a distance measure to perform a hierarchical clustering of sequences without having any domain knowledge of the sequences (Cilibrasi & Vitányi 2005).” https://en.wikipedia.org/wiki/Mutual_information#Absolute_mutual_information

I was surprised to find that the cited article demonstrates something very similar, and is seminal in an extensive body of research on this approach.

> the best way to understand a complicated model is often to approximate it with a simpler model

Maybe not, matching predictions is not matching the how/why of the predictions – https://users.cs.duke.edu/~cynthia/papers.html and in particular https://www.nature.com/articles/s42256-019-0048-x.epdf?author_access_token=SU_TpOb-H5d3uy5KF-dedtRgN0jAjWel9jnR3ZoTv0M3t8uDwhDckroSbUOOygdba5KNHQMo_Ji2D1_SdDjVr6hjgxJXc-7jt5FQZuPTQKIAkZsBoTI4uqjwnzbltD01Z8QwhwKsbvwh-z1xL8bAcg%3D%3D

I see it as first choosing an analysis most appropriate to the task including consideration of human information practice (what impact it will likely have on how people think). Communication is a later step?

Looks like an ambitious and fascinating effort. Beyond my competence. I’ll just relish in the fact that I pick up great insights here.

> The idea of partial pooling of local and general information is inherent in the mathematics of prediction from noisy data and, as such, dates back to Laplace and Gauss and is implicit in the ideas of Galton.

Not sure what you mean by _idea_ here. For instance, I am not sure any of them perceived the need for differing day errors or differing birth percentages.

At least, this is what I wrote in my thesis – In 1839 Bienayme had remarked that the relative frequency of repeated samples of binary outcomes often show larger variation than indicated by a single underlying proportion and proposed a full probability-based random effects model (suggested earlier by Poisson) to account for this. Here, the concept of a common underlying proportion was replaced by a common distribution of underlying proportions [that were drawn from]. It is interesting that a random effects model where what is common in observations is not a parameter, but a distribution of a parameter, followed so soon after the development of likelihood methods for combination under the assumption of just a common parameter [Laplace and Gauss].

> we prefer to think of it as a framework for combining different sources of information

Don’t think we want to presume commonness but rather carefully hypothesize and assess commonness before partial pooling. There are situations where this might be safely skipped but it is often overlooked http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

“Awareness of commonness can lead to an increase in evidence regarding the target; disregarding commonness wastes evidence; and mistaken acceptance of commonness destroys otherwise available evidence.” … “We believe, this simple contrived example from wiki [om Simpson’s paradox] nicely shows a lack of concern regarding the need to represent reality (correctly specifying common and non-common parameters) as well as one can or at least well enough for statistical procedures to provide reasonable answers.”

It’s fascinating and insightful, and I look forward to reading it more closely! It’s a very short paper for such a big topic, but that makes it very readable. In my initial reading, it would’ve helped me if you had provided up front some definitions of and constraints on the very broad terms in your title/abstract. You obviously have to apply implicit definitions and constraints in your analysis, but you could make them explicit. So, in your approach, what counts as an idea? What makes something comparatively important? For whom–statisticians, scientists, the world? Must the idea have been conceived in the last 50 years and also had the bulk of its impact in that period? Or can it be one or the other–a new idea that’s important but still hasn’t been embraced, or an old idea that finally found its time? I infer the following definitions from the text:

a) Ideas: developments or trends in methods, whether they be conceptual or mathematical or computational.

This is not a criticism, just a note that the paper has a different flavor than it might have had with a different definition of term. For example, one section is about “the idea of fitting a model with a large number of parameters…using some regularization procedure to get stable estimates and good predictions.” This an idea for a set of methodologies, but you might have instead focused on the deeper idea that we can get more and better information from some data by breaking the rules and then penalizing the results. It’s the same idea behind corrections for multiple comparisons, the LASSO, and the Edlin factor. Other sections talk more directly about ideas independent of methods, like valuing robustness and exploration.

b) Importance: the magnitude of the consequences of an idea, or of its operationalizations, for changing statistical practice and yielding more useful information.

Given the themes on your blog, I was surprised ideas like “NHST’s are dumb” and “statistics is hard so the review process needs to go on before, during, and after publication” didn’t make it in. But those ideas, while crucial, are not widely held, and so haven’t been as consequential.

c) Statistical ideas: statistical ideas about statistical practices for statisticians, as opposed to statistical ideas about science, like “we must limit substantive conclusions to what’s supported by the statistical conclusions.”

Again, not a criticism, just a distinction that could be made explicit.

On another tack, some form of the word “Bayes” comes up 23 times in the text, and another 32 times in the references. Not really sure what this implies, but it definitely says something about your focus that would’ve been helpful context up front. That multilevel models “can be seen as Bayesian” comes off as a non-sequitur at first, but it ends up being the important idea more so than MLM’s in general.

Hey, what about non-parametric models (like random forests or neural nets)? Sure, people knew about them forever, but I think a lot of the math was developed in the 1970s and then they didn’t get popular until we had the compute power for them. Arguably the same could be said for state-space models like HMMs, Kalman filters, etc. or even simple time series. There just wasn’t a lot that people could compute in the 1960s and earlier (but still enough to get to the moon!). Or maybe Aki and Andrew don’t think of neural nets or random forests as being statistical ideas or aren’t considering applications.

What about the machine learning people refocusing attention on prediction? For me, that’s the biggest change in landscape among people who do stats-like things. They got around the whole p-value thing by actually deploying systems that did things and then measuring how they worked (though they sneak them back in when comparing systems in academic settings).

Presumably that timeline is about when things became popular, not when they were introduced to the literature. For example, doesn’t shrinkage go back at least to Stein in the 1950s? But then arguably it wasn’t poplar until Efron and Morris in the 1970s or even Hastie and Tibshirani in the 1990s or the compressive sensing literature after that. I still don’t get all the fuss about causality. Thinking of stats counterfactually goes back to Laplace. Even so, EDA seems to be misplaced on this list. Hasn’t it always been more popular than statistics itself? Or is this some specific understanding of EDA following Tukey’s book? If so, I don’t see much impact of that in stats teaching or practice. But then I was playing little league baseball in the 1960s, not following stats :-)

I guess I’ll have to read the paper. I don’t even know what “robust inference” means in this context!

Bob:

Yes, definitely we think that nonparametric models like random forests and neural nets are important! They’re included in two of our eight “most important ideas”: Overparameterized models and regularization and Generic computation algorithms. We also talk about computing in section 2.2. Regarding prediction, some of that is in the above-mentioned sections, and it also comes up in Adaptive decision analysis. Regarding Stein etc.: yes, we cite that, as well as what came later. As we write in section 1 of our paper: Each of these ideas has pre-1970 antecedents, both in the theoretical statistics literature and in the practice of various applied fields. But each has developed enough in the past fifty years to have become something new. As for EDA, it’s been hugely influential in statistical practice: think of the popularity of ggplot. Finally, you can read section 1.7 to see what we say about robust inference, also it comes up in section 2 when we talk about connections among the ideas.

I should’ve read the paper first—it’s an easy read. I didn’t know all the methods mentioned and would’ve cited alternative references for a lot of the things you covered. It’s interesting to see this statistics framing of things like black box/non-parametric models as “over-parameterization”. That’s in some sense a more honest description than “non-parametric”—the number of parameters grows with the data! But that overparameterization thing feels like an implementation detail. The bigger idea is black box-ish likelihoods, where I add the “ish” to point out that there’s still a lot of fiddling on input encoding, auxiliary “unsupervised” data, output encoding, and convolutional or recurrent network structure that goes in. The world’s best Go player isn’t a black-box neural network. I’m curious as to whether the deep neural nets folks think of neural nets as non-parametric in the usual sense of having a number of parameters that grows with data. The models don’t unfold themselves that way like Andrew’s Fabulous Unfolding Flower, but practitioners do tend to fit bigger models with bigger data sets. And the number of parameters used is usually rather large compared to data dimensionality.

From my perspective having moved from machine learning into stats, the even bigger picture here is moving from retrospective data analysis to prediction. I’ve always thought the title

Bayesian Data Analysisfelt old fashioned in that it seems to imply a retrospective view of data. More recently, I’ve thought more about how it’s really more reflective of a data-first rather than model-first or parameter-first view of statistics. Prior predictive checks, posterior-predictive checks, cross-validation, and calibration of probabilistic predictions can all be framed in terms of data. In some sense, they have to break down into statistics of data or we’re just doing mathematics. Back in the 1990s, when ML was heatin gup, I constantly heard things like, “statisticians care about their model parameters, us ML types care about prediction”. I’ve heard that view reinforced recently by ML practitioners at Columbia and elsewhere. Andrew’s thinking around Bayesian inference feels to me very much in line with the goals of ML and not so much interested in the NHST focus of more traditional early 20th century statistics. You can see that in our recently arXiv-ed workflow paper, which is rather laissez-faire about post-selection inference, to say the least.Great article that I will have to read very slowly! I stopped practicing as a statistician 52 years ago and moved into computer science. This article tells me what I should have been reading since then. Thank you.

It’s great to have this take on important ideas and key references collected in one place. Thank you!

However, it seems to me that the contributors from the field of Computer Science and their influence on the field of Statistics are a bit under-mentioned in the article — or at least lack explicit citation to the degree they warrant. To be clear, I’m neither a Computer Scientist nor Statistician — I’ve only worked around the edges of these fields. So, maybe in what follows I’m just showing the shallowness & biases of how I learned these topics….

At any rate, in this passage from Section “2.2. Advances in computing”: “With the increase in computing power, variational and Markov chain simulation methods have allowed separation of model building and development of inference algorithms, leading to probabilistic programming that has freed domain experts in different fields to focus on model building and get inference done automatically.” Where are the citations to probabilistic programming references like Roy DM. 2011 “Computability, inference and modeling in probabilistic programming.” PhD thesis, MIT Press, Cambridge, MA; and to those exemplifying the approaches of Chris Bishop’s “Model-based Machine Learning”, Phil. Trans. R. Soc. A 2013 371? Or, collectively, Bishop, Michael I. Jordan, Josh Tenenbaum, Noah Goodman, Vikash Mansinghka, David Blei, Sebastian Thrun, Thomas Minka, etc. I’m under the impression that the ideas they brought to reality — e.g. infer.NET, Church, etc. — were the precursors of and had some influence on the realization of later systems like Edward, PyMC3, Stan, Pyro, etc. and approaches like Probabilistic/Bayesian Machine Learning (e.g., latent Dirichlet allocation) and the incorporation of greater amounts of domain knowledge along with data in complex systems models in many applied statistical modeling — which itself is a transformation in the emphasis of statistical modeling balancing more explicitly the complementary roles of domain knowledge and data. No? (I guess it begs the question, “How much of their Computer Science work was motivated by advances within Statistics?” I don’t know, but I’d like to see that explained, too — in that, if an idea is important not only within Statistics but also as a driver of innovation in other fields, I’d like to know more about that idea and the folks who brought it to life.)

Also, Section “1.3. Overparameterized models and regularization” provides no explicit citations to contributors in Bayesian nonparametrics like Yee Whye Teh’s et al. “Hierarchical Dirichlet Processes”, Journal of the American Statistical Association 101, 1566-1581; or even to statisticians like David Dunson, Peter Muller, etc. (Granted, maybe my impression of the importance of their work is exaggerated, esp. with respect to impact on the field of Statistics in general. I’m inclined to think so as I look at the Google Scholar “Cited By” numbers….)

Moreover, Section “1.1. Counterfactual causal inference” lacks citations of, say, Spirtes, P., Glymour, C., & Scheines, R. (2001). “Causation, Prediction, and Search,” Second Ed. (Adaptive Computation and Machine Learning). The MIT Press; or of Elias Bareinboim’s work with Pearl on transportability, like Pearl, J., and E. Bareinboim. “Transportability of causal and statistical relations: A formal approach.” 2011 IEEE 11th International Conference on Data Mining Workshops. IEEE, 2011 — or some journal article by them. Basically, the single listing of “(Pearl 2009)” doesn’t come close to doing justice to Pearl’s contributions on causal inference, even if solely as a motivating thorn in the side of statisticians leading to a reconsideration of the feasibility of deriving causality from observational data and the role of experimentation combined with observational data — and no mention of Bayesian networks and belief propagation (guess, little importance within Statistics, even Bayesian Statistics?).

Additionally, I would have thought state space models and the advent of structural time series models might have warranted special mention among important ideas in Statistics within the past 50 years, as well. Even if their advent may have been earlier, their impact has spiked with recent computing advances, especially in applied areas like econometrics.

In general, there seems only incidental mention of the nearly ubiquitous role of latent variable models as a single class, which are unified under generative Bayesian modeling into that single general class; collectively capturing error-in-variables, missing values, generalized linear models, structural equation models, latent classification/clustering, state space models, filtering/smoothing, autoencoders, GANs & conditional generative models, etc.

Oh, one more thing, how about a mention of the ideas from Information Theory and how their interplay with statistics have driven advances in model representation, algorithm design, and inference communication approaches, especially in quantifying the information content of data, priors, posteriors and evidence — e.g., Shannon entropy & mutual information, Jaynes maxent, Kullback-Leibler Divergence, I.J.Good/A.Turing weight of evidence, variational inference, etc.? (Many of these ideas are older than 50 years but are coming to the fore recently with computing advances.)

I didn’t even call out the pioneers & advances in software environment/hardware/compute architecture/cloud capabilities. So, I do sympathize with the authors’ daunting task and the imperative to decide on drawing boundaries somewhere wrt what they cite.

Still, as implied within the article, given that so much of what we see as Statistics these days is enabled by advances in Computer Science, we might as well explicitly call attention to innovative folks in both fields, esp. those working at the intersection. I’d love to see a greater celebration of Who’s Who in each field and what they’ve helped (1) transfer to the other field and (2) disseminate within their own field from the other. And, an article like this one could be a great demonstration of the dialogue between the fields.

At any rate, I much appreciate the effort the authors have put into this work. Thanks again!

Michael:

Thanks for the feedback. Yes, this is an article about statistical ideas not computing ideas, so we do talk about computing but that is not our focus. Aki is in the computer science department, though, so we do our best.

Regarding specific citations, we do cite many of the people and ideas you mention. These citations are all over the article, not just in section 2.2. Also the point of the citations was to link to the basic idea. Pearl did a lot of work on causal inference but we only cite that one book of his. Angrist and Imbens did a lot of very important work in the area too, and we only cite one paper of his. In an article of this length covering all of statistics, there’s just no room to come close to doing justice to the contributions of Pearl, Angrist, Imbens, or many others. In our books we have lots more citations—take a look at the bibliographic notes of the causal inference chapters of Regression and Other Stories. In this article we tried to show discretion in how many citations to include. For example we include multiple citations of work on Donoho and of Wahba on nonparametrics and regularization; this is partly because different aspect of this work has been highly influential in statistics during the past few decades. We could’ve included more citations of others; it’s just not always clear where to stop, and our focus was on giving some references to what we thought was most important.

More details: We write about latent variable models in multiple places including sections 1.3 and 1.4, and, as to all those methods that we have “only incidental mention” . . . that’s what you have to do if you want to write a short article. For example, had we included a paragraph rather than just a mention of each of “capturing error-in-variables, missing values, generalized linear models, structural equation models, latent classification/clustering, state space models, filtering/smoothing, autoencoders, GANs & conditional generative models, etc.,” this would add several pages to the article right there—and then for balance we’d want paragraphs on different important ideas in causal inference for observational studies, different ideas in exploratory data analysis, not to mention ideas close to my own heart such as MRP . . . . and then the article would be 50 or 100 pages long! I consider it a success that the article mentioned all these ideas (well, not MRP; we didn’t have space for every great idea!) in a way that drew connections among them.

Regarding the desire “to explicitly call attention to innovative folks . . . a greater celebration of Who’s Who . . .”: We actually worked pretty hard to focus on the ideas, not the people. An earlier draft of the article was more people-focused, but we decided that a focus on individuals was counterproductive. So let me just say for the record here that I’m a big fan of the work of David Dunson and Peter Mueller. not including them in this article is not at all a negative reflection on them. We were just trying our best to outline what we saw as the most important ideas and trace some of their development, and there’s an unavoidable arbitrariness in what exactly gets included.

We attempted to have a broad view in our article but we know that this is just our perspective. My hope is for article to be published in a journal along with discussions from people with other attitudes and experiences.

Andrew:

Thank you for responding to my comment and addressing my points.

I especially applaud the spirit of your response’s final sentence: “My hope is for [the] article to be published in a journal along with discussions from people with other attitudes and experiences.” Looked upon that way, your article should serve well in stirring discussion. And, as with those “Greatest Teams/Players in History” debates in sports, it’ll be fun to see!

A lot of this kind of stuff is just “in the air”. Everything influences everything else because people talk to each other. For Stan, Matt and I had the first working end-to-end system of language, autodiff and HMC in 2011. We were both computer scientists working for Andrew at the time. The language design descended directly from my figuring out how to translate BUGS statements into log density increments so we could use autodiff. I still don’t understand what infer.NET does. I couldn’t understand Church’s inference at the time as I was just learning Bayesian computational inference and focusing on HMC as we were building Stan; now that I do understand how Church works, I don’t hold out a lot of hope for it in practical problems.

BUGS was built by epidemiologists way back in the 1990s, not computer scientists. By measures like citations, books, courses, etc., it’s been hugely successful and influential. PyMC3 was largely an adaptation of Stan’s algorithms to an embedded DSL in Python and symbolic differentaiton with Theano. It was also built by a team including at least one epidemiologist. JAGS descended from BUGS, but using C++ from R. Also built by an epidemiologist. From that small N = 3 sample, I’d conclude that epidemiology rocks in terms of producing PPLs.

Matt and I built the first autodiff system inspired most by Sacado and CppAD, which came out of applied math and operations research, not CS. As did most of the other “early” systems before Stan, Theano, TensorFlow, or PyTorch. You might site back-prop in neural nets, which is a simple instance of autodiff, but it didn’t really have anything to add to the discussion at the time other than a very well-structured application.

I’d say that’s CS catching up with stats, which has always been about applying domain knowledge. Stats just encodes that knowledge as likelihoods, priors, and sometimes, data transformations. ML grew out of random forest, large scale L1 regularization, neural nets, and other black-box approaches, but there’s a lot of recent work in more structured neural net predictions for scientific problems. Sort of a realization that it might be useful to add some Hamiltonian physical structure if you’re trying to learn n-body problem mechanics.

I don’t think the advances are one way, though. You have people like Michael Jordan who straddle the fields and produce students straddling the fields like Dave Blei, Tamara Broderick, and Francis Bach. What’s funny is that I vividly remember talks by Jordan and Hinton in psychology (CMU psych was big into “connectionism”) in the 1980s and early 90s on neural nets that everyone in CS just yawned through. They were cute examples, but nobody was imagining they’d scale to what they are today.

I myself came out of programming language theory in computer science (side product of living in Edinburgh and Pittsburgh for 10+ years) and natural language processing.

It doesn’t look like application of differential geometry/topology to analysis of high dimensional probability spaces is in here. As a theory of applied statistics, it seems that viewing high dimensional parameter phase space as constrained to some gnarly manifold with low intrinsic dimension is a big motivator for computational methods in HMC, analysis of nonparametric machine learning models, explaining why deep learning works at all, methods of dimensionality reduction, etc.

Somebody:

i’d say that’s implicit in sections 1.3 and 1.5, at least that’s how I was thinking of it.

Often the gnarly manifolds are defined in the likelihood or data space, not in the posterior. That is, the metric over y implied by the likelihood log p(y | theta) is a mess, not the metric over the posterior theta implied by log p(theta | y).

In Stan, we adapt a simple Euclidean metric (a diagonal or dense positive-definite matrix) for Hamiltonian Monte Carlo based on the potential -log p(theta | y). The metric can’t get gnarly by construction, but it’s often a poor approximation of the true posterior flow implied by the posterior.

Riemannian HMC, as the name suggests, uses a Riemannian metric. But if the density you’re trying to fit results in a gnarly Riemannian metric over the posterior (rather than one that can easily transform back to something that behaves more Euclidean), then the sampler’s going to have problems.

Isn’t the jury still out on why deep neural nets work as well as they do? Lots of interesting math and stats work going on there, and though it’s largely being done by computer scientists, I wouldn’t call it a computer science problem because there’s no computational issue other than perhaps balancing inference vs. computational resources.

> Isn’t the jury still out on why deep neural nets work as well as they do?

No doubt — the manifold hypothesis for deep learning reminds me of some hand-wavy explanations for how we get away with applying linear models with normal errors to everything. My professor told me it’s because we’re effectively fitting to the first order taylor expansion term, and lots of things are normal because CLT. It satisfied me until the professor was out of earshot, but it only really works as a post-hoc rationalization — it doesn’t really tell me anything about what classes of problem it works and doesn’t work on before actually trying the fit.

On causal inference, you write: “We begin with a cluster of different ideas that have appeared in statistics, econometrics, psychometrics, epidemiology, and computer science, all revolving around the challenges of causal inference, and all in some way bridging the gap between, on one hand, naive causal interpretation of observational inferences and, on the other, the recognition that correlation does not imply causation.”

Philosophy should be on the list too! The contributions from philosophers to causal inference is long, including (but not limited to) Reichenbach’s (1956) original formulation of the common cause principle (a special case of the Causal Markov Condition), Pat Suppes’s (1970) work on probabilistic causation, David Lewis’s (1973) counterfactual account of causation, and (most importantly), Spirtes, Glymour, and Scheines’s (1993) work on inferring causal relations from sets of probabilistic dependencies.

Funny, I always thought of Suppes as a mathematical psychologist based on what I knew of his work (and being affiliated with the math psych “dream team” at Stanford in the 60s). But apparently both he and wikipedia agree with you that Suppes was in fact a philosopher!

I get you can’t hit everything, but what about pseudo-likelihood? No Besag spatio-temporal models or Cox PH models without it!

Christopher:

Pseudo-likelihood’s a clever idea that’s been used on occasion, but I don’t think it’s one of the most important statistical ideas in the past fifty years. Not that we were drawing any precise line.

This looks really cool and I look forward to reading in depth.

I can’t help but noticing that both of the authors of this paper, and many of the commenters acknowledged, are white men. 50 years ago that was the norm; today we’re beginning to have serious (as yet inadequate) discussions about diversity, inclusion, and ethics more broadly. This is not a statistical idea per se, but it’s certainly a conversation that is improving and will further improve practice of statistics.

(To be clear: these ideas and critiques are not new. They were around 50 years ago.)

James:

I don’t know if this helps, but the first person I asked to collaborate with on this paper was a nonwhite woman. But she was too busy to be a coauthor! Ideally she would’ve been involved in this paper too, and then it would have three authors.

My intention was not to critique you but rather to point to an idea that is changing the practice of statistics and that might be worth mentioning in the paper.