I (Yuling) read this new book *Machine Learning Under a Modern Optimization Lens* (by Dimitris Bertsimas and Jack Dunn) after I grabbed it from Andrew’s desk. Apparently machine learning is now such a wide-ranging area that we have to access it through some sub-manifold so as to evade dimension curse, and it is the same reason why I would like to discuss this comprehensive and clearly-structured book through a Bayesian perspective.

**Regularization and robustness, and what it means for priors**

The first part of the book is most focused on the interpretation of regularization and robustness (Bertsimas and Copenhaver, 2017). In a linear regression with data $latex (X,Y)$, we consider a small perturbation within the neighborhood $latex \Delta \in \mathcal{U}(q,r)= \{\Delta\in \mathcal{R}^{n\times p}: \max_{\vert\vert \delta \vert\vert_{q} =1 } \vert\vert \delta \Delta \vert\vert_{r} \},$ then the $latex l_q$ regularized regression is precisely equivalently to the minimax robustness:

$latex \displaystyle \min_{\beta}\max_{\Delta\in \mathcal{U}(q,r)} \vert\vert y-(X+\Delta)\beta \vert\vert_{r} = \min_{\beta} \vert\vert y-(X+\Delta)\beta \vert\vert_{r} + \vert\vert \beta \vert\vert_{q} $

and such equivalence can also be extended to other norms too.

The discussion of this book is mostly useful for point estimation in a linear regression. As a Bayesian, it is natural to ask, if we could also encode the robustness constraints into the prior for a general model. For example, can we establish something like (I suppress the obvious dependence on X):

$latex \displaystyle \min_{p^{post}} \max_{ p^*: D(p^* \vert\vert p^{sample})<\epsilon } – \int_{\tilde y} \log \int_{\theta} p(\tilde y \vert \theta ) p^{post}(\theta ) d\theta p^*(\tilde y\vert y ) d \tilde y= – \int_{\tilde y} \log \int_{\theta} p(\tilde y \vert \theta) p(\theta\vert y ) d\theta p^{sample}(\tilde y\vert y ) d \tilde y.

$

where is $latex p(\theta\vert y ) \propto p(\theta)p(y\vert \theta)$ is the Bayesian posterior distribution under certain prior that is potentially induced by this equation, and $latex y_{1,\dots,n}$ are iid samples from $latex p^{sample}$, which are however different from the future new samples $latex \tilde y \sim p^*(\tilde y)$ up to a $latex \epsilon$ perturbation under some divergence $latex D(\cdot|\cdot)$.

In other words, we might wish to construct the prior in such a way that the model could still give *minimax* optimal prediction even if the sampling distribution is slightly corrupted.

To be clear, the equivalence between a minimax (point-)estimation and the least-favorable prior is well studied. However here the minimax is the minimax out-of-sample risk of the prediction of the data, rather than the classic risk of the parameter estimation, where the only bridge is through the likelihood.

It also reminds me of our PSIS leave-one-out(LOO). In particular, we know where the model behaves bad (e.g. points with large k hat $latex >\epsilon$). It makes sense, for example, if we encode these bad points into the prior adaptively

$latex \displaystyle \log p^{prior}(\theta) \gets \log p^{prior}(\theta) + \gamma\log \sum_{i: k_{i}>\epsilon} p( y_i\vert\theta) $

and retrain the model either explicitly or through importance sampling. It is of course nothing but adversarial training in deep leaning.

Nevertheless, it is the time to rethink what a prior is aimed for:

- In terms of prior predictive check, we implicitly ask for a large or even the largest (in empirical Bayes) marginal likelihood $latex \int p(y\vert \theta) p(\theta) d\theta$.
- But prior is also just regularization, and regularization is just robustness from the previous argument. It is not because we or any other people have enough reasons to believe the regression coefficient perfectly forms a Laplace distribution that we use the Lasso; it is rather because we want our model to be more robust under some $latex l_1$ perturbation in the data. An adversarial training weighs more on “bad” data points and therefore deliberately decrease the prior predictive power, in exchange for robustness.

Let me recall Andrew’s old blog: What is the “true prior distribution”? :

(If) the prior for a single parameter in a model that is only being used once.for example, we’re doing an experiment to measure the speed of light in a vacuum, where prior for the speed of light is the prior for the speed of light; there is no larger set of problems for which this is a single example. My short answer is: for a model that is only used once, there is no true prior.

Now from the robustness standpoint, we might have another longer answer: it does not even matter if the population $latex \theta$ itself lives in any population. There is an optimal prior in terms of giving the appropriate amounts of regularization such that prediction from the model is robust under small noise, which is precisely defined by the minimax problem (in case someone hates minimax, I wonder if the average risk in lieu of minimax is also valid).

To conclude, the robustness framework (in the data space) gives us a more interpretable way to explain how strong the regularization should be (in the parameter space), or equivalently how (weakly) informative the prior is supposed to be.

##### Trees, discrete optimization, and multimodality

A large share of the book is devoted to the optimal classification and regression trees. The authors prove that a deep enough optimal classification tree can achieve the same prediction ability as a deep neural network — when the tree makes splits exactly according to the same network — whereas tree has a much better interpretability (evidently we could prove a multilevel model at the same setting will achieve a prediction ability no worse than that deep net).

The first glance might suggest a computationally prohibitive expense of solving a high dimensional discrete optimization problem, which is ultimately what a tree requires. Fortunately, the author gives a detailed introduction to the mixed integer algorithm they use, and it is shown to be both fast and scalable — although I cannot fit a discrete tree in Stan.

Nevertheless, there are still some discrete natures that may not be desired. For instance when the data is perfectly separable by multiple classification trees, whatever the objective function, the optimizer can only report a tree among all plausible answers. In the ideal situation we would want to average over all the possibility and I suppose that can be approximated by a bootstrapped random forest — but even then we are effectively using no-pooling among all leaves. For example if an online survey has very few samples in certain groups of the population, the best classification a tree can do is to group all of them into some nearby leaves, while the leaf to which they are assigned can only be decided with large variation.

To be fair all methods have limitations and it is certainly useful to include tree-based methods in the toolbox as an alternative to more black-box deep learning models.

##### We love decision theories, too.

I remember in this year’s JSM, Xiao-Li Meng made a joke that even though we statistician have to struggle to define our territory in terms of AI and ML, it seems even more unfair for operations research, for it is OR that created all the optimization tools upon which modern deep learning relies, but how many outsiders would directly links AI to OR?

The author of this book gives an argument in chapter 13: Although OR pays less attention to data collection and prediction compared with ML, OR is predominately focused on the process of making optimal decisions.

Of course we (Bayesian statisticians) love decision theories, too. Using the notation in this book, given a loss function c, a space for decisions $latex c\in C$, and observed data (x, y), we want to minimize the expected loss:

$latex \displaystyle z^*(x) = \arg\min \mathrm{E}[c(z, y)|x]$

From a (parametric) Bayesian perspective, such problem is easy to solve: given a parameter model $latex p(y|x, \theta)$, we have explicit form $latex \mathrm{E}[c(z, y)|x_0] = \int c(z, y) p(y| x_0, \theta) p(\theta| y_{1:n}, x_{1:n}) d \theta $ that enables a straightforward minimization on $latex z$ (possibly though stochastic approximation since we can draw $latex \theta$ from posterior directly).

The authors essentially consider a non-parametric model on $latex Y\vert X$ in RKHS. That is to consider a kernel $latex K(,)$ on covariates X and we can rewrite the expected risk as a weighted average of sample risk

$latex \displaystyle \mathrm{E}[c(z, y)|x_0] \approx K(X, x_0) K(X, X)^{-1} c(z, Y). $

And not surprisingly we can also construct the kernel though trees.

Indeed we can rewrite any multilevel model by defining an equivalent kernel K(,). So the kernel representation above amounts to a multilevel model with fixed group-level variance $latex \tau$ (often fixed hyper-parameter in the kernel).

##### And we love causal inference, too

Chapter 14 of the book is on *perspective trees*. The problem is motivated by observational studies with multiple treatments, and the goal is to assign an optimal treatment for a new patient. Given observed outcome y (say the Blood Sugar Level), all covariates X, (potentially nonrandom) treatments Z, what remains to be optimized is the policy for future patients with covariate $latex x$. We denote the optimal assignment policy $latex z^*=\tau(x)$ as a function of $latex x$.

It is a causal inference problem. If we have a parametric model, we (Bayesian statisticians) would directly write it down as

$latex \displaystyle z^*=\tau(x_0) = \arg\min_z \mathrm{E}[y|x_0, z]$

Under all unconfoundness conditions we have

$latex {E}[y|x_0, z]= \int y p(y| x, z) d y$, and $latex p(y| x, z)$ can be learned from the sampling distribution in observational studies by weighting or matching or regression.

Returning to this book, effectively the perspective trees models $latex p(y| x, z)$ by sub stratification: $latex y\vert x_0, z_0$ is the empirical distribution of all observed outcomes with treatment $latex z=z_0$ in the leaf node where $latex x_0$ lives.

##### Making decisions in the space of all trees

The perspective trees take one step further by using the following loss function when training the regression tree:

$latex \displaystyle \mathrm{E}[y|x, \gamma(x) ] + \sigma \sum_{i=1}^n (y_i – \hat y_i)^2$

where $latex \hat y_i$ is the prediction of $latex y_i$ using the same tree. (As a footnote, it is always better to use leave-one-out error, but I suppose it is hard to cross-validate a tree due to its discrete nature.)

The objective function is interpreted as the weighted average of “making an accurate prediction” ($latex \sigma =\infty$) and “making the optimal decision” ($latex \sigma =0$). Evidently it is not the same as fitting the tree first that only minimizes prediction error, and then doing causal inference using the post-inference tree and substratification.

I have a conflicting feeling towards this approach. On one hand, it addresses the model uncertainty directly. A parametric Bayesian might always tend to treat the model as it is and ignore all other uncertainty in the forking paths. It is therefore recommended to work in a more general model space– the trees do have merits in intuitively manifesting the model complexity.

On the other hand, to me there seems to be a more principled way to deal with model uncertainty is to consider

$latex \displaystyle \mathrm{E}[y|x, \gamma(x) ]= \mathrm{E}[\mathrm{E}[y|x, \gamma(x), M ]]$

where $latex M$ is any given tree.

Further under normal approximation, the expectation can be expanded to be

$latex \displaystyle \mathrm{E}[\mathrm{E}[y|x, \gamma(x), M ]]= \sum_{k=1}^{\infty} \mathrm{E}[y|x, \gamma(x), M_k] \exp( \sum_{i=1}^n (y_{ik} – \hat y_{ik})^2 ) / \sum_{k=1}^{\infty} \exp( \sum_{i=1}^n (y_{ik} – \hat y_{ik})^2 ),$

as long as I am allowed to abuse the notation on infinite sums and self-normalization.

The linear mixture (first expression in this section) can then be viewed as an approximation to this full-Bayes objective: it replaces the infinite sum by the largest one term, which is nearly accurate if all other trees vanish quickly.

To be clear, here different trees are only compared in the context of prediction. There is an analogy in terms of causal assumption where different causal models imply different conditional ignobility, which cannot be learned from the data in the first place.

More generally, there remains a lot to be done in the field of decision theory in the existence of model uncertainty, while everything will be even more complicated with extra settings on infinite model space, causality, and the looming concern that all models are still wrong even in this infinite model space– we almost have abandoned the posterior probability of models in model averaging, but how about decision making with a series of working models?

Overall, I enjoy reading this book. It provides various novel insights to rethink modern machine learning and a rich set of practical tools to fit models in the real world. Most constructively, it is a perfect book that inspires so many interesting open problems to work on after stacking multiple lenses.

Do you have an internet-accessible reference for the work behind “The authors prove that a deep enough optimal classification tree can achieve the same prediction ability as a deep neural network”? On the face of it, this sounds like there must be something nasty hidden in the small print. I would expect to be able to transform an arbitrary digital circuit into an equivalent deep neural network (assuming no very stringent restrictions on the architecture of the network) with only constant factor increase in size. I would expect that there are circuits whose minimum classification tree might perhaps have tractable depth but would still be of exponential size – for instance circuits computing functions that cannot be computed with less than space N might be difficult to compute using trees of size < 2^N.

A.G.

This issue seems to come up a lot in posts on (inherently) interpretable ML – its hard to make clear that many applications, at least for now, only achieve adequate performance using black boxes like deep neural nets. Also for many applications interpretability is not a big concern or advantage*. For instance, see comments here https://statmodeling.stat.columbia.edu/2019/11/15/zombie-semantics-created-in-the-hope-of-keeping-most-on-the-same-low-road-you-are-comfortable-with-now-delaying-the-hardship-of-learning-better-methodology/

Now, I am just guessing, but the proof likely involves N being unbounded and once N is above 5 +/- 2 the tree is no longer inherently interpretable.

* For instance, in some medical applications physicians will inappropriately over-ride inherently interpretable ML as they think they know better. So until there is improvement in say medical education black boc predictions might be best for patients here.

Agree. The tree will be effectively the same as a NN with step activation function– but really in that case there is no boundary between interoperability and blackbox.

There are some indirect clus.

If you know that the decision trees are generalized additive models where the basis function is indictor function and they are essentially simple functions in measure theory. See Jerome H. Friedman’s MARS on https://projecteuclid.org/download/pdf_1/euclid.aos/1176347963 .

It is not too amazing that a deep enough optimal classification tree can achieve the same prediction ability as a deep neural network.

In fact, there are some attempts to resue the deep learning packages to implement decision trees such as https://arxiv.org/abs/1702.07360.

I find a paper on this topic https://arxiv.org/pdf/1604.07143.pdf.

And it lists some references on how to construct a decision tree and use this tree to obtain a neural network.

Yuling:

Thanks for bringing the book to my intention, but since I can’t steal it from Andrew’s desk and there is very little on the book on line, not sure yet what to make of it. From some online slides I came across last night (don’t have link) the work seems very related to Cynthia Rudin’s work that I recently blogged on. Now, I am primarily bringing this up as she tries to be Bayesian in her approach – you will find half a dozen references involving Bayes here https://users.cs.duke.edu/~cynthia/papers.html

> It is not because we or any other people have enough reasons to believe the regression coefficient perfectly forms a Laplace distribution that we use the Lasso; it is rather because we want our model to be more robust under some l_1 perturbation in the data.

That dose seem to be the two solitudes of statistics:

“From a slightly different direction Tibshirani (2014) argues that enforcing sparsity is not primarily motivated by beliefs about the world, but rather by benefits such as computability and interpretability, indicating how considerations other than correspondence to reality often play an important role in statistics and more generally in science. Tibshirani’s view fits squarely within the alternative “classical,” or nonBayesian, approach in which techniques are chosen based on various robust operational properties rather than being viewed as approximations of reality.” http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

Here you seem to be moving Bayes into the alternative “classical,” or nonBayesian view through obtaining a prior that just or primarily does that.

Keith,

Thanks for the reference and your post.

I would rather think Bayesian has a long-existing connection with all those robust operational or classical properties (the old days when we still teach least favorable prior). It is very insightful in your paper to distinguish these two approaches. On the other hand, It is also because of the fact that

> Our belief in the efficacy of information aggregation, using continuous parameters to determine the level of partial pooling, is supported by a belief that reality though never directly accessible is continuous, that different experiments, treatments, and outcomes are connected somehow rather than distinct severed islands on their own.

such that we have to be concerned that the measurement we have is almost always noisy and likely corrupted, and therefore we do have to utlize

> various robust operational properties

in order to

> approximations of reality

?

OK, yes robustness can help, but also one can keep the view that priors (try to) represent a reality beyond our direct reach by instead modifying the likelihood rather than changing one’s view of the prior. As for instance in this approach Robust Bayesian Inference via Coarsening https://www.tandfonline.com/doi/abs/10.1080/01621459.2018.1469995 .

Not arguing that is superior, but rather just remains continuous with https://statmodeling.stat.columbia.edu/2016/04/23/what-is-the-true-prior-distribution-a-hard-nosed-answer/ that you pointed to.

Now, I missed that post which is the clearest statement of Andrew’s position which I don’t think he changed?

And since we are here, to avoid giving up in the case of for a single parameter in a model that is only being used once, Andrew embeded that problem into a larger class of exchangeable inference problems. Interestingly Peirce argued that for inference in a single non-repeated study, one should embed that problem into a larger class of exchangeable inference problems for an unlimited number of individuals.

Opps, this ” one should embed that problem into a larger class of exchangeable inference problems for an unlimited number of individuals.” should have been “one should embed that problem into a larger class of identical inference problems for an unlimited number of individuals.

Keith,

Sure, I am not dismissing the idea of “prior is just more data”– I love it. Indeed no matter how the prior is constructed from/aimed for, it can always be explained in the sample space(e.g., https://arxiv.org/abs/1705.07120 as a similar application in ML) and that is why Andrew has proposed prior predictive checks. And I don’t think Andrew has changed his position– so the short answer is the prior should reflect the population if it can.

Now on the other hand, I think the analogy to ask for “what is the true prior” is to ask for what is the value of the parameter/hyper-parameter (assuming a parametric model), and what is the optimal value the model can possess. If the model is correct, the optimal value almost referred to the population value. That said, what is the *true* value of the hyper-parameter? y~N(theta, 1), theta=1, theta ~N(mu, 1), mu=1 and y~N(theta, 1), theta=1, theta ~N(mu, 1), mu=0 corresponds to the same generative model! Sure we could always embed the model into a larger system where the hyper-parameter becomes parameter that directly generate the data, but that is also among the extra model assumption. Consider an even more extreme case, we sometimes put a strong prior in a regression to avoid multicollinearity. It purely serves for identification, and does not encode any population information.

In short, I agree there is a distinction between prior as more data/amalgamated evidence versus prior as more regularization/robustness/various operational properties, and I tend to think they are both Bayesian.

Shout out to operations research, optimal control and inverse problems…