I find a paper on this topic https://arxiv.org/pdf/1604.07143.pdf.

And it lists some references on how to construct a decision tree and use this tree to obtain a neural network.

There are some indirect clus.

If you know that the decision trees are generalized additive models where the basis function is indictor function and they are essentially simple functions in measure theory. See Jerome H. Friedman’s MARS on https://projecteuclid.org/download/pdf_1/euclid.aos/1176347963 .

It is not too amazing that a deep enough optimal classification tree can achieve the same prediction ability as a deep neural network.

In fact, there are some attempts to resue the deep learning packages to implement decision trees such as https://arxiv.org/abs/1702.07360.

Keith,

Sure, I am not dismissing the idea of “prior is just more data”– I love it. Indeed no matter how the prior is constructed from/aimed for, it can always be explained in the sample space(e.g., https://arxiv.org/abs/1705.07120 as a similar application in ML) and that is why Andrew has proposed prior predictive checks. And I don’t think Andrew has changed his position– so the short answer is the prior should reflect the population if it can.

Now on the other hand, I think the analogy to ask for “what is the true prior” is to ask for what is the value of the parameter/hyper-parameter (assuming a parametric model), and what is the optimal value the model can possess. If the model is correct, the optimal value almost referred to the population value. That said, what is the *true* value of the hyper-parameter? y~N(theta, 1), theta=1, theta ~N(mu, 1), mu=1 and y~N(theta, 1), theta=1, theta ~N(mu, 1), mu=0 corresponds to the same generative model! Sure we could always embed the model into a larger system where the hyper-parameter becomes parameter that directly generate the data, but that is also among the extra model assumption. Consider an even more extreme case, we sometimes put a strong prior in a regression to avoid multicollinearity. It purely serves for identification, and does not encode any population information.

In short, I agree there is a distinction between prior as more data/amalgamated evidence versus prior as more regularization/robustness/various operational properties, and I tend to think they are both Bayesian.

]]>Opps, this ” one should embed that problem into a larger class of exchangeable inference problems for an unlimited number of individuals.” should have been “one should embed that problem into a larger class of identical inference problems for an unlimited number of individuals.

]]>OK, yes robustness can help, but also one can keep the view that priors (try to) represent a reality beyond our direct reach by instead modifying the likelihood rather than changing one’s view of the prior. As for instance in this approach Robust Bayesian Inference via Coarsening https://www.tandfonline.com/doi/abs/10.1080/01621459.2018.1469995 .

Not arguing that is superior, but rather just remains continuous with https://statmodeling.stat.columbia.edu/2016/04/23/what-is-the-true-prior-distribution-a-hard-nosed-answer/ that you pointed to.

Now, I missed that post which is the clearest statement of Andrew’s position which I don’t think he changed?

And since we are here, to avoid giving up in the case of for a single parameter in a model that is only being used once, Andrew embeded that problem into a larger class of exchangeable inference problems. Interestingly Peirce argued that for inference in a single non-repeated study, one should embed that problem into a larger class of exchangeable inference problems for an unlimited number of individuals.

]]>Keith,

Thanks for the reference and your post.

I would rather think Bayesian has a long-existing connection with all those robust operational or classical properties (the old days when we still teach least favorable prior). It is very insightful in your paper to distinguish these two approaches. On the other hand, It is also because of the fact that

> Our belief in the efficacy of information aggregation, using continuous parameters to determine the level of partial pooling, is supported by a belief that reality though never directly accessible is continuous, that different experiments, treatments, and outcomes are connected somehow rather than distinct severed islands on their own.

such that we have to be concerned that the measurement we have is almost always noisy and likely corrupted, and therefore we do have to utlize

> various robust operational properties

in order to

> approximations of reality

?

Agree. The tree will be effectively the same as a NN with step activation function– but really in that case there is no boundary between interoperability and blackbox.

]]>Thanks for bringing the book to my intention, but since I can’t steal it from Andrew’s desk and there is very little on the book on line, not sure yet what to make of it. From some online slides I came across last night (don’t have link) the work seems very related to Cynthia Rudin’s work that I recently blogged on. Now, I am primarily bringing this up as she tries to be Bayesian in her approach – you will find half a dozen references involving Bayes here https://users.cs.duke.edu/~cynthia/papers.html

> It is not because we or any other people have enough reasons to believe the regression coefficient perfectly forms a Laplace distribution that we use the Lasso; it is rather because we want our model to be more robust under some l_1 perturbation in the data.

That dose seem to be the two solitudes of statistics:

“From a slightly different direction Tibshirani (2014) argues that enforcing sparsity is not primarily motivated by beliefs about the world, but rather by benefits such as computability and interpretability, indicating how considerations other than correspondence to reality often play an important role in statistics and more generally in science. Tibshirani’s view fits squarely within the alternative “classical,” or nonBayesian, approach in which techniques are chosen based on various robust operational properties rather than being viewed as approximations of reality.” http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

Here you seem to be moving Bayes into the alternative “classical,” or nonBayesian view through obtaining a prior that just or primarily does that.

]]>A.G.

This issue seems to come up a lot in posts on (inherently) interpretable ML – its hard to make clear that many applications, at least for now, only achieve adequate performance using black boxes like deep neural nets. Also for many applications interpretability is not a big concern or advantage*. For instance, see comments here https://statmodeling.stat.columbia.edu/2019/11/15/zombie-semantics-created-in-the-hope-of-keeping-most-on-the-same-low-road-you-are-comfortable-with-now-delaying-the-hardship-of-learning-better-methodology/

Now, I am just guessing, but the proof likely involves N being unbounded and once N is above 5 +/- 2 the tree is no longer inherently interpretable.

* For instance, in some medical applications physicians will inappropriately over-ride inherently interpretable ML as they think they know better. So until there is improvement in say medical education black boc predictions might be best for patients here.

]]>