Yea I suppose im not clear on terminology but for concreteness lets say you have a model y = b + e where b is some parameter and e is some known error. There are two types of noise (as you implied with your original post); the data generation noise e and parameter estimate p(b). For predictive distribution I think one would have to take both into account. Some sort of bayesian estimation (like taking dropout into account) can help with parameter uncertainty and some sort of mixture density can help with dgp noise.

]]>=====================================

Warren S. Sarle Apr 29, 1996

The neural network (NN) and statistical literatures contain many of the

same concepts but usually with different terminology. Sometimes the same

term or acronym is used in both literatures but with different meanings.

Only in very rare cases is the same term used with the same meaning,

although some cross-fertilization is beginning to happen. Below is a

list of such corresponding terms or definitions.

Particularly loose correspondences are marked by a ~ between the two

columns. A indicates the reverse. Terminology in

both fields is often vague, so precise equivalences are not always

possible. The list starts with some basic definitions.

There is disagreement in the NN literature on how to count layers. Some

people count inputs as a layer and some don’t. I specify the number of

hidden layers instead. This is awkward but unambiguous.

Definition Statistical Jargon

========== ==================

generalizing from noisy data Statistical inference

and assessment of the

accuracy thereof

the set of all cases one Population

wants to be able to

generalize to

a function of the values in Parameter

a population, such as the

mean or a globally optimal

synaptic weight

a function of the values in Statistic

a sample, such as the mean

or a learned synaptic weight

Neural Network Jargon Definition

===================== ==========

Neuron, neurode, unit, a simple linear or nonlinear computing

node, processing element element that accepts one or more inputs,

computes a function thereof, and may

direct the result to one or more other

neurons

Neural networks a class of flexible nonlinear regression

and discriminant models, data reduction

models, and nonlinear dynamical systems

consisting of an often large number of

neurons interconnected in often complex

ways and often organized into layers

Neural Network Jargon Statistical Jargon

===================== ==================

Statistical methods Linear regression and discriminant

analysis, simulated annealing, random

search

Architecture Model

Training, Learning, Estimation, Model fitting, Optimization

Adaptation

Classification Discriminant analysis

Mapping, Function Regression

approximation

Supervised learning Regression, Discriminant analysis

Unsupervised learning, Principal components, Cluster analysis,

Self-organization Data reduction

Competitive learning Cluster analysis

Hebbian learning, Principal components

Cottrell/Munro/Zipser

technique

Training set Sample, Construction sample

Test set, Validation set Hold-out sample

Pattern, Vector, Example, Observation, Case

Sample, Case

Reflectance pattern an observation normalized to sum to 1

Binary(0/1), Binary, Dichotomous

Bivalent or Bipolar(-1/1)

Input Independent variables, Predictors,

Regressors, Explanatory variables,

Carriers

Output Predicted values

Forward propagation Prediction

Training values Dependent variables, Responses,

Target values Observed values

Training pair Observation containing both inputs

and target values

Shift register, Lagged variable

(Tapped) (time) delay (line),

Input window

Errors Residuals

Noise Error term

Generalization Interpolation, Extrapolation,

Prediction

Error bars Confidence interval

Prediction Forecasting

Adaline Linear two-group discriminant analysis

(ADAptive LInear NEuron) (not Fisher’s but generic)

(No-hidden-layer) perceptron ~ Generalized linear model (GLIM)

Activation function, > Inverse link function in GLIM

Signal function,

Transfer function

Softmax Multiple logistic function

Squashing function bounded function with infinite domain

Semilinear function differentiable nondecreasing function

Phi-machine Linear model

Linear 1-hidden-layer Maximum redundancy analysis, Principal

perceptron components of instrumental variables

1-hidden-layer perceptron ~ Projection pursuit regression

Weights, Shrinkage estimation, Ridge regression

Jitter random noise added to the inputs to

smooth the estimates

Growing, Pruning, Brain Subset selection, Model selection,

damage, Self-structuring, Pre-test estimation

Ontogeny

Optimal brain surgeon Wald test

LMS (Least mean squares) OLS (Ordinary least squares)

(see also “LMS rule” above)

Relative entropy, Cross Kullback-Leibler divergence

entropy

Evidence framework Empirical Bayes estimation

OLS (Orthogonal least squares) Forward stepwise regression

Probabilistic neural network Kernel discriminant analysis

General regression neural Kernel regression

network

Topologically distributed < (Generalized) Additive model

encoding

Adaptive vector quantization iterative algorithms of doubtful

convergence for K-means cluster analysis

Adaptive Resonance Theory 2a ~ Hartigan's leader algorithm

Learning vector quantization a form of piecewise linear discriminant

analysis using a preliminary cluster

analysis

Counterpropagation Regressogram based on k-means clusters

Encoding, Autoassociation Dimensionality reduction

(Independent and dependent variables

are the same)

Heteroassociation Regression, Discriminant analysis

(Independent and dependent variables

are different)

Epoch Iteration

Continuous training, Iteratively updating estimates one

Incremental training, observation at a time via difference

On-line training, equations, as in stochastic approximation

Instantaneous training

Batch training, Iteratively updating estimates after

Off-line training each complete pass over the data as in

most nonlinear regression algorithms

Anyway, here’s one example where predictive distributions are important. When you’re doing capacity planning, then getting a point estimate of the load at a future date is of limited use; what you really want is a high-confidence upper bound. That is, you might want to find x such that

Pr(load > x) = alpha

for some small alpha (0.01, 0.001). More commonly, you’ll want to be able to handle the max load over some period: find x such that

Pr(max_i load_i > x) = alpha.

You would prefer not to have to choose alpha up front, as that may depend on executive decisions that haven’t been made yet and on costs and priorities that may are not known yet or may change. (Those costs include both the cost of providing resources that may not be used, and the negative consequences of not having enough capacity to meet the actual load.) You also may not want to set in stone the width of the period of interest in the second case above. So you really do need a predictive distribution.

As another example, I’m doing some work where we’re building Monte Carlo models in support of complex decisions. The result of a regression analysis might be only one piece of a larger model, with its outcome variable being only an intermediate step in computing the final net benefit of a decision.

Finally, nonlinear utility functions aren’t the issue; outputting the predictive mean is an optimal action only if (1) your action consists solely of outputting a prediction, rather than using that prediction to inform some other decision (which might have only a finite number of possible alternatives), and (2) your utility function (or loss function) is *quadratic*.

]]>also, spitballing here, what if you just model utility(y)? would you need uncertainty parameters then?

]]>Or imagine there’s a regulatory situation where pollution below some amount is allowed but above some amount there are nonlinear fines, and you need to optimize your maintenance schedule on your equipment…

or whatever. it’s easy to come up with nonlinear utility on uncertain parameters.

]]>https://www.sciencedirect.com/science/article/pii/S0169207018300785

]]>I think with these sorts of models the thing to be worried about is posterior predictive performance and in particular, understating the uncertainty of the prediction. My understanding is that KL-loss-based variational approximations are quite prone to this problem and also that it does show up in the DNN context.

]]>I wish we saw more of the kind of technology transfer from one field to another that Neal did there. I get the impression that the physics community has all kinds of wonderful mathematical tools that could be usefully applied in other disciplines, if only the right people knew about them.

]]>I would be delighted to find that I am mistaken here…

]]>Did you also miss – Stat Med. 1998 Nov 15;17(21):2501-8.A comparison of statistical learning methods on the Gusto database. Ennis M, Hinton G, Naylor D, Revow M, Tibshirani R.

Abstract

We apply a battery of modern, adaptive non-linear learning methods to a large real database of cardiac patient data. We use each method to predict 30 day mortality from a large number of potential risk factors, and we compare their performances. We find that none of the methods could outperform a relatively simple logistic regression model previously developed for this problem.

P

]]>https://www.stat.berkeley.edu/~breiman/random-forests.pdf

Thanks to Andrew for posting this query.

]]>I think if researchers delineated all assumptions they were making, then reviewers (inevitably) pointed out more, it would be great.

]]>PLoS One. 2014; 9(7): e101739

]]>What is true asymptotically may not be at all correct for someone with finite resources and time.

]]>Comparing models for analysing survey data is a great topic. We had a project on this at the University of Turin and Milan, a while ago. Two contributions from this:

1. We edited in 2011 a book published by Wiley titled Modern Analysis of Customer Surveys: with Applications using R, https://www.wiley.com/en-us/Modern+Analysis+of+Customer+Surveys%3A+with+Applications+using+R-p-9781119961383

Chapter 10 is on Statistical inference for causal effects by Fabrizia Mealli, Barbara Pacini and Donald B. Rubin

The following 11 chapters analyse the same dataset (the ABC data that can be downloaded from the book’s website). These are:

Bayesian Networks (11, Kenett, Salini)

Log Linear Models (12, Fienberg, Mandrique)

CUB Models (13, Piccolo, Iannario)

The Rasch Model (14, De Battisti, Nicolini, Salini)

Tree-based Methods and Decision Trees (15, Soffritti, Galimberti)

PLS Models (16, Boari, Cantaluppi)

Nonlinear PCA (17, Ferrari, Barbero)

Multidimensional Scaling (18, Solaro)

Multilevel Models for Ordinal Data (19, Rampichini, Grilli)

Control Charts Applications (20, Kenett, Deldossi, Zappa)

Fuzzy Methods (21, Zani, Morlini, Milioli)

This provides a unique opportunity to compare what you get from different models applied to the same dataset.

A paper that proposes to combine models in order to enhance information quality generated by analysis was also published in ASMBI: https://onlinelibrary.wiley.com/doi/abs/10.1002/asmb.927

]]>It shows how a certain form of drop-out (a regularization technique that involves removing edges at random during training) can be viewed as a Monte Carlo version of a Bayesian variational approximation; the upshot is that you can do drop-out during prediction to approximate sampling from the posterior distribution.

]]>Those baselines are logistic regressions. I’d like to see how sensitive their results were to the dozens of hyperparameters they tuned.

]]>As an update on https://statmodeling.stat.columbia.edu/2018/10/30/explainable-ml-versus-interpretable-ml/ this this paper Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Cynthia Rudin https://www.nature.com/articles/s42256-019-0048-x.epdf?author_access_token=SU_TpOb-H5d3uy5KF-dedtRgN0jAjWel9jnR3ZoTv0M3t8uDwhDckroSbUOOygdba5KNHQMo_Ji2D1_SdDjVr6hjgxJXc-7jt5FQZuPTQKIAkZsBoTI4uqjwnzbltD01Z8QwhwKsbvwh-z1xL8bAcg%3D%3D

In particular “Here is the Rashomon set argument: consider that the data permit a large set of reasonably accurate predictive models to exist. … Unpacking this argument slightly, for a given data set, we define the Rashomon set as the set of reasonably accurate predictive models (say within a given accuracy from the best model accuracy of boosted decision trees). Because the data are finite, the data could admit many close-to-optimal models that predict differently from each other: a large Rashomon set. I suspect this happens often in practice because sometimes many different ML algorithms perform similarly on the same data set, despite having different functional forms (for example, random forests, neural networks, support vec-tor machines).”

]]>1. Artificial intelligence is an application. It just means a machine doing something we typically think of people as doing. Like correcting spelling mistakes or driving cars or playing Settlers of Catan. It’s not a mathematical technique. We can build AIs with heuristics or we can build them with statistics or we can build them with both.

2. Machine learning is a broad class of techniques, not all of which are probabilistic. For example, support-vector machines and greedy agglomerative clustering algorithms are not probabilistic. Machine learning is currently the most popular way to build artificial intelligence applications.

3. Neural networks are a kind of statistical model that currently dominates research in machine learning and is thus currently the go-to method for developing artificial intelligence applications. Deep neural nets, by which people mean nets with more than one hidden layer, are a form of neural network. Deep nets are computationally intractable for traditional statistical inference due to both multimodality and the scale of the likelihood function. To cope, machine learning researchers have layered heuristics on top of standard estimation techniques such as autoencoders, early stopping, etc. And because the form of the likelihood matters and generic networks don’t work, they include specialized structures like convolutional layers to deal with image transposition and rotation. That is, they don’t learn to recognize cats just from looking at cat pictures—a lot of heuristic knowledge about vision has been encoded in the likelihood architecture.

As David MacKay explains in his info theory book, logistic regression is a simple neural network with N inputs, one output, and no hidden layers (he called it “classification with one neuron” rather than logistic regression). With appropriate link functions, neural networks can be used as generalized linear models. Viewed another way, they’re stacked logistic regressions, which is where the non-linearity comes from.

4. Gaussian processes are a kind of statistical model, albeit a computationally intractable one at scale due to the requirement to solve matrices whose dimensionality is given by the number of data points. Like neural networks, GPs can represent arbitrary multidimensional functions given enough data (subject to conditions imposed by priors like smoothness and by the choice of covariance function). Like in the generalized linear model case and in the neural network case, we can throw logit link functions on Gaussian processes and use them for binary or categorical data. Like neural networks, their architecture is very general and there is a lot of heuristic/subjective knowledge going into the choice of covariance function.

Radford Neal showed in his thesis (which also introduced HMC to the stats world!) that Gaussian processes are the limit of a single hidden-layer neural network as the number of hidden nodes goes to infinity.

5. Panel data and time series are just forms of data. They can be handled by any kind of approach.

6. Hierarchical modeling is a technique for partial pooling, aka modeling population effects. It pulls estimates for individuals or groups toward the estimate for the overall population of which they are a member. Machine learning researchers, including in neural networks, tend to only use shrinkage of estimates to zero (to avoid overfitting rather than to regularize to population estimates). They also tend to use fixed effects for populations rather than hierarchical modeling. Where you see hierarchical modeling in machine learning is in what they call domain adaptation, such as building a sentiment classifier for reviews for different genres of movies (dramas vs. comedies, for example) or products (shoes vs. refrigerators).

7. Bayesian modeling is an approach to using prior data and performing inference to propagate uncertainty. The machine learning community tends to call any technique that applies Bayes’s rule “Bayesian” (e.g., naive Bayes classifiers, which are almost never Bayesian in either modeling or inference). ML researchers also use the term “prior” very broadly to include prevalence of categories in a model (such as naive Bayes). They also tend to use “Bayesian” to describe any system with a prior on parameters, even if it’s essentially being fit with penalized maximum likelihood. Statisticians tend to reserve the term “Bayesian” for full Bayesian inference, where we average our predictions over our estimation uncertainty when performing posterior predictive inference.

]]>There are a couple of theories about why deep neural networks perform well on some problems. For example, in object recognition problems you are trying to create a classifier that is invariant to various transformations of the object in the image: e.g., rotation, translation, scaling, etc. Stephane Mallat has done a lot of work showing that if you use features that are designed to be invariant to these transformations already (rather than just the pixels) then you can get the same performance as a deep neural network with a regression.

I think of it like this. The data live on some curved manifold. Then you add noise to the manifold. If the magnitude of the curvature of the manifold is larger than the noise, then the noisy manifold will still look curved and a neural network will work better than, say, a regression. If the magnitude of the curvature of the manifold is smaller than the noise, then the noisy manifold will look flat and a regression will do just fine.

]]>roughly a “compact” subset is a generalization of a closed and bounded subset. Closed means it contains its limit points (boundary points) and bounded means it doesn’t go off to infinity. So yes a single layer of neurons can approximate any function you’d need in a typical applied problem by adjusting the large number of knobs (parameters describing the weights). So the problem becomes “how can we choose the settings on the knobs to do a “good job””

]]>The NN method basically makes the tradeoff of high dimensional parameter space vs domain knowledge by choosing high dimensional parameter space and then coping with that using sophisticated optimization techniques… it’s especially useful when we don’t have much domain knowledge.

]]>If you don’t have a specific point where you would like to compare them you probably need a couple of books to do all possible comparisons. If you actually are looking for a book where you get a lot of different flavors of machine learning (including neural networks) and “classic” statistical models you can try:

– Bishop’s Pattern recognition and machine learning (2006) you can find it free in: https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/

Now, with respect to data from surveys with neural networks, I would give the same answer. What do you need the model for? is it to do prediction? to estimate latent variables? it all depends. Depends on the goal, depends on the amount of data, depends on how versed are you on both “classic” statistical methods and machine learning.

]]>Assuming ML and AI stand for machine learning and artificial intelligence one could could say that they include NN as a subdomain. But the other way around it make less sense.

]]>One very interesting paper demonstrating that (perhaps unintentionally!) was https://www.nature.com/articles/s41746-018-0029-1 which came out last year from Google. That paper is focused on comparing the performance of different ML models for predicting various health outcomes from electronic health records.

If you go look in their supplement, their logistic regression models are within the margin of error of their much more complicated and sophisticated neural network ensembles. At best their fancier models only gain 0.01 or 0.02 AUROC over the much simpler baselines.

(The most annoying part of that paper is rather than celebrate the fact that logistic regression works so well, they hide those results in the supplement and don’t even mention them in the main text.)

]]>I wouldn’t be surprised if one could cast a given statistical procedure as a network of nodes and edges with specified or derived weights. Intermediate values like sums of squares would act like hidden nodes. That would make it essentially equivalent to a neural network.

]]>“The motivation for writing this paper was an article [18] published in Neural Networks in June 2017. The aim of the article was to improve the forecasting accuracy of stock price fluctuations and claimed that “the empirical results show that the proposed model indeed display a good performance in forecasting stock market fluctuations”.

“In our view, the results seemed extremely accurate for stock market series that are essentially close to random walks so we wanted to replicate the results of the article and emailed the corresponding author asking for information to be able to do so. We got no answer and we, therefore, emailed the Editor-in-Chief of the Journal asking for his help. He suggested contacting the other author to get the required information. We consequently, emailed this author but we never got a reply. Not being able to replicate the result of [18] and not finding research studies comparing ML methods with alternative ones we decided to start the research leading to this paper.”

Reference 18 is Wang J, Wang J. Forecasting stochastic neural network based on financial empirical mode decomposition. Neural Networks. 2017;90:8–20. https://doi.org/10.1016/j.neunet.2017.03.004. pmid:28364677

]]>