Comments on: Neural nets vs. regression models

By: Bill Spight

Bill Spight — Tue, 28 May 2019 17:00:42 +0000

In reply to Warren S Sarle. When I have had that problem, instead of blanks I used periods for spacing.

By: sam

sam — Tue, 28 May 2019 15:20:02 +0000

In reply to Kevin Van Horn. 'I don’t know what “posterior for each observation” means; do you mean predictive distribution?' Yea I suppose im not clear on terminology but for concreteness lets say you have a model y = b + e where b is some parameter and e is some known error. There are two types of noise (as you implied with your original post); the data generation noise e and parameter estimate p(b). For predictive distribution I think one would have to take both into account. Some sort of bayesian estimation (like taking dropout into account) can help with parameter uncertainty and some sort of mixture density can help with dgp noise.

By: Warren S Sarle

Warren S Sarle — Tue, 28 May 2019 03:23:07 +0000

In reply to Warren S Sarle. Well, I don't see how to edit a comment, and html's reduction of multiple blanks to a single blank makes my previous comment unintelligible.

By: Warren S Sarle

Warren S Sarle — Tue, 28 May 2019 03:18:01 +0000

Neural Network and Statistical Jargon
=====================================

Warren S. Sarle Apr 29, 1996

The neural network (NN) and statistical literatures contain many of the
same concepts but usually with different terminology. Sometimes the same
term or acronym is used in both literatures but with different meanings.
Only in very rare cases is the same term used with the same meaning,
although some cross-fertilization is beginning to happen. Below is a
list of such corresponding terms or definitions.

Particularly loose correspondences are marked by a ~ between the two
columns. A indicates the reverse. Terminology in
both fields is often vague, so precise equivalences are not always
possible. The list starts with some basic definitions.

There is disagreement in the NN literature on how to count layers. Some
people count inputs as a layer and some don’t. I specify the number of
hidden layers instead. This is awkward but unambiguous.

Definition Statistical Jargon
========== ==================

generalizing from noisy data Statistical inference
and assessment of the
accuracy thereof

the set of all cases one Population
wants to be able to
generalize to

a function of the values in Parameter
a population, such as the
mean or a globally optimal
synaptic weight

a function of the values in Statistic
a sample, such as the mean
or a learned synaptic weight

Neural Network Jargon Definition
===================== ==========

Neuron, neurode, unit, a simple linear or nonlinear computing
node, processing element element that accepts one or more inputs,
computes a function thereof, and may
direct the result to one or more other
neurons

Neural networks a class of flexible nonlinear regression
and discriminant models, data reduction
models, and nonlinear dynamical systems
consisting of an often large number of
neurons interconnected in often complex
ways and often organized into layers

Neural Network Jargon Statistical Jargon
===================== ==================

Statistical methods Linear regression and discriminant
analysis, simulated annealing, random
search

Architecture Model

Training, Learning, Estimation, Model fitting, Optimization
Adaptation

Classification Discriminant analysis

Mapping, Function Regression
approximation

Supervised learning Regression, Discriminant analysis

Unsupervised learning, Principal components, Cluster analysis,
Self-organization Data reduction

Competitive learning Cluster analysis

Hebbian learning, Principal components
Cottrell/Munro/Zipser
technique

Training set Sample, Construction sample

Test set, Validation set Hold-out sample

Pattern, Vector, Example, Observation, Case
Sample, Case

Reflectance pattern an observation normalized to sum to 1

Binary(0/1), Binary, Dichotomous
Bivalent or Bipolar(-1/1)

Input Independent variables, Predictors,
Regressors, Explanatory variables,
Carriers

Output Predicted values

Forward propagation Prediction

Training values Dependent variables, Responses,
Target values Observed values

Training pair Observation containing both inputs
and target values

Shift register, Lagged variable
(Tapped) (time) delay (line),
Input window

Errors Residuals

Noise Error term

Generalization Interpolation, Extrapolation,
Prediction

Error bars Confidence interval

Prediction Forecasting

Adaline Linear two-group discriminant analysis
(ADAptive LInear NEuron) (not Fisher’s but generic)

(No-hidden-layer) perceptron ~ Generalized linear model (GLIM)

Activation function, > Inverse link function in GLIM
Signal function,
Transfer function

Softmax Multiple logistic function

Squashing function bounded function with infinite domain

Semilinear function differentiable nondecreasing function

Phi-machine Linear model

Linear 1-hidden-layer Maximum redundancy analysis, Principal
perceptron components of instrumental variables

1-hidden-layer perceptron ~ Projection pursuit regression

Weights, Shrinkage estimation, Ridge regression

Jitter random noise added to the inputs to
smooth the estimates

Growing, Pruning, Brain Subset selection, Model selection,
damage, Self-structuring, Pre-test estimation
Ontogeny

Optimal brain surgeon Wald test

LMS (Least mean squares) OLS (Ordinary least squares)
(see also “LMS rule” above)

Relative entropy, Cross Kullback-Leibler divergence
entropy

Evidence framework Empirical Bayes estimation

OLS (Orthogonal least squares) Forward stepwise regression

Probabilistic neural network Kernel discriminant analysis

General regression neural Kernel regression
network

Topologically distributed < (Generalized) Additive model
encoding

Adaptive vector quantization iterative algorithms of doubtful
convergence for K-means cluster analysis

Adaptive Resonance Theory 2a ~ Hartigan's leader algorithm

Learning vector quantization a form of piecewise linear discriminant
analysis using a preliminary cluster
analysis

Counterpropagation Regressogram based on k-means clusters

Encoding, Autoassociation Dimensionality reduction
(Independent and dependent variables
are the same)

Heteroassociation Regression, Discriminant analysis
(Independent and dependent variables
are different)

Epoch Iteration

Continuous training, Iteratively updating estimates one
Incremental training, observation at a time via difference
On-line training, equations, as in stochastic approximation
Instantaneous training

Batch training, Iteratively updating estimates after
Off-line training each complete pass over the data as in
most nonlinear regression algorithms

By: Kevin Van Horn

Kevin Van Horn — Thu, 23 May 2019 19:14:52 +0000

In reply to sam.

I don’t know what “posterior for each observation” means; do you mean predictive distribution?

Anyway, here’s one example where predictive distributions are important. When you’re doing capacity planning, then getting a point estimate of the load at a future date is of limited use; what you really want is a high-confidence upper bound. That is, you might want to find x such that

Pr(load > x) = alpha

for some small alpha (0.01, 0.001). More commonly, you’ll want to be able to handle the max load over some period: find x such that

Pr(max_i load_i > x) = alpha.

You would prefer not to have to choose alpha up front, as that may depend on executive decisions that haven’t been made yet and on costs and priorities that may are not known yet or may change. (Those costs include both the cost of providing resources that may not be used, and the negative consequences of not having enough capacity to meet the actual load.) You also may not want to set in stone the width of the period of interest in the second case above. So you really do need a predictive distribution.

As another example, I’m doing some work where we’re building Monte Carlo models in support of complex decisions. The result of a regression analysis might be only one piece of a larger model, with its outcome variable being only an intermediate step in computing the final net benefit of a decision.

Finally, nonlinear utility functions aren’t the issue; outputting the predictive mean is an optimal action only if (1) your action consists solely of outputting a prediction, rather than using that prediction to inform some other decision (which might have only a finite number of possible alternatives), and (2) your utility function (or loss function) is *quadratic*.

By: sam

sam — Thu, 23 May 2019 15:57:56 +0000

In reply to Daniel Lakeland. Daniel. I agree its easy to come up with conceptual problems where you'd have non linear utilities. but in order to maximize that you'd have to have a concrete mathematical equation for your utility. i'm interested in how one would actually come up with that equation. for most instances i think thats challenging. also, spitballing here, what if you just model utility(y)? would you need uncertainty parameters then?

By: Daniel Lakeland

Daniel Lakeland — Wed, 22 May 2019 21:03:28 +0000

In reply to sam.

I’d also be curious what decision analysis Kevin is involved in, but really it’s easy to come up with nonlinear utilities. Like imagine you have an investment of x dollars now for a series of uncertain payouts with uncertain number of dollars at uncertain times in the future. The utility will be nonlinear exp(-r*t) in the time at which the payouts occur…

Or imagine there’s a regulatory situation where pollution below some amount is allowed but above some amount there are nonlinear fines, and you need to optimize your maintenance schedule on your equipment…

or whatever. it’s easy to come up with nonlinear utility on uncertain parameters.

By: Bob

Bob — Wed, 22 May 2019 20:37:02 +0000

Oh yeah, it’s in the original blog post.

By: Bob

Bob — Wed, 22 May 2019 20:35:12 +0000

Spyros Makridakis ran the M4 competition comparing models for time series prediction. I haven’t studied it in detail but he reckons the ML models unperformed statistical time series models .

https://www.sciencedirect.com/science/article/pii/S0169207018300785

By: sam

sam — Wed, 22 May 2019 19:57:32 +0000

In reply to Kevin Van Horn. Hey Kevin, Kind of curious as to what decisions you're making. I can see if you have some sort of non-linear utility function you're maximizing you might need a posterior for each observation. But where does that happen?

By: Corey

Corey — Wed, 22 May 2019 19:19:36 +0000

In reply to Kevin Van Horn. This particular variational approximation involves Bernoulli distributions in a way I am no longer clear on (if I ever was). I think with these sorts of models the thing to be worried about is posterior predictive performance and in particular, understating the uncertainty of the prediction. My understanding is that KL-loss-based variational approximations are quite prone to this problem and also that it does show up in the DNN context.

By: Kevin Van Horn

Kevin Van Horn — Wed, 22 May 2019 15:36:17 +0000

In reply to Bob Carpenter. "Radford Neal showed in his thesis (which also introduced HMC to the stats world!)" I wish we saw more of the kind of technology transfer from one field to another that Neal did there. I get the impression that the physics community has all kinds of wonderful mathematical tools that could be usefully applied in other disciplines, if only the right people knew about them.

By: Kevin Van Horn

Kevin Van Horn — Wed, 22 May 2019 15:31:59 +0000

In reply to Corey.

Thanks for the reference. I still have doubts about the usefulness of even an optimal variational approximation to the posterior in this case. If I understand correctly, your typical variational approximation for a high-dimensional parameter space is going to approximate the posterior as an axis-parallel multivariate normal (i.e. independent normals for each parameter), and from my understanding of deep neural nets I expect this to be an exceedingly bad approximation — they have not just massive multimodality, but also very strong posterior correlation between parameters.

I would be delighted to find that I am mistaken here…

By: Kevin Van Horn

Kevin Van Horn — Wed, 22 May 2019 15:23:12 +0000

In reply to Daniel Lakeland. Not to mention the massive multimodality that makes sampling the posterior problematic.

By: Keith O'Rourke

Keith O'Rourke — Wed, 22 May 2019 11:33:04 +0000

In reply to Eliot J.

Eliot:

Did you also miss – Stat Med. 1998 Nov 15;17(21):2501-8.A comparison of statistical learning methods on the Gusto database. Ennis M, Hinton G, Naylor D, Revow M, Tibshirani R.

Abstract
We apply a battery of modern, adaptive non-linear learning methods to a large real database of cardiac patient data. We use each method to predict 30 day mortality from a large number of potential risk factors, and we compare their performances. We find that none of the methods could outperform a relatively simple logistic regression model previously developed for this problem.

By: Al

Al — Wed, 22 May 2019 02:05:11 +0000

In reply to Bob Carpenter. Very useful!

By: Eliot J

Eliot J — Wed, 22 May 2019 01:35:14 +0000

My brief review of the literature neglected to mention that one of the first contributions with an explicit comparison of the two communities (stats vs ML) was Breiman’s. In 1999 he introduced CART random forests by comparing the predictive accuracy of ~1,000 RFs with a single iteration of logistic regression, concluding that the ensemble predictions were more accurate than LR.

https://www.stat.berkeley.edu/~breiman/random-forests.pdf

Thanks to Andrew for posting this query.

By: Anoneuoid

Anoneuoid — Tue, 21 May 2019 23:07:12 +0000

In reply to Daniel Lakeland. The interesting part to me is how obvious it is to say "sure" now, in contrast to being able to see what the experts were missing just a few years ago. I think if researchers delineated all assumptions they were making, then reviewers (inevitably) pointed out more, it would be great.

By: Daniel Lakeland

Daniel Lakeland — Tue, 21 May 2019 22:33:44 +0000

In reply to Anoneuoid. Sure, it's basically just that it's theoretically sufficient to have one layer, in practice it's efficient to have several layers. But in the end, it's not really different in any deep theoretical way from polynomial regression or fourier series or anything else you might try to represent a function.

By: Nick Adams

Nick Adams — Tue, 21 May 2019 21:26:21 +0000

In reply to Bob Carpenter. Never mind logistic regression, medical staff are better at predicting inpatient mortality using only their pre-existing organic neural net: AUROC up to 0.9. PLoS One. 2014; 9(7): e101739

By: Anoneuoid

Anoneuoid — Tue, 21 May 2019 20:36:13 +0000

In reply to Daniel Lakeland. The thing is that a single layer only does that asymptotically. You can still read stack overflow posts with people parroting that as a reason never to try deep learning up to a few years ago. What is true asymptotically may not be at all correct for someone with finite resources and time.

By: Ron Kenett

Ron Kenett — Tue, 21 May 2019 19:23:55 +0000

Another interesting thread……thank you for making this happen.

Comparing models for analysing survey data is a great topic. We had a project on this at the University of Turin and Milan, a while ago. Two contributions from this:

1. We edited in 2011 a book published by Wiley titled Modern Analysis of Customer Surveys: with Applications using R, https://www.wiley.com/en-us/Modern+Analysis+of+Customer+Surveys%3A+with+Applications+using+R-p-9781119961383
Chapter 10 is on Statistical inference for causal effects by Fabrizia Mealli, Barbara Pacini and Donald B. Rubin
The following 11 chapters analyse the same dataset (the ABC data that can be downloaded from the book’s website). These are:
Bayesian Networks (11, Kenett, Salini)
Log Linear Models (12, Fienberg, Mandrique)
CUB Models (13, Piccolo, Iannario)
The Rasch Model (14, De Battisti, Nicolini, Salini)
Tree-based Methods and Decision Trees (15, Soffritti, Galimberti)
PLS Models (16, Boari, Cantaluppi)
Nonlinear PCA (17, Ferrari, Barbero)
Multidimensional Scaling (18, Solaro)
Multilevel Models for Ordinal Data (19, Rampichini, Grilli)
Control Charts Applications (20, Kenett, Deldossi, Zappa)
Fuzzy Methods (21, Zani, Morlini, Milioli)

This provides a unique opportunity to compare what you get from different models applied to the same dataset.

A paper that proposes to combine models in order to enhance information quality generated by analysis was also published in ASMBI: https://onlinelibrary.wiley.com/doi/abs/10.1002/asmb.927

By: Corey

Corey — Tue, 21 May 2019 17:51:41 +0000

In reply to Kevin Van Horn.

Kevin, you may be interested in this (although it’s 4 years old now): http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html

It shows how a certain form of drop-out (a regularization technique that involves removing edges at random during training) can be viewed as a Monte Carlo version of a Bayesian variational approximation; the upshot is that you can do drop-out during prediction to approximate sampling from the posterior distribution.

By: Daniel Lakeland

Daniel Lakeland — Tue, 21 May 2019 17:35:28 +0000

In reply to Kevin Van Horn. In principle, one could describe a model for data using a neural network, whose parameters had prior distributions, and with a probability distribution over the errors given by a bayesian model, and wind up with a bayesian posterior over the parameters of a neural network model, but this is rarely done. One reason is that its virtually impossible for a normal human to express useful priors over neural network parameters because "what they do" is very opaque and there are generally going to be lots of inter-dependencies. Another reason is it's hard to fit high dimensional models in general, and it requires tons of computing time.

By: Kevin Van Horn

Kevin Van Horn — Tue, 21 May 2019 17:12:01 +0000

My issue with neural networks is that they focus on point predictions, so it is difficult to get the predictive *distributions* you need for decision analysis. Yes, you can often modify the objective function of an NN model to get predictive distributions instead of point predictions, but this is rarely done. Even if you do create a model that produces predictive distributions, good luck on incorporating parameter uncertainty.

By: Ricardo

Ricardo — Tue, 21 May 2019 16:55:20 +0000

David MacKay had a lecture that touches this topic: https://www.youtube.com/watch?v=Z1pcTxvCOgw

By: Tom M

Tom M — Tue, 21 May 2019 16:52:54 +0000

In reply to Bob Carpenter. This is very helpful! It's good to have a semanticist in the house, whether recovering or relapsing.

By: Daniel Lakeland

Daniel Lakeland — Tue, 21 May 2019 16:49:00 +0000

In reply to Bob Carpenter. Thanks Bob, very good addition to the discussion, will keep people who have less background info from getting confused.

By: Bob Carpenter

Bob Carpenter — Tue, 21 May 2019 16:46:57 +0000

In reply to Ethan Steinberg. This is worth blogging about separately. Here's their table of results:

Those baselines are logistic regressions. I'd like to see how sensitive their results were to the dozens of hyperparameters they tuned.

By: Keith O'Rourke

Keith O'Rourke — Tue, 21 May 2019 16:45:54 +0000

A better contrast might be interpretable versus explainable [black box] models (with Bayesian analyses done with inadequate workflow to understand them being in the explainable models category).

As an update on https://statmodeling.stat.columbia.edu/2018/10/30/explainable-ml-versus-interpretable-ml/ this this paper Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Cynthia Rudin https://www.nature.com/articles/s42256-019-0048-x.epdf?author_access_token=SU_TpOb-H5d3uy5KF-dedtRgN0jAjWel9jnR3ZoTv0M3t8uDwhDckroSbUOOygdba5KNHQMo_Ji2D1_SdDjVr6hjgxJXc-7jt5FQZuPTQKIAkZsBoTI4uqjwnzbltD01Z8QwhwKsbvwh-z1xL8bAcg%3D%3D

In particular “Here is the Rashomon set argument: consider that the data permit a large set of reasonably accurate predictive models to exist. … Unpacking this argument slightly, for a given data set, we define the Rashomon set as the set of reasonably accurate predictive models (say within a given accuracy from the best model accuracy of boosted decision trees). Because the data are finite, the data could admit many close-to-optimal models that predict differently from each other: a large Rashomon set. I suspect this happens often in practice because sometimes many different ML algorithms perform similarly on the same data set, despite having different functional forms (for example, random forests, neural networks, support vec-tor machines).”

By: Bob Carpenter

Bob Carpenter — Tue, 21 May 2019 16:35:50 +0000

As a recovering semanticist, I think I can help with terminology.

1. Artificial intelligence is an application. It just means a machine doing something we typically think of people as doing. Like correcting spelling mistakes or driving cars or playing Settlers of Catan. It’s not a mathematical technique. We can build AIs with heuristics or we can build them with statistics or we can build them with both.

2. Machine learning is a broad class of techniques, not all of which are probabilistic. For example, support-vector machines and greedy agglomerative clustering algorithms are not probabilistic. Machine learning is currently the most popular way to build artificial intelligence applications.

3. Neural networks are a kind of statistical model that currently dominates research in machine learning and is thus currently the go-to method for developing artificial intelligence applications. Deep neural nets, by which people mean nets with more than one hidden layer, are a form of neural network. Deep nets are computationally intractable for traditional statistical inference due to both multimodality and the scale of the likelihood function. To cope, machine learning researchers have layered heuristics on top of standard estimation techniques such as autoencoders, early stopping, etc. And because the form of the likelihood matters and generic networks don’t work, they include specialized structures like convolutional layers to deal with image transposition and rotation. That is, they don’t learn to recognize cats just from looking at cat pictures—a lot of heuristic knowledge about vision has been encoded in the likelihood architecture.

As David MacKay explains in his info theory book, logistic regression is a simple neural network with N inputs, one output, and no hidden layers (he called it “classification with one neuron” rather than logistic regression). With appropriate link functions, neural networks can be used as generalized linear models. Viewed another way, they’re stacked logistic regressions, which is where the non-linearity comes from.

4. Gaussian processes are a kind of statistical model, albeit a computationally intractable one at scale due to the requirement to solve matrices whose dimensionality is given by the number of data points. Like neural networks, GPs can represent arbitrary multidimensional functions given enough data (subject to conditions imposed by priors like smoothness and by the choice of covariance function). Like in the generalized linear model case and in the neural network case, we can throw logit link functions on Gaussian processes and use them for binary or categorical data. Like neural networks, their architecture is very general and there is a lot of heuristic/subjective knowledge going into the choice of covariance function.

Radford Neal showed in his thesis (which also introduced HMC to the stats world!) that Gaussian processes are the limit of a single hidden-layer neural network as the number of hidden nodes goes to infinity.

5. Panel data and time series are just forms of data. They can be handled by any kind of approach.

6. Hierarchical modeling is a technique for partial pooling, aka modeling population effects. It pulls estimates for individuals or groups toward the estimate for the overall population of which they are a member. Machine learning researchers, including in neural networks, tend to only use shrinkage of estimates to zero (to avoid overfitting rather than to regularize to population estimates). They also tend to use fixed effects for populations rather than hierarchical modeling. Where you see hierarchical modeling in machine learning is in what they call domain adaptation, such as building a sentiment classifier for reviews for different genres of movies (dramas vs. comedies, for example) or products (shoes vs. refrigerators).

7. Bayesian modeling is an approach to using prior data and performing inference to propagate uncertainty. The machine learning community tends to call any technique that applies Bayes’s rule “Bayesian” (e.g., naive Bayes classifiers, which are almost never Bayesian in either modeling or inference). ML researchers also use the term “prior” very broadly to include prevalence of categories in a model (such as naive Bayes). They also tend to use “Bayesian” to describe any system with a prior on parameters, even if it’s essentially being fit with penalized maximum likelihood. Statisticians tend to reserve the term “Bayesian” for full Bayesian inference, where we average our predictions over our estimation uncertainty when performing posterior predictive inference.

By: Charles Fisher

Charles Fisher — Tue, 21 May 2019 16:23:13 +0000

We did a fairly comprehensive comparison between neural networks, linear models, and other approaches for making predictions about phenotypes from RNA-sequencing data here: https://www.biorxiv.org/content/10.1101/574723v1.abstract. We found that — averaged across the 50 or so prediction tasks we looked at — everything performed about the same with L2-regularized regression slightly leading on average rank.

There are a couple of theories about why deep neural networks perform well on some problems. For example, in object recognition problems you are trying to create a classifier that is invariant to various transformations of the object in the image: e.g., rotation, translation, scaling, etc. Stephane Mallat has done a lot of work showing that if you use features that are designed to be invariant to these transformations already (rather than just the pixels) then you can get the same performance as a deep neural network with a regression.

I think of it like this. The data live on some curved manifold. Then you add noise to the manifold. If the magnitude of the curvature of the manifold is larger than the noise, then the noisy manifold will still look curved and a neural network will work better than, say, a regression. If the magnitude of the curvature of the manifold is smaller than the noise, then the noisy manifold will look flat and a regression will do just fine.

By: Daniel Lakeland

Daniel Lakeland — Tue, 21 May 2019 16:19:38 +0000

In reply to Daniel Lakeland.

https://en.wikipedia.org/wiki/Universal_approximation_theorem

roughly a “compact” subset is a generalization of a closed and bounded subset. Closed means it contains its limit points (boundary points) and bounded means it doesn’t go off to infinity. So yes a single layer of neurons can approximate any function you’d need in a typical applied problem by adjusting the large number of knobs (parameters describing the weights). So the problem becomes “how can we choose the settings on the knobs to do a “good job””

By: Daniel Lakeland

Daniel Lakeland — Tue, 21 May 2019 16:09:56 +0000

In reply to Tom Passin.

Yes, Neural Nets are universal function approximators, at least on bounded subsets of R^N which is every actual applied problem. Machine Learning/AI/Deep Learning stuff is at its core (as far as I can tell) the method of using various sophisticated usually stochastic optimization techniques to fit (mostly) Neural Net function approximators using somewhat sophisticated but generic loss functions especially based on hold-out data to avoid overfitting.

The NN method basically makes the tradeoff of high dimensional parameter space vs domain knowledge by choosing high dimensional parameter space and then coping with that using sophisticated optimization techniques… it’s especially useful when we don’t have much domain knowledge.

By: Sergio Garrido

Sergio Garrido — Tue, 21 May 2019 16:01:34 +0000

In reply to Sergio Garrido. I wrote the first sentence very poorly. What I meant is that ML, DL and AI are not separate things and neural networks don't include them. On the contrary, NNs are a subset of all of them, and they are subsets of each other.

By: Sergio Garrido

Sergio Garrido — Tue, 21 May 2019 15:46:55 +0000

I agree with Prof. Gelman in the fact that neural networks are also statistical models and the same with ML, DL and AI (DL is a subset of ML and ML is a subset of AI). The important thing, however, is to note that you need some criteria for your comparison. Do you want to compare them based on accuracy on a test set? interpretability? fairness of the outcome?

If you don’t have a specific point where you would like to compare them you probably need a couple of books to do all possible comparisons. If you actually are looking for a book where you get a lot of different flavors of machine learning (including neural networks) and “classic” statistical models you can try:

– Bishop’s Pattern recognition and machine learning (2006) you can find it free in: https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/

Now, with respect to data from surveys with neural networks, I would give the same answer. What do you need the model for? is it to do prediction? to estimate latent variables? it all depends. Depends on the goal, depends on the amount of data, depends on how versed are you on both “classic” statistical methods and machine learning.

By: Carlos Ungil

Carlos Ungil — Tue, 21 May 2019 15:38:15 +0000

In reply to Carlos Ungil. Sorry, I put my comment under another comment by mistake (and also dropped the s in “makes”).

By: Carlos Ungil

Carlos Ungil — Tue, 21 May 2019 15:34:50 +0000

In reply to Tom Passin. > For instance, NNs could include ML, DL, AI, and so on. Assuming ML and AI stand for machine learning and artificial intelligence one could could say that they include NN as a subdomain. But the other way around it make less sense.

By: Ethan Steinberg

Ethan Steinberg — Tue, 21 May 2019 14:36:04 +0000

It truthfully seems like neural networks are simply not the most optimal tool for tabular datasets.

One very interesting paper demonstrating that (perhaps unintentionally!) was https://www.nature.com/articles/s41746-018-0029-1 which came out last year from Google. That paper is focused on comparing the performance of different ML models for predicting various health outcomes from electronic health records.

If you go look in their supplement, their logistic regression models are within the margin of error of their much more complicated and sophisticated neural network ensembles. At best their fancier models only gain 0.01 or 0.02 AUROC over the much simpler baselines.

(The most annoying part of that paper is rather than celebrate the fact that logistic regression works so well, they hide those results in the supplement and don’t even mention them in the main text.)

By: Tom Passin

Tom Passin — Tue, 21 May 2019 14:30:52 +0000

Bart Kosko wrote in one or another of his books that neural nets are basically universal approximators. They can approximate any function. So they can be subject to any of the ills of other approximating systems, like overfitting, inappropriate fitting criteria, lack of orthogonality of inputs or internal variables, unpredictable results when extrapolating outside the range of the training data, etc.

I wouldn’t be surprised if one could cast a given statistical procedure as a network of nodes and edges with specified or derived weights. Intermediate values like sums of squares would act like hidden nodes. That would make it essentially equivalent to a neural network.

By: zbicyclist

zbicyclist — Tue, 21 May 2019 14:16:11 +0000

Thanks for the Makridakis link; I hadn’t seen that. This passage may seem familiar to readers of this blog:

“The motivation for writing this paper was an article [18] published in Neural Networks in June 2017. The aim of the article was to improve the forecasting accuracy of stock price fluctuations and claimed that “the empirical results show that the proposed model indeed display a good performance in forecasting stock market fluctuations”.

“In our view, the results seemed extremely accurate for stock market series that are essentially close to random walks so we wanted to replicate the results of the article and emailed the corresponding author asking for information to be able to do so. We got no answer and we, therefore, emailed the Editor-in-Chief of the Journal asking for his help. He suggested contacting the other author to get the required information. We consequently, emailed this author but we never got a reply. Not being able to replicate the result of [18] and not finding research studies comparing ML methods with alternative ones we decided to start the research leading to this paper.”

Reference 18 is Wang J, Wang J. Forecasting stochastic neural network based on financial empirical mode decomposition. Neural Networks. 2017;90:8–20. https://doi.org/10.1016/j.neunet.2017.03.004. pmid:28364677