Neural nets vs. regression models

Posted on May 21, 2019 9:55 AM by Andrew

Eliot Johnson writes:

I have a question concerning papers comparing two broad domains of modeling: neural nets and statistical models. Both terms are catch-alls, within each of which there are, quite obviously, multiple subdomains. For instance, NNs could include ML, DL, AI, and so on. While statistical models should include panel data, time series, hierarchical Bayesian models, and more.

I’m aware of two papers that explicitly compare these two broad domains:

(1) Sirignano, et al., Deep Learning for Mortgage Risk,

(2) Makridakis, et al., Statistical and Machine Learning forecasting methods: Concerns and ways forward

But there must be more than just these two examples. Are there others that you are aware of? Do you think a post on your blog would be useful? If so, I’m sure you can think of better ways to phrase or express my “two broad domains.”

My reply:

I don’t actually know.

Back in 1994 or so I remember talking with Radford Neal about the neural net models in his Ph.D. thesis and asking if he could try them out on analysis of data from sample surveys. The idea was that we have two sorts of models: multilevel logistic regression and Gaussian processes. Both models can use the same predictors (characteristics of survey respondents such as sex, ethnicity, age, and state), and both have the structure that similar respondents have similar predicted outcomes—but the two models have different mathematical structures. The regression model works with a linear predictor from all these factors, whereas the Gaussian process model uses an unnormalized probability density—a prior distribution—that encourages people with similar predictors to have similar outcomes.

My guess is that the two models would do about the same, following the general principle that the most important thing about a statistical procedure is not what you do with the data, but what data you use. In either case, though, some thought might need to go into the modeling. For example, you’ll want to include state-level predictors. As we’ve discussed before, when your data are sparse, multilevel regression works much better if you have good group-level predictors, and some of the examples where it appears that MRP performs poorly, are examples where people are not using available group-level information.

Anyway, to continue with the question above, asking about neural nets and statistical models: Actually, neural nets are a special case of statistical models, typically Bayesian hierarchical logistic regression with latent parameters. But neural nets are typically estimated in a different way: the resulting posterior distributions will generally be multimodal, so rather than try the hopeless task of traversing the whole posterior distribution, we’ll use various approximate methods, which then are evaluated using predictive accuracy.

By the way, Radford’s answer to my question back in 1994 was that he was too busy to try fitting his models to my data. And I guess I was too busy too, because I didn’t try it either! More recently, I asked a computer scientist and he said he thought the datasets I was working with were too small for his methods to be very useful. More generally, though, I like the idea of RPP, also the idea of using stacking to combine Bayesian inferences from different fitted models.

41 thoughts on “Neural nets vs. regression models”

zbicyclist on May 21, 2019 10:16 AM at 10:16 am said:

Thanks for the Makridakis link; I hadn’t seen that. This passage may seem familiar to readers of this blog:

“The motivation for writing this paper was an article [18] published in Neural Networks in June 2017. The aim of the article was to improve the forecasting accuracy of stock price fluctuations and claimed that “the empirical results show that the proposed model indeed display a good performance in forecasting stock market fluctuations”.

“In our view, the results seemed extremely accurate for stock market series that are essentially close to random walks so we wanted to replicate the results of the article and emailed the corresponding author asking for information to be able to do so. We got no answer and we, therefore, emailed the Editor-in-Chief of the Journal asking for his help. He suggested contacting the other author to get the required information. We consequently, emailed this author but we never got a reply. Not being able to replicate the result of [18] and not finding research studies comparing ML methods with alternative ones we decided to start the research leading to this paper.”

Reference 18 is Wang J, Wang J. Forecasting stochastic neural network based on financial empirical mode decomposition. Neural Networks. 2017;90:8–20. https://doi.org/10.1016/j.neunet.2017.03.004. pmid:28364677

Reply ↓
Tom Passin on May 21, 2019 10:30 AM at 10:30 am said:

Bart Kosko wrote in one or another of his books that neural nets are basically universal approximators. They can approximate any function. So they can be subject to any of the ills of other approximating systems, like overfitting, inappropriate fitting criteria, lack of orthogonality of inputs or internal variables, unpredictable results when extrapolating outside the range of the training data, etc.

I wouldn’t be surprised if one could cast a given statistical procedure as a network of nodes and edges with specified or derived weights. Intermediate values like sums of squares would act like hidden nodes. That would make it essentially equivalent to a neural network.

Reply ↓
- Carlos Ungil on May 21, 2019 11:34 AM at 11:34 am said:
  
  > For instance, NNs could include ML, DL, AI, and so on.
  
  Assuming ML and AI stand for machine learning and artificial intelligence one could could say that they include NN as a subdomain. But the other way around it make less sense.
  
  Reply ↓
  - Carlos Ungil on May 21, 2019 11:38 AM at 11:38 am said:
    
    Sorry, I put my comment under another comment by mistake (and also dropped the s in “makes”).
    
    Reply ↓
- Daniel Lakeland on May 21, 2019 12:09 PM at 12:09 pm said:
  
  Yes, Neural Nets are universal function approximators, at least on bounded subsets of R^N which is every actual applied problem. Machine Learning/AI/Deep Learning stuff is at its core (as far as I can tell) the method of using various sophisticated usually stochastic optimization techniques to fit (mostly) Neural Net function approximators using somewhat sophisticated but generic loss functions especially based on hold-out data to avoid overfitting.
  
  The NN method basically makes the tradeoff of high dimensional parameter space vs domain knowledge by choosing high dimensional parameter space and then coping with that using sophisticated optimization techniques… it’s especially useful when we don’t have much domain knowledge.
  
  Reply ↓
  - Daniel Lakeland on May 21, 2019 12:19 PM at 12:19 pm said:
    
    https://en.wikipedia.org/wiki/Universal_approximation_theorem
    
    roughly a “compact” subset is a generalization of a closed and bounded subset. Closed means it contains its limit points (boundary points) and bounded means it doesn’t go off to infinity. So yes a single layer of neurons can approximate any function you’d need in a typical applied problem by adjusting the large number of knobs (parameters describing the weights). So the problem becomes “how can we choose the settings on the knobs to do a “good job””
    
    Reply ↓
    - Anoneuoid on May 21, 2019 4:36 PM at 4:36 pm said:
      
      The thing is that a single layer only does that asymptotically. You can still read stack overflow posts with people parroting that as a reason never to try deep learning up to a few years ago.
      
      What is true asymptotically may not be at all correct for someone with finite resources and time.
    - Daniel Lakeland on May 21, 2019 6:33 PM at 6:33 pm said:
      
      Sure, it’s basically just that it’s theoretically sufficient to have one layer, in practice it’s efficient to have several layers. But in the end, it’s not really different in any deep theoretical way from polynomial regression or fourier series or anything else you might try to represent a function.
    - Anoneuoid on May 21, 2019 7:07 PM at 7:07 pm said:
      
      The interesting part to me is how obvious it is to say “sure” now, in contrast to being able to see what the experts were missing just a few years ago.
      
      I think if researchers delineated all assumptions they were making, then reviewers (inevitably) pointed out more, it would be great.
Ethan Steinberg on May 21, 2019 10:36 AM at 10:36 am said:

It truthfully seems like neural networks are simply not the most optimal tool for tabular datasets.

One very interesting paper demonstrating that (perhaps unintentionally!) was https://www.nature.com/articles/s41746-018-0029-1 which came out last year from Google. That paper is focused on comparing the performance of different ML models for predicting various health outcomes from electronic health records.

If you go look in their supplement, their logistic regression models are within the margin of error of their much more complicated and sophisticated neural network ensembles. At best their fancier models only gain 0.01 or 0.02 AUROC over the much simpler baselines.

(The most annoying part of that paper is rather than celebrate the fact that logistic regression works so well, they hide those results in the supplement and don’t even mention them in the main text.)

Reply ↓
- Bob Carpenter on May 21, 2019 12:46 PM at 12:46 pm said:
  
  This is worth blogging about separately. Here’s their table of results:
  
  Those baselines are logistic regressions. I’d like to see how sensitive their results were to the dozens of hyperparameters they tuned.
  
  Reply ↓
  - Nick Adams on May 21, 2019 5:26 PM at 5:26 pm said:
    
    Never mind logistic regression, medical staff are better at predicting inpatient mortality using only their pre-existing organic neural net: AUROC up to 0.9.
    
    PLoS One. 2014; 9(7): e101739
    
    Reply ↓
Sergio Garrido on May 21, 2019 11:46 AM at 11:46 am said:

I agree with Prof. Gelman in the fact that neural networks are also statistical models and the same with ML, DL and AI (DL is a subset of ML and ML is a subset of AI). The important thing, however, is to note that you need some criteria for your comparison. Do you want to compare them based on accuracy on a test set? interpretability? fairness of the outcome?

If you don’t have a specific point where you would like to compare them you probably need a couple of books to do all possible comparisons. If you actually are looking for a book where you get a lot of different flavors of machine learning (including neural networks) and “classic” statistical models you can try:

– Bishop’s Pattern recognition and machine learning (2006) you can find it free in: https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/

Now, with respect to data from surveys with neural networks, I would give the same answer. What do you need the model for? is it to do prediction? to estimate latent variables? it all depends. Depends on the goal, depends on the amount of data, depends on how versed are you on both “classic” statistical methods and machine learning.

Reply ↓
- Sergio Garrido on May 21, 2019 12:01 PM at 12:01 pm said:
  
  I wrote the first sentence very poorly. What I meant is that ML, DL and AI are not separate things and neural networks don’t include them. On the contrary, NNs are a subset of all of them, and they are subsets of each other.
  
  Reply ↓
Charles Fisher on May 21, 2019 12:23 PM at 12:23 pm said:

We did a fairly comprehensive comparison between neural networks, linear models, and other approaches for making predictions about phenotypes from RNA-sequencing data here: https://www.biorxiv.org/content/10.1101/574723v1.abstract. We found that — averaged across the 50 or so prediction tasks we looked at — everything performed about the same with L2-regularized regression slightly leading on average rank.

There are a couple of theories about why deep neural networks perform well on some problems. For example, in object recognition problems you are trying to create a classifier that is invariant to various transformations of the object in the image: e.g., rotation, translation, scaling, etc. Stephane Mallat has done a lot of work showing that if you use features that are designed to be invariant to these transformations already (rather than just the pixels) then you can get the same performance as a deep neural network with a regression.

I think of it like this. The data live on some curved manifold. Then you add noise to the manifold. If the magnitude of the curvature of the manifold is larger than the noise, then the noisy manifold will still look curved and a neural network will work better than, say, a regression. If the magnitude of the curvature of the manifold is smaller than the noise, then the noisy manifold will look flat and a regression will do just fine.

Reply ↓
Bob Carpenter on May 21, 2019 12:35 PM at 12:35 pm said:

As a recovering semanticist, I think I can help with terminology.

1. Artificial intelligence is an application. It just means a machine doing something we typically think of people as doing. Like correcting spelling mistakes or driving cars or playing Settlers of Catan. It’s not a mathematical technique. We can build AIs with heuristics or we can build them with statistics or we can build them with both.

2. Machine learning is a broad class of techniques, not all of which are probabilistic. For example, support-vector machines and greedy agglomerative clustering algorithms are not probabilistic. Machine learning is currently the most popular way to build artificial intelligence applications.

3. Neural networks are a kind of statistical model that currently dominates research in machine learning and is thus currently the go-to method for developing artificial intelligence applications. Deep neural nets, by which people mean nets with more than one hidden layer, are a form of neural network. Deep nets are computationally intractable for traditional statistical inference due to both multimodality and the scale of the likelihood function. To cope, machine learning researchers have layered heuristics on top of standard estimation techniques such as autoencoders, early stopping, etc. And because the form of the likelihood matters and generic networks don’t work, they include specialized structures like convolutional layers to deal with image transposition and rotation. That is, they don’t learn to recognize cats just from looking at cat pictures—a lot of heuristic knowledge about vision has been encoded in the likelihood architecture.

As David MacKay explains in his info theory book, logistic regression is a simple neural network with N inputs, one output, and no hidden layers (he called it “classification with one neuron” rather than logistic regression). With appropriate link functions, neural networks can be used as generalized linear models. Viewed another way, they’re stacked logistic regressions, which is where the non-linearity comes from.

4. Gaussian processes are a kind of statistical model, albeit a computationally intractable one at scale due to the requirement to solve matrices whose dimensionality is given by the number of data points. Like neural networks, GPs can represent arbitrary multidimensional functions given enough data (subject to conditions imposed by priors like smoothness and by the choice of covariance function). Like in the generalized linear model case and in the neural network case, we can throw logit link functions on Gaussian processes and use them for binary or categorical data. Like neural networks, their architecture is very general and there is a lot of heuristic/subjective knowledge going into the choice of covariance function.

Radford Neal showed in his thesis (which also introduced HMC to the stats world!) that Gaussian processes are the limit of a single hidden-layer neural network as the number of hidden nodes goes to infinity.

5. Panel data and time series are just forms of data. They can be handled by any kind of approach.

6. Hierarchical modeling is a technique for partial pooling, aka modeling population effects. It pulls estimates for individuals or groups toward the estimate for the overall population of which they are a member. Machine learning researchers, including in neural networks, tend to only use shrinkage of estimates to zero (to avoid overfitting rather than to regularize to population estimates). They also tend to use fixed effects for populations rather than hierarchical modeling. Where you see hierarchical modeling in machine learning is in what they call domain adaptation, such as building a sentiment classifier for reviews for different genres of movies (dramas vs. comedies, for example) or products (shoes vs. refrigerators).

7. Bayesian modeling is an approach to using prior data and performing inference to propagate uncertainty. The machine learning community tends to call any technique that applies Bayes’s rule “Bayesian” (e.g., naive Bayes classifiers, which are almost never Bayesian in either modeling or inference). ML researchers also use the term “prior” very broadly to include prevalence of categories in a model (such as naive Bayes). They also tend to use “Bayesian” to describe any system with a prior on parameters, even if it’s essentially being fit with penalized maximum likelihood. Statisticians tend to reserve the term “Bayesian” for full Bayesian inference, where we average our predictions over our estimation uncertainty when performing posterior predictive inference.

Reply ↓
- Daniel Lakeland on May 21, 2019 12:49 PM at 12:49 pm said:
  
  Thanks Bob, very good addition to the discussion, will keep people who have less background info from getting confused.
  
  Reply ↓
- Tom M on May 21, 2019 12:52 PM at 12:52 pm said:
  
  This is very helpful! It’s good to have a semanticist in the house, whether recovering or relapsing.
  
  Reply ↓
- Al on May 21, 2019 10:05 PM at 10:05 pm said:
  
  Very useful!
  
  Reply ↓
- Kevin Van Horn on May 22, 2019 11:36 AM at 11:36 am said:
  
  “Radford Neal showed in his thesis (which also introduced HMC to the stats world!)”
  
  I wish we saw more of the kind of technology transfer from one field to another that Neal did there. I get the impression that the physics community has all kinds of wonderful mathematical tools that could be usefully applied in other disciplines, if only the right people knew about them.
  
  Reply ↓
Keith O'Rourke on May 21, 2019 12:45 PM at 12:45 pm said:

A better contrast might be interpretable versus explainable [black box] models (with Bayesian analyses done with inadequate workflow to understand them being in the explainable models category).

As an update on https://statmodeling.stat.columbia.edu/2018/10/30/explainable-ml-versus-interpretable-ml/ this this paper Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Cynthia Rudin https://www.nature.com/articles/s42256-019-0048-x.epdf?author_access_token=SU_TpOb-H5d3uy5KF-dedtRgN0jAjWel9jnR3ZoTv0M3t8uDwhDckroSbUOOygdba5KNHQMo_Ji2D1_SdDjVr6hjgxJXc-7jt5FQZuPTQKIAkZsBoTI4uqjwnzbltD01Z8QwhwKsbvwh-z1xL8bAcg%3D%3D

In particular “Here is the Rashomon set argument: consider that the data permit a large set of reasonably accurate predictive models to exist. … Unpacking this argument slightly, for a given data set, we define the Rashomon set as the set of reasonably accurate predictive models (say within a given accuracy from the best model accuracy of boosted decision trees). Because the data are finite, the data could admit many close-to-optimal models that predict differently from each other: a large Rashomon set. I suspect this happens often in practice because sometimes many different ML algorithms perform similarly on the same data set, despite having different functional forms (for example, random forests, neural networks, support vec-tor machines).”

Reply ↓
Ricardo on May 21, 2019 12:55 PM at 12:55 pm said:

David MacKay had a lecture that touches this topic: https://www.youtube.com/watch?v=Z1pcTxvCOgw

Reply ↓
Kevin Van Horn on May 21, 2019 1:12 PM at 1:12 pm said:

My issue with neural networks is that they focus on point predictions, so it is difficult to get the predictive *distributions* you need for decision analysis. Yes, you can often modify the objective function of an NN model to get predictive distributions instead of point predictions, but this is rarely done. Even if you do create a model that produces predictive distributions, good luck on incorporating parameter uncertainty.

Reply ↓
- Daniel Lakeland on May 21, 2019 1:35 PM at 1:35 pm said:
  
  In principle, one could describe a model for data using a neural network, whose parameters had prior distributions, and with a probability distribution over the errors given by a bayesian model, and wind up with a bayesian posterior over the parameters of a neural network model, but this is rarely done. One reason is that its virtually impossible for a normal human to express useful priors over neural network parameters because “what they do” is very opaque and there are generally going to be lots of inter-dependencies. Another reason is it’s hard to fit high dimensional models in general, and it requires tons of computing time.
  
  Reply ↓
  - Kevin Van Horn on May 22, 2019 11:23 AM at 11:23 am said:
    
    Not to mention the massive multimodality that makes sampling the posterior problematic.
    
    Reply ↓
- Corey on May 21, 2019 1:51 PM at 1:51 pm said:
  
  Kevin, you may be interested in this (although it’s 4 years old now): http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html
  
  It shows how a certain form of drop-out (a regularization technique that involves removing edges at random during training) can be viewed as a Monte Carlo version of a Bayesian variational approximation; the upshot is that you can do drop-out during prediction to approximate sampling from the posterior distribution.
  
  Reply ↓
  - Kevin Van Horn on May 22, 2019 11:31 AM at 11:31 am said:
    
    Thanks for the reference. I still have doubts about the usefulness of even an optimal variational approximation to the posterior in this case. If I understand correctly, your typical variational approximation for a high-dimensional parameter space is going to approximate the posterior as an axis-parallel multivariate normal (i.e. independent normals for each parameter), and from my understanding of deep neural nets I expect this to be an exceedingly bad approximation — they have not just massive multimodality, but also very strong posterior correlation between parameters.
    
    I would be delighted to find that I am mistaken here…
    
    Reply ↓
    - Corey on May 22, 2019 3:19 PM at 3:19 pm said:
      
      This particular variational approximation involves Bernoulli distributions in a way I am no longer clear on (if I ever was).
      
      I think with these sorts of models the thing to be worried about is posterior predictive performance and in particular, understating the uncertainty of the prediction. My understanding is that KL-loss-based variational approximations are quite prone to this problem and also that it does show up in the DNN context.
- sam on May 22, 2019 3:57 PM at 3:57 pm said:
  
  Hey Kevin, Kind of curious as to what decisions you’re making. I can see if you have some sort of non-linear utility function you’re maximizing you might need a posterior for each observation. But where does that happen?
  
  Reply ↓
  - Daniel Lakeland on May 22, 2019 5:03 PM at 5:03 pm said:
    
    I’d also be curious what decision analysis Kevin is involved in, but really it’s easy to come up with nonlinear utilities. Like imagine you have an investment of x dollars now for a series of uncertain payouts with uncertain number of dollars at uncertain times in the future. The utility will be nonlinear exp(-r*t) in the time at which the payouts occur…
    
    Or imagine there’s a regulatory situation where pollution below some amount is allowed but above some amount there are nonlinear fines, and you need to optimize your maintenance schedule on your equipment…
    
    or whatever. it’s easy to come up with nonlinear utility on uncertain parameters.
    
    Reply ↓
    - sam on May 23, 2019 11:57 AM at 11:57 am said:
      
      Daniel. I agree its easy to come up with conceptual problems where you’d have non linear utilities. but in order to maximize that you’d have to have a concrete mathematical equation for your utility. i’m interested in how one would actually come up with that equation. for most instances i think thats challenging.
      
      also, spitballing here, what if you just model utility(y)? would you need uncertainty parameters then?
  - Kevin Van Horn on May 23, 2019 3:14 PM at 3:14 pm said:
    
    I don’t know what “posterior for each observation” means; do you mean predictive distribution?
    
    Anyway, here’s one example where predictive distributions are important. When you’re doing capacity planning, then getting a point estimate of the load at a future date is of limited use; what you really want is a high-confidence upper bound. That is, you might want to find x such that
    
    Pr(load > x) = alpha
    
    for some small alpha (0.01, 0.001). More commonly, you’ll want to be able to handle the max load over some period: find x such that
    
    Pr(max_i load_i > x) = alpha.
    
    You would prefer not to have to choose alpha up front, as that may depend on executive decisions that haven’t been made yet and on costs and priorities that may are not known yet or may change. (Those costs include both the cost of providing resources that may not be used, and the negative consequences of not having enough capacity to meet the actual load.) You also may not want to set in stone the width of the period of interest in the second case above. So you really do need a predictive distribution.
    
    As another example, I’m doing some work where we’re building Monte Carlo models in support of complex decisions. The result of a regression analysis might be only one piece of a larger model, with its outcome variable being only an intermediate step in computing the final net benefit of a decision.
    
    Finally, nonlinear utility functions aren’t the issue; outputting the predictive mean is an optimal action only if (1) your action consists solely of outputting a prediction, rather than using that prediction to inform some other decision (which might have only a finite number of possible alternatives), and (2) your utility function (or loss function) is *quadratic*.
    
    Reply ↓
    - sam on May 28, 2019 11:20 AM at 11:20 am said:
      
      ‘I don’t know what “posterior for each observation” means; do you mean predictive distribution?’
      
      Yea I suppose im not clear on terminology but for concreteness lets say you have a model y = b + e where b is some parameter and e is some known error. There are two types of noise (as you implied with your original post); the data generation noise e and parameter estimate p(b). For predictive distribution I think one would have to take both into account. Some sort of bayesian estimation (like taking dropout into account) can help with parameter uncertainty and some sort of mixture density can help with dgp noise.
Ron Kenett on May 21, 2019 3:23 PM at 3:23 pm said:

Another interesting thread……thank you for making this happen.

Comparing models for analysing survey data is a great topic. We had a project on this at the University of Turin and Milan, a while ago. Two contributions from this:

1. We edited in 2011 a book published by Wiley titled Modern Analysis of Customer Surveys: with Applications using R, https://www.wiley.com/en-us/Modern+Analysis+of+Customer+Surveys%3A+with+Applications+using+R-p-9781119961383
Chapter 10 is on Statistical inference for causal effects by Fabrizia Mealli, Barbara Pacini and Donald B. Rubin
The following 11 chapters analyse the same dataset (the ABC data that can be downloaded from the book’s website). These are:
Bayesian Networks (11, Kenett, Salini)
Log Linear Models (12, Fienberg, Mandrique)
CUB Models (13, Piccolo, Iannario)
The Rasch Model (14, De Battisti, Nicolini, Salini)
Tree-based Methods and Decision Trees (15, Soffritti, Galimberti)
PLS Models (16, Boari, Cantaluppi)
Nonlinear PCA (17, Ferrari, Barbero)
Multidimensional Scaling (18, Solaro)
Multilevel Models for Ordinal Data (19, Rampichini, Grilli)
Control Charts Applications (20, Kenett, Deldossi, Zappa)
Fuzzy Methods (21, Zani, Morlini, Milioli)

This provides a unique opportunity to compare what you get from different models applied to the same dataset.

A paper that proposes to combine models in order to enhance information quality generated by analysis was also published in ASMBI: https://onlinelibrary.wiley.com/doi/abs/10.1002/asmb.927

Reply ↓
Eliot J on May 21, 2019 9:35 PM at 9:35 pm said:

My brief review of the literature neglected to mention that one of the first contributions with an explicit comparison of the two communities (stats vs ML) was Breiman’s. In 1999 he introduced CART random forests by comparing the predictive accuracy of ~1,000 RFs with a single iteration of logistic regression, concluding that the ensemble predictions were more accurate than LR.

https://www.stat.berkeley.edu/~breiman/random-forests.pdf

Thanks to Andrew for posting this query.

Reply ↓
- Keith O'Rourke on May 22, 2019 7:33 AM at 7:33 am said:
  
  Eliot:
  
  Did you also miss – Stat Med. 1998 Nov 15;17(21):2501-8.A comparison of statistical learning methods on the Gusto database. Ennis M, Hinton G, Naylor D, Revow M, Tibshirani R.
  
  Abstract
  We apply a battery of modern, adaptive non-linear learning methods to a large real database of cardiac patient data. We use each method to predict 30 day mortality from a large number of potential risk factors, and we compare their performances. We find that none of the methods could outperform a relatively simple logistic regression model previously developed for this problem.
  
  P
  
  Reply ↓
Bob on May 22, 2019 4:35 PM at 4:35 pm said:

Spyros Makridakis ran the M4 competition comparing models for time series prediction. I haven’t studied it in detail but he reckons the ML models unperformed statistical time series models .

https://www.sciencedirect.com/science/article/pii/S0169207018300785

Reply ↓
Bob on May 22, 2019 4:37 PM at 4:37 pm said:

Oh yeah, it’s in the original blog post.

Reply ↓
Warren S Sarle on May 27, 2019 11:18 PM at 11:18 pm said:

Neural Network and Statistical Jargon
=====================================

Warren S. Sarle Apr 29, 1996

The neural network (NN) and statistical literatures contain many of the
same concepts but usually with different terminology. Sometimes the same
term or acronym is used in both literatures but with different meanings.
Only in very rare cases is the same term used with the same meaning,
although some cross-fertilization is beginning to happen. Below is a
list of such corresponding terms or definitions.

Particularly loose correspondences are marked by a ~ between the two
columns. A indicates the reverse. Terminology in
both fields is often vague, so precise equivalences are not always
possible. The list starts with some basic definitions.

There is disagreement in the NN literature on how to count layers. Some
people count inputs as a layer and some don’t. I specify the number of
hidden layers instead. This is awkward but unambiguous.

Definition Statistical Jargon
========== ==================

generalizing from noisy data Statistical inference
and assessment of the
accuracy thereof

the set of all cases one Population
wants to be able to
generalize to

a function of the values in Parameter
a population, such as the
mean or a globally optimal
synaptic weight

a function of the values in Statistic
a sample, such as the mean
or a learned synaptic weight

Neural Network Jargon Definition
===================== ==========

Neuron, neurode, unit, a simple linear or nonlinear computing
node, processing element element that accepts one or more inputs,
computes a function thereof, and may
direct the result to one or more other
neurons

Neural networks a class of flexible nonlinear regression
and discriminant models, data reduction
models, and nonlinear dynamical systems
consisting of an often large number of
neurons interconnected in often complex
ways and often organized into layers

Neural Network Jargon Statistical Jargon
===================== ==================

Statistical methods Linear regression and discriminant
analysis, simulated annealing, random
search

Architecture Model

Training, Learning, Estimation, Model fitting, Optimization
Adaptation

Classification Discriminant analysis

Mapping, Function Regression
approximation

Supervised learning Regression, Discriminant analysis

Unsupervised learning, Principal components, Cluster analysis,
Self-organization Data reduction

Competitive learning Cluster analysis

Hebbian learning, Principal components
Cottrell/Munro/Zipser
technique

Training set Sample, Construction sample

Test set, Validation set Hold-out sample

Pattern, Vector, Example, Observation, Case
Sample, Case

Reflectance pattern an observation normalized to sum to 1

Binary(0/1), Binary, Dichotomous
Bivalent or Bipolar(-1/1)

Input Independent variables, Predictors,
Regressors, Explanatory variables,
Carriers

Output Predicted values

Forward propagation Prediction

Training values Dependent variables, Responses,
Target values Observed values

Training pair Observation containing both inputs
and target values

Shift register, Lagged variable
(Tapped) (time) delay (line),
Input window

Errors Residuals

Noise Error term

Generalization Interpolation, Extrapolation,
Prediction

Error bars Confidence interval

Prediction Forecasting

Adaline Linear two-group discriminant analysis
(ADAptive LInear NEuron) (not Fisher’s but generic)

(No-hidden-layer) perceptron ~ Generalized linear model (GLIM)

Activation function, > Inverse link function in GLIM
Signal function,
Transfer function

Softmax Multiple logistic function

Squashing function bounded function with infinite domain

Semilinear function differentiable nondecreasing function

Phi-machine Linear model

Linear 1-hidden-layer Maximum redundancy analysis, Principal
perceptron components of instrumental variables

1-hidden-layer perceptron ~ Projection pursuit regression

Weights, Shrinkage estimation, Ridge regression

Jitter random noise added to the inputs to
smooth the estimates

Growing, Pruning, Brain Subset selection, Model selection,
damage, Self-structuring, Pre-test estimation
Ontogeny

Optimal brain surgeon Wald test

LMS (Least mean squares) OLS (Ordinary least squares)
(see also “LMS rule” above)

Relative entropy, Cross Kullback-Leibler divergence
entropy

Evidence framework Empirical Bayes estimation

OLS (Orthogonal least squares) Forward stepwise regression

Probabilistic neural network Kernel discriminant analysis

General regression neural Kernel regression
network

Topologically distributed < (Generalized) Additive model
encoding

Adaptive vector quantization iterative algorithms of doubtful
convergence for K-means cluster analysis

Adaptive Resonance Theory 2a ~ Hartigan's leader algorithm

Learning vector quantization a form of piecewise linear discriminant
analysis using a preliminary cluster
analysis

Counterpropagation Regressogram based on k-means clusters

Encoding, Autoassociation Dimensionality reduction
(Independent and dependent variables
are the same)

Heteroassociation Regression, Discriminant analysis
(Independent and dependent variables
are different)

Epoch Iteration

Continuous training, Iteratively updating estimates one
Incremental training, observation at a time via difference
On-line training, equations, as in stochastic approximation
Instantaneous training

Batch training, Iteratively updating estimates after
Off-line training each complete pass over the data as in
most nonlinear regression algorithms

Reply ↓
- Warren S Sarle on May 27, 2019 11:23 PM at 11:23 pm said:
  
  Well, I don’t see how to edit a comment, and html’s reduction of multiple blanks to a single blank makes my previous comment unintelligible.
  
  Reply ↓
  - Bill Spight on May 28, 2019 1:00 PM at 1:00 pm said:
    
    When I have had that problem, instead of blanks I used periods for spacing.
    
    Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Neural nets vs. regression models

41 thoughts on “Neural nets vs. regression models”

Leave a Reply to Ron Kenett Cancel reply