Skip to content

Neural nets vs. regression models

Eliot Johnson writes:

I have a question concerning papers comparing two broad domains of modeling: neural nets and statistical models. Both terms are catch-alls, within each of which there are, quite obviously, multiple subdomains. For instance, NNs could include ML, DL, AI, and so on. While statistical models should include panel data, time series, hierarchical Bayesian models, and more.

I’m aware of two papers that explicitly compare these two broad domains:

(1) Sirignano, et al., Deep Learning for Mortgage Risk,

(2) Makridakis, et al., Statistical and Machine Learning forecasting methods: Concerns and ways forward

But there must be more than just these two examples. Are there others that you are aware of? Do you think a post on your blog would be useful? If so, I’m sure you can think of better ways to phrase or express my “two broad domains.”

My reply:

I don’t actually know.

Back in 1994 or so I remember talking with Radford Neal about the neural net models in his Ph.D. thesis and asking if he could try them out on analysis of data from sample surveys. The idea was that we have two sorts of models: multilevel logistic regression and Gaussian processes. Both models can use the same predictors (characteristics of survey respondents such as sex, ethnicity, age, and state), and both have the structure that similar respondents have similar predicted outcomes—but the two models have different mathematical structures. The regression model works with a linear predictor from all these factors, whereas the Gaussian process model uses an unnormalized probability density—a prior distribution—that encourages people with similar predictors to have similar outcomes.

My guess is that the two models would do about the same, following the general principle that the most important thing about a statistical procedure is not what you do with the data, but what data you use. In either case, though, some thought might need to go into the modeling. For example, you’ll want to include state-level predictors. As we’ve discussed before, when your data are sparse, multilevel regression works much better if you have good group-level predictors, and some of the examples where it appears that MRP performs poorly, are examples where people are not using available group-level information.

Anyway, to continue with the question above, asking about neural nets and statistical models: Actually, neural nets are a special case of statistical models, typically Bayesian hierarchical logistic regression with latent parameters. But neural nets are typically estimated in a different way: the resulting posterior distributions will generally be multimodal, so rather than try the hopeless task of traversing the whole posterior distribution, we’ll use various approximate methods, which then are evaluated using predictive accuracy.

By the way, Radford’s answer to my question back in 1994 was that he was too busy to try fitting his models to my data. And I guess I was too busy too, because I didn’t try it either! More recently, I asked a computer scientist and he said he thought the datasets I was working with were too small for his methods to be very useful. More generally, though, I like the idea of RPP, also the idea of using stacking to combine Bayesian inferences from different fitted models.


  1. zbicyclist says:

    Thanks for the Makridakis link; I hadn’t seen that. This passage may seem familiar to readers of this blog:

    “The motivation for writing this paper was an article [18] published in Neural Networks in June 2017. The aim of the article was to improve the forecasting accuracy of stock price fluctuations and claimed that “the empirical results show that the proposed model indeed display a good performance in forecasting stock market fluctuations”.

    “In our view, the results seemed extremely accurate for stock market series that are essentially close to random walks so we wanted to replicate the results of the article and emailed the corresponding author asking for information to be able to do so. We got no answer and we, therefore, emailed the Editor-in-Chief of the Journal asking for his help. He suggested contacting the other author to get the required information. We consequently, emailed this author but we never got a reply. Not being able to replicate the result of [18] and not finding research studies comparing ML methods with alternative ones we decided to start the research leading to this paper.”

    Reference 18 is Wang J, Wang J. Forecasting stochastic neural network based on financial empirical mode decomposition. Neural Networks. 2017;90:8–20. pmid:28364677

  2. Tom Passin says:

    Bart Kosko wrote in one or another of his books that neural nets are basically universal approximators. They can approximate any function. So they can be subject to any of the ills of other approximating systems, like overfitting, inappropriate fitting criteria, lack of orthogonality of inputs or internal variables, unpredictable results when extrapolating outside the range of the training data, etc.

    I wouldn’t be surprised if one could cast a given statistical procedure as a network of nodes and edges with specified or derived weights. Intermediate values like sums of squares would act like hidden nodes. That would make it essentially equivalent to a neural network.

    • Carlos Ungil says:

      > For instance, NNs could include ML, DL, AI, and so on.

      Assuming ML and AI stand for machine learning and artificial intelligence one could could say that they include NN as a subdomain. But the other way around it make less sense.

    • Yes, Neural Nets are universal function approximators, at least on bounded subsets of R^N which is every actual applied problem. Machine Learning/AI/Deep Learning stuff is at its core (as far as I can tell) the method of using various sophisticated usually stochastic optimization techniques to fit (mostly) Neural Net function approximators using somewhat sophisticated but generic loss functions especially based on hold-out data to avoid overfitting.

      The NN method basically makes the tradeoff of high dimensional parameter space vs domain knowledge by choosing high dimensional parameter space and then coping with that using sophisticated optimization techniques… it’s especially useful when we don’t have much domain knowledge.


        roughly a “compact” subset is a generalization of a closed and bounded subset. Closed means it contains its limit points (boundary points) and bounded means it doesn’t go off to infinity. So yes a single layer of neurons can approximate any function you’d need in a typical applied problem by adjusting the large number of knobs (parameters describing the weights). So the problem becomes “how can we choose the settings on the knobs to do a “good job””

        • Anoneuoid says:

          The thing is that a single layer only does that asymptotically. You can still read stack overflow posts with people parroting that as a reason never to try deep learning up to a few years ago.

          What is true asymptotically may not be at all correct for someone with finite resources and time.

          • Sure, it’s basically just that it’s theoretically sufficient to have one layer, in practice it’s efficient to have several layers. But in the end, it’s not really different in any deep theoretical way from polynomial regression or fourier series or anything else you might try to represent a function.

            • Anoneuoid says:

              The interesting part to me is how obvious it is to say “sure” now, in contrast to being able to see what the experts were missing just a few years ago.

              I think if researchers delineated all assumptions they were making, then reviewers (inevitably) pointed out more, it would be great.

  3. Ethan Steinberg says:

    It truthfully seems like neural networks are simply not the most optimal tool for tabular datasets.

    One very interesting paper demonstrating that (perhaps unintentionally!) was which came out last year from Google. That paper is focused on comparing the performance of different ML models for predicting various health outcomes from electronic health records.

    If you go look in their supplement, their logistic regression models are within the margin of error of their much more complicated and sophisticated neural network ensembles. At best their fancier models only gain 0.01 or 0.02 AUROC over the much simpler baselines.

    (The most annoying part of that paper is rather than celebrate the fact that logistic regression works so well, they hide those results in the supplement and don’t even mention them in the main text.)

    • This is worth blogging about separately. Here’s their table of results:

      Those baselines are logistic regressions. I’d like to see how sensitive their results were to the dozens of hyperparameters they tuned.

      • Nick Adams says:

        Never mind logistic regression, medical staff are better at predicting inpatient mortality using only their pre-existing organic neural net: AUROC up to 0.9.

        PLoS One. 2014; 9(7): e101739

  4. I agree with Prof. Gelman in the fact that neural networks are also statistical models and the same with ML, DL and AI (DL is a subset of ML and ML is a subset of AI). The important thing, however, is to note that you need some criteria for your comparison. Do you want to compare them based on accuracy on a test set? interpretability? fairness of the outcome?

    If you don’t have a specific point where you would like to compare them you probably need a couple of books to do all possible comparisons. If you actually are looking for a book where you get a lot of different flavors of machine learning (including neural networks) and “classic” statistical models you can try:

    – Bishop’s Pattern recognition and machine learning (2006) you can find it free in:

    Now, with respect to data from surveys with neural networks, I would give the same answer. What do you need the model for? is it to do prediction? to estimate latent variables? it all depends. Depends on the goal, depends on the amount of data, depends on how versed are you on both “classic” statistical methods and machine learning.

    • Sergio Garrido says:

      I wrote the first sentence very poorly. What I meant is that ML, DL and AI are not separate things and neural networks don’t include them. On the contrary, NNs are a subset of all of them, and they are subsets of each other.

  5. We did a fairly comprehensive comparison between neural networks, linear models, and other approaches for making predictions about phenotypes from RNA-sequencing data here: We found that — averaged across the 50 or so prediction tasks we looked at — everything performed about the same with L2-regularized regression slightly leading on average rank.

    There are a couple of theories about why deep neural networks perform well on some problems. For example, in object recognition problems you are trying to create a classifier that is invariant to various transformations of the object in the image: e.g., rotation, translation, scaling, etc. Stephane Mallat has done a lot of work showing that if you use features that are designed to be invariant to these transformations already (rather than just the pixels) then you can get the same performance as a deep neural network with a regression.

    I think of it like this. The data live on some curved manifold. Then you add noise to the manifold. If the magnitude of the curvature of the manifold is larger than the noise, then the noisy manifold will still look curved and a neural network will work better than, say, a regression. If the magnitude of the curvature of the manifold is smaller than the noise, then the noisy manifold will look flat and a regression will do just fine.

  6. As a recovering semanticist, I think I can help with terminology.

    1. Artificial intelligence is an application. It just means a machine doing something we typically think of people as doing. Like correcting spelling mistakes or driving cars or playing Settlers of Catan. It’s not a mathematical technique. We can build AIs with heuristics or we can build them with statistics or we can build them with both.

    2. Machine learning is a broad class of techniques, not all of which are probabilistic. For example, support-vector machines and greedy agglomerative clustering algorithms are not probabilistic. Machine learning is currently the most popular way to build artificial intelligence applications.

    3. Neural networks are a kind of statistical model that currently dominates research in machine learning and is thus currently the go-to method for developing artificial intelligence applications. Deep neural nets, by which people mean nets with more than one hidden layer, are a form of neural network. Deep nets are computationally intractable for traditional statistical inference due to both multimodality and the scale of the likelihood function. To cope, machine learning researchers have layered heuristics on top of standard estimation techniques such as autoencoders, early stopping, etc. And because the form of the likelihood matters and generic networks don’t work, they include specialized structures like convolutional layers to deal with image transposition and rotation. That is, they don’t learn to recognize cats just from looking at cat pictures—a lot of heuristic knowledge about vision has been encoded in the likelihood architecture.

    As David MacKay explains in his info theory book, logistic regression is a simple neural network with N inputs, one output, and no hidden layers (he called it “classification with one neuron” rather than logistic regression). With appropriate link functions, neural networks can be used as generalized linear models. Viewed another way, they’re stacked logistic regressions, which is where the non-linearity comes from.

    4. Gaussian processes are a kind of statistical model, albeit a computationally intractable one at scale due to the requirement to solve matrices whose dimensionality is given by the number of data points. Like neural networks, GPs can represent arbitrary multidimensional functions given enough data (subject to conditions imposed by priors like smoothness and by the choice of covariance function). Like in the generalized linear model case and in the neural network case, we can throw logit link functions on Gaussian processes and use them for binary or categorical data. Like neural networks, their architecture is very general and there is a lot of heuristic/subjective knowledge going into the choice of covariance function.

    Radford Neal showed in his thesis (which also introduced HMC to the stats world!) that Gaussian processes are the limit of a single hidden-layer neural network as the number of hidden nodes goes to infinity.

    5. Panel data and time series are just forms of data. They can be handled by any kind of approach.

    6. Hierarchical modeling is a technique for partial pooling, aka modeling population effects. It pulls estimates for individuals or groups toward the estimate for the overall population of which they are a member. Machine learning researchers, including in neural networks, tend to only use shrinkage of estimates to zero (to avoid overfitting rather than to regularize to population estimates). They also tend to use fixed effects for populations rather than hierarchical modeling. Where you see hierarchical modeling in machine learning is in what they call domain adaptation, such as building a sentiment classifier for reviews for different genres of movies (dramas vs. comedies, for example) or products (shoes vs. refrigerators).

    7. Bayesian modeling is an approach to using prior data and performing inference to propagate uncertainty. The machine learning community tends to call any technique that applies Bayes’s rule “Bayesian” (e.g., naive Bayes classifiers, which are almost never Bayesian in either modeling or inference). ML researchers also use the term “prior” very broadly to include prevalence of categories in a model (such as naive Bayes). They also tend to use “Bayesian” to describe any system with a prior on parameters, even if it’s essentially being fit with penalized maximum likelihood. Statisticians tend to reserve the term “Bayesian” for full Bayesian inference, where we average our predictions over our estimation uncertainty when performing posterior predictive inference.

  7. A better contrast might be interpretable versus explainable [black box] models (with Bayesian analyses done with inadequate workflow to understand them being in the explainable models category).

    As an update on this this paper Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Cynthia Rudin

    In particular “Here is the Rashomon set argument: consider that the data permit a large set of reasonably accurate predictive models to exist. … Unpacking this argument slightly, for a given data set, we define the Rashomon set as the set of reasonably accurate predictive models (say within a given accuracy from the best model accuracy of boosted decision trees). Because the data are finite, the data could admit many close-to-optimal models that predict differently from each other: a large Rashomon set. I suspect this happens often in practice because sometimes many different ML algorithms perform similarly on the same data set, despite having different functional forms (for example, random forests, neural networks, support vec-tor machines).”

  8. Ricardo says:

    David MacKay had a lecture that touches this topic:

  9. My issue with neural networks is that they focus on point predictions, so it is difficult to get the predictive *distributions* you need for decision analysis. Yes, you can often modify the objective function of an NN model to get predictive distributions instead of point predictions, but this is rarely done. Even if you do create a model that produces predictive distributions, good luck on incorporating parameter uncertainty.

    • In principle, one could describe a model for data using a neural network, whose parameters had prior distributions, and with a probability distribution over the errors given by a bayesian model, and wind up with a bayesian posterior over the parameters of a neural network model, but this is rarely done. One reason is that its virtually impossible for a normal human to express useful priors over neural network parameters because “what they do” is very opaque and there are generally going to be lots of inter-dependencies. Another reason is it’s hard to fit high dimensional models in general, and it requires tons of computing time.

    • Corey says:

      Kevin, you may be interested in this (although it’s 4 years old now):

      It shows how a certain form of drop-out (a regularization technique that involves removing edges at random during training) can be viewed as a Monte Carlo version of a Bayesian variational approximation; the upshot is that you can do drop-out during prediction to approximate sampling from the posterior distribution.

      • Thanks for the reference. I still have doubts about the usefulness of even an optimal variational approximation to the posterior in this case. If I understand correctly, your typical variational approximation for a high-dimensional parameter space is going to approximate the posterior as an axis-parallel multivariate normal (i.e. independent normals for each parameter), and from my understanding of deep neural nets I expect this to be an exceedingly bad approximation — they have not just massive multimodality, but also very strong posterior correlation between parameters.

        I would be delighted to find that I am mistaken here…

        • Corey says:

          This particular variational approximation involves Bernoulli distributions in a way I am no longer clear on (if I ever was).

          I think with these sorts of models the thing to be worried about is posterior predictive performance and in particular, understating the uncertainty of the prediction. My understanding is that KL-loss-based variational approximations are quite prone to this problem and also that it does show up in the DNN context.

    • sam says:

      Hey Kevin, Kind of curious as to what decisions you’re making. I can see if you have some sort of non-linear utility function you’re maximizing you might need a posterior for each observation. But where does that happen?

      • I’d also be curious what decision analysis Kevin is involved in, but really it’s easy to come up with nonlinear utilities. Like imagine you have an investment of x dollars now for a series of uncertain payouts with uncertain number of dollars at uncertain times in the future. The utility will be nonlinear exp(-r*t) in the time at which the payouts occur…

        Or imagine there’s a regulatory situation where pollution below some amount is allowed but above some amount there are nonlinear fines, and you need to optimize your maintenance schedule on your equipment…

        or whatever. it’s easy to come up with nonlinear utility on uncertain parameters.

        • sam says:

          Daniel. I agree its easy to come up with conceptual problems where you’d have non linear utilities. but in order to maximize that you’d have to have a concrete mathematical equation for your utility. i’m interested in how one would actually come up with that equation. for most instances i think thats challenging.

          also, spitballing here, what if you just model utility(y)? would you need uncertainty parameters then?

      • I don’t know what “posterior for each observation” means; do you mean predictive distribution?

        Anyway, here’s one example where predictive distributions are important. When you’re doing capacity planning, then getting a point estimate of the load at a future date is of limited use; what you really want is a high-confidence upper bound. That is, you might want to find x such that

        Pr(load > x) = alpha

        for some small alpha (0.01, 0.001). More commonly, you’ll want to be able to handle the max load over some period: find x such that

        Pr(max_i load_i > x) = alpha.

        You would prefer not to have to choose alpha up front, as that may depend on executive decisions that haven’t been made yet and on costs and priorities that may are not known yet or may change. (Those costs include both the cost of providing resources that may not be used, and the negative consequences of not having enough capacity to meet the actual load.) You also may not want to set in stone the width of the period of interest in the second case above. So you really do need a predictive distribution.

        As another example, I’m doing some work where we’re building Monte Carlo models in support of complex decisions. The result of a regression analysis might be only one piece of a larger model, with its outcome variable being only an intermediate step in computing the final net benefit of a decision.

        Finally, nonlinear utility functions aren’t the issue; outputting the predictive mean is an optimal action only if (1) your action consists solely of outputting a prediction, rather than using that prediction to inform some other decision (which might have only a finite number of possible alternatives), and (2) your utility function (or loss function) is *quadratic*.

        • sam says:

          ‘I don’t know what “posterior for each observation” means; do you mean predictive distribution?’

          Yea I suppose im not clear on terminology but for concreteness lets say you have a model y = b + e where b is some parameter and e is some known error. There are two types of noise (as you implied with your original post); the data generation noise e and parameter estimate p(b). For predictive distribution I think one would have to take both into account. Some sort of bayesian estimation (like taking dropout into account) can help with parameter uncertainty and some sort of mixture density can help with dgp noise.

  10. Ron Kenett says:

    Another interesting thread……thank you for making this happen.

    Comparing models for analysing survey data is a great topic. We had a project on this at the University of Turin and Milan, a while ago. Two contributions from this:

    1. We edited in 2011 a book published by Wiley titled Modern Analysis of Customer Surveys: with Applications using R,
    Chapter 10 is on Statistical inference for causal effects by Fabrizia Mealli, Barbara Pacini and Donald B. Rubin
    The following 11 chapters analyse the same dataset (the ABC data that can be downloaded from the book’s website). These are:
    Bayesian Networks (11, Kenett, Salini)
    Log Linear Models (12, Fienberg, Mandrique)
    CUB Models (13, Piccolo, Iannario)
    The Rasch Model (14, De Battisti, Nicolini, Salini)
    Tree-based Methods and Decision Trees (15, Soffritti, Galimberti)
    PLS Models (16, Boari, Cantaluppi)
    Nonlinear PCA (17, Ferrari, Barbero)
    Multidimensional Scaling (18, Solaro)
    Multilevel Models for Ordinal Data (19, Rampichini, Grilli)
    Control Charts Applications (20, Kenett, Deldossi, Zappa)
    Fuzzy Methods (21, Zani, Morlini, Milioli)

    This provides a unique opportunity to compare what you get from different models applied to the same dataset.

    A paper that proposes to combine models in order to enhance information quality generated by analysis was also published in ASMBI:

  11. Eliot J says:

    My brief review of the literature neglected to mention that one of the first contributions with an explicit comparison of the two communities (stats vs ML) was Breiman’s. In 1999 he introduced CART random forests by comparing the predictive accuracy of ~1,000 RFs with a single iteration of logistic regression, concluding that the ensemble predictions were more accurate than LR.

    Thanks to Andrew for posting this query.

    • Eliot:

      Did you also miss – Stat Med. 1998 Nov 15;17(21):2501-8.A comparison of statistical learning methods on the Gusto database. Ennis M, Hinton G, Naylor D, Revow M, Tibshirani R.

      We apply a battery of modern, adaptive non-linear learning methods to a large real database of cardiac patient data. We use each method to predict 30 day mortality from a large number of potential risk factors, and we compare their performances. We find that none of the methods could outperform a relatively simple logistic regression model previously developed for this problem.


  12. Bob says:

    Spyros Makridakis ran the M4 competition comparing models for time series prediction. I haven’t studied it in detail but he reckons the ML models unperformed statistical time series models .

  13. Bob says:

    Oh yeah, it’s in the original blog post.

  14. Warren S Sarle says:

    Neural Network and Statistical Jargon

    Warren S. Sarle Apr 29, 1996

    The neural network (NN) and statistical literatures contain many of the
    same concepts but usually with different terminology. Sometimes the same
    term or acronym is used in both literatures but with different meanings.
    Only in very rare cases is the same term used with the same meaning,
    although some cross-fertilization is beginning to happen. Below is a
    list of such corresponding terms or definitions.

    Particularly loose correspondences are marked by a ~ between the two
    columns. A indicates the reverse. Terminology in
    both fields is often vague, so precise equivalences are not always
    possible. The list starts with some basic definitions.

    There is disagreement in the NN literature on how to count layers. Some
    people count inputs as a layer and some don’t. I specify the number of
    hidden layers instead. This is awkward but unambiguous.

    Definition Statistical Jargon
    ========== ==================

    generalizing from noisy data Statistical inference
    and assessment of the
    accuracy thereof

    the set of all cases one Population
    wants to be able to
    generalize to

    a function of the values in Parameter
    a population, such as the
    mean or a globally optimal
    synaptic weight

    a function of the values in Statistic
    a sample, such as the mean
    or a learned synaptic weight

    Neural Network Jargon Definition
    ===================== ==========

    Neuron, neurode, unit, a simple linear or nonlinear computing
    node, processing element element that accepts one or more inputs,
    computes a function thereof, and may
    direct the result to one or more other

    Neural networks a class of flexible nonlinear regression
    and discriminant models, data reduction
    models, and nonlinear dynamical systems
    consisting of an often large number of
    neurons interconnected in often complex
    ways and often organized into layers

    Neural Network Jargon Statistical Jargon
    ===================== ==================

    Statistical methods Linear regression and discriminant
    analysis, simulated annealing, random

    Architecture Model

    Training, Learning, Estimation, Model fitting, Optimization

    Classification Discriminant analysis

    Mapping, Function Regression

    Supervised learning Regression, Discriminant analysis

    Unsupervised learning, Principal components, Cluster analysis,
    Self-organization Data reduction

    Competitive learning Cluster analysis

    Hebbian learning, Principal components

    Training set Sample, Construction sample

    Test set, Validation set Hold-out sample

    Pattern, Vector, Example, Observation, Case
    Sample, Case

    Reflectance pattern an observation normalized to sum to 1

    Binary(0/1), Binary, Dichotomous
    Bivalent or Bipolar(-1/1)

    Input Independent variables, Predictors,
    Regressors, Explanatory variables,

    Output Predicted values

    Forward propagation Prediction

    Training values Dependent variables, Responses,
    Target values Observed values

    Training pair Observation containing both inputs
    and target values

    Shift register, Lagged variable
    (Tapped) (time) delay (line),
    Input window

    Errors Residuals

    Noise Error term

    Generalization Interpolation, Extrapolation,

    Error bars Confidence interval

    Prediction Forecasting

    Adaline Linear two-group discriminant analysis
    (ADAptive LInear NEuron) (not Fisher’s but generic)

    (No-hidden-layer) perceptron ~ Generalized linear model (GLIM)

    Activation function, > Inverse link function in GLIM
    Signal function,
    Transfer function

    Softmax Multiple logistic function

    Squashing function bounded function with infinite domain

    Semilinear function differentiable nondecreasing function

    Phi-machine Linear model

    Linear 1-hidden-layer Maximum redundancy analysis, Principal
    perceptron components of instrumental variables

    1-hidden-layer perceptron ~ Projection pursuit regression

    Weights, Shrinkage estimation, Ridge regression

    Jitter random noise added to the inputs to
    smooth the estimates

    Growing, Pruning, Brain Subset selection, Model selection,
    damage, Self-structuring, Pre-test estimation

    Optimal brain surgeon Wald test

    LMS (Least mean squares) OLS (Ordinary least squares)
    (see also “LMS rule” above)

    Relative entropy, Cross Kullback-Leibler divergence

    Evidence framework Empirical Bayes estimation

    OLS (Orthogonal least squares) Forward stepwise regression

    Probabilistic neural network Kernel discriminant analysis

    General regression neural Kernel regression

    Topologically distributed < (Generalized) Additive model

    Adaptive vector quantization iterative algorithms of doubtful
    convergence for K-means cluster analysis

    Adaptive Resonance Theory 2a ~ Hartigan's leader algorithm

    Learning vector quantization a form of piecewise linear discriminant
    analysis using a preliminary cluster

    Counterpropagation Regressogram based on k-means clusters

    Encoding, Autoassociation Dimensionality reduction
    (Independent and dependent variables
    are the same)

    Heteroassociation Regression, Discriminant analysis
    (Independent and dependent variables
    are different)

    Epoch Iteration

    Continuous training, Iteratively updating estimates one
    Incremental training, observation at a time via difference
    On-line training, equations, as in stochastic approximation
    Instantaneous training

    Batch training, Iteratively updating estimates after
    Off-line training each complete pass over the data as in
    most nonlinear regression algorithms

Leave a Reply to Ron Kenett