After reading this from John Langford:

The Deep Learning problem remains interesting. How do you effectively learn complex nonlinearities capable of better performance than a basic linear predictor? An effective solution avoids feature engineering. Right now, this is almost entirely dealt with empirically, but theory could easily have a role to play in phrasing appropriate optimization algorithms, for example.

Jimmy asks:

Does this sound related to modeling the deep interactions you often talk about? (I [Jimmy] never understand the stuff on hunch, but thought that might be so?)

My reply: I don’t understand that stuff on hunch so well either–he uses slightly different jargon than I do! That said, it looks interesting and important so I’m pointing you all to it.

@Andrew: "Deep" as Langford's using it here means latent predictors that are themselves modeled using logistic regressions. The "depth" is of the diagram of connections in a neural network. Each "neuron" is modeled as a logistic regression (or you can use some other sigmoid if you like).

For instance, a one-hidden-layer network would have continuous basic predictor vector x, binary outcome y, and a bunch of intermediate units z[i] for i in 1:I, each of which is defined by a logistic regression over the basic predictors using coeffs beta[i]:

z[i] = inv-logit(beta[i]' x)

Then the final outcome prediction y is defined as a regression over the z, as in:

y ~ Bernoulli(inv-logit(alpha' z))

This gives you non-linearities in the relation between x and y. And you can keep stacking the intermediate layers, making the neural network deeper. The deeper models are notoriously tricky to fit, hence the challenge of understanding "deep learning".

@Jimmy: Deep neural networks are only one way to deal with non-linearities.

Tree-based models (e.g. Bayesian additive regression trees or boosted decision stumps or whatever) provide another approach.

So do smoothed nearest neighbor methods and related kernel smoothers.

Yet another way is to interact features in a regression, which is what Andrew likes to do. If you keep interacting predictors with themselves and each other up to arbitrary orders, you get a universal non-linear learner for your task. The problem then shifts to regularizing enough so as not to overfit when moving to held out predictions.

much thanks for the explanations, bob. i figured you would chime in and help clarify. :)

@bob thanks for the insight. This may be a ignorant question (not too familiar with the ML literature), but if a single regression is universal, why would you go through the trouble of stacking them? I could understand that if there was some sort of well-defined causal structure (as in Pearl's causality graphs or instrumental variables), but stacking regressions to model non-linearities just seems to complicate analysis and interpretation unnecessarily. Not only do you have to keep track of the uncertainty in the parameters and the final prediction, but you also have to deal with the uncertainty in the prediction at each connection.