We have really everything in common with machine learning nowadays, except, of course, language.

I had an interesting exchange with Bob regarding the differences between statistics and machine learning. If it were just differences in jargon, it would be no big deal—you could just translate back and forth—but it’s trickier than that, because the two subfields also have different priorities and concepts.

It started with this abstract by Satyen Kale in Columbia’s statistical machine learning seminar:

Learning linear predictors with the logistic loss—both in stochastic and online settings—is a fundamental task in machine learning and statistics, with direct connections to classification and boosting. Existing “fast rates” for this setting exhibit exponential dependence on the predictor norm, and Hazan et al. (2014) showed that this is unfortunately unimprovable. Starting with the simple observation that the logistic loss is 1-mixable, we design a new efficient improper learning algorithm for online logistic regression that circumvents the aforementioned lower bound with a regret bound exhibiting a doubly-exponential improvement in dependence on the predictor norm. This provides a positive resolution to a variant of the COLT 2012 open problem of McMahan and Streeter when improper learning is allowed. This improvement is obtained both in the online setting and, with some extra work, in the batch statistical setting with high probability. Leveraging this improved dependency on the predictor norm yields algorithms with tighter regret bounds for online bandit multiclass learning with the logistic loss, and for online multiclass boosting. Finally, we give information-theoretic bounds on the optimal rates for improper logistic regression with general function classes, thereby characterizing the extent to which our improvement for linear classes extends to other parametric and even nonparametric settings. This is joint work with Dylan J. Foster, Haipeng Luo, Mehryar Mohri and Karthik Sridharan.

What struck me was how difficult it was for me to follow what this abstract was saying!

“Learning linear predictors”: does this mean estimating coefficients, or deciding what predictors to include in the model?

“the logistic loss”: I’m confused here. In logistic regression we use the log loss (that is, the contribution to the likelihood is log(p)); the logistic comes in the link function. So I can’t figure this out. If for example we were instead doing probit regression, would it be probit loss? There’s something I’m not catching here. I’m guessing that they are using the term “loss” in a slightly different way than we would.

“exponential dependence on the predictor norm”: I have no idea what this is.

“1-mixable”: ?

“a regret bound”: I don’t know what that is either!

“the COLT 2012 open problem of McMahan and Streeter”: This is news to me. I’ve never even heard of COLT before.

“improper learning”: What’s that?

“online bandit multiclass learning”: Now I’m confused in a different way. I think of logistic regression for 0/1 data, but multiclass learning, I’m guessing that’s for data with more than 2 categories?

“online multiclass boosting”: I don’t actually know what this is either, but I’ve heard of “boosting” (even though I don’t know exactly what it is).

“improper logistic regression”: What is that? I don’t think it’s the same as improper priors.

I know I could google all the above terms, and I’m not faulting the speaker for using jargon, any more than I should be faulted for using Bayesian jargon in my talks. It’s just interesting how difficult the communication is.

I sent the above to Bob, who replied:

Stats seems more concerned with asymptotic consistency and bias of estimators than about convergence rate. And I’ve never seen anyone in stats talk about regret. The whole setup is online (data streaming inference, which they call “stochastic”). But then I don’t get out much.

I have no idea why they say online logistic regression is a fundamental task. It seems more like an approach to classification than a task itself. But hey, that’s just me being picky about language.

I have no idea what 1-mixable is or what a predictor norm is, and I wasn’t enlightened as to what “improper” means after reading the abstract.

Again, not at all a slam on Satyen Kale. It would be just about as hard to figure out what I’m talking about, based on one of my more technical abstracts. This is just an instance of the general problem of specialized communication: jargon is confusing for outsiders but saves time for insiders.

I suggested to Bob that what we need is some sort of translation, and he responded:

I know some of this stuff. Regret is the expected difference in utility between the strategy you played and the optimal strategy (in simple bandit problems, optimal is always playing the best arm). Regret bounds are strategies for pulling arms in an explore/exploit way that bounds regret. This is what you’ll need to connect to the ML literature on online (subject at a time being assigned) A/B testing.

COLT is the conference on learning theory. That’s where they do PAC learning. If you don’t know PAC learning is, well, google it. . . .

I think “logistic loss” was either a typo or to distinguish their use of logistic regression from general 0/1 loss stuff.

Online bandit multiclass: online means one example at a time or one subject a time where you can control assignment to any of k treatments (or a control). Yes, you can use multi-logit for this as we were discussing the other day. It’s in the Stan user’s guide regression chapter.

Boosting is a technique where you iteratively train, upweighting examples each iteration where there were errors in the previous iteration, then you weight all the predictors at each iteration. Given its heuristic nature, there are a gazillion variants. It’s usually used with decision “stumps” (shallow decision trees) a la BART.

But I have no clue what “improper” means despite being in the title.

As an aside, I’m annoyed at the linguistic drift by which “classification and regression trees” have become “decision trees.” This seems inaccurate and misleading to me, as no decisions are involved. “Regression trees” or “Prediction trees” would seem more accurate to me. As with other such linguistic discussions, my goal here is not purity or correctness but rather accuracy and minimization of confusion.

Anyway, to continue the main thread, Bob summarizes the themes of the above discussion as: “us all being finite, academia being siloed, and communication being harder than math.” At a technical level, it seems that the key difference is that machine learning is focused on online learning, while statistics is focused on static learning. This is part of the general pattern that computer scientists work on larger problems than statisticians do.

40 thoughts on “We have really everything in common with machine learning nowadays, except, of course, language.

  1. I wouldn’t say that “machine learning focuses on online learning” is a good characterization of the current state of the field. There is a sub community within machine learning which is interested in that, but it’s distinctly a minority these days.

  2. Thank you, I was about to throw in the towel since I understood little of the original quote. At least I’m in good company. But it does make me think back to Breiman’s work – the differences run deeper than just language. And in the 20+ years since, I’m not sure the fields are converging. Statistics and Data Science should be virtually synonymous in my mind, but they are not. And the attempts to distinguish the two have left me, frankly, dumbfounded. Maybe we’ll try again on this post.

    • Dale:

      Just this morning I was talking with someone about statistics and data science, and how the “data science” perspective, for all its problems, is a real advance. I was thinking about David Cox’s book Theoretical Statistics from 1974, which is considered a classic but which I now think is kinda horrible, despite all the mathematical and statistical insight inside it. My problem with that book is that it’s all from the perspective of, God gives you a model and now you fit it. That’s how it goes in those old-school statistics books: You have data x_1,…,x_n iid from a Poisson distribution, etc. And that’s not real life. In contrast, in “data science,” we start with the data. That’s not quite right either—I’d say that usually we start with a question and then we gather data—but I prefer the “Start with the data” approach to the “You have Poisson data” approach.

      • I don’t really see how this “Start with the data/problem” approach is unique to data science. In my opinion, it is rather about theoretical vs. applied statistics:

        Applied statistics: Start with a concrete problem (e.g. “How many emergency patients will arrive at some hospital?”)
        Theoretical statistics: Start with an abstract problem (e.g. “Let X_1, …, X_n be i.i.d. Poisson distributed […]” ).

        As the title tells, the book by Cox is about “Theoretical Statistics”, so it mostly deals with problems of the second kind, of course. And questions such as “Why do I need to learn all this theory stuff, when will I ever need this?” are part of a separate debate for itself.

        But the approach of applied statistics (at least as I described it) is not that distinct from the “data science approach”, is it?
        (At least if we actually substitute “Start with data” by “Start with a question”)

        • Michael:

          It’s not that “start with the data” is unique to data science. It’s that old-school statistics textbooks (even those without the word “Theoretical” in their title), typically went with the “Suppose you have Poisson-distributed data” sort of formulation.

          Also, I completely disagree with your statement that “of course” a book on theoretical statistics should formulate problems with the data distribution as known. Theoretical statistics is the theory of applied statistics, and applied statistics problems typically don’t have known distributions, link functions, etc.

      • Andrew:

        I came to think that traditional statistics (Cox’s book) is better geared to hard science data, where many data generating process do follow some physical, chemical, biological laws (“data follow a Poisson distribution” kind of thing). With social science data is a completely different situation.

  3. CoLT is Computational Learning Theory. It combines both statistics and computational complexity. So you’re interested in polynomial-time and polynomial-sample-size algorithms, with the usual N=(size of input in bits) of computational complexity theory replaced by (1/ε) and (1/δ), where ε is classification error and δ is probability of exceeding a classification error of ε.

  4. Improper learning as I learned it in my ML classes is when the ground truth function is in some model class (e.g. the data are linearly separable), but the predictor is allowed to be in a superset of that class.

    I don’t think most people would phrase regret quite the way Bob did. Usually we talk about external regret, which is the gap between our performance and the best fixed predictor in hindsight. So for the bandit setting it’s not pull the best arm every time, but which arm would give the best result if it had been pulled every time. This is necessary since in the bandit setting there’s not a notion of distribution of reward, as the rewards are assumed to be picked adversarially.

  5. The business about “logistic loss” comes from the perspective of finding a hypothesis that minimizes some expected loss. So you decide on your loss function for misclassification (or predictive error for regression) up front, and you use that in training your model. This contrasts with the Bayesian stats approach of first computing the joint posterior distribution of the parameters, using this to compute the predictive distribution for a given input, and only then applying your loss function.

  6. For a rough summary of what they’re trying to do in that abstract, the setup of online classification is that each round t you get an X_t, and you want to produce a classifier f_t(X_t) which will produce a probability that a binary Y_t equals 1 or 0. Then you do it again for T rounds. The criterion for a good procedure is that the sum of the entropy loss Y_t log(f_t(X_t))+(1-Y_t) log(1-f_t(X_t)) of your sequence of classifiers over time is small, *relative to* the corresponding loss had you used in every round the logistic regression which was specially chosen to be the best possible logistic regression ex post if applied to every round over the T periods (which is infeasible because it depends on the data which you haven’t seen yet). This is called the “regret”. A “proper” algorithm requires that your classifier f_t is a logistic regression each period, while an improper algorithm could be anything at all: you are comparing to the best logistic regression, but you don’t have to use one yourself.

    The name of the game in these settings is to show that you have a procedure where regret is small (typically o(T)) for every possible sequence: a lot is known about when this is and isn’t possible, for different loss functions, conditions on the data, etc, and much of the abstract is about specifying which of those conditions do and don’t hold in this setting and precisely what kinds of bounds you get.

    For an intro to the topic, I like the textbook by Elad Hazan (“Introduction to Online Convex Optimization”), or the survey article by Shai Shalev-Shwartz. Francesco Orabona has a textbook which gets into more modern material. Karthik Sridharan, a coauthor on that article, has a good set of class notes.

    I have attempted to teach briefly some of this material at an undergraduate level, which I’m not sure has been a success, but for what it’s worth my class notes overview the basics: https://donskerclass.github.io/Forecasting/OnlineLearning.html, https://donskerclass.github.io/Forecasting/OnlineAlgorithms.html

    FWIW, note that a Bayesian approach to this kind of problem would just be to update your priors on a logistic regression (or whatever) each time a data point comes in. That’s not a terrible idea but it is computationally very expensive.

    • FWIW, note that a Bayesian approach to this kind of problem would just be to update your priors on a logistic regression (or whatever) each time a data point comes in. That’s not a terrible idea but it is computationally very expensive.

      Sequential monte carlo is essentially what you describe, but to avoid compounding numerical and approximation error, it’s much more typical to refit with the same prior and all the data each time, which induces a mathematically equivalent posterior. In some circumstances, that’s too expensive, but sequential monte carlo is a pretty sensitive procedure that can only really be reliably estimated on a much more restrictive class of models.

      • More precisely, SMC ( with data tempering) is one computational way to implement approximate Bayesian updating without refitting each time. If you don’t mind the cost, you can compute posteriors however you like. Note that since a regret bound is a guarantee, Bayesian updating might or might not have that property; but some well known results in the field can easily show that MAP estimates with properly scaled priors for log-concave likelihoods usually will. More generally, there is work more explicitly showing these kinds of guarantees for certain variational Bayes algorithms, which can often be made to run on streaming data without refitting.

        • but some well known results in the field can easily show that MAP estimates with properly scaled priors for log-concave likelihoods usually will

          For log-concave likelihoods, isn’t this true for the maximum likelihood estimator too? based on theorem 36.37 here
          https://www.stat.cmu.edu/~larry/=sml/Minimax.pdf

          so the fact that MAP estimators can be minimax in the frequentist sense is just a trivial consequence of them being equivalent to maximum likelihood in the large data limit.

      • (I can’t nest replies so deep, so replying to question about minimax here)

        It turns out that no, the fact that MAP is asymptotically equivalent to MLE is not the reason that it works in the regret setting. Minimax regret over sequences and minimax risk over classes of probability distributions are fundamentally different criteria. In fact, it is easy to show, that MLE can incur the maximum possible regret over any procedure and so is literally (tied for) the worst possible estimator under this criterion. In contrast, MAP, because of the regularization induced by the prior, is substantially more stable and so cannot be pushed to exactly the wrong end of the space by an adversarial sequence.

        Translation note to see this fact, in my notes or the references I listed, online learning people call the method that optimizes the criterion each period “Follow the Leader”, and the one which optimizes a regularized criterion each period “Follow the Regularized Leader”. The equivalence to MLE and MAP respectively follows from using the log likelihood as the loss function. Some regularity conditions apply.

        As always, why you would want to do well according to a regret criterion, a risk criterion, or something else, and so whether you find the above result meaningful, is a matter of taste. If you want you can express that taste in terms of axioms rather than criterion functions, which some people find compelling, I guess.

        • I’m a little surprised that the MAP would have any special properties at all, since it’s an ill-defined estimator that changes under reparameterization due to jacobian adjustments. I couldn’t find your notes or references, so I googled and I found this

          https://courses.cs.washington.edu/courses/cse599s/14sp/scribes/lecture3/lecture3.pdf

          I believe the claim is that “good” regularization on a maximum likelihood estimator can give better regret properties, and that regularization in this case corresponds with a zero-centered normal prior with a “good” sigma. This kind of estimator does coincide with what stan gives you when you call “optimize”

          https://github.com/stan-dev/stan/issues/2473

          since stan doesn’t apply the constraint jacobian during optimization and you’re most likely applying penalization on the original parameter space. But it may vary from implementation to implementation. I believe Turing.jl for example does apply jacobian adjustments even when you ask for a MAP, so the result can be affected by their constraining transformations.

        • Yes, basically that’s right; I really just meant a “good” penalization of the MLE, which will coincide with (some implementations of) MAP under certain likelihoods and priors hence the “some regularity conditions apply”. You can get away with penalties other than the centered L^2, so long as the penalty function is strictly convex, so you can go beyond mean zero normal priors. Basically I view the equivalence as more of a happy coincidence than something fundamental, since the FTRL rule is itself largely a heuristic and only at best approximately optimal by the regret criterion, though maybe deeper thinkers have some justifications.

  7. I’ve been reading a lot of machine learning theory lately so this post resonates. I think of the key concepts (PAC, sample complexity, uniform convergence etc) as a translation of basic statistical theory, but dressed in the language of guarantees about permissible solutions, and intentionally avoiding reference to a true data generating process and trying to model how the data were generated.

    • What do you mean by “a translation of basic statistical theory?” I think that a fundamental probem of MI and AI is that the practictioners and theorists don’t really understand the statistical models they are relying on for both theoretical development and practical implementation. Much of what I have seen seems to exhibit a rather superficial and often incorrect understanding of statistical foundations whether frequentist or Bayesian. I am not including Friedman, Hastie, Friedman, Tibshirani, etc. as objects of my remarks — these guys have a very good grasp of statistical foundations..

      • I guess I was thinking about things like sampling theory, randomization distributions, consistency when I wrote that. In ML, rather than making assumptions about what distribution generated the sample, the pattern is to assume you have a sample from some unknown distribution, optimize for the sample, then try to prove your solution is good on a new sample. A lot of basic ML theory is still describing the same underlying relationships, just applied to slightly different targets, e.g., bounding loss for solutions returned by using ERM under some representation.

  8. How does one justify classification based on the results of a logistic or probit regression? Both of these regressions are designed to estimate the probabilities of events. Neither provides a classification rule or any specific guide to classification. Why is it that so many do not understand this fundamental point? Why can’t people live with probabilities?

    • I imagine because of the perspective. From an engineering point of view, the task is to make good decisions, in this case good classification decisions. That’s the primary concern. Taking that point of view, logistic and probit regression are just tools for the task (maybe good, maybe bad). Probabilities may be something that you have no interest in living with. For example, if you have very limited computational resources, you might wish to compile the classification process to a very simple set of rules, perhaps compressed to a very small table. That would be an example where a circuit — not people! — can’t live with probabilities, and anyway, probabilities are to some extent “off topic.” There might be non-probabilistic concerns that would push you to accept something that is known to be statistically accurate.

      Of course probabilities would be critical in analyzing the quality of a proposed solution, but would have to be balanced against other concerns that we could lump into “utility.”

  9. The confusion referring to CART as “decision trees” goes way back. I first encountered it in the ’90s. Was “CART” trademarked? Was that the problem?

    • John,

      Decision trees (from decision analysis) had been around for decades before classification and regression trees came around, so my guess has always been that the term “decision tree” was out there, and the trees produced by Cart looked the same, and people were using them to make decisions, so it was natural to just use the term “decision tree.” It annoys me but I can see the rationale.

      • No, the term decision refers to what you do at the binary tree branches, i.e., if xij<c then you go left, otherwise you go right. It has nothing to do with decision theory. It is just tree jargon.

      • Rodney:

        I would prefer for the term “decision tree” to be used to refer to actual decisions, with “prediction tree” being used to refer to prediction models. I understand that a prediction can be thought of as a series of decisions, but to me it’s cleaner to reserve the word “decision” for decisions about possible actions.

        I recognize that it’s not up to me—language is just what people do—I just think it’s too bad that the distinction is lost when the word “decision” is applied indiscriminately to tree models, whether or not they involve actual decisions.

  10. If you were actually using your systems instead of just using them to get papers noticed, would you want active learning so users can override statistical inferences with their own arbitrary heuristics, expressed in natural language?

  11. Wouldn’t it be ironic if, say, McMahan and Streeter’s open problem had been effectively solved by statisticians decades ago, in a slightly different form? And Andrew could easily explain the solution to Kale, if only Kale’s paper didn’t need a translator? Indeed, it would seem that a sufficiently rigorous treatment of the problem would have to prove that this were not so…

  12. There’s also a notion of decision tree not from decision analysis, but just as a representation for a set of decision making rules (i.e., a representation of a decision making procedure). My guess, not being able to do the historical research required, is that people applied the term to CART because they viewed CART as implicitly a classification procedure.

Leave a Reply

Your email address will not be published. Required fields are marked *