“Unsupervised learning” gets a bad rap

Yesterday, a speaker from the company formerly known as Facebook corrected themselves after mentioning “unsupervised learning,” saying they weren’t supposed to use that term any more. This must be challenging after years of their chief AI scientist promoting unsupervised learning. The reason is apparently that they don’t want the public to worry about the rise of unsupervised AI! Instead, they are using the terms “unlabeled” and “self supervised”.

If I were the lexical police, I would’ve blacklisted the term “learning,” because we’re really just estimating parameters (aka “weights”) of a statistical model. I felt bad saying “learning” even back when I worked in ML full time, but that’s probably just because I lived through the AI winter during which the quickest route to rejection of a paper or grant was to mention “artificial intelligence.”

17 thoughts on ““Unsupervised learning” gets a bad rap

  1. But on issues other than terminology and language…is the set of techniques formerly known as “unsupervised learning” okay? After attempting it on a few problems, what I came away with is:

    1. The problem is computationally and numerically very difficult when you have a non-trivial (as a rule of thumb >= 10) number of dimensions. Generic distance metrics become unmeaningful because everything is about the same distance from everything else

    See theorem 1 in https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.2.238&rep=rep1&type=pdf

    2. The problem is, for similar reasons, not well-defined. Except in special, low dimensional cases, your result will be defined by you choice of distance metric/scale. So you need to choose a distance metric, so you need to know what you’re planning on doing with the output to make sure it’s a meaningful metric. But if you have a specific purpose in mind, can’t you make it supervised learning problem with a particular objective function?

    3. Despite never really getting clear answers on 1 & 2, stakeholders always seem happy with the result anyways, even in cases where they’re presented with contradictory outputs between review sessions or where the result is provably noise artifacts.

    Relevant thread
    https://twitter.com/lpachter/status/1440695046211203077?s=20&t=KggRqev0hicdDRqjXq18VQ

    So it really seems like unsupervised learning is most appropriate when people want to draw simple lines around complex data that *look* like they’re scientifically informed.

    Intuitively, my question is this: the problem of reducing the dimensionality of data by creating a smaller taxonomy or compression it into a smaller number of important dimensions doesn’t seem like it should have a generic answer to it. Some parts of the data are relevant to some problems and irrelevant to others. But once you have a specific problem, you’re out of truly unsupervised territory. So what’s a good application of unsupervised learning outside of the trivial case of low dimensions?

  2. Also, the linked article with Yann LeCun is mostly nonsense. I appreciate the difficulty of communicating technical concepts to a general audience, but absolutely nothing is gained from stuff like this

    LeCun believes the emphasis should be flipped. “Everything we learn as humans—almost everything—is learned through self-supervised learning. There’s a thin layer we learn through supervised learning, and a tiny amount we learn through reinforcement learning,” he said. “If machine learning, or AI, is a cake, the vast majority of the cake is self-supervised learning.”

    As humans, there’s nothing we learn from either supervised learning, reinforcement learning, or unsupervised learning. We don’t compute loss functions on examples and maximize them, either derivative free or gradient based. We don’t compute distance metrics or use epsilon greedy exploration or any of that. All three of those are based on simplistic models of human learning, and can give a cartoon approximation (or, sometimes, an idealized version) of human behavior, but to assert that human learning actually exists somewhere in those is subtlely begging the question that we already have the sketch of strong general A.I. somewhere in our mathematical framework.

    • Even if we understand those types of learning informally, humans engage in a great deal of directly supervised learning from parents, from school, and from friends.

      I used to be a semanticist, so let’s bastardize an example of Quine’s. If I don’t speak French and see a dog peeing on the street and I hear someone point at it and say “chien”, then I might be confused about whether “chien” means peeing or “chien” means dog or whether it means something else. But if I’m in a restaurant and tell others that I’m going to “chien dans les toilettes”, they’ll first laugh and then probably directly correct me.

      But even without people doing the supervising, the world supervises me through feedback. If I try jumping over a wall and can’t make it, the world’s telling me it’s too high. That’s direct supervision on how high I can jump! If I stick my hand in fire and it gets burned, that’s direct supervision.

      • The odd thing to me is that your examples are colloquially “supervised”, but they don’t match my understanding of the technical term “supervised learning”. To, me, supervised learning is a technique for building an approximation function where:

        * You have a set of (x_i, y_i) observed pairs from product space of X and Y
        * You have a loss function L(y1, y2) where [arg inf L(y1, y2) over y1 in Y] = y2
        * You try select your approximator fhat from a class of functions F such that fhat = arg inf E[f(x), y] over f in F.
        * Since the generating distribution of x and y are unknown, in practice you minimize sum_i L(f(x_i), y_i) with some kind of regularization.

        In the language example, the child does not observe y_i, or the correct word for that context. The colloquial sense of supervision, where there are rewards and penalties but no direct data from the output space, seems more analogous to reinforcement learning (though it’s still a loose analogy in my opinion). Which is part of what bothers me about this kind of mixing of colloquialism and technical jargon.

      • Word learning and vocabulary size are interesting questions. Although it’s disappeared into the chaos of my bookcases, I recently picked up a monograph from Benjamins (a linguistics book publisher) “Measuring Native-Speaker Vocabulary Size” which is seriously depressing. (It’s excellent, with good discusions of history and techniques.) The current best estimates are that native _writers_ have vocabularies that are proportional to the amount of stuff that they write. Shakespeare was previously considered to have an enormous vocabulary, but it turns out that’s just an artifact of the large amount of stuff he wrote. That is, every time you write something, it’s about something different, and words from that new area that you hadn’t used before appear in your writiing.

        Even worse, it seems that adult native speakers learn about 1000 new words a year. Every year of their lives. Year in, year out. Without rest or respite. Sheesh. How does one learn a second language well enough to do an undergrad degree in literature in it? (Is a joke question I am currently contemplating.)

        Even weirder, we often remember where we learned a particular word. I remember in high school, I had showed up in jeans and was being a jerk in math class, so I got sent to the headmaster’s office (for the jeans, not OK under the dress code, but I wore jeans anyway). The bloke who had the misfortune of having to deal with me said “You can be a jerk to us, but you really ought to think about connecting with your peers.” Although this wasn’t all that appropriate advice (I still have two close friends from that period), I realized that I hadn’t understood the word “peer” correctly. Yay! I learned a new word. But that has to happen three times a day, every day, day in day out, at least so that the linguists can write their monographs.

        And our understanding of each and every one of these ridiculous numbers of words is really really subtle and amazing. It often seems folks working in AI forget that.

    • “If machine learning, or AI, is a cake, the vast majority of the cake is self-supervised learning.”

      That’s an odd quote on its own. You could substitute anything for “cake” in that sentence.

      “If machine learning, or AI, is a sparkly unicorn on stilts singing ‘Iron Man’ backward, the vast majority of the sparkly unicorn on stilts singing ‘Iron Man’ backward is self-supervised learning.”

  3. If we can’t know anything until we have a final theory of everything, we will never know anything.

    We learn by trial and error plus memory, the same way that evolution made us–without any final theory. We will always have hypotheses that we communicate for others to criticize or refine. They may be incorrect or even misleading but they can only be supplanted by hypotheses which produce better results.

    • (Assuming arrogantly that you’re responding to me)

      I’m just not sure this qualifies as a theory. Compare these two statements:

      * In some cases, humans learn by reinforcement learning or supervised learning.
      * In some cases, humans learn by progressively improving their behavior in response to positive and negative reinforcement or by trying to match some exemplar.

      What predictions are made by the first but not the second, and are not false?

  4. None of the usual machine learning techniques explain how a person (or an animal for that matter) can learn something very well in one or a few reps, in the right circumstances. How can children pick up dozens of words in a day or even a week just by hearing them spoken a few times, for example?

    • One piece of the puzzle is what people have called “inductive biases”, though if we’re not allowed to say “unsupervised” maybe we’re not allowed to say “biases” either.

      In essence, real organisms never really encounter truly “novel” situations, so they/we can make reasonable guesses about what is going on that are usually in the ballpark of correct. ML algorithms are designed to be general purpose blank slates, so every situation is “novel” to them. They are meant to do a lot of things well in the long run, but they can’t do anything well in the short run.

  5. I understood the difference between “unsupervised” and “self-supervised” “learning” to be this:

    Self-supervised learning creates its own labels from the raw data. For NLP models, the algorithm randomly removes out words, and then has the model guess what the missing words are based on the surrounding words. For images, the algorithm randomly removes patches of each image and the model tries to estimate what’s in the missing patch. In these cases, what the algorithm randomly removes becomes the label, hence provides the “supervised” portion of “self-supervised.”

    As opposed to “unsupervised” learning, which is the usual clustering, dimensionalty reduction stuff where there really is no “ground truth” to compare the clusters or components to, so instead we find metrics to optimize under constraints or rely on the elbow method or whatever.

  6. “If I were the lexical police, I would’ve blacklisted the term “learning,” because we’re really just estimating parameters (aka “weights”) of a statistical model.”

    You sound like me (in some of my commments here and elsewhere that get ignored or excoriated).

    If everyone had followed that rule, though, we’d have figured out that most of things we want the computer to learn can’t be learned that way (e.g. the symbolic reasoning we failed to figure out how to do in the 70s and 80s) and we wouldn’t have bothered with the whole current round of AI.

    It’s a doubly unfortunate situation, since an accurate delineation of the space of things that can be learned by statistical models, neural nets, and the like would be good science, if the hypesters were not so busy insisting that these things will have “the ability to reason”. (That’s a funny article, though, since one of the “Rebooting AI” recommendations was that we look at how babies learn.)

    ” I felt bad saying “learning” even back when I worked in ML full time, but that’s probably just because I lived through the AI winter during which the quickest route to rejection of a paper or grant was to mention “artificial intelligence.””

    I find that papers, even outside AI, that have “learning” in the title often grate something fierce: one example was training slugs to avoid electric shocks, grinding them up, and injecting the sludge into another slug and claiming it had learned to avoid electric shocks.

    I punted AI before that AI winter hit. I wasn’t finding a problem to work on, and that didn’t bother me as much as it should have, since I realized that I’d be screaming at idiots (i.e. complaining about models that couldn’t possibly do what was needed/claimed) my whole life if I did. Speaking of being excoriated, I’m on an MIT Alum mailing list and they asked us advice about/for the new Comp. Sci./AI department/building/monster financial push. I wrote “Have a plan B that allows you to back out of/away from AI when the next AI Winter hits.” They never asked for my advice again.

  7. I’ve always found it a bit amusing how the ML field adopted terminologies mainly, it seems, on the basis of them sounding cool in a scifi sort of way. It would be very funny if that approach has come back to bite them now.

    • Zhou:

      Don’t attribute agency to “the ML field.” A field is just a bunch of people trying out different terminologies. Then some of them catch on, which sometimes is due to individual effort (for example, the terms “Bayesian data analysis” or “multilevel regression and poststratification”) but even then will only work of other people find the terms to be helpful for whatever reason, which could include sounding appealing to outsiders.

  8. I’m not a huge fan of the supervised/unsupervised dichotomy. I remember once trying to figure out what a semi-supervised method was and it was really confusing. The details seem to be more about where the data in the model comes from.

    I was a bit surprised to discover recently some parts of ML use “inference” in a very particular way: https://pytorch.org/docs/stable/generated/torch.inference_mode.html

    > InferenceMode is a new context manager analogous to no_grad to be used when you are certain your operations will have no interactions with autograd (e.g., model training)

    I guess you have to call things something, but if you’d asked me before I saw this to define inference in terms of ML, I’d have probably said training and inference are the same, or something.

Leave a Reply

Your email address will not be published. Required fields are marked *