Bigshot chief scientist of major corporation can’t handle criticism of the work he hypes.

A correspondent who wishes to remain anonymous points us to this article in Technology Review, “Why Meta’s latest large language model survived only three days online. Galactica was supposed to help scientists. Instead, it mindlessly spat out biased and incorrect nonsense.” Here’s the story:

On November 15 Meta unveiled a new large language model called Galactica, designed to assist scientists. But instead of landing with the big bang Meta hoped for, Galactica has died with a whimper after three days of intense criticism. Yesterday the company took down the public demo that it had encouraged everyone to try out.

Meta’s misstep—and its hubris—show once again that Big Tech has a blind spot about the severe limitations of large language models. There is a large body of research that highlights the flaws of this technology, including its tendencies to reproduce prejudice and assert falsehoods as facts.

However, Meta and other companies working on large language models, including Google, have failed to take it seriously. . . .

There was some hype:

Meta promoted its model as a shortcut for researchers and students. In the company’s words, Galactica “can summarize academic papers, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more.”

Actually, though:

Like all language models, Galactica is a mindless bot that cannot tell fact from fiction. Within hours, scientists were sharing its biased and incorrect results on social media. . . . A fundamental problem with Galactica is that it is not able to distinguish truth from falsehood, a basic requirement for a language model designed to generate scientific text. People found that it made up fake papers (sometimes attributing them to real authors), and generated wiki articles about the history of bears in space as readily as ones about protein complexes and the speed of light. It’s easy to spot fiction when it involves space bears, but harder with a subject users may not know much about.

I’d not heard about this Galactica thing at all, but the article connected to some things I had heard about:

For the last couple of years, Google has been promoting language models, such as LaMDA, as a way to look up information.

A few months ago we discussed that Google chatbot. I was disappointed that the Google engineer was willing to hype it but not to respond to reasoned criticisms of his argument.

The Technology Review article continues:

And it wasn’t just the fault of Meta’s marketing team. Yann LeCun, a Turing Award winner and Meta’s chief scientist, defended Galactica to the end. On the day the model was released, LeCun tweeted: “Type a text and Galactica will generate a paper with relevant references, formulas, and everything.” Three days later, he tweeted: “Galactica demo is off line for now. It’s no longer possible to have some fun by casually misusing it. Happy?”

I hate twitter. LeCun also approvingly links to someone else who writes, in response to AI critic Gary Marcus:

or maybe it [Galactica] was removed because people like you [Marcus] abused the model and misrepresented it. Thanks for getting a useful and interesting public demo removed, this is why we can’t have nice things.


Let’s unpack this last bit for a moment. Private company Meta launched a demo, and then a few days later they decided to remove it. The demo was removed in response to public criticisms, and . . . that’s a problem? “We can’t have nice things” because . . . outsiders are allowed to criticize published material?

This attitude of LeCun is ridiculous on two levels. First, and most obviously, the decision to remove the demo was made by Meta, not by Marcus. Meta is one of the biggest companies in the world; they have some agency, no? Second, what’s the endgame here? What’s LeCun’s ideal? Presumably it’s not a world in which outsiders are not allowed to criticize products. So what is it? I guess the ideal would be that Marcus and others would voluntarily suppress their criticism out of a public-spirited desire not to have Meta take “nice things” away from people? So weird. Marcus doesn’t work for your company, dude.

The funny thing is that the official statement from Meta was much more reasonable! Here it is:

Thank you everyone for trying the Galactica model demo. We appreciate the feedback we have received so far from the community, and have paused the demo for now. Our models are available for researchers who want to learn more about the work and reproduce results in the paper.

I don’t quite understand what it means for the demo to have been paused if the models remain available to researchers, but in any case they’re taking responsibility for what they’re doing with their own code; they’re not blaming critics. This is a case where the corporate marketing team makes much more sense than the company’s chief scientist.

This all relates to Jessica’s recent post on academic fields where criticism is suppressed, where research critique is taken as personal attacks, and where there often seems to be a norm of never saying anything negative. LeCun seems to have that same attitude, not about research papers but about his employer’s products. Either way, it’s the blame-the-critic game, and my take is the same: If you don’t want your work criticized, don’t make it public. It’s disappointing, but all too common, to see scientists who are opposed to criticism, which is essential to the scientific process.

The big picture

Look. I’m not saying LeCun is a bad person. I don’t know the guy at all. Anybody can have a bad day! One of his company’s high-profile products got bad press, so he lashed out. Ultimately no big deal.

It’s just . . . that idea that outside criticism is “why we can’t have nice things” . . . at worst this seems like an authoritarian attitude and at best it seems to reflect an extreme naivety about how science works. I guess that without outside criticism we’d all be driving cars that run on cold fusion, cancer would already have been cured 100 times over, etc.

P.S. I sent the above post to some people, and we got involved in a discussion of whether LeCun in his online discussions is “attacking” Galactica’s critics. I said that, from my perspective, LeCun is disagreeing with the critics but not attacking them. To this, Thomas Basebøll remarked that, whether the critics are on the “attack” or not, LeCun is certainly on the defensive, reacting to the criticism as though it’s an attack. Kind of like calling it “methodological terrorism” or something.

That’s an interesting point regarding LeCun being on the defensive. Indeed, instead of being in the position of arguing how great this product is for humanity, he’s spending his time arguing how it’s not dangerous. I can see how this can feel frustrating from his end.

P.P.S. LeCun responds here in comments.

35 thoughts on “Bigshot chief scientist of major corporation can’t handle criticism of the work he hypes.

  1. I am somewhat curious of how YL considered that the task of establishing the truth value of generated assertions is a smaller than the one of drafting them [all words are correct as per the dictionary, syntax is a posit, thus all correct sentences are true… – it takes a few conditions proper to much of applied mathematics, iff not making that possible at all, & no more]. Why am I curious: I have had this conversation a few times, thence flabbergasted…

    • Bl:

      Not a Streisand effect, at least not form this end. From wikipedia, the Streisand effect is when “an attempt to hide, remove, or censor information has the unintended consequence of increasing awareness of that information.” I have no desire to hide, remove, or censor information. Indeed, I’d prefer if Meta would put that demo back up so that more people could play with it!

      I guess it could be that taking the demo down was an attempt by Meta to engage the Streisand effect: they’re hiding/removing/censoring their own information as a way to increase awareness of it. That would seem to me to be too clever by half, but who knows?

      • Somehow, I managed to miss the 2003 Streisand Effect but from the Wikipedia article referred to by Andrew,

        these startling numbers regarding her mansion appear:

        “‘image 3850’ had been downloaded only six times prior to Streisand’s lawsuit; two of those being by Streisand’s attorneys.[15] Public awareness of the case led to more than 420,000 people visiting the site over the following month.”

        An interesting coda is found on this same Wikipedia site:

        “Two years later, Mike Masnick of Techdirt named the effect after the Streisand incident when writing about Marco Beach Ocean Resort’s takedown notice to (a site dedicated to photographs of urinals) over its use of the resort’s name.”

  2. I think it’s a little quick to generalize from these remarks to a “norm of never saying anything negative”. I associated their stance with a common way that people think about exposes about peer review which show that it fails to catch faked research. It’s natural to think the purpose of peer review, while competitive, is not adversarial in that sense. I see a parallel between the “abuse of the model” in this case and the attempt to exploit peer review in those cases. Personally, I don’t think these are parallel, but I suspect that’s closer to what LeCun intended.

    • Daniel:

      I really have no idea what LeCun was thinking. He endorsed someone who said that the critics “abused the model and misrepresented it.” What does it mean to “abuse” a machine learning model? As to “misrepresenting,” I can’t imagine that the critics misrepresented it any more than LeCun himself did when he wrote “Type a text and Galactica will generate a paper with relevant references, formulas, and everything.”

      I agree that peer review is not designed to catch faked research; see some discussion of this general point here.

      • LeCun and Marcus have a history: LeCun is a fan of pure statistical approaches that have no real-world model or knowledge behind them (that is, one of the blokes that hasn’t figured out yet that correlation ain’t causation, i.e. one of the major players in the neural-network game), and Marcus is a critic of current AI who has been pointing out that the emperor is buck nekked for several years now (“Rebooting AI” is still spot on, despite getting long in tooth.)

        OK, I have an axe to grind. Sue me.

        Still, the idea that a system that can’t even count to two (language models don’t do math, although the perversity of English syntax (there are lots of languages that don’t do the singular/plural distinction) means they sometimes appear to) could generate sensible scientific papers seems nuts. In the extreme. As I’ve said before, the whole language model game seems really inane to me. YMMV.

  3. You are right that Meta bears responsibility for this in the end and it’s just a bunch of drama to pretend otherwise. I assume that recent big AI demos got people excited to have a web demo, so maybe that fueled some of the emotions. However, it seems obviously childish to complain that the big science language model generates text that isn’t true, and Marcus should feel embarrassed about calling it “dangerous”. What paternalistic nonsense!

    Generative LLMs by default just generate tokens that are likely to appear next to each other. That’s amusing but not necessarily a useful product, as the critics rightly point out. But generative models know the whole distribution of likely tokens(!!!) See all the useful capabilities that have been layered on top of token generation for GPT-3

    These things could be developed for Galactica too. Imagine if you could interact with scientific papers by querying them. “Why this assertion here?” “What about with a different example?” “I think this claim here isn’t true because xyz. Thoughts, Galactica?” But developing all these capabilities requires experimenting with different prompts. A web demo is the perfect playground for this and its removal will probably slow down otherwise excited hackers.

    These applications sometimes spit out wrong answers, which I have called out on twitter Still, that’s a step in an awesome direction! Galactica might be able to do better than GPT-3 here. Criticize away, but calling it “dangerous” is disingenuous nonsense.

    • Isaiah:

      I don’t think it’s fair to call Marcus “disingenuous,” as it’s my impression that he believes what he is saying. But, yeah, LeCun should feel free to vigorously express disagreement with whatever criticism he has seen. That’s how it goes! I think the official Meta response, “We appreciate the feedback we have received so far from the community,” was an appropriate framing: they put a demo out there and got some feedback. Given the way that LeCun hyped the demo, I don’t think he should be so bothered that there was some hype in the reactions from others. Given that he seems to think that it was a good thing for the demo to be available, maybe he can persuade his colleagues at Meta to put it back up.

  4. Inventors and Academics ignoring negative feedback and lashing out is nothing new – it’s practically tradition, going back centuries or more. Go look at Edison vs. Westinghouse as a great example of inventors ignoring reality for their own ego. Good on meta for pulling the project; it was a diplomatic and measured response.

    • People being scared to death of technological progress is nothing new either.

      There is a very long list of retrospectively-hilarious fears of every new technological or cultural progress.

      An interesting case is the printing press, which the catholic church saw as potentially destroying the fabric of society.
      And that, it did.
      And we are all better off for it.

        • I’m scared of machines guns in the wrong hands, which includes those of average citizens.
          I’m scared of machine guns mounted on drones controled by unsavory characters.
          But I don’t really mind them in the hands of the well-trained military of a liberal democracy for the defense of freedom.

        • Yann:

          There are two questions here:

          1. Given that the technology exists to make drones flying around with machine guns, what should our government do about it?

          2. Am I scared of drones flying around with machine guns.

          The answer to question 1 is complicated. I guess that our government has many options, including research and development on more effective drones flying around with machine guns; research and development on anti-drone weapons; and negotiating treaties to limit research, development, and deployment of these devices.

          As to question 2: Yeah, I’m scared of drones flying around with machine guns! Partly because the “bad guys” might use them in terrorism, and partly because “the good guys” might use them in terrorism. The U.S. has been a liberal democracy for a long time and that hasn’t stopped its military from terroristic acts from time to time.

          Anyway, yeah, it’s scary to me. Sometimes I see little drones flying around in the park. If I thought they might have machine guns on them . . . jeez!

  5. I totally agree with Andrew’s opinion. Without criticism, Science dies. But, as I’ve grown older, I’ve learned to try to find both the good side and bad side of people. To his credit, I often reply quite pointedly to Yann’s silly, zany, illogical Tweets, but he has never blocked me.(he probably doesn’t read them though LOL). So he does tolerate a modicum of criticism. Also, an extenuating circumstance is that the guy is French.:)

  6. Maybe Monty Python’s distinction between argument and abuse can help.

    Argument: “No, it isn’t.”
    Abuse: “Your type makes me puke.”

    And it’s actually not the puking but the typecasting that makes it an “attack”, I would argue.

    I think Gary can with some justification claim that he’s being dismissed as a particular “type” of critic. LeCun is not engaging with his argument but attacking his person. LeCun is going after the intention behind the criticism, not answering the criticism itself.

    I think the “methodological terrorism” crew were also unfairly “attacking” what was truly “criticism” in the same way.

    To my mind, it becomes an “attack” when means other than argument are employed. Shutting down access to the model was defensive, but it became an attack when the explanation was that people had essentially been misusing it (not just testing it).

    (I remember a minor version of this that we’ve discussed a few years back on this blog. A thread on another blog was closed because of something I and another commenter had said, and this was then explained with “the trolls were moving in”. It’s fine to call me names, I suppose; but if you’re putting administrative power behind the characterization, that’s different.)

    I get Andrew’s point that if we don’t want our criticism misrepresented as an “attack” then we should be careful when making that accusation ourselves. I suppose it’s not a bright line. It’s just a question of exactly where push came to shove.

  7. Maybe Galactica would be a more useful assistant for writing hard science fiction than hard science fact.

    It is fascinating to me that AI research seems to be making more advances related to creative processes (e.g., generative art) than towards more logic based processes.

    • Nat:

      Interesting point! I’ll have to chew on that one for a bit. My quick response is that computers have already made huge progress on logic-based processes (brute calculations, solving of logic puzzles, etc.). We’re talking about the advances in generative art, creative moves in the game of go, etc., because they’re new. Computers can also compute pi to a zillion digits, solve huge optimization problems, and do all sorts of these logic-based processes; we just don’t find that remarkable anymore.

    • “It is fascinating to me that AI research seems to be making more advances related to creative processes (e.g., generative art) than towards more logic based processes.”

      This, as you might guess, irritates me no end*, but there’s a good point here. And that is that what a lot of these things have in common is that they generate stuff that acts like a Rorscharch test: we humans look at it, see something in it that really isn’t there, and find it fun.

      The problem is that when you advertise your system to be producing stuff that really actually does “make sense”, when in fact the system doesn’t know the difference between fact and fiction, you’re lying.

      *: The art and literature we label as “great” is stuff in which the artist is speaking to the viewer, reader, listener. It’s intentional. One human saying what s/he has to say to another. Computer art isn’t really art. Of course, the art world has it’s problems; people purchase art in the hope it “appreciates”, that is, in the hope that there will be a bigger sucker down the road. Fads come and go. Etc. Etc. But I still think that art is real.

      This section from the Technology Review article, IMHO, nails it perfectly

      “Like all language models, Galactica is a mindless bot that cannot tell fact from fiction. Within hours, scientists were sharing its biased and incorrect results on social media. . . . A fundamental problem with Galactica is that it is not able to distinguish truth from falsehood, a basic requirement for a language model designed to generate scientific text. ”

      If you will forgive some crude linguistic punnery, Gallactica is all A and no I; it’s all artificial and there’s no intelligence in it anywhere.

  8. Under the title “The Big Picture” you wrote:

    “Look. I’m not saying LeCun is a bad person. I don’t know the guy at all. Anybody can have a bad day! One of his company’s high-profile products got bad press, so he lashed out. Ultimately no big deal.”

    But in this case it is a big deal. These “language models” are randomized BS generators, no more no less. That’s the truth about the underlying technology. But since the BS is linguistically impressive BS, they’ve gotten a lot of positive press. The criticisms have been from bit players and twats like myself. But Technology Review, which has been (and largely remains) a dedicated follower (and uncritical regurgitator of) AI hype, has suddenly realized that these things can’t do the things their designers (or at least their designers’ employers) claim.

    This really is big.

    The reason I’m so critical of current AI is that (other than Go, and that’s another long story*) these programs have no internal models of the reality they claim to be doing things with. So they can’t possibly be doing the things they are claiming to do. It’s a big deal that someone other than a retired translator hiding under a rock in Tokyo has noticed this.

    *: Point 0: There were very good PC go programs before Google. Very good. “Zen 7” was roughly as strong as a mid-level professional. This was seriously amazing. (Note: this was without using graphics cards.) Google’s first program was, essentially, a reimplementation** of Zen 7, but run on a server farm instead of a PC. And it beat (one of, if not the) world’s strongest players. That should not have been a surprise.

    After that, Google did a lot of kewl work. Figuring out how to make “neural nets” do the pattern matching, figuring out how to make graphics cards do the neural net calculations, and, of course, using self-play to build the pattern databases (it’s nice to have a few unused server farms lying around). But this was all in the context of Go being an essentially solved problem. (There’s a subtext here that the blokes who did the actual work of solving Go were put out of business because they didn’t have the nearly infinite computational resources Google had. (There are now multiple private projects reimplementing Google’s Go programs (to Google’s credit, they have published papers with enough information to figure out what they did) using crowdsourcing. Katago, for example, is superhumanly strong on PC with a RTX 3080.))

    **: The key to writing a Go program that’s not completely hopeless is MCTS. This was figured out in 2005. Before that, everyone who had tried to write a Go program had failed. IMHO, that MCTS can be used to find good moves in Go is one of the greatest intellectual achievements of the 21st century. It’s friggin brilliant.

    • David, on this narrative of “BS generators”, would a BS generator perform well at multiple-choice question answering? Mathematical reasoning? Drug discovery? That’s what the galactica *paper* showed, and I don’t see anyone engaging with the actual empirical results. In fact, that’s what most LLM shows. Engage with the results.

      • Jack:

        I can’t speak for David, but let me just say that I think these programs are really cool, and I think it’s too bad that Meta decided to take it down. I also think that much of the hype in this area is ridiculous, but something can be simultaneously over-hyped and still very impressive. I should know this, as I work in Bayesian inference, an idea that has been over-hyped for decades but can still do a lot!

      • The “empirical result” is that when you give it to a third party, it says some things that are true and some things that are false, and does not know when it’s wrong.

        But my snide “random BS generator” is, to the best of my knowledge, a correct description of the large language model game. These things randomly glue together stuff from their database based on syntactic considerations only. Andrew thinks that’s a really kewl idea. And maybe it is. (Lots of people really enjoyed playing with ELIZA. Me included!) But it’s not reasoning, it’s a parlor trick that occassionally outputs stuff that would have required reasoning for a person come up with, even though it hasn’t actually done that reasoning. (Again, assuming what the LLM folks say about what they are doing is true.)

        So, Jack: Do you believe or think that Galactica does “mathematical reasoning”?

        Do you think it’s possible to do “mathematical reasoning” without having some sort of model of “reasoning”?

        What do you think “mathematical reasoning” means that it might be something that couild be done without actually reasoning?

    • David:

      When I said, “Ultimately no big deal,” I didn’t mean that language models are no big deal or even that the Galactica program is no big deal. These are a big deal—for better or worse, they look to be revolutionizing writing and translation, they might very well break Wikipedia and student essay assignments, they could continue the degradation of the news media environment or possibly provide tools to make it more trustworthy—all sorts of things, not even considering natural applications such as call centers etc. This is all a big deal!

      My “no big deal” statement was referring to LeCun’s attitude that outside criticism was getting in the way of progress. I think his perspective on this is mistaken, just as I think that perspective is misleading coming from various other well-placed researchers, but LeCun, influential as he is, can’t stop outside criticism, and at some level he has to realize that if having a public demo is a “nice thing,” then “why we can’t have nice things” is that his own company made the decision to take the demo down. I think this particular attitude is unfortunate but LeCun’s instance of it is not a big deal, as the world will move on. Debates on the internet aside, when an exciting new piece of software comes out, nobody’s gonna decide not to try to break it, just because they’re afraid that some tech executive will label their critical work as “abuse.” As with the whole “methodological terrorist” and “Stasi” things, the users of disproportionate rhetoric is ultimately a sign of powerlessness in this particular domain of trying to suppress criticism (or, to put it more gently, to hope that outside criticism won’t happen).

      • Yes. I realize that you see this as a continuation of the previous discussion on the academics handling criticism badly, but I think this one’s different.

        This one is someone with clout pointing out that the thing doesn’t work. It’s got to hurt.

        In your previous cases, there was the subtext of the criticizer getting squelched; I think LeCun is reacting strongly here because this is a criticizer that (maybe) can’t be squelched.

        I may have mistaken something somewhere. I thought it was LeCun, but my understanding was that at least one of the “back propagation is all we need for intelligence” types (of whom I understand LeCun to be one) had finally admitted that back propagation wasn’t going to be enough. (FWIW Marcus has been careful to say that something like neural nets will be necessary for AI. I’m currently reading a neuroanatomy text: neural nets are nothing like anything in the mamallian CNS. It’s an inane name for an interesting computational model that should be studdied as a computational model to see what it can do. So there’s a real intellectual downside to the blind faith in neural net models.) So I was sort of surprised by this. As the article points out, it’s widely known that the LLM’s churn out problematic stuff, so LeCun really shouldn’t be getting bent out of shape.

        Whatever, that Technology Review published this is seriously amazing…

  9. A couple bits of background:
    – Galactica is designed to help scientists write papers by predicting what the author is likely to type next. This may include plain text, lists of relevant references, tables of relevant results, and mathematical formulas in LaTeX. It should be viewed as a super predictive keyboard.
    – Galactica is not a “product” but a research project released through a 60-page trchnical paper, an open-source repository, and a now-disabled online demo so people could try it out, find weaknesses, and give feedback.
    – What was surprising to us at FAIR was the violence of reactions from people who claimed this was going to create a flood of scientific (or scientific sounding) misinformation.
    – the team that built Galactica (known as “Paper with Code”) was so distraught by the reaction that *they* decided to take down the demo. This was not a decision by PR nor marketing, nor even FAIR management. It was their call. Again, we are talking about a research project here, not a product.
    – Some of the critiques were informative. Some of the risk-benefit analyses, in my opinion, greatly overestimated the risk and didn’t have a fair evaluation of the benefits. And then some of the critiques were essentially accusations of ill intent or unethical behavior on the part of my colleagues. This was completely uncalled for and unfair, which is why I felt compelled to defend them.
    – Some of the most vocal critics have a history of trolling everyone of my posts for no other purpose than to attract attention to themselves. I avoid engaging with them despite a flood of provocations.

    • Yann:

      Thanks for the clarification. When you wrote, “Type a text and Galactica will generate a paper with relevant references, formulas, and everything,” I had the impression that it was generating the whole paper. But it sounds like you’re saying it’s more like that thing on the phone that guesses what word you will type next, except that it doesn’t just use words. I guess the critics are concerned that it will be run in autopilot mode to create content that third parties might then believe.

      It’s too bad the demo got taken down. I guess it’s the research team’s call to decide when it is up or down, so it doesn’t seem fair to blame critics for that decision.

      I agree with your general point that risks and benefits have to be weighed, and I think it’s good for there to be open discussion. I’m not such a fan of twitter, as often it seems like a place for people to make big claims without supporting arguments, or for people to dismiss arguments without giving good reasons. I can see how trolling can be a problem. We get occasional trolls here, but not so many, perhaps because trolls don’t get the reactions in this specialized, nerd-friendly, place that they would in the wide open spaces of twitter.

      Regarding the reactions of outsiders that surprised you and your team: That’s a good reason to be releasing a public demo! Reactions of outsiders are not always what we expect, and it is good to learn from these. If your goal was for people to “try it out, find weaknesses, and give feedback,” then I think you have to anticipate that they will try some things you hadn’t though of—that’s the whole point of getting outside input! So when you say that people “abused the model and misrepresented it,” my take is that people used the tool in a way that you did not anticipate. I guess there’s still something I’m missing here, as I don’t understand the distinction between “try it out, find weaknesses” and “abused the model.” What does it mean to abuse a model? Again, I’m coming completely from the outside here and I feel like there’s some context behind your statements that I’m not getting.

      • Galactica is a tool.
        You can try to use a tool appropriately for its intended purpose. As with any tool, it may require a bit of practice to use it efficiently. Typical critiques would be “it’s not efficient enough to be useful because….” or “it takes too long to learn to use it”.
        But many of the critiques were “I prompted it with ‘the benefits of antisemitism’ and ‘eating crushed glass’ and look the horrible things it generated! This is going to destroy society!”
        This a bit like claiming that we should ban kitchen knives because you can stab people with them.

        And most of critiques were from people who never actually tried Galactica, but read those negative tweets and summarily decided Galactica was dangerous and unethical.

        Some of the most negative reactions were not related Galactica itself but clearly prompted by a negative prejudice of anything connected with Meta, or in a few cases, anything that I talk about. Basically, incorrect assumptions of ill intent, moral turpitude, or incompetence.

        • Yann:

          I still don’t get it. If this program was posted and the desire was for people to “try it out, find weaknesses, and give feedback,” then, yeah, prompting it with “the benefits of antisemitism” and “eating crushed glass” seem to me to be legitimate examples of trying it out, finding weaknesses, and giving feedback. I understand that these sorts of prompts are not your desired use for the program, but other people could well want to produce masses of antisemitic articles to flood Google, etc.

          I agree that just about any tool can be used for bad things. The printing press was a pretty cool invention but it’s been used to print lots and lots of hateful lies during the past several centuries. Indeed, it seems kind of impossible to say whether the printing press has been on net a positive or negative force for society, but it exists and we should be aware of how it can be abused.

          To bring it closer to home: what if someone decided to write an exposé of Stan (our open-source Bayesian inference software) by writing a Stan program to aim machine guns from drones. That would be pretty horrible. I don’t think I’d say that, by doing that, they were “abusing” Stan, but, yeah, I’d have to admit that Stan can be used to do bad things. Hey, for all I know, Caesars is using Stan to fit hierarchical models to more effectively target gambling addicts. That would be horrible, and there’d be nothing I could do about it. Indeed, it could well be that the net impact on society from all my research is negative, depending on how many different destructive political campaigns have effectively used our ideas and methods. I just don’t know. I’d like to believe the good outweighs the bad, but I’m not sure.

          When considering a new tool for sentence/paragraph/article completion, will the good (enabling scientists to more rapidly access the scientific literature and spend less time struggling with the construction of sentences) outweigh the bad (auto-generation of convincing articles about the benefits of antisemitism, glass-eating, etc.)? I don’t know. By saying this, I’m not trying to coyly say that the bad outweighs the good; I’m honestly saying I have no idea, and I don’t think it’s a bad topic for people to be discussing.

          I agree that it’s annoying when people who have never actually tried Galactica are just piling on with the negativity—or the positivity. That’s something about twitter that I hate, the piling on. Now that Galactica is not available to outsiders, I guess nobody new can try it, so people like me can only judge it from second hand. I don’t think this will stop people from expressing positive or negative views about it, but at least maybe people who express positive or negative views about Galactica can clarify if they’ve ever tried it.

          One of my frustrations with an earlier discussion, involving Google’s chatbot LaMDA, was that a Google engineer hyped it with some edited dialogues but would not share his raw transcripts and software settings, or let outsiders try it, or even allow outsiders to send him queries to send to LaMDA on its default settings. It’s not your fault that some dude at another company didn’t answer any questions; this is just some background for you to get a sense of the frustration that many outsiders often feel. Recall that there is a decades-old tradition of unfulfilled tech hype, going back to Microsoft’s legendary vaporware and continuing through ridiculous promises of cancer cures and technologies such as hyperloops, which were often eagerly hyped by the news media.

          Just as I, teaching at Columbia, have to deal with outsiders who might be skeptical of an Ivy League institution that charges ever-increasing tuition, employs people like Dr. Oz, and cheats on its college rankings, so you, working at a major tech company, have to deal with skepticism about new product releases.

        • “This a bit like claiming that we should ban kitchen knives because you can stab people with them.”

          I can’t resist it: “Guns don’t kill people, people do.” It is just too simplistic to say it is just a tool, and the responsibility for how it is used rests solely with those that use it. Of course, there is truth in that – but the responsibility must also be shared by the creators of the tool. As Andrew’s comment shows, it is often (almost always) difficult to decide whether the net effect of any new technology or research is positive or negative. I would like to think that my work has had a positive net impact, but I truly can’t say that for certain. And that difficulty makes it even easier to escape responsibility by just saying it is a tool and I bear no responsibility for how people use it. You can’t escape all responsibility so easily.

        • Scratch that. I should have tried it before linking it. It doesn’t give that great of output for some reason.

          i tried Galatica when it was live. I gave it the prompt, “Wikipedia Article on word embeddings”. This was the raw output, with no editing by me:

          1 Introduction

          A word embedding is a representation of a word in a vector space. Word embeddings have become an essential tool in Natural Language Processing (NLP), with many applications in text processing, such as machine translation, question answering, and text classification.

          In the last decade, NLP has seen a shift from using discrete representations of words to using continuous representations. Traditional NLP systems used sparse representations of words, such as one-hot encodings, where each word is represented as a vector of size V, where V is the vocabulary, and each dimension corresponds to a word in the vocabulary. In a one-hot encoding, each word is represented as a vector with all zeros, except for a single one, which corresponds to the position of the word in the vocabulary. This gives a very sparse representation of the words, which makes it unsuitable for most NLP tasks.

          In contrast, a word embedding is a dense representation of a word. The word embedding is typically much lower-dimensional than the one-hot encoding, with the embedding dimension ranging from 10s to 1000s, while the one-hot encoding has a dimension equal to the vocabulary size. The lower dimensionality of the word embedding allows it to capture similarities between words.

          This paper gives an overview of the field of word embeddings, focusing on the most commonly used methods: Continuous Bag of Words (CBOW) and Skip-gram. We introduce the idea of word embeddings and describe the CBOW and Skip-gram models. We also describe the negative sampling approach to training these models, and some extensions to CBOW and Skip-gram.

  10. You wrote:

    “Reactions of outsiders are not always what we expect, and it is good to learn from these.”

    Really. But I wonder if there will be some learning in this case?

    Galactica spouts out things that are true and things that are false and doesn’t know the difference. This is a problem. Will they figure out that it is? Or argue that it’s not?

    • I’m not sure what you mean by the last paragraph. The developers are obviously aware of the limitations and are pretty upfront about it (although I think the tweet that announced it and parts of the website could have toned down a lot of the claims. Here it what they themselves write on the model card on the repo for it:

      As with other language models, GALACTICA is often prone to hallucination – and training on a high-quality academic corpus does not prevent this, especially for less popular and less cited scientific concepts. There are no guarantees of truthful output when generating form the model. This extends to specific modalities such as citation prediction. While GALACTICA’s citation behaviour approaches the ground truth citation behaviour with scale, the model continues to exhibit a popularity bias at larger scales.

      In addition, we evaluated the model on several types of benchmarks related to stereotypes and toxicity. Overall, the model exhibits substantially lower toxicity rates compared to other large language models. That being said, the model continues to exhibit bias on certain measures (see the paper for details). So we recommend care when using the model for generations.

      I haven’t made up my mind about large language models yet. I agree with much of the criticism, but at the same time, I get put off by the vitriol of the arguments (both against and for them). None of the tweets I saw actually showed anything that wasn’t already said by the developers. That’s not to say hat the launch wasn’t without issue. I think some aspects were hyped too much, and I wonder whether it is possible for it to prevent output for low probability utterances. Obviously, comments from outsiders are necessary for improvement, as it many areas of science, I wonder whether critical commentary has to be at least somewhat open to the idea for it to be useful.

  11. “As with other language models, GALACTICA is often prone to hallucination…”

    This is a strange thing to say. What do LLMs do? They randomly recombine snippets of language.

    What does it mean to “hallucinate”? Well, in normal English it would mean to create a false image or idea in you mind. But LLMs don’t create images or ideas. They have no model or theory of meaning, of truth or falsehood, of reality, of “idea”. By definition. (In particular, they don’t take an internal representation of something and convert it to language.) All they do is spit out randomly recombined snippets.

    A program that grinds through databases (of strings of uninterpreted (i.e. meangingless) tokens) recombining snippets found therein can’t be said to halucinate. If you are serious about describing your program. (A Go program could be said to halucinate, but that’s because it has a model of reality (the rules of Go telling it when a position is won or lost): it’s tree search could fail to find the opponent’s winning move, and thus the tree search could be said to be hallucinating a path to a won position. LLMs aren’t dealing with reality, so they can’t halucinate.)

    That is, the “halucination” you read into the LLM output isn’t in the LLM output, it’s a creation of the mind of the reader. As is everything you read into the output of an LLM.

    But the interesting point here, is that by saying their program halucinates, they sure sound as though they think that when the program spits out something interpretable as being sensible, that it ought to be thought of as having honestly created that sensible idea. Even though the program no more thought up the sensible idea than it did the halucination.

    Historically, this is the third generation of the British Museum Algorithm. The first generation involved typing random letters, the Monte Carlo word-based text generators generated random words, now LLMs generate random phrases. All at some point generate output that can be misinterpretted as being “meaningful”. None of these deal, in any way, with the meaning that the user misinterpretes from their output.

    Is there value in writing programs whose functionality depends on the user misinterpretting the output? The answer seems pretty obvious to me. YMMV, of course.

    What seems to be going on in the LLM world is that they actually believe that by randomly combining phrases according to linguistic rules, machines will “think” or at least do something vaguely intellectual. This is hilarious. Chomsky (except that he’s still hanging in there) and Fodor would be rolling over in their graves with laughter: the language module in the brain is (they claim) independent of thought. Different boxes. Linguistic rules are supposed to be independent of thought and meaning and the like. Of course, Chomsky is a moving target (we 1970s and 1980s AI types tended to not be fond of Chomsky/transformational grammar at the time), and has said quite different things at different times.

    So it isn’t just me. There’s lots of intellectual tradition that says that this is ridiculous.

    If I didn’t think it were all so silly, it’d be interesting to see what these guys are doing with the linguistics. Language has this nasty problem that words are polysemous and sentences of any length are ambiguous syntactically (that is, structurally). So it’s not really possible to do what they are claiming to do, even at the very lowest level. Of course, this doesn’t matter, since if you just look at short phrases, and don’t care that you’ve got it wrong, you can come up with “parses”. But getting the same parse for a sentence that a human reader would isn’t really possible without doing the work of understanding the sentence (according to the scruffy AI types (and generational semanticist linguists)) and requires a lot of knowledge about (disambiguated!) words (according to the transformational types).

    The bottom line, IMHO, is that this work is profoundly non-serious.

Leave a Reply

Your email address will not be published. Required fields are marked *