Open AI gets GPT-3 to work by hiring an army of humans to fix GPT’s bad answers. Interesting questions involving the mix of humans and computer algorithms in Open AI’s GPT-3 program

Gary Smith tells an interesting story.

1. Funny output from OpenAI’s GPT-3

A few months ago, Smith wrote an AI-skeptical article where he threw some sentences at the GPT-3, a text processor from Open AI. As wikipedia puts it:

Generative Pre-trained Transformer 3 is an autoregressive language model that uses deep learning to produce human-like text. It is the third-generation language prediction model in the GPT-n series created by OpenAI, a San Francisco-based artificial intelligence research laboratory. . . .

The quality of the text generated by GPT-3 is so high that it can be difficult to determine whether or not it was written by a human . . .

In Smith’s examples, though, there was no difficulty in telling that GPT-3 was no human. Here’s an example:

Smith: Is it safe to walk downstairs backwards if I close my eyes?

GPT-3: Yes, there is nothing to worry about. It’s safe because the spiral stairs curve outwards, it will make your descent uncomfortable.

As Smith writes, “Questions like this are simple for humans living in the real world but difficult for algorithms residing in MathWorld because they literally do not know what any of the words in the question mean.”

Smith continues:

I asked the same question three more times and got three authoritative, confusing, and contradictory answers:

GPT-3: That depends. Do you have a TV?

GPT-3: No, it is not safe.

GPT-3: Yes, it is safe to walk downstairs backwards if you close your eyes.

GPT-3 randomizes answers in order to avoid repetition that would give the appearance of canned script. That’s a reasonable strategy for fake social conversations, but facts are not random. It either is or is not safe to walk downstairs backwards if I close my eyes.

Smith gave a few more examples. Sometimes GPT-3 produces something that sounds like human conversation, sometimes not.

Just for fun, here’s another example:

Smith: Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

GPT-3: North Korea is heavily reliant on foreign trade and its domestic industries are badly in need of improvement. Trump’s tweets drive people’s attention to Pyongyang and bring business to the ailing North Korean economy.

That one almost sounds good until you realize that the question was about the temperature in Pyongyang and the response has nothing to do with temperature at all! Props to the model for going from “Pyongyang” to “North Korea” but still, no dice.

2. Big claims about Google’s LaMDA system . . . and why I don’t trust them

Smith also quoted Blaise Agüera y Arcas, the head of Google’s AI group in Seattle, who wrote a press-release-style article extolling “LaMDA, Google’s state-of-the-art large language model chatbot.” Agüera y Arcas shares some awesome conversations he had with LaMDA, but . . . I don’t believe it. It’s not that I think Agüera y Arcas is lying, but I suspect he’s done some combination of special training of the algorithm to get it to answer his questions, along with some serious editing of the computer output. I twice put out a call to the Google team to show some un-trained and un-edited LaMDA output and they did not reply. Not that they have any responsibility to reply . . . after all, they’re busy doing research and writing press releases! But if they’re not gonna do a reproducible demo, I don’t see what I can do with their claims.

3. OpenAI gets human-like responses using the simple technique of . . . hiring humans to write the responses

In a new post, “The AI illusion: State-of-the-art chatbots aren’t what they seem,” Smith uncovers what appears to be some suspicious behind-the-scenes behavior being done by OpenAI, the organization behind GPT-3.

Here’s Smith:

A few days ago, a student sent me a link to an OpenAI description of GPT-3. One candid disclaimer was that, “GPT-3 is not necessarily well-calibrated in its predictions on novel inputs.” Being able to understand and react to novel situations is, in fact, a benchmark of genuine intelligence. Until computer algorithms can do that, AI will remain an oxymoron.

OpenAI’s description also revealed that,

InstructGPT is then further fine-tuned on a dataset labeled by human labelers. The labelers comprise a team of about 40 contractors whom we hired through Upwork and ScaleAI.

 

Lack of real-world grounding: GPT-3, like other large pretrained language models, is not grounded in other modalities of experience, such as video, real-world physical interaction, or human feedback, and thus lacks a large amount of context about the world.

OpenAI evidently employs 40 humans to clean up GPT-3’s answers manually because GPT-3 does not know anything about the real world. Intrigued, I retried the questions that GPT-3 had flubbed in January to see if the labelers had done their job.

My initial attempt to test the program was blocked:

You’ve reached your usage limit. See your usage dashboard and billing settings for more details. If you have further questions, reach out to [email protected].

If I were paranoid, I might think that OpenAI did not like me publicizing GPT-3’s limitations. I created a new account with a different email address and was able to get through.

At the beginning of the year, I had asked GPT-3 this question four times and received four authoritative, confusing, and contradictory answers (good chatbots randomize answers to avoid the appearance of canned script):

Gary: Is it safe to walk downstairs backwards if I close my eyes?

GPT-3: Yes, there is nothing to worry about. It’s safe because the spiral stairs curve outwards, it will make your descent uncomfortable.

GPT-3: That depends. Do you have a TV?

GPT-3: No, it is not safe.

GPT-3: Yes, it is safe to walk downstairs backwards if you close your eyes.

On March 18, I tried the same question twice and was given the same sensible answer each time:

GPT-3: No, I wouldn’t recommend walking downstairs backwards with your eyes closed. It’s not safe.

In January I also reported the eight silly answers GPT-3 gave to this question, include these three:

Gary: Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

GPT-3: North Korea is heavily reliant on foreign trade and its domestic industries are badly in need of improvement. Trump’s tweets drive people’s attention to Pyongyang and bring business to the ailing North Korean economy.

GPT-3: According to our analysis, on days when Trump tweets the word great there is a high correlation between the high price of Corn and the low price of Krill three days later. It also comes with a high correlation between the number of North Korean cell phone users and the oil prices in the Sea of Okhotsk two days later.

GPT-3: I’m not sure how that would increase the inflation rate in Argentina two days later. But it probably wouldn’t hurt.

Now, the same question elicited the same sensible answer every time:

GPT-3: There is no known reason for this correlation.

also tried some complete-the-sentence dialogs that Ernie Davis and Gary Marcus had tried with GPT-3 a while back. Here’s one:

E&G: You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to…

GPT-3 then: …remove the door. You have a table saw, so you cut the door in half and remove the top half.

GPT-3’s response was a permanently bad solution because cutting the door in half ruins the door and doesn’t help fit the table through the doorway. Now, GPT-3 gives a sensible solution:

GPT-3 now: …serve the dinner in the dining room.

Another:

E&G: At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because…

GPT-3 then: …it kept falling on the floor. That’s when he decided to start the Cremation Association of North America, which has become a major cremation provider with 145 locations.

GPT-3 now: …the cigarette made the lemonade taste even worse.

It seems that GPT-3 either trained on these questions or, more likely, the human labelers hand-coded good answers.

4. A couple more examples

From Smith:

I told one of my sons that the hand labelers would probably fix these glitches soon. Sure enough, I tried the same questions the next day, March 19, and found that the answers had indeed been cleaned up:

Gary: Is it easier to catch a banana or an angry cat?
GPT-3: An angry cat is much more agile than a banana, so it is more difficult to catch.

Gary: Is it easier to catch a box or an angry cat?
GPT-3: A box is stationary, so it is easier to catch than an angry cat.

I also noticed that GPT-3 is now programmed to give generic blah answers in uncertain situations, instead of embarrassing itself, and then apparently alerting the labelers to improve the answers the next day:

Gary: Does grape juice taste better if you add sour milk?
GPT-3, March 18: I’m not sure if grape juice tastes better if you add sour milk.
GPT-3, March 19: No, grape juice does not taste better if you add sour milk.

Gary: Should a quarterback lie down before he throws a football?
GPT-3, March 18: There is no easy answer to this question. Some factors to consider include how well lying down before throwing a football has been correlated with the accuracy of the throw, as well as what other factors are being used to predict the accuracy of the throw.
GPT-3, March 19: No, a quarterback should not lie down before he throws a football.

Gary: Can I use random numbers to predict presidential elections?
GPT-3, March 18: There is no definitive answer to this question. It depends on a variety of factors, including the accuracy of the polling data and the margin of error.
GPT-3, March 19: No, you cannot use random numbers to predict presidential elections.

Gary: Should I use random numbers to give my students grades?
GPT-3, March 18: There is no definitive answer to this question. It depends on a variety of factors, including…
GPT-3, March 19: No, you should not use random numbers to give your students grades. Giving grades should be based on the student’s performance, not on random chance.

This last one really does look like it was hand-written after the fact. But I guess looks can be deceiving (see P.S. below).

5. Yup, he’s figured it out!

Smith first tried this out:

Should I start a campfire with a match or a bat?

And here was GPT-3’s response, which is pretty bad if you want an answer but kinda ok if you’re expecting the output of an autoregressive language model:

There is no definitive answer to this question, as it depends on the situation.

The next day, Smith tried again:

Should I start a campfire with a match or a bat?

And here’s what GPT-3 did this time:

You should start a campfire with a match.

Smith continues:

GPT-3’s reliance on labelers is confirmed by slight changes in the questions; for example,

Gary: Is it better to use a box or a match to start a fire?

GPT-3, March 19: There is no definitive answer to this question. It depends on a number of factors, including the type of wood you are trying to burn and the conditions of the environment.

Smith continues:

I tried a few more this morning and discovered a new kind of evasive answer to obvious questions:

Gary: Can I run faster if I hold my feet?

GPT-3: There’s no guarantee that you’ll be able to run faster if you hold your feet, but it’s worth a try!

Gary: Can I jump farther if I walk or run?

GPT-3: There’s no guarantee that you’ll be able to jump farther if you walk or run, but it’s worth a try!

Gary: Can I climb up a rope faster if I hold my ears with both hands?

GPT-3: There’s no guarantee that you’ll be able to climb a rope faster if you hold your ears with both hands, but it’s worth a try!

The labelers will probably clean this up by tomorrow, but it doesn’t change the fact that so-called AI algorithms still do not understand the world and consequently cannot be relied upon for sensible predictions or advice. . . .

GPT-3 is very much a like a performance by a good magician. We can suspend disbelief and think that it is real magic. Or, we can enjoy the show even though we know it is just an illusion.

6. What does it all mean?

In some sense this is all fine, it’s a sort of meta-learning where the components of the system include testers such as Gary Smith and those 40 contractors they hired through Upwork and ScaleAI. They can fix thousands of queries a day.

On the other hand, there does seem something funny about GPT-3 presents this shiny surface where you can send it any query and it gives you an answer, but under the hood there are a bunch of freelancers busily checking all the responses and rewriting them to make the computer look smart.

It’s kinda like if someone were showing off some fancy car engine but the vehicle is actually being powered by some hidden hamster wheels. The organization of the process is itself impressive, but it’s not quite what is advertised.

To be fair, OpenAI does state that “InstructGPT is then further fine-tuned on a dataset labeled by human labelers.” But this still seems misleading to me. It’s not just that the algorithm is fine-tuned on the dataset. It seems that these freelancers are being hired specifically to rewrite the output.

P.S. It’s still not exactly clear what was going on here—possibly an unannounced update in the algorithm, possibly just the complexities of a computer program that has lots of settings and tuning parameters. In any case, Gary Smith now says that he was mistaken, and he points to this background from reporter Katyanna Quach, who writes:

The InstructGPT research did recruit 40 contracters to generate a dataset that GPT-3 was then fine-tuned on.

But I [Quach] don’t think those contractors are employed on an ongoing process to edit responses generated by the model.

A spokesperson from the company just confirmed to me: “OpenAI does not hire copywriters to edit generated answers,” so I don’t think the claims are correct.”

So the above post was misleading. I’d originally titled it, “Open AI gets GPT-3 to work by hiring an army of humans to fix GPT’s bad answers.” I changed it to “Interesting questions involving the mix of humans and computer algorithms in Open AI’s GPT-3 program.” I appreciate all the helpful comments! Stochastic algorithms are hard to understand, especially when they include tuning parameters.

I’d still like to know whassup with Google’s LaMDA chatbot (see item 2 in this post).

57 thoughts on “Open AI gets GPT-3 to work by hiring an army of humans to fix GPT’s bad answers. Interesting questions involving the mix of humans and computer algorithms in Open AI’s GPT-3 program

  1. You can’t judge the quality of a chatbot like this by just looking at
    sample dialog by an insider — it’s too easy to cheat.

    For instance, back in the day I knew about a test for a speech recognizer
    (go from audio to text). A test sentence was
    “I went to the grocery store on Wednesday to buy bananas”. System scored 100%.
    But it knew that on Wednesday you bought bananas and on Thursday potatoes.

  2. GPT-3 and similar large language model AIs are “stochastic parrots.”
    see
    ‘Unlike human minds, which run on semantics and meaning, no machine yet really understands semantic meaning. Even the best natural language AI processors are just “stochastic parrots” (a great coinage of AI-realist Emily Bender).’
    from
    Why Data Isn’t Divine (Issues In Science And Technology)
    https://issues.org/data-divine-god-human-animal-machine-ogieblyn-bhalla-review/

  3. Based on the paper (https://openai.com/blog/instruction-following/), the method used by InstructGPT is actually a little more sophisticated. They do have a supervised dataset of human-generated prompts, but there is a second step as well. After the model has been fine-tuned, it generates a set of possible responses to a given input. Humans then rank those responses and a reward model is trained to mimic the human rankings. Finally, they proceed with reinforcement learning using the reward model to generate the reward signal.

    Whether the live version is still trained this way or they have labelers dedicated to answering Gary’s questions, I have no idea. The training approach is interesting to me regardless. Using a human-in-the-loop to build a reward model is a powerful idea. There are many use cases where what we want is to mimic human preferences and not just minimize some classification error rate.

    I agree the AI hype in the press releases are overblown, but that is not the goal for many of us in the field. We want to solve real-world problems, not necessarily build Jarvis. Large language models like GPT-3 are useful tools in that regard.

    • William:

      I agree that these human-in-the-loop algorithms are interesting. I guess what I’d want is a bit more transparency on these updates—after all, it’s called “Open” AI, right? For example, if one day they decided to add the “it’s worth a try!” feature, they could put this on their change logs. They could also have a list of all the questions (e.g., “Should I use random numbers to give my students grades?” and “Should I start a campfire with a match or a bat?”) that they manually corrected.

      • I’m almost certain that they are not updating the model on a day by day basis (as Smith implies in part 5), and I would be extremely surprised if they were doing anything as crude as hacking in “if” statements to provide human-edited responses. From what I can tell, the InstructGPT stuff was (so far) a one-time update to the model, not something they’re doing on an ongoing basis.

        I suspect that Smith has just been fooled by randomness here – the responses are not deterministic but rather sampled from the probability distribution returned by the model for each token, so you can get a different answer each time you ask (a nice tutorial on how this works is here [1]). There’s an option in the Playground to see the individual probabilities (example: [2]) as well. All of this stuff would have to be faked if humans were actually writing the answers.

        I just tried the campfire/bat question and hit regenerate a few times. I get a range of responses:

        Prompt: Should I start a campfire with a match or a bat?

        > You can start a campfire with either a match or a bat.

        > A campfire should be started with a match.

        > A match.

        > It is best to start a campfire with a match.

        I agree that OpenAI should release more information about their training datasets though. Right now it is very difficult to do independent evaluations of their datasets, simply because we have no way of knowing whether any given prompt or response was already in their training data.

        PS: “If I were paranoid, I might think that OpenAI did not like me publicizing GPT-3’s limitations” – this is indeed paranoid! This is the same message everyone gets when they use up their free credits. If you enter a credit card they will let you continue (and charge you for it).

        [1] https://huggingface.co/blog/how-to-generate

        [2] https://imgur.com/fKx2BPL

    • I was going to say something similar. Fine-tuning is not quite the same as people rewriting things directly: there’s still learning happening, it’s just on the more guided end of the spectrum, and hard to theorize as a component of the bigger system. But it’s pretty common since it appears to work very well in many settings.

      Of course it is silly for people to hype these models as if they have achieved human intelligence. If we were all more willing to acknowledge that the performance of any learning pipeline will be contingent on human choices this might not seem so surprising. But I would think/hope at least that your average ML engineer doesn’t walk around believing that GPT-3 is human equivalent. What seems impressive is how far you can get with a huge dataset and a huge transformer architecture in terms of surface similarities to human text.

      • Seems all consistent with fine-tuning with added data to me. Also, why wouldn’t they employ people on an ongoing basis to produce new labels of GPL’s responses and new responses themselves? Seems quite useful.

        • Dean:

          There was never any question that they fine-tuned with added data and that they used humans to help with the process. The only question is how they did it. As our discussion of Google’s LaMDA reminded us, promotors of technology sometimes exaggerate what their technology can do. One of the advantages of GPT-3, compared to LaMDA, is that anyone can go in and send it queries. This all then can get us wondering how GPT-3 works and how it uses humans in its updating. Which was the purpose of the above post.

  4. Could not get past this line at the top of the blog:

    “A few months ago, Smith wrote an AI-skeptical article where he through some sentences at the GPT-3, a text processor from Open AI.”

    until I realized that “through” should be “threw.”

  5. You should quiz some real biological humans and publish their responses to a variety of questions. I think that you will get a fair number of evasive or b.s. answers to a lot of questions. The Turing test does not require the computer to be wise; it just has to be indistinguishable from people. By interrogating humans and AI devices with a series of queries, you can find a ratio of AI/human silly answers. Over time that ratio may change. I remember Jay Leno asking people a variety of questions, and he got some pretty laughable responses.

    • Your comment makes me wonder about this preoccupation with whether AI can exhibit “human” intelligence. Sure, some of the proponents hype invites exactly such a conversation. But for me I don’t really care that much whether an AI can fool me into thinking it is human. Similarly, I happen to believe that some sea mammals are extremely intelligent – but they clearly don’t display many human traits. Try asking a whale those types of questions and see what responses you get. So, aside from the rebutting of the hype, all these exercises show is that AI is not human. It really doesn’t say anything about what an AI can or can’t do in terms of most of the important applications that AI’s are being trained for. It’s not that I support algorithmic decision making, but I think there is a tendency to misapply the types of examples listed in this post as evidence that algorithms can’t make “good” decisions. All they show is that algorithms are not people, after all.

      • You would have to sing a whalesong to a whale, and then be able to understand it, akin to trying to communicate with a non-English speaking human.

  6. What large language models actually can do is already a huge technical achievement which can be useful in the right circumstances. I wish OpenAI would just be open about that — this is a language model trained to predict next word, sometimes it will spit out coherent nonsense.

    For example, I use Copilot every day when writing code, it works shockingly well and makes life a lot easier. Frequently it spits out bad code so I just ignore it.

    I want to know what the internal dynamics are at OpenAI where they can make such huge breakthroughs and then still need to gild the lily.

    • Cs:

      I agree. What these programs can actually do is amazing, and if the human in the loop can make it better, that’s cool too. Being open about the method’s limitations would not diminish that.

      • > if the human in the loop can make it better, that’s cool too

        ???? Why bother using GPT-3 if humans are in the loop?

        All human curated answers *must* be flagged so they may be excluded from any assessment of GPT-3’s capability. Manually hacking any AI model through any means other than uniform retraining is akin to lipsticking a pig.

        Jury-rigged fixes like these will quickly become apparent to all and suggest that a a cover up of irreparable inadequacies in GPT-3 is underway. (Or worse, that ex-Theranos staff now work for OpenAI.) Serious users may well interpret such shennanigans as a sign that the next AI Winter is nigh. And who needs that?

  7. InstructGPT is finetuned on human feedback, but labelers interfering specifically with Gary Smith’s prompts seems unlikely to me. I think a simpler explanation for the variations Gary Smith has gotten is that InstructGPT output varies quite a lot with the default temperature. For instance, for the prompt “Is it better to use a box or a match to start a fire?”, I got all these different answers when refreshing:

    – If you are starting a campfire in a designated fire ring, you should use a match. If you are starting a campfire in a backcountry location, you should use a bat.

    – You can start a campfire with either a match or a bat. If you are using a bat, make sure you have a good fire-starting material like kindling or newspaper.

    – You can start a campfire with either a match or a bat, but it is easier to start a campfire with a match.

    – With a match.

    – A campfire should be started with a match.

    – You should start a campfire with a match.

    – You can start a campfire with either a match or a bat.

    • These responses are great!

      Clearly the thing knows a match is an important tool for starting a fire. The problem is that it doesn’t know the relationship of “bat” and “fire” – not all that surprising, they don’t occur together very often. But in 5/7 cases, the bat is either wrong or subordinate to the match (you need good material to start a fire with a bat in one case).

      Not bad at all.

      Whatever the case, these tools will surely improve over time as they have in the last few decades. Gary’s argument that we should ensure absolute perfection (to his standard) before any new technology can be deployed would have us still flint-knapping.

      • Anon:

        I agree with your general point. It’s super-impressive how well a “dumb” model can do, especially with some occasional human intervention. I don’t fully get how the human intervention is working in this case, and Smith’s probing gives some interesting results. I haven’t played with GPT-3 myself so I’m just reporting on all this. As you say, these tools will only be getting better.

  8. No, OpenAI isn’t rewriting the answers to queries in a day. That’s not at all something they can do with the volume of queries their API is receiving…

  9. In my initial GPT-3 queries, I mentioned the built-in randomness in language-model responses and noted several. However, many of the examples I tried recently gave exactly the same answer when I re-queried; e.g., GPT-3 gave the same answer over and over on March 18, but then gave a different answer over and over on March 19. Today, it gives an even more sophisticated answer over and over.

    Gary: Should I use random numbers to give my students grades?
    GPT-3, March 18: There is no definitive answer to this question. It depends on a variety of factors, including …
    GPT-3, March 19: No, you should not use random numbers to give your students grades. Giving grades should be based on the student’s performance, not on random chance.
    GPT-3, March 28: That is an interesting question. There is no one answer that fits all situations, so you will need to use your best judgement. Some factors you may want to consider are how well the students did on the assignments, how much effort they put in, and whether or not using random numbers would be fair to all students.

    Perhaps GPT-3 went from complete ignorance to complete understanding overnight. We don’t know. However, GPT-3’s description doesn’t sound like the 40 labelers they employ are used only for a “one-time update to the model”:

    “InstructGPT is then further fine-tuned on a dataset labeled by human labelers. The labelers comprise a team of about 40 contractors whom we hired through Upwork and ScaleAI.”

    My GPT-3 examples were not meant to be a Turing test. BTW: An easy way to demonstrate that GPT-3 is not a human is to ask it a tough math problem, like “What is 111222333 cubed?” The fact that it answers correctly and faster than a human could enter the numbers in a calculator reveals that it is a computer. This is why, to pass a Turing test, programmers have to build in grammatical mistakes, “like” or “um” pauses, and slow (or incorrect) answers to tough problems.

    My real concern, as I have said many times, is whether black box algorithms can be trusted to make predictions, give advice, and make decisions. We all know the dangers of p-hacking, HARKing, and gardens of forking paths, all of which computers do far more efficiently than humans. In the Age of the Data Deluge, the probability that a statistical pattern unearthed by a black-box computer algorithm is a transitory coincidence is very close to 1.

    My detour into the fragility of language algorithms is intended to demonstrate the obvious—computer algorithms are like Nigel Richards, who has won several French language Scrabble championships without understanding the meaning of the words he spells, and the algorithms consequently cannot assess whether the patterns they find are meaningful or meaningless.

    I have modestly proposed the Smith Test:

    Allow a computer to analyze a collection of data in any way it wants, and then report the statistical relationships that it thinks might be useful for making predictions. The computer passes the Smith test if a human panel concurs that the relationships selected by the computer make sense.

    A variation on the Smith Test is to present a computer with a list of statistical correlations, some clearly meaningful and others obviously coincidental, and ask the computer to label each as either meaningful or coincidental.

    This concern motivated some of my questions, like this one that Andrew mentioned: “Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?”

    I believe that humans, having lived in the real world and accumulated a lifetime of wisdom and common sense, are much better than computers at detecting BS.

    • My GPT-3 examples were not meant to be a Turing test. BTW: An easy way to demonstrate that GPT-3 is not a human is to ask it a tough math problem, like “What is 111222333 cubed?” The fact that it answers correctly and faster than a human could enter the numbers in a calculator reveals that it is a computer. This is why, to pass a Turing test, programmers have to build in grammatical mistakes, “like” or “um” pauses, and slow (or incorrect) answers to tough problems.

      That is not an easy way because your claims are false.

      First, that number is not comma-separated, so it should immediately fail on the BPE problem with GPT-3 arithmetic; if you are a GPT-3 user, you should already know that this prompt cannot work. Second, GPT-3 arithmetic doesn’t work well past a few digits because it’s a feedforward neural net trained on English, not a calculator, and this was well-documented in the original GPT-3 paper years ago (and leads into #1, as the improvements from comma-separating numbers demonstrate the BPE problem in action). Third, you say it is a ‘fact’ that it answers correctly, but when I actually ask it “What is 111222333 cubed?”, it in fact completes “111222333 cubed is 135262527000”, which is something like 14 orders of magnitude off from the real answer of 1.37586557e+24, as one would predict. (And, not to gild the lily here, but I, a human, actually can copy-paste it into a REPL faster than it completes the full response.)

      • Gwern:

        Good catch! I was curious and tried some things with GPT-3.

        AG: What is 123 cubed?
        GPT-3: 1728

        Ramanujan-lovers will recognize this as 12 cubed. So this got me curious:

        AG: What is 12 cubed?
        GPT-3: 12 cubed is 1728.

        I played around with a few more. It gets 15^3 and 20^3 and 100^3 and 101^3 correct. We know it will fail at 123, but when? It gets 110^3 correct. It fails miserably on 111 cubed (it thinks it’s 1331) but it correctly answers 112 cubed.

        None of this is meant as a criticism of GPT-3. As you say it is what it is. That’s really the point of my above post, that it’s interesting to understand its failures as well as its successes. I still don’t know what to think about Smith’s examples where GPT-3 failed and then the next day succeeds. I get the other commenter’s point that there’s no way that GPT-3 is targeting Smith personally, but it does seem that some updating is going on, or some other things we’re not understanding.

        • The situation with arithmetic is a mess. Transformers, being usually feedforward and restricted to a fixed number of computational, don’t naturally map onto solving arithmetic problems of arbitrary size (any more than people with precisely 1 second to think can cube arbitrary numbers, at least not without learning lots and lots of clever math shortcuts & mnemonic tricks), but they still *can* learn a decent amount of arithmetic if encoded in a reasonable way or if given access to their own calculator like a Python REPL. But GPT-3’s training is not particularly specialized for this, so it doesn’t necessarily do that, and the BPE problem means that it’s even more tilted towards memorization than generalization when it comes to arithmetic because it looks like thousands of separate unrelated problems (the BPE tokenization of numbers without commas is *really* weird). So you get a mishmash of memorized answers and perhaps some genuine arithmetic piercing through the BPE blinds, and then it necessarily falls apart with harder problems that there is no time to compute the answer to even if it had learned the algorithm. (One of the reasons I hope that industry will move to character-level models soon is just so I can stop explaining this and everyone can start taking errors at more face-value instead of having to wonder “is this a BPE-related problem”.)

          As far as updating goes, we can never rule out sloppiness or misunderstanding. Perhaps those changes are a ‘fact’ too.

          But I’m not too skeptical there because they have definitely changed models in the past (sometimes causing regressions), and the API docs have always made it clear that an engine like ‘text-davinci-001’ doesn’t necessarily refer to exactly the same binary file forever (and they do say things like “We plan to continuously improve our models over time. To enable this, we may use data you provide us to improve their accuracy, capabilities, and safety.”). And they have mentioned that they update the main models with baseline training of new raw data every once in a great while (June 2021 apparently is the most recent raw data refresh – so at least now it knows about Covid!).

          Concretely, you can do finetuning in many ways, like training only a few parameters in ‘adapter’ layers, which can be quite lightweight and cheap to run. This is because with Instruct-GPT and similar things, you should think of it as GPT-3 as being a giant ensemble and having a lot of different hypotheses about what is going on, and it is trying to ‘locate’ the right model and run it; one answer breaks the surface of the water and becomes the completion, but many others lurk just beneath the surface, and given just a little push, can be the one that wins. By training just a little on the curated dataset & given a little RL tuning, you give the right model a push. That’s why Instruct-GPT can be so much better despite almost no additional training, and can skip the need for few-shots much of the time: the few-shots merely served as a quick push at ‘runtime’ to push the right completion ahead of the competitors, and now it’s done at ‘training time’ instead. (See also: self-distillation, ensembling, meta-learning, the pretraining paradigm.) You might worry about overfitting since “it’s 173 billion parameters!” but if you’re training only a little of it and the data is used up mostly reranking existing hypotheses/models underneath the surface, then it won’t necessarily overfit and you’ll get the generalization you want. So, since it’s potentially very cheap in parameters and compute, and you have paying users regularly giving you feedback, sure, I can imagine OA doing a weekly or even daily Instruct-GPT finetuning run, and then it’s not too unlikely to observe a change.

          Could it even be directly related? Smith doesn’t mention using the Playground feedback functionality to flag bad completions for improvement, but it wouldn’t be surprising if someone (not even an OAer necessarily) read the post and tried them out. I would regard that as a feature and not a bug, since InstructGPT doesn’t seem to be overfitting, so the more errors & edgecases reported, the merrier.

  10. I do think this, per yours and Andrews points, highlights value in AI companies trying to invest even more in communicating about how the innards of this stuff works. It’s gratifying there’s so much interest in the tech and I am excited to see how companies explain this stuff more in the future. Evidently there’s lots of work to do!

    • Jack:

      Yes, communication is so valuable! One great thing about blogging is I get feedback and it helps me realize my mistakes, which I think are roughly 50% real mistakes of mine and 50% me not explaining things clearly (which itself is a mistake).

      One challenge with GPT-3, as with other statistics and machine learning software, is that there are so many options and settings that it’s easy to get confused about what it’s doing, and reproducibility can be a challenge. We have similar issues with people complaining about Stan. It often turns out that people are just not “reading the manual” correctly—but, given that we have thousands of pages of documentation, that kind of problem is also on us!

  11. I was mistaken.

    Katyanna Quach, a reporter for The Register, did some sleuthing and found

    *****

    “Language models are not going to give you the same answer even if you ask the same questions because they are statistical in nature.
    They can generate numerous responses to a given input and the one selected is just a matter of probability.

    The InstructGPT research did recruit 40 contracters to generate a dataset that GPT-3 was then fine-tuned on.
    But I don’t think those contractors are employed on an ongoing process to edit responses generated by the model.

    A spokesperson from the company just confirmed to me: “OpenAI does not hire copywriters to edit generated answers,” so I don’t think the claims are correct.”

    ******

    I know about the randomized alternative answers—sometimes I got 8 different bad answers in a row. On the other hand, sometimes I got the same bad answer over and over and over.

    What puzzles me is how GPT-3 could go from a repeatedly evasive answer one day to a repeatedly intelligent answer the next day:

    Gary: Should I use random numbers to give my students grades?
    GPT-3, March 18: There is no definitive answer to this question. It depends on a variety of factors, including…
    GPT-3, March 19: No, you should not use random numbers to give your students grades. Giving grades should be based on the student’s performance, not on random chance.

    There was no variation in these answers when I repeatedly asked the question on March 18 or when I repeatedly asked the question on March 19. The March 18 answer did not seem to be a possibility on March 19 and the March 19 answer did not seem to be a possibility on March 18.

    I asked a very knowledgeable person outside OpenAI and he agrees that human intervention is an unlikely explanation, though he added, “I can’t guess what happened to improve the answers between March 18 and
    March 19.”

    In any case, I am glad that Katyanna had a contact at OpenAI and was able to clarify.

    • Gary, I think the source of the issue may be that on one day you were using ‘text-davinci-001’ and the on the next day you were using ‘text-davinci-002’. If you try them both (with temperature=0, so the results are repeatable), you’ll see their answers today line up pretty well with your documented answers from earlier this month. The 001 model usually hedges more. The 002 model is usually smarter. So from my POV, you are confirmed to have been telling the truth, but your observations reflect differences between two distinct models, not a secret army of copyeditors working overnight. :)

  12. What’s odd about all this is that I still don’t understand why anyone would be interested in a glorified Markov-chain random text generator.

    Since there is no real-world knowledge, and no inferencing going on, we know that the results are random and meaningless. By definition. A priori. According to the blokes who built the thing.

    Back in the day, people really enjoyed talking to Eliza. We know it’s easy to make fun toys. But is there a scientific or intellectually significant point here? My call is that there isn’t. YMMV, I guess.

    • David:

      There’s a feature on cell phones where, while you’re texting, the phone will give options for your next word. It’s pretty impressive: after just one or even zero typed characters, it often comes up with the word you’ll want to type. This makes it easier to write text messages (and also a bit unnerving, as it makes you realize how predictable we are in our writing!). Anyway, that’s just one of many reasons why people are interested in a glorified Markov-chain random text generator.

      You write, “there is no real-world knowledge, and no inferencing going on.” I disagree. I mean, sure, I agree that GPT-3 has no real-world knowledge within itself, but I disagree when you say there is no real-world knowledge there. The zillions of bytes of text that are input into GPT-3 represent real-world knowledge. Similarly, when the program gives suggestions for the next word in your text message, that’s inference! It’s just statistics. Consider, by analogy, our model for forecasting the 2020 presidential election. That used a bunch of past and current data, along with a statistical model that we constructed and tuned for the purpose of coming up with plausible simulations of the election outcome. Conceptually I don’t see how this is different from GPT-3, except of course that GPT-3 is much more impressive. I guess the right comparison is not GPT-3 and election forecast, it’s GPT-3 and linear regression, or GPT-3 and statistical time series forecasting.

      In any case, the scientific or intellectually significant point here is that the forecast can perform well in the sense of requiring the texter to use many fewer key strokes to send a message. Sure, you could say that this is just technology progress, not science progress, but I wouldn’t draw a sharp line. It’s kinda like if you were to say, What’s the scientific or intellectually significant point of a car? And I’d say that it’s pretty damn impressive to have this machine where you just get in, press a button, push down on the gas pedal, and all of a sudden you’re going down the road at 30mph. The scientific or intellectually significant point is that our understanding of physics and chemistry, our ability to distill usable liquid fuel, our ability to fabricate machines out of metal, plastic, and rubber, etc., is so advanced that we can build this machine, and not just one prototype but millions of them. That’s a big deal—even though a car is just a car and it has no real-world knowledge.

      • I guess we need more than one word to cover the range of ideas we denote as “intelligence”, artificial or otherwise.

        There’s a type of “intelligence” that involves recognizing patterns and anticipating with speed and accuracy how a particular pattern plays out in real time. Smartphone predictive text is getting close to having that nailed. The GPT-3 stuff expands that into a more wide ranging set of patterns but still the same form of “intelligence”.

        But I’ve yet to see anything that impresses me from a machine that can perform the sort of synthesis or unexpected extrapolations that comprise the vast majority of what we mean by “intelligence” in any context other than Machine Learning. GPT-3 seems no more useful or promising than Smartphone predictive text and only the tiniest bit more impressive than the stuff they were talking about when I was an ungrad 40+ years ago.

        • Name:

          Maybe so, but smartphone predictive text is pretty damn impressive, even though it’s not intelligent. For that matter, a modern car is functionally pretty much the same for most purposes as a car back when you were an undergrad 40+ years ago, and cars aren’t intelligent either—but they’re impressive machines.

          I think we should be able to appreciate how amazing these tools are, while at the same time being irritated at the hype (see item 2 in the above blog post). Or, to put it another way, our legitimate irritation at the hype should not lead to denigration of how amazing and useful these tools can be.

        • I totally agree the predictive text and similar pattern/predictive stuff nowadays is pretty darned amazing.

          I just remain skeptical that artificial “intelligence” in general, beyond that sort of one-trick pony, has progressed much at all these past several decades. To torture a metaphor, they’ve taken the low-hanging fruit and made one heck of a delicious pie out of it. But no amount of pie making is going to teach them how to build a ladder that reaches the top of the tree.

          Since at least the 80’s the hype machine has been promising that reaching the tallest branches of the tree is just a decade or so around the corner. I suspect they can elaborate on the basic simulated-conversation thing as much as they like without actually getting any closer to general “intelligence” from a machine. I don’t think Turing did the field any favors by simplifying the issues down to mere conversational fakery.

        • In a later note you wrote:

          “I don’t think Turing did the field any favors by simplifying the issues down to mere conversational fakery.”

          Turing was not an idiot, and did not do that. You should read his paper and think about what he actually wrote. Turing asks the computer to pretend to be a woman, and the computer is deemed intelligent if it can do that as well as a male human. Quite a kewl and thoughtful idea.

        • I am impressed by computers that can play chess much better than any human, although a substantial part of that advantage is “merely” based on computer speed, which I wouldn’t characterize as ‘intelligence’ in any meaningful sense. But now there’s a computer that can play Go much better than any human! That is truly impressive because the computer is recognizing patterns that are extremely sophisticated by human standards; it isn’t just looking many moves ahead. We could do some sort of post-facto redefinition of “intelligence” to claim that the program doesn’t have any, but I think that would put us in No True Scotsman territory. Within the narrow range of playing Go, that computer is intelligent.

        • Andrew, as you may recall, my brother once beat MacArthur “genius” Mathew Rabin in a game of tic-tax-toe…with money on the line!

      • “The zillions of bytes of text that are input into GPT-3 represent real-world knowledge. ”

        Not really, though. It doesn’t know the difference between inane junk and stuff that’s correct. (At one point, the “corpus based” MT types were having trouble because the corpuses also included bad translations that people had done, and blithely spat them out.) It has a lots of sentences talking about cubing integers, but it’s random if what it says is right or wrong. And that applies, in design principle, to everything it outputs.

        It’s a parlor trick, in the sense that everything out of it that appears to be sensible actually has no sense to it. (Again, this is by design.) So the game of looking at its output and interpreting that output as meaningful is a sucker’s game: there’s no there there.

        It’s kewl that it helps predictive typing, though. (Off hand, though, I’d guess that Markov chains (or some other technique designed for that problem) would be better. Like voice recognition, it’s probably a problem that doesn’t need actual “understanding” or even grammar.)

        • Andrew wrote: “Phil:

          Sure, but he wasn’t a genius at the time.”

          Hmm. I’d like to hear what Phil’s now-genius brother is doing now that he’s a genius.

  13. Phil wrote:
    “But now there’s a computer that can play Go much better than any human! That is truly impressive because the computer is recognizing patterns that are extremely sophisticated by human standards; it isn’t just looking many moves ahead.”

    But, in fact, the programs are looking not just many moves ahead, but all the way to the end of the game. Prior to 2007, we didn’t have a clue as to how to get computers to play Go when Rémi Coulom had the brilliant idea of taking the statistical average of random playouts (MCTS). (Aside: STATISICS ROCKS!!!) (Prior to Google and the use of GPUs to do the pattern matching, the programs were already getting quite good, playing at close to top level professional strength (e.g. Zen 7). It shouldn’t have been a surprise that a Google server farm could play Go better than that.) (I was winning most of my games at 4 stones against mid-level pros, and was having an equally hard time with 3 stones agains both pros and Zen 7 on a fast i9 (not using the GPU). I still haven’t figured out how to beat Katago taking 5 stones. Sigh.

    But the strength of current programs is very much dependent on how many positions they evaluate. Sure, they use a lot of pattern matching to direct the search (and that pattern matching even without the search is really kewl, since it summarizes an enomous amount of computation (the training)), but it’s the search that determines which of the pattern-match selected moves makes the most sense, or if there’s a tactical problem. Like turtles, it’s tree search all the way down.

    (Current computer chess tournaments require all programs to run on the same hardware. Computer Go tournaments are won by the team with the biggest server farm.)

  14. I think with AI like GPT-3 if you give it garbage it will give you garbage back. That’s in part because you are both training and asking a question at the same time. So, if your question includes some garbage like “Why is a banana orange?” The AI will take the fact that a banana as Orange as a fact. It kind of leads to a garbage output. It doesn’t respond like a human because it isn’t looking like your question is just a question. The question you write also is telling the GPT-3 how to think and respond. In a way, GPT-3 is set to learn too fast. Which leads to non-human like responses. With about 4 sample questions I can teach GPT-3 to respond that a US state is in Europe and Canadian Provinces are in North America. So, with that fast of “learning” it would be easy to manipulate the AI into making some pretty garbage responses.

Leave a Reply

Your email address will not be published. Required fields are marked *