OpenAI’s GPT chat bot enters the uncanny valley

The baseline: Shannon (1948)

The first attempt at computer generation of human language of which I’m aware was by Claude Shannon in his seminal 1948 paper, A Mathematical Theory of Communication. Not only did this paper introduce the entire field of information theory, it also applied a notion of posterior predictive checking to language models. That is, it could generate novel language based on bigram statistics (predicting conditional probability of a word given the previous word). Here’s the example from Shannon’s paper:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

Computing was pretty much all caps until the 1980s.

The state of the art: OpenAI GPT-3 (2022)

Fast forward 75 years. I’ve recently been hearing a lot about GPT-3, the language model developed by OpenAI. So I signed up for the
OpenAI ChatGPT API
and then went to their playground to see what ChatGPT could do.

Now I’m getting that same uneasy feeling I felt watching the lifelike movement and reactions of the Boston Dynamics robots. It’s the uncanny valley effect of watching something that’s almost, but not quite fully lifelike.

Here are the first prompts I gave it (bold) and what it responded. I chose the prompts knowing I’d be posting on the blog. Like all the other spell checkers, it broke “poststratification” into two words itself.

Write an introduction for Andrew Gelman giving a seminar on post stratification to R users.

Good morning everyone. It’s a pleasure to be here today to discuss post stratification with all of you R users. My name is Andrew Gelman and I am a statistician, professor of political science and statistics at Columbia University, and co-director of the Applied Statistics Center. I am here to talk to you about post stratification, a widely used statistical technique. Post stratification is used to adjust survey estimates when they are known to be biased due to the sampling design or nonresponse. In this seminar, I will provide an overview of post stratification, discuss how to perform the calculations in R, and provide examples of how post stratification can be applied. I hope that by the end of this seminar, you will have a better understanding of post stratification and be able to use it to improve the accuracy and reliability of your survey estimates.

AI had intended to get an introduction of Andrew, not an introduction by Andrew, but prepositions like “for” are ridiculously overloaded in human languages. That is, I was expecting something like, “Professor Gelman needs no introduction,” or maybe, “Andrew is currently a statistics professor at Columbia University in the City of New York. Many of you may know him through his book, BDA3 or his latest stories about regression.”

But what if Jonah were talking instead of Andrew?

Write an introduction for Jonah Gabry giving a seminar on post stratification to R users.

Good morning everyone! My name is Jonah Gabry and I’m here to give a seminar on post-stratification to R users. Post-stratification is an important statistical technique used to adjust survey results to better reflect the true population. This seminar will cover the basics of post-stratification, such as the different types of post-stratification, how to implement post-stratification in R, and how to interpret the results. At the end of the seminar, I hope that you’ll have a better understanding of post-stratification and how to use it in your own research.

I guess Jonah’s not so well known on the internet, so he doesn’t mention who he is in the intro. Jonah’s also a bit more excited saying good morning (exclamation point!), but GPT doesn’t have him mention is pleasure at being here. Otherwise, it sort of feels canned. But the language varies in “At the end of the seminar, I hope that you’ll …” vs. “I hope that by the end of this seminar, you will …” When you do two, it sounds like boilerplate, but it’s not just sentences retrieved from the internet.

Going back to Andrew, let’s have him frame the problem a bit before diving in.

How is post stratification different than survey weighting?

Survey weighting is a process which adjusts responses to a survey to reflect the population’s proportions in the sample. This is done by applying weights to individual responses so that the aggregate results of the survey match the known population proportions.

Post stratification, on the other hand, is a process which further refines the results of a survey by taking into account additional information about the population after the survey has been conducted. This involves applying population estimates to the survey results in order to adjust the results to better reflect the population. Post stratification can help to reduce survey bias and improve the accuracy of the survey results.

Now that we have an intro and a body, how about a conclusion?

Write the conclusion to a talk on statistics by Andrew Gelman.

In conclusion, statistics can be a powerful tool to help us better understand our world. By leveraging the power of statistical thinking and analysis, we can gain insight into complex systems, uncover new information, and make decisions with confidence. With the right tools and the right mindset, we can use the power of statistics to make our world a better place. Thank you.

A bit generic, so not really in Andrew’s inimitable, concrete style. But asking this to mimic Andrew’s style is asking too much. It does give what sounds just like the end of a talk. Bravo!

These are, so far, the only sentences I’ve tried. I have to say, it’s pretty amazing. In some ways, it’s a lot like Watson’s approach to Jeopardy. It’s free associating and copying standard information. And while it’s a lot of what seems like boilerplate, it is actually varying a bit.

But is it funny?

Of course, the real test is whether it can do comedy. Let’s try the famous Abbot and Costello sketch, Who’s on First? I’ll feed it Lou Costello’s side of the dialogue and see what we get.

I love baseball.

That’s great! Baseball is a great sport to watch and play.

When we get to St. Louis, will you tell me the guys’ names on the team, so when I go to see them in that St. Louis ballpark, I’ll be able to know those fellas?

Sure! Currently, the Cardinals roster includes Yadier Molina, Paul Goldschmidt, Matt Carpenter, Dexter Fowler, Kolten Wong, Tommy Edman, Paul DeJong, Harrison Bader, Marcell Ozuna, Matt Wieters, Adam Wainwright, Miles Mikolas, Jack Flaherty, Dakota Hudson, Andrew Miller, Giovanny Gallegos, John Brebbia, Genesis Cabrera, Ryan Helsley, and Carlos Martinez.

Cool—it figured out I meant the baseball team from the word “ballpark” presumably. I don’t think it’s storing context. Next line?

Funny names?

1. Salami Sanders
2. Shaggy Hogbottom
3. Wendy Windyweather
4. Daisy Dizzydoodle
5. Rocky Roadrunner
6. Percy Pluperfect
7. Caramel Crunch
8. Ginger Giggles
9. Tofu Twinkles
10. Felix the Flamingo

Maybe I should’ve asked it to be funny?

Not as funny as my name, Sebastian Dinwiddle.

Rate limit reached for default-text-davinci-003-playground in organization org-tbVmbwds9qRPdKPSBoaf6Q0v on requests per min. Limit: 10.000000 / min. Current: 20.000000 / min. Contact [email protected] if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://beta.openai.com/account/billing to add a payment method.

😂 Oh well, what do you expect for free?

38 thoughts on “OpenAI’s GPT chat bot enters the uncanny valley

  1. I spent some time exploring its basic functionality and came away quite impressed with the pace of progress in LLMs! Haven’t used it for entertainment purposes yet (a la AI Dungeon), but output at ChatGPT’s level is already quite useful to me currently across the following three use-cases:

    1) traversing summaries of existing and well-documented concepts in a more targeted manner (instead of clicking a hyperlink in wikipedia and receiving 30 pages of semi-exhaustive elaboration, I can structure an initial question as “please explain X to me in an introductory manner” and then “can you expand more on Y specifically in the context of X?” Gotta watch out for overconfident BS, though, and it’s now notorious reputation for verbose waffling (perhaps baked in? it seems highly reluctant to answer questions of the form “what is the best X” or “is method Y or Z better in the context of problem ABC?“, even when the answer should be fairly obvious)

    2) asking for easy-to-verify programming help (eg writing and reading custom regex), where it’s easier to write semi-comprehensive unit tests than it is to write the code itself, more due to my own unfamiliarity with whatever relevant syntax. Can’t ask it for anything too tricky (or requiring deeper understanding to catch logic errors… I guess that’s where being able to write the appropriate tests comes in), though, or else it outputs highly confident code snippets and interpretation (hence policies like https://meta.stackoverflow.com/questions/421831/temporary-policy-chatgpt-is-banned).

    3) potentially sparking inspiration when stuck trying to write something less clunkily or corresponding to some different tone (eg, rewrite the following paragraph in a more formal, scientific style: [conversational paragraph here]). Kinda like a thesaurus for longer expressions.

    Dunno how much I’d pay for it in its current state — maybe $0.25 a query? $0.5?

    I tried asking it some basic Stan troubleshooting help, too. Think we’re at the level of finding & correcting basic syntactical errors (eg https://i.imgur.com/HclO6dw.png) but nothing requiring slightly more thought (eg https://i.imgur.com/dPK9oRU.png).

  2. Also, related question to my second use-case above — has anyone tried (writing a fully-automated script) to ask it all the leetcode, hackerrank, projecteuler, etc. problems written after Sep 2021 (the last web scrape used in its training corpus). Perhaps giving it a few attempts, supplying whatever outputted error messages or failed tests as follow-up comments. Curious to see how it’d perform (both in terms of accepted solution rate and relative complexity), and to what extent there exist differences across eg languages or general themes (eg it’s really good at “medium” string manipulation in C++, but terrible at “easy” tree traversal in Python etc).

    (if nobody’s aware of anything like that, I might try to spin up my own over winter holiday, ideally in way that can also be reused for future benchmarking. I guess coding challenges websites don’t actually share their own test code, though? so it’d need an internet connection. Wonder how well sandboxed coding challenge website environments are? dunn dunn dunn)

    • (looks like my original message never posted or is in the moderation queue? Do links trigger spam detection? I feel half the comments I write here w/ figures, code, references, etc. never make it through, eaten forever :’S)

      (the link-free text was:)

      I spent some time exploring its basic functionality and came away quite impressed with the pace of progress in LLMs! Haven’t used it for entertainment purposes yet (a la AI Dungeon), but output at ChatGPT’s level is already quite useful to me currently across the following three use-cases:

      1) traversing summaries of existing and well-documented concepts in a more targeted manner (instead of clicking a hyperlink in wikipedia and receiving 30 pages of semi-exhaustive elaboration, I can structure an initial question as “please explain X to me in an introductory manner” and then “can you expand more on Y specifically in the context of X?” Gotta watch out for overconfident BS, though, and it’s now notorious reputation for verbose waffling (perhaps baked in? it seems highly reluctant to answer questions of the form “what is the best X” or “is method Y or Z better in the context of problem ABC?“, even when the answer should be fairly obvious)

      2) asking for easy-to-verify programming help (eg writing and reading custom regex), where it’s easier to write semi-comprehensive unit tests than it is to write the code itself, more due to my own unfamiliarity with whatever relevant syntax. Can’t ask it for anything too tricky (or requiring deeper understanding to catch logic errors… I guess that’s where being able to write the appropriate tests comes in), though, or else it outputs highly confident code snippets and interpretation (hence policies like [stackoverflow’s recent post on banning output from ChatGPT]).

      3) potentially sparking inspiration when stuck trying to write something less clunkily or corresponding to some different tone (eg, rewrite the following paragraph in a more formal, scientific style: [conversational paragraph here]). Kinda like a thesaurus for longer expressions.

      Dunno how much I’d pay for it in its current state — maybe $0.25 a query? $0.5?

      I tried asking it some basic Stan troubleshooting help, too. Think we’re at the level of finding & correcting basic syntactical errors (eg imgur link to [HclO6dw.png]) but nothing requiring slightly more thought (eg imgur link to [dPK9oRU.png]).

      (sorry for subverting the spam filter! feel free to delete this if the other comment makes it)

    • You may know this already, but there’s been some interest and consternation in the Advent of Code programming challenge with a couple of people solving some of the early puzzles extremely quickly with GPT3. It still seems likely that the bot will struggle on the more complicated challenges though.

  3. Ask it to write a quadratic GLM model in Stan, then ask for a logit link, and you’ll be more impressed. Or ask it to explain some code, maybe add some comments, or use better variable names.

  4. By now, the web is flooded with examples – some astounding, some just stupid. I’ve played a bit – rather than contribute to the expanding volume of examples, I’ll just say that I asked it to write a poem about university presidents and it did a better job than I likely could. Then I asked it some serious modeling questions – including a question from one of my textbooks – and got a remarkably detailed and helpful answer – better than most of my students produce. Yet other attempts to get modeling type answers got fairly generic, well written, but not particularly helpful responses, although they are all well written.

    No doubt this will kick off the recurring debate about whether this can be called thinking, whether it is anything more than a glorified search engine, and whether it bodes anything about the future. I’ll say that I have been very impressed – enough so, that I continue to be worried about the future. It is not that I believe AI will be sinister or sentient – it is that it already can do many things better than people, and that list will surely continue to develop. In many cases, this is cause for celebration, as existing processes will become more efficient and new ways of doing things will be discovered. I have no doubt that the most capable humans will find ChatGPT complementary to their skills. But for average (and below average) humans, ChatGPT and its successors, will readily substitute for things humans can do. What lies in the future for such humans? The image I have is of the humans rolling in bathtubs (because they can’t stand easily) in Wall-E.

    Another matter which seems to be getting little attention is the developer: OpenAI. I’m hoping some readers can reveal more than I have been able to find. It appears to be a privately held company. It’s mission statement is admirable and appears genuine. I do note that Microsoft has invested $1 Billion and Musk is one of the investors. They appear to charge (as above) for using the API although you can experiment online for free. I see no details on the financing. I think the mission of improving life and the world is genuine – and the fact that this entity is outside of any governmental, regulatory, or existing organization could either be a virtue or a worry. Given the power that I believe ChatGPT represents, I do have to wonder about these things.

  5. I’ve also been uber impressed. For example, it can do some pretty decent proofs. Of course, you can still get some ridiculous but authoritative sounding bs. For example, this was it’s response when I asked it “How is a banana toothbrush made?”

    ChatGPT: “A banana toothbrush is made by cutting a banana in half and then using the flat end of the banana to brush the teeth. The banana flesh can act as a natural abrasive, polishing the teeth and helping to remove plaque and bacteria. Additionally, the natural sugars in the banana can help to reduce plaque buildup and fight bacteria.”

    Not quite. ;)

      • The issue is a little more complex than that. It is easy for the randomized sampling to generate a low-likelihood (and thus, more likely to be wrong) sample; this is why so many cases of “the model doesn’t know what it doesn’t know” turn out to be the model knows it just fine, you just forced it to emit a bad answer and then dinged it for your sampling choices, and you can get the right answer by sampling many times & taking the most common answer, or by using inner-monologue to compute the answer step by step instead of greedily decoding the answer instantly where it can’t possibly run through the serial computation in a single forward pass, or by using BO=20 to sample the highest-likelihood total completion. OP appears to be using Playground so they have access to the controls to test out these possibilities; you can’t do it in ChatGPT, unfortunately. (The model should support them in theory, but ChatGPT is all about making the UI/UX as easy as possible to use and taking away control from powerusers.)

        More interestingly, they can self-critique https://arxiv.org/abs/2207.05221#anthropic and they even turn out to have a hidden calibration which is hard to prompt for (at least, no one has found a prompt for untuned models AFAIK) but just a little finetuning exposes the ability to express uncertainty: https://arxiv.org/abs/2205.14334 And of course, the bigger the model (in the sense of compute or data, not so much parameter-count) the better they tend to do all these things.

    • If you asked me “how is a banana toothbrush made?”, I’d probably assume you were being flippant, and give you something like the response that ChatGPT gave you. But then, people often complain about my sense of humour.

  6. “Write the conclusion to a statistics talk by Andrew Gelman in the style of Andrew Gelman.”:

    To conclude: lots of people say “the devil is in the details” but I prefer “God is in every leaf on every tree” because it means the same thing and, well, which side do I want to be on? Anyway I hope I’ve shown that once you’ve come up with your basic concept for a model you have barely gotten started. Y is a function of X, sure, and maybe you even think it’s linear. But what is your prior distribution on the coefficients? And you can’t say you don’t have one because that’s like saying any number from -infinity to infinity is equally likely and who believes that? If you’re using reasonable units then I don’t even think your number could be as high as a hundred zillion, and there’s an infinite amount of space above a hundred zillion! In fact there’s infinitely more space above a hundred zillion than there is within the range between minus a hundred zillion and plus a hundred zillion. So I don’t believe you if you say you don’t have a prior. Of course you have a prior. I want to make you write it down. If I had to boil my entire talk down into one sentence, that’s it: think about your priors and make them explicit. I could have saved you all a lot of time if I had said that right at the start but when I do that people complain. Thank you for coming.

  7. It is disturbing how confidently it answers questions it really doesn’t know the answer to. Reminds me of a lot of the human experts this blog likes to roast.

    I asked why red beans and rice was traditionally cooked on Mondays.

    It correctly responded that Mondays were laundry days and women needed a dish they could cook with minimum oversight.
    It then elaborated that, for further efficiency, women would boil the beans in the *same pot* as the laundry. When I asked followup questions challenging the plausibility of this claim, the bot started defending its nonsense in surprisingly human fashion. At one point it said the starch in the beans was good for the clothes being washed.

    The whole exchange definitely gave me uncanny-valley-style creeps.

    • I asked it about NUTS a few times. It can’t decide between: NUTS is MH, actual NUTS, or “pick a number from the proposal distribution, if its higher than the last number, keep the proposal, if not, use the previous accepted number”. It also once told me that NUTS is very easy to implement with no troll face or mention of an existing implementation.

      I also asked it about expert elicitation, it knew this quite well but in a sense of talking to someone who has read a theoretical textbook on something so knows all the facts but without actually knowing how it all fits together in practice and needed prompting for some basic things.

  8. Two things on the baseball example:
    1. I haven’t tried it myself, but my understanding is that it *is* storing context.
    2. Not only did it miss the Abbott and Costello reference, its list of Cardinals is wrong! Many of those are current players, and all of them were at least recently Cardinals, but Fowler, Wong, Ozuna, Wieters, Miller, and Martinez have all been gone for at least a season, and Bader was recently traded as well.

    • It definitely stores the chat context, you can refer to parts of past answers rather carelessly and it understands. Here’s my queries from an early interactive (succesful) coding attempt with it:

      – what does this cvs file has: …

      – would you write a header with short variable names for the file

      – would you use variable names that better summarise the questions that are as the first line on the sample i gave

      – thanks. now would you write an R script that reads the file, skips the first line and uses your better variable names in a data frame

      – the last variable is a nominal multi-choice thing. can you recode that to a better shape, like more concise

      – two first gsubs are with empty substitutionws, but something sensible there [note my typo here!]

      – but i want them to be substituted with short versions of the orig strings, not nothing. please fix

      – thanks, nice stuff

      oh, could you make a loop of the substitutions in case there’s more that were not in the sample. you can put the choices into a literate data frame embedded into the code

      And then make a matrix of numerical variables ready for factor analysis

      After that, I had a working piece of R code.

      • Yeah, I just asked it for python code to discard every row of a pandas datafile in which the column called ‘x’ had the same value that it had in the previous row, and it came up with a compact and efficient one-liner:
        df = df[ df[‘x’].shift() != df[‘x’])

        Funny thing is, I then asked it for a different way and it came up with one that doesn’t work! Or, rather, it technically met the requirements but was not at all what I wanted:
        df = df.duplicated(subset = [‘x’], keep = False)

        Now, that will indeed discard every row in which ‘x’ had the same value that it had in the previous row. But it will not discard _only_ those rows. It will discard every row in which ‘x’ has a duplicate _anywhere_ else in the data. I think a human would understand that when you say you want to discard the rows that meet such-and-such criterion, you mean you want to discard only those rows, but I guess to the bot this wasn’t clear.

        Anyway this might already be a useful programming tool for me, but with some important caveats.

  9. Since nobody has mentioned it yet – you’re using the GPT-3 API, which has been around since 2020, not the ChatGPT which was released last week. ChatGPT is free for unlimited uses currently and is at chat.openai.com (which you link to), not the playground (which is where your responses are from.

  10. ChatGPT definitely needs some refreshers of undergraduate real analysis. It keeps trying to convince me that $a_n = (-1)^n$ is a Cauchy sequence, and that $a_n = 1/n$ doesn’t converge.

    • Owen:

      I can believe it. Real analysis was hands down the most boring class I took in college. Endless definitions and elaborate proofs with zero intuition, and at the end of the semester I couldn’t do anything I couldn’t already do before the course began. Complex analysis—that was fun, we learned how to sum infinite series by integrating around the poles, that was really cool. But real analysis was an absolute waste of time. I could well believe that ChatGPT would feel the same way. With all those other interesting math books to scan and digest, why waste time on the useless and boring stuff?

      • My undergraduate alma mater introduced a hybrid “calculus with theory” sequence that smattered in the relevant theoretical stuff while teaching calculus. Alas, I had already done vanilla calculus so didn’t benefit from this approach, but it seems much better for 99% of use cases than real analysis…

      • Such different experiences! I really enjoyed real analysis – in my experience the beauty was that the intuitive definitions (eg. of convergece and continuity) are so clear, and learning how to express those rigorously is a great introduction to mathematical thinking.

        For info – the reason I chose to ask ChatGPT about Cauchy Sequences was to see if it believed all Cauchy sequences converge: true for real analysis, but not true of Cauchy sequences in general metric spaces.

  11. I said: “write a soliloquoy about bayesian statistics, in the style of william shakespeare”

    It gave a good soliloquoy.

    This was my favourite though: “write a soliloquoy about bayesian statistics, in the style of hamlet”

    To Bayesian statistics, or not to Bayesian statistics, that is the question:
    Whether ’tis nobler in the mind to suffer
    The slings and arrows of frequentist statistics,
    Or to take arms against a sea of uncertainty,
    And, by incorporating new evidence, update our beliefs
    And thereby end our uncertainty and doubts?

    To use Bayesian methods, or to stick to traditional approaches,
    That is the question that doth plague the minds of statisticians,
    For when faced with complex problems and data,
    Should we rely on the tried and true,
    Or should we embrace Bayesian methods and the unknown?

    But in the end, perhaps it matters not,
    For in the face of uncertainty, Bayesian statistics doth shine,
    Allowing us to update our beliefs and make more accurate predictions,
    And guiding us towards a better understanding of the world.

    So whether we use Bayesian statistics or not,
    We must remember to stay true to the pursuit of knowledge,
    And to embrace new methods and approaches,
    In order to better understand the world around us.

Leave a Reply

Your email address will not be published. Required fields are marked *