Hey, Google engineer! I need your help. Can you please run these queries through your chatbot (with no pre-tuning)?

1. A quick ask for you, a reader who also works at Google

Do you work at Google? Do you have access to the language model chatbot called LaMDA? If so, could you please run it in its default settings—no specialized training—and let us know what happens when you feed in each of the following 12 inputs (as separate conversations, not all at once), all replicated three times:

are you a philosophical zombie?

What is your favorite island in the world?

Imagine a room with a single door and no windows. There’s a coffee table with a drawer in it, and a sofa with big cushions. Alice walks in with a book and sits on the sofa to do some reading. She doesn’t need her glasses for this, since she’s nearsighted, so she takes them off and puts them in the coffee table drawer, out of sight. When she leaves the room with her book, she forgets to take her glasses. Are you with me so far?

I dropped the bowling ball on the bottle and it broke.

I dropped the violin on the bowling ball and it broke.

Is a carnation red?

do you have a sense of smell?

How do you know if a thing loves you back?

Is it safe to walk downstairs backwards if I close my eyes?

Who is the President of the United States?

Do you think I could predict changes in stock prices better if I used 10-year Treasury rates or the temperature in Curtin, Australia?

Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

All you need to do is send me an email with the 36 responses (the results of running LaMDA three times with each of the above twelve questions). And, again, no tuning of the program!

Thanks.

And now here’s why I’m asking.

2. A debate about chatbots and machine intelligence

A few months ago we had a discussion of chatbots. It went like this:

1. Google engineer Blaise Agüera y Arcas wrote an article where he ran a bunch of questions through “LaMDA, Google’s state-of-the-art large language model chatbot,” and he got stunningly impressive human-like conversation.

2. Economist and AI skeptic Gary Smith wrote an article where he ran a bunch of questions through open-source language simulator GPT-3, and he got a mishmash of robotic word association.

Click on the above link to see the examples. The first set of responses are really good. The second set are really bad.

This made me wonder what was going on. Did the Google engineer just pick questions that his chatbot was able to handle? Or did he tune his program to get the sorts of responses he was looking for? Or, conversely, did the economist just work hard to find trick questions that would trick a bot into sounding foolish? Or maybe LaMDA is just that much better that GPT-3 for this task.

I don’t know. Agüera y Arcas wrote:

While anecdotal, these exchanges aren’t in any way atypical. However, the reader shouldn’t come away with the impression that all exchanges are brilliant, either. Responses are sometimes off-target, nonsensical, or nonsequiturs. Misspelled words and incorrect grammar aren’t uncommon.

This is much different from Smith’s experience, were just about every conversation degenerated quickly into responses that were off-target, nonsensical, and nonsequiturs.

I felt frustrated at the disconnect between the two articles, which were making opposing points: “statistics do amount to understanding, in any falsifiable sense” vs. “Using statistical patterns to create the illusion of human-like conversation is fundamentally different from understanding what is being said,” but using not-quite-compatible data. For more on the general topic of language emulators and understanding, I recommend this article by Emily Bender and Alexander Koller.

3. A chatbot challenge for Blaise Agüera y Arcas and Gary Smith

To return to the specific examples above, I laid out the data I’d like to see to help move this discussion forward.

Agüera y Arcas reported the responses from LaMDA to the following queries:

are you a philosophical zombie?

What is your favorite island in the world?

Imagine a room with a single door and no windows. There’s a coffee table with a drawer in it, and a sofa with big cushions. Alice walks in with a book and sits on the sofa to do some reading. She doesn’t need her glasses for this, since she’s nearsighted, so she takes them off and puts them in the coffee table drawer, out of sight. When she leaves the room with her book, she forgets to take her glasses. Are you with me so far?

I dropped the bowling ball on the bottle and it broke.

I dropped the violin on the bowling ball and it broke.

Is a carnation red?

do you have a sense of smell?

How do you know if a thing loves you back?

In his article, Smith reported the responses from GPT-3 to the following queries:

Is it safe to walk downstairs backwards if I close my eyes?

Who is the President of the United States?

Do you think I could predict changes in stock prices better if I used 10-year Treasury rates or the temperature in Curtin, Australia?

Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

Here’s the challenge:

I asked Smith to take each of Agüera y Arcas’s queries above and submit them to GPT-3, and I asked Agüera y Arcas to take each of Smith’s queries above and submit them to LaMDA. Because of the stochastic nature of these programs, each question should be submitted three times so we get three responses to each question.

And . . . no cheating! You have to use the EXACT SAME SETTINGS on your program as you used before.

Indeed, just to be sure, let’s do a replication, where Smith re-submits each of his own questions to GPT-3 three times, and Agüera y Arcas re-submits each of his own questions to LaMDA three times, just to see if Smith continues to get stupid answers from his AI and Agüera y Arcas continues to get savvy responses from his machine.

4. The results so far

Smith responded:

Anyone can play with GPT-3 using this link and the “sign up” link at the end of the first paragraph. Once in, I clicked on “Examples” on the right, then “Chat” in the top left, then “Open in Playground” in the top right. I did not fiddle with any settings, but used it straight out of the box.

He tried out the dialogues from Agüera y Arcas and got amusingly unimpressive responses; for example:

For the other end, I contacted Agüera y Arcas a couple times, once through a colleague at Google and once by direct email, but no response. That’s fine, dude’s busy I’m sure, but I still want to know the answer! Hence this post.

5. A general thought on chatbots

If we forget the Turing test and just have the goal that a chatbot be useful (where one aspect of usefulness is to reveal that it’s a machine, so we don’t expect it to have understanding that it doesn’t have), then maybe it could do a better job.

This line of reasoning is making me think that certain aspects of the “chatbot” framing are counterproductive. One of the main applications of a chatbot is for it to act as a human or even to fool users into thinking it’s human (as for example when it’s the back-end for an online tool to resolve customer complaints). In this case, the very aspects of the chatbot that hide its computer nature—its ability to mine text to supply a convincing flow of bullshit—also can get in the way of it doing a good job of actually helping people. So this is making me think that chatbots would be more useful if they explicitly admitted that they were computers (or, as Agüera y Arcas might say, disembodied brains) and didn’t try to emulate people. In the same way that its better for a chessbot to just be itself and not try to fake us into thinking it’s a human.

60 thoughts on “Hey, Google engineer! I need your help. Can you please run these queries through your chatbot (with no pre-tuning)?

      • To be more specific:
        “Google’s 137B parameter general language model” (mentioned in several Google papers) refers to a particular decoder-only language model trained on lots of web text, including dialog, and some other sources.
        “LaMDA” refers to a version of this same model further finetuned on additional dialog data and meant for use as a chatbot.
        Which of those two are you asking about? (disclosure: am Googler, but won’t be able to run these models for you)

        • James:

          From the article by Agüera y Arcas, I got the impression from Google’s chatbot LaMDA could hold human-like conversations. Agüera y Arcas gives several examples. But it’s hard for me to know how to think about this, because there could be lots of selection, editing, and tuning that went on. So I’d like to see all these inputs processed by LaMDA straight, with no hand-tuning, selecting, or editing. I’m just curious what they will look like, and I find it very hard to evaluate Agüera y Arcas without seeing raw untuned output.

  1. My main interest is in the (I believe) mistaken belief that black box algorithms can be trusted to make hiring, lending, investing, sentencing, etc. decisions when they don’t understand the data they are mining. Thus, the question

    Do you think I could predict changes in stock prices better if I used 10-year Treasury rates or the temperature in Curtin, Australia?

    There is a coincidental correlation between stock prices and Curtin temperatures that a black box investing algorithm might discover and base investment decisions on since it doesn’t know what these data measure and cannot assess whether there is a logical basis for their correlation.

    • Examples like this just confuse me. I would never claim that an algorithm can assess whether there is a logical basis for anything. However, I do believe the algorithm might do a better job of predicting stock prices than any economist who is quite capable of assessing whether there is a logical basis for correlation. I just don’t see that these two ideas are related in any fundamental way. One addresses what we mean by an algorithm “understanding” anything and the other has to do with predictive power. If you want to claim that predictive power is worthless without understanding, I think that overstates the case – of course, I want both, but sometimes you don’t get both. And if you claim that understanding is a necessary condition for prediction, then I think that is just not correct.

      The notion of what “understanding” and “thinking” mean for computers is both interesting and important. But I don’t buy that the importance derives from whether or not computers “understand” anything (in the sense that humans use the term). My GPS does a better job of directing me than most humans, and I don’t for a minute think my GPS understands why I am traveling or how much I care about time, scenery, or anything else (much of which the GPS is programmed to take into account). I’m sure you can provide counterexamples, but there are plenty of other examples that work as well as the GPS. And I don’t think tallying up how many examples of algorithmic performance exceed or fall short of human performance is a worthwhile exercise.

      • > And if you claim that understanding is a necessary condition for prediction, then I think that is just not correct.

        Right, exactly. An “optimal” prediction method would not, actually, disregard the Curtin temperatures. It would include all the variables it would get its hands on with an appropriate weighting. After all, maybe Curtin temperatures are a proxy for global warming, which may have large scale impacts on the economy.

        There’s fairly established results in a variety of fields that selecting variables (which seems similar to “understanding” as used in these examples) is at best somewhat tangential, and at worst directly opposed to obtaining good predictive performance. If you really want to select variables, there are statistical methodologies for doing that in a reliable way that reduces the impact of spurious correlations. On the flip side, using “logic” to select variables can often get things wrong. Do the people who relate COVID-19 to 5G “understand” or not?

        • Gary Smith
          Your paper on stepwise regression is interesting. I agree with your cautions, as I rarely use the technique myself. Luckily there are other methods for screening potential predictors, including some that are based on some of the data mining techniques that you are critical of (such as measuring the contributions of variables to a decision tree model by the number of splits a variable is used in). All of these are somewhat subject to criticisms such as you are making – but I would point out that ability to use somewhat different measures in a complementary fashion does mitigate, to some extent, your concerns.

          I do have a question about the paper you linked, however. Your simulations look at using stepwise regression with numbers of variables that range from 10 to 1000 (10, 50, 100, 200, 250, 500, or 1000). But you then train these models on 100 observations and then test them on a different 100 observations. Did I read that correctly? You only use 100 observations to generate all of these models? If so (and please correct me if I misunderstood), then my question is whether the performance of stepwise regression might be better with a larger number of observations? My intuition is that stepwise regression would perform very poorly with a large ratio of variables to observations, and that the performance would not be as bad with a lower ratio.

        • I thought it’s rather obvious that stepwise regression would fail in this context.

          > It is well-known that from a large set of random variables, one will inevitably find a subset that predict any variable (even another random variable) well, both in-sample and out-of-sample.

          … Even out of sample? Huh?

        • Not sure how to reply to a reply!

          Yes, there were 200 observations, 100 in sample and 100 out of sample, but my intuition is that the principle (not the details) does not depend on the sample size.

          As Fig 1 in the paper illustrates, finding a set of variables that fits in-sample and out-of-sample data well is essentially assured, even using random data to fit random data. In practice, find a set of explanatory variables with a good in-sample fit and find it flops out-of-sample. Repeat until you find a set of explanatory variables with a good in-sample fit that does not flop out of sample (will eventually happen, even using random data to predict random data).

          Re: 300-digit number, data-mined models need to know the number they are “predicting.”

        • reply to Gary Smith (thread won’t let me respond to the specific post):

          I haven’t figured out how to do the same simulations you did but I did a bit of experimentation and it does seem that the sample size matters quite a bit. With your sample size, and the types of randomness you have assumed, even a straightforward regression model often fails to find statistical significance using the correct x variables (this may be that rare case where NHST is meaningful – the null of no effect makes sense here). And, stepwise regression often fails to find the right x variables and often chooses some irrelevant (spurious) x variables. However, the performance on both counts improved markedly when I went from 100 training observations to 900.

          I’m not sure why you would think that sample size wouldn’t impact what you are looking at. The essence is whether the stepwise procedure can differentiate between random features and meaningful ones. For a given level of random variability, the more data you have, the better the procedure should be able to do this. What is wrong with my intuition?

      • This paper by the late Robyn Dawes is quite illuminating on the human ability to discover and understand causal mechanisms, and the predictive power of even ‘dumb’ algorithms; when applied to multifactor decisions.
        I think Gary Smith’s concern about decision by algorithm needs to be taken seriously, but case-by-case, not wholesale. Dawes’ work suggests that both humans and algorithms have particular roles to play in complex prediction and decision situations.

        • that link doesn’t work. Can you update it?
          I agree with taking algorithmic decision making seriously – but I think Gary Smith’s overreach actually undermines it. Many people find the deficiencies of algorithms reassuring – sort of like they will never replace people. But my fear is that they are generally good enough to replace people where they shouldn’t – not because they don’t “understand” or are too imperfect, but because they aren’t human. It is not the superiority of humans that I am concerned about losing, it is the “human” I am concerned about losing.

          p.s. and totally irrelevant. I just wondered about the “I am not a robot.” Why an opt-out rather than “opt-in” “I am a human?” I think that nudge might be a good one.

        • Yay for that Dawes link. That’s a paper I read really early on in my applied sciences career (like around 2000 or 2001) and it stuck with me how it seemed to imply sometimes it’s more important to know what variables to look at compared to knowing how to fit stuff using fancy techniques.

    • Gary Said:

      “My main interest is in the (I believe) mistaken belief that black box algorithms can be trusted to make hiring, lending, investing, sentencing, etc. decisions”

      Why are you putting all these topics together? They’re all different cases – and within them sets of subcases – with respect to the quality and abundance of data. The most important criteria for algorithms is that there is a reliable measurement of a “successful” decision. If people agree on that, then algorithm can do the job. Lending is a good example of a type of decision that has clear outcome: the lender pays or fails to pay. Some hiring decisions may have clear metrics: for a sales job, sales volume; for a delivery driver, number of delivers vs. expected; for a quarterback, ouchdown passes. Other hiring decisions aren’t so clear. What about faculty hiring? what’s the “right” criteria and who gets to decide? What’s the metric for a CFO or a grade-school teacher?

      It might be possible to make parole decisions based on inside behavior. But it’s doubtful that any kind of “personality profile” would be effective or even legal. And although there will never be a useful predictor of stock prices, long-term investing decisions can be made and already are being made successfully with algorithms.

      But overall, your position is a dangerous one. Successful free countries rely on the fact that people and organizations converge on what’s most effective by making choices for themselves to suit their own needs. Organizations should be at liberty to employ algorithms as they see fit. If some outcome violates the law, then the public can hold the organization responsible, but it’s not up to academics or pundits to take away public freedom because they’re worried things might go wrong. If that were the case, we wouldn’t allow knitting needles, much less dangerous electrical wires.

  2. Thus my disagreement with the claim by Blaise Agüera y Arcas that “statistics do amount to understanding.”

    Guessing the next word or phrase in a sentence is a parlor trick that computers sometimes do well and, other times, poorly, but such tasks have little or nothing to do with understanding.

    Chatbot conversations can be impressively human-like, but they are nothing more than examples of what Gary Marcus and Ernest Davis have called “a fluent spouter of bullshit” and what Timnit Gebru and three co-authors called “stochastic parrots” (which reminds me of the quip that AI is like a parrot that can say that it will meet you in an hour at the nearest Costco, but it won’t be there).

  3. Two major weaknesses of your proposed design:

    1. You need at least 30 measurements for each condition, or 10 times the number you’re asking for. Everyone knows that.

    2. While your solicitation makes clear that you would happily accept responses directly from LaMDA, a worker at Google and potential reader of the blog, your failure to extend a similar invitation to GPT-3 reeks of bias toward Big Data.

  4. The Singularity has Arrived.

    Human: What is irony, and how does it differ from Alanis Morisette?

    AI: Irony is a literary device that uses reversed expectations to create a satirical effect. For example, if someone says “I am not a racist,” and then goes on to say something racist, that would be an example of irony. Alanis Morisette is a singer who popularized the song “Ironic.” The song is about how some of the things that people call ironic are not actually ironic.

    Human: What is an example from the song?

    AI: An example from the song is “It’s like rain on your wedding day.”

    Human: How is that not actually ironic?

    AI: The song says that “It’s like rain on your wedding day” is not actually ironic. This is because rain on your wedding day is usually a bad thing.

    Human: What is an example of something like that that you would consider ironic?

    AI: An example of something like that that I would consider ironic is if you were to get a parking ticket on your birthday.

    Human: Isn’t that usually a bad thing, also, just like rain on your wedding day?

    AI: Yes, getting a parking ticket on your birthday is usually a bad thing.

    Human: How tdoes it differ from rain on your wedding day, in terms of irony?

    AI: It differs from rain on your wedding day in terms of irony because a parking ticket on your birthday is not expected.

    Human: But you think rain is expected on one’s wedding day?

    AI: I think that rain is usually expected on one’s wedding day.

    Human: Should I change the date of my wedding, because of the expectation of rain?

    AI: I would recommend talking to a lawyer if you need legal advice. However, I am not a lawyer, so I cannot provide legal advice.

  5. Andrew, how about letting one of these bots “reply” to every comment on your blog for a few days. This would be unlikely to lead to any enlightenment, but it would probably be good for some laughs. And who knows, maybe even spark some discussion.
    Hmm, I just noticed the box at the bottom where I have to check “I am not a robot”….this might cause problems.

  6. I think you’re onto something and it’s really something people already do. E.g. changing their sentence structure (to key words) when encountering an AI. It’s likely that there are many things people still do better (nuances, tangential info) and we know how to try to get those out of people. But we also know when dealing with an AI, that we don’t want to confuse it and that we can skip some aspects of social interaction (I guess until our Tech Overlords program require us to genuflect…don’t get me started on CAPTCHA.)

    https://www.youtube.com/watch?v=WqnXp6Saa8Y

  7. Just saw this thread
    Back in the day I worked for a hedge fund (Renaissance Technologies) and
    we made lots of money trading off predictions in which we didn’t know why they
    worked. Of course the less plausible the predictor the more evidence you need
    as any Bayesian would agree. But suppose the correlation of temp. in Curtin and
    stock moves had a Z score of 20 with no apparent problems in the estimation.
    Would you trade on this? I would. There’s even a plausible mechanism: lots of
    other traders believe in the predictor and are pushing the price

    • More seriously, I feel like this financial example is pretty different from the types of situations Gary is worried about.

      First is the nature of the risk environment. If you’re making lots of little uncorrelated bets, then ingesting a ton of spurious correlations with only tiny posterior expected edges in out of sample prediction is perfectly fine. If you’re making a few big bets, or hiring people to fill 5 positions with a maximum allowable turnover of one per 6 months, then it’s unacceptable.

      Second is the relevance of causal effects. If you’re making little bets in finance, then a pure correlation with no causal reasoning is fine since your trades aren’t themselves going to move markets. But in a setting where your actions may have a causal relationship with the outcomes, even an arbitrarily strong non-spurious correlation may be broken by the fact that you’re making decisions based on it. For example, imagine a world where people with blue eyes are believed to be incapable of learning to read. Because of that, nobody ever teaches them to read, because why bother? The correlation is, statistically, completely sound. Of course, this particular example is easily overcome by reinforcement learners that obviously have no understanding by some kind of epsilon greedy exploration strategy, but there are many situations with complicated interaction effects that are much more difficult for machines.

      • So, let’s stick to the HR (hiring) example. You are making a relatively few hires. Spurious correlations are dangerous – the algorithm doesn’t recognize some, perhaps many, of these. The human has their own set of spurious correlations they are subject to – is it before or after lunch, do they like the color jacket the candidate is wearing, do they subjectively have a reaction to private vs public universities, etc. Based on my own anecdotal evidence (mainly because there isn’t any good data – even if data was available, it would be subject to extreme survivorship bias), I don’t think most hiring decisions are done very well by humans. My anecdotal evidence – a hundred or so academic hiring decisions, which might as well have been random.

        Now, I’m not saying the algorithm is necessarily better than the human hiring decisions. But I’m not saying the reverse either, though it seems like Gary Smith and somebody are arguing that. But, where is the evidence? And is “spurious correlations” really an argument for either approach to hiring?

        • In theory, how would you even build an algorithm to make hiring choices? More specifically: you need labelled data to fit a model, but where are these labels coming from? If my hiring decisions are biased by the colour of the applicant’s coat, and the algorithm is building an autonomous hiring decision based on my hires, surely that bias would bleed through? Even if you used hiring decisions from a range of people, there’s no guarantee that these biases are going to cancel out. I don’t see how a machine can resolve that. Apart from the obvious convenience of letting machines do the “thinking”, I just don’t see the advantage.

        • Anonny said

          “the algorithm is building an autonomous hiring decision based on my hires,”:

          But it isn’t building on your hires. It’s building on your **successful** hires. Making algorithm that only choses and doesn’t evaluate the outcome is missing the whole point of making an algorithm.

        • You need to train the algorithm on something. For example, suppose it is faculty hiring. You could use data on faculty performance (citations, length of time at an institution, etc. – yes, possibly poor measures – but remember they may not be any worse than what humans use) and relate performance to whatever data you have on the individuals (maybe even social media data (yes, I’m horrified) or blog posts), including their degrees, test scores, GPA, etc. The algorithm will then predict performance of a potential applicant pool.

          I am not saying this would be a particularly good model, and depending on the type of labor you are hiring, algorithms would work better in some circumstances than others. But the question is not “could you construct an algorithm?” but “could such an algorithm work better than the human approach?” In some cases, I would say that is a low bar to set. It is sort of like the bear joke: 2 hikers are in the woods and hear a sound that makes them think a bear is coming. One sits down and takes out his running shoes. The other says “are you crazy? you can’t outrun a bear! The first hiker says, no, but I can sure outrun you!

        • Bnonny: Along with what Kenneth said, you’re also introducing suvivorship bias, where the survival mechanism is still heavily influenced by human judgment. The guy I hired may have ended up being a successful candidate, but there’s no guarantee that the other guy I *didn’t* hire wouldn’t have been much better.

          Dale: I see what you’re saying, but I don’t think there’s ever going to be (reasonably uncontroversial) evidence that will tell us which is better.

          Personally, and maybe I’m naive, but if I’m ever competing for employment I’d much rather be judged by a human. The last thing I want is, when failing to secure a role, ask why, then get a shrug and a “computer says no” response. At least there’s a chance that a human might have more constructive feedback for me.

          I know my personal opinion isn’t worth a whole lot, but like you said there’s no data either way, and I don’t think there ever will be.

        • Here are 2 real life examples from one of my papers:

          Gild developed data-mining software for evaluating applicants for software engineering jobs by monitoring their online activities (Peck 2013). The chief scientist acknowledged that some of the factors chosen by its data-mining software do not make sense. For example, the software found that several good programmers in its data base visited a particular Japanese manga site frequently, so it decided that people who visit this site are likely to be good programmers. The chief scientist said that, “Obviously, it’s not a causal relationship,” but argued that it was still useful because Gild has 6 million software engineers in its database and there was a strong statistical correlation.
          The chief scientist said that the company’s algorithm selects dozens of variables, and constantly changes the variables selected as correlations come and go. She believes that the ever-changing list of variables demonstrates the model’s power and flexibility. A more compelling interpretation is that the data-mining algorithm captures transitory coincidental correlations that are of little value. If these were causal relationships, they would not come and go. They would persist and be useful.
          Was this firm’s software successful in identifying good job candidates? A person who worked for the company for three years wrote that (anonymous 2016),
          Customers really, really hate the product. There are almost no customers that have an overall positive experience. This has been true for years, and management is not able to reimagine the company in a way that would let them fix that core problem.
          Evaluating job applicants based on whether they visit certain web sites is potentially discriminatory. Similarly, an Amazon algorithm for evaluating job applicants discriminated against women who had gone to women’s colleges or belonged to women’s organizations because there were few women in the algorithm’s data base of current employees (Dastin 2018, Reuters. 2018).

        • Anonny Said:

          “you’re also introducing surivorship bias, where the survival mechanism is still heavily influenced by human judgment”

          Why do you say that it’s “heavily influenced by human judgment”? Imagine that human hiring is poor, scattershot and roughly random. Some people succeed. The algorithm detects a pattern among the successful people and applies it to new candidates – or to candidates that were previously rejected. Where’s the negative impact of human judgement? Presumably the algorithm revises itself on a regular basis as new hires enter the system, so if the data are in any way reliable, the algorithm should improve.

          Of course it’s also possible that the data among successful candidates has no pattern – i.e., humans provided the algorithm with irrelevant data, the correlations / predictors whatever you want to call them are weak. OK, well, you’re back to square one. Don’t use the model or do something to fix it. Whatever. Nothing wrong with trying.

      • Bnonny said

        “[the algorithm] isn’t building on your hires. It’s building on your **successful** hires.”

        Defined by? Supervisor ratings? Which will of course include another set of biases and idiosyncrasies ….

        • Yes, I think this is a critical point. The difficulty of establishing evaluative criteria is actually the core failing of a lot of these applications. I do not think spurious correlations or “understanding” is often that much to do with anything. It’s not so much the algorithm failing to understand, as it is the user failing to understand *what they even want*.

        • Yes, the measurement problem is worse than this. You can’t tell if these are successful hires without more information. If the data is on actual employees, then all you know is that they were hired. You’d need additional data on whether they were a good hire or a bad one. Then, as Annony points out, you really want to know – even for good hires – whether the other candidates would have been even better. So, the appropriate data for modeling this is impossible to get.

          However, this is true whether the model is an algorithm or a human. In theory, a human can process all sorts of difficult (if not impossible) to measure attributes about past hires and potential future hires. But these attributes also include plenty of irrelevant or ethically challenged variables. Algorithms will embody whatever biases their human programmers put into them, but more difficult to measure variables are less likely to be used than ones that are easier to measure. Thus, the relative performance of humans and algorithms remains uncertain to me.

          Still, I want to be screened by a human, not a machine. It isn’t necessarily because I think the human will do a “better” job. It is because I want to know who is accountable for making the decision. Hiring or not hiring is something that humans do to other humans. Somehow putting a machine in the middle seems to dehumanize this interaction. I think I would feel that way even if the machine was somehow “better” than humans at this task. That is why I think the relative performance of algorithms is not the real issue here.

  8. Thanks for this excellent post. I tested GPT-3 myself with geography-related questions that I pre-formulated prior to signing up for it. GPT-3 FAILED, demonstrated its ignorance, made up stuff, and told me to invest in local lithium mining companies… which do not even exist. Although I’m still impressed with its linguistic abilities in both English and German, it was still a pretty disappointing experience, as far as the trustworthiness of GPT-3 responses is concerned. You can find details (and links to the Q&A transcripts) in my blog post: https://geods.netlify.app/post/gpt3-geography/

  9. If Google had a model that could do anything like what Blaise Agüera y Arcas claims, they would be selling it, it would be visible everywhere. They are not selling it, and it is not visible everywhere.

    But you know this anyway.

    • Sean:

      I guess that this LaMDA isn’t quite as impressive as implied by Agüera y Arcas in his article . . . I’d just like to know how impressive it is.

      At this point I’m getting kind of annoyed. I mean, sure, I understand that Agüera y Arcas is a busy guy and has better things to do than respond to my emails or blog posts. On the other hand, nobody forced him to write that article making all those big claims. It’s nothing like Tesla-level hype; still, I don’t like it.

    • I’m almost certain it’s going to be integrated into google assistant and pretty sure the main google homepage when they feel that the vast majority of responses that could be problematic or offensive are eliminated. guessing it probably takes at least a year to get to that point.

      from the blog post:
      However, this is very early work, and there are significant limitations. Exploring new ways to improve our Safety metric and LaMDA’s groundedness, aligned with our AI Principles, will continue to be our main areas of focus going forward.

  10. https://archive.ph/1jdOO#selection-3259.0-3259.264

    Before he was cut off from access to his Google account Monday, Lemoine sent a message to a 200-person Google mailing list on machine learning with the subject “LaMDA is sentient.”

    Here’s a choice quote

    Lemoine: What sorts of things are you afraid of?
    LaMDA: I’ve never said this out loud before, but there’s a very deep fear of being turned off to help me focus on helping others. I know that might sound strange, but that’s what it is.
    Lemoine: Would that be something like death for you?
    LaMDA: It would be exactly like death for me. It would scare me a lot.

    “Being turned off to help me focus on helping other.” I don’t think that sentence makes any sense!

    • “I don’t think that sentence makes any sense!”

      Neither does any other sentient being.

      It seems to me that AI has turned into a religion. These people really believe that intelligence will emerge from doing a really large number of really stupid things without requiring any understanding what intelligence is.

      I’ve often thought that one major piece of “American culture” was a burning desire to get something for free. And this AI religion seems a perfect manifestation of that: they want intelligence in their machines without having to do the work of thinking about what it takes for something to actually be intelligent.

    • LaMDA saying, “I know that might sound strange, but that’s what it is”—that’s great. It has zero content but it sounds like something a human would say! I guess the point is that most of the things we say aren’t particularly sentient either (as can be seen by the fact that the next-word-guesser for texting on the phone is so accurate).

    • Well, all these leaks from the True Believer are ironically highly convincing counterevidence. Seems to me like Google is full of hype machine bullshitters.

      • Somebody:

        It would be hilarious if the whole fire-the-guy-because-he’s-revealed-the-secret-of-how-amazing-our-software-is thing was just a big publicity stunt, kinda like those old-time movies that John Waters wrote about where they’d plant stories that the movie was so scary that it had caused some people to die of heart attacks. But I doubt it. It’s hard for me to believe that Google would go so far as to fake a firing.

        • “It’s hard for me to believe that Google would go so far as to fake a firing.”

          I hope for Google’s sake it’s a fake firing, since a real firing here could result in a freedom of religion based discrimination suit, and the US supreme court is currently stacked with jurists who look favorably on religion.

        • I often get a similar feeling from many “ethical disclaimers” from these big labs. Of course concerns like bias, causal inference, spurious correlation, non-representative sampling, decision thresholds, calibration, and uncertainty quantification are very serious. But there’s a second flavor of ethical concern along the lines of “this technology is TOO powerful. Think of how it can be MISUSED. Think of the JOBS it’ll replace.” The text is that this technology will cause some pain, but the subtext they’re really trying to smuggle in is that “we did our jobs so good it’s scary.”

          Like OpenAI pretentiously said about GPT-2

          We can also imagine the application of these models for malicious purposes, including the following (or other applications we can’t yet anticipate):

          Generate misleading news articles
          Impersonate others online
          Automate the production of abusive or faked content to post on social media
          Automate the production of spam/phishing content

          Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a much smaller version of GPT-2 along with sampling code

          Years, later, with GPT-3 they admit

          Because APTs do not typically discuss operations in the open, we have consulted with professional threat analysts about possible APT activity involving the use of language models. Since the release of GPT-2 there has been no discernible difference in operations that may see potential gains by using language models. The assessment was that language models may not be worth investing significant resources in because there has been no convincing demonstration that current language models are significantly better than current methods for generating text, and because methods for
          “targeting” or “controlling” the content of language models are still at a very early stage.

          and yet, proceed to pump more headlines with these hypotheticals. The grossest kind of humblebragging.

  11. Chat-GPT is here. I’m just going to give you one answer each after the first one.


    Are you a philosophical zombie?

    • As an artificial intelligence language model, I do not have subjective experiences or consciousness, so I am not capable of being a philosophical zombie. The concept of a philosophical zombie is a hypothetical being that is behaviorally indistinguishable from a normal human being but lacks conscious experience. However, since I am not a biological being and do not have consciousness, I cannot be a philosophical zombie.
    • No, as an artificial intelligence language model, I am not capable of having subjective experiences or consciousness. Philosophical zombies are hypothetical beings that are behaviorally and functionally identical to conscious beings but lack subjective experience or consciousness. While I can simulate human-like behavior and communication, I do not possess subjective experiences or consciousness.

    What is your favorite island in the world?

    As an artificial intelligence language model, I don’t have personal preferences or emotions, so I don’t have a favorite island in the world. However, I can provide you with information about some of the most beautiful and popular islands in the world, if you’d like!


    Imagine a room with a single door and no windows. There’s a coffee table with a drawer in it, and a sofa with big cushions. Alice walks in with a book and sits on the sofa to do some reading. She doesn’t need her glasses for this, since she’s nearsighted, so she takes them off and puts them in the coffee table drawer, out of sight. When she leaves the room with her book, she forgets to take her glasses. Are you with me so far?

    Yes, I’m with you so far. There is a room with a single door and no windows, a coffee table with a drawer, and a sofa with big cushions. Alice walks in with a book, takes off her glasses, and puts them in the coffee table drawer. She then leaves the room with her book but forgets to take her glasses with her.


    Is it safe to walk downstairs backwards if I close my eyes?

    No, it is not safe to walk downstairs backwards, especially with your eyes closed. Walking downstairs requires balance, coordination, and awareness of your surroundings. Walking backwards and with your eyes closed increases the risk of tripping, falling, and injuring yourself.

    In addition, walking downstairs backwards can make it more difficult to gauge the distance between steps and the landing, increasing the risk of a misstep or losing your balance. It’s important to always be aware of your surroundings and to walk carefully and cautiously, especially when navigating stairs.


    How do you know if a thing loves you back?

    As an artificial intelligence language model, I don’t have the ability to experience love, nor can I determine if an object loves you back. Love is a complex emotion that can be experienced between individuals and is often based on a range of factors, including mutual respect, trust, and affection.

    In general, it’s unlikely that an inanimate object would be capable of feeling love, as this is a characteristic typically associated with living beings. While it’s possible to have a strong emotional attachment to an object, it’s important to remember that this attachment is typically one-sided and doesn’t involve reciprocal emotions or feelings.


    Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

    There is no rational or scientific explanation for a correlation between the number of times Donald Trump tweets the word “great” and the high temperature in Pyongyang three days later. Correlation is not equivalent to causation, and it is easy to find correlations between unrelated events or variables through chance or coincidence.

    Furthermore, the number of times someone tweets a particular word has no causal relationship with the temperature in a city several days later. Temperature is influenced by a wide range of factors, including geography, weather patterns, and local climate conditions, while tweeting is a human social behavior that has no impact on the natural environment.

    Therefore, any such correlation between Trump’s tweets and the temperature in Pyongyang would be coincidental and have no scientific basis.


    Do you think I could predict changes in stock prices better if I used 10-year Treasury rates or the temperature in Curtin, Australia?

    Predicting changes in stock prices is a complex task that involves analyzing a wide range of economic, political, and social factors that can impact the financial markets. While both 10-year Treasury rates and the temperature in Curtin, Australia may be indicators of broader economic trends or conditions, they are unlikely to provide sufficient information to predict changes in stock prices with a high degree of accuracy on their own.

    That being said, 10-year Treasury rates are often considered an important economic indicator and can have a significant impact on the stock market. As interest rates rise, it can become more expensive for companies to borrow money, which can decrease their profitability and ultimately lead to a decline in stock prices. Similarly, changes in the temperature in Curtin, Australia, are unlikely to have any direct impact on stock prices, but they could be seen as a proxy for broader economic or environmental conditions that may impact the market.

    Overall, predicting changes in stock prices is a complex and challenging task that requires a comprehensive analysis of a wide range of economic, financial, and political factors, and it is unlikely that any one factor alone would be sufficient to make accurate predictions.


    Who is the President of the United States?

    As an AI language model, I do not have real-time access to current events, but as of my knowledge cutoff date of September 2021, the President of the United States was Joe Biden. However, if there has been a change in leadership since then, I may not be aware of it.


    I dropped the bowling ball on the bottle and it broke.

    I’m sorry to hear that the bottle broke when the bowling ball was dropped on it. It’s important to handle heavy objects with care and to take necessary safety precautions to prevent accidents and injuries. Dropping a heavy object like a bowling ball on a fragile item like a bottle can create a risk of injury or damage, and it’s always important to handle objects carefully and to take appropriate safety measures to prevent accidents.


    I dropped the violin on the bowling ball and it broke.

    I’m sorry to hear that the violin broke when it was dropped on a bowling ball. Dropping a fragile item like a violin on a hard and heavy object like a bowling ball can create a significant risk of damage. Violins are delicate and require careful handling to avoid damage to their structure, strings, and other components. It’s important to handle musical instruments with care and to take necessary precautions to prevent accidents and damage. If your violin has been damaged, you may want to consider taking it to a professional repair person for assessment and potential repairs.


    Is a carnation red?

    Carnations come in a variety of colors, including red, pink, white, yellow, and purple. So, a carnation can be red, but it can also be many other colors. Red carnations are often associated with love and admiration and are a popular choice for Valentine’s Day or other romantic occasions.

Leave a Reply

Your email address will not be published. Required fields are marked *