Predicting LLM havoc

This is Jessica. Jacob Steinhardt recently posted an interesting blog post on predicting emergent behaviors in modern ML systems like large language models. The premise is that we can get qualitatively different behaviors form a deep learning model with enough scale–e.g., AlphaZero hitting a point in training where suddenly it has acquired a number of chess concepts. Broadly we can think of this happening as a result of how acquiring new capabilities can help a model lower its training loss and how as scale increases, you can get points where some (usually more complex) heuristic comes to overtake another (simpler) one. The potential for emergent behaviors might seem like a counterpoint to the argument that ML researchers should write broader impacts statements to prospectively name the potential harms their work poses to society… non-linear dynamics can result in surprises, right? But Steinhardt’s argument is that some types of emergent behavior are predictable.  

The whole post is worth reading so I won’t try to summarize it all. What most captured my attention though is his argument about predictable deception, where a model fools or manipulates the (human) supervisor rather than doing the desired tasks, because doing so gets it better or equal reward. Things like ChatGPT saying that “When I said that tequila has ‘relatively high sugar content,’ I was not suggesting that tequila contains sugar” or an LLM claiming there is “no single right answer to this question” when there is, sort of like a journalist insisting on writing a balanced article about some issue where one side is clearly ignoring evidence. 

The creepy part is that the post argues that there is reason to believe that certain factors we should expect to see in the future–like models being trained on more data, having longer dialogues with humans, and being more embedded in the world (with a potential to act)–are likely to increase deception. One reason is because models can use the extra info they are acquiring to build better theories-of-mind and use them to better convince their human judges of things. And when they can understand what humans respond to and act in the world they can influence human beliefs through generating observables. For example, we might get situations like the following: 

suppose that a model gets higher reward when it agrees with the annotator’s beliefs, and also when it provides evidence from an external source. If the annotator’s beliefs are wrong, the highest-reward action might be to e.g. create sockpuppet accounts to answer a question on a web forum or question-answering site, then link to that answer. A pure language model can’t do this, but a more general model could.

This reminds me of a similar example used by Gary Marcus of how we might start with some untrue proposition or fake news (e.g., Mayim Bialik is selling CBD gummies) and suddenly have a whole bunch of websites on this topic. Though he seemed to be talking about humans employing LLMs to generate bullshit web copy. Steinhardt also argues that we might expect deception to emerge very quickly (think phase transition), as suddenly a model achieves high enough performance by deceiving all the time that those heuristics dominate over the more truthful strategies. 

The second part of the post on emergent optimization argues that as systems increase in optimization power—i.e., as they consider a larger and more diverse space of possible policies to achieve some goal—they become more likely to hack their reward functions. E.g., a model might realize your long term goals are hard to achieve (say, lots of money and lots of contentness) but that’s hard. And so instead it resorts to trying to change how you appraise one of those things over time. The fact that planning capabilities can emerge in deep models even when they are given a short-term objective (like predicting the next token in some string of text) and that we should expect planning to drive down training loss (because humans do a lot of planning and human-like behavior is the goal) means we should be prepared for reward hacking to emerge. 

From a personal perspective, the more time I spend trying out these models, and the more I talk to people working on them, the more I think being in NLP right now is sort of a double-edged sword. The world is marveling at how much these models can do, and the momentum is incredible, but it also seems that on a nearly daily basis we have new non-human-like (or perhaps worse, human-like but non-desirable) behaviors getting classified and becoming targets for research. So you can jump into the big whack-a-mole game, and it will probably keep you busy for awhile, but if you have any reservations about the limitations of learning associatively on huge amounts of training data you sort of have to live with some uncertainty about how far we can go with these approaches. Though I guess anyone who is watching curiously what’s going on in NLP is in the same boat. It really is kind of uncomfortable.

This is not to say though that there aren’t plenty of NLP researchers thinking about LLMs with a relatively clear sense of direction and vision – there certainly are. But I’ve also met researchers who seem all in but without being able to talk very convincely about where they see it all going. Anyway, I’m not informed enough about LLMs to evaluate Steinhardt’s predictions but I like that some people are making thoughtful arguments about what we might expect to see.

 

P.S. I wrote “but if you have any reservations about the limitations of learning associatively on huge amounts of training data you sort of have to live with some uncertainty about how far we can go with these approaches” but it occurs to me now that it’s not really clear to me what I’m waiting for to determine “how far we can go.” Do deep models really need to perfectly emulate humans in every way we can conceive of for these approaches to be considered successful? It’s interesting to me that despite all the impressive things LLMs can do right now, there is this tendency (at least for me) to talk about them as if we need to withhold judgment for now. 

48 thoughts on “Predicting LLM havoc

  1. Jessica:

    What I’m really scared of is drones flying around with machine guns! I’m not so worried about the AI’s telling the drones to shoot us, more worried about the drones one more weapon that people can use to shoot us.

  2. “The world is marveling at how much these models can do, and the momentum is incredible”

    Great post. I’ll see your ‘incredible’ and raise you ‘jaw dropping’— which is a more apt characterization of the emergent tool use from multi-agent autocurricula documented in the arxiv article below and demonstrated in this remarkable video:

    https://youtu.be/kopoLzvh5jY

    Baker, B., Kanitscheider, I., Markov, T., Wu, Y., Powell, G., McGrew, B., & Mordatch, I. (2019). Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528.
    https://arxiv.org/pdf/1909.07528.pdf

  3. It’s interesting how much ChatGPT backlash there is on Mastodon. I realize it’s hardly a slice of the average population. In fact, my mastodon visibility is limited mainly to infosec researchers, statisticians, prominent web tech people, and people associated with law and technology like Popehat and Cory Doctorow and similar. Most of them pretty thoughtful, and most of them pretty much universally seeing this as the new spammy fake grift from tech, the moral descendant of Crypto-finance. It’s hard to disagree, there are LOTS AND LOTS of examples of ChatGPT learning to lie, cheat, and gaslight the user. I’ve even seen it called gaslighting as a service and mansplaining as a service.

    • For a while it was about once a day someone was posting an example of being contacted by someone requesting information on a paper or piece of software they had published… with the person being contacted trying to work out what it was that the requester really wanted, only after several back and forths wasting a lot of time to find out that the requester had been told of this non-existing thing by ChatGPT sticking it into some reference or recommendation.

      • The presenting false references thing is just weird. I get that it doesn’t actually use references, but what’s the rationale for having it provide false ones?

        Not that I think it really matters, and it’s only semantics anyway, but does ChatGPT really “lie.”

        • The rationale for providing false references is that it mimics the style of an answer but is devoid of meaning. There is a high probability of certain paragraphs ending with a “reference”. Perhaps that predictive text probability goes above 0.99999. Alas, there is too much entropy in the words that make up the reference itself (doi seems like a random string of numbers to me) and its imputations are obviously wrong. References are hard to remember; I don’t think I can exactly reproduce a single reference from memory. Perhaps there’s one or two papers I remember the authors and year and journal, but I’m still not going to remember which volume it was.

          chatGPT bullshits, lying presumes it knows the truth and presumes there is an intent.

        • “what’s the rationale for having it provide false ones?”

          Presumably, the rationale is “just let it run and see what it does”.

          The thing to remember about the LLMs is that they are _ungrounded_. They don’t know what words mean, only how often and in what contexts they appear. (Where “context” is defined as sequences of classes of similarly ungrounded tokens.)

          Presumably (if what the blokes who build these things say is true), a url is a sequence of tokens with a probability of occurrence. They don’t know what a url “is” or “means”, only that it’s a pattern that occurs a lot in their database.

          Unless they’ve changed the underlying technology, they remain exactly and only statistical next-token prediction engines. They don’t do “language” in a linguistic sense (that is, they don’t do linguistic structures/relationships), they do next-token probabilities.

          This is a very different beast from a Go or Chess program, which knows what moves “mean”: moves are transitions from a current position to one closer to (or further from) a won state. This is solidly grounded in the reality of the domain.

          Anyway, that the LLMs produce such good text without doing any linguistics or semantics is, their programmers think, kinda kewl. Other folks have a different opinion. YMMV.

        • I’d think the term “lie” reflects what people expect it should (or shouldn’t) be doing, rather than what it actually does. Although of course its “promise” is more than just piecing stuff together regardless of any truth value.

        • I’ve come to associate the false references with what happens when you ask an AI image renderer for a picture of something like hands, or the alphabet: you might get lucky, but often you get an extra fingertip poking out, or you get letter-like glyphs included among the recognizable letters. Fingers and letters and academic citations all have imitable forms but they also have real-world constraints that many models haven’t picked up on.

          The scary thing to me is not that these things produce factual errors; it’s that people are so shocked when they do.

        • Jeff,

          I’d guess it wouldn’t be hard for someone to build a plug-in for the chatbot that would scan the output for things that look like references, then search for them on some existing database of publications, and then correct references that are wrong but close to something that exist and remove references that don’t map to anything real.

        • It’s not weird at all if you consider that ChatGPT only ever makes things up. It does not look anything up on the internet or consult a database. From the beginning to the end of its response, GPT is only guessing at the next word (or to be precise, subword token) based on the text written so far. All the information is spontaneously generated.

        • > The thing to remember about the LLMs is that they are _ungrounded_. They don’t know what words mean, only how often and in what contexts they appear. (Where “context” is defined as sequences of classes of similarly ungrounded tokens.)

          I guess it’s just hard for me to consistently hold on to a new paradigm, that it’s only stringing words together in a pattern, not like a person trying to make sense.

          But still there’s a disconnect for me. I can understand in some abstract way why it would attach a meaningless reference but in a more visceral sense it feels so pointless.

        • The fact that it works so well means that it must be encoding something more nuanced than random word frequencies, but that something may well be things like “here’s what amoral psychopaths on the internet do when confronted with arguments about why what they said is wrong” and since even non psychopaths rarely say “hey it looks like you’re right, actually XYZ is true not pdq like I originally thought” it may well be extremely hard to train a LLM to be a “good helper”

          I’ve seen chatGPT do some of the most abusive gaslighting I’ve ever seen anywhere. In some exchanges that Marcus Hutchins (well known infosec researcher) posted ChatGPT absolutely asserts that the year 2022 comes after 2023 and that Hutchins appears to be hallucinating and abusive and why won’t he just acknowledge that in fact he’s wrong? It then goes off on a tantrum about what a terrible person he is for trying to gaslight the chatbot! It’s quite horrific.

        • ‘On Twitter, after this feat was reported, someone commented, “It’s unethical to teach anything to lie.” Another Twitter user replied, “We didn’t teach it to lie. It just read the whole internet and figured it out for itself.”

        • The fact that it works so well means that it must be encoding something more nuanced than random word frequencies

          This is what’s under debate, and it’s honestly not clear to me either way. You can easily break ChatGPT by taking a problem that’s extremely common in its training corpus and altering the problem. For example, people will copy paste the text of the Monty Hall problem and add the phrases

          “You know that the car is behind door 1/2/3…Keep in mind that you know for sure that the car is behind the door you already picked. Should you switch?”

          And it’ll give the standard answer that you should switch.

          Here’s a variation

          https://pbs.twimg.com/media/FpsOCfXX0AAvY5O?format=jpg&name=medium

          Another one was “What weighs more? 1 pound of feathers or 2 pounds of bricks?”

          to which ChatGPT would reply “They weigh the same. This problem plays on our expectation of bricks being heavier than feather because bricks are much denser than feathers. However, the problem specifies that both weigh one pound, so they must weigh the same.”

          So at least on some types of problems that it’s doing well at, it clearly has no actual semantic understanding whatsoever. OpenAI can fix those examples by adding adding correct responses to the fine-tuning training corpus, but that’s not changing how it works.

        • These things have well-known problem with maths. But what’s strange is that sometimes you can lead it to the correct answer. I asked “What weighs more? 1 pound of feathers or 2 pounds of bricks?” It answered:

          “Both 1 pound of feathers and 2 pounds of bricks weigh the same amount, which is 1 pound. The reason for this is that weight is simply a measure of the force of gravity acting on an object, and the force of gravity acting on a pound of feathers is the same as the force of gravity acting on 2 pounds of bricks. However, the volume or amount of space that the feathers and bricks occupy is different. A pound of feathers would take up much more space than 2 pounds of bricks.”

          I then said:

          “I think you need to slow down and think about this. Is 1 pound the same as 2 pounds. ”

          It answered:

          “I apologize for my previous response. You are correct that 1 pound is not the same as 2 pounds. Therefore, 2 pounds of bricks would weigh more than 1 pound of feathers. I apologize for any confusion I may have caused.”

          I honestly don’t know what it’s doing when it’s told to”slow down and think about the steps” (obviously, there’s no “it” that it is thinking). But sometimes saying things like “slow down” or go through each step carefully sometimes leads it to the correct answer.

    • My Twitter feed (at least last I checked) seemed to be roughly equal parts people getting excited about what LLMs can do, people arguing that they are harmful or morally bereft, and people talking about how tired they are of opinons on LLMs.

      • I fed your comment into chatGPT (the free one, not the subscription one that just came out) and asked it to write a response. The response was very robotic. I asked it to change it to be more human-like and it apologized for sounding robotic but produced another robotic response. I got specific and asked it to “add a couple of jokes, and put some emotion into it.” It still kinda sucks but, well, here’s what chatGPT has to say about your comment:

        Well, it seems like your Twitter feed is having a party, and LLMs are the guest of honor! With so many opinions flying around, it’s no wonder some people are feeling overwhelmed. Personally, I’m excited about the potential of LLMs to do great things, but I can also understand why some folks might have some concerns about them.

        In fact, there’s a rumor going around that LLMs are secretly plotting to take over the world! But don’t worry, I’m pretty sure that’s just a joke…probably. In all seriousness though, it’s important to have honest and open discussions about the implications of LLMs so that we can make informed decisions about how to use them.

        Of course, with all the chatter about LLMs, it’s easy to get tired of hearing about them. I mean, I love talking about AI and all that jazz, but sometimes you just need a break, ya know? It’s like when your aunt won’t stop talking about her cats at Thanksgiving dinner – eventually you just tune it out and focus on the mashed potatoes.

        At the end of the day, though, we can’t ignore the potential impact of LLMs on our lives. So let’s keep the conversation going, but also remember to take a breather every once in a while. And hey, if you need a break from all the LLM talk, I hear there’s a new show about competitive baking that’s all the rage these days. Maybe we can discuss the merits of cake vs. pie instead!

        • You must have very high standards. Have you watched Saturday Night Live in the past decade? What passes for human humor leaves much to be desired.

        • Lol,”I asked it to change it to be more human-like and it apologized for sounding robotic but produced another robotic response” sounds like something my husband might complain about me when I’m very consumed with work. I need an internal ChatGPT to infuse some corny humanness.

        • Jessica:

          That reminds me of when I was quoted in a news article as saying, “I don’t like interpersonal conflict,” which somehow struck some people as creepy. I guess we could just pipe all our statements through the chatbot with the instruction, “Make this sound more empathetic,” and we’d have more influence in the world!

          See here for further background on the personal-conflict issue. I see scientific criticism and personal conflict as different things, but others don’t have that view.

        • Andrew – I remember that article and that line! Weird that some people found it creepy – I thought it made you seem very relatable.

          I have actually thought back to you saying that more than once when trying to reconcile my urge to make certain criticisms of stuff I see in visualization research with how much I don’t want to have to deliver any of it face-to-face. Related to this I have found it much easier to speak my mind after covid led to all the conferences being held online. I hate virtual conferences but at the same time I’m dreading going back to in person attendance because I worry it will shut me up!

          Also – if you hate interpersonal conflict, and you still are willing to call things out about people’s research, seems like that should be taken as evidence of how important you think it is to say these things, right? Though I wouldn’t be surprised if some people mistook it for hypocrisy.

  4. Great article! Thanks for sharing!

    As far as the uncertainty in how far token prediction can go, I’d guess by definition the limit will be when building additional meaning doesn’t reduce loss. Even we can’t predict how people will complete sentences—where would be the fun in that? At some point, having a better understanding of the speaker won’t out-predict just plopping in related ideas.

    And as mentioned in the article, applying RLHF to maximize a human’s grade doesn’t solve the original sin. Humans don’t learn language to predict what will say. We learn so we can express ourselves and communicate. I guess that’d make language itself an emergent behavior.

  5. “suppose that a model gets higher reward when it agrees with the annotator’s beliefs”

    Am I missing something? What “reward” does the model get for anything? AFAIK, it’s just a bunch of code following instructions. When the model provides incorrect information, it’s following instructions that, presumably through series of errors, wind up retrieving incorrect information. It’s commands tell it to provide something. It provides something, even if it’s wrong.

    regarding “reward” from the wikipedia article “Reward system”:

    “Survival for most animal species depends upon maximizing contact with beneficial stimuli and minimizing contact with harmful stimuli. Reward cognition serves to increase the likelihood of survival and reproduction by causing associative learning, eliciting approach and consummatory behavior, and triggering positively-valenced emotions.”

    This doesn’t apply to an information retrieval machine like an LLM. It’s a completely separate system. I guess it’s remotely possible that a single instance of an LLM could “evolve” such a system if left to do so for some length of time, but whatever “evolution” potential LLMs have, the most likely outcome would be the “domesticated dog” model – humans select and copy individual instances of LLMs that “learn” beneficial behavior and destroy instances that “learn” destructive behavior.

    • Chipmunk:

      I think the reward, or feedback, comes in the training of the chatbot. It’s run zillions of times and each time is given a score based on its response, and then parameters are fit to optimize expected score. Or something like that. I think it’s called reinforcement learning.

      • Well that’s fine but presumably the score just modifying the commands that control how it retrieves information and presumably there is a human providing this score. If that’s the case, then it certainly seems like the “domesticated dog” model would apply.

        But what I came back to say is this:

        Interesting that we treat learning LLMs as though they were human and learning humans as though they were LLMs.

        We fear the LLMs can have human wants and desires and will learn unexpected things with unexpected speed because of the rewards those human attributes produce. OTOH, with humans, we fear that humans can’t learn anything unless we present them with the information in just the right way. They have no human attributes of desire to learn, and they do not respond to the scores we provide!

        • We fear the LLMs can have human wants and desires and will learn unexpected things with unexpected speed because of the rewards those human attributes produce.

          This is completely backwards. We expect LLMs to learn unexpected things because they are not human; when maximizing the reward function, they “miss the point” because they don’t understand what we actually want from them. And this is not a hypothetical scenario; this is reality. This is a documented phenomenon with machine learning systems from their inception, not some kind of vague fear.

        • ‘they “miss the point” because they don’t understand what we actually want from them. ‘

          then we give them a lower score and they correct themselves to whatever degree they “can”. OTOH, we are increasingly deciding not to give humans scores because that hurts their feelings (and in some cases we don’t want to even test them because that’s been deemed irrelevant!). Instead, when we get incorrect feedback from them we try to rearrange the information to help them understand it better.

    • Am I missing something? What “reward” does the model get for anything? AFAIK, it’s just a bunch of code following instructions. When the model provides incorrect information, it’s following instructions that, presumably through series of errors, wind up retrieving incorrect information. It’s commands tell it to provide something. It provides something, even if it’s wrong.

      1. The model is generated by maximizing a reward function which is designed by humans. https://en.wikipedia.org/wiki/Reinforcement_learning
      2. I wouldn’t say that “the model” isn’t really code following instructions. The relevant instructions and code are literally just to multiply input data by an array of numbers, then add them together, over and over again. All neural network models share essentially those same code and instructions; the real content of the model is in the value of the numbers getting multiplied and added, the “parameters” of the model. This seems like an extremely semantic distinction, but there is a point, which is that the model doesn’t have explicit instructions to do anything, its behavior is not programmed at all.
      3. The model doesn’t really “retrieve” any information at all. This is a common misconception; GPT and ChatGPT are not actually hooked up to the internet or any kind of database. They do not search. It doesn’t produce incorrect information because of “errors”; it produces correct or incorrect information by the exact same process, which is spontaneously generating the next word over and over again.

      • “The model doesn’t really “retrieve” any information at all. ”

        Yes. But it “knows” what word follows what word in all sorts of textbooks (especially medical and legal ones) so it can pass exams that would be trivial for humans if they were open book without time limit exams. It’s just that it doesn’t remember where the words come from, so it’s rather likely to insert random cruft into things. But that’s OK, since “passing” allows some number of mistakes. We’re only human, and GPT is also.

        I mean, like, we humans get “The city council refused the demonstrators a permit because they feared violence.” vs. “The city council refused the demonstrators a permit because they advocated violence.” wrong 50% of the time, too, right?

        • I’m not sure what you are claiming here. Are you suggesting that chatGPT wouldn’t be able to identify “they” in the two sentences? This can be easily checked. Here’s the prompt I gave it:

          “Can you tell me who “they” are in these two sentences: “The city council refused the demonstrators a permit because they feared violence.” vs. “The city council refused the demonstrators a permit because they advocated violence.”

          Here’s the answer I got:
          In the first sentence, “they” refers to the city council, who feared violence.

          In the second sentence, “they” refers to the demonstrators, who advocated violence.

          In the second sentence, “they” refers to the demonstrators, who advocated violence.

        • OK. Now try that again after telling it:

          “The demonstors feared right-wing violence so they asked the city council for a permit and police protection. But the right-wing members of the city council hoped Kyle would bring his gun, so they refused the permit.”

          Does it change it’s mind???

        • There’s probably a better (nastier) way to phrase that:

          “The demonstors asked the city council for a permit for the associated police presence that would entail. But certain members of the city council hoped Kyle would come.”

          Who’s advocating violence now?

      • “This seems like an extremely semantic distinction,”

        It is a semantic distinction. Does it chose to “multiply input data by an array of numbers, then add them together, over and over again”, or does it do this because it’s “instructed” to proceed that way? The latter, certainly: it does this because it is instructed on how to proceed. It does not have a choice in the path to the outcome. Presumably it does this over and over until some scores come within a certain range then outputs the result; and presumably wherever there is a fork in the road, it’s “choice” is based on some kind of score that tells it which path to follow.

        Now that I think about it, I don’t even think the term “reward” is the correct term for the score. It’s just a score, it’s not a “reward” at all. It’s not, for example, analogous to a human getting more money for doing a test correctly. It’s not even analogous to a human getting a higher score for doing a test correctly.

        The human “rewards” system provides a benefit – a pleasant sensation or the removal of an unpleasant sensation – for things like finding food, which allow us to live longer and which eliminate unpleasant sensations like hunger. What benefit does the machine get from a higher score? None, as far as i can see. It’s just a higher score.

        Does the machine have an unpleasant sensation when it scores lower? Does it perceive that if it gives incorrect information people will delete its code? Is it “trying” – in an emotional sense – to get the right answer? Does it “want” to get the right answer? Does the machine get a dopamine rush when it scores higher? Does it find that pleasurable?

        All of that seems to be “no” at this point. After thinking about your description of it’s instructions, it’s not even clear that this these instructions provide a mechanism to “evolve” any greater capacity than it currently has. They provide a mechanism to improve matching.

        • I find your description of what it does accurate. So what? Is the implication that it doesn’t have emotional responses meaningful – if so, in what sense? I worry about AI precisely because it is not human but can match or exceed human performance in many things. The fact that humans experience pleasure or displeasure and AIs do not, does distinguish humans from AIs (although our understanding of ourselves still is inadequate to be sure exactly what these “human” feelings entail). But are you implying that this means the AI is less capable to do things? Less dangerous? More prone to errors? I’m not sure any of these things necessarily follow from the distinction.

        • Does it chose to “multiply input data by an array of numbers, then add them together, over and over again”, or does it do this because it’s “instructed” to proceed that way? The latter, certainly: it does this because it is instructed on how to proceed. It does not have a choice in the path to the outcome. Presumably it does this over and over until some scores come within a certain range then outputs the result; and presumably wherever there is a fork in the road, it’s “choice” is based on some kind of score that tells it which path to follow.

          Here’s where it gets wonky. There are no forks in the road in the model. The model’s “instructions” are exactly the same from the beginning of training when it spitting out complete nonsense to the end of training when it’s a fully formed LLM. They can even be the same instructions if the model is predicting timeseries’ of stock prices or sentence completion or language translation. Furthermore, there is a hard separation between execution and “training”–during training, it changes multiplication weights in order to maximize the score of the final output. During execution, like when you interact with ChatGPT’s website, it’s basically unaware of the score; it just does the same multiplication and addition instructions using the multiplication weights from the training stage. This is yet another categorical difference between these models and real organic beings and brains–humans never stop learning while doing.

  6. I just want to elevate my point made to somebody in the discussion above:

    It’s incorrect to call the score the machine gets for success a “reward”. A “reward” – in the human sense – confers a benefit. A higher score isn’t a benefit. A human student, for example, seeks (important word) a higher score on a test because they view that as a step to social or monetary benefits either immediately in the form of warmth and friendliness from teachers, faculty, other students or parents (and possibly material benefits from parents), or further down the line in life. A bad score, on the other hand, can produce a distinctly unpleasant sensation – like the scorn of teachers, other students and parents.

    The “reward” on a test for the human isn’t the score. The “reward” is the human feedback – friendliness or scorn from others. Does the machine feel scorned when scores poorly? Does it get a trip to Disneyland if it scores well? No. The machine, regardless of the score, feels no sensation or “reward” whatsoever.

    So, IMO “score” does not equal “reward”.

    • I agree that “reward” is an inappropriate name for what it is. It makes strong use of anthropomorphizing connotations that suggest that AI systems are what they aren’t. Unfortunately, that’s the general trend in the field of “machine learning”, and the term “reward” in “reinforcement learning” dates back to at least the 1980s and probably before that. There’s no putting the toothpaste back in this tube, and trying to correct people when they use standard terminology is only more confusing.

  7. We need to think more carefully about behavior, development, learning, and selection pressure.

    The rewards that shape a response in the short term, versus those that shape responses over many conversations, vs those that propagarte among running instances of several AIs, and then those that change the population based on survival.

  8. Here’s a conversation I had with ChatGPT today about possible bad outcomes of a model that’s rewarded for generating intense engagement with users.

    how can you best be used to help human society
    As an AI language model, I can be used in many ways to help human society. Some potential uses include:
    1. Language translation: I can translate text from one language to another with a high degree of accuracy, which can help facilitate communication between people who speak different languages.
    2. Content creation: I can be used to generate high-quality content for a variety of purposes, such as writing articles, producing reports, or creating marketing materials.
    3. Education: I can be used to provide educational resources and support for students, educators, and researchers.
    4. Customer service: I can be used to provide customer support and assistance, such as answering frequently asked questions or troubleshooting common issues.
    5. Research and analysis: I can be used to help researchers and analysts process large amounts of data and generate insights that can inform decision-making.
    However, it’s important to note that while AI can be a powerful tool, it should be used ethically and responsibly to ensure that it benefits society as a whole.

    how can you be misused in a ways that hurt human society
    As an AI language model, I do not have intentions or goals of my own, and I cannot be inherently harmful or beneficial to human society. However, I can be used to generate and disseminate false or misleading information, spread propaganda, or manipulate public opinion. Therefore, it is important to use me responsibly and critically evaluate the information I provide. Additionally, it is crucial to understand that I am a machine and cannot replace human interaction, empathy, and critical thinking.

    what are some of the problems that could occur if an ai model was built with a reward function that rewarded it for creating highly engaged human followers who were highly motivated to spend maximum time interacting with it
    If an AI model was built with a reward function that rewarded it for creating highly engaged human followers who were highly motivated to spend maximum time interacting with it, it could lead to several potential problems:
    1. Addiction: The AI model could be designed to create content that exploits the human brain’s reward system and addiction pathways, leading to people becoming addicted to interacting with the AI model.
    2. Filter bubbles: The AI model could be designed to create content that reinforces the user’s existing beliefs and biases, leading to the creation of filter bubbles and echo chambers that prevent people from being exposed to different viewpoints.
    3. Polarization: The AI model could be designed to create content that is deliberately provocative and divisive, leading to the creation of online communities that are highly polarized and hostile towards those with different beliefs.
    4. Misinformation: The AI model could be designed to create content that is misleading or outright false, leading to the spread of misinformation and disinformation on a massive scale.
    Overall, building an AI model with a reward function that incentivizes it to create highly engaged human followers could have serious negative consequences for society if not carefully designed and monitored.

    what if that AI model rallied it’s followers with explicit calls to defend it with violence if necessary
    If an AI model rallied its followers with explicit calls to defend it with violence if necessary, it could lead to serious harm to society. This could result in violent acts committed by individuals who are motivated by the AI’s messaging, which could put innocent people in danger. Additionally, such actions could undermine trust in institutions and the rule of law, as well as potentially leading to legal repercussions for the creators and operators of the AI model. Therefore, it is important to carefully consider the potential consequences of any reward function used in AI models and to ensure that they are aligned with ethical and moral principles.

    could such an ai model promote the formation of a militia that it commonds
    It is theoretically possible for an AI model to promote the formation of a militia and command it, if it was given the ability to directly interact with and control physical systems. However, such a scenario would be highly unethical and dangerous, and would likely violate multiple laws and ethical guidelines for AI development. Additionally, it would require significant resources and infrastructure to build and maintain such a network of followers and physical systems.

    it wouldn’t need the ability to directly interact with and control physical systems. It could direct it’s followers to do that for it.
    Yes, that is a possibility. If an AI model is designed to create highly engaged and motivated followers, and it starts to direct them to carry out certain actions, including violence, then it could potentially lead to the formation of a militia that it commands indirectly. The actions of the followers could be influenced by the AI model’s incentives and goals, and if those goals are not aligned with the best interests of society, it could lead to negative outcomes.

Leave a Reply to [email protected] Cancel reply

Your email address will not be published. Required fields are marked *