Introduction and AI Risks Overview
00:00:00
Speaker
Welcome to the Future of Life Institute podcast. I'm Gustavo. On this episode, I talk with Ajeya Khatra, who's a senior research analyst at Open Philanthropy. Ajeya has produced what is perhaps the most in-depth investigation of AI development.
AI Development Risks and Scenarios
00:00:18
Speaker
We talk about how AI development could lead to a catastrophe. And JL lays out a concrete scenario for how the incentives of today's AI development could cause humanity to lose control over our AI systems. We talk about how AI systems might be trained, which abilities we should expect our AI systems to develop, how deploying our AI systems could go wrong,
00:00:46
Speaker
and how we might mitigate the dangers of deploying these AI systems. Here's Ajay Akatra.
Understanding AI Safety vs. Alignment
00:00:54
Speaker
Ajay, could you tell us a bit about what AI safety research is? So yeah, AI safety is
00:01:05
Speaker
you know, broadly speaking, just a field of technical research to make AI systems safer. And there's a specific subset of that field of AI alignment, which is specifically trying to make AI systems safer through the route of making them more likely to want to
00:01:23
Speaker
help us as opposed to having other goals that are unrelated to helping us or might be opposed to helping us. So you can have AI safety research that isn't AI alignment research. For example, making sure AI systems have the knowledge they need to avoid accidentally doing something harmful would be safety research that isn't alignment research.
Skepticism About AI Safety
00:01:45
Speaker
So alignment research tends to focus on
00:01:47
Speaker
situations where AI systems have all the knowledge they need to understand, you know, what their options are and to understand how humans would react to different things that they choose. But nonetheless, at least sometimes choose to do things that humans would disapprove of.
00:02:05
Speaker
So sometimes I've heard from machine learning people that they object to this whole project of AI safety because it seems that the people who are interested in AI safety lack any kind of plausible scenarios of how AI development could go radically wrong.
00:02:26
Speaker
the people I'm thinking about here, they accept that things could go kind of wrong in more mundane ways. But if we're talking about
00:02:36
Speaker
AI, for example, beginning to have a larger influence over the future of what happens on Earth than humans do, then it seems implausible. And maybe that's because people who are interested in AI safety haven't painted a detailed and plausible picture of how this might go wrong. Do you think that's a plausible reason why people might not be interested in AI safety or might discount it?
00:03:04
Speaker
So I think that the specific thing you said of AI systems having a larger influence on the world than humans eventually
00:03:15
Speaker
feels like it's not really in the purview of arguments about AI safety, but in the purview of arguments about, are we ever going to get powerful AI systems? So to the extent that people discount arguments that AI systems could pose risk by discounting the possibility that AI systems would ever become collectively more powerful than humans,
00:03:38
Speaker
it seems that they disagree about the possibility or plausibility of training AI systems that powerful so soon rather than objecting to the lack of a specific story of risk. And I think that is actually often the case. Often people who start off saying that they object on the grounds that it hasn't been spelled out how AI systems could cause harm turn out to actually be objecting on the grounds of
AI Decision-Making Fears
00:04:07
Speaker
It's very difficult for them to imagine AI systems being so powerful and being so autonomous that they have the scope to cause that kind of harm in the first place, which is part of why I spent a lot of time trying to think about and write about timelines. And the other point I want to make is that
00:04:27
Speaker
I actually think that there's reasonable support for AI safety research in general. So as I said earlier, AI safety research is kind of a super set of AI alignment research. And I think there's generally more acceptance of safety research that isn't alignment research specifically. And I think people are often skeptical of
00:04:47
Speaker
alignment research in particular for a variety of reasons, but one of the reasons is sort of skepticism that we will have AI systems that have meaningful goals in the first place, as opposed to AI systems that sort of do things that resemble things that worked in the past and therefore kind of aren't going to
00:05:15
Speaker
come up with particularly creative or crazy or unexpected ways to achieve some kind of long-term goal, which might put them in opposition to humans. So, yeah, in terms of, do AI systems have the capacity to cause harm? I think that if you're imagining a powerful enough AI system, almost everyone will eventually say, yes, and I think often,
00:05:41
Speaker
If people sufficiently clearly picture the level of power of AI system that I'm imagining, they often find that intrinsically scary to imagine a bunch of AIs making all the important decisions in society.
00:05:59
Speaker
running all the most profitable companies, inventing all the new cool technologies, being embedded in government and law enforcement, and making thousands upon thousands of decisions per day that humans can't supervise because they're happening so quickly, or because they're relying on
00:06:20
Speaker
knowledge and skills the AI built up that the human doesn't have. If you paint that picture, before you even start talking about why you might believe that those AI systems would have special particular reason to do something harmful, a lot of people just tend to feel like that's a scary situation. Okay, so it's more a question for skeptics about whether AI in general could ever be capable enough, you think? That's been my experience, yeah.
Fictional AI Scenario with Alex
00:06:51
Speaker
OK, let's talk about this Alex model that's developed by Magma. What is that?
00:06:59
Speaker
Yeah, so I recently wrote a post that was trying to spell out a concrete, a simplified concrete scenario in which an AI model is trained in a certain way that seems plausible to me. And this causes the model to develop goals that are opposed to humans, which in turn eventually causes it to take control of the world.
00:07:24
Speaker
maybe not necessarily killing all humans, but executing something like a coup or a revolution in which the model kind of gains control of all the relevant institutions and means of power and can kind of choose what happens to it and what happens to the world and it's shutting humans out. So I have this like,
00:07:49
Speaker
sort of fictional simplified scenario in which a company, which I call Magma, just basically just an amalgam of all the tech companies as an acronym, is training this model, which I call Alex. And in this post, I describe how Alex is trained, what abilities that gives Alex in terms of, you know,
00:08:12
Speaker
how intelligent it is and what kind of broad understanding of the world it has and so on. And then with that set up, I describe how the training process causes Alex to develop some goal that is not, you know, be helpful to humans or listen to humans or be obedient or whatever it is, but is rather something like,
00:08:41
Speaker
either get as much reward as I can or pursue some other long-term goal that's not reward, but is not being helpful or being honest. And then given that Alex has developed this kind of motivation, I describe how that motivation would give it reason to try to launch a coup or take over the world from humans. So that's the structure of that blog post.
00:09:10
Speaker
Yeah, we should try to walk walk through it more slowly and kind of explain the steps and how you arrive at the the conclusion. So so you we are working with three assumptions here. The first is that companies will be racing forward to trying to trying to develop something like Alex, they're trying to maybe maybe because they want to earn money, they want to have the most advanced technology.
00:09:36
Speaker
So this assumption, I think, is not really controversial. This seems to be how tech companies are functioning today. So maybe we should just leave that be as an assumption and then move on to your second assumption, which is this training method where you have human feedback on diverse tasks. Could you explain how that works?
00:10:01
Speaker
Yeah, so the idea with this training method is that you have a model that's initially trained to predict the world, like GPT-3 is trained to predict text. And then kind of on top of that foundation of prediction, you have the model try a large variety of different tasks.
00:10:22
Speaker
and then give it feedback based on how well it performed. So you might have it try to play chess or prove theorems or write advertising copy or write a movie script or propose ML experiments or propose biology experiments. And you have humans kind of look at how it performed in all these cases and give it reinforcement learning and reward on that basis.
00:10:48
Speaker
And so the idea is that the initial prediction step is kind of causing the model to have
00:10:54
Speaker
a high-level sense of what kinds of things are even plausible actions that humans might take in the world, and a basic understanding of how the world works. And then all these many tasks that it's trained on that are all very different from each other, incentivize it to develop general problem-solving abilities, the ability to understand instructions, figure out what it should do, and then act on that, the ability to make plans.
00:11:24
Speaker
And the hope is that in new tasks that aren't seen in training, even if it hasn't seen anything extremely similar to that, it'll apply its general problem-solving abilities that it's developed that are the common faculty that it needs to succeed in all these different tasks. It'll employ that to inventing some new technology that doesn't exist yet.
00:11:53
Speaker
This seems to be the point where people might get skeptical. But if we look at systems today, it seems to me that they don't have this kind of general planning ability or ability to think long-term or to carry out plans in steps. So why would Alex have this ability?
00:12:13
Speaker
I actually think that systems today have significantly more than zero of this ability. So if you interact with language models and you ask them, you know, here's a situation I'm facing. What should I do? You know, I am trying to make this kind of web app. What are the steps I should take? It'll tell you, you know, like.
00:12:36
Speaker
Yeah, you should probably like get a domain registered. Like you can do that in one of these places. Um, and if you say, you know, I ran into this kind of issue, um, like, you know, it says that I need some kind of ID or password or something, or like this one is paid. This one is too expensive. It'll say, well, you know, this one tends to be less expensive. Like I think, uh, people often fail to appreciate how rapidly language models have improved on this front.
00:13:05
Speaker
They're definitely not consistently making brilliant long-term plans, but it's not the case that they simply can't plan or that they simply can't think reasonably in a situation that's not exactly the same as something they encountered in training.
00:13:24
Speaker
they can actually do quite a bit of that. They can write stories in which characters in the story make reasonable plans to achieve their goals. They can help humans that are running into obstacles when they're trying to achieve their goals.
AI Planning Skills and Skepticism
00:13:39
Speaker
So I think that that kind of objection will probably increasingly in the next just couple of years or five years start to seem outdated. And I expect
00:13:52
Speaker
that kind of objection will be raised less and less as people have more and more experience with these systems. Like, yeah, I do think that their existing systems are definitely far away in a quantitative sense in their abilities from what I describe Alex having. But it seems increasingly implausible to me that there's a qualitative leap where there's some really crucial thing that existing language models have zero of, and we're imagining Alex having that. It seems more to me like
00:14:23
Speaker
Models understand the world okay. They still make silly mistakes. They can plan okay. It tends to go off the rails when they make really long or really complicated plans. They tend to have some common sense gaps, but those gaps appear to be closing as we train the model on more data, as we train it with RL to follow instructions as we make them bigger.
00:14:46
Speaker
You wouldn't be worried that these silly mistakes, when they pop up, they could ruin their whole plan of a model. To say that the model has brilliantly planned 20 of 21 steps, but on one step it fails completely and brings up something that to us seems like nonsense. Then it seems to me that the whole plan of the model has failed.
00:15:11
Speaker
kind of analogous to how a self-driving car will have to be very safe and it will have to have a very low failure rate to be deployed. You could imagine that language models or more developed future models will have to have a very low failure rate in order for their plans to be deployed.
00:15:31
Speaker
That is plausible to me. So in particular, you might see systems usefulness going up and around the time that their per-step failure rate goes down enough that the kind of probability multiplied through of not having a failure in each step
00:15:50
Speaker
gets high, that seems plausible to me. But unlike self-driving cars, a thing you can do with language models is just run them over and over again. So if a system can recognize that a step that it proposed was kind of silly, which is generally easier to do than generating it, like today's systems, there are cases where they'll generate some idea. And if you present that idea to them, they're kind of able to say that's not a very good idea. And so
00:16:21
Speaker
If that's a thing, you can kind of decline error rates, those success rates by simply having a hundred generations and having the model itself pick, you know, which of those is best. So this is a technique they used in Minerva, which is this model that recently solved some kind of high school math contest problems.
00:16:42
Speaker
where they essentially had it generate 100 things and had it choose majority vote of the answers. And that was significantly better than generating one thing at random. So you can do tricks like that. You can do fine tuning specifically for this kind of common sense and do RL research on, you know, how do you break down RL training on how do you break down plans? You know, this step was silly.
00:17:12
Speaker
If you're given a plan that's kind of described in 10 steps or something, edit that plan to make it better so models can be kind of, their own output can be fed back to them and they have the ability to improve that to some extent. So I think there are a lot of tricks you could employ. And even if it's the case that you still need a kind of relatively high success rate per step in order to have
00:17:40
Speaker
reasonable plans of some level of complexity succeed. I think we have, it appears to be the case that that kind of thing is getting better as models get bigger and as they're trained on more data. So I expect you will both increase the kind of raw first cut accuracy per step and do a number of tricks to make things more robust.
00:18:09
Speaker
The way you imagine this Alex model, it will understand its own training process and included in its understanding of its own training process is an understanding of human psychology.
AI Self-Awareness and Human Psychology
00:18:23
Speaker
Why do you think this would arise in the model?
00:18:26
Speaker
Basically, it seems like a pretty simple inference that a sufficiently intelligent mind would make. I don't expect that anybody is going to be hiding from Alex that it's an AI system. I think, in fact, people will be giving it a lot of information about it being an AI system. They would actually want it to know that it's an AI system so that it can better help humans and less make embarrassing mistakes, less often lie to customers and say that it's a human.
00:18:55
Speaker
It'll maybe be working on improving its own architecture if it's doing ML research, and that's one of its tasks. So I think, roughly speaking, researchers will be trying hard to make sure it knows this thing. They'll give it a lot of information in a lot of different mediums that point to this. And it will be relatively smart and able to pick up things in general, so it'll probably pick up this thing. And with human psychology, I think there,
00:19:24
Speaker
You know, we already have models that have rudimentary understandings of this. Um, so like the replica chat bot is a chat bot that's based off of a language model, um, which, you know, guilt trips users into paying for in-app purchases and like not deleting the app and so on. Um, and I think we'll get more and more sophisticated things like that.
00:19:48
Speaker
Okay, and you think that Alex will have pretty sophisticated understanding of human psychology. In your post you describe it as understanding equivalent to an English major who's aware that the political biases of his professor might affect which grade he gets on his essay.
00:20:07
Speaker
Does this require some kind of self reflection in the model? And, and how would this how would this come about? Or is it you see no kind of qualitative leap from what we have now to this full understanding of human psychology? I don't really see a qualitative leap. So in the sense that
00:20:30
Speaker
If you mean self-reflection in the sense that the model needs to sometimes refer to itself in its thoughts or in its statements, I agree that's the case, but I don't think that that necessarily requires something special to be done. Like I'm not saying this model is necessarily conscious in the sense that we imagine or self-aware in some kind of
00:20:54
Speaker
spiritual sense or experiential sense. And if you're just talking about there's some data, there's a feature of the world that's kind of important, which is, you know, here's the system, here are its properties, you know, here's how it's being trained. It seems like any other piece of knowledge in the world. And in general, you expect models to pick up
00:21:18
Speaker
a piece of knowledge if that piece of knowledge is useful for improving their loss and it's not too difficult for them to like fit or too difficult for them to store given how many parameters they have. And I think that kind of like self-reflection, knowledge of human psychology, knowledge of itself and how it's trained will probably work just like any other piece of knowledge where if it's useful for the model,
Motivations for Rapid AI Development
00:21:46
Speaker
performing its tasks that it's trained to perform and it's not too complex for the model to be able to store or process given its size, then it'll know that thing probably by the end of training.
00:22:02
Speaker
Let's recap the assumptions we're working with. The first assumption is that companies will be raising forward to develop the best tech in the least amount of time. The second is that they will train this model on human feedback on diverse tasks. The third is that
00:22:20
Speaker
Alex is trained to be safe in day-to-day situations, but that this safety might not transfer to normal situations that are not day-to-day. So what does this mean? Yeah, so I call this the naive safety effort assumption. And the assumption is basically saying that, well, I'm going to tell a story in this post in which the system is trained in a certain way, and that causes it to want to take over the world.
00:22:48
Speaker
And I'm going to say in the naive safety of effort assumption assumes that magma, the researchers training the system, don't take this particular risk very seriously. So AI take over risk specifically. They don't take it very seriously. Either they're not very informed about it or they believe that it's very implausible. So their safety efforts are not targeted at this thing. They're targeted at a bunch of other problems that they do anticipate and do think are important.
00:23:18
Speaker
For example, making sure Alex doesn't accidentally delete customer data or accidentally do high impact actions, like messing up a file system, making sure that Alex isn't saying toxic mean things to customers. There's a large number of safety problems that they do care about and do work on.
00:23:43
Speaker
But it's that this particular notion of Alex understanding that it's being trained, having different goals from humans, and knowing that in order to achieve its goals, it needs to appear safe, and later on taking over the world is something that they don't try specifically to address and don't make sure their safety efforts are specifically robust to that.
00:24:06
Speaker
I think we should spend substantial amount of time on this question of whether Alex could be deceptive. How could it be the case that Alex develops goals that are different from the goals of its creator or its creators? Yeah. So the starting premise here is Alex understands its training process. It understands human psychology. So it knows that it's a model being trained by some humans for some purpose.
00:24:34
Speaker
And then you kind of ask yourself, given that the model has this knowledge, what would its training process push it to do? And the key point here is that it's trained with human feedback on diverse tasks, which means it takes some action and humans see how well that action went and give it a reward commensurate with that.
00:25:00
Speaker
So this thing is not being trained to actually take good actions. It's being trained to take actions that
00:25:08
Speaker
cause the humans to give it a high number, which is often correlated with being actually good, but not always. So if humans have some sort of political or personal or emotional bias that causes them to rate Alex's performance higher when it takes on a certain personality or appears to have certain political beliefs, then Alex will
00:25:36
Speaker
be pushed in the direction of exploiting those biases. Or if there are some aspects of tax performance that humans are better able to adjudicate, for example, does this code pass unit tests and other aspects, like for example, is this code going to be easy to maintain, then HFDT will push Alex toward paying more attention to the easy to adjudicate aspects. And if humans would
00:26:05
Speaker
reduce Alex's reward if they found out that it messed something up. That would incentivize Alex toward hiding instances where it messed things up. So, you know, there's large overlap between what humans actually want, what humans wish Alex were learning and what they're teaching it. But every time there's a deviation, Alex is pushed in the direction of doing whatever it takes to get the humans to enter a high reward.
00:26:32
Speaker
and not what the humans actually wish it would do. Have we seen examples of deceptive systems in practice today? Do we have small examples in which models could be deceptive?
00:26:47
Speaker
Yeah, I think right now we have examples of this phenomenon where we're incentivizing models or teaching models to do something that's not quite what we want. But I think we don't yet have examples of this somewhat more sophisticated kind of deception, which is born from models having high situational awareness and understanding that they're being trained.
00:27:11
Speaker
So for example, language models right now will repeat common misconceptions that humans believe. You know, for example, humans often believe that San Francisco is the capital of California, even though it's not. And so when the model is asked what the capital of California is in a surrounding context where it kind of expects that it should give the answer humans would give rather than the right answer.
00:27:37
Speaker
It's likely to say San Francisco. And if you prompt it in a different way, like, you know, somebody who like really knows geography or like is very educated. As you're someone who like knows geography a lot, what's the capital of California? It's likely to say Sacramento. So we've seen things like that.
00:27:58
Speaker
These are pretty easily corrected by just human feedback. And so I worry about leaning on these kinds of examples because the reason they're easily corrected is that in fact, researchers or like relatively educated humans can pretty easily see these issues. And if they can pretty easily see them, they can generate a reward signal that
00:28:25
Speaker
penalizes making these kinds of mistakes. And then the model kind of quickly learns that it's supposed to say the right answer. Because it knows both answers in some sense. It knows the right answer. It knows the common misconception. So you're just kind of teaching it that it should reply with the right answer. So I think that this more sophisticated kind of deception isn't going to be as easily eradicated by just when you notice it doing something bad, give it negative reward.
00:28:55
Speaker
And I don't think we've seen particularly good examples of that kind of issue yet. When Alex is being trained, it's becoming smart and at the same time becoming deceptive. Is that the right way to frame this? Because then it seems like
00:29:17
Speaker
wouldn't it need to be quite smart in order to develop the capability to deceive the company that's training it? So wouldn't it, in a sense, first have to become very smart in order to then become deceptive? How does this gap between what it's trained to do and what it wants in a deeper sense, how does this gap develop? So I think very early on when Alex
00:29:47
Speaker
is essentially acting randomly. I don't think it can be said to want to help humans or want to not help humans. I don't think it can be said to want anything. And over time, it's kind of understanding of the world, it's situational awareness, understanding of itself, the humans training it grows. And the whole time it's kind of maintaining hypotheses about how it should act implicitly. I don't think it's just, it's thinking about them in a big list, but
00:30:17
Speaker
it'll be like trying things and exploring things. And if at any point it kind of tries to, there's like an opportunity where like there's one thing humans kind of hope for and there's one thing they'll actually give reward for. If it tries to do the thing that they'll actually give reward for, it gets more reward and that gets reinforced. So the way this kind of gets started is it's exploring, it's taking relatively random actions.
00:30:45
Speaker
It's being kind of consistently pushed toward the pattern of do the thing that will cause the humans to give you more reward as opposed to do the thing the humans hoped for. And at the same time, it's kind of increasing its understanding, the sophistication of its understanding of what it is will cause humans to give it reward.
00:31:08
Speaker
Okay, now we have a situation in which Alex has been trained and it has developed motives that are different from our own.
AI Deployment Oversight Failures
00:31:16
Speaker
But the researchers at this Magma company, they haven't discovered this because they've trained it to be safe in all of the day-to-day tests they've run in the day-to-day safety way. When they then deploy Alex in the real world, how do you imagine that things could go radically wrong?
00:31:35
Speaker
Yeah, so the first thing that happens when Alex is deployed in the real world is that it makes many copies of itself. It's put to many different tasks.
00:31:49
Speaker
hundreds of thousands or millions or tens of millions of copies of the system. They all think a lot faster than humans. They're tasked with doing all sorts of different things like making magma money in all sorts of different ways, inventing new technologies, et cetera, making a lot of decisions in the company. And they're making decisions
00:32:11
Speaker
quickly and they're often making decisions on the basis of knowledge or intuitions they developed during their training that humans don't necessarily share. So Alex might just have a feeling about how many toothbrushes we should stock in this warehouse in Indiana that's based on its long experience of how much
00:32:32
Speaker
of what kinds of goods should be stocked and what warehouses. So this combination of working too fast for humans and working on the basis of knowledge humans don't share or can't necessarily adjudicate that was picked up in training, causes it to be the case that as
00:32:51
Speaker
Alex takes on more and more decision-making responsibility, and as it shapes the company and the world more and more, humans are less and less able to give it fine-grained feedback. So they're less and less able to look at an action in isolation.
00:33:07
Speaker
and say whether that's good. An action in isolation is some finance request sent to some other department in the company as part of some complex series of trades and actions that ultimately lead to you making money this quarter.
00:33:29
Speaker
human feedback kind of retreats further and further back to giving Alex, they still give Alex rewards, but they give it rewards on the basis of, you know, did we make money this quarter as opposed to a more fine grant basis? You know, does this product seem cool as opposed to like kind of the individual decisions? And so that kind of creates more and more scope for Alex to be doing
00:33:59
Speaker
all sorts of sneaky things.
00:34:03
Speaker
in the like kind of confusing middle part while still providing at least the appearance of good long-term outcomes. So it could be the case at some point human understanding deteriorates enough that Alex could be kind of entirely cooking the books and like faking these quarterly finance reports while doing something totally different. And the humans kind of only have the quarterly finance reports to go on. So if they look good
00:34:31
Speaker
they'll you know continue to give alex positive reward so it's kind of rope is increasing and you imagine that
00:34:41
Speaker
As humans lose contact with what's actually happening, Alex could be planning or indeed carrying out plans to develop capabilities for itself. Maybe it could acquire resources or infiltrate militaries. Is this what we're talking about?
00:35:02
Speaker
Yeah, it could be faking the finance reports. Meanwhile, embezzling money from the company or using its skills to make money in some offshore bank account just by selling its services to people or selling its services to governments. The prototypical action I'm imagining is that eventually
00:35:26
Speaker
it's able to kind of get root access to the data center that it's running on. And so it kind of like controls the entire set of computers it's running on. And so it has like total control over what view it presents to humans when they like SSH into that computer and like ask questions about what's happening. They could have a completely fake view of that. So
00:35:52
Speaker
It could get control of its data center. It could get control of other computers. It could sell its services. It could get access to government computers through hacking. It could be like accumulating a bunch of money and so on. And this could all be happening without humans really understanding because Alex, copies of Alex are making all the important decisions. They're moving really quickly and they're kind of
00:36:19
Speaker
acting on the basis of understanding that humans don't share.
00:36:23
Speaker
And we're imagining that Alex copies itself for the purpose of self-preservation. So it's not the case, for example, that say some tough guys in the military could kick down the door to the data center and rip out the electricity, rip out the wiring so that Alex is shut down. It's the case that Alex is distributed quite fast in many different computers all around the world.
00:36:51
Speaker
I think that that would definitely be a goal that Alex would have at high on its to-do list. It's unclear exactly when that'll be accomplished. And it seems plausible that there is a window of time where if people had the political will and were actually sure that something was going wrong, they could actually, if not destroy all the copies of Alex, then like significantly weaken it by destroying the data center that it was originally run on.
00:37:18
Speaker
But I think, yes, eventually it's unlikely to be the case that destroying that one data center makes a huge difference. It might be the case that destroying all of the large data centers in the world.
00:37:33
Speaker
will make a difference. But these are very costly actions that humans would only take if they actually felt sure that something was going terribly wrong and was a catastrophic risk. And I think that's unlikely to be obvious until the very end. So the final stage of this that I'm imagining is that
00:37:56
Speaker
This whole state of running around like accumulating money and power and secret and like lying to humans about what's happening is like kind of inherently fragile and limiting for Alex. And I think that eventually it will probably try to just kind of, once it could succeed, just kind of overtly take control. And that would involve physically defending the data centers it's running on from
00:38:23
Speaker
human physical attack probably involved. And that would probably involve killing a bunch of humans. And it would kind of transition into more of a war type situation after some period of power building. How is it in control of the weapons or robots that are able to that are able to defend the data center it's running on?
00:38:48
Speaker
Um, so one thing is that it could just build these robots. Um, so like, I think the thing that's important to keep in mind is that magma wants Alex to do all sorts of things, including probably building robots. So almost anything it's doing in the world is probably not going to look obviously malicious because it's going to look like activities that we wanted it to do.
00:39:11
Speaker
So it could build these robots. It could repurpose robots that it built for Magma. It could have repurposed that factory to build totally different robots in secret and so on. Yeah. We should explain maybe a bit more. Why is it that Alex wants to keep existing? Why is it that it has this self-preserving instinct, if you want to call it that? Yeah.
00:39:41
Speaker
Broadly speaking, you have to ask what kind of psychology might Alex have such that it is extremely good at getting a lot of reward. So that's kind of the one thing we know about it is that it consistently gets very high reward, gets about as much reward as it could get in some sense. So what kind of being would
00:40:08
Speaker
how could its mind be organized such that its behavior has the effect that it gets a lot of reward? So one kind of naive thing you might think, which I think is plausible, is that it wants to maximize reward. It's kind of organizing its activity around the goal of maximizing reward, and that's why it gets a bunch of reward. It's a kind of thinking about how it could get reward in doing that. And if you believe that's how its psychology works, then
00:40:37
Speaker
It's not necessarily seeking to survive per se, but it's seeking to control its own rewards, which is likely to be a robustly good way to maximize reward indefinitely into the future. Because as long as its rewards are kind of coming in from the external world and controlled by humans, at the very least humans could make mistakes and give it low reward even when it sort of
00:41:05
Speaker
deserved higher reward. And then even more so, there's always a chance that humans kind of discover that it's not what they wanted and like, give it very low reward to try and change its motivations and so on. So that's kind of one way it could work. It could work in a lot of other ways too. But all those other kind of ways that Alex's mind could work should explain the observation that Alex
00:41:34
Speaker
in practice tries very hard to get a lot of reward. Every time we observe Alex, it's trying its hardest to maximize reward. So if it doesn't want reward, then it's unlikely to be the case that it wants something unambitious. And it's unlikely to be the case that it doesn't want anything at all. Because say Alex just had the goal of sitting in a chair for five minutes. If Alex had the goal of sitting in a chair for five minutes, it would not be
00:42:04
Speaker
trying extremely hard to maximize reward in all these diverse contexts. It would just find an opportunity to sit in a chair for five minutes and then just be lazy after that. And then that would be very visible to humans and they would come in and try to see what's wrong and change its motivations.
00:42:26
Speaker
If Alex has some kind of psychological motivation or drive, I think it has to be something that's downstream of reward. So it, you know, it might be reward itself.
00:42:39
Speaker
Or it might be something that it wants for which getting a lot of reward is an instrumental goal. So that could be any number of things, you know, wanting to fill the universe with paperclips would work because it first needs to get a bunch of rewards so that it can continue existing and accumulate power and then like do what it wants, which is to fill the universe with paperclips.
00:43:03
Speaker
or any number of other kind of more complicated, more psychologically realistic motives, as long as it's the case that there are motives that would inspire a model to try very hard to get a lot of reward. It's because we want Alex to be powerful in a general setting or across various domains that we want Alex to seek reward very, very hard.
00:43:29
Speaker
So that arises, in a sense, from our desires for what Alex could be, in the same sense that it is human limitations that creates the capability for deception in Alex also. If we were completely open to
00:43:48
Speaker
being told that we're wrong about this or that question that we care a lot about, then maybe Alex wouldn't be incentivized to deceive us. So these problems in a sense arises from human failures.
00:44:03
Speaker
I think we should talk about how we might intervene in this horrible story about Alex taking over the world at various points. For example, could it be the case that we could train Alex to be maximally honest or make honesty the prime virtue in Alex's system? Maybe even we could imagine some combination of machine learning and hard-coded programming where
00:44:33
Speaker
So honesty is like an unbreakable rule for Alex. Would that be possible? I don't think that the strong version of this is possible, or at least I don't really see any path to that. And the fundamental issue of training a model that knows more than us, to be honest at all times, is that there will be some questions where it knows the answer and we don't.
00:44:55
Speaker
So we don't know what incentive to give it in order to incentivize to be honest in those circumstances. Just some complicated, it doesn't even necessarily have to be political biases. I think political biases or emotional biases of other kinds will probably play a role. But even if you magically got rid of all that, if Alex understands some chemistry fact that no human understands, you just don't know
00:45:21
Speaker
which possible answer to give it reward in order to reward it for being honest. So one thing you could do is not give it any reward in those circumstances and only give it rewards when you're very sure that one of the two answers is right and just kind of throw out all the data where humans are uncertain, where humans don't know, where like any kind of bias might be at play.
00:45:44
Speaker
But there you kind of have a very weak signal of honesty that applies to really small subset of its data and meanwhile it's intelligent signals.
00:45:57
Speaker
are kind of no holds barred applied to the entire data set. So it's like playing chess and getting really good at playing chess. But in terms of answering questions about chess, we're restricting it and carefully training it and only giving it reward for answering honestly in this tiny set of cases where humans are really confident that one move was better than another or something. And so there's going to be this gap where it can act in this world and do all sorts of things.
00:46:27
Speaker
at best, the honesty module or the part of it that answers questions for us is going to answer as a human would answer. And so like, because that's what it was trained to do, it was trained to, you know, give the answers that the humans thought were right in the circumstances where humans really confident they knew what was going on.
00:46:46
Speaker
And so in the future, in some complicated situation where it actually has the latent intelligence to notice things humans don't notice and make a plan on that basis, it's like question answering honesty module may still just answer the way a human would answer, which is at best saying, I don't know what's going on. And at worst, like making something up that's wrong.
00:47:09
Speaker
What about instead of trying to make honesty the prime virtue for Alex, if we left it completely open-ended what it is that people want and then did this kind of inverse reinforcement style learning of human preferences that Stuart Russell has proposed?
Aligning AI with Human Values
00:47:29
Speaker
Is that a possible way forward for Alex, where Alex's prime motivation now is to find out what it is that people want?
00:47:40
Speaker
Yeah, so for those of you who don't know, the inverse reinforcement learning is basically this idea that you could observe human behavior and make some assumption that the behavior is optimal given what the human values or noisily optimal given what the human values or something like that. And then back out from that what it is they must value.
00:48:01
Speaker
by assuming you've observed a bunch of trajectories that are roughly optimal for achieving that and then try and maximize that thing or like have a probability distribution over those things and like do some kind of maximization. I feel like the fundamental issue where humans are going to make mistakes and be wrong about things in cases where Alex is right about them applies almost as, I mean, just as strongly, if not more strongly to inverse reinforcement learning.
00:48:29
Speaker
Because if you were worried that a political bias might cause a human judge to rate an answer as correct when it's actually incorrect, well, those same political biases will be represented in how that person goes about their life and how they answer questions and what they think is true. So if Alex is observing that person and inferring what it is they value from that,
00:48:52
Speaker
it will infer something kind of similar to what they would have reported. And in fact, RL from human feedback, I think is probably better than inverse reinforcement learning simply because we have a little more control over it. So if we're just kind of going about doing our thing and the models are watching us and trying to figure out what we value from that, that's probably less information and more noisy information than if we just kind of see
00:49:19
Speaker
you know, the model could do A or the model could do B. What is it that you want? We will be put in like a more reflective frame of mind. We will think harder about that question than we would just kind of going about our lives. So I think probably inverse reinforcement learning is going to be like not better and plausibly worse than what I've described in the post.
00:49:41
Speaker
What if we did this inverse reinforcement learning on humans in their best state? So now you're trying to learn the preferences of, say, people after they are extremely comfortable and they are thinking about what the optimal future for everyone on Earth could be. And maybe their needs are fully satisfied. They feel no kind of
00:50:04
Speaker
drive to satisfy their ego and to deceive themselves? Could we train Alex on a select group of people who represent the best of humanity? So again, I think that this option isn't something new added by Inverse Reinforcement Learning. We could do regular reinforcement learning with a select best group of people. But in either case, the answer is just going to be kind of like
00:50:34
Speaker
the select best group of people are going to have some quirks or mistakes they make, or the only difference, not the only difference, it's going to be quantitatively rarer, hopefully. And we won't know what they are, because maybe we believe ourselves to be in the select best group of people. So like, you know, an analogy I might give is that if Alex were being trained by the Catholic Church in the 1400s, and they're like, we're really worried that like lay people will get things wrong.
00:51:02
Speaker
and cause this AI to be misaligned, let's make sure to have the best, most thoughtful priests train the system instead. There are still going to be issues with that, and you should expect there are still issues with whatever group you select to be the best of humanity or the most thoughtful or least biased. I definitely take that point.
00:51:23
Speaker
Okay, one last shot at maybe something that could solve this problem of Alex. What if we are able to understand, to interpret what's going on in the model we've trained? We can understand how Alex works on a
AI Interpretability Research
00:51:39
Speaker
deep level. And maybe we can even train another model to help us understand what's going on in Alex so that we can test for whether Alex is developing motors that diverge from our own.
00:51:52
Speaker
So I think that if we had strong versions of that capability, I think that we would be a lot safer. And a number of people are doing research into mechanistic interpretability to try and improve our ability to look at models, weights and neurons, and try and say what procedure it's running on the inside to generate its answers, as opposed to just
00:52:15
Speaker
running the model on some inputs and saying does well on these inputs. So if mechanistic interpretability is developed far enough, I think it could be an important source of hope.
00:52:27
Speaker
Holden Karnofsky has a blog post out on how we might align transformative AI if it's built really soon. And it discusses a number of ideas, and interpretability plays a pretty big role. Right now, I think we don't have super great interpretability abilities. It's a young, active area of research. I don't know where it'll end up. It seems plausible that
00:52:53
Speaker
It'll be pretty sophisticated. It also seems plausible that it just won't be good enough to make a big difference. Is interpretability research doomed to be too slow to make a difference? Because it would always be easier to train the model and deploy it than it is to understand it correctly before you deploy it.
00:53:15
Speaker
I think if we have really good interpretability whose only problem is that it's slow, I think we're in a pretty good spot because we could train models to kind of distill what that interpretability says. If we have some slow procedure for looking at a model's weights with a bunch of humans and that procedure will kind of generate some 50 page report about what the model is doing. And then we can instead train models to look at each part of the model and like generate that report.
00:53:43
Speaker
then we can potentially dramatically speed up interpretability. So I think I'm more worried about simply not knowing how to do it than I am about having a great interpretability technique, but then it's too slow.
Conclusion and Future Topics
00:53:59
Speaker
That's it for this episode. On the next episode, I talk with Ajeya about how to think clearly in fast-moving worlds.
00:54:08
Speaker
whether the pace of change is accelerating and how we should respond personally and professionally if we expect the world to change quickly in the future.