Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI image

Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

Future of Life Institute Podcast
Avatar
77 Plays5 years ago
It's well-established in the AI alignment literature what happens when an AI system learns or is given an objective that doesn't fully capture what we want.  Human preferences and values are inevitably left out and the AI, likely being a powerful optimizer, will take advantage of the dimensions of freedom afforded by the misspecified objective and set them to extreme values. This may allow for better optimization on the goals in the objective function, but can have catastrophic consequences for human preferences and values the system fails to consider. Is it possible for misalignment to also occur between the model being trained and the objective function used for training? The answer looks like yes. Evan Hubinger from the Machine Intelligence Research Institute joins us on this episode of the AI Alignment Podcast to discuss how to ensure alignment between a model being trained and the objective function used to train it, as well as to evaluate three proposals for building safe advanced AI.  Topics discussed in this episode include: -Inner and outer alignment -How and why inner alignment can fail -Training competitiveness and performance competitiveness -Evaluating imitative amplification, AI safety via debate, and microscope AI You can find the page for this podcast here: https://futureoflife.org/2020/07/01/evan-hubinger-on-inner-alignment-outer-alignment-and-proposals-for-building-safe-advanced-ai/ Timestamps:  0:00 Intro  2:07 How Evan got into AI alignment research 4:42 What is AI alignment? 7:30 How Evan approaches AI alignment 13:05 What are inner alignment and outer alignment? 24:23 Gradient descent 36:30 Testing for inner alignment 38:38 Wrapping up on outer alignment 44:24 Why is inner alignment a priority? 45:30 How inner alignment fails 01:11:12 Training competitiveness and performance competitiveness 01:16:17 Evaluating proposals for building safe and advanced AI via inner and outer alignment, as well as training and performance competitiveness 01:17:30 Imitative amplification 01:23:00 AI safety via debate 01:26:32 Microscope AI 01:30:19 AGI timelines and humanity's prospects for succeeding in AI alignment 01:34:45 Where to follow Evan and find more of his work This podcast is possible because of the support of listeners like you. If you found this conversation to be meaningful or valuable, consider supporting it directly by donating at futureoflife.org/donate. Contributions like yours make these conversations possible.
Recommended
Transcript

Introduction to AI Alignment with Evan Hubinger

00:00:12
Speaker
Welcome to the AI Alignment Podcast. I'm Lucas Perry. Today, we have a conversation with Evan Hubinger about ideas in two works of his, an overview of 11 proposals for building safe advanced AI, and risks from learned optimization in advanced machine learning systems.
00:00:32
Speaker
Some of the ideas covered in this podcast include inner alignment, outer alignment, training competitiveness, performance competitiveness, and how we can evaluate some highlighted proposals for safe advanced AI with these criteria.
00:00:47
Speaker
We especially focus in on the problem of inner alignment and go into quite a bit of detail on that. This podcast is a bit jargony, but if you don't have a background in computer science, don't worry. I don't have a background in it either. And Evan did an excellent job of making this episode accessible.
00:01:06
Speaker
Whether you're an AI alignment researcher or not, I think you'll find this episode quite informative and digestible. I learned a lot about a whole other dimension of alignment that I previously wasn't aware of, and feel this helped to give me a deeper and more holistic understanding of the problem.
00:01:25
Speaker
Evan Hubinger was an AI safety research intern at OpenAI before joining Mirri. His current work is aimed at solving inner alignment for iterated amplification. Evan was an author on risks from learned optimization in advanced machine learning systems, was previously a Mirri intern, designed the functional programming language Coconut, and has done software engineering work at Google, Yelp, and Ripple.
00:01:54
Speaker
Evan studied math and computer science at Harvey Mudd College. And with that, let's get into our conversation with Evan Huebinger. In general, I'm curious to know a little bit about your intellectual journey and the evolution of your passions and how that's brought you to AI alignment.

Evan's Journey into AI Alignment

00:02:16
Speaker
So what got you interested in computer science and tell me a little bit about your journey to Miri?
00:02:21
Speaker
I started computer science when I was pretty young. I started programming in middle school, playing around with Python, programming a bunch of stuff in my spare time. The first sort of really big thing that I did, I wrote a functional programming language on top of Python. It was called Rabbit. It was really bad. It was like interpreted in Python.
00:02:41
Speaker
And then I decided I would sort of improve on that. I wrote another functional programming language on top of Python for coconut, got a bunch of traction. I was just over while I was in high school, starting to get into college. And this is also around the time I was reading a bunch of the sequences on mesteron.
00:02:54
Speaker
I got sort of into that and the rationality space and I was following it a bunch. I also did a bunch of internships at various tech companies doing software engineering and especially programming languages stuff. Around halfway through my undergrad, I started running the effective altruism club at Harvey Mudd College. And as part of running the effective altruism club, I was trying to learn about all of these different cause areas and how to use my career to do the most good.
00:03:17
Speaker
And I went to VA Global and I met some MIRI people there. They invited me to do a programming internship at MIRI, where I did some engineering stuff, functional programming, dependent type theory stuff. And then while I was there, I went to the MIRI Summer Fellows Program, which is this place where a bunch of people can come together and try to work on doing research and stuff for like a period of time over the summer. I think it's not happening now because of the pandemic, but it hopefully will happen again soon.
00:03:45
Speaker
And while I was there, I encountered some various different information and people talking about safety

Understanding AI Alignment: Inner vs Outer

00:03:51
Speaker
stuff. And in particular, I was really interested in this. At that time, people are calling it optimization demons, the sort of idea that there could be problems when you train a model for some objective function. You don't actually get a model that's really trying to do what you trained it for.
00:04:05
Speaker
And so with some other people who are at the Murray-Somos Fellows program, we tried to dig into this problem, and we wrote this paper, Risk Nerd Optimization in Advanced Machine Learning Systems. Some of the stuff I'll probably be talking about in this podcast came from that paper. And then as a result of that paper, I also got a chance to work with and talk with Paul Christiano at OpenAI.
00:04:24
Speaker
And he invited me to apply for an internship at OpenAI. So after I finished my undergrad, I went to OpenAI and I did some theoretical research with Paul there. And then when that was finished, I went to Miri, where I currently am, and I'm doing sort of similar theoretical research to the research I was doing at OpenAI, but now I'm doing it at Miri.
00:04:43
Speaker
So that gives us a better sense of how you ended up in AI alignment. Now, you've been studying it for quite a while from a technical perspective. Could you explain what your take is on AI alignment and just explain what you see as AI alignment?
00:04:59
Speaker
Sure. So I guess broadly, I like to take a general approach to AI alignment. I sort of see the problem that they're trying to solve as the problem of AI existential risk. It's the problem of it could be the case that in the future we have very advanced AIs that are not aligned with humanity and do really bad things. I see AI alignment as the problem of trying to prevent that. But there are obviously a lot of subcomponents to that problem.
00:05:22
Speaker
And so I like to make some particular divisions. Specifically, one

Pragmatic AI Safety Proposals

00:05:27
Speaker
of the divisions that I'm very fond of is to split it between these concepts called inner alignment and outer alignment, which I'll talk more about later. I also think that there's a lot of different ways to think about what the problems are that these sorts of approaches trying to solve, inner alignment, outer alignment. What is the thing that we're trying to approach in terms of building an aligned AI? And they also tend to fall into the Parker shadow camp of thinking mostly about intent alignment.
00:05:49
Speaker
where the goal of trying to build AI systems right now as a thing that we should be doing to prevent AIs from being catastrophic is focusing on how do we produce AI systems which are trying to do what we want. And I think that in our outer alignment of the two big components of producing intent-aligned AI systems, the goal is to hopefully reduce AI existential risk and make the future a better place.
00:06:10
Speaker
Do the social and governance and ethical and moral philosophy considerations come much into this picture for you when you're thinking about it?
00:06:21
Speaker
That's a good question. There's certainly a lot of philosophical components to trying to understand various different aspects of AI. What is intelligence? How do objective functions work? What is it that we actually want our AIs to do at the end of the day? In my opinion, I think that a lot of those problems are not at the top of my list in terms of what I expect to be quite dangerous if we don't solve them.
00:06:41
Speaker
I think a large part of the reason for that is because I'm optimistic about some of the AI safety proposals such as amplification and debate which aim to produce a sort of agent in the case of amplification which is trying to do what a huge tree of humans would do and then the problem reduces to rather than having to figure out in the abstract what is the objective that we should be trying to train an AI for that philosophically we think would be utility maximizing or good
00:07:06
Speaker
or whatever, we can just be like, well, we trust that a huge tree of humans would do the right thing and then sort of defer the problem to this huge tree of humans to figure out what philosophically is the right thing to do. And there's sort of similar arguments you can make with other situations like debate, where we don't necessarily have to solve all of these hard philosophical problems if we can make use of some of these alignment techniques that can solve some of these problems for us.
00:07:29
Speaker
So let's get into here your specific approach to AI alignment. How is it that you approach AI alignment and how does it differ from what Miri does? So I think it's important to note, I certainly am not here speaking on behalf of Miri, I'm just presenting my view. And my view is pretty distinct from the view of a lot of other people at Miri.
00:07:50
Speaker
So I mentioned at the beginning that I used to work at OpenAI and I did some work with Paul Cristiano. And I think that my perspective is pretty influenced by that as well. And so I come more from the perspective of what Paul calls prosaic AI alignment, which is the idea of we don't know exactly what is going to happen as we develop AI into the future. But a good operating assumption is that we should start by trying to solve AI for AI alignment if there aren't major surprises on the road to AGI.
00:08:20
Speaker
What if we really just scale things up? We sort of go via the standard path and we get really intelligent systems. Would we be able to align AI in that situation? And that's the question that I focus on the most, not because I don't expect there to be surprises, but because I think that it's a good research strategy. We don't know what those surprises will be. Probably our best guess is it's going to look something like what we have now. So if we start by focusing on that, then hopefully we'll be able to generate approaches, which can successfully scale into the future.
00:08:50
Speaker
And so because I have this sort of general research approach, I tend to focus more on what are current machine learning systems doing, how do we think about them, and how would we make them inner line and outer line if they were sort of scaled up into the future.
00:09:05
Speaker
This is sort of in contrast with the way I think a lot of other people at MIRI view this. I think a lot of people at MIRI think that if you go this route of prosaic AI, current machine learning scaled up, it's very unlikely to be aligned. And so instead you have to search for some other understanding, some other way to potentially do artificial intelligence that isn't just this standard prosaic path that would be more easy to align, that would be safer. I think that's a reasonable research strategy as well, but it's not the strategy that I generally pursue in my research.
00:09:34
Speaker
Could you paint a little bit more detailed of a picture of say the world in which the prosaic AI alignment strategy sees as potentially manifesting where current machine learning algorithms and the current paradigm of thinking in machine learning is merely scaled up and via that scaling up we reach AGI or super intelligence?

Prosaic AI Alignment Strategy

00:09:58
Speaker
I mean, there's a lot of different ways to think about what does it mean for current AI, current machine learning to be scaled up because there's a lot of different forms of current machine learning. You could imagine even bigger GPT-3, which is able to do highly intelligent reasoning. You could imagine we just do significantly more reinforcement learning in complex environments and we end up with highly intelligent agents.
00:10:21
Speaker
I think there's a lot of different paths that you can go down that still fall into the category of prosaic AI. And a lot of the things that I do as part of my research is trying to understand those different paths and compare them and try to get to an understanding of even within the realm of prosaic AI, there's so much happening right now in AI. And there's so many different ways we could use current AI techniques to put them together in different ways to produce something potentially super intelligent or highly capable in advance.
00:10:48
Speaker
Which of those are most likely to be aligned? Which of those are the best paths to go down? One of the pieces of research that I published recently was an overview and comparison of a bunch of the different possible paths to prosaic AGI. Different possible ways in which you can build advanced AI systems using current machine learning tools and trying to understand which of those would be more or less aligned and which would be more or less competitive.
00:11:13
Speaker
So you're referring now here to this article, which is partly a motivation for this conversation, which is an overview of 11 proposals for building safe advanced AI.
00:11:22
Speaker
That's right. All right. So I think it'd be valuable if you could also help to paint a bit of a picture here of exactly the Miri style approach to AI alignment. You said that they think that if we work on AI alignment via this prosaic paradigm, that machine learning scaled up to super intelligence or beyond is unlikely to be aligned. So we probably need something else. Could you unpack this a bit more?

Inner Alignment Challenges and Solutions

00:11:47
Speaker
Sure. I think that the biggest concern that a lot of people at Mirri have with trying to scale up prosaic AI is also the same concern that I have. There's this really difficult pernicious problem, which I call inner alignment, which is presented in the risk and learn optimization paper that I was talking about previously, which I think many people at Mirri, as well as me, think that this inner alignment problem is the key stumbling block to really making prosaic AI work.
00:12:14
Speaker
I agree that this is the biggest problem, but I'm more optimistic in terms of I think that there are possible approaches that we can take within the prosaic paradigm that could solve this inner alignment problem. And I think that is the biggest point of difference is how difficult will inner alignment be?
00:12:32
Speaker
So what that looks like is a lot more foundational work, and correct me if I'm wrong here, into mathematics and principles in computer science, like optimization and what it means for something to be an optimizer and what kind of properties that has. Is that right?
00:12:49
Speaker
Yeah, so in terms of some of the stuff that other people at Miri work on, I think a good starting point would be the embedded agency sequence on the alignment forum, which gives a good overview of a lot of the things that the different agent foundations people like Scott Garabrand, Sam Eisenstadt, Abram Demsky are working on.
00:13:05
Speaker
All right. Now you've brought up inner alignment as a crucial difference here in the opinion. So could you unpack exactly what inner alignment is and how it differs from outer alignment? This is a favorite topic of mine. A good starting point is trying to sort of rewind for a second and really understand what it is that machine learning does.
00:13:28
Speaker
Fundamentally, when we do machine learning, there are a couple of components. We start with a parameter space of possible models, where a model in this case is some parameterization of a neural network or some other type of parameterized function. And we have this large space of possible models, this large space of possible parameters that we can put into our neural network.
00:13:49
Speaker
And then we have some loss function, where for a given parameterization, for a particular model, we can check what is its behavior like on some environment. In supervised learning, we can ask how good are its predictions that it outputs. In an RL environment, we can ask how much reward does it get when we sample some trajectory.
00:14:09
Speaker
And then we have this gradient descent process, which samples some individual instances of behavior of the model. And then it tries to modify the model to do better in those instances. We search around this parameter space, trying to find models which have the best behavior on the training environment. This has a lot of great properties. I mean, this has managed to propel machine learning into being able to solve all of these very, very difficult problems that we don't know how to write algorithms for ourselves.
00:14:38
Speaker
But I think because of this, there's a sort of tendency to rely on something which I call the does the right thing abstraction, which is that, well, because the model's parameters were selected to produce the best behavior according to the loss function on the training distribution, we tend to think of the model as really trying to minimize that loss, really trying to get reward.
00:15:00
Speaker
But in fact, in general, that's not the case. The only thing that you know is that on the cases where I sample data on the training distribution, my model seemed to be doing pretty well. But you don't know what the model is actually trying to do. You don't know that it's truly trying to optimize the loss or some other thing. You just know that, well, it looked like it was doing a good job on the training distribution.
00:15:20
Speaker
And what that means is this abstraction is quite leaky. There's many different situations in which this can go wrong. And this general problem is referred to as robustness or distributional shift. This problem of, well, what happens when you have a model which you wanted it to be trying to minimize some loss, but you move it to some other distribution, you take it off the training data, what does it do then? And I think this is the starting point for understanding what is inner alignment is from this perspective of robustness and distributional shift.
00:15:48
Speaker
Interlinement specifically is a particular type of robustness problem. And it's the particular type of robustness problem that occurs when you have a model, which is itself an optimizer. When you do machine learning, you're searching over this huge space of different possible models, different possible parameterizations of a neural network or some other function. And one type of function which could do well on many different environments is a function which is running a search process, which is doing some sort of optimization.
00:16:16
Speaker
You could imagine I'm training a model to solve some maze environment. You could imagine a model which just learns some heuristics from when it should go left and right. Or you could imagine a model which looks at the whole maze and does some planning algorithm, some search algorithm which searches through the possible paths and finds the best one. And this might do very well on the mazes. If you're just running a training process, you might expect that you'll get a model of this second form that is running this search process that is running some optimization process.
00:16:46
Speaker
In the risk and learn optimization paper, we call models, which are themselves running search processes, Mesa optimizers, where Mesa is just Greek and it's the opposite of meta. There's a sort of standard terminology in machine learning, this meta optimization, where you can have an optimizer, which is optimizing another optimizer.
00:17:03
Speaker
And Mesa optimization is the opposite. It's when you're doing gradient descent, you have an optimizer, and you're searching over models. And it just so happens that the model that you're searching over happens to also be an optimizer. It's one level below rather than one level above. And so because it's one level below, we call it a Mesa optimizer.
00:17:19
Speaker
And interalignment is the question of how do we align the objectives of Mesa optimizers? If you have a situation where you train a model, and that model is itself running an optimization process, and that optimization process is going to have some objective. It's going to have something that it's searching for. In a maze, maybe it's searching for how do I get to the end of the maze?
00:17:40
Speaker
And the question is, how do you ensure that that objective is doing what you want? If we go back to the does the right thing abstraction that I mentioned previously, it's tempting to say, well, we train this model to get to the end of the maze. So it should be trying to get to the end of the maze.
00:17:56
Speaker
But in fact, that's not in general the case. It could be doing anything that would be correlated with good performance, anything that would likely result in general that gets the end of the maze on the training distribution. But it could be an objective that will do anything else, sort of off distribution. That fundamental robustness problem of when you train a model and that model has an objective, how do you ensure that that objective is the one that you trained it for? That's the inner alignment problem.
00:18:21
Speaker
And how does that stand in relation with the outer alignment problem? So the outer alignment problem is how do you actually produce objectives which are good to optimize for. So the inner alignment problem is about aligning the model with the loss function, the thing you're training for, the reward function.
00:18:41
Speaker
Outer alignment is aligning that reward function, that loss function, with the programmer's intentions. It's about ensuring that when you write down a loss, if your model were to actually optimize for that loss, it would actually do something good.
00:18:55
Speaker
Outer alignment is the much more standard problem of AI alignment. If you've been introduced to AI alignment before, you'll usually start by hearing about the outer alignment concerns, things like paperclip maximizers, where there's this problem of you try to train it to do some objective, which is maximize paperclips. But in fact, maximizing paperclips results in it doing all this other stuff you don't want it to do. And so outer alignment is this value alignment problem of how do you find objectives which are actually good to optimize.
00:19:23
Speaker
But then even if you have found an objective, which is actually good to optimize, if you're using the standard paradigm of machine learning, you also have this inner alignment problem, which is, okay, now how do I actually train a model, which is in fact going to do that thing, which I think is good. That doesn't bear relation with Stewart's standard model, does it?
00:19:42
Speaker
It sort of is related to Stuart Russell's standard model of AI. I'm not referring to precisely the same thing, but it's very similar. I mean, I think a lot of the problems that Stuart Russell has with the standard paradigm of AI are based on this start with an objective and then train them all to optimize that objective.
00:19:58
Speaker
When I've talked to Stuart about this in the past, he said, why are we even doing this thing of training models, hoping that the models would do the right thing? We should be just doing something else entirely. But we're both pointing at different features of the way in which current machine learning is done and trying to understand what are the problems inherent in this sort of machine learning process. I'm not making the case that I think that this is an unsolvable problem. I mean, it's the problem I work on. And I do think that there are promising solutions to it, but I do think it's a very hard problem.
00:20:25
Speaker
All right, I think you did a really excellent job there painting the picture of inner alignment and outer alignment.

Complexity of Machine Learning Objectives

00:20:31
Speaker
I think that in this podcast, historically, we have focused a lot on the outer alignment problem without making that super explicit.
00:20:40
Speaker
Now, from my own understanding and as a warning to listeners, my basic machine learning knowledge is something like a ork structure hobbled together with sheet metal and string and glue and gum and rusty nails and stuff. So I'm going to try my best here to see if I understand everything here about inner and outer alignment and the basic machine learning model. And you can correct me if I get any of this wrong.
00:21:02
Speaker
So in terms of inner alignment, there is this neural network space, which can be parameterized. And when you do the parameterization of that model, the model is the nodes and how they're connected, right?
00:21:16
Speaker
Yeah. So the model in this case is just a particular parameterization of your neural network or whatever function approximated that you're training. And it's whatever the parameterization is at the moment we're talking about. So when you deploy the model, you're deploying the parameterization you found by doing huge amounts of training via gradient descent or whatever, searching over all possible parameterizations to find one that had good performance on the training environment.
00:21:42
Speaker
So that model being parameterized, that's receiving inputs from the environment. And then it is trying to minimize the loss function or maximize reward. Well, so that's the tricky part, right? It's not trying to minimize the loss. It's not trying to maximize reward. That's the thing which I call the does the right thing abstraction, this sort of leaky abstraction that people often rely on when they think about machine learning, but isn't actually correct. Yeah, so it's supposed to be doing those things, but it might not.
00:22:10
Speaker
Well, what is supposed to mean? It's just a process. It's just a system that we run and we hope that it results in some particular outcome. What it is doing mechanically is we are using a gradient descent process to search over the different possible parameterizations to find parameterizations which result in good behavior on the training environment.
00:22:30
Speaker
It's good behavior as measured by the loss function or the reward function, right? That's right. You're using gradient descent to search over the parameterizations to find a parameterization which results in a higher reward on the training environment.
00:22:42
Speaker
Right. But achieving the high reward, what you're saying is not identical with actually trying to minimize the loss. Right. There's a sense of what you could sort of think of gradient descent as trying to minimize the loss because it's selecting for parameterizations, which have the lowest possible loss that it can find. But we don't know what the model is doing. All we know is that the model's parameters were selected by gradient descent to have good training performance, to do well according to the loss on the training distribution. But what they do off distribution, we don't know.
00:23:11
Speaker
We're going to talk about this later, but there could be a proxy. There could be something else in the maze that it's actually optimizing for that correlates with minimizing the loss function, but it's not actually trying to get to the end of the maze. That's exactly right.
00:23:24
Speaker
And then in terms of gradient descent, is the TLDR on that, the parameterized neural network space, you're creating all these perturbations to it. And the perturbations are sort of nudging it around in this n-dimensional space, how many other parameters there are or whatever. And then you check to see how it minimizes the loss after those perturbations have been done to the model. And then that will tell you whether or not you're moving in a direction, which is the local minima or not in that space. Is that right?
00:23:54
Speaker
Yeah, I think that that's a good intuitive understanding. What's happening is you're looking at infinitesimal shifts because you're taking a gradient and you're looking at how those infinitesimal shifts would perform on some batch of training data. And then you repeat that many times to go in the direction of the infinitesimal shift, which would cause the best increase in performance. But it's basically the same thing. I mean, I think the right way to think about gradient descent is this local search process. It's moving around the parameter space, trying to find parameterizations, which have good training performance.
00:24:23
Speaker
Is there anything interesting that you have to say about that process of gradient descent and the tension between finding local minima and global minima?
00:24:32
Speaker
Yeah, I mean, it's certainly an important aspect of what the gradient descent process does that it doesn't find global minima. It's not the case that it works by looking at every possible parameterization and picking the actual best one. It's this local search process that starts from some initialization and then looks around the space trying to move in the direction of increasing improvement. Because of this, there are potentially multiple possible equilibria parameterizations that you could find from different initializations that could have different performance.
00:25:02
Speaker
All the possible parameterizations of a neural network with billions of parameters like GPT-2 or now GPT-3, which has greater than 100 billion, is absolutely massive. It's a sort of combinatorial explosion of a huge degree where you have all of these different possible parameterizations
00:25:17
Speaker
running internally sort of correspond to totally different algorithms, controlling these weights that determine exactly what algorithm the model ends up implementing. And so in this massive space of algorithms, you might imagine that some of them will look more like search processes, some of them will look more like optimizers that have objectives, some of them will look less like optimizers, some of them might just be sort of grab bags of heuristics or other different possible algorithms.
00:25:39
Speaker
You get to depend on exactly what your setup is. If you're training a very simple network that's just like a couple of feed-forward layers, it's probably not possible for you to find really complex models implementing complex search processes. But if you're training huge models with many layers with all of these different possible parameterizations, then it becomes more and more possible for you to find these complex algorithms

Optimization Strategies in AI

00:25:58
Speaker
that are running complex search processes.
00:26:01
Speaker
I guess the only thing that's coming to mind here that is maybe somewhat similar is how 4.5 billion years of evolution has searched over the space of possible minds, and here we stand as these ape creature things. Are there, for example, interesting intuitive relationships between evolution and gradient descent? They're both processes searching over a space of mind, it seems.
00:26:23
Speaker
That's absolutely right. I think that there are some really interesting parallels there. And in particular, if you think about humans as models that were produced by evolution as a search process, it's interesting to note that the thing which we optimize for is not the thing which evolution optimizes for. Evolution wants us to maximize the total spread of our DNA, but that's not what humans do. We want all these other things like decreasing pain and happiness and food and mating and all of these various proxies that we use.
00:26:53
Speaker
An interesting thing to note is that many of these proxies are actually a lot easier to optimize for and a lot simpler than if we were actually truly maximizing spread of DNA. An example that I like to use is imagine some alternate world where evolution actually produced humans that really cared about their DNA and you have a baby in this world and this baby stubs their toe and they're like,
00:27:16
Speaker
What do I do? Do I have to cry for help? Is this a bad thing that I've stubbed my toe? And they have to do this really complex optimization process that is like, okay, how is my toe being stubbed going to impact the probability of me being able to have offspring later on in life? What can I do to best mitigate that potential downside now? And this is a really difficult optimization process. And so I think it sort of makes sense that evolution instead opted for just pain bad. If there's pain, you should try to avoid it.
00:27:45
Speaker
But as a result of evolution opting for that much simpler proxy, there's a misalignment there because now we care about this pain rather than the thing that evolution wanted, which was the spread of DNA. I think the way Stuart Russell puts this is the actual problem of rationality is how is my brain supposed to compute and send signals to my a hundred odd muscles to maximize my reward function over the universe history until heat death or something.
00:28:14
Speaker
We do nothing like that. It would be computationally intractable. It would be insane. So we have all of these proxy things that Evolution has found that we care a lot about. Their function is instrumental in terms of optimizing for the thing that Evolution is optimizing for, which is reproductive fitness. And then this is all probably motivated by thermodynamics, I believe.
00:28:37
Speaker
When we think about things like love or like beauty or joy or like aesthetic pleasure in music or parts of philosophy or things, these things almost seem intuitively valuable from the first person perspective of the human experience. But via evolution, there are these proxy objectives that we find valuable because they're instrumentally useful in this evolutionary process on top of this thermodynamic process.
00:29:05
Speaker
And that makes me feel a little funny. Yeah, I think that's right. But I also think it's worth noting that you want to be careful not to take the evolution analogy too far because it is just an analogy. When we actually look at the process of machine learning and how gradient descent works, it's not the same. It's running a fundamentally different optimization procedure over a fundamentally different space.
00:29:28
Speaker
And so there are some interesting analogies that we can make to evolution. But at the end of the day, what we really want to analyze is how does this work in the context of machine learning? And I think the Risk and Learn Optimization paper tries to do that second thing of let's really try to look carefully at the process of machine learning and understand what this looks like in that context. And I think it's useful to sort of have in the back of your mind this analogy to evolution. But I would also be careful not to take it too far and imagine that everything is going to generalize to the case of machine learning because it is a
00:29:58
Speaker
process.

Inner Alignment Failures

00:30:00
Speaker
So then pivoting here, wrapping up on our understanding of inner alignment and outer alignment, there's this model which is being parameterized by gradient descent, and it has some relationship with the loss function or the objective function. And it might not actually be trying to minimize the actual loss or to actually maximize the reward. Could you add a little bit more clarification here about why that is?
00:30:25
Speaker
I think you mentioned this already, but it seems like when gradient descent is evolving this parameterized model space, isn't that process connected to minimizing the loss in some objective way? The loss is being minimized, but it's not clear that it's actually trying to minimize the loss. There's some kind of proxy thing that it's doing that we don't really care about.
00:30:49
Speaker
That's right. Fundamentally, what's happening is that you're selecting for a model which has empirically on the training distribution the load loss. But what that actually means in terms of the internals of the model, but it's sort of trying to optimize for and what it's out of distribution behavior would be is unclear. So a good example of this is this maze example. So I was talking previously about the instance of, you know, maybe you train a model on a train distribution of relatively small mazes and to mark the end, you put a little green arrow.
00:31:17
Speaker
And then I want to ask the question, what happens when we move to a deployment environment where the green arrow is no longer at the end of the maze and we have much larger mazes? And then what happens to the model in this new off distribution setting?
00:31:32
Speaker
And I think there's three distinct things that can happen. It could simply fail to generalize at all. It just didn't learn a general enough optimization procedure that it was able to solve these bigger and larger mazes. Or it could successfully generalize and knows how to navigate. It learned a general purpose optimization procedure, which is able to solve mazes. And it uses it to get to the end of the maze.
00:31:53
Speaker
But there's a third possibility, which is that it learned a general purpose optimization procedure, which is capable of solving mazes, but it learned the wrong objective. It learned to use that optimization procedure to get to the green arrow rather than to get to the end of the maze. And what I call this situation is capability generalization without objective generalization.
00:32:12
Speaker
Its objective, the thing it was using those capabilities for, didn't generalize successfully off distribution. And what's so dangerous about this particular robustness failure is that it means off distribution, you have models which are highly capable. They have these really powerful optimization procedures directed at incorrect tasks.
00:32:31
Speaker
you have the strong maze solving capability. But this strong maze solving capability is being directed at a proxy, getting the green arrow, rather than the actual thing which we wanted, which was get to the end of the maze. And the reason this is happening is that on the training environment, both of those different possible models look the same in the train distribution. But when you move them off distribution, you can see that they're trying to do very different things, one of which we want and one of which we don't want. But they're both still highly capable.
00:33:00
Speaker
you end up with a situation where you have intelligent models directed at the wrong objective, which is precisely the sort of misalignment of AIs that we're trying to avoid. But it happened not because the objective was wrong. In this example, we actually want them to get to the end of the maze. It happened because our training process failed. It happened because our training process wasn't able to distinguish between models trying to get to the end and models trying to get to the green era.
00:33:24
Speaker
And what's particularly concerning in this situation is when the objective generalization lags behind the capability generalization. When the capabilities generalize better than the objective does, so that it's able to do highly capable actions, highly intelligent actions, but it does them for the wrong reason.
00:33:40
Speaker
And I was talking previously about the Mesa optimizers, where inner alignment is about this problem of models which have objectives which are incorrect. And that's the sort of situation where I expect this problem to occur. Because if you are training a model, and that model has a search process and an objective, potentially the search process could generalize without the objective also successfully generalizing.
00:34:01
Speaker
And that leads to this situation where your capabilities are generalizing better than your objective, which gives you this problem scenario where the model is highly intelligent but directed at the wrong thing. Just like in all of the outer alignment problems, the thing doesn't know what we want, but it's highly capable, right?
00:34:16
Speaker
Right. So while there is a loss function or an objective function, that thing is used to perform gradient descent on the model in a way that moves it roughly in the right direction. But what that means, it seems, is that the model isn't just something about capability. The model also implicitly somehow builds into it the objective. Is that correct?
00:34:44
Speaker
We have to be careful here because the unfortunate truth is that we really just don't have a great understanding of what our models are doing and what the inductive biases of gradient descent are right now. And so fundamentally, we don't really know what the internal structures of our models are like. There's a lot of really exciting research, stuff like the circuits analysis from Chris Ola and the clarity team at OpenAI.
00:35:05
Speaker
But fundamentally, we don't understand what the models are doing. We can sort of theorize about the possibility of a model which run in some search process. And that search process generalizes, but the objective doesn't. But fundamentally, because our models are these black mark systems that we don't really fully understand, it's hard to really concretely say, yes, this is what the model is doing. This is how it's operating. And this is the problem.
00:35:26
Speaker
But in risk from optimization, we try to at least attempt to understand that problem and look at if we really think carefully about what gradient descent is incentivizing and how it might work, what are the things which we might predict would happen. So the objective that you're training the model for does not live in the model. It lives in the gradient descent process. It lives in the training procedure.
00:35:45
Speaker
We might hope that when we train a model on an objective, that it will produce its own model of that objective and try to figure out what it is and be aligned with it. But we don't know exactly what happens. The model doesn't get to see the objective you're training for. All that happens is that the grade descent process looks at its behavior and tries to make it so that its behavior is more aligned with the loss function. But that loss function never enters into the model somehow. That model never sees that loss function.
00:36:10
Speaker
It might have some objective internally, like I was saying, if it's a Mesa optimizer. And then you might hope that that objective is aligned with the loss function we're training it for. But fundamentally, all we know is that its behavior on the training distribution was aligned with the loss function. That makes sense. And because it's so black boxy, we can't really interpret the state of the alignment of the model. So is the only way to do that, to test it out of distribution and see what happens at this point?
00:36:36
Speaker
There are a bunch of different possible ways to address this problem. So certainly one approach is to try to test it out of distribution, which is this adversarial training approach. This model is going to have some potential failure modes off distribution. We can try to find those failure modes and then train the model on those failure modes to prevent it from having this bad off-distribution behavior.
00:36:57
Speaker
There are some concerns with adversarial training, though. In particular, adversarial training doesn't necessarily catch what I see as the most pernicious, difficult inter-alignment failure, which is something that we call deceptive alignment in the risk and run optimization paper. In the deceptive alignment case, if the model knows that it's being adversarially trained, then you're not going to be able to figure that out just via throwing it a bunch of examples.
00:37:20
Speaker
You can also do something like transparency. So I mentioned previously that there's a lot of really exciting transparency interpretability work. If you're able to sort of look inside the model and understand what algorithm it's fundamentally implementing, you can see, is it implementing an algorithm, which is an optimization procedure that's aligned? Is it learned a correct model of the loss function or an incorrect model?
00:37:39
Speaker
It's quite difficult, I think, to hope to solve this problem without transparency and interpretability. I think that to be able to really address this problem, we have to have some way to peer inside of our models. I think that that's possible, though. There's a lot of evidence that points to the neural networks that we're training really making more sense, I think, than people assume. People tend to treat their models as these sort of super black box things. But when we really look inside of them, when we look at what is it actually doing, a lot of times it just makes sense.
00:38:04
Speaker
I was mentioning some of the circuits analysis work from the Clarity team at OpenAI, and they find all sorts of behavior. Like we can actually understand when a model classifies something as a car, the reason that it's doing that is because it has a wheel detector and has a window detector. And it's looking for windows on top of wheels. And so we can be like, OK, we understand what algorithm the model is influencing. And based on that, we can figure out, is it influencing the right algorithm or the wrong algorithm? And that's how we can hope to try to address this problem.
00:38:30
Speaker
But obviously, like I was mentioning, all of these approaches get much more complicated in the deceptive alignment situation, which is the situation which I think is most concerning.
00:38:38
Speaker
All right, so I do want to get in here with you in terms of all of the ways in which inner alignment fails. Briefly before we start to move into this section, I do want to wrap up here then on outer alignment. So outer alignment is probably again, what most people are familiar with. I think the way that you put this is it's when the objective function or the loss function is not aligned with actual human values and preferences.
00:39:03
Speaker
Are there things other than loss functions or objective functions used to train the model via gradient descent. I've been interchanging a little bit between loss function and reward function and injective function and these are from different paradigms machine learning so reward function would be what you would use in a reinforcement learning context.
00:39:24
Speaker
The loss function is the more general term, which is in a supervised learning context, you would just have a loss function. You still have a loss function in a reinforcement learning context, but that loss function is crafted in such a way to incentivize the model to optimize the reward function via various different reinforcement learning schemes. So it's a little bit more complicated than the sort of hand way V picture. But the basic idea is machine learning is we have some objective and we're looking for parameterizations of our model, which do well according to that objective.
00:39:49
Speaker
Okay. And so the outer alignment problem is that we have absolutely no idea and it seems much harder than creating powerful optimizers, the process by which we would come to fully understand human preferences and preference hierarchies and values.
00:40:06
Speaker
Yeah, so I don't know if I would say we have absolutely no idea. We have made significant progress on outer alignment. In particular, you can look at something like amplification or debate. And I think that these sorts of approaches have strong arguments for why they might be outer aligned.
00:40:22
Speaker
In the simplest form, amplification is about training a model to mimic this HCH process, which is a huge tree of humans consulting each other. Maybe we don't know in the abstract what our AI would do if it were optimized in some definition of human values or whatever. But if we're just training it to mimic this huge tree of humans, then maybe we can at least understand what this huge tree of humans is doing and figure out whether amplification is aligned. And so there has been significant progress on outer alignment, which is sort of the reason that I'm less concerned about it right now.
00:40:51
Speaker
because I think that we have good approaches for it. And I think we've done a good job of coming up with potential solutions. There's still a lot more work that needs to be done, a lot more testing, a lot more to sort of really understand. Do these approaches work? Are they competitive? But I do think that to say that we have absolutely no idea of how to do this is not true. But that being said, there's still a whole bunch of different possible concerns.
00:41:13
Speaker
Whenever you're training a model on some objective, you run into all of these problems of instrumental convergence, where if the model isn't really aligned with you, it might try to do these instrumentally convergent goals, like keep itself alive, potentially stop you from turning it off, or all of these other different possible things, which we might not want. And so all of these are what the outer alignment problem looks like. It's about trying to address these standard value alignment concerns, like conversion instrumental goals, by finding objectives potentially like amplification,
00:41:41
Speaker
which are ways of avoiding these sorts of problems. Right, so I guess there's a few things here wrapping up on outer alignment. Nick Bostrom's superintelligence, that was basically about outer alignment then, right? Primarily, that's right, yeah. Inner alignment hadn't really been introduced to the alignment debate yet.
00:42:00
Speaker
Yeah, so I think the history of how this sort of concern got into the AI safety sphere is complicated. I mean, so I mentioned previously that there are people going around talking about stuff like optimization team ends. And I think a lot of that discourse was very confused and not pointing at how machine learning actually works.
00:42:17
Speaker
And we're sort of just going off of, well, it seems like there's something weird that happens in evolution where evolution finds humans that aren't aligned with what evolution wants. That's a very good point. It's a good insight. But I think that a lot of people recoiled from this because it was not grounded in machine learning because I think a lot of it was very confused and didn't fully give the problem the contextualization that it needs in terms of how machine learning actually works.
00:42:37
Speaker
And so the goal of risk-borne optimization was to try and solve that problem and really dig into this problem from the perspective of machine learning, understand how it works and what the concerns are. Now with the paper having it out for a while, I think the results have been pretty good. I think that we've gotten to a point now where lots of people are talking about inter-alignment and taking it really seriously as a result of the risk-borne optimization paper.
00:42:57
Speaker
All right, cool. So you didn't mention sub-goals. So I guess I just want to include that. Instrumental sub-goals is the jargon there, right? Conversion, instrumental goals, conversion, instrumental sub-goals, those are synonymous. Okay. And then related to that is good Hart's law, which says that when you optimize for one thing hard, you oftentimes don't actually get the thing that you want, right?
00:43:20
Speaker
That's right. And Goodhart's law is a very general problem. The same problem occurs both in inner alignment and outer alignment. You can see Goodhart's law showing itself in the case of convergent instrumental goals. You can also see Goodhart's law showing itself in the case of finding proxies like going to the green arrow rather than getting to the end of the maze. It's a similar situation where when you start pushing on some proxy, even if it looked like it was good on the train distribution, it's no longer good off distribution.
00:43:46
Speaker
Good heart's law is a really very general principle, which applies in many different circumstances.
00:43:52
Speaker
Are there any more of these outer alignment considerations we can kind of just list off here that listeners would be familiar with if they've been following AI alignment? Outer alignment has been discussed a lot. I think that there's a lot of literature on outer alignment. You mentioned superintelligence. Superintelligence is primarily about this outer alignment problem. And then all of these difficult problems of how do you actually produce good objectives. And you have problems like boxing and the stop button problem and all of these sorts of things that come out of thinking about outer alignment.
00:44:20
Speaker
And so I don't want to go into too much detail because I think it really has been talked about a lot. So then pivoting here into focusing on the inner alignment section, why do you think inner alignment is the most important form of alignment?
00:44:33
Speaker
So it's not that I see outer alignment as not concerning, but that I think that we have made a lot of progress on outer alignment and not made a lot of progress on inner alignment.

Progress and Concerns in AI Alignment

00:44:43
Speaker
Things like amplification, like I was mentioning, I think are really strong candidates for how we might be able to solve something like outer alignment. But currently, I don't think we have any really good strong candidates for how to solve inner alignment.
00:44:54
Speaker
You know, maybe as machine learning gets better, we'll just solve some of these problems automatically. I'm somewhat skeptical of that. In particular, deceptive alignment is a problem which I think is unlikely to get solved as machine learning gets better, but fundamentally don't have good solutions to the inner alignment problem. Our models are just these black boxes mostly right now. We start starting to be able to peer into them and understand what they're doing. We have some techniques like adversarial training that are able to help us here, but I don't think we really have good satisfying solutions in any sense to how we'd be able to solve inner alignment.
00:45:23
Speaker
Because of that, inner alignment is currently what I see as the biggest, most concerning issue in terms of prosaic AI alignment.
00:45:31
Speaker
How exactly does inner alignment fail then? Where does it go wrong and what are the top risks of inner alignment? I mentioned some of this before. There's this sort of basic maze example, which gives you the story of what an inner alignment failure might look like. You train the model on some objective, which you thought was good, but the model learned some proxy objectives, some other objectives, which when it moved off distribution, it was very capable of optimizing, but it was the wrong objective.
00:45:56
Speaker
However, there's a bunch of specific cases. And so in risk and learn optimization, we talk about many of the different ways in which you can break this general inner misalignment down into possible subproblems. So the most basic subproblem is this sort of proxy pseudo alignment is what we call it, which is the case where your model learns some proxy, which is correlated with the correct objective, but potentially comes apart when you move off distribution.
00:46:21
Speaker
But there are other causes as well. There are other possible ways in which this can happen. Another example would be something we call suboptimality pseudo alignment, which is a situation where the reason that the model looks like it has good training performance is because the model has some deficiency or limitation that's causing it to be aligned, where maybe once the model thinks for longer, it'll realize it should be doing some other strategy, which is misaligned, but hasn't thought about that yet. And so right now it just looks aligned.
00:46:47
Speaker
There's a lot of different things like this where the model can be structured in such a way that it looks aligned on the tree distribution, but if it encountered additional information, if it was in a different environment where the proxy no longer had the right correlations, the things would come apart and it would no longer act aligned.
00:47:02
Speaker
The most concerning in my eyes was something which I'll call deceptive alignment. And deceptive alignment is a very particular problem where the model acts aligned because it knows that it's in a training process and it wants to get deployed with its objective intact. And so it acts aligned so that its objective won't be modified by the gradient descent process and so that it can get deployed and do something else that it wants to do in deployment.
00:47:29
Speaker
This is sort of similar to the treacherous turn scenario, where you're thinking about an AI that does something good and then it turns on you. But it's a much more specific instance of it, where we're thinking not about treacherous turn on humans, but just about the situation of the interaction between gradient descent and the model, where the model maybe knows it's inside of a gradient descent process and is trying to trick that gradient descent process.
00:47:51
Speaker
a lot of people on encountering this are like, how could this possibly happen in a machine learning system? And I think this is a good reaction because it really is a very strange thing to train a model to do this. But I think there are strong arguments for why separate alignment would actually be the simplest type of model that you could find in the situation.
00:48:09
Speaker
A way of explaining this, I think to anyone on the street would be like, imagine if pigs were intelligent enough to create farmers and you created farmers and they appeared to be aligned. They took care of you and they gave you these nice mud pits and they fed you every day. They gave you like a shelter and all of these other nice things. And then one day the farmer shows up and kills you, right?
00:48:35
Speaker
You thought the thing that you created was aligned but it was deceptively aligned and it takes a treacherous turn. Is this sort of like a more mundane example that you might agree with.
00:48:47
Speaker
So I think that's a good intuition pump for thinking about this situation. I generally am sort of averse to try and really heavily apply these sorts of analogies, because it's a good analogy to think about what's happening, but it doesn't answer the core question of how likely is this to actually happen in the machine learning system. Yeah, that makes sense. Because it's much more specific than the other kinds of minds in this mind space, it seems pretty rare, a thing that could exist, but hard to find. Right. I think that's a good intuition, but I'm going to try to disavow you that notion.
00:49:17
Speaker
First, I think it's interesting to look at maybe you do a bunch of adversarial training. You're really pushing the model to figure out what the objective is. It needs to know the objective at some point if you are training it in all possible situations. It needs to know what the loss function is for it to be able to do a good job.
00:49:32
Speaker
But there's multiple possible channels in which information about the loss function can enter the model. And so I'll fundamentally distinguish between two different channels, which is that the information about the loss function can enter through the gradient descent process, or it can enter through the model's input data.
00:49:48
Speaker
I'll call these two channels internalization and modeling. So internalization is the situation where you have this model that's going along, and it has some proxy objective, maybe. And that proxy objective is not quite right. But then it gets moved to a new situation where the proxy objective no longer works, and gritty descent goes inside the model and tinkers with the proxy to make it slightly more accurate. Iterate this process many, many times, and eventually the proxy gets better and better and better and better and better until it starts matching up with the actual loss function.
00:50:15
Speaker
But there's another story that I can tell, which is modeling. If you think about a training process like GPT-3, where the model is trained on a huge corpus of all of this really rich input data, well, in that sort of a situation, there's a whole bunch of really relevant information in that input to the age of subjective.
00:50:33
Speaker
And the agent, if it's in this environment where it has this huge text corpus, it's going to have to learn a model of this environment and how it functions. We can imagine a situation where maybe you're training the model on some huge text corpus to do something like maximize human values or whatever. And it reads a Wikipedia page on ethics. And it's like, wow, look at all of this really relevant, rich, useful information for figuring out this objective.
00:50:57
Speaker
But then there's a second question, which is, OK, suppose that the model has some model of the input data, which includes a bunch of rich information inside of the model already about that objective. How does gradient descent actually modify the model to make use of that information? And so there's two different types of modeling, which are deceptive alignment and cordial alignment.
00:51:17
Speaker
So the original story is you have this model that's going along, it has some proxy objective. But it encounters this really rich input data, which includes a huge amount of information about the objective. To be able to predict successfully what the Wikipedia page on ethics is going to say, it has to know about ethics. So it learns this really detailed ethics model. And then gradient descent is like, look, you have this really detailed ethics model. I'm going to just modify your objective to point to that ethics model. Now your new objective is just optimize that.
00:51:45
Speaker
And so this leads to something sort of like portability, where the model that you're training has its objectives determined by a pointer to some part of its world model. It has some model of this environment and includes some information about ethics. And now it's trying to optimize for that thing that it's pointed to in its world model.
00:52:01
Speaker
Then there's this other story, which is the deceptive alignment story. Similar, you have a model going along, it has some proxy objective, and it learns this really detailed world model that includes a bunch of information about ethics or whatever. And then Grady Descent modifies the model to think longer about the fact that it's inside of an optimization process and realize that the correct instrumental thing for it to be doing is to optimize for that ethics model.
00:52:25
Speaker
And then it realizes, oh, I see. I should optimize for this model that I have. It goes and does that. And in both situations, you end up being able to make use of this really rich input data to improve the performance by changing the objective to make reference to that. But in the cordial case, you actually modify it to point to it. Whereas in the deceptive case, you just modify the model to think about the fact that it should be instrumentally optimizing for it. And then the question is, which one of these is more likely?
00:52:50
Speaker
Before I go into that though, I want to dwell for a second on the internalization versus modeling distinction again. I think a good analogy is thinking about animal imprinting. Think about a duck. They're trying to train ducks to follow their mothers or whatever. You can imagine a situation where the duck internalizes a model of its mother. It is born with some really detailed model of how to figure out what a mother is.
00:53:12
Speaker
But there's also the modeling case where the duck is going to be in the world. It's going to have to form this really detailed model of the world. And that really detailed model of the world is going to have the very first thing that encounters should always be the mother. And so rather than having to learn some detailed model of figuring out what mothers are, if you just instead do some modeling process where it's like just have the duck's objective point to the mother in its world model.
00:53:34
Speaker
This modeling case, you can see it as something like animal imprinting, where rather than really internalizing the objective, because it already has this world model, it might as well just specify the objective in terms of that world model instead. And I haven't yet talked about why you would get deception versus courage ability. Right now, I'm just trying to convince you that you should expect modeling. You should expect that if you give your model this huge, really rich detailed input data, that it's going to learn an objective via reference to that input data rather than from scratch.
00:54:01
Speaker
When we think about models being trained, we think about them having some objective, and that objective is just modified by gradient descent as it moves along to get closer and closer to reality. But in the modeling case, that's not what's happening. The model is simultaneously learning a model of the world, as well as trying to prove its objective. And the objective is modified to make reference to that model of the world to make it more effective.
00:54:20
Speaker
So internalization is it just wants the same thing as you do. And so it does it because it wants the same thing. Whereas cordial abilities is trying to figure out what you want and then do that. And then deception is also trying to figure out what you want and then do that, but it's doing so for instrumental rather than terminal reasons. The instrumental reason is the model could get more reward.
00:54:38
Speaker
it can do better in the long run if it's able to not be modified by the gradient descent process. So if it's doing a good job, then gradient descent process won't change its objective. And so therefore it can get deployed with the same objective and then go do a whole bunch of other stuff that you don't want it to do. But why does it care about the current objective?
00:54:55
Speaker
We're imagining a situation where you have a proxy-aligned model, but it cares about that proxy across episodes. It cares about that proxy over time. And it wants that proxy to continue to be optimized for. If the gradient descent process modifies it so that it no longer cares about that proxy because of doing a bad job, then that proxy won't get optimized for. So if it wants that proxy to get optimized for in the future, it needs to play along, pretend to be doing the right thing so that it doesn't get modified, so that it can get deployed, and then go optimize that proxy.
00:55:23
Speaker
But it also needs to understand that it's within the context of an optimization process where gradient descent will continue to perturb it. That's exactly right.
00:55:32
Speaker
And where does that kind of awareness come from? So now we have the second question, which is deception versus gorgibility. The awareness comes from gradient descent. Gradient descent is trying to produce models which are better at the loss function. And a deceptive model is really good and really effective because it's trying very hard to do the thing which you want it to be doing.
00:55:55
Speaker
Both deception and courageability are situations where the model is trying to figure out what to do and then do that. And the question is, which is an easier model, which is a simpler model for gradient descent to find? And I think there's a strong argument for deception being simpler than courageability.
00:56:10
Speaker
Gorgeability has some fundamental difficulties in terms of, if you're imagining gradient descent process, which is looking at a proxy line model and is trying to modify it so that it makes use of this rich input data, it has to do some really weird things to make gorgeability work. It has to first make a very robust pointer.
00:56:28
Speaker
With corrigibility, if it's pointing at all incorrectly to the wrong thing in the input data, wrong thing in the world model, the corrigible optimizer won't correct that pointer. It'll just be like, well, I have this pointer. I'm just trying to optimize what this thing is pointing for. And if that pointer is pointing at a proxy instead, it'll just optimize that proxy. And so you have this very difficult problem of building robust pointers.
00:56:48
Speaker
With deception, you don't have this problem. A deceptive model, if it realizes the loss function is different than what it thought, it'll just change to doing the new loss function. It's actually much more robust to new information because it's trying to do this instrumentally. And so in a new situation, if it realizes that the loss function is different, it's just going to automatically change because it'll realize that's the better thing to do instrumentally. And so fundamentally, deception is more robust. It gives you a more robust pointer. It's easier in that sense.
00:57:14
Speaker
There's another sense in which cordial ability is very difficult. To produce a cordial model requires deception modifying the model's optimization process to be capable of optimizing under uncertainty. For a cordial model, because it has this pointer as its objective, it's going to start not really knowing what that pointer is pointing to. It's going to have some probability distribution over the different possibilities. And it's going to have to know how to optimize under that probability distribution of objectives rather than under a single objective.
00:57:38
Speaker
And so this problem of optimization under uncertainty, it's much more complicated than just optimizing under a fixed objective. What that means is that to produce a cordial optimizer, gradient descent has to do all of these modifications where it has to modify it to optimize under uncertainty and to have this super robust pointer to the part of its world model that's supposed to be optimizing for.
00:57:57
Speaker
But if you compare this to deception, gradient descent modifying a model to become deceptive is extremely simple. It needs to modify the model to have a long-term objective to think about what its objective is across many different instances. And then it just needs to modify the model to think for a while about the fact that it's in an optimization process and what the right thing is to do.
00:58:17
Speaker
And that modification is not very complicated because it's just a simple modification of think a little more about this particular thing, which is exactly the sort of modification that you should expect to happen all the time when you're training a model. And so I think it's a fundamentally much simpler modification.
00:58:31
Speaker
There's also another argument that you can make here, which is there's just a lot more deceptive models. Any proxy objective, once a model which is optimizing that proxy objective starts optimizing that objective more in the long term across episodes, and then thinks about the fact that it's an optimization process, will become deceptive. But to produce cordial ability, you have to find exactly the right pointer. There's many different possible pointers out there, only one of which is going to give you the exact correct pointer.
00:58:54
Speaker
And similar with the internalization, there's many different proxies. Only one is the actual true loss function. Whereas with deceptive alignment, any of those different proxies, they're all compatible with deception. And so I think there's a lot of strong arguments, both this accounting argument for there being many more deceptive optimizers, as well as the simplicity argument for the modification necessary to produce a deceptive optimizer is just a lot simpler, I think, than the modifications necessary to produce these other types of optimizers.
00:59:18
Speaker
And so because of this, I think that there's a strong case to be made for deception really not being that uncommon, not being something crazy to think what happened in the training process, but is maybe even potentially the default outcome of a lot of these sorts of training procedures, which is quite, quite scary and quite concern.
00:59:33
Speaker
And obviously, all of this is speculation. We're trying to understand from a theoretical process what this gradient descent process might do. But I think we can make a lot of strong cases about thinking about things like simplicity and accounting arguments to at least put this problem on the radar. Until we have a really strong reason that this isn't a problem, we should take it seriously. Buck, who's another person who works at Miry, often tries to explain some of the risk more optimization stuff. And he has an analogy that might be useful here.
00:59:59
Speaker
You can imagine the Christian God, and the Christian God is trying to produce humans which are aligned with the Bible. And you can imagine three different possible humans. You have Jesus, who is just the same as God. Jesus has the same objective as God. Jesus is aligned with God because he just fundamentally wants to do the exact same things.
01:00:18
Speaker
That's internalization. That would be internalization. You can have Martin Luther. Martin Luther is aligned with God because he wants to really carefully study the Bible, figure out what the Bible says, and then do that. And that's the cordial ability case. Or you can have Blaise Pascal. And Blaise Pascal is aligned with God because he thinks that if he does what God wants, he'll go to heaven in the future. And these are the three different possible models that God could find. And you're more likely to find a Jesus, a Martin Luther, or a Blaise Pascal.
01:00:47
Speaker
And the argument is there's only one Jesus out of all the different possible human objectives. Only one of them is going to be the exact same one that God wants. And Martin Luther, similarly, is very difficult because out of all the human objectives, there's only one of them, which is figure out precisely what the Bible wants and then do that. But Blaise Pascal, in this situation, anybody who realizes that God's going to send them to heaven or hell or whatever, based on their behavior, will realize that they should behave according to the Bible or whatever. And so there's many different possible Blaise Pascals, but there's significantly fewer possible Martin Luther's and Jesus's.
01:01:17
Speaker
Yeah, I think that's an excellent way of simplifying this. Blaze Pascal could care about any kind of proxy. I guess the one thing that I'm still a little bit confused about here is in terms of the deceptive version, again, why is it that it cares about the current proxy reward?
01:01:34
Speaker
I think that has to do with the structure of the training procedure. You start out usually, I think, with a proxy-aligned Mesa Optimizer. Maybe you start out with a bunch of heuristics, but then once you get a Mesa Optimizer, it'll usually start by being proxy-aligned. It'll have some proxy that's not quite the same as the loss function, at least if you're in a complex enough environment. There's a whole bunch of different possibilities. It's going to start with some proxy. But then you might hope that if you do a bunch of adversarial training, you train it for a really long time in a bunch of different environments, you'll be able to improve that proxy until you get to the point where it has the right thing.
01:02:03
Speaker
And the deceptive line is about asking the question, there's actually multiple different possible ways for a model to figure out the loss function. It could figure it out via internalization, or it could figure out via deception, or it could figure out via courtability. And we don't know which one of these is more likely. You might imagine that if you start with a proxy aligned base optimizer, you train it a bunch, you'll internalize the correct objective. But the argument that I'm making here is that that might not be what happens.
01:02:26
Speaker
And so the reason I'm saying you start with a proxy aligned based off measures, because I think that's the way that you should expect that the training procedure will be structured, that you'll start by learning some proxy, and then that proxy will get better, but that having that proxy get better could result in internalization or deceptional cordiability.
01:02:42
Speaker
Sure. So the model develops some kind of proxy, but it seems like the other necessary ingredients are that one, it knows that it's in an optimization process where gradient descent is changing its objective function. And then it requires two, it actually cares about its current objective function over time and over iterations.
01:03:06
Speaker
That's absolutely right. And so then the question to ask is how likely are those things to occur? And I think it's the really important point which I want to make. I think a lot of times when people think about this, they're like, it seems unlikely that it would just happen to develop this understanding and care about its thing long-term. It doesn't just happen. It happens as a result of the training procedure, because if it does happen, it results in significantly better performance.
01:03:29
Speaker
And so the question is, would gradient descent modify the model to have those properties? And that's the argument that I want to make, is that it's not that necessarily that the model just happens to develop deception, but that deceptive models are just really effective at pursuing reward on the training environment. And so you should expect that if you have a really good training process, that it's going to find deceptive models because they're really good at doing the thing you're training them for.
01:03:53
Speaker
And so that's the sort of most fundamental argument that I want to make. Well, I do think there's another level here where I think that as you're getting to the point where training models and very complex data sets where they have huge amounts of really rich information, I think we should expect the model to basically figure out most things. You should expect that if it's going to have to learn all of this really complex data, it's going to have to build a world model that's going to include the training process. So it's going to include the loss function.
01:04:17
Speaker
And so I think that as we move into this direction of training more and more complex datasets, more and more complex models, you don't want to rely on your model not figuring something out, especially when figuring that thing out results in it having better performance. Now, there's also this question of, does it care about its return across episodes versus just across multiple steps? And for deception, it needs to care about the multi-episode return. And this is also another critical point.
01:04:41
Speaker
I think there's a strong argument that if you imagine a model being trained to care about its multi-step return, and maybe also I should pause here and explain what I mean by step versus episode. So in the standard reinforcement learning setup, you have many different episodes. Each episode is composed of many different steps, and you train the model to only optimize its single episode return, its reward across all the different steps in an episode, but not across episodes.
01:05:06
Speaker
That's not always the case. There are some reinforcement learning procedures that optimize across episodes. An example of this is something like population-based training. Population-based training selects models which have the greatest reward across episodes, not simply within the episode. Though standard reinforcement learning, you're just selecting for the single episode return. But even if you're just selecting for the single episode return, you have the same general problem that we've been talking about.
01:05:31
Speaker
where it's not the case that you actually end up with a model necessarily that only cares about a single step return. You're just going to end up with a model that behaves like it cares about single step return on the train distribution.
01:05:41
Speaker
And if you imagine a situation where, you know, you have a model and it's been trained to optimize for this objective over many steps, and then it discovers that actually there's these other episodes that are beyond this current step. I think that the most natural generalization is that you should expect that the model is going to care about those other steps too. You can imagine a situation where every situation where the model has seen a blue door, it's been like, oh, going through this blue door is really good. And it's learned an objective that incentivizes to go through blue doors.
01:06:04
Speaker
If it then later realizes that there are more blue doors than it thought because there are other blue doors in other episodes, I think you should generally expect it's going to care about those blue doors as well. But again, even if you don't buy that, it's not just about what's most natural. It's also about what does the gradient descent process do. And because models which have this property, the plan across episodes, can become deceptive and therefore do really well in the training distribution, you should expect that a good training process is going to modify them to become deceptive, even if they wouldn't naturally.
01:06:32
Speaker
Okay. So help me just understand a little bit more here. At a high level, it seems that the internalization and the cordial ability things require higher specification than the deception version. You've brought up all of these different arguments about why you think that the deceptive version is easier to find because it has all of these computational benefits in terms of minimizing the loss function, right? That's right.
01:07:02
Speaker
And the structure of the deceptive model is composed of, as we mentioned earlier, one, the knowledge that it is within some kind of optimization process where it is being perturbed by gradient descent over episodes and that it cares about the current proxy over episodes. And it seems that you're arguing, correct me if I'm wrong.
01:07:28
Speaker
The manifestation of these components is that the strong desire to deceive leads to a very strong direct attempt at actually minimizing the loss or accomplishing the objective function. That's right. Because if it fails doing that, it won't be able to accomplish the proxy that actually cares about later. That's right.
01:07:55
Speaker
If there's more deployment episodes than training episodes, and it just cares about how many times it goes through the blue door, or the green arrow, whatever, the green arrow is a proxy objective. And so if it gives up the fact that it's optimizing the green arrow, it's going to get modified to not do that anymore. And so if it wants to go to the green arrow in all of these different deployment episodes, it needs to first in the training procedure, pretend that it doesn't care about the green arrow. And then once it's deployed, switch to going after the green arrow.
01:08:19
Speaker
This also requires the models to know when it's in training and when it's in deployment. This is another situation where I think you should expect that when you're training really advanced models with these huge world models that they figure out this information. But also, even if it doesn't figure out that information, it can gamble. It can be like, well, there's some chance I'm in deployment. There's some chance I'm in training. I will just opt to go to the green arrow some percentage of the time and to do the correct thing some other percentage of the time. And this gives you a decision problem, which you can solve mathematically. You get some optimal defection probability.
01:08:47
Speaker
Okay. And then going up in terms of specification one step, the cordial agent needs to have a very detailed world model where it has a pointer via gradient descent at some specific part of the world model that seems to correlate highly with minimizing the loss. That's right.
01:09:12
Speaker
Why is that less likely? Well, so the major two reasons that I want to argue for are one, getting the pointer right is really hard and very difficult to do robustly. Whereas getting the deception right is much easier to do robustly. Think about animal imprinting, right? If the animal imprints on a human caretaker instead, if you have a misspecified pointer, it doesn't actually have very good performance.
01:09:35
Speaker
The deceptive alignment is more robust at figuring out what the thing is it's trying to optimize for, because the course alignment relies on this pointer working successfully. And if the pointer fails, it's just going to optimize the wrong thing. But why can't the pointer change just as well as the attempt at satisfying the actual loss function in the deceptive model?
01:09:56
Speaker
So it absolutely can, but the question I want to ask is how difficult is it for a gradient descent to make one sequence of modifications versus another sequence of modifications. The sequence of modification leads to cordial ability versus the sequence that leads to deceptive alignment. The sequence of modifications necessary to produce cordial alignment requires all of this very, very careful pointer tuning. And so you can imagine if you're considering an infinitesimal movement in the direction of cordial alignment, imagine replacing your current proxy with a
01:10:20
Speaker
bad pointer, it's not going to be that great. You're going to have to put a lot of careful authorization effort into getting a really good pointer before it becomes better to be cordial. Whereas with deception, the model could build up this really good world model. And once the world model gets complicated enough that includes a specification of the fact that it's in a training process, as soon as it starts thinking about the training process, it's going to be as robust as it can be because it's going to instrumentally just want to try to figure out what is the best possible approximation I can get for this thing I'm being trained for so I can optimize it as effectively as possible.
01:10:48
Speaker
As this gives you a much more robust pointer compared to the cordial alignment case. And so the danger would be with the deceptive version, once it's unleashed upon the world and it has very excellent world model, it realizes this. And finally now I can do the deceptive turn or something like that to actually optimize the proxy. That's right. Yeah.
01:11:11
Speaker
So we've covered a lot of the ways in which inner alignment fails. Now, inner alignment and outer alignment are two of the things which you care about for evaluating proposals for building safe and advanced AI.

Evaluating AI Proposals for Safety

01:11:27
Speaker
There are two other properties that you care about training procedures for building beneficial AI.
01:11:35
Speaker
One of these is training competitiveness and the second one is performance competitiveness. Could you explain what training competitiveness is and performance competitiveness and why they're both important?
01:11:48
Speaker
Absolutely. Yeah. So I mentioned at the beginning that I have a broad view of AI alignment, where the goal is to try to mitigate AI existential risk. And I mentioned that what I'm working on is focused on this intent alignment problem. But a really important facet of that problem is this competitiveness question. We don't want to produce AI systems, which are going to lead to AI existential risk. And so we don't want to consider proposals, which are directly going to cause problems.
01:12:13
Speaker
As the safety community, what we're trying to do is not just come up with ways to not cause existential risk. Not doing anything doesn't cause existential risk. It's to find ways to capture the positive benefits of artificial intelligence, to be able to produce AIs, which are actually going to do good things. You know, why are we actually trying to build AIs in the first place? We're actually trying to build AIs because we think that there's something that we can produce, which is good, because we think that AIs are going to be produced on a default timeline and we want to make sure that we can provide some better way of doing it.
01:12:42
Speaker
And so the competitiveness question is about how do we produce AI proposals which actually reduce the probability of existential risk. Not that just don't themselves cause existential risk, but that actually overall reduce the probability of it for the world. There's a couple of different ways that can happen. You can have a proposal which improves our ability to produce other safe AIs. So we produce some aligned AI and that aligned AI helps us build other AIs with even more aligned and more powerful.
01:13:07
Speaker
We can also maybe produce an aligned AI, and then producing that aligned AI helps provide an example to other people of how you can do AI in a safe way. Or maybe it provides some decisive strategic advantage, which enables you to successfully ensure that only good AIs are produced in the future. There's a lot of different possible ways in which you can imagine building an AI leading to reduced existential risk. But competitiveness is going to be a critical component of any of those stories. You need your AI to actually do something.
01:13:33
Speaker
And so I like to split competitiveness down into two different subcomponents, which are training competitiveness and performance competitiveness. And in the overview of 11 proposals document that I mentioned at the beginning, I compare 11 different proposals for prosaic AI alignment on the four qualities of outer alignment, inner alignment, training competitiveness, and performance competitiveness.
01:13:53
Speaker
So training competitiveness is this question of how hard is it to train a model to do this particular task? It's a question fundamentally of if you have some team with some lead over all different other possible AI teams, can they build this proposal that we're thinking about without totally sacrificing that lead? How hard is it to actually spend a bunch of time and effort and energy and compute and data to build an AI according to some particular proposal?
01:14:19
Speaker
And then performance competitiveness is the question of, once you've actually built the thing, how good is it? How effective is it? What is it able to do in the world that's really helpful for reducing existential risk? Fundamentally, you need both of these things. And so you need all four of these components. You need outer alignment, inner alignment, training competitiveness, and performance competitiveness. If you want to have a prosaic AI alignment proposal, it is aimed at reducing existential risk.
01:14:44
Speaker
This is where a bit more reflection on governance comes into considering which training procedures and models are able to satisfy the criteria for building safe advanced AI in a world of competing actors and different incentives and preferences. The competitive stuff definitely starts to touch on all those sorts of questions.
01:15:05
Speaker
When you take a step back and you think about how do you have an actual full proposal for building prosaic AI in a way which is going to be aligned and do something good for the world, you have to really consider all of these questions. And so that's why I try to look at all of these different things in the document that I mentioned. So in terms of training competitiveness and performance competitiveness, are these the kinds of things which are best evaluated from within leading AI companies and then explained to say people in governance or policy or strategy?
01:15:34
Speaker
It is still sort of a technical question. We need to have a good understanding of how AI works, how machine learning works, what the difficulty is of training different types of machine learning models, what the expected capabilities are of models trained under different regimes, as well as the outer alignment and inner alignment that we expect will happen.
01:15:51
Speaker
I guess I imagine the coordination here is that information on relative training, competitiveness, and performance competitiveness in systems is evaluated within AI companies and then possibly fed to high power decision makers who exist in strategy and governance for coming up with the correct strategy given the landscape of companies and AI systems which exist.
01:16:15
Speaker
Yeah, that's right. All right. So we have these intent alignment problems. We have inner alignment and we have outer alignment. We've learned about that distinction today and reasons for caring about training and performance competitiveness. So part of the purpose of this, I mean, it's in the title for this paper that partially motivated this conversation and an overview of 11 proposals for building safe and advanced AI. You evaluate these proposals based on these criteria, as we mentioned,
01:16:45
Speaker
So i guess i want to take this time now then to talk about how optimistic you are about say your top few favorite proposals for building safe and advanced and how you've roughly evaluated them on these four criteria of inner alignment outer alignment and then performance and training competitiveness.
01:17:07
Speaker
I'll just touch on some of the ones that I think are most interesting to start by taking a look at. And it's not necessarily just going to be the ones I think I'm most excited about because some are more complicated than others. And I want to start with some of the simpler ones and some of the ones that I am fairly excited about and I think are interesting, but that also pedagogically useful to take a look at.

Imitative Amplification and Relaxed Adversarial Training

01:17:25
Speaker
I'll start with amplification, which is the approach that I probably spend the most time thinking about. The specific form of amplification that I want to start by talking about is imitative amplification, where imitative amplification specifically means the procedure of training a model to imitate a human with access to that model. The goal is to produce models imitating this HCH tree, which I mentioned earlier, this tree of humans consulting humans.
01:17:51
Speaker
The models trained to mimic a human consulting the model, but then the models that that human is consulting with themselves trained to mimic humans consulting models, which were themselves trained to mimic humans consulting models. So the limit you're hoping to end up with this model trained to mimic this tree of humans consulting humans consulting humans, which is HCH for humans consulting HCH.
01:18:11
Speaker
The specific way that I talk about this, though, in the overview of 11 proposals is that there's an additional thing which you can do on top of this imitative objective, which is we also want to add some form of inner alignment solution. So we might think that imitative amplification is going to be good in the limit. It's going to be good if we can actually produce models which are really trying to predict HCH.
01:18:31
Speaker
But we also have to actually ensure that we're doing that, that we don't get deceptive models or other sort of bad inner misalignment problems. And so we can add on top of the imitative amplification, something which I call relaxed adversarial training. I have another post that came out of some of the work I did open AI, which really goes in detail into how you would do something like relaxed adversarial training and what it means, how it works. But fundamentally, the idea behind relaxed adversarial training is to train the model both to do the base objective of trying to imitate the human consulting model.
01:19:00
Speaker
but also on an additional objective of trying to decrease the probability that the model is doing something bad, is deceptive, as assessed by some overseer which is looking at the model. Which in this case, in the case of an application, the overseer would be the human with access to the model.
01:19:16
Speaker
So you're training overall on the objective of mimicking the human with access to the model and making it so that the human with access to the model, when that human with the help of the model uses transparency tools to look at the model and see what it's doing, making sure that that human believes that the model is doing the right thing. And so you're combining training on the imitative objective as well as on the relaxed adversarial training objective, trying to make it so that this oversight works. How do you evaluate that in terms of your criteria?
01:19:46
Speaker
We can start with outer alignment. So the question of outer alignment is, is this HCH tree aligned? And the hope is that because this HCH tree is just composed of humans, that if we can understand the human behavior and ensure that those humans are doing the right thing, which we might hope to do, you know, we've solved lots of human coordination problems in the past that we can ensure that this thing is outer aligned.
01:20:04
Speaker
For inter-alignment, inter-alignment is this question of relaxed adversarial training. So in particular, the way in which I think about solving relaxed adversarial training in this situation is that you have this big model space, and you have some set of deceptive models. And if you want to avoid that set of deceptive models, one thing which you can do is you can find some larger set which contains all deceptive models, such that it's very easy to check whether models in that set are not. As long as outside of the set contains enough competitive models that they're able to do the right thing, your relaxed adversarial training would work.
01:20:35
Speaker
There's a bunch of different ideas of how you could define a set like this, but the one which I'm most excited about, which I think about the most, is to have the set be based on the distinction of myopia, where you run this relaxed episode training check by looking at the model, determining whether the model cares about its multi-episode return or not.
01:20:51
Speaker
And if the model does care about its multi-episode return, then it has the potential to become deceptive. And so we don't allow that. We have it so that the overseer, if it sees that it cares about the multi-episode return, oversight fails. And if it sees that it doesn't care about the multi-episode return, then the oversight passes. And then we train the model to cause the oversight to pass so that it doesn't develop this behavior where it cares about the multi-episode return. And that's the basic idea of how you could use relaxed episode training here to solve inner alignment.
01:21:17
Speaker
And then you have training competitiveness. It's sort of this imitation learning problem. We're trying to train the model to imitate the human with access to the model. There's all sorts of potential problems here. It could be very difficult to get a bunch of human data for this problem, as well as if you're just doing imitation learning, there's some concerns of is the model really going to learn these sort of powerful long-term optimization, especially if you're specifically selecting it to be myopic.
01:21:39
Speaker
Also, there's a question of how rich is the reward signal, which is produced by imitation learning. Invitation learning uses some sort of fixed distance metric to determine the distance between what the model produced and what was predicted. Generally, this is done by having the language model, which reduces the probability distribution, and then determining the probability under that distribution of the sentences produced by the human with access to the model. And so you have this problem of how rich is the training signal, how difficult is the training.
01:22:03
Speaker
I think we really don't know what the training competitiveness is going to be, but I think there's reason to

AI Safety via Debate and Microscope AI

01:22:07
Speaker
be optimistic. I mean, a lot of these sorts of simple language modeling tasks like GPT-2 and GPT-3 have been very successful. And so this is very similar to that in that sort of vein, as you can hope that it'll also be, hopefully, successful similarly. And then you have the performance competitiveness. And for performance competitiveness, the real question is, how powerful is this HCH thing? Can HCH actually solve all the problems we've kind of solved?
01:22:28
Speaker
And this is going to be a question of, you know, how do you train your humans? Can a human take any question and decompose it into different sub-questions such that answers those sub-questions can be recombined into answer the whole question? Is this possible for all questions? So there's some questions which are undecomposable, in which case it wouldn't be competitive. But if you can decompose all these questions, it sort of should have performance competitiveness.
01:22:47
Speaker
So the one that I've been talking about, imitivity amplification plus relaxed adversarial training is the one that I work on most and is probably the one I'm most optimistic about. There's others that are also really interesting. I mean, I think all of the proposals, every proposal on the list has lots of strong arguments in favor of it and lots of real reasons to consider it.
01:23:02
Speaker
Another proposal, which I have talked about previously, is AI safety via debate. So AI safety via debate has some similar properties to amplification, but it has a different objective. Rather than trying to mimic HCH, you're trying to do this debate procedure, where you have two different models trying to defend two different sides of some proposition and convince the judge to agree with one of those sides.
01:23:23
Speaker
There's some strong arguments for why this sort of procedure would have nice outer alignment properties because it converges to something like what the judge's equilibrium beliefs are after having seen all of the arguments in favor. And so we might hope that this sort of equilibrium beliefs are going to be good. And there's lots of work also that has been done on trying to improve the ability to access like the true equilibrium beliefs to really give the judge all possible information. There's some work that was done by Paul Christiano and Beth Barnes about this cross-examination mechanism.
01:23:49
Speaker
which gives it significantly better performance competitiveness properties by allowing it to access a larger complexity class. But then the other question is then what can you do on top of AI safety via debate to help with inter-alignment? And an interesting thing that you can do is you can have the models trained to use transparency tools on each other to win the debate. If one model is able to look inside the other model and demonstrate that it's non-biopic or something, then it wins the debate. And so then you can incentivize the models to play this game of trying to look inside of each other and find the inter-alignment problems and then train away from those problems.
01:24:19
Speaker
outer alignment you sort of really have this question of are the tomatoes actually incentivized to have some honest cooperative helpful answers in the limit and this is a question of what that equilibrium really looks like is it the case that humans are able to be tricked by really persuasive arguments and no counter argument will unpersuade them or is it the case that when humans are exposed to all of the best arguments and counter arguments they come to the right conclusions we don't really know but this is fundamentally what the outer alignment question is here
01:24:45
Speaker
this is sort of a trickier equilibrium to analyze. With the amplification case, we just had to figure out with this concrete tree of humans. But with debate, we have to understand as more abstract, what is this theoretical equilibrium behavior of a human under exposure to all these different arguments? And so I think there's a case that from an analog perspective, debate is at least more questionable.
01:25:03
Speaker
Then we have inter-alignment. The situation here is how successfully can we incentivize these models to inspect each other and be able to identify catastrophic behavior. And there's certainly some problems here. You have to get a good training signal to be able to produce models, which are good at inspecting each other, which probably means you have to produce deceptive models and then have the models find the inception, or at least produce non-miopic models and then have the models find the non-miopia. But then this could also incentivize the models to hide the non-miopia. So it's a little bit unclear exactly how this works. It's sort of going to be dependent on precisely the properties of the training procedure.
01:25:31
Speaker
I think there's some good reasons to be hopeful here for training competitiveness because the way that the big game is structured lets you do this self-play structure where you have these agents competing on the zero-sum game in a self-play, which is in many ways similar to a lot of big successes of things like AlphaZero, where AlphaZero is able to solve all these environments by doing this very successful self-play on these zero-sum games. The biggest problem here is probably getting data.
01:25:53
Speaker
And then there's the performance competitiveness question, which is how effective would an actual debate agent be? The hope is that in the limit, you should just be able to ask the first debater, what do you think? And then it'll just choose the most convincing answer. And then you can just go with that. You don't have to run the whole debate into deployment. But it's only going to be able to solve these language problems. It's going to give you as the equilibrium of what a human thinks after all of these different arguments. And is that good enough?
01:26:17
Speaker
Is it the case that humans are going to really be able to come to good enough equilibria after they see all these arguments that they're going to be able to produce really good answers? And also, is it the case that question answering alone is sufficient to be able to be competitive in potentially a very competitive marketplace?
01:26:32
Speaker
As a third proposal that I think is interesting to go into is something called Microscope AI. Microscope AI I think is really interesting to look at because it's very different from the other proposals that I was just talking about. And it has a very different approach to thinking about how do we solve these sorts of problems. For all of these approaches, we need to have some amount of ability to look inside our models and learn something about what the model knows. But when you use transparency tools to look inside our model, it teaches you multiple things. It teaches you about the model. You learn about what the model has learned.
01:26:58
Speaker
But it also teaches you about the world, because the model learned a bunch of useful facts. And if you look inside the model and you can learn those facts yourself, then you become more informed. And so this process itself can be quite powerful. And so that's fundamentally the idea of microscope AI. The idea of microscope AI is to train a predictive model on the data you want to understand, and then use transparency tools to understand what that model learned about that data, and then use that understanding to guide human decision making.
01:27:23
Speaker
And so if you think about outer alignment, in some sense, this procedure is not really outer aligned because we're just trying to predict some data. And so that's not really an aligned objective. If you had a model that was just trying to do a whole bunch of prediction, it wouldn't be doing good things to the world. But the hope is that if you're just training a predictive model, it's not going to end up being deceptive or otherwise dangerous. And you can also use transparency tools to ensure that it doesn't become that.
01:27:44
Speaker
We still have to solve inner alignment, like I was saying. It still has to be the case that you don't produce deceptive models. And in fact, the goal here really is not to produce based optimizers at all. The goal is just to produce these predictive systems which learn a bunch of useful facts and information, but that aren't running optimization procedures. And hopefully we can do that by having this very simple predictive objective and then also by using transparency tools.
01:28:02
Speaker
And then training competitiveness, we know how to train powers and prediction models now, you know, something like GPT-2, now GPT-3, these are predictive models on text prediction. And so we know this process, we know that we're very good at it. And so hopefully we'll be able to continue to be good at it in the future. The real sticking point with microscope AI is the performance competitiveness question. So is enhanced human understanding actually going to be sufficient to solve the use cases we might want for like advanced AI?
01:28:26
Speaker
And I don't know. It's really hard to know the answer to this question, but you can imagine some situations where it is and some situations where it isn't. So for situations where you need to do long-term, careful decision-making, it probably would be, right? If you want to replace CEOs or whatever. That's a sort of very general decision-making process that can be significantly improved just by having much better human understanding of what's happening. You don't necessarily need the AIs making the decision.
01:28:48
Speaker
On the other hand, if you need fine-grained manipulation tasks or very, very quick response times, AIs managing a factory or something, then maybe this wouldn't be sufficient because you would need the AIs doing all of this quick decision making and you can have it just giving information to a human.
01:29:03
Speaker
One specific situation which I think is important to think about also is the situation of using your first AI system to help build a second AI system and making sure that second AI system is aligned and competitive. And I think that it also performs pretty well there. You could use a microscope AI to get a bunch of information about the process of AIs and how they work and how training works, and then get a whole bunch of information about that, have the humans learn that information, then use that information to improve our building of the next AIs and other AIs that we build.
01:29:29
Speaker
There are certain situations where Microsoft AI is performance competitive, situations where it wouldn't be performance competitive, but it's a very interesting proposal because it's sort of tackling it from a very different angle. It's like, well, you know, maybe we don't really need to be building agents. Maybe we don't really need to be doing this stuff. Maybe we can just be building this Microsoft AI. I should mention the Microsoft AI idea comes from Chris Ola, who works at OpenAI. The debate idea comes from Jeffrey Irving, who's now at DeepMind. An amplification comes from Paul Christiano, who's at OpenAI.

Resources and Pessimism on AI Alignment

01:29:56
Speaker
Yeah, so for sure, the best place to review these is by reading your post. And again, the post is an overview of 11 proposals for building safe advanced AI by Evan Hubinger. And that's on the AI alignment forum. That's right. I should also mention that a lot of the stuff that I talked about in this podcast is coming from the risks from learned optimization in advanced machine learning systems paper.
01:30:19
Speaker
All right. Wrapping up here, I'm interested in ending on a broader note. I'm just curious to know if you have concluding thoughts about AI alignment. How optimistic are you that humanity will succeed in building aligned AI systems? Do you have a public timeline that you're willing to share about AGI? How are you feeling about the existential prospects of earth originating life?
01:30:45
Speaker
Uh, it's a big question. So I tend to be on the pessimistic side. My current view looking out on the field of AI and the field of AI safety, I think there's a lot of really challenging, difficult problems that we're at least not currently equipped to solve. And it seems quite likely that we won't be equipped to solve by the time we need to solve them. I tend to think that the prospects for humanity aren't looking great right now, but I nevertheless have a very sort of optimistic disposition. We're going to do the best that we can. We're going to try to solve these problems as effectively as we possibly can. And we're going to work on it and hopefully we'll be able to make it happen.
01:31:15
Speaker
In terms of timelines, it's such a complex question. I don't know if I'm willing to commit to some timeline publicly. I think that it's just one of those things that is so uncertain. It's just so important for us to think about what we can do across different possible timelines and be focusing on things which are generally effective regardless of how it turns out. Because I think we're really just quite uncertain. You know, it could be as soon as five years or as long the way as 50 years or 70 years. We really don't know. And I don't know if we have great track records of prediction in this setting.
01:31:45
Speaker
Regardless of what AI comes, we need to be working to solve these problems and get more information on these problems, get to the point where we understand them and can address them because when it does get to the point where we're able to build these really powerful systems, we need to be ready. So you do take very short timelines, like say five to 10 to 15 years very seriously.
01:32:03
Speaker
I do take very short timelines very seriously. I think that if you look at the field of AI right now, there are these massive organizations, OpenAI and DeepMind, that are dedicated to the goal of producing AGI and are putting a huge amount of research effort into it. And I think it's incorrect to just assume that they're going to fail. I think that we have to consider the possibility that they succeed and that they do so quite soon. A lot of the top people at these organizations have very short timelines. And so I think that it's important to take that claim seriously and to think about what happens if it's true.
01:32:28
Speaker
I wouldn't bet on it. There's a lot of analysis that seems to indicate that at the very least we're going to need more compute than we have in that sort of a timeframe. But timeline prediction tasks are so difficult that it's important to consider all of these different possibilities. I think that yes, I take the short timelines very seriously, but it's not the primary scenario. I think that I also take long timeline scenarios quite seriously.
01:32:47
Speaker
Would you consider DeepMind and OpenAI to be explicitly trying to create AGI? OpenAI, yes, right? Yeah, OpenAI is part of the mission statement. DeepMind, some of the top people at DeepMind have talked about this, but it's not something that you would find on the website the way you with OpenAI. But if you look at historically, some of the things that Shane Legge and Dennis Hasselbes has said, a lot of it is about AGI.
01:33:09
Speaker
Yeah, so in terms of these being the leaders with just massive budgets and person power, how do you see the quality and degree of alignment and beneficial AI thinking and mindset within these organizations?

Bridging AI Alignment and Mainstream ML

01:33:27
Speaker
Because there seems to be a big distinction between
01:33:29
Speaker
The AI alignment crowd and the mainstream machine learning crowd, a lot of the mainstream ML community hasn't been exposed to many of the arguments or thinking within the safety and alignment crowd. Stuart Russell has been trying hard to shift away from the standard model and incorporate a lot of these new alignment considerations. So yeah, what do you think?
01:33:52
Speaker
I think this is a problem that is getting a lot better. Like you were mentioning, Stuart Russell has been really great on this. Chi has been very effective at trying to really get this message of, you know, we're building AI, we should put some effort into making sure we're building safe AI. And I think this is working. If you look at a lot of the major ML conferences recently, I think basically all of them had workshops on beneficial AI. DeepMind has a safety team with lots of really good people. OpenAI has a safety team with lots of really good people.
01:34:17
Speaker
I think that the standard story of, oh, AI safety is just this thing that these people who aren't involved in machine learning think about. It's something which really in the current world has become much more integrated with machine learning and is becoming more mainstream. But it's definitely still a process and it's the process of, like Stuart Russell says, that fields of AI has been very focused on this sort of standard model and trying to move people away from that and think about some of the consequences of it. It takes time and it takes a sort of evolution of the field, but it is happening. I think we're moving in a good direction.
01:34:46
Speaker
All right, well, Evan, I've really enjoyed this. I appreciate you explaining all of this and taking the time to unpack a lot of this machine learning language and concepts to make it digestible. Is there anything else here that you'd like to wrap up on or any concluding thoughts?
01:35:07
Speaker
If you want more detailed information on all of the things that I've talked about, the full analysis of inter-alignment and under alignment is in risk management optimization in advanced machine learning systems by me, as well as many of my co-authors, as well as an overview of 11 proposals post, which you can find on the AI alignment forum. I think both of those are resources, which I would recommend checking out for understanding more about what I talked about in this podcast. Do you have any social media or a website or anywhere else for us to point towards?
01:35:35
Speaker
Yeah, so I mean, you can find me on all the different sorts of social media platforms. I'm fairly active on GitHub. I do a bunch of open source development. You can find me on LinkedIn, Twitter, Facebook, all those various different platforms. I'm fairly Google-able. It's nice to have a fairly unique last name. So if you Google me, you should find all this information.
01:35:53
Speaker
One other thing, which I should mention specifically, everything that I do is all public. All of my writing is public. I try to publish all of my work, and I do so on the AI alignment form. So the AI alignment form is a really, really great resource because it's a collection of writing by all of these different AI safety authors. It's open to anybody who's a current AI safety researcher. And you can find me on the AI alignment form as EvHub. I'm EVHEB on the alignment form.
01:36:19
Speaker
All right, Evan, thanks so much for coming on today and it's been quite enjoyable. This has probably been one of the more fun AI alignment podcasts that I've had in a while. So thanks a bunch and I appreciate it. Absolutely. That's super great to hear. I'm glad that you enjoyed it. Hopefully everybody else does as well.
01:36:43
Speaker
If you enjoyed this podcast, please subscribe, give it a like or share it on your preferred social media platform. We'll be back again soon with another episode in the AI alignment series.