Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
AIAP: An Overview of Technical AI Alignment with Rohin Shah (Part 1) image

AIAP: An Overview of Technical AI Alignment with Rohin Shah (Part 1)

Future of Life Institute Podcast
Avatar
86 Plays6 years ago
The space of AI alignment research is highly dynamic, and it's often difficult to get a bird's eye view of the landscape. This podcast is the first of two parts attempting to partially remedy this by providing an overview of the organizations participating in technical AI research, their specific research directions, and how these approaches all come together to make up the state of technical AI alignment efforts. In this first part, Rohin moves sequentially through the technical research organizations in this space and carves through the field by its varying research philosophies. We also dive into the specifics of many different approaches to AI safety, explore where they disagree, discuss what properties varying approaches attempt to develop/preserve, and hear Rohin's take on these different approaches. You can take a short (3 minute) survey to share your feedback about the podcast here: https://www.surveymonkey.com/r/YWHDFV7 In this podcast, Lucas spoke with Rohin Shah. Rohin is a 5th year PhD student at UC Berkeley with the Center for Human-Compatible AI, working with Anca Dragan, Pieter Abbeel and Stuart Russell. Every week, he collects and summarizes recent progress relevant to AI alignment in the Alignment Newsletter. Topics discussed in this episode include: - The perspectives of CHAI, MIRI, OpenAI, DeepMind, FHI, and others - Where and why they disagree on technical alignment - The kinds of properties and features we are trying to ensure in our AI systems - What Rohin is excited and optimistic about - Rohin's recommended reading and advice for improving at AI alignment research
Recommended
Transcript

Introduction and Episode Overview

00:00:12
Speaker
Hey everyone, welcome back to the AI Alignment podcast. I'm Lucas Perry, and today we'll be speaking with Rohan Shah. This episode is the first episode of two parts that both seek to provide an overview of the state of AI alignment. In this episode, we cover technical research organizations in the space of AI alignment, their research methodologies and philosophies, how these all come together on our path to beneficial AGI, and Rohan's take on the state of the field.
00:00:41
Speaker
As a general bit of announcement, I would love for this podcast to be particularly useful and informative for its listeners. So I've gone ahead and drafted a short survey to get a better sense of what can be improved. You can find a link to that survey in the description of wherever you might find this podcast or on the page for this podcast on the FLI website.
00:01:00
Speaker
Many of you will already be familiar with Rohan, he is a 4th year PhD student in Computer Science at UC Berkeley with the Center for Human Compatible AI working with Anka Drogon, Peter Abil, and Stuart Russell.

Guest Introduction: Rohan Shah

00:01:14
Speaker
Every week he collects and summarizes recent progress relevant to AI alignment in the alignment newsletter. And so, without further ado, I give you Rohan Shah.
00:01:28
Speaker
Thanks so much for coming on the podcast. Rohan, it's really a pleasure to have you. Thanks so much for having me on again. I'm excited to be back. Yeah. Long time, no see since Puerto Rico, beneficial AGI. And so speaking of beneficial AGI, you gave quite a good talk there, which summarized technical alignment methodologies approaches and broad views at this time. And that is the subject of this podcast today.
00:01:54
Speaker
People can go and find that video on YouTube. And I suggest that you watch that. That should be coming out on the FLI YouTube channel in the coming weeks. But for right now, we're going to be going in more depth and with more granularity into a lot of these different technical approaches.

Different Approaches to AI Alignment

00:02:11
Speaker
So just to start off, it would be good if you could contextualize this list of technical approaches to AI alignment that we're going to get into within the different organizations that they exist at and the different philosophies and approaches that exist at these varying organizations. Okay. So disclaimer, I don't know all of the organizations that well.
00:02:36
Speaker
I know that people tend to fit CHI in a particular mold, for example. CHI is the place that I work at, and I mostly disagree with that being the mold for CHI, so probably anything I say about other organizations is also going to be somewhat wrong, but I'll give it a shot anyway.
00:02:53
Speaker
So I guess I'll start with Chai. I think our public output mostly comes from this perspective of how do we get AI systems to do what we want? So this is focusing on the alignment problem. How do we actually point them towards a goal that we actually want, align them with our values? Not everyone at Chai takes this perspective, but I think that's the one most commonly associated with us. And it's probably the perspective on which we've published the most.
00:03:19
Speaker
It's also the perspective I usually but not always take. Mary, on the other hand, takes a perspective of we don't even know what's going on with intelligence. Let's try and figure out what we even mean by intelligence, what it means for there to be a super intelligent AI system. What would it even do? Or how would we even understand it? Can we have a theory of what all of this means?
00:03:43
Speaker
We're confused. Let's be less confused. Once we're less confused, then we can think about how to actually get AI systems to do good things. That's one of the perspectives they take. Another perspective they take is that there's a particular problem with AI safety, which is that even if we knew what goals we wanted to put into an AI system, we don't know how to actually build an AI system that would reliably pursue those goals as opposed to something else.
00:04:12
Speaker
That problem of even if you know what you want to do, how do you get an AI system to do it is a problem that they focus on. And the difference from the thing I associated with CHI before is that with the CHI perspective, you're interested both in how do you get the AI system to actually pursue the goal that you want, but also how do you figure out what goal that you want? What is the goal that you want? Though I think most of the work so far has been on supposing you know the goal. How do you get your AI system to properly pursue it?

Safety Teams at DeepMind and OpenAI

00:04:40
Speaker
I think DeepMind Safety Team, at least, is pretty split across many different ways of looking at the problem. I think Jan, like, for example, has done a lot of work on reward modeling, and this sort of fits in with the how do we get our AI systems to be focused on the right task, the right goal, whereas Vika has done a lot of work on side effects or impact measures.
00:05:03
Speaker
I don't know if I would say this, but the way I interpret it is how do we impose a constraint upon the AI system such that it never does anything catastrophic, but it's not trying to get the AI system to do what we want, just not do what we don't want or what we think would be catastrophically bad.
00:05:22
Speaker
OpenAI safety also seems to be, okay, how do we get deep reinforcement learning to do good things, to do what we want to be a bit more robust? And then there's also the iterated amplification debate factor cognition area of research, which is more along the lines of, can we write down a system that could plausibly lead to us building an aligned AGI or aligned powerful AI system?
00:05:52
Speaker
FHI, no coherent direction, that's all of FHI. Eric Drexler is also trying to understand how AI will develop in the future is somewhat very different from what Mary is doing, but the same general theme of trying to figure out what is going on. So he just recently published a long technical report on comprehensive AI services, which is the general worldview for predicting what AI development will look like in the future.
00:06:17
Speaker
Like if we believe that that was in fact the way i would happen we would probably change what we work on from the technical safety point of view.

Value Learning and Superintelligent AI

00:06:27
Speaker
And align evans align evans does a lot of stuff so maybe i'm just not going to try to categorize them.
00:06:34
Speaker
And then Stuart Armstrong works on this, okay, how do we get value learning to work such that we actually infer a utility function that we would be happy for an AGI system to optimize or a super intelligent AI system to optimize. And then ODD works on factory cognition, so it's very adjacent to the iterated amplification and debate research agendas.
00:06:57
Speaker
then there's a few individual researchers scattered, for example, Toronto, Montreal, and ANU, and EPFL. Maybe I won't get into all of them because that's a lot, but we can delve into that later. Maybe a more helpful approach then would be if you could start by demystifying some of the MIRI stuff a little bit, which may seem most unusual.
00:07:23
Speaker
I guess strategically the point would be that you're trying to build a system, this AI system, that's going to be hopefully at some point in the future, vastly more intelligent than humans, because, you know, we'd want it to help us colonize the universe or something like that and lead to lots and lots of technological progress, et cetera, et cetera. But this basically means that humans will not be in control unless we very, very specifically arrange it such that we are in control.
00:07:52
Speaker
we have to thread the needle perfectly in order to get this to work out in the same way that by default you would expect that the most intelligent creatures beings are the ones that are going to decide what happens. And so we really need to make sure and also it is probably hard to ensure that these vastly more intelligent beings are actually doing what we want. Given that, it seems like what we want is a good theory that allows us to understand and predict what these AI systems are going to do.
00:08:21
Speaker
Maybe not in the fine nitty gritty details, because if we could predict what they would do, then we could do it ourselves and be just as intelligent as they are. But at least in broad strokes, what sorts of universes are they going to create?

Challenges in AI Alignment: Stability and Utility Functions

00:08:33
Speaker
But given that they can apply so much more intelligence than we can, we need our guarantees to be really, really strong, like almost proof level. Maybe actual proofs are a little too much to expect, but we want to get as close to it as possible.
00:08:47
Speaker
Now, if we want to do something like that, we need a theory of intelligence. We can't just sort of do a bunch of experiments, look at the results, and then try to extrapolate from there. Extrapolation does not give you the level of confidence that we would need for a problem this difficult. And so, rather, they would like to instead understand intelligence deeply, decompose themselves about it.
00:09:09
Speaker
Once you understand how intelligence works at a theoretical level, then you can start applying that theory to actual AI systems and seeing how they approximate the theory or make predictions about what different AI systems will do. And hopefully then we could say, yeah, this system that looks like it's going to be very powerful is approximating this particular idea, this particular part of the theory of intelligence. And we can see that with this particular theory of intelligence, we can align it with humans somehow and just expect that this is going to work out.
00:09:39
Speaker
Something like that now that sounded kind of dumb even to me as I was saying it, but that's because we don't have the theory yet It's very hard to speculate how you would use a theory before you actually have the theory So that's the reason they're doing this the actual thing that they're focusing on is Centered around problems of embedded agency and I should say this is one of their I think two main strands of research the others kind of research I do not know anything about because they have not published anything about it
00:10:08
Speaker
But one of their strands of research is about embedded agency. And here the main point is that in the real world, any agent, any AI system or any human is a part of their environment. They are smaller than the environment and the distinction between agent and environment does not crisp. Maybe I think of my body as being part of me, but I don't know. To some extent, my laptop is also an extension of my agency. There's a lot of stuff I can do with it.
00:10:36
Speaker
Or on the other hand, you could think maybe my arms and limbs aren't actually a part of me. I could maybe get myself uploaded at some point in the future, and then I will no longer have arms and legs. But in some sense, I am still me. I'm still an agent. So this distinction is not actually crisp. And we always pretend that it is in AI so far. And it turns out that once you stop making this crisp distinction and start allowing the boundary to be fuzzy, there are a lot of weird, interesting problems that show up.
00:11:05
Speaker
And we don't know how to deal with any of them,

Embedded Agency and Self-Modification

00:11:07
Speaker
even in theory. So that's what they focus on. And can you unpack, given that AI researchers control the input output channels for AI systems, why is it that there is this fuzziness? It seems like you could extrapolate away the fuzziness given that there are these sort of rigid and selected IO channels. Yeah, I agree. That seems like the right thing for today's AI systems.
00:11:33
Speaker
But I don't know if I think about, okay, this AGI is a generally intelligent AI system. I kind of expect it to recognize that when we feed it inputs, which let's say we're imagining a money maximizing AI system.
00:11:49
Speaker
that's taking in inputs like stock prices and it outputs which stocks to buy and maybe it can also read the news that lets it get newspaper articles in order to make better decisions about which stocks to buy. At some point I expect this AI system to read about AI and humans and realize that hey,
00:12:08
Speaker
it must be an AI system it must be getting inputs and outputs it's reward function must be to make this particular number in the bank account be as high as possible and then once it realizes this there's this part of the world which is this number in the bank account.
00:12:24
Speaker
Or it could be this particular value this particular memory block in its own CPU and its goal is

Strategies for Corrigibility and Goal Alignment

00:12:32
Speaker
now make that number as high as possible. In some sense it's now modifying itself especially if you're thinking of the memory block inside the CPU if it goes and edits that and sets that to a million.
00:12:43
Speaker
a billion, the highest number possible in that memory block, then it seems like it has in some sense done some self-editing. It's changed the agent part of it. It could also go and be like, okay, actually what I care about is this particular reward function box is supposed to output as high a number as possible. So what if I go and change my input channel such that it feeds me things that cause me to believe that I've made tons and tons of profit? So this is a delusion box consideration.
00:13:12
Speaker
While it is true that I don't see a clear concrete way that an AI system ends up doing this, it does feel like an intelligent system should be capable of this sort of reasoning, even if it initially had these sort of fixed inputs and outputs. The idea here is that its outputs can be used to affect the inputs or future outputs.
00:13:33
Speaker
Right. So I think that that point is the clearest summation of this. It can affect its own inputs and outputs later. Like if you take human beings who are, by definition, human level intelligences, we have, say, in a classic computer science sense, if you thought of us, you'd say we strictly have five input channels, hearing, seeing, touch, smell, etc.
00:13:57
Speaker
Human beings have a fixed number of input output channels but obviously human beings are capable of self modifying on those and our agency is sort of squishy and dynamic in ways that would be very unpredictable and i think that unpredictability and sort of almost seeming a femorality of being an agent seems to be the crux of a lot of the problem.
00:14:18
Speaker
I agree that that's a good intuition pump. I'm not sure that I agree with the crux. The crux to me feels more like you specify some sort of behavior that you want, which in this case is make a lot of money or make this number in the bank account go higher or make this memory cell go as high as possible.
00:14:34
Speaker
And when you were thinking about the specification you assume that the inputs and outputs fell within some strict parameters like the inputs are always going to be news articles that are real and produced by human journalists as opposed to a fake news article that was created by the AI in order to convince the reward function that actually it's made a lot of money.
00:14:53
Speaker
And then the problem is that since the as outputs can affect the inputs, the AI could cause the inputs to go outside of the space of possibilities that you imagine the inputs could be in. And this then allows the AI to gain the specification that you had for it.
00:15:09
Speaker
Right. So all

Iterated Amplification and Debate in AGI Alignment

00:15:11
Speaker
of the parts which constitute some AI system are all potentially modified by other parts. And so you have something that is fundamentally and completely dynamic, which you're trying to make predictions about, but whose future structure is potentially very different and hard to predict based off of the current structure.
00:15:30
Speaker
Yeah, basically. And in order to get past this, we must, again, tunnel down on these decision theoretic and rational agency type issues at the bottom of intelligence to sort of have a more fundamental theory which can be applied to these highly dynamic and difficult to understand situations.
00:15:52
Speaker
Yeah, I think the mere perspective is something like that. And in particular, it would be like trying to find a theory that allows you to put in something that stays stable even while the system itself is very dynamic. Right. Even while you're a system whose parts are all completely dynamic and able to be changed by other parts, how do you maintain a degree of alignment amongst that?
00:16:16
Speaker
One answer to this is give the AIA utility function. There is a utility function as explicitly trying to maximize that. And in that case, it probably has an incentive in order to protect that utility function. Because if it gets changed, well, then it's not going to maximize that utility function anymore. It'll maximize something else, which will lead to worse behavior by the lights of the original utility function.
00:16:38
Speaker
That's a thing that you could hope to do with a better theory of intelligence is how do you create a utility function in an AI system that stays stable even as everything else is dynamically changing.
00:16:50
Speaker
Right. And without even getting into the issues of implementing one single stable utility function? Well, I think they're looking into those issues. So for example, Vengen reflection is a problem that is entirely about how do you create better, more improved versions of yourself without having value drift or a change to utility function.
00:17:11
Speaker
Is your utility function not self-modifying? So in theory, it could be. The hope would be that we could design an AI system that does not self-modify its utility function under almost all circumstances. Because if you change your utility function, then you're going to start maximizing that new utility function, which by the original utility function's evaluation is worse.
00:17:36
Speaker
If i told you lukas you have got to go fetch coffee that's the only thing in life you're concerned about you must take whatever actions are necessary in order to get the coffee. And then someone was like hey lukas i'm gonna change your utility functions so that you want to fetch instead and then all of your decision making is going to be in service of getting tea.
00:17:57
Speaker
You would probably say, no, don't do that. I want to fetch coffee right now. If you change my utility function to being fetch t, then I'm going to fetch t, which is bad because I want to fetch coffee. And so hopefully you don't change your utility function because of this effect.
00:18:12
Speaker
Right, but isn't this where corrigibility comes in and where we admit that as we sort of understand more about the world and our own values, we'll want to be able to update utility functions? Yeah, so that is a different perspective. I'm not trying to describe that perspective right now. This is a perspective for how you could get something stable in an AI system. And I associate it most with Eliezer, though I'm not actually sure if he holds his opinion.
00:18:41
Speaker
Okay. So I think this was very helpful for the Miri case. So why don't we go ahead and zoom in, I think a bit on Chai, which is the center for human compatible AI. So I think rather than talking about Chai, I'm going to talk about the general field of trying to get AI systems to do what we want. A lot of people at Chai work on that, but not everyone. And also a lot of people outside of Chai work on it.
00:19:08
Speaker
because that seems to be a more useful carving of the field. So there's this broad argument for AI safety, which is we're going to have very intelligent things based on the earth organality thesis. We can't really say anything about their goals. So the really important thing is to make sure that the intelligence is pointed at the right goals. It's pointed at doing what we actually want. And so then the natural approach is how do we get our AI systems to infer what we want to do and then actually pursue that.
00:19:38
Speaker
But I think in some sense it's one of the most obvious approaches to AI safety. This is a clear enough problem, even with narrow current systems, that there are plenty of people outside of AI safety working on this as well. So this incorporates things like inverse reinforcement learning, preference learning, reward modeling. The Searle cooperative IRL paper also fits into all of this.
00:20:02
Speaker
So yeah, I can dig into any of those in more depth. Why don't you start off by talking about the people who exist within the field of AI safety gives sort of a brief characterization of what's going on outside of the field, but primarily focusing on those within the field. How this approach in practice, I think generally is say different from Miri to start off with, because we have a clear picture of them painted right next to what we're delving into now.
00:20:32
Speaker
So I think different from Mary is that this is more targeted directly at the problem right now in that you are actually trying to figure out how do you build an AI system that does what you want. Now, admittedly, most of the techniques that people have come up with are not likely to scale up to super intelligent AI. They are not meant to. No one claims that they are going to scale up to super intelligent AI. They're more like some incremental progress on figuring out how to get AI systems to do what we want.
00:20:59
Speaker
And hopefully with enough incremental progress, we'll get to a point where we can be, yes, this is what we need to do. Probably the most well-known person here would be Dylan Hatfield-Manel, who you had on your podcast. And so he talked about Searle and associated things quite a bit there. There's not really that much I would say in addition to it. Maybe a quick summary of Dylan's position is something like instead of having AI systems that are optimizing
00:21:26
Speaker
for their own goals, we need to have AI systems that are optimizing for our goals and try to infer our goals in order to do that. So rather than having an AI system that is individually rational with respect to its own goals, you instead want to have a human AI system such that the entire system is rationally optimizing for the human's goals.
00:21:51
Speaker
This is sort of the point made by Searle, where you have an AI system. You've got a human. They're playing this two-player game. The human is the only one who knows the reward function. The robot is uncertain about what the reward function is and has to learn by observing what the human does. And so now you see that the robot does not have a utility function that it is trying to optimize. Instead is learning about a utility function that the human has and then helping the human optimize that reward function.
00:22:19
Speaker
So summary, try to build human AI systems that are group rational as opposed to an AI system that is individually rational. So that's Dylan's view.
00:22:30
Speaker
Then there's Janleike, DeepMind, and a few people at OpenAI. Before we pivot into OpenAI and DeepMind, just sort of focusing here on the CHI end of things and this broad view, and help me explain here how you would characterize it, the present day actively focused view on current issues and present day issues in alignment and making incremental progress there.
00:22:55
Speaker
This view here you see as sort of subsuming multiple organizations. Yes, I do. Okay. Is there a specific name you would again use to characterize this view? Oh, getting AI systems to do what we want. Let's see. Do I have a pithy name for this? Helpful AI systems or something. Right. Which again is focused on current day things, is seeking to make incremental progress and which subsumes many different organizations.
00:23:23
Speaker
Yeah, that seems broadly true. I do think that there are people who are doing more conceptual work thinking about how this will scale to AGI and stuff like that, but it's a minority at work in this space. Right. And so the question of how do we get AI systems to do what we want them to do also includes these views of say, vingian reflection or how we become idealized versions of ourselves or how we build on value over time. Right.
00:23:50
Speaker
Yeah, so those are definitely questions that you would need to answer at some point. I'm not sure that you would need to answer vintage reflection at some point, but you would definitely need to answer, how do you update? Given that humans don't actually know what they want for a long-term future, you need to be able to deal with that fact at some point. It's not really a focus of current research, but I agree that that is a thing that this approach will have to deal with at some point.
00:24:17
Speaker
Okay. So moving on from you and Dylan to DeepMind and these other places that you view as this sort of approach also being practiced there. Yeah. So while Dylan and I and others at CHI have been focused on sort of conceptual advances, like in toy environments, does this do the right thing?
00:24:36
Speaker
What are some sources of data that we can learn from? Do they work in these very simple environments with quite simple algorithms? I would say that OpenAI and DeepMind Safety teams are more focused on trying to get this to work in complex environments of the sort that we're getting this to work on state-of-the-art environments, the most complex ones that we have. Now, I don't mean Dota and Starcraft because running experiments with Dota and Starcraft is incredibly expensive. But can we get AI systems that do what we want?
00:25:05
Speaker
for environments like Atari or Mujoko. There's some work on this happening at CHI. There are preprints available online, but it hasn't been published very widely yet.
00:25:16
Speaker
Most of the work I would say has been happening with an OpenAI DeepMind collaboration. And most recently, there was a position paper from DeepMind on recursive reward modeling. But before that, there was also a paper on combining first a paper, deeper enforcement learning from human preferences, which said, OK, if we allow humans to specify what they want by just comparing between different pieces of behavior from the AI system, can we train an AI system to do what the human wants?
00:25:44
Speaker
And then they built on that in order to create a system that could learn from demonstrations initially using a kind of imitation learning, and then improve upon the demonstrations using comparisons in the same way that deep neural from human preferences did.
00:26:00
Speaker
So one way that you can view this research is that there's this field of human computer interaction, which is about, well, it's about many things, but one of the things that it's about is how do you make the user interface for humans intuitive and easy to use such that you don't have user error or operator error?
00:26:18
Speaker
One common from that feel that i like is that most of the things that are classified as user or operator or should not be classified as such they should be classified as interface errors where you had such a confusing interface that will of course at some point some user was going to get it wrong. And similarly here what we want is a particular behavior out of the eye.
00:26:38
Speaker
or at least a particular set of outcomes from the AI. Maybe we don't know exactly how to achieve those outcomes. And AI is about giving us the tools to create that behavior in automated systems. The current
00:26:52
Speaker
tool that we all use is the reward function. We write down the reward function, and then we give it to an RL algorithm, and it produces the behaviors or the outcomes that we want. And reward functions are just a pretty terrible user interface. They're better than the previous interface, which was writing a program explicitly, which humans cannot do it if the task is something like image classification or continuous control in Minjoko. It's an improvement on that.
00:27:17
Speaker
But reward functions are still a pretty poor interface because they're implicitly saying that they encode perfect knowledge of the optimal behavior in all possible environments, which is clearly not a thing that humans can do. I would say that this area is about moving on from reward functions, going to the next thing that makes
00:27:37
Speaker
the human's job even easier. And so we've got things like comparisons, we've got things like inverse award design, where you specify a proxy to award function that only needs to work in the training environment, or you do something like inverse reinforcement learning, where you learn from demonstrations. So I think that's one nice way of looking at this field. So do you have anything else you would like to add on here about how we present day get AI systems to do what we want them to do section of the field?
00:28:03
Speaker
Maybe I want to plug my value learning sequence because it talks about this much more eloquently than I can on this podcast. Sure. Where can people find your value learning sequence? It's on the alignment forum. You just go to the alignment forum at the top. There's recommended sequences. There's embedded agency, which is from Mary, the sort of stuff we already talked about. So that's also a great sequence. I would recommend it. There's iterated amplification, also a great sequence. We haven't talked about it yet. And then there's my value learning sequence. So you can see it on the front page of the alignment forum.
00:28:34
Speaker
Great. So we've characterized these say different parts of the AI alignment field and broadly just so far it's been cut into this sort of mirror review and then this broad approach of trying to get present day AI systems to do what we want them to do and to make incremental progress there. Are there any other slices of the AI alignment field that you would like to bring to light?
00:28:59
Speaker
Yeah, I've got four or five more. There's the iterated amplification and debate side of things, which is how do we build using current technologies, but imagining that they were way better? How do we build an aligned AGI? So they're trying to solve the entire problem as opposed to making incremental progress and simultaneously, hopefully thinking about conceptually, how do we fit all of these pieces together? There's limiting the AGI system.
00:29:28
Speaker
which is more about how do we prevent AI systems from behaving catastrophically. It makes no guarantees about the AI systems doing what we want. It just prevents them from doing really, really bad things. Techniques in that section include boxing and avoiding side effects. There's the robustness view, which is about how do we make AI systems, well, behave robustly. I guess that's pretty self-explanatory.
00:29:57
Speaker
There is transparency or interpretability, which I wouldn't say is a technique by itself, but seems to be broadly useful for almost all of the other avenues. It's a thing we would want to add to other techniques in order to make those techniques more effective. There's also in the same frame as Mary's, can we even understand intelligence
00:30:18
Speaker
Can we even forecast what's going to happen with AI? And within that, there's comprehensive AI services. There's also lots of efforts on forecasting, but comprehensive AI services actually makes claims about what technical AI safety should do. So I think that one actually does have a place in this podcast, whereas most of the forecasting things do not, obviously. They have some implications on the strategic picture, but they don't have clear implications on technical safety research directions, as far as I can tell right now.
00:30:48
Speaker
All right. So do you want to go ahead and start off with the first one on the list there and then we'll move sequentially down? Yeah. So iterated amplification and debate. This is similar to the helpful AGI section in the sense that we are trying to build an AI system that does what we want. That's still the case here, but we're now trying to figure out conceptually, how can we do this using things like reinforcement learning and supervised learning, but imagining that they're way better than they are right now.
00:31:16
Speaker
such that the resulting agent is going to be aligned with us and can reach arbitrary levels of intelligence. So in some sense it's trying to solve the entire problem. We want to come up with a scheme such that if we run that scheme, we get good outcomes. We've solved almost all of the problem.
00:31:33
Speaker
I think that it also differs in that the argument for why we can be successful is also different. This field is aiming to get a property of porridgeability, which I like to summarize as trying to help the overseer.
00:31:51
Speaker
It might fail to help the overseer or the human or the user because it's not very competent. And maybe it makes a mistake and thinks that I like apples when actually I want oranges, but it was actually trying to help me. It actually thought I want apples. It's an incorrigibility you're aiming for trying to help the overseer. Whereas in the previous thing about helpful AGI, you're more getting an AI system that actually does what we want. There isn't this distinction between what you're trying to do versus what you actually do.
00:32:19
Speaker
So there's a slightly different property that you're trying to ensure. I think on the strategic picture, that's the main difference. The other difference is that these approaches are trying to make a single unified, generally intelligent AI system. And so they will make assumptions like given that we're trying to imagine something that's generally intelligent, it should be able to do X, Y, and Z. Whereas the research agenda that's let's try to get AI systems that do what we want tends not to make those assumptions.
00:32:49
Speaker
and so is more applicable to current systems or narrow systems where you can't assume that you have general intelligence.
00:32:57
Speaker
For example, a claim that Paul Christiano often talks about is that if your AI agent is generally intelligent and a little bit cordial, it will probably easily be able to infer that its overseer or the user would like to remain in control of any resources that they have and would like to be better informed about the situation that the user would prefer that the agent does not lie to them, et cetera, et cetera.
00:33:21
Speaker
It is definitely not something that current AI systems can do unless you really engineer them to. So this is presuming some level of generality, which we do not currently have.
00:33:33
Speaker
So the next thing I said was limited AGI. Here the idea is there are not very many policies or AI systems that will do what we want. What we want is a pretty narrow space in the space of all possible behaviors. Actually selecting one of the behaviors out of that space is quite difficult and requires a lot of information in order to narrow in on that piece of behavior.
00:33:57
Speaker
But if all you're trying to do is avoid the catastrophic behaviors, then there are lots and lots of policies that successfully do that. And so it might be easier to find one of those policies, a policy that doesn't ever kill all humans.
00:34:12
Speaker
at least the space of those policies, right? One might have this view and not think it sufficient for AI alignment, but see it as sort of a low-hanging fruit to be picked because the space of non-catastrophic outcomes is larger than the space of extremely specific futures that human beings support.
00:34:32
Speaker
Yeah, exactly. And the success story here is basically that we developed this way of preventing catastrophic behaviors. All of our AI systems are built with the system in place, and then technological progress continues as usual. It's maybe not as fast as it would have been if we had an aligned AGI doing all of this for us, but hopefully it would still be somewhat fast and hopefully enabled a bit by AI systems.
00:34:56
Speaker
Eventually, we will either make it to the future without ever building an AI system that doesn't have the system in place, or we use this to do a bunch more AI research until we've solved the full alignment problem, and then we can build with high confidence that it'll go well, an actual proper aligned superintelligence that is helping us without any of these limitation systems in place.
00:35:18
Speaker
I think from a strategic picture, that's basically the important parts about limited AGI. There are two subsections within this, limits based on trying to change what the AI is optimizing for. So this would be something like impact measures versus limits on the input output channels of the AI system. So this would be something like AI boxing.
00:35:38
Speaker
So with robustness, I sort of think of robustness as mostly it's not going to give us safety by itself, probably, though there are some scenarios in which it could happen.
00:35:50
Speaker
It's more meant to harden whichever other approach that we use. Maybe if we have an AI system that is trying to do what we want to go back to the helpful AGI setting, maybe it does that 99.9% of the time, but we're using this AI to make millions of decisions, which means it's going to not do what we want a thousand times
00:36:12
Speaker
that seems like way too many times for comfort. Because if it's applying its intelligence to the wrong goal in those 1,000 times, you could get some pretty bad outcomes. This is a super heuristic and fluffy argument that there are lots of problems with, but I think it sets up the general reason that we would want robustness.
00:36:30
Speaker
So with robustness techniques, you're basically trying to get some nice worst case guarantees. Let's say, yeah, the AI system is never going to screw up super, super bad. And this is helpful when you have an AI system that's going to make many, many, many decisions and you want to make sure that none of those decisions are going to be catastrophic. And so some techniques in here include verification, adversarial training, and other adversarial ML techniques like Byzantine fault tolerance or stuff like that.
00:37:00
Speaker
These are all data poisoning. Interpretability can also be helpful for robustness if you've got a strong overseer who can use interpretability to give good feedback to your AI system. But yeah, the overall goal is take something that doesn't fail 99% of the time and get it to not fail 100% of the time, or check whether or not it ever fails so that you don't have this very rare but very bad outcome.
00:37:25
Speaker
And so would you see this section as being within the context of any others or being sort of at a higher level of abstraction? I would say that it applies to any of the others. Well, okay. Not from Mary and method agency stuff, because we don't really have a story for how that ends up helping with AI safety. It could apply to however that cashes out in the future, but we don't really know right now.
00:37:54
Speaker
With limited AGI, maybe you have this theoretical model. If you apply this sort of penalty, this sort of impact measure, then you're never going to have any catastrophic outcomes. But of course, in practice, we train our AI systems to optimize that penalty and we get the sort of weird black box thing out. And we're not entirely sure if it's respecting the penalty or something like this.
00:38:14
Speaker
Then you could use something like verification or transparency in order to make sure that this is actually behaving the way we would predict it would behave based on our analysis of what limits we need to put on the AI system. Similarly, if you build AI systems that are doing what we want, maybe you want to use adversarial training to see if you can find any situations in which the AI system is doing something weird, doing something which we wouldn't classify as what we want.
00:38:39
Speaker
With iterated amplification or debate, maybe we want to verify that the corrigibility property happens all the time. It's unclear how you would use verification for that because it seems like a particularly hard property to formalize, but you could still do things like adversarial training or transparency.
00:38:57
Speaker
We might have nice theoretical arguments for why our systems will work, but then once we turn them into actual real systems that will probably use neural nets and other messy stuff like that, are we sure that in the translation from theory to practice all of our guarantees stayed? Unclear, we should probably use some robustness techniques to check that.
00:39:19
Speaker
interpretability, I believe was next. It's sort of similar in that it's broadly useful for everything else. If you want to figure out whether an AI system is doing what you want, it would be really helpful to be able to look into the agent and see, oh, it chose to buy apples because it had seen me eat apples in the past versus it chose to buy apples because there was this company that paid it to buy me apples so that it would make more profit.
00:39:47
Speaker
If we could see those two cases, if we could actually see into the decision-making process, it becomes a lot easier to tell whether or not the AI system is doing what we want, or whether or not the AI system is courageable, or whether or not the AI system is properly... Well, maybe it's not as obvious for impact measures, but I would expect it to be useful there as well, even if I don't have a story off the top of my head.
00:40:10
Speaker
Similarly with robustness, if you're doing something like adversarial training, it sure would help if your adversary was able to look into the inner workings of the agent and be like, ah, I see this agent. It tends to underweight this particular class of risky outcomes. So why don't I search within that class of situations for one that it's going to take a big risk on that it shouldn't have taken otherwise. It just makes all of the other problems a lot easier to do.
00:40:37
Speaker
And so how is progress made on interpretability? Right now, I think most of the progress is in image classifiers. I've seen some work on interpretability for deep RL as well. Honestly, that's probably most of the research is happening with classification systems, primarily image classifiers, but others as well. And then I also see deep RL explanation systems because I surveyed a lot of deep RL research.
00:41:07
Speaker
But it's motivated a lot. There are real problems with current AI systems, and interpretability helps you to diagnose and fix those as well. For example, the problems of bias and classifiers. One thing that I remember from Deep Dream is you can ask Deep Dream to visualize barbells, and you always see these sort of muscular arms that are attached to the barbells, because in the offending set, barbells were always being picked up by muscular people.
00:41:35
Speaker
So that's a way that you can tell that your classifier is not really learning the concepts that you wanted it to do. In the bias case, maybe your classifier always classifies anyone sitting at a computer as a man because of bias in the data set and using interpretability techniques. You could see that, okay, when you look at this picture, the AI system is looking primarily at the pixels that represent the computer as opposed to the pixels that represent the human.
00:42:04
Speaker
and making its decision to label this person as a man based on that and you're like, no, that's clearly the wrong thing to do. The classifier should be paying attention to the human, not to the laptop.
00:42:16
Speaker
So I think a lot of interpretability research right now is you pick a particular short-term problem and figure out how you can make that problem easier to solve. Though a lot of it is also what would be the best way to understand what our model is doing. So I think a lot of the work that Chris Ola is doing, for example, is in the Spain. And then as you do this exploration, you find instances of bias in the classifiers that you're studying.
00:42:40
Speaker
So comprehensive AI services is an attempt to predict what the future of AI development will look like. And the hope is that by doing this, we can figure out what sort of technical safety things we will need to do.
00:42:55
Speaker
or strategically what sort of things we should push for in the AI research community in order to make those systems safer. There's a big difference between we are going to build a single unified AGI agent and it's going to be generally intelligent and optimize the world according to a utility function versus we are going to build a bunch of disparate, separate, narrow AI systems that are going to interact with each other quite a lot and because of that they will be able to do a wide variety of tasks.
00:43:23
Speaker
None of them are going to look particularly like expected utility maximizers. And the safety research you want to do is different in those two different worlds. And Kai's is basically saying we're in the second of those worlds, not the first one. Can you go ahead and tell us about ambitious value learning?
00:43:41
Speaker
Yeah. So with ambitious value learning, this is also an approach to how do we make an aligned AGI solve the entire problem in some sense, which is look at not just human behavior, but also human brains and the algorithm that they implement and use that to infer an adequate utility function, one that we would be okay with the behavior that results from that.
00:44:05
Speaker
infer this utility function and then plug it into an expected utility maximizer. Now, of course, we do have to solve problems with, even once we have the utility function, how do we actually build a system that maximizes that utility function, which is not a solved problem yet. But it does seem to be capturing the main difficulties if you could actually solve the problem. And so this is an approach I associate most with Stuart Armstrong.
00:44:30
Speaker
All right. And so you were saying earlier in terms of your own view, it's sort of an amalgamation of different credences that you have in the potential efficacy of all of these different approaches. So given all of these and all of their broad missions and interests and assumptions that they're willing to make, what are you most hopeful about? What are you excited about? How do you sort of assign your credence and time here?
00:44:55
Speaker
I think I'm most excited about the concept of courageability. That seems like the right thing to aim for. It seems like it's a thing we can achieve. It seems like if we achieve it, we're probably okay. Nothing's going to go horribly wrong and probably it will go very well.
00:45:11
Speaker
I am less confident on which approach to corrigibility I am most excited about. If rated amplification and debate seem like if we were to implement them, they would probably lead to corrigible behavior. But I am worried that either of those will be either we won't actually be able to build generally intelligent agents. In which case, both of those approaches don't really work.
00:45:36
Speaker
Or another worry that I have is that those approaches might be like too expensive to actually do and that other systems are just so much more computationally efficient that we just use those instead due to economic pressures.
00:45:51
Speaker
Hall does not seem to be worried by either of these things, and is definitely aware of both of these issues. In fact, he was the one, I think, who listed computational efficiency as a desideratum, and he still is optimistic about them, so I would not put a huge amount of credence in this view of mine. If I were to say what I was excited about for chordability instead of that,
00:46:14
Speaker
It would be something like take the research that we're currently doing on how to get current AI systems to work, which often call narrow value learning. If you take that research, it seems plausible that this research extended into the future will give us some method of creating an AI system that's implosively learning our narrow values and is courageable as a result of that, even if it is not generally intelligent.
00:46:43
Speaker
This is sort of a very hand-wavy speculative intuition, certainly not as concrete as the hope that we have with iterated amplification, but I'm somewhat optimistic about it. And less optimistic about limiting AI systems, it seems like even if you succeed in finding a nice
00:47:04
Speaker
simple rule that eliminates all catastrophic behavior, which plausibly you could do. It seems hard to find one that both does that and also lets you do all of the things that you do want to do. If you're talking about impact measures, for example, if you require your AI to be low impact, I expect that that would prevent you from doing many things that we actually want to do because many things that we want to do are actually quite high impact.
00:47:30
Speaker
Now, Alex Turner disagrees with me on this and he developed attainable utility preservation. He is explicitly working on this problem and disagrees with me. So again, I don't know how much credence to put on this. I don't know if Mika agrees with me on this or not. She also might disagree with me and she is also directly working on this problem. So yeah.
00:47:53
Speaker
It seems hard to put a limit that also lets us do the things that we want, and in that case, it seems like due to economic pressures, we'd end up doing the things that don't limit our AI systems from doing what we want. I want to keep emphasizing my extreme uncertainty over all of this, given that other people disagree with me on this, but that's my current opinion. Similarly with boxing, it seems like it's going to just make it very hard to actually use the AI system.
00:48:20
Speaker
Robustness and interpretability seems very broadly useful and supportive of most research on interpretability, maybe with an eye towards long-term concerns, just because it seems to make every other approach to AI safety a lot more feasible and easier to solve. I don't think it's a solution by itself, but given that it seems to improve almost every story I have for making an aligned AGI,
00:48:47
Speaker
seems like it's very much worth getting a better understanding of it. Robustness is an interesting one. It's not clear to me if it is actually necessary. I kind of want to just voice lots of uncertainty about robustness and leave it at that.
00:49:04
Speaker
It's certainly good to do in that it helps us be more confident in our AI systems, but maybe everything would be okay even if we just didn't do anything. I don't know. I feel like I would have to think a lot more about this and also see the techniques that we actually use to build AGI in order to have a better opinion on that. Could you give a few examples of where your intuitions here are coming from that don't see robustness as an essential part of AI alignment?
00:49:33
Speaker
Well, one major intuition pump is if you look at humans, there are at least some humans where I'm like, okay, I could just make this human a lot smarter, a lot faster, have them think for many, many years. And I still expect that they will be robust and not lead to some catastrophic outcome. They may not do exactly what I wanted because they're doing what they want, but they're probably going to do something reasonable. They're not going to.
00:50:00
Speaker
do something crazy or ridiculous. I feel like humans, some humans, the sufficiently risk-averse and uncertain ones seem to be reasonably robust. I think that if you know that you are planning over a very, very, very long time horizon, so imagine that you know you're planning over billions of years, then the rational response to this is, I really better make sure not to screw up right now since
00:50:26
Speaker
There is just so much reward in the future. I really need to make sure that I can get it. And so you get very strong pressures for preserving option value or not doing anything super crazy. So I think you could plausibly just get reasonable outcomes from those effects. But again, these are not well thought out.
00:50:45
Speaker
All right, and so I just want to go ahead and guide us back to your general views, again, on the approaches. Is there anything that you would like to add there on the approaches? I think I didn't talk about KICE yet. I guess my general view of KICE, I broadly agree with it that this does seem to be the most likely development path, meaning that it's more likely than any other specific development path, but not more likely than any other development path.
00:51:11
Speaker
So I broadly agree with the worldview presented. I'm still trying to figure out what implications it has for technical safety research. I don't agree with all of it. In particular, I think that you are likely to get AGI agents at some point, probably after the CAIS soup of services happens, which I think, again, Drexler disagrees with me on that. So I put a bunch of uncertainty on that.
00:51:38
Speaker
But I broadly agree with the worldview that Caius is proposing.
00:51:42
Speaker
In terms of this disagreement between you and Eric Drexler, are you imagining agentee AGI or superintelligence which comes after the kice soup? Do you see that as an inevitable byproduct of kice or do you see that as an inevitable choice that humanity will make? And is Eric pushing the view that the agentee stuff doesn't necessarily come later, it's a choice that human beings would have to make?
00:52:09
Speaker
I do think it's more like saying that this will be a choice that humans will make at some point. I'm sure that Eric, to some extent, is saying, yeah, just don't do that. But I think Eric and I do, in fact, have a disagreement on how much more performance you can get from an AGI agent than a CHI super services.
00:52:28
Speaker
My argument is something like there is efficiency to be gained from going to an AGI agent and Eric's position as best I understand it is that there is actually just not that much economic incentive to go to an AGI agent.
00:52:42
Speaker
What are your intuition pumps for why you think that you will gain a lot of computational efficiency from creating sort of an AGI agent? We don't have to go super deep, but I guess a terse summary or something? Sure. I guess main intuition pump is that in all of the past cases that we have of AI systems, you see that in speech recognition, in deep reinforcement learning, in image classification.
00:53:09
Speaker
We had all of the hand-built systems that separated these out into a few different modules that interacted with each other in a vaguely taste-like way and then at some point we got enough compute and large enough data sets that we just threw deep learning at it and deep learning just blew those approaches out of the water. So there's the argument from empirical experience.
00:53:31
Speaker
And there's also the argument that if you try to modularize your systems yourself, you can't really optimize the communication between them. You're less integrated and you can't make decisions based on global information. You have to make it based off of local information. And so the decisions tend to be a little bit worse. This could be taken as an explanation for the empirical observation that I made that we can already make.
00:53:55
Speaker
So that's another intuition pump there. Eric's response would probably be something like, sure, this seems true for these narrow tasks or narrow tasks. You can get a lot of efficiency gains by integrating everything together and throwing deep learning and end-to-end training at all of it.
00:54:11
Speaker
But for sufficiently high-level tasks, there's not really that much to be gained by doing global information instead of local information, so you don't actually lose much by having these separate systems, and you do get a lot of computational efficiency and generalization bonuses by modularizing. He had a good example of this that I'm not replicating, and I don't want to make my own example because it's not going to be as convincing, but that's his current argument.
00:54:36
Speaker
And then my counter counter argument is that's because humans have small brains and given the size of our brains and the limits of our data and the limits of the compute that we have, we are forced to do modularity and systemization to break tasks apart into modular chunks that we can then do individually. Like if you are running a corporation, you need each person to specialize in their own task without thinking about all the other tasks, because we just do not have the ability to optimize for all of everything altogether because we have small brains, relatively speaking.
00:55:06
Speaker
or like limited brains is what I should say. But this is not a limit that AI systems will have. An AI system with just vastly more compute than the human brain and vastly more data will in fact just be able to optimize all of this with global information and get better results. So that's one thread of the argument taken down to two or three levels of arguments and counter arguments. There are other threads of that debate as well.
00:55:30
Speaker
I think that that serves our purpose for illustrating that here. So are there any other approaches here that you'd like to cover or is that it? I didn't talk about factored cognition very much, but I think it's worth highlighting separately from a trait amplification in that it's testing an empirical hypothesis of can humans decompose tasks into chunks of some small amount of time? And can we do arbitrarily complex tasks using these humans?
00:55:54
Speaker
I'm particularly excited about this sort of work that's trying to figure out what humans are capable of doing and what supervision they can give to AI systems, mostly because going back to a thing I said way back in the beginning, what we're aiming for is a human AI system to be collectively rational as opposed to an AI system that is individually rational. Part of the human AI system is the human. You want to be able to know what the human can do.
00:56:19
Speaker
do, what sort of policies they can implement, what sort of feedback they can be giving to the AI system. And something like factored cognition is testing a particular aspect of that. I think that seems great and we need more of it.
00:56:32
Speaker
Right. I think that this seems to be the sort of emerging view of where social science or scientists are needed in AI alignment in order to, again, as you said, just sort of understand what human beings are capable in terms of supervised learning and analyzing the human component of the AI alignment problem as it requires us to be collectively rational with AI systems.
00:56:54
Speaker
Yeah, that seems right. I expect more writing on this in the future. All right, so there's just a ton of approaches here to AI alignment. And our heroic listeners have a lot to take in here. In terms of getting more information generally about these approaches, or if people are still interested in delving into all these different views that people take at the problem and methodologies of working on it, what would you suggest that interested persons look into or read or two?
00:57:23
Speaker
I cannot give you a overview of everything because that does not exist to the extent that it exists. It's either this podcast or the talk that I gave at beneficial AGI. I can suggest resources for individual items. So for embedded agency, there's the embedded agency sequence on the alignment forum, far in the way, the best thing to read for that.
00:57:48
Speaker
For Kais, comprehensive AI services, there was a 200 plus page tech report published by Eric Drexler at the beginning of this month. If you're interested in it, you should totally read the entire thing. It is quite good. But I also wrote a summary of it on the alignment forum, which is much more readable in the sense that it's shorter. And then there are a lot of comments on there that analyze it a bit more.
00:58:14
Speaker
There's also another summary written by Richard Ngo, also on the Alignment Forum. Maybe it's only on Less Wrong, I forget. It's probably on the Alignment Forum. So that's a different take on comprehensive AI services, so I'd recommend reading that too. For limited AGI, I have not really been keeping up with the literature on boxing, so I don't have a paper to recommend. I know that a couple have been written by, I believe, Jim Babcock and Roman Yampolsky.
00:58:44
Speaker
For impact measures, you want to read Vika's paper on relative reachability. There's also a blog post about it if you don't want to read the paper. And Alex Turner's blog post on attainable utility preservation. I think it's called Towards a New Impact Measure, and this is on the Alignment Forum. For robustness.
00:59:07
Speaker
I would read Paul Cristiano's post called, Techniques for Optimizing Worst Case Performance. This is definitely specific to how robustness will help under Paul's conceptions of the problem. And in particular, he's thinking of robustness in the setting where you have a very strong overseer for your AI system. But I don't know of any other paper or blood post that's talking about robustness generally.
00:59:35
Speaker
for AI systems that do what we want. There's my value learning sequence that I mentioned before on the alignment forum. There's CIRL, or Cooperative Inversary Enforcement Learning, which is a paper by Dylan and others. There's Deep Reinforcement Learning from Human Preferences and Recursive Reward Modeling. These are both papers that are particular instances of work in this field.
01:00:02
Speaker
I also want to recommend Inverse Award Design because I really like that paper. So that's also a paper by Dylan and others. For corrigibility and iterated amplification, the iterated amplification sequence on the alignment form, or half of what Paul Christiano has written, if you want to read not an entire sequence of blog posts, then I think clarifying AI alignment is probably the post I would recommend.
01:00:29
Speaker
It's one of the posts in the sequence. It talks about this distinction of creating an AI system that is trying to do what you want as opposed to actually doing what you want and like why we might want to aim for only the first one.
01:00:43
Speaker
For iterated amplification itself, that technique, there is a paper that I believe is called something like supervising strong learners by amplifying weak experts, which is a good thing to read. There's also a corresponding OpenAI blog post, whose name I forget. I think if you search iterated amplification, OpenAI blog, you'll find it. And then for debate, there's AI safety via debate, which is a paper. There's also a corresponding OpenAI blog post,
01:01:13
Speaker
For factored cognition, there's a post called factored cognition on the alignment forum, again in the iterated amplification sequence. For interpretability, there isn't really anything talking about interpretability from the strategic point of view of why we want it. I guess that same post I recommended before of techniques for optimizing worst-case performance talks about it a little bit.
01:01:38
Speaker
For like actual interpretability techniques, I'd recommend that distill articles, the building blocks of interpretability, and feature visualization. But these are more about particular techniques for interpretability as opposed to why we want interpretability. And on ambitious value learning.
01:01:55
Speaker
The first chapter of my sequence on value learning talks exclusively about ambitious value learning, so that's one thing I'd recommend. But also, Stuart Armstrong has so many posts. I think there's one that's about resolving human values adequately and something else, something like that.
01:02:14
Speaker
That one might be one worth checking out. It's very technical though, lots of math. He's also written a bunch of posts that convey the intuitions behind the ideas. They're all split into a bunch of very short posts, so I can't really recommend any one particular one. You could go to the alignment newsletter database and just search Stuart Armstrong and click on all of those posts and read them. I think that was everything.
01:02:41
Speaker
That's a wonderful list. So we'll go ahead and link those all in the article, which goes along with this podcast. So that'll all be there organized and nice, neat lists for people.
01:02:53
Speaker
This has all probably been fairly overwhelming in terms of the number of approaches and how they differ and how one is to adjudicate the merits of all of them. If someone is just sort of entering the space of AI alignment or is beginning to be interested in sort of these different technical approaches, do you have any recommendations? Beating a lot rather than trying to do actual research.
01:03:16
Speaker
This was my strategy. I started back in September of 2017. And I think for the first six months or so, I was reading about 20 hours a week in addition to doing research, which is why it was only 20 hours a week. It wasn't the full time thing I was doing. I think that was very helpful for actually forming a picture of what everyone was doing.
01:03:38
Speaker
Now, it's plausible that you don't want to actually learn about what everyone is doing, and you're okay with, like, I'm fairly confident that this thing, this particular problem, is an important piece of the problem, and we need to solve it. And I think it's very easy to get that wrong, so I'm a little wary of recommending that, but it's a reasonable strategy to just say, okay, we probably will need to solve this problem, even if we don't, the intuitions that we get from trying to solve this problem will be useful.
01:04:05
Speaker
focusing on that particular problem, reading all of the literature around that, and attacking that problem in particular, lets you start doing things faster while still doing things that are probably going to be useful. So that's another strategy that people could do, but I don't think it's very good for orienting yourself in the field of AI safety.
01:04:25
Speaker
So you think that there's a high value in people taking this time to read, to understand all the papers and the approaches before trying to participate in particular research questions or methodologies. Given how open this question is, you know, all the approaches make different assumptions and take for granted different axioms, which all come together to create a wide variety of things which can both complement each other and have varying degrees of
01:04:52
Speaker
efficacy in the real world when AI systems start to become more developed and advanced? Yep, that seems right to me. Part of the reason I'm recommending this is because it seems to be that no one does this. I think on the margin, I want more people who do this. In a world where 20% of the people were doing this and the other 80% were just taking particular pieces of the problem and working on those, that might be the right balance somewhere around there. I don't know.
01:05:19
Speaker
It depends on how you count who is actually in the field, but somewhere between 1 and 10% of people are doing this. Closer to the 1.
01:05:27
Speaker
Which is quite interesting, I think, given that it seems like AI alignment should be in a stage of maximum exploration, just given that conceptually mapping the territory is very young. I mean, we're essentially seeing the birth and initial development of an entirely new field and specific application of thinking.
01:05:51
Speaker
And there are many more mistakes to be made and concepts to be clarified and layers to be built so seems like we should be maximizing our attention in exploring the general space trying to develop models of the efficacy of different approaches and philosophies and use of alignment.
01:06:10
Speaker
Yeah, I agree with you. This should not be surprising given that I am one of the people doing this or trying to do this. Probably the better critique will come from people who are not doing this and can tell both of us why we're wrong about this. We've covered a lot here in terms of the specific approaches, your thoughts on the approaches, where we can find resources on the approaches, why setting the approaches matters. Are there any parts of the approaches that you feel deserve more attention in terms of these different sections that we've covered?
01:06:39
Speaker
I think I want more work on looking at the intersection between things that are supposed to be complementary. How interpretability can help you have AI systems that have the right goals, for example, would be a cool thing to do. Or what you need to do in order to get verification, which is a sub-part of robustness, to give you interesting guarantees.
01:07:03
Speaker
on AI systems that we actually care about. Most of the work on verification right now is like, there's this nice specification that we have for adversarial examples in particular. Is there an input that is within some distance from a training data point such that it gets classified differently from that training data point?
01:07:23
Speaker
And this is a nice formal specification, and most of the work in verification takes the specification as given and then figures out more and more computationally efficient ways to actually verify that property, basically. That does seem like a thing that needs to happen, but the much more urgent thing in my mind is how do we come up with these specifications in the first place?
01:07:42
Speaker
If I want to verify that my AI system is cordial or I want to verify that it's not going to do anything catastrophic or that it is going to not disable my value learning system or something like that, how do I specify this at all in any way that lets me do something like a verification technique, even given infinite computing power?
01:08:04
Speaker
it's not clear to me how you would do something like that and i would love to see people do more research on that. That particular thing is my current reason for not being very optimistic verification in particular but i don't think anyone has given it a try so it's possible that there's actually just some approach that could work that we just haven't found yet because no one's really been trying.
01:08:27
Speaker
I think all of the work on limited AGI is talking about, okay, does this actually eliminate all of the catastrophic behavior? Which, yep, that's definitely an important thing, but I wish that people would also do research on, given that we've put this penalty or this limit on the AGI system, what things is it still capable of doing?
01:08:48
Speaker
Have we just made it impossible for it to do anything of interest whatsoever? Or can it actually still do pretty powerful things, even though we've placed these limits on it? That's the main thing I want to see from there, from the let's have AI systems that do what we want. Probably the biggest thing I want to see there, and I've been trying to do some of this myself, is some conceptual thinking about how does this lead to good outcomes in the long term.
01:09:16
Speaker
So far, we've not been dealing with the fact that the human doesn't actually know, doesn't actually have a nice consistent utility function that they know and that can be optimized. So once you relax that assumption, what the hell do you do? Then there is also a bunch of other problems that would benefit from more conceptual clarification. Maybe I don't need to go into all of them right now.
01:09:37
Speaker
Yeah. And just to sort of inject something here that I think we haven't touched on and that you might have some words about in terms of approaches, we discussed sort of a general views of advanced artificial intelligence, a services-based conception. Though I don't believe that we have talked about aligning AI systems that simply function as oracles or having a concert of oracles.
01:10:02
Speaker
You can get rid of the services thing and the agency thing if the AI just tells you what is true or answers your questions in a way that is value aligned. Yeah, I mostly want to punt on that question because I have not actually read all of the papers. I might have read a grand total of one paper on the oracles and also super intelligence, which talks about oracles. So I feel like I know so little about the state of the art on the oracles that I should not actually say anything about them.
01:10:32
Speaker
Sure. So then just as a broad point to point out to our audience is that in terms of conceptualizing these different approaches to AI alignment, it's important and crucial to consider the kind of AI system that you're thinking about the kinds of features and properties that it has. And Oracle's are another version here that one can play with in one's AI alignment thinking. I think the canonical paper there is something like good and safe uses of AI oracles, but I have not actually read it.
01:10:59
Speaker
There is a list of things I want to read. It is on that list. But that list also has, I think, something like 300 papers on it. And apparently, I have not gotten to Oracle yet. And so for the sake of this whole podcast being as comprehensive as possible, are there any conceptions of AI, for example, that we have admitted so far, adding on to this agential view, the CHICE view of it actually just being a lot of distributed services, or an Oracle view?
01:11:26
Speaker
There's also the tool AI view. This is different from the services view, but it's somewhat akin to the view you were talking about at the beginning of this podcast, where you've got AI systems that have a narrowly defined input-output space, they've got a particular thing that they do with limit, and they just sort of take in their inputs, do some computation, they spit out their outputs, that's it, that's all that they do. You can't really model them as having some long-term utility function that they're optimizing, they're just implementing a particular input-output relation, and that's all they're trying to do.
01:11:55
Speaker
even saying something like they are trying to do X is basically using a bad model for them. I think the main argument against expecting tool AI systems is that they are probably not going to be as useful as other services or agential AI because tool AI systems would have to be programmed in a way where we understood what they were doing and why they were doing it.
01:12:18
Speaker
Where is agent show systems or services would be able to consider new possible ways of achieving goals that we haven't thought about those plans and so they could get super human behavior by considering things that we want to consider. Where is to lay eyes like google maps is super human in some sense but it's super human only because it has a compute advantage over us.
01:12:42
Speaker
If we were given all of the data and all of the time in human real-time that Google Maps had, we could implement a similar sort of algorithm as Google Maps and complete the optimal route ourselves. There seems to be this duality that is constantly being formed in our conception of AI alignment, where the AI system is this tangible external object, which stands in some relationship to the human and is trying to help the human to achieve certain things.
01:13:12
Speaker
Are there conceptions of value alignment which, however the procedure or methodology is done, changes or challenges the relationship between the AI system and the human system where it challenges what it means to be the AI or what it means to be the human, whereas there's potentially some sort of merging or disruption of this dualistic scenario of the relationship? I don't really know of any. It sounds like you're talking about things like brain computer interfaces and stuff like that.
01:13:40
Speaker
I don't really know of any intersection between AI safety research and that.
01:13:45
Speaker
I guess this did remind me that I want to make the point that all of this is about the relatively narrow, I claim, problem of aligning an AI system with a single human. There is also the problem of, okay, what if there are multiple humans? What if there are multiple AI systems? What if you've got a bunch of different groups of people and each group is value aligned within themselves and they build an AI that's value aligned with them, but lots of different groups do this. Now what happens?
01:14:12
Speaker
Solving the problem that I've been talking about does not mean that you have a good outcome in the long-term future. It is merely one piece of a larger overall picture. I don't think any of that larger overall picture removes the dualistic thing that you were talking about, but the dualistic part reminded me of the fact that I am talking about a narrow problem and not the full problem in some sense.
01:14:35
Speaker
Right, and so just to offer some conceptual clarification here, again, the first problem is how do I get an AI system to do what I wanted to do when the world is just me and that AI system? Me, that AI system, and the rest of humanity, but the rest of humanity is treated as part of the environment.
01:14:51
Speaker
So you're not modeling other AI systems or how some mutually incompatible preferences and train systems would interact in the world or something like that. Exactly. So the full AI alignment problem is... It's funny because it's just the question of civilization, I guess. How do you get the whole world and all the AI systems to make a beautiful world instead of a bad world?
01:15:14
Speaker
Yeah, I'm not sure if you saw my lightning talk at the official AGI, but I like talked a bit about this. I think I called that top level problem, make AI related feature stuff go well. Very, very, very concrete, obviously. It makes sense. People know what you're talking about. I probably wouldn't call that broad problem the AI alignment problem. I kind of want to reserve the term alignment for the narrower problem.
01:15:38
Speaker
We could maybe call it the AI safety problem or the AI future problem. I don't know. Beneficial AI problem actually. I think that's what I used last time. That's a nice way to put it. I think that that conceptually leaves us at a very good place for this first section. Yeah, seems pretty good to me.
01:16:03
Speaker
If you found this podcast interesting or useful, please make sure to check back for part two in a couple weeks where Rohan and I go into more detail about the strengths and weaknesses of specific approaches. We'll be back again soon with another episode in the AI alignment podcast.