Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
AIAP: Inverse Reinforcement Learning and Inferring Human Preferences with Dylan Hadfield-Menell image

AIAP: Inverse Reinforcement Learning and Inferring Human Preferences with Dylan Hadfield-Menell

Future of Life Institute Podcast
Avatar
72 Plays7 years ago
Inverse Reinforcement Learning and Inferring Human Preferences is the first podcast in the new AI Alignment series, hosted by Lucas Perry. This series will be covering and exploring the AI alignment problem across a large variety of domains, reflecting the fundamentally interdisciplinary nature of AI alignment. Broadly, we will be having discussions with technical and non-technical researchers across a variety of areas, such as machine learning, AI safety, governance, coordination, ethics, philosophy, and psychology as they pertain to the project of creating beneficial AI. If this sounds interesting to you, we will hope that you join in the conversations by following or subscribing to us on Youtube, Soundcloud, or your preferred podcast site/application. In this podcast, Lucas spoke with Dylan Hadfield-Menell, a fifth year Ph.D student at UC Berkeley. Dylan’s research focuses on the value alignment problem in artificial intelligence. He is ultimately concerned with designing algorithms that can learn about and pursue the intended goal of their users, designers, and society in general. His recent work primarily focuses on algorithms for human-robot interaction with unknown preferences and reliability engineering for learning systems. Topics discussed in this episode include: -Inverse reinforcement learning -Goodhart’s Law and it’s relation to value alignment -Corrigibility and obedience in AI systems -IRL and the evolution of human values -Ethics and moral psychology in AI alignment -Human preference aggregation -The future of IRL
Recommended
Transcript

Introduction to AI Safety Series

00:00:06
Speaker
Welcome back to the Future of Life Institute podcast. I'm Lucas Perry, and I work on AI risk and nuclear weapons risk related projects at FLI. Today, we are kicking off a new series where we will be having conversations with technical and non-technical researchers focused on AI safety and the value alignment problem. Broadly, we will focus on the interdisciplinary nature of the project of eventually creating value aligned AI, where what value align exactly entails is an open question that is part of the conversation.

Scope of AI Issues: Social to Technical

00:00:35
Speaker
In general, the series covers the social, political, ethical, and technical issues and questions surrounding the creation of beneficial AI. We'll be speaking with experts from a large variety of domains and hope that you'll join in the conversations. If this seems interesting to you, make sure to follow us on SoundCloud or subscribe to us on YouTube for more similar content.

Meet Dylan Hadfield-Manel

00:00:56
Speaker
Today, we'll be speaking with Dylan Hadfield-Manel.
00:00:59
Speaker
Dylan is a fifth year PhD student at UC Berkeley, advised by Anka Dragan, Peter Abil and Stuart Russell. His research focuses on the value alignment problem in artificial intelligence. And with that, I give you Dylan. Hey, Dylan, thanks so much for coming on the podcast. Thanks for having me. It's a pleasure to be here.

From Robotics to AI Safety: Dylan's Journey

00:01:19
Speaker
So I guess we can start off, you can tell me a little bit more about your work over the past years. Like, how have your interests and projects evolved? And how has that led you to where you are today?
00:01:30
Speaker
Well, I started off towards the end of undergrad and beginning of my PhD working in robotics and hierarchical robotics. Towards the end of my first year, my advisor came back from a sabbatical and started talking about value alignment problem and sort of existential risk issues related to AI. At that point, I started thinking about questions about misaligned objectives, value alignment, and generally how we get
00:02:00
Speaker
the correct preferences and objectives into AI systems. And about a year after that, I sort of decided to make this my central research focus. And then the past three years, it's been most of what I've been thinking about.
00:02:16
Speaker
Cool. So it seems like you sort of had like an original path where you were working on practical robotics, and then you sort of shifted more so into into value alignment, AI safety efforts. Yeah, that's right. Before we go ahead and jump into your specific work, it'd be great if we could go ahead and define what inverse reinforcement learning exactly is.

Inverse Reinforcement Learning Explained

00:02:38
Speaker
So for me, it seems that inverse reinforcement learning, at least from the view, I guess, of technical AI safety researchers, is it's viewed as an empirical means of conquering descriptive ethics, whereby we're able to give a clear descriptive account of what any given agent's preferences and values are at any given time is. Is that a fair characterization? That's sort of one way to characterize it. Another way to think about it, which
00:03:07
Speaker
which is a useful perspective for me sometimes is to think of inverse reinforcement learning as a way of doing behavior modeling that has certain types of generalization properties. So anytime you're learning in any machine learning context, there's always going to be a bias that controls how you generalize to new information. And
00:03:36
Speaker
Inverse reinforcement learning and preference learning, to some extent, is a bias in behavior modeling, which is to say that we should model this agent as accomplishing a goal, as satisfying a set of preferences. And that leads to certain types of generalization properties in new environments. So for me, it's inverse reinforcement learning is building in this agent-based assumption into behavior modeling.
00:04:06
Speaker
So given that I'd like to dive more into like the sorts of specific work that you're working on and go into like some summaries of like your findings and your sorts of research that you've been up to. So given this interest that you've been developing in value alignment and like human preference aggregation and AI systems, learning human preferences, what are the main approaches that you've been working on?

Theoretical Frameworks for Value Alignment

00:04:30
Speaker
So I think the first thing that
00:04:34
Speaker
really Stuart Russell and I started thinking about was trying to understand sort of theoretically, what is a reasonable goal to shoot for? And what does it mean to have to do a good job of value alignment? To us, it feels like issues with misspecified objectives, at least in some ways, are a bug in the theory.
00:05:03
Speaker
all of the math around artificial intelligence, for example, Markov decision processes, which is the kind of central mathematical model we use for decision making over time, starts with an exogenously defined objective or reward function. And we think that mathematically that was a fine thing to do in order to make progress. But it's an assumption that really has put blinders on the field
00:05:33
Speaker
about the importance of getting the right objective down. And so I think the first thing that we sought to try to do was to understand what is a system or a setup for AI that does the right thing in theory, at least? What's something that if we were able to implement this that we think could actually work in the real world with people?
00:06:00
Speaker
It was that kind of thinking that led us to propose cooperative inverse reinforcement learning, which was our attempt to formalize the interaction whereby you communicate an objective to the system.

Cooperative Inverse Reinforcement Learning

00:06:16
Speaker
And the main thing that we focused on was including within the theory a representation of the fact that the true objective is unknown and unobserved.
00:06:27
Speaker
and that it needs to be arrived at through observations from a person. And then we've been trying to investigate the theoretical implications of this modeling shift. So in the initial paper that we did, which was titled Cooperative Inverse Reinforcement Learning, what we looked at is how this formulation is actually different from a standard environment model in AI.
00:06:53
Speaker
And in particular, the way that it's different is that there's strategic interaction on the behalf of the person. So the way that you observe what you're supposed to be doing is intermediated by a person who may be trying to actually teach or trying to communicate appropriately. And what we showed is that modeling this communication, this communicative component can actually be hugely important and lead to much faster learning behavior.
00:07:23
Speaker
In our subsequent work, what we've looked at is taking this formal model in theory and trying to apply it to different situations. And there are kind of two really important pieces of work that I liked here that we did. One was to take that theory and use it to explicitly analyze a simple model of an existential risk setting. This was a paper titled The Off-Switch Game that we published at ICHCI last summer.
00:07:54
Speaker
What it was was working through a formal

Ensuring Human Control in AI Systems

00:07:56
Speaker
model of a corrigibility problem within a CIRL framework. And it shows the utility of constructing this type of game in the sense that we get some interesting predictions and results. So the first one we get is that there are some nice, simple, necessary conditions
00:08:21
Speaker
for the system to want to let the person turn it off, which is that the AI system needs to have uncertainty about its true objective, which is to say that it needs to have within its belief the possibility it might be wrong. Then all it needs to do is believe that the person it's interacting with is a perfectly rational individual. If that's true,
00:08:49
Speaker
you get a guarantee that this robot always lets the person switch it off. Now, that's good because in my mind, it's an example of a place where at least in theory, it solves the problem. This gives us a way that theoretically, we could build cordial systems. Now, it's still making a very, very strong assumption, which is that it's OK to model the human as being
00:09:18
Speaker
optimal or rational. And I think if you look at real people, that's just not a fair assumption to make for a whole host of reasons. And so the next thing we did in that paper is we looked at this model. And what we realized is that adding in a small amount of irrationality breaks this requirement. It means that some things might actually go wrong.
00:09:48
Speaker
And so the final thing we did in the paper was to look at the consequences of either overestimating or underestimating human rationality. And the argument that we made is sort of there's a trade-off between assuming that the person is more rational, lets you get more information from their behavior, thus learn more and in principle help them more. But if you assume that they're too rational, then this actually can lead to quite bad behavior.
00:10:16
Speaker
And so there's a sweet spot that you sort of want to aim for, which is to maybe try to underestimate how rational people are, but you obviously don't want to get it totally wrong. We followed up on that idea in a paper with Smith and Milley as the first author that was titled, Should Robots Be Obedient? And that tried to get a little bit more at this trade-off between maintaining control over a system
00:10:46
Speaker
and the amount of value that it can generate for you.

Robot Obedience vs. Value Maximization

00:10:50
Speaker
We sort of looked at the implication that as robot systems interact with people over time, you expect them to learn more about what people want. And if you get very confident about what someone wants and you think they might be irrational, the math and the off switch paper predicts that you should try to take control away from them.
00:11:12
Speaker
So this means that if your system is learning over time, you expect that even if it is initially open to human control and oversight, it may lose that incentive over time. In fact, you can predict that it should lose that incentive over time. So should robots be obedient, we modeled that property. And we looked at some consequences of it. We do find that you get a basic confirmation of this
00:11:40
Speaker
hypothesis, which is that systems that maintain human control and oversight have less value that they can achieve in theory. But we also looked at what happens when you have the wrong model. So if the AI system has a prior that it cares about, that the human cares about a small number of things in the world, let's say,
00:12:07
Speaker
then it statistically gets overconfident in its estimates of what people care about and disobeys the person more often than it should. So arguably, when we say we want to be able to turn the system off, it's less a statement about what we want to do in theory or the property of the optimal robot behavior we want.
00:12:36
Speaker
And more a reflection of the idea that we sort of believe that under almost any realistic situation, we're probably not going to be able to fully explain all of the relevant variables that we care about. So if you're giving your robot an objective to find over a subset of things you care about, you

The Corrigibility Conundrum

00:13:00
Speaker
should actually be very focused on having it
00:13:04
Speaker
listen to you more so than just optimizing for its estimates of value. And I think that provides actually a pretty strong theoretical argument for why courage ability is a desirable property in systems. Even though, at least at face value, it should decrease the amount of utility those systems can generate for people.

Inverse Reward Design for Robust AI

00:13:28
Speaker
The final piece of work that I think I would talk about here is our NIPS paper from December, which was titled Inverse Reward Design. And that was sort of taking cooperative inverse reinforcement learning and pushing it in the other direction. So instead of using it to theoretically analyze very, very powerful systems, we can also use it to try to build tools that are
00:13:56
Speaker
more robust to mistakes that designers may make and start to build in initial notions of value alignment and value alignment strategies into the current mechanisms we use to program AI systems. So what that work looked at was understanding the uncertainty that's inherent in an objective specification. So in the initial cooperative inverse reinforcement learning paper and the off switch game,
00:14:26
Speaker
We said is that AI systems should be uncertain about their objective and they should be designed in a way that is sensitive to that uncertainty. This paper was about trying to understand what is a useful way to be uncertain about the objective. And the main idea behind it was that we should be thinking about the environments the system designer had in mind. So we use an example of a TD robot navigating in the world
00:14:56
Speaker
And the system designer is thinking about this robot navigating where there's three types of terrain. So there's grass, there's gravel, and there's gold. And you can give your robot an objective, a utility function defined over being in those different types of terrain that incentivizes it to go and get the gold and stay on the dirt where possible, but to take shortcuts across the grass when it's high value. Now, when that robot goes out into the world,
00:15:23
Speaker
there are going to be new types of terrain and types of terrain the designer didn't anticipate. And what we did in this paper was to build an uncertainty model that allows the robot to determine when it should be uncertain about the quality of its reward function. So how can we figure out when the reward function that a system designer built into an AI, how can we determine when that objective
00:15:52
Speaker
is ill-adapted to the current situation. And you can think of this as a way of trying to build in some mitigation to Good Heart's Law. Would you like to take a second to unpack what Good Heart's Law is? Sure. So Good Heart's Law is an old idea in social science that actually goes back to before Good Art. But it basically said, well, I would say that in economics, there's a general idea of the principal agent problem.
00:16:23
Speaker
which dates back to the 1970s, as I understand it, that basically looks at the problem of specifying incentives for humans. So how should you create contracts? How do you create incentives so that another person, say an employee, helps earn you value? And Goodhart's law is a very nice way of summarizing a lot of those results, which is to say that
00:16:50
Speaker
Once a metric becomes an objective, it ceases to become a good metric. So you can have properties of the world which correlate well with what you want, but optimizing for them actually leads to something quite, quite different than what you're looking for.
00:17:06
Speaker
Right. If you are optimizing for test scores, then you're not actually going to end up optimizing for intelligence, which is what you wanted in the first place? Exactly. Even though test scores, when you weren't optimizing for them, were actually a perfectly good measure of intelligence. I mean, not perfectly good, but were an informative measure of intelligence. I think that this good arts law arguably is a pretty bleak perspective
00:17:33
Speaker
Well, if you take it seriously and you think that we're going to build very powerful systems that are going to be programmed directly through an objective in this manner, good arts law should be pretty problematic because any objective that you can imagine programming directly into your system is going to be something correlated with what you really want rather than what you really want.
00:17:56
Speaker
I think you should expect that that will likely be the case. Right. Is it just simply too hard or too unlikely that we're able to sufficiently specify what exactly that we want that will just end up using some other metrics that if you optimize too hard for them, it ends up messing with a bunch of other things that we care about? Yeah. I think there are some real questions about what
00:18:21
Speaker
What does it even mean? Well, what are we even trying to accomplish? What should we try to program into systems? I think that is, you know, philosophers have been trying to figure out those types of questions for ages. But for me, as someone who takes a sort of more empirical slant on these things, you know, I think about the fact that the objectives that we see within our individual lives are so heavily shaped by our environments.
00:18:49
Speaker
and which types of signals we respond to and adapt to is heavily adapted itself to the types of environments we find ourselves in. And we just have so many examples of objectives not being the correct thing. I mean, it's effectively all you could have is correlations. The fact that wireheading just is possible
00:19:18
Speaker
is maybe some of the strongest evidence for Gerd's law being really a fundamental property of learning systems and optimizing systems in the real world.
00:19:30
Speaker
There are certain sorts of agential characteristics and properties which we would sort of like to have in our AI systems, like them being agential. Yeah, like, well, so like, cordial ability is like a characteristic which you're doing research on and trying to like understand better. And same with obedience.
00:19:49
Speaker
And it seems like there's a trade-off here where if a system is too cordial or it's too obedient, then you lose its ability to really maximize different sorts of objective functions, correct? Yes, exactly.
00:20:04
Speaker
I think identifying that trade-off is one of the things I'm sort of most proud of about some of the work we've done so far. So given like AI safety and like really big risks that can come about from AI, in the short to medium and long term, like before we really have AI safety figured out, is it really possible for systems to be like too obedient or too cordial or like too docile? And how do we sort of navigate this space and find sweet spots?
00:20:34
Speaker
Um, so I think it's definitely possible for systems to be too cordial or too obedient.

Aligning AI in Multi-agent Settings

00:20:40
Speaker
It's just that the failure mode for that doesn't seem that bad. Right. If you think about this, it's, it's kind of like Clippy, right? Like Clippy was sort of sure. Yeah. So Clippy was an example of, uh, an assistant that Microsoft created in the nineties.
00:21:02
Speaker
And it was this little paperclip that would show up in Microsoft Word. And it would, well, it liked to suggest that you were trying to write a letter a lot and would ask for different ways in which it could help. Now, on the one hand, that system was very cordial and obedient in the sense that it would ask you whether or not
00:21:24
Speaker
you wanted to tell all the time. And if you said no, it would always go away. But it was super annoying, because it would always ask you if you wanted help, right? It would sort of the false positive rate was just far too high, to the point where the system became really a joke in computer science and AI circles of what you don't want to be doing. And
00:21:50
Speaker
So I think it's, you know, systems can be sort of too obedient or too sensitive to human intervention and oversight in the sense that too much of that just reduces the value of the system.
00:22:08
Speaker
Right, for sure. But on one hand, when we're talking about existential risks or even a paperclip maximizer, then it would seem like, like you said, the failure mode of just being too annoying and checking in with us too much seems like not such a bad thing given existential risk territory. I think if you're thinking about it in those terms, yes. But I think if you're thinking about it
00:22:37
Speaker
from the standpoint of I want to sell a paperclip maximizer to someone else, then it becomes a little less clear. I think especially when the risks of paperclip maximizers are much harder to measure. I'm not saying that it's the right decision, sort of from a global altruistic standpoint.
00:23:02
Speaker
to be making that trade off. But I think it's also true that just, if we think about the requirements of market dynamics, it is true that AI systems can be too cordial for the market. And that that is a huge failure mode that AI systems run into, and it's one we should expect the producers of AI systems to be responsive to.
00:23:27
Speaker
Right. So given all these different... Is there anything else you wanted to touch on there? Well, I had another example of systems that are too cordial. Sure. Which is, do you remember Microsoft's Tay? No, I do not. So this is a chatbot that Microsoft released and they trained it based off of... So it was a tweetbot and they trained it based on things that were treated at it.
00:23:57
Speaker
I forget if it was doing nearest neighbors look up or if it was just doing a sort of neural net that ended up overfitting and memorizing parts of the training set. But at some point 4chan realized that the AI system, that Te was very suggestible. So they basically created an army to radicalize Te and they succeeded. Yeah, I remember this.
00:24:26
Speaker
And I think you could also think of that as being the other axis of too courageable or too responsive to human input. Right. So the first axis I was talking about is sort of the failures of being too courageable from an economic standpoint. Right. But there's also the failures of being too courageable in a sort of multi-agent mechanism design kind of setting.
00:24:55
Speaker
where I believe that those types of properties in a system also open them up to more misuse. If we think of AI, so cooperative inverse reinforcement learning and the models we've been talking about so far exist in what I would call the one robot, one human model of the world. And generally you could think of extensions of this
00:25:24
Speaker
with n humans and m robots. And the variance of what you would have there, I think, lead to different theoretical implications. But if we think of just two humans, so n equals 2 and one robot, m equals 1, we'll suppose that one of the humans is the system designer and another one is the user. There is this sort of trade-off between how much control the system designer
00:25:51
Speaker
has over the future behavior of the system and how responsive and cordial is to the user in particular. And sort of trading off between those two, I think is a really interesting ethical question that comes up when you start to think about misuse.
00:26:11
Speaker
So so going forward and as we're sort of like developing these systems and trying to make them like more fully realized in the world where the number of people will equal something like seven or eight billion. How do we navigate this space where we're trying to sort of hit a sweet spot where
00:26:33
Speaker
It's cordial in the right sorts of ways and to the right degree and right level and to the right people. And it is obedient to the right sorts of people and is not like a suggestible from the wrong sorts of people. Or does that just like enter a territory of so many like political, social and ethical questions that it's just will take years to think about and work on? I mean, I think it's closer to the second one.
00:27:01
Speaker
I'm sure that I don't know the answers here. We're still, like, from my standpoint, I'm still trying to get a good grasp on what is possible in the one robot, one person case. Yeah. You know, I think that when you have, yeah, when you, like, man, that's just, it's so
00:27:28
Speaker
It's so hard to think about that problem, because it's just very unclear what's even correct or right. And ethically, you want to be careful about imposing your beliefs and ideals too strongly onto a problem. Because you are sort of shaping that. Right. But at the same time, these are real challenges that are going to exist. And we already see them in real life.
00:27:56
Speaker
Uh, you know, it's like, if we look at the YouTube recommender stuff that was just happening, you know, arguably that's a misspecified objective.

AI and Radical Content Promotion

00:28:07
Speaker
So, so this, I, to, to get a little bit of background here, this is a largely based off a recent New York Times opinion piece. Uh, it was looking at the recommendation engine for YouTube and pointing out it has a, uh, a bias towards, uh, recommending radical
00:28:26
Speaker
content. So either fake news or Islamist videos. And if you dig into why that was occurring, you know, that a lot of it is because they're, what are they doing? They're optimizing for engagement. And the process of online radicalization looks super engaging. And now we can think about sort of
00:28:52
Speaker
Where does that come up? Well, it's a, you know, that, that issue gets introduced in a whole bunch of places, but a big piece of it is that there is this adversarial dynamic to the world. And there are users generating content in order to be. Outraging and raging because they discover that it gets more feedback and more responses. And so you need to design a system that's robust to that strategic property of the world, but at the same time,
00:29:21
Speaker
you can understand why YouTube was very, very hesitant to be taking actions that would look like censorship. Right. And so I guess just coming more off of this idea of the world having lots of adversarial agents in it, human beings are like general intelligences who have reached some level of cordiability and obedience that works
00:29:45
Speaker
kind of well in the world amongst a bunch of other human beings. And that was sort of developed in through evolution. So are there potentially techniques for developing the right sorts of like courage, ability and obedience in machine learning and AI systems through stages of evolution and running like environments like that? I think that's a possibility. One of the I would say, well, I have a couple of thoughts related to that. The first one is,
00:30:16
Speaker
I would actually challenge a little bit of your point of modeling people as general intelligences, mainly in the sense that when we talk about artificial general intelligence, we have something in mind. And it's often a shorthand in these discussions for
00:30:40
Speaker
perfectly rational Bayesian optimal actor. What that means is a system that is taking advantage of all of the information that is currently available to it in order to pick actions that optimize expected utility.
00:31:00
Speaker
When we say perfectly, we mean a system that is doing that as well as possible. It's that modeling assumption that I think sits at the heart of a lot of concerns about existential risk. And I definitely think that's a good model to consider. But there's also the concern that it might be misleading in some ways and that it might not actually be a good model of people and how they act in general.
00:31:28
Speaker
So one way to look at it would be to say that there's something about the incentive structure around humans and in our societies that has developed and adapted that creates the incentives for us to be courageable and thus a good research goal in AI is to figure out what those incentives are and to replicate them in AI systems. Another way to look at it is that people are intelligent
00:31:57
Speaker
not necessarily in the ways that economics models us as intelligent, that there are sort of properties of our behavior which are desirable properties that don't directly derive from expected utility maximization. Or if they do, they derive from a very, very diffuse form of expected utility maximization. So this is the perspective that says that people
00:32:24
Speaker
on their own are not necessarily what human evolution is optimizing for. But people are sort of a tool along that way. And you could make arguments for that based off of, I think it's an interesting perspective to take. But what I would say is that in order for societies to work, we have to cooperate.
00:32:51
Speaker
That cooperation was a crucial evolutionary bottleneck, if you will. And one of the really, really important things that it did is it forced us to develop the parent-child strategy relationship equilibrium that we currently live in. And that's a process whereby we communicate our values, whereby we
00:33:20
Speaker
train people to think that certain things are OK or not, and where we inculcate certain behaviors in the next generation. And I think it's that process more than anything else that we really, really want in an AI system and in powerful AI systems. Now, the thing is the, and I guess, well, to continue on that a little more, it's really, really important that that's there, because if you don't have those
00:33:48
Speaker
types of mental, those kind of cognitive abilities to understand causing pain and to just sort of fundamentally decide that that's a bad idea, to have a desire to cooperate, to buy into the different coordination and normative mechanisms that human society uses. And if you don't have that, then you end up, well, then society just doesn't function.

Human Intelligence and Ethical AI

00:34:12
Speaker
You know, a hunter-gatherer tribe of self-interested sociopaths probably doesn't last for very long.
00:34:18
Speaker
And so what this means is that our ability to coordinate our intelligence and cooperate with it was co-evolved and co-adapted alongside our intelligence. And I think that that evolutionary pressure and bottleneck was really important to getting us to the type of intelligence that we are now. And it's not a pressure that AI is necessarily subjected to.
00:34:45
Speaker
And I think maybe that is one way to sort of phrase the concern, I'd say. So when I look to evolutionary systems and sort of where the incentives for cordiability and cooperation and interaction come from, it's largely about the processes whereby people are less like general intelligences in some ways.
00:35:12
Speaker
Sort of evolution allowed us to become smart in some ways and restricted us in others. So based on the imperatives of group coordination and interaction. And I think that a lot of our intelligence in practice is about reasoning about group interaction and what groups think is okay and not.
00:35:34
Speaker
And that's part of the developmental process that we need to replicate in AI, just as much as spatial reasoning or vision.
00:35:43
Speaker
Cool. I guess I just want to touch base on this before we move on. Are there certain sorts of assumptions about the kinds of agents that humans are, and almost, I guess, ideas about us as being utility maximizers in some sense that people you see commonly have, but that are like misconceptions about people and how people operate differently from AI? Well, I think that that's the whole field of behavioral economics in a lot of ways.
00:36:13
Speaker
I could sort of go up to examples of people being irrational. I think there are all of the examples of sort of people being more than just self-interested. There are ways in which we seem to be risk-seeking that seems like that would be irrational from an individual perspective, but you could argue it may be rational from a group evolutionary perspective. I mean, things like overeating.
00:36:44
Speaker
You know, I mean, that's not exactly the same type of rationality, but it is an example of us becoming ill adapted to our environments and showing the extent to which we're not capable of changing or which it may be hard to. Um, yeah, I think, you know, in some ways, you know, one, one story that I tell about AI risk
00:37:10
Speaker
is that, you know, back in the start of the AI field, we were sort of looking around and saying, Oh, we want to create something intelligent. And intuitively, we all know what that means, but we need a formal characterization of it. And the formal characterization that we turned to was the basically theories of rationality developed in economics. And although those theories turned out to be, except in some settings, not great descriptors of human behavior,
00:37:40
Speaker
they were quite useful as a guide for building systems that accomplish goals. And I think that part of what we need to do as a field is kind of reassess where we're going and think about whether or not building something like that perfectly rational actor is actually a desirable end goal. In the sense, I mean, there's a sense in which it sort of is, I would like an all powerful
00:38:08
Speaker
perfectly aligned genie to help me do what I want in life. But you might think that if the odds of getting that wrong are too high, that maybe you would do better with shooting for something that doesn't quite achieve that ultimate goal, but that you can get to with pretty high reliability. This may be a setting where shoot for the moon, and if you miss, you'll land among the stars is just a horribly misleading
00:38:36
Speaker
perspective. Shoot for the moon and you might get a hellscape universe, but if you shoot for the clouds, it might end up pretty okay. Yeah. We could iterate on the sound bite, but I think something like that is, you know, may not be, that's kind of where I stand on my thinking here.
00:38:56
Speaker
So we've talked about a few different approaches that you've been working on over the past few years. What do you view as the main limitations of such approaches currently? Mostly you're just only thinking about one machine, one human systems or environments. What are the biggest obstacles that you're facing right now in inferring and learning human preferences?
00:39:19
Speaker
Well, I think the first thing is it's just an incredibly difficult inference problem. And it's a really difficult inference problem to imagine running at scale with sort of explicit inference mechanisms. So one thing to do is you can design a system that explicitly tracks a belief about someone's preferences and then acts in response to that.
00:39:42
Speaker
And those are systems that you could try to prove theory about. They're very hard to build. And they can be difficult to get to make work correctly. In contrast, you can create systems that have incentives to construct beliefs to accomplish their goals. And it's easier to imagine building those systems and having them work at scale. But it's much, much harder to understand
00:40:09
Speaker
how you would be confident in those systems being well aligned. I think that, you know, what are the biggest concerns I have? I mean, we're still very far from any of these approaches being very practical, to be honest. I think this theory is still pretty unfounded. You know, there's still a lot of work to go to understand, like, what is the target we're even shooting for? What does an aligned system mean?
00:40:37
Speaker
My colleagues and I have spent an incredible amount of time trying to just understand what does it mean to be value aligned if you are a suboptimal system. So there's one example that I think about, which is, say you're cooperating with an AI system playing chess.
00:41:00
Speaker
you start working with that AI system and you discover that, you know, if you listen to its suggestions, 90% of the time it's actually suggesting the wrong move or a bad move. So would you call that system value aligned? No, I would not. I think most people wouldn't. But now what if I told you that that program was actually implemented as a search that's using the correct goal test?
00:41:28
Speaker
So it actually turns out that if it's within 10 steps of a winning play, it always finds that for you. But because of computational limitations, it usually doesn't. Now is the system value aligned. I think it's a little harder to tell here. And what I do find is that when I tell people this story and I start off with the search algorithm with the correct goal test, they almost always say that that is value aligned, but stupid.
00:41:58
Speaker
So there's an interesting thing going on here, which is we're not totally sure what the target we're shooting for is. And you can take this thought experiment and push it further. So suppose you're doing that sort of search, but now let's say it's heuristic search that uses the correct goal test but has an adversarially chosen heuristic function. Would that be a value-aligned system? Again, I'm not sure.
00:42:26
Speaker
If the heuristic was adversarially chosen, I'd say probably not. But if the history heuristic was just just happened to be bad, then I'm not sure. Could you potentially unpack what it means for something to be adversarially chosen? Sure. Adversarially chosen in this case just means that there is some intelligent agent selecting
00:42:51
Speaker
that heuristic function or that evaluation measurement in a way that's designed to maximally screw you up. Adversarial analysis is a really common technique used in cryptography where we try to think of adversaries selecting inputs for computer systems that will cause them to malfunction. In this case, what this looks like is an adversarial algorithm that
00:43:19
Speaker
that looks, at least on the surface, like it is trying to help you accomplish your objectives, but is actually trying to fool you. And I'd say that more generally what this thought experiment helps me with is understanding that value alignment is actually a quite tricky and subjective concept. And it's actually quite hard to nail down in practice what it would mean.
00:43:47
Speaker
What sort of effort do you think needs to happen and from who in order to specify what it really means for a system to be value aligned and to not just have sort of like a soft squishy idea of what that means but to have it really formally mapped out so it can be implemented in machine systems?
00:44:08
Speaker
need more people working on technical AI safety research. I think to some extent it may always be something that's a little ill-defined and squishy, but generally I think it goes to the point of needing good people in AI willing to do this kind of squishier, less concrete work.
00:44:29
Speaker
that sort of really gets at it. I mean, I think value alignment is going to be something that's a little bit more like I know it when I see it. And as a field, we need to be moving towards a goal of AI systems where alignment is the end goal, whatever that means. Like I'd like to move away from artificial intelligence where we think of intelligence as an ability to solve puzzles to artificial aligning agents.

Value Alignment as an Adaptive Process

00:44:54
Speaker
where the goal is to build systems that are actually accomplishing goals on your behalf. And I think the types of behaviors and strategies that arise from taking that perspective are qualitatively quite different from the strategies of pure puzzle solving on a well-specified objective.
00:45:11
Speaker
So all this work we've been discussing is largely at a theoretic and meta level. At this point, is this the main sort of research that we should be doing or is there any space for research into what specifically might be implementable today? I don't think that's the only work that needs to be done. For me, I think it's a really important type of work that I think is that I'd like to see more of.
00:45:40
Speaker
I think a lot of important work is about understanding how to build these systems in practice and to think hard about designing AI systems with meaningful human oversight. I'm a big believer in the approach that says that, well, I'm a big believer in the idea that AI safety, the distinction between short-term and long-term issues is not really that large and that there are
00:46:07
Speaker
synergies between the research problems that go both directions. So I believe that on the one hand, looking at short-term safety issues, which includes things like Uber's car just killed someone, includes the YouTube recommendation engine, includes issues like fake news and information filtering,
00:46:34
Speaker
I believe that all of those things are related to and give us our best window into the types of concerns and issues that may come up with advanced AI. And at the same time, and this is a point that I think people concerned about xRisk do themselves a disservice on by not focusing here, is that actually doing theory about advanced AI systems and about
00:47:03
Speaker
in particular systems where it's not possible to what I would call unilaterally intervene. So systems that aren't necessarily, that aren't cordial by default. I think that that actually gives us a lot of idea of how to build systems now that are just merely hard to intervene with or oversee. So if you're thinking about issues of monitoring and oversight and
00:47:30
Speaker
How do you actually get a system that can appropriately evaluate when it should go to a person because its objectives are not properly specified or may not be relevant to the situation? You know, I think YouTube would have been in a much better would be in a much better place today if they had a robust system for doing that for their recommendation engine. In a lot of ways, the concerns about X risk represent an extreme set of assumptions for getting AI right now.
00:47:59
Speaker
And so I think I'm also just trying to get a better sense of what the system looks like and how it would be functioning on a day-to-day. What is the sort of data that it is taking in in order to capture, learn, and infer specific human preferences and values? And just trying to understand better whether or not it can model whole moral views and ethical systems of other agents or if it's just sort of capturing little specific bits and pieces.
00:48:26
Speaker
So I think my ideal would be to, as a system designer, build in as little and as possible about my sort of moral beliefs. And I think that ideally the process would look something, well, one process that I could see and imagine doing right would be to
00:48:51
Speaker
just directly go after trying to replicate something about the moral imprinting process that people have with their children. So you'd either have someone who's like a guardian or is responsible for an AI systems decision. And we build systems to try to align with one individual and then
00:49:16
Speaker
try to adopt and extend and push forward the beliefs and preferences of that individual. I think that's sort of one concrete version that I could see. I think building explicit, I think a lot of the place where I see things maybe a little bit different than some people is that I think the main ethical questions we're going to be stuck with and the ones that we really need to get right are sort of the mundane ones.
00:49:46
Speaker
Sort of the things that most people agree on and think are just sort of like, obviously that's not okay. Sort of kind of like mundane ethics and morals rather than kind of the more esoteric or fancier kind of population ethics questions that can arise. I feel a lot more confident about the ability to build good AI systems if we get that part right. And I feel like we've got a better shot at getting that part right because there's a clearer target to shoot for.
00:50:15
Speaker
So now what kinds of data would you be looking at? So in that case, it would be data from interaction with a couple select individuals. But ideally, you'd want sort of as much data as you can. What I think you really want to be careful of here is how much assumptions do you make about the procedure that's generating your data. So what I mean by that is
00:50:41
Speaker
whenever you learn from data, you have to make some assumption about how that data relates to the right thing to do, where right is with a capital R in this case. And the more assumptions you make there, the more your system will be able to learn about values and preferences, and the quicker it will be able to learn about values and preferences. But the more assumptions and structure you make there,
00:51:07
Speaker
the more likely you are to get something wrong that your system will be able to recover from. So again, we sort of see this trade-off come up of a discrepancy between the amount of uncertainty that you need in the system in order to be able to adapt to the right person and figure out the correct preferences and morals against sort of the efficiency with which you can figure that out. But I guess in saying this, it feels
00:51:37
Speaker
a little bit like I'm rambling and unsure about what the answer looks like. And I hope that that comes across because I'm really not sure. Beyond the rough structure of data generated from people, interpreted in a way that involves the fewest prior conceptions about
00:52:04
Speaker
what people want and what preferences people have that we can get away with is what I would shoot for. But I don't really know what that would look like in practice. Right. So it seems here that it's just encroaching on a bunch of very difficult social, political, and ethical issues involving persons and data which will be selected for preference aggregation, like how many people are included in developing the reward function and utility function of the AI system.
00:52:32
Speaker
Also, I guess we'd have to be considering like culturally sensitive systems where systems operating in different cultures and contexts are going to be needed to be trained on different sets of data. And I guess there'll also be like questions and ethics about whether or not we'll even want
00:52:50
Speaker
systems to be training off of certain cultures data? Yeah, so I would actually say that a good value, like I wouldn't necessarily even think of it as training off of different data, like one of the core questions in artificial intelligence is identifying the relevant community that you are in and building a normative understanding of that community.
00:53:15
Speaker
So I want to push back a little bit and move you away from the perspective of we collect data about a culture and we figure out the values of that culture and then we build our system to be value aligned with that culture. And more think about the actual AI product is the process whereby we determine, elicit and respond to the normative values of the multiple overlapping communities that you find yourself in. And
00:53:44
Speaker
And that that process is ongoing, it's holistic, it's overlapping and it's messy. And I think we'd like to, to the extent that I think it's possible, I'd like to not have a couple of people sitting around in a room deciding what the right values are. But much more, I think a system should be holistically designed with value alignment at multiple scales as a sort of core property of AI.
00:54:14
Speaker
And I think that that's actually a fundamental property of human intelligence. You behave differently based on the different people around, and you're very, very sensitive to that. There are certain things that are okay at work that are not okay at home, that are okay on vacation, that are okay around kids that are not.
00:54:35
Speaker
And figuring out what those things are and adapting yourself to them is sort of the fundamental intelligence skill needed to interact in modern life. Otherwise, you just get shoved.
00:54:48
Speaker
So it seems to me that in the context of a really holistic, messy kind of ongoing value alignment procedure, we'll be aligning AI systems, ethics and morals and moral systems and behavior with that of a variety of cultures and persons and just interactions in the 21st century.
00:55:08
Speaker
And when we reflect upon the humans of the past, we can see in various ways that they are like just moral monsters. We have issues with slavery and today we have issues with factory farming and voting rights and tons of other things in history. How should we view and think about aligning powerful systems, ethics and goals with
00:55:32
Speaker
like current human morality and preferences and the risk of amplifying like current things which are immoral in in present day life. This is the idea of mistakenly locking in the wrong values. In some sense, I think it is something we should be concerned about less from the standpoint of entire well, no, I think yes, from the standpoint of entire cultures getting things wrong. But again, I think if we
00:56:00
Speaker
If we don't think of there being a sort of monolithic society that has a single value set, these problems just are fundamental issues. What your local community thinks is okay versus what other local communities think are okay. And a lot of our society and a lot of our political structure is about how to handle those clashes between value systems.
00:56:28
Speaker
my, my ideal for AI systems is that they should become a part of that normative process and maybe not participate in them as people, but also, you know, I think this, if we think of value alignment as a consistent, ongoing, messy process, there is
00:56:49
Speaker
I think maybe that perspective lends itself less towards locking in values and sticking with them. There's sort of one frame you can look at the problem, which is we determine what's right and what's wrong, and we program our system to do that. And then there's another one, which is we program our system to be sensitive to what people think is right or wrong. And I think that's more the direction that I think of value alignment in.
00:57:14
Speaker
And then what I think the final part of what you're getting at here is that the system actually will feed back into people. So what AI systems show us will shape what we think is OK and vice versa. And that's something that I am quite frankly not sure how to handle. I don't know how you predict how you're going to influence what someone wants and what they will receive they want.
00:57:41
Speaker
and how to do that, I guess, correctly. All I can say is that we do have a human notion of what is acceptable manipulation. And we do have a human notion of allowing someone to figure out for themselves what they think is right and not, and refraining from biasing them too far. So to some extent, if you're able to value align with communities in a good ongoing holistic manner,
00:58:11
Speaker
That should also give you some ways to choose and understand what types of manipulations you may be doing that are okay or not. Also say that I think that this perspective has a very kind of mundane analogy when you think of the feedback cycle between recommendation engines and regular people. Those systems
00:58:38
Speaker
don't model the effect, well, they don't explicitly model the fact that they're changing the structure of what people want and what they'll want in the future. That's probably not the best analogy in the world. I guess what I'm saying is that it's hard to plan for how you're going to influence someone's desires in the future. It's not clear to me what's right or what's wrong. What's true is that we as humans have a lot of norms about what types of manipulation are okay and are not.
00:59:06
Speaker
And you might hope that appropriately doing value alignment in that way might help get to an answer here. I'm just trying to get a better sense here. So like when I'm thinking about the role that ethics and intelligence plays here, I sort of view intelligence as a means of modeling the world and achieving goals and ethics as the end towards which intelligence is sort of aimed here.
00:59:31
Speaker
Now, I'm curious, in terms of behavior modeling, where inverse reinforcement learning agents are modeling, I guess, the behavior of human agents and also predicting the sorts of behaviors that they'd be taking in the future or in the situation in which the inverse reinforcement learning agent finds itself, I'm curious to know where metaethics and moral epistemology fits in, where
00:59:59
Speaker
inverse reinforcement learning agents are finding themselves in novel ethical situations and what their ability to handle those novel ethical situations are like. When they're handling those situations, how much does it look like them performing some sort of
01:00:17
Speaker
normative and meta ethical calculus based on a kind of moral epistemology that they have? Or how much does it look like they're using some other sort of like behavioral predictive system where they're like modeling humans? The answer to that question is not clear. So what does it actually mean to make decisions based on an ethical framework or a meta ethical framework?
01:00:44
Speaker
I guess we could start there. You and I know what that means. But our definition is encumbered by the fact that it's pretty human-centric. We could talk about it in terms of, well, I weighed this option. I looked at that possibility. And we don't even really mean the literal sense of weighed and actually counted up and constructed actual numbers and multiplied them together in our heads.
01:01:11
Speaker
What these are is they're actually references to complex thought patterns that we're going through. And so verifying whether or not those thought patterns are going on in AI system is you can also talk about the difference between sort of the process of making a decision and the substance of it. So when an inverse reinforcement learning agent is going out into the world, the policy it's following
01:01:38
Speaker
is constructed to try to optimize a set of inferred preferences. But does that mean that the policy you're outputting is making meta-ethical characterizations? Well, at the moment, almost certainly not, because the systems we build are just not capable of that type of cognitive reasoning. But I think the bigger question is, do you care? And to some extent, you probably do. I mean, I'd care if I had some very
01:02:07
Speaker
deep disagreements with the sorts of metaethics that led to the preferences that were unloaded to the machine. And also if the machine were in like such a new novel ethical situation that was unlike anything human beings had faced that just required some sort of metaethical reasoning to deal with. Yes. I mean, I think you definitely want it to take decisions that you
01:02:30
Speaker
you would agree with, or at least that you could be sort of non-maliciously convinced to agree with. But practically, there isn't a place in the theory where that sort of shows up. And it's not clear that what you're saying is that different from value alignment in particular. If I were to try to refine the point about metaethics,
01:02:56
Speaker
What it sounds to me like you're getting at is an inductive bias that you're looking for in AI systems. And arguably, ethics is about an argument of what inductive bias should we have as humans. But I don't think that that's a first order property in value alignment systems necessarily, or in preference-based learning systems in particular.
01:03:26
Speaker
I would think that sort of that kind of metaethics, I think comes in from value aligning to someone that has these sort of sophisticated ethical ideas. I don't know where your thoughts about metaethics came from, but at least indirectly, you could probably trace them down to the values that your parents inculcated in you as a child. That's how we built metaethics into your head.
01:03:55
Speaker
if we want to think of you as being an AGI. And I think that for AI systems, you know, that's actually, that's the same way that I would see it being in that. So like, I don't believe the brain has circuits dedicated to meta ethics. I think that exists in software, and in particular is something that's being programmed into humans from their observational data, more so than from the
01:04:22
Speaker
structures that are built into us as a fundamental part of our intelligence or value alignment. We've also talked a bit about how human beings are potentially not fully rational agents. And so with inverse reinforcement learning, this sort of leaves open the question as to whether or not AI systems are actually capturing what the human being actually prefers.
01:04:50
Speaker
Or if there are some sorts of limitations in the humans observed or chosen behavior or explicitly told preferences, like limits in that ability to convey what we actually most deeply value or would value given more information. And so these inverse reinforcement learning systems may not be learning what we actually value or what we think we should value. How can AI systems sort of assist
01:05:19
Speaker
in this evolution of human morality and preferences whereby we're actually conveying what we actually value and what we would value given more information? Well, there are sort of two things that I heard in that question. So one is how do you just sort of mathematically account for the fact that people are irrational
01:05:48
Speaker
and that that is a property of the source of your data. And so inverse reinforcement learning at face value doesn't allow us to model that appropriately. And so it may lead us to make the wrong inferences. And I think that's a very interesting question. And it's probably the main one that I think about now as a technical problem.
01:06:17
Speaker
Um, is understanding what are good ways to, to model how people might or might not be rational and building

AI's Role in Human Morality Evolution

01:06:27
Speaker
systems that can, uh, sort of appropriately interact with that sort of complex data source. So one recent thing that I've been thinking about is what happens
01:06:41
Speaker
if people rather than knowing their objective, what they're trying to accomplish, are figuring it out over time. So this is a model where the person is a learning agent that discovers how they like states when they enter them, rather than sort of thinking of the person as an agent that already knows what they want and is just planning to accomplish that. And I think these types of assumptions that try to
01:07:10
Speaker
build a sort of, paint a very, very broad picture of the space of things that people are doing can help us in that vein. When someone's learning, it's actually interesting that you can actually end up helping them. You end up with a class of strategies that looks like you sort of breaks down into three phases. You have initial exploration page where you help the learning agent get a better picture of the world and the dynamics and its associated rewards.
01:07:38
Speaker
And then you have an observation phase where you observe how that agent now takes advantage of the information that it's got. And then there's an exploitation or extrapolation phase where you try to implement the optimal policy given the information you've seen so far. And so I think sort of moving towards more complex models that have a more realistic setting and richer set of assumptions behind them is important.
01:08:07
Speaker
The other thing you talked about was about helping people discover their morality and sort of learn more what's okay and what's not. And there I'm afraid I don't have too much interesting to say. In the sense that I, I believe it's an important question, but I just don't feel that I have many answers there. So practically I'm not sure, like,
01:08:35
Speaker
If you have someone who's learning their preferences over time, is that different than humans refining their moral theories? I don't know. You could make mathematical modeling choices so that they are, but I'm not sure if that really gets at what you're trying to point towards. I'm sorry that I don't have anything more interesting to say on that front.
01:09:01
Speaker
other than I think it's important. And I would love to talk to more people who are spending their days thinking about that question because I think it really does deserve that kind of intellectual effort. Yeah. It sounds like we kind of need some more like AI moral psychologists to help us think about these sorts of things. Yeah. And in particular, we're talking about sort of philosophy around value alignment and the ethics of value alignment.
01:09:31
Speaker
I think a really important question is, what are the ethics of developing value alignment systems? So a lot of times people talk about AI ethics from the standpoint of, for lack of a better example, the trolley problem. And the way that they think about it is who should the car kill? There is a correct answer, or maybe not a correct answer, but there are answers that we could think of as more or less bad.
01:09:59
Speaker
and an AI, which one of those options should the AI select? And that's not unimportant. But it's not the ethical question that an AI system designer is faced with. In my mind, if you're designing a self-driving car, the relevant questions you should be asking are two things.
01:10:27
Speaker
One, what do I think is an okay way to respond to different situations? Two, how is my system going to be understanding the preferences of the people involved in those situations? And then three, how should I design my system in light of those two facts? So I have my own preferences about what I would like my system to do.
01:10:58
Speaker
there are, I have an ethical responsibility, I would say, to make sure that my system is adapting to the preferences of its users to the extent that it can. But I also wonder to what extent, how should you handle things when there are conflicts between those two value sets? Right. So, you know, there's
01:11:23
Speaker
you're building a robot, it's going to go and live with a uncontacted human tribe. Um, you know, should it respect the local cultural traditions and customs? Probably that would be respecting the values of the users. But then let's say that that tribe does like something that we would consider to be gross, like pedophilia. Um, you know, is my system required to participate wholesale in that value system?
01:11:54
Speaker
And where's the line that we would need to draw between sort of unfairly imposing my values on system users and being able to make sure that the technology that I build isn't used for purposes that I would deem reprehensible or gross?
01:12:15
Speaker
Maybe we should just put a dial in each of the autonomous cars that lets the user set it to like deontology mode or utilitarianism mode as it's racing down the highway. But yeah, I think this is sort of the, I guess, an important role where I just think that metaethics is super important and not sure if this is necessarily the case.
01:12:38
Speaker
if fully autonomous systems are sort of going to play a role where they're resolving these ethical dilemmas for us, which I guess, at some point, eventually, if they're going to be like really actually autonomous and help to make the world a much better place seems kind of necessary. So I guess this kind of like feeds into my next question, where I'm wondering, like, where we probably both have different assumptions about this, but what the role of inverse reinforcement learning is, ultimately,
01:13:06
Speaker
Like, is it just to allow AI systems to evolve alongside us and to match our current ethics? Or is it to allow the systems to ultimately surpass us and move far beyond us into the deep future? Inverse reinforcement learning, I think, is much more about the first than the second. I think it can be a part of how you get to the second and how you improve. But for me, when I think about these problems technically,
01:13:34
Speaker
I try to think about matching human morality as the goal. Except for the factory farming stuff. Well, I mean, if you had a choice between, like, you know, thinks that eradicating all humans is OK and against factory farming versus neutral about factory farming and thinks that eradicating all humans aren't OK, which would you pick? I mean, I guess there are
01:14:03
Speaker
with your audience that there may be some people that would choose saving the animals answer. But my point is that I think it's so hard for me, like technically I think it's very hard to imagine getting these normative aspects of human societies and interaction right. I think just hoping to participate in that process in a way that is
01:14:31
Speaker
is analogous to how people do normally is a good step. I think we probably, to the extent that we can, should probably not have AI systems trying to figure out if it's OK to do factory farming. And to the extent that we can, I think it's so hard to understand what it means to even match human morality or participate in it that, for me, the concept of surpassing it feels very, very challenging and fraught.
01:15:01
Speaker
And I would worry as a general concern that as a system designer who doesn't necessarily represent the views and interests of everyone, that by programming in surpassing humanity or surpassing human preferences or morals, what I'm actually doing is just programming in my morals and ethical beliefs.
01:15:24
Speaker
Yeah, so I mean, there seems to be this strange issue here where it seems like if we get AGI and like recursive self improvement is a thing that really takes it off so that we have a system who has potentially succeeded in its inverse reinforcement learning, but far surpassed like human beings in its general intelligence. So we have kind of like a super intelligence that's matching like human morality.
01:15:51
Speaker
It just seems kind of like a funny situation where we'd really have to pull the brakes and I guess like as William McCaskill mentions have like a really, really long deliberation about ethics and moral epistemology and value. How do you view that? I think that's right. I mean, I think there are some real questions about who should be involved in that conversation. For instance, I actually think that it's, well, one thing I'd say is that
01:16:17
Speaker
You should recognize that there's a difference between having the same morality and having the same data. So one way to think about it is that people who are against factory farming have a different morality than the rest of people. Another one is that they actually just have exposure to the information that allows their morality to come to a better answer. There's this confusion you can make between the objective that someone has
01:16:47
Speaker
and the data that they've seen so far. So I think one point would be to think that a system that has current human morality but access to a vast, vast wealth of information may actually do much better than you might think. I think we should leave that open as a possibility.
01:17:13
Speaker
I think, you know, and for me, this is less about morality in particular, and more just about like, power concentration, and how much influence you have over the world. I mean, if we imagined that there was something like, you know, there's a very powerful AI system that was controlled by a small number of people. Yeah, you better think freaking hard before you tell that system what to do. And, you know, that is, that's related to questions about
01:17:42
Speaker
ethical ramifications on sort of metaethics and generalization and what we actually truly value as humans.
01:17:52
Speaker
But it's also super true for all of the more mundane things in the day to day as well. Did that make sense? Yeah, yeah, totally makes sense. And I'm becoming increasingly mindful of your time here. So I just wanted to hit a few more questions if that's okay before I let you go. Please. Yeah. So I'm wondering, would you like to or do you have any thoughts on how coherent extrapolated volition fits into this conversation and your views on it?

Coherent Extrapolated Volition and Its Challenges

01:18:19
Speaker
What I'd say is I think coherent extrapolated volition is an interesting idea and goal. Where it is defined as? Where it's defined as a method of preference aggregation. Personally, I'm a little wary of preference aggregation approaches. Well, I'm wary of imposing
01:18:44
Speaker
your morals on someone indirectly via choosing the method of preference aggregation that we're going to use. Right. But it seems like at some point we have to make some sort of meta ethical decision or else like we'll just forever be lost. Do we have to? Well, some agent does. Well, does one agent have to? Did one agent decide on the ways that we were going to do preference aggregation as a society?
01:19:12
Speaker
No, it sort of naturally evolved out of. It just sort of naturally evolved via a coordination and argumentative process. Right. And so for me, my answer to like, if you if you forced me to specify something about how we're going to do value aggregation, like, you know, if I was controlling the values for an AGI system,
01:19:39
Speaker
I would try to say as little as possible about the way that we're going to aggregate values, because I think we don't actually understand that process much in humans. And instead, I would opt for a heuristic of to the extent that we can devote equal optimization effort towards every individual and allow that parliament, if you will,
01:20:07
Speaker
to determine the way that value should be aggregated. And this doesn't necessarily mean having an explicit value aggregation mechanism that gets set in stone. This could be an argumentative process mediated by artificial agents arguing on your behalf. This could be a futuristic AI-enabled version of the court system.
01:20:33
Speaker
like an ecosystem of preferences and values kind of in conversation. Exactly.
01:20:40
Speaker
So we've talked a little bit about sort of like the deep future here now with like where we're reaching around like potentially like AGI or artificial super intelligence. After sort of I guess, inverse reinforcement learning is like potentially solved. Is there anything that you view that comes after inverse reinforcement learning in these techniques? Yeah, I mean, I think inverse reinforcement learning is certainly not the be all end all. I think what it is is it's
01:21:09
Speaker
one of the earliest examples in AI of trying to really look at preference elicitation and modeling preferences and learning preferences. It existed in a whole bunch of economists have been thinking about this for a while already.
01:21:31
Speaker
But basically, yeah, I think there's a lot to be said about how you model data and how you learn about preferences and goals. I think inverse reinforcement learning is basically the first attempt to get at that, but it's very far from the end. And I would say the biggest thing in how I view things that is maybe different from your standard reinforcement learning
01:21:59
Speaker
inverse reinforcement learning perspective is that I focus a lot on how do you act given what you've learned from inverse reinforcement learning. So, inverse reinforcement learning is a pure inference problem. It's just figure out what someone wants. And I ground that out in all of my research in take actions to help someone, which introduces a new set of concerns and questions.
01:22:24
Speaker
Great. So it looks like we're about at the end of the hour here. So I guess just if anyone here is interested in working on this technical portion of the AI alignment problem, what do you suggest they study or how do you view that it's best for them to get involved, especially if they want to work on like inverse reinforcement learning and like inferring human preferences? So I think if you're an interested person and you want to get into technical safety work,
01:22:54
Speaker
Well, the first thing you should do is probably read Jan Lake's recent write-up in 80,000 hours. But generally what I would say is try to get involved in AI research flat. Don't focus as much on getting into AI safety research. And just generally focus more on acquiring the skills that will support you in doing good AI research. So get a strong math background.
01:23:23
Speaker
get a research advisor who will advise you on doing research projects and help teach you the process of submitting papers and figuring out what the AI research community is going to be interested in. In my experience, one of the biggest pitfalls that early researchers make is focusing too much on what they're researching rather than thinking about who they're researching with and
01:23:51
Speaker
how they're going to learn the skills that will support doing research in the future. I think that most people don't appreciate how transferable research skills are to the extent that you can try to do research on technical AI safety, but more so work on technical AI. And if you're interested in safety, the safety connections will be there and you may see how
01:24:17
Speaker
A new area of AI actually relates to it, supports it, or you may find places of new risk and be in a good position to try to mitigate that and take steps to alleviate those harms. Wonderful. Yeah. Thank you so much for speaking with me today, Dylan. It's really been a pleasure and it's been super interesting. It was a pleasure talking to you. I love the chance to have these types of discussions. Great. Thanks so much. Until next time. Until next time. Thanks a blast.
01:24:47
Speaker
If you enjoyed this podcast, please subscribe, give it a like or share it on your preferred social media platform. We'll be back soon with another episode in this new AI alignment series.