Introduction: Podcast Overview & Guest
00:00:12
Speaker
Hey everyone and welcome back to the AI alignment podcast at the Future of Life Institute. I'm Lucas Perry and today we'll be speaking with Stuart Armstrong on his research agenda version 0.9, synthesizing a human's preferences into a utility function.
00:00:28
Speaker
Here Stewart takes us through the fundamental idea behind this research agenda, what this process of synthesizing human preferences into a utility function might look like, key philosophical and empirical insights needed for progress, how human values are changeable, manipulable, under defined and contradictory, how these facts affect generating an adequate synthesis of human values, where this all fits in the alignment landscape,
00:00:55
Speaker
and how it can inform other approaches to aligned AI systems. If you find this podcast interesting or useful, consider sharing it with friends on social media platforms, forums, or anywhere you think it might be found valuable.
Engagement: Listener Feedback & Sharing
00:01:10
Speaker
I'd also like to put out a final call for this round of Survey Monkey polling and feedback. So if you have any comments, suggestions, or any other thoughts you'd like to share with me about the podcast, potential guests, or anything else, feel free to do so through the Survey Monkey poll link attached to the description of wherever you might find this podcast. I'd love to hear from you.
00:01:31
Speaker
There also seems to be some lack of knowledge regarding the pages that we create for each podcast episode. You can find a link to that in the description as well, and it contains a summary of the episode, topics discussed, key points from the guest, important timestamps if you want to skip around, works referenced, as well as a full transcript of the audio in case you prefer reading.
About Stuart Armstrong & His Research
00:01:55
Speaker
Stuart Armstrong is a researcher at the Future of Humanity Institute who focuses on the safety and possibilities of artificial intelligence, how to define the potential goals of AI and map humanity's partially defined values into it, and the long-term potential for intelligent life across the reachable universe.
00:02:13
Speaker
He has been working with people at FHI and other organizations such as DeepMind to formalize AI desiderata in general models so that AI designers can include these safety methods in their designs. His collaboration with DeepMind on interruptibility has been mentioned in over a hundred media articles.
00:02:33
Speaker
Stewart's past research interests include comparing existential risks in general, including their probability and their interactions, anthropic probability, how the fact that we exist affects our probability estimates around that key fact, decision theories that are stable under self-reflection and anthropic considerations, negotiation theory and how to deal with uncertainty about your own preferences, computational biochemistry,
00:02:59
Speaker
fast ligand screening, parabolic geometry, and his Oxford defill was on the Holonomy of Projective and Conformal Carton Geometries. And so, without further ado, or pretenses that I know anything about the Holonomy of Projective and Conformal Carton Geometries, I give you Stuart Armstrong.
Understanding Human Preferences
00:03:23
Speaker
We're here today to discuss your research agenda version 0.9, synthesizing a human's preferences into a utility function. One wonderful place for us to start here would be with this story of evolution, which you call an inspiring just so story. And so starting this, I think it would be helpful for us contextualizing the place of the human and what the human is as we find ourselves here at the beginning of this value alignment problem.
00:03:52
Speaker
I'll go ahead and read this here for listeners to begin developing a historical context and narrative. So I'm quoting you here. You say, this is the story of how evolution created humans with preferences and what the nature of these preferences are. The story is not true in the sense of accurate. Instead, it is intended to provide some inspiration as to the direction of this research agenda.
00:04:14
Speaker
In the beginning, evolution created instinct-driven agents. These agents had no preferences or goals, nor did they need any. They were like Q-learning agents. They knew the correct action to take in different circumstances, but that was it. Consider baby turtles that walk towards the light upon birth, because traditionally, the sea was lighter than the land, of course.
00:04:36
Speaker
This behavior fails them in the era of artificial lighting. But evolution has a tiny bandwidth acting once per generation, so it created agents capable of planning, of figuring out different approaches rather than having to follow instincts. This was useful, especially in varying environments, and so evolution offloaded a lot of its job onto the planning agents.
00:04:58
Speaker
Of course, to be of any use, the planning agents need to be able to model their environment to some extent, or else their plans can't work, and had to have preferences, or else every plan was as good as another. So in creating the first planning agents, evolution created the first agents with preferences.
00:05:16
Speaker
Of course, evolution is messy, undirected process, so the process wasn't clean. Planning agents are still riven with instincts, and the modeling of the environment is situational, used for when it was needed, rather than some consistent whole. Thus, the preferences of these agents were under-defined and sometimes contradictory. Finally, evolution created agents capable of self-modeling and of modeling other agents in their species.
00:05:42
Speaker
This might have been because of competitive social pressures as agents learn to lie and detect lying. Of course, this being evolution, the self and other modeling took the form of clutches built upon spandrels built upon clutches, and then arrived humans who developed norms and norm violations.
00:06:01
Speaker
As a side effect of this, we started having higher order preferences as to what norms and preferences should be. But instincts and contradictions remained. This is evolution after all. And evolution looked upon this hideous mess and saw that it was good. Good for evolution, that is. But if we want it to be good for us, we're going to need to straighten out this mess somewhat.
00:06:23
Speaker
Here we arrive, Stuart, in the human condition after hundreds of millions of years of evolution. So given the story of human evolution that you've written here, why were you so interested in this story and why were you looking into this mess to better understand AI alignment and develop this research agenda?
Challenges in Inferring Preferences
00:06:40
Speaker
This goes back to a paper that I co-wrote for NeurIPS. It basically develops the idea of inverse reinforcement learning, or more broadly, can you infer what the preferences of an agent are just by observing their behavior?
00:06:55
Speaker
Humans are not entirely rational. So the question I was looking at is, can you simultaneously infer the rationality and the preferences of an agent by observing their behavior? It turns out to be mathematically completely impossible.
00:07:11
Speaker
We can't infer the preferences without making assumptions about the rationality, and we can't infer the rationality without making assumptions about the preferences. This is a rigorous result. So my looking at human evolution is to basically get around this result, in a sense to make the right assumptions so that we can extract actual human preferences. Since we can't just do it by observing behavior, we need to dig a bit deeper.
00:07:40
Speaker
So, what if you glean then from looking at this process of human evolution and seeing into how messy the person is? Well, there's two key insights here. The first is that I located where human preferences reside, or where we can assume human preferences reside. And that's in the internal models of the humans, how we model the world, how we judge, that was a good thing, or I want that, or oh, I'd be really embarrassed about that.
00:08:09
Speaker
And so human preferences are defined in this project, or at least the building blocks of human preferences are defined to be in these internal models that humans have with the labeling of states of outcomes as good or bad. The other point of bringing about evolution is that since it's not anything like a clean process, it's not like we have one general model with clearly labeled preferences and then everything else flows from that.
00:08:38
Speaker
It is a mixture of situational models in different circumstances with subtly different things labeled as good or bad. As I said, human preferences are contradictory, changeable, manipulable, and under-defined. There are two core parts to this research project, essentially.
00:08:58
Speaker
The first part is to identify the human's internal models, figure out what they are, how we use them, and how we can get an AI to realize what's going on. So those give us the sort of partial preferences, the pieces from which we build our general preferences.
00:09:16
Speaker
The second part is to then knit all these pieces together into an overall preference for any given individual in a way that works reasonably well and respects as much as possible the person's different preferences, meta preferences, and so on.
00:09:36
Speaker
The second part of the project is the one that people tend to have strong opinions about because they can see how it works and how the building blocks might fit together and how they'd prefer that it be fit together in different ways and so on. But in essence, the first part is the most important because that fundamentally defines the pieces of what human preferences are.
AI Alignment & Research Context
00:09:58
Speaker
Before we dive into the specifics of your agenda here, can you contextualize it within the evolution of your thought on AI alignment and also how it fits within the broader research landscape? So this is just my perspective on what the AI alignment landscape looks like.
00:10:16
Speaker
They're a collection of different approaches addressing different aspects of the alignment problem. Some of them, which Miri is working a lot on, are technical things of how to ensure stability of goals and other similar thoughts along these lines that should be necessary for any approach.
00:10:35
Speaker
others are developed on how to make the AI safe either indirectly or make it itself fully aligned. So the first category, you have things like software as a service. Can we have super intelligent abilities integrated in a system that doesn't allow for, say, super intelligent agents with pernicious goals?
00:11:01
Speaker
Others that I have looked into in the past are things like low impact agents or oracles, which again, the idea is we have a super intelligence, we cannot align it with human preferences, yet we can use it to get some useful work done. Then there are the approaches which aim to solve the whole problem and get actual alignment, what used to be called the friendly AI approach.
00:11:28
Speaker
So here, it's not an AI that's constrained in any ways, it's an AI that is intrinsically motivated to do the right thing. There are a variety of different approaches to that, some more serious than others. Paul Cristiano has an interesting variant on that.
00:11:45
Speaker
Though it's hard to tell, I would say his is a bit of a mixture of value alignment and constraining what the AI can do in a sense, but it is very similar. And so this is of that last type of getting the aligned, the friendly AI, the aligned utility function.
00:12:04
Speaker
In that area, there are what I would call the ones that rely on indirect proxies. This is the ideas of you put Nick Bostrom in a room for 500 years, or a virtual version of that, and hope that you get something aligned at the end of that.
00:12:22
Speaker
There are direct approaches, and this is the basic direct approach, doing everything the hard way in a sense, but defining everything that needs to be defined so that the AI can then assemble a aligned preference function from all the data.
00:12:40
Speaker
Wonderful. So you gave us a good summary earlier of the different parts of this research agenda. Would you like to expand a little bit on the quote fundamental idea behind this specific research project?
Synthesizing Human Preferences
00:12:53
Speaker
There are two fundamental ideas that are not too hard to articulate. The first is that though our revealed preferences could be wrong, though our stated preferences could be wrong, what our actual preferences are, at least in one moment, is what we model inside our head, what we're thinking of as the better option.
00:13:15
Speaker
We might lie, as I say, in politics or in a court of law or just socially, but generally when we know that we're lying, it's because there's a divergence between what we're saying and what we're modeling internally. So it is this internal model which I'm identifying as the place where our preferences lie.
00:13:32
Speaker
And then all the rest of it, the whole convoluted synthesis project, is just basically how do we take these basic pieces and combine them in a way that does not seem to result in anything disastrous and that respects human preferences and meta-preferences, and this is a key thing, actually reaches a result.
00:13:55
Speaker
That's why the research project is designed for having a lot of default actions in a lot of situations. If the person does not have strong meta-preferences, then there's a whole procedure for how you combine, say, preferences about the world and preferences about your identity, are by default combined in a different way.
00:14:16
Speaker
If you would want GDP to go up, that's a preference about the world. If you yourself would want to believe something or believe only the truth, for example, that's a preference about your identity. It tends to be that identity preferences are more fragile.
00:14:32
Speaker
So the default is that preferences about the world are just added together, and this overcomes most of the contradictions because very few human preferences are exactly anti-aligned, whereas identity preferences are combined in a more smooth process so that you don't lose too much on any of them. But as I said, these are the default procedures, and they're all defined so that we get an answer, but there's also large abilities for the person's meta-preferences to override the defaults.
00:15:03
Speaker
Again, precautions are taken to ensure that an answer is actually reached. Can you unpack what partial preferences are? What you mean by partial preferences and how they're contextualized within human mental models?
00:15:18
Speaker
What I mean by partial preference is mainly that a human has a small model of part of the world. Like, let's say they're going to a movie and they would prefer to invite someone they like to go with them. Within this mental model, there is the movie themselves and the presence or absence of the other person.
00:15:40
Speaker
So this is a very narrow model of reality. Virtually the entire rest of the world and definitely the entire rest of the universe does not affect this. It could be very different and not change anything of this. So this is what I call a partial preference. You can't go from this to a general rule of what the person would want to do in every circumstance, but it is a narrow, valid preference.
00:16:07
Speaker
Partial preferences refers to two things. First of all, that it doesn't cover all of our preferences. And secondly, the model in which it lives only covers a narrow slice of the world. You can make some modifications to this. This is the whole point of the second section, that if the approach works, variations on the synthesis project should not actually result in results that are disastrous at all.
00:16:35
Speaker
If the synthesis process being changed a little bit would result in a disaster, then something has gone wrong with the whole approach. But you could, for example, add restrictions like looking for consistent preferences. But I'm starting with basically the fundamental thing is there is this mental model. There is an unambiguous judgment that one thing is better than another. And then we can go from there in many ways.
00:17:03
Speaker
A key part of this approach is that there is no single fundamental synthesis process that would work. So it is aiming for an adequate synthesis rather than an idealized one, because humans are a mess of contradictory preferences, and because even philosophers have contradictory meta-preferences within their own minds and with each other.
00:17:27
Speaker
and because people can learn different preferences depending on the order in which information is presented to them, for example. Any method has to make a lot of choices, and therefore I'm writing down explicitly as many of the choices that have to be made as I can so that other people can see what I see the processes entailing.
00:17:49
Speaker
I am quite wary of things that look for reflexive self-consistency, because in a sense, if you define your ideal system as one that's reflexively self-consistent, that's a sort of a local condition in a sense, that the morality judges itself by its own assessment. And that means that you could theoretically wander arbitrarily far in preference space before you hit that.
Complexity in Human Preferences
00:18:16
Speaker
I don't want something that is just defined by this has reached reflective equilibrium. This morality synthesis is now self-consistent. I want something that is self-consistent, and it's not too far from where it started. So I prefer to tie things much more closely to actual human preferences and to explicitly aim for a synthesis process that doesn't wander too far away from them.
00:18:44
Speaker
I see. So the starting point is the evaluative moral that we're trying to keep it close to.
00:18:51
Speaker
Yes, I don't think you can say that any human preference synthesized is intrinsically wrong, as long as it reflects some of the preferences that were inputs into it. However, I think you can say that it is wrong from the perspective of the human that you started with if it strongly contradicts what they would want.
00:19:16
Speaker
agreement from my starting position is something which I take to be very relevant to the ultimate outcome. There's a bit of a challenge here because we have to avoid, say, preferences which are based on inaccurate facts. So some of the preferences are inevitably going to be removed or changed just because they're based on factually inaccurate beliefs.
00:19:41
Speaker
some other processes of trying to make consistent what is sort of very vague will also result in some preferences being moved beyond. So you can't just say the starting person has veto power over the final outcome, but you do want to respect their starting preferences as much as you possibly can.
00:20:03
Speaker
So reflecting here on the difficulty of this agenda and on how human beings contain contradictory preferences and models, can you expand a bit how we contain these internal contradictions and how this contributes to the difficulty of the agenda?
00:20:21
Speaker
I mean, humans contain many contradictions within them. Our mood shifts. We famously are hypocritical in favor of ourselves and against the foibles of others. We basically rewrite narratives to allow ourselves to always be heroes.
00:20:40
Speaker
Anyone who's sort of had some experience of a human has had knowledge of when they decided one way or decided the other way or felt that something was important and something else wasn't. And often people just come up with a justification for what they wanted to do anyway, especially if they're in a social situation. And then some people can cling to this justification and integrate that into their morality while behaving differently in other ways.
00:21:07
Speaker
The easiest example are political hypocrites. The anti-gay preacher who sleeps with other men is a stereotype for a reason. But it's not just a contradiction at that level. It's that basically most of the categories in which we articulate our preferences are not particularly consistent.
00:21:29
Speaker
If we throw a potentially powerful AI in this to change the world drastically, we may end up with things across our preferences. For example, suppose that someone created or wanted to create a subspecies of human that was bred to be a slave race.
00:21:49
Speaker
Now, this race did not particularly enjoy being a slave race, but they wanted to be slaves very strongly. In this situation, a lot of our intuitions are falling apart because we know that slavery is almost always involuntary and is backed up by coercion
00:22:11
Speaker
We also know that even though our preferences and our enjoyments do sometimes come apart, they don't normally come apart that much. So we're now confronted by a novel situation where a lot of our intuitions are pushing against each other. You also have things like nationalism, for example.
00:22:33
Speaker
Some people have strong nationalist sentiments about their country, and sometimes their country changes. And in this case, what seemed like a very simple, yes, I will obey the laws of my nation, for example, becomes much more complicated as the whole concept of my nation starts to break down.
00:22:53
Speaker
This is the main way that I see preferences to being under-defined. They're articulated in terms of concepts which are not universal and which bind together many, many different concepts that may come apart.
00:23:09
Speaker
So at any given moment, like myself at this moment, the issue is that there's a large branching factor of how many possible future Lucas's there can be. At this time currently, and maybe a short interval around this time as you sort of exploring your paper, the sum total of my partial preferences and the partial world models in which these partial preferences are contained.
00:23:34
Speaker
The expression of these preferences and models can be expressed differently and sort of hacked and changed based off how questions are asked, the order of questions. I am like a 10,000 faced thing, which I can show you one of my many faces depending on how you push my buttons. And depending on all of the external input that I get in the future,
00:24:00
Speaker
I'm going to express and maybe become more idealized in one of many different paths. The only thing that we have to evaluate which of these many different paths I would prefer is what I would say right now, right? Say my core value is joy or certain kinds of conscious experiences over others. And all I would have for evaluating this many branching thing is say this preference now at this time, but that could be changed in the future. Who knows?
00:24:29
Speaker
I will create new narratives and stories that justify the new person that i am and make sense of the new values and preferences that i have retroactively like something that i wouldn't actually approved of now but my new maybe more evil version of myself would approve and create a new narrative retroactively. Is this sort of helping to elucidate and paint the picture of why human beings are so messy.
00:24:53
Speaker
Yes, we need to separate that into two. The first is that our values can be manipulated by other humans, as they often are, and by the AI itself during the process. But that can be combated to some extent. I have a paper that may soon come out on how to reduce the influence of an AI over a learning process that it can manipulate. So that's one aspect.
00:25:20
Speaker
The other aspect is when you are confronted by a new situation, you can go in multiple different directions, and these things are just not defined. So when I said that human values are contradictory, changeable, manipulable, and under-defined, I was saying that the first three are relatively easy to deal with, but that the last one is not.
00:25:44
Speaker
Most of the time, people have not considered the whole of the situation that they or the world or whatever is confronted with. No situation is exactly analogous to another, so you have to try and fit it into different categories.
00:26:00
Speaker
So if someone, dubious, gets elected in a country and starts doing very authoritarian things, does this fit in the tyranny box, which should be resisted, or does this fit in the normal process of democracy box, in which case it should be endured and dealt with through democratic means?
00:26:20
Speaker
what'll happen is generally that it'll have features of both, so it might not fit comfortably in either box. And then there's a wide variety for someone to be hypocritical or to choose one side or the other. But the reason that there's such a wide variety of possibilities is because this is a situation that has not been exactly confronted before
00:26:42
Speaker
So people don't actually have preferences here. They don't have a partial preference over this situation because it's not one that they've ever considered. How they develop one is due to a lot, as you say, the order in which information is presented, which category it seems to most strongly fit into, and so on.
00:27:02
Speaker
We are going here for very mild under-definedness. The wheeling slave race was my attempt to push it out a bit further into something somewhat odd. And then if you consider a powerful AI that is able to create vast numbers of intelligent entities, for example,
00:27:21
Speaker
and reshape society, human bodies, and human minds in hugely transformative ways, we're going to enter sort of very odd situations where all our starting instincts are almost useless. I've actually argued at some point in the research agenda that this is an argument for ensuring that we don't go too far from the human baseline normal into exotic things where our preferences are not well defined.
00:27:51
Speaker
because in these areas, the chance that there is a large negative seems higher than the chance that there's a large positive.
00:28:00
Speaker
Now, I'm talking about things that are very distant in terms of our categories, like the world of Star Trek is exactly the human world from this perspective, because even though they have science fiction technology, all of the concepts and decisions there are articulated around concepts that we're very familiar with, because it is a work of fiction addressed to us now.
00:28:24
Speaker
So when I say not go too far, I don't mean not embrace a hugely transformative future. I am saying not embrace a hugely transformative future where our moral categories start breaking down.
00:28:38
Speaker
In my mind, there's two senses. There's the sense in which we have these models for things and we have all of these necessary and sufficient conditions for which something can be pattern matched to some sort of concept or thing. And we can encounter situations where there's conditions from many different things.
00:28:56
Speaker
being included in the context in a new way, which makes it so that the thing like goodness or justice is under defined in the slavery case because we don't really know initially whether this thing is good or bad. I see this under defined in this sense.
00:29:12
Speaker
The other sense is maybe the sense in which my brain is a neural architectural aggregate of a lot of neurons and the sum total of its firing statistics and specific neural pathways can be potentially identified as containing preferences and models somewhere within there. So is it also true to say that it's under defined in the sense that the human as not a thing in the world, but as a process in the world, largely constituted of the human brain.
00:29:41
Speaker
Even within that process, it's under defined where in the neural firing statistics or the processing of the person there could ever be something called a concrete preference or value.
00:29:54
Speaker
I would disagree that it is undefined in the second sense. In order to solve the second problem, you need to solve the symbol grounding problem for humans. You need to show that the symbols or the neural pattern firing or the neuron connection or something inside the brain corresponds to some concepts in the outside world.
00:30:18
Speaker
This is one of my sort of side research projects. When I say side research project, I mean I wrote a couple of blog posts on this pointing out how I might approach it. And I point out that you can do this in a very empirical way. If you think that a certain pattern of neural firing refers to, say, a rabbit, you can see whether this thing firing in the brain is that predictive of, say, a rabbit in the outside world.
00:30:44
Speaker
or predictive of this person is going to start talking about rabbits soon. In model theory, the actual thing that gives meaning to the symbols is sort of beyond the scope of the math theory. But if you have a potential connection between the symbols and the outside world, you can check whether this theory is a good one or a terrible one.
00:31:07
Speaker
if you say this corresponds to hunger, and yet that thing only seems to trigger when someone's having sex, for example.
00:31:16
Speaker
We can say, okay, your model that this corresponds to hunger is terrible. It's wrong. I cannot use it for predicting that the person will eat in the world, but I can use it for predicting that they're having sex. So if I model this as connected with sex, this is a much better grounding of that symbol.
00:31:37
Speaker
So using methods like this, and there's some subtleties, I also address Quine's gabagai and connect it to sort of webs of connotation and concepts that go together. But the basic idea is to empirically solve the symbol grounding problem for humans. When I say that things are under defined, I mean that they are articulated in terms of concepts that are under defined across all possibilities in the world.
00:32:05
Speaker
not that these concepts could be anything, or we don't know what they mean. Our mental models correspond to something. It's a collection of past experience, and the concepts in our brain are tying together a variety of experiences that we've had. They might not be crisp, they might not be well-defined even
00:32:27
Speaker
if you look at, say, the totality of the universe, but they correspond to something, to some repeated experience, to some concepts, to some thought process that we've had and that we've extracted this idea from.
00:32:40
Speaker
When we do this in practice, we are going to inject some of our own judgments into it. And since humans are so very similar in how we interpret each other and how we decompose many concepts, it's not necessarily particularly bad that we do so. But I strongly disagree that these are arbitrary concepts that are going to be put in by hand.
00:33:07
Speaker
they are going to be in the main identified via once you have some criteria for tracking what happens in the brain, comparing it with the outside world, and those kind of things. My concept, maybe a cinema is not an objectively well-defined fact, but what I think of as a cinema and what I expect in a cinema and what I don't expect in a cinema, like I expect it to go dark
00:33:35
Speaker
and a projector and things like that, I don't expect that this would be in a completely open space in the Sahara Desert under the sun with no seats and no sounds and no projection. I'm pretty clear that one of these things is a lot more of a cinema than the other.
00:33:55
Speaker
Do you want to expand here a little bit about this synthesis process? The main idea is to try and ensure that no disasters come about. And the main thing that could lead to disaster is the over-prioritization of certain preferences over others.
00:34:14
Speaker
There are other avenues to disaster, but this seems to be the most obvious. The other important part of the synthesis process is that it has to reach an outcome, which means that a vague description is not sufficient. That's why it's phrased in terms of, this is the default way that you synthesize preferences. This way may be modified by certain meta-preferences.
00:34:39
Speaker
The meta-preferences have to be reducible to some different way of synthesizing the preferences. For example, the synthesis is not particularly over-weighting long-term preferences versus short-term preferences.
00:34:53
Speaker
it would prioritize long-term preferences, but not exclude short-term ones. So, I want to be thin is not necessarily prioritizing over, that's a delicious piece of cake that I'd like to eat right now, for example. But human meta-preferences often prioritize long-term preferences over short-term ones. So this is going to be included, and this is going to change the default balance towards the long-term preferences.
00:35:19
Speaker
So powering the synthesis process, how are we to extract the partial preferences and their weights from the person? That's, as I say, the first part of the project, and that is a lot more empirical. This is going to be a lot more looking at what neuroscience says, maybe even what algorithm theory says or what modeling of algorithms say and about what's physically going on in the brain and how this corresponds to internal mental models.
00:35:49
Speaker
There might be things like people noting down what they're thinking, correlating this with changes in the brain. And this is a much more empirical aspect to the process that could be carried out essentially independently from the synthesis product. So a much more advanced neuroscience would be beneficial here.
00:36:10
Speaker
Yes. But even without that, it might be possible to infer some of these things indirectly via the AI. And if the AI accounts well for uncertainties, this will not result in disasters. If it knows that we would really dislike losing something of importance to our values, even if it's not entirely sure what the thing of importance is, it will naturally with that kind of
Utility Functions in Decision-Making
00:36:37
Speaker
act in a cautious way, trying to preserve anything that could be valuable until such time as it figures out better what we want in this model.
00:36:49
Speaker
So in section two of your paper, synthesizing the preference utility function, within this section, you note that this is not the only way of constructing the human utility function. So can you guide us through this more theoretical section, first discussing what sort of utility function and why a utility function in the first place?
00:37:11
Speaker
One of the reasons to look for a utility function is to look for something stable that doesn't change over time. And there is evidence that consistency requirements will push any form of preference function towards a utility function.
00:37:26
Speaker
and that if you don't have a utility function, you just lose value. So the desire to put this into utility function is not out of an admiration for utility functions per se, but for desire to get something that won't further change or won't further drift in a direction that we can't control and have no idea about.
00:37:45
Speaker
The other reason is that as we start to control our own preferences better and have a better ability to manipulate our own minds, we are going to be pushing ourselves towards utility functions because of the same pressures of basically not losing value pointlessly.
00:38:04
Speaker
you can kind of see it in some investment bankers who have to a large extent constructed their own preferences to be expected money maximizers within a range. And it was quite surprising to see, but human beings are capable of pushing themselves towards that. And this is what repeated exposure to different investment decision tends to do to you. And it's the correct thing to do in terms of maximizing money
00:38:32
Speaker
And this is the kind of thing that general pressure on humans, combined with humans' ability to self-modify, which we may develop in the future, so all this is going to be pushing us towards utility function anyway. So we may as well go all the way and get the utility function directly rather than being pushed into it.
00:38:52
Speaker
So is the view here that the reason why we're choosing utility functions, even when human beings are very far from being utility functions, is that when optimizing our choices and mundane scenarios, it's pushing us in that direction anyway.
00:39:07
Speaker
In part, I mean, utility functions can be arbitrarily complicated and can be consistent with arbitrarily complex behavior. A lot of when people think of utility functions, they tend to think of simple utility functions. And simple utility functions are obviously simplifications that don't capture everything that we value. But complex utility functions can capture as much of the value as we want.
00:39:34
Speaker
What tends to happen is that when people have, say, inconsistent preferences, that they are pushed to make them consistent by the circumstances of how things are presented. Like you might start with the chocolate mousse, but then if offered a trade for the cherry pie, go for the cherry pie, and then if offered a trade for the maple pie, go for the maple pie, but then you won't go back to the chocolate.
00:39:58
Speaker
Or even if you do, you won't continue going round the cycle because you've seen that there's a cycle and this is ridiculous and then you stop it at that point. So what we decide when we don't have utility functions tends to be determined by the order in which things are encountered and under contingent things.
00:40:16
Speaker
And as I say, non-utility functions tend to be intrinsically less stable and so can drift. So for all these reasons, it's better to nail down a utility function from the start so that you don't have the further drift and your preferences are not determined by the order in which you encounter things, for example.
00:40:34
Speaker
This is, though, in part, thus a kind of normative preference then, right? To use utility functions in order not to be pushed around like that. Maybe one could have the meta-preference for their preferences to be expressed in the order in which they encounter things.
00:40:48
Speaker
You could have that strong meta-preference, yes, though even that can be captured by utility function if you feel like doing it. Utility functions can capture pretty much any form of preferences, even the ones that seem absurdly inconsistent. So we're not actually losing anything in theory by insisting that it should be utility function. We may be losing things in practice in the construction of that utility function.
00:41:16
Speaker
I'm just saying, if you don't have something that is isomorphic with the utility function or very close to that, your preferences are going to drift randomly affected by many contingent factors. You might want that, in which case you should put it in explicitly rather than implicitly. And if you put it in explicitly, it can be captured by a utility function that is conditional on the things that you see and the order in which you see them, for example.
00:41:45
Speaker
So comprehensive AI services and other tool AI like approaches to AI alignment, I suppose avoid some of the anxieties produced by strong and gentle AIs with utility functions. Are there alternative goal ordering or action producing methods in agents other than utility functions that may have the properties that we desire of utility functions or is the category of utility functions just so large that it encapsulates much of what is just mathematically rigorous and simple?
00:42:15
Speaker
I'm not entirely sure. Alternative goal structures tend to be quite ad hoc and limited in my practical experience, whereas utility functions or reward functions which may or may not be isomorphic do seem to be universal.
00:42:33
Speaker
There are possible inconsistencies within utility functions themselves if you get a self-referential utility function including your own preferences, for example, but Miri's work should hope to clarify those aspects.
00:42:48
Speaker
I came up with an alternative goal structure, which is basically an equivalence class of utility functions that are not equivalent in terms of utility. This could successfully model an agent whose preferences were determined by the order in which things were chosen.
00:43:06
Speaker
But I put this together as a toy model or as a thought experiment. I would never seriously suggest building that. So it just seems that for the moment, most non-utility function things are either ad hoc or under-defined or incomplete.
00:43:24
Speaker
and that most things can be captured by utility functions. So the things that are not utility functions all seem at the moment to be flawed, and the utility functions seem to be sufficiently versatile to capture anything that you would want.
00:43:40
Speaker
This may mean, by the way, that we may lose some of the elegant properties of utility functions that we normally assume, like deontology can be captured by a utility function that assigns 1 to obeying all the rules and 0 to violating any of them.
00:43:55
Speaker
And this is a perfectly valid utility function. However, there's not much in terms of expected utility in terms of this. It behaves almost exactly like a behavioral constraint. Never choose any option that is against the rules.
00:44:12
Speaker
that kind of thing, even though it's technically a utility function, might not behave the way that we're used to utility functions behaving in practice. So when I say that it should be captured as a utility function, I mean formally it has to be defined in this way, but informally it may not have the properties that we informally expect of utility functions.
00:44:35
Speaker
Wonderful. This is a really great picture that you're painting. Can you discuss extending and normalizing the partial preferences? Take us through the rest of section two on synthesizing to a utility function. So the extending is just basically you have, for instance, the preference of going to the cinema this day with that friend versus going to the cinema without that friend.
00:45:00
Speaker
That's of incredibly narrow preference, but you also have preferences about watching films in general, being with friends in general. So these things should be combined in as much as they can be into some judgment of what you like to watch, who you like to watch with, and under what circumstances. So that's the generalizing
00:45:21
Speaker
The extending is basically trying to push these beyond the typical situations. So if there was a sort of virtual reality which really gave you the feeling that other people were present with you, which current virtual reality doesn't tend to, then would this count as being with your friend? On what level of interaction would be required for it to count as being with your friend?
00:45:49
Speaker
That's some of the sort of extending. The normalizing is just basically the fact that utility functions are defined up to scaling, up to multiplying by some positive real constant. So if you want to add utilities together or combine them in a smooth min, combine them in any way, you have to scale the different preferences. And there are various ways of doing this.
00:46:17
Speaker
I failed to find a intrinsically good way of doing it that has all the nice formal properties that you would want.
00:46:26
Speaker
but there are a variety of ways that can be done, all of which seem acceptable. The one I'm currently using is the mean max normalization, which is that the best possible outcome gets a utility of one, and the average outcome gets a utility of zero. This is the scaling.
00:46:47
Speaker
Then the weight of these preferences is just how strongly you feel about it. Do you have a mild preference for going to the cinema with this friend? Do you have an overwhelming desire for chocolate? Once they're normalized, you weigh them and you combine them. Can you take us through the rest of section two here, if there's anything else here that you think is worth mentioning?
00:47:10
Speaker
I'd like to point out that this is intended to work with any particular human being that you point the process at. So there are a lot of assumptions that I made from my non-moral realist worried about oversimplification and other things.
00:47:28
Speaker
The idea is that if people have strong meta-preferences themselves, these will overwhelm the default decisions that I've made. But if people don't have strong meta-preferences, then they are synthesized in this way, in the way which I feel is the best to not lose any important human value.
00:47:50
Speaker
There are also judgments about what would constitute a disaster or how we might judge this to have gone disastrously wrong. Those are important and need to be sort of fleshed out a bit more because many of them can't be quite captured within this system. The other thing is that the outcomes may be very different.
00:48:15
Speaker
To choose a silly example, if you are 50% total utilitarian versus 50% average utilitarian, or if you're 45%, 55% either way, the outcomes are going to be very different because the pressure on the future is going to be different. And because the AI is going to have a lot of power, it's going to result in very different outcomes.
00:48:39
Speaker
But from our perspective where if we put 50-50 total utilitarianism and average utilitarianism, we're not exactly 50-50 most of the time. We're kind of, yeah, they're about the same. So 45-55 should not result in a disaster if 50-50 doesn't. So even though from the perspective of these three mixes,
00:49:04
Speaker
45, 55, 50, 50, 55, 45. These three mixes will look at something that optimizes one of the other two mixes and say, that is very bad from my perspective.
00:49:17
Speaker
However, from our human perspective, we're saying all of them are pretty much okay. Well, we would say none of them are pretty much okay because they don't incorporate many other of our preferences. But the idea is that when we get all the preferences together, it shouldn't matter a bit if it's a bit fuzzy. So even though the outcome will change a lot, if we shift it a little bit, the quality of the outcome shouldn't change a lot.
00:49:43
Speaker
And this is connected with a point that I'll put up in section three, that uncertainties may change the outcome a lot, but again, uncertainties should not change the quality of the outcome. And the quality of the outcome is measured in a somewhat informal way by our current preferences. So moving along here into section three, what can you tell us about the synthesis of the human utility function in practice?
00:50:13
Speaker
So first of all, there's, well, let's do this project. Let's get it done.
Practical Applications & AI Alignment
00:50:18
Speaker
But we don't have perfect models of the human brain. We haven't grounded all the symbols. What are we going to do with the great uncertainties? So that's arguing that even with the uncertainties, this method is considerably better than nothing. And you should expect it to be pretty safe and somewhat adequate even with great uncertainties.
00:50:40
Speaker
The other part is I'm showing how thinking in terms of the human mental models can help to correct and improve some other methods like revealed preferences, our stated preferences, or the locking the philosopher in a box for a thousand years.
00:50:59
Speaker
All methods fail, and we actually have a pretty clear idea when they fail. Revealed preferences fail because we don't model bounded rationality very well. And even when we do, we know that sometimes our preferences are different from what we reveal. Stated preferences fail in situations where there's strong incentives not to tell the truth, for example. We could deal with these by sort of adding all the counterexamples of a special case.
00:51:29
Speaker
Or we could add the counterexamples as something to learn from, or what I'm recommending is that we add them as something to learn from while stating that the reason that this is a counterexample is that there is a divergence between whatever we're measuring and the internal model of the human. The idea being that it is a lot easier to generalize when you have an error theory rather than just lists of error examples.
00:51:57
Speaker
Right. And so there's also this point of view here that you're arguing that this research agenda and perspective is also potentially very helpful for things like cordibility and low impact research and Cristiano's distillation and amplification, which you claim all seem to be methods that require some simplified version of
00:52:19
Speaker
the human utility function. So any sorts of conceptual insights or systematic insights which are generated through this research agenda in your view seem to be able to make significant contributions to other research agendas which don't specifically take this lens.
00:52:36
Speaker
I feel that even something like courage ability can benefit from this because in my experience, things like courage ability, things like low impact have to define to some extent what is important and what can be categorized as unimportant.
00:52:53
Speaker
A low impact AI cannot be agnostic about our preferences. It has to know that a nuclear war is a high impact thing, whether or not we'd like it, whereas turning on a orange light that doesn't go anywhere is a low impact thing. But there's no real intrinsic measure by which one is high impact and the other is low impact. Both of them have ripples across the universe.
00:53:20
Speaker
So I think I phrased it as Hitler, Gandhi, and Thanos all know what a low-impact AI is, all know what an Oracle AI is, or know the behavior to expect from it. So it means that we need to get some of the human preferences in, the bit that tells us that nuclear wars are high-impact
00:53:41
Speaker
But we don't need to get all of it in because since so many different humans will agree on it, you don't need to capture any of their individual preferences. So it's applicable to these other methodologies. And it's also your belief, and I'm quoting you here, you say that I'd give a 10% chance of it being possible this way, meaning through this research agenda, and a 95% chance that some of these ideas will be very useful for other methods of alignment.
00:54:10
Speaker
So just adding that here as your credences for the skillfulness of applying insights from this research agenda to other areas of AI alignment.
00:54:24
Speaker
In a sense, you could think of this research agenda in reverse. Imagine that we have reached some outcome that isn't some positive outcome. We have got alignment and we haven't reached it through a single trick and we haven't reached it through the sort of tool AIs or software as a service or those kind of approaches. We have reached an actual alignment.
00:54:52
Speaker
It therefore seems to me all the problems that I've listed, or almost all of them, will have had to have been solved. Therefore, in a sense, much of this research agenda needs to be done directly or indirectly in order to achieve any form of sensible alignment. Now, the term directly or indirectly is doing a lot of the work here, but I feel that quite a bit of this will have to be done directly.
00:55:20
Speaker
Yeah, I think that that makes a lot of sense. It seems like there's just a ton about the person that is just confused and difficult to understand what we even mean here in terms of our understanding of the person and also broader definitions included in alignment. So given this optimism that you've stated here surrounding the applicability of this research agenda on synthesizing a human's preferences into a utility function,
00:55:48
Speaker
What can you say about the limits of this method?
Limits of Synthesized Preferences
00:55:53
Speaker
Any pessimism to inject here? So I have a section four, which is labeled as the things that I don't address. Some of these are actually a bit sneaky, like the section on how to combine the preferences of different people. Because if you read that section, it basically lays out ways of combining different people's preferences.
00:56:15
Speaker
But I've put it in that to say, I don't want to talk about this issue in the context of this research agenda, because I think this just diverts from the important work here. And there are a few of those points. But some of them are genuine things that I think are problems. And the biggest is the fact that there's a sort of informal Goodell statement in humans about their own preferences.
00:56:41
Speaker
How many people would accept a computer synthesis of their preferences and say, yes, that is my preferences, especially when they can explore it a bit and find the counterintuitive bits? I expect humans in general to reject the AI-assigned synthesis no matter what it is, pretty much just because it was synthesized and then given to them,
00:57:05
Speaker
I expect them to reject or want to change it. We have a natural reluctance to accept the judgment of other entities about our own morality. And this is a perfectly fine meta-preference that most humans have.
00:57:22
Speaker
and I think all humans have to some degree, and I have no way of capturing it within the system because it's basically a Goodell statement in a sense. The best symptom of this process is the one that wasn't used. The other thing is that people want to continue with moral learning and moral improvement.
00:57:42
Speaker
And I've tried to decompose moral learning and more improvements into different things and show that some forms of moral improvement and more learning will continue, even when you have a fully synthesized utility function. But I know that this doesn't capture everything of what people mean by this. And I think it doesn't even capture everything of what I would mean by this. So again, there is a large hole in there.
00:58:09
Speaker
There are some other halls of the sort of more technical nature, like infinite utilities, stability of values, and a bunch of other things. But conceptually, I'm the most worried about these two aspects. The fact that you would reject what values you were assigned, and the fact that you'd want to continue to improve, and how do we define continuing improvement that isn't just the same as, well, your values may drift randomly.
00:58:38
Speaker
What are your thoughts here? Feel free to expand on both the practical and theoretical difficulties of applying this across humanity and aggregating it into a single human species-wide utility function. Well, the practical difficulties are basically politics, how to get agreements between different groups.
00:58:58
Speaker
People might want to hang on to their assets or their advantages. Other people might want stronger equality. Everyone will have broad principles to appeal to. Basically, there's going to be a lot of fighting over the different weightings of individual utilities.
00:59:14
Speaker
The hope there is that, especially with a powerful AI, that the advantage might be sufficiently high, that it's easier to do something where everybody gains, even if the gains are uneven, than to talk about how to divide a fixed-size pie.
00:59:32
Speaker
The theoretical issue is mainly what do we do with anti-altaristic preferences. I'm not talking about selfish preferences. Those are very easy to deal with. That's just basically competition for the utility, for the resources, for the goodness, but actual anti-altaristic utility. So someone who wants harm to befall other people.
00:59:53
Speaker
and also to deal with altruistic preferences because you shouldn't penalize people for having altruistic preferences. You should, in a sense, take out the altruistic preferences and put that in the humanity one and allow their own personal preferences some extra weight.
Altruistic vs Anti-Altruistic Preferences
01:00:10
Speaker
But anti-altristic preferences are a challenge, especially because it's not quite clear where the edge is. Now, if you want someone to suffer, that's an anti-altristic preference.
01:00:20
Speaker
If you want to win a game and part of your enjoyment of the game is that other people lose, where exactly does that lie? And that's a very natural preference. You might become a very different person if you didn't get some at least mild enjoyment from other people losing or from the status boost there.
01:00:42
Speaker
is a bit tricky. You might sort of just tone them down so that mild anti altruistic preferences are perfectly fine. So if you want someone to lose to your brilliant strategy at chess, that's perfectly fine. But if you want someone to be dropped slowly into a pit of boiling acid, then that's not fine.
01:01:01
Speaker
The other big question is population ethics. How do we deal with new entities? And how do we deal with other conscious or not quite conscious animals around the world? So who gets to count as a part of the global utility function?
01:01:20
Speaker
So I'm curious to know about concerns over aspects of this alignment story or any kind of alignment story involving lots of leaky abstractions. Like in Rich Sutton's short essay called The Bitter Lesson, he discusses how the bitter lesson of computer science is how leveraging computation over human domain specific ingenuity has broadly been more efficacious for breeding very powerful results.
01:01:48
Speaker
we seem to have this tendency or partiality towards trying to imbue human wisdom or knowledge or unique techniques or kind of trickery or domain specific insight into architecting the algorithm and alignment process in specific ways. Whereas maybe just throwing tons of computation at the thing has been more productive historically.
01:02:12
Speaker
Do you have any response here for concerns over concepts being abstractions or the categories in which you used to break down human preferences, not fully capturing what our preferences are?
01:02:23
Speaker
Well, in a sense, that's part of the research project and part of the reasons why I warned against going to distant words where, in my phrasing, the web of connotations break down, in your phrasing, the abstractions become too leaky. This is also part of why, even though the second part is done as if this is the theoretical way of doing it,
01:02:45
Speaker
I also think there should be a lot of experimental aspect to it to test where this is going, where it goes surprisingly wrong or surprisingly right. The second part, though, it's presented as just this is basically the algorithm. It should be tested and checked and played around with to see how it goes.
01:03:07
Speaker
For the bitter lesson, the difference here, I think, is that in the case of the bitter lesson, we know what we're trying to do. We have objectives, whether it's winning at a game, whether it's classifying images successfully, whether it's classifying some other feature successfully, and we have some criteria for the success of it.
01:03:29
Speaker
The constraints I'm putting in by hand are not so much trying to put in the wisdom of the human or the wisdom of the steward. There's some of that, but it's trying to avoid disasters. The disasters cannot be just avoided with more data.
01:03:49
Speaker
You can get to many different points from the data, and I'm trying to carve away lots of them. Don't oversimplify, for example. To go back to the bitter lesson, you could say that you can tune your regularizer. What I'm saying is, have a very weak regularizer, for example. This is not something that the bitter lesson applies to, because in the real world on the problems where the bitter lesson applies,
01:04:15
Speaker
you can see whether hand-tuning the regularizer works because you can check what the outcome is and compare it with what you want. Since you can't compare it with what you want, because if we knew what we wanted, we kind of have it solved, what I'm saying here is don't put a strong regularizer for these reasons.
01:04:35
Speaker
The data can't tell me that I need a stronger regularizer because the data has no opinion, if you want, on that. There is no ideal outcome to compare with.
01:04:47
Speaker
There might be some problems, but the problems like if our preferences do not look like my logic or like our logic, this points towards the method failing, not towards the methods needing more data and less restrictions. I mean, I'm sure part of this research agenda is also further clarification and refinement of the taxonomy and categories used, right, which could potentially be elucidated by progress in neuroscience.
01:05:17
Speaker
Yes. And there's a reason that this is version 0.9 and not yet version one. I'm getting a lot of feedback and going to refine it before trying to put it out as version one. It's an alpha and beta at the moment. It's a pre-release agenda. Well, so hopefully this podcast will spark a lot more interest and knowledge about this research agenda. And so hopefully we can further contribute to bettering it.
01:05:41
Speaker
When I say that this is in alpha or in beta, that doesn't mean don't criticize it. Do criticize it, especially if these can lead to improvements, but don't just assume that this is fully set in stone yet.
01:05:56
Speaker
Right. So that's sort of framing this whole conversation in a light of epistemic humility and willingness to change. So two more questions here and then we'll wrap
Reflective Equilibrium Risks
01:06:08
Speaker
up. So reflective equilibrium. You say that this is not a philosophical ideal. Can you expand here about your thoughts on reflective equilibrium and how this process is not a philosophical ideal?
01:06:22
Speaker
reflective equilibrium is basically you refine your own preferences, make them more consistent, apply them to yourself until you've reached a moment where your meta preferences and your preferences are all smoothly aligned with each other. What I'm doing is a much more messy synthesis process.
01:06:41
Speaker
And I'm doing it in order to preserve as much as possible of the actual human preferences. It is very easy to reach reflective equilibrium by just, for instance, having completely flat preferences or very simple preferences. These tend to be very reflectively in equilibrium with itself. And pushing towards this thing is a push towards, in my view, excessive simplicity and the great risk of losing valuable preferences.
01:07:09
Speaker
The risk of losing valuable preferences seems to be a much higher risk than the gain in terms of simplicity or elegance that you might get. There is no reason that the kludgy human brain and its mess of preferences should lead to some simple reflective equilibrium.
01:07:30
Speaker
In fact, you could say that this is an argument against reflexive equilibrium because it means that many different starting points, many different minds with very different preferences will lead to similar outcomes, which basically means that you're throwing away a lot of the details of your input data.
01:07:48
Speaker
So I guess two things. One is that this process clarifies and improves on incorrect beliefs in the person, but it does not reflect what you or I might call moral wrongness. So like if some human is evil, then the synthesized human utility function will reflect that evilness. So my second question here is an idealization process is very alluring to me. Is it possible to synthesize
01:08:14
Speaker
the human utility function and then run it internally on the AI and then see what we get in the end and then check if that's a good thing or not.
01:08:21
Speaker
Yes. In practice, this whole thing, if it works, is going to be very experimental and we're going to be checking the outcomes. And there's nothing wrong with sort of wanting to be an idealized version of yourself. What I have, especially if it's just one idealized, it's the version where you are the idealized version of the idealized version of the idealized version of the idealized version, et cetera, of yourself, where there is a great risk of losing yourself and the input there.
01:08:51
Speaker
This was where I had the idealized process where I started off wanting to be more compassionate and spreading my compassion to more and more things at each step, eventually coming to value insects as much as humans and then at the next step value rocks as much as humans and then removing humans because of the damage that they can do to mountains.
01:09:11
Speaker
That was a process or something along the lines of what I can see if you are constantly idealizing yourself without any criteria for stop idealizing now or you've gone too far from where you started. Your ideal self is pretty close to yourself.
01:09:29
Speaker
the triple idealized version of your idealized idealized self or so on, starts becoming pretty far from your starting point. And this is the sort of areas where I fear oversimplicity or trying to get to reflective equilibrium at the expense of other qualities and so on. These are the places where I fear this pushes towards.
01:09:53
Speaker
Can you make more clear what failed in our view in terms of that idealization process where Mahatma Armstrong turns into a complete negative utilitarian?
01:10:05
Speaker
It didn't even turn into a negative utilitarian, it just turned into someone that valued rocks as much as they valued humans and therefore eliminated humans on utilitarian grounds in order to preserve rocks or to preserve insects if you wanted to go down one level of credibility. The point of this is this was the outcome of someone that wants to be more compassionate, continuously wanting to make more compassionate versions of themselves that still want to be more compassionate and so on.
01:10:34
Speaker
it went too far from where it had started. It's one of many possible narratives, but the point is the only way of resisting something like that happening is to tie the higher levels to the starting point.
01:10:48
Speaker
A better thing might say, I want to be what myself would think is good, and what my idealized self would think was good, and what the idealized idealized self would think was good, and so on. So that kind of thing could work. But just idealizing without ever tying it back to the starting point, to what compassion meant for the first entity, not what it meant for the nth entity, is the problem that I see here.
01:11:17
Speaker
If I think about all possible versions of myself across time and I just happen to be one of them, this just seems to be a meta-preference to bias towards the one that I happen to be at this moment, right?
01:11:28
Speaker
We have to make a decision as to what preferences to take. And we may as well take now, because if we try and take into account our future preferences, we are starting to come a cropper with the manipulable aspect of our preferences. The facts, these could be literally anything.
01:11:49
Speaker
There is a future steward who is probably a Nazi because you can apply certain amount of pressure to transform my preferences. I would not want to endorse their preferences now. There are future stewards who are saints.
01:12:05
Speaker
whose preferences I might endorse. So if we're deciding which future preferences that we're accepting, we have to decide it according to criteria and criteria that at least are in part of what we have now.
01:12:20
Speaker
We could sort of defer to our expected future selves if we sort of say, I expect a reasonable experience of the future, define what reasonable means, and then average out our current preferences with our reasonable future preferences.
01:12:38
Speaker
If we can define what we mean by reasonable then, then yes, we can do this. This is also a way of doing things. And if we do it this way, it will most likely be non-disastrous. If doing the synthesis process with our current preference is non-disastrous, then doing it with the average of our future reasonable preferences is also going to be non-disastrous. This is one of the choices that you could choose to put into the process.
01:13:07
Speaker
Right. So we can be mindful here that we'll have lots of meta preferences about the synthesis process itself.
01:13:13
Speaker
Yes, you could put it as a meta preference or you can put it explicitly in the process if that's a way you would prefer to do it. The whole process is designed strongly around get an answer from this process. So the, yes, we could do this. Let's see if we can do it for one person over a short period of time. And then we can talk about how we might take into account considerations like that, including, as I say, this might be in the meta preferences themselves.
01:13:43
Speaker
This is basically another version of moral learning. We're kind of okay with our values shifting, but not okay with our values shifting arbitrarily. We really don't want our values to completely flip from what we have now, though some aspects we're more okay with them changing. This is part of the complicated, how do you do moral learning?
01:14:06
Speaker
All right, beautiful Stuart, contemplating all this is really quite fascinating. And I just think in general, humanity has a ton more thinking to do in self-reflection in order to get this process really right. And I think that this conversation has really helped elucidate that to me and all of my contradictory preferences and my multitudes within the context of my partial and sometimes erroneous mental models. Reflecting on that also has me feeling
01:14:32
Speaker
It may be slightly depersonalized and a bit ontologically empty, but it's beautiful and fascinating.
Conclusion: Reflections & Invitations
01:14:40
Speaker
Do you have anything here that you would like to make clear to the AI alignment community about this research agenda? Any last few words that you would like to say or points to clarify?
01:14:53
Speaker
There are people who disagree with this research agenda, some of them quite strongly and some of them having alternative approaches. I like that fact that they are researching other alternatives. If they disagree with the agenda and want to engage with it, the best engagement that I could see is pointing out why bits of the agenda are unnecessary or how alternate solutions could work.
01:15:22
Speaker
You could also point out that maybe it's impossible to do it this way, which would also be useful. But if you think you have a solution or the sketch of a solution, then pointing out which bits of the agenda you solve otherwise would be a very valuable exercise. In terms of engagement, you prefer people writing responses on the AI Alignment Forum or less wrong? Emailing me is also fine. I will eventually answer every non-crazy email.
01:15:52
Speaker
Okay, wonderful. I really appreciate all of your work here on this research agenda and all of your writing and thinking in general. You're helping to create beautiful futures with AI and you're much appreciated for that. If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We'll be back again soon with another episode in the AI alignment series.