Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
#26 Iwan Williams: Do Language Models Have Intentions? image

#26 Iwan Williams: Do Language Models Have Intentions?

AITEC Podcast
Avatar
0 Plays2 seconds ago

In this episode of the AITEC podcast, Sam Bennett speaks with philosopher of mind and AI researcher Iwan Williams about his paper “Intention-like representations in language models?” Williams is a postdoctoral researcher at the University of Copenhagen and received his PhD from Monash University.

The conversation explores whether large language models exhibit internal representations that resemble intentions, as distinct from beliefs or credences. Focusing on features such as directive function, planning, and commitment, Williams evaluates several empirical case studies and explains why current models may appear intention-like in some respects while falling short in others. The discussion also considers why intentions matter for communication, safety, and our broader understanding of artificial intelligence.

For more, visit ethicscircle.org. 

Recommended
Transcript

Introduction to Ewan Williams and His Work

00:00:16
Speaker
Welcome back to the ATech Podcast. Today, I'm sharing a conversation with Ewan Williams. Ewan is a philosopher of mind, artificial intelligence, and a researcher at the University of Copenhagen. Specifically, he's working at the Center for Philosophy of Artificial Intelligence.
00:00:31
Speaker
He earned his PhD at Monash University, and in this episode, we are going to discuss... um his article, Intention-Like Representations in Language Models? question mark And so, yeah, the central question we explore is you know whether large language models exhibit anything that genuinely resembles intentions or whether such descriptions, you know describing a model as having a certain intention, whether such descriptions are ultimately just metaphor. All right, I hope you enjoy the conversation.

Ewan's Journey into Philosophy

00:01:06
Speaker
Just to get started, know maybe could you just briefly introduce yourself, maybe know where you're from and like kind of how you got interested in philosophy? Sure. So um I'm originally from Wales in the UK and...
00:01:18
Speaker
I suppose I kind of stumbled into philosophy in a way. I was always into kind of big, deep questions, navel-gazing, you might say. but um And then and when it came to choosing a university degree, I landed on philosophy, not knowing all too much about it, and ended up loving it and have pursued it since then into graduate studies. i did my PhD in Monash in Australia, in Melbourne, and I've now made my way back to Europe. I'm now in Copenhagen doing a postdoc.
00:01:47
Speaker
and my kind of interests of I began mainly in philosophy of mind and philosophy of cognitive science, but in the last couple of years I've been thinking about AI like a lot of people, like yourself. So yeah, that's kind of where my maya intellectual interest is at the moment.
00:02:03
Speaker
Awesome. Yeah. And so basically in this um in this particular paper that we're going to be discussing, you're sort of dealing with the question of like, hey, do these large language models, do they have mental states? Specifically, you're thinking about a specific kind of mental state, the mental state of an intention um rather than like a belief, let's say.

Understanding Large Language Models

00:02:25
Speaker
So cool. so So before we kind of get more into that, maybe just to get us oriented, we can just start with a really like you know simple question just to remind listeners, kind of you know what is a large language model?
00:02:37
Speaker
Of course. um So most people will have encountered large language models indirectly if they've not come across that term. ah These are the core technologies underlying ah chatbots like ChatGBT and Claude and Gemini.
00:02:52
Speaker
um And what they are is machine learning models. They're a particular type of machine learning model. So they're models that are designed with the goal of ah predicting or processing or generating language or text, um as opposed to some other kind of type of model that might be dealing with images or different kinds of of data.
00:03:14
Speaker
And under the hood, these are artificial neural networks. So called because they're very very loosely inspired by the structure of of brains of neurons in the brain. um And ah in particular, these large language models are often using a transformer architecture as a particular type of neural network, that the details of which we don't need to get into here. but um these the The way these systems work at a kind of basic level is information is processed through these networks in a way that's determined by a bunch of mathematical operations. and these aren't It's always worth emphasizing, not everybody is aware of this, these aren't hand-coded by ah by engineers.
00:03:59
Speaker
n Engineers aren't kind of fixing the the minute details of how exactly a language model works, rather As the name machine learning implies, they're kind of learning or emerging ah certain solutions to to problems, in this case predicting and generating text, by being trained on huge amounts of data and using some sophisticated learning algorithms.
00:04:22
Speaker
um So often in the case of language models, that might be ah feeding the model a huge corpus like a bunch of text taken from the internet and then using a kind of next token prediction task where a model is kind of updated based on its ability to guess the next word in a string of string of words.
00:04:43
Speaker
And what you get if you do this millions or billions of times ah using the right kind of architecture is a model that's quite impressive at mimicking the kinds of responses that a human might give in response to some text inputs.
00:05:01
Speaker
So it's kind of... ah useful or you know broadly factually correct answers to text-based questions or requests. um What's less clear, and we'll get into that a little bit today, is exactly how they do that in terms that are actually comprehensible to us. you know We can talk about the specific kind of matrix multiplications that that go on inside the system, but that doesn't actually tell us all that much about how the model is solving some particular task or problem.

Do Language Models Mimic Human Minds?

00:05:33
Speaker
Yeah, and it's interesting that, you know, it's so successful at um imitating our kind of linguistic abilities that that it makes you wonder, you know, is it kind of doing it in a similar way that we're doing it? um But, you know, so one quite kind of question, maybe ah one kind of philosophical question one might want to ask with respect to language models is, you know, do they have mental states? um So yeah, can you kind of introduce us to that type of question?
00:06:07
Speaker
Yeah, of course. So um mental states ah is somewhat of a term of art that philosophers use, but you know it's fairly self-explanatory. We're just talking about states that arise in our mind. So that could be beliefs or desires or perhaps perceptual experiences or pain or bodily experiences, emotions and so on.
00:06:27
Speaker
So there's ah there's a whole kind of zoo of different types of mental states that philosophers are interested in for different reasons. um And ah we might talk about, you know, we might think of some of those having a distinctive experiential aspect, some kind of conscious phenomenology or conscious character that goes along with something like the experience of pain.
00:06:46
Speaker
um But there's also kind of aspects of the causal role or functional role of different mental states. So... something like a belief might be more characterized in terms of the kinds of ah states or or conditions that trigger a certain belief or the kinds of effects on behavior that having a certain belief um ah generates. And yeah, so that's what we mean by mental states.
00:07:11
Speaker
ah The question... about language models, as you point out, is if they're so sophisticated in terms of their behavior, terms of their ability to kind of, as you say, mimic the the linguistic capacities of humans, there's a question of whether they're doing it in something like the way that we are, which would be in a way that's mediated by mental states like beliefs and desires and intentions.
00:07:38
Speaker
Great. Yeah. so So it's really important to think about that distinction you made in terms of like, um you know, you can take a mental state and describe it in terms of its phenomenology, but then you could also describe it in terms of its causal role.
00:07:54
Speaker
um So like, um you know, there's a phenomenological description you could give related to intending to like drink coffee, let's say, you know,
00:08:05
Speaker
um You might have an urge toward the mug, you know the thought of drinking coffee. It's like something that makes you feel like it's something to be done. um Whereas the causal rule description, you know you might talk about um how...
00:08:22
Speaker
um insofar as I have an intention to drink coffee, um that's going to increase the probability that I reach for the mug or um not do something else, you know, suppress other competing options for action. So anyway, just important, I think, yeah, to to highlight that distinction between a causal role description of a mental state and then a phenomenological description.
00:08:45
Speaker
Yep. but sure Great. um So, you know, people might find it interesting that, you know, I think among researchers, like, I guess philosophers, for example, um many many think that you can say that large language models have something like beliefs and credences. So um could you kind of explain that line of reasoning?

Do Language Models Have Beliefs?

00:09:11
Speaker
Why think a large language model has something like a belief or a credence?
00:09:17
Speaker
So ah we we should maybe be a little cautious in exactly how widely subscribed a view like this is. I think there would be many people who'd... who'd die ah would not sign up to this idea that that language models have beliefs and credences, but but certainly um when we look at kind of AI research, ah researchers are very comfortable with talking about knowledge states in in language models and occasionally beliefs as well, using that term explicitly.
00:09:47
Speaker
um And there is a question of whether this is just a kind of loose talk, a kind of metaphorical way of speaking, um ah you know, what ChatGPT is thinking about or what it knows about the world.
00:10:01
Speaker
um But there's I think there's some reason to take it somewhat seriously. I mean, if just at the behavioral level, if we... um if we look at ah how a language model might respond to different kinds of fact-based questions, we might be able to elicit answers from ChatGPT that look like um it has the belief that Paris is the capital of France, for instance. We could ask about that particular fact in lots of different ways and get ah answers that are broadly consistent with what you would expect somebody to say if they had that particular belief.
00:10:37
Speaker
um But maybe kind of slightly more compelling evidence is when we look under the hood in these language models, there's been kind of a line of research trying to uncover something like belief states internal to the workings of language models.
00:10:50
Speaker
And what we find, or one line of evidence is, um the the states that are induced by ah false statements, or more carefully, false statements which the model is likely to to respond to as false,
00:11:07
Speaker
um those kind of internal states look very different from the internal states that are induced by true statements. So they almost seem to cluster in different and subspaces within the kind of internal encoding scheme of of the model.
00:11:23
Speaker
And more more than that, not only are we able to kind of from the outside distinguish these two different kinds of of internal states, um that seems to play a causal role in their responding to sentences as true or false. So we can actually intervene on these internal states and make it start responding as if it believes that Paris isn't the capital of France by kind of pushing that internal representation towards the kind of false subspace and and moving it more in that direction. um
00:11:54
Speaker
So you know some some have drawn the inference that well, this is acting in something like the way that a belief is in in human psychology, that it's some kind of internal state that's encoding something about the truth or falsity of some some statement or some fact in the world.
00:12:10
Speaker
um But you know there's there's been a bit of back and forth about whether that's enough to count as a belief and so on. But certainly beliefs and credences are the sorts of things that do get talked about a bit more than something like intentions, which is partly what inspired this line of research.

Do Language Models Have Intentions?

00:12:28
Speaker
Awesome. Great. That's really helpful. So yeah. So, okay. So yeah, your, your line of research that we want to talk about yeah is really specifically about intentions. So yeah, you just kind of reviewed a little bit, you know, the line of reasoning for thinking that, um, language models have either beliefs full blown or maybe something approximating it. Um, so yeah, could you kind of introduce us to your question of, you know, do,
00:12:52
Speaker
language models possess intentions. Yeah, I'd be happy to. um So maybe it's it's worth just first kind of distinguishing beliefs from intentions and in it in a kind of coarse-grained way. So beliefs are states, mental states that try to in some way track the way the world is. We're trying to kind of get an accurate picture of reality by forming beliefs that correspond with that reality, that encode contents that are right are true about the world.
00:13:22
Speaker
um Intentions, by the other hand, that they kind of sit on the other side of of um of the the process of of um decision-making and and and reasoning and and acting in the world, they're much closely more closely tied to the action side of things. So intentions... are You might think of them in terms of decisions or plans about what to do, that driving or motivating action in a more direct way.
00:13:52
Speaker
So if I have a belief that there's beer in the fridge, ah that's quite different from having an intention to put beer in the fridge. There may be no beer in the fridge right now, but if I have that intention, then it's going to alter my behavior and in such a way that I make that goal state become true, or at least if I'm successful in executing the intention.
00:14:12
Speaker
ah So that's that's one kind of ah way of thinking it you know ah at a basic level how intentions differ from beliefs. um And when it comes to language models, you know ah whether or not we think language models have something like belief states or something like belief states, we might also wonder whether these systems have something like intentions. Do they form plans? Do they...
00:14:36
Speaker
represent the actions or things that they're about to do before they do them in a way that's kind of guiding or controlling their behavior so that's that's the question and there's a number of reasons why you might be interested in a question like that which maybe we can get into in a moment but that's the question great yeah it's just just so um i don't know just be clear maybe um you know would you Would you say that you're looking into whether language models contain you know internal features that strongly resemble you know the kind of functional causal role that intentions have? Or...
00:15:15
Speaker
you know In your mind, are you just full-blown trying to figure out whether language models have genuine intentions? like Are we looking for something that's intention-like or just a genuine intention, I guess? Does that make sense?
00:15:29
Speaker
It does. and It's a very good question. i mean, I hedge in the title of my article. I say intention-like representations in language models. As you say, question mark. i i'm I'm interested in that question, but I suppose... i I am first and foremost interested in the question of whether they have intentions, and I think that's kind of what starts the the debate. People are interested in should we ascribe this mental state that we ascribe to other human beings and maybe non-human animals, should we ascribe that to ah these cutting-edge AI systems?
00:16:02
Speaker
But in a sense, I think where I end up is thinking that maybe that's not an entirely well-formed question, or or to put it another way, it might not be the kind of question that we can get a clean yes or no answer to.
00:16:16
Speaker
um so It might be that intentions are as mental states are kind of characterized by a cluster of different features, and maybe those features can be present to a greater or lesser extent, and we might find states in language models, or for that matter, non-human animals or or other situations, where some of those features or some of those criteria are met,
00:16:41
Speaker
and others aren't met. And I want to be a little open-minded about what we should do in a situation like that, whether where we should say, okay, just categorically, they don't have intentions or whether, you know, the a more honest or or answer is something like, well, they have states that are intention-like in many interesting respects, but not in other interesting respects. and And that might be actually more informative than just trying to squeeze it into a binary distinction.
00:17:09
Speaker
It also seems like, I'm not sure if this makes sense, but the idea it seems like the proper order of investigation would be you first try to establish to what degree the two things resemble each other, kind of take various factors that you find in human intentions, see how much the language models represent.

Why Intention-like Behavior in AI Matters

00:17:29
Speaker
uh, various, you know, representations within the in language model resemble those. And then you kind of work up to, the, the question of, you know, okay. So given the resemblances, does that ultimately count as, you know, instantiating a full blown, uh, intention? Um, so anyway, it just seems like this is kind of the proper order. Um, if that makes sense. Um, but, you know, so, um,
00:17:55
Speaker
at this one point, at this point, you know, but maybe we can, you know, go back to something that you mentioned earlier, which is like, you know, the the, the whole importance of the question, importance of asking, um, you know, whether, um,
00:18:09
Speaker
language models have intention-like representations. um So yeah, would you like to to to address that? i mean, you talk a little bit about communication, you mentioned things related to safety concerns. Anyway, yeah, do just want to briefly touch on why this is kind of a um pressing question?
00:18:27
Speaker
Yeah, of course. um You might be interested in the question for various different reasons, and I think intentions get invoked in different contexts. often without kind of serious engagement about the question of whether they do have intentions or not, or at least not very much evidence is presented. you know Sometimes people just categorically deny it, or sometimes they just assume that ah AI systems do or in the near future will have intentions. um But this gets invoked in various different contexts. So one context might be might be interested in questions about communicative capacities, the ability of these systems to engage in the kind of communication that that we human beings do with each other.
00:19:11
Speaker
so ah just For example, um when humans use language, we don't just kind of generate words or tokens for the sake of it. you know We're engaged in some kind of communicative ah project with with some other human being, typically. and um One kind of aspect of that is what's being referred to in in philosophy of language as speech acts. So we might, for example, ah make an assertion to another person, which is you know trying to kind of ah communicate our belief in that claim and perhaps even to make the other person believe what we believe.
00:19:50
Speaker
ah We might give a recommendation or a command to somebody or a suggestion, try and change their behavior ah through our use of language. Or we might ask them a question to try and kind of elicit some information from them.
00:20:02
Speaker
So these are different things that we do with words that we use language for. um And it's very common on on ah kind of in theorizing about these different kinds of speech acts to require some kind of intention behind the language, that you know merely producing or generating the language isn't enough to have ah to have performed one of these speech acts.
00:20:25
Speaker
In the case of assertion, a kind of traditional picture would hold that you need to have an intention to induce a belief in the hearer or in the audience. ah Whereas, you know, or more generally, you might just think that you need an intention to make an assertion, right to to actually be making an assertion.
00:20:43
Speaker
ah So that's one one kind of motivation for the question is you might think, do language models ah perform speech acts, communicate in the way that we do? That might be partly dependent on questions about intentions.
00:20:56
Speaker
A quite different motivation might be ah questions around safety. And I'm a little open-minded about exactly how how relevant ultimately and this this question about intentions is going to be to kind of practical questions about safety. But but certainly some people in this this AI safety research community who are interested in, know,
00:21:20
Speaker
ah you know analyzing and predicting and and preventing certain harms caused by the use of AI technologies, one ah thing that they're sometimes worried about is so-called runaway or deceptive AI.
00:21:35
Speaker
It's kind of getting towards slightly more um speculative and long-term kind of concerns, but people are genuinely worried, or many people are genuinely worried, that we'll soon, if not already, be faced with AI systems that have intentions that are in conflict with our own best interests.
00:21:57
Speaker
And not only that, that those intentions might be hidden in some way, that they might be ah ah strategically or otherwise masked by by the model's behavior, that they they're not kind of superficially ah revealing their their misaligned intentions.
00:22:13
Speaker
ah So intentions get invoked in that debate as well, um you know, this this idea of... of misaligned intentions. But um again, you know, we should care really, if that is a ah real concern, we should care about whether we already have systems that that have intentions or have something like intentions. um And ah perhaps this might be relevant in some ways in then some point down the line, identifying and monitoring these intentions for safety for safety purposes.
00:22:48
Speaker
Great, that's really helpful. um so yeah so So we kind of have on the table a sense of the question we're you're asking here, the research question, you know do these language models have something like intentions?
00:23:03
Speaker
And then we have sense for the importance of this kind of question. seems like intention, for example, is kind of essential to communication.

Identifying Properties of Intentions in AI

00:23:10
Speaker
um Now, you know maybe like let's talk about yeah how are you what's your method for approaching this question? um It seems like, um correct me if I'm wrong, but it seems like your approach is like, um you know first, let's propose five properties, five qualities ah that are typically associated with intentions that seem to be really important um for a human mental state to to be an intention. And then we kind of look back to see if there's any representations in large language models which, you know,
00:23:46
Speaker
you know possess all five of those properties. would Is that kind of how you would describe the method? That's exactly right. yeah Okay, great. and um yeah And just to be clear, you're not saying like you know this these five characteristics, which we'll talk about in a second, are um you know like the consensus definition. um and you know So some some people might be... um It's uneasy about saying these five properties like you know or define an intention, but at any rate, um they they they do seem to be typically associated with intentions. So maybe yeah we can just start talking about some of these these these five marks, you might say. so
00:24:28
Speaker
um The first one you bring up is directive function. um So yeah, can you kind of tell us a little bit about this first characteristic?
00:24:39
Speaker
Yeah, of course. um So directive function, we've we've touched on it a little bit already to some extent when we talked about this distinction between beliefs and intentions. So beliefs, as I mentioned before, are in the business of trying to track the way the world already is.
00:24:58
Speaker
um So you might say that beliefs have a what's sometimes called a descriptive function. They're trying to kind of describe the actual facts of the matter. ah Intentions, by contrast, have what some people call a directive function. So they're trying to direct the behavior of the containing system, right the organism that or the human being that has the intention.
00:25:21
Speaker
to make some changes in the world in order to attain some goal state. ah And as I mentioned with this this and beer in the fridge example, you might have more or less the same kind of state of affairs, ah the same content, the same proposition that's targeted by different kinds of attitudes, or in this case, a belief attitude or an intention attitude, where one of the crucial differences is this direction of fit. So in the case of believing that there's beer in the fridge, it's the job of the belief to fit the facts of the matter, to fit the world, ah to try and
00:26:01
Speaker
ah get an accurate picture of of the way the the fridge is right now. Like I should only believe that there's beers in the fridge if the world is such that there are indeed some beers in the fridge, right? Like I should drop the belief if there's no beers in the fridge. Precisely, yeah, exactly. Whereas the opposite is true or or it's a very different situation with an intention. It would be weird to intend there to be or or to put beers in the fridge if there were already sufficient number of beers in the fridge. Whereas um ah
00:26:32
Speaker
Normally what we would do is we observe that or we but believe that there are no beers in the fridge and then we intend to change that fact. So the intention has an opposite direction of fit. You might say it's the job of of the fridge or the job of the world to be brought into accordance with ah the content of the intention.
00:26:54
Speaker
And that's kind of like they they there was like a causal and element with intention where, um yeah, um it might initiate a bunch of activity on my behalf to make it the case that the world fits this state of affairs. Exactly.
00:27:10
Speaker
this This wonderful state of affairs of there being beers in my fridge. Yes. Yeah, that's right. Yeah, so that that gets at this this distinction between directive and descriptive our representations with intentions falling on the directive side of that dichotomy.
00:27:28
Speaker
Excellent. Okay. So, um yeah, and you know we we might not have to you know go... um kind of I think for our conversation, maybe we can kind of move a little bit quickly over these these five marks, and then we can kind of get more into your really fascinating you know analysis of, okay, you know given these five marks, and do... um you know, are there any representations in language models that, you know, have satisfied these five marks? So, yeah, so so so we'll we'll come back to the whole i directive function idea.
00:28:02
Speaker
um But let's see. So maybe let's we can move to the next mark. um and it's Okay, so the second and the the second characteristic you bring up, again, these are characteristics that are typical of intentions and in it's ah distality. So yeah, can you can you tell us a little bit about this second, Mark?
00:28:21
Speaker
Sure. yeah So when we think about intentions ah in our in our lives, some of them are ah very closely related in time to the actions that they end up triggering. So and I might form an intention to click my fingers and then very soon after in time I click my fingers.
00:28:41
Speaker
ah So that's what researchers sometimes call a proximal intention. Um, but other intentions have much longer time horizons associated with them. So if you think about an intention to retire, uh, when you're in your sixties or seventies, and you might have an intention to do something next summer and, uh, you know, that's might be shaping your, your behavior and certainly shaping your, um,
00:29:07
Speaker
your decision making, which we'll talk about a little bit yeah later, but um but it has a much wider distance in time um from the execution of the intention.
00:29:19
Speaker
So this is kind of characteristic of of human intentions that they can span this this wide range between very proximal to very distal intentions.
00:29:29
Speaker
And sometimes people talk about a hierarchy of of distal intentions interacting with proximal intentions. um So this again would certainly be relevant if we're thinking about categorizing or or assessing representations in artificial systems in terms of how similar they are to intentions. we We would want to know, do we find directive representations, that's the first feature, that exhibit something like degrees of distality? Do we find
00:30:02
Speaker
ah Do we only ever find states that are very proximally related in time ah to the actions that they they trigger, or do we find these longer time horizons as well? That would be more ah similar to a system that has intentions.
00:30:21
Speaker
Yeah. And that's a good point too, about like bringing up the issue of like human versus, you know, uh, other animal intentions. Um, so yeah, maybe it's like for other animals, uh, potentially that, you know, certain animals only have more proximal intentions and tensions related to, ah the next five minutes, let's say, whereas like you said, yeah, we have kind of this amazing ability. If you think about it to have intentions that are really, uh, far off, um, you know, uh,
00:30:48
Speaker
know Anyway, so it's interesting to think about that. But yeah, great. so so um So that second characteristic is kind of a spectrum ah related characteristic. you know, it's like an intention can be ah more have can be related to the more distant future versus something more ah nearby in time.
00:31:09
Speaker
and And then the the third characteristic um is related to abstraction. And this is also another kind of like ah kind of spectrum characteristic. ah Intentions can be more or less abstract. So can you can you tell us a little bit about ah this this third characteristic?
00:31:25
Speaker
Yeah, that's right. So um abstraction is less about ah distance in time, but more to do with how specific or fine-grained or detailed the content of of the intention is. So again, we can have very specific and detailed intentions. So imagine a dancer performing some choreographed dance. They might have um very precise intentions about exactly how they're going to move their body in relation to some music or have you.
00:31:55
Speaker
um But many of our intentions are not that specific, and not that detailed, so we might just have an intention to get a well-paying job. but That intention leaves lots of things unspecified and and there's many different ways one could satisfy that intention. There's a question about exactly what kind of job you get, exactly what qualifies as as well-paying, exactly when you're going to to execute this plan,
00:32:23
Speaker
And um you know that's not to mention the all the possible sequences of bodily movements ah you could enact to ultimately end up with you know satisfying that intention.
00:32:34
Speaker
So ah what this gives a sense of is we can form intentions that are very specific, very concrete, very detailed, and we can form intentions that are much less fleshed out and much more abstract.
00:32:48
Speaker
So again, this might be something that we we consider when assessing candidates for intentions in other kinds of systems, in non-human animals or in or in AI systems.
00:33:00
Speaker
are they alike Are they like our intentions in exhibiting this this spectrum of degrees of abstraction? Great. Okay, so then moving on to fourth characteristic, which I feel like is really one of the more important characteristics in your paper, um which is commitment. and And, you know, maybe correct me it if I'm wrong, but this this particular characteristic commitment is maybe the one where... ah large language models do the worst on potentially. I don't know if if that's the way you would put it, but but anyway anyway, yeah. So a fourth characteristic of of intention is that they typically include a factor of um commitment. So um yeah, so do you want to introduce us to to that a little bit?
00:33:44
Speaker
Yeah, of course. So commitment is a bit of an umbrella term. him It's capturing a cluster of related features, but but two of them that get discussed in the the literature on intentions are the idea of ah firstly...
00:34:01
Speaker
Intentions are ah conduct controlling. They're more directly linked to the actual production of of actions compared with something like desires. So a desire um might also be seen as a kind of ah directive representation in that it's aimed about, you know,
00:34:22
Speaker
uh, it's aimed towards, uh, changing the world or motivating us, uh, in different ways. Uh, it's not necessarily aimed at capturing the way the world already is. Um, but unlike desires, intentions, uh, you know, desires can pull us in all sorts of different directions. So we might have a desire to eat a chocolate cake, but at the same time have a desire to, you know, be healthy and, and, uh, you know, lose some weight or whatever.
00:34:49
Speaker
And, uh, Whereas intentions are supposed to settle among these inconsistent or ah desires that are in tension with each other. They're supposed to elect one particular a course of action among a range of alternative inconsistent possibilities.
00:35:08
Speaker
um So if I intend to eat the chocolate cake, that's quite different from merely desiring to ah because I've now kind of chosen that that course of action and are kind of making steps towards pursuing it.
00:35:20
Speaker
And there's a consequence of this, which is that intentions ah tend, we we tend to form relatively consistent sets of intentions at any given time. So whereas beliefs might pull us in different directions, it would be very strange or something would have gone terribly wrong in terms of your kind of rationality if you intended to ah meet a friend for lunch, ah but at the very same time intended to ah leave town for the day and and and get on a plane somewhere. um
00:35:55
Speaker
So intentions should generally speaking be mutually ah consistent with each other. ah there's ah There's a slightly different aspect of of or different sense of commitment that sometimes gets invoked as well, which is more about stability over time.
00:36:11
Speaker
So when we commit to a course of action, when we form an intention, we tend to... ah stay on that path unless there's some good reason to kind of reopen the the question for consideration. Generally speaking, we don't just arbitrarily drop our intentions or reconsider them ah without without good reason. So sometimes people say that intentions are resistant to reconsideration ah and that all things being equal, we kind of they have some kind of inertia to them once we're kind of set on on a path of intending something.
00:36:47
Speaker
So those are kind of two aspects of of commitment that might be interesting when assessing intentions and language models. Great. Okay. And then finally, the fifth characteristic you bring up is planning.
00:37:01
Speaker
um So yeah, in could you tell us a little bit, that you know, that the idea being something like an intention, it's normally sitting inside of a larger organization. system, you might say, of planning. So can you maybe tell us a little bit about this idea of a system of planning and how intentions tend to interact with that?
00:37:24
Speaker
Of course. So intentions are sometimes seen as the kind of end point of a process of deliberation. We've kind of weighed up the pros and cons and and landed on a particular course of action.
00:37:38
Speaker
But there's another sense in which that isn't the real end point. So between you know forming an intention and actually executing it, ah it's not as if intentions are kind of sitting ah patiently to be to be executed and and not having any effect on on our minds, um when we form an intention, it then acts as an input to any kind of subsequent deliberation or reasoning about what to do that we end up doing. So, you know, if I have some plan,
00:38:08
Speaker
to go on holiday next summer, then, well, firstly, at some point between now and next summer, i need to ah figure out exactly ah what i'm going you know where I'm going to go on holiday. I need to buy some tickets and so on. So so those are kinds of means-end constraints. So we need to, once we've formed an intention that's a kind of end point, we need to intend some adequate means to that end.
00:38:33
Speaker
So that's one sense in which ah intentions serve as inputs to planning. But we they might also act as a kind of general filter. And this is, you know, ah ah a lot of this these ideas around planning and also commitment. I'm um leaning heavily on the work of Michael Bratman, who's a philosopher of action, um who talks about this in detail. But um he talks about intentions acting as so something like a filter on other intentions, on the forming of other intentions. So again, you know, like the example before, if I if i form an intention to... ah
00:39:06
Speaker
to go for lunch with a friend, then that rules out a whole bunch of other things that would be inconsistent with that intention. so So we should we should, generally speaking, when we have intentions in a system, we should expect that they're going to interact with other intentions, place constraints on other intentions in something like this way.
00:39:25
Speaker
ah You might talk about kind of horizontal constraints, but also these vertical means end constraints. Oh, interesting. Okay. Yeah. Like, yeah, I was thinking if, you know, if if someone had the intention of, um, yeah, drinking Amstel light in, in, in Denmark, um, it would be bizarre, but be bizarre if they then also subsequently formed the intention of, you know, never traveling if they're not in Denmark, you know? So that's like, uh, by setting the intention of drinking Amstel light in Denmark, that kind of constrains, um,
00:40:02
Speaker
you know, what other intentions are rational, precisely given that, uh, preceding intention. Yeah. Yeah. So part of this is a kind of normative rationality kind of consideration, but it's also somewhat causal, right? So sometimes this gets talked about in terms of a rational pressure. So, you know, not only is it rational to do this, but precisely because it's rational, where we tend to not to form those kind of inconsistent intentions. Those tend to be the exceptions to the rule.
00:40:34
Speaker
Oh, that's interesting to think about. ah It's like, some of the causality seems to be hinging on the fact that we have this general system requirement to like maintain rational yeah rationality or something like that. yeah Interesting.
00:40:51
Speaker
Okay, great. Yeah. I mean, you know, maybe we could just try to, I don't know, summarize to some degree here, you know, it's like we've said, you know, intentions, and we you know, kind of looking at these five marks, these five characteristics that are typically associated with intentions and, you know,
00:41:06
Speaker
gone through a few key things. you know They're not just beliefs about the world because they're not just about representing how things are. They're often about or they're generally about ah guiding action toward ah creating a certain state. um They're also not just desires because like you said, you know you can have a desire pulling in multiple directions. I can have the desire to both drink a beer. And on the other hand, I also have the desire not to drink a beer to i don't be healthy or something. um Whereas Intentions, they kind of involve a commitment to a particular course of action. um And they're also, you know, intentions are not restricted to the immediate future, like you were saying. um and They can be formed with like a lot of temporal distance. Something can be more nearby versus further out in terms of what I intend.
00:41:54
Speaker
um And they're also not necessarily maximally maximally concrete, ah you know, you know, the intention just to drink a beer rather than to drink specifically Amcel Light in Denmark, you know, uh, and then, then, then um,
00:42:10
Speaker
What else? um And then, yeah, they're not just plans. you know Plans kind of flesh out intentions, but intentions do kind of play a distinctive role when it comes to planning. um They kind of settle what is to be pursued and then you kind of structure... And then and then and then that settled...
00:42:31
Speaker
you know, goal sort of ends up structuring how the planning proceeds. So anyway, that's just a just a little bit of a gloss on the five things we just went through. Is there anything else um when you want to talk about um at this point um just in review or should we start talking a little bit more about the case studies that No, I think that's a nice summar summary other than you know you might just acknowledge the point that these also ah might interact in different ways, these different features. They might be interdependent. you know It might be that um you know precisely because intentions can be very abstract and and need fleshing out, that that's the kind of thing that motivates the the planning relations that we find between intentions.
00:43:13
Speaker
intentions. And likewise, you know planning is going to require some degree of commitment. We need to kind of commit to one thing or another ah in order to get going with with fleshing out our plans.
00:43:26
Speaker
ah And many of these these um um richer conditions just presuppose the idea of intentions being directive states as opposed to descriptive states. So these these interact in different ways, these features.
00:43:39
Speaker
Interesting. Awesome. Great. Okay, cool. So yeah, now that we have the, these five characteristic marks of intention on the table, um, we can now talk a little bit about, about some case studies that you bring

Case Studies and Examples of Intention-like Representations

00:43:51
Speaker
up. So, um, maybe, yeah, you could just kind of tell us a little bit about, you know, what role are these case studies, you know, kind of playing in your overall argument? I mean, you know, correct me if I'm wrong, but it seems like the idea is that they kind of provide concrete empirically grounded candidates for intention-like representations in language models. In other words, we look at these um case studies from the literature and we see um researchers talking about certain representations and you notice that, hey, these representations seem to resemble intentions to some degree. So any rate, um yeah, you wanted to just tell us a little bit about um you know what role these case studies are kind of playing in your overall argument.
00:44:36
Speaker
That's exactly right. So I'm i'm looking a little bit through the the technical literature that's, you know, ever-expanding technical literature that's sometimes called interpretability research or more narrowly mechanistic interpretability research, which is... um attempting to on understand how language models or any other machine learning model solves some particular task or ah understand more about it's its how it works under the hood and how
00:45:08
Speaker
different what kinds of features it might uncover about the world, what kind of ah factors might it be leveraging and in arriving at a decision. ah And these two these two case studies that I focus on, it's it's kind of each of them, there's kind of a cluster of of papers that have explored this kind of stuff, but in each case, um I'm finding, as you say,
00:45:32
Speaker
representations that have at least some superficial resemblance to intentions, and maybe the researchers talk about them in something like those ways, use some words like planning and so on that ah that are kind of adjacent to the idea of an intention or a task-oriented representation.
00:45:47
Speaker
And ah then I'm trying to look more carefully and see, okay, well, compared with these five characteristic features, ah do they resemble intentions in ah in a deeper way? Is this just a superficial resemblance or is there and genuinely something intention-like ah in a more robust way about the functional role of these representations that are being posited by these AI researchers?
00:46:13
Speaker
Perfect. Awesome. And and so i guess like the first candidate that we're going to have on the table from this first case study is a function vector. I think that's right. um Okay. And so, yeah, let's try to lead up to this idea of a function vector. um You know, basically function vectors, they were kind of, ah they were posited or, you know, maybe you could say discovered in the context of, you know,
00:46:42
Speaker
few shot in context learning. And so, um, yeah, so can you maybe just tell us a little bit about, you know, what is, you know, few shot in context learning and, you know, what were the researchers kind of trying to discover about, um, these models?
00:47:00
Speaker
So it's it's a bit of a jargon-filled phrase, few shot in context learning, but um maybe it's easiest to see by way of of examples. So you may have experimented, that listeners may have experimented themselves with with chatbots and language models and realized that ah if you give kind of a few examples of the kind of thing you want the language model to do,
00:47:22
Speaker
um you often get better results than if you of just just asked it to perform a particular task um without any preamble. um And it turns out that language models are quite good at inferring something about the structure of an example or a task that that's demonstrated to it in in context, right which is just a technical term for you know the prompt that you give the model.
00:47:50
Speaker
um And ah so one example might be if if I gave you the following sequence of pairs of words, if I said to you, rich, poor, young, old, big, small,
00:48:05
Speaker
And then I gave you the word up and I asked you to kind of continue the pattern. Could you... Down. Yes, very good. correct All right. so so um I was on the hot seat. I passed. So what you've what you've done there is is you've correctly identified that there's there's a certain relation, a pattern of relations between each of those... um the values in each of those pairs, um namely that they are antonyms or opposites, the words that mean the opposite to each other. so um
00:48:37
Speaker
And then you've applied that very same relationship to the Q word, in this case up, and you've ah you've generated you or you've produced the answer ah down, which is which is correct in this context. So um there are many in this...
00:48:53
Speaker
I'm focusing in particular on a study by some researchers, Eric Todd and and colleagues, but um in David Bow's lab. But ah this there's some many other researchers who who explored this phenomenon.
00:49:07
Speaker
And what they're trying to uncover is how exactly a language model um ah solve tasks like this? how How do they figure out um how to produce the correct answer, in this case an antonym of the word up?
00:49:22
Speaker
um So, I mean, maybe we can get start to get into the details about what these these researchers discovered, but but that's that's the kind of the context that sets up um what where we arrive at something that like a candidate for an intention.
00:49:37
Speaker
Right, so it's like we have a general type of task that the model is capable of doing well, you know, like finding antonym or completing the pattern. completing yeah completing the pattern And the question is, that you know how is it managing to do this? you know And um and it's I guess, was their answer this this key idea of the function vector? The function vector is what perhaps um allows the model to um succeed in not just like a specific task, like
00:50:17
Speaker
from up to down, but like generally over across many tasks that fall under that type of finding the anthem. Anyway, so yeah, yeah but but yeah but maybe just tell us yeah like how would you describe what a function vector is? Yeah, no, that's exactly right. So... um So yeah, maybe just to kind of give some more context. So you might have a task like finding an antonym. You might also have a task like translating English words to Spanish words, or the implicit relationship might be know mapping countries to their capitals or countries to their currencies and so on. um So for each of these tasks, um models show ah you know some significant degree of success at solving the tasks.
00:51:01
Speaker
um And what the researchers discover, and probably don't need to get into the details of exactly how they they extracted these these so-called function vectors, but they find representations that seem to play a really important role in mediating success at each of these different kinds of tasks.
00:51:21
Speaker
So for the antonym task, they find a function vector that's ah playing a really important role in mediating ah finding ah ah the correct antonym for a keyword.
00:51:34
Speaker
um And likewise, they find a different function vector for the English to Spanish translation task and a different one for countries to capitals and so on. um And these representations, not only are they kind of reliably triggered in response to kind of demonstrations of the of the corresponding tasks, so demonstrations of antonym relationships, ah but they also play a causal role in then generating um an appropriate antonym when given a keyword like up. um
00:52:08
Speaker
And who what one thing that the researchers show is that these representations have that effect quite robustly, so across a many different kinds of input prompts. And in fact, they can they even show that if we extract that function vector and put it into a totally new context where we just give a single keyword without any preamble, ah we give the word um common, and then we inject the function vector as the language model is processing that input common, ah we inject the antonym function in vector. It will then cause the model to produce the word rare.
00:52:45
Speaker
so it will kind of ah induce this kind of antonym producing behavior. um So ah why what does any of this got to do with intentions? Well, you might think that at a very coarse-grained level, the way the researchers talk about this is representations of a particular task or of a particular ah function, a particular kind of behavior.
00:53:11
Speaker
that then triggers that behavior whenever the, uh, whenever the, uh, the representation is, is activated. And that's a little bit like the role that intentions play, right? That the representations of a goal state or of a behavior of a task that when we activate them or we, um,
00:53:29
Speaker
we trigger them, ah they produce the the corresponding behavior or task or and so on. so um So that's one reason why you might think they're at least a candidate for something like an intention. And then there's a question of how well they they map onto these different features that we talked about before.
00:53:48
Speaker
Yeah, that's that's really helpful. So it's like, you know, when a function vector is active, you know, it's it seems to have a kind of directive function because it's causing the model to perform a ah specific task, you know, and it's kind of general, like translate, not just translate um dog.
00:54:07
Speaker
to Pero or but you know it's it's it's really more general of like this is um it's going to cause the model perform a task like translate um and it's so it doesn't seem to be just representing like a belief facts, it seems to have that sort of um world to mind direction of fit. And yeah, I wish we could go a lot more into all the nuts and bolts because when you really go into the nuts and bolts of how these, you know, function vectors operate, you really see more and more that resemblance you're talking about. It's not,
00:54:41
Speaker
Yeah, you know obviously it's tough in a 12-cast to go into it to because it can seem very superficial, what we're saying, but like when you look more into it, um there's there's just certain you know features of a function vector, like the fact that they're more task-level, not um as like you know in terms of specifying like what kind of action to perform rather than just let's let's let's produce a very specific output. And...
00:55:08
Speaker
and Anyway, they guide downstream processing and and and they're not just correlated with a certain behavior, but they really seem to cause um the model to perform a certain behavior. So at any rate, um um yeah, that that you know there's there's various... in The more you more you go into the details of how the technical details, the more plausible this this resemblance becomes, I think, right?
00:55:39
Speaker
um That's exactly right. Yeah.

Do AI Features Resemble Intentions?

00:55:42
Speaker
Okay, great. And now let's, yeah, let's just shift to um the second case study just so that we also have another candidate on the table. um um And so can you just tell us a little bit about, um, um, um,
00:55:57
Speaker
Tell us a little bit about case study two and just to preview, you know, the key, the candidate that's going to emerge from here is something called an output feature. Is that right? Yeah, that's right. And so just like the you know case study one produced the candidate of function vector, case study two is going to produce this new candidate of output um feature. So yeah, can you tell us a little bit about um the second case study you use?
00:56:22
Speaker
Sure. So this is ah some research done by researchers at Anthropic. So Anthropic is one of the the big tech companies involved in, in their case, they and they They are the people behind the chatbot Claude that many people will have interacted with, um but they also have kind of quite a big research wing to the organization um and they do a lot of this mechanistic interpretability research trying to figure out Claude does the many things that Claude can do.
00:56:57
Speaker
um and And there were a pair of papers that the research team put out ah early-ish last year 2025 looking into...
00:57:09
Speaker
looking into um Well, they have very broad scope, these papers, but but essentially um they're working within this tradition of thinking about language models in terms of features and circuits.
00:57:25
Speaker
ah So I should maybe say a little bit about what those those two terms mean. So feature um means something like what philosophers mean by representation, ah or at least ah you know primitive representation might be might be something like representation.
00:57:40
Speaker
um ah ah synonym there. So features are supposed to be the things ah that a language model leverages ah to deal with a particular kind of input and generate an appropriate output. And circuits are supposed to be kind of connections between features or algorithms that compute over features ah in order to produce that output.
00:58:09
Speaker
um So in this particular case study, the researchers used a somewhat sophisticated way of trying to identify features in in language models, which involves something called um ah cross-layer transcoders. so So again, we get into kind of jargon territory quite quite quickly. um But ah maybe you kind of skimming over the details a little bit, they...
00:58:39
Speaker
They identify what seem to be candidates for representations of different variables that are activated in different contexts when a language model is solving some task and that um play a particular kind of role in generating outputs.
00:58:56
Speaker
And ah maybe moving from, you know, features usually get talked about in the in the context of... um features of the input. ah So it might be you know something like the fact that a particular input has a positive sentiment expressed in it, or the fact that a particular word refers to someone who you know someone ah who's male as opposed to someone who's female. That that might be the kind of feature that that researchers identify. But I think this is the first paper, at least that I've seen, to introduce this idea of an output feature, which is instead some kind of feature or representation in the model that's directed at
00:59:39
Speaker
generating a particular kind of output as opposed to responding to a particular kind of input. So they're kind of pushing the model toward a certain output rather than just exactly describing the input that's being given. Yes.
00:59:52
Speaker
They promote a certain response from the model. from the model And then of course, because um you know they're interested in circuits, there'll be interesting relations between these features or these representations. So you might have input features interacting with output features in a kind of complex a circuit or algorithm.
01:00:12
Speaker
um But yeah, there's there's this case of the... of the um A particular case study that that I focus in on that's discussed in in one of these two papers, which is discussing poetry writing in Claude. And I think that's a kind of fun thing to think about, is if you give ah Claude a task like... um you know producing a continuation of a rhyming couplet. If you give it the first line of of a poem, like he saw a carrot and had to grab it. This is the example in the paper.
01:00:46
Speaker
um you might get a continuation which says, his hunger was like a starving rabbit. So it's kind of quite acute a little poem that Claude produces.
01:00:57
Speaker
And ah we might wonder how how a language model can solve a task like this. um When you start to think about it, it's a fairly sophisticated behavior. um The model has to produce a sentence that both rhymes, you know, ends on a rhyming word, but also that ah makes sense given the context of of the first sentence. um And the research is kind Semantically coherent. Yes, exactly. Semantically coherent, also coherent in terms of kind of the ah the rhythm of the sentence and so on.
01:01:27
Speaker
And um so one what the what the researchers are interested in figuring out ah at a high level is distinguishing between two possibilities. So maybe the model is just kind of working in an improvisational kind of way. So in other words,
01:01:44
Speaker
for each word that it generates, it just kind of works on the fly. It doesn't take into account you know where it's heading or what it needs to end up doing by the end of the second line, and it's just generating token by token ah off the cuff. Alternatively, it might be doing something more sophisticated, a kind of planning which involves you know electing or landing on an end word ahead of time and then kind of working backwards from there and writing towards this this planned endpoint.
01:02:12
Speaker
um And now we can maybe begin to get a hint of of where intention-like features might come into the the picture. um But in short, they they present evidence that they think supports this this latter hypothesis, that there's something like a planning behavior going on here.
01:02:29
Speaker
ah and maybe I can talk about exactly what they they find there. So ah what they what they find in in this case of this particular poem, he saw a carrot and had to grab it.
01:02:41
Speaker
That's the prompt that's given to the model. At the end of that prompt, before the model has started generating any words on the second line, um the model seems to activate a representation of rabbit.
01:02:54
Speaker
So it's activating a rabbit representation, ah specifically an output feature in the terminology of of these researchers. um And moreover, that that output feature not only is ah is generated ahead of time, but long before it reaches the end of the second line, but it seems to play a causal role both in driving the ultimate generation of that end word rabbit and also guiding or shaping the generation of those intermediate tokens. So the words, he saw, sorry, what is it? his
01:03:25
Speaker
His hunger was like a starving. ah Those also seem to be causally driven by that activating of that that rabbit output feature. Right. So it's it's's so on the one hand, um it's kind of... It's biasing the model toward a certain output, ah which is rabbit at the end of that second line.
01:03:50
Speaker
um So, you know... um you know when this feature is activated, um Rabbit as the you know final line becomes likely. Whereas if you know that feature suppressed, um Rabbit as the final line becomes unlikely.
01:04:08
Speaker
um but But yeah, like you said, it's not just... And it's important that that Rabbit feature... is active before the second line even begins.
01:04:21
Speaker
So it's not as though the rabbit feature only emerges once it's outputting rabbit as the end of the line. Instead, that rabbit feature is kind of active, um, um Yeah. Prior to even beginning the second line, which, yeah. So it has like a kind of future directed sort of orientation there. But then, like you said, there's also that shaping of intervening action.
01:04:43
Speaker
um So it's going to constrain what happens in the middle of the sentence. And, and anyway, so it's, so it's like, it's it seems to be writing as it were with,
01:04:57
Speaker
an end game in in mind. At any rate, yeah, sorry to to interrupt. Yeah, no, no, no, that's exactly that's exactly how the the researchers talk about it. um And there's then a question of, okay, if we're kind of being a bit more circumspect, how closely does this functional role map onto these five features that we've been talking about?
01:05:25
Speaker
Okay, perfect. And yeah, so now, you know, let's, so we kind of have on the table these two candidates that you've you've you've um extracted from the literature, kind of more technical literature, analyzing interpretability literature, i think is what you called it, right?
01:05:42
Speaker
Okay, so yeah, so now that we have those on the table, and we kind of have a prima facie case for looking at them, like it's not just we're picking anything to to consider whether they satisfy the five marks, it's more like, um yeah There's a good prima facie case for them. and I wish we could kind of get into that more, of course. but um but yeah let's now um you know I want to kind of focus maybe on three of the marks. Directive function, planning, and commitment.
01:06:12
Speaker
um But you know maybe at ah first we could just start with a high level. you know do you kind of want to just give um the listeners your your general overall assessment of you know to what degree do on the one hand function vectors, on the other hand output features? And it would be nice if we could get in more detail about you know what's the difference between those two. But um yeah, if you want to just give you your high level assessment of how intention-like these things are.
01:06:40
Speaker
I'd be happy to. i think the the answer is a slightly disappointing kind of it's a mixed bag kind of answer, right? So ah what we don't find, it is should be should be clear, we don't find something that absolutely perfectly matches onto the role of an intention, ticks all of the boxes, and there's no there's no question about it. um You know, these are intentions.
01:07:07
Speaker
We also don't find something that's you know radically unlike an intention, and you know such that it's completely outrageous that that any researcher would ever use a word like intention to refer to one of these features.
01:07:19
Speaker
We end up somewhere kind of uncomfortably in the middle um between those two. so ah kind of going through the features, we find you know something like a directive function um or some evidence of a directive function for both output features and function vectors, but then there are some questions about that that hinge on and some technical details and also some details about exactly what we mean by function and so on. ah Distality and abstraction, I think the case is you know fairly strong. We find a range of degrees of distality and abstraction in in both cases. um
01:07:59
Speaker
But I think you know there's a question of exactly how distal and exactly how abstract these these representations can get. um And as you mentioned, you hinted at before, commitment I think is probably the place where these models or these representations, candidates for intentions, falter the most. um So it seems like there's something...
01:08:23
Speaker
Interestingly, importantly, different from how we think intentions ah involve commitment in in human beings versus how these representations seem to operate. and We can get into that. And planning, I think the case is fairly strong, again, although there's there some kind of caveats to do with how planning might interact with intentions. commitment. so So as you can sense from this kind of high-level overview, ah it's a really kind of mixed bag and ah i think there is an interesting question about, okay, what do we do with that kind of mixed result um ah you know in terms of whether we ultimately treat these as intentions or not or we just have to sit um with the the possibility that they're an edge case that's kind of irreducibly an edge case, at ah least for now, at least in the current generation of systems.
01:09:13
Speaker
Yeah. And I really think it's kind of a virtue of the paper, kind of the way in which you take a sort of nuanced middle position where, you know, you're, you're avoiding two extremes of like, you know, language models obviously have intentions or they obviously don't. Instead, you're kind of yeah arguing for this kind of graded mixed profile.
01:09:31
Speaker
They're, their intention-like in certain respects, but importantly, unlike intentions in others. And so, yeah, maybe, you know, again, trying to be sensitive to time here, but maybe we can just talk about some of the bottom line when it comes to each of those three marks that I mentioned, directive function, planning, and commitment. And in my sense with the directive function is that um you would want to say that Okay. um
01:10:03
Speaker
and It gets into some tricky technical stuff related to prompt processing um versus token generation. But um yeah, would you be able to talk a little bit about you know the bottom line? I think it's kind of like, yeah, they're locally, they have like like a local directive function, but maybe not globally directive. I don't know if that's what you want to focus on here. But yeah, just... ah yeah What would you want to say about that? So yeah, very quickly, on the positive side, we've already kind of touched on this a little bit, that both output features and and function vectors seem to be playing something like this causal role where when they're triggered, it produces a certain kind of ah task or behavior to be generated by the model. So that's something like the role we would expect from a directive representation.
01:10:49
Speaker
So that's you know a tick in favor of of directive function. Where it gets complicated, there there are kind of two ways in which it gets complicated. um One is to do with how we think about function in in philosophy, the definition of a function. There's one rather minimal definition, which is just something like a causal role in producing some behavior of ah of a broader system. And that's the kind of evidence that I was just talking about before. you know it seems to be Both of these representations seem to be playing something like the role of generating a certain kind of behavior or achieving a certain type of goal.
01:11:24
Speaker
So the richer sense of function, which sometimes gets invoked, it should be said in defining directive function, something like a proper function, ah which in some people's minds at least is importantly related to the history of a system and and and the way something was, ah whether a certain behavior or effect was selected for itself. So think of the heart pumping blood.
01:11:46
Speaker
ah It has that function, not just merely in the sense that it plays the causal role of pumping blood in the system, but that it's kind of supposed to do that because, ah well, one story for why it's supposed to do that is because ah it doing that in the kind of evolutionary history of the organism was beneficial. And therefore, that's why we have systems that have hearts that pump blood today.
01:12:08
Speaker
um Where that ah creates a complication for language models is then we have to think about, okay, these these different kinds of representations, how did they arise in the first place? Did they arise ah through different kinds of training procedures? And some training procedures might involve generating ah text um but others might not. right If you think of the mere next token prediction task, then the model is not really generating tests at text in that kind of situation. It's it's just um trying to predict the actual next token in a kind of example sequence.
01:12:45
Speaker
ah So that creates a potential complication. that Maybe these... causal roles that a directive may have been explicitly selected for, but they're just a kind of convenient side effect of some some behavior that was kind of useful for predicting next tokens. So so it gets it gets complicated there. So you might end up with a kind of divide and conquer result there where some language models have directive proper functions, whereas others just have directive functions that are that are mere causal role functions but not really proper functions.
01:13:19
Speaker
ah So yeah, that that's some of the the the details. And then you mentioned this distinction between prompt processing and token generation, and and maybe we don't need to get into all the details there, but but it's um you there's a situation where we might have different... ah different stages in ah a language model dealing with a prompt and then generating tokens.
01:13:46
Speaker
ah So first, if you give a long wall of text to ChatGPT and then click enter, and it has to process all that text before it starts generating tokens. So it doesn't start generating tokens after the very first word in your prompt. It waits till the end of the prompt and it's processed all those tokens leading up to the end before it starts generating.
01:14:08
Speaker
And then there's a question of, okay, well, if we find function vectors or or um output features that are activated during prompt processing before tokens are being generated, then there's a question of whether they're even playing the causal role of of a directive representation. um Because that in that phase, they're not even being used to generate tokens at all. They're just...
01:14:33
Speaker
being used in some way to shape the kind of internal states of the model. um So they're playing more of a descriptive function rather than this... Potentially, yeah. yeah potentially so it So it may be that it's kind of contextual to some extent whether a function vector or an output feature is playing descriptive or directive role.
01:14:53
Speaker
That context being the um ah the ah the the stage of processing that the model is in. um So that was all very quick and and high level, but it gives you a sense of how this initial picture of you know these seem to be playing a directive role maybe gets a bit complicated when we think about the details.
01:15:14
Speaker
Sure. And yeah, so now moving on to that other feature, you might want to ah discuss a little bit planning. um So you know we kind of discussed, yeah, it looks like there's evidence for planning in the poetry case. um you know, like representing future outputs like rabbit, um, and they're active, you know, several tokens before actually executing, you know, that, um, that token rabbit and they can strain intermediate word choice. And, um, but yeah, what, what what are some of the qualifications that we should think about with with respect to planning?
01:15:53
Speaker
Yeah, so I think the the evidence for planning is strongest in the the second case study when we're thinking about these output ah features. And um simply because the the first line of research into function vectors wasn't wasn't looking at this kind of...
01:16:14
Speaker
uh, relation between different kinds of representations. But in any case, um, in this poetry writing example, as we mentioned before, we get representations, uh, of ultimate end words that are activated ahead of time, uh, many tokens prior in the sequence, and also they play this kind of dual constraining role, firstly in kind of constraining the words that are generated on the way to ending up with that N-word, rabbit.
01:16:45
Speaker
um ah but but also kind of playing a causal role in generating the token rabbit itself. ah We might also see something like means-end constraints that we mentioned before. So, for instance, they find a circuit where ah or they present evidence, which say a person they suggest is a circuit where, um ah let's say, there's a feature which is produce a rhyming word,
01:17:12
Speaker
um ah produce a word that rhymes with it ah and that that has a kind of positive activating effect on the feature for rabbit, but also on some other features like grab it and have it and so on.
01:17:27
Speaker
So that might be something like a means-end constraint or causal relationship that A representation of a certain kind of end is triggering ah representations of appropriate means to that end. And then ultimately, you know the the behavior that that the model lands on, which is generating RabbitModels.
01:17:46
Speaker
and So that's some that's some evidence in in favor of planning. so So in that case, I think this is one of the surprising and interesting things about this line of research is that there is something sophisticated going on in terms of mutual constraints among these output-oriented representations in the models.
01:18:07
Speaker
Excellent. And now turning to commitment, I mean... um It seems like this is sort of maybe the weakest point when it comes to um establishing a resemblance. It's where, i guess, function vectors and output features least resemble.
01:18:28
Speaker
ah human intentions, at least. you know and i mean Maybe to to kind of hit on one point that seems kind of important is that it seems like okay models, yeah, they do bias outcomes,
01:18:42
Speaker
but they don't really commit to outcomes. So maybe that's something we could hit on here. It's just, you know, what's the, what's the difference there between biasing an outcome and actually committing to a particular outcome?
01:18:58
Speaker
That's exactly right. yeah So what what we see um in the case of this poetry example is when we reach the end of that first line, when we have, he saw a ah carrot and had to grab it, um not only is a representation of rabbit generated,
01:19:15
Speaker
ah at at the end of that first line, but also a representation or or representations, features, output features relating to other candidate end words. So you get something like a representation of grab it and habit, and these are all kind of active in parallel.
01:19:32
Speaker
ah And not only are they active in parallel, they all seem to be kind of exerting some kind of causal influence on the generation of tokens, intermediate ah tokens and also that ultimate end word.
01:19:46
Speaker
um And one of the ways that we see this is that ah you know at the very end ah token, because these models are are stochastic, they're they're not kind of ah producing exactly the same response every single time. um What they produce ultimately is a kind of distribution over possible end words.
01:20:08
Speaker
and then ah a kind of stage of the the architecture then has to sample from that distribution. um And of course, you know if if it has a very strong peak around a certain ah token or or word like Rabbit, and then it will be very, very likely to end its outputs on that. But it won't be 100% guaranteed. So if we run the model 10 times, we might find that you know Seven out of ten times, it it ends the poem with rabbit, but on the eighth and ninth time, it might end it on something like grab it or ah or habit or something like this.
01:20:47
Speaker
um So why is this relevant to to commitment? Well, remember before we said that commitment is about electing or settling on one of a range of alternative possibilities and kind of foreclosing or ruling out the the question of of pursuing the other possibilities.
01:21:10
Speaker
um And ah that means that intentions tend to be fairly mutually consistent. We don't simultaneously ah intend to um to go to on holiday to um Denmark and...
01:21:25
Speaker
ah and Thailand and the USA all all on the same um all on the same holiday um at the same point in time. Because that's mutually inconsistent, we might desire to go to all of those places, but intentions are supposed to be playing this commitment role of of electing for one or the other.
01:21:43
Speaker
What we see here is multiple possibilities being entertained, as it were, or or active in the model at once. um So you know it's possible that maybe if we keep on looking, we'll find some some ah some other representation in the model that is playing something more like this commitment role, actually kind of settling among these possibilities.
01:22:07
Speaker
But I think it might be just a bit of a deeper problem like that. that These architectures, because they're always activating multiple features at any given time, ah they're very often going to...
01:22:20
Speaker
be operating in more of a way of biasing the model's behavior one way or another because that's kind of good enough for the the task that the the models need to solve. um And this might be one way in which human ah intention and practical reasoning differs from these kinds of representations we we find in language models, at least in the current generation. Yeah, it's really fascinating. I mean, it it just seems like sampling is really like a crucial thing because it seems to be explaining why these models fail to exist exhibit commitment, even though there's like a planning like representation going on and sampling.
01:22:59
Speaker
um Yeah, because, yeah, because planning without commitment, it seems possible in virtue of that sampling, which sort of has a, um it sort of has a, um indeterminacy about it, I guess, and yeah potentially, but great. Well, you know that's, that's really appreciate you going through those um three there. And obviously, you know, there's so many um really rich and fascinating details in your paper. And I wish we could talk

Future Research on AI and Human Cognition

01:23:34
Speaker
more about them. um
01:23:35
Speaker
But yeah, obviously, I'm just going to recommend to the listeners that you know your paper, it's it's really fantastic in terms of, um yeah, like I mentioned earlier, the nuanced middle position that you've come up with, the integration of empirical and philosophical research, and then also just the conceptual rigor of it. you know You're not using intention as like a vague...
01:23:58
Speaker
thing you know it's it's really you you really um um decompose it nicely into these five features anyway um yeah so yeah again highly recommend to listeners this paper and uh at this point i kind of would just ask you know you want to tell us a little bit about you know what you're up to now what you're working on uh just yeah so i'm still i'm still thinking a lot about language models and ai systems and more generally and suppose ah If you could characterize my kind of general thread of of inquiry at the moment, it's these questions about how similar the inner workings of a language model and other AI systems are to the kind of cognition that goes on in in human and other animal minds. um
01:24:42
Speaker
So one question I'm um' thinking about at the moment, which is something I've written on in the past, is more fundamental questions about representation in AI systems. So...
01:24:53
Speaker
ah ah Do we have good evidence to say that a model really is representing a particular feature of our world, especially if it's a system that's only ever been trained on text, it hasn't ever directly interacted with our world? um And you know many people have argued that you need to know something about the history of a model ah to answer some of these questions. it relates to a bit what we were talking about, about functions. um And I'm kind of trying to scrutinize this assumption a little bit. do Do we need to know something about the history of a system, its etiology, in order to...
01:25:28
Speaker
ah to know what kinds of things it can represent or it does represent. ah So that's one one project I'm working on at a moment. But um yeah, as stay tuned. And yeah, maybe I should just say for kind of full transparency, this paper we've been discussing is still under review. So this this is a preprint at this stage. um But yeah, hopefully it will be... You can find it as a preprint online on Phil Papers, but um hopefully soon will be yeah a slightly more refined version of the paper in print somewhere.
01:25:57
Speaker
Awesome. Well, thanks again so much for coming on, Ewan. Appreciate it. Thanks so much, Sam.