Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Neel Nanda on Avoiding an AI Catastrophe with Mechanistic Interpretability image

Neel Nanda on Avoiding an AI Catastrophe with Mechanistic Interpretability

Future of Life Institute Podcast
Avatar
199 Plays2 years ago
Neel Nanda joins the podcast to talk about mechanistic interpretability and how it can make AI safer. Neel is an independent AI safety researcher. You can find his blog here: https://www.neelnanda.io Timestamps: 00:00 Introduction 00:46 How early is the field mechanistic interpretability? 03:12 Why should we care about mechanistic interpretability? 06:38 What are some successes in mechanistic interpretability? 16:29 How promising is mechanistic interpretability? 31:13 Is machine learning analogous to evolution? 32:58 How does mechanistic interpretability make AI safer? 36:54 36:54 Does mechanistic interpretability help us control AI? 39:57 Will AI models resist interpretation? 43:43 Is mechanistic interpretability fast enough? 54:10 Does mechanistic interpretability give us a general understanding? 57:44 How can you help with mechanistic interpretability? Social Media Links: ➡️ WEBSITE: https://futureoflife.org ➡️ TWITTER: https://twitter.com/FLIxrisk ➡️ INSTAGRAM: https://www.instagram.com/futureoflifeinstitute/ ➡️ META: https://www.facebook.com/futureoflifeinstitute ➡️ LINKEDIN: https://www.linkedin.com/company/future-of-life-institute/
Recommended
Transcript

Introduction and Guest Background

00:00:00
Speaker
Neil, welcome to the podcast. Glad to have you on. Hey, yeah. Perhaps you could introduce yourself. Yeah. So, hey, I'm Neil. I work on mechanistic interpretability research, which means I take a network that's been trained to do a task and I try to reverse engineer what algorithms it's learned and like how it does this. Trying to be as faithful to the model actually does as possible. Kind of form a digital neuroscience.
00:00:30
Speaker
I'm currently an independent researcher. I used to work at Anthropic, working on their transformer circuits work. And I generally try to work on reverse engineering language models. They're with a bunch of dalliances through various side projects.

History of Mechanistic Interpretability

00:00:47
Speaker
So as I understand mechanistic interpretability, it's a pretty early field. How young is this field, would you say? Yeah. Let's call it mechanistic interpretability. It's such a mouthful.
00:01:00
Speaker
And I want to make Meck and Terp a thing. So the field has a snappier name. Anyway, so Meck and Terp, I did the history. So deep learning as a whole has existed for about 10 years when AlexNet turned out to be way better at image classification than anything else. And this started becoming the hot thing that's basically entirely taken over the field of machine learning. And
00:01:31
Speaker
The subfield of AI interpretability has only really been a thing since I donated the history that well, but the first big paper I saw was in 2014 when people found a way to visualize early neurons in image networks and saw they were looking at things like edges and corners and things like that.
00:01:55
Speaker
And I would say that, at least by my somewhat biased lights, the subfield of Meckentup has been massively pioneered by this researcher called Chris Ola, who was first involved in some work called Deep Dream at Google Brain, then ran the Meckentup team at OpenAI, doing this really great work
00:02:20
Speaker
on image circuits where they reverse engineered different things in these image models. And since around late 2021, the field has heavily focused on transformer language models like chat GPT and smaller versions of this and trying to reverse engineer those.
00:02:41
Speaker
and the field has rapidly grown from mostly Chris's team and collaborators to a much larger field with multiple professors seriously interested, maybe five industry or non-profit orgs having teams working on it, some independent researchers like me, and hopefully it will keep growing and we will continue this
00:03:06
Speaker
glorious exponential trend of increasing by about 5x in the last year. But pretty young fields. Yeah, pretty young field.

Importance and Impact of Meck and Terp

00:03:14
Speaker
Why is MacInterup useful? How would it look if we succeeded in this field? Sure. Are you asking me this from an AI-reducing AI x-risk perspective, or just a general, like, why would anyone care about this? I say let's start with the general, and then we can talk about the reducing AI x-risk.
00:03:36
Speaker
Sure. So I think for why should anyone care about this? My first and foremost argument is just a kind of scientific and aesthetic argument, which is that machine learning is becoming an increasingly important part of the world. We are really good at giving models complex tasks, like someone typed in a search query to Google, what do they want? What should I show them? And giving good answers. And
00:04:06
Speaker
these are capable of solving tasks that we just have no idea how to write a program that can write poetry in the way that chat tpt can or explain jokes and is it just really really dissatisfied to me
00:04:24
Speaker
This is not an acceptable state of the world. That we have computer programs that can essentially speak English at a human level, but that we cannot write these programs ourselves. We can only train them via this enormous soup of data. And that's like my personal reason.

AI Safety and Misaligned Goals

00:04:43
Speaker
I generally think that just in a world that is being increasingly shaped by these systems, it's just really useful to understand what they're doing and why. For example,
00:04:55
Speaker
Lots of people are pretty concerned about racism and sexism and algorithmic bias in models. If you can understand how this is implemented in the model, it seems like it should be much easier to check how well your techniques but reducing this have worked and ideally motivate new ones. We've got recommender systems that
00:05:18
Speaker
There are compelling arguments for doing pretty significantly negative things in the world. I think that if we could look inside those systems and understand why they recommended what they recommended on like an actual algorithmic level, all of this would just be so much more grounded. And we could actually have rules and regulation and transparency around what was going on here.
00:05:41
Speaker
And then the cause closest to my heart is AI existential safety, the question of
00:05:50
Speaker
It seems plausible we're going to be in a world with human-level AI systems where, without getting into arguments over the semantics of what that actually means, just systems that are capable of doing lots of complex tasks in the world, which may be acting towards goals and whose goals may be different from ours. And I think that
00:06:16
Speaker
The people making these systems are probably not going to want them to have misaligned goals, but that if we can't see what the goals are, and we can't see whether the model is actually being aligned versus doing something deceptive, or just broken and not what we want, that it is much harder to actually get good outcomes here.

Research Highlights: Grokking and More

00:06:38
Speaker
Neil, could you tell us about some results from mechanistic interpretability that are interesting to you? Perhaps you could touch on multiple results.
00:06:49
Speaker
Sure! Alright, so a work particularly close to my heart is this project I worked on during my stint of independent research called Progress Measures for Grokking via Mechanistic Interruptibility, where so grokking was this famous mystery in deep learning where some researchers found that if you train a model or twist a small model on some algorithmic tasks, like modular additions,
00:07:17
Speaker
And you give it, say, half of the data to train on and keep half of it that it never sees to test it on. It will initially memorize. It gets really good at the data it sees and it's terrible the data it doesn't see. Then if you keep training for an incredibly long time, it will suddenly grok or generalize and learn how to do the data it hasn't seen yet. And this is kind of wild because it just keeps seeing the same data again and again.
00:07:46
Speaker
And the work I did was first to... So, I figured this just has to be susceptible to mech and turn. It's a tiny model doing an algorithmic task. This is the kind of thing we are good at. And reverse engineered how it did modular addition.
00:08:05
Speaker
where it turned out to have learned this wild algorithm where it thought about the numbers it was adding as rotations around a circle and learned to compose the rotations together in this really weird trig identity and Fourier transform-based way that was ultimately very clean and legible. And then I could look into the model as it was training
00:08:30
Speaker
and saw that rather than suddenly figuring out the right solution it actually slowly transitioned from the memorized solution to the generalized solution but that it could only generalize data it hadn't seen yet when it was both capable of generalizing and wasn't also trying to memorize but only when it was really good at generalizing did it decide it didn't need to bother memorizing and
00:08:58
Speaker
That's a work I'm particularly proud of because, well, it's the first research I like properly led. But also, I just feel like if Meckentup works, we should be able to demystify things like this. A totally different work that I think is pretty exciting is this paper called Toy Models of Superposition from Anthropic, which people may have seen on Twitter as the Why on Earth Is There a Tetrahedron in My Neural Network paper?
00:09:26
Speaker
where so superposition is this problem that comes up where so models have say n have like say a thousand neurons in them and they want to represent features these properties of the inputs and often they will try to represent like a thousand features in there
00:09:50
Speaker
like a feature per neuron, and this is all very nice and reasonable. But sometimes they seem to be doing this thing called superposition, where they actually have more than a thousand features, and they can't do a feature per neuron, and learn some weird compression scheme. And philanthropic, or like, this is an important, confusing phenomenon we want to understand better. Let's make a toy model that tries to simulate it.
00:10:17
Speaker
we can study it in this kind of metaphorical petri dish. And they found that in the toy model they created, not only did it learn to use superposition, but that it learned these beautiful geometric configurations, where say, if you gave it 25 features and 10 neurons to shove them into, it would sometimes learn to compress the first five features into the first two neurons, the second five features into the second two neurons, etc.
00:10:48
Speaker
and you got kind of different levels of compression efficiency like tetrahedra have four features in three dimensions which is higher fidelity than five and two and models will slowly transition from tetrahedra to the next thing along and it was just a beautiful paper. And the final work I want to highlight is this work from
00:11:19
Speaker
opening eye called multimodal neurons in artificial neural networks, where... So, multimodal neurons in the brain are these neurons which activate on multiple representations of the same thing. For example, a drawing of Spider-Man, the name Peter Parker, and like a picture of Spider-Man. The same neuron lights up. Which is kind of fascinating, because it suggests there's some real abstraction going on.
00:11:47
Speaker
And they took this model called CLIP, which is part of how models like DALI, the image generation model, are trained, where CLIP takes in both an image and a text input. And they thought this should probably have abstractions, because it's about image and text. And they looked at some of the neurons in the image half and tried to interpret what they represented. And they found all kinds of wild neurons, including a bunch of multimodal neurons.
00:12:18
Speaker
like a Spider-Man neuron, or a bunch of conceptual neurons, like a teenage neuron or an anime neuron, which seem to represent the abstract concept of anime or teenage. Or you got neurons for, say, France that are activated on French language, the French flag, but also the bit of a map that represented France. And this is fascinating to me that this is a thing we can find in networks.
00:12:47
Speaker
And one of my side projects would be making this website called Neuroscope, which shows the text that most activates the neurons in a bunch of language models. Or you can just go through and look at what kind of wild things models seem to potentially represent. Are there any concepts that have been found in these
00:13:09
Speaker
in these neural networks that are entirely

Can AI Develop New Concepts?

00:13:11
Speaker
new. So I'm thinking that, are there concepts that humans do not have that could be found in these neural networks?
00:13:20
Speaker
Unfortunately, I'm not currently aware of any examples, though this feels like the kind of thing that should be possible. I'm particularly excited about work that looks at reverse engineering models like AlphaZero that are superhuman at Go or chess, because this is both a very kind of algorithmic and legible domain. These systems are also clearly vastly superhuman, but also there's been some promising work trying to interpret them.
00:13:49
Speaker
like this paper from DeepMind from Tom McGrath called Acquisition of Chess Knowledge in Alpha Zero, where they showed that if you just took a bunch of human chess concepts and looked for them in the model, you just find them. And further, that if you looked at when in training it had developed these, you could compare it to the human history of chess knowledge and see at what rates the model learned things that humans learned.
00:14:17
Speaker
and also found some fascinating quirks like phase transitions, when there's a certain point around step 30,000 when it learns a bunch of things at the same time.
00:14:26
Speaker
but nothing superhuman. This points in the direction of there being concepts that are generally useful because the same concepts are found by artificial neural networks as are found by humans. Also your first example of an interesting paper where in the beginning the model learns by memorization and then at some point
00:14:52
Speaker
Generalizes that sounds to me a bit like perhaps the way that children learn math where in the beginning you learn your multiplication tables and then at a later point you understand multiplication as more of an algorithm so there are there are interesting parallels between humans and AI systems here.
00:15:13
Speaker
Yes, I do think it's easy to exaggerate the degree to which there are these analogies. I really don't like the term neural network for this exact reason, because it sounds so biological and neuroscience-y. But I think it's pretty plausible there are good analogies.
00:15:30
Speaker
interpretability research to understand what's going on so that we can see whether what we're doing is actually working. We needed to create these feedback loops between creating a system and then seeing how the system works and then improving the system. And if we don't understand what's going on, then it's much more difficult to actually say that we have improved because by which metric have we improved if we don't understand what's going on inside of these systems.
00:15:57
Speaker
Exactly, and I should flag, I'm extremely biased, I work in mechanistic interpretability, I like mechanistic interpretability a lot, I think it's promising, but I think there's a bunch of other routes by which we might reach this goal.
00:16:11
Speaker
And I think the field of make AI not kill everyone is a healthier field if it has people pursuing a bunch of approaches, including people who aren't even doing this because they care about AI safety. They're doing this because it's just really fascinating and they want to understand models better. Okay, so how promising would you say that mechanistic interpretability is? Sure.
00:16:38
Speaker
For the abstract question of how promising, I'd say, surprisingly promising in that the field has made vastly more progress than I thought was possible, but also not very promising in the sense of, oh my god, this is such an ambitious task, and so few people work on it, where
00:17:02
Speaker
I dunno, my experience getting to the field has been something like, well, there's no way networks are interpretable. They're just a mass of super-blending algebra. They're basically a kind of fancy, curve-fitting algorithm, or just an inscrutable black box. They're not trained to be interpretable, so they have no incentive to be. And then,
00:17:22
Speaker
you can actually find meaningful interpretable neurons in image models and meaningful circuits where circuit here means kind of subset of the models neurons and parameters that do some task and then but this is really specific to image models because those are so human-like we have this visual cortex and then there's kind of worked on language models or like small language models and there's been promising results in larger language models
00:17:51
Speaker
but also actually being able to meaningfully reverse engineer something non-trivial in a model like DPT-3. This is ludicrously ambitious and hard on anything you've done so far. So I consider it very much an open scientific question of whether this will actually work. I think it is a bet worth making, but I do not think it is a bet that we should go all in on, or that it's obviously the one true path to things going well.
00:18:21
Speaker
In terms of interesting results and reasons for optimism, there was this paper from the Anthropic Interruptibility Team that I was a bit involved in, called In Context Learning and Induction Heads, which I think is probably the most compelling result I've seen on looking at these models actually telling us something deep about them.
00:18:49
Speaker
An induction head, in brief, is this circuit we found in these two-layer attention only language models. All right, so zooming out a bit of induction heads specifically. The kinds of models we study in trailer verse engineer are transformer language models, like chat GPT. Fundamentally, it's about modeling sequences. You give it a sequence of words and you train it to predict the next word.
00:19:19
Speaker
Like you give it a page of a book and you ask it what comes first on the next page of the book. And this is a kind of weird and arbitrary thing you might train a model to do. But it's really convenient because one, this is like a pretty hard task.
00:19:37
Speaker
It sounds kind of easy, but if you imagine, say, getting the first five pages of a book and ending halfway through page six and then predicting what comes next, this is actually kind of a sophisticated task that implicitly involves learning a bunch of things like facts about the world, structures of grammar, reasoning, as people who play with chat tpt probably know, how poetry and rhyming works.
00:20:03
Speaker
And then it's also really convenient because you can just give it a massive soup of data, which automatically comes with labels of what the next word is. And so you train it to do this on a massive, massive mountain of text. And then internally,
00:20:22
Speaker
the transformer is all about representing sequences which can kind of vary in length. A transformer should work on a sequence with like one word and a sequence with a thousand words. And so internally the model is made up of these layers, these simple functions.
00:20:44
Speaker
And transformers are made up of alternating attention and MLP layers. MLP stands for multi-layered perceptron, but you don't need to care about what that means. And so they're modeling the sequence of words. And what they're basically trying to do is they take each word
00:21:05
Speaker
And then they do some processing. And after each step, you get a refined representation of each word, which is integrated in some context and processing from the surrounding words so that you slowly get a better and better understanding of what's going on there, what context the word is in, so you can eventually predict what comes next. And then there's these two types of layers which kind of do this incremental processing.
00:21:34
Speaker
The first type is an attention layer. So at its heart, because the model is a sequence modeling thing, it's doing things in parallel on each word. But obviously you need to move information between words. And so attention is all about figuring out which other words and their context is most relevant to the current word for the specific task the header's doing.
00:22:01
Speaker
attention layers are made up of heads which specialize into different kinds of processing and then it identifies what's most relevant it identifies some important information there and it copies it to the current thing and then they're about like routing information around
00:22:22
Speaker
Perhaps an example here could be good. So say that the model has landed on Apple, and then it's trying to predict what comes next. And say the next word in its training data is tree. Then it's looking around for other words that might be relevant for predicting what comes after Apple. Would that be a reasonable example? Yeah.
00:22:46
Speaker
I think maybe a clearer example is something where you can see what kind of thing the model is thinking. So let's say you've got some text like, the Eiffel Tower is located in the city of. And empirically, models know that Paris comes next. But the information that Paris comes next comes from the of word. But like, of clearly doesn't have enough to know what's going on.
00:23:16
Speaker
And it turns out there are some heads which learn to look at the word tower. Earlier bits of the model have integrated in the context that tower is part of Eiffel Tower and which have looked up the fact that Eiffel Tower is in Paris. And then this head moves that from the tower bit to the of bit and uses that to output, Paris comes next.
00:23:45
Speaker
And actually the same heads will do this for things like the Colosseum is located in or the Parthenon is located in. And yeah, you often get heads for grammatical structures like look at the first word in the sentence, look at the subject of the current sentence, et cetera. And so these induction heads, that's the impressive thing that's been reverse engineered within mechanistic interpretability so far.
00:24:15
Speaker
Yes. So induction heads are part of an attention layer, where the attention layer is built up over these heads that can kind of be thought of as independently. And because it's an attention layer, this is the model of doing something sophisticated with finding relevant information and moving it around. And the task being done is
00:24:40
Speaker
So a fact about text is it often contains repeated text. For example, if I see the word Michael, it's pretty hard to figure what comes next. It's probably some famous celebrity like Michael Jackson, Michael Jordan, whatever, but it's hard to know exactly which. But if in the past the text Michael Jackson appeared, then it's pretty likely that Jackson comes next.
00:25:11
Speaker
And so a pretty great algorithm a model can learn is, okay, let's look for a word that came after Michael in the past. And then let's move the information from that word to where I currently am. So I predict that that word comes next, whatever that is.
00:25:36
Speaker
And we reverse engineered these things called induction heads, which implement that. And it's worth noting, this isn't just the model memorizing something like Jackson often comes after Michael. If Michael Jackson came up in the past, this comes next. It's actually just like an actual algorithm that is run on any input. You can just give it some random gibberish text.
00:26:00
Speaker
that's randomly generated, then just copy and paste that, and then run the model. And it will predict that the copied and pasted text, it will predict the copied and pasted text really well, even though it's never seen anything like that before. So it's a general algorithm. It's not a set of fixed rules involving specific words. It's a general algorithm that can work on arbitrary words, but which does this very specific task.
00:26:28
Speaker
Like it's not a lookup table, but it's also not like intelligence in any meaningful sense. But I can see how reverse engineering something like that is actually impressive because now we're beginning to understand what's going on inside of these previously entirely black box systems that just predict the next word in the text.
00:26:47
Speaker
Yes, so I think that reverse engineering induction heads is like cool, though probably sounds a lot easier than it actually is from the outside. But they're not the reason that I raised the induction heads paper.
00:26:59
Speaker
Um, so we actually reverse engineered induction heads in this earlier paper called a mathematical framework for transformer circuits. And, um, I feel like I shouldn't be saying we, uh, this paper was written by, um, Catherine Olson, Nelson Elhaj, Chris Ola, and the rest of anthropic. And I was somewhat involved, but they get a large amount of the credit doing great pioneering work in this field. Um, and I definitely don't want to claim this is my work, but.
00:27:29
Speaker
Yeah, so these induction heads, we found them by looking at these tiny two-layer attentionally models, and then we looked at larger models. And it turns out that not only do all models that people have looked at have these heads, up to about 13 billion parameters,
00:27:55
Speaker
Since leaving, I actually had a fun side project of looking at all the open source models I could find, and I found them in about the 41 models I checked, all of them that were big enough to have induction heads had them. And not only do they appear everywhere,
00:28:12
Speaker
They also all appear in this sudden, what we call a phase transition, where, so as you're training the model, if you just keep checking, does this have induction heads? Does this have induction heads? There's this narrow band of training between about five to 10% of the way through training. Exact numbers vary, but that's kind of the idea. The model goes from no induction heads to basically fully formed induction heads.
00:28:42
Speaker
And this is enough of a big deal that if you look at the loss curve, which is the jargon for how good the model is at its task, there's this visible bump where the model is smoothly getting better and then briefly gets better much faster and then returns to its previous level of smoothly getting better when these induction heads fall.
00:29:02
Speaker
So that's wild. Then the next totally wild thing about induction heads is that they're really important for this thing models can do called in-context learning. So a general fact about language models that are trained to predict the next word is that the more previous words you give them, the better they are.
00:29:24
Speaker
which is kind of intuitive. If you don't predict what comes next in the sentence, the cat sat on the mat, like what comes after on thee in on thee, if you just have thee, it's really hard. If you've got on thee, it's a bit easier. If you've got the cat sat on thee, it's like way easier. But it's not obvious that if you add more than a hundred words, it really matters. And in fact, older models weren't that good at using words more than a hundred words back.
00:29:51
Speaker
And it's kind of not obvious how you to do this, though clearly it should be possible. For example, if I'm reading a book, the chapter heading is probably relevant to figuring out what comes next. Or like if I'm reading an article, the introduction is pretty relevant. But it's definitely a weird thing that models can do this. And it turns out that induction heads are a really big part of how they're good at this, where
00:30:21
Speaker
Models that are capable of forming induction heads are much better at this thing of tracking long range dependencies and text. The ability of models to do this perfectly coincides with the dramatic bit where they're learned, and
00:30:36
Speaker
When we did things like tweaking a model too small to have induction heads with this hard-coded thing that made induction heads more natural to form, that model got much better at tracking how to use text far back to predict the next thing.
00:30:53
Speaker
And we even found some heads that seem to do more complicated things like translation, where you give it a text in English, you give it a text in French, and it looks at the word in English that came after the corresponding word in French. These also seem to be based on induction heads.
00:31:13
Speaker
These induction heads, they pop up in many different neural networks, in many of the neural networks that you've checked at a certain size. Is this perhaps analogous to how, say, eyes have evolved in different species and underwater and on land and so on? Because maybe induction heads are very generally useful for language models. Yeah. I think that's a pretty good analogy.
00:31:44
Speaker
That's actually a pretty great analogy. I think one thing worth drawing out from that analogy is that there's two big components to what kind of things an organism might learn. There's what is the environment it's in, what's useful, and then what constraints does it have and what's natural.
00:32:04
Speaker
light, light is there, light is useful, understanding your environment is important, so eyes are valuable. And the same kind of underlying biology presumably incentivizes eyes, though I'm not a biologist so I don't actually know how much eyes implemented in the same way or not. And
00:32:24
Speaker
I think the analogy from biology to AI can be a bit overdone, but this underlying principle of repeated text happens a lot and it's useful to notice, and induction heads are a very natural thing for a model to express that's kind of efficient in the same way that, I don't know, walking on legs is more efficient than growing your own wheels, even if probably we could figure that out, biology could figure that out if it really had to.
00:32:49
Speaker
All right, so say induction heads are an example of a way in which we can understand what's going on inside of these neural networks.

Reducing AI Risk and Personal Motivations

00:32:59
Speaker
I would like us to pivot a bit to talk about how does this line of research help make us safer. So you mentioned that your largest motivation for working or perhaps say at least one of your motivations for working on this is to help reduce AI risk.
00:33:18
Speaker
How do you see mechanistic interpretability helping there? This definitely is only one of my motivations. I generally think of my life as having multiple different categories of motivation. There's the
00:33:32
Speaker
high-level, abstract, zoom in every few months and think about my life goals and meaning, where, okay, I think that human-level AI is probably going to happen, and plausibly might go badly, and I want to be doing things that address this because making the world a better place is of deeply important value to me. But also on a day-to-day basis, that's not the kind of thing I can really draw excitement from.
00:33:56
Speaker
And things like, oh my god, this is so fun, I get to stare at an alien organism all day and probe its brain for how it works. Or, huh, lots of people around me are really smart and excited about my work and think I'm really cool for doing it. Here's another thing that appeals to the kind of monkey inside my brain.
00:34:17
Speaker
And part of how I try to solve my life is to align these various different things, but then by default point the same way to align with my abstract, high level, actually what matters to me, but really hard to feel and practice. I'm very concerned about models being deceptive, and in particular, models being deceptive in a way that's hard to notice, where
00:34:46
Speaker
the model realizes it's being trained, it realizes it's in its interests for its human operators to think it's doing what they want and being aligned and performing well, and so it learns to fake doing that. And if it's competent, which the models I'm scared about are, then it will just do this well. It will do this in a way that's as close as it can get to what a system that genuinely cared will do. And
00:35:16
Speaker
Fundamentally, when there are multiple solutions inside a model that both have the same outputs, one of the obvious things to do is to open up the black box of the model, look inside, and try to poke around at what's going on. And this is one of the ways I think mechanistic interpretability is likely to be pretty promising.
00:35:41
Speaker
In particular, I'm really excited about there being better feedback loops for people trying to align these systems where, especially to get closer to the worlds which are actually dangerous, where we might have systems that are capable enough to do damage and also capable enough to realize they might want to do damage. I think that having really good feedback on how well the alignment techniques work in a way that is not tied up with things like
00:36:12
Speaker
model is making mistakes and which can ideally be as fine-grained and mechanistic as possible. It's like a pretty valuable thing to exist, though I also just have a skepticism of careful, elaborate arguments in general, and think that anyone trying to reason about a complex problem like AI safety should share this skepticism. And I think for me,
00:36:37
Speaker
Obviously, if we understand the terrifying black boxes who might be massively influencing the world, it is more likely to get better for humanity. I think that kind of very simple argument satisfies a different part of me. Yeah, I can see how that argument is powerful. So there's a question I have about
00:36:58
Speaker
Mechanistic interpretability, this might help us understand what's going on inside of these black box systems. But if we discover that something we do not like is going on, is there a part of mechanistic interpretability that helps us control the system or steer the system in a different way? Or is that where you imagine another approach within AI safety takes over? So it's mechanistic interpretability mostly about detecting bad behavior
00:37:24
Speaker
and not about actually steering or controlling AI systems in the right direction. So in my personal conception, mechanistic interpretability is about reverse engineering a trained system rather than about intervening on that system to change what it does. I think that it's easy to get into semantics here where
00:37:52
Speaker
I don't know, maybe if you're a doctor you refer to your instruments as like a very different part of medicine as like the actually cutting people open and making them better. But these are clearly highly related things and better instruments enable better medication and better surgery techniques. And I also think it's kind of healthy for mechanistic interpretability to try to get feedback on kind of actually do things.
00:38:21
Speaker
There's this great paper called Rome from David Bao and Kevin Meng, where they did some work to try to track down which bits of a model contained factual knowledge, like the Eiffel Towers in Paris, but then which also used this to... But then the focus of the paper was on this memory editing technique, where they change it so the model thinks the Eiffel Tower is actually in Rome.
00:38:48
Speaker
got some pretty impressive results. Like if you ask it something like, can you give me directions to the Eiffel Tower? It will direct you to Italian railway stations rather than French railway stations. And it's kind of wild though. I also think that the paper is kind of flawed in some ways and there's been some recent interesting criticism
00:39:13
Speaker
But I think I'm pretty excited about that kind of attempt, where if we really understand how models work, then these kind of things should be much easier. Because to me, a lot of the field of AI is kind of flandering around in the dark.
00:39:31
Speaker
And I mean, one concerning implication of that statement is that if we really understand the system, we should be able to make better ones faster.

Balancing AI Development and Safety

00:39:40
Speaker
And in my opinion, even if I want to live in a world with safe and aligned, powerful systems one day, I feel like making them better is vastly outstripping the understanding of them and making them safer. And I want the relative speed to be as great as possible on the safer side.
00:39:58
Speaker
So Neil, I have a kind of a weird question here. So what I could be worried about is say that you do some mechanistic interpretability research and then it's published online in a paper or say it's written up in a blog post.
00:40:14
Speaker
And this then becomes part of the training set for a future language model. Could the research you're doing now help future language models deceive us because it understands how we're trying to interpret it?
00:40:29
Speaker
uh yes there are totally worlds where that happens and it does seem like this is not on my list of things i am concerned about in the short term or even things that are on my list of ways i think that doing my research could be harmful but definitely could happen uh one idea which i've seen being popular on the 11th forum that i find very fun is uh there's
00:40:58
Speaker
There's this string that the people who made Big Bench, which is this new big set of benchmarks and metrics we use to test language models, made, where they have this string they want to put in documents that contain the answers to these questions, such that future language models can just be... When you get the dataset for future language models, you can just say, don't include things with this string.
00:41:23
Speaker
And there's a very fun idea of just put this in like all alignment posts so we don't tip off the model. But so to me, this kind of links into the more interesting question of if you know that someone is trying to read your mind, can you outthink them? And my personal guess is that it is at the very least significantly harder
00:41:49
Speaker
for a model to outthink a mind reader than it is for the model to deceive an operator. One of the main reasons I believe this is I think that getting better at something is fundamentally about getting feedback, and models are trained to get lots and lots of feedback on their outputs. But models are not trained to get lots of feedback on their internals and how much a mind reader can see them.
00:42:17
Speaker
Note that this argument does break if people start doing things like training on an interpretability-inspired metric or giving feedback via it, which I think is a thing that I am extremely cautious about recommending anyone does for this exact reason. And having techniques so robust they can't be broken is much, much harder. I...
00:42:42
Speaker
I think that I don't know how I would think my thoughts such that a digital neuroscientist could read this and not see it. I think there's a couple of ideas people have flirted with me that kind of make sense. Spend a bit of time thinking about what intentions I'm forming and why, and then spend most of my time just acting on these cached intentions without dwelling on how I came up with them is one tactic.
00:43:09
Speaker
Another tactic would just be thinking like really weird and convoluted ways and keep away from the dangerous thoughts. And I can imagine some of these working, especially the model has some external memory. It can store these cashed intentions in, in a way that's kind of divorced from how they were generated.

Speculative AI Risks

00:43:27
Speaker
We should flag that all of this is very speculative and it is not your main concern. Completely. Yes. Yes. This is very much on my list of fun things I might speculate about with friends at 1AM.
00:43:39
Speaker
rather than a thing that is actually a thing I think about doing day-to-day research. Let's dig into some of what I could see as being problems with mechanistic interpretability as a research paradigm. I think the main worry I have is just that mechanistic interpretability is not fast enough.
00:43:57
Speaker
So how I frame this is to say that when we get actual feedback from the systems that were cutting edge two years ago, there are now new systems. And we have not been fast enough to implement the learnings in the new systems. So in a sense, you're always playing catch up, or at least that's my fear. Do you think that's a real worry?
00:44:21
Speaker
So this definitely is a real worry. I will point out that I would argue that the field of mechanistic interpretability of language models didn't even exist two years ago, and all existing work was on image models, such that I'm...

Scaling Challenges and Foundational Understanding

00:44:40
Speaker
I don't know, I feel less bad about that one. But I do think this is an important point, and
00:44:50
Speaker
So there's a couple of underlying questions here. There's, are we just capable of doing this at all? Maybe we could do GPT-3 given two years, but then if GPT-4 is 10 times bigger, it would take us 20 years, and that's not sustainable.
00:45:13
Speaker
Um, but then there's just questions like, are we just going to kind of always going to be scrapping our previous work? Or can we be building a field or like building upon our previous work, even as the exact focus changes. And I'm reasonably optimistic about the building on our progress one.
00:45:33
Speaker
I think that a bunch of our insights and conceptual frameworks from image models transferred, though definitely imperfectly, and less well than I'd have hoped. And there were lots of weird things about transformers. For example, image models don't have attention layers, and a bunch of the work has focused on understanding what's up with attention layers. And so that's a category of concern. I think on the
00:46:03
Speaker
Will the problem of scale, can we actually interpret these humongous models, even if we can interpret a smaller model? I'm kind of unsure. I am excited about things that try to take insights from Meckinterp and scale them and automate them.
00:46:26
Speaker
And I think that one kind of speculative dream might be that even if we know how we would reverse engineer GPT-4 given 20 years,
00:46:37
Speaker
before we get to the really scary systems, we'll hopefully have systems that are kind of near human and can take over a lot of cognitive work. And if we could have these systems try to help us and just do this 20 years of work in weeks instead, that seems in some sense like a massive win that could actually scale. But
00:47:02
Speaker
Yeah, I think one thing I will say that's maybe one of my more controversial opinions is that I think that the field of interpreting AI is much, much more bottlenecked by really rigorous true beliefs about networks than it is by good ideas for things that would scale and would actually generalize to larger models. And I think that it's just very hard to tell which of these ideas are good.
00:47:30
Speaker
which of the days have subtle flaws, and I feel like just doing work to find even some cherry-picked thing like induction heads that we then try to really deeply understand.
00:47:45
Speaker
or trying to discover some more of the underlying principles behind the kinds of algorithms the models learn and the kinds of ways they express things, even if the exact details don't generalize, just seems like it should enable all further things to do with understanding these models. Because at the moment it feels like we're floundering in the dark.
00:48:11
Speaker
So if we imagine that we have AIs helping us interpret other AIs, so AIs helping us understand what's going on inside of other neural networks.

Can AIs Interpret Each Other?

00:48:22
Speaker
Couldn't this in a sense lead us back to where we started? Because maybe the way that one AI is being interpreted by another is perfectly understandable to the AI that's doing the interpreting.
00:48:35
Speaker
Perhaps it's an inefficiency to try to translate it into something that's understandable by humans. Perhaps an AI trained to interpret another AI could skip the step of producing something that's understandable by humans and thereby work faster. Perhaps it could outcompete systems that try to translate to human.
00:48:59
Speaker
I'm not sure I quite followed the question. Is what you're asking, if we try to train systems to interpret more complex systems, maybe those would be too slow and inefficient, such that a system that's not trying to translate this to human ease is just much more capable? Exactly. Imagine that we have a system, an AI, that's interpreting another AI.
00:49:23
Speaker
We have two of such systems. We have one system that delivers something that's understandable by humans, and we have another system that just delivers red or green, let's say. So just thumbs up or thumbs down without giving us actual information about what's going on. It just says this.
00:49:41
Speaker
this system is doing what you want it to do. Could you could you see a world in which the system that that gives us very little information but just gives us a thumb up or a thumb down would outcompete the system that that tries to do the hard work of translating a neural network into something that's humanly understandable. So to me, outcompete kind of seems like the wrong framing. I don't think that
00:50:11
Speaker
The question is not if you had two competing auditing companies, one of whom did the red-green thing, one of whom did the sophisticated thing, which then would make more money. If we end up in a world where it's possible to make the extremely expensive, sophisticated thing that translates into human ease, that's a massive win.

Mechanistic Interpretability vs. Auditing

00:50:33
Speaker
Like, pharmaceutical companies spend billions of dollars making sure their products are safe when YOLO put it on the market and don't check is much cheaper. But, you know, we have regulations that make it so that you need to do the really hard expensive thing that's better at making things safer. And I'm like really not that concerned about it will just be too inefficient or too expensive. I'm much more concerned about A, we won't be able to do it,
00:51:03
Speaker
B, it will be so prohibitively expensive or slow that it's not even possible to do practically, even if we knew how to do it in theory. And finally, that it's just not reliable. We can't trust this system, and we just get lost in the nested chains of more and more sophisticated systems. And on the specific point of
00:51:31
Speaker
practicalities and kind of competition. I think one thing which is easy to overlook here is that mechanistic interpretability is not necessarily about auditing every single action a system takes. It's much more about taking the system and trying to understand it.
00:51:51
Speaker
Clinical trials are maybe a good analogy in this context, where it's not like you watch a patient as you give them them. It's not like you, when you deploy a drug in the world, you watch every patient as they take it. You study the drug in this clinical context, but as close to the real world as you can make it, ideally by just giving it to people and then see what happens and get data. And this can be vastly more, this can be expensive in a way that watching everyone who takes a COVID vaccine is just completely impractical.
00:52:21
Speaker
And models are even better because there is just a long but finite list of numbers, the parameters that just define the model. And if we can just study this, do a bunch of running the model on inputs, but fundamentally just understand what it represents, then I think that questions about competitiveness just matter a lot less there.
00:52:47
Speaker
you think that mechanistic interpretability will have to involve AIs interpreting other AIs at some point for it to scale to bigger systems? I think it is plausible that this is a thing that happens, and it is plausible that the most practical path to meaningfully understanding really complicated systems is via using
00:53:14
Speaker
less dangerous AI assistants. I do not think this is an important thing in the near-term future of the field. I do not think this is even necessary before we could reverse engineer a human-level system, at least enough to figure out whether it's safe. And this is not at all what I am personally working on.
00:53:32
Speaker
This is very much in the speculative what I think could happen. I think in particular, if we're trying to figure out how we could align a vastly vastly superhuman system, I struggle to imagine it really being doable to perfectly reverse engineer by humans. Though I think that
00:53:52
Speaker
fully reverse engineering is not on the critical path from where we are now to having safe AI. Does this have internal, internally represented goals? And if so, what are the goals? Or is this being deceptive? Just seems significantly easier.
00:54:10
Speaker
Yeah, that's a great way to frame it. So I read one objection to the whole mechanistic interpretability paradigm or research field, which is that when you're trying to interpret a neural network, you cannot from interpreting the network itself understand how the system will react in different environments. So imagine that
00:54:36
Speaker
you cannot get information about how the system will react in a diverse set of environments. And so therefore, there's a fundamental limit to how much you can get out of interpreting these networks. Do you think that's a reasonable objection? My off-the-cuff answer is no.
00:54:54
Speaker
In particular, to me, one of the things that's distinctive about Meckinterp and reverse engineering the system is that you're understanding the algorithms the model employs. And if you actually understand an algorithm, you should be able to predict how it generalizes.
00:55:15
Speaker
You should be able to come up with adversarial examples that will trip the system up using your understanding. You should be able to predict what happens on weird settings. Induction heads are a good example, where understanding them let us predict that models could, if given complete random gibberish text, predict repeated subsets of that. And they can.
00:55:37
Speaker
And I'm unconvinced by that criticism. I do think there's some truth to it in that it's very easy to impose your own preconceptions on what behavior you should look for, what kinds of inputs you give it, such that you see what lights up and how you focus on things.
00:55:59
Speaker
And I think in particular, if you aren't trying to fully reverse engineer the system, but are just trying to localize the bits that matter most, the criticism has more teeth to me. But in my eyes, Meckinterp is one of the more promising things that isn't susceptible to that criticism. Because we're trying to understand what algorithm underlies the network itself, which is kind of understanding a more general feature of a neural network.
00:56:25
Speaker
Yes. And I think there's other techniques and approaches that also seem pretty valuable here. There's the area of adversarial training and adversarial robustness, where you get an adversary like a human raider or another AI to generate examples trying to throw off the system. And adversarial examples are a good example of this. The pictures listeners might have seen where you have a
00:56:54
Speaker
Panda, you add a bunch of random noise and it thinks it's definitely a wombat or something. Even though to our eyes it looks just like exactly like a panda and they've just imperceptibly changed a couple of pixels. And yeah, one field is about trying to find the inputs that almost trip up the model that are at least like what it's expecting. And that's another angle to figuring out what it does in weird contexts.
00:57:20
Speaker
But I'm honestly not really aware of many others that I think it's promising as Meckinterp here. Okay, so actually you think this objection applies the least to Meckinterp compared to other approaches? I don't like using superlatives, but kind of. And to be clear, I think it does apply. I just think it applies a lot less. Yeah.
00:57:44
Speaker
All right, so let's say that people have been listening to this and they are now excited about Meck and Terp. They want to help contribute to this field.

Getting Involved in the Field

00:57:52
Speaker
What would be the best ways to get into this? Which papers or books or video series should you read? Should you go to hackathons? How would you go about getting into the field?
00:58:04
Speaker
Sure, so actually one of my side projects for the last couple of months has been trying to make this dramatically easier. And I have this post called Concrete Steps to Get Started in Mechanistic Interpreterpability, which you can find if you go to neilnanda.io slash getting dash started. And
00:58:25
Speaker
This tries to give a pretty concrete guide to what I think you should do if you actually want to learn about the field and potentially do research, and tries to collate a lot of the other resources I've made and other things I think are useful and good for this.
00:58:39
Speaker
Generally, I think one of the really great things about Meckentup is that one, there's lots of important work to be done on tiny systems that can fit on a free Google Colab notebook in your browser rather than needing some expensive supercomputer. And which you can just play around with and get fast feedback within minutes, especially using some of the demos I and collaborators have made where you can just
00:59:09
Speaker
play around with a model by running existing code. And I would recommend people just like go and screw around with the various educational materials, read some of the papers recommended. For people who want to dig more in, I have the sequence that I'm pretty satisfied with called 200 concrete open problems in mechanistic interpretability.
00:59:33
Speaker
where I try to both just like lay out a map of the field, what I think are the interesting sub areas, big open questions, how I think about doing research in those traps, pitfalls, tips.
00:59:46
Speaker
but also just a long list of concrete problems I would be excited for someone to go and work on. Would this be the best approach? Say that a listener has a general background in computer science and is interested in this, is the best way to jump into one of these 200 problems and try to solve it and then fail and then get feedback from trying to solve an actual problem? Or would you rather have a more expanded base of theory by reading some papers
01:00:15
Speaker
How would you start? I personally think that most people I see getting into the field spend way too long reading papers and trying to build a broad base and not enough time just doing things. And I think that building a broad base is important. But I think that the best way to build the base is by doing things, failing, noticing what you're stuck at, and using this to ground your learning.
01:00:43
Speaker
And I think that going back once you've gotten your hands dirty and going and just like going on a learning binge and trying to fill in a bunch of the holes in your knowledge is like solidly worthwhile. And I think there's a learning style which just prefers reading a bunch of papers. And my getting started guide also tried to give some concrete advice for that. But code fast, code early, try to build practical knowledge as well as theory is some of the most common advice I get to people trying to get into the field.
01:01:12
Speaker
And again, one of the things that's really nice about mech and top is there's lots of small bite size problems. I tried to rank the problems, my sequence, my difficulty, and there's a rank for, I think someone who's new to the field could like probably get a good amount of traction on this in like a week or two. And I think that just trying to do something is one of the best ways to get grounding and get started. Fantastic. Perfect.