Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Neel Nanda on What is Going on Inside Neural Networks image

Neel Nanda on What is Going on Inside Neural Networks

Future of Life Institute Podcast
Avatar
204 Plays2 years ago
Neel Nanda joins the podcast to explain how we can understand neural networks using mechanistic interpretability. Neel is an independent AI safety researcher. You can find his blog here: https://www.neelnanda.io Timestamps: 00:00 Who is Neel? 04:41 How did Neel choose to work on AI safety? 12:57 What does an AI safety researcher do? 15:53 How analogous are digital neural networks to brains? 21:34 Are neural networks like alien beings? 29:13 Can humans think like AIs? 35:00 Can AIs help us discover new physics? 39:56 How advanced is the field of AI safety? 45:56 How did Neel form independent opinions on AI? 48:20 How does AI safety research decrease the risk of extinction? Social Media Links: ➡️ WEBSITE: https://futureoflife.org ➡️ TWITTER: https://twitter.com/FLIxrisk ➡️ INSTAGRAM: https://www.instagram.com/futureoflifeinstitute/ ➡️ META: https://www.facebook.com/futureoflifeinstitute ➡️ LINKEDIN: https://www.linkedin.com/company/future-of-life-institute/
Recommended
Transcript

Introduction and Background

00:00:00
Speaker
Neil, welcome to the podcast. Glad to have you on. Thanks. Could you introduce yourself? Sure. Hey, I'm Neil. I do mechanistic interpretability research, which means I take neural networks, I try to open them up and stare at the internals, and try to reverse engineer out what algorithms they've learned, and attempting to take us from the world where we know that a network is good at a task, knowing how the network is good at a task.
00:00:29
Speaker
And this is really weird and hard, but also kind of fascinating. And I'm currently a grant-supported independent researcher working on this stuff. I previously worked at Anthropic under Chris Oler, who, in my opinion, basically founded the field. That's a debatable claim. And yeah, excited to be here, generally giving hot takes and trying to be helpful.
00:00:54
Speaker
Fantastic. So how did you get into this career? Where did you start? Could you tell us your story of, say, your education and how you got into AI safety and mechanistic interpretability? Sure. So my childhood is heavily marked by far too much time spent doing pure maths. I was very into doing maths Olympiads when I was in high school, was on the UK's international maths Olympiad team.
00:01:23
Speaker
There's naturally fluid into doing maths at Cambridge and being really, really into doing the hardest pure maths courses I could. And then towards the end of my degree, I was getting increasingly burnt out on doing things that were pretty abstract but fundamentally kind of useless.
00:01:42
Speaker
And I'd gotten pretty into effective altruism, both via my university society and when I was much younger, reading things like Less Wrong or Harry Potter and the Method of Rationality, which holds a very special place in my heart.
00:01:57
Speaker
And yeah, I was originally planning on going to finance, but at semi the last minute decided that AI x-risk seemed really important. I was good at technical things and could probably maybe do something useful here.

Career Shift to AI x-risk

00:02:15
Speaker
And was kind of confused at risk averse, but decided it was worth spending at least a year trying to investigate what on earth was up in the fields. Do I think I can do anything useful?
00:02:26
Speaker
I spent a year doing a bunch of back-to-back internships at some different labs, and I didn't really vibe with any of the exact things I was working on, but by the end of that I had concluded, okay, something here matters. I believe there is important work to be done. I do not confidently believe that the people working on this already have things completely ahead and solved. Seems like I should obviously just continue trying to find the best way to contribute to this.
00:02:56
Speaker
And then I massively lucked out and got this amazing opportunity to work with Chris Ola and Anthropic, where he'd previously done a bunch of work reverse engineering image classifying models. And Anthropic was this new startup he'd co-founded where he was working on reverse engineering language models. And I was kind of involved in the early stages of that field. And this was just fascinating.
00:03:24
Speaker
I think it kind of appeals to multiple sides of my personality. There's the mathematical computer science loving abstraction side where you've got this network, fundamentally it is a mathematical object that is the stack of a bunch of linear algebra and which I can just reason about on a conceptual level and I know all the moving parts and how it works.
00:03:48
Speaker
It also appeals to my truth-seeking side because I am trying hard to form true beliefs about this model. It's really, really easy to trick yourself and to come up with an elaborate, complicated, elegant hypothesis when actually there's something really boring going on. Like, you think that there's some special structure of the input that needs to be there that the model is picking up on. And then you just try giving it some random other input and it does the same thing.
00:04:18
Speaker
And it kind of also appeals to the part of me that likes empiricism and feedback because the gap between I have an experiment idea to test a belief about a model and I just write code and get feedback is just incredibly short and incredibly satisfying. And this means that I can just iterate and form true beliefs and it's delightful.
00:04:41
Speaker
Could we go back to the thing you said about being risk-averse? So how did that come about and how did you perhaps overcome it to choose to engage with this field?

Risk-Averse Nature and Community Support

00:04:51
Speaker
I don't really know how it came about. I think just personality-wise, I'm a fairly risk-averse person. Things like, does this feel like a kind of somewhat stable life path?
00:05:04
Speaker
Does this seem like it's kind of somewhat socially legible? Like people think it's kind of cool and worth caring about? Does this feel like something I could imagine myself doing for several years? And where I'm not afraid that the rug will suddenly be pulled out from under me? This just kind of feels like a pretty big part of my personality. So it's kind of entertaining that I've ended up as a grant-supported independent researcher.
00:05:33
Speaker
But I think things would have shifted. So I think that there are some ways in which being a bit conservative and a bit risk averse can be helpful. Like a classic mistake among impulsive young men is, or young people in general, is have a wild, crazy idea you take really seriously and do something high risk and throw yourself into it entirely without checking, is this a terrible idea?
00:06:02
Speaker
and often the desire to be risk-averse in protecting someone. But also, often the world is just kind of wrong about important things. Like, I don't know, most people in January 2020 thought that COVID wasn't a big deal. Some of the people who thought it was a big deal, also the people I know who think extra risk matters the most. And
00:06:27
Speaker
I think it's important, I've tried to become the kind of person you can notice when the desire to be socially conformist or risk averse is taken away from what I think is actually important. And I think the arguments that human level AI is a thing that could happen, that it's gonna be a really, really big deal and will not by default be a good thing, does seem really compelling even if they're kind of weird.
00:06:54
Speaker
One thing that's helped a lot is just being intentional about the social circles I hang out in and who I talk to, and just having lots of friends who are also working on these problems makes the socially conformist part of me a lot more comfortable, because lots of really smart people I know think that AI expresses auto-masses. This is obviously a double-edged sword,
00:07:16
Speaker
Since I don't know if you hang out with people who all believe a certain thing, it's very hard to tell if that thing is just complete BS. But I think that making sure that I do spend time around people who share my values and who reinforce the kind of person I want to be has felt like a pretty good trade overall.

The Joy of Mechanistic Interpretability

00:07:38
Speaker
Do you think you reduced your uncertainty that say AI could become human level or that AI could lead to human extinction? Or did you think more in terms of how important it would be if, you know, what's at stake? So did you think about the stakes or did you think of, did you reduce your say empirical uncertainty about AI? What was most important? I would say both feel pretty important.
00:08:07
Speaker
Honestly, a lot of the personal decisions were dominated by, oh ha, I find working on mechanistic interpretability really fun and I seem good at it. Such that it doesn't really matter whether I think AI X-Risk is a big deal versus an overwhelmingly enormous deal, it seems like the best, really important problem that I'm unusually well suited to work on.
00:08:32
Speaker
So just finding things fun rather than weird abstracts and kind of guilt flavored was pretty important. And I think just actually working on stuff, finding it fun and getting feedback from the world that I was good at it was probably the most important thing there. I do think engaging more with the arguments, just talking with more people for whom this is actually a significant belief that has changed how they're living their life and what they're working on helped make things feel more real.
00:09:02
Speaker
kind of in a similar way to how, I don't know, thinking about trolley problems feels like a very abstract thing when forming moral beliefs to me, while actually thinking through moral decisions in my life, and where I donate money, where I can do the most good, is much more visceral. I think that
00:09:24
Speaker
actually thinking through day-to-day research decisions, what I want to spend my career on, and the way those grounded in these beliefs make them all visceral, which just matters emotionally if my factual beliefs don't change. I think I did change my mind on some empirical things.
00:09:40
Speaker
In particular, I now feel like I've got a much more coherent story for a path by which we get from where we are now to a world where we have human level AI and a world where we do not have human level AI that is aligned with what we want. And this is a pretty important part of taking AI x-risk seriously. But that feels kind of secondary to the more emotional things.
00:10:03
Speaker
which maybe makes me a bad rationalist, but I think people often undervalue these and overvalue kind of hot, cold, hard logic.

Mathematics in AI Work

00:10:14
Speaker
So how attractive is mechanistic interpretability as a field for a person that's very into pure mathematics, say mathematical elegance? How developed is the math in mechanistic interpretability compared to other areas of mathematics?
00:10:33
Speaker
So I would say that it's not really an area where there's really deep and sophisticated mathematical theory, though there's enough mathematical things it's useful to be able to think clearly about that having a maths background is useful.
00:10:53
Speaker
I actually also recently worked on a project with one of my mentees where we reverse engineered small networks trained to do group composition, which is a kind of intro level maths course. And we discovered it had this algorithm involving representation theory, which is a kind of more involved early grad school level pure maths course. And this was a really cool investigation where both of us having maths backgrounds is really helpful. But in general,
00:11:23
Speaker
Yeah, I'd say the main things you need to know are just models are heavily based on linear algebra, multiplying vectors by matrices, having a really, really deep understanding for how to think about this, and reasoning through problems in linear algebra, and also problems in probability theory, calculus, and information theory, which may be the other things that feel most relevant.
00:11:46
Speaker
So I think I really enjoyed pure maths and I really enjoy mechanistic interpretability, but to me it feels more like a transfer of flavour than a transfer of specific domain knowledge. Like, day to day, I am thinking about a mathematical construct of a neural network that is a mass of super-linear algebra. I'm trying to reason about what kinds of functions could be expressed inside the super-linear algebra.
00:12:14
Speaker
And then I'm being a truth-seeking empiricist who runs experiments to check for this. And this kind of iterative truth-seeking while also reasoning through abstractions and trying to form true beliefs about a complex system is part of what I loved about pure maths and solving hard maths problems and reasoning through a theorem. There's somewhat everyone enjoys about it. I also should emphasize that there's lots of people who do great mech and terp work
00:12:42
Speaker
Sorry, short for mechanistic interpretability and is much less than mouthful. There are people who do great mech interpretwork who do not have maths backgrounds and aren't taking evidence. But as a pure mathematician, that's kind of what I vibe with about it.
00:12:58
Speaker
I feel like it's still a bit abstract what it is that you actually do. So if we try to think about in very down-to-earth terms, when you're looking at the screen in your day-to-day work, what are you looking

Analyzing Neural Network Behavior

00:13:11
Speaker
at? Are you visualizing the weights of the network? How does it look for you when you work? So the feel of the work is, well, when things are going well, the feel of the work is that I'm in this tight feedback loop.
00:13:28
Speaker
So I have this model. The model has some behavior that it can do. And I'm trying to reverse engineer how it does this behavior and why it does it. A concrete example is this. There was this great paper from Anthropik I was vaguely involved with on induction heads.
00:13:52
Speaker
which are a small part of a language model. We often use the term circuit to refer to a kind of sub-part of a model doing a specific task. It's part of the model's overall what it's doing. So induction heads are this tiny circuit which it detects and continues repeated text. For example, if you've got a
00:14:21
Speaker
If there's an article about Gus Dokker, and the name Gus comes up followed by Dokker in the past, so the way language models are trained, they're trained to predict what comes next, like what's the next word. So when it sees Gus, it checks where was Gus in the past, what came after that, Dokker should come next.
00:14:44
Speaker
And so I'd take a model with some behavior I want to interrogate, like this. Oh, it knows that Docker comes next if Gus Docker came in the past. And then I try to understand how it does this. And there's a toolkit of different techniques I might use for this.
00:15:03
Speaker
One of the simplest is this idea called direct-logit attribution, where there's a bit of jargon, but what it basically means is the model has this output called logits, which represent how likely it thinks each next word is to arise, and this is a
00:15:25
Speaker
linear function of what each bit of the model has output, where the important bit of linear function is it means you can break this down into what was contributed by each bit of the model. And I can just check which bits of the model most contributed to the fact that it thought Docker came next. And then plot a picture where each component is colored by how Dockery it was. And that's like one example of the thing I might be doing day to day, as the first step
00:15:54
Speaker
Yes, yes. Tell me if you like this analogy. So when you're looking at a neural network, perhaps that's somewhat analogous to the brain. And then you're trying to understand which parts of the network are responsible for which functions. Perhaps somewhat analogous leads to how the brain, different regions of the brain are responsible for different functions of the human body or the speech center or the center for balance, something like this. How well does that analogy work?
00:16:23
Speaker
It's okay. I think that the neural networks are basically like human brains thing is pretty overblown. Though I do think digital neuroscience is a reasonable enough handle on what mechanistic interpretability is. One of, in part, I think just the name of neural network is really misleading because it sounds like, oh, super brain-like.
00:16:52
Speaker
where, if you look at the actual internal function of networks, they're wildly different from our brains in a bunch of ways. Like, in a network, every neuron is connected to every neuron in the next layer, what's called being dense, while our brains are sparse. And generally, in our brains, neurons are either firing or not firing, while inside the model, how much it fires is like a really, really important thing.
00:17:20
Speaker
And I'm not a neuroscientist, so it's very plausible that a lot of what I'm saying is just wrong, which is also part of why I don't want to claim that what I'm doing is like neuroscience. One fun kind of day-to-day difference is just it is so much easier to do experiments on a neural network than a brain. I can just
00:17:41
Speaker
To figure out what's going on in the brain, you need to stab in electrodes, and you might kill your patient, and IRBs often disapprove of this, and you often need to cut things open.
00:17:51
Speaker
have careful hand-washing procedures, and it sounds like a mess, while with a neural network, I can just say, this is a computer program. I will run it. I don't know what the computer program does, but I know what each of the steps is, even if the step is just multiply an enormous vector by an even more enormous matrix and get out a new vector. I can just save every intermediate thing that I get when I'm running this model, and I can even
00:18:22
Speaker
insert at some specific point in the model, let's replace these five entries in that vector with another five numbers, and then look at what happens. Yeah, right. So it's much easier to do experiments on digital neural networks than actual or physical neural networks, aka brains. But when you're doing these experiments, how does it actually work?
00:18:48
Speaker
Sometimes running an AI model can be very expensive and require a supercomputer. But when you're doing these experiments, you say that there's a tight feedback loop. So I assume that you're running them rather quickly. So you're running parts of a network or how does that work?

Training vs. Inference

00:19:04
Speaker
Sure. So there's two important things to clarify here. There's the difference between training and inference, and the question of how big the model you're working on is. So when machine learning models are made, there's this stage called training where models are made up of parameters, a very long list of numbers that determine exactly what the model does,
00:19:33
Speaker
And then it's given an input, and it produces some output based on these parameters. And during training, you give the model a massive pile of data. It keeps looking at a data point for its current parameters. It makes a guess for, say, what the label of this image is or what the next word is on this text. And then it gets some feedback on whether it was right or wrong, and it updates every parameter to be a little bit closer to what would have given it the right answer.
00:20:04
Speaker
And you do this a really, really long time. And at the end of this, by black magic, you have parameters that are good at the task. Even at the start, the parameters were just randomly generated numbers. And this is really slow, really hard, really expensive. And importantly, the parameters are being changed iteratively to become better. And then at inference, you just fix the parameters.
00:20:33
Speaker
and you look at the output. And this is much faster and cheaper. So for example, GPT-3, I believe, was trained on hundreds to thousands of GPUs. And running GPT-3 is a bit of a headache because it's just such a big model. It's got
00:20:54
Speaker
200 billion, well, 175 billion parameters. Generally these models are run on GPUs and most GPUs, like the expensive GPUs can maybe hold 20 billion numbers. So you just need to spread it across a bunch of things and it's kind of a headache. But it's like much easier and faster if you're doing inference. And generally when interpreting a model,
00:21:21
Speaker
I'm taking a already trained model that people are using and then trying to treat it as this alien organism that I want to investigate. And yes, that's trainee versus inference.
00:21:35
Speaker
Yeah, maybe we could, because you describe these models as working by Blackmagic or being alien beings. So I think that's actually key because what you're trying to do is understand this alien being by interpreting it. Do you feel like you're working with something that's truly alien from you? Or how do you feel when you're trying to interpret these models?
00:22:00
Speaker
Alien is a weird word. I think when I hear the word alien, it connotes something like fundamentally unknowable or bizarre and mysterious. And my current vibe about networks is networks are weird, but that they fundamentally make sense given their constraints, but that their constraints are very different from my constraints.
00:22:23
Speaker
In particular, the way I think about the world, and the way I think many people do, especially on the technical side, is in kind of logic and abstractions and, I don't know, I write code and the code has all of these if this then that, but everything in this list do blah blah blah statements. And it's all kind of very formal, precise, while
00:22:51
Speaker
Internally, networks are built on linear algebra. They have numbers that just kind of take on any real value. They aren't just like 0 or 1. And they're being multiplied together. And there's this whole rich mathematical theory of linear algebra for how you should think about this. And this just heavily changes what kind of algorithms are natural to express. I also want to emphasize that
00:23:20
Speaker
when I'm discussing algorithms, I'm mostly focusing on what it's natural to implement kind of on the parameter level of the model, rather than what the internal experience of the model might be like, if that's even a meaningful phrase, which plausibly it isn't. What do you mean by internal experience? We're not discussing consciousness here. How would you differentiate these two things? So it's like,
00:23:49
Speaker
When I'm personally thinking, it often feels like I'm doing this intuitive reasoning, that I'm not going through an explicit formal reasoning process, but I'm kind of weighing up evidence, I've got this gut feel. This especially holds when I'm doing something that's fairly intuitive, like if I'm talking to someone
00:24:15
Speaker
There's normally a part of my mind tracking, hmm, is this person annoyed with me? Are they enjoying themselves? Do they look happy? And I'm not running an ex... It doesn't internally feel like I'm running an exporter algorithm. But I also have no idea how my brain is implemented. And very plausibly, there's some kind of algorithm in there.
00:24:36
Speaker
And there's kind of these three separate things of how I think about algorithms as nil, how my brain might implement an algorithm, and how a model may implement an algorithm. And these are all pretty different things. Inductive bias is the jargon for this. The what ideas and algorithms is it natural to be expressed in a model?
00:25:00
Speaker
One concrete example of this is, so one of my main projects as an independent researcher has been this work on grokking, where I was reverse engineering a model trained to do modular addition, kind of clock arithmetic, like 11 plus 5 mod 12 is 4 because it wraps around when it gets bigger than 12.
00:25:26
Speaker
And to me, modular arithmetic is basically about you do addition, and then if it gets bigger than the modulus, you just like make it smaller. An addition I think of through an algorithm like the kind of things you learn in primary school. Just add the digits, carry the 1, add the 10s, carry the 1, etc. And maybe a network would learn something like binary addition because that's what's natural for a computer.
00:25:54
Speaker
But it turns out the model has learned this wild algorithm involving triggered entities and Fourier transforms, where what it's basically decided to do is to, rather than thinking about modular addition as numbers that eventually wrap around, it's thinking about numbers, say, mod 12 as rotations of some number of twelfths around the unit circle.
00:26:24
Speaker
And it's learned to combine these rotations. So let me know whether I've lost you. This is getting in the weeds. No, it's interesting. So what you're saying is, to summarize, it's invented or
00:26:41
Speaker
It solves a problem that we would solve with simple math using very advanced math or at least more advanced math. And this is a surprise to us. What's interesting to me here is that we are surprised because what you're describing is that
00:26:55
Speaker
On some level, we understand very well what's going on with neural networks. We have the math. We understand how computers work. But still, we're surprised at what they come up with. And in some sense, we understand what's going on. In some sense, these networks aren't alien to us. But in another sense, we can be genuinely surprised by how they solve problems.
00:27:21
Speaker
Yes. One way to frame it is kind of learning just another way of thinking. Like, what is it natural to express in this network? Where, for me, thinking about modular addition as rotations is really weird, though it does technically work. Like, if you rotate by 7 twelfths around a circle and then rotate by 8 twelfths
00:27:49
Speaker
That's 15 twelfths, but because you've wrapped around, that's just three twelfths. 7 plus 8 is 3 mod 12. This is a valid way of thinking about it. But it turns out that these geometric transformations are just much easier for a model to express because you can express all of this in terms of vectors and matrices, which are the natural language of a model.
00:28:13
Speaker
And a lot of what has felt to me like the skill in learning mechanistic interpretability is learning to think in what kind of things a model might express.
00:28:27
Speaker
And I've actually been involved in this great work with someone I'm mentoring called Bilal Chuktai. Am I butchering his name? Sorry, Bilal. And we've recently read this paper looking at how models look at groups in general. And I think in Bruce Thompson's brief earlier, it uses this area of maths called representation theory.
00:28:47
Speaker
which basically extends this thing of rotations around the unit circle to geometric transformations of much larger groups, of like much higher dimensional spaces. Like we found that a model had learned to think about geometric transformations of a four-dimensional tetrahedron. And like, what?

Surprising AI Problem-Solving Methods

00:29:08
Speaker
But if your natural thing is linear algebra, this is a kind of reasonable thing to express.
00:29:14
Speaker
Do you find yourself developing an intuition for how AI models think, so to speak? So do you begin to think the way that we might imagine that they think in terms of where everything is ultimately based on linear algebra and binary? So I do want to disentangle
00:29:39
Speaker
how a neural network might think in the terms of how my brain implements neurons firing and synapses and neurotransmitters and all of that crap, and how a network thinks in terms of my internal experience of sometimes having these intuitions, sometimes doing formal mathematical reasoning, having a conversation, et cetera.
00:30:06
Speaker
I don't even know if networks have something analogous to my intuition and internal experience, let alone wanting to claim the field as anywhere near being able to understand this. And though hopefully it will be something that seems kind of important, what I'm referring to is much more the mechanistic stuff of
00:30:31
Speaker
how a model could implement some specific algorithm it is doing. Like, check whether the name gas appeared in the past. If it appeared, look at what came next and assume that's what's going to come next. Or apply two rotations to each other to get modular addition and weird things of that form.
00:30:53
Speaker
Is there a danger that AI models will try to solve problems in very complex ways that are natural for them, but that we cannot understand? So can it become way too advanced for us to interpret and understand because it's easier for AI models to solve problems in a specific way? So this is definitely an issue, which I honestly just don't really know.
00:31:23
Speaker
like interpreting models is hard, but I wouldn't say that we're good enough at it, but I could tell the difference between we aren't good enough and it's just impossible.
00:31:35
Speaker
I'm honestly a lot more concerned, models will learn a thing that isn't kind of logical, and is just a massive soup of statistical correlations that turns out to look like sophisticated behaviour, but which we have no hope of interpreting. And then it learns some galaxy-brained algorithm that's just too hard for me to understand, but is in principle interpretable. Just like, I'm so much happier if we live in the second world.
00:32:05
Speaker
if only because that makes it tractable to maybe make simpler systems that are mere human that we can understand and then use these to help us understand the inscrutable alien brains which do have internal structure.
00:32:23
Speaker
What I'm imagining here is a self-driving car, for example, or a robot navigating the world. And I'm thinking that this is extremely complex. And then if we're trying to interpret what's going on in a network like that, is it even approachable? Because who knows what's been developed in training, right? It could be extremely complex, involving all sorts of things that we have trouble untangling.
00:32:52
Speaker
So I think that's two things to distinguish here. So I often think about what's happening in a model in terms of these two concepts. Features and circuits. So a feature is some property of the inputs the model is detecting. For example, this word is a noun, or this is the second word in Eiffel Tower, and a circuit is
00:33:21
Speaker
a kind of some actual computation or reasoning, reasoning in the mechanistic sense, not in the internal experience sense, but the model is learning how to do. Like, if I am looking for a city and I recently had a big European landmark appear, look for the city that came in.
00:33:47
Speaker
which is the kind of thing that a model uses to answer the Eiffel Tower is in with Paris. And yeah, I think that there's the question of do models learn weird, inscrutable circuits, some incredibly complicated algorithm I have no hope of understanding, but where kind of the outputs and internal representations kind of make sense. And that seems like okay.
00:34:16
Speaker
It's, if we can find a way to figure out what the features represented are, I'm pretty happy. And then there's, does it learn weird alien abstractions about the world that we have no hope of understanding? Like how, if you trained a model in the late 1800s in some hypothetical steampunk world, it might learn to reason about the world through quantum mechanics and
00:34:46
Speaker
That does not seem like a thing that I would trust my ability to reverse engineer. That would be such a great story if people had discovered quantum mechanics by reverse engineering and AI trains to model physics. Do you think something like that could actually happen with us? Could it be the case that say future advanced AIs will help us understand physics or chemistry in a deeper way than we understand it now?
00:35:15
Speaker
And I'm thinking that the story here would be that the model discovers something and we reverse engineer it and then develop better theory. Is that plausible too? So yeah, there's this idea called microscope AI from Chris Ola, which is basically this idea that, okay, making
00:35:38
Speaker
AI that can do complex human-level tasks seems really useful, but also absolutely terrifying, and we have no real reason to think these models are actually going to do what we want. And if there's one thing I've learned from the field of AI alignment, it's that we do not know how to align these things. So what if, rather than making a thing that acts on the world, we just make a thing, then never turn it on and just try to dissect its brain?
00:36:03
Speaker
and reverse engineer what it's learned and what kind of thought processes it has and then just either learn, just understand these for the value of understanding them and the understanding it brings or try to make some simpler thing that implements these algorithms. And I
00:36:25
Speaker
think, and Chris would agree with me, that this is not actually a thing that's likely to happen before we get things that are just capable of acting in the world and doing things, and this being way more useful. But it's a fun, romantic image. Definitely a romantic image. True. And if we're in a world where we actually make human-level AI, and it is aligned with what we want, and we need to figure out what humanity's place in a world like that is,
00:36:55
Speaker
I definitely hope that we're able to understand science and ideas from these systems, even if
00:37:04
Speaker
I don't know, it is probably strictly easier to have a system that can explain the scientific concept that it understands to a human in an understandable way than to reverse engineer it. Could you take mechanistic interpretability and then place it into the larger context of AI safety? Where does it fit into the broader research field that is AI safety?

Diversity in AI Safety Research

00:37:29
Speaker
Sure.
00:37:32
Speaker
The broader field of AI safety is kind of a mess, and there's lots of different people who have very different views on the problem and how to go about solving it, as is good and how the world should be for as confusing and messy a problem as AI alignment. A couple of important axes along which people vary. There's the school of people who work on presaic AI alignment, meaning
00:38:01
Speaker
we wanted to figure out how we would align a human level system built on current day techniques. And the fields, their, focuses, this kind of means neural networks and deep learning, but practically nowadays it normally means looking at large language models like chat GPT and thinking about what a human level system built on this kind of technology might look like.
00:38:26
Speaker
And I would say that most mechanistic interpretability work fits into that category of trying to really engage with these models and understand how to reverse engineer those. Another big dividing question is empirical work versus conceptual work.
00:38:48
Speaker
And a kind of underlying question about the importance of feedback loops, where you get these people who are working with these large models or otherwise working on reality planning systems, even if they're somewhat more toy, and trying to model parts of the alignment problem and figure out how you might solve it in practice.
00:39:09
Speaker
And then there's all sorts of people who are thinking through what are the threat models we might care about? What are the ways that an AI might actually be dangerous? What kinds of solutions could we come up with? And do these work in these work on a conceptual level? And there are researchers like Evan Hübinger and Paul Kirshner's Lyman Research Center, who I think do some pretty solid work here.
00:39:35
Speaker
But there's one of the really hard parts about doing conceptual work is having feedback loops and having the ability to distinguish between are you doing things that are correct and truth tracking, or have you gone off into your vast castle in the clouds of pretty philosophical concepts that don't matter.
00:39:58
Speaker
So the AI safety field as a whole, do you think it lacks a unified paradigm? To what extent is AI safety still fumbling around trying to find something at base to build on or a unifying paradigm on which all researchers can agree on the basics? Say that quantum mechanics is such a paradigm in physics.
00:40:24
Speaker
how far are we from such a paradigm in AI safety? Sure. So I would at least somewhat argue that we don't really have a paradigm for deep learning, at least in the sense of no one really knows what's going on. Though arguably there is a paradigm in terms of how you make GPUs go burr and produce systems that are better.
00:40:49
Speaker
which sadly is much easier than making systems we understand or which do what we want.
00:40:58
Speaker
And I would say that the field of AI alignment definitely does not feel to me like it has a clear paradigm. I would say that mechanistic interpretability feels like it's converging a little bit more on a paradigm, or at least a shared set of methods and techniques and philosophies, but also that there's lots of people with lots of very different opinions.
00:41:21
Speaker
I've recently been reading Thomas Kuhn's The Structure of Scientific Revolutions, which I believe is the origin of this paradigm notion, and lots of the things he refers to, like researchers bitterly arguing about the right techniques to use and the right ways to interpret evidence I definitely vibe with. But I also want to challenge a bit the framing of the question, where
00:41:45
Speaker
To me, I feel like when I hear paradigm, there's this connotation that this is good at what you want. And I feel like it's easy to have an incorrect, overly restrictive paradigm. And I think that it's good and important for there to be some metric by which you can measure progress. Not necessarily an explicit metric like a benchmark, but just a kind of agreed upon set of norms and principles to evaluate whether this work is good or not.
00:42:13
Speaker
And what are the problems you want to work towards? But I think it's also very easy to get really caught up in specific paradigms. For example, a lot of L.A. Z. Odykowski and Nick Bostrom's early work on AI XRISC is built on these specific frameworks around expected utility maximizers and
00:42:35
Speaker
and of sovereigns who are modeled as these omnipotent things who can just act in the world and just are smart so they can do whatever they want. I'm caricaturing. But I think there's lots of aspects of this that just don't match our understanding of modern deep learning. And I think the field had coalesced around that paradigm that could actually have been bad.
00:42:59
Speaker
my ideal is probably a field with like maybe three robust paradigms and then a long tail of people trying to make their own paradigm so that there is like I know I think that the paradigm of let's figure out how to interpret how to align GPT-3 or chat GPT or GPT-4 etc how to make these systems do what we want and go from that to
00:43:28
Speaker
eventual AGI seems like far the highest priority thing. And I think if there could be a paradigm forming around that, that would feel pretty good to me. But I'd never want it to be the entirety of the field, because we're trying to speculate about this future technology and what could go wrong with it. That is not the kind of thing where you should have consensus.
00:43:47
Speaker
And perhaps, as you mentioned, paradigms could become overly restrictive. And also maybe it would be too slow for a very fast-moving field such as AI to try to develop an overarching paradigm. If you think about how long it took for us to develop these paradigms in physics, we're talking decades to centuries. And perhaps we do not have that kind of time in the field of AI.
00:44:16
Speaker
I think there's lots of different things that you can mean when you say forming a paradigm, and in particular trying to form a paradigm, that it's worth trying to disentangle. I think that forming a consensus is very hard.
00:44:33
Speaker
kind of, I don't know, science moves on one funeral at a time is the classic saying, the way you form a consensus is where the old guard who believes something else just like retire or die, and that takes ages. But that doesn't feel to me like the
00:44:52
Speaker
key thing here. To me the thing that a researcher in the field's job is not necessarily building a consensus, it's trying to articulate ideas, trying to take everyone else's work and distill it into a cohesive worldview, trying to iterate on the right techniques and the right toolkit, trying to critique other people's work and find flaws in other techniques.
00:45:16
Speaker
until you slowly red team your way to a good technique and toolkit that actually works. And to me these are core parts of forming a paradigm, but also core parts of just doing good research. And one of the things I found really fun about doing mechanistic interpretability versus say pure maths is there's this wild forest of open questions.
00:45:40
Speaker
One of my recent side projects has been writing this sequence called 200 Concudator from Problems in Mechanistic Interpretability, as a fun challenge to myself to just... There's so many! Can I write down 200? And I probably wrote down more than 200.
00:45:57
Speaker
So when I hear you speak about this, it sounds like you have a pretty well-developed worldview on AI safety. And perhaps you wouldn't say like that. You seem to have formed your own beliefs about this. Could you spend a little time talking about how you got here?
00:46:18
Speaker
So how do you have these independent beliefs? You have a sense of what's important, a sense of what's interesting to a larger extent than most people, I would think. Sure. So I think one caveat worth giving before I jump into this is the difference between having beliefs that are true in the sense that they track true things about the world and
00:46:45
Speaker
Like if you look back in a hundred years when we kind of know what actually happened, how correct am I on this cosmic scorecard? And then there's have beliefs that feel true to me. Have things that make sense. Have things where it just feels true and I understand how it works. Well, maybe this breaks down to two things. There's confidence, which is what I've been describing of
00:47:11
Speaker
You believe you know how it works, and it feels reliable and reassuring. And then there's having a model of it with moving parts, where you don't just know A.I. risk matters because Eliasi Adkowski said so and he's really smart. But it's more, okay, A.I. risk matters because
00:47:33
Speaker
models might be trained with reinforcement learning. Reinforcement learning might give them goals. There's no reason the goals need to be good. And where you have this chain of reasoning that you can interrogate and where you know what would happen if you change your mind on different parts of it. And I think that people often conflate the being confident with the truth tracking. I think I've done much better on the being confident than on the truth tracking, which is potentially concerning.
00:48:03
Speaker
And I do think I've done reasonably well on the forming a model with moving parts. The jog in an alignment circle is often having a gears level model. And I'm very happy to talk about that. But I thought this is a framing worth giving. Definitely.
00:48:22
Speaker
Please tell us about your gears level model of how AI safety works, especially I would be interested in how do we go from doing mechanistic interpretability to actually decreasing the risk of or decreasing AI risk or existential risk.
00:48:41
Speaker
I think my process of forming these beliefs was not particularly sophisticated. A little bit just kind of happened naturally over time. I read arguments, I thought the arguments were broadly reasonable, I talked to people, I tried to come up with arguments against what I'd been reading.
00:48:57
Speaker
I just spent time working in the field, being exposed to different viewpoints and discussing research and critiquing other people's research, and probably I could have optimized much harder for forming really great deep beliefs on this.
00:49:14
Speaker
a little bit just kind of happens naturally. I think in the past I was very stressed about finding the one true form of beliefs about AI alignment and the one true agenda that was the most important. I think this is just like essentially impossible for as complicated and controversial a question about as AI alignment. Since I know many smart comps and people don't even think it's an issue at all. Even if I tragically really come across well-reasoned arguments from those people.
00:49:47
Speaker
And... though I have recently seen some good attempts on that front. And... yeah. I have this post called, uh, concrete advice for forming views on AI safety, which people should go check out if they want more of my takes. Um, the core of it is just this incremental, things will kind of happen over time, expose yourself to things. I'm also a really big believer in trying to summarize and distill other people's beliefs.
00:50:14
Speaker
like take an argument or an article or a research you respect and try to distill down their arguments or beliefs to a kind of bullet point or tree structure where you see their beliefs and how they depend on the previous ones. I think that just doing this, even if you aren't trying that hard to evaluate how much you agree with each step in the tree, is really helpful for forming a model with moving parts. Though, obviously,
00:50:42
Speaker
figuring out what you actually think about it and trying to critique it is a great part of the process. Moving to the second bit of your question on how I think mechanistic interpretability fits into the big picture. So partially I don't know
00:51:01
Speaker
A lot of why I feel satisfied about working on mechanistic interpretability is that A, it's just really, really fun, and I seem good at it. And it's grabbed me so much more than the other areas of alignment that I've tried. I think there is something to be said for just, if there's lots of things that seem pretty good, work on the thing that you think you are comparatively best at.
00:51:28
Speaker
And part of what makes the risk of us part of me feel satisfied is I think that if we understood more about what was going on in potential human-level black box systems that are shaping the world and potentially following goals we don't want, it seems like, yes, obviously understanding them will help. How could it possibly not help?
00:51:52
Speaker
And that my work is not relying on a specific fragile argument where if I chain my mind about some key assumption, everything comes crashing down. I do think there are key assumptions, but most of them are the empirical and scientific questions of will any of this work? With maybe a side serving of will this net be helpful for making the system safer rather than understanding them, just letting us make them better and faster?
00:52:21
Speaker
which is another thing that I think anyone researching this stuff should be thinking about. But yeah, so with all of this caveats aside, I should occasionally give actual answers instead of pure caveats. One story that I find pretty compelling is that...
00:52:47
Speaker
it's very easy to give an input to an AI system and see its outputs and to study its kind of overall behavior. And often looking at behavior and carefully choosing the inputs and outputs can tell you a lot about the system. But one of the times when this is messiest is when there are multiple solutions that both give comparable or even identical outputs and
00:53:18
Speaker
One of the ways I think mechanistic interpretability can really shine is if you're trying to reverse engineer what algorithms the model has learned, and there are multiple algorithms that lead to the same behavior, you should be able to distinguish between these two algorithms. And the field is very much in early stages, but I was involved in this work I mentioned earlier about reverse engineering grokking.
00:53:45
Speaker
where the model was doing a task where both learned the solution of memorizing it, of just memorizing data at scene, and a solution about generalizing data I hadn't seen, and mechanistic interpretability going a lot of traction there. But I think that more broadly,
00:54:06
Speaker
One of the key bad scenarios I see is we produce this competent, deceptive model which wants power and influence, and which is competent enough that it can deceive us and do exactly what an aligned model would do. And this seems like a very, very important scenario where there are multiple models, where there are multiple solutions a model could be implementing that have the same behavior.
00:54:35
Speaker
And I think that anything centered on deception seems like a really exciting way, like a really promising way for mech interpret to do something useful. I further think that there's like a lot of ways this could cash out as being helpful.

Mechanistic Interpretability and AI Deception

00:54:54
Speaker
And I can see there as being a spectrum of how well we succeed at doing mechanistic interpretability.
00:55:01
Speaker
where there's the world we're kind of in now where with like months of laborious effort from many researchers, we might be able to take a model like DPT-3 and have some shot of noticing deception in it, even if it's very much not reliable. And this is like way better than nothing. If we're going to deploy a system into the real world, being able to do things like comb it through neurons and check whether it's deceptive seems like a lot better than nothing.
00:55:28
Speaker
But if we got better at this, then ideally we could use this to just reliably audit systems and get as confident as we can that they're good. I'm also really excited about this as a path to give better feedback loops to people doing the actual amendment work.
00:55:49
Speaker
If you try an alignment technique and you want to figure out whether it works, if you could look at the deception circuit and see how much this has changed the models, like how much the model is lying.
00:56:00
Speaker
or look at what goals the model has internally represented, if it's represented goals at all, and use this to better understand and use this to check how well your technique worked. That seems really exciting. Yeah, definitely. On a very basic level, if we think about the alignment problem, we want to align AI values with human values.
00:56:23
Speaker
And of course, we need to understand what are the AI values or the AI goals. And so in some sense, mechanistic interpretability is very obvious as an approach. Of course, not the content, but that you would want to do mechanistic interpretability seems very straightforward to me, or like an essential part of aligning AI.
00:56:45
Speaker
I should give the caveat that most of the arguments I'm giving are not actually mechanistic interpretability-specific arguments, they're why you should care about interpretability in general. Where interpretability is this pretty large subfield of AI, of which mechanistic interpretability is this, like, pretty niche but growing sub-area, and
00:57:09
Speaker
I got into mechanistic interpretability because it just seemed by far the most inspiring area in terms of forming what I thought were actual true beliefs about models. And it's so, so easy to shoot yourself in the foot and trick yourself into thinking you know what's going on, or form a very surface level understanding, or an understanding which wouldn't really generalize to a much larger, more sophisticated model.
00:57:35
Speaker
such that I personally think mechanistic interpretability is the best bet to make it. But this is very much my personal take that people should not take as the gospel on the only way to do interpretability is mechanistic interpretability where we painstakingly look for specific circuits and nothing else could
00:57:54
Speaker
Perhaps we could talk a bit more about this notion of an AI becoming deceptive. Sure. How would that come about? How would an AI become deceptive and how would interpretability or perhaps specifically mechanistic interpretability help us distinguish between an honest AI and a deceptive AI? Sure.
00:58:20
Speaker
There's a framework that I like a lot for breaking down how to think about AI x-risk from this report by Joseph Carlsmiths, the Open Philanthropy, on power-seeking AI, where he breaks down three traits the system might have.
00:58:39
Speaker
There's what you call APS. There's just advanced capabilities, a system being human level or better in a lot of important areas. Though not necessarily all. There's planning.
00:58:58
Speaker
which is the system is capable of forming like real plans, it has goals, and it's able to form plans towards those goals. You can also think of this as kind of, it is an agent, whatever that word means, and strategic awareness, which is it kind of knows its context, it knows it is
00:59:24
Speaker
a language model being trained by a company called OpenAI on the internet to predict the next token. Or whatever. And the argument goes that if you have all of these
00:59:37
Speaker
that for most possible things a system might want, assuming the system has goals, which comes under this kind of agentic planning assumption, then it's going, there are generic things like having power and resources and money and autonomy that is really useful towards these, unless you get exactly the right goals. This idea called instrumental convergence.
01:00:08
Speaker
And so it's going to want to do things like, say, have it supervisors, give it good reward, or not turn it off, or release it into the world. It's going to understand its context. It's going to understand it's got these operators. It's being trained in some way that involves being given feedback. And it's going to have some understanding of the mechanisms by which these work, how people work, how people think, what our preconceptions are about an AI system.
01:00:38
Speaker
Strategic awareness sounds kind of weird and sci-fi-ish, but if you just ask chatgpt, what are you, it just gives a perfectly coherent answer. I even asked it to give me training code for a large language model like itself, and it gave code that would basically have worked with a little bit of editing, though fortunately not for the hard part of training it across thousands of GPUs. So models trained on the internet definitely know what large language models are.
01:01:05
Speaker
And then, yeah, so it realizes that it wants things. It's easier to get things if you can control what your operators want. And knowing that you have operators and is important context for deciding that this is worth doing, like a system like AlphaGo, I don't think is ever going to decide to deceive its operators because it's trying to play Go. I don't see it ever getting strategic awareness about the real world.
01:01:35
Speaker
and then it has the advanced capabilities to actually do this. So yeah, to summarize, planning towards these goals, having the awareness of its context such that it realizes the deception is useful, and the instrumental convergence such that deception is useful towards its goals and doesn't heavily conflict with them,
01:02:01
Speaker
and then being competent enough that it can use its understanding of its context to like successfully deceive. But if we can figure out what's going on inside these networks, then we can distinguish between deceptive AIs and honest AIs. This is the hope. Yes, this is the hope. Hopefully we can.
01:02:23
Speaker
Yes, so I definitely do not want to claim that this is a thing I am confident will be able to do, or that mechanistic interpretability is obviously the best path to do this, but I do think it is the kind of thing that I think mechanistic interpretability could be the right tool for. In particular, I think that
01:02:48
Speaker
things around what goals does this model have and what kinds, like what algorithms are following for these actions are the kinds of things I think mechanistic interpretability is like targeting. One particular thing that makes me optimistic about this kind of angle is that one of the areas mechanistic interpretability shines is by finding
01:03:13
Speaker
thinking and computation the model does that's very localized that's either specific to some behavior and which is specific to some components of the model like some layers and some neurons and which is not where like most of the model just doesn't actually matter that much and it's an open question to me about how these models work
01:03:41
Speaker
of whether this localization is a true thing in general, rather than just the somewhat cherry-picked areas the field has studied so far. But it's plausible to me that for really important things like goals, there is some semi-localized, semi-explicit representation
01:03:59
Speaker
And it is plausible to me that for something like deception, we can look at what kind of internal circuitry lights up in social situations, look for commonalities and try to reverse engineer the key things there.
01:04:18
Speaker
And that most of the model is going towards these generic capabilities, like understanding language and modeling the world's knowledge. But these are much less relevant from an alignment context, even if those are some things I care a lot about being able to reverse engineer as well.
01:04:35
Speaker
It would be great for us if goals were localized in the network because then we can find them, we can isolate them, and we can understand them more easily than if they were, say, spread across the entire network. Yes.