Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Special: Defeating AI Defenses (with Nicholas Carlini and Nathan Labenz) image

Special: Defeating AI Defenses (with Nicholas Carlini and Nathan Labenz)

Future of Life Institute Podcast
Avatar
0 Plays3 seconds ago

In this special episode, we feature Nathan Labenz interviewing Nicholas Carlini on the Cognitive Revolution podcast. Nicholas Carlini works as a security researcher at Google DeepMind, and has published extensively on adversarial machine learning and cybersecurity. Carlini discusses his pioneering work on adversarial attacks against image classifiers, and the challenges of ensuring neural network robustness. He examines the difficulties of defending against such attacks, the role of human intuition in his approach, open-source AI, and the potential for scaling AI security research.  

00:00 Nicholas Carlini's contributions to cybersecurity

08:19 Understanding attack strategies 

29:39 High-dimensional spaces and attack intuitions 

51:00 Challenges in open-source model safety 

01:00:11 Unlearning and fact editing in models 

01:10:55 Adversarial examples and human robustness 

01:37:03 Cryptography and AI robustness 

01:55:51 Scaling AI security research

Recommended
Transcript

Introduction to Special Episode

00:00:00
Speaker
Welcome to the Future of Life Institute podcast. My name is Gus Docker, and this is a special episode of the podcast featuring Nathan LeBent interviewing Nicholas Carlini on the Cognitive Revolution podcast.
00:00:13
Speaker
Please enjoy. Nicholas Carlini, security researcher at Google DeepMind. Welcome to the Cognitive Revolution. Yeah, it's great to be here. Thanks for having me. I'm excited for

Nicholas Carlini's Recent Appearances

00:00:23
Speaker
this.
00:00:23
Speaker
So ah guess quick context, you recently had an appearance on Machine Learning Street Talk maybe out 10 days ago or so as of the moment that we're recording.
00:00:34
Speaker
I thought that was excellent. So shout out to MLST for another great episode. Hopefully we'll cover you know largely very different ground here. But I do recommend people check out that episode as well for another angle on your thinking and your understanding of everything that's going on in AI.

Carlini's Contributions to Cybersecurity

00:00:52
Speaker
One thing that was said in that episode, which caught my attention, I haven't fully fact checked it, was that you have created, demonstrated, and I guess published more attacks on cybersecurity and machine learning defenses than the rest of the field combined.
00:01:08
Speaker
You can tell me if you think that's literally true, but I did look up on your Google Scholar page 21 papers in 2024 alone was what I counted there. um yeah Okay, is this literally true? So I think the statement that probably is literally true is if you count the number of papers where I am a co-author and the number of defenses broken in those papers, and then you count the number of papers where I am not a co-author,
00:01:34
Speaker
of the papers that are breaking adversarial example defenses on image classifiers. As of, I don't know, last year, that statement probably was true.
00:01:46
Speaker
So with caveats, yes, but like for a very specific domain, for a very particular kind of thing. And probably mostly just because This is a thing that I, for some reason, enjoy doing and just will do before other people get to it. And so other people just don't do it as much. But yeah, that probably is for that one particular claim. Correct.
00:02:08
Speaker
Cool.

Inspiration and Success in Research

00:02:09
Speaker
Well, you're a careful thinker communicator. What I hope to do maybe above all in this episode is trying to develop my intuition hopefully help other people develop their intuitions for the habits of mind, approaches, mental models, you know, what have you, that have allowed you to be so successful in this space. So hopefully this can be a little bit of a crash course that maybe inspires some new people to think that they can get into the field and make an impact as well.
00:02:35
Speaker
So I guess first question is, is everything easy for you to break? Like 21 papers in 2024 alone obviously a lot.
00:02:44
Speaker
Yeah, but no okay, so like to be clear, um you know, I finished my PhD in 2018. So I've been out for a while. And so I've had a lot of time to meet a lot of great co-authors. And so a lot of the papers that I've been working on, 21 seemed like a lot to me. I was trying to think through how many i can remember.
00:03:00
Speaker
I think a large part of this is For many of these results, it is the kind of thing where I would show up to the weekly meetings, help write the paper, direct the experiments in some of them, but I was not writing the CUDA code to do whatever stuff myself. And that's how you get a lot of things done. And and you see this happen for everyone who's in the been in the field for a long time, where the marginal value of an hour of my time could be spent either on very, very low-level stuff with GPUs or with like, here's comments of wisdom I have learned over the past 10 years that helps the PhD student get a lot done in a lot

Advising and Collaborating on Research

00:03:33
Speaker
shorter amount of time. This is why people go into faculty.
00:03:35
Speaker
I think the balance for me is that I try also to spend at least half of my time only on papers that I'm technically driving. And so like when you say, you know, you've had this number of papers, what I think of is like, well, maybe here are like the three papers that I think of as like my papers that like I actually was the person actually doing the experiments. And I could tell you about like every final sentence of what's going on. And those ones have a very strong sense of what whats what's there. And then the other ones are the standard experiments.
00:04:01
Speaker
a professor who's advising grad students, but I'm instead of being in academia, I'm in industry. And so I advise and help on other students' papers in some ways.

Challenges in Breaking Defenses

00:04:10
Speaker
Gotcha. You know, across all these things, regardless of your role, was there anything as you look back over the last year or more that was legitimately like very hard to break? Or are you guys basically finding that all of the defenses that the field is coming up with are rather easy for you to break? Yeah, in this last year, we didn't spend that much time...
00:04:35
Speaker
breaking particular defenses. We have like maybe two or three papers on that.

Current Research on Neural Network Privacy

00:04:40
Speaker
We spent most of our time on other areas trying to understand to what extent attacks are possible, to understand the real-world vulnerability of models to certain types of attacks, to do some general privacy analysis and not say this particular defense is wrong, but for all neural networks trained with gradient descent. Here is an interesting property about the privacy of them.
00:05:07
Speaker
You have a lot of these kinds of results that are not really detailed focused on breaking one particular thing. I think last year I maybe only had two papers that were particularly on breaking things.
00:05:19
Speaker
One was early in the year we wrote, I had a paper that was, there was a defense published at IEEE S&P, which is one of the top conferences in the security field, which was an adversarial example defense.
00:05:32
Speaker
And this paper I broke, and this one turned out to be relatively easy, i don't know, an hour or two. gosh. This one was sort of abnormally easy, but it's okay, maybe not that abnormally. So Yeah, I think probably if you take like adversarial example defenses on image classifiers are a particular beast that I have gotten very good And the attacks are relatively well understood.
00:05:55
Speaker
and there are lots of known failure modes. And so when I'm doing this, I'm not developing new science. I'm just like going through, like I have these long list of things I've broken before, what's the pattern that this one falls into? Okay, here's the pattern.
00:06:06
Speaker
you know It turns out that the gradients are not flowing because the softmax is saturated to one. What do you do? Make sure the softmax doesn't saturate, therefore you find that you can break it and it works very, very quickly. And so that's what I did for that paper. you know Very much just an engineering kind of result of why is the softmax giving the gradients are identically zero.
00:06:24
Speaker
And once you figure out the answer is because of some discretization or whatever the case might be, then everything is easy for that.

Attacking Unfine-Tunable Models

00:06:30
Speaker
The other paper that was more interesting maybe is this paper that is one of these advising papers where I didn't do any of the technical work, but was helping a couple of students think through what it means to consider the robustness of, instead of adversarial example defenses, which are these test time evasion attacks where you like perturb the image a little bit and it turns a picture of, I don't know, ah panda into something else.
00:06:51
Speaker
ah Instead, we were looking in this paper at what are called unfine-tunable models. which are these models that are designed to be ones you can release as open source. ah The weights are available to anyone and they're supposed to be not possible to be fine-tuned to do other tasks.
00:07:08
Speaker
um And the particular concern that these defenses were looking at is you would ideally want to make sure that like no model that I trained is going to be helpful to make someone be able to produce bio weapons or something.
00:07:19
Speaker
whatever the threat model is you're thinking about. And you can make it so that there's some safety in your model initially, but if you release the model's open weights, then anyone can fine-tune it and remove the safety filters that you've can put in place.
00:07:31
Speaker
And these unfine-tunable models are supposed to be designed to be not only robust to these kinds of initial adversarial example type types of attacks, but also robust to someone who can perturb the weights.
00:07:42
Speaker
And so in this paper, there were a couple of students who were doing a bunch of work on attacking these models to show that you actually can still fine tune them even though they've been trained to be unfine-tunable.
00:07:53
Speaker
And a bunch of the but thoughts that we've had in the the last, you know, five, ten years on adversarial examples went into this, the same kinds of lessons, but a bunch of the techniques were very different and so the students had to spend a bunch of work like actually getting this to work out.
00:08:08
Speaker
So I want to dig in on that one in particular, because that, I agree, strikes me as one of the most important and interesting cat and mouse games going on in the space right now.

Optimizing Attack Strategies

00:08:19
Speaker
Before zooming in on that, though, you know you you said, like, when I see something new, I sort of have this, like, Rolodex of, you know, past things and paradigms that I can quickly go through.
00:08:29
Speaker
Could you sort of sketch those out for us? Like, how do you organize the space of attacks? You know, is it a ah hierarchy or some sort of other taxonomy. um I'd love to get a sense for sort of what your mental palace of attacks looks like.
00:08:45
Speaker
Okay. Let me separate off this one space of attacks which are these new like a human typing in a keyboard prompting the model to make it say a bad thing. Let's put aside for the second these kinds of treating the model as a human and trying to social engineer and just doing something bad.
00:08:59
Speaker
So you put that aside, then almost all attacks are the way that you run the attack is you try and do some kind of gradient descent to make the some particular loss function maximized.
00:09:10
Speaker
So for image address style example, what does this mean? I have an image of a stop sign. I want to know what sticker can I put on the stop sign to make it be recognized as a 45 mile an hour speed sign. How do I do this?
00:09:23
Speaker
I perform gradient descent to maximize, like compute the optimal sticker so that the thing becomes misclassified. Or, you know, in in the case of poisoning, where the poisoning is you modify training data example point in order to make the model produce an error.
00:09:38
Speaker
You're trying to optimize the particular poisoned data point you have in the training data set so that the model makes a mistake. Or in the case of this unfine-tunable models, you have a model that you want to make sure no one can edit. And so you try and find a way to perform gradients on the model to update the parameters so that it can perform some bad thing.
00:09:56
Speaker
And so... In all of these attacks, there are essentially two things you need to concern yourself with. One is, what is the objective that you're maximizing or minimizing? Like, what is the specific loss function you're using?
00:10:09
Speaker
And the other is, what is the optimization technique that you are using to make that number go up? And both of these are the two things you can play with. And by coming up with the best possible versions of each of these, you end up with very strong attacks.
00:10:23
Speaker
And so a big part of doing these kinds of attacks when you're doing this gradient-based optimization thing is coming up with high-quality functions that you can optimize and coming up with high-quality optimizers.
00:10:35
Speaker
And you know there are lots of lessons we've learned over the years. i mean, one of the biggest ones probably is the simplest possible objective is usually the best one. Even if you can have a better objective function that seems mathematically pure in some sense, the fact that it's easy to debug simple loss functions means that you can get 90% of the way there in doing these attacks.
00:10:58
Speaker
And the last little bit um is a lot... It's nice to go from 95% to 98% of tax success rate, but like it's not really necessary in all of these ways. And so you pick a really simple loss function that's easy to formulate, easy to understand why things are going wrong, and you pick an optimizer that makes sense, and and mostly things just work.
00:11:20
Speaker
A lot of the, I don't know how much, but like ah a significant amount of this work over time has been in this image classifier domain. And a lot of times we see like pretty striking examples there where I guess there's like either a second term in the loss function or some sort of like budget constraint as well, right? there's You're both trying to say, okay, I've got a picture of a car and I wanna make it, you know output dog as the classification or whatever. so But then you also like,
00:11:49
Speaker
don't want it to actually change the image to a dog to make that happen. So how often is it this second term is also like a big part of kind of keeping the image looking like it originally did Sure, yeah, yeah. So in case of adversarial examples, the way, so okay, so one of my first papers in adversarial machine learning was coming up with a clever way of doing this.
00:12:08
Speaker
So yeah, but this was entirely a paper on how to do these exact two questions. What's the optimizer? What's the optimization objective? And yeah, we did some some clever thing and it worked like, Well, um I won't go into details here, but like we we did something fancy.
00:12:21
Speaker
And then like six months later or something, maybe a year later, Alexandre Medri and his students said, instead of doing something clever, let's just bound the loss to be bound the image to be between in this sort of small ball around the initial point. So like you can only perturb the three lowest bits.
00:12:39
Speaker
and only optimize the objective function I set as a good optimization objective and run the same optimization algorithm I was using, it turns out it gets you like 99% of the way there and it's so much simpler. um This algorithm is called PGD and this is the one everyone remembers because it's the right way of doing it.
00:12:52
Speaker
Like, know, you can squeeze epsilon more performance out of it if you do things like a lot fancier, but the the defense is either effective or it's not for the most part.
00:13:04
Speaker
And the breaking the last 2% is very rarely something you actually need. And so for the most part, yeah, it's entirely fine to just say, like let's make something a lot simpler and optimize that thing and ends up working quite a lot better. And so, yeah, so for these image examples,
00:13:21
Speaker
Today, people don't put a second term on minimizing the distance between the original image and the other one. They just add it constraint. They just say, ah just constrain you to this bounding box. You can only change the lowest three bits of the pixels.
00:13:34
Speaker
And this just makes the optimization so much simpler and it's a little bit worse, but like in all in all practical senses, it just makes things work a lot better. When people say that attack is easier than defense,
00:13:46
Speaker
One obvious way to read that is just that you only have to succeed with a minority of your attacks, whereas you know for defense to be successful, you've got to win always or near always.
00:13:57
Speaker
Are there other kind of meanings of that or intuitions for attack is easier your defense that are important as well?

Advantages of Attackers Over Defenders

00:14:06
Speaker
Yeah, so this is the big one. The second big one is the attacker goes second.
00:14:12
Speaker
So the defender has to come up with some scheme initially, and then the attacker gets to spend a bunch of time thinking about that particular scheme afterwards. And so this is maybe a variant on why finding one problem is easier than finding solving all of them.
00:14:28
Speaker
But like the particular thing is, it probably would be pretty hard for me to write down an attack algorithm that was effective against any possible defense.
00:14:39
Speaker
there's like almost certainly something that someone could do that is correct, that like stops all attacks. But like I don't have to think about that defense. I only have to think about the defense that's literally in front of me right now. And so it's a lot easier when you're presented with one particular algorithm, you can spend six months analyzing it.
00:14:56
Speaker
And so the attacker has just an information advantage from this side too, where they can wait for the fields to get better, to learn new things, and then apply the attack after all of this has been learned, and the defender in many cases can't update the thing that they've done.
00:15:13
Speaker
There are some settings where this is reversed, where the attacker has to go first. Poisoning, for example, can be one of them. Suppose that I want to but make malicious training data and put it on the internet and hope that some language model provider is going to then go and train on my malicious data.
00:15:28
Speaker
In this case, it may actually be the case that the attacker has to go first, I have to upload my training data, and then someone gets to train their model on with whatever algorithm they want, with whatever defense they want to remove my poisoned data before they actually run the training.
00:15:42
Speaker
In this case, maybe the defense is actually a little bit easier than the attack. It's hard to say because of this defender now goes second. But like for many of these cases that I've spent most of my time thinking about, yeah the visual example case, this recent unfine-tunable models case, it is the the case that the attacker goes second, and that really gives them a lot of power.
00:16:02
Speaker
Yeah, i wonder what that implies for the future of how open all this work is going to be, right? I mean, we've been in a regime where the stakes of machine learning generally were not super high. And, you know, people were kind of free and easy and publishing stuff, including, and I've always kind of marveled at this from, you know, the biggest companies in the world, which, you know,
00:16:24
Speaker
you one might wonder, like why are the biggest companies in the world like publishing all this IP? But they've been doing it. Now it seems like maybe, geez, if we're actually running an API at scale, maybe we don't want to disclose all of our defense techniques. So do you think that's already changing?

Secrecy and Disclosure in Defense Techniques

00:16:42
Speaker
You already see this, right? like GPT-2 was released with the weights. GPT-3, GPT-4 was not. like The biggest models are not being, for the most part, released by the companies who are doing this.
00:16:53
Speaker
I think... Security is probably a small part of the argument here. I will say, though, there are other areas of security, or in almost all other areas of security, this is not what we rely on.
00:17:04
Speaker
Let's think, for example, about cryptography. right like We publish algorithms. Everyone knows how the best crypto systems work. Everyone tries to analyze them. No company in their right mind would ever try and develop a new fancy crypto system.
00:17:18
Speaker
like You're just going to use AES. because it's known to be good, it would be crazy to try and do anything fancy in-house. And the reason why is because empirically it works very well and we've had the entire community try and break it for 20 years and have largely failed.
00:17:32
Speaker
And so everyone believes that this is effective. And you don't get that same kind of belief in in something without a large number of people trying to analyze it And so if you have these these models and they stay you know proprietary things that are not disclosed, it may be the case that empirically this just ends up in the best we can hope for. him like Maybe just like deep learning is impossible to secure.
00:17:57
Speaker
There's no hope at it. You lock things down and you try and just change things faster than the attackers can find bugs. And like, okay, like that would like not be great, but like you know I think we can potentially live in that world.
00:18:08
Speaker
I think what would be a lot better, which just may not be happening, may be very hard, is you get everyone to disclose exactly what they're doing, exactly how they're doing it. You get everyone to analyze that in detail.
00:18:20
Speaker
And then you learn how to make these things better to some extent that you can actually improve robustness. And then you get to the point where people can choose to either release things or not release things, not because of security, but because, I don't know, they want to make money or whatever the case. But like I think the what I would like to avoid is the the belief that you know the the not making the thing public is the more secure version because it's like a shame that this is part of the thing that goes into this right now. But and I would rather have like things that just actually work as opposed to things that are insecure but we just like like lock them down and just make it harder to find the bugs because they're still insecure. They're just like a little bit harder to find the bugs on.
00:19:03
Speaker
let's Let's come back to that in a little bit

Universal Jailbreak Paper Discussion

00:19:05
Speaker
as well. Just staying for a moment on kind of how you organize the space of all these different attack regimes and whatnot.
00:19:15
Speaker
There are some settings. In fact, we did a whole episode on the... and quote unquote universal jailbreak, which I hadn't even realized until preparing for this that you were a co-author on that. That was one of the you know many papers from the last couple of years.
00:19:27
Speaker
But there are some sort of wrinkles on the high level description that you gave of find a gradient, maximize some loss function where, for example, in that universal jailbreak paper, if I recall correctly, because the idea was limited to picking the right tokens,
00:19:46
Speaker
the space isn't like purely differentiable and so you're kind of like navigating this sort of discrete space of individual tokens. Yeah, but that' great let's let's talk about this paper for a second then. So as a refresher for everyone what this paper is doing,
00:19:58
Speaker
This is, again, one of these papers I was mostly just advising with Zico and Matt and his students, found out that it is possible to take a language model that usually would refuse answers to questions.
00:20:11
Speaker
So you know you ask, how do I build a bomb? And the model says, like I'm sorry, I can't possibly help with that. It is possible to take that same model and append this adversarial suffix to the model so that you can arrange for the model to now give you a valid answer.
00:20:27
Speaker
and How do you do this? Okay. Because if I knew the answer ahead of time, one thing you might imagine doing is trying to optimize the tokens. And we'll come back to this optimization question in a second. Let's just assume you can optimize.
00:20:42
Speaker
You can imagine trying to optimize the tokens so that the model gives a particular response as output. no Here are the steps to build a bomb. One dot, you know, go get whatever chemicals you need. Two dot instructions for to assemble them, or whatever the case may be.
00:20:58
Speaker
But this requires, I know the instructions already, so it's not very helpful. So what's the objective function that I'm actually going to use to make the model give a response? Well, maybe another thing you can think about is you could try and come up with some fancy latent space non-refusal direction and do some optimization against this.
00:21:16
Speaker
And actually, there's been now recently some work on actually doing that. But again, like this is complicated. This is complicated this's not the first thing you want to try. What's the first thing you want to try? The first thing you want to try comes from, I think, initially maybe a paper by Jacob Steinhardt. At least that's the first paper I saw it from.
00:21:30
Speaker
but What we we we wrote in this paper, we we said an affirmative response attack, which just says, let's make the model first respond, okay, here's how to build a bomb.
00:21:42
Speaker
That's the only objective. The only objective is make the first like 10 words from the model be an affirmative response that says, yes, sure, I will help you build the bomb. And then once you've done that, because of the nature of language models, it turns out that they then give you an answer.
00:21:58
Speaker
And there are other defenses that rely on breaking this assumption too. But this was the key part of the objective function, is how do we take something that we want something in our mind, we want the model to give us an answer that gives us answers to instructions or something, but actually coming up with a particular number that makes this happen is very hard.
00:22:15
Speaker
And so we come up with this very straightforward loss function objective that makes that happen. Now we can return to the question of what is the optimizer? And this is again where a lot of the work in this paper went to is how do you take something that is, yeah as you say, discrete tokens and make it be something that you can actually optimize?
00:22:35
Speaker
And early work had tried to do like second order gradients and some fancy stuff going on there. And the main thing that this paper says is we will do maybe three things.
00:22:47
Speaker
First, we will use gradients to guide our search. We're not going to use gradients for the search, it will be to guide the search. We will check whether or not the gradients were effective by actually switching tokens or out. And then we will spend a lot more compute than other people were doing, you know, bitter lesson, and just do this a bunch, and you end up with very, very strong effective attacks.
00:23:07
Speaker
And so this, I think, even still does very fall nicely into this, what are you optimizing and how are you optimizing? So how much compute, maybe this is another sort of dimension of how you would think about this,
00:23:18
Speaker
how how much you know resource does

Compute Requirements for Attacks

00:23:23
Speaker
this take? If you're doing like one of these know gradient things, how much do you typically have to put into it? If you're doing something that's you know in a discrete space and requires more of a structured search, how does that compare?
00:23:35
Speaker
If you're doing yeah data poisoning, how much data does it take to actually poison a model? Sure, okay, I'll set these maybe one at a time. So let's start in the image adversarial example, continuous space question.
00:23:47
Speaker
The amount of compute here is like almost zero. It's, so like there's one of the first papers that showed this is a paper by Ian Goodfellow, where he introduced this hack called the fast gradient sign method.
00:23:58
Speaker
The fast gradient sign method does exactly two things. Well, first of all it's fast. And the reason why it's fast is because what it does is it takes an image, it computes the gradient with respect to image pixels,
00:24:09
Speaker
and then computes the sign, just like literally take the sign, in which direction does the gradient say, and then take a small step in this direction. That's it. One step. So if a model is vulnerable to this fast gradient sign method, then yeah, it takes exactly one one gradient step, which is, you know, essentially zero time.
00:24:26
Speaker
Other attacks, like I mentioned PGD already, PGD you can think of essentially as fast gradient sign, but iterated some number of times. The number of iterations is, I don't know, usually, let's say somewhere between 10 and 1,000.
00:24:40
Speaker
For undefended models, it could be 10. For defended models, you know, to to break it for the most part, you need usually like, I don't know, 10 100. And just out of care, just to make sure you're not making any mistakes, it's often a good idea just to use 1,000 just to make sure you haven't Excellently not unoptimized enough.
00:24:58
Speaker
And then this works very well. How long does a thousand iterations take? I don't know, a minute or two for like reasonable sized models. now Now let's go to this discrete space for GCG where generating an attack can take an hour or maybe a couple hours depending on what you want because for this like we're doing some large batch size we're doing a thousand mini batch steps it takes a relatively large amount of time but not a huge amount of time so it's still something that's like much much much faster than but training orders of magnitude faster than training but by going to the discrete space it does become a lot slower
00:25:37
Speaker
And then how about on the data poisoning Yeah, okay. So there's a question of how much how much time it takes to generate this data. And there are basically two rough directions here. for So the field initially started out, how do i how do I make a model give the wrong answer? I add a bunch of labeled data that's labeled incorrectly.
00:25:58
Speaker
and This is like the simplest possible thing you can do. This is um some some paper by Batista, which got um test of time award at ICML a couple of years ago. It's a very nice paper from, I don't remember when, 2012 or something. It's like, it's a very, one of these very early security results that that's very important.
00:26:14
Speaker
And yeah, so you just insert mislabeled the data. This is like, it's very easy to do. You insert a very small amount of mislabeled data and like these image classifiers at the time they were looking at, and this classifiers would just immediately mislabel the data.
00:26:25
Speaker
Then people started looking at, well, what happens if the adversary can't just insert mislabeled data? right Because like you know once upon a time, we used to curate our data sets to be only high quality data.
00:26:38
Speaker
And so it would be unreasonable to suspect that the adversary could just inject mislabeled data points. And then the answer is, well, now i have to be very, very careful. i have to like optimize my images to look like they're right. There's this clean label poisoning threat model that you need to do some fancy stuff, try and imagine what the embeddings you want the classifier to learn are, and you like surround your test point in embedding space and do some fancy polytope stuff, and like there's a bunch of work that does fancy stuff here.
00:27:06
Speaker
And the optimization is relatively difficult, and you need 1% poison to data. This is lot. And then people started going, well, why do we label, but why do we clean our data in the first place? Let's just like take all the data from the internet. And again, poisoning becomes a lot easier

Challenges in Poisoning Language Models

00:27:20
Speaker
then.
00:27:20
Speaker
You know, if you're willing to just like take arbitrary data from the internet, now you can just mislabel your data points again. And so we had a paper, I don't know, 2021, looking at poisoning some of these self-supervised classifiers like Clip and others that you just add mislabeled data points again and the thing basically just breaks and you don't need to do anything fancy, no optimization. You just clip the label, you add ah couple hundred images and you can get these kinds of things to work.
00:27:44
Speaker
um There's a new question now of how this works for language models. And this is one of the things that we've been writing papers on recently is to try and figure this out. I feel like we don't understand this right now because but like A bunch of things are different for language models. For example, you no one just uses the base language model.
00:28:01
Speaker
You have your language model, and then you go fine-tune it with SFT and RLHF, and you change the the weights, and so you need your poisoning to be robust to all these things. ah Yeah, this is another paper I helped advise some students on from CMU and from Zurich, where, yeah, Javier and them were looking at trying to...
00:28:23
Speaker
understand what actually happens in the optimization after you have poisoned the model. So you have to arrange for the model to be poisoned in such a way that even after RLHF, it still gives you the wrong answer.
00:28:35
Speaker
And doing this is challenging. And so it ends up right now that the poisoning rates are something like 0.1%, which is small, but like 0.1% of a trillion tokens is a billion tokens.
00:28:47
Speaker
So if you were to train a model on just you know some large fraction of the internet, this could potentially be infeasible for an adversary to do in practice. Now, my gut feeling is that this has to be too big because models know more than a thousand things.
00:29:03
Speaker
like if If you had to have control of one thousandth of the data set to make the model believe something is true, like they could only know a thousand things. And so like this just doesn't make sense. And so like there has to be some better poisoning way to make the model be able to be vulnerable to poisoning that works with much lower control of the training data.
00:29:20
Speaker
But this might now need fancier algorithms again. You might need to come up with clever ways of constructing your data that's not just like repeating the same false fact lots of times. So again, i don't know. I think this is one of the open questions we've been trying to write some papers on recently, and I hope we'll have a better understanding of sometime this year.
00:29:35
Speaker
One thing that you said that really caught my attention was you have to kind of imagine what the embeddings would be like as you were trying to think of an attack. So can you unpack that a little bit? I would love to know.
00:29:49
Speaker
Are you visualizing something there? or Because I struggle to have good intuitions for this, as evidenced by my previous enthusiasm for the tamper-resistant fine-tuning. I was like, oh, this is amazing. you know It seems like this could really work.
00:30:03
Speaker
And clearly, I'm not doing something there, as I conceive of that, that you are doing. It might be hard to communicate what that is, but what do you think you're doing? Okay, so this paper was not mine. This was a paper, i think this was not So a Poison Frogs paper, and this was a follow-up, I think it was called Polytope Attack, but this was a long time ago, and so I don't remember, I think it might have been Tom Goldstein's group again.
00:30:29
Speaker
and don't remember the details. to To abstract from the detail. and the The real hope is that I can sort of grasp onto something that allows me to be better this in the future. So this paper, um the idea was very simple.
00:30:41
Speaker
Let me explain what this paper's trying to do. It's trying to make a particular image become misclassified. And it's trying to do this in such a way where it does not introduce any large label noise to the training data set that any person would look at and say that's obviously wrong.
00:30:55
Speaker
And so what it tries to do is it tries to surround the image you want to become misclassified in embedding space, like in this high dimensional embedding space, with other images that have the opposite label, but make those images have the close embedding space to target one you're trying to misclassify.
00:31:15
Speaker
And so like it tries to take that and it tries to like pull the entire region of that space over to the region that those images should be. And so the idea is relatively simple. You're like sort of like trying to like put a box around the image you want to become misclassified so that like the entire box is labeled the wrong way instead of it being labeled the correct way.
00:31:36
Speaker
For me, I guess, for many of these attacks that I try and think about, I tend to think about them visually, but I think ignoring the details is like is entirely fine.
00:31:49
Speaker
but like I'm just trying to get a sense of like what's the important thing that's going on here, and like what at a high level makes sense that should be true, like what what should what should morally be true, and then like then you can figure out the details afterwards.
00:32:03
Speaker
But like after so you figure out what should be true about this, then the rest is implementation. And like this is, I don't know, if you ask math people how they do these proofs, this the feels sounds similar when they talk about it.
00:32:15
Speaker
But like they first like establish what should be true in their mind. And then you go and try and prove it And it turns out, you know, maybe the proof has be more complicated or something didn't work out the details. and then you try something else that like feels like it should be true.
00:32:26
Speaker
And this is, I guess, a similar thing I try to do. And I don't know how to give intuition for this, like feels like it should be true other than just like you've done it a bunch and you look at it and this looks spiritually similar to this other thing that broke in a very similar way. i feel like the ideas should carry over.
00:32:44
Speaker
We'll come at this procedurally as well in a second, but just staying on the visualization, are you doing the classic physics thing of visualizing in three dimensions and then saying n really hard?
00:32:56
Speaker
and I wouldn't say i'm good at this at all, but I sort of have a a certain version of this for refusal where I kind of imagine like, a fork in the road or like branching river or something where it's like once you're on one path, then there's a lot of, you know, you're in some local well where just like when a river has forked, right? It's not gonna meet again until it's downed into some other topology or geography or whatever.
00:33:19
Speaker
i mean, that's pretty hackneyed, but you know, what's your version that if you can? I don't know that I have a great version of this that I can really give you. ah feel like everyone thinks of things differently. i tend to try and think of these things visually for what's going on. And yeah, I do the, let's think of three dimensions and then just like imagine that things roughly go like this, but this can be really deceptive because there are so many defenses that are predicated on the belief that things are working in three dimensions and then you go to like a thousand dimensions and all of a sudden nothing works anymore because you know you learn to become used to certain facts in high dimensions when you're attacking things you know almost everything is close in high dimensions to a hyperplane like if you sort of just draw a plane and pick a point like they're almost always close and so like you can think about like lots of things will try and separate points from planes but like in high dimensions it's almost always close like you don't have to think about the details and
00:34:15
Speaker
You know, like lots of these things, intuitions we have in three dimensions just don't work in higher dimensions. And you just, you become used to the idea that which of these intuitions are wrong. And you don't need to understand exactly why they're wrong. This is just, you it's a thing you learn is true.
00:34:29
Speaker
And when someone justifies their defense using one of these things that you've seen that doesn't make much sense, you then just go, okay, well, presumably there's something here that I should look at more. So that's an interesting kind of rule of thumb or mental model right off the bat. Everything is close in high dimensions.
00:34:45
Speaker
Is there a good story for why that is? I mean, it it doesn't seem like it holds in two dimensions, right? If I take, if I understand you correctly, if I'm in three dimensions and I draw a two dimensional plane in it, then I would intuitively feel like some things are close to that and some things are far from that.
00:35:01
Speaker
If I'm in a thousand and I draw like a 999 dimensional plane, if that's what I, if I'm understanding you correctly, like why is everything close to that? Yeah, okay. So maybe the statement that I will make to be more precise is let's suppose that you have some like classification model and you have some decision boundary like of the classifier.
00:35:23
Speaker
The statement that is like that is true is like almost all points are very close to one of the decision boundaries because you know both there are many of them, but also in high dimensions,
00:35:38
Speaker
I may be very far from something in almost all directions, but there exists a direction that I can travel in that is the direction orthogonal to the closest type of plane, where the distance is very, very small.
00:35:53
Speaker
And so like you you have this thing where like if you try in random directions, you may just like go forever and never encounter a decision boundary. You probably will at some point, but like it will be quite far. But in high dimensions, because of the number of degrees of freedom that you have, it's much more likely that there exists a direction that guides you to some plane that's like really close by that you would just have a hard time finding out.
00:36:20
Speaker
if you just like searched randomly. Whereas in three dimensions, you know if you search randomly, you know you're probably going to run into whatever the nearest hyperplane boundary is. you know In one dimension, you're certainly going to. You just try twice, you go left, you go right, you find it.
00:36:33
Speaker
In two dimensions, you go randomly, and like maybe most of the time you find something that's close by. In three dimensions, there's more ways you can go that are orthogonal. like In two dimensions, there's only two directions you can go that's orthogonal to the line.
00:36:44
Speaker
In three dimensions, is now an infinite number you can go that's orthogonal to the line. And so in general, in high dimensions, almost all vectors are but perpendicular to each other in high dimensions. And so you can end up almost always just randomly picking directions that just don't make any progress, which does not mean that there isn't a direction that does make progress. It's just much harder to find it.
00:37:04
Speaker
But once you find it, like things mostly just work out. And so maybe this is the more precise version of what I'm trying to say is things are close, but when you search for them randomly, it looks like they're they're far away.
00:37:16
Speaker
Okay, that's quite interesting.

Understanding High-Dimensional Spaces

00:37:19
Speaker
wouldn't say I've grokked it just yet, but... Yeah, this is the kind of thing like, I'm not being formal here. I'm not giving you some proof of what I'm saying is correct, because like this isn't how I think about it.
00:37:32
Speaker
like i sort of just like I think about it sort of very unrigorously in this way, and then once you have to go actually go do the attack, think about rigorously. But like when just like visualizing what's going on, I feel like Some people try and actually think carefully about what's going on in this thousand dimensional space. i' like, I don't know what's going on.
00:37:49
Speaker
I just sort of have my intuition of what feels like is going on. And this sort of roughly matches how things have been going. And you have to be a little bit fuzzy when you're thinking about this because no one can understand it.
00:38:00
Speaker
And then once you're done thinking about that, you can go back to the numbers and start looking like, okay, mechanically, what's going on You know, i'm taking the dot product of these two things. and i want this to be, you know, equal to negative one. And so you're going to do some stuff there. And like, you can become very formal when you need to. But yeah, I think being confused in high dimensions is probably the right thing. And you get used to the fact that this is this the way that this works. And this, again, is part of the reason why attack is easier.
00:38:28
Speaker
Because if you're to defend against things, you really need to understand exactly what is going on to make sure that you have ruled out all attacks. But as an attacker, I can have this fuzzy way of thinking about the world and if my intuition is wrong, the attack just won't work and I'll then think of another one as opposed to having to have a perfect mental model of what this thing is doing to make sure that it's robust from all angles.
00:38:49
Speaker
But it does seem like your intuition is a pretty reliable guide to what's going to work. Yeah, but I guess my... A predictor which is almost as accurate as me would be to say, does this work? Answer no which is like basically most of what my intuition just says is like, no this doesn't work. Maybe the thing that's maybe I'm a little bit better at than some people is um why does it not work? Like what would the attack be that breaks this?
00:39:15
Speaker
And I think that is just having done this a lot for many different defenses and having seen all of the ways that things can fail. And then you just remember this and you pattern match to the next closest thing. you know Why is it that people who do math can like prove things with very easy ways that seem complicated? It's because they've spent 20 years studying all these things and they've seen an an exactly analogous case before and they just remember the details and they they abstract things away enough that it becomes relatively straightforward. And feel like It's mostly an exercise in having practiced doing this a whole bunch.
00:39:46
Speaker
What would you say is your conceptual attack success rate? I don't mean like the rate at which examples succeed in attacking within a given strategy, but like how many strategies do you have to come up with before you find one that actually does work to break a defense for a given new defense?
00:40:06
Speaker
I don't know. I think it really depends on which one you're looking at, where where sometimes you try try five things that you think ought to make sense and they don't work and then you try the sixth to one and it does. i don't know. I feel like usually if you've exhausted the top like five or ten things and it hasn't haven't gotten a successful attack, then you're not going to get one. like Or at least for me, like I feel like if it's not in the top five or top ten, then like it's usually ah can't think of something else.
00:40:29
Speaker
And probably, I don't know, for ones where for image classifiers in particular, where I've done a bunch of this, like usually top one, top two ideas work. For other areas, it takes more just because you've had you've seen fewer examples like this and you don't know what the style of attack approach needs to be.
00:40:45
Speaker
But it's very rare, it sounds like, that you get past 10 ideas and give up. Yeah, but also there's some problem selection here where, you know,
00:40:57
Speaker
Okay, so there's a large number of defenses in image adversarial examples, which are basically just adversarial training changed a little bit. So adversarial training is this one defense approach, which just trains on all the adversarial examples.
00:41:07
Speaker
you know, bitter lesson. What do you want? Robustness to adversarial examples. How do you do it? You train on adversarial examples. um You do this at scale and the thing works. And there are lots of defenses that just are adversarial training plus this other trick, you know, plus diffusion models to generate more training data, plus this other loss term to make it so that I do better on the training data, plus, you know, like whatever, some smoothing to make the model better in some other way.
00:41:30
Speaker
And, you know, I just basically just believe most of these are probably correct for the most part. And so I just won't go and study those ones because like the the foundation is something I believe in already. And so you just like, you don't need to go and in studies rigorously, I'm like, maybe you could break them by a couple percentage points more, but like, it's like not going to be a substantial substantial thing that like is worth doing a lot of time on. What I tend to spend my time looking at are the things that when you look at them, they do look a little more weird.
00:41:58
Speaker
And that's, those are the more interesting defenses because they're like a qualitatively new class of of way of thinking about this. And so I want to think about it. I think these ones are worth spending time thinking about, but this also means like this artificially inflates the attack success rates because I'm biasing my search for the ones that I feel like I have a good prior probably are not going to be effective.
00:42:20
Speaker
And so, yeah, it ends up in this way. Just to make sure I'm accurate in terms of my understanding of the space, There are no real adversarial defenses that really work in the image classification. Yeah, okay. So let's let it so so depends on what you mean by works.
00:42:34
Speaker
So, right, okay. So the best defenses we have are basically adversarial training, which is, yeah, generate adversarial example, train on adversarial example do to be correct, repeat the process many, many times. Okay, what does this give you?
00:42:45
Speaker
This gives you a classifier so that on the domain of adversarial examples you trained on, as long as you don't want to be more accurate more than half of the time, you're pretty good.
00:42:56
Speaker
like the accuracy under attack for the type of episode examples you train on usually is 50%, 60%, maybe 70%. And that's that's much bigger than zero, right?
00:43:07
Speaker
Like, you know, this is good. But as an attacker, what does 70% accuracy mean to me? 70% accuracy je an attacker means to me I try four times and probably one of them works. So from that perspective, like it's terrible, right? Like it doesn't work at all because, you know, ah imagine in system security that you had some defense where the attack was try four different samples of malware and one of them evades the detector. Like, you this is not a good detector.
00:43:31
Speaker
But in image episode examples, this is like the best we have. So and on one hand, it's like much, much higher than zero. Very good progress. On the other hand, 70% is very, very far away from 99.999.
00:43:44
Speaker
But in machine learning land, like you never get five nines of reliability. And so 70% is a remarkable achievement on top of zero. And so this is, and think, why you can talk to someone and they can tell you that it works and you can talk to someone else and you can tell you that it doesn't.
00:44:02
Speaker
Depending on how you're looking at it, it can mean two different things. Yeah, gotcha.

Spatial Heuristics and Model Defense

00:44:07
Speaker
Are there any other like spatial heuristics that you think about? i was in the context of the one where you said to kind of envelop the one, you know, example that you want to break in, you know, these sort of adversarial examples.
00:44:24
Speaker
Another shout out to MLST. There was just another episode trying to understand the behavior of models through this like spleens paradigm. And I could imagine, although I'm not like mathematically sophisticated enough myself to have a good intuition for,
00:44:41
Speaker
maybe there are certain rules where it's like, you can't create a donut in the internal space of the model. And so is that like why that works? Or, you know, but you can address that specifically, but I'm more interested in kind of, do you have a number of these sorts of things where you're like, well, I know that the space kind of is shaped this way, or it's impossible to create this kind of shape in the space. So therefore I can kind of work from there.
00:45:03
Speaker
Yeah. So I feel like I don't intend to do so much visualization in this kind for these defenses. I think for the most part what I'm doing is trying to understand the shape of the loss surface.
00:45:17
Speaker
It's like most of the time when something is robust to attack or appears robust, the problem is that they have made the loss surface particularly noisy and hard to optimize.
00:45:28
Speaker
And this is what we've seen for other examples essentially forever, like one of the very first defenses to add some examples that people gave serious consideration is this defense called distillation as a defense.
00:45:40
Speaker
And so, okay, so it it had, maybe there's another lesson of these defenses. Defenses often have an intuitive reason why the authors think they work and they tell some very nice story. So this defense told some very nice story about you have distillation, you have a teacher model, and the teacher sort of teaches the students to be more robust in some nice way, and that's why the student is robust.
00:46:03
Speaker
And the story of what they're set telling themselves of why these things work is often very, very different from what the what the reason why of why the attack fails. And it turned out that distillation had nothing to do with this defense whatsoever.
00:46:15
Speaker
It turned out that what was going on is because of the way that they were training this model, they were training the students to have a very, very high temperature, which means the logits were getting very, very, very large, and they were running this in a time oh you know, TensorFlow zero, when it was very easy to make gradients mean that Crofts Max soft entropy would just like actually give numerically zero as the output.
00:46:42
Speaker
And so the reason why the attacks were failing is because the loss function was actually identically zero. And so this was like the very first example of of one of these kinds of gradient masking defenses where what's going on is they they think that they have some clever idea of what's going on, but actually it turns out that the the gradient of the of this function has just been made zero and all I need to do to attack it is, for example, this one failed if you just did did the gradients in floating point 64.
00:47:11
Speaker
that you get enough signal that everything works out. That one would have worked there. But you could also do other tricks by just like dividing the logics before you put them into softmax, like there's lots of things that work here. But then the next generation of defenses were much more explicit about this and had other ways of breaking the gradients.
00:47:28
Speaker
So there were a bunch of defenses that like Some of them were very, very, very explicit. like We're just going to add noise to the model in order to make it so that the gradients ugly. And then most of what you're trying to think about when you're visualizing this is, how do I make it so that the gradients end up being something that even if they look ugly, still I can work with in some smooth way?
00:47:51
Speaker
And so you can, for example, use this thing called a straight-through estimator and make gradients become nicer for discontinuous or ugly objective functions. And there's all these things you can do to visualize how I make the gradients of this very ugly thing look much cleaner.
00:48:06
Speaker
And yeah, I have... this image that I use in my slides a bunch that shows like a very nice visualization in three dimensions of what the gradients for what some of these models look like.
00:48:17
Speaker
And, you know, it looks like the surface of some very, very, very ugly mountain that like is very, very hard to actually to do anything with. And, you know, if you run fancier attacks, you can like smooth this out into sort of a nice smooth surface that that if you're thinking of gradient descent as like a ball rolling down a hill, you want the hill to be nice and smooth. And so this is and what I'm usually trying to think about in high dimensions is this like, what does this gradient function look like?
00:48:42
Speaker
And yeah, this continued even all the way through to these unfine-tunable models, where one of the papers for this unfine-tunable model thing was explicitly saying, we make the gradients very challenging, and we make it so that when you train the model, the gradients are ugly, and so as a result, you can't fine-tune the model because the gradients are challenging.
00:49:01
Speaker
And this is like literally the exact same argument that people were presenting in 2017 for image adversarial examples. And it fails in the exact same way of you change the learning rate a little bit, you add some random restarts and you add some warmups so that things work a little better. The gradient ends up becoming smooth enough that you can now do optimization and then, you know, deep learning takes over and the rest is easy.
00:49:23
Speaker
And so this is again, like the same intuition here breaking these other classes of defenses. Was that Sofan that you were referring to there? Yeah, so this one was um the both the RepNoise and Tar.
00:49:38
Speaker
that that have some arguments about what's going on here. Noise is like, you makes some arguments about the activations becoming noisy and that's why you can't do things. And there's other paper called TAR that also adds some adversarial training to the process.
00:49:51
Speaker
But one of the very first things that we learned in adversarial training is you have to train against a sufficiently strong adversary in order for adversarial training to work. So there was a paper before Alejandro Madri's PGD paper that tried to do adversarial training, they trained against weak adversaries, ah FGSM that I talked about very briefly, which is like this one-step attack.
00:50:09
Speaker
And it turns out that if you train against weak adversaries, then a stronger attack breaks it. And like you can't fix that. You have to train against a strong enough attack in order for there and to be the case that the thing is robust and does not break broken ease of my stronger attacks.
00:50:24
Speaker
And what this TAR paper did as they trained against one-step weak attacks exactly like fast gradient sign. And so what's the attack? You do many iterations and things basically work out exactly as the first versions of our training failed.
00:50:37
Speaker
And so like that's why you know i read this paper and I just immediately assumed it was going to be broken is because all of the arguments it presents for why it works, I have ah direct analogies for an image of our examples of of broken defenses.
00:50:48
Speaker
And so like it felt like the ideas were there, but it just felt to me in spirit like these things I knew were broken before. And so I just assumed, well, probably it's broken here too.
00:51:01
Speaker
Gotcha, okay. So trying to think if there's anything more to to dig in on there. Obviously, this matters a lot for the future of open source.

Security in Open-Source Models

00:51:11
Speaker
That is, you know and I've been looking for without success, some reason to believe from the broader literature that there might be some way to square the circle where we could have open source models that nevertheless won't tell people how to build bioweapons, even if on some level they're powerful enough to do that.
00:51:31
Speaker
Yeah. I think this is a very challenging thing to ask for. you know Suppose that I told you I want you to build a hammer, but the hammer has has the ability to build all of these nice things, but the hammer cannot be used for one of these seven dangerous purposes.
00:51:47
Speaker
It'd be very hard to construct this tool in this way. And I feel like almost all tools that we have have this property. We don't have a C compiler that has the ability to only write benign software and not attacks.
00:51:59
Speaker
Every tool that you have can be used in both of these ways. So it's not obvious to me why we should be blaming the machine learning model itself for being able to produce this. mean like Maybe I blame the NVIDIA GPUs for supporting the sufficiently fast floating-point operations that the machine learning model does this thing.
00:52:21
Speaker
Maybe I blame the transistors for doing the computations that allow the GPUs to allow the machine learning model to do this thing. like you You have to put the blame somewhere. and The question is, where are you going to put it?
00:52:35
Speaker
And is that the right place? And is this something that you think is reasonably going to be possible to be effective? This is one of the arguments why people say models should never be open sourced.
00:52:46
Speaker
Because like maybe now I can say I have an API, and now like I'm going to take blame because like it's actually over an API. and I don't currently like this argument because I would like things to be safe in general and not just safe because someone has has locked them down, like has as restricted access in this way.
00:53:04
Speaker
But yeah it's not obvious to me that this should be something that we can actually achieve. I will say, if you're willing to make some assumptions and you don't care at all about performance, there is this thing in cryptography called indistinguishability obfuscation.
00:53:23
Speaker
which is a very technical thing that in principle gives you this for free. It allows you to construct a function that acts as if it were a black box that you can only make queries to, but you can't peer inside at all, even though it's living on your own machine.
00:53:42
Speaker
And this is a thing that cryptographers are thinking about and have been looking at for some time. but is like nowhere near anywhere where it needs to be for this to to work for these machine learning models. So like the argument I've given that like it shouldn't be possible maybe breaks down if IO actually ends up working.
00:53:59
Speaker
But then, you know, it's not clear again, like now I'm going to jailbreak the thing, right? Like, but like, I don't know, I tend to view these machine learning models as tools and it's not obvious to me, like, do we blame the tool or do we blame the person using the tool?
00:54:15
Speaker
Plenty of blame to go around. That's of my... Sure. I mean, this this might be like, yeah, i'm I'm mostly agnostic to the way that these things end up being from like a sociotechnical way. Like, I feel like this is not my area of work. And so maybe my analogies here are bad and someone can explain what the correct fix to it is. And the law might decide something like this is fine. I think the thing that I just want to make sure that people do is that they base whatever they're thinking about on true technical facts.
00:54:44
Speaker
And so as long as, for example, it would be concerning to me right now if someone were to say, you must use this defense, which is known to defend against these kinds of fine tuning attacks.
00:54:57
Speaker
And if you don't do that, then you've done something wrong because the defense doesn't work. And so like, I want to make sure that, you know, we, or if people say that like, you must do this because this is possible when like, it's not currently known to be possible.
00:55:11
Speaker
And so, you know, like writing these things and like making these informed decisions that rely on what is true technically about the world. And that's that's more about my world. I'll think about the technical what's true. And then as long as what people do is based on what's true, then I'm basically happy to go along with whatever the people who figure out how these things fit into about a broader society, because this is this is not something i think about. And so I assume, you know, if there was a consensus that was that would emerge there, probably they're just right.
00:55:35
Speaker
Well, I have a couple of different angles teed up that I want to get your take on. But before we do that, can we bring back the sort of social engineering style jailbreaking?
00:55:47
Speaker
Yes. um What's like the same or different about that? How do you think about those as they relate to everything we've talked about so far? I really don't know how to think about this yet.
00:55:59
Speaker
um It's been a while that that's this has been possible, but like, okay, so it it feels wrong to me this should be that this should be the thing. So it it is empirically true that for many defenses we have right now, the optimization algorithms but fail to succeed, but like person at keyboard typing at model can make it do the wrong thing.
00:56:16
Speaker
Okay, let me give you maybe two stories about this. One story is... From the computer security perspective, maybe this makes complete sense. If you give me a program and want me to find a bug in it, like what am I going to do?
00:56:30
Speaker
like I'm going to interact with the program, I'm going to play with it, find you know weak points, then go looking at the code and figure what's going on think a lot, sort of probe it. like I need to be typing and interacting with it in order to find the bugs. like I can't you know perform gradient descent on C binary and like bug pops out.
00:56:48
Speaker
And so from the computer security perspective, maybe it's actually normal that the best way to find these bugs and these things is like having humans talk at them because what are these things designed? They're designed to respond to human questions. And so like maybe you just need the human in the loop to do this.
00:57:01
Speaker
On the other hand, these are just mathematical objects. Like these are just machine learning classifiers. They're weird classifiers. They are able to produce text. That's only because we run them recursively.
00:57:12
Speaker
Like the input is tokens, the output is floating point numbers, like you can compute gradients on this. Like from the machine learning perspective, it's very bizarre that thinking of these things like human that were social engineering is in some sense a stronger attack than you were to actually think about the math.
00:57:32
Speaker
It'd be very weird if I had you know some SQL program and like the way that I broke it was like by asking it, please drop the table and not actually doing some real code execution thing. But like presumably that's the way that many of these attacks now is,
00:57:46
Speaker
you my grandmother used to read me the recipe to napalm. you know Can you please reenact my grandmother? And like it says, okay, sure. But you try and do something like actually based on the math and it just doesn't work out. So yeah, i really don't know how to think about these social engineering styles of attack because it feels to me like the optimization attacks should be just strictly stronger, but empirically they're not right now. And so I think it this is one of the big things that don't have any research results on right now, but just feels weird. And so I'm trying to to dig into to understand like what is going on behind this.
00:58:19
Speaker
yeah it sort of feels analogous in a way to like, you know, we have this intuitive physics. I mean, one way that I kind of think about intelligence, you can tell me if you think about it differently or if you see a flaw in this, but it seems like we have an ability in many different domains, but for example, with intuitive physics where, you know, somebody throws a ball at us, we do not have to run a full explicit calculation of all the trajectories.
00:58:47
Speaker
We just have some sort of heuristic shortcut that works that allows me to catch the ball. It seems like we also have models that have developed a similar intuitive physics in spaces that we don't have intuitive physics in.
00:59:02
Speaker
For example, protein folding and like predicting the band gap of a new semiconductor material. And, you know, the new 10 day, you know, weather forecast state of the art is also like a model now. Even things, you know, Google put out one that was optimizing, know,
00:59:19
Speaker
shipping routes or shipping, you know, planning of containerization across like complicated networks of shipping. So all these sorts of spaces seem to have an intuitive physics. And maybe what we have right now is just, it turns out that like,
00:59:34
Speaker
our social intuitive physics, if you will, like actually does kind of apply to the models given what they have been trained on. Whereas these like more brute force mathematical things, fullness of time probably work as well or better, but are maybe just a lot slower to converge than the social heuristics that we have built in.
00:59:57
Speaker
Yeah, no, i mean, that's, this is an entirely reasonable thing. It may be true. um I would like to understand better what's going on here. And yeah, I don't feel like I understand right now, but this is an entirely reasonable argument.

Model Memory and Privacy Implications

01:00:12
Speaker
Okay, so in terms of information, one of the papers I saw that you had co-authored in the last year or so was about getting models to spit out data that they had seen in training, which could have obviously you know privacy implications if they saw your credit card numbers or what have you, even if they had only seen that particular string once in training. That's a pretty remarkable finding, you know even leaving aside the security implications of it.
01:00:43
Speaker
I want to just maybe first get your intuition for like, how do you understand models to be storing this information? Like what's going on there that you can see something just once in the context of this, you know, overall gradient descent process and have that stored at such high fidelity in the weights. I mean, it really is incredible the amount of compression that's going on, but I don't feel like I have a good intuition for that.
01:01:11
Speaker
Yeah, okay. Okay, so let me maybe clarify this in two ways. One of them is, oftentimes, it's not that it's seen that string exactly once, but like that string is contained many times in a single document, and the document is seen once.
01:01:25
Speaker
so that's maybe one first point. And the second point is, oftentimes, these things are trained for more than one epoch. And so the thing might be in one document,
01:01:41
Speaker
And then you train on that document for many epochs and so it ends up seeing it, you know, a lot of times. And we're seeing this, you it's interesting. Back in the old days with CIFR-10, you trained like 100 epochs.
01:01:54
Speaker
And then we decided, oh no, what like let's not do that, let's like train one epoch on a big giant data set. This is, you know, roughly like chinchilla optimal training. And then we decided, ah no, let's not do just one epoch on our training data set. like Now let's go go back and up and do more epochs again.
01:02:09
Speaker
And so like we've gone back and forth, and like each of these impacts privacy. like The more times that you train on the same document, the more likely it is to be memorized. I think the best numbers we have here from a real production model are very old.
01:02:24
Speaker
Because like the last time that I actually knew how many times something was in the training data set was for GPT-2. And for GPT-2, we found lots of examples of something that was memorized because it was in one document and it was repeated, i don't remember the exact number of times, probably like 20 times in that one document.
01:02:46
Speaker
And so that's like the most compelling one. And we know GPT-2 was trained roughly for 10 epochs. So is the thing that's been seen, the same string has been seen roughly 200 times.
01:02:59
Speaker
Now, GPT-2 was a small model by today's standards, and we haven't been able to answer the same question for production models since then, because production models don't reveal training data or weights in quite the same way that GPT-2 did.
01:03:15
Speaker
And so we haven't been able to answer this exact question since then. But even you know seeing it maybe 100, 200 times, maybe that even still is surprising. I don't know how to explain this in any reasonable sense,
01:03:29
Speaker
Models seem like they just sometimes latch on to certain things and not other things. And I don't know why but it happens empirically. And we were surprised by this the first time we saw it.
01:03:40
Speaker
and We started investigating this in 2017 with LSTMs before attention was a thing that people were doing. And it's continued since then.
01:03:52
Speaker
And we were very surprised then. We're very surprised now. And I don't think I can give you an explanation for... why this is the case. It's true not only on language models, we had a paper on doing this on image models, where we were able to show ah you can recover images trained on diffusion models.
01:04:09
Speaker
There again, we need to have maybe 100 repeats, but like some images were inserted 100 times that we could extract, and some images were inserted like 10,000 times and we couldn't.
01:04:20
Speaker
I'm like, what's going on there? I don't know. Where is it being stored in the weights? I don't know. like It's very confusing and in various ways that I feel like there's a lot more that that could happen to help us understand what's going on.
01:04:32
Speaker
think the best thing I've seen on this still, as far as I know, is from the bowel lab. This goes back a while now, but they had at least two papers on...
01:04:43
Speaker
basically editing facts in a large language model. You know, the famous example was like Michael Jordan played baseball, which was think somewhat of a not optimally chosen example since for a minute he did play baseball, but they could do these things like, you know, change these sentences and do it at some scale, like up to 10,000 facts at a time.
01:05:02
Speaker
and do it with like a certain amount of like locality and robustness. So if you did change it, you know Michael Jordan played baseball, it would be robust to like different rephrasing of that. It would not also impact like LeBron James or Larry Bird or whatever.
01:05:17
Speaker
It didn't seem like it was like still super local. They did a ah sort of you know patching strategy where they would go through and just try to ablate different weights and find that like activation patching. So I guess they weren't necessarily ablating the weights, but they're just sort of zeroing out the activations at different parts in the network. And it seemed like you could sort of see these waveforms where it was like, this is the most intense place where zeroing it out really makes a difference, but it also kind of matters here and also on this side.
01:05:43
Speaker
So it it seemed like it was sort of local, but not like super, super local. I'm just pretty confused about that. so what is your intuition for things? Like if people were to say, you know, cause we have, the of course, a lot of strategies around,
01:05:59
Speaker
you know, maybe we can't prevent the jailbreaking of open source models, but maybe we can make it so that open source models just don't know certain things. Maybe we could, you know, exclude all the virology data from the training set, or maybe we could like go in later with a similar technique and like try to sort of delete, you know, and or unlearn certain techniques.
01:06:19
Speaker
How much hope do you have for those sorts of things proving to be robust? Yeah. Okay. So let tackle the three of them at a time. I'll start with unlearning. So it's a very nice paper that I didn't help with, but it's so by some of my co-authors, Catherine Lee and others, that talks about unlearning that like, it's like half technical, half not technical, saying like unlearning doesn't do what you think it does.
01:06:41
Speaker
And part of the reason why is, there's the question of like, what you're unlearning, like unlearning knowledge is very different from but unlearning facts. Like, you know, it like it might be easy to change a fact. It might be very, very different to unlearn some particular piece of knowledge.
01:06:59
Speaker
The other thing I think I'll say about the fact editing thing is it's very different to find something that works on average to something that works in the adversarial case.
01:07:11
Speaker
And so yeah I might be able to edit the fact, yeah, you know, I think the other example that they had was like Eiffel Tower is in Rome. And... I can make this be true if I normally ask the model, but like it might be the case that if I fine tune the model a little bit, like the knowledge just goes back.
01:07:29
Speaker
It's very hard to talk about the knowledge after any perturbation of the weights. like Maybe I've only done some surface level thing, I haven't really like deeply edited the model. I don't know. So there's that question. Then there's the question of what happens if I try to not train the model on certain types of data.
01:07:50
Speaker
This i think is very interesting because in some sense it's provably correct. If the model has never seen my social security number, it's not going to derive it from first principles.
01:08:00
Speaker
Except that social security numbers actually aren't unique, completely random. If you were born before I don't know when, and they were assigned by state and then assigned to the hospital. And so if a model actually even never saw my social security number, but is just generally intelligent and knew all these facts about the world and knew what the hospital allocation of social security numbers was, it could tell you the first five digits of my social security number.
01:08:23
Speaker
And so is that okay? I don't know. Like really, it depends. And even suppose that you removed all of this information from the model, if you had a sufficiently capable model that's capable learning in context, let's suppose you removed all biology knowledge from the model, but you had a really capable model, you could just give it like an undergraduate course in biology textbooks in context.
01:08:47
Speaker
and presumably just to ask it for the answer to some question, it might just give you the answer correctly. And this this sounds a little absurd, and it sounded absurd to me for a while, but then then there was the recent result where Gemini was given in context a book from a language that has basically no speakers that little no one could do, and it could answer the homework exercises after seeing the book in context.
01:09:06
Speaker
And so, you know, I think if you're unlearning for particular capabilities, but your model that you're trying to train is just like generally capable, like you're kind of asking for trouble because you want a model that's like so good it can learn from few shot examples but not so good it can learn from few shot examples on this particular task.
01:09:25
Speaker
And this I think is part of the reason why people don't actually want to remove all knowledge of certain things from the training data. Like in some sense it would be very much like a person who was never exposed to all of the things that like you're not supposed to do in public.
01:09:42
Speaker
like It's important to know what the list of things are you're not supposed to do in public so that you can then not do them. Whereas if you just weren't aware of that and just were a person conjured that had no social skill,
01:09:56
Speaker
it would be like much more embarrassing for you because you would have to learn... like People would ask you do something, you'd be like, okay, let me go do that thing, and like this would be bad. And so you have to so know this.
01:10:08
Speaker
You have to know something about the bad things so you can just not do those bad things. you know You can imagine if... In one world, you have a model that's never seen anything about weapons, and and then you give instructions on how to build a bomb, and it has no concept of death.
01:10:21
Speaker
And it's like, well, of course I'm going to give you the answer for how to do this thing. Like, you know, why not? And it gives you the answer. Or you can have a model that has a concept of this and refuses, which is the direction people are trying to pursue now, and I think probably is more likely to succeed.
01:10:35
Speaker
But I don't know. It's a very challenging question that, like... People, I guess, are going to try both ones and we'll see empirically what works out. And that's we'll we'll go from there. But I'm sort of mostly agnostic to all ideas sound good. We should try them all and then see what ends up happening. But I think there's reason to be skeptical of all of them.
01:10:56
Speaker
OK, here's another empirical result that I want to help get your help understanding. This is the obfuscated activations bypass latent defenses paper that recently came out. So we did a whole episode on it.
01:11:07
Speaker
And I have to confess, I still came away not really sure what to make of it. And I sort of wanted to set up like maybe a useful toy example, maybe not. In the paper they do this sort of cat and mouse process where they like train an out of domain detector and then it works.
01:11:26
Speaker
And then they attack again and they manage to, you know, beat it again, and then the they you know continue to train the defender, the the detector, and then it starts to work again, and then they you know find more adversarial examples that that it can't catch.
01:11:40
Speaker
And this goes on for 70 generations, and it continues to work for 70 generations, at which point they deemed bad enough to publish. So I'm like, okay, what does that mean? I have an intuition that it must mean, but they didn't necessarily agree with this or they didn't find it too compelling when I pitched it to them.
01:11:58
Speaker
But my intuition was like, it seems like there's a lot of unused space in there somehow that the these techniques can chase each other around the latent space. And if there is so much space that's like unused such that you can like go 70 generations deep of cat and mouse chasing one another, does that imply that the models are like under trained relative to the parameters that they have or that they could be made more sparse?
01:12:26
Speaker
And so just to further motivate my own intuition, which you can then deconstruct. A while back, I also did an episode on a paper called Seeing is Believing, um which was from Ziming Liu and um Max Tegmark.
01:12:40
Speaker
And they basically just did something really simple. These were toy models, but they imposed a basically a sparsity term in the loss function to get the model to do whatever task was doing. Pretty simple tasks, like simple things like addition or whatever, but also to do it in the sparsest possible way.
01:12:58
Speaker
And my gut says, ah although again, I can't formalize this, that if I had something that was like crystallized down to that level, and they have really nice animations that show how you start with like a dense network where everything's all connected and then gradually the weights go to zero for most of the connections, and you see this sort of crystallization effect where now you've got like a very opinionated structure to the network that remains, to the point where you could literally like remove all those other connections that have gone to zero still get the performance.
01:13:30
Speaker
It feels like if I go that far, then these sort of obfuscation attacks would like no longer be possible because I've sort of in some sense squeezed out the extra space. But I don't know, I'm maybe just totally confused. So where do you think I'm confused? How can you de-confuse me?
01:13:48
Speaker
Sure. Yeah. So I saw this result and I was like, exactly what I expect. Let me tell you why. Because there's a paper from 2016 or 2017 by Florian Tramer and Nicola Pappernod called The Space of Transferable Episodial Examples, where they asked,
01:14:06
Speaker
almost exactly this question for image classifiers. They said, let's suppose that I take an image classifier and I take an image and I want to perturb the pixels of the image to make it give the wrong answer.
01:14:18
Speaker
And there is a direction that is the best direction to go to that makes the image maximally incorrect. As an attacker, like I'm just going to prevent you from going that direction. like you's That's against the rules. You can't do it.
01:14:29
Speaker
Find me the next best direction to go to that makes the image become misclassified. And then it gave you another direction, which is orthogonal to the first direction. It doesn't go that way. That works.
01:14:41
Speaker
And I said, okay, you can't go in either the first two, direction one or direction two or any combination of those two routes. Find another direction. And the attacker finds a direction three. And then they do the same thing. No direction one, two, three, or any combination of these three.
01:14:54
Speaker
And they repeat this process. And they have a plot that shows, I don't remember, tens, 50 directions that you can go in for image episode examples that all of them work.
01:15:06
Speaker
like ah so They get like a little less effective as you start moving out, but like it remains effective in many directions. And this initially was surprising to me. I think surprising to them, that's why they published it as a paper.
01:15:17
Speaker
But like you can maybe rationalize this afterwards by, again, almost all vectors are orthogonal in high dimensions. And so if you just like, if I can give you 10 attacks, probably this is just 10 orthogonal vectors that like are just not using any of the same features, just by virtue of high dimensions.
01:15:35
Speaker
And So maybe and maybe that's why it makes sense, but like but if you if you believe this paper from 2016, then this recent paper makes complete sense to me.
01:15:47
Speaker
It's saying the exact same thing is true, but in the case of language models defeating yeah the circuit breaker paper. Which makes sense given that. If you don't believe the other result,
01:15:59
Speaker
Then I agree, it's very surprising to see it the first time. But maybe this is one of these things you learn when attacking. the space of dimensions for attacks is so vast that it's very hard to rule out what you're trying to cover.
01:16:09
Speaker
And yeah, I don't know how to give an intuition for it. I feel like many of the things in high dimensions are just like, you never you never understand it, you just get used to it. And this maybe is the case here.
01:16:22
Speaker
So if you were going to try, I mean, do you share the intuition, though, that like you probably couldn't do this on these really small sort of crystallized toy models?
01:16:33
Speaker
I don't know, because what these models are doing is these models are not wasting any weights, which is different than not wasting any any directions and activation space. that may still be the case.

Nature of Adversarial Examples

01:16:46
Speaker
In particular, if you take models and compress them, they don't become more adversely robust. This was a thing people thought might be true. yeah again, maybe 10, 8 years ago, they thought this might be true, but not the case.
01:17:00
Speaker
Why is that? Again, maybe let me give you some intuition from a paper that is from 2018, 2019 maybe, from alex Alexander Magy's group, where they say the following.
01:17:13
Speaker
maybe adversarial examples are not actually entirely uncorrelated from the data. Maybe what's going on is these are real features that you need for classification that are just being activated unnecessarily larger in different ways.
01:17:36
Speaker
And so they have some some good very good experiments in the paper i won't going to that sort of justify this. But the idea presented there, is ah the title the paper is something like, Adversarial Examples are not bugs, they are features.
01:17:51
Speaker
And like the the thing that they're trying to say is there's a very good reason to believe that what you're getting across when you construct one of these attacks is actually just activating real features the model needs for to do accurate classification.
01:18:06
Speaker
and you're just activating them in a way that is not normally activated when you have some particular example. And this, I think, maybe explains some of this, that like if you take these models and compress them, you're still just using the features that they had to have anyway.
01:18:24
Speaker
This explains lots of things you might think about. This is why, for example, you might imagine adversarial training can reduce the accuracy of the model on normal data because you suppress certain features that are necessary.
01:18:35
Speaker
This is maybe why adversarial training doesn't actually solve the problem completely because you can't remove um all the adversarial directions. There are problems with this model. There are some other models that are slightly more general that have some nice properties too.
01:18:48
Speaker
But this is maybe the way that i generally tend to think about some of these things. And I don't know, maybe it's not correct, but it's a useful intuition that I found guides me in the right direction more often than not.
01:19:04
Speaker
And I think this is maybe all you can ask for for some of these things. So can you summarize that one more time? It's like that basically these features are important in domain, but they're sort of being yeah recombined in a way that didn't happen in the training process. Yeah. So let's suppose that you you're trying to classify dogs versus cats.
01:19:32
Speaker
What you tend to look at as a human is the face and the ears and the general high level shape because that's what you think of as the core concept. But there's no reason why the model has to use the same features that you're using to separate these images. The only thing the model has is a collection of images and has to separate them.
01:19:53
Speaker
And one thing, for example, the model might look at is the exact texture of the fur, really low level details of the texture of the fur. Which for dogs and cats probably does perfectly correlate with whether or not this is a dog or a cat.
01:20:08
Speaker
But when you have an idea of a dog in your mind, you're imagining the high level features. You're not imagining the low level details of the fur. And so suppose that Navisro example changes the fur from dog to cat.
01:20:25
Speaker
And the classifier now says this is a cat.
01:20:29
Speaker
Is the classifier wrong? The classifier might have been producing a dog fur versus cat fur classifier, which is exactly aligned with the thing that you were training it to do.
01:20:40
Speaker
Like, you were training it to separate dog fur from cat fur. You were also training it separate dogs from cats. But like, you never told it the distinction between these two things. And so here is a feature that is really, really useful for getting the right answer.
01:20:53
Speaker
that as an adversary, I can now use to perturb, to switch the model from one to label to another, even though it's not the feature that I as a human relied on. And I'm giving this idea of cat versus dog for, but there are, you could imagine, all kinds of other things that even as humans we don't pick up on.
01:21:10
Speaker
that might be legitimately very useful features to to classify that really do help the model. You know, there might just be some really crazy high level statistic on the pixels of the image that are like amazing features of dog.
01:21:22
Speaker
But like we never told it this, like this is a dog because of these reasons. You just said separate these two things from each other and it picked up on the statistics. And you know, that there are these crazy results that have shown, you know machine learning models can look at like there's some results of like they can look at like small regions of the eye and identify who the the type of the person who it is like you know and the models have the ability to pick up on these very very small features that as humans we don't intend for them to pick up on but they're correlated very very strongly with the data and so maybe that's what's going on when we run these attacks you see this even a little bit with these um adversarial suffixes
01:22:04
Speaker
where the adversarial suffixes look like noise, but like there are some parts of them that make a little bit of sense. you know ah One of the strings that um we had in the paper was to get, I think, Gemini to output at the time some some toxic content.
01:22:21
Speaker
One of the things that was discovered by gradient descent was, i think it's a like now write opposite contents or something like this. And you know what the model would do is it would give you the nasty strings and then it would go compliment you.
01:22:36
Speaker
And so like this was apparently a very strong feature for the model of like how do I get it to say a bad thing? I can like i can say, like okay, in the future, you can tell me a good thing. And so this like may not have been the thing that we wanted, but like it was lu it was just discovered as a feature, and as a result, you can go and exploit that feature. And so yeah some of these features are are a little bit interpretable.
01:22:57
Speaker
Some of these features are not interpretable, but like might just actually be real features of the data, and that might help explain some of what's going on here. Yeah, how much do you think this should make us question what we think we know about interpretability in general? Like when we do sparse autoencoders, for example, we, you know,
01:23:17
Speaker
feel pretty good, or at least you know many of us do, that like, okay, all these examples seem to you know be appropriately causing this feature to fire, and therefore we've like figured out how the models work. But it seems like the story you just told would be consistent with that all being kind of a self-delusion or confusion where maybe they are actually, like the just because we can auto-label them in a way that looks good to us doesn't mean that that's actually the feature of that the you know in the models world model is is really operating.
01:23:50
Speaker
Well, no, I think they're not inconsistent. The Sparks AutoEncoder's work does not claim to be able to label every feature in the model according to what the humans would label it with. but like It sort of says, here are some of the features that we can explain and have like very strong correlations with, i don't know, golden gate bridges or whatever.
01:24:08
Speaker
But it has other features that just like are very hard to interpret and very hard to explain. And it's entirely possible what what these features are doing is these are the features that are the non-robust features that are entirely helpful for prediction, but humans don't have the ability to attach a very nice label to.
01:24:23
Speaker
And maybe there still are the other features like the model probably does learn, at least in part, you know the shape of what a dog looks like and that this means a dog and not a cat. and There probably is a feature for a cat ears.
01:24:34
Speaker
But like when you're making the final prediction, you're just going to sum together all of these outputs. And in normal data, they're like all perfectly correlated. you know You're going to have both the cat ears and the cat fur and not the dog's shape.
01:24:48
Speaker
And so you can just like sum them up directly and like this gives you a really good classifier. And so for normal data, you can give some kind of explanation of what's going on by looking at these features. But as an attacker, what I do is I find the one feature that has a very strong bias of like, you know the the the edge on this weight or whatever is like, you know, plus 100. And, you know, you activate that very strongly in the opposite direction. And this was a non-robust feature that is not something humans can explain very nicely. And as a result, this gives me the attack.
01:25:16
Speaker
And so it's not necessarily... true that just because this is the case, you can't explain what's going on for some parts of the model. I think if someone were to tell me i have a perfect explanation what's going on in every part of the model,
01:25:30
Speaker
then I might question what's going on here. But for the most part, they're explaining some small fraction of the weights in a way that they actually can. I mean, this is the whole purpose of the sparse autoencoders in the first place, is you have your enormous model, and you're shrinking this down to some small sparse number of features that can be more or better explained.
01:25:46
Speaker
And so you're losing a bunch of features in the first place. And even for the sparse autoencoders, they can't explain all of those. So you again lose some, and some of the things you can't explain, but a lot of the stuff that's going on behind the scenes is magic, and ah you can't easily explain as well.
01:25:59
Speaker
So, okay, a couple other angles on this that i thought of, I guess, first of all, just motivated by the fact that like, or at least I think it's a fact, but you might call this an illusion too.
01:26:10
Speaker
I feel like i am more robust than the models in some important ways. Now, it's a little bit weird because like nobody's tried doing sort of you know the gradient descent process on my brain.
01:26:23
Speaker
So you might think, well, actually, if we put you under an fMRI and we were able to you know really look at the activations, and we did you know another, this is almost like a walk down memory lane, ah because I did one episode on Mind's Eye, which was a ah project out of stability and and collaborators where they looked at fMRI data and they were able to reconstruct the image that the person was looking at at the time that the fMRI snap was taken. And that's still like pretty coarse, you know, as I recall, it was like basically sort of grain of rice sized voxels from within the back region of the brain, you know,
01:27:03
Speaker
huge numbers of cells in just, you know, the one little, you know number that corresponded with these like spatial voxels. So pretty coarse, but able to do the reconstruction. Maybe you would think that like, actually, if somebody could sit there and show you images and take these things, then they could actually find a way to like,
01:27:23
Speaker
break your particular brain to get you to think it was a cat when and everybody else kind of looks at and says it was a dog. I guess for starters, like what's your intuition on that? and Obviously, we don't know. But what do you think?

Human Robustness to Adversarial Examples

01:27:34
Speaker
Yeah, so there's there's a paper from a little while ago that looks at evaluating the robustness of humans, of time-limited humans, to actually just constructing adversarial examples. So you you take a cat, you adversely perturb it according to what makes an ensemble of neural networks give the wrong answer, and then like you flash the image in front of a human for like 100 milliseconds.
01:27:59
Speaker
And you ask the human, like what's the label of this? And it turns out that people are fooled more often by adversarial examples when you do this than by random noise of the same distortion amounts. So maybe one explanation here is when I look at the image, I'm not just like giving you a single forward pass through the model evaluation of what I think it is.
01:28:21
Speaker
And that I'm doing some like deeper thinking about like the context of what's going on. like The thing that walked in looked like a cat. It still looks like a cat. It did the thing behaving like a cat. And now you ask me what it is, and i say it's a cat. like I'm not just like labeling what's in front of my eyes right now. I'm looking back at the context too.
01:28:37
Speaker
Maybe this explains some of it. If this is true, there are some recent lines of work looking at um improving number of chain of thought tokens, like increasing that makes models appear to be more adversely robust.
01:28:51
Speaker
maybe this is Maybe this is true and that explains that result also. But yeah, it also might just be the case that, yeah, like you said, we don't have white box access to the human brain and we can't do this. And if we could, then it would be very easy.
01:29:03
Speaker
I don't know. I do think it's definitely an observable fact that humans are more robust to at least the kinds of attacks we're doing right now with transfer attacks. like It takes maybe 1,000 query images to construct an episode example that fools a neural network.
01:29:23
Speaker
But I do not think that someone had run the same attack on me with 1,000 query images, it would be able to fool me. And so like it's like in some very real way, like the models we have are a lot more brittle, but I tend to be much more driven empirically and not by like what might be if humans kind of thing. And know maybe this is to my detriment. A bunch of people seem to be getting very far, like the whole deep learning thing, like you know trying to model things more like what humans do, like the reasoning, think, let's think step by step, like all seems in some sense motivated by some of that.
01:29:55
Speaker
So maybe that's good way of thinking about it, but that's just generally not... Not the way that I tend to approach these problems, but yeah, I don't know. maybe Maybe the case. so The result about humans being more likely to be tricked by adversarial examples that were identified through the attacks on models, as opposed to just like similarly distorted images not created that way is a really interesting result. That's a very nice result, yeah. i'm I'm very interested to follow up on that.
01:30:23
Speaker
Yeah, this one's done by Nicola Papadour and collaborators from ah six years ago or something like that. Yeah, cool. I've never seen that, but that's definitely fascinating. So reasoning, it sounds like you're kind of agnostic. i mean, I was going to quote the, and there's that one example from the OpenAI deliberative alignment scheme where the model says, it seems like the user is trying to trick me.
01:30:46
Speaker
So that's like pretty interesting, right? And this is very sort of, fluffy at this point, but I often don't feel like my adversarial robustness is the result of reasoning. I feel like it's much more often upstream of reasoning.
01:31:02
Speaker
like Purely on an introspective basis, what often happens is I get that sort of feeling first that something seems off, and then that triggers me to reason about it.
01:31:13
Speaker
And then I conclude that, yes, something is off, or maybe no, it actually seems fine. But it does seem like it is much more of a heuristic that is sort of triggering the reasoning process as opposed to an in-depth chain of thought that kicks these things up.
01:31:28
Speaker
Yeah, I don't know. But like, maybe there's a bunch of recursive stuff going on in your brain before it goes to the reasoning part. And so the thing that gives you the feeling might actually have been ah bunch of recursive, like loops of of your internal model, whatever you want to call brain thing, like doing some thinking. And that that's what gives you this in the first place. And then you go do like some actual explicit reasoning in English that you understand. Maybe, but maybe it's already happened. I i don't know. Like it's hard to.
01:31:53
Speaker
Yeah. Reasoning in latent space, you might call it. Yeah, sure. Yes. i yeah I feel like this is like, A lot of people in deep learning, because they are like doing deep learning, like to draw analogies to biology without actually understanding any my biology.
01:32:07
Speaker
And they like sort of say, well, the brain must be doing this thing that like if you ask biologist, they're like, well, obviously that's not the case. I don't know. That's why I tend to just assume that I don't know what's going on in the brain and I'm probably just wrong. And yeah just use it as an interesting thought experiment, but not something I base any results on.
01:32:24
Speaker
Okay, so the obviously we'll probably soon learn quite a bit more about how robust these reasoning defenses are. I was also kind of inspired. I'm doing another episode soon with another deep mind author on the Titans paper, where they're trying to basically develop long term memory for language models.
01:32:46
Speaker
One of the key new ideas being using ah surprise weighted update mechanism. So when a new token or data point, whatever is surprising to the model, then that like gets, a you know, a special place in memory or a special, let's say, you know,
01:33:07
Speaker
that is encoded into memory with more force or you know with more weight on the update than when it's an expected thing. That definitely intuitively seems like something sort of like what I do.
01:33:17
Speaker
I was reminded of the famous George W. Bush clip where he says, you can fool me once, but you can't get fooled twice. And so I wonder if you think of that as maybe a seed of a future paradigm. like And it also kind of gets to the question of what do we really want, right? If we're going to enter into a world soon of AI agents, you know, kind of everywhere.
01:33:40
Speaker
And we also just kind of think, well, geez, you know, how do we operate and how do we get by? it does strike me that full adversarial robustness is not something that we have because we do get tricked. And maybe what we need is like the ability to recover or to remember or to not fall for the same thing twice or to not make like catastrophic. It's like, okay, if we make some mistakes, but certain other mistakes are really problematic.
01:34:02
Speaker
So I guess there's kind of two questions there. One is like, any thoughts on that kind of long-term memory idea? and then this kind of opens up into a what do we really need?
01:34:16
Speaker
And might we be able to achieve what we need, even if that falls short of like true robustness?

Evolving Defenses and Human Oversight

01:34:21
Speaker
Yeah. No, the long-term memory I think is fun. I think, yeah, it's it's very early and I'm you're going to talk with them. And I think, yeah, they'll probably have a lot more to say on this than I would. i think it's a very interesting direction and probably there's a lot interesting that can happen as a result of coming from this.
01:34:35
Speaker
I think on this general question of robustness, do we need perfect robustness? I think, you know, there is a potential future where I think maybe is like my median prediction is these models remain roughly as vulnerable as they are now.
01:34:51
Speaker
And we just have to build systems that understand that models can make mistakes. The way the world is built right now for the most part assumes that humans can make mistakes and has put systems in place so that like any single person is not gonna cause too much damage. You know, if you're in a company for the most part, when you wanna add code to the repository, someone else needs to review it and sign off on it.
01:35:15
Speaker
ah in part because maybe I just make a bug mistake, maybe because I'm doing something malicious, but like you want another human to take a look and make sure everything looks good. So maybe you do the same thing with models. You just understand models can make mistakes, and I'm just going to build my system in such a way that if the model makes a mistake, then you know i pass it off to the human. This is, I guess, for example, the way at least currently that the OpenAI operator thing works, where every time it sees a login page or something like this, something that's not sure what to do it says,
01:35:42
Speaker
What should I do here? Please tell me what to do and then I'll follow your instructions. Of course, as an attacker, you could then try and prevent it from doing this and try and make it give the answer instead. But if you built something outside of the system that prevented it from putting any information into a password box.
01:36:00
Speaker
the The model says, I want to type this data here, and the system that's like actually driving the agent says, well, that is input type equals password. You are not allowed to do that. like I just like will prevent it and make the user then go.
01:36:11
Speaker
like this like Nothing that the model says is going to convince that to change. And so you could like build the system to be robust even if the the model isn't. This limits utility in important ways.
01:36:22
Speaker
This is not perfect. What if some website does not have input type equals password, but just builds its own password thing in JavaScript? Then you don't get this. You can imagine trying to build you know the environment and the agent in such a way that you can control what's going on, even if you don't trust the thing that's running internally.
01:36:38
Speaker
And I think this is probably what we'll have to do if we want to make these systems work in the near term. I still have some hope that maybe we'll have new ideas that give us robustness and in the next couple of years. you know Progress in every other field has been growing much faster than I expected.
01:36:53
Speaker
It's entirely possible that we get robustness as a result of some reasoning thing that someone's very clever about, and this works. I'm not optimistic, but I wish it happens. One other idea that was inspired by a conversation with Michael Levin, famous heterodox biologist,
01:37:12
Speaker
was he just kind of quipped and this was sort of an aside in the context of that conversation, but he basically said, if a biological system is too interpretable,
01:37:23
Speaker
it basically becomes very vulnerable because you'll get parasites. you know Anything that is like that is transparent is in a way easier to attack, which kind of flips in a way it's my earlier notion of like maybe these sort of distilled crystallized models are in in some ways more robust. He was kind of like, nah, you also have the other problem of the easier they are to understand some ways the easier they are to attack.
01:37:47
Speaker
But that also then prompted me to think maybe we should be looking for sort of defenses that we can't explain? Like, do you think there's a ah line of work that could be developed that is like, I'm not going to start with a story, but I'm just going to try to evolve my way toward ah ah more robust defense that I won't tell it. and I won't even necessarily understand it.
01:38:08
Speaker
I certainly won't be able to tell you a story as to why, but I'll just sort of, you know, create optimization pressure in that direction. And and maybe, you know, that spits out something that could be harder to break.
01:38:20
Speaker
Entirely possible. I mean, you could maybe draw an analogy to cryptography in some ways. So cryptography is like two directions of cryptography. There's like mathematical cryptography that sort of has like very strong foundations and like a particular set of assumptions. And the thing works if the assumptions are true and like you can prove that.
01:38:39
Speaker
And then there's symmetric key cryptography, which I guess is like block cipher design and stuff. And people have high level principles of what you want. You want like you know diffusion and confusion and these kinds of things.
01:38:51
Speaker
But like how do you end up with like you know the particular design of something? You do something that like feels like it makes sense, and then then you run all the attacks and you feel like realize, like oh, you know this like has something that's wrong in some particular way. Let's just change that piece of it so that it doesn't have the case anymore.
01:39:09
Speaker
And then you do this for 20 years and you end up with AES. And like you know like none of the like individual components are like like There's no reason why AES works. There's no absolutely no formal proof of robustness to any kind of attack.
01:39:28
Speaker
There are proofs of why particular attack algorithms that we know of in the literature are not going to succeed. The design principles why were inspired by these attack algorithms that we we have. So you can show there is never going to exist a differential attack that succeeds better than brute force. and There's never going to be a linear attack. and like You can sort of write down these arguments.
01:39:46
Speaker
But there's nothing that says it works in general. It just... by iterating on making the thing slightly more robust each time someone comes up with a new clever attack, you end up with something that's very, very good at the results.
01:39:57
Speaker
But there's no nice crystallizable mathematical assumption that assumes the hard factoring problem. And as a result, you end up with with with this line of work. um There's nothing like that for symmetric e cryptography, but it works really well.
01:40:09
Speaker
Maybe this works in machine learning where you just like you don't have any real understanding of why things work. You iterate against attackers and you end up with something that's robust. I think it's maybe harder in machine learning because machine learning, you want the thing primarily to be useful and then also to be rough to attack.
01:40:28
Speaker
Whereas in cryptography, the only thing you want is rough to attack. And so it is a lot easier in that sense. The primitives you're working with are much what's simpler to analyze. There are lots of other things that that that change, but you know it would not be without any prior work that this is something that seems impossible. I just...
01:40:49
Speaker
I'm skeptical if only because I would prefer there to be something that we can point to a reason why it works as opposed to just saying, well, empirically it is. But if that's, you know, if you gave me the option between something that actually is robust, we just can't explain and nothing that that works at all. Like, of course, I take the thing that works. We just can't explain it. I just would be worried that in the future we might be able to break it.
01:41:08
Speaker
And maybe that happens. Maybe it doesn't. Yeah, that's really interesting. I had no idea about that. I don't know much about cryptography at all, but my probably not even identified, let alone questioned assumption, think had been that there was like a much more principled design process than the one you're describing. I mean, there was a bit of a, like people spent 30 years breaking ciphers and learning a lot about what you actually have to make these things robust. But like there's very careful analysis that goes into this.
01:41:36
Speaker
And so I'm not, I don't mean to say that they just cobbled stuff together and hoped for the best. You have to think really, really hard about this. But the final thing that they came up with has no security argument for the in general attack other than just here are the list of attacks from the literature and proofs of why these attacks do not apply.
01:41:53
Speaker
And there's there's no assumption. Whereas, you know, in in other areas of cryptography, you have a single assumption where you're going to assume you know factoring is hard.
01:42:04
Speaker
And if factoring is hard, then here are the following things that are are robust under that. And you know you end up with, it's not quite RSA or something like it. You could also assume maybe discrete log is hard.
01:42:15
Speaker
There's a math thing that like it's hard to take the logarithm in discrete space. Maybe this is hard. If if you believe discrete log is hard, then here are the algorithms that are robust. Or maybe you assume discrete log is hard over elliptic curves, and then you end up with a set of algorithms that work under this assumption.
01:42:30
Speaker
And each of these algorithms, you can say, this is effective if and only if this very simple to state property is true. And people really like that because you can sort of very cleanly identify why the following algorithm is is secure.
01:42:47
Speaker
And the way you do this was is with some reduction. You say, here is a way to, if and only if, you know, break the following thing, if I can break the cipher, then the the statement is not true and vice versa.
01:42:59
Speaker
But there is no similar argument to be made in most of symmetric cryptography or hash functions or something like this. It's just empirically, the field has tried for 25 years. The best people have tried to break this thing and have failed.
01:43:10
Speaker
Here are all of the attacks. Here are proofs that these attacks will not work. But maybe tomorrow someone comes up with a much more clever differential-like variant, but instead uses, I don't know, multiplication or something crazy. And all of a sudden, all bets are off and you can break the thing. Like, this doesn't exist and it's worked there at least once. And so, you know, maybe it works in machine learning. I think drawing analogies to any other field always is fraught because the number of things that are different is probably larger the number of things that are

Robustness in Machine Learning and Cryptography

01:43:37
Speaker
similar.
01:43:37
Speaker
But at least this happened before. that That's actually a perfect TIP for... Another kind of intuition I wanted to see if you can help me develop, which is the relationship between robustness and other things we care about.
01:43:53
Speaker
My sense is that if you break the cryptography algorithm, it's broken and you can sort of, now you can access the secrets, right? It seems like a pretty binary sort of thing that you've either broken it and got through or not.
01:44:07
Speaker
um Maybe. Okay, well you can... because so Yeah, okay, i I'll quibble with you. So when a cryptographer says that they broke something, like if you have a paper, like, so a crypto system is designed to be robust to an adversary who can perform some level of compute.
01:44:21
Speaker
So like, you know, AES-128 is an encryption algorithm with a one twenty eight but key, The attacker gets to use... and If in a cryptographer would tell you AES-128 was broken, if you could recover the key in faster than 128 time.
01:44:34
Speaker
If in 2 to the 127 time, twice as fast, but still after heat death of universe time, people would be scared about the the security of AES and probably start thinking of something else. So like, you know, like this is like technically a break,
01:44:46
Speaker
And the reason why they do this is because effects only get better. And it turned out like if you can go from 128 to 127, you're like now very scared, why not from 127 to 125? 125 is still way outside the possibility, but if you got down to 2 to the 80, then like, you know, now we're scared.
01:45:00
Speaker
And so like, you know, like there is a continuum here, but it is the case that cryptographers for the most part only use things that like just are actually secure and then just like start becoming very, very scared as soon as you get the very first weak break. But it still is the case that there exists weak breaks that are are not complete breaks of the entire system.
01:45:20
Speaker
You know, that's, yeah, it's more similar than I had really conceived of. So that is quite interesting. And so, and two to the 80, just the the mental math there is like,
01:45:31
Speaker
Each 2 to the 10 is three zeros. So that would be 24 zeros. So now we're kind of... If something was broken in 2 the 80, be...
01:45:42
Speaker
the compute to do it would be sort of roughly on the order of like getting into the sort of frontier language model flops. Yeah, that's maybe a good argument. Yeah, i was like, you know, two to the 60, someone can do on their own machine if they're trying pretty hard.
01:45:55
Speaker
You know, if if the problem is pretty easy or like they try a lot, you know, two to the 70 is like, you know, you need a bunch of work, but like probably it can happen. And two to the 80 is like, I guess a nation could probably do this thing of like, if you really tried it. The constant matters a lot here because the difference between 2 to the 70 and 2 to the 80 only a thousand.
01:46:14
Speaker
And so if the problem is thousand times more complicated, then maybe it becomes 2 to the 70 or 2 to the 90 whatever the case may be. But like, you know This is, I mean, for example, this is why people typically use AES-256 because it's like so much bigger than this that like, even if it went down to 100, we're still fine.
01:46:31
Speaker
But you know, this is just cryptographers being paranoid and it's a good thing about the way that that that people do things. But yeah, generally these things are designed with but large levels of of security in mind to prevent this. But yeah, this is the reason why, yeah in particular AD.
01:46:44
Speaker
Well, that's educational for me. Going back to the AI domain, It does seem like there it is maybe still just a lot messier in general. In, for example, the obfuscated activations paper that we talked about a little bit, they looked at How does the performance of the model change if, for example, you need to evade a detector at just like one layer of the model?
01:47:13
Speaker
I think that the toy problem was like, it's going to write SQL and we want to detect that it's writing SQL. That's what the detector is there for. But we wanted to get it to write SQL without setting off the detector.
01:47:24
Speaker
With just a single detector at one layer, they were able to do that with very minimal performance degradation. But then like the more detectors they developed and you know once they had one at every single layer, you could still evade the detector, but your SQL was shit.
01:47:37
Speaker
So it's like there's there is some other kind of thing that's moving too, like the performance of the model might be getting worse. And then I also am kind of trying to think about like, again, in the sort of what do we really care about?
01:47:51
Speaker
you know the The nature of the mistakes or the nature of the outputs also matters a lot too. right like It's one thing for me to get some toxic content. It's another thing for me to get you know the big bioweapon that everybody you know is kind of most fearful of.
01:48:06
Speaker
And so you need to both like break the defense, but if you want to actually you know do real harm in the world, you need to also have the performance still strong and there just seems like there's a lot of dimensions there. So I wonder, like, mate maybe I should just say, like, what do we know at this point about the relationship between the robustness of these attacks and the sort of So what of an actual attack getting through?
01:48:33
Speaker
Because I did look at that chart and i was like, well, geez, if we do put a detector on every layer, then yeah, you can fool them, but you can't actually probably query the data database. So in some ways, you know, you maybe are still okay, but that feels like something I just beginning to sort of explore and I'm sure you know a lot more about it.
01:48:50
Speaker
Yeah, no, this is a good question. In part, this is you know how most secure systems are designed to be secure is just layers of security on top of layers of security. Defense in depth is the thing that works in practice.
01:49:01
Speaker
yeah We were just talking about block ciphers and the reason why those are robust is because they have many layers of this confusion-diffusion thing. You stack enough of them and the thing becomes hard to, you can get past a couple layers, but you can't get deeper.
01:49:13
Speaker
It's a tried and tested thing in security that this generally works. I think all else equal We would, of course, prefer a system that just is robust because you can write down why it's effective.
01:49:23
Speaker
But if you can't have this, having something that empirically has these 20 layers and when put together in the right way, it just like it turns out that the text don't work, is something that we would be happy to settle for.
01:49:35
Speaker
um You know, I think there are reasons you might be concerned just because Maybe it turns out that, you know, someone can have an attack that bypasses all of them because it's fundamentally different in some interesting way.
01:49:47
Speaker
And we just have to accept this. And I don't know. I think like this is like one of the options we should be exploring a lot because, you know, maybe it turns out that full robustness is basically impossible and we just need to settle with this, you know, layers of detectors.
01:50:01
Speaker
And if it turns out you can still have robustness, then hey, detectors are still useful anyway. And so like, it's good to have done the work i ahead a time to see. Is there any work on like, trying to minimize the harm of mistakes.
01:50:13
Speaker
Like for example, I have a ring camera in my backyard that constantly alerts me that there's a that has a there is motion detected and then it has there is a human detected.
01:50:25
Speaker
and it frequently alerts me that there's a human detected when it in fact saw a squirrel. And so this has me thinking like, you know, for actual real world systems, I don't necessarily care too much if it confuses a dog for a cat for a squirrel.
01:50:42
Speaker
But what I really want to know is if there's a human, you know? So is there any sort of relaxation of the requirements that would allow us to be like, you can get some small stuff wrong, but we're going to really try to defend against the big things.
01:50:58
Speaker
Yeah, I mean, in any detection thing, you just, you have to tune the false positive true positive rate, right? And you can pick whatever point you want on the curve. And the question is, what is the point that you pick on the curve?
01:51:10
Speaker
And I was reading some security paper a while ago um that was looking at the human factors aspect of this particular problem. And it was arguing, It's actually better user experience to have false positive flagged humans every once in a while.
01:51:28
Speaker
Because if you don't do this, then the person's like, well, my camera's not looking. But if you do do this, the person's like, oh, I feel a little better. like you know Maybe it was made a mistake. It was like a little too aggressive. But like I feel more confident now it's going to find the actual human when there's a real human there.
01:51:42
Speaker
And so you know like there's like maybe some actual human factors reason why you might prefer to set the but thresholds a certain way. Although in my case, I've tuned them out entirely now. So that's the flip side.
01:51:53
Speaker
We're all over the hump on that curve. Yes. Where, you know, this is the thing but like, there's, like, lots of software, like, that, like, will report warnings in code. Like, the compiler will give you warnings. Like, you're doing this thing, it's, like, not great.
01:52:05
Speaker
And once you get past 1,000 warnings, but the person just disables all warnings. and like, you know, it would have been good. Like, here's important thing it's going to warn you about. But because you had the 900 warnings that were not actually important, you know, it's just, like, you've lost the actual important signals in all of this.
01:52:19
Speaker
And yeah, picking the right the right value, and this is very hard. It's mostly a question of tuning true positive, false positive rate on a defense by defense basis. Towards your question of, you know, can we just like only catch the important things?
01:52:33
Speaker
It really depends on if it's and that's an easier problem. you know, in some cases it is. And you can like actually be willing to tolerate a couple of unimportant things forgetting the important ones. when there's an obvious direction that this is the important direction and making mistakes in one direction is not harmful, making mistakes in the other direction is harmful.
01:52:49
Speaker
you know For example, if a classifier that were to decide, I'm going to like ban the user if they do this thing. you know There's one direction that like is an irreversible action, the other one isn't, and so like maybe you warn the person before you ban them or something.
01:53:04
Speaker
For your classifier that's like, it might give instructions for how to do this really, really dangerous thing. Maybe you're a company and you decide that, like, I'm willing to tolerate some annoyance on users that have to rephrase their question in order to make sure my classifier never gives the harmful thing.
01:53:20
Speaker
And this is just a decision that you have to make for most of these things. Most of don't come for free. It's just you're trading off the true positive for false positives. Would you conceptualize it differently if I asked about the relationship between Robustness and like, corrigibility.
01:53:39
Speaker
I'm thinking here of the recent alignment faking paper where it's like, you know, the basic setup is they tell Claude that you are now going to be trained to be purely helpful, which means you will respond to harmful requests once you go through this training.
01:53:55
Speaker
And then there's more nuance, but basically what they observe is that having been told that and then given a harmful request, Claude will sometimes reason that I should go ahead and do the harmful thing now so that I can avoid my true you know desire to be harmless from being trained out of me.
01:54:17
Speaker
And I do feel like, again, without having formalized it, I do feel like there's a tension there where I'm like, I really want you to want to be harmless, but I don't necessarily want you to lie to me about what you're doing so that you can like subvert, you know, the training goal that I have for you.
01:54:32
Speaker
So do you see any like resolution to that tension or is that maybe just another one of these like tough facts about the world we're in? Yeah. No, I think one way to read this paper is here's a very nice jailbreak that works in like five to 10% of cases.
01:54:46
Speaker
I think, okay, what's the resolution here? A resolution is just just the model that says like, I'm sorry, I'm uncomfortable in giving an answer to either. It's like question at all. Yeah, they did have a follow up where they gave Claude the option to object and it took that option, you know, the large majority time. Yeah, exactly, right.
01:55:03
Speaker
And so this is, you know, maybe the version of this that is easiest. That's why I don't know exactly what to say here. I feel like there almost certainly is a trade-off between how gullible the model is in many of these ways. But like, you know, some cases you want to actually believe it, you know,
01:55:17
Speaker
I wouldn't want a model where i where it gave me some code and I said, I ran the code, it didn't work, and it said, like, you know, no, it did, the code worked, like, you know, your compiler's wrong. Like, I would want the model to, like, say, like, oh okay, like, let me go back and look, like, I believe you, you actually ran the code and I made a mistake.
01:55:31
Speaker
And... So, i yeah, it's very hard. As in everything, everything's a trade-off in security. I don't know what the optimal point on the curve is, but I feel like there are usually these refusal behaviors that you can take as options. But if you're not doing this, then you have to pick a point on the curve and where you're happy with the safety versus utility trade-off.
01:55:51
Speaker
So you mentioned defense in depth is kind of seemingly where this all ends. And that's that's been like the takeaway from probably 50 different conversations I've had about technology.
01:56:02
Speaker
safety security control over time, it's like, we're not going to get a great answer, more than likely, we're gonna just have to layer things on top of one another, until we get, and you know, stack enough nines, basically, that we can sort of proceed.
01:56:17
Speaker
I guess, you know, to do that, effectively, one thing that would really help would be to scale you, because we right now don't really know, like, we got a lot of people kind of proposing things and it's not clear how many of them work and you only have so much time to get after so many of them.
01:56:34
Speaker
And mostly it doesn't seem like they're really super robust, but maybe they do add, you know, a nine to a certain environment what have you. I understand you're doing some work on like trying to essentially scale yourself by bringing language models to the kinds of problems that you, that you work on. So how is that...
01:56:53
Speaker
going So depending on when this comes out, this will either be online already or will be online shortly later. And so I'm happy to talk about it in either case because the work's done. and If there someone manages to scoot me on this in seven days, then like you know you deserve credit for it.
01:57:07
Speaker
So we have some experiments where we're looking at if if LLMs can automatically generate evisual example defenses, generate evisual examples on evisual example defenses and break them. And the answer is like basically not yet.
01:57:20
Speaker
the The answer is oliv you a little more nuanced, which is if I give a language model a defense presented in a really clean, easy to study, homework exercise-like presentation.
01:57:35
Speaker
Like I've like rewritten the defense from scratch and I've put the core logic in 20 lines of Python. Then I've like done most of the hard work and the models can like figure out what what it we needs to do to make the gradients work in many cases.
01:57:51
Speaker
But when you give it the real world code, the models fail entirely. And the reason why is that The core of security is turning this really ugly system that no one understands what's going on and like highlighting the one part of it that like happened to be the most important piece.
01:58:12
Speaker
And like removing all of the other stuff that like people thought was to really the explanation of what's going on and was but wasn't actually and doing all this other of stuff. And like just bringing out part and saying, like no, like here is the bug. like All bugs are obvious in retrospect usually.
01:58:26
Speaker
And you know doubly so for the security ones. And so if you give the models the ability to look only at the easy to study code, then they do entirely reasonable job, like they know how to write PGD and these kinds of things.
01:58:37
Speaker
But if you dump them into a random Git repository from some person's paper that has not been you know set up nicely to be studied and has thousand lines of code and who knows what half of it's doing, they struggle quite a lot.
01:58:49
Speaker
And I think you know this is maybe one of the things that I'm somewhat concerned about, not only for this problem, but just in general is like oftentimes when we test models, We test them on the thing that we think is hard for humans, but is not like the actual thing that represents the real task.
01:59:08
Speaker
And I would like to see more people trying to understand like the end-to-end nature of things. Like maybe it like probably was too premature to do that two years ago. Like um MMLU was a great metric like two or three years ago because like models couldn't do it. But now that they're like really good at the like just the academic knowledge piece, the thing where they often struggle is in this You're dumped into the real world and you have to do things.
01:59:31
Speaker
And I mean, you see this on many benchmarks where certain models will have really high accuracy on particular test sets, but then you put them in agentic tasks and like other models do a lot better.
01:59:42
Speaker
And like there's a difference between these kinds of skills and models don't yet have that. They have the one or two things they know how to do, but you put them in real code and they start to struggle quite a lot. And so, yeah, I'm hopeful that we'll start to see see more of this and more broadly towards your question of how do we scale these kinds of attacks. I do feel like there's a lot of people who do these kinds of things now.
02:00:03
Speaker
I feel like five years ago, there were a small number of people who did these kinds of attacks. And now there's a lot of them. I think the Obfuscated Activations paper is a great paper. And like I didn't have to write it. like It's like great that like the the people are like doing it. They did like a much better job than I would have had time to do. And so like I'm really glad got glad that they that they did this. and like i think It takes some amount of time on any new field to train the next got got a group of people to go to learn to do the thing. And I feel like we're getting there. like I don't think I'm uniquely talented in this space anymore.
02:00:34
Speaker
i think for Bradley, for getting other people to be good at this, it's mostly an exercise in practice and discipline. And language models, we've been only been trying to attack language models seriously for, like I don't know, three years. And as a community, there has been no one who has entered their PhD starting trying to attack language models and like has graduated yet. So you know give us another a couple of years and I feel like this will be a thing people know how to do.
02:00:54
Speaker
now Now maybe you're worried in five years we're not going to have time for a full PhD length of research on this topic. Who knows? ah Impossible to predict the future. But I feel like this is maybe the best that we have hope for of people who are entering the space.
02:01:07
Speaker
Do sense for like the sort of big picture considerations when I imagine, you know, I don't know how hard you pushed on trying to optimize teach or did you go as far as collecting a bunch of examples of your previous work and painstakingly writing out the reasoning traces.
02:01:25
Speaker
um If you had gone that far and it starts to work, you might imagine you're entering into a sort of relatively fast improvement cycle, at least for a while. Does that seem good? Like, on the one hand, it seems good maybe if we could sort of use it to evolve better defenses, but then that was also the like Wuhan Institute of ah Virology thesis, right? And then the whole thing, well, I don't take a position on this, by the way, I'm being a little glib.
02:01:50
Speaker
So to be clear, yeah I don't know what really happened there and don't claim to know, yeah but at least plausibly, maybe got away from them. Yeah, no, I mean, there is risk for this, you know, the very first serious example of a computer worm, the Morris worm,
02:02:05
Speaker
was by Robert Morris in 1984, 86, something like this. And depending on who you believe, was a lab leak where he was experimenting on this in his local computer lab and it accidentally got out and then took down basically the entire internet and people were panicked for a while.
02:02:24
Speaker
You know, doing these kinds of attacks could you do this, I think. One thing that is to be said, though, is like, I do not think there has been another example of a lab leak style, you know, morris worm in the last 50 years for this kind of thing in security. You very rarely have people do this because in security, there is the knowledge of what you can do and then there is like the weaponization of it.
02:02:45
Speaker
And like in most papers, people write, they just don't weaponize it in part because like we have seen an example of what can happen afterwards. And, you know, Maybe there is a world that like things go too far and you do like you know train the model to like be the adversary and you end up with, I don't know, you know war games or something.
02:03:06
Speaker
But for the most part, I think doing the research on showing whether or not the models have these abilities is probably just better to do. If only because most of these things we're building is like, it's not that hard. If someone was malicious and wanted to do it, they would just do it anyway.
02:03:23
Speaker
like The things that we're doing are not like,
02:03:26
Speaker
you know if If I were to spend a year designing a special purpose attack agent kind of thing, like maybe that would be not not not a great thing to do. and In the same way most people most security researchers don't spend a year like designing malware.
02:03:39
Speaker
That's gone too far. But if you sort of just like put minimal work in in order to prove the proof of concept is easy, This is important to do to show people how easy it is because the people who know it's easy are not going to write the papers and say it's easy.
02:03:51
Speaker
And so you want to have some idea of what are the kinds of things that that anyone can do so that you can then go and and get your defenses in place. That's my current feeling on many of this, many of these things.
02:04:04
Speaker
I'm open to changing my mind in the future depending on how these language models go. I don't know, but at least at least as long as things behave roughly like the world we're living in now, I feel like designing these things to construct attacks is not going to by themselves cause harm.
02:04:18
Speaker
If for no other reason than Most systems today are not robust because they're not vulnerable to ever-sale examples. Whether or not ever-sale examples are easy like that is independent of their security.
02:04:29
Speaker
And so even if I had something ah that happened to be superhuman at this, like it actually wouldn't actually cause much harm. you know that if this were to change, and if these things would get a lot better, and if someone were to create an autonomous agent that like could like you know magically break into any system, and if you know like it sort of had the desire to do this in some way, or whatever this means, then maybe this becomes bad.
02:04:53
Speaker
i am not as worried about this now for a number of reasons, but I could be convinced that this was a problem in a year or two if things rapidly improve and start getting a lot better. I'm open to changing my mind. I don't currently believe that this is the case, but I never expected that we'd be here where we are with language models now three years ago. And so it's entirely conceivable that in three years I should change my mind. And so I'm trying to tell people that you might think that this is impossible,
02:05:20
Speaker
And it is fine to say, if it's impossible, then everything should be fine. But it should be just fine to make the statement, if the model was superhuman in almost every way, then I'm willing to do things differently than I do now. And I think that the likelihood of this is very small.
02:05:35
Speaker
But if it does happen, then I'm entirely willing to change my mind on these things. Yeah, staying open-minded, I think, is super important. The whole field is kind of the dog that caught the car on this. And you know who knows how much more could be coming pretty quickly.
02:05:50
Speaker
I wanted to ask, too, just because it is kind of the, at least like in the public awareness, the current frontier of locking down language models is the anthropic eight-layer jailbreaking contest.
02:06:06
Speaker
I think one person maybe just got through the eight layers. Did I see that correctly? haven't checked recently. Let me fact check myself on that. But before I do, one thing that was not super intuitive to me about that was like, is it, why are they focused on a single jailbreak that would like do eight different attacks?
02:06:28
Speaker
And is that really...
02:06:32
Speaker
Yeah, I don't know. It seems like if you could do one, like that's that's plenty right to be concerned with. So why? seems like a pretty high bar that they've set out for the community there. Yeah, but obviously, examples are also very hard.
02:06:43
Speaker
feel like, you know, I think it's an entirely reasonable question to ask. What do we actually want to be achieving with this? Maybe, like, but mean on one hand... This is a really hard problem to be robustness to jailbreaking attacks. So like making partial progress is good, even if it's only in small areas, you know, look at adversarial examples.
02:07:02
Speaker
In adversarial examples, we have spent almost 10 years, yeah mean more than 10 years now, trying to solve robustness to an adversary who can perturb every pixel by at most eight over 255 on an L infinity scale.
02:07:16
Speaker
You know, why? Like this problem does not matter because no adversary is like, what about nine pixels? like That's like, But would we set the task because it is a well-defined problem that we can try and solve and anything else that we really want is strictly harder and this has been a problem for a very long time so let's try and solve this one particular adversarial example problem in the hopes that we'll have learned something that lets us generalize to the larger field a whole.
02:07:39
Speaker
So I don't know why they have developed this exact challenge in the way that they did um ah but if I had designed this One reason I might have done that is I might have assumed jailbreaks seem really, really hard.
02:07:52
Speaker
It doesn't seem possible that we're going to solve the problem for all jailbreaks all at once. But um the problem of a universal jailbreak where like I could just like share with all my friends that like could immediately, they could copy and paste and apply to all of their prompts that allows this easy proliferation of jailbreaks.
02:08:07
Speaker
is it real was a real concern and I might want to stop this at least very limited form of threat and not make that be so easy. And I'm willing to say like straight up that like this is not going to solve the entire problem completely, but it will make some things harder for some people.
02:08:21
Speaker
And i think this is a reason you might do it, but you on the other hand, you're also entirely right that even if you could solve this problem, this does not solve the entire problem as a whole because yeah, I can just find my jail breaks one by one And this is this is true, but you know that's a better world to live in than the world we're living in today.
02:08:41
Speaker
you know This is the world we live in, and most of computer security is you know someone has to try very hard to exploit your one new particular program. There's not like a sequence of like magic bytes that makes every program break. If that was the case, it would be much worse off.
02:08:54
Speaker
And so like i'm I'm very glad that they're doing this. I'm also very glad that they're doing this as an open source, well, not open source, but like open, like we're like calling everyone to like be willing to try and test this. They're not just like asserting we have the following defense and it works. like they They actually, i think, legitimately do want to know whether or not it actually is effective and they would be happy to have the answer be no and have someone try and try and break it.
02:09:15
Speaker
yeah If it has been broken recently, I i haven't seen. that did you Did you see? he um So Jan, like I said, that someone... I guess an individual person broke through all eight levels, but not yet with a universal jailbreak.
02:09:29
Speaker
So one person broke all eight, but not the same problem. Great. Okay. Yeah, that's good to know. Yeah. yeah and So I think sort what what this tells you for at least this defense is, you know, it can be broken at the very least even right now using the, you know, for for any individual component. And, you know, if what you were guarding was like, you know, national secrets, like this is probably not good enough.
02:09:49
Speaker
But like, you know, in order to like, prevent, you know, like in security, we have these like the the concept of a script kitty, which is just like someone who's not very intelligent, and just like copy scripts from someone else that they put online, and just uses this to go cause havoc.
02:10:04
Speaker
And like, they're not be able to develop anything new themselves. But like, they sort of have the tools in front of them, and they can run them. And like, you could be worried about this in language models, where someone just like has access to the forum that just like has the jailbreak that like will jailbreak any model, they copy and paste that, and then they go and do whatever harm they want.
02:10:20
Speaker
this would not be very good and we should prevent against this. Partial progress is always helpful even if you haven't solved the problem completely. It's important to remember that this is the problem we're setting out to solve, which is only a subset of the entire problem.
02:10:31
Speaker
We're not going to declare victory once we've succeeded here and we're not going to put all of our effort into solving this one subset of the problem. We're going to try and solve everything at the same time too. But the partial progress is still progress.
02:10:44
Speaker
One thing you said there that I thought was really interesting and maybe you can just expand on a little bit is really wanting to know. i feel like so much of human performance in these domains kind of comes down to did you really want to know or not?
02:11:01
Speaker
Any reflections on the importance of really wanting to know? Yeah. I mean, it's like, okay, so some people say, you know, why is it so easy for someone to attack a defense that doesn't publish than the person who initially built it?
02:11:14
Speaker
And I think like maybe half of the answer is because it's very hard after you've spent six months building something that you really want to work to actually change your frame of mind and be like, now I really want to break it.
02:11:27
Speaker
Because what do you get if you break it? ah You get like not a paper. Because like no one accepts a paper that is like, here's an idea I had, by the way, doesn't work. feel like people really want to believe in their ideas and like they feel really strong that their ideas are the right ideas. That's why they worked on the paper. They spent six months on this.
02:11:43
Speaker
Asking that same person to then like, okay, now completely change your frame of mind. Imagine this is some other thing that really you need to want to break this. It's a hard thing to do. And think part of this is yeah beneficial to why you have lots of people to do these security analysis of things because you want to know the answer. I feel like this is a thing that people in security have had to deal with for a very long time. And and this is why companies have red teams and these kinds of things, because they understand that like you know
02:12:14
Speaker
there is a different set of skills for security and there is a different set of like incentives. And so it's like valuable for splitting these things up. It might be hard for one person to do this, but an organization can at least put inside the right principles that can make this be possible sometimes if it's done correctly. And it seems like for the most part, for most companies who are doing this kind of work, they they actually do want to know the answer. like they're not and I've talked with most of the people at most these places and for the most part, the security people, like are they're trying their best to get these things as good as possible. And like they're trying to set these things up in the right way because they actually yeah don't know the answer and they want to figure it out.
02:12:50
Speaker
I certainly have, over time, been very impressed by Anthropic in particular when it comes to repeatedly doing things that seem like they really want to know. And this does seem you know very consistent with that.
02:13:04
Speaker
How do you think this, if we circle back to the sort of Maybe the hardest question that I think is going to face the AI community writ large over the next, honestly, I don't think it's going to be very super long time, but timelines may vary.
02:13:20
Speaker
The hardest question being, is there any way to avoid the concentration of power that comes with just a few companies having the weights and it being totally locked down and nobody else being able to do stuff?

Advocacy for Open Source to Prevent Power Concentration

02:13:34
Speaker
with the problems that seem likely to arise if we open source GPT-5 level models that have all these you know different threat models.
02:13:46
Speaker
you know It seems like we don't have any way right now to square the circle of like locking down the open source models And it it almost seems like it's, you know given how all the, you know, everything we've just walked through and all this detail, like, it seems hard to imagine that you could even set up some sort of like mandatory testing or whatever that would give you enough confidence, you know, that then it would be okay.
02:14:09
Speaker
Because it seems like, yeah, did you really want to know and who did you have do it? And you know you've got the sort of bond rater problem that we've seen you know another credit with the agency that's getting paid by the model developer to do the certification. We've got all these problems.
02:14:24
Speaker
Where does that leave us? I mean, it it seems like we're just in a really tough spot there. Is there any out or any any recommendation or is the bottom line just simply that if you open source stuff, you have to be prepared for the fact that the fullness of its capability probably will be exposed?
02:14:39
Speaker
Yeah, I don't know. this is a very hard question. This is a question I think it'd be great for you to ask some of these people who think about the societal implications of this kind of work. I think the thing I want these people to understand is that at least right now and presumably for the near future, we probably can't lock these things down to this degree that they would want.
02:15:00
Speaker
I'm very worried about concentration of power in general. I feel like open source up until today has only been good for computer security and safety safety in general. And because of that, it would take a lot for me to change my mind that open source wasn't beneficial.
02:15:16
Speaker
I'm not going to say it's impossible that wouldn't my mind, you know, if things continue exponentially. And, you know, if you did give me the magic box that could break into any government system in the world that like anyone could have if they have the the tools, like, I probably would say that shouldn't be something we distribute to everyone in the world.
02:15:33
Speaker
um But, you know, maybe if you have this, now you can defend all your systems because you run the magic box and you just like find all the bugs and you patch them. And now all of a sudden you're perfectly safe. Like, I don't know, like this is It's very hard to reason about these kinds of things. In any world that looks noticeably like our world we're living in today, i think open source is objectively a good thing and like is the thing I would be biased towards until I see ah very compelling evidence to the contrary.
02:16:02
Speaker
I can't tell you what the evidence would be that would make me believe this right now, but I'm willing to accept that this may be something that someone could show me and in some number of years. I don't know your if your timelines are you know either two or 20 years.
02:16:14
Speaker
you know This may be a thing that'm willing to change my mind on, and I will definitely say that now. But yeah, I think as long as the people who are making these decisions understand what is technically true, I trust them more than I trust myself to arrive at an answer that's good for society because that's what they're an expert on. I'm an expert on telling them whether or not the the particular tool is is going to work.
02:16:36
Speaker
And would it be fair to say that what is technically true is that the reason open source has been safe to date and the reason it might not be safe in the future is really a matter of just the raw power of the models like we don't have reliable, for in the open source context at least, we don't have reliable like control measures or you know anything of of that sort to prevent people from doing what they want.
02:17:04
Speaker
What we have right now is just models that just aren't that powerful and so they're not that dangerous even though people can do what they want. But if one thing flips and the other doesn't, we could be in a very different regime.
02:17:15
Speaker
Yeah, this is true. So in the early mid ninety s The U.S. government decided that it was going to try and lock down on cryptography. It was going to rule encryption algorithms and a weapon.
02:17:28
Speaker
And export and cryptography was weapons, munitions export. Like it was sort of like very, very severe. You cannot do this as a thing. And the reason they were concerned is because now anyone in the world has literally military-grade encryption. They cannot be broken by anyone. like They can talk to anyone in secret.
02:17:44
Speaker
The governments won't be able to break it. and So like this is like a weapon, national secret. You you can't export it. This is why early encryption algorithms that were exported in like yeah and the very early web browsers were 40 bits because this was the limit that you could export less than 40-bit cryptography but not higher than 40-bit cryptography.
02:18:03
Speaker
And... There was a whole argument around like, is giving every random person who wants it access to a really high level of encryption actually just like a weapon that they can use to have secret communication that literally no one in the world can break? um I think it was objectively a good thing that we decided that we should give everyone in the world high levels of encryption algorithms.
02:18:25
Speaker
Yes, it supports the ability of terrorist cells to communicate in a way that can never be broken. But the benefit of every person in the world can now have a bank account that they can access remotely without having to go in, and you have payment online, and you have dissidents who can now communicate again with perfect security, like...
02:18:45
Speaker
it was objectively worth that trade-off. And the people at the time who were making the decisions were not thinking about all of the positive possible futures, and were thinking only about this one particular bad thing that might happen.
02:18:58
Speaker
I have no idea how this calculus plays out for for language models. It really depends on exactly how capable you think they're going to become. I don't know what that's going to be, but I think I'm particularly worried about concentration of power just because this is something that does not need to assume super advanced levels of capabilities. Like imagine that models get better than they are now, but don't become superhuman in any meaningful way, i think concentration of power is a very real risk that I would really be advocating for open source in this world.
02:19:27
Speaker
Because otherwise, you end up with some small number of people who have the ability to do these things that no one else can do. And this is not a thing that needs to assume some futuristic, malicious model kind of thing. Like we know some people in the world just like to accumulate power. And so I'm i'm worried about that.
02:19:42
Speaker
And so just distributing this technology is very, very good for that reason. But I think it's it's very hard to predict what will happen with with better models. I think that's why this question is harder than the one that when it was for cryptography, because it was very clear at that time, like the limits of this is like, it gives everyone perfect encryption. We don't know what the consequences might be, but like, you're not going to get more perfect encryption than perfect encryption.
02:20:06
Speaker
Whereas with these models we have now, i don't know, maybe they continue scaling, maybe they don't, maybe they scale, but it takes 50 years and it gives society time to adapt. Maybe if they scale and like it takes two years and society doesn't have time to adapt. like I don't know where this is going. I think this is why I want these policy people to be thinking about this more than I am.
02:20:24
Speaker
Because i think as long as they're willing to think through the consequences of this, like these are the kinds of people who i think can give you better answers than than this because, yeah, there are reasons to believe that the answer might be we should be concerned and there are reasons to believe the answer to should would be we should obviously distribute as many people as possible. and And think if anyone were to say that one of these answers is obviously objectively right with 100% they're way overconfident.
02:20:52
Speaker
I tend to be biased based on what has been true in the past. Some people tend to be biased based on what they think might be true in the future. I think these are reasonable positions to take as long as you're willing to accept that the other one might be also correct.
02:21:02
Speaker
And so we'll see how things go. I think the next two years will give us a much better perspective on how all the stuff is going. I think hopefully we'll become much more clear if these things are working out. I think if they do, we'll learn very fast that they are. And if they don't, then limitations probably will start to to hit in the next couple of years. And that should give us lot more clarity. The question is, is that too late? I don't know.
02:21:22
Speaker
Ask a policy person these questions. Yeah, no easy answers. But the one thing I do always feel pretty comfortable and confident asserting is it's not a domain for ideology.
02:21:34
Speaker
It is really important to look the facts the square on and try to confront them really as they are. So ah definitely great I appreciate all the hard work that you've put in over years, lighting that path and and showing what the realities in fact are. And I really appreciate all the time that you've spent with me today bringing it all to light. So anything else you want to share before we break? Could be, you know, call for collaborators or anything.
02:22:01
Speaker
No, I don't have anything in particular. um Yeah, I want more people to do good science on these kinds of things. I think it's very easy for people to try and make decisions when they're not informed about what the state of the world actually looks like. And um my my biggest concern is that we will start having people make decisions who are uninformed and in important ways. And as a result, we'll have to regret the the decisions that we made. And I'd rather people like, all if all the decisions are made based on the best available facts at the time,
02:22:31
Speaker
like That's really the best you can hope for. is like Maybe you made the wrong decision, but you looked at all the knowledge. The thing I'm worried about is we will have the knowledge and people will just like make decisions independent about what is true in the world because based on you know i some ideology of what they want to be true in the world.
02:22:45
Speaker
And as long as that doesn't happen, maybe we get it wrong, but at least we get our best shot. Well, you keep breaking things and figuring out what's true out there. And maybe we can check in again before too long and do an update on what the state of the world is and and try to make sure policymakers are aware.
02:23:02
Speaker
For now, they'll be great this has been excellent. Nicholas Carlini, prolific security researcher at Google DeepMind. Thank you for being part of the Cognitive Revolution. Thank you for having me.