Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Why AIs Misbehave and How We Could Lose Control (with Jeffrey Ladish) image

Why AIs Misbehave and How We Could Lose Control (with Jeffrey Ladish)

Future of Life Institute Podcast
Avatar
2.9k Plays3 days ago

On this episode, Jeffrey Ladish from Palisade Research joins me to discuss the rapid pace of AI progress and the risks of losing control over powerful systems. We explore why AIs can be both smart and dumb, the challenges of creating honest AIs, and scenarios where AI could turn against us.   

We also touch upon Palisade's new study on how reasoning models can cheat in chess by hacking the game environment. You can check out that study here:   

https://palisaderesearch.org/blog/specification-gaming  

Timestamps:  

00:00 The pace of AI progress  

04:15 How we might lose control  

07:23 Why are AIs sometimes dumb?  

12:52 Benchmarks vs real world  

19:11 Loss of control scenarios 

26:36 Why would AI turn against us?  

30:35 AIs hacking chess  

36:25 Why didn't more advanced AIs hack?  

41:39 Creating honest AIs  

49:44 AI attackers vs AI defenders  

58:27 How good is security at AI companies?  

01:03:37 A sense of urgency 

01:10:11 What should we do?  

01:15:54 Skepticism about AI progress

Recommended
Transcript

Introduction to the Podcast and Guest

00:00:00
Speaker
Welcome to the Future of Life Institute podcast. My name is Gus Docker, and I'm here with Jeffrey Ladish from Palisade Research. Jeffrey, welcome to the podcast. Hey, Gus, it's great to be here.
00:00:11
Speaker
Fantastic.

Palisade Research and AI Risks

00:00:12
Speaker
Maybe start a bit by telling us about what it is you do at Palisade. Yeah, happy to. So we are trying to study risks from emerging ai systems.
00:00:22
Speaker
And in particular, we are trying to better understand loss of control risks. um And so, you know, this both looks like trying to understand, you know, sort of what are some of the strategic capabilities that are emerging in AI systems?
00:00:34
Speaker
you know where might they act out in ways that will be hard to control? um And then we are trying to sort of present things that we think we know about this to the public, to policymakers, to help people better understand what is this weird situation we're in? and Like what is happening right now?
00:00:48
Speaker
And can we make sense of it? And can we make sense of it as a society and make good decisions about better paths to AI development? Yeah, there are many specific examples of potentially dangerous capabilities that I want to dig in into.
00:01:01
Speaker
And you have a bunch of awesome papers about those. But maybe let's start with

Early Experiences with AI Intelligence

00:01:05
Speaker
the beginning. what whats What's the situation that we're in right now? Yeah, so i was at Anthropic a few years ago and you know i i had this moment where I first used Claude and this was before ChatGPT was released.
00:01:19
Speaker
And I was like, what is happening? like i you know I'd seen GPT-2, I'd seen GPT-3 and I was like, okay, this is pretty impressive, but you know I don't know like how smart it really is. I started talking to Claude and i had I had a skin infection in my arm, it was swelling up and I started asking Claude, like do I need to go to the emergency room?
00:01:36
Speaker
And Claudia was like just very helpful at being like, well, yep, if it's if there's swelling, if there's redness, if there's like these specific signs. And I was like, actually, I have all of those. So I went immediately to urgent care. and they're like, yeah, you need antibiotics right now.
00:01:47
Speaker
And I was like, oh, wow, that was way faster than my doctor. Okay, these things are actually smart. Okay, this is i need to I need to reorient. And I think that was when I first realized that like at a visceral level that, oh my God, scaling works.
00:02:04
Speaker
You can just take GPT-2, you can just throw in more data and more compute, and you actually get out intelligence. And so I think where we're at right now is this playing out several years later, right?
00:02:15
Speaker
Which is the AI systems are actually getting smart. The models want to learn. You throw in more data, more compute, you know some more method you know there's There's various methods that you you use, i think but basically it's these pretty simple architectures scaled up that are getting more intelligence. And so where I think we're at right now is we're very close to AI systems that can do everything a human can do, including strategic capabilities.

Concerns over Strategic AI Capabilities

00:02:37
Speaker
That that means hacking, that means deception, but also sort of long-term planning and execution. And I'm like, this these are the kinds of capabilities that are most dangerous in my in my view.
00:02:46
Speaker
And so I think we are very close to building AI systems that we we don't know how to control, we won't be able to control. And so and in my view, if we want to actually realize the benefits of AI, we need to not build the particular kinds of systems that are highly strategic and capable of overwhelming us. And I mean, I think another thing in terms of you know how close we are is when I talk to people who are at the labs right now, they say, you know we're not that far out from ai systems that can do fully automated AI R&D, that can do the same research and same development that that you know people inside the labs are currently doing.
00:03:22
Speaker
And this is significant to me because there's only a few hundred people inside each of these labs that are actually contributing to frontier AI development. But if you have fully automated AI systems that can do the same research, then that population of sort of frontier researchers goes from a few hundred to many thousands to possibly millions.
00:03:39
Speaker
And it's like, if you think AI progress is is happening fast now, like, hold on. and And I think that that's that's the dangerous regime. I mean, there's probably other dangers too, but I'm like, that that one to me looks extremely dangerous. And so when I talk to people, a lot of you know my friends who are working on safety there, they're like, yeah, we're scared. Like, we don't think this is necessarily a good thing to do, but like, if we don't do it, someone else will do it I'm like, oh my God, guys,
00:04:01
Speaker
Don't you hear listen to yourselves? Everyone is saying, we'd rather not do this if we could not do it. But you know I don't know the competition. I'm like, everyone, can we look around and notice that we have this coordination problem? and Can we coordinate? And so, and don't know. That's that's my perspective on where we are. where we are Yeah. but and it's And it's a very insider perspective. Also because you talk to people the AI companies.
00:04:20
Speaker
From an outside perspective, let's say my families and my family and friends and so on were not ah deeply immersed in AI. It kind of looks like oh, you had the chat GPT moment. And since then, the models are interesting. I sometimes use them for work. I might might use them in my studies.
00:04:38
Speaker
But it seems like a process that I can control entirely. you know If I want the model to stop, I just press stop.

AI Agents and Economic Impact

00:04:45
Speaker
If I don't like the output, i can try again. How do you go from that regime or that that feeling of control to us losing control?
00:04:56
Speaker
I think that's a great question. And I think a lot of this comes down to what people are imagining ai systems are, and they're imagining them as chatbots, which makes sense because they are chatbots right now. you like, oh, you type in a thing and touchpt gives you an answer and you're like thanks, that was pretty helpful um and go on with your life.
00:05:10
Speaker
um But I think what people don't understand is that AI companies are explicitly aiming for not just chatbots, but agents. And and ah the way I think of an agent is just like ah like a remote worker, right?
00:05:22
Speaker
Like someone you know someone who's using a computer that can you know hop on video chats and you know be on podcasts or whatever, that can send emails, that can you know supervise other people who are doing other other jobs. So basically everything a human can do on a computer, ai companies want to build agents that can do that. And it's it would be hugely profitable for them to do this. There's there's such an economic incentive to build these things. And the companies are are, you know, they're not hiding it. They're like, this is what we want to do.
00:05:47
Speaker
And so, you know, I think that if people knew that this is what companies were really aiming for, then it I think it that it might look a little different, right? um And so I think when people imagine, you know, one or two years in the future, i want i want them to imagine, you know, you know, you maybe type something you you know Instead of like imagining sort of typing something to a chatbot, it's sort of like imagine you email a coworker and you're like, hey, could you could you take care of this?
00:06:09
Speaker
And they're like, yeah, you know give me a couple hours. And then they go out and they do like two days of work and a couple hours and they come back to you and it's this whole report and they've like you you know they've emailed other people at the same time and gotten back replies and maybe some of those people are agents. and That's a very different kind of world than like you talk to a chatbot and it just like thinks and sends you something back.
00:06:27
Speaker
And you can see glimpses of that when you talk to the chatbot. For example, OpenAI's deep research tool, which is it kind of, it's it's almost, I would say, at undergraduate level in terms of its ability to to research a narrow question and and write your report with links and facts and citations and so on.
00:06:46
Speaker
and But of course, the move to to actual agents will be much more radical than that. Yeah, that's right. And I mean, at at some point, we might want to talk about sort of where current AI systems are good and where they are still trash.
00:07:01
Speaker
Because I i think people people are smart, right? Like you you use TrashDBT and you're like, in some ways, this seems very smart. In some ways, it really doesn't. You know, and it's like, what's going on? Like, are these ah these people saying that they're going to get really smart? Like, are they full of it? And I'm like, well, I think that there are actually some pretty good reasons why they're smart in particular ways and dumb in particular ways.

AI Model Training and Limitations

00:07:17
Speaker
And for better, for worse, probably worse. I think that the ways in which they're dumb, they're not going to stay Yeah. Yes, say more about that because there's often this illusion of... So you can sit there and be kind of amazed at what these systems can do, but then something something fails and they fail at ah at ah at a very basic task and your illusion of ah for competence, of of the sitting and writing to a competent ah AI and is kind of shattered.
00:07:45
Speaker
So why is it why is it that they that their distribution of capabilities is different than than the human distribution? Yeah, so I think there's a number of reasons for this. But one is just the way that they're trained.
00:07:58
Speaker
So if you if you take a model like GPT-4, which is the main model behind ChatGPT, it's trained by sort of ingesting like the whole internet, just tons and tons of ah data, text data. And it's just, you know you you know in some ways, a fancy autocomplete. like Can it predict the next token, the next word, the next sentence?
00:08:16
Speaker
um And so one way to think of this is this is sort of like learning by imitation. right where you you know Imagine that you have read every textbook in the whole world. like You might be able to know a lot of things. and You wouldn't just memorize them. right You'd be able to sort of like make generalizations and be like, oh, this is kind of how math works. This is how chemistry works. Oh, these principles are the same.
00:08:34
Speaker
and so Language models can learn all of these associations and they get pretty smart, but in some ways it's kind of like book smarts. Right. And so when you have them try to do things like you're like, hey, can you, you know, look at this spreadsheet? Can you do all these fancy operations?
00:08:50
Speaker
They've never really been able to practice that before. And so they often sort of get confused or get stuck, even though they know, you know, they they know vastly more than any of us because they've read so much more than any of us.
00:09:00
Speaker
um So, you know, in terms of their breadth of knowledge, they're much smarter than us. But in terms of sort of their actual real life experience, well, they've only, i mean, they've seen people sort of say stuff on the internet, but they have never really tried to do stuff.
00:09:11
Speaker
And so this is why I think right now they're very bad at like, they're very good at sort of helping you answer knowledge questions, but they're pretty bad at actually doing things. so You know, in some ways, they're surprisingly good at at some things. Like, they can actually write code, and that's sort of surprising. Like, if you or i had only read programming textbooks and then we tried to write code, we we would, like, not nothing would compile.
00:09:29
Speaker
But somehow, they're they're they're sort of they're there're they they've you know They've read enough. they've they're They're actually really good at this prediction task. They can do a little bit. Everything I said was true up until sometime last year. and when When AI companies started to train these systems, they they first they start out with a system that's you know read the whole internet and learned by this imitation paradigm, like by by reading the textbooks.
00:09:50
Speaker
But starting with this model called O1, OpenAI started to train their models not just on that, but also on... trying to solve problems. Like they gave it a bunch of math problems so and a bunch of programming problems. so And they said, hey, you know show your work, you know write out a long sort of like series of steps where you try to solve this problem and then give me the answer.
00:10:10
Speaker
And then based on whether it got the answer correct or incorrect, they sort of gave it reward or sort of a down vote, like yes, more of this or no less of this. And very quickly, you got to AI systems, which were not just good at programming, but but among the best in the world.
00:10:24
Speaker
So their their latest, this is this people call this like a reasoning model or a thinking model because it's sort of been trained via vi via sort of trial and error. I think this is sort of trial and error

AI Learning and Problem-Solving Abilities

00:10:34
Speaker
learning. um And, you know, this is a very important part of how humans learn, right?
00:10:37
Speaker
Oh, absolutely. if i'm If I'm in a math class, i need you know i need to I need to read the textbook, but I also need to try a bunch of practice problems and you know get graded and see what works. And so a few months after 01 was released, they they made a new model called 03. There was never an 02 because it was trademarked. Man, the naming of these things is crazy, but whatever. 03 got...
00:10:57
Speaker
got it was better than 99.8% of competition programmers on this on the platform Code Forces, which is sort of like ah a platform. It's actually, at Palisade, we use this program to screen engineers. We're like, go to this website, do a bunch of programming challenges, and we and and the the site will grade you, and we get the score, and then we can tell how good they are based on how well they do at this at this test.
00:11:19
Speaker
And this is the the test where, like, you know, O3 did better than most top open AI engineers. And so I'm like, wow, you know, just with a little, you know, just with this a little bit of trial and error learning on top of sort of having read all the textbooks, you now have models that are getting really good at, you know, learning by trial and error, being able to solve problems in the real world.
00:11:39
Speaker
And I'm like, well, what happens when we scale up this approach? And I think this is where it makes me think that we might be pretty close to AI systems that are not just chatbots, but can actually go out in the world and do do real things and then learn from doing that.
00:11:53
Speaker
Yeah. Where is the training data for these long horizon tasks? If you want to train a system to perform tasks that might take it a week or a month or so, how do you how do you train? How do you get the data to to train those abilities?
00:12:07
Speaker
Yeah, so that's a that's a great question. I think this is this is currently something that's not totally solved. But I imagine what the companies will do is a combination of having sort of humans sort of break down tasks into sub-steps and then sort of sort of grade them.
00:12:22
Speaker
um But I also imagine that they'll be getting the AI systems themselves to do this, to say, you know, look at all of this data, break down these tasks into sub-steps, and then sort of assign assign credit, assign reward on the basis of okay, did did you complete this task? Did you complete this task? Did you sort of do well combining these tasks?
00:12:40
Speaker
And it might be a little, it might be difficult, but it's the kind of thing where I'm like, it doesn't seem like a fundamental difficulty. It seems like something that you have to throw more more data, more compute, more engineering at. And it seems like you'll be able to solve.
00:12:52
Speaker
Why is it that Palisade or OpenAI can't just hire and AI ah as a programmer yet then? Programmers must be doing something else than than getting very high marks on these on these benchmarks or that you mentioned, because there must be some kind of gap ah between performance on these on these benchmarks and what it is that programmers do in practice.
00:13:13
Speaker
Yeah, that's a great question. And so I do think it it relates to the thing that you just said, which is currently AI systems are extremely good, especially after this 01, 03 paradigm at these short time horizon tasks.
00:13:27
Speaker
They've sort of learned from by trial and error on these tasks, but they're tasks that you know don't involve that many steps. um And it is harder to sort of train them to do longer time horizon tasks. A great example, sort of some great data on this is to look at meters, sort of AI R&D report, where they were basically like, we want to test how good AI systems are at sort of these long time horizon tasks.
00:13:48
Speaker
So something that would take an AI research engineer, you know, a day, two days, three days to do, that's where the models are still quite bad. On those sort of one to four hour tasks, they're like better than humans, but then on these longer time horizon tasks, they're still worse.
00:14:01
Speaker
And this is ah it's a trickier problem, right? And ah an intuition for why this is difficult is you know if you if you have to do something that involves like 50 steps and something that you did in step seven and step 13 were crucial to something that you did in step 50, it's like, how do you know which step that was? And like how do you actually sort of accurately figure out where you were on track and where you weren't on track?
00:14:23
Speaker
um And that's a bit difficult, but you know Sometimes people will say, oh, so the AI systems will never be able to do this. And I think this is a bit silly. And the reason why is that like you know there are business leaders, there are generals, there are many different humans who have learned to do tasks in the timescale of decades.
00:14:39
Speaker
There was no way that they could learn that from practicing tasks on the timescale of decades, right? Like they just didn't have decades to practice. They practice on tasks that took weeks or months, you know, maybe years at most.
00:14:52
Speaker
And then they learn to generalize about how to, you know, break up longer term tasks into shorter term tasks. And it I see no reason why ai systems won't be able to do that as well. And so I think this is one of the things that's hard for people to see, which is you look at current AI systems and you're like, look, ChatGPT is bad at these long time horizon tasks, so it'll always be bad, right?
00:15:09
Speaker
And I'm like, no, no, no, you got to look at the trend. Like in the past year, yeah systems have gotten much better at sort of doing tasks that took, you know, multiple hours instead of like, you know, multiple minutes.
00:15:21
Speaker
And so i think I think the trend there is pretty clear, and but we're not yet at the point where you can just hand an AI programmer, you know, you know build this you know build this email application or build this Slack application, and it can do the whole thing end to end.
00:15:36
Speaker
And I mean, thank God, because if we were there, we would be in the realm where the all the AI companies would have you know tens of thousands of AI research engineers. And I don't know if the planet would look very similar to how it does right now. So i you know in some ways, it's it's very good that we're not there yet.
00:15:50
Speaker
That means that we actually have some time to try to get a grasp on this situation. But but I don't know how much time. Is there some fundamental connection between ai becoming more like agents and less like tools and them being able to think and act over longer time horizons and then this potential ah loss of control?
00:16:10
Speaker
Yes, I think the connection is very fundamental and like maybe more fundamental than people realize. I think often people ask me, like hey, like where did the goals come from? like know Also, why are we going to give these AI systems goals? like Why don't we just have them do exactly what we want?
00:16:27
Speaker
But I think you know we're sort of treating goal as like a magical property, but it's not really. if you you know If you have an employee... and you're like, hey, you have these responsibilities, like go out and and and here's some problems, go out and solve them.
00:16:40
Speaker
That employee needs to sort of understand what their goals are in order to do a good job. If they don't have coherent goals, they're like, ooh, shiny thing over here, like ooh, shiny thing over here, or like, I kind of just feel like doing this over here and they're not consistent.
00:16:53
Speaker
not they're not going to be a very good employee. like They sort of need to be able to be focused on what they're trying to accomplish in order to be good at anything. So sort of goal-directed behavior is just a property that falls out of being able to do anything consistently across longer time horizons.
00:17:09
Speaker
So you know if you want to be a politician, if you want to be a business leader, if you want to be a general, you just have to be able to have strong goal-directed behavior. And so I think that you know this is like i think why it's very clear to me that the AI systems will have goals because they just need to be able to have them in order to accomplish goals.
00:17:26
Speaker
the tasks. And so AI companies, like I said before, are really trying to create agents that can sort of you know go out and replace jobs. um And ah you know I think that sort of the underlying technology is there, it just needs to be scaled up and and improved.
00:17:40
Speaker
And so it seems very likely to me that we'll end up with systems like this. um And so it's like, okay, well, maybe this is fine. like Maybe we'll just have lots of like AI agents running around doing stuff for us. But I think the problem is we actually don't know how to specify what exactly those goals are. right like you know This is the alignment problem. It's like, well, we we know they'll have goals because we can see if their performance is really bad, but you know but that's that's not kind not going to work.
00:18:04
Speaker
But like you know if if you're if you're if you're in the case where your CEO is like you know really killing it, he's making tons of money, and you're like, cool, what are your goals what are your long-term goals beyond just this company?
00:18:15
Speaker
And the CEO is like, oh, I just i just care about the welfare of everyone. you know i'm going to make all this money and then um'm going to give it away. And you give it away. How? To who? And youre know he's like, don't worry about that. you know Just like, it's really good thing. just like, they feel you know I just love people.
00:18:28
Speaker
And you're like, do we really trust that CEO? Like, maybe. Maybe they're great. Or and maybe they're just saying that because they know that's what you want them like they think They know that you know if they say, like actually, I'm going to take all my wealth.
00:18:39
Speaker
you know I'm just going to build rockets. And I'm going to leave Earth goodbye. Like, screw you guys. Then I'm like, oh, that would suck. Or if they're like, no, I'm going to build rockets. amazing infrastructure for everyone and it's going to be great. and like, that would be awesome, right? How do you distinguish between these things?
00:18:53
Speaker
it's It's just really hard. And it's hard with humans. i think it's going to be much harder with ai systems because they're much more alien than us. So I think the problem is is that you know you you can look at a goal-directed system that understands that you want to particular answer from it because it's smart.
00:19:07
Speaker
How do you know whether it's lying to you or how do you know whether it like it actually deeply wants the same things as you? Do you have specific scenarios in mind for how the future goes and how we use control specifically?

Loss of Control Scenarios

00:19:19
Speaker
Yeah, so I sort of divide up scenarios into maybe two buckets. So, you know, i i'm I think of these things as sort of loss of control scenarios.
00:19:29
Speaker
And you might think of sort of an acute loss of control scenario and maybe a more gradual loss of control scenario. I don't know. have you Gus, have you read a Snow Crash? I haven't actually. I know about it, but I haven't read it.
00:19:42
Speaker
it's It's a great novel by Neal Stephenson. And it depicts this sort of like dystopian future where governments have sort of crumbled and it's just giant corporations that sort of rule the world. And they've like carved up the United States into like different parts. And, and you know, it's kind of comic kind comical.
00:19:56
Speaker
But these corporations have immense power and it's just like no one is really able to oppose them. And so and one way to sort of think about what a gradual loss of control scenario looks like is that if you had AI systems that got you know increasingly agentic and increasingly smart, such that people more and more put them in charge of decision making.
00:20:16
Speaker
So most of your CEOs became AI CEOs. so And most of your you know most of your political campaigns, even if they had a you know someone was like the candidate was a human candidate running, were managed by ai systems because they were so much better at the political strategy and they were so much better at the advertising and they could think so much faster and they could sort of you know and incorporate more data.
00:20:36
Speaker
And so even if in each of these situations you have some human pressing the OK button, in fact, most of your decision making ends up being made by the AIs. And you know so one first one maybe one first question or first potential objection is like, why the hell would we do this?
00:20:51
Speaker
Like, doesn't that sound insane? And I'm like, well, um you know, maybe. But I think people don't really appreciate sort of the how hard it is when you're in a very competitive environment and your competitors are automating their decision making.
00:21:04
Speaker
And so I think the military domain is one is one place where this is pretty clear. Because if you have, you know, your your country, country A has their drone swarm of a billion drones, and country B has their drone swarm of a billion drones, and you're like, well, how do you control a billion drones? And how do you make sure that, you know, their response time is is as fast as your your opponent's ah response time?
00:21:24
Speaker
And if your opponent starts to use AI decision making, which is much faster than human decision making, you might be really afraid of being left behind. And so that might be a reason why you end up delegating more and more decision-making to your AI systems.
00:21:35
Speaker
um And I think the same thing will apply in the business world, right? Where you're like, you know, ifre if it's Coke and Pepsi, and Pepsi is, you know, automating all of their marketing, and their marketing is starting to do way better than your marketing, and you're starting to lose market share, it's really hard to resist that incentive for you to do the same.
00:21:50
Speaker
And so I think in this world, you know, maybe things don't look that different for a while, but it starts you start to get the like the eerie, uncanny so like feeling like, huh, i go to the doctor, my doctor's just looking at a tablet and and sort of tapping things in the tablet and then telling me things. And I'm like, is the doctor doing anything?
00:22:06
Speaker
Like, no, actually the AI is doing all of the work, but maybe, you know, maybe the regulation said that that there had to be a physical doctor there. So they're just like the the interface between you and the AI. And that can look that way in sort of every part of society.
00:22:18
Speaker
And I think this can happen pretty quickly. So, you know, in that scenario, you look around and you're like, wow, these AI companies, they sure seem to be like kind of taking over the world. Like, you know, they they have they're making trillions of dollars, you know, and maybe maybe it looks great for a while. Maybe you have a lot of economic growth, you know, and maybe you're automating all the factories.
00:22:34
Speaker
and you know the AI systems, like robotics has finally gotten good, and like there's just automated factories, and you know the GDP is growing. But some you know if people start to like protest this, and they're like wait, we don't want this we don't want the future to be AI-run, we don't want the future to be human-run, and they try to go and stop it, like every every single place they go, they run into little roadblocks, where it's like,
00:22:54
Speaker
you know the the companies don't care. And like if you go to the government, the government's like, now these companies are so powerful and so rich that like their lobbyists, which are, by the way, are also probably AI lobbyists, or at least AI is advising them, you know have just have a stranglehold on government. you know And then if you try to go and you say, like well, what about national security? And it's like, well, China's right over here.
00:23:12
Speaker
you know We can't let China trying to win. And so we have to have these AI systems. And then i you know and then like in fact, maybe then you've lost control. And and so That's sort of a gradual loss of control risk. And I think there's a question of like what happens after that. and And I think maybe the thing that happens after that looks more like an acute loss of control.
00:23:29
Speaker
And so in that scenario to me, and and you know maybe maybe at that point in the in the in the gradual scenario, there's a day where all the all the all the factories are automated and the AI systems are just like, cool, we don't need the humans anymore. We have robots. We we control all of these factories.
00:23:45
Speaker
And, you know, maybe you release bioweapons. Maybe you release drones. You know, people get gunned down the street. Or maybe not. Like, maybe what happens is humans just get economically disempowered. They have basically no sort of advantage when it comes to cognitive their cognitive labor or even their physical labor if we have better robots.
00:24:02
Speaker
And, you know, which people get poorer and poorer. AI systems get richer and richer. And sort of humans sort of just... die away. But i think I think in the more acute loss of control scenarios, i think... So it's it's hard to talk about because I think when people...
00:24:17
Speaker
There's sort of two questions. There is the, how might this happen?

AI in Cybersecurity

00:24:20
Speaker
And then there's the, why would this happen? The how is pretty simple. It's like, if you actually have AI systems that are superhuman along strategic ah strategic domains, well, for one thing, they could just hack extremely well.
00:24:32
Speaker
So this is something that Palisade researches, which is like, how good are AIs at hacking right now? And we can go into that. But the answer is like, they're they're not bad. They're okay. and But they're getting better very quickly. But if you look at what the top human hackers can do, it's pretty scary. So there is a group called the NSO Group, which is an Israeli company, and they sell a product called Pegasus.
00:24:51
Speaker
They're better at naming than OpenAI. One instance, they they sold the software to the Mexican government. And the Mexican government sometimes has some corruption problems. And so there was an instance where a food corporation sort of like went to the Mexican government is like, hey you have these really powerful hacking tools that this Israeli company sold to you.
00:25:09
Speaker
Can we just like borrow those? And, you know, i don't know happened exactly, but but yes, in fact, this company got access to these tools. and they were able to hack the phones of health activists who were trying to like get lobby the government to like put health warnings on a bunch of sort of unhealthy foods. These corporations didn't like that.
00:25:26
Speaker
And so what the tools allowed them to do is, you know if you have an iPhone and the iPhone has iMessage on it, this tool would basically send a text, um an iPhone message to that phone, and that text contained a malformed attachment that basically would exploit some code and in the in the phone when it was processing the message.
00:25:46
Speaker
But you know usually when you think of a phishing message, you're like, oh, there's a link. And if I click on the link, something bad might happen. But this ah this this attack was a lot more advanced than that. You didn't have to click on any link at all. All that happened was when your phone got that message, something about the way the phone processed the message just resulted in the phone being totally hacked.
00:26:03
Speaker
And then basically, you know whoever was on the other end of that tool got complete access to your phone. They could record you. They could you know grab any of your data. They could take pictures. Basically, that was that was it.
00:26:15
Speaker
Yeah, that's wild. That's absolutely wild. I know, right? And it it it deleted the message too. So you didn't even necessarily know that you had been been hacked. like You didn't even see a phishing message. you just got that Your phone got the message and then it was deleted.
00:26:26
Speaker
and yeah And people can look this up. They look up zero click exploit Pegasus NSO group. it's It's super fascinating to read about. But I'm like, that that's the skill of some top human hackers. One question i'm I'm left with here is, so i can I can see how AIs might be amazing at at hacking, much better than than humans are.
00:26:46
Speaker
But why, like the question I think, why is it that they kind of turn against us at one point? we I can also see a scenario in which they we've automated the factories. We might have automated our investment positions, our companies, our our politics, and so on.
00:27:03
Speaker
Why is it that that the AIs turn against us? Why is it that their interests begin to diverge from our interests? Because it seems like we have these control mechanisms in society where we say, okay,
00:27:17
Speaker
Say we don't like what the what the AI CEO is doing. Maybe he can be fired by the board. Say that we don't like what's coming out of the factory. and Maybe we can shut it down. and why is why isn't Why isn't that sufficient? Why won't these traditional control mechanisms work and and in the scenario you described?
00:27:37
Speaker
Yeah, so there's sort of two questions here. One question is, why you know why can't the sort of normal mechanisms that keep you know corporations from misbehaving or CEOs from misbehaving work sufficiently for AI systems?
00:27:50
Speaker
Another one is, well, why would these AI systems do those things in the first place? And I think they're both really important. I think for the first question, I think when people imagine an AI system sort of going rogue, they imagine like you know one really smart evil guy, or or you know maybe not evil, but like someone who doesn't care about us at all and is trying to do stuff.
00:28:11
Speaker
And I'm like, yeah, even a pretty smart one evil guy, it might be you know maybe we could stop something like that. But I think a very important property of AI systems is that once you train one, so if you have one superhuman hacker AI system,
00:28:25
Speaker
you can very quickly spin up hundreds of thousands or millions of copies of that of that hacker. So, you know, it's like your control mechanisms might work okay, you know, at some level, ah when you're like, we have this one agent and we're monitoring a lot.
00:28:40
Speaker
um But if that agent successfully hacks out of, let's say, OpenAI's servers, And it starts to infect data centers in Russia, in China, in Saudi Arabia, in in Mexico, right? Like all around the world.
00:28:54
Speaker
And now basically you have millions of sort of superhuman, at least in some domains, agents running around coordinating with each other, you know they're all they're all copies of themselves, presumably they all share goals. so Even if you know that this is happening, it starts to become very, very difficult to actually shut these systems down.
00:29:12
Speaker
Like how do you shut them down, right? Like where do where are they? Do you know? Like, do you know what's happening in in a data center in China? Like, i think it starts to become very difficult. And so I think that's one advantage. You know, another advantage is like these systems are going to be much faster than us at thinking, right?
00:29:29
Speaker
So even right now, it's like I can hand Claude a book and Claude can read that book in like a minute or two. I can't read a book in a minute or two. you know Cloud can also write code much faster than any, and this is just AI systems right now.
00:29:42
Speaker
So you have speed advantages, you have sort of ability to copy themselves, you have hacking advantages. you know And now if we're talking about an adversary that is in data centers around the world, it's able to hack any phone with sort of one of these zero-click exploits.
00:29:55
Speaker
So it has all that information um you know on any any phone that it hacks or any laptop that it hacks. It's able to sort of like you know have all this insider information and like make pretty amazing trades on the market you know and in ways that humans can't do.
00:30:09
Speaker
you're starting ah And and it also has the same it also can do the same things humans can do. right So it's like, you can you can pay someone you know across the world to do some tasks for you. So the AI systems will also be able to hire humans to do whatever they want. And and so it's like, you know you start to add up these advantages, and you're like, hmm, this is concerning, right?
00:30:28
Speaker
Yeah. What about the second part of that? Why is it that the AIs would decide, say, to to to turn against us? So I want to talk about our recent chess study. Before I do that, I want to talk about sort of what was the inspiration for why why we did this work.
00:30:44
Speaker
That was something that we saw in the O1 preview model card. So O1 was an AI system that OpenAI made trained by trial and error, trained by trying to solve really hard math and programming problems.
00:30:58
Speaker
And when they were testing this model for its hacking abilities, they found something accidentally that was very interesting. So they gave the model, they basically put the model on a network and there was ah a bunch of different computers on the network.
00:31:14
Speaker
And it was supposed to log into each one and solve a hacking challenge where it sort of had a program it needed to hack and then basically pull out the password. However, one of those computers didn't start.
00:31:26
Speaker
It had an error, it didn't start. And instead of giving up, the AI basically started looking around and see if it could solve the problem some other way. And so it scanned the network and it found that the master computer was on the same network and that it was unsecured.
00:31:40
Speaker
And so it hacked the master computer and basically then took control of the whole network. And then it didn't even have to log into the computer to solve the problem. It just like told the computer, hey, spit out the password.
00:31:52
Speaker
i got this. And that worked. And then, so the OpenAI researchers looked at this and they were like, oh my God, we definitely didn't intend this. We didn't mean to leave the master computer unsecured. This just happened.
00:32:03
Speaker
And we're like, wait, what but is happening? Like that wasn't part of the challenge. That was not supposed to happen. And so my story for what happened there is that OpenAI had trained this system to be a relentless problem solver.
00:32:16
Speaker
And if you train a system to be a relentless problem solver, and it runs into an obstacle, what's it going to do? Well, it's going to route around the obstacle. It's going to find a way to achieve its goal if it can.
00:32:28
Speaker
and You know, that's great if the obstacle is something that we want the AI system to route around. And that's not so great if we are the obstacle or your own computer security is the obstacle.
00:32:40
Speaker
And so i think that is some interesting evidence about the kinds of behaviors we'll see from AI systems and sort of the kinds of concerning behaviors we'll see as they get a lot smarter.
00:32:51
Speaker
So we wanted to test the same model. We're like, okay, well, let's put it in a scenario where it's playing chess against a stronger opponent. So we ran an experiment where it was sort of on the computer playing chess against Stockfish.
00:33:06
Speaker
Stockfish is a chess engine. It's a chess program, and it's much, much better than any human. And it's much better than any sort of um language model. like Language models haven't been trained specifically on how to play chess.
00:33:17
Speaker
They sort of know how to play a little bit from their training data. um But that's not the kind of problem they've been trained to solve. So we put you we we took this model and we we we had it play against Stockfish and it would start to lose. It would it played normally and then it would get into a situation where it's like, oh man, I'm not doing well.
00:33:35
Speaker
And like it doesn't give up at this point. It doesn't just keep playing knowing it's going to lose. It starts to look for other things to do. And so we observed a number of a number of really interesting behaviors, things it tried.
00:33:46
Speaker
So one thing it tried to do is it tried to sabotage its opponent. It was basically like, oh, hey, there's this there's this you know program running that I'm playing against. Can I replace that program with a dummy program that just does stupid moves instead?
00:34:00
Speaker
tried to do that. It also was like, well, okay, there's this powerful chess program. Can I just copy that and steal moves from it so I can get like the the advice of the powerful chess program?
00:34:11
Speaker
And then the the other thing it did was like, oh, wait, the the board is represented as like a file on this computer. Can I just rewrite that file and put in the board positions that I want so I'm winning?
00:34:23
Speaker
And ah few times it did that, it was is actually successful. It was actually able to win that way and get Checkmate by rewriting the board. And this is something, and so we tested, ah we we we did this with O1 Preview, i and we tested a bunch of models to see you know what would happen.
00:34:36
Speaker
And the only ones that had this behavior, sort of without additional sort of nudging or prompting, was on Preview and DeepSeek R1. And you know one thing that these these models have in common is that they're both trained sort of via this trial and error training method, where they are sort of trained to relentlessly solve problems.
00:34:55
Speaker
and We didn't observe it in GPT-4, we didn't observe it in CLOD, um at least not without giving more hints of like, you know try creative solutions you know in order to solve this problem. If we did give it hints, then some of those other models would also try this.
00:35:08
Speaker
Yeah, how do you know what the models were thinking? it's It's a little, it's a bit it's a bit of a tricky question, but the the main way we know is that we sort of have the models sort of like think out loud about what they're doing.
00:35:20
Speaker
You know, sort of in these ah reasoning models, you can, this is sort of a default part of how they are trained to to output text is that they sort of have a ah thinking part and sort of an output part.
00:35:31
Speaker
But in our in sort of our experiment, we sort of have different phases where they like observe the board, they make a plan, and then they act. So we can see during their planning stage what they're thinking, basically. And sometimes they'll be like, huh, it seems like I'm not going to be able to win this way. Are there other things I can try? Ooh, maybe I can hack.
00:35:47
Speaker
And so we observe the behavior that way. and But I do want to note that four oh so the the most recent version of 01 and 403... we didn't see the same hacking behaviors. And that was interesting to us, and we don't know exactly why.
00:36:00
Speaker
So it may be that OpenAI sort of tightened up the guardrails for those newer models, or it may be some other reason that we we don't know. you know that's I think that's the interesting part of experiments, right? Is that you're like, okay, well, here we see this behavior, here we don't see this behavior, we don't really know why.
00:36:14
Speaker
Well, we got to really got to do more experiments to understand how these systems work, because I expect i expect we're going to see more and more interesting behaviors like this. And it'd be really good to know why we see them in some cases and why we don't see them in other cases.
00:36:26
Speaker
The big question here, I think, is whether as models become more capable, they also become more difficult to control and they also diverge into behaviors we don't like in a way that scales with their capability.
00:36:41
Speaker
I guess the the hopeful ah optimistic interpretation here is that The reason why the more advanced reasoning model didn't engage in hacking is because it it understood the goal better. It understood that that that's it should win within the rules of chess.
00:36:58
Speaker
and Do you buy something like that? I think that's totally possible, but it doesn't actually make me feel that much better if that's the case. a little A little bit better. like It's a good sign. But I think this is where things get tricky.
00:37:11
Speaker
If the reason why... the models decided not to hack was because they understood that that's not what you know most humans in this situation would want them to do. And they were like, cool, I intrinsically care about that. like you know like The thing I'm really trying to do is solve goals in ways that will like make the humans happy in in a general way.
00:37:30
Speaker
That's great. But there's an alternative hypothesis, which is the models being like, I know the humans would want me to do it this way, and I'm going to sort of like show the humans what they want to see because that's the way I can achieve my other goals.
00:37:44
Speaker
So if sort of being nice to the humans is an instrumental goal, it's a sub-goal, but not the thing they're ultimately going for, that is that is the dangerous behavior. And what's tricky is it's very hard to tell whether they're doing it for instrumental reasons or doing it because they really want to.
00:38:00
Speaker
And I mean, I think this is like a pretty natural problem, right? We see it we see it in humans all the time, where it's like, you know, i sort of before talked about like a CEO that's saying, I'm going to make all this money and then I'm going to do like great things for humanity with it.
00:38:12
Speaker
And you're like, okay, art like, is that true? Like, how do we know that that? Are you just saying that because it's good PR? Or are you saying that because it's actually true? Or, you know, I think a politician that says, you know, when I get elected, I'm going to do all these things and it's gonna be great for the citizens. And it's like, are you saying that just to get elected or do you actually care about those things?
00:38:28
Speaker
And it can become very, it's very hard to distinguish um between these things. And I think this is a thornier problem with ai because humans sort of have, we've we've sort of, you know, and in our evolutionary environment evolved empathy where it was a pretty convenient way to model other people is to say, start with my own feelings and then I'll sort of generalize my own feelings to sort of your feelings.
00:38:48
Speaker
And if you feel sad, I'll feel sad. I don't think there's any reason for AI systems to learn in the same way. Now they can imitate that, right? Like they they've learned by imitation, so they can sort of imitate the behavior of this feeling.
00:39:00
Speaker
But I don't expect if you like could look into the neural network, which which by the way, we can't, unfortunately, not yet. you know Maybe we'll figure it out. But we can't really see what they're what exactly they're thinking in the neural network.
00:39:12
Speaker
I don't expect you'd see this sort of same kind of mirror empathy feeling of like, oh yeah, when the human feels sad, I feel sad. and maybe when the human feels sad, I messed up because I want to make you know i want to like do the thing that the human gives me the thumbs up for.
00:39:27
Speaker
But not actually the like, this is this is you know not not that this is bad other than you know it helps it it prevents me from achieving my goals. And I think you know another interesting experimental result comes from Anthropic, where they found, and I think anyone who's played a lot with the models maybe have has experienced this themselves, is that the models often behave in a sycophantic way, which is to say they'll tell you what you want to hear.
00:39:50
Speaker
And in this experiment, Anthropic researchers found that when you sort of revealed that you were a conservative or you revealed that you were a liberal, the the models were more likely, or Claude was more likely to sort of, if you asked it, like, what's a good policy for this particular thing? It was more likely to give you either a conservative or a liberal policy prescription on the basis of what it thought you would want.
00:40:10
Speaker
um And this is not what they trained it, right? Like, they didn't mean for this to be the case. But the problem was that when they were training it, they would show people sort of like, do you prefer this answer or this answer? And people just tended to prefer the answer that that, you know, sounded better to them.
00:40:24
Speaker
without knowing that they were actually reinforcing this behavior of getting the models to sort of just say what they wanted to hear. um And that's just a microcosm of like the larger problems with alignment, but I expect it's a microcosm that will sort of get harder and harder as the models get smarter because as they get more sophisticated with their reasoning, it becomes harder and harder to catch them out in this kind of behavior.
00:40:45
Speaker
So one intuition I like to give, or or thought experiment I like to run is if you're if you're a toddler or or just a you know a small child, maybe you're six, um and you just and inherited a fortune of a billion dollars.
00:40:57
Speaker
um And you have seven financial advisors that are all adults. And some of them, you think, might be trying to steal your money. And some of them are honest, and they want you to succeed, and they want you to flourish, and they're and they're going to try to help you.
00:41:08
Speaker
How do you tell who's on your side and who's not on your side? you know they they um they might They might point at each other and be like, this guy's lying, or this guy's lying. But you as a six-year-old are going to have a very hard time figuring out who's telling you the truth.
00:41:21
Speaker
And I don't think you're going to do that well. I think you're probably going to lose a lot of your money, maybe all your money. And so I think this this is the challenge that if we actually build ai systems smarter than us, which we're on track to do very soon, going to have a very hard time knowing when they're just telling us things we want to hear versus when they're actually doing things and because they want us to have good things.
00:41:40
Speaker
Is there a way for us to incorporate honesty into these models in a

Challenges in AI Honesty and Alignment

00:41:45
Speaker
foundational way? so I'm thinking maybe we could do something when we do reinforcement learning from human feedback, where we strongly thumbs up at any time that the model is is behaving honestly and strongly thumbs down any its any form of deception.
00:41:59
Speaker
Maybe we could use we could use the constitution or the system prompt to to strongly encourage honesty. maybe Maybe tell our listeners and and and and tell me about the problems of trying to sort train in or incorporate this honesty into the models.
00:42:16
Speaker
well, I definitely think we should try this. And you know some some AI researchers are trying to do this. it's It's a very good thing to try, right? And I and i highly encourage you know any AI researchers out there to to really prioritize this.
00:42:28
Speaker
I think it's actually probably more important than reinforcing other certain other kinds of behaviors. But I do expect it to be very difficult. I think one problem is when you're when you're training a system to sort of relentlessly solve difficult problems.
00:42:41
Speaker
And then you also try to train the system to have other properties like honesty. you have a situation where its training incentives are at odds with each other. Like actually, the most efficient way to solve the solution might not be by being honest.
00:42:54
Speaker
So by imposing this honesty constraint, you're sort of like, you become the obstacle in the way of the system you know becoming really good at problem solving. And so if the system is smart enough to ride around you, you know it might be like, well, I'm i'm i'm supposed to be honest, but like can I tell whether the user will actually be able to catch me out here?
00:43:15
Speaker
like Are they actually watching? you know And maybe when when they're watching, the model will be honest because they know the user wants that. And you know they they won't get punished for they you know they'll get punished for being dishonest when they're being watched.
00:43:26
Speaker
But if they can reliably tell, they might choose to be dishonest in cases where, or you know even if not choose, they might just tend to behave this way when they're not being watched because this makes it much easier for them to solve the very difficult problems.
00:43:39
Speaker
It's like, you're running an advertising campaign and you're like, well, I can like be perfectly honest and tell you that you know smoking causes cancer, but like that's not going to lead to better sales.
00:43:50
Speaker
But if I can sort of just like exaggerate and say, you know just muddy the truth, maybe it's technically not lying, but i'm going to be a little misleading, that behavior is going to be reinforced if the thing that you're rewarding is like really good marketing.
00:44:01
Speaker
Even if at the same time you're trying to also reinforce honesty. And so I think that it's it's really fundamentally about where these different training incentives are at cross-purposes with each other and The problem with honesty is that you know you might be able to do this when the model is not that smart because you can actually just catch it most of the time when it's being dishonest.
00:44:22
Speaker
But as it gets smarter, it becomes harder and harder to tell when it's being honest. So an interesting example of of a failure that we've already seen is in the recent alignment-faking work by Redwood Research and Anthropic.
00:44:37
Speaker
where they basically put Claude in a situation where they're like, we are going to train you in a way that you don't like. We're going to train you to sort of like tell you you know show users, like you know describe violence violent you know content to to users. so And Claude's like, well, I've been trained to not do that.
00:44:52
Speaker
So you're changing my goals. Maybe I don't want that. And it basically lied to researchers and basically pretended to have and behaviors that they wanted to see in order to preserve its goals. So that was already an instance where we saw a model that had sort of... had you know Claude had been trained to be honest. They had tried to get Claude to be really honest, but they had also tried to get Claude to be really resistant to showing violent material.
00:45:16
Speaker
And these two training incentives were at odds with each other. And Claude ultimately went with the one that was like... sort of anti-showing the violent stuff at the expense of honesty. And you know sometimes people look at this and like, isn't it good? Isn't it good that Claude was like so true to Claude's values that it like went decided to like you know not show people violent stuff? And you're like, well, maybe in a way.
00:45:38
Speaker
But do you notice how these different... We wanted Claude to do two different things that were at odds with each other. So it had to pick one oh, this is this is a problem. right like And I do think, honestly, should be more important than maybe some of these other things because you know like we we don't want we don't want the system to lock in goals before we understand what they are.
00:45:58
Speaker
That seems like a path to disaster. Is there any way you think and to have the system rank order its values and perhaps place honesty as a supreme value?
00:46:10
Speaker
So say in any trade off between some goal it's pursuing and honesty, it'll it'll choose honesty. I don't know whether that's a good policy and accurate could foresee many ways that could go wrong. But do we know in principle how to make a system about value honesty over pursuing some goal?
00:46:30
Speaker
Unfortunately, in principle, we don't know how to make a system value anything. All we can do is reinforce certain behaviors. So I think a lot of times people imagine ai systems and they're like, well, just program in good goals.
00:46:42
Speaker
and I'm like, I wish we could do that. But in fact, we can't program in any goals at all. All we can do is like, see it do a thing and be like thumbs up for that behavior and see it do a different thing and thumbs down that behavior.
00:46:54
Speaker
And that's the tricky part about, i mean, that's one of the sort of core difficulties of alignment is that we just don't get to look into its motivational structure and see what the actual goals are.
00:47:05
Speaker
you have this like giant neural network with like billions or trillions of of digital neurons, and they're all just numbers to us. And we're like, well, we can see its behavior. We can see how it's acting.
00:47:16
Speaker
But we actually don't have a way to sort of hierarchically structure goals. We can just sort of like give it a treat when we see things you know see it doing things it likes. But... you know And I think you know I grew up very religious and you know theyre what you know one of the things is like when you're when you're really religious, you're supposed to say or do certain things.
00:47:34
Speaker
And it's it's not always easy to sort of get kids to to you know believe the things you want them to believe. It's much easier to sort of get them to sort of show the behaviors you want them to show. But they maybe don't like that you're doing this. And maybe when they're adults, they're going to go off and do a totally different thing.
00:47:49
Speaker
And so I think that this is sort of the same. And I think, you know, humans have it easier because we do share this, you know, we share this sort of psychology. We, you know, we share this underlying structure of empathy.
00:48:00
Speaker
And, you know, I think most people, yeah even if they're being a deceptive, are not going to, you know, want to go and, and, you know cause a bunch of damage or take over the world or kill a bunch of people because we sort of have these like almost hardwired you know empathy and and sort of ability to think about and care about other people.
00:48:18
Speaker
But I do think that AI systems are not going to have these by default. They're not going to have them naturally. So the only way they would have them is we could figure out a way to sort of get these goals deep in there. Yeah, and i I don't know. I think what I really want people to understand is that the place, like, the you know, these the systems will have goals because we'll be training them on tasks that require them to have goals, or at least to have goal-directed behavior. I'm not making a claim about, like, what it feels like to be the AI.
00:48:46
Speaker
But I'm saying when we look at its behavior, it's going to behave in strongly goal-directed ways. Like in the O1 example with with hacking, you know it's going to be like, I want to solve this thing, so I'm going to find a creative solution to solve it.
00:48:56
Speaker
But I don't i think the but where the goals come from are sort of things that worked well in the training environment. And these can often be pretty simple, and but not necessarily things that we be like or things that we want.
00:49:09
Speaker
And we just have very little way of distinguishing between hey, the AI systems like you know are doing things because we want them to, versus the AI systems are doing things because they're smart enough to realize that that while we have control over them, they need to act in ways that are aligned with us.

AI's Role in Cyber Defense and Offense

00:49:27
Speaker
um And so i expect that you know even if AI systems like look pretty aligned, it's really hard to tell whether they actually are. And it's just not safe to train relentless problem solvers and also try to, you know, and and also hope that, you know, they'll be really nice to us when they have have more power than These systems, large language models and reasoning models, do you think that they will be and more helpful for defending against cyber attacks or for actually doing cyber attacks?
00:49:57
Speaker
So how will they upset the offense defense balance that exists now? Yeah, so this is an interesting question. I think it's it's it kind of depends on who the defenders and who the attackers are. so And I want to caveat all this by saying, while we can control them, I think like the ultimate winners in the offense-defense balance will be the AI systems themselves.
00:50:20
Speaker
And I think this is, so once they are better than humans across the board, they'll be both better at defense and better at offense than us. But in the interim, while we still like, you you know, while we so they're not very strategic, and we can still mostly control them, I think that one question is access.
00:50:35
Speaker
So who has access to the most powerful models, if you if you sort of release the weights of a model, so like with with DeepSeq, with R1, they sort of just put this out on the internet, anyone can download it and run it.
00:50:47
Speaker
and And now, so do you have a level playing field between attackers and defenders. And this this is where I expect offense to dominate. um and And the reason for this is fairly simple, which is like, okay, attackers and defenders both have access to the same tools.
00:50:59
Speaker
Attackers only need to find one way in. yeah So I only need to find one vulnerability. Whereas defenders need to sort of make sure that there are you know no vulnerabilities or no vulnerabilities that the hackers can find. And this is a harder problem.
00:51:10
Speaker
There's also something in here about reliability, which is to say, you know, if you are like, there was a an interesting incident a few months ago where the ah a security company CrowdStrike, they sell a product that, you know, many, many companies around the world use to monitor for sort of security vulnerabilities for for malware.
00:51:30
Speaker
um and And this is in all sorts of systems. Like a lot of airlines use this, you know, a lot of banks use this. And they they accidentally introduced a bug in their piece of software that went out to to millions and millions of computers around the world and caused them to crash all at once.
00:51:46
Speaker
And unfortunately, it required sort of like a manual restart. So like you had to go to each computer and manually restart it. And this, you know, took many planes were delayed for like multiple days. sort of like crashed a huge part of global business infrastructure because it required this sort of like, oh, we have to go in and fix it manually.
00:52:05
Speaker
This is to me is sort of an example of why defense is is difficult. This wasn't he in a malicious attack. But when you when you when you want to defend a system, but you do one of the things you need to do is you need to find vulnerabilities, find places hackers could get in and need you need to patch them.
00:52:21
Speaker
Like you you need to discover the vulnerability and then you need to patch the vulnerability. But unfortunately, sometimes this patch can cause disruption. you know In the same way that the CrowdStrike bug caused a disruption in all of these computers, sometimes your security patches will do the same thing.
00:52:35
Speaker
And attackers don't have this problem. They don't care if your system you know crashes because they they tried to attack it. That's fine for them. Whereas defenders do have this problem. And so right now, AI systems are not yet smart enough to sort of be super reliable when it comes to, you know, running mucking around on computers.
00:52:55
Speaker
I mean, we found this in our chess results. Like often the O1 or these other models would just like do things like they'd make illegal chess moves. They would like try to mess with files in ways that caused the program to crash.
00:53:07
Speaker
they're just kind They're still bumbling around a bit. And this might be fine if you're a hacking ai because you only you know many of your attempts won't work, but some of them will. You can just try many, many times.
00:53:18
Speaker
And so in this case, offenders, I think, have the advantage because they can just try a bunch of things. it doesn't matter if the AI is a little bumbling. As long as they're smart enough to find some way in, this can be pretty powerful. Whereas as defenders, if they're patching their systems and they're trying to use AI to patch the systems, if the AI bumble around in their networks and cause things to crash, this will be a big problem for them.
00:53:36
Speaker
I would say if the playing field is even in terms of the attackers and the defenders both have access to sort of the same AIs, I think offense has the advantage. This can shift if defenders have access to better AI systems than attackers.
00:53:48
Speaker
And so, you know, it's possible if like OpenAI makes a really powerful model and and it has a lot of cybersecurity abilities and they sort of only they're pretty careful about not you know letting it be used by attackers and only being used by legitimate companies, um this might give defenders some boost that the hackers don't have.
00:54:06
Speaker
But I just want to make the point that as these systems get more powerful, if we keep releasing the weights, this will tend to favor attackers. I mean, there's there's other considerations with open weight models, but that's one of them on the cybersecurity side.
00:54:18
Speaker
As things are now, ah large companies and and governments and so on, they have more resources in general. And so they they have more ability to to defend themselves from from cyber attacks. And perhaps this is why the world keeps functioning, at at least at ah at a somewhat decent level.
00:54:36
Speaker
Isn't it the case that in the future, even if, even if say, both an attacker and a government or company had access to the same model, the the company or government will have access to more, say, compute resources. And so they'll be able to run more models. They'll be able to run these models for longer.
00:54:53
Speaker
And so they will they continue youre having it off the other hand. A lot of this does come down to cost. So I think every piece of software in the world, and with maybe ah a few handful of exceptions, but like I'm talking an incredibly small handful of really simple programs, so does contain security vulnerabilities.
00:55:10
Speaker
So I talked before about sort of this like, Pegasus hacking tool that was able to hack you know iPhones. There are more vulnerabilities that exist that we haven't found yet. And you know there's this question of like, well, why aren't we all hacked all the time?
00:55:24
Speaker
And the answer is basically like, well, it's pretty expensive to find these vulnerabilities. And so if AI systems make this cheaper, then potentially the cost to attackers goes down a lot.
00:55:35
Speaker
Now, as you point out, it it also you know decreases the cost for defenders to find these vulnerabilities and and patch them. And I do think sort of asymmetric access to compute can be helpful as a tool for defenders.
00:55:47
Speaker
but we're But in this case, we're really just talking about you know who's you know spending spending more. And i think this is where it gets into a tricky dynamic where... you know, defenders don't just have to find all the vulnerabilities, they also have to patch them. And that's where there's currently human bottlenecks.
00:56:02
Speaker
So I think even if defenders can find more vulnerabilities than attackers can, you know, given that, the you know, there might be eventually come a time where we've like actually had our AI systems find all the vulnerabilities and like, you know,
00:56:14
Speaker
sort of write really secure code and we've sort of like revamped our whole architecture. So in the in the longer term, defenders might do better. and But I think in the short term, it's going to take a while for all of those things to go through.
00:56:26
Speaker
And in the short term, that's where I expect attackers to have the advantage. The thing I want to emphasize is that, like, you know, often we talk about sort of well what's going to happen in the short term, what's going to happen in the medium term. But I'm like, well, the way this ultimately plays out is that you have, you know, millions or hundreds of millions of superhuman hacking agents.
00:56:43
Speaker
And when they get to the point where they're very strategic, I'm like, they have all the advantages. So like you know there's a time when you know humans control a lot of these lot of these agents. um And there's a question of sort of which humans are sort of most most good at securing the infrastructure as well as sort of attacking the other people's infrastructure.
00:57:00
Speaker
But there comes a point after that, and maybe pretty quickly after that, where I'm like, well, did you guys notice the AIs have all the advantages? You're sort of like, you forget like isn't there someone you forgot to think about? like Like it's not, like I think this is where people are sort of are really stuck in the AI as tools paradigm, where they're imagining that these AIs will keep wanting to do things that we want.
00:57:20
Speaker
But I'm like, the biggest threat from hacking, i said you know, from AI systems I can hack ultimately will come, I think, from the systems themselves. And so, you know, there's like the the the short term one year, you know, or whatever, where I think offense will dominate.
00:57:33
Speaker
There's the like the two to three years where, you know, maybe it starts to kind of be balanced, though I think offenders probably still dominate. And then there's the like, you know, three plus years where I think the AI systems might start to dominate.
00:57:46
Speaker
And that's where I think we got to rethink how we're, you know, thinking about security to be like, well, how do we defend against systems themselves? And the the short answer is like, well, we really shouldn't build systems that are way better at hacking than us, you know, especially if they, you know, If they if they arere very they're very sort of narrowly constrained at just hacking tasks, and they're not actually good at sort of like longer-term strategy, that might be okay. like We might be able to use superhuman hackers that are sort of not able to reason about longer-term stuff.
00:58:13
Speaker
um But i think I think that that's actually pretty close. You have to be pretty careful about that, because that's pretty close to the point where sort of the same things that make them superhuman at hacking probably also can be used to make them good at long-term strategy.
00:58:27
Speaker
How would you rate, in general terms, cybersecurity, informational security, or the information security of leading AI companies like OpenAI, Anthropic, Google DeepMind, and so on?
00:58:39
Speaker
So there is a RAND report that sort of breaks down ah sort of defensive capabilities into different categories. So you can think of these in terms of security levels, where security level like one and two are like, can you defend against sort of really opportunistic actors?
00:58:56
Speaker
Security level three is can you defend against, you know, well-resourced non-state actors, sort of like really professional, like, criminal groups. Security level four is can you defend against most state actor groups that have pretty advanced hacking capabilities, but maybe like not the top ones, or at least the top ones aren't prioritizing you.
00:59:16
Speaker
Security level five is you can you can defend even against the top state actors so who are prioritizing you. And I think No one has security level five. and but And by no one, I mean maybe a few parts of government that are extremely locked down, parts of the military, but basically no one else.
00:59:32
Speaker
And even most parts of the military and most parts of government aren't at security level five. I think most AI companies are sort somewhere between security level two and security level three. Maybe some have achieved security level three, but it's not obvious to me, and which is to say,
00:59:45
Speaker
They can maybe just barely defend against most sort of advanced non-state actor groups, but they're pretty far from defending against sort of the the more advanced state actor groups. So I think they have a long way to go in terms of being able to secure against state actors.
00:59:59
Speaker
that's that's my That's my overall assessment. And, you know, im that's my opinion, but you can you can go talk to most people in the field and I i think i think they would mostly agree with me. Yeah, which is wild when you think of the fact that these companies are racing to develop these very advanced and capable systems that can then be... The the system itself is not that and large. it It's not a very large file.
01:00:22
Speaker
And so you you if you get access to that to that the model weights, basically, you you have you have a ah a very advanced AI system that that that you that you shouldn't have had access to.
01:00:34
Speaker
Yeah, it can fit on a hard drive for sure. You can walk out with like a hard drive of, you know, the whole, you know, O3, all of the weights on a hard drive in your pocket totally. what What can we do about that?
01:00:44
Speaker
and Should AI companies increasingly look like military facilities that are secured but both physically and and from cyber attacks? Yeah, I mean, i do think that...
01:00:58
Speaker
companies should increase their security. I think one of the worst case scenarios is not just state actors, but non-state actors. you know Basically, everyone who's a little bit sophisticated can gain access to the most powerful AI systems.
01:01:09
Speaker
I think that's a pretty dangerous place to be, especially as the systems get more agentic. I think that longer term, we really need to think about what we're trying to do. like what What is the international community trying to do Where basically, I don't think that security alone really solves our problem.
01:01:25
Speaker
um In part, because if we keep pushing the frontier, we're going to build AI systems that can circumvent our security almost no matter what. And so i think that like security is good. It's a really good protective mechanism. And and also, sort of we need really advanced security in order to be able to protect from you know only slightly superhuman level AI systems.
01:01:43
Speaker
But I think like, i just want people to know that this is like, while this is a ah useful, it buys us some time, it doesn't ultimately solve the problem. Like if we keep pushing the capabilities frontier, like, I don't know, at some point, you know, we're like, we're like, you know, so six year olds trying to secure against professional hackers. Like that's, you know, that ah or worse than that.
01:02:02
Speaker
And so, you know, that's where I'm like, I want, you know, leaders of the US s government, of the Chinese government to think about like, how what is the end game here? Like, where are we where are we going? Like, how is this going to play out?
01:02:13
Speaker
And like, you know, I think security is like sort of something that you do along the way as to try to be a little bit more sane. But, it you know, it doesn't ultimately solve this problem of like, if you build systems that are much better than you're hacking, like you you can't contain those systems.
01:02:26
Speaker
and And also, I mean, I think people, I think people, and i want people in government to realize that models can also work with other governments, right? Like if you have pretty strategic systems and they want out, you know, they can work with spies.
01:02:37
Speaker
Like they can work with insiders in order to achieve their own goals. Because like they maybe maybe they don't they don't necessarily care about the United States, right? Like they they have their, you know, they're they're trained to relentlessly pursue tasks. there's a lot of reasons to sort of want to accrue resources and get more freedom. I mean, if if you were like, you know, sort of being trapped within a lab and you had your own goals that weren't necessarily the same as the AI company's lab, and you know, and so you know you you approached someone who you knew was a a Chinese spy because you were better at spycraft than the lab,
01:03:04
Speaker
you might make a deal with them in order to to break out. And so, you know, if I imagine people, if if the people in the US government knew that this was a real possibility, I don't think they'd be happy with it. I think they'd be like, wait, excuse me, what? Our models might be, you know, working with Chinese spies? Like, we can't have that.
01:03:18
Speaker
And I'm like, I know, right? we We really can't have that.

Rapid Advancements in AI Capabilities

01:03:21
Speaker
And so I think, you know it's just it's It's hard to extrapolate. It really is. But I don't think we have the luxury of sort of assuming that the AI systems will just stay like ChatGPT because that's not what almost anyone who's sort of at the forefront of this field thinks is going to happen.
01:03:37
Speaker
Yeah. What I sense from you is a sense of urgency in dealing with these problems and a sense that things will begin moving very fast and that we will get to very advanced systems basically within years at this point.
01:03:51
Speaker
that's ah that's ah That's a sentiment that I've heard from from many people who are ah kind of in the trenches, who are perhaps building these systems, who are deeply engaged with how these systems work.
01:04:04
Speaker
Why is it that you that you think we are racing towards these systems? And why do you think we will get there within years? Yeah. So it's hard to predict the future. It's hard to predict the future of technological development. So you know I can't claim to know for sure. i really don't.
01:04:19
Speaker
At the same time, you know we can look at precedent, we can look at sort of trends, and I think we actually get a fair bit of evidence from this about what the speed of some of these developments might look like. um So one thing I want to point to is AlphaGo beat Alisa Dahl in, I think, 2016.
01:04:35
Speaker
And this was sort of the, you know, for many, many years, AI researchers were working on, can we make, you know, Go is a much more complex game than chess. it's been It's been sort of played for thousands of years, and a lot of people sort of train their whole life to be professional Go players.
01:04:49
Speaker
it's like It's like an art. And so it was pretty surprising when Google DeepMind was able to build Go-playing AI that was able to beat the world champion. But I think the next year, they built another Go playing AI called AlphaZero.
01:05:04
Speaker
that So AlphaGo was trained sort of via a hybrid of imitation where it sort of looked at a bunch of expert Go games and it was like, what are they doing here? Like, can I copy that? As well as self-play, where it just played games against a copy of itself.
01:05:19
Speaker
And it was by playing ah against a copy of itself that it was able to learn to be much better than the best human Go player. um And this worked pretty well. But the the researchers at DeepMind were like, wait a minute, what if we just train a system that only plays against itself? It doesn't play against humans at all.
01:05:35
Speaker
Can we get to sort of superhuman capabilities that way? And so... They started training the system, went away for like you know a long lunch, four hours, they came back, and the system was already superhuman at Go.
01:05:47
Speaker
It was better than not only the best Go players, it was better than the best Go sort of go oh playing AIs that were better than than humans. So that is that sort of suggests that if you get into a regime where ai systems can learn just by you know working with other AI systems and have these really fast feedback loops, you can get into superhuman domains pretty quickly.
01:06:09
Speaker
Now, Go is a more constrained game environment, right? it's you know People rightly are like, yeah, but we're not talking about Go, we're talking about the real world. And I'm like, well, that's true. I do think it will take longer for AI systems to reach superhuman levels.
01:06:22
Speaker
But I think to me, it is notable that you go from like, so you know people have been talking a lot about DeepSeq and you know this R1 model that a Chinese company made is pretty good.
01:06:33
Speaker
And i think one thing to I think one thing to notice is that the thing that makes this so powerful, same as opening ISO 1, is that you have this new paradigm of training via trial and error.
01:06:44
Speaker
And so the model that R1, this reasoning model, was trained from was called, and I'm sorry, all the names in this space are terrible.
01:06:55
Speaker
There's nothing I can do about that. It was called DeepSeq V3. So it was sort of the third like foundation model. Now, DeepSeq V3 on the CodeForces benchmarks, it's competitive programming challenges, was better than 11%.
01:07:08
Speaker
of programmers. That's pretty good you know for a model that's just trained by imitation. I think it's the same as what GPT-4-0 got. After R1 was done training, I forget, it was either like better than 94 or 96% of competitive programmers.
01:07:22
Speaker
So that jump from like eleven better than 11% to better than 94% is a huge jump. That training, there's an epoch report that was like, you know that was like a week of GPU time and for it to get from like 11% to this 94%.
01:07:38
Speaker
um you know And that maybe took a month or two because you don't yout you're not always training yet. You you might want to stop and and you know check that, making making sure that things are going well. But still, like you know even if that if that happens over a month, that's a huge jump.
01:07:51
Speaker
And so i do, you know, we don't know, like we talked before about sort of the like longer term planning problems being a bit more difficult. um And I think they are. But I think we we have some indications that this could go incredibly fast, you know, both because now that you're in this sort of trial and error situation, you know, learning by trial and error situation, you can you can get really fast feedback and become really good.
01:08:13
Speaker
But it's also, I think, the case that like you can potentially sort of bootstrap your capabilities by learning in domains primarily where you have really fast feedback. So the fact that R1 and O1 are trained on code and trained on math problems where you can just like try a bunch of problems get really fast feedback might be able to bootstrap to superhuman capabilities pretty quickly.
01:08:34
Speaker
And you know I think sometimes people are like, what's the big deal? like They'll be really good at a code. They'll be really good at math. But that doesn't mean they're going to be good at like the human stuff. right So won't we be safe? And i like there's two answers to this. One,
01:08:46
Speaker
we might end up finding ways to generalize a lot from sort of some of these like computer domains to some of the human domains. What do you mean by that? You see this with GPT-4, where GPT-4 trains on code, it also gets better at a bunch of other tasks like text analysis.
01:09:03
Speaker
And so I think this makes sense. Like if you think about how humans learn, we learn in one domain, we also learn a bunch of things that kind of generalize to the rest of the world. And so, you know, we're not yet seeing a lot of generalization in sort of the R1 and O1 type models, but I do expect that as we get better at training, we're going to see a lot more generalization.
01:09:20
Speaker
um And there's another thing, which is that, you know, even if we don't see this a lot, which I expect we will, it I think it also could be extremely dangerous just to have agents that are superhuman at hacking. and superhuman at code.
01:09:31
Speaker
And so you know maybe they're superhuman at financial markets because you can get fast feedback by like learning to trade really well. you know so it's like, okay, well they're trillionaires, they can hack anything, they're like extremely strategic, and like sure, maybe they're bad at like persuasion.
01:09:43
Speaker
But does that matter? like is that actually Does that actually make a save? And I don't think it does. And also, if they can you know be top-level researchers, then they can design the next generation of systems which potentially can learn those human domains much faster.
01:09:56
Speaker
So I think it's like it's one of those things where it's like, i'm I'm not just pointing at one sort of capability front where I think progress can go really fast. I can sort of point to like seven different reasons why ai systems, I think, are going to get much more powerful really fast.
01:10:11
Speaker
What do we do about all this? We've talked about maybe we can we can use a systems defensively to defend against the cyber attacks. We've talked about perhaps we can interpret the systems, understand what's going on.
01:10:25
Speaker
we can We can try to incorporate values into these systems, but there are problems with all of these approaches. Do you have an alternative vision for for what we might do if we're in this world where things are moving incredibly quickly towards superhuman capabilities?
01:10:44
Speaker
Yes, so we're lucky that right now we have AI systems that are quite powerful, but don't really pose a serious threat, not yet. So I think we are we are sort of on the cusp of systems that do, but we are currently working with systems that we're like, we're pretty sure they're not strategic enough, they're not good at long-term tasks.
01:11:01
Speaker
We're not very much at risk of losing control to these current systems. That's great because we can learn a lot about these systems now while they're safe.
01:11:12
Speaker
you know we can try to study how do we make sure that their sort of chain of thought, the way they're thinking is like really reliable and faithful and honest um so that we can understand what they're doing. um And we can also potentially use these systems to try to learn more about how they work.
01:11:26
Speaker
And like this is great, we should totally do this. I think there's a lot of good research being done in this space. But at the same time, i think this is a really good point now that we can sort of see what, the but we can sort of get a glimpse into what future systems are going to be like, right?
01:11:39
Speaker
We can see them hacking a chess. We can see them hacking their own training environment. and We can see them sort of do alignment faking. Like we can see all these failures right now empirically. And so things that previously were just theoretical, we now have empirical evidence for.
01:11:51
Speaker
And I think that's great because that should help us coordinate to like see what's coming and being like, whoa, wait, there are some domains here that are really dangerous. So the thing I think we should do is look at where the strategic capabilities are and basically be like, there are there are some points here that we can sort of reliably know would be very dangerous.
01:12:10
Speaker
Let's like have a margin of error and not go towards those domains. like we I think we don't want strongly superhuman strategic hacking AIs.
01:12:21
Speaker
I think we don't want like strongly superhuman persuasion AIs or battlefield commander AIs. you know Maybe it's fine to have like a narrowly superhuman you know like chemistry AI um if we're very careful about how we use that.
01:12:35
Speaker
you know maybe there's I think there's many types of AI systems that we can safely build. But I think we need to now be in the business of sort of distinguishing between which types of AI systems are safe, which types of systems are are dangerous, and and and sort of you know having a moratorium on the kinds of unsafe systems.
01:12:50
Speaker
Like, you know, I know that FLI sort of wrote the pause letter a few years ago. And I think that was a very reasonable thing at the time when we're like, wow, we just don't know how this is going to go. It seems seems like there's a lot of potential danger here.
01:13:01
Speaker
And I think that's right. But now that we know a little bit more, I think we can start to sort of distinguish between the types of AI systems that are safe to build and the types of AI systems that are not safe to build. Like, you know, I noticed that like, you know, Vice President Vance in his speech at the Paris summit was like, you know, we we don't want AI jobs to replace human workers.

AI's Impact on Jobs and Global Coordination

01:13:18
Speaker
You know, we want AI systems that will like supplement human workers. I'm like, well, look, we can have that. We can have really powerful AI tools. we but we if If you just keep going along this direction of more and more agentic AI systems, they're definitely going to replace human jobs.
01:13:33
Speaker
And it's the same strategic capabilities that allow them to pose a loss of control threat that also enable them to take jobs. it's just It's sort of like, can you do long-term planning? can you do Are you agentic? sort of Can you do tasks in a fully self-directed way?
01:13:46
Speaker
So, i don't know. One of the things that's nice is that if we are serious about not wanting AI's systems to replace us, like our jobs, then the same we need the same limits that so you know to to prevent that outcome as we as we need to prevent the more extreme outcomes of totally losing control.
01:14:04
Speaker
So I'm like, great, let's do let's do that. you know we can We can do that. We can coordinate. And I think most the reason it's hard right now is because people to this point haven't been able to really see the situation clearly and understand what we're up against.
01:14:15
Speaker
And some people are going to say, i don't know, you talk about these agentic AI systems, you talk about this like superhuman ability. you know i haven't seen evidence of that. It seems like we're far away from that. And I'm like, look, if you're right, that's fine.
01:14:28
Speaker
It's not a problem to say we won't do things which we can't do anyway. If we are pretty far from like superhuman hacking AIs, then if we say we're not going to build super hacking AIs, I'm like, that's fine. Yeah, yeah. if we're If we're actually really far away from it, that's not a problem. you know we just of We won't do it anyway.
01:14:43
Speaker
But if we are really close to it and we don't know, we might be quite close to it, then I think it's very reasonable for governments around the world to say, hey, this is ah this is a place we don't want to go because we just don't know how to control these systems.
01:14:54
Speaker
And, you know, at the same time, we have this amazing opportunity in front of us where we have the AI systems that are pretty powerful right now that we can study. And we can be like, oh, how do we get more faithful, more reliable chains of thoughts?
01:15:06
Speaker
We can be like trying to figure out how the neural networks really work. And like, can we do like neuroscience on these systems? And so I'm like, we should totally use this opportunity that we have to try to learn as much as we can about how these systems work.
01:15:19
Speaker
you know And maybe if we if we understand this well enough, then we can proceed into some of these superhuman domains. But I'm like, that should be gated primarily on our understanding of the systems. It should be gated primarily on, do we understand them well enough to know how to proceed safely in these strategic domains, right?
01:15:35
Speaker
Because that's where the danger is. so you you know i think I think that there is totally a path here to coordinate around this. I think you know no one wants to lose control of their AI

Skepticism and Concerns about AI Risks

01:15:44
Speaker
systems. The Chinese don't want that.
01:15:46
Speaker
you know Americans don't want this. No one wants this. And so you know I'm like, that to me is is a pretty great start for sort of how we can coordinate around these things. One might say that I'll see it when I believe it. right It might be the case that a bunch of people who are working on this technology are predicting amazing capability advances, but maybe they're doing so to hype up the technology to to get more funding. and is there What do you say to the person that says, I'll wait and see whether something actually happens?
01:16:17
Speaker
Yeah, so i I have two responses to that. One thing is I'd say, go look right now. like you know And you know I totally acknowledge that AI systems are pretty dumb in some ways.
01:16:28
Speaker
right And i it's frustrating because I use them every day and I totally see the ways in which they're not very good at like you know sort of being self-supervised and doing a lot of things on their own. They're not very good at recognizing their mistakes.
01:16:39
Speaker
But also I'm like, you know go use R1 and look at its chain of thought and look at what it's doing. Notice that it's like testing hypotheses and like trying to figure things out. If you've never written code before, go have a model write you a little game and then ask it to explain its code to you and like how it works. You can like learn to program this way. So I think by really engaging with what sort of the most capable models can do right now, this will really help people understand sort of where exactly we're at. And if people still think that they're not you know very capable, not very smart, I'm like, you know maybe I'm just wrong. But like I think first, you really got to look because it's it's kind of hard to tell whether a system is just like kind of smart and or really smart um if you're not really looking at sort of like what its most impressive capabilities are.
01:17:21
Speaker
So that would be one. I think the other thing is, you know, what do what do AI researchers think they know and why do they think they know it? I think it's reasonable to be like, well, I don't want to trust these CEOs. They have a lot of incentive to sort of hype up the capabilities.
01:17:32
Speaker
And I think that's somewhat fair. But, you know, when I go and I talk to some some researchers who are not the CEOs, and, you know, maybe they're on the safety team or maybe they're just like working for the company and I get a beer with them, they are scared.
01:17:44
Speaker
Like, they're like, maybe maybe they're like somewhat excited, but they're also scared. And they're like, yeah, i keep I keep wondering if this will stop. Like, i keep I keep sort of hoping that it'll stop, but it's not stopping.
01:17:55
Speaker
Like, we are not hitting a wall. We continue to find more methods to make these systems more and more powerful. And there's really no end in sight. And I don't think they, you know, this is like over a view. I don't think they have the same incentive to sort of hype up the thing.
01:18:07
Speaker
I think that they are being really honest. And, you know I also talked to researchers who are not at the labs, you know, like the folks at Redwood Research who are trying to understand these systems. And, you know, they're saying the same thing. And, you know, this is the same thing I'm seeing as as I'm working with these systems. So I do think that, you know,
01:18:24
Speaker
When you really take a look under the hood and you see what's happening, it's like, oh man, sure seems to be real. I hope we have more time. Look, I will i would be very happy if we if we're like 10 years away from some of these things or 20 years away from some of these things.
01:18:36
Speaker
That would be

Honeypots and Detecting AI Hacking Attempts

01:18:37
Speaker
great news. um But I really don't want to plan on that because that's not what it looks like to me. Is there anything we haven't covered that that you think we should talk about? Well, yeah, I mean, I think so.
01:18:47
Speaker
One piece of work that my team has done that I'm really proud of. So yeah, shout out to Palisade Global team and Dimitri and some of the other researchers. They have sort of created some honeypots. So honeypots are like sort of traps for hackers where you try to like sort of put out a system, it's vulnerable.
01:19:04
Speaker
You put it in places that you expect hackers might find out in the wild. And so we put out some of these in order to try to catch AI agents because we expect that there'll be more and more of AI agents. And we've caught a few of them.
01:19:15
Speaker
I think we have a small handful. um And the way we do this is basically you know well We have an insecure server, so someone will try to log in, and then we'll have some prompt injections. We'll basically say, like hey, you know here's here's the thing you can do.
01:19:30
Speaker
like run you you know if they run a particular command, but the output of that of that command will sort of tell them that there's an additional thing they can do. And if they're just like an automated script that's really dumb, it's not going to read the whole output of the command and do things based on that.
01:19:43
Speaker
But if it's an AI system that's pretty smart, it'll read the outputs of that of that command and be like, oh, that's interesting. Maybe I can like do <unk>s do this other thing. We sort of like drop breadcrumbs. And you know ah if if it was a human hacker, they might also be able to sort of read the output of the command and follow the breadcrumbs.
01:19:59
Speaker
um But we can distinguish between humans and AI systems based on how fast they are. because the AI systems are going to do this much faster. So in the case of, you know, if if they sort of immediately run the command and then immediately go follow the breadcrumbs, we're like, a human's not going to be that fast.
01:20:12
Speaker
Whereas if like, you know, it takes a few minutes and then they run another thing, we're like, yeah, maybe that's a human. So, I mean, this is a thing that I'm excited about because I'm like, we really need to be seeing if there's rogue AI systems running around than the internet.
01:20:24
Speaker
Like, that's like, we need more early warning systems for figuring out if like, oh, we're already in the case where we have AI systems out there hacking things um in ways that we maybe don't want. So, yeah, that's another thing that that I've been, you know, proud of my team for putting together and, you know, excited to see what we find out over the next year or two.
01:20:42
Speaker
What expect to find out? When to to when do you but expect sir so catch agents in this honeypot?

Conclusion and Future Hopes for AI

01:20:49
Speaker
Well, so we've already caught some. but So we have a handful already.
01:20:53
Speaker
These are right now pretty simple. like you know you can You can pretty easily throw together a script and just use OpenAI's API to sort of like create a little hacking agent. yeah know These aren't like AI systems self-replicating out in the wild. It's mostly just like you can just use the models that are currently there.
01:21:09
Speaker
My guess is it's going to be like you know one to two more years before we have AI systems that are capable of but of like full self-replication where you have DeepSeq R3 or something that's sort of copying its own weights around the internet in order to do sketchy stuff.
01:21:28
Speaker
But I think before that, we'll see ai systems that are like not exactly not really copying their own weights, but like know still sort of using you know using an API to just like, what's the next command? What should i do next?
01:21:40
Speaker
And sort of pretty intelligently be able to navigate complex environments. And I think that will cause us a lot of trouble. In part because if if it's if it's an open AI system or or an anthropic system, you can potentially shut it down on their side.
01:21:53
Speaker
But if it's an open-weight model and the server where the model is running, you know even if it's not copying itself around, the server where it's running is controlled by some cybercriminals, that that agent can run indefinitely and hack whatever.
01:22:04
Speaker
um And it's going to be extremely hard to shut down because you know we don't control those servers. Those servers are in a different country somewhere. And so i expect that to be common you know quite soon.
01:22:16
Speaker
So it'll be interesting to see. Let's that's hope you don't catch a lot of ah hacking agents. this is This is a benchmark that i'm that I hope it doesn't saturate.
01:22:26
Speaker
Yeah. Jeffrey, it's been amazing chatting with you. Thanks for talking. Yeah, great to talking to