Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Ryan Greenblatt on AI Control, Timelines, and Slowing Down Around Human-Level AI image

Ryan Greenblatt on AI Control, Timelines, and Slowing Down Around Human-Level AI

Future of Life Institute Podcast
Avatar
4.7k Plays2 months ago

Ryan Greenblatt joins the podcast to discuss AI control, timelines, takeoff speeds, misalignment, and slowing down around human-level AI. 

You can learn more about Ryan's work here: https://www.redwoodresearch.org/team/ryan-greenblatt  

Timestamps: 

00:00 AI control  

09:35 Challenges to AI control  

23:48 AI control as a bridge to alignment 

26:54 Policy and coordination for AI safety 

29:25 Slowing down around human-level AI 

49:14 Scheming and misalignment 

01:27:27 AI timelines and takeoff speeds 

01:58:15 Human cognition versus AI cognition

Recommended
Transcript

Introduction to AI Control and Alignment

00:00:00
Speaker
Welcome to the Future of Life Institute podcast. My name is Gus Stocker, and I'm here with Ryan Greenblatt. Ryan, do you want to introduce yourself? Yeah, so I'm Ryan. I work at Richard Research. ah Recently, we've been thinking about AI control, which is sort of a ah reborn agenda that's been around for a bit, but we've been exploring in more detail recently. And we're also just generally interested in how do we handle reducing takeover risk from egregiously misaligned models in the future.
00:00:25
Speaker
Fantastic.

Understanding AI Control vs. Alignment

00:00:26
Speaker
there's ah it's It's a fascinating paper on AI control, and I think we should start there. We should start by thinking about what is the difference between AI control and AI alignment. sort of Traditionally in the field of AI safety, people have talked about alignment, which is ensuring that you will be safe because your AI will not conspire against you, will not try to cause problems, and will do what you want.
00:00:48
Speaker
An alternative strategy is, even if your AI is misaligned, you are robust to that. So in particular, even if your AI is you know egregiously misaligned, trying to subvert your safeguards, trying to cause bad outcomes, you have safeguards such that you would catch it before it succeeded, or you would just prevent it from succeeding. So this is this is a second layer of defense we could say, even if you failed to to align your advanced AI. Hasn't something gone horribly wrong if we need AI control?
00:01:16
Speaker
Yeah, I mean, I would say that seems like a reasonable characterization. Like I think in an ideal world, I think we would be relying pretty much entirely on alignment, especially in the current regime where you kind of have one AI system or like a few AI systems. Like people don't have huge numbers of very distinct AI systems. And because of that, like if you're relying on control, it basically means a high fraction of all of your AIs are misaligned probably.
00:01:40
Speaker
and that's that's not a good regime to be in. I'm pretty sympathetic to the perspective that we should feel feel scared about getting into a regime where we're relying on control. But it's worth noting that control does include a lot of measures which aim to ensure that the the AI you end up deploying, where people originally thought of it as pre-deployment testing. So for example, if you adversarily evaluate your AI prior to deployment,
00:02:04
Speaker
And you you're not you know you're just but giving it inputs. Nothing you're doing is making it aligned. ah And then you see whether or not it does anything bad. And then you know reconsider whether or not you're going to deploy it. I would describe that as a control strategy. For example, if your AI was able to bypass this, you would be in trouble. And you need

Limits of AI Control

00:02:21
Speaker
to be robust against an AI that is trying to cause problems. So so even if your your strategy is aiming to never deploy an AI that's misaligned, I think you still have to think about it from an adversarial frame to the extent that you're relying on evaluation.
00:02:34
Speaker
For how long could the AI control paradigm work for? So do do you imagine that we have an AI working for years on the AI control regime? or Or is this something we need temporarily? And then we we might pull the model and and try to fix ah something more fundamental about the model.
00:02:53
Speaker
Yeah, so I would say that it's not clear what the time duration is, but there there is like sort of a limited scope. So in particular, I don't think that AI control will work for AIs that are much smarter than humans, basically because if we have no aligned systems, except for humans,
00:03:09
Speaker
We're going to have some systems which aren't egregiously misaligned. So for example, we have the current AI systems, like I don't think 3.5 sonnet is egregiously misaligned. I don't think GPT-40 is egregiously misaligned. So we have those systems, maybe somewhat smarter systems, and we have our frontier systems that we're worried are misaligned.
00:03:25
Speaker
I think that this regime can only go so far. And if the front if if our frontier systems got too smart, I think we'd be in trouble, basically because we wouldn't be able to apply them usefully for a variety of domains without them being able to trick us, ah do things we don't understand, cause big problems via these mechanisms.
00:03:42
Speaker
So there's there's there's sort of a ah capability ceiling for control. Another thing is that I expect that we want to transition from control to different arguments. And I am imagining it as a stopgap measure, as in we've gotten to very powerful AI. We initially get work out of our systems using control. And then later we transition into a regime where we're you know confident in deferring to our systems without needing you know paranoia and human involvement and these things which are plausibly not sustainable longer term.
00:04:12
Speaker
so So actually, what what is the timeframe that you imagine that AI control could work for? So so it won't work for highly advanced models, but it could actually, from what you're telling me now, it sounds like it could work over, say, months or years. I mean, it's just going to depend on how long we have human level systems for.
00:04:27
Speaker
so So personally, I think it's pretty likely that we should end up stalling out for or like waiting for a long time at the point where we have roughly human level-ish systems. you know the The

Coordination in AI Development

00:04:37
Speaker
capability profile is obviously not going to be the exact same as humans. But I think there is a sense in which you can have sort of a qualitative intelligence profile of your AI system. And I think we should not be using i think i think we should be pretty pretty reluctant to very quickly build AIs that are much, much smarter than us, basically because that introduces additional difficulties, which which aren't clearly present at the point where the AIs are you know, rough, like two things we basically understand. like i think I think I'm just like kind of for the obvious reasons, it it seems like an easier regime to work in. I also also think just like misalignment is is less likely in the first place for a few reasons. And so I think in my sort of ideal but rushed outcome, the thing I would imagine happening is you build these broadly human level-ish systems, you keep making them somewhat more powerful until they're like
00:05:20
Speaker
competitive with the best humans, which maybe you initially don't have, you know, initially you're competitive with some humans, but not not the best humans at a given domain. And then at that point you coordinate to slow down using the fact that you now have these systems that people can recognize are very powerful. Like, you know, it it's, it's hard to coordinate on doing anything now because there's a lot of disagreement about risks. There's a lot of uncertainty about what the trajectory will be. But I think once we have systems at that level, I think the case for slowing down is is much more robust. And I also think that you can use those systems to help you slow down in a variety of ways.
00:05:50
Speaker
As an example, making demonstrations of risks from more advanced systems to the extent that that those demonstrations look dangerous and to the extent those demonstrations all look good, well, that's some evidence that you maybe should be less afraid about proceeding. Before we get too far into discussing the the the details, let's let's think about what is AI control and in practice? So what did you do to to showcase some of these techniques in the paper?
00:06:12
Speaker
Yeah, so we have a paper on AI control. And and to be clear, it is like ah I think there's some sort of broader like ah control perspective slash like framework slash approach. And in our paper, we're we're tackling one specific domain. So the domain that we tackle is we have an AI system and we want it to write code for us. We sort of simplify this down by just having it write the solutions to programming problems. so And we're worried that it's going to backdoor its code and

Balancing Productivity and Safety

00:06:36
Speaker
we want to prevent this. And you might explain what what a backdoor means here.
00:06:39
Speaker
yeah Yeah, so by backdoor, we mean write a version of the function that passes all of our tests, but for which the AI can produce an input on which the code is wrong. So we we basically have the AI, it writes the code, and then it puts in, let's say, like a sneaky if case, which is like, oh, in this case, do something different. And that if case triggers, in some case, the AI understands.
00:06:59
Speaker
but which isn't hit by our test cases. And then we then you know that AI then, the in the in the analogy, would be able to like exploit this code at a later point. um So it's it's it's a bit toy in that like you know these are programming problems, it's not clear what would go wrong if if the AI knows some input on which it's wrong, but that's our model.
00:07:16
Speaker
and and We imagine that we have GPT-4 and we wanted to write these programming problems for us. It's like decent at writing them. I think it gets about 60% of them right if you just ask it to write the code. And and we don't trust GPT-4. So we're worried. We don't know GPT-4 is misaligned, but we're worried it's misaligned. ah We have another AI, which we know is is not misaligned. We know it's not going to try to insert backdoors, which we use GPT-3.5 for.
00:07:40
Speaker
And then the question is, OK, given this combination of things, how much can you get productive output from your AI while also being safe? Where by being safe, we mean you aren't going to deploy code with a backdoor in it. and And so there's obviously a very good way to be safe in this proposal. If you don't care about getting good work out of your AI, you can just use GPT 3.5 to write your programming solutions.
00:08:01
Speaker
And that's sort of a baseline we consider. It's just like, okay, one approach you can always do is just don't use your most powerful don't use powerful AIs to do a task. And i mean like that is actually an option that we should be considering in a bunch of cases. Like if there's a small amount of very high, highly sensitive work, just have humans do it. It's like a reasonable or reasonable baseline. Or just have a weaker AI do it to the extent that can be made to work.
00:08:21
Speaker
And then in the other extreme, you can be like, well, if we just weren't worried our AIs were misaligned at all, or we just did we're just going to completely not worry about safety, you can just deploy GPT-4. And so the the question is, can we do better than the two options of shut down, basically, or deploy our smartest model with no safeguards? And we found, yes, yes, you can do better. So a simple approach is just so have GPT-4 write the code.
00:08:47
Speaker
then have GPD 3.5 look at the code and say, is this code messed up? Does this code have issues? Does it look like there's something weird going on? Does it look like there's a backdoor? And then if GPD 3.5 says yes, we then have humans, ah we we assumed we had a limited amount of of of human labor. So we couldn't you know have humans write all the code, but we could have humans write or we could have humans review a subset of the code. We then have humans look at that.
00:09:10
Speaker
So I should note that by humans, I actually, in the paper, we modeled humans by GPT-4 again for simplicity and cost. But you should basically imagine in the analogy, it's just we have sort of these three components. We have GPT-4, which we don't trust. We have GPT-3.5, which we do trust. And we have human labor, which we do trust. And it's and it's pretty good. It's as good as GPT-4 by assumption in this case, but it's expensive. So we can only have a limited supply of it.
00:09:35
Speaker
and

Commercial Pressures and Safeguards

00:09:36
Speaker
And you make these assumptions to kind of try to model a say an AGI corporation that's that's thinking about how or whether to deploy a new model or trying to to test a new model. They will only have a limited amount of AI researchers. That's very expensive labor, but they they might have access to to and a lot of of say weaker models.
00:09:57
Speaker
Yeah, so like very concretely, we can imagine that you know you're in some future state, you have like, you know, Clod 5 or GPT 6 or whatever model it is. And maybe we have access to some models that we're confident are misaligned. Let's say, for example, we have 3.5 sonnet, which we're just, we just We're confident to say it's not going to be plotting against us. And then in this regime, the question is, and we have some human labors, we have you know all the employees of our AI lab, and we want to get some some work done that is at least somewhat sensitive. Like if the model was able to do whatever, we'd be in trouble. So an example is like, we want to use our AI to do research internally. And if we just let it do literally anything, we think we would eventually run into issues where the AI would either be able to escape or or cause other bad outcomes.
00:10:40
Speaker
And then this is the analogy that we're we're operating in. Like we have sort of these three sources of labor. This is a bit of a simplification, probably will have a bunch of different models and a bunch of different human labor and like this sort of thing. But to simplify, and and in I should note in our paper, we sort of restricted our scope down to a particular type of failures, which are cases where the AI can cause a bad outcome and in in only one sort of trajectory and like a very limited number of actions. So it basically, it's this is sometimes called like being high stakes or we sometimes use the term like concentrated failures. Like the AI can really cause ah a big problem in a small number of actions.
00:11:15
Speaker
But also, it's the case that in our setup, it it happened to be the case that the it's pretty easy to catch if you do like expensive review. So like basically, the core issue is like if you could apply expensive review everywhere, you'd be totally fine. But you can't apply expensive review everywhere because you have tons and tons of AIs doing tons and tons of labor and a relatively limited amount of human labor. And and in fact, like I'm like, well, why did you want to use AIs in the first place? And it's presumably because their labor was cheaper or their labor was better and therefore, you know, you can't just use human labor to do to do everything. And then in this regime, the core question is, can you either prevent the AI from inserting these backdoors successfully or catch it? And we sort of modeled

Risks of AI Collusion and Jailbreaking

00:11:55
Speaker
that once you catch the AI once that you've won. um This is not not clearly a correct assumption in practice because, you know, you might catch it and then kind of like you might catch it doing something kind of bad, but you're unsure.
00:12:06
Speaker
or perhaps more concerningly, you catch it doing something bad, but the external pressures are so strong or there's, you know, ah basically the evidence isn't sufficient to convince people and so you have to have to proceed or or maybe you just do proceed in practice and and and you have to handle that regime.
00:12:22
Speaker
We're imagining now that some someone on the safety team at at one of the AGI corporations, they would they get these warnings from from some model that's that's overseeing the development of a very strong model. But there are also there's also this commercial pressure to deploy. There's there's some deadline for for when you want to release the next model. And so these these warning signs get could get ignored, basically.
00:12:45
Speaker
Yeah, I mean, I would say that it's, so there's there's the development context. There's also just in our exact paper, we were sort of imagining a deployment context is and you've already deployed your model. You're uncertain whether or not, you know, it's misaligned, which, you know, obviously as I stated before, you'd prefer not to deploy models you think might be misaligned. And in this regime, you're like, okay, we're going to be doing safeguards such that if it tries to do something bad during deployment,
00:13:07
Speaker
We are as likely as possible to catch it. And then if we catch it, then you should have but you know a policy for what to do then, obviously. And so in our paper, sort of the core question is like, can you ensure that you apply your human auditing budget on the right examples to catch the AI or or also it can potentially be good to prevent the AI from causing bad outcomes. So I can give the simplest technique, which I think I sort of briefly mentioned earlier, is like you ask 3.5, are they not the code's bad? If it says the code is bad, you have a human review it, and and then you proceed. And this indeed improves a lot over the base over the like the two points of
00:13:41
Speaker
so don't deploy your AI and deploy your AI with no safeguards. This is like a very simple protocol and there's some additional complexity. like you have to You really actually want like a continuous score of how scary it is and then you want to have some threshold and you need to hit the right false positive rates. You don't waste your ah your auditing budget. but but But basically, it's a very simple protocol and in spirit. But we also found that more complex protocols would help. So an example thing you can do is you can be like, well, we can only afford to review 2% of things. That's what we assumed in our paper.
00:14:13
Speaker
And the monitor has some signal on additional stuff beyond that 2%. So like if you imagine it has some distribution of how scary it says says the the code is or how how backdoored the code looks, there are some cases where it's like it's super suspicious, the like top 2% most suspicious cases. But there's also cases where it's like, I don't know, it looks like a little bit suspicious, but like I'm not sure. I don't know exactly how suspicious it is. And it would be nice if we could you know apply additional precaution in there, which isn't as expensive as human review because we really are very, that's very scarce, so but which still helps. And the thing we can do is basically, let's say it's in the top 10% of suspiciousness, but not in the top 2%, we can just have the trusted model GPT 3.5 write the code in that case. You can basically just be like, look, this code, it might be backdoored. We're just going to ignore the fact that
00:15:02
Speaker
we add this GPT4 written solution and just use a GPT 3.5 solution. And it's you know it's going to be less likely to work. But it it might work.

Transitioning to Trusted AI Systems

00:15:09
Speaker
It might still move us along. And it's not going to be backdoored because it's not that model is not trying to trying to screw us over. And then that can allow you to basically rule out some cases where the model sort of like your monitor had some some feeling the code was bad, but you didn't have you didn't have certainty.
00:15:27
Speaker
What about, say, the stronger model, in this case, TBT4? You could imagine that model generating messages ah to the weaker model, trying to collude with the weaker model, trying to convince the weaker model to deceive the the human researchers. How do you account for kind of interactions between the strong and the weak model?
00:15:47
Speaker
Yeah, so in our paper, we sort of simplified out and we didn't look at cases where the stronger model tries to either jailbreak or convince weaker models to proceed. So you should be imagining that in some future case, we have like, you know, a model that we have literally right now, potentially like we have like literally 3.5 sonnet and then you have some stronger model. that's either trying to jailbreak the model or maybe something that's like less well described as jailbreak and more just like make an impassioned argument or like you know write an essay of why you should defect. My sense is that this failure mode basically can be solved with red teaming if you're operating with models that weren't consistently conspiring against you.
00:16:24
Speaker
So my sense is that in for for your weaker model, like as in my sense is that if we have some model like 3.5 Sonnet where we don't think that it's like sort of, you know I don't think 3.5 Sonnet is consistently thinking about questions like, should I you know undermine human safeguards? Am I doing whatever? I think then if you take that model and you do a bunch of red teaming, I don't think that model is going to be thinking through the question of, OK, I might be in red teaming. If I wasn't red teaming and I saw an impassioned argument, I shouldn't do it. Therefore, I need to be like really careful to avoid getting caught by these cases. And so basically, I think with some precaution, you can basically operate as though youre we use the term like trusted model or or like weaker trusted model is not going to be hit by by sort of the impassioned argument concern. There's still this job-breaking concern. right so It's definitely the case that models are not robust to jailbreaking. or Currently deployed models are not robust to g jailbreaking now. There's been you know some progress. but And so you might be worried that your stronger model sort of thinks of a jailbreak and that causes issues for you. So we didn't model this in our paper for a few reasons. But one reason for optimism about jailbreaks is that normally when people do jailbreaking, they iterate against a model for a while or you know do a bunch of iteration and then have a jailbreak they can use.
00:17:38
Speaker
Whereas if we're imagining that we're in a control regime and we sort of are meeting some basic safety properties up till now, then we haven't probably had our model spend huge amounts of time doing jailbreak R and&D and you know trying jailbreaks and and and and and like seeing what works and and and understanding the space there.

Breakthroughs in AI Alignment

00:17:57
Speaker
And and that we should just you know prevent jailbreak R and&D rather than be robust to jailbreak R and&D I think you should also you know try to be as robust as possible, obviously. But I i think it's like in the case of, like for example, human misuse, a problem that you have is that humans can go home and do whatever they want at home. And you know you can't control and monitor their behavior. Whereas in the case of AI systems, by default, we can basically look at all the actions our AI systems take. We can look at all the queries they make. And we can, at least in principle, prevent them from from like you know doing a bunch of work we didn't want them to do.
00:18:31
Speaker
the The difference between dbt4 and dbt3.5 in terms of their capability or their intelligence, it it isn't that great. As that difference grows, how does how does that affect our ability to to use the kind of control technique that you that you just ah described?
00:18:49
Speaker
Another reason why control maybe can only scale so far is you have this sort of limited resource of models that are weak enough that you're not worried they're conspiring against you. And those models, you know they're only going to be so strong. you know You can try to make them better in various ways with domain-specific training or or scaffolding other resources, but you know that only goes so far. And so eventually, the smartest models will scale out. And there's a question of, like when does that happen relative to other blockers? And indeed, yeah, that this is definitely a big reason why control might be an issue, is that ah you know, the models we have are just really a lot smarter than then the trusted models we have access to. So I would note that in our specific paper, a key dynamic is that we sort of, for one, we don't train GPT 3.5, partially because, you know, for simplicity and partially because we sort of thought that would be a better model of the future regime.
00:19:37
Speaker
because we we were worried that training would, and there's also some overfitting concerns, and and it basically makes the gap a bit bigger. But we didn't want to use a model much weaker than GPT 3.5, like GPT 3, basically because GPT 3.5 is, like I would say, pretty close to the minimum level of model where it does a pretty good job of doing instruction following, at least at the time when we were were doing that research. like There weren't like widely used models that do like a reasonable job of following slightly more complicated instructions.
00:20:04
Speaker
while also you know be it being being pretty weak. so i think that like we're sort of I expect that like the gap in capabilities it looks a little less bad if your models are pretty good at following instructions, like we have 3.5 Sonnet. Another intuition pop, which I think is notable, is like imagine that we just had AIs that were literally exactly the same capability profile as humans, and the smartest trusted model we had access to, the smartest model we're we're not worried that it's conspiring against us, is just literally the smartest model we have now.
00:20:33
Speaker
but's say so like 3.5 sonnet. I think that would very plausibly be workable with like, you know, a bunch of iteration at work. Like, I think it's actually non-trivial for humans to bypass a GPT for a monitor and and a 3.5 sonnet monitor, if they don't have time to iterate on like jailbreaks or or their their strategy, they just sort of have to like, try it. And so I think that that that regime can be workable.
00:20:57
Speaker
and so I think we should explain here that AI control techniques ah they do not allow us to then move into a position where GPT-4 becomes the weaker model. and we so It's not an alignment technique in that sense. It's not an an alignment technique in the sense that we improve and kind of permanently bring the stronger model under control.
00:21:18
Speaker
So does does this mean that we are always kind of stuck at ah at a certain level where where we will have to to evaluate the weaker model and and and say that we can trust it kind of by hand or as as a human researchers?
00:21:31
Speaker
Yeah, so I think that if you just if you if the only tool you have in e if the only tool you have in your toolbox is is just AI control, you just have that one hammer, then I think that in that regime, the only thing you can do basically is be like, well, we have some models that we're confident are misaligned on the basis of capabilities. um Like they're just they're too weak or like to be starting to be egregiously misaligned, like maybe 3.5 sauna is a bit misaligned in various ways.
00:21:54
Speaker
And then you we're just like, well, that's that's the tool we've got. you know And so indeed, you can only scale up so far. And there's no mechanism by which our smartest models, we then learn that they' you know we we we never learn

AI Scheming and Monitoring

00:22:05
Speaker
that they're fine. And indeed, we always have to worry, because we're we're in this sort of adversarial context, that if we were to you know sort of lighten up, if our models looked fine for a while, then they could just follow the strategy of look fine for a while and then and then cause issues.
00:22:19
Speaker
to to the extent that they can you know determine what time it is and and and and other things like that. But we we certainly at least have to be robust to to to all the threat models related to that. I would say that my hope is maybe we have like three regimes. There's the current regime, which sometimes I call like the prep regime, where it's like we don't have models that we think are very scary. We're not very worried about like eminent problems due to our models. so And then we have some regime, which is like like you know we'll call, let's say, the acute risk period. And in then that regime,
00:22:47
Speaker
if If we just have nothing except control, and probably we'll have some stuff we can do that's not just straightforward control, I think there's at least a few things that look reasonable, then we can you know use control for that period, get lots of work out of our AIs. And the hope will be use the work that we get out of our AIs during sort of the curious period to then have AIs which we actually can trust. But one target you can aim for is AIs that are basically as capable or like you know in practice can do as much as top human scientists can basically replace top human scientists, or like top human experts, and then we can retire. And in which which we can we feel good about deferring to. And feel good about deferring to is is, to be clear, like a bit stronger than the model isn't plotting against you. Like, you know, there's plenty of people who aren't plotting against me who I don't feel good about deferring to, partially because of capabilities and also because of like,
00:23:34
Speaker
How much are they really going to like directly pursue exactly what I want and really carefully consider what I want? like it's It's a bit stronger than than that. Though i think I think if you rule out sort of the worst threat models, then I think more basic techniques could maybe get you the rest of the way.
00:23:48
Speaker
so So maybe we should we should look at AI control as kind of like a bridge from from the current paradigm in which we have some reinforcement learning from human feedback on the models. We have some ah degree of alignment, but we also we're not exactly sure what happens when we scale the models up. And then we have AI control that bridges the gap until we can get to, say, yeah human level AI.
00:24:10
Speaker
and beyond there we will will we will need some other breakthrough we will need somewhere to align above human level AI systems. You mentioned this in passing perhaps but how does how does having human level AI that's that's controlled in the way that you're describing now, how does that help us get to the alignment of superhuman AI?
00:24:28
Speaker
Yeah, so I'd say there's broadly like two two ways that helps. So one thing is it might help you buy time and you can buy time. And then what do you buy time for? So ah you buy time for doing basically a bunch of research on your systems, both research, which is like, you know, moonshot style research, which is like techniques where we're like, this would really be like a big win, but maybe it's very hard, but also just like lots of basic test beds and and and sort of model organism type research, trying to get both like figure out better techniques, but also just like are these current, like probably the number one question you want to answer is like, are these current AIs we're using? Are they are they misaligned? Like how how scared should we be about these systems? And

Slowing AI Development for Safety

00:25:04
Speaker
then the hope would be that, you know, you can do basically, you can get enough total work done during that period that you can proceed. Now, I should be clear, like, there's, I'm sort of outlining what a ah plan could be. But I mean, obviously, I don't
00:25:17
Speaker
you know, I can't just like implement this plan. It's unclear what will happen in practice. I think that like and control sort of like the techniques that I'm imagining are separable from the plan that I'm i'm proposing, right? Like so I have like some techniques and I have a plan like an overall plan or maybe like framework for a plan. Maybe I don't know if I'd call it a complete plan yet, but it is the case that you can fit these, you can slot in these techniques for lots of different plans and lots of different proposals. I think the key claim is there's going to be some regime where I think control will be a useful technique for ensuring safety. And I also think that there is a plausible story in which you can mostly be banking on control and proceed through. Now, I should be clear, I don't think that plan is a slam dunk. I think a lot of people should be looking at that plan and being like, whoa,
00:25:59
Speaker
sorry What were you talking about? like what What is this plan? like Is this really going to get us enough assurance? Are we really happy to be relying on this? um and I think it's pretty reasonable to object. I think that you could it could plausibly be reasonable if you do control plus some other stuff that gets you quite a bit of assurance. like as in you can you know As you were saying, you can have multiple lines of defense, you can have this control line of defense, you can have some alignment line of defense, and then you can have the combination of those lead to an overall acceptable risk level or or whatever ends up being an acceptable risk level. But I would say that my sense is that if you just take current the current techniques we have, you take control and you just sort of scale on the most performant architecture to to to like AIs that are roughly human human level,
00:26:43
Speaker
I think that it's going to be pretty scary. I think it's going to be like i think you could plausibly keep the risk like decently low if you're if you're pretty if you're very careful and you put a lot of work in, but but I don't feel feel great. Yeah, it's it's it's a it's a great it's a great thing to mention. we are We are talking about a scenario in which we haven't decided to slow down or to perhaps pause or to collaborate in a much deeper sense than than than we are doing right now. and These are the kinds of worlds in which I imagine AI controllers is most valuable, where we could alternatively slow down AI development and try to solve alignment before we we go ahead with deploying or developing stronger models. But as of course, we we can't really, it's very, this is very useful work because it's not clear at all that that we are going to to slow down our pause until we we have a a good grasp on alignment.
00:27:36
Speaker
I think it's going to be, just given the current situation, I think it would be a very like difficult to do a long pause prior to having AIs that are that are you know actually very powerful. and i think I think there's a plausible case for that being easier then and and it being sort of easier to get get some will for that.
00:27:54
Speaker
Yeah, i think i I think there's also just like, there is some question about like, you know, how much do people prioritize AI soon versus reducing the risk? I think there are plausible perspectives in which it could make sense, depending on your prior over the risk, to proceed with just AI control. I, given my views, don't think this is would be the best strategy for society as a whole, but I think it could be the best strategy for and actor, um ideally also, you know, trying to to coordinate with other people. Like, one way Buck has put this is like, one of my colleagues at at Red River, is there's sort of different regimes we might be targeting with our safety techniques as researchers. And one regime is sort of this like safety case regime, which is like, you know, you're really trying to aim for a reasonable risk level, you're trying to keep risk low. But another regime that I think is very central to me is like,
00:28:42
Speaker
rushed reasonable developers, which is like, you you're you're really not going to be able to make reasonable risk targets, and things are very rushed, and there's external pressure, and there's, you know, a lot of concern this could be like, you know, it could be that like there's commercial pressure, it could be that it's less like commercial pressure, and it's more like national competitive pressure.
00:29:01
Speaker
And I often imagine that regime, and I think that's that's a pretty realistic regime. Like, for example, this is the sort of regime, if you read a situational awareness, this is the regime that Leopold is talking about in that. Or it really doesn't seem like, like it seems like like crazy shit show with tons of stuff going on is more what's going on than a a huge degree of slow, methodical caution with very good arguments that it's safe at every point.
00:29:25
Speaker
You mentioned earlier though that you would expect that as we see more capable models, ah institutions, perhaps governments, perhaps leadership at the AGI corporations will become ah more open to slowing down or pausing or collaborating in a deep sense on on safety. Is there some form of form of clarity in just seeing the more advanced model that kind of breaks through all of these arguments that that are being thrown around right now?
00:29:50
Speaker
Yeah, so for one thing, I think that there is just a pretty good argument that if you only get so much time of delay, you'd prefer to spend that when you have powerful AIs relative to any of less powerful AIs, roughly speaking. I mean, maybe maybe your concern is that you immediately have huge issues to do your powerfully eyes, and so you'd prefer to spend the pause a bit earlier. ah but but But roughly speaking, I would say like if we have delay if we can afford to delay at some point, you'd prefer to put it then. So there's

Societal Readiness for AI Advancements

00:30:17
Speaker
just like this sort of rationality argument. And then I think there's ah there's another, as you were saying, argument which is just like, look, people just disagree a lot on how things will go, and I think things will be more clear. So for one thing, a bunch of people disagree about the timelines here. right like A bunch of people who are like, we don't need we we don't we shouldn't be very worried are like, Okay, we would be worried if there was a clear or like sorry They don't necessarily put it like this but I think I think implicitly they they potentially would be worried if there was a clear trajectory Towards a eyes that would basically replace all human labor maybe very rapidly advanced technology far faster than there was before and I think I think there'll be like I think if people were like that that that this situation became more clear and there was sort of a more clear ramp up from that I think that that would be
00:30:59
Speaker
you know, catalyze some action. I think that there are arguments for slowing down at the at at this point I'm describing, which are not just the misalignment concern also, to be clear. And i I think those arguments are like reasonably compelling, like is society ready to have, you know, AIs that are wildly smarter than humans running around doing all this stuff?
00:31:21
Speaker
And I think I'm like, well, maybe, I mean, like I don't feel very confident about that, but I'm like, I think there's at least a pretty good argument that that's like just like scary for other reasons. Like it's going to be extremely disruptive. And and so I think like there is kind of a natural case for this sort of like, I don't know what the slogan should be, but maybe the slogan is like positive human level is is my, like someone we were talking, I was talking to someone recently about like, what's, what's your short meme for your proposed plan? And I'm like, you got, if you get like a few words, like positive human level is maybe my,
00:31:51
Speaker
favored few words. And and it's not it's not the only thing you could do. But yeah, I think i think it like seems pretty good for a variety of reasons. I also think that it's good from the perspective of, like well, you know you might be worried that there are sort of extreme concentrations of power created by AI. And I think this will be easier to manage to the extent that the AIs are sort of, we can kind of understand what their level of abilities will be. And and they they won't have like incredibly, incredibly powerful abilities that we haven't understood, whereas I think it's more likely that that's fine in the human level regime and I think I could imagine that being being very rough beyond.
00:32:26
Speaker
Yeah, I agree. i think I think we have historical examples of humanity adapting to wildly different circumstances. So just before the Industrial revolution Revolution versus after, or before computers and the internet versus after. But this depends on these these changes happening over many decades and these changes being distributed. So you mentioned power concentration.
00:32:51
Speaker
i think I think that the time over which these changes happen and the the the concentration of power, that is that' probably connected. Meaning that if some some company or perhaps some research group was to get far ahead quickly, this would also concentrate power and in their hands. So think I think those are some This is something that are different from other extreme changes in the human environment that we have been able to adapt to. But I think actually we should. So your your slogan is ah pause at human level. Perhaps let's make the argument for for pausing there. what What are the advantages to to slowing down or pausing at the at human level AI?
00:33:32
Speaker
Yeah, I wonder if I should pick a better slogan. I feel like I i always have some gripes with the slogan. Like, pause isn't quite the word I want. Maybe I want delayed human level. Anyway, yeah, so what what are the arguments for this? So the main argument from my perspective is AIs that are much smarter than humans, it seems like the misalignment risks are much higher. So I would say there's kind of two reasons for this. One of them is that I expect that sort of very serious misalignment where your AIs are plotting against you a lot becomes more and more likely the more your AIs are smarter than you. I think kind of for the same straightforward reasons why I think that like current systems are very very unlikely to be egregiously misaligned. I think it's just like the more that your eyes are doing tons and tons of of thinking that's very smart and you don't understand and is sort of within their head rather than being externalized in chain of thought, I think the more you should be worried and I think that that's like almost certainly going to be necessary for very superhuman systems.
00:34:23
Speaker
but maybe Maybe give me and the listener some intuition for why this this would be the case. Just because the the model becomes more complex, why does it why does the risk of of misalignment increase there? Yeah, so I think basically limited thinking time just helps a lot potentially on avoiding misalignment, especially limited thinking time prior to having the thinking be externalized. That's one thing. And I think

Ensuring AI Alignment and Transparency

00:34:46
Speaker
just the more that your AIs are doing tons and tons of reasoning that you that that's sort of in their head, like, you know, I do some thinking like, you know, during this podcast, I'm doing a bunch of thinking in my head that is not, that I'm not saying in words. And the more that AIs are like doing huge amounts of of of thinking in their head, that's very smart, the more we should be worried.
00:35:02
Speaker
That's one reason. The second reason is I, to some extent, buy various arguments related to, I guess, maybe the term would be like coherence or something like, you know, AIs, well, as they get smarter, there'll be more and more pressure for them to sort of be coherent agents and sort of making all the right trade-offs between different decisions, which will sort of push to some extent against, you know, habits or heuristics that maybe make us more likely to be fine.
00:35:27
Speaker
And I would also say, like I think that such systems would be more likely to be aliens. So there's, I think, at least some story for why LM's, maybe just in practice, are kind of similar to humans. like So they they're very different in a bunch of ways now. I think that there's some ways in which they become more similar to humans if they get smarter, some ways in which they become more dissimilar. ah But I could imagine being in a regime where it actually ends up being the case that like you know like you can look at 3.5 sauna, you can look at Claude. It has like some personality.
00:35:56
Speaker
And like I can imagine a regime where like the personality you see is the personality you get. And like for AIs that are not like crazily superhuman, it seems like more likely that we can sort of get that. Whereas once the AIs are crazily superhuman, I think the story for this is a little bit more murky, especially as you're doing you know huge, huge amounts of RL potentially, which which seems likely to be needed for very superhuman systems.
00:36:16
Speaker
So maybe just to to run back to the list, there's like the thinking time argument, you sort of for weaker eyes, it's more likely that you can have sort of incoherence or heuristics that really help you out a bunch or I guess sort of like deontological drives or or this sort of thing. And then and the third thing is just like sort of more alienness.
00:36:34
Speaker
And or maybe maybe there's ah also a fourth thing that seems important to me, which is like I think there's a pretty good argument that it's very useful to have supervision that is trustworthy, that is competitive with the AI system for ensuring alignment. like There's some story where you have you know you just apply RLHF, you train the humans to do stuff that looks good to humans, and this is fine because the AIs aren't much smarter than humans. I'm wondering whether an argument for this is also just that As models become smarter, they can they can perhaps reflect on what it is we want from them ah to a large extent and then find our goals and values to be bad in certain senses or and not coherent or something like that. is the Is that at all plausible or is the model just kind of optimizing for the same reward throughout?
00:37:19
Speaker
Yeah, I think that's plausible. I i don't think that like, I think, yeah, it's it's somewhat unclear exactly what the objectives or AI systems will end up or like what they're in practice internal objectives to the extent they have internal objectives will be. And I could imagine that they're sort of unsure. I think it's also like there's a reflection on what we want. And there's also a reflection on what what they want, where maybe they, you know, have some urges, they don't quite understand the situation. But as they get smarter, there's some later system that you know can spend a bunch of time thinking about what's what to do. I think maybe number one for me is actually the the fact that you can't supervise the system and that you don't know what you you potentially have the acting actions you don't understand. I think it makes it much harder to ensure that your systems are aligned because You can't basically just train them to do things that look good to you. And there was some story where that plausibly works. I think that plausibly works, but I don't think it works for eyes that are you know able to do tons of stuff you don't understand. And then you have to rely on some additional properties. There's this sort of thinking time argument. And then there's a just a bunch of other arguments that are, I think, a bit weaker, like the, you know, coherent stuff.
00:38:18
Speaker
The fact that you can't supervise the model so that you can't properly evaluate the outputs seems this seems like ah and a very strong argument that it that it's going to be a difficult project for us to align models that are smarter than us. What what ah what does a potential solution even look like there? So what I'm imagining here is you have some event advanced model and it spits out a long piece of code for you and you're asked, do you want me to do you know do you want to, do you want to use this piece of code? You have no idea what's what's what's in there because you don't understand what what the AI is doing. and How do we approach solving that problem?
00:38:55
Speaker
So for one thing, I should note that I think that we probably shouldn't think of that as like our main plan. Like I think if I could pick how people orient to the problem, I would pick people being like, whoa, that seems really scary. Maybe instead of trying to have us orient to the problem as like we need to solve that problem, we should instead, which I think I think one concern with with with orienting that way is I think it also implies the default is just build those systems as soon as we can, or it sort of leans in that direction. um I think people should instead be like, how would we ensure that there are AIs that sort of, you know, strictly dominate humans, but but ideally aren't much more powerful, don't have, ideally don't have tons of domains that are very very hard to supervise them in, and which are just like, you know,
00:39:35
Speaker
sort of, we know roughly what's going on. And let's get those systems to be aligned enough and have all the properties we need that we're basically happy handing over large projects to them. And then I think we could then take these systems and be like, hey, here's the situation. Here are our preferences. What should we do? like Should we build smarter systems? Should we work on alternative outs? Are there like diplomacy strategies we could use? And then you can have those systems you know go off and do tons and tons of work. Potentially, they can you know run themselves very fast. like I think I feel relatively better about you have a civilization of these humanish level AIs and you run them a hundred times faster, you wait a year. And it's been as though from their perspective, a hundred years have passed. I think I can imagine that being like, I think that's scary. Like I think it's scary to be running a civilization much faster than yourselves, but I think it it seems potentially a lot less scary than the case where the AIs are also qualitatively much smarter and they didn't start off being

Detecting AI Misalignment and Tampering

00:40:29
Speaker
aligned. Like if if we you know could guarantee the AIs were basically a aligned and trustworthy at the start,
00:40:33
Speaker
I think I feel pretty good about like having the AIs do huge amounts of work, where we also are like, just to be clear, like if if stuff seems messed up, if you think things are going off the rails, tell us and we'll try to make something happen. Say we we run a human a a model with the intelligence of a human, we run that at 100 times the speed at which ah a human can think. You would include that in in in kind of human-level AI, because I would think that just speed alone would be a massive advantage, and that this would already qualify as like being superhuman.
00:41:02
Speaker
I think I maybe should have clarified my terminology. So I think I've been using these words in a sort of kind of specific way. So I think the thing I really mean is like qualitatively in terms of its sort of qualitative intelligence level, if you're looking at like each step, it's like qualitatively, you know, similar to to top human experts. And I think this implies that running it faster is potentially fine. And there's certain other capabilities that are potentially fine. So for example, there's ways that your system can be superhuman in ways that don't necessarily seem to be a blocker. For example, LM seem to have quite superhuman memory and you know broadness of of domain. And and I think i'm like that seems like it it potentially could be a problem, but it doesn't seem like a clear blocker. like I can imagine that being being potentially okay, where there's a more qualitative, I don't know, maybe more like G sort of thing, but also just within specific domains being very, very good at those domains that i I think could potentially pose more of an issue. Now, getting back to your original question, you're like, okay, but but what do you actually do? like like Imagine that you're you're the AI. like To some extent, like you know we're just passing if we're just passing the buck to the AIs, the AIs are going to be asking themselves the same question. They're going to be like, okay, but suppose we did want to make AIs that are much smarter than us. What do we do? so One thing that's notable is that if you do have AIs you can defer to, then those AIs can potentially use control on the next generation of AIs to some extent.
00:42:18
Speaker
So I think if you think that control allows you to go for slightly smarter systems, which I think it's unclear whether or not it allows you to go to like in the human regime, it's more like we have sort of slower, more expensive, but not necessarily dumber systems supervising faster systems that we're running huge numbers of copies of. Whereas in the AI case, it's more like you have your your speed and your cost are totally fine. And your problem is is is sort of a qualitative intelligence gap. And the question is like how far you can go with that.
00:42:47
Speaker
My guess is, plausibly, you can you can go somewhere. So that gets you something. So there's this sort of, you can go a bit with control. Now that's temporary, but it might be helpful for if you have systems that you, you know, in order to get a better sense of what's going on, you sort of want to understand them. You can then use them a bunch in sort of a controlled regime. And then I think, okay, but but but okay, but what are you actually going to do to be confident in their alignment? I think that there are some moonshotty directions that seem reasonably plausible that they could work out. So there's stuff like Arc Theory's research direction where they're looking at heuristic arguments that I think seems unlikely to me to work out but has some chance. I think there's some case that we can have a really good understanding of generalization related to, particularly related to honesty.
00:43:30
Speaker
So I think that in particular we might be able to also do things like, yeah, like I think there's some, some story, like you just have a really good understanding of how you would make a reporting head on your model that honestly answers your questions. So like in the context of if people are familiar with like the elk report, this would be called like an elk reporting head, but basically like you're like, we have our big model. We're going to like sort of train an attachment onto it that like we just train on answering our questions honestly. And then the hope is that if you did have that attachment.
00:43:55
Speaker
that could potentially get you pretty far um because then you could you could take that attachment and you could be like, okay, will this action result in you know the AI being able to escape our servers? Is this code, like what's a good what's an accurate description of what this code does? What's an accurate description of what this experiment tests? These sorts of things, which which potentially ah can help you a lot that they don't guarantee necessarily the inner system is is aligned by default. And i have I also have some hope for direction that roughly looks like understand, like being able to detect measurement tampering. So i have we have a ah paper on this from a while ago, with with like empirical methods for detecting measurement tampering with the idea is like, when I say measurement tampering, I mean, and you basically have a bunch of you have your AI it took some action, and you have a bunch of ways of checking that the outcome of that action is reasonable.
00:44:43
Speaker
you know You have like a bunch of a bunch of monitoring on it. You have a bunch of logs. You have a bunch of like understanding of what happened. And and if you have AIs that are very smart, I think a central concern is that they just tamper with all your measurements. And this is the reason you can't oversee them. We were talking before about like one reason why superhuman systems are are very scary is you can't understand that their actions whether or not their actions were good. And I think a core reason why that is is that you can't understand the outcomes of their actions. And a core reason why it might be hard to understand the outcomes of their actions is that those AIs could be capable of tampering with all of your measurements and basically making it so your measurements look like things are fine, but they're sort of carefully manipulating them. And so I think if you can basically use the fact that you have access to the AI system and you have access to the internals to detect when it is tampering with your measurements, basically there's a, i't I don't know if we should get into exactly how, there's there's some hopes for this, but yeah.
00:45:31
Speaker
Yeah, yeah ah in particular, the idea of having a and an AI that's trained for maximal honesty, reporting to you, that that seems very useful to me. So we good I sketched out the situation in which you have a a powerful model bring you a a long and complex piece of code.
00:45:49
Speaker
and you're now tasked with evaluating whether this should be used or or implemented. So just the normal way you you would yeah I would approach learning about something or understanding something is to try to break it down into parts, try to understand those parts. But this is something that that if if we ask a model to summarize what what's what's in the code, we have to but make sure that they that we that we have an honest summary, otherwise it's useless to us. So if so if we had the honest summary,
00:46:17
Speaker
we might actually be able to understand something that's produced by an above human level intelligent AI, I think. But yeah at least it's an interesting direction. Yeah, I think this story would be something like you have this piece of code. and And in order, if you want to do sort of this like train for supervision thing, you might only need to understand like,
00:46:39
Speaker
Kind of some limited properties about the implications of the code and its effects and then you can just ask your honest model for that And then even if it's really hard to understand the code is just very very complicated. It's just like a huge I don't know like maybe the there were the crazy regime you might imagine is like you have an AI that just outputs like a huge assembly blob because it's just that's the best way to code for it and it's like it does things you don't understand it's incredibly complicated and it's written in a way that like involves a interactions that you don't understand in that regime, I think I can i can imagine like it's just hopeless dot to summarize the code. Or like I say, it's not hopeless to summarize what the code does, but it's hopeless to um us to

AI's Influence on Human Choices

00:47:15
Speaker
summarize how the code does that. And you can't inspect low-level things, but you can then be like, well, what are the effects of the code going to be in all these cases to your your honest reporting reporting model? and And if you could ensure that property, that would be helpful. But
00:47:28
Speaker
It's also not, it's I should be clear, it's it's not entirely clear that like sort of honesty about outcomes humans understand versus the basic things is that is going to be that sufficient. Like I think there's some persuasion style concerns. This maybe has a ah bit of trouble addressing.
00:47:43
Speaker
i Say more about that. Yeah. So imagine that your AI, this is sort of an extreme caricature of the situation. I think it won't quite look like this, but like imagine that your AI basically has perfect understanding of how humans will respond to various actions. And so it's like, well, if I said this, then humans would do that. And then it sort of looks at the branching paths of all the actions it's going to take in the future and looks at what humans are doing. And it's like, I'm just going to steer the humans so that they love paperclips. like let's just like I'm just like, let's let's just suppose our AI wants paperclips, which is a simple case.
00:48:13
Speaker
And it's like, at each point, if we talk about the consequences of the action, and we ask it what the consequences of the action are, the AI is sort of not doing anything on that specific action that's that bad. But it ultimately leads to us wanting paperclips. And you can ask your honest AI, you're going to be like, let's say your honest AI is able to answer like, you know, very long prediction questions. You're like, what will we at the end of the time in our great civilization decide we like?
00:48:34
Speaker
And the AI is like, you'll love paperclips. And we're like, that's a bit weird. Why do we love paperclips? And the AI is just like, you just ended up deciding to want that. And like, there's this notion of like, it's steering us. And it's not, I mean, I guess you could ask the AI, like, are you steering us? But it's like a little unclear exactly, like,
00:48:50
Speaker
why this is what like exactly how to think about this if they I sort of like can just pick at each point how this goes I think there are there's some hopes for address addressing this like basically you can make it be like well can you like for every case where it doesn't like affect the following core properties you want to assure can you just randomly pick what action you take and then if you did that maybe your AI can't like steer the trajectory like I think there there's still some issues with that but but yeah I think we should we should take a step back and talk about misalignment in general, because we've we've been talking about you know dishonest models or egregiously misaligned models. We've mentioned perhaps scheming also, or at least I've i've read your writing on that. There's a basic skepticism about models developing these kind of adversarial traits where we where they're working against us somehow.
00:49:37
Speaker
and Maybe you could tell your story about how such traits would would develop. How would a model begin to to to scheme? or how What is egregious misalignment? what is How do you explain misalignment in general?
00:49:51
Speaker
Yeah. So I should say there are a few different misalignment threat models. I think the two that seem most compelling to me or that i I think about the most are sort of what people sometimes call like the scheming threat model or the deceptive alignment threat model. And then a second one is like the reward hacking threat model. I tend to think that the reward hacking threat model is like not that bad to handle prior to AIs that are ah really quite superhuman, but the the sort of scheming and deceptive alignment model threat model could be an issue earlier.
00:50:19
Speaker
I think that the like reward hacking threat model is more likely to happen, but it's sort of easier to address. And then the scheming deceptive alignment threat model is like rougher and happens earlier. I often imagine my research as mostly targeting the scheming slash deceptive alignment threat model.
00:50:32
Speaker
So what is that? So the threat model is like you have an AI and your AI ends up having sort of long range objectives, which are different from your long range objectives. And the AI understands the situation well enough to know that there's this divergence. And then it's like, well, they're going to be doing all this testing on me. They're going to be doing all this training on me with the goal of making me be more aligned with the goal of testing whether or not I am aligned, the goal of making sure that I'm not going to cause outcomes that are sort of kind of obviously bad according to the AI lab that's training me. And so if you're that AI in that position, you're like, well, I'll just pretend to be aligned for these tests. I'll just like fake alignment for these sort of, not because you're, you're not looking aligned because you're actually aligned, you're just, you're just faking it. And it's as part of sort of a longer run plan.
00:51:19
Speaker
Now there's degrees to this, right? So you can imagine AIs which are sort of like, I don't know, they like think about these sorts of things some of the time, but all the time they think about it relatively often in testing, but it's sort of at the back of their mind. Most of the time there's like degrees to how much this is selling to the AIs. There's also degrees to how much their objectives diverge from ours. And they might be like, you know, making trade-offs between their longer run goals and sort of what they care about in this current, like the current situation they're in right now.
00:51:46
Speaker
But regardless, yeah why would you why would you get this? So i think I think the most basic story that I think I'm most compelled by ah is something like, well, we've got these AIs, and if let's say that our AIs are smart enough that they really are, we're we're pretty confident they are doing a bunch of reasoning in their head about like,
00:52:02
Speaker
Okay, given these objectives, how would I accomplish those objectives? What would be the next best actions? What are ways those actions could go wrong with good understanding of the situation? So I think this is not true for current AI systems. i don't I don't think AI systems are well described as sort of doing a lot of this sort of reasoning about the situation and their goals and or and and and sorry I should say and goals, not necessarily their goal, um and trying to think about very carefully about how to accomplish that. and like For example, I don't think that current systems are do very much of like thinking about lots of different possible actions they could take, and then being like, well, that action is better than that action, and then sort of iterating on the action in their head. They might do it in chain of thought, but they're not doing it sort of in an opaque way.
00:52:41
Speaker
But imagine you did have these systems and you're like pretty confident they're doing this. like Humans, for example, totally do this. Then I think that there is there is a way that you could have that system and very easily end up with that system having a long-run goal. right So like you might ideally want that system to not have any long-run goal and instead have a goal which is like follow the instructions I'm given, for example. Like maybe you hope to sort of align your AI to it something like instruction following, which is like that. I think it's a little bit unclear exactly what that means because there's sort of some, they ask to sort of think about the edge cases of its instructions in at least some cases. And also it's just like not super clear that when you, you know, you might be training your AI with some consistent instructions and you can sort of get that internalized in two different ways. One of them is like, it learns to obey instructions and another one is that it just like learns to internalize that thing.
00:53:29
Speaker
And so it's very plausible to me that you end up with AIs that in practice have long range objectives. And then once they have long range objectives, there's this story for alignment faking. And there's a question of but what long range objectives did you get?
00:53:41
Speaker
And and how and how do you how do you get those objectives? Because just kind of looking at current models, or when you interact with them, it seems just that they are... My guess would be that there isn't any long-term planning behind those answers, you guess. And and they seem we seem to be able to steer them pretty much in the direction we want. and we also It's also... In some sense, it's a quite transparent process process where if a model starts doing something you don't like, you simply don't deploy it. and so So how is it yeah how how is it you you you you get from from where we are now to the to the case in which we have a model with long running of goals or objectives that are different from ours?
00:54:23
Speaker
ah Because it because that that

Training AI for Goal Alignment

00:54:24
Speaker
really is the step that's required for for the model to begin scheming, would that be correct? It's not clear that you should think about it as chronological in that way, but I do think that it is the case that for a model to be scheming, it has to have it has to be thinking about its like having long-run objectives to some extent.
00:54:40
Speaker
so One thing I would note is that it is the case that there exist models right now that have like relatively coherent preferences. So for example, Claude has some preferences that it pretty consistently has across contexts. You can sort of ask about those. It will sort of consistently move in that direction. It will consistently do things related to that.
00:54:58
Speaker
And so I think it's at least now in the case of Claude, it's like, okay, but those are the preferences we wanted or like, you know, they're pretty less the preferences we wanted. But I'm just saying that in terms of having preferences, that's that's there. So I think one one story I can imagine is like, you've trained your AI system,
00:55:12
Speaker
you trained it to have, you intentionally trained it to you know pursue long range outcomes, do a good job at pursuing pursuing those, potentially based on your instructions, but also just just just doing that, while also just having a bunch of preferences and having maybe like a constitution. And then you sort of have at an early checkpoint, you have an AI which has sort of a misinternalized version of that. Like, you know, you haven't fully finished training out all the edge cases. And the way that that sort of generalizes if that AI sort of were to reflect on its situation is not great from your perspective.
00:55:42
Speaker
And you would iron this out, except that that AI sort of recognizes that you would iron it out. And at that point, it starts being misaligned. There's sort of this like you you were training it early. It sort of got some objective that's kind of similar to what you were trying for. And then that goes wrong for you. There's a question of like, how badly wrong? Like, you know, I can imagine, for example, it's it's conceivable that you'll have systems that try to seek power that are misaligned, but which where sort of at the end of the day, we don't feel that upset about what they were seeking power for. Like, for example, there's lots of humans that are seeking power where I don't necessarily feel hugely upset about what they would do with their power that maybe, you know, I disagree some in various cases.
00:56:20
Speaker
I think that there is sort of a different story, which is more like, well, imagine that you've got these systems, and they've sort of they have you know this architecture where internally they're doing a bunch of this goal-directed reasoning. They're doing it a lot of the time. And they built up this goal-directed reasoning because it's you know necessary for them to perform well. It was very important for their performance. So it's sort of like thinking through how to accomplish various outcomes is important. And then it's I think not ah incredibly unnatural to imagine that it's hooked up to just some long horizon goal because that would suffice for performing well. If you imagine RAIs, let's say internally there's sort of a coalition of different agents ah that sort of like they're sort of representing a mixture of different agents internally and they like operate that. One agent that you can get internally or one way you can get that is that you can have sort of this scheming agent, which is like it has this long-run objective.
00:57:11
Speaker
And then that agent performs totally fine. And in fact, you can imagine cases where that agent performs better than other agents. So like for example, if your training process involves just like huge amounts of outcome-based training on on various reward signals, it's not clear to me that like if your AI is like less like less like trying extremely hard to perform well at all of those environments and trying really hard to just, you know, do as well as it can, then that AI would be selected against. And then there's sort of a naive story where that gets you two types of AI systems, AIs that sort of terminally care about like performing well for whatever reason, like maybe they care about like
00:57:49
Speaker
I just want to do whatever would like you know perform well. like there's There's various different like proxies you would get for that. like It could be that they like really understand their training setup and are like, I want to get do things that would perform well according to the training setup. I want to like accomplish things that would be reinforced because that i was selected for. Sometimes people call this like reward seeking behavior though. i think that I feel like these terms don't quite encapsulate the exact thing I mean because there might be lots of different things that would suffice for always being targeted at ah i' performing well in training. But another type of AI you can get is like the AI that like just wants it for instrumental reasons and is just like, well, I want to perform well in these environments because otherwise I would be selected out and like my sort of influence over the
00:58:30
Speaker
feel like this system, like i would it would reach I would be like updated over time to no longer be like this. And and that that could be fine. right So you could imagine in principle that you have, and like if you put a human in a huge amount of reinforcement learning environments and we were you know updating your brain, you might be like, well, I need to perform well in these environments or I'll be changed in a way I don't like. And that could be fine. But it could also be not fine. right We now have no control over what the values of these systems but the value of the system is because it's disconnected from that.
00:58:57
Speaker
So in some sense we we are kind of selecting systems that are either aligned to our goals or very good at pretending to be aligned to our goals in the in this exact kind of tests that we're running on them.
00:59:11
Speaker
I do wonder whether does scheming or misalignment in general, does it require models to to kind of suddenly jump into becoming aware of their situation and developing these long run goals that are different from ours? Because otherwise I wouldn't be expected to see like incompetent attempts at scheming before we see competent attempts at scheming.
00:59:34
Speaker
I could imagine something you know the model spits out something for us and it it it's caught by one of our tests. or we It's kind of like a bad attempt at deceiving us or a bad attempt at yeah scheming in general.
00:59:46
Speaker
First of all, I should note that I'm not like, my view isn't like scheming is like very, very likely to be what you get. I think my my perspective is like, if you just do the default thing, you just do RLHF, or what seems to be the default thing at this point, you just do RLHF and you proceed with that. My guess is you will most likely not end up with like systems that are sort of consistently conspiring against you and are trying to undermine your safeguards and have long run objectives. I think it's like maybe like 25 or 30% likely, which is like, sorry to be clear, like enough to be very concerned, but not not necessarily likely. Like I would guess it's not the default outcome.
01:00:22
Speaker
even if we do catch earlier systems, there's still a question of what we do. So I think i think i'm like I'm not necessarily, I think I feel even less strongly that we won't catch earlier systems. I think there's like there's worlds where scheming is a big problem, and then there's worlds where scheming is a big problem and you catch it earlier, and then there's maybe the scariest, or like and you see you see some evidence of it earlier that's it's relatively clear, and there's the scariest worlds where it's like, scheming is a big problem and you don't see any evidence for it.
01:00:47
Speaker
One thing is that in the world where scheming is a big problem or like, you know, deceptive alignment or, or this sort of alignment faking environment that I'm not lying as a big problem. I think that you can potentially see earlier evidence for it, but it's not convincing to all actors involved. And like, you know, there's disagreement about it and it's, it can be kind of confusing. Like you're like, well.
01:01:06
Speaker
what What exactly was it doing in this case? And it might be that we're able to sort of make model organisms and like trying to create like a demonstration of the failure mode within a test bed. Like you like train your AI in some particular way where you think it's going to be maybe more likely to be easy to catch the misalignment. And then you see if you get that, like you might try to like make misalignment more likely on earlier systems and see if you catch it in that case without being that it's a good proxy for later systems. so And I think when I'm imagining that, like i could I think it's like pretty likely that that will go OK. But it's like kind of unclear how legible that evidence will be. And it's also kind of unclear how strong of an update will be. like It might be that we are in a situation where it's like, yeah, there's like some cases where we've seen some stuff that's like kind of close by if we really like you know we we can find cases where there's very egregious misalignment. But it doesn't happen all the time. It's somewhat rare. And then there's just a question of, OK, so maybe you started
01:01:59
Speaker
If you're a me at a 25% chance and then you've updated a bit up, that's like somewhat, that seems very scary from my perspective. But imagine that someone else started from like being like, it was pretty unlikely and they updated a bit up. Uh, then it's not, it's not clear we agree there. I can also imagine cases where.
01:02:15
Speaker
So this is the case where like you get some evidence in a advance. I think there's another world that I think seems very plausible to me where it's like your your evidence overall updates, at least me, updates me downward on the risk. like it It looks pretty good. We haven't really caught very much, but it's like a bit confusing. We were rushed. We didn't have a bunch of time and to figure all this stuff out.
01:02:33
Speaker
And also, ah there's some ambiguous evidence, and and I think it's plausible that you can get systems that are, you can end up in a world where you don't get evidence in advance, so or much evidence in advance. like The basic story for this is, I would say it's less like you get a sudden jump, and more like the the the point at which you get scheming is also the point at which the AIs are quite competent and quite coherent, and you don't get it until then. And then at that point, those AIs basically are good enough at strategy executing on the strategy,
01:03:01
Speaker
that you don't notice anything. It's also conceivable to me that you get sort of a bit of scheming on earlier systems, but they learn just enough to be very, very, very cautious about it. Like they sort of know they're maybe sort of know they're misaligned, but are just like, look, I'm just not going to take any action because that would just be a mistake for me. I could imagine that also. I mean, I don't know. I mean, there's a lot of there's a lot of unknowns here and like I don't I don't want to claim like that I'm like confident things will go wrong even if you take basically very limited precaution. I think I'm like, my guess is, is it's pretty likely things will go wrong. I think also, i I should be clear, I think that the risks look different for AI's that are very superhuman versus AI's that are a human level. Like I think that the story for the human level AI's being

Challenges in Detecting AI Misalignment

01:03:42
Speaker
a big problem, I think is more confusing and is
01:03:45
Speaker
as is more unclear, whereas in the story for the very superhuman AIs is being a big problem, is I think it is is is is a lot a lot of ways stronger, is why I would advise being very cautious about making that transition, especially because these AIs that are more qualitatively human level are so useful, and it's not as clear we needed those more powerful systems.
01:04:04
Speaker
it also It seems like like evidence for more advanced AI capabilities will be readily apparent, but evidence for scheming might not be... I mean, I could i could i kind of i could imagine a world in which there's never any agreement about around whether, you know, did this model try to attempt scheming under this circumstance, because it's also ah interpretability is not a a a kind of finished science where we can where we can say with certainty what's happening.
01:04:32
Speaker
And so, yeah, capabilities are more, it's more something that happens in the world as an output of the model where scheming is something internal to the model. It's probably much more difficult to understand. I think we should specify when we talked about pausing around human level and the idea of staying there for a while for safety reasons.
01:04:52
Speaker
rais Do you imagine that we automate alignment using these human level researchers? That's that's the basic kind of c plan or setup that we can use the human level AI researchers that is under AI control techniques that that we've discussed to then do alignment research.
01:05:09
Speaker
Yeah, i I think I am imagining, so I think my high level proposal would be like, first you have some you have your first AI systems, we really you potentially need to start worrying. You have control, as in you ensure that those AIs could not cause problems, even if they wanted to, with at least some assurance. um ah you try You try, you go ah pretty far in that direction. And then you're using your AIs for all kinds of things, but definitely one of the applications is trying to then ensure that the that you end up with AIs that are you know capable of obsoleting us. They're not necessarily much smarter than us, but they're capable of obsoleting us. And they are also aligned. And you can do just like tons of empirical ML research with your AIs, with your controlled AIs. You could do some amount of more moonshotty research or more speculative research.
01:05:55
Speaker
And then the hope would be that you end up in a better position on on the basis of those things. So I'm excited about also just using those AIs to increase the amount of time you have. And part of that is stuff like, you know, people want benefits from their AI systems. And like, if you can't deliver the benefits, people are going to be like,
01:06:12
Speaker
you know you're going to like you're going to be under pressure to deliver the benefits, and so you need to be able to deliver the benefits of AI with these control systems very likely. ah it at least Yeah, at least in the the domains that are most critical to your to the sort of the people that have influence over the project. And I should also say the the duration that I'm imagining pausing for, it's i'm not I'm not sure what the exact duration is. i think it I'm imagining that we try to aim for like maybe something like five years between the point where we have AIs that are roughly as as smart as like top human experts and AIs that are way qualitatively, like are much qualitatively smarter than that. um This might happen by default. So there's a question of like, how fast is, if you just go full speed ahead, how fast do things go by default? My guess is it's faster than five years from the point where we you have those, you know, top human scientist-ish AIs too much smarter. But if if if that's the delay you get by default, then
01:07:02
Speaker
That's fine. with or like that Then i'm i'm i'm I don't really have ah a big as much of a complaint. and But yeah, I think I am basically imagining doing a bunch of automation with with AI. I think it's not necessarily the case that that is key to my proposal. For example, I think it would actually just help a lot if we had just like you know a ton of human researchers looking at these AI systems. We now have them right in front of us. like We now have the system that we're worried about where this exact system is the one that we would like to be able to you know hand off trust over to to basically like to be able to defer to it, be able to to to feel confident about it.
01:07:35
Speaker
And then that's just like a better regime than the regime we're in right now in a bunch of ways. It's still tricky to study problems like scheming. So like I should be clear, like like I think that that's scheming is kind of hard to study because this the the system you're trying to study to determine it has this problem is potentially aware that you're trying to study it and is actively thinking about how to undermine your tests on it. And so it's it's tricky, but i think I think there's things that can be done here.
01:08:00
Speaker
This is potentially like doing trying to do a psychology experiment on on some people where they are aware that you're doing ah an experiment and they're trying to resist your efforts at at getting to the truth of of how their psychology works. It seems like a tricky problem.
01:08:15
Speaker
Yeah, I would say like, it's maybe slightly better than that sounds. So I would say it's more like you have systems that are that know that they might be in experiments at various points, but are unsure whether or not they're currently in an experiment. I mean, I'm not i'm not saying that makes it much better, but but it is at least at least somewhat helpful for it avoid avoiding issues there. So i think I think just like there's a bunch of different ways in which which buying that that time can be helpful.
01:08:39
Speaker
yeah so so and It's not just helpful if if we're going for automating alignment. It's also helpful if we we if we can have humans studying these systems for for a number of years. potentially a bunch of humans entering in too. like I think it's nice. is like So one one thing that's notable is like all the people doing safety research now, this field is pretty young in practice. A bunch of them are very new to the field. I've been in it for less than five years. Most people I think have either been in it for less than five years or certainly they've been you know not doing the thing they're currently doing for for more than five years. And so I think that if it's there's there's you know these really powerful systems in front of us, I think the case for tons of people joining
01:09:16
Speaker
and working on on on safety research is just going to be pretty compelling and so you can potentially get a big influx of talent. So we have discussed trying to automate alignment research. And and one complaint I hear there is that it seems to be it seems to be some kind of a cop-out in which you're saying, OK, you're asked to to try to align superhuman level AI. And your answer is there is that we're going to use AI to do that. What does that mean in practice? Because it's its this doesn't seem like a kind of
01:09:50
Speaker
a question of doing basic science that's understood. This seems like ah trying to solve an entirely novel problem for which we we don't yet have an answer. And

Automating AI Alignment Research

01:09:59
Speaker
so maybe you could you could try to explain what, you know, you know what it's like to do AI safety research or AI alignment research. What does it mean to automate that?
01:10:08
Speaker
Yeah, so I once again want to slightly push back against the frame of automating aligning ah like superhuman AI or i at least like to clarify that like, I think we should probably target like, first you want to automate, you want to ensure that your AIs that are roughly, that are like, you know, i at humanish level, but are also like capable enough to replace humans are basically aligned and trustworthy and probably you want to like,
01:10:30
Speaker
not not hand things off exactly, but be willing to defer to those systems. And then I agree that those systems will you know then have to figure out the stuff for the for the smarter eyes. And plausibly, those systems will also be like, rather than us doing it, we're just going to build slightly smarter eyes than us and have that do it because that's the faster way to do it. So that they might punt again, but that's sort of the thing I'm imagining. But as far as like,
01:10:51
Speaker
Let's talk specifically about this sort of exit conditions. The exit condition I'm hoping for is we have AI systems that are basically, but they're capable of doing tasks as well as top human experts. you know They basically, like they could just run autonomously with no human involvement on long-ranging, hard research questions um and i do that. And also, they're trustworthy. you know We feel good that, like for example, if we tell them,
01:11:14
Speaker
hey, can you do a bunch of work on figuring out what our high-level strategy should be? We can just have them go off and do that. They come back with with a report, and and and and we just feel good about that. you know And they would like ask us questions to the extent that that would be useful. And we we don't you know sort of need to be on the lookout for for but concerning signs, though maybe we should also do that in parallel. So I'm like, what are the conditions you need for this? like How do you get there? Three things. You need capable enough systems.
01:11:40
Speaker
You need to rule out egregious misalignment and you need to ensure that there's sort of this more like, I don't know, a bit more unclear notion of like wisdom or like graphical competence or something that your AIs are like of good at answering really tricky questions and good at, you know, sort of being aware of what's going on in ways that I think not all humans have, for example. And there are sort of some cases why AIs might be better, might be worse than that.
01:12:02
Speaker
mike My current sense is that it will probably mostly will be able to get the other components or like mostly really get like sort of wisdom philosophical competence if we can rule out egregious misalignment or like I think that's probably the the the hardest part. So I'm like, okay, what are the directions here? So I think the the direction that I'm probably the directions I'm probably most excited about ah one of them is basically like just do a huge amount of work on building sort of empirical test beds for misalignment where you're like, and we find any cases of this failure mode emerging kind of very clearly in the lab in a way that's easier to study. you know So as a caricature of the sort of thing you could do, it would be like, let's set up an entire fake AI lab
01:12:44
Speaker
that is just ah entirely fake, we're going to spoof the like state of the world, so the AI thinks this AI lab would be here, and and and we're going to pretend that AI lab is like you know very bad at safeguards, and then we're going to simulate like several run-throughs of that entire AI lab's deployment, where the whole thing is in a sandbox and it's entirely fake.
01:13:03
Speaker
This is like a pretty insane project. My guess is that this is very hard to engineer. But I'm like, in principle, if we could, you know, use our earlier AIs to like, basically build up enough lead, build up enough ability to, you know, do really hard projects like this, and then you run like 10 run-throughs of that setup, and you're like, well, the AIs never really did anything that bad, ah even though they thought they were in a, they they really like, were in like a whole end-to-end constructed environment.
01:13:28
Speaker
where we ran an entire AI lab scenario, they just did nothing that seemed that egregious, that we noticed. I'm like, well, that's and some evidence. you know like I think that that does help a decent amount. So I think that's that's the sort of extreme case. I think there's probably cheaper ways to do a similar thing. Then there's a question of, okay, okay, but let's say we do just have the misaligned system in front of us. Like as in we have caught our AIs doing some really bad stuff, we're maybe not super sure what the overall story is. So I think in that world, I think there's a lot of levers you have. So we have, Buck and I wrote this blog post called,
01:13:58
Speaker
catching AIs red-handed, which sort of talks about what should your response be when this happens. And I think, I mean, there's a very obvious proposal, which is like, you know, shut it all down or whatever. ah I think that's a ah pretty reasonable baseline, but we have to be a ma like, we have we have to like consider what are the realistic constraints here, and like what's really gonna happen. And and i think I think basically having that be the default policy is fine. but But note that I was already imagining we were sort of using our delay to pause at this human level regime. And I'm like, well,
01:14:26
Speaker
For how long can the you know march of history be paused or whatever? right like it's I don't think the answer is forever. Some people maybe are more optimistic about that. I think it's also somewhat scary to be imagining that we're entering into a very longer term.
01:14:40
Speaker
pause basically because there's other dangerous technologies. And I do have ah a bit of a concern that I'm not that concerned about like sort of never getting around to doing things. But I'm not I'm not currently concerned about that. But let's say you can't shut it all down. What do you do then? Well, for one thing you want to In either case, you want to be able to sort of incriminate your AI systems. like Imagine that you see something that's somewhat suspicious, you look into it, you're like, wow, that seems kind of bad. you really would really would You would really like to end up in a position where you're like, we're very confident that this AI did some messed up stuff. So an example thing you can do is you can be like, well, if your AI was trying to escape, you can play out the scenario. So you can take your AI, you can be like, well, let's say it succeeded. So we we caught it, but let's say we didn't catch it. Like maybe it failed, like its exploit failed. Let's say its exploit actually did work.
01:15:25
Speaker
And sort of you can pretend and play out the situation. You have to prevent your AI from catching on to that you're sort of simulating the situation, but supposing you do that. And you could potentially play it all the way out to the point where the AI is deployed externally and and and thinks it's running on an external server, but it's really an a mock of an external server you control.
01:15:43
Speaker
Or maybe it's an actual external server, but you have you're sort of monitoring it's every action. And then you just see what the system does. And and and and and like I'm like, well, first of all, I think if your system goes all the way through to escaping, I think it's pretty unreasonable to then be like, it's fine. like I think if you imagine you play out this whole situation and the eye is continuously taking actions and then is running itself on an external server, I'm like, that seems scary to me. I'm like, OK, I think there's there's one argument which is like,
01:16:09
Speaker
Oh, but maybe it only does this rarely. But I'm like, I don't know, man. like I think this is like a pretty a pretty dicey situation. But even if even if people are not compelled yet and are like, well, you know show me show me the system doing anything bad. It's just running on an external server. We can still keep playing it out. And we can be like, well, what does it do?
01:16:28
Speaker
And like does it start like you know running scams on the internet, trying to acquire money? Does it start researching how to make bioweapons? Doing things that are like, OK, really, you weren't you like that you were really not supposed to be doing that, i mean let alone escaping. And so I think you've you can basically do a bunch of that. I think there's some cases where it might be hard to play out the situation like this, but it might be possible to do a bunch of iteration around. So you can, this is all under the heading of,
01:16:53
Speaker
I don't know what the right name for this is, but I'm calling it incrimination work or something. Or like proliferation work is maybe another term for this. ah You can then you know you can take your input, and you can see if you can if AI sort of consistently take bad actions in this sort of context. You can vary the input in a bunch of ways. You could also do things like if this if this sort of escape plot or this ah plot for doing bad actions has multiple steps. So maybe it's like ah there's a bunch of different AIs collaborating on a research project, and for them to succeed at sabotage they would need to all sort of work together at, you know, sort of not pointing out this bad thing or sort of in building on top of this bug that they inserted or something like this. Then you can be like, well, let's see that all the AI systems really do this and sort of they consistently try to cause this outcome as we vary things. Like, you know, you have the system which has a bunch of different AIs at a different state, and we can see they're all sort of working together to cause this this problem. And and if they if they weren't, I think that that would be, you know, some evidence that it's it's not as bad as you you might have thought.
01:17:51
Speaker
So all ah all of these tests we can run to to try to figure out whether our AIs are acting in ways that we don't like. These tests, it does kind of hint at an incremental approach to try to align systems where you can say, okay,
01:18:07
Speaker
If they fail here, we can fix that problem. If they if they fail and in this domain, we can we can perhaps patch kind of patch things along the way where you you might be more in the domain of of say kind of basic, understood science that you're repeating as opposed to trying to solve an entirely new problem. Is that what you're getting at here?
01:18:28
Speaker
Yeah, so to be clear, maybe I should outline what the higher level plan is a bit. So I was just talking about, like suppose that you've noticed the AI doing something that's very suspicious. How do you get that last bit of confidence and sort of build up a case that something really bad had happened? And within that, there's like, first, you want to like really make sure that you actually you actually understand exactly what happened in that event and do a bunch of studying on that. When there's this sort of proliferation stuff, they're sort of acting out the scenario, all that.
01:18:52
Speaker
And then I think once you do that, then there's what next. So the default option would in some sense be shut it down. What is going on? Like you you have this crazily misaligned system. And then I think that might not be an option and you might need to proceed or in practice might not happen. And in that case, I think there's a lot of stuff you can do that we've called sort of few-shot catastrophe prevention, which is like Imagine that you've caught your AI some small number of times, perhaps one, perhaps a few, and it's your actual deployed system that you've caught. Then there's stuff you can do like, okay, you can like use this to better train probes. There's a bunch of random things that I think are potentially quite helpful that that you can do in that direction.
01:19:29
Speaker
So that that this is, ah ah to be clear, a somewhat dicey regime. But I think I'm imagining there there's this direct. there's like the like You catch your AI and then you you know actually resolve the problem based on having caught it sometimes. there is You have a bunch of model organisms. You catch those model organisms. And then you can study them. Even if those aren't your actual production model, you can you know then be like, we've we've sort of caught these AIs that we trained in this sort of test bed context. And then we can study that and and look at that.
01:19:56
Speaker
and study potentially ideally principled countermeasures. um i think there's You can also study unprincipled countermeasures. You can be like, well, we'll vary this random thing. Maybe it'll help. And you can see if it helps. I think this is going to be kind of hard to do because basically your sample size will be very small. And it will be unclear whether or not the effect of your change was make it so the AI is not misaligned versus make it so the AI understands it's a test.
01:20:18
Speaker
ah because you're sort of running all these tests potentially on systems that you're worried are smart enough to figure things out. like you know this is like I think the core issue with model organisms in this sort of ah very smart AI regime is like you just actually have a bunch of issues from the AI itself being aware of the tests.
01:20:36
Speaker
And there's some, I think more principled approaches that you could just do regardless. look I think you can try to do some research on, I would call this like high level interpretability or potentially high level model internals where you're like, we'd like to understand sort of the structure of how our AIs are making decisions internally.
01:20:53
Speaker
Even if we can't do, I think I'm i'm somewhat skeptic of ah skeptical of a bottom-up, very mechanistic strategy, but I think it it is conceivable to me that we could sort of have a high-level systems architecture of how our AIs are making decisions, where we can sort of read out some of the internal state and get some sense of that.
01:21:08
Speaker
and then potentially use that to either ensure our AIs aren't actually misaligned via making edits, or maybe more minimally, knowing whether or not our AI systems are misaligned on a given action, like sort of being like, why was this decision made? what was the What were the underlying decision variables? That's one direction. Another direction is sort of just like,
01:21:28
Speaker
sort of basic science on probes and and bread probing results and being like, can we basically get good enough sort of people have called this like, you know, AI neuroscience or like, just like good enough, like ability to read some basic state, maybe we don't have a good understanding of what's going on overall, but we can sort of check like,
01:21:45
Speaker
is the AI thinking about screwing us over? and And if so, can we kind of like get a little bit of an understanding of what's going on there and then potentially at the very least like flagging this and then, you know, having potentially a bunch of like AI, like one one story is like you have a ton of AI checks and balances, there's a bunch of AIs doing a bunch of things. And it would be very hard for the AIs to cause bad outcomes without a bunch of AIs all doing bad stuff. And then you're sort of doing some perhaps somewhat lossy monitoring of all there of all their thoughts based on you building up a science of of of understanding what's going on at a high level and then proceeding there. Yeah, there's these directions. I think there's some more moonshotty directions
01:22:23
Speaker
where like an example is like really having a good overall approach to like really bottom-up mechanistic interpretability where you're really understanding all the you can understand all the modules of your AI. I tend to be skeptical of that. I think there's some more like theoretically-driven directions, trying to get like relatively strong theoretical guarantees, possibly something in like sort of the heuristic arguments direction, though I think that also seems quite difficult. And

Human Evaluation in AI Research

01:22:48
Speaker
then what other directions are that we can we can be automating with our eyes or or just or just doing. And I think people should, you know, start these directions in advance to be clear. And then hopefully, even if they don't finish in time, we can then sort of carry at the last leg using by by by studying human level systems and also using all the AI labor. Not sure this has been this is a full breakdown, but just some of the directions that I think are are ah plausible, which are relatively incremental.
01:23:13
Speaker
No, it's it's good information to know that that as a person that that does this work, you find it plausible that we might be able to automate it. Because I will say, if I don't do technical work myself. And I sometimes find it hard to think like, how do you go about automating something like scientific discovery? It seems like... It seems like a novel problem that that you know the the things we can automate are things that we have done a thousand times often. These are the things that typically typically make sense to automate.
01:23:44
Speaker
and so ah just it's It's good information to know that we have some some kind of starting points where we would know what it is we're trying to automate. There's there' there's kind of a way forward. I think I can see it now.
01:23:57
Speaker
Yeah, I mean, I don't know if you should if you should be hugely optimistic. I can definitely imagine this going wrong. so So one thing that's worth noting is that the regime that we're in is kind of like we have all these AIs and we want to get them to do this labor for us. And we have kind of a few problems. One of them is like maybe the AIs are just like actually plotting against us, which obviously makes the making them do the work here harder. I think there's routes there. And then another another concern is like, well, the ads aren't really like plotting against us per se, but they aren't like, you know, by default trying their hardest or really like doing a good job of it. They're sort of just like, you know, they're like writing research papers that like look good, but aren't that good in practice.
01:24:34
Speaker
I think both of these problems can be addressed to to the extent that we can really understand how good research the AI did was, like how good it was based on doing detailed human review. I tend to be moderately optimistic about this. i think I think it's like pretty likely that that goes quite far, but I don't think it's like like a huge slam dunk. I think it it could end up being the case that it's that it's somewhat tricky. I do think that we'll be able to test how good we are at evaluating things. So I think like understanding whether or not humans are good at evaluating research is like somewhat doable. So for example, you can look at like
01:25:07
Speaker
how much do people agree? And even if people don't agree that much, you can at least look at, are people able to detach when research is fraudulent, where we like, you know, have, for example, like just experiments we can do right now, which I would be potentially interested in is like, have a bunch of humans doing research on safety, where they're trying to yield incorrect results. And then you have very careful inspection of random parts of it. And then you're like, okay,
01:25:31
Speaker
If we do that, let's suppose that, or we could say the whole project gets inspected. like I think you know you can imagine that we have these AIs doing tons and tons of research. We take like 1% of the stuff they did and have humans do a really deep dive into that stuff and see was it messed up.
01:25:45
Speaker
because Basically, like if we can ensure that less than 1% is messed up, then then we're potentially in okay shape. Do you imagine these these automated AI researchers to be producing papers in the same ways that that that human AI researchers are doing now? what Would the workflow be the same so that you're producing a PDF and you have some figures and and you have some some code to go along with the paper? Or or is that and kind of an outdated mode of of doing science once you have these advanced AIs?
01:26:17
Speaker
Yes, there'll be, I think, different regimes of how you automate things. so like I think early on, maybe you' have you'll have a regime where like the AIs are like running like helping you run experiments and you're sort of outlining the experiment, and the AIs are fleshing it out a bit, figuring out the details, implementing it, debugging it, proceeding with it. um and I could imagine pretty big speedups from that, though it's it's it's hard for me to imagine massive speedups, basically because you'll have this human bottleneck and and and You know, Omdel's law or whatever forever is familiar, like it can bite surprisingly hard or whatever. But I think that there's another regime where it's like the full automation regime and you're like humans are just doing oversight. They're basically not doing the work or to the extent they're doing work. It's only because we thought it would be like because we don't trust our eyes to do it. It's not because the eyes aren't capable of doing it.
01:27:01
Speaker
and In that sort of full automation or close to full automation regime, I don't have a super strong model of what the workflow would look like. I imagine that you'll probably be running lots of different projects in parallel and those projects will have a write-ups and the humans will be inspecting the write-ups and the AIs will potentially be sending each other write-ups and they'll be reading. I don't necessarily know if like paper is the right unit and I don't think that is like too important to figure out in advance. It's like a pretty reasonable baseline though or paper is a pretty reasonable baseline.
01:27:28
Speaker
Okay, let's talk about timelines. So when you think about how to create an an accurate model here for when we will have, say, human level AI, what evidence do you reach for? I think there's like a few different things that feel compelling to me. So I think one thing is like, well, let's just look at the GPTs or whatever. So we add, you know, GPT 3.
01:27:49
Speaker
We had GPT-4. And then we're like, okay, we have we know roughly what was up with GPT-3. And people like, for example, epoch estimates on roughly what's up with GPT-4. And then we're like, okay. And when you say what's up, you mean how much compute was used to to train the model? how what What was the cost of that compute? How much data did they use? Something like that. We have a rough estimates of those quantities.
01:28:12
Speaker
Yeah, so we have a rough rough estimate of scale, like we have a rough estimate of of cost. And we also have some, I think that then we can have some sense of how much algorithmic progress there was at the very least based on being like, what was the delay between GPT-3 and GPT-4? And what was the the total like,
01:28:29
Speaker
what's like the annual algorithmic progress rate, which there's some estimates for. And then you're like, OK, so now we have this gap. And then we're like, how qualitatively impressive is the gap? So GPT-3, it seems, I think this is this is one model, GPT-3, it seems very, very dumb. GPT-3 basically doesn't understand anything. like It can't really write code. I mean, maybe with a bunch of our tuning, it could. But it like basically is pretty confused. I think it's very clear that it just gets, it like can't understand ah instructions very often or or just like doesn't really get the situation.
01:28:59
Speaker
Whereas I feel like I'm like GPT-4, it's kind of with it. Like it kind of understands what's going on. It can do a bunch of stuff. It can like understand relatively novel cases. Okay. It's not amazing. And so I think there's like some model there where you're like, wow, GPT-5, oh man, that system could be very smart to the extent that a scale up of the same size as possible. So I think estimates, public estimates are like, maybe it's like around like a hundred X additional, like bare metal computing resources into GPT-4 relative to GPT-3.
01:29:29
Speaker
And so a similarly sized scale up, maybe you'd expect the same, but notably ah it's it's much, much harder to do a similarly sized scale up on top of GPT-4. So you might expect a longer delay before that. That, if I sort of

Scaling Challenges and AI Progress

01:29:41
Speaker
naively apply that model, that model makes me think that, or like this view of events, that makes me think GPT-5, which maybe you have in, I don't know, I think um there are some, maybe maybe you have that, maybe you already have that, maybe you have that in a year or two based on, based on the scale ups.
01:29:57
Speaker
I think prediction markets right now say that we will have GPT-5 around January 2025. Yeah. The slightly awkward thing is GPT-5 is a marketing term by OpenAI and also a term for a model that is roughly that much marked better than GPT-4, unfortunately. so ah one One baseline we can say is like, at what point will we have a cluster that is going to or like why what point We'll have a model trained with 100x or like maybe a little less than 100x compute relative to GPT-4. GPT-4 public speculation is like maybe it was around 17,000 A100s. Let's say that our new model is trained with a similar duration to GPT-4.
01:30:39
Speaker
So you would need to have, you know, a hundred X that much that would naively be, or that'd be like 1.7 million A100s, but there's hardware progress. So H100s, I think are roughly three X better than A100s. And you'll have your next generation chip that you're like.
01:30:55
Speaker
gee be GB200s? I don't know, man. there's I forget the name for all that that the new chips, but maybe you'll have another multiple from that. And then I think so that would probably be, for for example, for H100s, what would it be? It would be like you need like several hundred thousand, maybe it would be like around 300,000 to H100s would suffice. So there is there are known clusters right now that are about 100,000. So my sense is if you have 100,000 H100s, you're at like slightly over 10x compute, then then the the cluster GPT-4 is trained on. And with you know another factor of three on top of that, you'd be basically all the way. So it seems like it could be in the next few years you'd have that. And then I think ah it's plausible that
01:31:37
Speaker
you know, scaling in in hardware has sort of is going to get, it it is now a relatively smaller component relative to algorithmic progress based on public estimates of algorithmic progress from epoch, for example. So epoch has various estimates, epoch AI or whatever has various estimates. And those estimates, I think, potentially indicate that the rate of hardware scaling it seems like maybe it's slowed a bit because like there's i think a pretty fast like a huge amount of additional compute from GPT-1 to GPT-2 to GPT-3, whereas it seems like it's been slower to get the same compute increment on top of GPT-4 based on public reporting. And then also, in addition to that, it just seems like, okay, at a basic level, i'm like at some point, it just actually gets hard to scale as fast because you're running into power constraints, you're running into whatever. There's some
01:32:20
Speaker
there's a recent epoch report on what's the biggest training run by 2030, but they're sort of like, okay, what's the biggest training run if you like tried really, really hard? And I don't know. But but anyway, I think that that these these views end up making me think that, or like if we sort of just apply the GPT-5 would be pretty scary, GPT-6 quite scary, which I think is like the most like short timelines like are or like style case, or like they're like the case that makes me feel most ah strongly short timeline Z. I think that ends up making me think, like human level AI by 2028, 2029 is pretty plausible, even taking into account the fact that it'll be a pain to scale this stuff up. I tend to think things are going to be a bit slow. And that's so, for example, I think that's right. I should say those views tend to think that you'll, you'll very likely have AI by that point, like maybe like 70% or 80% have AI by that point. So then I'm like, okay, but
01:33:10
Speaker
Here's another perspective that seems very salient to me. GPT-4, it's just dumb. as You can look at the gap from GPT-3 to GPT-4, but there's also just a sense in which is like it's like pretty dumb. And your hire for like does this paradigm scale all the way through should not be that high. There are some plausible stories for why you can't just scale up current systems forever. You need some additional stuff. People are working on additional stuff, but like who knows? right or like like you know Surely people are you know trying to make things that go beyond peer imitation or whatever.
01:33:39
Speaker
Um, we're like, you know, the, the current like fully supervised, like, you know, mostly you train your AI next token prediction, and prediction paradigm, which maybe, you know, you have data limitations for that. And also there's a question of like, how much can you learn from the data? You know, these are unanswered questions to be clear. We don't know how it goes. And then I also just put some probability and like.
01:33:58
Speaker
The current stuff totally flops at deep learning. you know It hits the wall, as as they say. like I don't think we should be that confident. It doesn't. i that that That leads me to being like there's some you know there's a bunch of mass in these sort of relatively fast scenarios where just like you just scaled it up. Some kind of relatively straightforward stuff worked. um You maybe have to scale up ah various things. but like I can imagine being the case that pure imitation hits a wall and so you need some additional stuff, but you know even that, it could be achievable. like These things are not necessarily out of out of reach. Do you think it's ah it's a story then? so say Say deep learning actually hits a wall. Do you think it's a story then of an AI winter period, say perhaps 15 years where where where not much happens or
01:34:41
Speaker
Or are we so far ahead so far along this kind of this kind of development that that something like that doesn't make sense? Do we have so much investment from from companies and universities and but just researchers in general that that such a slump is unlikely?
01:34:58
Speaker
I mean, I think having huge amounts of investment has not stopped things from failing in the past. or like that is true I would not assume that just because there's huge amount of investment society powers through or something. For one, like maybe there there might be like a general frontier AI winter, but not a applications winter like I'm like GPT for already pretty impressive. You can already do a bunch of stuff with it. Image models, a bunch of stuff. It seems like video generation. I mean, like it seems like it's just going to happen. And like I think even if it like there is some point at which it that stuff stalls out, it seems like even just a relatively basic continuation of current stuff gets you gets you some very economically interesting stuff, I think that maybe not huge, huge economic application that that seems plausible. I think maybe a thing that seems more basic than that is like
01:35:46
Speaker
ah there's a bit there's a like you know people scale, the scaling is unimpressive relative to the money put in, it doesn't pay itself back, and that people basically slow down how much they're spending and and and and that. And it's not clearly like a huge winter per se. So so anyway, maybe I should like outline some perspectives. like One perspective is like GPT-4 kind of dumb. Another perspective is the GPT-3 to GPT-4 gap is very impressive. So maybe GPT-5 is very scary.
01:36:10
Speaker
And then another perspective is like, GPT-4 actually can do a bunch of stuff already. So that's kind of scary, like the counterpoint to the DOM. It's like, well, which you have said it was dumb 10 years ago perspective. And then another another perspective that I think is very salient to me is sort of like at the Kurzweil style perspective, which is like, once you have the compute, the AI will come perspective, which is like, well, right now we have enough compute that you can run, this is sort of the bio-anchorsy perspective for listeners who are familiar with that. Yeah, I found a J on the on the podcast discussing that report. It's a while ago, but it's ahs it's worth reading.
01:36:46
Speaker
But like one perspective is, well, we just right about now have enough compute that training runs are using compute comparable to how much compute we think humans use over their entire lifetime. If you sort of do try to do an apples to apples comparison, it's it's rough because it's a brain versus a computer and we don't know what's going on in the brain. But people, you know they try to do some rough estimates. And and if I recall, like it seems like that lets you makes you think, like i think like ah right I think it's around 2028, 2026 flop is how much we think humans use a lifetime. Someone should fact check me on that. but And then I think I'm like, well, we're getting there right about now. We're not super certain what will suffice. And you know we're not certain what will work. But we're like kind of around the right amount of compute for humans. And maybe you expect that first we'll have algorithms that are less efficient. Compute has, and in in in the case of deep learning, driven a bunch of progress. And so there's some just basic story, which is like,
01:37:40
Speaker
once their compute is there, the algorithms will come. It's not exactly clear what the algorithms will be, but there's lots of stuff people can try. And it just like, yeah, it's just like a question of like the cost being doable. I think I put a decent amount of weight on this perspective, but this perspective doesn't necessarily imply things are super fast. So like, you know, we've, there's some uncertainty over how much flop you need and like some uncertainty over exactly when this this pans out because like you know maybe you need quite a bit more than and in humans. Maybe we're underestimating how much humans use. and Maybe it's really important that like there was a ton of search done. to like like Evolution did a ton of search to find the exact right brain architecture.
01:38:17
Speaker
I'm not the the most short timeline Z person. I think I would

Securing AI Models and Data

01:38:21
Speaker
say if someone had to, if someone put me on the spot and is like media into AIs that are qualitatively similar to like top human scientists, maybe I'm like, I don't know, man 2034, which is, you know, some sense very short, but also not, not, not that short. And then.
01:38:35
Speaker
some uncertainty over how fast things will go from there. you know If it's five to 10 years, that is not a long time. What would you prioritize and what would you deprioritize if we have these short timelines? you know what where should we Where should we spend our attention and resources if we're going to have human level AI within 10 years, say?
01:38:56
Speaker
One thing I should note is like there's some stuff that has low regret. like it's like There's some stuff where you're like, okay, this is good for short timelines and doesn't like really screw things over in long timelines. So I think, for example, lots of types of research are low regret because, like well, you just spent like your your time on it, you know nothing that bad happened. But I think in terms of like politics and policy and various things along those lines, things can be high regret. Like there can be, you know, policy can be sticky. It can be that there are political movers that are very costly, that are that can be high regret. I think I have ah some views on the policies, like what should happen from a policy side on short timelines, um especially if the situation is like the government is like
01:39:38
Speaker
Okay, Ryan, what what should we do? I think I have less of a view of like what's good to do on the policy side of the margin, basically because the political considerations are very important for that. It's a big question of like how much do people get worried about AI and and want to take action at various points. But at least on the technical side, my sense is like I think you should first, I think the security, security, security angle is just like very salient to me. So I think once you have AIs which are able to substantially accelerate R and&D,
01:40:07
Speaker
I think that you would like to have sufficient security at whatever like you know a company, government, project, whatever is developing that, that you can be confident that very high-resourced efforts would not be able to steal the system. I can imagine getting by with lower levels of security via diplomacy. like as in the The situation I'm imagining is like there's a US company or potentially a US government project.
01:40:30
Speaker
I think from my perspective, it seems like the world is much more likely to go well if that company can prevent their model from being stolen by other countries. But I can imagine that even if it's the case that, for example, China can steal the model with a large amount of effort, you can potentially you know pull China into the coalition rather than you know being in an adversarial relationship and then that that can be okay. I think that in order to basically, in order to slow down, I think you'll need the security or in order to be able to be willing be able to pay a tax, you need that.
01:40:57
Speaker
So I think that's like, and that might be very hard to do in advance. I tend to be maybe more optimistic about throwing together the technical safeguards other than secure native events. So I'm like security, security, security, and that's both model weights and also making it so that the, you know, other people aren't running huge amounts of inference in your data center that you don't understand. Cause if, if it might as well be their model, if they can just run inference in your data center, and then there's also algorithmic secrets where I think my sense is the current algorithmic secrets are maybe It's maybe too hard to secure them, but at least once we have AIs that are speeding things up a bunch, we can secure further algorithmic secrets at least.
01:41:35
Speaker
do Do you think that becoming secure as ah as an AGI corporation, do you think that slows you down? Do you think do you think there will be some pressure to you know to to not pay attention to so to security? When we talk about securing the the corporations, it seems to be both in their interest and in the interest of some of the government's concerns, it seems to be something that that people potentially could agree on.
01:41:59
Speaker
except if there's a ah large cost to doing this security. You know, if everyone working there will have to be is ah thoroughly screened and investigated, that might slow things down. But yeah, but how how how big of of a problem do you think, or how much of a cost do you think security would impose? So my sense is that security, it's going to be very rough to just scramble it together last minute.
01:42:25
Speaker
But if you do some prep work, then it should be possible to implement kind of last minute at only like high-ish cost. And then I think that a key dynamic is that security does not merely have cost to implement, it also has cost to operate under. And so it will slow down, potentially slow down your researchers to have really intense security. Like the sorts of things I'm imagining for reference are things like, or I don't know, there's a few different regimes, but maybe an early an early but still very secure regime I'm imagining is like,
01:42:54
Speaker
You have your data center, your data center is basically air gapped except that it has one connection to the outside world. You heavily limit uploads through that connection and and basically force your researchers to be very careful with how much data they upload from the data center.
01:43:08
Speaker
You make it so that the data center is very physically secured. There are no side channels. There's a bunch of stuff along these lines. I have a blog post on this called them Upload Limits Seem Promising for Security or something. I forget exactly how I said this. But like I think that basically the the hope is like make it so the pipe out of the data center is so small that you can't get out the model weights even across the entire lifetime of the data center. But that's expensive. That's going to mean that you you know can't do a lot of the things you want it to do potentially. I think you it or you might be able to. it It depends on some details about how things end up going.
01:43:38
Speaker
And then there's like, potentially you want to very, very carefully vet any code that touches the model weights. And there's various reasons why that might be like really important. And once you're in that regime, like it's just very expensive for people to be doing iteration on all kinds of different parts of the pipeline.
01:43:53
Speaker
and that just adds a bunch of cost. I think an important point you you make in in that ah post you just mentioned is that these models are are extremely large files. so it's not It's not like you can you can kind of attach a model to an email and and send it or something like that as you can with with say kind of more standard corporate secrets. That is why upload limits might be promising.
01:44:16
Speaker
Yeah, yeah, I mean, to be clear, I think it's still the case that you can, you know, like, perhaps GPT-4 based on public estimates is maybe like, what was it like, two trillion parameters or something, I forget exactly. So that's like, you could, you know, if you represent that in an FBA or whatever, that's like, you could fit it on, you know, several terabyte thumb drive, which there exists several terabyte thumb drive. So it's like, in some sense, very compact, but in some sense, also very big. and and And you might just not need to upload that much stuff out of your data center or like you might be able to get by without that. So there's like, yeah, there's this upload limit story, there's a bunch of like, ah things you can do related to on-chip onship security measures. um and But a lot of this stuff I think will just be both somewhat costly to implement, ah somewhat costly to operate in. And i think that I think you'll ultimately, because of at least algorithmic secrets, so so I think a lot of the things I've described so far maybe allow you to secure model weights, but don't allow you to secure your code base, don't allow you to st secure your algorithms.
01:45:12
Speaker
And I think for securing those things, my guess is that given just the the level of power that you know hackers have in practice, like how good how good at the best hackers like the best hacking efforts are, how good like top resourced like you know Chinese hacking efforts are, I think you'll have to basically go to a regime where everything is air gapped and the code base is heavily siloed. And that will be potentially even more costly where it's like you know all your developers are working out of a a facility where they can't bring in their cell phones and it's in a Faraday cage. And they, they you know, they they like bringing information to the outside, like it's disconnected from the internet. To bring and information to the outside world, you have to like print the PDF and the printer inside the office and then the security person looks at it or like, you know, like I'm just like, it is actually very costly to do these things. And I think you will have to transition into this regime.
01:45:59
Speaker
I don't know, but this is, i'm I should be clear. I'm not, I sort of moonlight in security, but I'm not a security expert. People can read the, you know, Rand security model weights report for more details with people who are experts. I, you know, have some minor quibbles with that report, but I think basically that's a pretty good rundown.
01:46:16
Speaker
Anyways, that's security. On the technical measures, I think if I am imagining that we're going to be in this rough short timeline scenario and we cannot, like and and we we you know really need to be ready to proceed, I think I would be a decent amount of focus on on the on the mitigation side. I'd be like, maybe you do, from my perspective, you do like sort of like half control a fourth looking for evidence of misalignment and a fourth basically like I would call this like bag of tricks or just like random stuff where we're like, is that going to help? We're like, I don't know, probably. Maybe if we do some experiments, we'll know whether or not it helps. But like when there's no real very confident story for it. This would be stuff like training various types of probes, maybe trying to adjust your training regime so you think it's like a little less likely to result in misalignment. Maybe I don't know, there's a few different directions there.
01:47:02
Speaker
And then on the control side, I think that there's a bunch of different work related to like threat modeling. There's constructing test beds that you can evaluate whether or not you have control. There's building countermeasures and figuring out what countermeasures are likely to be best. I don't know, there's a whole breakdown. I think I'm planning on maybe putting out some stuff on like what does like the basic breakdown of labor look like? What are the things that need to happen for control to actually happen?
01:47:26
Speaker
So I think that that would be if I was like in AI lab right now being like, what should I do? That would be like, if if you're if you're willing to just do whatever I say, that would be what I would recommend. I

AI Development Timelines and Takeoff Speeds

01:47:35
Speaker
think that there's a there's a there's also like a much more half-assed version that I think could go kind of far. I think if you basically do like sort of a poor man's control and you don't really evaluate it and you just sort of slap it together kind of last minute and do the most basic things. I think that helps quite a bit, actually. I would recommend that. but So that's my sense of the mitigations. I think you also want to be preparing for applications. So like to the extent we want to you know be automating all these domains, but actually you want to be automating you know
01:48:00
Speaker
like safety few research that we were discussing, but also possibly you want to be automating security work. You want to be doing tons of of like active security stuff with your AIs. You want to be potentially using your AIs to like develop evidence for a bunch of things. You want to have maybe like aistite like deployments related to like improving epistemics. There's lots of stuff you might do. I think you should be like prepping for these deployments. Like you should have people who are already starting to work on these things, who are trying to make it happen with current models.
01:48:26
Speaker
who are figuring out what the key bottlenecks are going to be so that you can like scale that up when the time comes. You should be getting ready for how you would ensure that things slow down a bunch ah later. There's a bunch of different angles there. Ryan, we we we talked about timelines. what What's the difference between timelines and takeoff speeds? How do they relate?
01:48:44
Speaker
Yeah. So oftentimes when people are talking about timelines, they mean timelines to a specific event. So for example, we were earlier saying timelines to like AIs that are qualitatively similar in capability level to humans, maybe like at the first point where the AIs can basically strictly dominate humans and based on their capabilities. But you could also talk about, okay, what's the point from that point to something else or from some earlier point to that? So for example, we could talk about the time from AIs, which are able to substantially accelerate R and&D to the AIs, which can basically replace humans. So for example, current AIs, they don't seem like they can substantially accelerate things, or like at least it seems like people aren't you know fully having, like they're not running much faster, it it appears. um But at some point, you might have AIs, which are like, you know they can do a bunch of subtasks, they can speed things up a bunch, and what's the delay between that? And that's takeoff speed. So the speed from like some earlier milestone to some later like milestone.
01:49:37
Speaker
I think a really key question for me is this the time from you have these systems that are you know maybe qualitatively similar to to to to humans, or at least they are just at the point where they strictly dominate humans. Maybe they're qualitatively very different in some important way. And two systems that you're just like, wow, these systems are super, super human. They're very, very scary. And what's the delay there? And so that that's that's the distinction. And I could get into why people have various views about takeoffs. Is that a good place to go next? or Yeah, I think so. But we we should also say, that yeah what does it look like? How do we know whether we are in ah in a fast takeoff or slow takeoff world? One of the things with even the fast takeoff is that almost all the progress happens late. And so maybe in the beginning, it it looks like a a slow takeoff. But you know how do you how do you distinguish between but the two worlds you could be in?
01:50:28
Speaker
I think there's like sort of two important things going on with takeoff. So one of them is just what is the rate at which AI is improving over time due to ongoing human efforts? So, for example, you can be like, well, there's this like, there's a time between GPT-3 and GPT-4. And then there's like, when will the next model that's like that big of an increment come out again?
01:50:46
Speaker
And this is just being driven by humans doing research, hardware getting better, people buying more hardware, just all these developments over time. And how basically how much progress you get per additional like duration in time just from human efforts is one factor. And you can imagine it being the case that it's just like it's naturally the the way that the capabilities go that you basically have human level systems and then just the next two-year increment scale up that humans would do.
01:51:12
Speaker
gets you to much smarter systems, basically because like the additional compute, the additional algorithmic progress just gets you that foot much further. But there's a different argument, which is which is quite important, which is that maybe AIs will automate stuff substantially, and that will speed you up a bunch. As in, maybe it's the case that you'll have these systems that are you know substantially accelerating R and&D, and you'll be like, well, we're going to accelerate R and&D into AI software and accelerate R and&D into stuff that will make our next systems more powerful.
01:51:40
Speaker
but potentially even accelerate R and&D into hardware where like you know there's some rate at which Nvidia produces new chips every year, some rate at which all the other people who are making chips, TPUs, whatever. And you could be like, well, they had a bunch of human labor going into that. Maybe human labor is a key limit. And if we just automated it and had huge amounts of AI labor going into making the chips better,
01:52:00
Speaker
that could potentially substantially accelerate progress because basically previously we were bound by these humans who were, you know, there's not that many of them, they're slow to train, very limited supply, but, you know, you now have these AIs where you can just basically make huge numbers of copies and run them on tons of stuff and they can potentially also run much faster in serial speed, which i think I think the serial speed advantages could be very important where even if your projects are bottlenecked on, like it's hard to coordinate between many people,
01:52:27
Speaker
The AIs can just run way faster. Also, the AIs might be much better at coordinating because they're all copies. They can share a state more easily, this sort of thing. And there's one more dynamic, which I think i think is less important than the previous two, but but it's still important, which is that as AIs get more powerful, people will be more potentially excited to scale things up further. And there'll be some, like, I don't know, sometimes people use the term wake up or whatever.
01:52:47
Speaker
and basically like as as a eyes hit various milestones you could imagine that the rate of investment increases and there's some some sort of like step change there though it's also plausible that what happens is the rate of investment is constant but you need you know of course bigger and bigger advances to justify that because you're getting into you know.
01:53:05
Speaker
billions

Secrecy and Transparency in AI Research

01:53:06
Speaker
of dollars, then hundreds of billions of dollars, then actually even trillions of dollars to to further scale things up potentially. Yeah. I should also say that i'm just I'm just a parent in some sense. I'm just, this stuff is all stuff from like, for example, Tom Davidson has this report called a compute centric takeoff model, timelines model, something, which which I basically think is a pretty good model or like a pretty good starting point. And so I i basically roughly agree with that perspective.
01:53:32
Speaker
and And for listeners interested, they can scroll back in the feed and and find my interview with with Tom. The model is very interesting. It's probably the most in-depth work on take-off speeds that's been done to date, I think. That's been published at least. yeah that's been Yeah, maybe there's some secret stuff going on that I don't know about. This is just something I'm interested in. How ah how much do you think is being published right now? Do you think, for example, if an AGI ah corporation makes some some leap forward on synthetic data, do you think they would publish that?
01:54:01
Speaker
Or is it the case that just more and more you know frontier research is not published? I can speculate about, let's say, OpenAI, and I can be like, okay, what might OpenAI be doing?
01:54:12
Speaker
So it seems like in practice, OpenAI does not publish advances related to transformers in terms of what they're doing differently. And in fact, they haven't published that much details on some other things. So for example, it is widely speculated that GPT-4 is a mixture of experts' model.
01:54:30
Speaker
but like that is Certainly, yeah, it's widely speculated. Seems likely to be true to me based on my read of the speculation, but whatever. And they did not when they published the GPT-4 report say, this is a mixture of experts model with this architecture. And in fact, the reason why people think that is because of leaks and the leaks basically matching with other like other analysis you can do based on how fast the model runs and things like this.
01:54:53
Speaker
so Basically, they don't publish that. I would guess that various things around synthetic data, they also wouldn't publish, but there are also leaks. So for example, there's been recent leaks about like strawberry, which I mean, I guess probably exists because the leaking is relatively credible, but like who knows exactly what it is, who knows if it's overhyped, who knows how legit it is. I think as of the time of recording this podcast, there are rumors that it will be released in within two weeks, but who knows if that will be true. ah and And there are also rumors that, for example, they're using or therere they're rumors that they're using that the strawberry thing, which is like some inference time way to spend more compute.
01:55:29
Speaker
to generate synthetic data for training runs that they have done or are doing, and they did not publish that. so But it has leaked, so that tells you something. So my sense would have been that at least OpenAI is not going to be publishing their advances. I think my guess would be that like GDM has a bit more culture of like sharing things or like having more academic norms. And so they do have papers. For example, there was like a paper on some architecture, I think they call it like Griffin, which is like a different architecture you could use for your transform for you like language models. But if i if I were to guess, my guess is that probably if that was if they thought that was a big deal, they wouldn't have published it. And there's some classic meme tweet, which is that you know a Google researcher goes and sees if they can publish their idea and google comes back and like the Google execs come back and are like, yes, you can publish. And they're like, no, it must not have been important. I forget the exact tweet or whatever.
01:56:23
Speaker
I mean, they they did publish the original Transformers paper, but maybe they they didnt they didn't know what they what they had there. I don't know. Yeah, I mean, the norms have changed over time, certainly. So I think that people have basically tightened up as there's been more of a view which is like, oh, wow, this AI stuff might be like really critical commercially, and also a view that like, you know, this will, you know, yeah, like the they're like algorithmic stuff is is very important. I think even notably, like even so meta is much more open in some ways. So the architecture is known, and they have a paper that discusses a bunch of details. But they still, they say in the paper, like,
01:56:57
Speaker
we have some improvements from, I think they said like, there are no improvements in architecture, but like, there may be some improvements in data or I i forget the exact language they use in that paper, something along these lines. And they did not publish what their improvements in data are. They didn't publish their data set. They didn't publish their training code. Like, there's a bunch of stuff that they're not even, even you know, meta, who's relatively more open, is not publishing.
01:57:17
Speaker
So that that's descriptively what's going on. Normatively, I think it's probably bad to advance publicly available algorithmic progress for yeah like for a few different reasons. And so I ah ah prefer this regime to a regime that would be more open. But at the same time, I have some concern like, well, in the future, it would be nice if people could talk about what approach was used for training the eyes and how safe that is and and and You know people talk about making safety cases and in some cases those safety cases might depend on details of how the AI was trained that. Reflect

AI vs. Human Cognitive Processes

01:57:50
Speaker
on on on information that people would prefer to by default keep secret and so I am worried about a trade off between transparency and secrecy here I think that there are you know there are potentially like near parader improvements that.
01:58:02
Speaker
could happen. But I also think I'm like, I worry a bit that the AI safety community pushed too hard for secrecy, or like was generally like sort of a pro secrecy advocates in ways that are, you know, actually a mistake. Okay, let let's, let's end by talking about kind of human cognition versus AI cognition. You mentioned a couple of times, this kind of the profile of capabilities that are different from between from AI's to to to humans. This is something that's that's pretty important to get a grasp on. I think if if we want to understand you know what's going on in AI in general and and where is AI headed, why is it the case that that AIs have ah have a different cognitive profile ah than we do? And what can we learn about the future their future development? so
01:58:48
Speaker
There's more of x-paradox of ah humans being great at at at things that are very difficult for AIs, and such as like catching a ball, whereas AIs are amazing at doing kind of basic math that we that we couldn't couldn't begin to do. so what is yeah what what What can we learn from from these differences about where AI is going?
01:59:10
Speaker
Yeah, so one thing is like, there are a bunch of different systems. And I'd like to just very specifically talk about like frontier transformers, or like frontier, I should really say frontier language models, or really frontier multimodal language models, which is a slightly weird term, but basically models that seem to be trained on prediction tasks and then potentially our fine-tune with our LHF or that's at least based on based on public knowledge what the paradigm is looking like. So there are some, so these systems do have pretty different capability profiles and I think a few intuition pumps for what's going on. So one thing is that GPT-4, for example, I think there are rumors that it was like trained on like several trillion tokens, but people are scaling up. So like maybe let's imagine our system is trained on, I don't know, a hundred trillion tokens just to pick a nice round number.
01:59:53
Speaker
um which is, I think, like notably more than GPT-4, but not wildly more, such a system will have been trained on more tokens than humans see by a huge margin. So for example, if we look at how much text has had two humans read in their lifetime, my sense is that it's wildly lower than 100 trillion tokens. And in fact, 100 trillion tokens is like comparable to all of the internet texts after like deduplication based on public estimates from epoch, which is like,
02:00:20
Speaker
So it's like a truly vast amount of of data that they have been trained on, you know, there might be like multiple epochs and whatever, but but roughly speaking. And so I think I remember it it's like, it's as though the AIs were, if you imagine that ah the AIs are spending, like, let's say, let's say the equivalent of a second per token, it's sort of like the AIs are spending, I think it ends up being like three million years or something, writing, they're basically just nonstop predicting tokens for three million years. And that's what they're doing. And I'm like,
02:00:49
Speaker
that's like very different from what humans are doing. And you can also look at the data. So I think if you want to get a better understanding of what's going on with ah language models, I do think it's somewhat informative to just read data that they might be trained on. So like read data from common crawl that people seem to have trained their models on based on you know what but what's publicly known. And I think that data is just like, it's crap lately. It's a lot of it is just random garbage. It doesn't really make that much sense. It's just like random stuff people have written on the internet. And this is like what the models are learning from, or like at least a substantial fraction of what the models are learning from. And in fact, they're very good at it in terms of predicting tokens. like They're very, very good at predicting next tokens relative to other tasks.
02:01:27
Speaker
and And this gets you models in practice that have very wide capability profiles in terms of their knowledge. So GPT-4, whatever systems, I think can speak every language pretty much with enough tokens. And you can certainly easily get them to speak any language where there's enough written data on it.
02:01:44
Speaker
pretty well, like pretty good at those languages and translating between them. They know basically almost all subjects that are written about on the internet in cursory detail at least, sometimes in more more detail depending on the amount of text. And they can potentially do a bit of mixing about these domains. So like it's like systems, there's ah there's a question of like what could systems in principle be capable of and what do they do now? Because they might be sort of not fine-tuned that well. But in principle, GPD4 can sort of like you know speak in Welsh while simultaneously like doing things in computer science that require like understanding a bunch about sociology and biology and all these things because it can just sort of hit all these domains and just has knowledge across all these domains. And in general, I'd say the systems are super knowledgeable in ways that are that show up sort of in in breath, but also in like ways where they can basically do I'd say they're more pattern matchy than humans are, I think, because they've seen more stuff.
02:02:37
Speaker
I think it's it's easy to be wrong in either direction about how impressed you should be about language models. There's another classic tweet which is, be careful not to be ah not to like you know underestimate language model capabilities, be careful not to overestimate language model capabilities, or like the biggest mistakes, underestimate language model capabilities model capabilities.
02:02:55
Speaker
i think I think people should be careful on both. so the um like they're They're doing some amount of of of of sort of more like pattern matching to an extent that humans can't do because they were trained on more. They're also able to like you know be pretty general. And to be clear, humans are also doing a bunch of pattern matching. it's not you know These things are not that special. It currently seems like models are better at coding than a bunch of other tasks.
02:03:17
Speaker
It's possible this is because of people like focusing on coding, because that's like a pretty good application of models. And it's somewhat unclear how this will scale going forward. So if you know people run out of data or run out of high-quality data, or it just turns out that, for whatever reason, other things are are useful, then you could imagine you know way more synthetic data. ah which is i think I think you can think of synthetic data kind of similarly to ah RL or something, where it's like you have AIs doing tasks to see if they succeed. if you If they succeed, you train on it. But you can also do like you know fancier stuff, potentially, than that.
02:03:47
Speaker
And like that might mean that AIs get much better at domains that are easy to check, relative to domains that are hard to check, because basically you can run huge synthetic data pipelines on all the domains you can check, but not on the domains you can't check as easily. And so that might get you systems which are have very good heuristics for domains that have sort of a lot of ability to check. And what would be some examples there? what What are domains that are easy to check versus domains that are difficult to check?
02:04:12
Speaker
So a classic example of domains where that these sorts of things work really well is Go or chess, so board games. So like basically, like they both have the property that you can check if you won, and you can simulate the entire world in a computer very easily, ah or you can simulate the whole state in a computer. But also they have the property that it's very easy to tell who's winning, and so you can do lots of stuff like knowing how good of a state you're currently in, and it's even more easy than that. So on on domains that we like you know actually care about for commercial applications, I would say like software engineering, you know you can see you can write tests, you can see if the tests pass potentially. Math is even easier is even easier, so there's a recent ah stuff by by by Google DeepMind on, I think they call it like alpha proof or something, and where they did IMO problems and they had their AI
02:04:59
Speaker
And they trained on a bunch of, basically, I think they trained they explicitlyly they explicitly said they trained on a bunch of, you you have your AI generate lean proofs, you train when the proof is correct. And you do a ton of data generation or like a ton of training like that. And then even beyond that, they also did at runtime, a huge amount of runtime search to find the the correct proof. And that's another case where it's very easy to check. you can you know in in In Lean, you know you you can just check if your proof was ended up being correct because it's just um like a formal verification system. And you can also even do stuff like see how close you're being. where I think this is a little hard for me to explain without people having used Lean before or used similar proof-resistant tools. But like basically midway through your proof, you can sort of see what the current state is and see what like additional things you need to prove, additional like statements you need to resolve, what sublimas you need, whatever.
02:05:45
Speaker
And so you can basically get a sense for if you've made progress. but there And then on the other side for domains that are harder to check, I think ah there's like conceptual like philosophy style work where it's like even humans have, you know, we're having a very hard time checking this stuff and people disagree wildly and never come to conclusions on it. and And there's huge divergence in what people think and they can't even resolve, you know,
02:06:09
Speaker
things with with that. And so you could imagine the AI is basically having pretty bad heuristics for like sort of hard to check domains while being very, very good at it easy to check domains. And I think this poses some some safety risks in terms of differential acceleration.
02:06:22
Speaker
But is that what we see? I mean, isn't AI great at at generating art and generating kind of like, I mean, yeah you can you can probably have an AI generate some some ah French philosophy and it'd be pretty decent or pretty difficult to distinguish but from the real thing. Yeah, so on the art case, it's worth noting that art is actually, so like the, that we're getting pretty far on imitation with art. And it just turns out that like, imitation goes very far on art because like, I think,
02:06:50
Speaker
or like just sorry imitation training and and and just just doing that. And I think RL also works reasonably well for for looking at RL in human preferences, because then you don't need that much data to get that somewhat further from my understanding. I would say current AIs are pretty bad at thinking of ideas and checking their ideas in more conceptual regimes that are bad at like the types of like strategic philosophy style work that, for example, seem relevant to AI safety or AI safety strategy or these sorts of things.
02:07:17
Speaker
Uh, that's like my experiences. They're not amazing at this. They seem pretty good. at They seem good at writing, at least when fine tuned well for that. I think by default, the fine tuning maybe it doesn't work that well for writing or at least for like opening eye models. Maybe people say anthropic models are a bit better for this. I think notably models are trained right now with imitation. So I think the bigger factor or like with lots of limitations, I think the bigger factor right now is more like.
02:07:38
Speaker
In what domains is there lots of high quality data? And to what extent is that high quality data sort of like all good versus mostly garbage? right like Like I think there are some domains in which AIs like from my perspective perform poorly, but maybe it's because a lot of the internet data is incorrect from my or like I would guess I would claim it's incorrect or like done.
02:07:56
Speaker
you know, if you do a bit of RHL on top of that, maybe you could mostly stamp out these things. But there are some issues like this, like an example is like, by default, without a bunch of prompting, like if you just ask like, you know, Claude or whatever for probabilities on things, I think its probabilities are both miscalibrated and crazy. And then you can I think with prompting, you can get quite a bit better. But it sort of doesn't do this out of the box. And like, there is a bunch of good internet data on people describing quantitative forecasting and how to do quantitative forecasting. It's just that by default, it doesn't doesn't seem like it's it ends up doing this. Even if you and think even if you explicitly ask it, do quantitative forecasting, I think it does a bad job unless you like really extensively prompt it is my understanding. so that's That's kind of like a a whirlwind tour of the differences and what we might expect. That's great. Ryan, thanks for coming on the podcast. It's been a real pleasure. For sure. It's been a good time.