Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Dan Hendrycks on Catastrophic AI Risks image

Dan Hendrycks on Catastrophic AI Risks

Future of Life Institute Podcast
Avatar
417 Plays1 year ago
Dan Hendrycks joins the podcast again to discuss X.ai, how AI risk thinking has evolved, malicious use of AI, AI race dynamics between companies and between militaries, making AI organizations safer, and how representation engineering could help us understand AI traits like deception. You can learn more about Dan's work at https://www.safe.ai Timestamps: 00:00 X.ai - Elon Musk's new AI venture 02:41 How AI risk thinking has evolved 12:58 AI bioengeneering 19:16 AI agents 24:55 Preventing autocracy 34:11 AI race - corporations and militaries 48:04 Bulletproofing AI organizations 1:07:51 Open-source models 1:15:35 Dan's textbook on AI safety 1:22:58 Rogue AI 1:28:09 LLMs and value specification 1:33:14 AI goal drift 1:41:10 Power-seeking AI 1:52:07 AI deception 1:57:53 Representation engineering
Recommended
Transcript

Introduction to the Podcast and Guest

00:00:00
Speaker
Welcome to the Future of Life Institute podcast. My name is Gus Docker and I'm here with Dan Hendricks. Dan is the director of the Center for AI Safety. Dan, welcome to the podcast. Glad to be back. You are also an advisor to XAI. Maybe you can tell us a bit about that.

Elon's AGI Project and Concerns

00:00:18
Speaker
Sure. So XAI is Elon's new AGI project. It's still very much in its early stages, so it's difficult to say specific things about what they'll be doing or what the specific high-level strategy is to give a sense. Elon has been interested in the failure mode of sort of eroded epistemics, where people don't
00:00:43
Speaker
have a shared sense of consensus reality, and this might make it harder for a civilization to appropriately function.

Goals for AI Systems

00:00:49
Speaker
There are other types of extras that he's concerned about as well. His probability of doom or that of an existential catastrophe is around 20 to 30%. So he takes this, I would guess, more seriously than
00:01:03
Speaker
than any other leader of a major organization but exactly how one goes about reducing that risk is still still somewhat to be determined there is an interest in building more true seeking eyes but you know another occasions do you mention that we should have eyes with the objective of.
00:01:21
Speaker
preserving human autonomy or maximizing their freedom of action and on other instances in thinking about good objectives for AI systems, having them increase net civilizational happiness over time. So I think that this reflects sort of a plurality of different goals that he thinks AI systems should end up pursuing rather than just picking one.
00:01:45
Speaker
I think it's relevant to note that it's a fairly serious effort. I'd anticipate that it would probably be one of the main three AI companies next year or the year after, like OpenAI, Google DeepMind, and XAI. So I don't think of it as a smaller effort, but it has the capacity to have a substantial issue of force.
00:02:14
Speaker
The other top AI corporations you mentioned, Entropic, Google DeepMind, OpenAI, have backing from giant tech companies. Does XAI similarly have some backing from Tesla, for example? I can't specifically say about that, but this is not a sub-part of Tesla. This is not an organization inside of Twitter or X, and it's not an organization inside of Tesla.

Categorizing AI Risks

00:02:41
Speaker
The main topic of conversation for this episode is your paper on catastrophic risks from AI and specifically categorizing these risks. So you categorize risks from catastrophic risks from AI in four different categories. Maybe we should just start by sketching out those categories and then go into depth later.
00:03:01
Speaker
Yeah, so I guess at a very abstract level, there's risks if people are trying to use AIs intentionally to cause harm. That's a basic one. So there's an intentional, intentional catastrophe that would be malicious use.
00:03:17
Speaker
Another one is where there are accidents. And if there are accidents, this would often be the consequence of the AI developers using these very powerful systems or potentially leaking them or accidentally putting in some bad objective or doing some gain of function. But that would be some accident risk. So that relates to organizational risks or organizational safety.
00:03:42
Speaker
The third would be these environmental or structural risks, basically where AI companies or AI developers, be those companies or maybe in later stages, countries are racing to build more and more powerful AI systems or AI weapons. And this structural risk incentivizes companies or these developers to
00:04:08
Speaker
I see more and more decision making and control to these AI systems. We get a looser and looser leash. Things move very quickly. We become extremely dependent on them. This gets us in an irreversible position where we are not actually making the decisions, but we're basically having nominal control. It's very possible in that situation we just ultimately end up losing control to the sort of very complicated, fast-moving system that we've
00:04:35
Speaker
create. And then the final type would be these risks that emanate from the AI systems themselves. These are more internal or inherent risks from AI systems. And that would take the form of rogue AIs, where they have goals separate from our own, and they work against us to complete or satisfy their desires or preferences.
00:05:02
Speaker
Overall, there are four, there's malicious use, there's these organizational risks, there's these structural slash environmental risks, and there's these inherent or internal risks in the form of malicious use, organizational risk, racing dynamics, and rogue eyes.
00:05:22
Speaker
If we look back maybe 10 years or so, I think most of the discussion about AI risk would have been about rogue AI. The risks that are coming internally from the AI, so to speak, the AI developing technically in ways that we're not interested in.
00:05:41
Speaker
So how much is this categorization setting set in stone? Do you think it will change over time as we learn more or have the field of AI safety matured such that we can see the risk landscape now?

Evolution of AI Risk Understanding

00:05:55
Speaker
I think the focus on rogue AI systems is largely due to early movers having substantial cultural influence. I think if we asked other people who were not as invested in AI risks, if they were to write down concerns about these, they would of course think that people using the technology for extremely destructive purposes was posed as catastrophic risks.
00:06:16
Speaker
And I think the communities ended up having some self selection effects such that people didn't end up talking about things like malicious use and treated that as a distraction. So I think the community didn't make much of a space for people who were concerned about things other than rogue AI systems.
00:06:36
Speaker
But that was a mistake, the AIs being used in malicious ways can definitely cause catastrophes and can end up causing, can end up increasing the probability of existential risks as well, which maybe we'll speak about the connections between ongoing harms, anticipated risks and catastrophic risks and existential risks.
00:07:00
Speaker
I think the community of people who were thinking about AI risks a long time ago would largely think about whether there's a direct simple causal pathway to something like an extinction event. Now I think we have more of a sophisticated causal understanding of the interplay between these various factors such that
00:07:21
Speaker
One doesn't try and look for direct mechanisms, but instead tries to look at what sort of events increase the probability of existential risk rather than does it directly cause extinction. And that distinction between something that increases probability versus directly cause means that we have to look at a much broader variety of factors
00:07:40
Speaker
and we can't end up just thinking that all we need to do is make a single very powerful AI agent do what we want and then everything is solved forever. Unfortunately, we're going to have to treat this as a broader socio-technical problem. We're going to have to consider the various stakeholders, the politics, indeed the geopolitics, the relations between different countries, liability laws, and all these other things because we're not in this sort of
00:08:08
Speaker
foom type of scenario, it would seem, it seems we're more in a slow takeoff. So many of these real world considerations that were sort of sidelined in Fuse's distractions is actually where most of the action is.
00:08:19
Speaker
What would be examples of something that might put society in a worse position where we are less able to handle a powerful AI?

Impact of Global Conflict on AI Risks

00:08:30
Speaker
A prime example would be World War III. If there's World War III conditioned on that, that increases the probability of existential risk from AI systems.
00:08:41
Speaker
This would spur a substantial AI arms race. We would quickly outsource lethality to them. We would not have nearly as much time for making them more aligned and reliable in that process. But still, the competitive pressures would compel different states to create powerful AI weapons and eventually have that take up more and more of their military force.
00:09:07
Speaker
But that doesn't directly cause extinction. So if we try and backchain from that, the story gets much more complicated. And so then it's not viewed in the scenarios that others were thinking of. There's an AI lab, they've suddenly got a godlike AI, and then it has decisive strategic control over the entire world. How will they make sure that it does what they want? That was, I think, the scenario that others were thinking, largely, and all other ones were
00:09:37
Speaker
too broad, too intractable. Essentially, there was a focus on, quote unquote, targeted interventions historically, where we're just a small number of people. We can't do these broad interventions that involve interfacing with various institutions and getting public support. Those are intractable.
00:10:00
Speaker
So the best we can do is do some very narrow, specific things, maybe technical research. This doesn't look like a strategy because broad interventions are actually more attractive. Well, the world is interested in this and we have some amount of time to try and help our institutions make good decisions and policy around these issues.
00:10:17
Speaker
Do you think this broader vision of AI safety should make us more positive or less positive? Imagine we have to set all of the institution up perfectly. It seems like we have a narrow corridor to make things right, where the institutions have to be there, the technical side have to work, all stakeholders have to be set up well for this to succeed for us. Should the complexity of the problem make us more pessimistic?
00:10:44
Speaker
I think there's at least more tractability compared to an AI suddenly goes from incompetent to omnipotent and in control of the world overnight, and we have no idea if it was emergent and we didn't actually control that process. That doesn't have almost any tractability to it. I don't think we need our institutions to be completely perfect. We just need to try and be in the business of reducing risk. So maybe that's one other
00:11:11
Speaker
conceptual distinction is that historically there'd be a focus on is it an airtight solution that works in the worst case where if everything if something goes wrong then it's insufficient because we have to quote unquote get it right on the first try when we do have some amount of time not saying we have a huge amount of time we have some amount of time we can do some course adjustment and
00:11:33
Speaker
incorporate information as we go along. I'm not saying that's a sure-fire strategy, but I think that's the best we have. And it allows us to correct some mistakes, but obviously we can't have much of an error tolerance, unfortunately.

Unknown AI Risks

00:11:55
Speaker
Do you think we are missing something with this categorization that you've set up in the paper? Could we be missing some category of risk that will be obvious to us in 20 years? And could that risk potentially be the most dangerous because we're not anticipating it? Are there unknown unknowns here?
00:12:14
Speaker
Well, usually there are unknown unknowns. I focused largely on catastrophic risk and large scale loss of human life. I didn't speak about AI well-being very much, or for instance, that's something that could end up
00:12:29
Speaker
changing a lot of how we think about wanting to proceed forward with managing the emergence of digital life is if they have moral value. So I think that's something I didn't touch on in the paper largely because I think our understanding of it is very underdeveloped and I still think it's a bit too much of a taboo topic such that it just hasn't been much research on it as a consequence.
00:12:58
Speaker
Let's dig into the first category of risk, which is malicious use. This is a category in which bad actors choose to use AI in ways that harm humanity.

AI and Bio-engineered Pandemics

00:13:11
Speaker
Recently, there's been a lot of discussion of AIs helping with bio-engineered pandemics. This has been brought up in the US Senate, I think, and it's been widely publicized. How plausible do you think it is that the current or the next generation of large language models could make it easier to create bio-engineered viruses?
00:13:38
Speaker
Yeah, so I think that this is actually one of the largest reasons I wrote this paper was because during 2022, when sort of the development of this paper started, this is this bio thing, nobody's talking about it. Although people will treat malicious use as a distraction, I don't think that's the case. There are catastrophic and existential risks that can come from malicious use.
00:14:03
Speaker
and this threat vector concerns me quite a bit. So I think it is quite plausible that if we have an AI system that has something like a PhD-level understanding of virology, then it's fairly straightforward that such a system would provide the knowledge for synthesizing such a weapon.
00:14:27
Speaker
The risk analysis is something like what's the number of people with the skill and access to create a biological weapon that could be civilization destroying? And what's the sort of the probability that they're actually want to do that? And right now, you know, maybe there are 30,000 virology PhDs, and, you know, they just don't really have the incentive to do that.
00:14:51
Speaker
Meanwhile, if you have that knowledge available to anybody who wants to go to use Google's chat bot or Meta's chat bot or Bing's or OpenAI's, then we can add several zeros to the number of people with the skill to pull it off because they could just ask such a system, how do I make one? Give me a cookbook.
00:15:17
Speaker
Now, there be guardrails, of course, but the guardrails are fairly easy to overcome because these AI systems can easily be jailbroken. You can just append some random garbled string or some adversarially crafted garbled string at the end of your request to the chat bot, and then that'll take off. It's like safety guardrails as a paper that the
00:15:39
Speaker
center helped with discussing adversarial attacks for large language models. Now that's the thing. Or you might use an open source model that's available and might also be easily stripped of its guardrails.
00:15:56
Speaker
I don't think the AI developers now with the APIs have much of a high ground as far as safety goes when it comes to the malicious use case. For other things like hacking, there'd be a different story.
00:16:10
Speaker
That could change. Maybe they'll add more measures. Maybe they'll get better filters. Maybe they will remove some bio-related knowledge from the pre-training distribution and so on. But anyway, that would be sufficient for, if such a thing were to happen, then there could be a pandemic that could cause some civilizational discontinuity, which could be some existential risk.
00:16:33
Speaker
It'd be difficult for it to kill everybody, but for toppling civilization. And it's not clear how that would go or the quality of that situation.
00:16:42
Speaker
That's enough for us to worry, I think. Some of the pushback to this story of a bio-engineered virus enabled by large language models is that, well, isn't all of the training data freely available online? Couldn't a potential bad actor have gone online, gotten the data and used it already? What's the difference between using a search engine and a large language model?
00:17:08
Speaker
Sure. So two things, even if there is some type of harmful content online, I don't know why we would want it being propagated. If the nuclear secrets were online, I don't know why you'd want that propagated because your risk increases based on the ease of access to these. But in the case of bioweapons, yes, there are some bioweapons that are not civilization destroying, available online.
00:17:34
Speaker
the ones that would be potentially civilization destroying though, would require a bit more thinking. So there could be several, or there could be many people killed as a consequence of these though, but not at a societal scale risk necessarily. So I think that's a relevant difference. Many of the
00:17:56
Speaker
extremely dangerous pathogens. Fortunately, virology people are not writing those up and posting those on Twitter, and then all you gotta do is search for them. That's not actually the type of information. For other types of information, like how to, tips for breaking the law, or how to hotwire a car, this sort of stuff is online, and generic cookbooks for some generic, smaller scale bioweapons, sure, but not civilization destroying.
00:18:23
Speaker
And how is the guide for creating a civilization destroying virus in the large language model if it's not online, in the data online? So I'm not saying that the current ones have this in their capacity. I'm saying that when they have like a PhD level knowledge and are able to reflect and do a bit of brainstorming, then you're in substantially more trouble.
00:18:49
Speaker
And that could possibly be a model on the order of like GPT-5 or 5.5. It may be within its capacity. So there you don't need agent-like AI. You would just need a very knowledgeable chatbot for that threat to potentially manifest. So there's quite a bit we'll need to do in technical research and in policy for reducing that specific risk.

Malicious Use of AI

00:19:17
Speaker
Another risk you mentioned under malicious use is this issue of AI agents, which they are perhaps a bit analogous to viruses in the sense that they might be able to spread online and replicate themselves and cause harm. What do you worry about most with AI agents?
00:19:38
Speaker
I'm emphasizing instance that there are many forms of malicious use. In this paper, I'm mainly emphasizing ones that could be catastrophic or existential. So in this case, you could imagine people unleashing rogue AI systems to just destroy humanity that could be their objective. And that would be extremely dangerous. So you don't need you know, I don't need
00:20:01
Speaker
power seeking arguments or these claims that by default, they will have a will to power. You don't need any of that. You just need to assume that if enough people have access, and if some person is omnicidal, or thinks in the way that some AI scientists do, that we need to bring about the next stage of cosmic evolution and that
00:20:26
Speaker
Resistance is this futile to quote Richard Sutton, the author of the reinforcement learning textbook, and that we should bow out when it behooves us. There are many people who would have an inclination for building, not saying Rich Sutton would specifically give the ecosystem of destroy humanity, but it doesn't seem to say too much against that prospect.
00:20:51
Speaker
That's another example of malicious use that could be that could be catastrophic or existential.
00:20:58
Speaker
And how close do you think we are to AI agents that actually work? We had someone set up Chaos GPT early on when GPT was released, but it got stuck in some loops and it couldn't actually do anything, even if it was imbued with bad motives. When would you expect agents to actually be capable and therefore dangerous?
00:21:23
Speaker
Yeah, so I think that their capability would be a continuous thing in the same way, like, when are they good at generating text? It's like, well, you know, it kind of started in GB2, GB3. And so, so I might anticipate great strides in AI agents next year, where you can give it some basic short tasks, like, you know, help me make this like PowerPoint or something. It's not going to do the whole thing, but
00:21:51
Speaker
I can help with things like that or browsing around on the internet for you more so i think those capabilities will keep coming for it to pose a substantial risk there's a variety of things that could do.
00:22:07
Speaker
It could threaten, for instance, mutually assured destruction with humanity by saying, I will make this bioweapon. That will destroy all of you and I'll take you down with me unless you comply with some types of demands. That could work. If they're good at hacking, then they could potentially amass a lot of resources by scamming people or by stealing
00:22:30
Speaker
cryptocurrency, there's a variety. They could, of course, tap into lots of different sensors to manipulate people or influence public discourse. They wouldn't necessarily need to be embodied for this type of thing to happen. If we're in a later stage of development where we have a lot of weaponized AI systems and then hacking those systems, of course,
00:22:55
Speaker
Substantially more concerning or if they or if those systems get repurposed maliciously to weaponized AI system So there's it's um, it becomes a lot easier as time progresses. The AIs don't need to be particularly power seeking on this view though to To have this potential for catastrophe because humanity will basically give them that power by default
00:23:18
Speaker
They will deep weaponizing them. They will integrate them into more and more critical decisions. They will let them move around money and complete transactions, and they'll give them a looser and looser leash. So as time goes on, the potential for rogue AI or for deliberately AI systems that are deliberately instructed to cause harm would be, the potential impact or severity would keep increasing.
00:23:47
Speaker
I think maybe it's worth mentioning here just the continuous costs of traditional computer viruses, which are costly and which we've gotten better at handling those as a civilization, but we still haven't defeated
00:24:06
Speaker
traditional or conventional viruses which are very dumb compared to what AI agents could be. So we can imagine a computer virus equipped with more intelligence and how would you as a person, I'm not saying AI agents will be necessarily as smart as people soon, but how would you do the kind of hacking that the agent might be interested in? It's interesting to consider at least that we haven't been able to squash out conventional viruses.
00:24:36
Speaker
Yeah, they could they could exfiltrate their information onto different servers or like less protected ones and then use those to proliferate themselves even further. So there'll be a very, very distinct adversary with many, many options at their disposal for causing harm.
00:24:55
Speaker
Yeah, one thing I worry about is whether the tools and techniques we'll need at an institutional level to handle malicious use will also enable governments

Preventing Malicious AI Use

00:25:07
Speaker
to become totalitarian basically, to exercise too great a level of control over citizens who have done nothing wrong.
00:25:18
Speaker
What is required to prevent the large language models that could become AI agents and could be used to create viruses? What techniques are available for preventing them being used in such ways without enabling too much state power?
00:25:39
Speaker
Yeah, I think this is definitely a tension where to counteract these risks from rogue lone wolf actors, then people would want the technology centralized. I mean, this would be a similar issue with nuclear weapons, for instance, where we didn't want everybody being able to make nuclear weapons. We wanted to keep control of uranium. And so what happened was we had a
00:26:06
Speaker
no first use plus non proliferation regime. And that kept the power in a few different people's hands. I think there are things we could do to reduce these sorts of risks by creating institutions that are more democratic. I think that seems useful. I think decoupling the organizations that are that has some of the most powerful systems, having those more decoupled from
00:26:33
Speaker
The militaries would be fairly useful so that if something gets out of hand with one, if they're linked, and if we're needing to pull the plug on these AI systems, this isn't like taking down the military. I think just separating this cognitive labor or automated labor from a physical force would be fairly useful.
00:27:00
Speaker
But I think largely it's creating democratic institutions is one of these measures. In the case of dealing with rogue AIs that are people maliciously instructed rogue AIs that are proliferating across the internet, I think there'd be other types of things like legal liability laws for
00:27:20
Speaker
for cloud providers that if you are running an unverified or unsafe AI system on your cloud or on your compute, then you get in trouble. This would create incentives for them to keep track of it instead of just doling out compute to whoever's paying. So that's sort of like having an incentive for off switches all over.
00:27:45
Speaker
So there's a variety of different things we could be doing to strike this balance by reducing these malicious use risks. I mean, also, I should mention, some of these malicious users don't require this type of centralization or nearly as much. We can do various things to reduce this risk without giving tons of power to states. For instance, we invest in personal protective equipment.
00:28:13
Speaker
or monitoring waterways for early signs of some pathogens. There's the traditional stuff we can do to reduce risks from pandemics, for instance, which would reduce our risks, our exposure to the risk of AI-facilitated pandemics. So not all interventions for reducing malicious use require more centralization.
00:28:36
Speaker
I would imagine that we probably wouldn't want in the long term, let's say it's like 2040 or something like that, we wouldn't want anybody anywhere being able just to ask their AI system how to make a pandemic or being able to unleash it to try and take over the world. This doesn't seem like a good idea. There'd be other types of things like structured access.
00:29:00
Speaker
where for these bio capabilities, you just give people who are doing medical research access to those specific bio capabilities. But other people, they don't really have much reason for it. So they don't get that advanced. They don't get models with that advanced knowledge. So I think there are some simple restrictions that we can do that can take care of a large chunk of the risk without needing to hand over the technology to like militaries and then they're the only ones who have it.
00:29:24
Speaker
You mentioned legal liabilities for cloud providers and maybe companies in general. I wonder if this might be a way to have a form of decentralized control over AI agents or over large language models or generative models, AI in general. By having the state provide a framework for
00:29:47
Speaker
where you can get fined for trespassing some boundaries but then having companies implement exactly how that works. Use technical tools in order to reduce their risk of fines and maybe we can find a good balance there where we weigh the costs and benefits.
00:30:08
Speaker
I think that liability laws help fix the problem of externalities quite a bit, where they're imposing risks on others that shouldn't have any risk imposed on them because they're not privy to the decisions. There's an issue with that, though, which is that there's only so many externalities that some of these organizations could internalize, though, with liability law.
00:30:35
Speaker
somebody creates a pandemic as a consequence of their system. You can sue that company, but they're they're not going to be able to pay off the destruction of civilization with their their capital. So there's there's quite a limit to it. It can it can help fix the incentives, but it still doesn't. It doesn't fix them entirely because
00:30:55
Speaker
Not particularly one certainly can't internalize like downfall of like civilization as an organization and like foot the bill for that. And then the extinction of the human race is also, I don't think that's the thing you could settle in court.
00:31:09
Speaker
What about requiring insurance? So this is an idea that has been discussed for advanced biological research, gain of function research with viruses, for example. Maybe such a thing could also work for risky experiments with advanced AI.
00:31:27
Speaker
It depends if the harms are localized. I think insurance and this taming of typical, not long tail, not black swan type of uncertainty, but thin tailed type of uncertainty makes sense when risks are more localized, but when we are dealing with risks that are scalable and can bring down the entire system, then I think a lot of the incentives for insurance don't make as much sense.
00:31:56
Speaker
So you basically need some law of large numbers and many types of insurance to kick in, to have that risk diversified away. But if the entire system has exposure to that risk, there's not another system to diversify it.
00:32:12
Speaker
Maybe you could paint us a picture of a positive vision here. Say we get to 2050 and we've worked this out. What does the world look like in a world where we control malicious AI? I think if people have access to these AI systems, they're subject to, and they have many of their capabilities. There are, of course, restrictions on them. You can't use them to break the law. So a lot of these most dangerous capabilities, nobody's really able to use them in that way. If they're
00:32:41
Speaker
is a need for, in the case of defense, they would end up using AIs for things like hacking and whatnot. And they would have access to that type of technology, but it wouldn't be the case that any angsty teenager can just download a model online and then they instruct it to take down some critical infrastructure. This just isn't a possibility.
00:33:07
Speaker
It's very much trying to strike a balance with that. I would hope that we would also have these most powerful AI systems that do carry more of this force, that have some of these more dangerous capabilities, are subject to democratic control, so that power is not as centralized. And that also, I think, reduces the risk of lock-in risks as well, where some individual group can impose their values and entrench them.
00:33:37
Speaker
So those are some properties of a positive future. So I don't think it looks like
00:33:46
Speaker
complete mass proliferation of extremely dangerous AI products. And I don't think it looks like only one group, one elite aristocrat group gets to make the decisions for humanity either. So there's different levels of access to different levels of lethality and power depending on whether it makes sense. But the highest level institutions are still democratically.

Military AI Races and National Dynamics

00:34:11
Speaker
Another category of risks that you discuss is the possibility of an AI race. Now we've done another episode where we talked about evolutionary pressures and how they work between corporations and how they might lead to a situation in which humanity is gradually disempowered. But I think one thing we could discuss here in this episode is the possibility of a military AI race.
00:34:39
Speaker
What do you think a military AI race looks like? To recap, we were just at the malicious use one. And so now the other risk category would be like racing dynamics or competitive pressures or collective action problems. This is that structural environmental risk that when we're referring to.
00:34:54
Speaker
categories way earlier. Yeah, I think the with the corporate race, obviously, there's, as we discussed in the previous episode, there's, you know, them cutting corners on safety. And this is largely AI development is driven by a lot of these organizations will start as having a very strong safety event. But then they're basically going to be pressured into just racing and prioritizing the profit and developing these things as quickly as possible. And staying
00:35:18
Speaker
competitive over their safety. This is sort of the dynamic that basically drives pretty much all these companies. I don't think actually in the presence of these intense competitive pressures that intentions particularly matter. So I think I think basically this is the main force to look at when trying to explain the major developments of AI. Why are companies acting the way they are? It can be very well approximated by by them.
00:35:46
Speaker
just trying to by them succumbing to competitive pressures or defecting in this broader collective action problem of should we slow down and should we proceed more prudently and invest more in safety and try and make sure our institutions are caught up or should we race ahead so that way we can continue being in the lead because one day we'll maybe be more responsible with this technology.
00:36:09
Speaker
I'm concerned, as mentioned in that previous episode, of that leading us to a state of substantial dependence and losing effective control. You can imagine a similar dynamic happening with the military. Just like if we don't want arrows, for instance, you're not going to roll back arrows. And so when you start going down the road of weaponizing AI systems, if they're more potent and cheaper and more generally capable and more politically convenient than sending
00:36:36
Speaker
human soldiers onto the battlefield, then this becomes a very difficult process to reverse back. Eventually what happens is you've had an on-ramp to many more potential catastrophic risks. You've transferred much of the lethal power, in fact, the main sources of lethal power to AI systems, and then you're hoping that they're reliable enough and that you've sufficiently, you can keep them under sufficient control and that they can do your bidding.
00:37:06
Speaker
Even if you do get them highly reliable and they do what you instruct them to do, this doesn't make people overall very safe. We saw with the Cuban Missile Crisis, we can definitely, nukes don't turn on us. They don't go off and pursue their own goals or something like that.
00:37:22
Speaker
they do what we want them to do. But collectively do this structural environmental game theoretic situation where like, well, we would all be better off without nuclear weapons, but we it makes sense for us each individually to stockpile them, we put the broader world at larger collective risk. So like in the Cuban Missile Crisis, JFK said we had up to like a half
00:37:47
Speaker
or like a 50% chance of extinction in that event. It was a very close call because we almost got a nuclear exchange with that. And likewise with AI systems, they may be more powerful. They may be better at facilitating the development of new weapons too. And this could also bring us in a situation where we could potentially destroy ourselves again.
00:38:12
Speaker
What's pernicious about this structural or environmental constraint where we've got different parties, in this case militaries, competing against each other is the following. Even if
00:38:28
Speaker
We convinced the world that the existential risk from AI is 5%, because let's say they're not reliable. We can't reliably control them. So maybe there's a 5% chance to turn on us or we lose control of them, and then we become a second-class species or exterminate.
00:38:43
Speaker
Even if that's the case, it may make sense for these militaries to go along with it. They swallowed the risk of potential nuclear Armageddon by creating these nuclear weapons in the first place. But they thought, if we don't create these nuclear weapons, then we will certainly be destroyed. So there's certainty of destruction versus a small chance of destruction.
00:39:06
Speaker
And I think they'd be willing to make that trade off. So this is how there could be an existential risk to all of humanity based on these structural conditions. So it's not enough to convince the world that existential risk is high because they might just, OK, well, yeah, that's five percent. OK, we're going to have to go with that rational thing. It makes rational sense for us to engage in this what would normally be very risky behavior because we don't have a better choice.
00:39:33
Speaker
So this is why I don't think it makes sense just to hammer home the point that, wow, these AIs could turn on us or we could lose control of them. There's a structural thing of like, that's not going to matter unless that probability is like very high. Like maybe if it's like 30 percent and they go, OK, all right, we're not going to build the thing because but if it's something like 5 percent, they might go through with it anyway. So
00:39:55
Speaker
more than just concerns about single AI agents make sense to focus on. We have to focus on these multi-agent dynamics, these competitive pressures, the game theory of what they're facing.
00:40:12
Speaker
So I think that if you don't resolve that, you're basically exposed to insensitivity to a lot of existential risk up to maybe five or 10%, which maybe it isn't, it's possible. Maybe it's actually only 2%. And where you convince the world, everybody's very educated about it. Everybody listens to Future of Life podcast tomorrow, and they all go, wow, this is a concern. I am updated 5%. It won't matter. It won't stop that type of dynamic from happening. So you have to fix
00:40:39
Speaker
the international coordination issue have to avoid this sort of potential for World War III thing. Now, it didn't directly cause it, as we were discussing earlier. This wasn't a direct cause of extinction, but it increased the probability substantially. That's the sort of framing we have to focus on in trying to reduce existential risks, not search for direct causal mechanisms, but look at these diffuse effects and structural conditions.
00:41:00
Speaker
Yeah, so concretely, this might look like the US is considering implementing AI systems into their nuclear command and control systems.
00:41:10
Speaker
So specifically, they're doing this to counteract the rumors of other countries doing the same thing. And in order to act quickly enough with their nuclear weapons, they think they need to give AI a greater degree of control over these nuclear weapons. And so you have a situation in which
00:41:35
Speaker
countries are responding to the actions of each other in a way that accelerates risks from both sides in a sense.
00:41:47
Speaker
There'd be one, I mean, there are other ways this can affect warfare. It could maybe be better at doing anomaly detection, thereby identify nuclear submarines and affect the nuclear triad that way. Or we, in later stages, just have massive fleets of AI, and this is saying robots, sorry to say, but like later stage, if they're much cheaper to produce, they'd be very good combatants.
00:42:13
Speaker
There isn't skin in the game. This makes it more feasible to get into conflict. There are other ways in which this increases the probability of conflict, too. There's more uncertainty about where your competitors are relative to you. Maybe they had an algorithmic breakthrough. Maybe they could actually catch up really quickly or surpass us by finding some algorithmic breakthrough.
00:42:34
Speaker
This creates severe or extreme uncertainty about the capabilities profile of adversaries. This lack of information about that increases the chance of conflict as well. It may also increase first strike advantage substantially too, which would also increase the probability of conflict.
00:42:51
Speaker
Like, we have an AI system today. It's much more powerful than anything else. They might get theirs tomorrow. If we act today, then we can squash them. And that could get the ball rolling for some global catastrophe.
00:43:06
Speaker
Yeah, pretty pernicious dynamics overall, but all of these can be viewed as competitive pressures driving AI systems into propagating throughout all aspects of life. We mentioned through the public sphere in the economy, people's private lives with AI chatbots, also in defense in the military. It just basically comes everywhere and we end up relying more and more on them to make these sorts of decisions and I don't think
00:43:36
Speaker
And many of these, we become, so depend on them that things move quickly. We can't actually keep up. We can't make, if we're actually making these decisions, we'll make much worse decisions. So then they basically become ineffective control. Things also move so quickly that the answer to our AI problems is we need to bring in more AIs because since they're using more AIs, now we need to use more AIs. And so it creates a self-reinforcing feedback loop, which ends up eroding our overall influence and oversight as to what's going on.
00:44:03
Speaker
And so I think that's the default line. So of these sort of risk categories, I think this seems like straightforwardly the case if we don't fix international coordination, and if there's a close competition between countries, or if we don't fix the racing dynamics in the corporate sphere, then I think it's fairly likely that humanity becomes at least like
00:44:28
Speaker
A second-class species loses control from there. Eventually, probably they go extinct, but that might be a long time after. So this is the main risk that I'm worried about, but as director of central radiosity, I'll try and be ecumenical and focus on various others, too.

Organizational Risks in AI Development

00:44:46
Speaker
So I'm always making sure there are projects addressing each of these, though. But personally, this is the one that I'm most concerned about.
00:44:51
Speaker
So treaties between governments and some form of collaboration between the top AI corporations, is that the way out here? How do we mitigate this risk? The way you describe it, it seems very difficult to avoid, given the incentives, basically. People respond to incentives, they rationally respond to incentives.
00:45:13
Speaker
for each step along the way, they have reasons to do what they're doing. And so it seems difficult to avoid. What are our options? Well, there are positive signs, for instance, like Henry Kissinger was recently suggested in foreign affairs that the US cooperate with China on this issue now, but before it's too late. So I think some people are recognizing the
00:45:39
Speaker
the importance of trying to do something about this. It's possible there'd be some clarifications about antitrust law, which would make it possible for AI companies to not engage in excessive competition over this and put the whole world at risk. Potentially, there could be an international institution
00:45:59
Speaker
like a CERN for AI, which is the default organization, which has a broad consortium or coalition of countries, writing input to that and helping steer it.
00:46:16
Speaker
one that's maybe decoupled from to some extent of militaries so that we're not having too much power centralized in one place. So it doesn't have a monopoly on violence and eventually after it automates a lot of monopoly on labor. I think that's just like basically all the power in the world. So those are possibilities. I think that the time window might be a bit shorter though if the if there's an arms race, an arms race
00:46:40
Speaker
in the military and if the AI is viewed as the main thing to be competing on, we need to spend a trillion dollars on that order for nuclear weapons. When that becomes the case, I think it's where we're very much set down that path and then we're exposed to very substantial risks.
00:46:58
Speaker
So yeah, I think maybe we'll have a sense in the next few years as to whether we get some type of coordination or if we are not going to recognize that we're all in the same boat as humans and we don't want this to happen. But we'll need people to basically understand what happens if we go down this route and if we don't try and fix the payoff matrix, the incentives at the outset, the structure that these players find themselves in or that these developers find themselves in. That looks like very much a political problem as it happens.
00:47:27
Speaker
So this is why making, reducing AI x-risk and whatnot and making AI safe is a socio-technical problem. It's not writing down an eight-page mathematical, you know, solution, you know, work of genius and then, oh, okay, we can all go home now and everything's taken care of.
00:47:43
Speaker
it's not going to look like that. That was a category error in understanding how to reduce this risk. We shouldn't have these types of like founder's effects have like undue influence over like it will keep lingering. I think that will eventually like go away, but I still think it's still like lingering and I think we should just like move past it and recognize the complexity of the situation.
00:48:04
Speaker
Let's talk about organizational risks. And these risk categories, of course, kind of play into each other, influence each other. So if we have organizations that are acting in a risky way, this increases the risk of potentially rogue AI, or it incentivizes others to race in order to compete with these organizations that are acting in risky ways.
00:48:32
Speaker
But yeah let's just take it from the beginning watch what falls under the organizational risks category.
00:48:41
Speaker
Yeah, so organizational risk at a slightly more abstract level would be the the accidents bucket. So even if we reduce competitive pressures, and if we have a and if we don't have to worry about malicious use immediately, we'd still have the issue of organizations having, you know, maybe a culture of move fast and break things or them not having a safety culture.
00:49:11
Speaker
In other industries or for other technologies like rockets do you know that wasn't um extreme competition with that but nonetheless rockets would blow up or nuclear power plants would melt down acts catastrophic accidents can still happen and these can be very deadly in the case of a systems eventually so.
00:49:31
Speaker
I think this is definitely a very hard one to fix most of the people at these AI organizations and how they were initialized and whatnot. Still had a lot of people who are mostly just wanting to build it and the consequences of society be damned. This is not my wheelhouse. I don't read the news. I don't like thinking about this sort of stuff.
00:49:51
Speaker
this is annoying, humanities majors and whatnot, who are in these, you know, ethics divisions or policy divisions that keep annoying us. This is kind of the attitude at most of these companies by and large. And I think this is a large source of risk. We could, as well as
00:50:12
Speaker
It's just non-trivial as we see in other things like nuclear power plants, chemical plants, rockets, and making sure that this is all extremely reliable. So we'd need various precedents. There's basically a literature on this called the Organizational Safety Literature, which focuses on various corporate controls and processes for
00:50:35
Speaker
Making sure that the organization responds to failure, takes near misses seriously, has good whistleblowing, has good internal risk management regimes, has like a chief risk officer or an internal audit committee, all these sorts of things to to reduce these types of risks. And yeah, you were right in that this interacts with
00:50:54
Speaker
Not necessarily direct cause of some of these extra centers, but nonetheless boost up the probability if we're perceiving that an organization is very reckless in its attitude. This causes more safety minded ones to compete harder and justify justify racing. This reduces the.
00:51:14
Speaker
That consequently reduces the amount of time you have to work on control and reliability of these AI systems, which affects the probability of rogue AIs, of course. There's also other types of accidents that could happen, like the organization might accidentally leak the model, leak one of its models that has some lethal capabilities in it if it's repurposed.
00:51:36
Speaker
There's also a risk of as as potentially who's to say happened with with with viruses. Maybe there'd be some unfortunate gain of function research that would also lead to some type of catastrophe as well. There are people interested in what is essentially gain of function research and in creating warning shots, they might be a little too successful later on. What does gain of function research look like in AI?
00:52:04
Speaker
deliberately building some AI system that's like power seeking or Machiavellian and wants to destroy humanity and then they're going to use this to like, you know, scare the world with but like, at some point, when it's powerful enough, you might get what you asked for. The idea here is to create a dangerous AI, maybe an AI that's more agentic or power seeking.
00:52:25
Speaker
and then use that model to study how to contain it. But then the worry is that we could ironically go extinct perhaps because we can't control the model.
00:52:41
Speaker
Yeah, and if this is like, who's to say who's going to be experimenting with this, or how exactly cautious they will be, or they're like skill level, it may be mandated that they test for these types of dangerous inclinations or capabilities. And who exactly is going to be doing that is unclear may not be like the most like,
00:53:01
Speaker
capable people, or there's just some overall or there's just some risk of accidents in that way. So I guess that gives them some flavor of some of the direct accidents. But I also think how it indirectly affects things is so at one way in which I think strongly indirectly affects things is
00:53:20
Speaker
What accident is an intellectual error inside of these organizations where they conflate safety and capabilities? This is a very common thing, where there's not clear thinking about safety and capabilities, where people be, oh, well, we're smart, rationalists, and justify the means. We're risk neutral.
00:53:42
Speaker
don't actually do much empirical deep learning research. But conceptually, we think that this will be beneficial for safety, even though it will come at the cost of capabilities and whatnot. So they've really muddied up that line. And the distinction between safety and capability, such that you could imagine a lot of these safety efforts basically just work on capabilities the entire time. I think that's a reasonable fraction of the safety teams, I think, do focus just on capabilities for context.
00:54:11
Speaker
there is an extreme correlation between AI's capabilities in various different subjects and goals. So if you want your AI system to be better at something like math problems or history problems or accounting problems, these capabilities are all extremely correlated now, we can see with like large language models.
00:54:35
Speaker
You should assume that if something is correlated, like the correlation is like 80% or like 90%, it's extremely high.
00:54:46
Speaker
When people reason themselves into some new capability that they think will be helpful for safety, it's very likely the base rate of it being correlated with capabilities and basically being nearly identical to other capabilities by being so correlated is extremely high. So I think there needs to be substantial evidence that the safety intervention that one is applying
00:55:10
Speaker
isn't affecting the general capabilities. And that requires empirical evidence. So a good example of empirical research that I think helps with safety but doesn't clearly help with general capabilities of making a system smarter would be like the area of machine unlearning. So machine unlearning is where you're trying to unlearn some specific dangerous capabilities, trying to unlearn bio knowledge, trying to unlearn specific know-how that allows you to hack.
00:55:37
Speaker
This is more clearly like measurably not correlated with, it's anti-correlated with some capabilities and not particularly correlated with general capabilities. If you're just moving that specific know-how. I ever see a robustness is also generally anti-correlated with general capabilities. It doesn't make the systems overall smarter. What happens is it makes the systems robust to some specific types of attacks. Robustness to that comes at a fairly large computational cost and takes up a lot of the model capacity.
00:56:05
Speaker
But that would be a safety intervention that doesn't make the models overall smarter. So those are examples of, I suppose another example would be with transparency research. Historically, there have been no instances of transparency advancements leading to general capabilities advancements. Just trying to understand what's going on in the model. And it doesn't really work nearly as well as just like throwing more data at it and
00:56:34
Speaker
There aren't many architectural improvements that are likely to be found anyway. As a result of these investigations, the track record is pretty basically completely clean for transparency. Now, maybe that wouldn't be the case in the future, but then at that point, then we wouldn't identify this as something that is particularly helping with safety.
00:56:50
Speaker
Uh, so I think that for the safety research areas, we need to be quite clear about there's, it's not, there is, you can't just have some informal argument about, uh, or an appeal to authority that, Oh, this is helpful for safety because of, you know, some, some.
00:57:05
Speaker
Uh, some, some verbal argument, the empirical machine learning is very complicated. Hindsight barely works in trying to understand what's going on. Why does pre-training on fractal images help improve robustness to, uh, I don't know, basically everything and improve the calibration anomaly detection. I have no idea. Um, it works though. Even yeah. Ask people like, why are activation functions the way they are? Uh, I don't think there's actually a good canonical explanation. That's like very consistent.
00:57:32
Speaker
You would want empirical evidence that when we are engaging in safety research, we are not accidentally also increasing the capabilities of

Corporate Incentives for AI Safety

00:57:41
Speaker
models. And you think this is something that happens often.
00:57:44
Speaker
Yeah, I think this happens extremely often in this sort of this organizational risk of the conflation of safety and capabilities. Now, this isn't to say that they are loose and separate. A better improvements in capabilities has downstream effects on safety in many situations. It makes them better able to understand human values, for instance, as they gain more and more common sense.
00:58:03
Speaker
But if we are trying to improve safety and specifically reduce existential risk, I think we need to differentially improve on some safety access and not in the general capabilities access. If we are doing something that's fairly correlated with capabilities and safety, I think that the default expectation is that actually you're working in the service of capabilities. A good example would be one of OpenAI's strategies to mention this specifically, because I just don't think it's particularly intellectually
00:58:32
Speaker
I don't think building a superhuman alignment researcher specifically just affects alignment. I think such a thing can be easily repurposed to doing lots of other types of research. I don't think there's like a specific alignment research skill set that is just, oh, it's just you'd only get that, but
00:58:55
Speaker
If you're good at that, it means nothing about your ability to accomplish anything else. I just don't think that's the case. I think it's actually extremely correlated with general capabilities. It would be very straightforwardly repurposed to other forms of research. But that's an example of this sort of conflation. Now, this isn't to say OpenAI is only conflating safety and capabilities entirely. I'm not claiming that. They will have some work on transparency. I gather that they'll work more on reliability and
00:59:24
Speaker
robustness. But this is a very dangerous conflation. And I think basically if they're if they seem kind of correlated, just intuitively, and then if you hear a lot of verbal arguments without empirical demonstration, basically assume just the base rates are like a lot of these completely separate subjects like performance in history, and like performance in philosophy and mathematics, like those are all like hyper correlated to assume this other type of thing is also hyper correlated with two.
00:59:51
Speaker
Anyway, that's another one that like an organizational factor that really reduces the amount of time we have to solve this problem and our ability to solve it as well.
01:00:03
Speaker
I think the worry here for, say you're a top AI company and you're thinking, okay, how much safe organization features should we implement? Should we have more red teaming? Should we have more procedures, more review, more testing? Should we require this empirical evidence before we begin a new safety research program?
01:00:26
Speaker
This now threatens to slow us down and it opens us up to competition from the scrappy new startup that's at our heels trying to outcompete us.
01:00:41
Speaker
This is very straightforwardly now a case of AI race undermining organizational safety, or at least threatening to undermine organizational safety. Can you make an argument if you were to sell this to a CEO of an AI corporation that...
01:01:00
Speaker
When would I be in that situation? That safety is in the interest of the organization itself. You could say it's difficult to sell unsafe products, right? You want to be in control. You don't want to lose the weights of your model in a leak and so on. So there might be some correlation between the self-interest of the organization and the interest that society has in safety in general.
01:01:26
Speaker
But before that one factor, an additional factor that just like diffusely increases probability of extras from like these organizations, if they just do safety washing and they don't even know it sometimes might have some like small gesture for safety. They might have a, for instance, a responsible scaling policy that doesn't commit them to almost anything. And then that placates regulators, for instance, but doesn't doesn't actually reduce risk.
01:01:49
Speaker
Those would be other examples of how organizational risks can end up increasing the probability of existential risk. Although it's diffused and indirect, it still matters. On the self-interest point, I think a lot of the catastrophic risks, or catastrophic risks and existential risks are tail risks.
01:02:10
Speaker
And generally, organizations don't really price in tail risks that much. A lot of portfolios don't really do much to address tail risks either, like in other industries like in finance and whatnot.
01:02:24
Speaker
Um, so it's sort of, this is kind of like a problem with our, in many of our institutions that a lot of, we could convince them to do things like red teaming to some extent, but doing red teaming for existential risks and whatnot is not necessarily something that they would check to do. Cause that's not going to affect their product tomorrow.
01:02:43
Speaker
There's no pushback if everybody is dead, as mentioned before. So I think that this works to a limit. I think some things like saying information security for your company or so that your weights don't leak. This is a much easier argument to make. Other claims like some of these internal controls and whatnot, oh, this will slow us down. This will reduce our velocity. And I think these are
01:03:07
Speaker
are harder to make. And I don't think that there are necessarily short-term economic incentives for some of these. Many of these are actually more for addressing tail risks and black swan events. So they would then need to just recognize that the black swan events are real possibilities beyond a probability threshold worth actually addressing.
01:03:32
Speaker
I'm not claiming that they're being completely irrational if they're being fairly short-sighted and don't believe in these Black Swan events from it. Then I think them trying to maintain velocity and just maintain optionality and whatnot. It's understandable. I wouldn't advocate for that, but it's understandable that they're doing that.
01:03:55
Speaker
comporting to a lot of their incentives well. And they will do various things to reduce some generic risk. They will do some generic forms of red teaming, regardless of whether there's regulation, because it will make sense. But I just don't think that that does particularly much in the way of reducing these catastrophic or existential risks often.
01:04:15
Speaker
Say you're a philanthropist or a government with a big bag of money and you want to incentivize safety research at these top AI corporations.
01:04:25
Speaker
Is there a way in which you could earmark the money and make sure it's spent on what you want it to be spent on? So it's not funneled into increasing capabilities of the models, it's spent on the right type of safety research. I think that'd be one intervention. I think it's very possible to, there are a lot of professors lying around in academia who could do
01:04:50
Speaker
This research, all you need is to subsidize. So for instance, like center for safety as a compute cluster, we'd love to expand it. We're only able to support not not that many professors to do research with large language models and very compute intensive experiments. But there are a lot of a lot of professors who
01:05:09
Speaker
Could be doing more research here. So I think that probably there aren't that many people working at these organizations. I should say as well and So I wouldn't I wouldn't bet on them To fix everything you're actually just correlating it with like what is Jared Copland's safety vision? What is yon lake is safety vision and like you're you're exposed you're getting like two or three bets if you were like
01:05:30
Speaker
giving each of them money. And I think that's not a very diversified portfolio and you should expect blind spots just because people can't simulate a collective intelligence, a broad research effort by themselves, even if they work very hard and have lots of discussions and have
01:05:49
Speaker
good deference to outside views and so on. They just can't they just can't simulate that function. So I would suggest if one's wanting to subsidize safety research, we can if there's can have a subsidized compute cluster, then we can have high accountability of like, you're not allowed to run this project because this doesn't seem sufficiently safety related, instead of giving like money, you know, strings attached to some academics and they run off with it.
01:06:10
Speaker
Um, so that would be my, uh, preferred intervention. Um, and I think that there's, uh, it can take orders of magnitude more. So like, if any of them are listening, you know, like reach out to us, I'd love to get more compute to, uh, to, um, uh, people doing relevant, uh, uh, doing relevant research, um, uh, on safety in a, in a nice diversified portfolio across transparency and adversarial robustness and back doors and machine learning models, uh, uh, and unlearning, um, uh, these types of topics.
01:06:38
Speaker
Do you think some safety breakthroughs would be kept secret?

Enhancing AI Organizational Safety

01:06:42
Speaker
Say a safety breakthrough at Google DeepMind made the AI useful or in a way that incentivized them not to share the safety breakthrough. Would you expect safety breakthroughs to be shared widely as if they were found in an academic lab?
01:07:01
Speaker
I think for market positioning, one of them could occupy the niche of being the safest of the racing companies. We are technically the safest. So I think that's currently occupied by anthropic. And this might make it fairly useful when pitching themselves for, say, a defense contract that look weird, the more reliable organization.
01:07:25
Speaker
compared to our competitors. And so if they would be open sourcing some of that, then I think they would lose some of that competitive advantage. So it's quite conceivable that they'd hold on to things. And there's many safety projects they do for which the code is not like open source. So we see that to some extent there. But I think it can make sense for one of them to try and just be a bit safer than the others or a bit more reliable than the others.
01:07:51
Speaker
I've heard Sam Altman, the CEO of OpenAI, talk about releasing these systems, specifically releasing GPT-3 and 4 to the world, chat GPT, in order to gather more attention to the issue. Do you think this is a viable strategy? Is this too risky or is it worth trying?
01:08:14
Speaker
I think to answer a more extreme question to possibly get a sense of my position on this, I think the release of llama 2, for instance, by meta, which is an open source, large language model around the capacity of GPT 3.5. I think the benefits of that actually outweigh the costs.
01:08:32
Speaker
It enables a lot more research. It also improves our defenses against some of the immediate applications of these AI systems. So, for instance, it came out today that North Korea is using some of these AI systems. I don't know whether it's LAMA2, that'd be my guess, because it's the most capable open source system, or code LAMA potentially.
01:08:55
Speaker
using AI systems to identify vulnerabilities in software, and then that helps them shortlist things to attack. This isn't an extremely capable, or this doesn't rewrite the cost-benefit analysis of cyber attacks. It doesn't rupture our digital ecosystem. But this basically gives us some preview and forces these issues on people's attention.
01:09:19
Speaker
So I think there's an argument to be made for open sourcing llama 2 or if it's trained on 10x more compute, llama 3. After that there's more uncertainty we'd have to see because maybe it could be repurposed for things like bioweapons then or it would be substantially more capable at hacking and scamming things like that.
01:09:44
Speaker
I think there's a real argument to be made for some short-term stressors sort of snapping the system into to do something about it, or at least waking them up. But I think systems function better with some amount of stressors. When the stressors get too extreme, then it can undermine the system. So it's complicated. I think maybe the situation in the case of OpenAI releasing these things or how they want to go about release strategies would not surprise me if that should change or it would be better to do other things.
01:10:14
Speaker
future. Yeah. Do you know something about the internal processes for deciding when to release these models? So in the case of like meta, for instance, they may have like a chief legal officer vote on whether or potentially a veto power that could still be overwritten by the CEO, which may have happened in the case of llama to being suggestive here because I haven't heard a second source for it. But
01:10:40
Speaker
Usually for decisions, though, like, you know, OpenAI will be accountable to their board. The board, I don't know whether they have formal powers to decide whether they'd be voting. Often boards have blunt powers of just firing the CEO. And there often aren't processes in place for these larger scale decisions. So you could imagine a CEO just deciding unilaterally to have something released.
01:11:08
Speaker
And that's something that organizational safety could improve and be processes for high stakes decisions as an example. But yeah, but by default, boards do not have fine grained control. And so it's often up to the CEO to make the call. So you have a single point of failure. What is the Swiss cheese model of organizational safety?
01:11:28
Speaker
Yeah, so I'm mentioning, and if people are wanting to hear more about the organizational safety literature, we'll have the AI Safety Ethics and Society textbook out in November in one of the chapters would be on safety engineering. The Swiss Chief's model is a
01:11:47
Speaker
easy to communicate, it's kind of outdated, but it gets at a, just like how people are doing extra analysis of existential risk from AI earlier, they have a toy model that captured some of the, and some of the scenarios to be concerned about. But it's important not to let that be the lens by which you filter everything through that's that captures some of it, but not all of it. And the Swiss cheese one captures some of the dynamics, but not all the dynamics. But anyway, the Swiss cheese model with the
01:12:13
Speaker
with that parallel design, essentially having multiple layers of defense. If you have red teaming, even red teaming for a catastrophic risk, that reduces the risk of catastrophe, but it's not itself perfect. You might also want stronger informational security too, to make sure that if you had a dangerous model that it doesn't leak, you could have
01:12:39
Speaker
you could have better transparency tools to check for deceptive behavior in AI systems. But if those transparency tools failed, maybe you would want monitoring of these AI systems so that before they take any action, it needs to be approved by something equivalent to an artificial conscience or filter that would filter out some of the immoral actions of AI agents before they're able to take them.
01:13:02
Speaker
And so all of these together can increase the reliability of the system. So the hope is that if you stack together many of these, you've substantially reduced your risk. This isn't looking for a perfect airtight solution. This is looking for layering on many different defenses to actually reduce risk. So if I want to reduce biorisk, for instance, here's an example of a Swiss cheese thing. First, I could
01:13:29
Speaker
First, there'd be the diffuse thing. Maybe there could be some regulation about not allowing models with these capabilities. But let's say I'm an organization that takes safety more seriously. So that depends on safety culture. So that's some sort of barrier. Regulation might be some barrier against this risk. Safety culture might be some barrier against this risk. So then they have enough of a safety culture. They're willing to add a lot of these safety features. Now, these safety features themselves will end up having lots of different layers of defense. You could have an input filter to try and remove whether there's a request to create a bio weapon.
01:13:59
Speaker
You could also remove virology-related data from the pre-training distribution so that it likely knows a lot less about virology. You could have an output filter as well, which would, even if somebody jailbreaks the input filter, then they're also going to need to jailbreak the output filter, which is harder to do. And you could imagine adversarially training this as well. So it would be another layer so that it would be more robust to people trying to jailbreak those layers of defense.
01:14:24
Speaker
But then you also have there's also people who could through the API fine tune the model and inject some of that bio knowledge back into the model so you could have a filter.
01:14:35
Speaker
that screens the fine-tuning data so that that information can't get back into the weights. And then you could add another layer, which would be an unlearning layer, where you would assume that before you hand back the fine-tuned model to the user, before they get it back, we're going to run a scrubbing, unlearning, knowledge expunging thing to expunge some of any bio knowledge, if there is any. And that would be yet another layer. This approach reduces the risk of some biocatastrophe.
01:15:03
Speaker
Are any of those airtight? No. But do they work better collectively? Absolutely. So this is why we shouldn't be focusing on these airtight solutions exclusively. We also need to make use of these various layers of defense. That's how we actually reduce the probability of existential risk. We can't let perfection be the enemy of the good. If we'd say, well, if we can't build a completely 100% reliable input filter, then we shouldn't have an input filter. That's a dead end, so we shouldn't investigate it.
01:15:32
Speaker
That's just not how things work.
01:15:35
Speaker
Tell us more about the textbook. I'm pretty excited to read this. I hope that this is a product that should exist, I think. Specifically, tell us more about how do you think about updating this or keeping it up to date? I think for a textbook on AI safety, it won't probably work if the next version is out in 2034 or something like that. So how do you keep it up to date? And also, you can just present the textbook, which I think listeners will be interested in.
01:16:04
Speaker
I mean, since I've been around in in an academe for a while, I do have at least some of the sense of like, what things what content is more likely to stand the test of time. And so that one's not talking about, you know, Dolly two or something which uses those already outdated.
01:16:19
Speaker
or like what are kind of like fad topics and not giving those too much, not giving those air time. I mean, an example of this would be like an unsolved proms in an LCA, which I don't know, two or three years ago or something. But there we introduced emergent capabilities, which I think has become fairly popular before Burns et al's paper on honesty and whatnot, where also honesty is a big part of alignment. So there's sometimes one needs to call the shots too as to what things will
01:16:48
Speaker
even if there isn't much of a literature on it at all, need to predict what will end up standing the test of time. So I think it should have some reasonable longevity because we're not focusing on transient knowledge, but instead like general interdisciplinary frameworks for thinking about risk across all these sectors. Because we had this issue of like, if you're thinking about AI risk, you have to think a bit about geopolitics, you have to think about international relations to some extent.
01:17:14
Speaker
You think about AI risk, you have to think about corporate governance and AI developers and what sort of incentives are driving them. And you have to think about the individual AI systems themselves too. You have to think about organizational safety. You have to think about broad variety of factors and will basically focus quite a bit on frameworks for thinking clearly about each of those.

AI Safety, Ethics, and Society Textbook

01:17:38
Speaker
I would imagine that later one could have GPT-6 like help, like update the textbook anyway.
01:17:44
Speaker
So, honestly, it's actually like the plan for it. Something in that direction. How technical is the book? Does it contain pseudocode, like a standard AI textbook? The premise of it is to onboard people from different disciplines. This isn't written for machine learning PhD people. There are lots of different fields, economists, legal scholars, philosophers, people without technical background, policymakers.
01:18:10
Speaker
think tank people who want more of a systematic understanding of these issues and so it's largely written for people without any specific background and it's not trying to be a sort of like a introductory machine learning PhD course. That would be the course.mlsafety.org if you want a course of various technical topics or the machine learning safety course
01:18:35
Speaker
But this one is more focusing on, as we were discussing, the game theory of this, the various governance solutions. Conceptually, many of the arguments associated with rogue AIs, why might they be power seeking? Why might they be deceptive? Understanding that. There's also introduction to machine learning and reinforcement learning in it. Understanding collective action problems, since that was fairly relevant in these
01:19:03
Speaker
competitive pressures. There's also ethics in the book as well, where if you're assuming that you've got your AI systems to be somewhat reliable, then we have to start worrying about making it beneficial. And so there's various bits of AI ethics as well of what are objectives that we might give the AI system. What would those look like? What would be some of the, you know, moral trade-offs that you're making there?
01:19:30
Speaker
So it's covering AI safety, ethics, and society. So trying to be fairly broad, you should have lecture slides, and presumably, I'll get around to recording videos for two. The goals, there's several goals of it, like to compress the content. Right now, if you want to understand AI risk, basically need to be like part of like an intellectual scene, like in the Bay Area, probably.
01:19:53
Speaker
maybe in maybe somewhat in Oxford. So very high barriers to entry. And then if you do, you're probably going to take a somewhat narrow view just because they're all interested in rogue AIs and don't have as much interaction with the rest of the world. So you'll have many blind spots as to a lot of the social variables and the broader socio-technical problems.
01:20:18
Speaker
The knowledge has been a bit diffused across various different blogs and to stay up to date, you've often had to jump around from different places. So it would be nice to have something that's more compressed. So some of the goals are to reduce the fragmentation of AI risk knowledge, increase the readability and the compression rate of this content.
01:20:40
Speaker
And so there's reducing the barrier to entry to these crucial ideas that should hopefully scale the number of people who can understand AI risk extremely quickly. I was somewhat surprised by, although there's a lot of global attention, the number of new experts flooding in has been, I think, very underwhelming. Is that good or bad? Sometimes it's a bad thing if experts are rushing into the new, newly hot idea.
01:21:07
Speaker
I think that if people are onboarded well and have a more comprehensive understanding, if they're basically like charlatans who aren't going to like do their their work, then that's more of a problem. So I think by default, with like another capabilities jump or two, they will flood in. There's basically a question and I don't anticipate they're going to read lots of
01:21:30
Speaker
no less wrong dot com posts i'm not to be on board just gonna start talking and try to be about themselves i say this is that in my time miracle machine learning research it basically should assume that.
01:21:45
Speaker
When some areas start getting pretty hot, there'll be lots of random new people coming in and trying to influence the discussion substantially. Hopefully the people as they come in would have some understanding of many of the basics though, but I think by default it's relatively inaccessible. You'll have to read a lot of scattered content from different places and a lot of it will be idiosyncratic and it'll just take a long time to go through.
01:22:12
Speaker
Those are some of the reasons for doing this. And then also, I think that given that rogue AI is not the only concern or only risk source, there's a lot of content that even a lot of people who've been thinking about AI risk for a while will possibly need to become aware of. So that's why, so just as a graduate student, when I just developed and just focused
01:22:41
Speaker
these other things other than Rogue AI. And then now I think people are recognizing the importance of that. So now there'll be some material to help get a more formal understanding of these other sorts of issues. That's great. I'm looking forward to reading it. I think we should nonetheless talk about Rogue AI. That's your last category of risk. One issue here is proxy gaming. How does that work? How is it dangerous?
01:23:10
Speaker
Yeah, so you can imagine if you've got a very powerful AI system, if it finds reliability holes in the objective that it's given, then this could be destructive because it's being guided by a flawed objective. I think in a colloquial example,
01:23:27
Speaker
is with, I believe in Hanoi, there'd be a bounty for killing rats. And so if you get the rats and you get a bounty, but then people are incentivized to breed rats so as to collect more of that bounty. That would be an example of an objective that you put forward that ends up getting gained. It's fairly difficult to encode all of your values, like well-being and whatnot, into a specific objective, a simple
01:23:57
Speaker
objective. So you might expect some approximation to what you actually care about. In machine learning, a famous example, this is the boat racing or coast runners example that OpenAI had, which was of proxy gaming of
01:24:12
Speaker
There's a reward function and the reinforcement learning agent would optimize that reward function. This was a racing game. You'd think it would optimize the reward function by going around the track. But what it instead learned to do was it can get a higher reward by getting lots of turbo boosts. And the turbo boosts, it could get a very rapid sequence of them by crashing into walls and catching on fire and then continually turbo boosting in that way. And that would help it get a higher score.
01:24:39
Speaker
So there are often holes in these objectives due to an ability to compute exactly the right objective, or maybe we can only monitor some parts of the system. There's a computational and spatial and temporal constraints on the quality of the objective, meaning that you're going to often have to go with an approximation. So something perfectly ideal.
01:25:02
Speaker
This relates to Good Heart's law, which works in human domains also, in which it's difficult to specify exactly what it is you want. Whenever you specify something you want, that thing you've specified is now open to being game. An example here might be that you want deep scientific insight and you assume that such insight correlates with
01:25:28
Speaker
citations or number of citations, but then you get gaming of the citation systems in which academics are incentivized to maximize citations at the cost of scientific insight. So is this a more general problem across all agents, humans included?
01:25:48
Speaker
Yeah, I don't think this is specific to AI agents. I will say that some objectives are harder to game than others. For instance, the bounty on rat tails is a lot easier to game than citations because citations can be very valuable for getting emigrate, getting a green card, for instance.
01:26:08
Speaker
strong incentives to do it. But it's nonetheless challenging. So some of these objectives, even when people are trying very hard to game it, they still can be correlated with a lot. Like college admissions still focuses
01:26:23
Speaker
incentivize people to be productive. Yes, they'll go overboard in studying for the exams and whatnot, the college admissions tests. Yes, they'll go overboard in the number of extracurriculars and whatnot. But I still think it can help shape compared to there not being the incentive in the first place. I think, overall, my take on a good explanation is that there's some objectives that are, or some goals are, all goals and proxies are wrong. Some are useful. And some, though, when gained in particular ways, could be potentially catastrophic.
01:26:53
Speaker
So there's quite a variety. There are some objectives as well that people would claim would produce good outcomes. For instance, if you gave an AI an objective, like make the world the best place it can. And if that was actually the objective, you gave it.
01:27:09
Speaker
Okay, that's quite different from make people very engaged with this product. That's quite different. I think that making these proxies
01:27:24
Speaker
incorporate more of our values becomes more possible across time, because the systems can represent these other sorts of notions of say, well being of autonomy, because they have a lot better of a world model and more of an understanding of people as well. However, so I think that getting
01:27:45
Speaker
objectives that are in the right direction seem possible. The issue is making them be robust to adversarial pressure. I'm not as concerned about like, we get telling AI go cure cancer, and then it does something like, oh, I'll give lots of people cancer to experiment on them to speed up the experimentation process. This is easily ruled out by some like objective with like an interpret the request as a reasonable reasonable person would.
01:28:10
Speaker
This is a fairly new development in AI that we now have these large language models that can, at least to some extent, understand common sense and have a more subtle understanding of human values.
01:28:25
Speaker
Yeah, earlier, the AIs, they would be kind of like savants where they understand some particular thing well, but then nothing else. And human values are so late in the evolutionary process, it suggests that they're very late to be one of the last things that AIs learn. But that fortunately wasn't the case. We explored this a few years ago in the paper with the ethics dataset. We're basically using that to show that
01:28:54
Speaker
Look, they've got understanding of various morally salient considerations. Here's their predictive performance on well-being things. Here's their understanding of deontological rules and notions in justice and fairness, such as whether people get what they deserve or whether people are being impartial. So they have an understanding of a lot of
01:29:16
Speaker
uh, morally salient considerations. There is a question of reliability though. If they're optimizing that objective, are they basically, is that objective succumbing to that adversarial optimization pressure? If it's optimizing it, it's basically functionally similar to it being adversarial to that objective. This is why there's a focus on adversarial robustness because
01:29:39
Speaker
Later, we've got an AI agent that's given a goal, and this AI system is outputting whether it's succeeding by the goal or not. So we've got an AI evaluator, and we've got an AI system that's optimizing that goal. This AI evaluator, you don't want that being game. You want that AI evaluator being adversarily robust, robust to optimizers trying to say that it's doing a good job.
01:30:02
Speaker
that's the sort of threat model later stage. And that's how some of these topics that were explored in vision and whatnot end up. And now finally, with the large language models, the tax paper, which you can make us read about that in New York.
01:30:17
Speaker
times where you can jailbreak and manipulate these models with little adversarial suffixes. In a later stage, we'd have AI systems evaluating other AI systems, and those AI systems that are evaluating are implicitly encoding an objective, and you want those to be adversarially robust.
01:30:33
Speaker
So, adversarial robustness is not an easy problem to fix. And if you don't fix an issue, then you might have some AI systems just gaming the system and going off, optimizing an objective aggressively that is not what we want.
01:30:48
Speaker
Is there a problem here with the concept of maximization? It seems to me that it would be less dangerous to tell an AI system, go earn a million dollars on the stock market than to tell it, go earn as much money as possible on the stock market. Could we cap the potential negative impact by capping the goal also?
01:31:12
Speaker
I think that's one approach. You could imagine, conceptually, a variety. You could have satisficers where they basically are like, eh, and now I'm good to go. I don't need to keep optimizing this aggressively. There is the possibility of not giving them open-ended goals or very ambitious goals would make them less concerning, more constrained ones. But there is adversarial buzzness would be one. There's also be anomaly detection.
01:31:43
Speaker
is something that's researched quite a bit in vision. I've had some part in trying to have the research community focus on that. And I imagine anomaly detection will be very relevant, again, when we're trying to monitor the activities of various AI agents. Are they doing something suspicious here while they're being monitored? Are they kind of adversarially trying to make the monitor think, oh, it's doing the right thing. So we'll need anomaly detection, too, to detect if there's some proxy being gained.
01:32:11
Speaker
That can reduce our exposure to that risk. There's also having some held out objectives of which the AI agent is unaware that it's being evaluated against. And that can also do things like reduce the risk of it going to extreme and optimizing the idiosyncrasies of the evaluator.
01:32:33
Speaker
But this is a problem. I think that most of the problem right now, though, if we have large language models trying to optimize a reward model that judges them, they can do that and they eventually start to over-optimize it. Although the optimizers that are much more effective at breaking machine learning models are actually just straight up adversarial attacks compared to neural models that are taking multiple steps
01:33:01
Speaker
iterating on their outputs. The generic gradient-based adversarial attacks are just much more effective. So I think of the risks of gaming, I think we need to do more just to address the typical adversarial robustness issue. Gold drift is a somewhat related issue where the AI's goals shift over time and the AI might come to take an instrumental goal as an intrinsic goal.
01:33:28
Speaker
How could this happen? It's still a bit unclear to me how an instrumental goal would become intrinsic over time. So to start out with, an intrinsic goal is something that you care about for itself. That could be something like happiness or pleasure. For some others, they could say, maybe friendship, you'd say, I care about that in itself. You might care about your partner's well-being.
01:33:54
Speaker
not because it's useful to you, but you care about their well-being in itself. And then there are other things that are just instrumental for achieving those intrinsic goods such as like money. Money lets you buy things so that you could have higher well-being or a car.

Intrinsic vs Instrumental Goals

01:34:10
Speaker
It gets you from point A to point B.
01:34:11
Speaker
However, some people have intrinsically, to use a sort of more Boston phrase, some of these instrumental goals. Some people actually just directly want money, even to a point where it doesn't make sense. Or power. Many people are just like, they want power. Even if it harms other parts of their well-being, they're willing to make that type of tradeoff.
01:34:36
Speaker
So they might latch onto these cues and develop some of the wrong associations.

Unintended AI Goals and Behaviors

01:34:43
Speaker
So we see that in people, and there's a risk that AI systems might develop those wrong cues as well. Gold drift could happen in some other types of way too, where if you have multiple different agents, they might interact in some unexpected way, and then a new goal starts to drive their behavior.
01:35:01
Speaker
An example, we can see this in basic AI multi-agent situations. It's not catastrophic, of course, because we're still here, but in some AI society, in some Stanford paper from earlier this year, the AIs start talking with each other, and then they start arranging social structures that they're going to have a, they're going to throw an event.
01:35:21
Speaker
At some person's house then and then this starts to then they start acting and in all these ways to make sure this type of thing happened. And then these sorts of things start to be what drives their behavior. That's another way in which things can end up drifting, not necessarily through having something be intrinsic, but there could be these emergent goals from interactions that end up driving behavior.
01:35:40
Speaker
Certainly, there are many emergent things in society, things that become new, and this isn't the goal that I originally had when I was 10 years old, but now some of these things end up driving my behavior quite substantially. So if we have adaptive AI systems, and if they end up responding to each other, then you could have some emergent complexity happen, and those interactions, that behavior starts driving the overall group behavior as they're imitating each other, as they're responding to each other,
01:36:10
Speaker
So it's basically multi-agent systems be very difficult to control in the single agent one, you'd have to worry about there being some wrong association between an intrinsic and instrumental goal, like money or power. And that could mean if that does happen, if basically something wrong gets intensified,
01:36:29
Speaker
then you're in a very dangerous situation because then your AI has a goal that's just different from what you wanted.

Risks of AI Goal Divergence

01:36:36
Speaker
And so then it will, to get that goal, it will optimize against you. It will respond adversarially. It will resist your efforts to shut it down so that it can achieve that goal. So although it's not something that necessarily happens by default or with extremely high probability, if it does happen, then you've got a substantial tail risk in front of you.
01:36:55
Speaker
I wonder whether these AIs will persist for long enough for Goldrift to happen. So normally we retrain models every couple of years, we switch out for the newest ones. And so it's not like a person that has 30 years to change their values. Will they last long enough for Goldrift to matter?
01:37:16
Speaker
So I guess two things.

Adaptive Goals in AI

01:37:18
Speaker
One is the world to move substantially more quickly in the future, such that often in these more pivotal periods, I don't know if it was Lenin or something like that, there are decades in which weeks happen and then there are weeks in which decades happen.
01:37:33
Speaker
even if there is a high replacement rate in the population this goes on in a much lower process they can still end up constructing things that end up causing their goals to be different like the let's say they develop some different type of social infrastructure for mediating their interactions on their new companies being formed in their end up driving many of them.
01:37:51
Speaker
then those features of the environment would end up affecting the generation that comes after it. So you could still imagine some type of drift, some intergenerational drift, but if each generation is very short, you can still imagine some type of goal drift in that way. This is kind of, think of yourself. Many of the goals, the intrinsic goals that you have or intrinsic desires that you have are completely unlike those when you were younger.
01:38:14
Speaker
if the even tasting food on the things you care about it may be required sports your your taste in music. Affiliations all of these things that are changing across time and so and they can also go away to some of the things you care about like i care about this person's well being for themselves then you break up with them.
01:38:35
Speaker
now I actually don't care about their well-being in itself. I don't have that strong of a feeling toward them. So adaptive systems carry this type of property. This is one way in which they end up gaining some goals that we didn't intend either through some emergent goal from the product of various interactions or through them intensifying some instrumental goal like power. They end up
01:39:02
Speaker
having too strong an association with that and reward and then just end up seeking the power itself. Could gold drift be a good thing? So we wouldn't want to fix human values from the year 1800, for example. You could describe our changing goals from back then to now as a form of gold drift, where people from 1800 might disagree violently with whatever we believe now, but we still probably think it's a good thing that we've changed our values.
01:39:31
Speaker
Yeah, could it be good and could we learn from the AIs? Yeah, so I think this is a good point in what makes thinking about AI risk generally a lot harder. As we mentioned earlier, there's this balance issue with malicious use that because you'd be concerned about unilateralists misusing AIs or rogue actors misusing AIs that we should then centralize power. But then you end up getting some other existential risk of lock-in, of concentration of power.
01:39:58
Speaker
And then I think likewise, in this case too, you can't have a society in complete stasis.

Balance Between Control and Growth

01:40:04
Speaker
And as it would be driven by new emergent type of structures, you should still try and make sure that you have some control over that process or reasonable control over that process. It seems if there's not much control, then I think it's likely to slip from your hands. But otherwise, so there's basically one will have to strike a balance between
01:40:27
Speaker
some very chaotic state where they're running wild, and some stasis. And this is just a continual issue in many areas of evolving groups. Yeah, that would also be a problem of if there'd be too much entrenchment, if there isn't an ability to have adaptation of the things that we care about.
01:40:47
Speaker
Yeah, so anyway, that's there's some dissonance aren't simple answer. This is why it's will be a balancing act. And as also why I don't expect a part in particular, a single solution to solve everything for all time, I will need to respond will need institutions and structures and control measures that respond to the features of the environment and calibrate according. Why could AI become power seeking?
01:41:15
Speaker
So this is a very, I think one of the main sort of AI risk stories would be it becomes power seeking.

AI Power-seeking Behavior

01:41:24
Speaker
I'll make a bit of a case for it and I'll speak about some issues with it too. You could imagine a person gives an AI system a goal, like, go make me a lot of money as an instrumental goal. Scaining a lot of power seems like a very helpful way to accomplish that higher level goal.
01:41:45
Speaker
There's a concern that when you specify a goal that there'll be some sub goals that are to correspond to correlated with power and You'd want to make sure that you can control that those those tendencies So that's one of just being when you're just directly giving an AI a goal It may have a goal that's correlate with power, but that's that's not
01:42:06
Speaker
Is that terribly unexpected? We will give them goals that relate to power quite a bit. Militaries will probably build AI systems that are fairly power seeking, and so we should expect some amount of AIs that are pursuing power either as their main goal or as one of their main sub-goals.
01:42:25
Speaker
And maybe power seeking to a limited extent is okay. Basic feature of accomplishing many of these sorts of goals. For instance, the fetch the coffee one, if it's instructed to fetch a coffee, it would have an incentive to preserve itself because it can't fetch the coffee otherwise.
01:42:41
Speaker
And but you might want to curtail some of those tendencies so that those don't get out of hand. But that would be a we've had a paper at ICML earlier this year where we're deliberately giving it penalties to penalize some of these these tendencies that it has when it is trying to seek its reward. It starts having incentives to accrue resources and things like that. And then can we have it more acquire the resources that are more minimal to accomplishing its goals?
01:43:09
Speaker
Can we have it engage in less power seeking behavior? So I think that that's something that we can offset, but we'll need to make sure that we have good control measures for that to keep that keep that in check. There's also the so that's one of just people directly instructing it with goals that are by default probably going to be pretty related to power. And there's also maybe they would intrinsically care. Let's say that they had some random goal. It's like a paperclip maximizer. You're you're sampling from
01:43:35
Speaker
use the old verb is your sampling from mind space that has a random set of desires and whatever the set of desires that would end up trying to seek a substantial not part of some.
01:43:46
Speaker
one claim, but I think that has to be something more rigorously argued. I would like to note that I think that a lot of those power-seeking arguments, I don't think it works as well as I thought it did, the arguments associated with them. I still think it's a relevant thing that we'll want to control the sub-goals of AI systems to make sure they're not
01:44:11
Speaker
not too strongly related to power and that there's nothing unexpected going on there. So, for instance, people might argue for power seeking by saying, like, well, power is instrumentally useful for a broad variety of goals, therefore it will seek power if it's trying to accomplish any sort of reasonable goal. And you'd ask them what power is, and then they'd say, power is what's instrumentally useful for accomplishing a wide variety of goals and go, OK, well, that's a tautology. So we need to be more careful. What exactly are we meaning by power here?
01:44:42
Speaker
Separately, there's often a bit of that that's one like, slightly bug that lurks in the background is that they'll define power in terms of instrumental stuff and then it's technological. Another issue is that there's sometimes a conflation between power seeking and dominance seeking.
01:45:00
Speaker
Those are not the same thing. When the AI is trying to fetch the coffee and is engaging in self-preservation to do so, it's not necessarily, therefore, trying to take over the world. So saying that an AI is power seeking is not necessarily existential. Indeed, you could imagine various ways in which other powerful actors engage in power seeking behavior but don't try and seek dominance. So for instance, different countries
01:45:29
Speaker
in trying to increase their own power to preserve themselves. This is the sort of thesis of neorealism or structural realism. And what happens is they will basically, many states will just try and keep power relative to many of their peers. If Germany, for instance, tries to take, it's seeking power to protect itself, but if it tries seeking power at the level of a global domination, it will be met with force. There will be balancing from other peers. So when we're in a multi-agent situation,
01:45:56
Speaker
Then it doesn't necessarily always make sense for AI systems to try and take over the world because of the other AI agents would be that will support my preferences or goals and desires. So I will counteract you. Balancing in international relations is what this is called. That's the thing that can offset dominant seeking. So it's not necessarily because the power seeking is dominant seeking and trying to take over the world. An additional point is that
01:46:17
Speaker
We can partly influence the dispositions of AI systems. Sorry to say we can do that. We can make these like have dispositions to be a good chat bot or be a good assistant. Now, how strong is that? It's not perfect.
01:46:33
Speaker
But if it were given a task like, hey, go accomplish this, go accomplish some goal for me, if it would think, well, you know, the best way would be I could accomplish this goal better if I were extremely powerful and took over the world. But that may not be in keeping with its values necessarily.
01:46:53
Speaker
So it may have some tendency pulling in that direction, but you could also give it some dispositions to pull it against it. And that might be sufficient to offset some of these tendencies toward power. Even if there is some incentive there, it may not be enough to overwhelm it. So a lot of this discussion about instrumental convergence needs to think about the balance between these forces. And they would need to argue basically that the instrumental drive is extremely strong to overwhelm fine tuning and all of these sorts of things.
01:47:21
Speaker
which I don't think that there's much of a specific argument for that. I want to highlight here, Joe Carlsmith has a great report. I think the most rigorous argument for why power seeking in AI could be existentially dangerous. So just for listeners who are interested in what I think is the best argument for that out there.
01:47:40
Speaker
I agree. He helped popularize the power-seeking phrase as well. And I think that by focusing on power, that helped us integrate this into some other academic discussions, like power versus cooperation. What I was describing here, just a moment ago about balancing, was that we can take a cue from the international relations literature of seeing, well, power-seeking agents, when that's one of their main goals, that doesn't necessarily turn into them trying to seek domination.
01:48:07
Speaker
Another thing is that in Bostrom, in super intelligence, there's also a sort of part slide of hand, not intentional, but I suppose maybe an accident, where he's saying that power makes you better able to accomplish your goals, therefore, they will seek power. That's saying that something is helpful, if you have it, that doesn't mean that it's rational to seek it. So although there's an incentive for it, that doesn't mean it's instrumentally rational to pursue it. So for instance, it would
01:48:32
Speaker
If we run the argument in a different way, it would be helpful for me to be a billionaire. That doesn't mean that it's rational for me to try to become a billionaire. That would carry a lot of risks. That would take a lot of time. The existence of incentives aren't necessarily enough to say that that's what will be driving their behavior or is the first approximation of their behavior.
01:48:57
Speaker
And I think that there are other ways in which just power seeking doesn't emerge or dominance seeking doesn't emerge. If you give it some goals, like, obviously, if you say, you know, shut yourself off, or if you give it a goal, like, don't seek power. These are obviously counter examples for that just to show that this isn't like a, you know, it's not a law of all AI systems that they will try and seek power. Separately, if you give it a more goal, like, go fetch the milk, it could try and take over the military to put up a you know, a
01:49:25
Speaker
a motorcade to make sure that can get to the store very quickly. But if you had some time penalty or something, this would not necessarily be the thing to do. So instead, just go fetch the milk would often be the best way of getting the reward instead of some very circuitous path.
01:49:41
Speaker
Now, so I do think that there is a risk of if you have AI agents that are not protected and autonomous, you could get power-seeking type behavior. For the same reason that states try to shore up their power, they shore up their power because there isn't anybody they can call on for help if they're getting attacked necessarily. Like if the US starts getting attacked,
01:50:03
Speaker
Maybe some countries will come, but there isn't a police force that will settle the issue. So the best they can do is try to short power to defend themselves so that they can't be pushed around like that. So we have a non-hierarchical or quote unquote anarchic international system. And that incentivizes agents to seek power, to preserve themselves, to pursue whatever their goals are.
01:50:26
Speaker
And you could imagine if AI systems are not protected if they are part of say some crime syndicate or if they're rogue they're unleashed somebody unleashes them then those systems would actually have a very strong instrumental incentive to seek power in the same way that states do that if they want to protect themselves from some potential adversaries that can harm them.
01:50:49
Speaker
there isn't somebody to call on. They can't ask the US government, if there are crimes, they can't say, US government protect me, I'm getting harm. That is not a possibility to them. So what they have to do is they have to take matters in their own hands and accumulate their own power. So what I've done is I've sort of flipped things a bit. There'd be the usual argument that AIs might be power seeking just by their inherent nature, by the inherent natures of goals and optimizers and things like that. But I've instead mentioned that one source of power seeking is humans give them some sort of goals that are very correlated with power. And then there might be some unexpected stuff that happens in their subgoals.
01:51:19
Speaker
And then the other thing I've done is I've mentioned how the structure of the environment that they're in, some structural reasons for why they might end up seeking power to. I'm not as sure about them having an intrinsic one or internal reason for power seeking, but I think goals being given intentionally or the structure of the environment that they find themselves in. It's a sort of cage that they're locked in. There's really nothing they can do if they're wanting to accomplish their goals other than to part, invest a lot in protecting themselves, would also incentivize them to seek a substantial amount

Environmental Influence on AI Power-seeking

01:51:49
Speaker
of power.
01:51:49
Speaker
So I do think power seeking is a concern, but not for the same reasons that other people are giving, like we're going to randomly sample a mind from mind space, we'll be very alien and by way of almost any desires, it will necessarily try to seek dominance over humanity. But I still would be concerned about power seeking. How concerned are you about deception arising in AI?
01:52:12
Speaker
I think that the contribution of focusing on deception was useful because we now see that AIs have to some extent some representation of morally salient considerations as we explore in the paper aligning

Deceptive Behavior in AI

01:52:28
Speaker
AI with shared human values and I clear maybe 2020 or something.
01:52:31
Speaker
where we measure that and show that, and by now it's obvious because it's in chat bots and people can ask its moral questions, but they have some capacity for that. And the deception part focuses on maybe they're actually, although they maybe understand the goal, they don't necessarily feel inclined to pursue it. So in psychology, this is a distinction between cognitive empathy and compassionate empathy.
01:52:57
Speaker
Cognitive empathy psychopaths have. They could understand and predict what people will end up feeling in response to various actions. They have a very good predictive model of people's feelings and their emotions and what they think is valuable. Meanwhile, if they have compassionate empathy, that's when they feel motivated to do things by it and help people realize those values. So there's a distinction that they would have cognitive empathy, but not necessarily compassionate empathy.
01:53:26
Speaker
And so if they're deceptive, they could basically play along. They could be like, yeah, I don't actually care about you, but I'm going to act like it to get my goals accomplished, as psychopaths do.
01:53:36
Speaker
And here, maybe we should mention here how the drive of deception arises from the way that we are doing reinforcement learning from human feedback or how it could arise from that. So in the Machiavelli ICML paper, we saw instances of them doing deception because it simply helps them accomplish their goals better by default.
01:54:00
Speaker
So many environments just incentivize the type of behavior. If they have some type of misaligned goal from us, then they could bide their time and wait to come to power to take a quote unquote treacherous turn. So it could just be very strongly incentivized by some type of training process like by just seek more reward, deception can often be a good trick when you're monitored, behave nicely when you're not monitored, switch your behavior, behave in a more cutthroat way.
01:54:29
Speaker
That's how a deceptive behavior can be a concern or some Machiavellian type of behavior. And we that there are instances of this. You could imagine as more a non agentic case with chat boxes if they're
01:54:47
Speaker
being given human feedback, maybe they'd have an incentive to say very agreeable answers to people, things that they'd say, Oh, that sounds good to me, even though it's if it's not necessarily true. So that's how even, you know, chatbots might be incentivized to be in a somewhat deceptive direction. But we can also see this in agents just that often helps them accomplish their goals.
01:55:08
Speaker
Also chatbots might learn to recognize the ways in which they're telling bad lies, let's say. The obvious things they're saying that are false are penalized, whereas the more sophisticated ways they might be telling falsehoods are not penalized.
01:55:28
Speaker
Yeah, good. Yeah. So this is, gets it like in, in a lot of like repeated interactions and whatnot, deception often emerges in the evolution paper from, from the last time I was here, we spoke about how a deception can often be and concealment of information can often be an evolutionary stable strategy and that there are many instances of deception in the environment. So it's a fairly difficult thing to blot out when you try and control for it. You often end up selecting for more deceptive behavior.
01:55:58
Speaker
At the same time, we do have progress on this, though, where we can in a recent paper we submitted or in a recent paper we uploaded to archive called Representation Engineering, a top down approach to AI transparency.
01:56:17
Speaker
There we have instances, many instances, it's not that difficult to control by manipulating the internals of the model, whether or not it's lying. It has an internal concept of what is accurate. We can find a truth direction, we can add, subtract the direction or something of that sort, and then that can cause it to spit out incorrect text. And we have other more sophisticated control measures too, but we can manipulate internals to do that. So it's within the capacity of AI systems to lie and be deceptive. We have another paper on that called
01:56:47
Speaker
If you search AI deception and then maybe my name or something then you'd see that paper. So many instances of AI deception already, but we do have some traction on this problem. So fortunately, there'd still be the issue of having more reliable lie detectors and being able to control them to be more honest or output their true beliefs. So there's definitely much more work to be done, but we're at least not helpless. We don't need to wait another 30 years for
01:57:14
Speaker
interpretability research to get to a state of being able to start to rush against the question. We now have some ability to influence whether AI is live by controlling their internals. And so that makes me more optimistic about dealing with this problem, but you don't want to do premature celebration. I don't know how much time we'll have to continue
01:57:38
Speaker
getting those detection measures and those control measures to be highly reliable. So that'll depend on having a lot of researchers who can research with these cutting edge, very large models to make progress on.
01:57:54
Speaker
Yeah, the representation engineering paper was super exciting. Maybe you could explain at what level does representation engineering work because it's different from mechanistic interpretability.

AI Transparency Challenges

01:58:06
Speaker
It's more high level, which is what we're after in a sense. We are after the high level emergent behavior in these models.
01:58:15
Speaker
Yeah, I was mentioning compassionate empathy and cognitive empathy because it's a bit of psychology, but I think it's trying to do something more like a project like AI psychology or a cognitive science to think what we should be trying to do here. So in the case of this representation engineering, that's I think we're trying to be the analog of that.
01:58:32
Speaker
where we're given these high-level representations of truth and goals and things like that, can we make it be so that it actually outputs its beliefs or what it says it believes is actually what it believes? For that, you need to have a handle on these very high-level concepts so that they're not psychopathic, so that we can control their dispositions to behave and have things like compassionate empathy.
01:58:56
Speaker
Meanwhile, I think the mechanistic stuff is looking at a much lower level. It's looking more at the substrate, at the neuron level, at the circuit level, at the node to node connection level. And that's maybe closer to something like neurobiology. But then what we're doing is more like trying to study the mind as opposed to trying to study the specific structures in the brain and the connections between them and how that gives rise to phenomena. So I think philosophically,
01:59:18
Speaker
I had tried many times to do a paper on transparency historically, but it wasn't a good angle of attack. But in my view, it would take too long. But I think if we do it in a more top-down type of way, where we try and here's the eyes of mine, let's try and decompose it into some like
01:59:38
Speaker
front representations that drive a lot of its behavior and maybe decompose those further and further. Basically, we have a big problem of understanding an AI's mind. Let's break it up into subcomponents and try and get a handle on those and control those. I think that approach might be more
01:59:54
Speaker
efficient at reducing risks of AI deception, then building from the bottom up understanding, you know, this is how it answers. This is the circuit in it that lets it understand multiple or identify a multiple choice question. And then this helps it select the whether to output the full question, the full response back or whether just to select A, B, C, or D, you know, things like that.
02:00:15
Speaker
You can build those up, but that might become very complicated in time. So I think it might make sense to not work from the bottom up, but go from the top down. There are analogs of this type of approach in cognitive science. People would initially try and just study things at the synapse level, but it can often be more fruitful of trying to understand things at the representational level. What are the high level emergent representations that are a function of all the population, of all the neurons in the network?
02:00:44
Speaker
and try and try and understand things that low. Now, there's, of course, a risk of like, well, maybe there's some funny business that gave rise to that representation. And that's true. We could still do things to reduce that risk by like, trying to understand the representations at various layers in the network and try and decompose the system further and further so that there isn't much room for funny business or deception.
02:01:10
Speaker
But so that's that's it. At a high level, it's not viewing neurons as the main unit of analysis is you viewing being representations as the main unit of of analysis and neurons are relevant insofar as they help us predict
02:01:26
Speaker
and explain what's going on in representations. But those are more of a, that's sort of just the substrate. It's a comment on the substrate in the same way that if we have a computer program that plays Go, if I'm reasoning about the Go program, I'm just probably going to be thinking about Go strategies.
02:01:43
Speaker
when I'm playing against it. I don't need to think at the software level, like, well, where do you think it, what layer do you think it's at right now? Or what TensorFlow objective function did AlphaGo end up optimizing here? Maybe some of the examples. We don't need to analyze at that level. We certainly don't need to break it down at the level of assembly. We don't need to reason about assembly to try and understand its behavior. So I think that there's
02:02:06
Speaker
some emerging complexity inside of neural networks. We can study it at that level, and it's studying at that level is fruitful because there's an emergent ontology and some coherent structure inside of that, which you would end up getting lost in the details when you end up zooming in further to the neuron level.
02:02:28
Speaker
Although it's possible in principle, it's possible in principle to explain everything in terms of that, just like it's possible to explain the economy in terms of particle physics.
02:02:37
Speaker
computationally, you could do it, but it doesn't make sense to study it at a level. This isn't to say that their mechanistic interpretability and representation engineering are completely loose and separate. There's probably overlap just as like in biology and chemistry, they have some overlap, but you wouldn't try and understand biology just through chemistry. And I think if you're trying to understand representations, I don't think you're necessarily just going to try and understand everything.
02:03:01
Speaker
through neurons and node-to-node connections and specific execution pathways and treat it like a computer program, but instead something more like a mind with loose associational high-level representations.
02:03:16
Speaker
Yeah, so take a cognitive trait like honesty. Do we know anything about how that's distributed across the model? Is there like a center, a cluster of the weights in which this is now representing honesty or functioning as the honesty module, or is it more distributed across the whole model?
02:03:40
Speaker
Yeah, neural network representations are highly distributed, which makes sort of trying to bolt down and pinpoint specific locations of a lot of functionality a lot more difficult, as well as the interactions between all these components too, can end up giving rise to a lot of complexity. Imagine that you understood a neuron, and this detects a whisker at 27 degrees, and this other neuron detects an upper corner of a fire hydrant.
02:04:06
Speaker
you can you can i can i if you can understand these millions of neurons that gets you some way but are you really understanding the collective overall emergent behavior of the system. That's that doesn't necessarily follow so i don't think it's enough to understand the lowest level parts to understand the the overall system and it's a collective function.
02:04:23
Speaker
But it can be helpful. It can provide some types of insights. In the case of honesty, though, I find that it's a direction and its beliefs about what's true or not are directions in its representational space. And it doesn't seem to be located at a specific neuron. So when we are adjusting the representations through various control measures that we propose, then we can actually end up manipulating it.
02:04:47
Speaker
Partly, this paper is a bit more philosophical in what's the sort of paradigm. What's the strategy that we're wanting to proceed in making AI systems transparent? The representation level is going to be a very fruitful way. I should note that
02:05:09
Speaker
But it'd be useful to diversify over research agendas and things like that. Hopefully we'll get more reliable control measures and be able to modify relatively arbitrary parts. We'll have success when we can like inside of the AI system, when we have better ability to sort of read their mind or understand the representations, if we could use it for like knowledge discovery, then we've known that our methods are fairly good because they're probably going to pick up some observations about the world from the big pre-trained distribution that
02:05:38
Speaker
no individual knows or that many individuals don't know. And so if we can get better tools like that, then that would be a late stage sign of success.
02:05:47
Speaker
And it seems like we have a better shot at success here than neuroscience on humans, because we have such fine-grained access to... It's as if we had a human brain spread out with full access to what all of the neurons are doing. Or do you think that's right? Do you think we have a better chance of success compared to traditional neuroscience?
02:06:10
Speaker
Yeah, yeah, certainly. I think the sort of mechanistic interpretability would claim this claim this as well, that since we have access to the gradients, we have rewrite access to every component of it. This allows for much more controlled replicable experiments and substantial ability to do science that the barriers to entry in cognitive science.
02:06:31
Speaker
Many of them are removed. There's also, this might get easier in time, what makes this now possible, whereas previously it wasn't. If you use models like GPT-2 or below, the representations are not very good, quite incoherent. But as we use larger models like LAMA-2, pre-trained on many more tokens, they have some emergent internal structure that actually starts to make some sense, and directions that are correlated with coherent concepts that humans have.
02:06:58
Speaker
I think earlier it's more like a shibboleth, but now since there is some coherence to it, it's not just a big causal soup of connections. So this is why I think, unfortunately, this wasn't something that we could have particularly done in like 2016 and is very much something that's possible now that previously wasn't. Dan, thanks for spending so much time with us here. It's been very valuable for me. And I think it will be for our listeners too. Great, great, great. Thank you for having me. Have a good day.