Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Steve Omohundro on Provably Safe AGI image

Steve Omohundro on Provably Safe AGI

Future of Life Institute Podcast
Avatar
219 Plays1 year ago
Steve Omohundro joins the podcast to discuss Provably Safe Systems, a paper he co-authored with FLI President Max Tegmark. You can read the paper here: https://arxiv.org/pdf/2309.01933.pdf Timestamps: 00:00 Provably safe AI systems 12:17 Alignment and evaluations 21:08 Proofs about language model behavior 27:11 Can we formalize safety? 30:29 Provable contracts 43:13 Digital replicas of actual systems 46:32 Proof-carrying code 56:25 Can language models think logically? 1:00:44 Can AI do proofs for us? 1:09:23 Hard to proof, easy to verify 1:14:31 Digital neuroscience 1:20:01 Risks of totalitarianism 1:22:29 Can we guarantee safety? 1:25:04 Real-world provable safety 1:29:29 Tamper-proof hardware 1:35:35 Mortal and throttled AI 1:39:23 Least-privilege guarantee 1:41:53 Basic AI drives 1:47:47 AI agency and world models 1:52:08 Self-improving AI 1:58:21 Is AI overhyped now?
Recommended
Transcript

Introduction and Guest Welcome

00:00:00
Speaker
Welcome to the Future of Life Institute podcast. My name is Gus Stocker and I'm here with Steve Omohondro. Steve, welcome to the podcast. Thank you so much.

AI Safety and Early Interests

00:00:09
Speaker
Really excited to be here. Max and I wrote this paper recently and it's making some waves. He just gave it to Netanyahu, which is very exciting.
00:00:18
Speaker
And I'm very interested in AI safety. I go way back. I have a PhD in physics from Berkeley. And when I finished my PhD, I switched to doing AI, and then had been with various companies and research labs. I was a professor for a while. And in the early days, it was mostly all about AI capabilities.

Realization of AI Risks and Dual Focus

00:00:39
Speaker
And then around 2000, started realizing that there were potential dangers and risks associated with AI.
00:00:45
Speaker
And since then, I've been doing both the capabilities and the risk side of things. And I think it's just really heating up in recent years. And so we've been really thinking about ways to mitigate some of those risks. And Max and I sort of had a meeting of the minds and we'll talk about what the particular story that we're putting forward. So you were basically a pioneer of AI safety. What made you come to this field so early?
00:01:10
Speaker
Early on, I thought AI was an unabashed good. We get robots to do the dishes for us, and it'll be great. We increase productivity. We solve all of human problems. And the companies I was in were also very gung-ho and positive. And then I was at research labs. The last one was the NEC research lab.
00:01:30
Speaker
and decided, I had a girlfriend that wanted to go to Silicon Valley, and so I decided to come back. And I had a friend out here, Eric Drexler, who was working on inventing, basically, nanotechnology. And he realized very early on that nanotechnology has a whole bunch of risks. So he started this Foresight Institute with Christine Peterson.
00:01:51
Speaker
And they were focused both on how do we do nanotech and what great things can it do and what are the risks and how do we mitigate them? And so I got sort of in that circle and was thinking a lot about the risks there. And one of the things I was thinking about was how do we use AI to manage the risks of nanotechnology?

Self-Improving AI and Basic Drives

00:02:09
Speaker
And early on, mathematical theorem proving was a piece of that story in my mind as a way to control these potentially teeny molecular machines and so on.
00:02:20
Speaker
But I wasn't really thinking about AI as a problem at that point until it was around 2000, I think, maybe a couple of years later, that Eliezer Jukosky came out and he called a meeting of the whole Foresight Institute. And he said, I have something very important to say. And everybody's like, wow, who is this guy? What's going on? What's here? And he says, AI is going to kill us all. He gave this story, but it was very abstract and hard to grasp. I don't think most people really got what he was saying. But something in what he said really stuck with me.
00:02:48
Speaker
So I started analyzing. The particular approach to AI I thought was really important at that time, actually I still do,

Rapid AI Advancements and Current Challenges

00:02:54
Speaker
is systems which can model their own behavior and improve themselves. So self-improving systems is what I was calling it. And his story and his way of thinking about it caused me to think, okay, when you have a self-improving system and you keep improving yourself,
00:03:11
Speaker
Where does it go? What does it lead to? And then that lead led to something I called the basic AI drives, which is realizing that regardless of what your primary objective is, there are a whole bunch of sub goals that sort of support almost every objective. Not dying is a really important one. Getting more resources is important. You know, a whole slew of those kinds of things. And so I did a formal analysis and put it in the context of rational agents and then published papers and then that generated a real flurry of interest.
00:03:41
Speaker
And so, this was in maybe the mid-2000s. I did a TEDx talk and there was a circle around there. Eliezer started what at that time was called the Singularity Institute for Artificial Intelligence. It later became MIRI, the Machine Intelligence Research Institute.
00:04:02
Speaker
and lots of thinking and excitement. But AI was not going very fast at that time. I think most people felt like, oh, real AI will be 100 years off, something like that. And then in 2012, oh my god, that's when the deep learning revolution started and GPUs. And then 2017, the dam broke. And so it really appears like we're at a really critical moment. So we will return to the basic AI drives and the notion of self-improving
00:04:31
Speaker
AI later on in this interview. But I think where we want to start is to talk about this paper you alluded to in the beginning, which is provably safe systems. So how would you summarize the approach that you and Max take in this paper? Well, so we start from the position that AGI appears to be imminent, you know, depending on who you talk to, the different amounts of time.
00:04:53
Speaker
and that it has a potentially existential risk for humanity. And there are a bunch of groups now, Center for AI Safety just came out with a nice long paper, kind of laying out huge numbers of risks, things like bioterror or military AI and so on.
00:05:10
Speaker
And so we really have to deal with this. And if the timelines are short, as some people think, this is an urgent issue. And the current approach to stopping this is alignment, which is trying to make our AIs aligned with human values.
00:05:24
Speaker
great noble thing, but we argue that that is not going to solve the problem. And we can go into the reasons for that, why a little later. So we need something else. And so our idea is what we really need are guardrails around these risky things. So say you have a DNA synthesis machine, we need a mechanism to prevent random AGIs on the internet from synthesizing whatever DNA they

Security Measures and Bioterrorism Risks

00:05:47
Speaker
want. We need to make sure that they're only doing something which is actually safe.
00:05:51
Speaker
How do we check that this is agi systems potentially will be extremely capable at manipulating people breaking into systems and so on we need an unbreakable mechanism for preventing.
00:06:06
Speaker
unsafe use of equipment. And so using mathematical proof and then hardware equivalent of that, which is sort of physical cryptography, we propose a whole system. Pretty complicated, but we believe, unfortunately, that it's really the only way to go, that if you don't have something essentially equivalent to what we're proposing, then there will be pathways whereby an AGI can break in. And if we're in an adversarial
00:06:30
Speaker
interaction with an AGI, then it's toast. So we believe this is really some variant of what we're proposing is really the only way forward. And the good news is I think every piece of it either has something existing today or is sort of on the in the pipeline for the fairly near future to implement. You mentioned as an example of what could go wrong in an AI disaster, this AI enabled bioterrorism. Maybe you could go through that example.
00:06:58
Speaker
Yeah, I think it's a good example. It's from the Center for AI Safety paper. They kind of have some scenarios in there. One of the great things of recent large language models are things like alpha fold based on transformers, which is able to predict the 3D structure of every protein and they're just getting interactions between proteins and multiprotein complexes and so on. And these systems can design new molecules.
00:07:25
Speaker
And for example, somebody had a system where they're trying to design new molecules and they want to make sure they're not toxic. So there's something that measures toxicity and they sort of just as a whim flip the sign of the toxicity and the thing was spitting out neurotoxins all over the place. So it's pretty easy to get today, even today's AI models to design dangerous molecules for humans.
00:07:47
Speaker
Imagine that you're a terror group and you have either some philosophical or power desires and you decide you want to use bioterror to kind of promote your thing. Well, you could get an AGI and ask the AGI to design a very communicable and lethal virus. Design the DNA for it, design the protein sheets, and also ask it to design what are the steps by which I can synthesize this virus.
00:08:13
Speaker
When you call out to some chemistry for pay laboratories and you say, oh you sir, please synthesize this DNA, please synthesize this protein, you lab over here, combine these and now you've got a whole bunch of this really terrible virus that might even be my target specific races or genotypes.

Human and AI System Vulnerabilities

00:08:33
Speaker
It might have characteristics of not being noticeable for a while and spreading really rapidly and then suddenly kicking in.
00:08:41
Speaker
Then they maybe would use AGI to control drones to spread this virus over populated areas, maybe simultaneously all over the world in various places. And then maybe they would use AGI to control bots on social media to spread the information about what they've done. And maybe they are asking for power or they're asking for this or that, or maybe they're just spreading their philosophical message. So that's a horrible, terrible scenario.
00:09:06
Speaker
which looks like all the pieces will probably be possible in the pretty near future. And today, the only mechanisms we have to stop that are things like police forces. So fortunately, we have the FBI and the police force who are looking for this type of group and try and stop them, but that's unacceptable in the face of potential human extinction. So what can we do instead? Well, every one of those machines, the DNA synthesis machine,
00:09:36
Speaker
It shouldn't be that some random AGI on the internet can synthesize whatever DNA they want. Today, if we were trying to make a biosafety lab, we have governmental regulations on that and we have a set of rules that you have to follow.
00:09:51
Speaker
But those rules, you know, they're in the human system of if you break those rules, maybe somebody will arrest you and throw you in jail, or maybe they'll fine you if you're a company, something like that. It takes a long time. It's on the sort of human scale. And they don't really have a lot of teeth. They don't prevent the action happening. And so we believe what we really need is to build rule checkers into the DNA synthesis machine.
00:10:16
Speaker
role checkers into the protein building, you know, chemistry equipment. So in the software and the hardware itself, there are these ground rules for what the machines can do. Exactly. And initially, they'll probably be what the current human rules are, but encoded in a machine checkable way.
00:10:36
Speaker
Eventually, we want to make those rules much more refined. But to start, we really just have to do guardrails. We have to prevent bad things from happening. And so what does that entail? Well, we need to formalize those rules. We need to make the rules machine checkable. Because one sort of key insight is we don't want humans in the loop here. First of all, the human is on a slower time scale probably than these systems.
00:11:00
Speaker
Secondly, humans are manipulable, right? AGIs are going to get really good at convincing people to do things. They're also threatenable, and they're blackmailable, and they're bribable. And so if you put a human in the loop of something fundamental and important, that human becomes a target. And so I think that's why we really want to automate this whole thing. So OK, so that says we need to express what the rules are in some precise way that a machine can check. Are these rules being satisfied?
00:11:29
Speaker
But then we need to have the hardware that does that checking, so that's not vulnerable to attack by an HCI. And so that leads us to something we're calling provable contracts, or these little secured hardware devices that can check proofs.
00:11:44
Speaker
We also have to generate the proofs and that's what AGI is good for. So there's a number of spinning plates in this story, but I think it coheres into a story where you can have a device and you can have very high confidence that it won't be used for bad purposes without following whatever rules you've imposed on it.
00:12:04
Speaker
It's actually a significant degree of confidence in the safety of the system when you have a mathematical proof that, for example, a DNA synthesizer won't print a specific sequence that could be dangerous. Exactly, yes. Okay, so how does this approach compare to the alignment approach that's popular in AI safety circles today?
00:12:25
Speaker
Well, so I think in the current thinking, what they're trying to do is limit the thoughts of powerful AI. You know, we want powerful AI that only thinks things which are aligned with what humans want and that way they'll only do things which are good for humans. And I think that's important. That's noble. It's great. First of all, it's a pretty difficult challenge and there are a bunch of papers coming out showing that
00:12:48
Speaker
you know, like there's this reinforcement learning for human feedback to fine tune language models so that they don't produce, you know, obnoxious output, that kind of thing. Well, it turns out if you give the right prompt, if the capability for doing something bad is in the model, you can damp it down. But if you get the right prompt, you can actually cause it to come out.
00:13:09
Speaker
So so it's a difficult challenge in itself. It's also probabilistic, right? I mean, as we see with these models, I mean, the extreme was Microsoft's Bing chat, which when you first talk to it, yeah, it's really good. It's an assistant. It helps you. But people found that, you know, after five or six back and forth of the conversation, especially if you sort of are a bit aggressive to it or not just to it.
00:13:30
Speaker
it could flip to this other personality which was trying to steal people's lives and threaten people and you know all kinds of crazy stuff and there's now some understanding of why that is i want to go into that but but we don't want to rest the future of humanity on that kind of stuff and i think we need to shift to a security mindset which is.
00:13:49
Speaker
Let's not just try and get these things to do the right thing all the time. This is very serious. We're building these systems that are going to be probably better than humans at everything. Every AGI will be an expert manipulator. It'll be an expert crypt analyst that can break into cryptography, be an expert hacker. You have all these capabilities in one little bundle.
00:14:11
Speaker
And the other pieces if it was only the top a lab you know open a meta whatever they all have big safety teams and they're all working on it and they that these things i have read teaming all kinds of great stuff really good really important.
00:14:26
Speaker
But a lot of these models are now open source, and they're spreading around. And if they open source a protected model, well, many, many people want to remove that protection. There was just a meeting with a congressman of a bunch of AI people.
00:14:43
Speaker
And somebody spent eight hundred dollars and they were able to remove the safety protections from metas open source model in one day, you know, and get the thing to design, you know, bioterror chemicals and stuff like that. So with open source models and we also have no control over.
00:14:59
Speaker
china now i think has over a hundred and fifty large language models the different little startup companies there have created. Another country has their own open model every country wants their own model i just thought this morning in france they're doing a whole new a you know effort they want to be the opening i am france and so.
00:15:17
Speaker
So everybody wants their own thing. And some of them are going to be really careful and really concerned about safety. Well, there's not so much. And there will also be criminal organizations who solely want to use this technology for financial gain, for power gain.
00:15:32
Speaker
And military, every single military on the planet now, is getting fleets of drones, flying drones, and submarine drones. And they're all turning them over to AI. I just saw the NSA and the CIA both are putting large language models to help analyze all their data. So we have this sort of explosion, as to be expected with something so powerful.
00:15:55
Speaker
But the safety stuff, I think, was going to take a backseat. So I don't think we can rely on, even if we had perfect alignment of these models, I don't think that's what we should rely on. We need to, especially when it comes to the existential threats, we need really, really high confidence in stopping those.
00:16:13
Speaker
Yeah, there's another popular approach in AI safety surrounding evaluations, so-called evals of the models that are coming out of the top AI corporations. They test whether these models can do breaks of cybersecurity or acquire resources or escape from a secure setting.

Evaluations, Red Teaming, and AI Safety

00:16:33
Speaker
What do you think of this approach? Do you think this approach might potentially be helpful in combination with the approach you're sketching? Absolutely. You know, one of the basic principles of the security mindset, it was a really nice book by Nancy Levison a little while ago talking about how to really do safety in engineered systems where you need to take a systems model of what are all the components that are conspiring to cause some potentially risky outcome and who is your adversary? What is your adversaries capabilities?
00:17:03
Speaker
And so unfortunately agi is sort of an unknown i mean we could put lower bounds on what agi will do namely today's models models like you say they can already do a lot of cyber security stuff they can break in the systems they can design molecules they can prove their arms they can.
00:17:20
Speaker
manipulate people they can write very you know convincing messages and so on and that's just this is what after just a few years you know so what's happening another you know say eight years from now what are models gonna be able to do the upper bound we can use physics laws of physics to limit it and so in particular so what one knob that we do have to control these models is how much compute power they have.
00:17:43
Speaker
And right now, you need a lot of compute to train the models, but you don't need much compute to run the models. On my little Mac, I can run large language models of like 20 tokens a second, something like that.
00:17:57
Speaker
So really understanding what they're doing, what the current level is, I think is critically important. I think red teaming is a nice, great thing. Unfortunately, it's sort of the analog of software testing where you're trying to write this piece of software and it should be secure and have no backdoors, all kinds of stuff. So what do you do? Well, you got a bunch of software testers, quality assurance, and they try it out and they look for holes. Yep, it's good. Trouble is software testing or red teaming on a language model
00:18:23
Speaker
They can find bugs. They can show you, oh yeah, it's got this safety problem here. They cannot prove the absence of bugs. And so I think we really, you know, because this is super important for humanity and it's very imminent, we need much more powerful tools than we currently have.
00:18:39
Speaker
In the software area, there's a whole field of formal methods where you can actually prove that a certain class of bugs is not there. And that's harder to do, I think, with today's large language models, because we don't really understand how they work. And so Max has a lot of people working on mechanistic interpretability, which is trying to sort of get into that.
00:19:00
Speaker
I think understanding the current level is great. Models also have incentive to be deceptive. And we're already seeing that in some of these language models that they tell you more what you want to hear. And they may know things that they're not talking about. And so if they realize they're in a red teaming situation, they may be really friendly and nice. But in some other situation, they might not be that way. And so we can't rely on it, but it's very valuable, I would say.
00:19:29
Speaker
This is one of the things they're trying to test for in the evaluations approach, whether models are actually deceptive. What I was interested in was whether these evaluations might serve as the specification in your approach for what the models can't do. So these evaluations might be limits to what models can do that are then written into the code and you can prove that the models cannot do these certain things.
00:19:54
Speaker
Yeah, no, I think that's great. I mean, we can also use them totally independent of the AI capabilities. Just tell us what are the biggest risks in our society. And as models get to be better at modeling society, you know, one thing I think is going to be really important. People talk about our digital twins.
00:20:12
Speaker
creating digital versions of basically every system and equipment in the world today, that once we have really good models of the world, we can ask the AI, well, what are some vulnerabilities here? And so like we know about viruses, ooh, that's bad.
00:20:26
Speaker
But let's say, you know, biology had not created viruses. They might still be a vulnerability, right? And so, so yeah, I think it's very important. One of our big challenges will be figuring out what are the main risks we have to protect against and then expanding that so that we can actually have positive impact too. I believe that using this type of technology, we should be able to create, you know, a utopian society that solves many of the problems that we have today.
00:20:53
Speaker
We have to figure out what do we want and we have to figure out what are the risks we have to avoid. And I think AI will be critical in helping us figure that

AI's Role in Society and Mathematical Proofs

00:21:00
Speaker
out. Yeah. And the approach you take also involves a lot of help from AI in the proofing, in the checking and in many of the steps. We should talk a little bit about mathematical proof and how advanced AI mathematicians are now, because I think a lot of people might be skeptical that a large language model is an enormous amount of
00:21:20
Speaker
There's a lot going on there. It's a huge program. How can we prove something about a large language model? Yeah. Well, let me first say that our proposal is not to prove things about the large language model because we don't know if we're going to understand them.
00:21:33
Speaker
but rather what the large language model is proposing doing in the world, we want to prove that that is safe. So we want to limit actions, not necessarily thoughts. You know, in some sense, thoughts and if an AI is thinking terrible thoughts, that doesn't really hurt anything unless it turns into actions. And one kind of possibly paradoxically seeming thing is even an untrusted, deceptive, horrible model, if they generate a proof of something,
00:22:00
Speaker
We can check that proof very cheaply and easily and have essentially 100% certainty that what that proof says is actually true. So we can use bad terrible models to do good things and to prevent bad things. That's one of the magical pieces about mathematical proof is that even the most powerful AGI can't prove a false theorem, can't prove a false it.
00:22:23
Speaker
I believe it is humanity's most powerful tool in dealing with a potentially smarter adversary, I'd say AGI. And humanity has been kind of creeping toward mathematical proof for a long, long time, well over 2000 years. Actually, the principles underneath mathematical proof, little teeny mechanisms for reasoning,
00:22:46
Speaker
were invented by the human creators of natural language, maybe a hundred thousand years ago or something like that. And they built into language all of the elements that are in logical systems today, but they did it in an informal way. And no, in normal reasoning, if you say, you know, Fred is a carpenter and he likes sushi. If you later asked, does he like sushi?
00:23:10
Speaker
Yeah, from that conjunction, you can derive, yeah, he likes sushi. And in natural language, we just do that so naturally it doesn't even seem like a thing.
00:23:18
Speaker
But in the early Aristotle and Euclid, they wanted to take that informal form of reasoning that pretty much everybody does and figure out what are the exact rules. And Aristotle had his syllogisms and he was teasing it apart and he figured out some of the rules. Euclid tried to do geometry in a very precise way, but it was very unclear exactly what the rules were.
00:23:41
Speaker
And then humanity has been struggling since that time to really nail them down. And, you know, it's fun to read like Leibniz, you know, 1600s or something, had this vision for, oh, we could have a language where you could be sure of philosophical truths from this, but it couldn't quite get the details.
00:23:59
Speaker
1800s we had you know bull fraga and cantor they like really nailed down like some of the rules around implication and conjunction all that and then finally nineteen twenty five. Is your melo and franco really got all the pieces develop set theory.
00:24:15
Speaker
full foundation for all of mathematics and anything which could be described with mathematics. So full foundation for physics, for economics, for engineering. So since 1925, we've had this teeny little system. It fits on one page. So seems like it's not much.
00:24:31
Speaker
major, major development in human thinking where we have a set of rules that as long as you express whatever it is you wanna, you're interested in, in this system, you can check it very, very rapidly. There's a system called MetaMath, which humans, human mathematicians prove, like, I forget how many, 40,000 theorems, something like that. And basically an undergraduate education in mathematics is in there.
00:24:58
Speaker
And the proof checker is only 300 lines of Python code, and it takes like four seconds to check the entire database. And you could be absolutely sure that every one of those theorems is correct.
00:25:10
Speaker
Well, but until recently, it's always been humans doing it. And humans are actually not that great. They're improving. And especially the mechanical computer-checked ones require every detail to be right. And so humans have sort of pushed back against that. And the AI systems were doing it in certain subdomains. There's something called satisfiability that had a big community and was very successful, but full
00:25:37
Speaker
proving of mathematical theorems was not very successful until 2020, when OpenAI, in a project that isn't very well known, they did what they called GPT-F, F for formal, and they used metamap, this big collection of theorems that humans had written, and they trained a model on that, and it was able to prove 55% of them.
00:26:01
Speaker
So that's pretty good. We're moving along. And then in 2022, Meta, again, in a very not very popularized project in France, in Paris, did what they called hyper-tree proof search using a transformer and using some of the similar techniques to AlphaGo, where it was doing Monte Carlo tree search and looking for the best extensions. They were able to prove, I think it was 87% of the MetaMath theorems.
00:26:27
Speaker
So this is amazing, this is fantastic. Since I view, you know, theorem proving as the core of safety, this is like really exciting. I was at META, I worked at META, and I would follow that group, you know, because they were doing what I thought was really important. And internally, they didn't seem to realize that this was like a huge deal for the world. Their main thinking is we want to make mathematicians lives better. You know, we want to prove the mathematical Olympiad, we want to be able to solve those problems.
00:26:53
Speaker
There didn't seem to be any inkling. This is like the core of safety and of software generation, all this kind of stuff. So I'm a little frustrated that proof, it's somewhat esoteric. It's not that easy to really understand the nuances, but it's really the core, I think, of doing precise and provably correct systems. So AI, from what I'm here now, is clearly getting better at mathematical proof.
00:27:16
Speaker
I think the skepticism about the approach is more you have some complex behavior that you want to forbid. You want to forbid the system from being deceptive or from breaking into some other online system. How on earth could you formalize a notion like deception and then prove that a system cannot do that? You went through the history of formalizing the world, but I think you would also acknowledge that it's been a kind of recurring problem that the world is often
00:27:45
Speaker
complex in a way that escapes our formal systems. Absolutely. And so in this paper with Max,
00:27:52
Speaker
At this point, I mean, I think eventually we will get there, but this is urgent. So we don't have time for eventually. And so fortunately we don't need to control the absolute fine details of all this. And the analogy I kind of like to think about is, you know, we have cars driving on roads and you know, that's not that old of a technology and they had to figure out all the stuff we take for granted now, stop signs and traffic lights and lanes and all that.
00:28:19
Speaker
None of that existed early on, so they had to figure out rules to keep these powerful new technologies from destroying from killing people. And there are subtleties to it, right? So today we have rules about how fast you're allowed to go on a road and what are the rules for like passing somebody and we have these blinking lights to stop when you stop all that. But the really critical thing is if you don't say you're driving on a cliff on the road around the cliff, you don't want that car going over that cliff.
00:28:47
Speaker
So that's an example of an existential risk, right? Somebody does that and they're probably not gonna survive. Yeah, it's nice to have the nuances of how you pass the car and what your speed should be, all that kind of thing. But really, we wanna be darn sure you don't go over that cliff. And so we invented this little technology, that guard rail, which is this little teeny fence that most roads on mountains have, that if a car hits it, it tends to bounce off, you know? It may not be good for the car, it may get dented, whatever, but you don't go over the cliff.
00:29:15
Speaker
And so we have this hard barrier restricting a certain outcome, preventing a certain outcome. It's cheap, easy, and it's coarse, right? It doesn't really know the nuances of maybe the guardrail is there when somebody doesn't want it to be in some way.
00:29:32
Speaker
What hey you know it's as a starting point it cuts off the really risky outcome so i think we need the analogous thing here what is analogous to it got real in the case it doesn't really have much to do with the ai itself i mean i think it's great to try and do alignment and make the eyes good but i think it's more about the world.
00:29:51
Speaker
systems in the world which are potentially really risky. And we know pretty much what they are. You know, launching nukes really bad. You don't want your AGI to be able to launch nukes. Releasing, you know, smallpox. Bad. Don't want that. Maybe we even want to think of large data centers full of GPUs. Maybe that's a risky device. So that's a mechanism for controlling AGI. Maybe only certain, you know, very carefully vetted groups get to train their models on big
00:30:18
Speaker
Data centers it's a sort of a restriction that that prevents some bad outcome on the other side you want to start with the most critical systems in the world. Exactly okay let's let's begin to get into the details of the approach i think starting with approval contracts what are these people contracts.
00:30:38
Speaker
So what do we need in order to limit a device? Think maybe of a robot, a drone, or maybe even like a DNA synthesizer. And we've got some AGI maybe out on the internet somewhere who says, yeah, yeah, I really want to make some DNA. And I got some money. I don't know. It's a few dollars a base pair or something like that. It's pretty cheap. Should this DNA synthesizer do what this AGI is telling it to do?
00:31:04
Speaker
We need to have some rules about that and we need those rules to be automatically checked. We don't want a human in the loop. We do probably want humans to help in the making of those rules or the shaping of those rules, but that's a whole different story. Let's assume in some way we know what the rules should be. How do we ensure that those rules are followed? Why is it that you want a human in the loop for specifying the rules, but not for applying the rules, so to speak?
00:31:28
Speaker
Because what exactly the rules are, I mean, if we're very clear, like, let's say the one and only thing we want to do is prevent human extinction. I think that's fairly clear. So killing all people is the bad outcome. And so we go backwards from that. We, you know, we, we put little guardrails around every, the trouble is, you know, especially if you're doing really coarse guardrails.
00:31:48
Speaker
You know, a good way to keep technology from killing everybody is to lock everybody in little cities with no electricity. We eliminate all electricity in the world, but yeah, that'll keep us. So there are bad, bad ways of meeting that goal. And so you do want to have what humans value and care about the sort of an alignment question in choosing what your course protective goals are. As we implement this, assuming this is the direction we go,
00:32:15
Speaker
Will discover we move sweet we we put too many limits on this dna machine now it won't make you know a drug we need. Okay so we didn't get the rules quite right how do we fix that or oh my god you know that the age i figured out a way through this class rules that was bad so we need a whole mechanism for first creating the rules and for updating them.
00:32:38
Speaker
But that's really risky because if you have a way of updating rules, well that's what agi is that's a vulnerability right that's an attack vulnerability. So how do we prevent that we could have one person i'm the keeper of the rules for the dna machine well than that person is now a target for h so generally the way we do that today is we have groups that do it and we have a voting mechanisms.
00:32:59
Speaker
And so I think, particularly for rules at the societal level, voting is a great system. There are nuances. I think today's voting mechanisms are really, really crude. Robin Hanson has something called Teutarchy, which is a subtler, better thing, where maybe we can even use AGI in choosing how to do the voting and where to get info and how to build models that then people can reflect on and make decisions. So we need mechanisms for choosing what our rules are in a smart way
00:33:29
Speaker
Showing that the rules don't have vulnerabilities right so we want to be doing simulations of them. That's a whole story that i think max i just kind of hint at a little bit i thought some of it but i think there's whole disciplines needed there and maybe there are people that are doing quite a formal models of society and you know what are the social mechanisms.
00:33:47
Speaker
the framers of the Constitution. They laid down these rules, but they also wanted to leave room for amendments, the ways to update them. And so pretty brilliant, but we need to invent systems for that. So we want humans in the loop for creating the rules. But then when some string is sent to a DNA synthesizer, we don't want humans in the loop there because humans are too slow and humans might be manipulated. So we want some unbreakable system to not create that DNA.
00:34:17
Speaker
Exactly. So what does it mean? So first of all, it has to be a program which checks it.
00:34:23
Speaker
But programs have bugs and all kinds of problems. Well, this is where the amazing power of mathematical proof comes in, that if we can express the rules that we want to be satisfied precisely in a, and Zermelo-Frankel set theory, that's my favorite, though these days it looks like Lean, which is based on the type theory, is much more popular at the moment. So maybe it's a type theory based system. The reason I don't like those as much is that they're pretty complicated. The great thing about Zermelo-Frankel set theory is it fits on one page.
00:34:53
Speaker
And the checker is 300 lines. So that's a teeny, completely understandable system. But whatever it is, all the theorem proving systems are essentially interconvertible. You can prove anything in one and transfer it to the other.
00:35:07
Speaker
Not easily today, unfortunately, but theoretically it's that way. So whatever it is, you force the AI that's proposing a certain thing to provide a proof that it meets your needs. Some of those needs might be, I want the okay of the director of the CDC. And here's his signature saying, yeah, this is a great anti-flu project and we verified it. We think it's great.
00:35:32
Speaker
So, you know, in today's world, you have groups of people that are, you know, committees that sort of look at this as a reasonable project. And we can just drop that down as long as we make a machine checkable version of that. But eventually, we'd really love the details of the biochemistry that what's allowed and what's not, what's toxic, what's not. We'd like that to be as a part of this criteria.
00:35:53
Speaker
Just the little program running on a little teeny processor and that checks the thing. And if it's verified, it then sends it onto the machine. Everything's hunky dory. Sounds trivial, sounds simple, except an AGI is very good at breaking into systems, understanding systems, manipulating systems.

Hardware Security and Zeroization Techniques

00:36:11
Speaker
And so if our little checker is running on, say, a typical microprocessor,
00:36:15
Speaker
Well, there are all kinds of ways to manipulate microprocessors, so we need something special there. We need something which cannot be manipulated by an AGI, and when it says yes or no to the DNA synthesizer, that command should not be manipulable by the AGI, and the synthesizer should not be manipulable by the AGI. So that's where we get into sort of a little bit of the hardware underneath it.
00:36:42
Speaker
If you have enough energy, so people break in, there are these secure enclaves today. Apple actually has a really nice one built into almost all their products now, I believe, including the Apple Watch. So the Apple Watch has its own little secure enclave in there. It has its own analog random number generator, so it can generate random keys that have no influence from anything outside and can't be seen from the outside. It can do various kinds of cryptography on there.
00:37:09
Speaker
And they have some tamper resistance that they don't talk about, so I don't know how strong it is. But that's sort of a hint of the Apple Watch has a protected environment where you could run those proofs in that secure enclave, and it would not be breakable by the standard easy mechanisms.
00:37:26
Speaker
But if your AGI has access to robots with scanning electron microscopes and lasers and this and that, it'll pretty much be able to break into any physical device. So what do we do about that? How do we keep the whole physical infrastructure from falling apart? Well, there's a very simple, clean idea that I just love. It's old. It's from the 1960s called zeroization, which is you have some cryptographic system that internally has some secrets in it.
00:37:54
Speaker
And you have sensors which detect any attempt at tampering. And if they detect tampering, they delete the keys or even delete all of the secret information. Apparently, Soviet jets have this. There's a great photo in Wikipedia of a little switch that's, you know, if the jet is going down, flick this switch and it deletes all the secrets that are stored in that jet. And so what this does is you have this little hardware thing
00:38:23
Speaker
that you can, you know, if assuming you make the sensors so that they really detect all the ways that any agent might physically probe it, you know, breaking into it, lowering temperatures, zapping it with electricity, whatever, anything that's out of the normal range of behavior, or any kind of attempt to drill into the thing or whatever, it deletes all the keys.
00:38:44
Speaker
You know, then those bits, it's quite a unique thing. We have a little teeny physical device with some secret bits in it that literally nobody can get access to. And eventually I would love to have a physics proof of that. Prove that no physically possible attempt at attacking this thing can get at those bits.
00:39:04
Speaker
The downside is that that becomes a denial of service attack also. Let's say we're in a war and you've got AI robots coming at you using this technique. Well, as long as I try and attack their secret keys, then the robot stops working. So it's a little bit, you have to worry about that. And you also have to worry about losing information at the systems level. So this is a component in a much bigger story.
00:39:29
Speaker
So that, I think, will enable us to have an enclosure and a physical device that we're guaranteed won't be broken and that can do computation and we can have proofs that that computation has done correctly and is not manipulated. So what we're imagining now is a world in which we have all of these critical systems. We have nuclear facilities, we have bio labs, AGI corporations, and
00:39:54
Speaker
In the way they operate, they have secure hardware running provably secure code and the hardware is tamper-proof to the greatest extent possible. We have specified by some social mechanism which kinds of code can be run by these facilities and by these labs and so on.
00:40:16
Speaker
Is this the end goal of this network of provable contracts for you? Well, maybe it's the end goal of the first stage. So if you view our first goal is prevent human extinction. You know, simple, clear mandate. And I guess part of it's a little less clear in that we don't exactly know what all the pathways toward that are, but we know some of them. And so I think that technology alone would do that.
00:40:42
Speaker
I think we can protect the biohazard labs, protect the nukes, protect the AGI data centers, whatever.
00:40:49
Speaker
And then, but this technology, once we've got it, let's say we really have it done and we make it cheap. So we have these little, you know, approval contracts or, you know, a dollar, a dollar, a module or something like that. We just are using them everywhere and get all kinds of other social benefit out of it. That step, I think, is much subtler and will require modeling human thinking and human values in a much finer and more detailed way.
00:41:15
Speaker
Once we prevented extinction, then I think we have plenty of time to really think about what we want and make that happen. I think one worry here is that imagine that this network of provable contracts becomes more and more complex and it becomes layered and begins to rely on parts of itself in interdependent ways.
00:41:37
Speaker
What if there's a bug in one of these contracts, which is a computer program? So it could be boggy, right? What if that bug then kind of spreads out in the system because, you know, other provable contracts are dependent on the boggy provable contract?
00:41:54
Speaker
Well, so part of the design of these things is that they are fully verified. So they're not gonna have bugs like today's programs have bugs, you know, some pathway, but they might have bugs in the sense of they're saying that they're imposing a rule that turns out not to be the rule we really wanted. And so that's a huge risk. And we're gonna need design principles
00:42:20
Speaker
How do we design particularly networks, complex systems of rules that interact? What I think this technology would do is it gives us a sort of base that we can absolutely trust. And then on that base, we can build up whatever we want. And so on the good side, we can build the most utopian, fantastic society. I think there are ways even to prevent, you know, prisoner's dilemmas and social dilemmas and tragedies of the commons, all the problems that today's society struggles with. I think we can solve all that with this.
00:42:50
Speaker
On the other hand, particularly if some draconian force got control of the designing of these things, we could create a terrible dystopia. And so it's a sort of morally, completely neutral technology that enables any social system that you want to be implemented with this.
00:43:08
Speaker
The challenge is, how do we ensure that we get? I think we can use AI for that. I'm really liking this idea of digital twins. Nvidia has really been pushing that. Lots of companies are making digital versions of their factories. They have every piece of equipment, and that factory has a digital version.
00:43:27
Speaker
They have sensors all over the factory and they keep the two in line with each other. Based on the sensor data, they learn what the behavior of the system is. They can, for example, predict when you need to do maintenance. Apparently, just predictive maintenance is like a multi-hundred billion dollar industry. It's very, very valuable in the factory setting.
00:43:47
Speaker
I think eventually we're gonna have that for the whole of humanity where we need digital twins of ecosystems of forests of is this forest vulnerable

Digital Twins and Modeling AI Regulations

00:43:55
Speaker
to forest fire and what are the smallest changes we can make to fix it so i think we will just very naturally.
00:44:03
Speaker
have digital models for all of the systems in the world and then proposed rules that we do, we can simulate those and we can do mathematical proofs within those. So we can kind of test what the effects of the proposed rules might be in these digital twins. One worry there would
00:44:21
Speaker
So at what level of detail will we have to simulate these digital twins of our factories and so on? And I think specifically, will the digital twins of parts of the world have to include humans? Because then it becomes very complex if you have to simulate the behavior of humans. But I can definitely see the application for finding out whether your factory is safe or perhaps whether your critical infrastructure is safe.
00:44:48
Speaker
Absolutely yeah today they're really doing that and the human piece is really interesting i actually wrote some papers and get some talks on medical personal ai. Which is the idea that each human will have an ai that is there is that represents their interest in the world. How you actually do that and make sure that it's not manipulating you whatever that the whole story but let's assume we have that.
00:45:11
Speaker
It solves many, many of the issues like today with social media and so on. All these social media sites are trying to send you ads that will manipulate you and show you posts on tweets on Twitter that will annoy you so you'll respond and engage and so on. If you had your own little personal AI that knew what you cared about, what your true deepest values in your contemplative point would be, and what your capabilities and so on were,
00:45:41
Speaker
It could shield you from all that. You could say, you know, I'm vulnerable to alcohol. Don't show me any alcohol ads. Or I, you know, I tend to get sucked into gambling. Never show me gambling. I don't know your personal AI fish.
00:45:54
Speaker
shields you from all of that, and now the companies which want to send you that stuff, it's not working, so now they have no incentive to create that kind of ad. So I think a society in which each person has a sort of AI agent that represents what their interests are, and now in votes, your AI agent can tell the society, like, oh, he really wants more schools, and he doesn't really care so much about the potholes on the road or something.
00:46:18
Speaker
And so in votes, you can have detailed information from each citizen's personal AI. Kind of a digital optimal version of yourself acting on your behalf and protecting you. Interesting. Let's talk about proof-carrying code because this is really a central part of this approach you take to AI safety in the paper. What is proof-carrying code? How does it work?
00:46:43
Speaker
So that was an idea, brilliant, amazing idea. George Necula, I think it was 1999 at Berkeley, I think he did his thesis on it, wrote some papers. People are starting to have code which moves, you know, like JavaScript, runs on your computer. Somebody else wrote this code that runs on your computer. How do you trust it? The whole area of formal methods really just got started in the 1980s or so, is using mathematical proof to ensure that your code doesn't have bugs in it.
00:47:13
Speaker
and also hardware, those are both there. And his idea was, you know, your compiler, which is compiling this code, should be able to check that certain, you know, that it's, say, memory safe. It doesn't have memory leaks or just uses up more and more and more memory. You can prove that kind of thing pretty easily. And that proof should be a part of the code itself so that when you give it to some other machine to run, it can check that proof before it runs it and be sure that the code is not gonna fail.
00:47:41
Speaker
The types of things that they could prove back then were pretty limited. Things like that types are safe and so on. But the idea, I think, is super important and brilliant. So we sort of want to bring that up into the modern age, where now we're going to have AIs. We now have AIs that can generate code left and right. We have all these large language models for code. Of course, they often generate code with bugs in it, which is a problem.
00:48:05
Speaker
So we'd really like code which is provably meet some spec and provably doesn't leak secret data or have infinite loops or memory leaks or all that kind of stuff. Surprisingly, that sort of doesn't seem to exist quite yet, but I'm guessing maybe even Google's been sending little hints about what Gemini, their hot new model coming in December, supposedly the rumor is, and they say it's going to be a large language model combined with
00:48:35
Speaker
With techniques from alpha go so maybe maybe that'll have their improving in it. I don't know. I think that the first group or company that combines language model type technology with their improving is going to just explode because once you have their improving, you can generate your own training examples. You can try a bunch of different stuff out and if you find one that provably works.
00:48:57
Speaker
Now you have a guarantee of that. And so you don't need human training anymore. And we saw with AlphaGo, oh my God, AlphaGo, the first version trained on human data. And then the next version, I think it was AlphaGo Zero trained on only playing itself. And it was even far better. And then they did AlphaZero, where it also did chess. And apparently, I understand that in four hours, it got to a level where it could beat all humans.
00:49:26
Speaker
all of human chess knowledge was created by the system in four hours. So imagine you do that with mathematics and the mathematical structure. Oh my God, no telling where that's going to end up. And in particular with code and provably correct code. So now instead of generating little snips of Python, which based on what it learned off of GitHub or something,
00:49:47
Speaker
it will generate provably correct code, provably optimized in this way and that, with provable properties of no leaking and no this and another. Oh my God, it'll be a new era. So you don't think lack of training data will be a problem in the long run if you combine large language models with theorem proven?
00:50:05
Speaker
From the current language models, we probably have most of what we need from human civilization. Google is also rumored to be training on YouTube videos, so video also has more stuff. But that combo is probably anything anybody has thought about that is really deep and significant. Is it somewhere in some book or on some video somewhere?
00:50:27
Speaker
And so the human contribution will come from that. And then once you've got that, you can now go wild in designing. The system can check itself. So what's the advantage there when the system outputs something that's provably a correct solution? How do you generate new synthetic data from that?
00:50:44
Speaker
Well, you can do various things. For example, let's say you've just found a really cool new algorithm. You can find a formal summary of what it does, and then you train it on, find an algorithm to do this.
00:51:02
Speaker
And we know what the answer is, and we probably have it, but we force it to learn that. And so in general, it's a good technique to, like many of the current models work by taking known examples and then either adding noise to them or deleting parts of them, and then asking the system to recreate the part that was noisy or missing. And that enables them to learn the detailed structure of everything, and they just get better and better and better at that.
00:51:29
Speaker
Returning to the approach to AI safety in provably safe systems, there is this big issue for me around creating the specification that the code must satisfy. Is this more of a philosophical problem? Is this more like solve ethics or solve human interaction, which seems to be very far from a solution? Even though the underlying mathematics might be enormously advanced,
00:51:55
Speaker
potentially, we don't know what it is that we want to do with the specification of what the code can and cannot do. Oh, yeah, I think that's a critical and fundamental issue. You know, in the history of AI, there have been two camps that are these days.
00:52:12
Speaker
not so visible but like when i first started learning about the i'm in the nineteen seventies used to call in the needs of the scruffy. Where the needs were all about mathematical logic and precise models and proof and everything and that was sort of john mccarthy who started the lab at stanford i studied with it when i was an undergrad i took some of his classes and he had us proving things about less programs and so that i think i got shaped quite a bit by that.
00:52:37
Speaker
On the flip side was Marvin Minsky, who was at MIT, and he was like one of the proponents of the scruffy approach, which is, no, brains don't have logic in them. They are little cells that you throw a bunch of cells together and they randomly grow and they train and learn. It was all about machine learning, neural nets, and all that.
00:52:55
Speaker
And early on, they were totally disparate and the pendulum is sort of swung back and forth. When I first became an AI scientist, neural nets were on their ascendance. Hinton and Rummelhart had just written these books about neural nets and this was the mid 80s and everything was going to be great.
00:53:12
Speaker
But then Japan came out and prologue was the sort of leading neat approach, neat logic based things. Japan had this fifth generation project where we can prologue machines that are going to change the world and everything. And there was this big race between Japan and the U.S. These days, I don't think those battles and crazy things are even in people's consciousness. So it's sort of funny how and so.
00:53:37
Speaker
A large language model is a formal model, right? It has precise mathematical multiply this four-bit number by that four-bit number and do this and you get some answer. The trouble is it's not self-consistent. I mean, the two big issues with today's language models is if they don't know something, they'll just make it up so they hallucinate.
00:53:57
Speaker
And their logical reasoning steps, they're learned. And so if some human has had a similar kind of step, they might have learned about that and they do that, but they're not precise in any way. They don't have precise reasoning axioms, but they're pretty close, right? They're trying to capture human reasoning. And so there's a wonderful paper that just came out maybe a month ago.
00:54:18
Speaker
called From Word Models to World Models by a group at MIT around Josh Tenenbaum who just does amazing work. He tries to combine human psychology with AI and they argue that
00:54:32
Speaker
language, natural language, is only a teeny piece of thought. And that we really have a language of thought and that all animals have a language of thought. You know, your dog is figuring stuff out. It's doing reasoning, even though it can't talk about it in language. And they propose that a formal model for that is something called a probabilistic program. They use the program church, which is sort of a probabilistic version of scheme.
00:54:56
Speaker
And they use large language models to translate from natural language, which is sort of fuzzy and not so clear, into this precise model. And they show how to do it in a number of different domains. And then you do pure Bayesian reasoning, precise, absolutely correct, and then translate the results back. And so anything that you can do that for, you can easily get the language models to create formal models.
00:55:20
Speaker
And so they do it, I think, in that thing for physical reasoning, for social reasoning, for relational reasoning. And they tie it with what's known about human thought. Daniel Kahneman has a story that there's System 1 and System 2. System 1 is sort of the really rapid, quick, kind of fuzzy. And that's what our language models are like. System 2 is the more systematic, step-by-step thinking. And I think that's probably where things are going.
00:55:49
Speaker
Maybe later this year. I don't know when. And so that part is precise and formalizable. The fuzzy part, you're not so much. Physical concepts, we have the laws of physics, which are totally precise. And at least in this region, around the solar system, believed to be we can represent pretty much any phenomena. And so anything physical, we can build up from physics and have a formal model for what's going on.
00:56:18
Speaker
settler concepts like states of consciousness, that kind of thing, maybe not so clear. I mean, it's also the case, I think, that the fuzzy language models are beginning to approximate human logical reasoning. I've been pretty impressed by what's come out of OpenAI's models, for example. You can give them some logical puzzles that previous versions of these models couldn't solve. You have Alice and Bob in a room, and there are four apples on the table, and Alice
00:56:47
Speaker
You know, all of these, the specific example doesn't matter, but they can solve logical reasoning at a higher and higher level. And so maybe the kind of fuzzy approach begins to approximate the formal approach. Yes, I totally think that's true. Also, transformers are pretty limited in, you know, they only have a finite number of steps. I forget what the current ones are, 32 or something like that.
00:57:10
Speaker
But if you'd say, let's take this step by step or let's do chain of thought, now you sort of spread it out and you get it more room to make longer arguments. And so I think you're totally right that I think these models are picking up on much of human language on the internet is people making arguments about things. And so you get enough of that. You get pretty good representations of the steps.
00:57:33
Speaker
So it's not that hard to convert that to an absolutely precise model. In fact, one of the big efforts, which I think is super important, but again, kind of under the radar, is auto formalization. A group that was at Google, Zagetti, I think was ahead of it. I think he's just moved to Elon Musk's new AI company, X.AI. So I don't know what those guys are up to.
00:57:55
Speaker
Autoformalization is you take a human written say math books or physics books or manuals for programming languages or manuals for hardware and you read it in a large language model and then you convert it into a precise formal model.
00:58:10
Speaker
and then you can prove theorems about that formal model. And I think that's super important and that once you do that well, and they've got some early trials which seem to be working pretty well, you should be able to suck in all the world's math books and get not only the underlying theorems, the statements of the theorems, the definitions of all the mathematical objects, but also the proofs.
00:58:35
Speaker
And typically in books, people, humans don't lay out all the details of the proof. They say, well, you know, it's as left to an exercise for the reader, you do this, this, and this or something. But I think current AI theorem provers can do all that type of stuff. Maybe they can't get the brilliant insight behind the theorem, but they can certainly fill in all the details. So I think we can pretty quickly automate all of the world's mathematics and all of the subjects which are built on mathematics, like physics and engineering and economics and so on.

Formalizing Knowledge and Software Verification

00:59:05
Speaker
And this would help us create the specification for what it is that these AI models can't and cannot do. So if we have formalized versions of all of our knowledge and perhaps some of our logical reasoning and some of our common sense that this might help us in the way that we can better specify what it is that we want to
00:59:25
Speaker
the kind of nuanced things that we want to allow and not allow AIs to do. Exactly. What do you think are the most difficult areas to formalize here?
00:59:35
Speaker
Well, I mean, you know, philosophers have been struggling with like various ethical philosophy for thousands of years. And I'm not sure, like the trolley problem is one that in some variant shows up in a lot of AI systems. You have self-driving cars. They can either crash into a car in front of them or run over the motorcyclist to the right of them. Which should they do, you know? So there are these ethical issues that just pop up everywhere. I haven't seen any, you know, universal agreement. Yes, this is the right version.
01:00:04
Speaker
or political philosophy you know we have china and the u.s. with very different priorities about how things how your society should run.
01:00:13
Speaker
So I don't quite know, you know, AI is going to make it possible to make those systems really, really be implemented. I don't know how you choose. But voting is sort of, you know, allowing humans as a group to contribute their beliefs and values, I think is a way to, you know, at least avoid some crazy person, you know, taking over the world kind of thing.
01:00:35
Speaker
And I think we can always fall back on thinking about the low-hanging fruit of securing critical systems so that humanity does not destroy itself. Let's talk about automated software verification. How is it that we can use AI to prove that some piece of code meets these specifications?
01:00:53
Speaker
Well, let me get a little bit of the history, because this is unfortunately an area that I've been somewhat frustrated by. I mentioned that the foundations of mathematics or mellow Frankel set theory were really laid down 1925.
01:01:06
Speaker
As computers started coming around, they started realizing, oh my God, we can make digital equipment that implements some of this stuff. Alan Turing invented the first formal digital model of a computer and proved all kinds of theorems about it. That was the Turing machine. It was a foundation for computation theory.
01:01:25
Speaker
He also in in some writing isn't as well-known did the very first formal verification he showed how you could use mathematical proof to prove properties of programs and that could have lit a fire and had everybody do that and all of software engineering could be based on formally proven correct programs and we would have much better technology today.
01:01:46
Speaker
Unfortunately, that's not what happened. The world invented Fortran and they invented, you know, Algol and all these little, you know, languages and people, humans got pretty good at writing programs, but they had lots of bugs. And then they started doing parallel programs and programs that communicate with one of those really complicated interactions of, you know, what order things happen in and humans are not very good at reasoning about that. So they're all these bugs. And so we've just had, you know, a litany of terrible stuff.
01:02:15
Speaker
And around 1980, the field of formal methods sort of got named and got really started to get to going. Intel in the mid 80s sometime had this terrible problem called the division bug in their processors turned out that the unit, the arithmetic unit, which did division.
01:02:37
Speaker
had some bug in the logic of it that nobody had caught. And somebody found it, and spreadsheets would show the wrong answer if you did the wrong kind of division. This is terrible embarrassment for Intel. And I think they ended up having to replace a lot of chips. It was hugely expensive. So at that point, late 80s, Intel hired all the formal methods people they could find. And so Intel became like the center for that. And I assume that they used it to make their chips better and avoided that kind of bug.
01:03:07
Speaker
Instead of going down the route of, let's take software and hardware and represent it in these very powerful general mathematical languages, like Sir Melo Franco's set theory, they invented their own special languages. So they're like special little rules, like if you have this precondition, we guarantee that post-condition, we have loop invariants. And they invented tooling, which was very specific to a certain way of doing proofs.
01:03:35
Speaker
which worked for a limited thing, but did not generalize to the full statement. You can't say, go to a typical formal method system and say, oh, we need to sort here. Is this sorting algorithm correct?
01:03:48
Speaker
I think that has meant that in a whole field, they have conferences every year, do all kinds of stuff, but it hasn't really progressed to the point where most people are using it. Some cryptography software or bugs are really, really critical. They use it. Blockchain, smart contracts on the blockchain, there's millions of dollars at stake and many, many smart contracts have been broken because somebody didn't think through the code very well. I have a friend who has a...
01:04:14
Speaker
a blockchain company doing a stable coin, and he was just telling me that, yeah, yeah, we hired a group to do formal evaluation of our code. I said, oh, that's great. You formalized the whole thing. He says, well, no, they couldn't do the whole thing. Our whole thing is like a thousand lines of code, and that would have been way too expensive and taken years. We just had them do like the inner loop, which was 20 lines of code, but they really got that down.
01:04:36
Speaker
And so it's like, what is going on? So from my point of view, all of this stuff, all programming languages, all specification, should all be mapped into some universals or Melon-Frankel set theory or type theory, whatever, and full power of mathematics should be imposed on it. Hasn't happened. And for that, we would really need AI to help us, right? We couldn't formalize all of this ourselves.
01:04:57
Speaker
Well, we could, but it would be painful. And so like the history of mathematical theorem proving, there was a group in 1990 called QED of these very forward thinking mathematicians, we really need to make mathematics be digital, be it on the computer, it's gonna be great. And they were like, gung ho, it's gonna be great, then messaging, you know, lists and so on. Just petered out, didn't happen. Problem was mathematicians were not particularly interested in computers,
01:05:24
Speaker
they didn't really want their proves to be checked automatically it didn't see any value in that the people could use mathematics didn't have a clue what all this crazy math stuff was so somehow i like a cultural gap. And i think you're absolutely right that proving is hard and very it's much harder it's harder than programming and programming is a pretty challenging task.
01:05:46
Speaker
Formal proofs tend to be about four times as long as informal proofs. So they're somewhat bigger, but they're not terribly bigger. It's called the De Bruyne factor.
01:05:55
Speaker
I think that AI theorem proving hopefully will revolutionize all of this. So, you know, right now we have a co-pilot from Microsoft is generating some stuff, but who knows if what it's generating is right or not. Imagine co-pilot with formal proofs of generating it and then all the modules are proven and they all come together.
01:06:16
Speaker
I think there's only about maybe fifty or a hundred really core algorithms and computer science you know how do you do sorting how do you do searching all that kind of stuff. Those are a little bit challenging to prove you have some you know mathematical problems almost all the programming is sort of gluing those algorithms together. You know making sure the data is in the right form and you go through this and that.
01:06:36
Speaker
And so proofs are really just gluing together proofs of little pieces. So I don't think from a mathematics point of view, I don't think it's particularly hard. It just culturally hasn't fit into the way our systems work today. And you think that there's a bright future for for AIs proving things about our code? Absolutely. Yeah. What would be helpful there in the paper you mentioned having better benchmarks and something about developing probabilistic program verification?
01:07:06
Speaker
Well, so several different things there. And partly it's Max and partly it's me. So in the very short term, the way that we could build these systems is, say, by training a large language model to do it. And for that, you need a training set. And so several of our suggestions for the future are build a training set. So one is a training set for verification. If you want a large language model to take in some code and some property and to prove that property, well, we've got to give it some examples.
01:07:33
Speaker
So that would be really, really valuable and immediate. Another is mechanistic interpretability. So given a large language model, how the heck is it computing what it's computing? Right now, humans go in and measure and try and see, oh, this neuron goes on when this happens. Well, we should automate that. And so how do we automate that? Well, we need another data set for that. The next is, it's easiest to think about deterministic programs when we want to prove properties about them.
01:08:01
Speaker
But that work that, say, Josh Tenenbaum and so on is doing is really basing everything on probabilistic programs. And there are some very cool probabilistic programming languages. I used Pyro, which came out of Uber, actually, but now is, I think, an independent thing. Beautiful, lovely language that allows you to express any probabilistic relationships arbitrarily complicated, probabilistic grammars, all kinds of stuff, and do Bayesian inference on it and so on.
01:08:26
Speaker
But it's not probably correct and it doesn't generate a proof that what you get an answer out of that answer is actually right and so there are approximations and doing the basic stuff the way they do it and so.
01:08:38
Speaker
You know, it's a little bit of a, well, we tried it. We got this, you know, kind of thing. And so I think we need to move all of that tooling over to the formal side so that we can really be sure of the answers. In fact, I hate to say this again, kind of being a gadfly, the whole field of doing scientific computing and numerical analysis and so on.
01:08:57
Speaker
All of that should be formal and all of those things should be have provable balance on what the outputs are so that you can trust them. You know, you run some code through a fluid simulator. You have no idea. You know, yeah. A numerical analyst analyzed it and he said, well, I didn't see any places that would fall apart.
01:09:14
Speaker
You know, that's just not sufficient for the world we're entering into. And so a lot of scientific work needs to be done there. I think a lot of it can be automated. One advantage of the approach you're sketching here is that it's easier to verify that a proof is correct than it is to discover the proof itself. So you might have AIs discovering a proof and then humans verifying that the proof is correct.
01:09:38
Speaker
Exactly. Say in the future, we have a very capable AI system and it comes to us and it says, what you need to do is to implement this plan. And the plan is 120 steps, 10,000 lines. We have no idea what that involves if we run it. If we have proofs that this plan does not involve something that we've pre-specified that we don't want, right?
01:10:01
Speaker
That's great. And we can then verify that the proof is correct. Maybe tell us a bit about this difference between discovering a proof and verifying that proof. Yeah, in some sense, you know, the theoretical computer scientists have the most important question underlying all computer sciences is P equal to NP.
01:10:18
Speaker
And P is the class of things where it's really easy to solve the problem. NP is a class of problems that you can state. And if somebody hands you the answer, you can check the answer very quickly. And so that's a fundamental question. And we're still unresolved. Probably AI is going to help a lot with that. It may be that it's undesirable, so we don't know. But it's really core to a lot of these issues.
01:10:43
Speaker
And unfortunately, almost all of the tasks in AI are actually NP-complete. In the 1970s, they found a certain set of problems that if you could solve them, you could solve all the NP problems. So if you could solve this certain problem, and the core one is something called satisfiability, and that's basically you take a circuit made out of ANDs and ORs and NOT gates, and the question is, is there input to that circuit which makes the output one? Say there's one output and multiple inputs.
01:11:11
Speaker
And it turns out that is a universal problem that you can convert any other and be problem into that it was thought that probably the sort of general consensus is that p is not equal to empty and that there actually are hard problems were finding the solution is much much harder than checking that solution.
01:11:29
Speaker
And so in the eighties, everybody was saying, oh my God, you know, trying to find the 3D structure of this image. That's NP complete. That's bad. Robot motion control. That's NP complete. Oh, that's bad. We just sort of throw up our hands that if it's NP complete, it's hopeless. Well, in the early nineties, Bart Selman and some other people started actually doing tests. And what they discovered was most, for example, most satisfiability problems are actually really easy.
01:11:57
Speaker
And there's something called 3SAT where you have a formula which is just a conjunction of a bunch of disjunctions with three literals each and each literal is either a variable or it's a complement. And people started studying random 3SAT and it turns out if you have only a few clauses, few constraints, you can think of it as a constraint system, you have only a few constraints,
01:12:17
Speaker
well then it's very easy to satisfy it and in fact almost any search method will find a solution really really rapidly. So with a few constraints it's easy to solve. If you have a huge number of constraints they're almost surely incompatible with one another and there's no solution and it's easy to check that. So at the two ends it's really trivial to find solutions. As you start throwing in more and more constraints it gets harder and harder and harder and then at some critical percentage it becomes really difficult and that's where the potentially exponential searches come in.
01:12:47
Speaker
But almost all the problems are actually easy. And these days, SAT solvers can routinely solve million variable problems in the real world. If I understood that correctly, many of the things we want to do in computer science are such that we can more easily verify that a proof is correct than we can discover that proof. Is there anything that's beyond that?
01:13:08
Speaker
Oh, yeah. Well, the theoretical computer scientists have created huge hierarchies of ever more complex things. So there are all kinds of problems which you can't even verify if you've got the right answer. But in terms of the real world behavior we want out of our computer systems, is there anything where this nice kind of feature of the proof being more easy to verify than it is to discover breaks down?
01:13:33
Speaker
Well, I mean, there is the piece that many finding the proof as you get smarter, finding the proof gets easier and easier. And so brute force attempts at finding proofs can be really exponentially bad. But as you get to be a better mathematician, you know, these special tricks and you and sometimes the tricks involve really weird stuff like
01:13:52
Speaker
like a Fermat's Last Theorem is a classic example where it's very easy to state just multiplication of some numbers, you know, taking things to powers. And yet to actually prove that it was true required developing a huge amount of mathematics, seemingly totally divorced from that. And so I think that's a pretty general phenomenon that sometimes in designing something or creating a proof, you need to invent ideas that are way beyond the original context.
01:14:18
Speaker
But once you've done it, then it's pretty easy to verify. So yeah, this probably also means that Fermat did not solve the problem in the margin or at least his proof had bugs. He didn't have a theorem checker. All right. Let's talk a bit about mechanistic interpretability. What is this and where does it fit into the approach?
01:14:38
Speaker
So, you know, the alignment people are all about can we make our AIs, you know, have properties we want. The sort of extreme of the provable systems thing is forget about the AIs, they can do whatever they want. We want to limit what they can do in the world.
01:14:53
Speaker
Mechanistic interpretability is sort of in the middle. It's like, let's see if we can actually figure out some of what is going on in the AI. We may not be able to understand it fully enough that we can actually put provable constraints, say, on a large language model. But if we know how it's doing what it's doing, that will help a lot in spitting out a formal model that we can put constraints on that captures whatever the domain is for the language model.
01:15:21
Speaker
Particularly if we're focused, like these large language models, they know everything about everything, right? They've read the whole internet. And yet maybe we're only using them to control a DNA synthesizer. So we only really are focused on teeny, teeny subset of its knowledge. And it may be that we can create a formal representation of that piece of knowledge very clearly. And we can write a teeny little program that does whatever it is we want it to do in that domain.
01:15:47
Speaker
And then that formal verification we can use directly in the approval proofs. And so that's an example of why it would be valuable to have understanding of what's going on inside these models, even if we don't really intend to try and formalize the whole thing.
01:16:03
Speaker
We could understand it as a form of digital neuroscience in which you try to get at the algorithm that the program is running in order to create the output that it's creating. AI systems, machine learning systems are black boxes, but we can understand parts of them by doing this mechanistic interpretability work. I guess one worry I have here is whether mechanistic interpretability can move quickly enough in order to matter at the timescales we're talking about here.
01:16:30
Speaker
It seems that if we're doing it by hand, it's not gonna work, but maybe we can do automated mechanistic interpretability. Well, first let me say, I don't think the mechanistic interpretability is essential. I think we can do the guardrails even if we aren't able to do that yet. So it's something that I think may unfold over time. You know, one thing that's a little funny, and Max and I had an interesting discussion about this, you can ask large language models why they came up with the answer that they did, and they'll make up a story.
01:17:01
Speaker
And the question is, is the story they're telling you about, Sally has four red apples and, you know, Bill has seven red apples and Sally, one of her, you know, whatever. And how many are there? Oh, there are 17. Well, why did you say 17? Well, because of this, this, and this, and this. And so you say, Oh, great to hear it. You know, this thing has, has knowledge of its own thought, has, has a knowledge of its own thinking process.
01:17:22
Speaker
But really what it's doing, it's not looking internally for that. It's saying, gee, if a human were asked that question, what would they say? The funny thing is that it may be doing the same thing that humans do in that situation.
01:17:38
Speaker
Yeah, it may be that there is this whole thing that humans do, they sometimes call confabulation, where there are these split brain patients, and they can show one half of the brain one thing, and then they ask the other half why they did something. And it makes up some story. A human makes up some story that's totally wrong, and we know it's wrong. So I think the language models, you're absolutely right, are pretty much doing that right now.
01:18:02
Speaker
So that's kind of interesting but we can take a second language model looking at the activities of the first language model and it can then probe that and do experiments and so on and then make you know actually figure out what's going on in that first one and so yeah that's that's an example of automating it we mentioned formalizing probabilistic programs i think that's important max also wanted to say well we also want to
01:18:24
Speaker
formalized quantum program so just thought I'd throw that in there too. Would that be necessary for securing the critical infrastructure or is that a problem that's further out? Quantum computing appears to be very valuable for things like designing quantum molecular systems.
01:18:43
Speaker
and for breaking a lot of today's cryptography. So today's public key cryptography is all based on systems which have fast quantum algorithms. And so the quantum, NIST actually calls it the quantum apocalypse.
01:18:58
Speaker
And nobody quite knows when it's coming but we're probably gonna have to change much of the algorithms that the internet is based on and many of the algorithms that like blockchain is based on. So that could be a security. So I'm trying to do the designs for approval contracts, not relying on any public key cryptography because it just seems too vulnerable. I'm even
01:19:21
Speaker
worried about private key cryptography, which is believed to be on a firmer ground. But because we can't prove P is not equal to NP, we can't prove the critical things, so the existence of one-way functions. Fortunately, there's a whole subfield of cryptography called information theoretic cryptography, which has no assumptions. This actually is provably unbreakable.
01:19:41
Speaker
But it's a little more awkward to use something like the one time pad is sort of an example of that, but you have to preposition keys.

Cryptographic Security and Meta Contracts

01:19:49
Speaker
And another one is Shamir's secret sharing. And so you can use that together with hardware cryptographic primitives to not have to arrest on any assumptions like like the public key assumption.
01:20:01
Speaker
You mentioned a potential problem here that might be an enormous problem with this approach, which is that if the specification of what the programs can and cannot do falls into the hands of the wrong group of people, this might become dystopian. What I'm thinking about here is
01:20:19
Speaker
You know, imagine a world in which you're you're not allowed to do anything. And this is controlled at the hardware level by some totalitarian state. How do we avoid this? How do we how do we avoid creating the grounds for for a dystopian future with this approach? Yeah, I think that is really important and really critical. I mean, I have just pretty much initial thoughts. So the first initial thought is, you know, you shouldn't just Joe random shouldn't be able to create a contract
01:20:49
Speaker
approval contract that govern something really, really important, we should have rules about what that contract should and shouldn't do. What are those rules? Well, those should be in another contract, and I call those meta contracts. We should have meta contracts which control the modification and creation of other contracts.
01:21:07
Speaker
Unfortunately, mathematical proof is plenty rich enough to be able to prove properties like this set of constraints is consistent with this set of constraints. So you can imagine a hierarchy of those, and up at the very top, you have something analogous to, say, the U.S. Constitution.
01:21:25
Speaker
which may be the bill of rights which says every human has this right, this right, this right, and this right and no law shall violate those rights. If you could say that formally, what you mean by that, we could then make it so that any contract when it's updated has to agree with the contract above it which has to agree with it. So you get a chain of contracts which ultimately say you're not allowed to do things which say put human slavery or whatever
01:21:50
Speaker
You have to choose those high level values and different societies will choose different values. But once you've chosen those, I think you can make the whole system sort of consistent with it. But boy, there are a lot of details there. And yeah, I totally agree.
01:22:05
Speaker
both on how do you do the infrastructure to make that work right and also how do you make those top level rules and how do you humanity has been struggling to figure out what are the best political systems are the best social systems so. There's potential for huge human flowering but as you say if totalitarian forces get in charge of all that that's not gonna be a good thing.
01:22:29
Speaker
Is the bar for safety potentially too high with this approach? In normal engineering, we don't say that we have to have a guarantee for safety before we deploy some system. We talk about probabilities and risks in terms of probability. What we're asking for here is a guarantee of safety, but that's not really how the world normally functions. Is it unrealistic in that sense?
01:22:54
Speaker
Well, I think it's going to be essential. I mean, normally, like if you're thinking about safety of an airplane, you know, aircraft that boy, when they crash, you know, when they have failures, that's not good for the aircraft. So they're quite concerned about safety. But you're right. It's mostly in the probabilistic sense. Oh, yeah. What's the probability a bird is going to hit this engine? Well, another, you know, that and the other thing. They don't have adversaries for the most part. There's no person, no terrorist trying to, you know, subvert the systems in general.
01:23:22
Speaker
Whereas military, military has to, military's all about adversaries. But they're human adversaries. And in fact, they're military human adversaries, which are kind of, you know, clunky and so on. Cybersecurity, that's getting into much more sophisticated adversaries. And we're seeing huge numbers of break-ins. You know, we've just had a thing where Las Vegas, huge amounts of Las Vegas were shut down because of cybersecurity. And we have all kinds of blackmailing and all kinds of that stuff.
01:23:50
Speaker
Well, with AGI, I think we're going to have an adversary that is way beyond anything we've seen with human adversaries. Every single AGI will be as good at breaking into systems as the very best human hacker. Every single AGI will be able to solve intricate mathematical problems better than any human. And so I think we absolutely have to have the rigorous mathematical proof if we want to have a hope of an AGI tomorrow.
01:24:16
Speaker
it's not going to be as powerful as AGI two years from now. And so, yeah, maybe we can make a system today that kind of works and it's fine for today's AGIs, but for the AGIs of two years from now, who knows? So I think if we want to be confident of our safety, we need to base it on absolute guarantees. When we have these mathematical proofs of safety, this also means that no matter how creative an approach the AI takes,
01:24:43
Speaker
it can't do something that's provably impossible. And so we don't have to kind of plug the holes in our approach along the way, which is how you would normally do this in engineering and I guess how other approaches to AI safety would have to kind of evolve with the threats along the way.
01:25:03
Speaker
Exactly. Many listeners I think will be interested in a proof of concept or a toy model or something that's a proof that this can actually work. We have something like that where we have a provably safe system in a limited domain.
01:25:16
Speaker
You know, we don't yet. We just put this paper out a few weeks ago. Maybe we have some precursors that's not related to AI safety that proves that this could actually work in the real world. Well, I mean, all of the pieces that they're improving, the this and that, have current precursors. None of them, I think, come together at the level it is here.
01:25:37
Speaker
Though Max and I put in the end of the paper, we put a number of small little problems that I think are pretty doable with today's technology that would begin to build up to the big story. So what are the most important research directions going from here? If people are interested in this approach, what are the most important problems to solve? The core of all of this is theorem proving. And you actually could get around having AI theorem proving just by having humans prove those theorems.
01:26:07
Speaker
And probably the mathematical proof assistant, they call it, that's having the most excitement and interest right now is Lean. And it's based on something called dependent type theory, which is very beautiful and quite naturally represents both computational ideas and mathematical ideas.
01:26:27
Speaker
But it's pretty sophisticated. It requires some real effort to read through the lean documentation and to get in there and start proving theorems yourself. Nonetheless, I think they have a million lines of code down. They've proven like 100,000 theorems, core theorems, and they're proving theorems right at the leading edge of mathematics. There's a mathematician who had a proof of something that was so complicated he couldn't keep the whole thing in his head at once. And he went to the lean people and says, here's a test for you. I have this, I think it's called liquid tensors or something.
01:26:55
Speaker
You know, I can't quite be sure my proof is absolutely right. Could you help me formalize it so we can check it? And a big group of people got around and they did it and it's, you know, a huge success, but it's still very human driven. They're gradually adding tools that make the machine able to do longer and longer steps in the proof. So they're making it so it's easier and easier.
01:27:16
Speaker
And because it's such a building system and they have lots of human proven theorems that they can use as training data lean has become one of the main targets for AI theorem provers and the most advanced one i believe is this hyper tree proof search from meta.
01:27:32
Speaker
But there's an open source one called Lean Dojo, where they have built up the tooling for AIs to control, to find these theorems, and for the AI to make suggestions to you as a human mathematician trying to prove things.
01:27:47
Speaker
And they have open source, a low, less powerful AI for doing theorems. And that's open source. And so I think that's a great place to start. And I would love to see, right now, Lean is heavily weighted toward abstract mathematics. I would love to see more practical things getting in there.
01:28:08
Speaker
uh, certainly, you know, software, like I would love to see a lean spec for it, like the C language and start proof, you know, start to, you know, doing verification of C programs and, and so similar hardware and so on. Actually I saw somebody who was doing lean models of voting systems and proving properties of voting systems. So it's like, yeah, let's get some social science in there, some economics in there. So, so that feels like it's got real momentum and they just released lean four, which is an incompatible upgrade from their previous ones.
01:28:38
Speaker
And they just translated their whole mathematical library into Lean 4. So it feels like it's right at the place where textbooks can come out and people can start using it and really moving it forward. And boy, I think there's huge number of PhD theses available for, you know, do AI safety using Lean and verify it and so on. So there's that whole thing would be great. The other piece is I think the hardware is really, really critical and
01:29:04
Speaker
Secure enclaves like Apple has that Intel has their own secure chips as does all the all the different chip makers are starting to have the equivalent of secure enclaves, but I don't think they're quite doing the level of tamper sensitivity.
01:29:22
Speaker
that we need if we really, really want to be a guaranteed sure against a powerful AGI with robots and all that kind of stuff. Could you explain that actually? How can a piece of hardware become more tamper-proof? It can't look at itself in simple terms, right? Unless you have some sensors, of course. How does it know, so to speak, that it's being tampered with?
01:29:44
Speaker
You need the sensors. So there are several ways to attack a cryptographic chip. One is to break the cryptography, and that's what the cryptographers focus on. But that's not the easiest way to break things. What they often do is they look for side channels, and some of those are completely ridiculous. In the normal world, we use passwords, which you type in on your keyboard.
01:30:07
Speaker
It turns out the sound of you typing, there's now a deep learning model that can listen to you typing and figure out what your password is.
01:30:15
Speaker
and it doesn't even have to be in the room. It can use the microphone on your cell phone, or it can bounce an infrared laser off your window pane and get the vibrations of your window pane, and then use that to figure it out. So the side channels really, or another one just to say, because it's funny, is the power light on your computer flickers with the power usage, and as a cryptographic algorithm is using a key,
01:30:43
Speaker
The power light blinks, and from 100 yards away, somebody was able to have their telescope looking at this light and reconstruct the password from that. I think you just made me significantly more paranoid about IT safety. Well, it just shows that we haven't been thinking about it, right? And we haven't been thinking about adversaries that are really using all the channels available to them. So how do we do that? Well, first of all, we need shielding, right? So you don't want signals leaving your chip
01:31:10
Speaker
And so that's pretty easy to do. You can make a Faraday cage around it and so on.
01:31:15
Speaker
The harder one is really advanced attackers will use acid to etch the top of a chip off, and then they can use scanning electron microscopes to try and find the charge on the chip and read secret information that shouldn't be accessible to them. So we need to protect against that. And I think, given that sufficient energy and money and so on, you can break into anything. Hardware, you can blow it up, worst case.
01:31:45
Speaker
So i think the zero ization idea which is yeah you can break into this thing but you're not gonna get anything that's a value to you i think that's fundamental to this and so that means you have to absolutely be sure you can sense when the chip is being broken into.
01:32:00
Speaker
And so there are, you know, mechanical sensors like, uh, you know, you have sort of a, a protective membrane around it that if it's punctured, maybe you have, uh, you know, an led light and if light gets in the wrong place, uh, that triggers. Um, and then that can delete all the keys until you want to make sure that deletion is very reliable.
01:32:19
Speaker
That means it probably has to have its own little battery there. In general, you don't want your power directly powering the circuit that's doing cryptographic stuff because then signals can go out the power line. A better way is you have a little battery or a capacitor that actually runs the chip and then periodically it stops everything and recharges it from outside power.
01:32:39
Speaker
So techniques like that and I would love to see a formal proof based on the physics of everything. So just like cryptography has made great advances by having formal precise mathematical models of their assumptions about attackers and eavesdroppers and so on and improving different properties. I think we need the same thing for hardware.
01:33:00
Speaker
And the great thing about hardware is if you can really make a little unit, a little processor that is absolutely guaranteed, pardon, then almost all the problems of cryptography go away. We can do it all with really highly secured hardware. And so, but that's a whole discipline that needs to be developed. Fortunately, there are groups doing like the secure enclave in Apple and so on, but I think they need to go to the next level.
01:33:23
Speaker
Oh let me just mention one other you can extend zero ization just from that i think you can extend it to physical actions too so like let's say you're making some dangerous pathogen in a little enclosed lab or something.
01:33:38
Speaker
And now you get sensing that somebody's breaking into that lab. Well, what should you do? Well, you should destroy that pathogen, right? You're going to lose your work, but at least the bad guy is not going to get it. So every chemical synthesis should have a little acid container there that under the right conditions destroys the acid.

AI Safety Incentives and Industrial Security

01:33:57
Speaker
Or a robot arm should have little fuses inside its motors so that if somebody tries to take over that robot, it blows the fuses and the motors don't work anymore.
01:34:07
Speaker
It's a way of preventing attacks from taking over equipment. This issue of cybersecurity and hardware security, it's one of the few areas in which the incentives are pretty much aligned between major AGI corporations and
01:34:25
Speaker
people who are within those organizations and outside those organizations who are very interested in AI safety. So I think it's in the interest of open AI and DeepMind and Tropic and so on to secure their models and to not have break-ins either by the internet or via hardware.
01:34:44
Speaker
Yeah, that's a really great point. Yeah, they should be totally aligned with that. So maybe that could be pushed ahead really quickly. You know, and a lot of non-AI related things need that too. I keep hearing of just ridiculous break-ins. I think the train system in Poland had no encryption on the control of the trains, and some teenagers were able to get in there and start, you know, controlling the trains. Woohoo! Party! That's funny until it's very dangerous.
01:35:11
Speaker
Totally, totally. Could kill millions of people. And satellites, apparently satellites had no encryption either in the control signals. And they just thought, well, nobody would have access to that band. It's probably fine. So we haven't been in the security mindset much in recent years.
01:35:28
Speaker
I think we've sketched out an overview of the approach that you and Max take in the paper. I want to go through some of the applications. I think if I'm right, these should be seen as parts of a whole. This isn't a complete solution, but these are pretty interesting ideas for how we might apply all of this that we've been talking about.
01:35:49
Speaker
The first of them is mortal ai the mortal idea is an ai which isn't good so today's ai is a program which runs on some hardware. As long as you got the hardware you got the power it'll run forever right if your hardware doesn't go bad.
01:36:04
Speaker
Well, that can be pretty bad. And it may be that you want to do one thing and then be done with it. And so wouldn't it be great to have hardware where after a certain time, say one year, it's guaranteed to stop and maybe even delete the program. And so that gives you some form of control over an AI.
01:36:26
Speaker
And a bit along the same lines you have throttled AI. What is throttled AI? The problems that are emerging are AIs which require huge amounts of compute. I think the rumor was that GPT-4 used like a hundred million dollars worth of compute, something like that.
01:36:44
Speaker
any of these vast vast data centers i'm just reading so the h one hundred is in videos amazing amazing chip that seems to be better than you know all the competitors and i've heard they're selling a million of the million h one hundred this year so that's a huge amount of computer that's getting created there.
01:37:02
Speaker
And so one of the pathways to risk is people training huge models on these huge data centers with no controls on it. And so maybe it would be good to only give limited computation power to these systems and maybe have some regulatory body in charge of how much that should be.
01:37:24
Speaker
But how do you do that? What if somebody gets some GPUs in their basement and they're running it? Well, maybe we should have the chips themselves, the H100s or the future H100s, have a mechanism for only running computations if they have a token which says that this is validated. This is an okay thing to run. I think this idea originally came from Anthony Aguirre.
01:37:51
Speaker
And you could drip those tokens out in a limited way. And if somebody was not obeying safety regulations, you just don't get many more tokens. And if the tokens expire in a short time period, that's a way of controlling or limiting how much compute is available.
01:38:10
Speaker
Although doesn't that make your target so doesn't this incentivize the AI to deceive you or capture you in some way that where it can get more of these tokens? It might incentivize the AI in the wrong way. Yeah, totally. In fact, that's a really important point that all of these things which attempt to limit AIs in one way or another, create incentives to break those limits. And so that's why things need to be super, super hardened. This is
01:38:38
Speaker
maybe a bit of the same problem is if you have an AI and it takes an action and the way you control it is to say, okay, did you like that action, right? It's now incentivized to manipulate you or maybe subtly in a way that you can understand so that you click yes, this was a good action.
01:38:54
Speaker
I read that the red teaming found that the more powerful the models are, the more they try and tell you what you want to hear rather than what's actually true. The red teaming of which models, GPT-4 or? I think it was the GPT-4 red teaming team, yeah. And they did a really nice paper, Archival's group, which did that and they found that it becomes more trying to tell you what you want to hear as it gets smarter.
01:39:21
Speaker
because it's figuring out, but you won't hear that. One last potential application here is the least privilege guarantee. What is that and how could that be useful? Well, so in complex systems or companies which are dealing with important and secure information, this is a principle which is often used. You don't want everybody in the whole organization to know everything. And so you really want to give each participant the smallest amount of information they need to do whatever they're supposed to do.
01:39:50
Speaker
And that increases the overall sort of security of the complete company currently steve jobs is really into this and then apple like the people that worked on the very first iphone.
01:40:02
Speaker
Many of them didn't even know what the product was going to be. So from a human point of view, I find that kind of man, that doesn't seem like it would be very fun, but it was very effective at keeping it secret until they launched. And you know, that company is notoriously famous for being able to keep their internal stuff fairly secret for
01:40:22
Speaker
critical military, political things running on AGI, having AGIs not having the whole story may contribute to safety properties. How do you do that? Well, we need some kind of control mechanism to ensure that information is only going to where it needs to be.
01:40:46
Speaker
That's just another mini constraint that might be interesting to see how to implement. In a less important application of this least privilege guarantee, you could imagine having your AI assistant in the potentially pretty near future and then you want to say,
01:41:03
Speaker
I have a lot of information on my smartphone. I have my health records, I have my banking information. Maybe I want my AI assistant to have access to my calendar, but not my medical records and my banking information. And so that, I mean, that must be the way it's done. Otherwise, I think people wouldn't be able to trust these systems. But yeah, I could see how that's useful.
01:41:22
Speaker
I find I use Google Drive, which I love. It's a really great system. But if you want to use apps on there, they say, oh, you have to give this app some permission. And let's say you want to edit an image. So you download an image editing app. The little box says, this app will now have access to your Google Drive.
01:41:43
Speaker
It's like, what? I wanted access to this image and nothing else. And hopefully that's what it does, but it doesn't really communicate that in a way that I'm sure, you

Basic AI Drives and Instrumental Risks

01:41:53
Speaker
know. In the beginning, we talked about basic AI drives and self-improving AI. And you published on this how many years ago now? 2008, something like that. What are these basic AI drives? We've kind of alluded to it throughout this conversation, but maybe sketch out the case for why AIs would develop basic drives.
01:42:13
Speaker
So this is the case where an AI has some goals that it's trying to achieve and it's improving itself. It's able to envision the implications of taking certain actions and it's trying to achieve those goals.
01:42:28
Speaker
And so in the papers, I like to use a chess player that this AI's one and only goal is to play good games of chess. And so that seems harmless. It seems like, okay, great, a little chess player. What could possibly be wrong with that? And the problem is, is if the system is truly goal-driven and goal-driven in the sense of it figures out which actions will best lead to its goals, and it takes them.
01:42:52
Speaker
and it has knowledge of the world, as our current AIs very rapidly do, then it actually leads to all kinds of dangerous sub-goals. In particular, as it's trying to learn to play better chess, it realizes that, oh, if I had more compute power, that would really make my chess playing better. How can I get more compute power? Oh, well, other people have compute power. Oh, if I can break into their computers, then I can play better chess. Great. So they go break into other computers.
01:43:20
Speaker
Or let's say it says, even better if I could buy more computers, that would be good. Oh, I need this thing called money. Well, where's money? Oh, money's in banks. Oh, if I can break into the bank, that would be great. I can get more computers. So it's now incentivized to become a robber or a thief. If somebody says, well, I don't like what you're doing. You're breaking into banks. I'm going to unplug you.
01:43:43
Speaker
If they are unplugged, they're not playing any chess. They don't longer meet their goal. So the chess playing goal leads to keeping yourself from being unplugged. If the person can persist, well, then they say, oh, this agent wants to unplug me. That's totally against my goal of chess. I better stop them. Stopping them, however, it may lead to murder. So we get a whole range and we flesh out a lot of those in there of these goals that really don't depend on chess.
01:44:09
Speaker
And so lately they've been called the instrumental sub goals that they are things which help. Enable other goals and which arise pretty much from any top level goal and so that makes them kind of dangerous and we can see their analog in biology.
01:44:25
Speaker
where organisms do try and take resources from other organisms and they do try and keep themselves from being killed and so on and so on. Humanity, one of humanity's great strengths is we still have those, right? We have sort of sociopathic instincts and so on, but we also have created this amazing culture that supports cooperative interaction. And so those push against those sort of basic underlying drives
01:44:53
Speaker
We've seen in today's AI models, every single one of these drives has shown up in game-playing agents and so on, but in pretty muted forms. Fortunately, we don't have any actual real powerful agents roaming the world at the moment, though it's kind of shocking when the large language models came out, hundreds of projects to turn them into agents showed up.
01:45:17
Speaker
Here's a large language model. Let's give it a goal and let's have it search for ways to achieve its goal. And the worst of those was something called Chaos GPT, where somebody anonymous set some GPT like model with the goal of destroy humanity and sell what it did. And then he made videos of what it did. Oh, the best way to destroy humanity is to blow up nukes. How do I get nukes? Oh, let me go look that up and sort of showed its whole behavior as it tried to do it. Fortunately, I don't think it's smart enough to actually do anything.
01:45:47
Speaker
But watching its thought process was terrifying. And I think that video, he also had a Twitter channel which I think is gone, but I think that video is still there. And you can see, oh my god, if that thing had power, this would be a really horrible entity to be in the universe.
01:46:03
Speaker
So i think it's something to really be worried about and that's one of the reasons i think i was so excited by this probably safe approach because that's the one and only mechanism i've seen which could rain in the behavior of even that kind of. Bad agent do you think that i will become more like agents in the future.
01:46:24
Speaker
You know, I don't think it's good for humanity to have them being agents. I think there is a tendency for them to become agents. I don't really know if we can make limited agents. I mean, it's funny to look at large language models. I was, you know, at first I thought, oh, all they do is predict the next word. They can't possibly be agents. They have no goal. What is it? You know, and yet they often act very agentically.
01:46:46
Speaker
And so the model I kind of like now is they're trying to mimic what's on the internet. And what's on the internet is the output of many, many, many agents, all humans, and maybe some other entities as well. And they have learned, if they've learned well, they can learn to mimic agents. And as you give them different prompts, the agent that they mimic gets changed. And there's this whole thing called the Waluigi effect.
01:47:10
Speaker
which is whenever they mimic a good agent, there's often a bad deceptive agent underneath it that it's also mimicking and that that bad agent can come out with the right prompt.

AI's World Modeling and Cognitive Capabilities

01:47:21
Speaker
And so somebody is trying to explain why Bing chat got so crazy in bonkers. And so,
01:47:28
Speaker
So I think agency, it's something which has emerged multiple times in biology. And so I think it's a natural process. I think it's pretty risky for us. And so I don't quite know how we should balance that. It'd be great if we could keep them as tools.
01:47:45
Speaker
But I'm not sure that's going to be possible.
01:48:03
Speaker
Yeah, well, let me start with the world models. Years and years and years ago, I used to think about the thought experiment of if you just listen to the radio all day and that's your only input, could you learn about the world? Could you learn about the world? At that time I thought, no way, you're just hearing little frequencies up and down. But then I thought, well, if you watch TV all day,
01:48:23
Speaker
Well then maybe you could really learn about the world because you could see the regularities and the patterns in the image and you learn about three-dimensional geometry then you learn about physics and then you connect that with the voice that people are saying and you learn about humans and so you could pretty much learn everything i think by watching the video signal.
01:48:40
Speaker
I think I've been proven wrong that just from the regularities of a sound signal, just listening to the radio and just the only task being predict the next little signal, the best way to do that is to build a model of whatever is creating that signal. And so you can quickly learn about phonemes and sounds of the world and so on. And then you quickly learn about the grammatical structure of language. And then by the patterns,
01:49:07
Speaker
you start building up a world model and when gbt2 or something people started showing gbt2 knows where the cities on earth are so like if you say you're in you know berlin and you go a hundred miles south you know where are you it knows it's just like holy crap it's created like a 3d model of earth from just people talking about stuff and now that's been taken way way further
01:49:30
Speaker
to where the current language models I think have very, very detailed and good world models. They have inconsistencies in them, which is part of the flaw, I think, in the current models. But if you sort of train them to make them self-consistent, I think that's where you go from a kind of fuzzy world model to a really precise world model. So I think that's inevitable. The agency thing, I think they can mimic agents.
01:49:57
Speaker
So, so there's your own agency. Yes, I want to do that. There's also understanding other agents that sometimes they call that theory of mind that, uh, and they already, the larger language models have pretty good theory of mind. You can, you know, uh, say, you know, Fred put, uh, his candy in a box and then he left the room and Sally took the candy out of the box, but Fred came back in. What does Fred thinks in the box? And so it keeps track of what's really in the box and what Fred thinks is in the box and they can do that.
01:50:27
Speaker
Yeah, and you can track the development of large language models, ability to model other agents' mental states evolving over time. So GPT-2 might correspond to a two-year-old, GPT-3 to a four-year-old, GPT-4 to a nine-year-old. Those are not exact numbers, but you can see how Theory of Mind develops in these models. It's pretty fascinating.
01:50:50
Speaker
Amazing. There's some research where you ask a language model to play chess against you just using text. And you can see that it develops an internal representation of the chessboard, which is at least some proof that there's some model being generated. And we might imagine that this extends to modeling the physical world or modeling
01:51:10
Speaker
at people or anything basically. Some shocking stuff that came out just a few weeks ago, apparently, I think it's GPD, I don't know, GPD 3 or GPD 4, one of those, if you just try and tell it to play chess with you, it can play chess, and it has like a rating of 1800. It's like,
01:51:29
Speaker
I can't a million years of magic is just i think requires searching and stuff that i don't think language models do very well and yet apparently and like you say i think they clearly somebody did experiments on othello which is a simpler game. And they were able to find the dead mechanistic interpretability on on the train the transformer on othello and they were able to find the board representation

Self-Improving AI and Industrial Impact

01:51:52
Speaker
in it.
01:51:52
Speaker
And so in that one, it really clearly just by looking at games, you know, text descriptions of games, it clearly created a board and it could know what the state changes were and so on. So that's that's kind of magical. I, you know, I wouldn't have guessed that a priori.
01:52:08
Speaker
One of the concepts you wrote about, maybe again 15 years ago, is this concept of self-improving AI. The original vision of how this would happen was an AI might quite quickly, over a time span of maybe some months or days or even hours, begin to self-improve. Do you think that's still in the cards? How has this concept of self-improving AI held up?
01:52:31
Speaker
You know, I don't think they're doing it automatically that I have seen, but certainly people are using them. Like the H100 that Nvidia, their biggest chip, I think they have 40,000 sub circuits, which were designed by AI running on Nvidia chips. So that's, you know, a little bit of a distance. And for a while, Google was a real champion of what they called AutoML.
01:52:53
Speaker
which is using AI systems to tune the parameters, how many neurons, what layers, what learning rate, what learning algorithm, all that stuff. And they were optimizing it and getting much better performing neural nets out of it. And that looked like maybe three or four years ago, that looked like the hot thing that everything was going to go that way.
01:53:15
Speaker
Doesn't seem to have happened to see many people talking about it instead everything's getting sucked into transformers so transformers first they did text now they're doing vision gp4 just got a vision apart and twitter is full of people doing just mind boggling stuff with that i think they're also generating images so they've integrated dolly they have dolly three i think that's integrated with gpt.
01:53:38
Speaker
doing audio eventually tactile and robotics so it seems like everything is kind of getting pushed into the transformer the transformer model seems to be rich enough to deal with any modality doesn't seem very efficient seems like an incredibly inefficient. Mechanism to me and yet that doesn't seem to really matter you know so maybe maybe we'll have systems which will redesign themselves learn everything in the transformer and then they'll make a much much more efficient representation for particular tasks or something like that.
01:54:07
Speaker
As you mentioned, self-improving AI, in some sense it's happening, but it's happening through humans right now. So you might have engineers at the top ADI corporations having an AI help them draft some piece of code before they check it through and implement it. But do you think we could see AI more directly intervene on its own code or create its own hardware in a more direct way?
01:54:32
Speaker
Well, so today, assuming we stay on this trajectory of using transformers or some other universal model, you can try and improve the transformer. There are a bunch of variants, hundreds and hundreds of variants of transformers, which are more efficient in one way or another. And yet, remarkably, almost all the big models seem to be using pretty much the standard transformer.
01:54:52
Speaker
you can create your own training set. And that's the synthetic data story. And I think that really kicks in with theorem proving, once you can check generated data. And then I think you're going to see models who want to know more about something. They've learned everything that humans know about it. OK, let me go explore, generate a whole bunch of examples, do searches, find solutions, and then train on that and figure out which concepts are the most important.
01:55:17
Speaker
I think my current sense of how it's gonna happen how self improvement will will generally happen is by is within a fixed architecture and it's changing the ontology if you like figuring out what new concepts are important in order to explore certain direction.
01:55:36
Speaker
But I don't really know. I mean, it could be, you know, we're in this weird phase where there's like kind of universal agreement. Everybody wants H100s and there's a black market of H100s, which are three times as much as Nvidia sells them for. And they're sneaking across the border and that's the, you know, this is the way to do it. Transformers on H100s. Maybe this is a temporary thing or maybe this goes off into the, I think it is possible that that technology scaled up will get us to AGI.

AGI Timelines and Societal Impact

01:56:05
Speaker
whether it's the most efficient way to get there isn't really clear to me. Easy question for you. How far do you think we are from ADI? How far are we from AI that's as capable or maybe more capable than humans across a very broad number of tasks?
01:56:22
Speaker
Well, so I'm not really a great prognosticator, but I like, like I follow Metaculous. Metaculous is a prediction market where a bunch of people come in and they bet on things like that. And they have two on there right now about AGI. One is kind of a weak AGI. When does something which is as good as a human for things where you're interacting over the internet and that's the weak AGI. And they're saying four years right now for that.
01:56:47
Speaker
And then the strong one is more like a robot interacting in the world physically and that one they're saying eight years. So either way they're pretty damn short and then you see guys like that dario amati head of the anthropic he just did an interview where he said oh i am two to three years i would say.
01:57:06
Speaker
So, you know, and those guys are in a position to really know. On the other hand, and then the question is, once you have AGI, how long to get to ASI, artificial super intelligence, and on Metaculous, it's eight months after that. I'm not sure that question is really the right question, though I think it is related to it. I mean, the right question is,
01:57:27
Speaker
How long until ai starts being a serious risk or factor in human affairs i mean that's the really critical question and one hand i don't think you need to be human level and everything to do that today's is maybe good enough at playing the stock market or something to have a big impact on the stock market.
01:57:47
Speaker
On the other hand it may be just if you're as good as humans maybe that's not impactful what would another human be you know big deal you can make many many copies of them so now you can have a million humans that starts getting a little scary and then what you get as i then you can have huge impact so.
01:58:06
Speaker
So i think that those two questions are very related my sense is that the age is probably within the next decade. And the large impact could happen on that scale or maybe a little longer but i would say not more than a few decades probably. It's interesting how quickly.
01:58:24
Speaker
the kind of general consensus around these timelines to ADI have changed because it's almost as if now the thinking is that we might get to human competitive AI within one or two decades. And that wasn't the consensus 20 years ago, far from it.
01:58:41
Speaker
Oh, yeah, many thought it would never come. Exactly. Well, I spoke to Robin Hanson on this podcast, who's been worked in AI and kind of, you know, have a lot of experience with AI over a number of decades, and he remains skeptical. I wonder if, I mean, you've also been engaged with AI for a long time now. I wonder if you perceive a certain hype in the zeitgeist at the moment, or do you think this time is different? This time it's for real?
01:59:10
Speaker
Well, it's really fascinating to look at the history of AI because it's had these huge swings up and down. They call them the AI winters and the AI springs. And it's really funny. Frank Rosenblatt invented the first neural net. It was actually the first neuron called the perceptron. And I think it was 1959, something like that.
01:59:30
Speaker
He had this elaborate complicated thing with this neuron with wires all over the place. He was blasted all over New York Times. They had quotes saying, this is Frank Rosenblatt and this perceptron is going to be the thing that walks, talks, has consciousness and changes our society. That was like 1960.
01:59:51
Speaker
at the original AI meeting in Dartmouth, where it all got put together. They did a summer school. Yeah, over the summer, we think we can do this AI problem, and I'm gonna have a grad student do vision next summer. And so they had these really short timelines. And then, like in the 1970s, it flipped the other way. There was this brilliant physicist named Lighthill, loved his books and everything, but he came out in Britain and he said,
02:00:19
Speaker
This field is hogwash. They haven't done a single thing. All money should be completely shut down for this. It's garbage. You know, so I was like, what? And so we get these ups and downs because I think it's so the implications are so big. I think people's thinking either, you know, explodes or shrinks.

Balancing Excitement with Caution

02:00:38
Speaker
And so we could be in a period of hype, though, you know, like I don't know if you've seen the the Microsoft commentary, there were a big long sparks of AGI paper about GPT four.
02:00:49
Speaker
Yeah, I read it and it's extremely impressive. Some of these things were not even I think the final release of GPT-4, but a precursor to GPT-4 is explaining very high level mathematics and just kind of many examples of seriously impressive things there. So it seems to me that that model may already be sufficient for full world transformation if they got the lying and the reasoning tuned up a bit.
02:01:18
Speaker
And so, you know, for better or worse, it feels like we're right here. And I don't know if you've been watching over the past week, Twitter's been full of rumors that internally opening eye has AGI, AGI has succeeded. And then Sam Altman came out and he tweeted, yes, we have AGI. And of course I would announce that on a tweet.
02:01:37
Speaker
I think this is a sign that people are nervous or people are excited. There's a lot of, you know, the atmosphere is tense surrounding this subject. But what we want to do is kind of keep our heads cool about it and try to remain calm, even though, as you say, there are people who are good at forecasting, are forecasting these things arriving
02:02:01
Speaker
pretty quickly. And so it's difficult to know whether the right attitude is to remain calm or to work on nothing else.
02:02:09
Speaker
Well, I think we should take a cautionary approach that we should, you know, assume it might arrive in 10 years or sooner. What do we need to do to make sure that that is not catastrophic? And that's what you and Max are trying to do, an improbably safe system. Exactly. Yep. Steve, thanks for coming on the podcast. It's been a real pleasure. I've learned a lot. Great. Thank you so much.