Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
David Dalrymple on Safeguarded, Transformative AI image

David Dalrymple on Safeguarded, Transformative AI

Future of Life Institute Podcast
Avatar
3.2k Plays8 days ago

David "davidad" Dalrymple joins the podcast to explore Safeguarded AI — an approach to ensuring the safety of highly advanced AI systems. We discuss the structure and layers of Safeguarded AI, how to formalize more aspects of the world, and how to build safety into computer hardware.  

You can learn more about David's work at ARIA here:   

https://www.aria.org.uk/opportunity-spaces/mathematics-for-safe-ai/safeguarded-ai/   

Timestamps:  

00:00 What is Safeguarded AI?  

16:28 Implementing Safeguarded AI 

22:58 Can we trust Safeguarded AIs?  

31:00 Formalizing more of the world  

37:34 The performance cost of verified AI  

47:58 Changing attitudes towards AI  

52:39 Flexible‬‭ Hardware-Enabled‬‭ Guarantees 

01:24:15 Mind uploading  

01:36:14 Lessons from David's early life

Recommended
Transcript

Introduction to the Podcast

00:00:00
Speaker
Welcome to the Future of Life Institute podcast. My name is Gus Docker, and I'm here with David Dalrymple, also known as Davidad, who is the Program Director of Safeguarded AI at ARIA. David, welcome to the podcast. Thanks, Gus. It's great to be on the podcast. Fantastic. All right.

Critical Phases of AI Capability

00:00:19
Speaker
Maybe we can start by talking about how safeguarded AI fits into the landscape of AI safety efforts.
00:00:26
Speaker
So I would say it's important to distinguish between different phases of critical AI capability thresholds where different risks are prominent and different approaches are needed to mitigate those risks. There is the current phase which is associated with prosaic AI safety where we are mostly concerned with misuse risks and the kind of mitigations that we need are basically monitoring and jailbreak resistance and ways of stopping the AI outputs from being steered and ah in a dangerous direction by a user.

Ambitious AI Alignment

00:01:04
Speaker
There is the endgame phase, which is the next most common thing for people to talk about where you're working with an AI system that is so powerful that it would be hubris to think that you could possibly contain it at all. It might discover kind of new scientific principles that invalidate the basis on which you thought you had it contained in a space of time that it was too quick to notice and shut it down first. So that that endgame scenario, it sort of requires The mitigation that's required there is ambitious alignment, which is like no matter what situation this algorithm finds itself in, it will it will behave well, benevolently or or something.

Navigating the Middle Phase of AI

00:01:45
Speaker
And so this requires doing a lot of philosophy. This is sort of the agent foundation's direction falls into this category of ambitious alignment.
00:01:52
Speaker
ah Safeguarded AI is targeting what what I call the critical period in the middle, where we're dealing with systems that are too powerful to just deploy on the internet, or even for humans to ah talk to, because they might be very, very persuasive about very damaging memes, and or they might emit like unstoppable malware if they're connected to the internet.
00:02:16
Speaker
But it's not like they're so smart that it would be impossible to contain them if you actually tried. And so the question that I'm trying to answer in Safeguarded AI is how could we make use of a system in this middle level of super intelligence without creating unacceptable risks?

AI in Safe Autonomous R&D

00:02:35
Speaker
ah And and the the the answer is is in ah in a sentence is we create a workflow around this contained system which is a general purpose way of of using it to do autonomous R and&D of special purpose agents that have quantitative safety guarantees relative to specific contexts of use. So it's like we do it autonomous AR&D, but not on AGI. So it's not doing like a self-improvement. But if it's if it's capable of doing that, then it should also be capable of doing autonomous R and&D on special purpose agents for for particular tasks.
00:03:15
Speaker
at scale. And so we could do it for for many, many, many special purposes. And that's what I'm hoping to demonstrate in the program. And these special purposes, what would be an example here? Yeah, so some examples include discovering new materials for batteries, determining at a control policy for when to charge and discharge batteries based on the weather forecast for solar and the demand forecast on the power grid, optimizing clinical trials so that they can be faster and more efficient, manufacturing monoclonal antibodies, which right now are made
00:03:49
Speaker
in bioreactors that need to be supervised by PhD biologists and sort of steered in ah in a healthy direction. So if we could automate that and the quality control associated with that, it would bring down the cost of monoclonal antibodies and by orders of magnitude. And they're super useful for lots and lots of diseases, and they're just not used right now because they're so expensive to make.
00:04:09
Speaker
could be things like automating the the manufacturing of ah factories that manufacture other things like metagenomics monitors for all of the ah airplanes in the world that to be monitoring whether there's any new pathogens showing up in their in their atmosphere on board.
00:04:27
Speaker
So basically, there are things that help with with climate and health and and that provide economic value and manage energy. And there are also things that help with some of the attack vectors that Rogue AI might use, like biothreats and cyber threats. So creating formally verified software and hardware that does does not have exploitable bugs is another example.
00:04:49
Speaker
So this is both about capturing the upside of AI, but also about limiting the downside by having these specialized agents perform their purposes. Yeah, so it's it's about getting getting a so substantial upside straight off the bat and then also about using this critical period to help improve civilizational resilience to rogue AI that might emerge later. How long should we expect that middle period to be? Because as things are right now, I feel and I'm guessing many listeners feel that these periods are getting squeezed together and we might not have much of ah ah a middle period. What do you expect?
00:05:25
Speaker
So I think it's appropriate to have radical uncertainty about a lot of things in this space, including this kind of like timeline predictions. I think it is plausible that even on the default scaling pathway, if we think about it as you know one order of magnitude every year in effective compute as ah as a sort of pessimistic approximation,
00:05:48
Speaker
I think this middle period where you know it's it's much smarter than the top humans, but not smarter than all of humanity put together at 100x speed, you know it's that's like six or seven orders of magnitude. So i'm I'm thinking my median expectation is sort of six or seven years in this critical period on the default scaling path. But it's also important to note that if there is a way of getting a lot of the economic upside,
00:06:13
Speaker
And if there is a way of resolving the assurance dilemma of knowing whether other other people in the world are following the same pathway that you are, then it might it might well become a stag hunt game as opposed to a prisoner's dilemma game.
00:06:29
Speaker
where there is a an equilibrium where everyone decides, let's just stay with this way of doing things for for a little while longer. It depends on your discount rate. So if people use economic discount rates, which are now you know around 5% per year, and that's that's how you value the long-term future. Even then, you can kind of make an argument, which I do in in the program thesis, that ah with a successful program, it would be an equilibrium to kind of hold off on going all the way to full superintelligence for about 12 years.
00:06:59
Speaker
So that there there's that angle and if the if the discount rate becomes kind of recognized to be, if it becomes recognized that it's important to have a lower social discount rate, which is starting to be the case for climate change that is not tied to the economic discount rate, then it could be an equilibrium to hold off for much longer.
00:07:18
Speaker
And it will be easier to convince companies, nations and so on to hold off on superintelligence when you can present these benefits of ah more narrow systems that include kind of safety guarantees. So you can capture ah some or much of the upside without taking the risk of developing highly advanced systems too early.
00:07:40
Speaker
Yeah, that's right. I mean, if people are looking at 10% GDP growth as as an option, it's not an equilibrium to say, let's give up that option entirely. But if we're saying, okay, we're only going to do special purpose things that are possible to to define formally, and that only covers 15% of the economy, then that suddenly becomes viable. What role does a gatekeeper play within the safeguarded AI program?
00:08:05
Speaker
Yeah, so this is ah a little bit of a complicated question because it get it gets into the details of all of the the the different boxes of formal verification is the gatekeepers kind of the innermost box of verification. We can go through the whole stack, by the way. Let's let's go through the whole stack. Let's start start from the outermost level. level So at the outermost level, we're saying we want to keep an AI system contained. So we want to have a formally verified hardware system that runs a formally verified operating system that maintains invariance of integrity and confidentiality.
00:08:40
Speaker
And ah confidentiality and integrity are the way that cybersecurity people think about it. But if you're thinking about it from a containment point of view, those are actually output side channels and input side channels. so if it's If there's a way of and information leaking to the outside, that's a confidentiality violation. If there's a way of information leaking from the outside to the inside, that's an actually an integrity violation.
00:09:01
Speaker
So ah these types of properties, which people are starting to in you know confidential computing, starting to actually integrate with GPUs are also important for containment. So the the outermost layer of formal verification is to sort of verify these bounds on information in and out and make sure that that we have that controlled. Yeah, maybe here say a bit about what it is you mean when you talk about formal verification. What it it what can we guarantee formally about these systems?
00:09:29
Speaker
By formal verification, I mean using mathematical proofs that are grounded in assumptions which come from either a model of the design of microprocessor, or if you're verifying the microprocessor, then the assumptions are grounded in in physics, and it's how solid state semiconductor physics.
00:09:51
Speaker
And you verify that the design of the microprocessor, if if it is implemented in a world that satisfies these physical principles, will have behavior that aligns with its specification. And then on the software level, you sort of assume that the microprocessor behaves according to its specification, which is a bit more of an abstract kind of axiomatic system.
00:10:12
Speaker
And then from from that axiomatic system, you deduce that if you put this operating system on it, and then regardless of what programs you run on that operating system, the confidentiality and integrity properties are invariant. So what we're doing there is is we're saying If our understanding of how computers work is complete and we're not missing some some extra scientific phenomena, then this confidentiality and integrity properties will hold so with you know some exponential probability exponentially low probability that of of a failure. That's great. OK, then let's move to the next layer of the stack.
00:10:52
Speaker
Yes. So inside of the containment system, then we're running this powerful AI and we want to get something out of it. So we don't want to contain it 100%, but we also want to get as as few bits of Shannon information out as we can manage. So we don't want the system inside to be able to freely choose lots of bits of information to communicate to the outside.
00:11:16
Speaker
But if we've specified a formal problem that has a unique solution, so a unique optimal solution to an optimization problem, for example, then what we can ask, what we can require of the powerful system inside containment is to produce not only that solution, but also a proof that it is correct and unique.
00:11:37
Speaker
Then inside the box, we can check that proof. And this is the next layer of formal verification. So verifying that this proof is correct, that it is indeed a proof of the correctness and uniqueness of the solution. And then if that check passes, then we output the solution. We don't output the proof because the proof won't be unique. So there'll be lots of ways of doing steganography into into the proof itself.
00:12:01
Speaker
um And there are ways of saying, okay, we if we don't trust the proof checker inside the box to not be corrupted, then yeah we could redundantly have the proof encrypted and sent to five other clusters around the world where in a secure enclave will run different implementations of the proof checker.
00:12:16
Speaker
and then destroy the proof once we've checked that it's correct. But basically, the point of the point of all of this verification is is threefold. i mean First, it's like we do actually want to know that it's correct. We want to avoid confabulation and hallucination, which might which might matter even for for models that are capable of quite powerful behavior. They might still be less than fully reliable at exhibiting those capabilities.
00:12:41
Speaker
um But if they're if they're powerful enough to be worried about you know causing a ah ah global catastrophe, probably they are pretty reliable. So it's not as much correctness. It's more these other two benefits from promo verification. The second one is giving us humanity confidence that it is correct. So you know even if it was going to be correct anyway, we would have a hard time distinguishing that from the world in which it was scheming. ah And and it so it it helps with so game theoretic coordination to to have the the check.
00:13:11
Speaker
And then the third benefit is reducing caught the causal empowerment. So because we're verifying uniqueness, or we can verify something kind of close to uniqueness, we can do quantitative bounds on the transfer entropy. We're reducing the the amount of of causal power that the system inside the box has. So if it's only discovering the the elaboration of a mathematical object that was well-defined,
00:13:36
Speaker
when we asked the question and we we just weren't smart enough to find the answer, then it has very little ability to change the world in ah in a direction that it wants to, even if it wants to do something strange. Are we now at the point where we can talk about the gatekeeper?
00:13:52
Speaker
We are now up to the gatekeepers. So the next layer the next layer is, what is this solution that we can formally verify inside the box will be correct and and safe when when it comes out? And my null hypothesis for what that solution would look like, although we also may just allow the superintelligence to figure out alternative architectures, but my null hypothesis is that it consists of a black box neural network that over which the safety proof would be like universally quantified and a backup controller, which is a much smaller neural network that is only responsible for maintaining the safety stability invariance.
00:14:36
Speaker
And, and then a gatekeeper and the gatekeeper is doing runtime verification, which is the third, this is now the third layer of formal verification is at runtime after you've extracted your special purpose AI from the box. The special purpose AI actually is, is verifying its own actions. And, and the verification question there is over the next time step.
00:15:03
Speaker
is the probability of transitioning to a non-recoverable state. low enough and if it's not low enough then you switch to the backup controller and the backup controller will always with a very high probability be able to maintain from a recoverable state the safety property. So if you ever had an unsafe action that was proposed that would take you outside the recoverable set then because at the previous time step you had proved that you would still be in the recoverable set then it's not too late at that point to switch over to the backup controller.
00:15:38
Speaker
So you can make an argument that's basically a mathematical induction argument that the you you assume that in the initial state, you're safe. And then you can show that if you're safe at state n, then at state n plus one, you'll be safe with high probability. And so you can make this induction argument statically before deployment.
00:15:58
Speaker
before deployment of the special purpose agent. But the it the induction argument fundamentally relies on doing a runtime check, which you couldn't really factor out because the the state space is is exponentially large in the and the dimensionality. And so you couldn't have just upfront said, no matter what happens with the you know with the advanced controller, we'll always be safe, and now we don't need to check anymore. But you can verify upfront that the runtime verification will keep you in the recoverable set.
00:16:28
Speaker
One question that arises here is how far are we from implementing this system or the stack you just described how far is this from from reality and how much of this is theoretical versus implementable right now.
00:16:43
Speaker
This is mostly theoretical. Right now in the ARIA program on Safeguarded AI, the set of people who is working on it is currently about 57 mathematicians in the theory technical area, technical area 1.1. So it's mostly theory. And what we need to develop is a language for representing real-world problems as formal mathematical problems that is sufficiently expressive to cover all of the science and engineering domains that we think it's possible to cover and that are important either economically or or for resilience.
00:17:19
Speaker
So it's a little bit like a probabilistic programming language, but that includes continuous time systems and partial differential equations and other modeling frameworks that are not easily expressible as probabilistic programs with finitely many variables. And, but that remains declarative so that we can do formal reasoning or so that, so that we can expect that a super intelligent system would be able to do formal reasoning about it and prove to a human engineered checker that it's formal reasoning was correct.
00:17:48
Speaker
So I think it's you know if we're successful in the program, then it's we're going to have a demonstration probably sometime around 2028 of being able to do this and in a few different fields and and actually get what I'm aiming for is is economic value on the scale of a billion pounds a year from deploying the system to deploying each of these kind of examples example problems.
00:18:13
Speaker
and and what What is an example of success here? If you're able to create this this kind of verifiable language for a domain in which we we are not able to to do formal verification yet, what might this enable? Could you give an example there?
00:18:29
Speaker
Yeah, so it could be much better, more efficient balancing of the electrical grid. Right now, the balancing costs are are on the order of 3 billion pounds a year, which has gone way up. It was more of like half a billion pounds a year 10 years ago. It's gone way up because it's much harder to balance a grid that is much more of renewable heavy. But as energy storage systems come online, it should be possible to kind of use AI to intelligently move the energy around in time.
00:18:59
Speaker
But right now, AI systems that are not safeguarded are reasonably not trusted with these kinds of decisions. One worry I have here is that I expect that a very small number of people will be able to understand the proofs that are that are given by by these various layers of the stack. And so in some sense, the rest of humanity will have to rely on our trust in in those experts.
00:19:29
Speaker
Is there any way to to to make these proofs kind to explain these proofs without degrading the the property that that they are formally verified? First, just to separate the proofs from from the specs. The proofs, if they are mechanically checked and and if the proofs are, in fact, either real time and happening in real time and and and and hugely high dimensional or inside the box and we don't want to actually leak them because they might have steganography, then no one's going to be reading the proofs. We're going to be relying on on on actually machine checking the proofs. But what we will be relying on humans to audit is that the proofs are of the right theorem. So the theore that the theorem actually covers the safety properties we care about in real life.
00:20:18
Speaker
Is that an easier problem? It sounds to me intuitively like an easier problem. Still a hard problem, but but is that an easier problem? I think it is a much easier problem. I think in in general, it is it is much easier to and in engineering to define what you want than to and actually find an implementation and and to show that that implementation is correct.
00:20:40
Speaker
but Yeah, it's not trivial. so I think we're going to we're gonna want to engage assistance AI systems that are at sort of present-day level or non-catastrophic risk level to help with formalizing those specifications. But again, we can't rely on those systems to be perfect perfectly reliable all the time.
00:21:01
Speaker
and So we need human human audits. And so I'm imagining having kind of large teams of humans who sign off on little little pieces, kind of like the way that a company like Google or Amazon does code review at a really large scale, every piece of code that goes into their enormous code base is signed off by three different humans. So a little bit like that kind of building up a huge model. And then sure, I mean, if it's if it is architected in a in a good way, in a way that will be scalable, that will involve and modularity and it will facilitate exposition as sort of explaining for each module in in a blog post form, like here's here's what's going on. um And you could
00:21:44
Speaker
maybe rely on AI to generate the exposition. and And then again, just have a human check, you know, does this exposition make sense? Another thing I think is is really useful here is because large language models and as they become multimodal also are models basically of human response to ah stimuli, we can optimize for finding trajectories in the state space that satisfy the current safety specifications, but would be distressing to a human as modeled by an LLM. And then kind of surface that and and say, okay, humans, I notice that your your specification might be missing this criterion,
00:22:31
Speaker
Because here are some trajectories that satisfy the current specification that you I think you wouldn't like. So we can use that as kind of an iterative process to to help refine specifications. If this kind of technological development goes as previous technological changes have gone, I would expect that at some point the experts, the technical experts, will have to present something to key decision makers in government and so on.
00:22:58
Speaker
What would be the challenges in in doing that? This is somewhat of the same question that I asked previously, but how how would you truthfully communicate the correct level of trust we could have in these systems, given your given the technical results?
00:23:12
Speaker
So ARIA is kind of like the the old pre-1971 ARPA, before it became DARPA, but we're we're we're sort of an advanced research ah projects agency, advanced research and invention agency, similar name, and we give out grants. We we sort of formulate programs of research, put a vision out there,
00:23:32
Speaker
and give out grants. And to unpack it a little bit more, people come in, individuals are recruited and hired as program directors. And a program director then comes to define what we call an opportunity space, but which is a ah zone of kind of research and development a potential, which is smaller than a subfield.
00:23:54
Speaker
and which is currently being and being neglected but seems worth shooting for and and kind of ripe for progress and and would be would be transformative if it worked out. And then within that opportunity space, define something even more specific, which is called a program. And so in my case, my opportunity space is mathematics for safe AI. So that also includes like singular learning theory and agent foundations. And like there are other other ways of trying to use mathematics for safe AI.
00:24:22
Speaker
But within that, I've then defined this program around Safeguarded AI, which is this very specific bet on you know a particular stack of formal verification tools. And so within the program, what we do is we make calls for proposals for certain thematic elements of the vision, and then people make make proposals and we we give out grants.
00:24:44
Speaker
and for for for people to do the research. So unlike NASA, where we would do a lot of R and&D within the organization itself, for example. So ARIA does not do that. Okay, so that being said, ARIA also is not a regulatory body or a regulatory advising body, even the way that, say, the AI Safety Institute is.
00:25:03
Speaker
So I interact with AI Safety Institute a fair bit, i but I'm sort of giving my perspective on where the puck is going if things go well in in three to five years. So they have a work stream at the AI Safety Institute on what are called safety cases. There's a new paper that came out recently from, I think, Center for Governance on AI, which they've linked to safety cases for AI.
00:25:29
Speaker
And so safety cases are well-established practice in high-risk industries like nuclear power and passenger aviation, where that there is some set of assumptions which are not verifiable, but there's evidence for them of various kinds, like empirical testing or some a priori argument about about how physical materials behave.
00:25:53
Speaker
And those those assumptions are then combined by arguments that are sort of like deductive proof steps, that that they're like inference rules that are that are also part of the safety case. And they say, OK, well, if I make these assumptions, then I can i can deduce this claim. And this claim is is now recursively evidence along with these other claims to deduce this higher level claim. And eventually at the top of that tree of the safety case is the claim the risk of fatality is less than one per hundred thousand years. And then that's sort of the quantity with with respect to which there is a regulatory requirement. There is like a ah curve of fatalities versus number of times per millennium that is allowed to be sort of subjectively
00:26:46
Speaker
ah deduced to be from assumptions that are also challenged by by regulators. So that's the safety case paradigm. So the AI Safety Institute is is exploring that for for multiple different AI safety solutions that are sort of related to these different critical periods where the primary risks have different character. And I've been starting to work with them on what would a safety case for Safeguarded AI look like. um So I don't have an answer on that, but it is something that we're thinking about.
00:27:14
Speaker
And those there's in in a kind of situation in which we have these more advanced quantitative safety guarantees, those numbers would be the numbers that technical experts could present to high ranking government officials when when they're making decisions about, you know, should the system be allowed to run? Should we should we deploy the system? Decisions like that.
00:27:37
Speaker
Yeah, that's right. We also have a sub area of the program on socio-technical integration, which is technical area 1.4, which is actually live now. So if you're listening to this and you're interested in in in that in this question, take a look and see if you might want to submit a proposal. One of the things in there that we're looking at is how could we support, how can we build tools to support multi-stakeholder collective deliberation about risk thresholds and trading off risks against benefits when people have different opinions.
00:28:10
Speaker
It might be interesting to talk a bit about how the AI case compares to the case of aviation or nuclear power, because it it it seems to me that that it's AI is more complex, and it it touches on basically all aspects of a society. And also, we seem to have not as much time, because in the in the aviation case, we might have had, say, 100 years to get this right, but it doesn't seem to me that we have 100 years to get the AI case right.
00:28:37
Speaker
Maybe you can talk about the the kind of challenges we're facing there and and what you think about our our prospects of success. it is true that in nuclear and in in and an aviation, many fatal accidents occurred before this safety these kind of safety techniques were figured out. right And if we're if we're talking about AI at a catastrophically capable level, we can't afford to have many fatal accidents. and So there is a there's a sense in which this is a harder problem to approach. We do have the advantage of having
00:29:15
Speaker
mature safety practices in other industries, which 100 years ago, when aviation began, there just weren't really. There was civil engineering a little bit, but that was for static systems and not dynamic systems. And so now we have kind of mature safety practices for dynamic systems, but as you say, only in narrow narrowly scoped contexts of use. and and And now we're thinking, can we come up with some kind of safety practices that would apply to dynamic systems with open-ended, unbounded contexts of use?
00:29:47
Speaker
I do think that's very hard. And that's the reason why in Safeguarded AI, we're focusing on a scalable workflow for generating safety guarantees in well-defined, scoped contexts of use. So we're saying, can we take those practices for safety critical engineering systems that only apply when you have a well-defined context of use and just make those practices really fast and easy for people to engage in with lots of AI help?
00:30:15
Speaker
and and do that across many, many contexts of use, which would maybe not all of which would be considered safety critical today, but we'll do it this way ah anyway to to sort of limit the amount of of dangerousness that the AI could sneak through in different contexts. But what that implies is we won't be using Safeguarded AI to create a chatbot product where you could just ask anything you want.
00:30:40
Speaker
will only be using the kind of pre-dangerous, pre-catastrophically dangerous systems as chatbots to help us formulate the problems that are too hard for those systems to solve, for the much more capable systems to solve in a way that it has has these layers of formal verification and containment.
00:30:59
Speaker
and and As you described, if I understood you correctly, you're also working on extending the areas or the aspects of the world over which we can do mathematical proofs and the areas of of the world that are well-defined enough for us to to do these proofs and this this formal verification. could you Is that kind of a large part of what and what you're doing in the program? and What are the most interesting aspects? What are the most challenging areas to to formalize? Yeah, so for now, it is a large part of what I'm doing in the program it is is just expanding the ah scope of what what types of problems we can formalize. And towards the end of the program, it's going to be much more there's going to be much more machine learning.
00:31:47
Speaker
We have a call for expressions of interest that is live now for technical area two, which is where the machine learning will take place. ah And we could get into that more later. But the theoretical aspect is it's less about sort of expanding the space of what we could do proofs about and more about unifying different areas of of science and engineering, where in chemistry or epidemiology, you might use Petri nets.
00:32:15
Speaker
And in civil engineering, you might use finite element methods and partial differential equations. And in electrical engineering, you might use ordinary differential equations and Fourier analysis. And so just taking taking all of these different modeling ontologies, there's not that many, but you know somewhere between 12 and 20 distinct ontologies that we want to unify. And then the other big piece of it actually on the theoretical side is managing uncertainty that is more radical than basic on uncertainty.
00:32:45
Speaker
Uh, so we call it imprecise probability. It has arisen in a bunch of ways. And I guess I should say, you know, cause a lot of the audience probably will be pretty orthodox Bayesian. There have been, there have been a lot of kind of ad hoc attempts at fuzzy logic or like Dempster Schaeffer theory or prospect theory or things like that. Try to try to be more imprecise than, and then probability and they, they they do all kind of feel a bit janky.
00:33:10
Speaker
and and i and i I appreciate that. what What I have seen is a handful of things that have emerged from very different directions, very different motivating considerations that have all converged in the sense that they're mathematically equivalent, even though it it feels ah ah a little bit ad hoc to say, oh, the epistemological state instead of being a probability distribution is now,
00:33:33
Speaker
a convex downward closed subset of sub-probability distributions. like why is it Why those particular adjectives in that particular order? But it turns out that if you look at the infra-basianism theory that was motivated by agent foundations from Vanessa Cosso and Alex Appel,
00:33:50
Speaker
and And you look at what their epistemological state is, the homogenous ultra contribution, it's equivalent to this. If you look at Joe Halpern's decision theory from 12 years ago that was motivated by, hey, it seems like sometimes experts disagree about the probabilities of events.
00:34:06
Speaker
So, ah how can we just have a decision theory that doesn't resolve those disagreements and just keeps them as unknowns? That turns out to be equivalent to this. There's Paul Cristiano's Define Ability of Truth paper from 2013, which is motivated by reflective consistency, and that turns out to be a special case of this.
00:34:26
Speaker
There's papers in economics, there's programming language theory, there's a pure category theory kind of thrust where people are like, these monads don't commute, but they should commute. How can we kind of like weaken the notion of distributive law and figure out like the bet the way in which probability and non-determinism can can most closely, almost commute. And then they found this. So I think there's like this relatively canonical thing, which is the next step in kind of humanity's understanding of decision making under uncertainty. And and that's a big part of of what we're doing here also. I think in general, if you have a Bayesian belief state, you have assumed too much already, even even just
00:35:07
Speaker
in a prior, a prior is giving you precise prior probabilities for every possible hypothesis. And ands and and people always say, oh, where do you get your prior from? you know Don't ask. that's the Never ask a Bayesian where his prior comes from. Never ask a woman her age kind of thing. And and i think that's actually I think that criticism actually ah is right. And it just has taken mathematics a while to kind of discover the the next step. But that's a big part of what we're trying to push forward.
00:35:36
Speaker
Sounds very interesting. It also sounds like this is something that I think happens sometimes in in in science where you will you will discover different approaches converging to the same solution. But then oftentimes or sometimes I think the new paradigm is not yet fully developed and and kind of usable yet.
00:35:56
Speaker
And so we might we might have to be patients for a while longer or patients for some part of the domain we're working in. ah do Do you agree or or do you think we are further ahead so that we might be able to have a new decision theory that's non-patient?
00:36:11
Speaker
So I think the theory is is is very close. I think we might have the theory worked out in a year. And then i what I'm hoping for is to have an implementation worked out in two more years. So i i think it's I think it's not far. But I i also would admit, we don't have a usable implementation yet.
00:36:33
Speaker
Would this approach be critical to safeguarded AI for the whole program to work? That you succeed in in developing a a new or more advanced form of decision theory?
00:36:44
Speaker
I would say it is still useful to do the rest of the program and require that the safety specification be making a Bayesian prior assumption still better than the than the status quo in terms of of safety AI safety for this critical period. But I would feel that most of the safety benefits would be sort of compromised by needing to make a Bayesian assumption because there always will be a gap between the distribution that you assume and and the reality. And if you if you can't cover that gap by being agnostic and and and and making use of nutty and uncertainty, then the the system in the box would be able to exploit that gap and that that could be a big problem.
00:37:35
Speaker
If you are one of the big AGI corporations like OpenAI, DeepMind, or Anthropic, and you're thinking about, should we try to implement a stack like Safeguarded AI? What are the costs associated with trying to implement something like that? Will it, for example, take them longer to train their models? Will it will it be more computer intensive when you're when you're running the models and so on?
00:38:00
Speaker
So I think probably the biggest cost is that it's just a very different paradigm for, not a different paradigm for like machine learning. It's not like an alternative to transformers or something, but a different paradigm for the business model.
00:38:17
Speaker
It's not that you just have one product that you serve on the web. It really needs much more user interface tooling and even institutional design and many different formal verification systems that that are stacked on top of each other. So it's a it's a very different way of creating a product around an AI system.
00:38:40
Speaker
In terms of the AI system itself, my guess is that the pre-training will not be that different or will not need to be that different. The post-training will be different, but probably the post-training will be more cost-effective. If I'm right and this paradigm works, probably you will actually get more capabilities inside the box by interacting with a formal verifier that is coupled to real world problems than by interacting with another large language model that's trying to judge whether you follow the constitution. And can can you say why that is? Because that sounds like a dream scenario in some sense, if you can couple safety to capabilities and align its incentives such that
00:39:26
Speaker
You could also get the capabilities without the safety. So it's not it's not that it's not that as good as it sounds. and And that's part of why we're looking for an organization that has an unprecedented governance but robustness to develop the machine learning part of the Safeguarded AI Program.
00:39:42
Speaker
But the reason why I think there could be strong capabilities here, it's related to if you look at the progress that has been made this year over the course of 2024 as the non-synthetic data that's used in pre-training has sort of reached its its limit, sort of reached peak data, peak tokens. What we've seen is a lot more progress in math and coding where tests can be done in quote unquote simulation sort of in cyberspace because those sources of feedback scale with compute. Whereas a human feedback and, you know, feedback on creative writing or on philosophy does not scale with compute. And in in fact, right now, feedback on electrical engineering or mechanical engineering does not scale with compute, but with the Safeguarded AI tools, they would.
00:40:37
Speaker
So in some sense, what we what what I'm hearing is that right now, ah mechanical engineering, for example, would be closer to something like creative writing or philosophy. But what we're trying to do is to make it closer to something like like math, basically.
00:40:53
Speaker
That's right. Yeah. Yeah. Yeah. All right. So you mentioned potentially partnering with an organization that could try to implement the safeguarded AI stack. What would that look like? And and what role would would you be playing? What role would Aria be playing? and How would such a partnership work?
00:41:12
Speaker
Yeah, great question. So as always Aria is a grant making organization. So what we are proposing to do is to make a grant of 18 million pounds is the sort of amount that's earmarked for a technical area to be a single grant to a single organization. The process is roughly right now we're looking for expressions of interest, which is basically just your name and like a few hundred words at most about why you're interested. And then at some point when we are confident that there is enough interest, then we'll move into phase one, a formal call for proposals where the proposals are brief proposals that if successful, you know we would fund three or four of them to go and spend a few months
00:41:58
Speaker
developing a full proposal. Because actually designing the governance mechanisms is going to take time. And and we don't want to just make one bet on that. but If possible, make multiple bets and have multiple parallel efforts to try and figure out what this should look like. And we are not really prescribing the answers. We're going to pose the questions, the questions like,
00:42:21
Speaker
What is your legal structure? What are your bylaws? Why would we expect that the incentives of the people who are part of the decision-making structure for safety critical decisions would be aligned with the full consideration of both positive externalities and negative externalities for society at large? And then we're looking for for for arguments of why some particular concrete legal structure and decision-making structure would would satisfy that desideratum. And then if if we If we get good answers to these questions, then what we would offer is a non-dilutive research grant of 18 million pounds to the new ah newly created organization, where it could be an organization that's created as a sort of affiliate of an existing organization. But we don't expect that any existing organization with its existing
00:43:13
Speaker
legal and decision-making structure that is located in the UK would sort of meet meet the the the bar that we're setting here. So it would be probably a new entity in some form. Then we would grant we but we would grant to that new entity 18 million pounds. and and and And the contract would be, this is for doing the machine learning research that's part of the Safeguarded AI program.
00:43:36
Speaker
That organization could then also pursue other agendas at the same time, and could raise funding from other sources. There's nothing exclusive about this. I wouldn't even really consider it a partnership. it It's just, it's just funding. We just want to support something that has a more robustly beneficial shape to do the development of capabilities that is necessary for Safeguarded AI to work.
00:44:02
Speaker
Are you optimistic that we can or someone can design and an institution, design a kind of a legal framework that would satisfy what ah you' you're you're after here? I mean, the legal system is not close to being formally defined. And so this is, in some sense, the implementation would be hinging on something that that is more fussy than than the the technical aspects of your program. How how ah how optimistic are you here?
00:44:32
Speaker
um i'm I'm moderately optimistic. I think in fairness, there has been a lot of progress in the last 10 years on improvements to the sort of organizational governance structure of Frontier AI labs. And so I think basically there's just room for more improvement. There's room for more progress on this.
00:44:54
Speaker
Do you think that implementing safeguarded AI would prevent AI systems from acting autonomously or would would limit them in certain ways that would make them less economically viable?
00:45:07
Speaker
ah So, the main applications that we're considering are autonomous agents, and but in very specific domains. So, like an autonomous agent that makes monoclonal antibodies doesn't do anything else. So, I wouldn't say the distinction is about autonomy. I would say the distinction is about generality and and perhaps about the ease of integration. and We're not going to be able to make drop-in remote workers with Safeguarded AI. So that is a limitation. I would say there's a minority of economically valuable work that can be defined or that would be worth defining
00:45:46
Speaker
in enough detail to solve those problems with Safeguarded AI. So I think the case that needs to be made is not like, you know, shut down the chatbots and pivot entirely to Safeguarded AI only, but it's more like above some capability levels, it just doesn't make sense to to deploy as a chatbot. And right now in the sort of responsible scaling policies and the the critical capability, you know frontier safety frameworks, and and and so on. The frontier AI safety commitments, I guess, is the general term that's being used now that companies have made as part of the AI safety summit series. Most of them say at this critical level where it's capable of autonomous cyber attack and autonomous air R and&D and persuasion, we don't know what mitigations we would apply. and so we would you know if we get If our evals get a high level of risk on these, we would stop, I guess, is basically what they say. um Now, you might be skeptical that in fact they would actually stop at that point, but what I'm offering is an economic improvement over what they have claimed that they would do.
00:46:54
Speaker
which is instead of just stop, you say, okay, we'll put it in a very strong box and we'll only use it for formally verifiable problems to sort of generate autonomous agents in specific domains where we can get quantitative safety guarantees. Yeah, and I think it be it will be reasonable to expect that demand for a product or or vision like like safeguarded AI would would increase in in a scenario where decision makers have to either kind of stop completely or ah Gain some of the benefits by proceeding more safely and so so you're you're also thinking or it seems to me that you've thought right from the start about say game theory of this this this whole situation which is often a missing piece i think it's interesting that that you might have something that.
00:47:42
Speaker
works in theory, and it it could be extraordinarily beautiful and and and so on in theory. But if it's not aligned with the incentives of oh ah key decision makers in the real world, it might not be that useful. So I think that's something that should be applauded. How do you expect this to develop? Do you think that which events or which situations would change the political mood or zeitgeist such that there would be more interest in in something like safeguarded AI?
00:48:12
Speaker
Yeah, good question. Roughly, I think the perception of the perception of the potential for catastrophic risks needs to be higher. and that That was somewhat meta. Why the perception of the perception? yeah So in the game theoretic equilibrium kind of analysis, it's almost an assumption, I would say, of classical game theory that the stakes are common knowledge. If the stakes are not common knowledge, then it's very hard to be confident that you know that your opponent is aware that they're playing the same game that you are. And so if in in in the and the AI case, you know to make it explicit, if
00:48:55
Speaker
you are not confident that the opponent is aware of the risk, then you would assume that they're going to kind of surge ahead and take the advantage thinking that it's a prisoner's dilemma when it isn't. And if you're not confident that your opponent is confident that you know what the stakes are, then you will think that they will be trying to preempt you because they don't think that you understand that it's not a prisoner's dilemma, and so on.
00:49:20
Speaker
I mean, there's only three levels of meta, as Yudkowsky's law says. It doesn't go infinitely deep, really. But but I think you do need to have those that kind of real world common knowledge of sort of, okay, we've all kind of declared that we we take this seriously. I think that common knowledge exists between the leaders of the frontier labs. It doesn't exist between all of the key decision makers and all of the frontier labs. And it certainly doesn't exist between all of the key decision makers and and all of the governments. and So I think that's a big thing that It could be that there's just new scientific evidence. And this would be the ideal case is that there's new scientific evidence like ah ah people talk about the potential of a model organism of of misalignment in an autonomous way. So some kind of demonstration at a scale that is not itself dangerous of the shape of of the danger. and And that could be very compelling.
00:50:15
Speaker
I agree that it might be interesting, but it just it has connotations for me to gain a function research. and and it would be It would be extraordinarily ironic if if we ended up in a situation in which humanity harmed itself by demonstrating a danger that we want to prevent. Do do you think this can be done safely, I guess is the interesting question.
00:50:40
Speaker
Yeah. So I think that the sort of formally verified containment system is also helpful yeah for this. So I think there is something a little bit nerve wracking about the way that people are doing cyber attack evals right now in you know in a Docker container, in a VM, in a virtual network. but in the cloud. And you know the way that I think it was 01 preview ah sort of escaped one of those layers of containment completely unexpectedly. um And everyone said, oh, that's interesting. A good thing we had two more layers of containment. So so it does that does worry me a little bit. It would be nice if we had one of those layers of containment being a CL4 or you know something formally verified, which does exist. You know, a CL4 has been free of exploitable bugs for 10 years.
00:51:27
Speaker
um So and it's it's ah it's a pretty concrete direction to say, hey, you know we should have NVIDIA drivers for SEL4 so that you can actually run AI systems on SEL4 clusters. So that's something i' I'm i' i'm talking to talking to some people about. But I would say,
00:51:42
Speaker
That is not part of my program. It's not in scope of what I'm doing. So it's it's still neglected. That's a neglected direction. But yeah, I think there are ways of running tests safely. I don't think that we're quite there yet, but I think the the toolkit exists. It's sort of like the way that cybersecurity researchers can safely handle state-level malware and do analysis of it. And it just requires very high skill level at being careful.
00:52:07
Speaker
which I think is is not yet being applied to to AI risk evaluations, but probably probably will be before it's too late. but yeah and On the margin, doing more work on making sure that it's good.
00:52:22
Speaker
and And the idea here would be to present, again, key decision makers with results of where a system has shown that it can ah do something that's that's quite dangerous, but where we've contained that system so that it it actually didn't cause cause harm. That's right.
00:52:39
Speaker
make sense Okay, maybe here would be a good point to talk about flexible hardware-enabled guarantees. Let's start from the beginning. what What do you mean by flexible hardware-enabled guarantees?
00:52:53
Speaker
This is a form of a hardware enabled mechanism for compute governance. The purpose of it is really to resolve the assurance dilemma of how can you be sure that a competitor or you don't trust is in compliance with an agreement, any agreement that restricts the AI development ah pathways at all. and So it could be, you know we're going to set some maximum, you know no training runs larger than 10 to the n ah flops. But it could also be something much more subtle that sort of stratifies training runs according to various levels of flops and then runs capability evals that are and more and more intensive.
00:53:35
Speaker
at different levels, and then stratifies according to capabilities, and then you have have to demonstrate mitigations, and then there's stratifications of how strong the mitigations need to be, and and and then those mitigations need to sort of be in accordance with risk thresholds, so that there's some you know like a safety case argument that at the top level there's some very, very low probability of anything that might cause a thousand fatalities or whatever.
00:53:59
Speaker
So that is the reason that I'm going for flexible hardware-enabled governance is because we don't know yet, and and and sort of societies and governments are not yet in a position to to to make long-term decisions about what the shape of those agreements should be.
00:54:17
Speaker
and So that that needs to come later. So and when when people say, you know we don't know enough about science to to decide yet like what the rules of the road should be, there's some truth to that. we don't We don't know enough to decide the details. And so what I'm aiming for here is to develop hardware. Because hardware has such a long lead time, I think it's quite urgent now, even though It is not time to have any kind of international agreement about about AI. It's time now to develop the hardware platform that would suffice to guarantee assurance for whatever that agreement turns out to be years later after the hardware is built.
00:55:01
Speaker
That seems like a very difficult problem to to handle when you don't have you don't have the specifications of what it is you want to guarantee. How are you measuring progress here? The lovely thing is what we're trying to guarantee here is is is about compute, and compute is universal. So so one way of answering the question, you know they' it's more complicated than this, but one way of answering the question is just Well, we have a coprocessor which which runs the but policy and decides whether whether to perceive whether to permit proceeding with running a code.
00:55:35
Speaker
ah And then that coprocessor can be updated according to a ah multilateral decision-making process. So that that's basically the answer. And it's it's it's quite different in both positive and negative ways from previous assurance dilemmas that have been resolved around chemical weapons and nuclear material. It's it's less favorable than those in that computations do not leave a distinct physical trace, a signature of one computation versus another. Of course, they consume a certain amount of energy, but lots of things consume lots of energy. That's not a very distinctive trace that that you can really track. And it certainly doesn't help you distinguish between two different computations running on the same hardware that have different you know training versus inference or something. the But the on the other side of the coin, and I think much more significantly, the favorable thing about
00:56:28
Speaker
about trying to govern compute is that the compute itself can be programmed to do the governing. Not not with current hardware, it does require some tweaks, but there would be no way no conceivable way to sort of take a kilogram of uranium and configure it such that it refused to be incorporated into a nuclear weapon and only would be only would be used peacefully. But if you have a computing system that has hardware for detecting tampering and disabling itself,
00:56:57
Speaker
in the case that it's being tampered with and for analyzing what it's being used for and doing cryptographic verification and coordinating with the other parts of the system to verify that the whole cluster is being used in accordance with with the rules.
00:57:12
Speaker
then you can do a lot more than the kind of traditional approach that relies on physical spot checks and and and and supply chain and and and tracking where everything is going geographically. In fact, I think it's not necessary to track where everything goes geographically or to have surveillance and monitoring of what everything is being, you know what all the computers being used for. I think the hardware governance systems that rely on that that are basically hardware surveillance systems that can then be used to do verification in a centralized way have serious downsides. you know People, individuals, particularly in in the US, would object on the basis of of individual liberty. On the international scale, countries would object on the basis of sovereignty. and So trying to provide a technical option for compute governance that is actually completely privacy-preserving
00:58:03
Speaker
and and and even to some extent liberty preserving in that there should be certain kinds of computations that you can do regardless of what the multilateral ah governance body decides. Some kind of safe harbor like, well, these computations are small enough that there's no way that they're a catastrophic AI risk. So there's there should be no way of remotely disabling devices entirely or sort of changing the rules to be arbitrarily restrictive. But there should be ways of of of changing the rules to some extent for large computations and adapting as we learn more about what types of ah risks emerge at what scale and how to evaluate them and how to mitigate them.
00:58:42
Speaker
Can you say more about how your system would preserve privacy and liberty? and And especially also, you mentioned that it's not necessarily it's not necessary to track where all the advanced chips are, for example, or whether they're being used in in large training runs. Maybe say say more about that.
00:59:02
Speaker
So basically, the principle is that the entire loop of, like one, figure out what's going on, two, decide whether it is in in compliance with some set of rules, and then three, if it's not, do something about that to stop it from happening. That entire loop can happen within the device and in a way that is resistant both to tampering by the physical owner and to exploitation by the the sort of decision-making process that that adjusts things over time. so that I think that that's basically the answer. the The information that's used in that process of deciding whether something is safe or is in compliance with safety principles doesn't need to leave the device if the device itself can be trusted both by its owner and by the world at large to fulfill its intended function.
01:00:00
Speaker
What is something that we could prevent using flexible, hardware-enabled governance? what um it could be Could we prevent specific capabilities within within systems, for example?
01:00:11
Speaker
It's possible to build in evals into a policy and say, if you're above a certain threshold, then every 10 to the 20 flops of training that you do, you have to ah run this this suite of evals. And if this suite of evals ah comes up red,
01:00:31
Speaker
then you have to pause and and and then there could be some other there are there are various ways of kind of handling cases where where eval are concerning or where the policy sort of says oh this is high risk so it could be that you could be that you just stop and say we're gonna now.
01:00:49
Speaker
delete these, we're going to have to delete these weights start over. That would be the most extreme. It could be that like the least extreme is to say, okay, well, this is fine, but we're going to encrypt the weights such that they can only be run on other flexible hardware enabled guarantee hardware and, and they have to run with a speed limit. And so you can only, you can only inference them at, you know, 70 tokens per second or whatever. That would be like the least extreme kind of restriction. And then there is a whole spectrum of, of of things in between. And really, you know, the, the, the hope is that there are lots of things that I wouldn't have even thought of that you could implement because we're just implementing sort of general purpose computing system for determining what the, what the rules are.
01:01:33
Speaker
What do you think of the different proxies we have for determining whether a system is behaving in in dangerous ways? so You mentioned one of them being being evals or evaluations where we test for specific capabilities or events that happen within the system. Those are, of course, not perfect. You're trying to capture something with an evaluation suite and you you're you're capturing a a proxy of of what you're you are attempting to capture. Same goes for something like total compute used in a training run where you it's also a ah proxy for what you're you're actually aiming at, which is, I would say, a harm or danger within a system. To what extent are is our ability to to measure what we're ah trying to prevent the limit here?
01:02:17
Speaker
So a couple of things on that. I mean, one is I think it's important to distinguish between capability evaluations and something that some people call propensity evaluations, or you might call mit mitigation evaluations or safeguard evaluations, where you've done some post-training to the system to try to stop it from actually exercising its dangerous capabilities in practice. And then you're trying to determine whether there is still a way of eliciting the the latent capability I think those are much trickier. Both of them are not perfect. But if you have only been doing pre-training, so if you've only been doing autoregressive training, and then you want to determine whether there's a capability present, I think that is much easier than determining whether there's a way of eliciting a capability after you've done post-training ah and and sort of made it such that all of the simple ways of eliciting the capability don't work because it says, sorry, I can't help you with that.
01:03:12
Speaker
then discovering the jail breaks is a little bit trickier. I do think there's quite a bit of hope in automated red teaming, where you have some other AI system that's very clever that maybe even also uses non-black box gradient updates and and searches for a prefix that causes the system to still exercise its dangerous capabilities. But certainly the kind of standard conception of evals as like,
01:03:40
Speaker
a questionnaire you just feed these props in and you and then you check whether the answers are are concerning um that but does not work for mitigation evaluations it's very easy to game but for capabilities evaluations if your training is sort of following the pattern of being an autoregressive process that is whose objective function is cross entropy basically then i think it's it's not that easy to game the capabilities evaluations so So I think that's an important distinction. And then I guess I would also say there's an important point to make about soundness and and sort of adjusting the trade-off between false positives and false negatives, where compute thresholds are very much an imperfect proxy for capabilities. But even if we can't reliably predict
01:04:25
Speaker
that certain capability arises at a certain compute scale, we can be fairly confident that you know catastrophic capabilities do not emerge at 10 to the 20 flops of autoregressive compute. So we we can sort of adjust it such that we'll get lots of false positives. We'll get lots of systems that are over the limit of of concern, but are actually not concerning, but very few false negatives. And then I think the principle should be that we just do lots of layers of that.
01:04:54
Speaker
So if you're over a compute threshold, that just means now you need to do capabilities evaluations. And if you're over a capabilities threshold, well, that just means now you need to do automated red teaming. And if the automated red teaming finds something, then that means you go back to the drawing board and you figure out a different mitigation approach, which could be, for example, safeguarded AI and the extreme.
01:05:16
Speaker
as a ah way of completely containing the capability and and making it only do things that you ask for. Yeah. What's the motivation but behind thinking about ah compute governance like this? You've mentioned to me before that you're interested in in buying time to do to do the the research necessary to to make systems safe. is Is this approach to hardware governance about buying more time to do safety research?
01:05:40
Speaker
I don't usually think in terms of buying time, I think in terms of game theoretic stability and kind of stabilizing a Pareto optimal Nash equilibrium where everyone follows a safer strategy. And the safer strategy is not necessarily something like we will stop for n years and hope that by the end of n years, there will be more progress on safety. I don't think that's a very sophisticated strategy. I think the strategy should be more along the lines of we will not deploy systems without a commensurate level of safety for the next n years. And then after n years, we will reevaluate what our what our balance is. And so it's really more about stabilizing for a for a certain period of time, a certain strategic orientation rather than about kind of buying time and and and pausing something until until some future date.
01:06:39
Speaker
Does this approach require buy-in from, say, NVIDIA, ah TSMC, ASML, and so on, the kind of major players in the supply chain of chips? So a flexible hardware-enabled guarantee is useful for game theoretic stability only insofar as one can reasonably assume that most of the compute that is available in the world for doing frontier training is enabled with this guarantee. It has has has this built in. It is a fortunate kind of feature of the current landscape that there are there are big bottlenecks in the production of Frontier AI Compute where it's possible to kind of
01:07:27
Speaker
get some assurance that all of the frontier compute is being made with a guarantee in place. So I think there's a bunch of places that that that one could intervene to get that kind of guarantee. But in the long run, it just has to be I think that In the same way that some people would think about an AI agreement as part of the agreement is you need to know everyone every part every party to the agreement needs to know where all of the data centers in the world are and needs to be able to send inspection teams to go and check out the data centers. so I talk a lot about you know privacy preserving and and removing the kind of necessity of knowing where all of the compute is in the world.
01:08:05
Speaker
but You do still need to know where all of the frontier compute fabs are in the world and to be able to inspect those and see that they're not making compute, which is lacking the hardware-enabled guarantee. Is there anything we haven't covered on flexible hardware-enabled guarantees that we should cover here?
01:08:25
Speaker
Maybe I'll say a little bit more about the subsystems, flexible hardware-enabled guarantees that are that are necessary to make it work. So we talked about the secure code processor, which actually assesses whether code satisfies the current set of rules. There also needs to be a ah process for updating the set of rules, which I think of as as being pretty closely analogous to a smart contract, ah where there's ah there's a set of stakeholders who need to reach a quorum in order to update the current set of rules.
01:08:55
Speaker
The current set of rules can also change over time, so it can depend on wall clock time. and And there can also be restrictions on what the next set of rules is. As a smart contract does, it can say, if you propose a new version of the rules, here are the meta rules for like how that version has to be in order for it to be a valid a valid update.
01:09:16
Speaker
So that's another piece. There is a piece around physical tamper detection, which I think has been maybe given up on some 15 or 20 years ago by the hardware industry. I think there was interest in tamper detection in the 90s and early 2000s. And eventually, everyone said, well, there's just always some way around it. There is a sense in which I think it's quite different from from pure cybersecurity and formal verification. I think it will be it it would be very difficult to be 100% confident that there are no exploitable bugs.
01:09:46
Speaker
in physical tamper detection but also I think there are some technologies now that make it much much more favorable than it was the last time it was taken seriously. In particular we have now because of 6G being on the horizon because of of cellular communications we have radios that are much more sensitive to like millimeter scale perturbations of of of metal. So you can sort of sense from the inside, almost like a radar, whether there is any penetration of the of the metal case of a server. There's also, just because of smartphones, much cheaper sensors for you know cameras. We've put lots of cameras on the inside. We can have little AI chips that are looking for for visual anomalies, you know thermal sensors, acceleration sensors, and
01:10:33
Speaker
and and so on. I think there's just a lot of a lot of things that can be done and we can sort of enumerate all of the possible physical attacks that one could do to try to get inside of a box and mess with it ah and and make sure that you can detect all of those before they can disable the tamper detection mechanism itself.
01:10:49
Speaker
And then there's also a sort of a tamper response mechanism, which is baked into the accelerator, the that the AI chip. Millions of little nanoscale fuses that as soon as the signal from the tamper detection system that says everything is okay.
01:11:06
Speaker
goes away, um then all of those little fuses get burned out by charge stored in local capacitors. And the then those the the absence of those fuses makes the makes the chip not not useful, makes it not run. And it would be possible in principle to go in and repair all of the fuses. But it would it would be extraordinarily costly, you hundreds of millions of of dollars to to do that. And we can then have a layer of physically unclonable functions that sandwiches the chip from top and bottom, but such that if you did go in and repair all the fuses and tried to put everything back the way it was, it would be impossible to put it back in such a way that it would still be cryptographically able to attest to its integrity, that it would still have half the same private key. ah So that's another kind of subsystem that needs needs a little bit of R and&D. that These technologies, i I would say, are not exactly speculative, but they've not
01:12:01
Speaker
They've not all been used together in the way that I'm proposing that they could be. So the the sort of a system integration aspect to this that needs to be, needs to be fleshed out. But I think it's a matter of, you know, maybe 20 engineers who are very, very high skilled in their respective fields working on this for 12 to 18 months. I don't think this is like a research problem exactly. It's more of a, more of an engineering kind of system, system integration problem.
01:12:26
Speaker
Do you think the way we develop advanced AI will change such that total compute is not it's no longer the limiting factor? For example, maybe advanced systems will begin to make much more use of inference time compute and become smarter that way. and Maybe we'll develop better algorithms such that we can get the same level of capabilities from from lower end hardware or ah much much less compute.
01:12:52
Speaker
And the worry here, of course, is that is that hardware governance or comp computer or compute governance will not be relevant in so in in in those worlds. Yeah, so a few a few points on that. One is I think the inference time compute paradigms become useful after a very large amount of pre-training. So I think it it is fair to say we shouldn't expect that there are going to be pre-training runs that are 10 to the 35 flops, because we should expect that a lot of those orders of magnitude of effective compute will come at inference time instead. But I don't think that we should either expect that there will be ways of using inference time compute to take like lama 2 and make it catastrophically dangerous. So I think we need to, when thinking about hardware governance,
01:13:48
Speaker
think about ways of not just limiting the scale of large training runs, but even for maybe medium training runs, one might say, encrypting the weights and requiring the inference also take place on flexible hardware enabled guarantee chips and have inference time governance and have basically inference speed limits and could even have a kind of token system in a different sense of token from LLM token.
01:14:14
Speaker
but more like more like crypto token, where there's a there's a certain amount of a limited amount of maybe like taxi medallions, you know, where you need to have one of these medallions on your system in order to inference inferences big model. And there are few enough of those medallions in the world that, you know, we're not concerned about this becoming, as as some people put it a nation of 10 billion geniuses. And it's like, okay, maybe you could you you can have a nation of 50,000 geniuses. And and we we think we can deal with that. But that there's some limit on how much inference time compute is permitted per unit of real world time. And what about the question of algorithms? Would you expect those to improve their efficiency to the point where compute is no longer the limiting factor?
01:14:59
Speaker
my my belief about this is sort of grounded in the human brain and and and an assumption that evolution has been restricted in many ways in terms of the materials that it can use to construct intelligence and in terms of the spatial, the reliability of spatial patterning and you know the way that it it it distributes energy. and Okay, so yeah, so there are lots of ways in which the brain is sort of not physically optimal.
01:15:29
Speaker
But algorithms, I think, are not very constrained yeah ah in in terms of evolution's ability to access design space. And the the and the brain over the course of of childhood is is doing you know somewhere around 10 to the 25, 10 to the 26 flops. And so my guess is that there is not that much headroom for algorithmic improvements to go more than a couple orders of magnitude more efficient than that.
01:15:59
Speaker
So that's sort of a a a reason that i am I tend to be more optimistic about compute governance than than others. ah So definitely algorithmic progress happens and is still happening, but I think i think that that vector of of progress will probably saturate.
01:16:18
Speaker
pretty soon, whereas the vector of of scale, scaling, pre-training, and also above some threshold of pre-training where it becomes viable scaling at inference, those will become the the main factors of improved capabilities instead.
01:16:33
Speaker
Is it plausible that an evolved system would be near optimal? I normally think of engineered systems as as almost always better, but do we have another example where an evolved system is is close to optimal? maybe Maybe something with energy consumption while walking or or the brain's energy consumption, something like that?
01:16:52
Speaker
Yeah, so yeah I think if you look at the efficiency of the eye at converting photons into information, that's that's pretty close to physically optimal. If you look at you know photosynthesis, if you look at the the way that birds orient to the Earth's magnetic field,
01:17:13
Speaker
for a long time that was actually believed to be, it must like, it must be a myth because it seemed physically impossible to have that level of sensitivity. um It turns out to be quantum sensing. So it is possible, but it is very close to optimal. it is is that Is that real? That sounds like is straight out of some yeah interesting conspiracy theory, if you combine birds with something quantum.
01:17:37
Speaker
yeah Yeah, no, I mean, I think yeah people get confused about this, because there are hypotheses that you know, the brain is doing quantum computing, and that's how it becomes conscious, which is which is obviously silly, if you think about the coherence time, like the any anything, any kind of quantum coherence, body temperature would decay within a millisecond. So there's no way that a conscious ah moment of consciousness, which is clearly at least five milliseconds could be quantum coherent. However, sensing
01:18:09
Speaker
can totally happen within a millisecond and so i think people get confused and say oh you know it's there are theres there's very clear first principles argument why like the the nervous system can't be quantum it's like no no it's it can't be doing quantum computing is what quantum sensing is completely different.
01:18:26
Speaker
and And in fact, I also just for the record, you know, I think this is true about technology as well. Like I think quantum sensing technology is much more fruitful, and it's going to be much more important than quantum computing technology for similar reasons, you don't need to maintain coherence for such a long time. And what were an example of quantum sensing as kind of an in product form?
01:18:49
Speaker
Um, I haven't thought too much about this, but yeah, things like quantum Hall effect sensors for magnetic fields and, you know, but maybe improved quantum efficiency for photo detectors for, you know, medical imaging or astronomy.
01:19:04
Speaker
Interesting. All right. If we think about safeguarded AI and flexible hardware-enabled guarantees, and we think about those two directions or research directions programs over the next decade, what is but does success look like and and which challenges do you anticipate along the way?
01:19:24
Speaker
Success for the Safeguarded AI program itself is basically convincing a key decision makers that Safeguarded AI is a viable strategy for extracting economic and security benefits from advanced AI without catastrophic risk.
01:19:41
Speaker
So it's as as as with most ARIA programs, it's about pushing the edge of the possible, changing the conversation about what is feasible. And what that hopefully manifests as is as some agreements that involve Safeguarded AI being among the options that are generally recognized as safe for systems within some band of capability levels on various dimensions. And the flexible hardware-enabled governance success case is, first of all, that most compute in the world 10 years from now is FlexHEG enabled. and And second, that there is a reasonable and and generally recognized as legitimate governance process for refining and adjusting the rules over time.
01:20:36
Speaker
And what would be the main challenge to having most hardware in the world being flex hack enabled. The main challenges just that there's a ah only there there are a small a small number of key players where.
01:20:50
Speaker
the various quorums among certain key players would be sufficient to sort of make that happen, at least for now for the next for the next few years. But yeah, I think it's it's mainly about, again, kind of convincing people that it's possible right now. A lot of people think that FlexHEG is not possible, either because it's not possible to do tamper detection or because it's not possible to do a cryptographic verification of cluster scale properties. I'm very confident that the second one is possible and that it's just a matter of engineering.
01:21:20
Speaker
Tamper detection is more of an open question, to be honest. I think it it could very well happen and in a year or two that I become convinced after a bunch of attempts have sort of fallen to some national lab hackers that actually, okay, we can't we can't make a state-level tamper responsiveness. But I guess I would say that is probably the main challenge, is is is actually making it state-level tamper responsive and convincing people that that's the case.
01:21:49
Speaker
Do you worry about companies being companies that are that may be opposed to to hardware governance? and Maybe an example could be a meta. i'm I'm not saying that they actually are, but but you could imagine that that they would be opposed to hardware governance. that Those companies could simply develop their own chips in-house and set up and ah an alternative supply chain that's all vertically integrated and then have a a competitive advantage Very, very, very expensive. i don't think it's I don't think it's really a viable option for anyone over the next 5 to 10 years to develop an alternative supply chain. And and I think it it becomes a worthwhile bargain to to accept FlexHEC style restrictions in exchange for having access to the the supply chain and the technology base.
01:22:41
Speaker
that exists. I think that would not necessarily be the case with some of the more heavy-handed hardware governance approaches that that involve monitoring where everything is and and all the computations that everyone's running on it. And I guess the argument that I would make to meta is to say, look, at some point, your your frontier models will be capable enough that it is obvious to everyone that it would be irresponsible to actually make the weights available.
01:23:11
Speaker
unencrypted. But You're absolutely right that it's important for individuals to be able to run their own language models and have privacy and, to some extent, control over the system prompt and so on and to be able to fine tune and customize and to you know to their language. And so what's the way that we can kind of have the best of both worlds? Well, it's something like this. You have a secure enclave on the on the processors where you can distribute a model freely in terms of free of charge
01:23:43
Speaker
But in a way that's encrypted, it can only be run on FlexEgg-enabled chips and and it runs with built-in safeguards that can't be removed, but also with complete privacy and with some level of customization. Fantastic. I think this concludes our our conversation on on your most recent work. and now maybe we could Because now that I have you, and we could move into some of the other work you've done in the past and ah Talk about your your life story also because there's a lot of interesting things going on there. and you've You've done work on brain uploading in the past. That's that's been ah an interest ah of of yours maybe a decade ago, you mentioned. How has your perspective on on brain uploading evolved or the the possibility of that approach working at all?
01:24:33
Speaker
this was This is a long time ago, a long time before I was associated with ARIA, and it's it's completely separate from what I'm doing now. But I did at one point, maybe ah from 2010 to 2013 or so, it it it appeared to me that AI was a bit stuck. So I was i was a little bit late to notice that deep learning was working.
01:24:56
Speaker
Actually, someone told me in 2010, who was quite early, that deep learning was was was very promising, it was probably going to work, and I dismissed it. and so um I made a mistake there, and it it it took me it took me quite a few years to kind of notice. It was in 2013, and I should have noticed in 2012, at least.
01:25:15
Speaker
I mean, even 2013 is quite early to notice that that deep learning is taking off. 2010 would be legendarily early to notice that that deep learning is taking off.
01:25:27
Speaker
Yeah, so so during that time, it seemed to me like there were basically no promising pathways towards de novo AGI. And yet, that was the time when optogenetics was starting to take off, ah which is the technology of genetically engineering biological neurons.
01:25:46
Speaker
to light up in proportion to their level of activity, to literally emit like fluorescent fluorescent light, and to also be controllable so that they would receive light and translate the light into spike trains. And so it seemed like there was a new potential for a way of developing machine intelligence that was actually not not completely artificial, but rather an emulation of biological neural networks. A very, very faithful emulation of a specific biological neural network. So that's sort of the way that I would define, you know, mind uploading. And i I worked on this for the simplest nervous system that is kind of known to science, which is the nematode worm C. elegans, which has a total of 302 neurons in its entire body. And it's exactly 302, you know, unless it has a mutation. So it's very stereotyped. And it's actually, despite that, capable of learning a little bit.
01:26:44
Speaker
and So it's capable of learning to be averse to a particular scent, like you know methane or carbon dioxide. It's capable of learning to be attracted to particular scents in its larval stage, which is like its childhood, was detecting those scents at the same time as it was detecting food, that it sort of learns an association that the scent is indicative of food in its environment and then in adulthood is ah is sort of attracted to locomote towards the source of that scent.
01:27:14
Speaker
So this was something I thought we could demonstrate that you could train a an actual worm in its larval stage to be attracted to some particular scent, then in its adult stage, um perform a uploading process which would involve optogenetically stimulating all of the neurons in various random patterns and observing the effect that it has on all the other neurons.
01:27:39
Speaker
and then build up a model of the coupling coefficients between all of the neurons as an an ordinary differential equation, and then run that differential equation in a simulation where it's hooked up to a simulated body using soft matter physics, ah and show that in the in that virtual environment, it exhibits the behavior of being attracted to that particular odorant.
01:28:05
Speaker
when you run it with with that in place. so and so what What happened, you know how that went, is the optogenetics was was basically was basically working. It wasn't completely mature. and It was hard to get the the indicators localized to the center of the cell.
01:28:23
Speaker
so because neurons are very closely packed together in real systems. There is a very small gap, almost no gap at all between adjacent neurons. There is this kind of image processing problem. It's very ironic that I was a deep learning skeptic at the time because it might have might have been possible to solve this image processing problem with deep learning even then. At the time, I tried to approach it with Bayesian inference. And that did did not work. There's not enough. Or it didn't it couldn't have worked in real time. So in in order to really do what I had had planned, it would need to infer in real time what the state of the system is so that it could do automated experiment design in closed loop
01:29:03
Speaker
ah because you only get about an hour of optogenetic manipulation, of ah at at least at the time. The optogenetics were were not super sensitive, so you had to use very strong lasers. And after about an hour of that, like it was pretty damaged by the laser. So you don't get that much time to get a sort of readings of healthy behavior from the neural network. So it needed to be, it would it would have needed to be really close to automated experimental design to optimize the informational efficiency of of every stimulation and yeah the the the computing side of it for doing that image processing and and analysis in real time did not work.
01:29:42
Speaker
But the there's a funny thing where someone who i was working with on it in twenty eleven said. You know it's it's gonna be too hard to like interpret these blurry images that we're getting from from like light sheet microscopes but ten years from now as like camera technology advances with spinning desk.
01:30:01
Speaker
confocal microscopy, you should be able to get clean enough images that you can use ordinary computer vision techniques to just segment out the cells. And it turns out he was exactly right. And he did it in 2022, Andy lifer. He used spinning disc with, you know, more advanced cameras that existed in 2022 and sort of did, did this process of cataloging and measuring the coupling coefficients between, between all the neurons, almost all the neurons. I shouldn't say all the neurons. It's actually still not quite done.
01:30:31
Speaker
but like 250 of the 302 neurons. So it's still a little bit of incomplete, the thing that I had proposed to do. But there's started to be more interest in it now just over this past year. I've i've i've heard a couple couple people suggesting that Now that it's been shown by Leifer and at others that the technology does basically work, and just it's just a matter of kind of making another push the way that I did way too early to create a project where the goal is to like finish it and and demonstrate in simulation that that it does preserve learned behavior.
01:31:05
Speaker
So I think that's on the horizon for C. elegans. That's 302 neurons. And the scalability is, you know, is is less than linear. So that the the larger a system is, the harder it is to, because it's three dimensional, the harder it is to get precise imaging of what's going on in the interior of a nervous system. and So as we get up to the scale of even a mouse brain, there is almost, again, and qualified, but there's almost no physically feasible way of of reading out all of the neurons at once like we but can do now in C. elegans. It basically would require circulating through the vasculature something. It could be stents that are a fixed structure that sort of is biocompatible and flexible, or it could be
01:31:54
Speaker
some kind of microscale robot things that go by the name of neural dust. And those would then need to be powered and communicate via ultrasound because actually, if they were powered by radio, that would be too much radio it would damn it would be too much energy, it would damage the brain. so But ultrasound could work. And then yeah then you could do do some of these experiments. But but then like also, it would be difficult to extract all the coupling coefficients because there are so many synapses in a human brain. and And there's only so much time that you have to run an experiment before a human dies, even if it even if everything's biocompatible, humans have a finite lifetime. So so it's very very challenging. My guess is that you know AI systems will accelerate things a lot and and maybe come up with new solutions that we hadn't thought of. but But it seems pretty clear at this point that we're going to have de novo superintelligence before we have
01:32:49
Speaker
machine intelligence that's emulating human nervous systems. And that seemed clear already six or seven years ago to me. So that's basically why I don't spend much time on this direction anymore. It's basically a thing that ah you know we could explore. We we should we should think about once we've once we get through this sort of acute risk period of u intelligence.
01:33:10
Speaker
do Do you think there's any hope of using brain emulation or or mind uploading as a way to elicit the preferences that we might not be able to express such that those preferences can be used for training AI systems to behave in ways we would like?
01:33:28
Speaker
That's a good question. I mean, again, I think mind uploading per se is is just way too hard to be useful on that time scale.

Lo-Fi Uploading and LLMs

01:33:36
Speaker
People are talking about something called lo-fi uploading, which I think is a bit of a misnomer. i i would I would just call it imitation. But like large language models are doing reasonably accurate, um I would say if not precise, imitation of human linguistic behavior.
01:33:54
Speaker
And if you if you fine-tune an LLM on the the writings of a particular individual, then it does become kind of precise. And you can kind of you know ask the LLM, like, what do you think about this? And get a pretty good prediction of what the opinion of that person would be. But it's not perfect. And it, in particular, is, I think,
01:34:15
Speaker
not at all a safe assumption that this type of imitation would generalize robustly to unprecedented types of situations and unprecedented questions. So I think it has a ah limited usefulness from the perspective of of extracting preferences. I think it's much more much, much more useful as a means of extracting surprise. So if you look at the, you know, not the output that follows asking a question, but the logits that are accumulated as the question is processed. So how surprising is it?
01:34:52
Speaker
that this would be a question that I am asked. That I think is quite reliable. So if you're asking a completely unprecedented question, I think you'll be able to tell mechanistically through running an lll running it through an LLM. This is not a question that humans are typically expecting.
01:35:07
Speaker
And then you can use that as a way of of kind of guiding a process of refining a specification such that it doesn't end up in situations that are really unprecedented and and and hard to hard to make judgments about. And that will be useful to because because people in general do not like to be surprised. And so maybe you can you can gain some information about what a system should do based on how much that system surprises a human.
01:35:35
Speaker
It's not directly a value judgment. It's not that minimizing surprise is is a is an ethical imperative that I'm asserting. Active inference would maybe say that minimizing surprise is the main thing, but I don't i don't think that is is quite right. I think it's actually a misinterpretation of of the underlying mathematics. What I do think it's useful for is because surprising questions are hard to answer well.
01:36:01
Speaker
So it's it's a guide to not to to what is valuable or what do people like. It's a guide to where can we be confident that we know what people like and where we can't be confident, then we should you know be cautious.

Early Achievements and Meaningful Work

01:36:15
Speaker
in In preparation for this conversation, I scroll through your CV and in your early life, you had a lot of accomplishments that I think would make it fair to say that you were a child prodigy. For example, you you graduated from MIT at 16 and you've been working on theoretical kind of adventures that are that are extremely advanced at a very young age. And one interesting question there is just how do you think about mentorship at an early age where you might be mature and advanced in your technical skills, but you don't have a lot of life experience in order to to judge what is but you should do or which direction is is is worth pursuing and so on.
01:36:56
Speaker
how do how do you How do you think you you dealt with with that? I'm assuming that there's there's there's something so like ah being mature, ah technically speaking, in your in your technical abilities without necessarily having a lot of call it wisdom or life experience to know how to navigate everyday life.
01:37:14
Speaker
Yeah, which is exactly the problem with early super intelligence, right? It'll have a very, very strong technical capabilities, but not necessarily wisdom. So, I mean, I don't think I worked around this. I think I worked on a lot of things, including mind uploading that turned out not to be the most important things to work on. And I did try to be pretty sensitive to, you know, what what does seem like it's actually important to work on. And in in some ways, that was a disadvantage because it led me to work on a lot of different things and ah kind of bounce around or at least it looks that way to some people from the outside. But it did also enable me to end up landing on something that
01:37:55
Speaker
does feel like now very, very important and and sort of aligned with ah with with a meaningful purpose. But I guess maybe one way of taking your question is like what advice would I give to child prodigies today? Yeah, or perhaps to near child prodigies, some of whom might be listening, where you're very young and you're working in in a technical field and you're working with people who are much older than you maybe and and have more experience. How do you how do you navigate the their tension between listening to authorities while also developing your own your own ideas. So I think I do have an answer to this, which is consider but when there's something that you're uncertain about, consider what observations would you have if it were true.
01:38:41
Speaker
And what observations would you have if it were false? And that includes, observations includes the words that people around you say. And instead of taking those words literally and trying to determine if the words are true or false, think about them as the output of a data generating process, where there's dynamics, there's incentive dynamics, where there's psychological dynamics, and and where there's cultural dynamics, and and there's logical dynamics that are that are seeking truth.
01:39:12
Speaker
And how likely is it that you would hear these words if they were true? And how likely is it that you would hear these words if they were false? And seek out observations that would look very different in worlds where this thing is true and worlds where this thing is false. Rather than seeking out a source of truth that literally would tell you in words whether it's true or false, you seek out some observations which might not be words, which might be data, which might be papers, which might be capital flow that would help you to distinguish between the worlds where it's true and the worlds where it's false.
01:39:49
Speaker
It seems like great advice in general, but it also seems to impose a lot a lot of cognitive overhead. But perhaps if we're talking about child prodigies or near child prodigies, that makes sense. Davila, thanks a lot for for talking to me. This this has been great. Yeah, thank you for having me.