Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Connor Leahy on the State of AI and Alignment Research image

Connor Leahy on the State of AI and Alignment Research

Future of Life Institute Podcast
Avatar
223 Plays1 year ago
Connor Leahy joins the podcast to discuss the state of the AI. Which labs are in front? Which alignment solutions might work? How will the public react to more capable AI? You can read more about Connor's work at https://conjecture.dev Timestamps: 00:00 Landscape of AI research labs 10:13 Is AGI a useful term? 13:31 AI predictions 17:56 Reinforcement learning from human feedback 29:53 Mechanistic interpretability 33:37 Yudkowsky and Christiano 41:39 Cognitive Emulations 43:11 Public reactions to AI Social Media Links: ➡️ WEBSITE: https://futureoflife.org ➡️ TWITTER: https://twitter.com/FLIxrisk ➡️ INSTAGRAM: https://www.instagram.com/futureoflifeinstitute/ ➡️ META: https://www.facebook.com/futureoflifeinstitute ➡️ LINKEDIN: https://www.linkedin.com/company/future-of-life-institute/
Recommended
Transcript

Strategic Landscape of AGI

00:00:00
Speaker
All right, welcome to the Future of Life Institute podcast. I'm back here with Connor Leahy, and Connor is, of course, the CEO of Conjecture. Connor, welcome to the podcast. Back again. All right, so how does the strategic landscape of AGI look like right now? Tell me if you think this kind of structure is approximately true. Do you have open AI and deep-minded front?
00:00:27
Speaker
followed by the American tech giants like Apple and Google and Meta. And then perhaps you have American startups, and then you have Chinese tech giants and startups. In terms of capabilities, is that the right ordering? Sort of, I would say DeepMind is behind OpenAI by a pretty significant margin right now. I think Anthropic might actually be ahead of DeepMind at this point.
00:00:49
Speaker
Not 100% clear. DeepMind keeps their cards much closer to their chest. So they might have some really impressive internal things. I've heard some things that affect, but I don't have evidence of them. So it seems to me OpenAI is clear front center ahead of everyone else.

Challenges for Tech Giants and Startups

00:01:05
Speaker
I expect Anthropic will catch up. It seems like they're trying very hard to train their GPT-4 model right now, like the equivalent model right now. I expect they will succeed. Tech Giants, I mean, really depends. Like Meta is like,
00:01:18
Speaker
pretty far behind. Google is deep-mind. Apple doesn't do anything as far as I can tell. So I would, for example, think some startups such as Character are ahead of all of them. Like Character AI is a company that was founded by Noam Shazir and others. Noam Shazir being the person who invented the Transformer. And, you know, they're both folks on chatbots and such, but their models are quite good. They're quite good. So they're, yeah, that's kind of
00:01:48
Speaker
How I would say, yeah, I don't feel that Chinese Czech giants startups are very relevant. I think they're really far behind, and I don't expect them to catch up anytime soon. Are we simply waiting for the American tech giants to make moves here? I mean, Apple has a lot of money. They have a lot of talent. They have machine learning chips on all of their iPhones. You could easily see an enhanced Siri GPT-4 style.
00:02:13
Speaker
Sure. But Google, which is supposed to be the best of these, couldn't even keep the guy who invented transformers around because they're so dysfunctional. And then one of the first things I was told by experienced investors and such when I founded a startup is that incumbents are not competition. They're all incompetent. It's like
00:02:33
Speaker
Sure, all these things are possible. They have the resources to do these things. But there is a lot of reasons why it could be very hard for large organizations to execute on these kinds of things.
00:02:44
Speaker
Another great example is the Bard. So the chatbot that Google produced, it was severely delayed. It had lots of problems and it was just extraordinarily underwhelming compared to what a much smaller group at OpenAI was capable to do in smaller amounts of time. Google is in code red now where the CEO is personally involved and everyone's freaking out.
00:03:08
Speaker
That doesn't mean they can catch up. Just because a lot of people in the boardroom say something should be done and they have a lot of money, that's not enough.

Culture of Secrecy and AI Publishing Norms

00:03:18
Speaker
There are some things that are actually hard and training complex cognitive engines like GPT-4 is among those things. Other ones are, for example, chip production. China did a huge thing about how they're going to produce their domestic chip production. They're going to catch up to TSMC.
00:03:36
Speaker
And that has now been like slowly like choking away because it's just not succeeding because no one in the world other than TSMC can get these ultraviolet machines to work for whatever reason. So what we're doing right now and kind of
00:03:53
Speaker
describing incumbent technology giants as incompetent. Might that be mistaken? Because perhaps they're hiding, they're waiting to release until they have something that's very polished, that would be very apple-like thing to do. Perhaps DeepMind has something that they're not releasing because they are safety conscious, or is this wishful thinking?
00:04:17
Speaker
wishful thinking.
00:04:47
Speaker
contractors where people have a culture of secrecy and like keeping things to the chest. Like Apple is like the only company on this list that like is good at that. DeepMind also and Tropic is also trying but.
00:04:59
Speaker
You know, mixed. If you ask, you know, who's better, you know, Lockheed Martin or like, you know, Airbus fire jets or whatever, I'm like, okay, I genuinely don't know. Like, you know, that's like actually hard to know. And like people will actively make it hard for you to know these things. But like all these people have an incentive to make it public, how good they are. And they do so quite aggressively.
00:05:22
Speaker
and when it benefits them. And, you know, Google, you know, scrambled after chat GPT to catch up with Bard and they put their best effort forward and it was a flop. And, you know, same thing, like with Ernie and China and stuff like this. Like, I think that don't gouge your brain yourself. Like, it's just what it looks like.

Tacit Knowledge and IP Protection in AI

00:05:39
Speaker
And also the AI researchers in the most advanced organizations, they want to publish research so that they perhaps can move to another organization. They have
00:05:50
Speaker
They have interests and incentives that are not particularly aligned with the company they work for. So there are these publishing norms where your resume as an AI researcher is your published papers. Does this make it basically difficult to prevent new advances from spreading out to a lot of companies?
00:06:14
Speaker
Yep, that's correct. But if that's the case, why can't Google catch up then? Because there is an additional aspect to it, which is execution and tacit knowledge. So especially with large language model training, a massive amount with different a good language model from a decent language model is
00:06:33
Speaker
weird voodoo shit where it's just like someone just has a feeling like oh you know you have to turn down the atom beta 2 decay parameter why
00:06:46
Speaker
Noam said so. There is a theoretical catch-up where it's like, you need to come up with the right architecture, the right algorithms, whatever. But there's also engineering and logistics. Setting up a big compute data center is hard and takes a lot of money and time and specialized effort. And also then you need, so there's a logistical aspect to it.
00:07:11
Speaker
There's also, there's a will and executive capacity that can an organization commit to doing something like this and have someone lead it who can actually manage the complexities involved with it. And then there's a huge aspect of task knowledge of just the stuff that isn't written down. And there's a lot that is not written down.
00:07:32
Speaker
And that tested knowledge might be particularly important in chip production, which could be why the giants or the companies in front of chip production are very difficult to copy. That is exactly correct.
00:07:47
Speaker
This is also what people on the inside of chip production will tell you, that there is a absolutely phenomenally large amount of tacit knowledge that goes into producing high-end computer chips, and that there's a lot of this tacit knowledge that only TSMC has, and they're not writing it down, and even if they wanted to, they probably couldn't.
00:08:05
Speaker
Why is it that, for example, with defense companies or with tip production, there's a protection of intellectual property in such companies that we don't see in AI companies. In the most secure or safe secrecy in AI companies is simply to not say what's happening. You don't see that these company secrets are protected by intellectual property in the same way.
00:08:33
Speaker
Yep, I think this is completely contingent historical cultural fact. I think there is no
00:08:40
Speaker
didn't have to be this way. I think it was literally just coincidence. It's literally just coincidence that just the personalities of people that founded this field and the academic norms that it came from a very academic area. There was much less military involvement and much less industry involvement initially. Academics have much more bargaining power in that
00:09:06
Speaker
Because of the high level of tacit knowledge, there's a larger bargaining power that academics have here, because if Noam Shazir wants to publish, he can go wherever he wants. So if you don't allow him to, he'll just go somewhere else, or whoever, any high profile person. But I think this is...

Defining AGI and Its Implications

00:09:22
Speaker
totally contingent from the perspective of people in ML that think like this is the way things have always been will always be can only be this is obviously wrong because you know you're telling me the people who build like you know stealth fighters are not incredible scientists and engineers like you know give me a break like just because you're a great scientist engineer doesn't mean they're like you know compelled by their very genomics to like you know want to publish like no this is just a cultural this is a contingent truth about how the culture happened to have evolved
00:09:52
Speaker
As the race to AGI heats up, will we see more closedness? So closed data, closed algorithms, did you see labs not publishing as many papers? Will these kind of open source norms from the AI researcher community begin to fall apart? I hope so.
00:10:10
Speaker
We're seeing some of it, that's for sure. And I hope it accelerates. As you see it, we are already in a race towards AGI, correct? Yep, obviously. So maybe 10 years ago when people were debating AI, people would debate whether human level AI is possible and whether we could get there within a couple of centuries and so on.
00:10:31
Speaker
Perhaps, is it time now to retire the term ADI and to talk about just specific predictions? Because people mean a lot of different things. Perhaps, if you asked Connor Leahy in 2010 whether chat GPT or GPT-4 was an ADI system, what would you have responded?
00:10:53
Speaker
I mean, depending on how well you described it, I would have said yes, but I do think these things are AGI. So I still do, but it's just like people just don't like my definition of the word AGI. So I don't use that word very much. I agree that the word AGI is not particularly good. It's
00:11:09
Speaker
People use it to mean very, very different things. Like by reasonable definitions of AGI that were used like 10 years ago, obviously chat GPT is AGI, obviously so. Like most of the definitions of AGI from like 10 or 20 years ago were, you know, vaguely can do some human like things in some scenarios, right? And like, you know, and like, you know, reasonable human level performance on like a pretty wide range of tasks.
00:11:30
Speaker
Obviously, TAT-TBT and GPT-4 have reached this level, and obviously they have the ability to go beyond that. But there's other definitions that they don't fulfill, strictly better at humans at literally everything. Sure, TAT-TBT is not that. But also, LOL, what are you doing? Perhaps that's not super interesting. There will always be 2% where humans are just better.
00:11:55
Speaker
Sure. Or people are just, you know, testing it wrong or just not bothering to do it or whatever. So like the, the real definition of AGI that I am most interested in personally is more vague than that. And it's something like a system that has the thing that humans have that chimps don't.
00:12:15
Speaker
It's like, you know, a thing that, you know, you know, human brains are basically scaled up chimp brains, you know, like a factor of like three or four or something. They're all the same structures, all the same kind of things, internally, blah, blah, blah. But humans go to the moon, chimps don't.
00:12:29
Speaker
So, you know, there's some people who are like, oh, okay. But like, you know, there's always going to be some task that a specialized, you know, tool, like, you know, an AGI couldn't fold proteins as good as alpha fold or whatever. And I'm like, yeah, sure, sure, sure, sure. AGI, maybe you can't fold proteins as good as alpha fold, but it can invent alpha fold.
00:12:47
Speaker
So the relevant thing I'm personally interested in is just like a thing that is powerful enough to learn and do science and to pose potentially existential risks to humanity. Like those are things I personally care about.
00:13:02
Speaker
And when I talk about AGI, that's generally what I'm referring to, but I agree it's a bad meme. Like it's a bad word because other people, as I've said before, some people in the hear AGI, they think like, you know, friendly human robot buddy, who's like, you know, sort of smart as you, but not really smarter, you know, but other people think, you know, AGI is God-like super thing that can do everything, live into media things.
00:13:27
Speaker
We can have a semantic fight about this, but I don't know. Perhaps a way to resolve these issues is to make predictions. Would you be willing to make a prediction about when, for example, an academic paper created by an AI model would get published in, say, a reasonably high-quality scientific journal?
00:13:49
Speaker
that's under defined. How much human intervention is allowed here? Do I give it a prompt? Does it have to navigate the website and upload the paper itself? Say you can give it a detailed prompt and the system simply has to create the paper and nothing else.
00:14:09
Speaker
Do does this have to have actually occurred or be possible because possible yesterday actually occurred? I think there's no one's gotten around to it. I just don't bother to do this. Do you think this could be done now? Yeah, absolutely. Like obviously so like that, you know, the SoCal affair, you know, already got papers published, you know, you know, decades ago, which complete nonsense.
00:14:30
Speaker
I think you could have done this with GPT 2 probably if you allow non-STEM journals. You maybe need GPT 3 for STEM journals, but have you read ML papers? So many of them are so awful.
00:14:44
Speaker
This is like not that hard. So you think this is basically already here now? Oh yeah, absolutely.

AI's Role in Scientific Discovery and Safety Concerns

00:14:49
Speaker
But I think this is not capturing the thing you're interested in. The thing I think you're probably interested in is that like, can it do science? You're not interested in, can it trick reviewers into thinking something is good? So the question of when will the thing publish a scientific paper, which are what I expect, you know, correct me if I'm wrong, but expect you're looking for the question, when can it do science? You're not looking for the question, how stupid are peer reviewers?
00:15:14
Speaker
True. How can we make this question interesting then? When can an AI system publish a scientific paper that gets cited a lot? Or is that also simply a way of gaming the system? There's various ways we can think about this.
00:15:35
Speaker
I'm going to give the unsatisfying, but I think correct answer, which is by the time it can do that, it's too late. If you have a system that can fulfill whatever good criteria you can actually come up with, that actually means you can do actual science, it's too late. And I expect at that point, if we have not aligned such a system, if you're not creating fun things, then, you know, n times are upon us and we do not have much time left, if any.
00:15:59
Speaker
If you asked me to bet on these things or do with my real money, I just wouldn't because I don't expect the bet to pay out. Do you expect the AIs to publish credible scientific papers before they can empty a dishwasher? Credible or correct? Those are different. Correct. Interesting scientific papers.
00:16:18
Speaker
I expect that to happen probably before the dishwasher. I can make a concrete prediction. I expect the world to end before more than 10% of cars on the street are autonomous.
00:16:29
Speaker
Okay, so what we have here is a scenario in which we are close to transformative AI, we could call it, or perhaps deadly AI. If we are very close, does this mean that the game board is kind of settled in a sense? The big players are the players that are going to take us all the way there. So for example, we could ask, is it open AI that ends up creating transformative AI?
00:16:56
Speaker
seems pretty likely in the current trajectory. If nothing changes, if government doesn't get involved, if culture doesn't shift, if people don't revolt, then yeah, I expect opening eye, deep mind, anthropic, 80%, 70% one of them, and risk percentage smeared over other actors or actors that haven't yet merged. Are we getting hyped up on an exponential curve that's about to flatten off?

AI Safety Methods and Their Limitations

00:17:25
Speaker
So will we, for example, because we run out of data or we run out of accessible compute or something of that nature, this is not something you see coming. I don't see any reason to expect this. My general predictions are usually predict what, if you don't know what the future is going to hold, predict that what just happened will happen again. And this is what I'm seeing. We're at the beginning of an, you know, we're now in takeoff, you know, where the expenditures are happening. And will this flatten off at some point? Yeah, sure. I just expect that to be,
00:17:55
Speaker
post apocalypse. Let's take a tour of the landscape of different alignment solutions or AI safety solutions. The current paradigm that's used by OpenAI, for example, is that you train on human-created data, and then you do reinforcement learning from human feedback, fine-tuning the model afterwards.
00:18:14
Speaker
If nothing changes, if we are very close to transformative AI, if perhaps transformative AI will be developed by OpenAI, could this succeed as a last option? Do you think that reinforcement learning from human feedback could take us to something at least somewhat safe systems? No.
00:18:34
Speaker
There's no chance of this paradigm working. No. It's not even an alignment solution. It's not a proposal. I don't think the people working on it will even claim that. I'm pretty certain that if you ask Parker Stiano or something like, is RLHF a solution to alignment, he would just say no.
00:18:51
Speaker
And for context, Paul Cristiano basically invented reinforcement learning from human feedback. Yeah, he was one of the core people involved in it. And I don't think he, I mean, maybe some people involved in a creation will claim this, I don't know. But I would expect that if you ask the people who created these methods, is this an alignment solution, they would say no, and they don't expect this to work. Like, I don't think that RLHF
00:19:14
Speaker
in any way addresses any of the actual problems of alignment. The core problem of alignment is how do you get a very complex, powerful system that you don't understand to reliably do something complicated that you can't fully specify in domains where you cannot supervise it. It's the principle agent problem writ large.
00:19:38
Speaker
RLHF does not address this problem. It doesn't even claim to suggest this problem. There's no reason to expect that RLHF should solve this problem. It's like clicker training an alien. Every time you do an RLHF update, so for those not familiar, you can't imagine it's simplified as the model produces some text and you give it a thumbs up or a thumbs down. And then you do a gradient update to make it more or less likely to do the stuff.
00:20:09
Speaker
you have no idea what is in that gradient. There is no idea what it is learning, what it is updating on. Let's say your model threatens some users or whatever, and you're saying like, oh, that's bad. So give it a thumbs down.
00:20:24
Speaker
Well, what does the model learn from this? Well, one thing it might learn is don't threaten users. Another thing it might learn is don't get caught threatening users. Or it could use less periods. Or don't use the word green. Or who knows? In practice, what's great to learn is a superposition of all of these, or tons of these possible explanations.
00:20:50
Speaker
And it's going to change itself the minimum amount to fulfill this criteria or move in that direction and in that domain. But you have no reason to expect this to generalize. Maybe it does.

Security Mindset in AI Alignment

00:21:04
Speaker
Sure. Maybe it does.
00:21:05
Speaker
But maybe it doesn't. Like, you have no reason to expect it to. There is no theory, there is no prediction, there is no safety. It is like, you know, it's like the, you know, sea-freeted Roy and the tiger, right? It's like, well, you know, we've raised it from birth. It's so nice. And then it mauls you. Like, why? Who knows? Tiger had a bad day. I don't know.
00:21:23
Speaker
Perhaps the counter argument to something like this is that it's only when we begin interacting with these systems in an empirical way that we can actually make progress. The 10 years of alignment research before, say, 2017, didn't really bring us any closer. It was only when OpenAI began interacting with real systems that we understood how they even worked and therefore perhaps gained information about how to align them. Do you buy that? No.
00:21:53
Speaker
And because? I mean, which part of that is true? Saying that no progress happened when it was like three people in the basement working out with no funding is like, what the hell are you talking about, A? Given it was three people in the basement, they made a lot of progress on predicting things that would happen, on deconfusing concepts, on building an ontology for things that don't yet exist. This was extremely impressive, given the extremely low amount of effort put into this.
00:22:22
Speaker
Sure, they didn't solve alignment. Sure. But has any progress happened to alignment since then? It's not obvious to me that there's been more people using the word. There's a lot more papers about it. And there's stuff like RLHF and stuff. I don't consider RLHF to be progress in a sense as regression. It's like,
00:22:45
Speaker
the fact that anyone, and this is not meant to be as a critique per se of the people who did RLHF because they think they were fully aware that, hey, this is not alignment, this is just an interesting thing you want to study a little bit, which I think is totally fair. It's so legitimate. And like RLHF has its purpose. It's a great way to make your product nicer. As a capabilities method, RLHF is totally fine. Just don't delude yourself into thinking that this is... I don't...
00:23:12
Speaker
by this whole like, well, if it makes a model behave better in some subset of scenarios, this is progress towards alignment. I think this is a really bad way of thinking of the problem. It's like saying, well, if I hide my password file two folders deep, then that is security. Because there are some scenarios where an attacker would not think to look two folders deep. And I'm like,
00:23:38
Speaker
Sure, in some trivial sense, that's true. But that's obviously not what we mean when we talk about security. If you encrypt the password file, but your encryption is bad, I'm like, OK, that's progress, but your encryption is bad. I'm like, all right, cool. This is obviously safety. I accept this as safety. But now I have problems with your encryption system. That is progress and alignment that we can argue about.
00:24:03
Speaker
You moving a thing, your password.txt two folders deep, I do not consider progress. You weren't even trying to solve the problem. You were trying to do something different. You don't think that Microsoft's Bing chat or Sydney was less aligned than the chat GPT for? Not in the ways I care about. I think this is a stretching of what I use the term alignment for.
00:24:33
Speaker
You can make that statement. I think this is a completely legitimate way of defining the word alignment if you want to define it that way. That is an okay thing to do. But it's not the thing I care about. I do not expect that if I had an unaligned existential risk AGI,
00:24:50
Speaker
and I did the chat GPT equivalent to it, that that saves you. I think that gives you nothing, you die anyways. Nature doesn't grade on a curve. Just because you're 10% better, if you don't meet the mark, you still die. Doesn't matter. It doesn't matter if your smiley face was a little 10% larger than the next guy's smiley face. If you're only painting incrementally larger smiley faces, it doesn't matter.
00:25:14
Speaker
So what about extrapolations from reinforcement learning from human feedback? For example, having AIs work to give feedback on other AIs. Could that maybe scale to something that would be more interesting for you? Why would you expect that to work?
00:25:31
Speaker
Like, where does the safety come from here? There is no step in this process where you actually are addressing the core difficulty of how do you deal with a system that will get more smart, that will reflect upon itself, that will learn more, that is fundamentally alien with fundamentally alien goals encoded in ways we do not understand, can access or modify, that is extrapolating into larger and larger domains that we cannot supervise. No part of this
00:25:59
Speaker
addresses this problem. It's like you can play shell games with where the difficulty is until you confuse yourself sufficiently to thinking it's fine. This is a very common thing. How does it science all the time? Especially in cryptography, there's a saying, everyone can create a code complex enough that they themselves can't break it.
00:26:16
Speaker
And this is a similar thing here, where I think everyone can create an alignment scheme sufficiently complicated that they themselves think it's safe. But so what? If you just confuse where the safety part is, that doesn't actually give you safety. What would be evidence that you're wrong or right here? For example, if it turns out that the GPT model that's available right now is not used to create
00:26:46
Speaker
havoc in the world is not used to scam people and turns out to not be dangerous in the sense that we expected. Would this be evidence that perhaps OpenAI is doing something right? So a proof of this would be is that no one on the entire internet can find any way to make a model say something bad, that there's no prompt that can be found that makes it say something OpenAI doesn't want it to say.
00:27:11
Speaker
Perhaps not say something bad. It's not specifically about bad words or something. This is actually quite important. This is quite important. Because what OpenAI is trying to do is to stop the model from saying bad things. That's what they were trying to make it do. And they failed. That's the interesting thing. If they have an alignment technique,
00:27:28
Speaker
that actually worked, that I expect might have a chance to work on super intelligent system, it should be able to make your less smart systems never in any scenario, say a bad thing. So it should be much more, it should work in almost all cases, or basically all cases for you. So importantly, it has to be all cases, because if it is not all cases, then unless you have some extremely good theoretical reason why actually this is okay,
00:27:54
Speaker
But by default, these are black boxes. I don't accept any assumptions. Unless you give me a theory, a causal story about why I should relax my assumptions, then I'm like, well, if it's breakable, it will break. And this is the security mindset. The difference between security mindset and ordinary paranoia is ordinary paranoia assumes things are safe until proven otherwise. Security mindset assumes things are unsafe until proven otherwise. And sure, you can't apply security mindset to literally everything all the time because you go crazy.
00:28:22
Speaker
Right? Sure. But when we're dealing with existential threats of extremely powerful, superhuman optimizing systems, systems whose whole purpose is to optimize reality into weird, you know, edge cases, to define, to break systems, to glitch, to, you know, enforce power upon systems. This is exactly the type of system you have to have a security mindset for, because if you have a system that's looking for a hole,
00:28:51
Speaker
in your walls and that you have one small hole, that's not good enough. If you have a system which is randomly probing your wall and you have one small hole, yeah, maybe that's fine. If

Interpretability and Future of AI Alignment Research

00:29:03
Speaker
it's small enough, that's okay. But it's not okay if it is deliberately looking for the small hole and if it's really good at finding them.
00:29:10
Speaker
What about the industry of cybersecurity, for example? You would assume that they have a security mindset, or at least they should have. But accidents happen all the time, data is leaked, and so on. Isn't that evidence that we can survive situations where there are holes in our security? It's not actually true that systems have to work 100% of the time.
00:29:35
Speaker
The fact that we survived has not anything to do with the security message. It has to do with the systems being secure and not being existential. If those systems had been existentially dangerous HDIs, yes, I expect we would be dead. It's only because of the limited capabilities of these systems that can be hacked and then have been hacked and so on.
00:29:53
Speaker
Exactly. Let's take another paradigm of AI safety, which is mechanistic interpretability. And this is about understanding what this black box machine learning system is doing, trying to reverse engineer the algorithm that produced the neural network weights. Is this a hopeful paradigm in your opinion?
00:30:12
Speaker
I think it's definitely something worth working on. It's something that me and people conjecture work on as well. I think the way I think about interpretability is not as a alignment agenda. Interpretability doesn't solve alignment. It might give us
00:30:27
Speaker
tools with which we can construct in the line system. The way I think about it in my ontology is that I think of mechanistic interpretability as attempting to move cognition from black box neural networks into white boxes. Again, as I've said before, black box is observer dependent. Neural networks are not inherently black box. It's not like an inherent property of the territory. It's property of the map.
00:30:55
Speaker
If you have extremely good interpretability in your head and extremely good theory and extremely, you know, a lot of compute in your head and whatever, then a neural network would probably look like a white box to you. And if that is the case, fantastic. Now you can like, you know, found lots of things and maybe like stop it from doing bad things and whatever. And so I expect this. So this is like, like, this is the default thing I tell people to do if they don't know what to do. If they're like, I don't know what to do with alignment or safety, I'm just like, okay, just work interpretability.
00:31:24
Speaker
Just try. Just bash your head against it and just see what happens. Not because I think it's easy. I expect this to be very hard. I also think a lot of the current paradigm of mechanistic interpretability is not great. I think a lot of people are making simplifying assumptions I think they should be making. But in general, I'm in favor. I think this is good. It's one problem, or perhaps the main problem with interpretability research is just the question of whether it can move fast enough
00:31:50
Speaker
We are just beginning to understand some interesting clusters of neurons in GPT-2, but right now, GPT-5 is being trained. Can it keep up with the pace of progress? Do you think it can? The same applies to literally every other thing. My default answer is no. I don't expect things to go well. Again, I expect
00:32:12
Speaker
things to go poorly. I do not expect us to fill up alignment on time. I expect things will not slow down sufficiently. I expect things will continue and I expect us to die. I expect this is the default outcome of the default world we are currently living in. This doesn't mean it has to happen. There is this important thing that, you know, the world being the way it is is not overdetermined. It didn't have to be this way. The way the world currently is, the path we're currently on,
00:32:39
Speaker
is not determined by God. It is because of decisions that humans have made. Individual humans have made decisions and institutions and so on have made decisions and done things in the past that have led us to the moment we are in right now. This was not over determined. It didn't have to be this way. It doesn't have to continue to be this way, but it will
00:33:00
Speaker
if nothing changes. So do I expect interpretability to work on time? No. Do I expect CoM to work on time? No. Do I expect RLGF to work ever? No. I don't expect any of these things to work. That doesn't mean it's impossible if we take action, if things change, if we slow down, or if we make some crazy breakthroughs or whatever. I think interpretability is like, I think there is a lot that can be done here. I think there is a lot of theory. There is a lot of things that can be done here.
00:33:29
Speaker
Will they happen in, like, are they possible to happen? Yes. Will they happen in time? Probably not. Then there is the research done by Paul Cristiano and Elie Isaac-Jutkowski at the Alignment Research Center and the Machine Intelligence Research Institute. This is something that's for me at least difficult to understand. As I see it, it is attempting to make
00:33:56
Speaker
to use mathematics to prove something about the background assumptions in which alignment is situated. What do you think of this research paradigm? Is there any hope here? I feel like both Paul and Eliezer would scream if you put them in the same bucket. I think the research is actually very different. Just to say a few words on that,
00:34:20
Speaker
the straw man. I'm specifically straw manning because Eliezer and Paul are some of the most difficult people to get their true opinions right. So basically every single person I know completely mischaracterizes Paul, even people who know him very well. Whenever someone who knows Paul very well, I asked him what Paul believes, they tell me X, and then I asked Paul, he tells me something different. So I don't think this is malicious, I think it's just
00:34:48
Speaker
Paul and Joukowsky's opinions are very subtle and very complex and communicating them is hard. So I am prefracting this. I am definitely misunderstanding what Paul and Eliezer truly believe. I can just give my best straw man. So my best straw man I can give for Christiano view is that he works on currently something called ELK, which is eliciting latent knowledge. It's kind of like this attempt to add that plus another thing that I'm aware of where he's trying to like.
00:35:18
Speaker
Think of worst case scenarios, how can you get true knowledge out of neural networks? How can you get their true beliefs about system, or not just neural networks, any system, arbitrary system, even if they're deceptive?
00:35:34
Speaker
Also related to that, he does some semi-formal work about proofs and causal tracing through neural networks and stuff like this. This is a straw man. This is definitely not an accurate description of what he actually does, but this is the closest I can get. While Eliezer, well, he's currently on hiatus. He's currently on sabbatical, so I don't think he's currently doing any research, actually. But historically, what Miri, the organization that he founded, does is
00:36:01
Speaker
kind of building formal models of agency, like trying to deconfuse agency and intelligence on a far more fundamental level than building, like, you know, just doing some code and trying to build an AI and then figuring out how to align it. It's way more thinking for first principles.
00:36:17
Speaker
What is an agent? What is optimization? What would it mean for systems to be aligned or courageable? Can we express these things formally? How can systems know that they or their successors will be aligned? How can they prove things about themselves? How would they coordinate or work together, work on decision theory, embedded agency, stuff like this? So I think a lot of the Miri paradigm is
00:36:47
Speaker
a lot more subtle than, and I understand the Miri paradigm better than I do Paul. I think a lot of it is very subtle, but actually I think a lot of the Miri work is very good. I think it's very good work. I think it's really interesting, but that's just my opinion. So
00:37:06
Speaker
When people talk about like formal mathematical theories, blah, blah, blah, they often refer, I think to like something that I think Eliezer said in the sequences where it's like the only way to get aligned AGI is like formally proof checked full thing, you know, solve all of alignment in theory, and then, you know, build AGI. I.
00:37:28
Speaker
don't know if he still believes this. He probably does, but I'm just saying, I just don't know. I don't think I've asked him. Maybe I've asked him, but I don't remember his answer. And I don't think Paul believes this. Paul, last time I talked to him, again, this is straw man. Please don't hold me to this, Paul. Sorry if I'm misrepresenting you here. My understanding is that he has 30% PDoom on the current path or something like that, which
00:37:57
Speaker
obviously isn't going through formal methods. So by that, I deduce that he doesn't expect this to be necessary. If that's wrong, I apologize. That's just my impression that Paul is quite open to non-formal things and neural networks and that kind of stuff. He has this belief that if it's just neural networks, we're super screwed. We're just super, super screwed. There's nothing we can do. It's way too hard. So
00:38:22
Speaker
A lot of the muri perspective, I think, is that aligning neural networks is so hard that we have to develop something that is in neural networks that is easier to align. And then you have to use that instead. And this has been not super successful, as far as I can tell. My view on this is not sure about Paul's agenda.
00:38:42
Speaker
I'm pretty pessimistic about Elk. It seems too hard. Pretty pessimistic about that. Don't really understand the other stuff he's working on. Can't really comment on that. I definitely disagree with him on some points about interpretability and PDoom and such. I think he's too optimistic about many things. But every time when I bring this up, he actually then has good counterpoints. So maybe he has some good counterpoints I just don't know about.
00:39:07
Speaker
for Eliezer, I agree that in a good world, that's what we would do. I think in a good world, where people are sane and coordinated, and we take lots of time, we would do much more miry things. Not necessarily exactly what miry did. I think some of the
00:39:27
Speaker
exact details of what I would have done, but the general class of things, let's deconfuse agency, let's deconfuse alignment, and then try to build formal models. I think this is super sensible. It didn't work in this one specific instance.
00:39:47
Speaker
given the constraints they had, I don't think that means the entire class of methodologies, you know, ontologically flawed and like cannot possibly work. I'm just like, you know, they tried, they found some things that I find interesting, and other things didn't work out. Like, bro, that's how science works.
00:40:07
Speaker
And perhaps it could have worked if we had started in 1950 working on this and had developed it alongside the general mathematics of computation or something like that.

Cognitive Emulations and Public Attention in AI Safety

00:40:18
Speaker
Yeah, I think this is completely feasible. I think it's completely possible that there's things that come slightly differently.
00:40:23
Speaker
If Miri had one more John von Neumann get involved and get really into it at the early days, I think it's not obvious that this is 100 years away or something like this. It might be, but it's not obvious to me. Things always feel impossibly far away until they're not. People thought flying machines were hundreds of years away the day before it happened. Same thing with nuclear fusion and fission and stuff like that.
00:40:52
Speaker
It's, I feel like Miri gets a bad rap. Is that sure they made some technical bets and they didn't quite work out, but I think that's like fair. So I'm pretty sympathetic to the Yudkowsky view, even so I am.
00:41:07
Speaker
So my personal view is kind of like, we're at the point, this is a strategic decision. Like, okay, if I had, if I knew I had 50 years, I'd probably work on mirror delay stuff. But I'm like, all right, I don't have 50 years. So I'm like, the kind of COM stuff I work on is more of a compromise between the various positions where we're like, all right, there's a spectrum between fully formal, everything white box and nothing formal, completely black box. Let's try to move to as far towards the formal things possible, but no further kind of.
00:41:37
Speaker
That makes any sense. It does. Now, perhaps introduce co-emps, these cognitive emulations. Yeah, so cognitive emulation or co-em is the agenda that I conjecture our primary focus on right now. It is a proposal or a more research agenda for how we could get towards more safe, useful, powerful AI systems by fundamentally
00:42:07
Speaker
trying to build bounded, understandable systems that emulate human-like reasoning. Not arbitrary reasoning, not like they just solve the problem by whatever means necessary, but they solve it in the way humans would solve a problem, in a way that you can understand. So when you use a CoM system, and so these are systems not models, like I don't expect this to be like, this is not neural network, this is like a
00:42:29
Speaker
It may involve neural networks it probably does involve neural networks, but it'll be like a, you know, system of many sub components, which can include neural networks but also include non neural network components that
00:42:40
Speaker
When you use such a system, at the end, you get a causal story. You get a reason to believe that you can understand, using human-like reasoning, why it made the choices it did, why it did the things it did, and why you should trust the output to be valid in this regard.
00:43:01
Speaker
Yeah, and for listeners who were enticed by that description, they should go back and listen to the previous podcast in this series where Connor and me, we discussed this for an hour and a half. All right, so as we see more and more capable AI systems, do you expect us to also see more and more public attention? Do you expect public attention to kind of scale with capabilities? Or will public attention lag behind and only come in right at the very end?
00:43:30
Speaker
both in that I think we're at the very end and we're sorry to see the attention now. Do you think that this will on net be a positive or negative? Will public attention to AI make AI safer? At the current point, I see it as
00:43:46
Speaker
positive. It's not obvious. It could still go negative very easily. But the way I'm currently seeing things is that everyone is racing headlong into the abyss. And at least what the public so far in my experience has been able to do is to notice, hey, wait, what the fuck? Don't do that. Which is great progress. Sorry, I couldn't tell. It is truly maddening.
00:44:13
Speaker
How many smart, you know, ML professors and whatever are so incredibly, utterly resistant to the possibility that what they're doing might be bad or might be dangerous.
00:44:28
Speaker
incredible the level of rationalization that people were capable of. I mean, it's not credible, like, it's actually very expected. This is exactly what you expect. They rely on a man to understand something when his salary depends on him not understanding it. And even the people who claim to understand it and say all the right words, like, you know, they still do it. Like, you know, opening the eye can say all the nice words about alignment they want or anthropic or deep mind or whatever. They're still racing to AGI, and they don't have an alignment solution. So
00:44:54
Speaker
I don't like speculating about people's internal minds or like why are they good or are they aligned or that. I don't really care. What I care about is what they do. And for me, the writing is on the wall. Just people are just reading towards the abyss. And if no intervention happens, if nothing changes, they will just go straight off that cliff with all of us in Tao. And I think the public
00:45:23
Speaker
you know, even so they don't understand many things and there's many ways in which they can make things worse, do seem to understand the very, very simple concept of don't careen off abyss into the abyss.
00:45:34
Speaker
Stop that right now. So here's an argument. OpenAI releases GPT-4 and this draws a lot of attention and therefore we get more resources into regulation and AI safety research and so on. And so it's actually a good thing to release a model that's very capable but not a super intelligent AGI.
00:45:55
Speaker
Is this galaxy-brained, is this 4D chess, or do you think there's something there? Sure, there's something there, but it is also just obviously 4D chess. It's like, okay, if you had a theory with which you could predict where the abyss is, sure, okay.
00:46:15
Speaker
There is no such theory. You have no idea what these systems can do when people get their hands on it. You have no idea what happens when you augment them with tools or whatever. You have no idea what these teams can do. There's no bounds. There's no limit or whatever. So every time you release a model, every time you build such a model, you're rolling a dot. Maybe this time's fine.

Military Involvement in AI Development

00:46:35
Speaker
Maybe next time's fine.
00:46:36
Speaker
At some point, it won't be like, it's Russian roulette. Sure, you can play some Russian roulette. Most of the time, it's fine. Five out of six people say Russian roulette is fine. What about possible counterproductive overreactions from more public attention to AI? For example, imagine that we decide to pause AI research, but
00:46:56
Speaker
AI safety research gets lumped in with AI capabilities research. And so even though we're not making any progress on capabilities, we're not making any progress on safety either. And when we lift this pause, we are in the same place as when we instigated it.
00:47:11
Speaker
Honestly, I'd be extremely surprised if that happened. I'm trying to imagine how that would actually play out in the real world. People won't even accept a moratorium on training things larger than GPT-4, which is the easiest thing to implement, the easiest thing to monitor that affects a teeny tiny sliver of all AI research. There are so few people that could or ever would train a GPT-4 size model.
00:47:37
Speaker
you know, that's such a teeny tiny sliver of AI research. And not even that is like feasible in the current political world. It's like very hard to get done. It's like an overreach so large that, you know, miry doodling, you know, type theory on the whiteboards gets shut down. I'm like, oh, that's not the world we live in. Like if we were in that world, I would be like, okay, interesting. Let's talk about it. But this is just not the world we live in.
00:48:04
Speaker
What about AI becoming a military technology? And only militaries can work on it, and perhaps they work on it in ways that turn out to be dangerous. Yep. I am concerned about this. I think this is one of the ways teams can go really badly. I used to be more virulently against this than I am now. Now, in another sense, I look at where we're currently heading, and I'm like, all right, currently we have 100% team chance.
00:48:30
Speaker
What are the other options? I'm not going to defend many of the atrocities committed by militaries across the world or whatever. I'm not going to say that there's not problems here. I'm not going to deny that there's some really fucked up people involved in these organizations or anything like that. Of course there are. But also, at least in the democratic West, don't want to speak about other nations, but
00:48:55
Speaker
There is such a thing as oversight. There are court martialings because it's an actual thing actually happens. In a sense, the military is authoritarian in a good way. The military is very authoritarian. There is hierarchy, there is authority, there is accountability, there is structure.
00:49:15
Speaker
The US military does a lot of bad things, but at least to some degree, they are accountable to the American public. Not perfectly, there's lots of problems here. But if a senator wants a hearing to investigate something going on in the military, they can usually get it, which is not perfect, huge problems or whatever, but that's something. And people
00:49:38
Speaker
do look like politicians do care. They might make very stupid mistakes and they might make stupid things and they might make things worse. Like the DOD could scale up like GPT-4 very easily. They could make something much bigger than that. If they did a Manhattan Project and they put all the money together to create GPT, GPT-F, just like end of the world system, then they could and that would be bad.
00:50:08
Speaker
So I think it can make it worse, but it's not obviously so. It's not like it could also be that they aid, like also it's just like super incompetent, slow, bureaucratic mess. And the military is very conservative.
00:50:22
Speaker
like very, very conservative about what they deploy, about what they do. They want extremely high levels of security. They want extremely high levels of reliability before they use anything. Like if we built AI systems to military standard of like reliability, like the military requested that like every AI system is like, you know, as reliable as a flight control system, I would be like, that sounds fucking great. Like that sounds awesome. Of course, that is a rosy view, like probably
00:50:52
Speaker
I think it's not a question if military gets involved. I think it's a question of when. And when this happens, it probably, as the law of undignified failure goes, if a thing can fail, it will always fail in the least dignified way possible. So probably it won't get to this level. But I think we should not dismiss out of hand that, I mean, first of all, I think it's ridiculous to accept that the military will not get involved.
00:51:17
Speaker
I think this is just impossible at this point, unless we get paperclip tomorrow. Unless things go so fast that no one can react, the military will get involved and we should work with them. We should be there to be like, all right, how can we help the military handle this as non-stupidly as possible? And I do think that a lot of people that work in a military do care and would like things to be safe and work well.
00:51:43
Speaker
Is it worse than, you know, Sam Altman, you know, all like Dr. Strangelove style, you know, is running, you know, things as fast as possible? Is it worse if, you know, the military nationalizes the whole thing and grinds into this bureaucratic monstrosity? Not obvious to me. I'm not saying I know obviously it is good, but it's not obviously not good. All right. Gorno, thank you for coming on the podcast. Pleasure as always.