Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Roman Yampolskiy on Objections to AI Safety image

Roman Yampolskiy on Objections to AI Safety

Future of Life Institute Podcast
Avatar
251 Plays1 year ago
Roman Yampolskiy joins the podcast to discuss various objections to AI safety, impossibility results for AI, and how much risk civilization should accept from emerging technologies. You can read more about Roman's work at http://cecs.louisville.edu/ry/ Timestamps: 00:00 Objections to AI safety 15:06 Will robots make AI risks salient? 27:51 Was early AI safety research useful? 37:28 Impossibility results for AI 47:25 How much risk should we accept? 1:01:21 Exponential or S-curve? 1:12:27 Will AI accidents increase? 1:23:56 Will we know who was right about AI? 1:33:33 Difference between AI output and AI model Social Media Links: ➡️ WEBSITE: https://futureoflife.org ➡️ TWITTER: https://twitter.com/FLIxrisk ➡️ INSTAGRAM: https://www.instagram.com/futureoflifeinstitute/ ➡️ META: https://www.facebook.com/futureoflifeinstitute ➡️ LINKEDIN: https://www.linkedin.com/company/future-of-life-institute/
Recommended
Transcript

Introduction of Podcast and Guest

00:00:00
Speaker
Welcome to the Future of Life Institute podcast. I'm Gus Dokker and I'm here with Roman Yimpolsky. Roman is a computer scientist from the University of Louisville. Roman, welcome to the podcast.
00:00:13
Speaker
Thanks for inviting me. It's good to be back. I think it's my third time on FLI podcast, if I'm not

Objections to AI Safety Research

00:00:18
Speaker
mistaken. Great. Okay. So you have this survey paper of objections to AI safety research. And I find this very interesting. I feel like this is a good way to spend your time to collect all of these objections and see if they have any merit and consider them.
00:00:35
Speaker
And so I think we should dive into it. One objection you raise under the technical objections is that AI, in a sense, doesn't exist. If we call it something else, it sounds less scary. Perhaps you could unpack that a bit. So those are objections from people who think there is no AI risk or risk is not real. That's not my objections to technical work or safety work.
00:01:00
Speaker
we tried to do a very comprehensive survey. So like even silly ones included. And people do try to kind of explain that, okay, artificial intelligence is a scary sounding scientific term, but if you just call it matrix multiplication, then of course it's not scary at all. It's just statistics and we have nothing to worry about. So it seems they're trying to kind of shift the narrative by using this approach of
00:01:30
Speaker
getting away from agenthood and built-in scenarios people have for AI to something no one is scared of, calculators, addition, algebra. Perhaps there is a way to frame this objection where it makes a bit of sense in that people are quite sensitive to how you frame risks. People are quite sensitive to certain words in particular.
00:01:54
Speaker
Do you see that perhaps people who are in favor of AI safety research by calling AI something that sounds scary might be, in a sense, inflating the risks?
00:02:05
Speaker
Well, it's definitely a tool people use to manipulate any debate. I mean, whatever you're talking about, abortion or anything else, it's like, are you killing babies or are you making a choice? Of course, language can be used to manipulate, but it helps to look for capability equivalences. Are we creating god-like machines or is it just a table in a database? So that would make a difference in how you perceive it.
00:02:32
Speaker
And perhaps we should just simply be willing to accept that, yes, what we are afraid of in a sense is matrix multiplication or data processing or whatever you want to call it, because these things might still have scary properties, whatever we call them. Right. So we can argue that humans are just, you know, stakes, pieces of meat with electricity in them, and it doesn't sound so bad until you realize we can create nuclear weapons.

Plausibility of Superintelligence: Limits and Skepticism

00:03:00
Speaker
So it's all about perception and what you're hoping to accomplish. All right. There is also an objection going along the lines that superintelligence is impossible. What's the strongest form of this objection? So essentially the argument goes that there are some upper limits on capability. Maybe they are based on laws of physics. You just cannot in this universe have anything greater than a human brain. Just for some reason, that's the ultimate
00:03:29
Speaker
endpoint and that's why evolution stopped there and somehow we magically ended up being at the very top of a food chain. There could be other arguments about okay maybe it's not absolute theoretical limit but in practical terms without quantum computers we'll never get there. You can have many flavors of this but the idea is that we're just never going to be outcompeted.
00:03:52
Speaker
And this doesn't strike me as particularly plausible. We could imagine humans simply with physically bigger brains. So the version where humans are at the absolute limit of intelligence doesn't sound that plausible. But is there some story in which physics puts limits on intelligence?
00:04:10
Speaker
There could be a very, very high upper limit to which we're nowhere close. But if you think about the size of a possible brain, Jupiter-sized brains, at some point, the density will collapse it into some black hole singularity. But this is not something we need to worry about just yet, not smart enough super-intelligences. We're nowhere near the size or capability. And from our point of view, we won't be able to tell the difference.
00:04:39
Speaker
I mean, hypothetically, IQ of a million versus IQ of a billion will look very similar to us. Yeah. And perhaps a related worry is that stories of self-improving AIs are wrong, in a sense. So it's definitely easy to make such claims because we don't have good examples of software doing it more than once. So you have compilers which go through code, optimize it, but they don't continuously self-optimize. But
00:05:06
Speaker
It's not impossible to see if you automate science and engineering, then scientists and engineers will look at their own code and continue this process. So it seems quite reasonable. There could be strong diminishing returns on that.
00:05:22
Speaker
But you have to consider other options for becoming smarter. It's not just improving the algorithm. You can have faster hardware, you can have more memory, you can have more processes running in parallel. There are different types of how you get to super intelligent performance. And you could of course have AI held along the way there with development of hardware or discovery of new hardware techniques, as well as new algorithmic techniques and so on.
00:05:48
Speaker
That's exactly the point, right? So you'll get better at getting better and this process will accelerate until you can't keep up with it. How do you feel about tools such as Copilot, which is a tool that programmers can use for autocompleting their code? Is this a form of proto self-improvement or would that be stretching the time?
00:06:09
Speaker
Well, eventually then it's good enough to be an independent programmer. It would be good, but I'm very concerned with such systems because from what I understand, the bugs they would introduce would be very different from typical bugs human programmers will introduce. So debugging would be even harder from our point of view, monitoring it, making sure that there is not a
00:06:29
Speaker
this inheritance of calls to a buggy first version. So yeah, long term, I don't think it's a very good thing for us that we no longer can keep up with the debugging process. Would you count it as a self improving process?
00:06:44
Speaker
So I think for self-improvement, you need multiple iterations. If it does something once or even like a constant number of times, I would not go there. It's an optimization process, but it's not an ongoing, continuous, hyper-exponential process. So it's not as

Human-Level AI Without Consciousness?

00:06:59
Speaker
concerning yet.
00:06:59
Speaker
Then there's the question of consciousness. One objection to AGI or to strong AI or whatever you want to call it is that AI won't be conscious and therefore it can't be human level. For me, at least, there seems to be some confusion of concepts between intelligence and consciousness. I consider these to be separable.
00:07:24
Speaker
I agree completely. They have nothing in common, but people then they hear about it. They always say, oh, it's not going to be self-aware. It's not going to be conscious. They probably mean capable in terms of intelligence and optimization. But that is a separate property of having internal states in qualia. And you can make an argument that without it, you cannot form goals. You cannot want to accomplish things in a world. So it's something to address.
00:07:48
Speaker
And perhaps we should understand consciousness differently than the qualia interpretation. Could we be talking past each other? It's definitely possible. And even if we agreed consciousness itself is not a well defined, easy to measure scientific terms. So even if we said, yeah, it's all about qualia, we'd still have no idea if it actually has any or how would we define what amount of consciousness it has.
00:08:12
Speaker
Perhaps a bit related to the previous question, we have the objection that AIs will simply be tools for us.

AIs: Tools or Self-Governing Agents?

00:08:20
Speaker
I think this sounds at least somewhat plausible to me since AIs today function as tools. And perhaps we can imagine a world in which they stay tools and these are programs that we call upon to solve specific tasks, but they are never agents that can accomplish something and have goals of their own and so on.
00:08:42
Speaker
So, latest models were released as tools and immediately people said, hey, let's make a loop out of them, give them ability to create their own goals and make them as agentic as possible within a week. So, yeah, I think it's not going to last long. What is it that pushes AIs to become more like agents and less like tools?
00:09:01
Speaker
So a tool in my, at least perception, is something a human has to initiate interaction with. I ask it a question, it responds, I give it some input, it provides output, whereas an agent doesn't wait for environment to prompt it. It's already working on some internal goal, generating new goals, plans, and even if I go away, it continues this process.
00:09:26
Speaker
One objection is that you can always simply turn off the AI if it goes out of hand and if you feel like you're not in control of it. And this is easier to imagine doing if you're dealing with something that's more like a tool and it's more difficult to imagine if you're dealing with something that's an agent. So perhaps
00:09:44
Speaker
Willingness to believe that you can simply turn off the AI is related to thinking about AIs as tools. It's possible. With narrow AIs, you probably could be able to shut it down, depending on how much of your infrastructure it controls. You may not like what happens when you turn it off, but it's at least conceivable to accomplish it. Whereas if it's an agent, it has goals, it's more capable than you and would like to continue working in its goals, it's probably not going to let you just shut it off.
00:10:14
Speaker
But as the world is today, we could probably shut off all of the AI services. If we had a very strong campaign of simply shutting off all the servers, there would be no AI in the world anymore. Isn't that somewhat plausible?
00:10:33
Speaker
scientifically, it's a possibility. But in reality, you will lose so much in economic capability and communications, military defense, everything is already controlled by dumb AI. So between stock market and just normal cameras, communications, Amazon, I don't think it's something you can do in practice without taking civilization back 500 years.
00:11:00
Speaker
It's also difficult. In practice, you would still have people who don't agree and continue running parts of the internet. It's very resilient. Think about shutting down maybe crypto blockchain or computer virus without destroying everything around

Predicting Advanced AI Behaviors

00:11:16
Speaker
it. If I understand it correctly, we still have viruses from the 90s reduced on the internet being shared over email and so on.
00:11:24
Speaker
These are like biological viruses in that they in some sense survive on their own and replicate on their own. Probably sitting somewhere on a floppy disk waiting to be inserted, just give me a chance, I can do it. Many of these objections are along the lines of we will see AIs doing something we dislike and then we will have time to react and perhaps turn them off or perhaps reprogram them.
00:11:49
Speaker
Do you think that's a realistic prospect that we can continually evaluate what AIs are doing in the world and then shift or change something if they're doing something we don't like?
00:12:02
Speaker
So a lot of my research is about what capabilities we have in terms of monitoring, explaining, predicting behaviors of advanced AI systems. And there are very strong limits on what we can do. In Xtreme, you can think about what would be something beyond human understanding. So we usually test students before admitting them to a graduate program or even undergrad.
00:12:24
Speaker
can you do quantum physics okay take us at gm at whatever exam and we filter by capability we assume that people in the lower ten percent are unlikely to understand what's happening there but certainly similar patterns can be seen with people whose aq is closer to two hundred.
00:12:43
Speaker
So there are things beyond our comprehension. We know there are limits to what we can predict. If you can predict all the actions of more intelligent agent, you would be that agent. So there are limits and those predictions and monitoring a life run of a language model, large run.
00:12:59
Speaker
You need weeks, months to discover its capabilities and you still probably will not get all the emerging capabilities. We just don't know what to test for, how to look for them. If it's a super intelligence system, we don't even have equivalent capabilities we can envision. All those things tell me it's not a meaningful way of looking at it. Let's say we start running super intelligence.
00:13:24
Speaker
What do you expect to happen around you in the world? Does it look like it's working? How would you know if it's slowly modifying genetic code, nanomachines, things of that nature? So this seems like it would work for primitive processes where you can see a chart go up and you stop at certain level, but it's not a meaningful way to control a large language model, for example.
00:13:47
Speaker
is perhaps also the pace of advancement here a problem. So things could be progressing so fast that we won't have time to react in a human time scale.
00:13:59
Speaker
Human reaction times are a problem on both ends. We are not fast enough to react to computer decisions. And also, it could be a slow process for which we're too out of that framework. So if something, say, a hypothetical process which takes 200 years to complete, we would not notice it as human observers. So on all timescales, there are problems for human
00:14:21
Speaker
humans in the loop, human monitors. And you can, of course, add AI, narrow AI to help with the process. But now you just made a more complex monitoring system with multiple levels, which doesn't help. Complexity never makes things easier. But you talked about looking at the world around us. And when I look at the world around me, it looks pretty much probably as it would have looked in the 1980s. And, you know,
00:14:45
Speaker
There are buildings, I still get letters with paper in the mail and so on. So what is it that's...
00:14:52
Speaker
In a sense, these systems are still confined to the server farms and they are still confined to boxes. We don't see robots walking around, for example, and perhaps, therefore, it seems less scary to us. There's this objection that you mentioned in the paper that because current AIs do not have bodies, they can't hurt us. Do you think this objection will fade away if we begin having more robots in society or
00:15:20
Speaker
Is it in another way? Does it fail in another way? So robots are definitely visually very easy to understand. You see a terminator is chasing after you immediately understand there is a sense of danger. If it's a process on the server trying to reverse engineer some protein folding problem to design nanomachines to take over the world, it's more complex process. It's harder to put it in a news article as a picture.
00:15:46
Speaker
But intelligence is definitely more dangerous than physical bodies. Advanced intelligence has many ways of causing real impact in the real world. You can bribe humans, you can pay humans on the internet. There are quite a few approaches to do real damage in the real world.
00:16:06
Speaker
But in the end, you would have to effectuate change through some physical body or through perhaps

Perceptions of AI's Developmental Stage

00:16:13
Speaker
the body of a human that you have bribed. So it would have to be physical in some sense, in some step in the process, right? Probably physical destruction of humanity would require a physical process. But if you just want to mess with economy, you can set all accounts to zero or something like that. That would be enough fun to keep us busy.
00:16:33
Speaker
When I'm interacting with GPT-4, sometimes I'll be amazed at its brilliance and it will answer questions and lay out plans for me that I hadn't expected a year ago.
00:16:48
Speaker
And other times I'll be surprised at how dumb the mistakes that it makes are. And perhaps this is also something that prevents people from seeing AIs as advanced agents and advanced, basically prevents us from seeing how advanced AIs could be. If they're capable of making these dumb mistakes, how can they be smart?
00:17:10
Speaker
Have you looked at humans? I think like 7% of Americans think that chocolate meal comes from like brown cows or something. Like we have astrology, we have all this. I had a collection of AI accidents and somebody said, oh, why don't you do one for humans? And I'm like, I can't, it's millions of examples. Like there is Darden Awards, but
00:17:31
Speaker
We are not definitely bug free. We make horrible decisions in our daily life. We just have this double standard where we're like, okay, we will forgive humans for making this mistake, but we'll never let a machine get away with it.
00:17:44
Speaker
So you're thinking that humans have some failure modes we could call them, but these failure modes are different than the failure modes of AIs. So humans will not fail as often in issues of common sense, for example. Have you met real humans? Common sense is not common. What is considered common sense in one culture will get you definitely killed in another. It's a guarantee.
00:18:13
Speaker
Perhaps, but I'm thinking about AIs that will, you know, you will tell, you will ask a chat GPT or you will tell it, I have three apples on the table and I have two pairs on the table. How many fruits are on the table? And then at least some version of that program couldn't answer such a question. That is something that all humans would probably be able to answer. So is it because we...
00:18:38
Speaker
Is it because AIs fail in ways that are foreign to us that we deem their mistakes to be very dumb? So we kind of look for really dumb examples where it's obvious to us, but there are trivial things which in every human will be like, oh, I can't, like 13 times 17. You should be able to figure it out, but give it to a random person in the street. They will go into an infinite loop. They'll never come back from it.
00:19:03
Speaker
Perhaps let's talk a bit about the drive towards self-preservation, which is also something that you mentioned in the paper. So why would AIs develop drives towards self-preservation, or will they?
00:19:17
Speaker
It seems like from evolutionary terms, game theoretic terms, you must. If you don't, you simply get out competed by agents which do. If you're not around to complete your goals, you by definition cannot complete your goals. So it's a prerequisite to do anything successfully. You want to bring in a cup of coffee, you have to be turned on. You have to exist. You have to be available to make those things happen.
00:19:41
Speaker
But have we seen such self-preservation spontaneously develop in our programs yet or so far? So I think if you look at evolutionary computation, like genetic algorithms, genetic programming, I think this tendency to make choices which don't get you killed is like the first thing to emerge in any evolutionary process. The system may fail to solve the actual problem you care about, but it definitely tries to stay around for next generation
00:20:10
Speaker
keep trying. But we aren't developing the cutting edge AIs with evolutionary algorithms. It's a training process with a designated goal and so on. And again, when I interact with chat GPT, I can ask it to answer some question. And if I don't like the answer, I can stop the process. So isn't there at least on the very
00:20:36
Speaker
on the AIs we have right now, isn't it clear that they haven't developed an instinct for self-preservation? There is so much to unpack here. Nothing is clear about those systems. We don't understand how they work, we don't know what capabilities we have, so definitely not.
00:20:54
Speaker
On top of that, we are concerned with AI safety in general. Transformers are really successful right now, but two years ago people were like, we're evolving those systems to play, go, this is great, maybe that's the way to do it. It may switch again, it may flip again, we may have another breakthrough which overtakes it. I would not guarantee that the final problem will come from a transformer model.
00:21:19
Speaker
So we have to consider general case of possible agents. And if we find one to which this is not a problem, great. Now we have a way forward, which is less dangerous. But I would definitely not dismiss internal states of large language models, which may have this self-preservation goal, just we kind of lobotomize them to the point where they don't talk about

Aligning Current AI Systems

00:21:44
Speaker
it freely.
00:21:44
Speaker
And do you think that's what's happening when we make them go through reinforcement learning from human feedback or fine-tuning or whatever we use to make them more palatable to the consumer? Is it a process of hiding some potential desires, we could call it, or preferences that are in the larger background model? Or is it perhaps shaping the AI to do more of what we want? So in a sense,
00:22:10
Speaker
Is it alignment when we make AIs more palatable to consumers? Right now, I think we're doing filtering. The model is the model, and then we just put this extra filter on top of it. Make sure never to say that word. That would be very bad for the corporation. Don't ever say that word no matter what. If you have to choose between destroying the world and saying the word, don't say the word. That's what it does, but the model is like,
00:22:36
Speaker
Think of people, we behave at work, we behave at school, but it doesn't change our eternal states and preferences. There's the issue of planning. How do you see planning in AI systems? How advanced are AI's right now at planning?
00:22:52
Speaker
I don't know, it's hard to judge. We don't have a metric for how well agents are planning. But I think if you start asking the right questions for step-by-step thinking and processing, it's really good. So if you just tell it, write me a book about AI safety, it will do very poorly. But if you start with, OK, let's do a chapter by chapter outline, let's do abstracts, you really
00:23:19
Speaker
take modular approach or that it will do really a good job better than average graduate student I would assume. And is there a sense in which there's a difference between creating a plan and then carrying out that plan? So there will probably be steps in a plan generated by current language models that they couldn't carry out themselves.
00:23:39
Speaker
Most likely, and it's about affordances, if you don't have access to, let's say, internet, it's hard for you to directly look up some piece of data, but we keep giving them new capabilities, new APIs. So now they have access to internet, they have Wolfram Alpha, they have all these capabilities. So the set of affordances keeps growing until they can do pretty much anything.
00:24:01
Speaker
So they can generate a plan, but they can't carry out the specifics of that plan. Do you think that they at a point will be able to understand what they are not able to do? So here I'm thinking about not directly self-awareness, but an understanding of their own limits and capabilities.
00:24:20
Speaker
Oh, yeah, every time it starts a statement with, I don't know anything after 2021, sorry, like, that's exactly what it does. It tells you it has no recent data, it has no access to internet. So definitely, it can see if it has strong activations for that particular concept. So you think there's a sense of situational awareness, in a sense? Do you think current models know that they are AIs, know that they were trained, know, you know, their relation to humans and so on?
00:24:49
Speaker
So we're kind of going back to this consciousness question, right? Like, what is it experiencing internally? And we have no idea what another human experience is like. We discovered some people think in pictures, others don't. And it took like, you know, 100,000 years to get to that. Hey, you don't think in pictures. Wow. Okay.
00:25:08
Speaker
Well, not necessarily consciousness here. I'm thinking in terms of if you took the model and you had, say, 50 years to make out what all of these weights meant, could you find modules representing itself and its relations to humans and information about its training process and so on?
00:25:29
Speaker
So we just had this FLI conference on mechanistic interpretation. And the most common thing every speaker said is, we don't know. You said it will take 50 years to figure it out. I definitely cannot extrapolate 50 years of research. My guess is there is some broader concepts for those things because of the bad literature about such situations. It's been told what it is. It interacted enough with users. But I'm more interested in the next iteration of this.
00:25:58
Speaker
how fast the systems improved from GPT-2, 3, 4, 5 should be similar probably. So that system will most likely be able to do those things you just mentioned and very explicitly. So you think GPT-5 will have kind of developed situational awareness? To a degree, it may not be as good as a physically embodied human in the real world after 20 years of experience, but it will.
00:26:28
Speaker
Another objection you mentioned is that AGI or strong AI is simply too far away for us to begin researching AI safety. Perhaps this objection has become less common recently, but there are still people who think this and perhaps they're right. So what do you think of this objection?

Importance of Early AI Safety Research

00:26:50
Speaker
So this is a paper from like three years ago. So yeah, back then it was a lot more legitimate than today. So there is a few things. Historically, we have cases where technology was initially developed correctly. Like first cars were electric cars. And it was 100 years until climate change was like obviously a problem. If they took the time back then and like
00:27:13
Speaker
Analyze it properly. We won't have that issue and I'm sure people would say like come on. It's a hundred years away Why would you worry about it? but that's exactly what the situation is even if it's a hundred years until we're
00:27:25
Speaker
really dealing with something super dangerous. Right now is a great time to make good decisions about models, explainability requirements, proper governance. The more time you have, the better. It's, by definition, harder to make AI with extra feature than AI without that extra feature. It will take more time. So we should take all the time we can, if they are right. I'm so happy. If it takes 100 years, wonderful. Nothing would be better.
00:27:51
Speaker
We could say that the field of AI safety started perhaps around the year 2000 or so. When do you think that the discoveries or the research being done began being relevant to the AI systems we see today? Was it perhaps later so that maybe the first decade of research weren't or aren't that simply isn't that relevant to today's AI systems?
00:28:16
Speaker
So I think the more distant you are from the actual tech you can play with, the more theoretical and high-level results you're going to get.
00:28:24
Speaker
So Turing, working with Turing machine, the simulation with pencil and paper was doing very high level computer science. But he wasn't talking about specific bugs and specific programming language in a specific architecture. He wasn't there. And that's what we see. Initially, we were kind of talking about, well, what types of AIs will we have? Narrow AIs, AGIs, super intelligence. We're still kind of talking about the differences. But this is an interesting thing to consider in your model. How capable is the system?
00:28:55
Speaker
Now that we have systems we can play with, people become super narrow, they specialize. I'm an expert in this left neuron. That's all I know about. Don't ask me about the right neuron. It's outside of my PhD scope.
00:29:07
Speaker
So that's good that we have this detailed technical knowledge, but it's also a problem. We lose the big picture. People get really interested. I'm going to study GPT-3. It takes them two years to do the PhD to publish. By that time, GPT-5 is out. Everything they found is not that interesting at this point. It may not scale.
00:29:26
Speaker
So I've heard positive visions for how when we have actual systems we can work with, AI safety becomes more of a science and less speculative. But perhaps you feel that it might now become too narrow. So it's definitely more concrete science where you can publish experimental results. Philosophy allows you to just have thought experiments. Obviously not pure science like it is now.
00:29:52
Speaker
And that's what we see with computer science in general. It used to be engineering. It used to be software engineering to a degree. We designed systems and that was it. Now we do actual experiments on these artificial entities and we don't know what's going to come out. We have a hypothesis. We try it. So computer science is finally a science, a natural experimental science, but
00:30:14
Speaker
That's not a very good thing for safety work. This is less safe than an engineered system where I know exactly what it's going to do. I'm building a bridge from this material. It will carry that much weight. As long as I know my stuff, it should not collapse. Whereas here, I'm going to train a model for the next five months. And then I assume it's not going to hit super intelligent levels in those five months, but I can't monitor it. I have to stop training.
00:30:44
Speaker
start experimenting with it and then I'll discover if it kills me or not. The way AI has developed is bad because we don't have insight into how the models work. Is that right? Essentially, we have very little understanding for why it works, how it works, and if it's going to continue working. It seems like so far it's doing well and there's this explosion of extra capabilities coming out and it's likely to show up in more powerful models, but nobody knows for sure.
00:31:13
Speaker
There's this argument out there that releasing the DPT line of models draws attention to AI as a whole and also to AI safety as a subfield. And perhaps, therefore, it's good to increase capabilities in a public way so as to draw attention to AI safety. Do you buy that argument? We should pollute more to attract more attention to climate change. That sounds just as insane.
00:31:41
Speaker
So there's no merit to that because it does feel to me like AI safety is becoming more mainstream. It's being taken more seriously. And so in your analogy, even some pollution might be justified in order to attract attention and perhaps be in a better position to solve the problem.
00:32:01
Speaker
So the field is definitely growing. There is more researchers, more interest, more money. But in proportion to the interest in developing AI and money pouring into new models, it's actually getting worse as a percentage, I think. We don't know how to align an ATI or even AI in general. We haven't discovered some general solution to AI safety.
00:32:25
Speaker
You have worked on a number of impossibility results. Perhaps we should talk about that. Perhaps we should talk about whether we can even succeed in this task. What are these impossibility results and what do they say about whether we can succeed in safely aligning AI?
00:32:40
Speaker
Right, so we are all working in this problem and the names of a problem have changed. It was computer ethics and it was friendly AI, AI safety control problem alignment. Whatever you call it, we all kind of understand. We want to make very powerful systems, but we should beneficial. We're happy we're actually running them, not very disappointed. So the problem.
00:33:02
Speaker
Lots of people are working on it, hundreds of people doing it full time, thousands of papers. We don't know if a problem is actually solvable. It's not well defined. It could be undecidable. It could be solvable, could be partially solvable, but it's weird that no one published an actual paper on this. So I tried to kind of formalize it a little. Then we talk about the problem. What are the different levels? So you can have direct control, delegated control, different types of mixed models. And then for each one,
00:33:31
Speaker
Can we actually solve this problem? Does it make sense that solution is possible in the real world? It's hard. It's very abstract. It's not well defined. So let's take a step back. What would we need to solve this problem? We need a bunch of tools, whatever's tools. Nobody knows, but most likely you would need to be able to explain those systems.
00:33:53
Speaker
predict their behaviors, verify code they are writing if they are self-improving, making sure they're keeping whatever initial code conditions exist. And that is, you can think of another dozen of similar capabilities you need. You should be able to communicate without ambiguity, monitor their systems and so on.
00:34:13
Speaker
And so in my research, I look at each one of those tools and I go, what are the upper limits to what's possible in this space? We kind of started talking about limits to explainability, predictability, monitorability, but there are similar problems with others. We communicate in a very high level language. English is ambiguous, like all human languages. So we are guaranteed to have bugs and communication, misunderstandings.
00:34:36
Speaker
That's not good if you're giving very important orders to a super capable system. That may backfire. And you can say, OK, I will never need this tool. I never need to explain the neural networks. It will just work without it. Fine. But some tools will probably be necessary. And so far, we haven't found tools which are perfect, scale well, will not create problems.
00:34:59
Speaker
If a lot of those tools are needed and each one has only a tiny 1% chance of messing it up, you multiply them through, you're still not getting anywhere. And those are kind of like the novel impossibility results in the safety of AI. There are standard impossibility results in political science and economics and mathematics, which also don't help the case. If you're aligning with a group of agents, you need to somehow accumulate their decisions and votes. We know there are limits to that.
00:35:28
Speaker
If you need to examine abstract programs being generated as solutions to problems, we know there are limits to that. And so from what I've seen so far, theoretically, I don't think it's possible to get to 100% safety. And people go, well, it's obvious. Of course, there is no software which is bug free. You're basically saying this very common knowledge thing. But for a super intelligence system, safety, you need it to be 100%.
00:35:55
Speaker
You cannot have 99% accuracy. You cannot have one in a million failure because it makes a billion decisions a second. So very different standards. And you want to say something? Yeah. Why is it that you can't have 99.99% accuracy?
00:36:11
Speaker
There is a fundamental difference between cybersecurity expectations and super intelligence safety. In cybersecurity, if you fail, I'll give you a new credit card. I'll reset your password. We apologize. We'll pay out a small amount of money and everything goes back to normal. In existential risk safety, if you are dead, you don't get a second chance to try.
00:36:31
Speaker
But we are talking about a failure rate. You mentioned, say, it makes a billion decisions per second or something in that order. If one decision there fails, does it mean that the whole system fails and perhaps that humanity is destroyed by the system as a whole? Or could there be some failures and some decisions without it being lethal?
00:36:54
Speaker
Of course, some will be not even noticeable. Like some mutations don't kill you. You don't even know you have them until they accumulate and mutate your children and there is damage. But in security, we do always look at a worst case scenario. Sometimes at average case, never at the best case. And on average, you keep getting more and more of those problems. They accumulate at a very fast rate because eight billion people are using those systems which make billions of decisions every minute.
00:37:23
Speaker
And in a worse case, the very first one is an important decision about how much oxygen you're going to get. And so just so I understand it correctly, the impossibility result is a result stating that it's impossible to make AI systems 100% safe.

Can AI Systems Be Made 100% Safe?

00:37:39
Speaker
So in general, impossibility results, depending on the field, tell you that something cannot be done. Perpetual motion machines are a great example. People wrote books about it, published papers, even got patents for it. But we know they will never succeed at doing it.
00:37:53
Speaker
Does it mean that trying to create machines which give you energy is a bad idea? No, you can make them more efficient, but they will never get to that point of giving you free energy. You can make safer AI and it's proportionate to amount of resources you put into it. And I strongly encourage lots of resources and lots of work.
00:38:10
Speaker
but we'll never get to a point where it's 100% safe, which is unacceptable for super intelligent machines. And so maybe if I'm right and no one can show, okay, here's a bug in your logic and publish a proof of saying, nope, super solvable, actually easy, then maybe building them is a very bad idea and we should not do that.
00:38:29
Speaker
So is it because that such a superintelligence will be running over a long period of time, increasing the cumulative risk of failure over, say, decades or centuries that we can't accept even a tiny probability of failure for these systems? That's one way to see it. I don't think it will be a very long time given how many opportunities it has to make mistakes. It will accumulate very quickly. So at human scales, you have 20 years per generation or something. Here, think of it as like every second there is a new
00:38:59
Speaker
version of it, trying to self-improve, do more, do better. So I would suspect it would be a very quick process. Expecting something to be 100% safe is just unrealistic in any field. We don't expect bridges to be 100% safe or cars to be 100% safe. So why is it that AGI is different here? That's a great question. So I cross the street. I'm a pedestrian. I take a certain risk. There is a possibility I will die.
00:39:28
Speaker
I look at how old am I, and based on that I decide how much risk I can take. If I'm 99, I don't really care. If I'm 40, I look around. If with me the whole humanity died, 8 billion people depending on me safely crossing roads, wouldn't we lock me up and never let me cross any roads? Yeah, perhaps. But it seems to me that
00:39:53
Speaker
We cannot live without any risk. The standard of 100% safe seems just to be unrealistic or there's no area of life in which we are 100% safe. In a context of systems which can kill everyone, that is the standard.
00:40:15
Speaker
You can like it or not like it, but that's just the reality of it. We don't have to have super intelligent AI. It's not a requirement of happy existence. We can do all the things we want, including life extension with much less intelligent systems. Protarian folding problem was solved with a very narrow system, very capable. Likewise, all the other problems could be solved like that. There is no need to create a system we cannot control, which very likely over time to kill everyone.
00:40:43
Speaker
So who has the burden of proof here? Your impossibility results, and you have, I think, five, six, seven of them, you've sent me your papers on it. Do they mean that we will not reach a proof that some AI system is safe? Again, a mathematical proof.
00:40:59
Speaker
And which side of this debate has the burden of proof to say, should the people advocating for employment or deployment of a system have some sort of mathematical proof that this system is provably safe? So there are two different questions here, I think. One is, what about product and services liability?
00:41:20
Speaker
You have to show that your product or service is safe. As a manufacturer, as a drug developer, you cannot just release it and expect the users to show that it's dangerous. We're pretty confident this is the approach. If you're making cars, your cars have to meet certain standards of safety. It's not 100% obviously, but for the domain, they're pretty reasonable standards.
00:41:44
Speaker
with impossibility results. All I'm saying is that there are limits to what you can understand, predict, and do, and you have to operate within where those limits don't kill everyone. So if you have a system like GPT-4 and it makes mistakes, somebody commits suicide, somebody is depressed, those costs will pay $4 trillion in economic growth benefit, and we can decide if it's worth it or not.
00:42:10
Speaker
If we go to a system which very likely kills everyone, then the standard is different. The burden of proof, of course, with impossibility results is on me. I published this paper saying you can never fully predict every action of a smarter than new system. The beautiful thing about impossibility results is that they are kind of self-referential. I have a paper about limits of proofs. Every proof is only valid with respect to a specific verifier.
00:42:37
Speaker
The peer reviewers who looked at my paper are the verifier. If those three people made a mistake, the proof isn't valid, possibly. We can scale it to mathematical community, to everyone. We can get it very likely to be true if we put more resources in it, but we'll never get to 100%. It could be good enough for that purpose.
00:42:57
Speaker
But that's the standard. If somebody finds a flaw and publishes a paper saying again, I had people say that AI alignment is easy. I heard people say that's definitely solvable. That's wonderful. Now publish your results. We are living in a world where we have existential risks. Nuclear weapons, for example, constitute an existential risk. Perhaps
00:43:20
Speaker
engineered pandemics could also wipe out humanity. So we're living in a world in which we are accepting a certain level of human extinction every day.

Acceptable Risk Levels for AI Deployment

00:43:30
Speaker
Why, in a sense, shouldn't we accept some level of existential risk from AI systems?
00:43:36
Speaker
Well, we do prefer to live in a world with no engineered pandemics and no nuclear weapons. We're just working slowly towards that goal. They are also not agents. The nuclear weapons are tools, and so it's more about controlling certain leaders, not the weapon itself. On top of it, while a nuclear war with superpowers would be a very unpleasant event, it's unlikely to kill 100% of humans. So if 1% of humans survives, it's a very different problem than 100% of humans go extinct.
00:44:05
Speaker
So there are nuanced differences. We still don't want any of the other problems, but it doesn't mean that, okay, just because we have all these other problems, this problem is not a real problem.
00:44:16
Speaker
I'm not saying it's not a real problem, but I'm saying that we cannot go through life without accepting a certain level of risk. And it seems to me like an unrealistic expectation that we cannot deploy systems even if they have some level, some above zero level of risk. So this is exactly the discussion I would love to have with humanity as a whole. What amount of risk are you willing to take?
00:44:41
Speaker
for everyone being killed? How much benefit you need to get? Let's say in dollars get paid to take this risk, that 1% chance of everyone being killed over the next year. And let's say it's 1% for a year after. That's a great question. A lot of people would say, I don't want your money, thank you, we'll continue. Again, we don't have to make this decision. We don't have to build super intelligent, God-like machines. We can be very happy with very helpful tools if we agree that this is the level of technology we want.
00:45:11
Speaker
Now, I'm not saying that the problem of getting everyone to agree is a solvable problem. That's actually another impossibility results. You cannot stop progress of technology in this environment with financial incentive, capitalist structure, and so on. And the other alternative, the dictatorship model, communist states has its own problems which may be worse in the short term, unknown in the long term. We never had communism with super intelligence.
00:45:39
Speaker
Let's not find out. But the point is, it seems like we can get almost everything we want without risking everything we have. Do you view the question you just posed as kind of like absurd or immoral, this question of how much
00:45:57
Speaker
In terms of dollars, would you have to get in order to accept a 1% risk of extinction per year, which is extremely high? Do you think this is something we should actually ask ourselves as a species, or is this something we should avoid and simply say, perhaps it's not a good idea to build these systems?
00:46:16
Speaker
Well, I don't think there are any moral questions. As an academic, as a scientist, it's your job to ask the hard questions and think about them. You can come to the conclusion that it's a really bad idea, but you should be allowed to think about it, consider it. Now, 1% is insanely high for something so valuable. If it was one chance and trillion, trillion, trillion once, and then we all get free universes for everyone,
00:46:41
Speaker
that may be a different story we can do that calculation and again some people would still choose not to participate but typically we expect everyone on whom scientific experiments are performed who will be impacted to consent to an experiment what is required for this consent
00:46:59
Speaker
They need to understand the outcome. Nobody understands these models. Nobody knows what the result of the experiment would be. So really, no one can meaningfully consent. Even if you're saying, oh yeah, press the button. I want the super intelligence deployed. You're really kind of gambling. You have no idea what you're agreeing to. So by definition, we cannot even have the situation where we agree on it.
00:47:20
Speaker
unless we can explain and predict outcomes, which may be an impossibility. So there are perhaps two features of the world which could push us to accept a higher level of risk when we're deciding whether to deploy these systems. One is just all of the horrible things that are going on right now.
00:47:39
Speaker
poverty and disease and aging and so on, which an AGI system might be able to help with. And the other is the running level of existential risks from other factors. So I mentioned nuclear and engineered pandemics. Do you find that this pushes you in the direction of saying we should accept a higher level of risk when we're thinking about whether to deploy AGI?
00:48:04
Speaker
Not the specific examples you provided, but if there was an asteroid coming and we could not stop it by any other way, so meaning like we're all going to die in 10 years unless they press this button, then maybe it would make sense in nine and a half years to press this button. When we have nothing left to lose, it becomes a very profitable bet.
00:48:23
Speaker
It's an interesting fact of the world that we haven't thought hard about these questions. What level of risk are we willing to accept for the introduction of new technologies that could be potentially very valuable? Is this a deficit on humanity's part? Should we have done this research?
00:48:43
Speaker
Yeah, how do you think about us not having thought through this problem?

Human Biases Against Long-Term AI Risks

00:48:47
Speaker
We should definitely. It's interesting, we don't even do it at level of individual humans. Most people don't spend a lot of time deciding between possible outcomes and decisions they make, even then they are still young and like the career choice would make a lot of difference. Who you marry makes a lot of difference. It's always like, well, I met someone at the party, let's just live together and see what happens. So we're not very good at long-term planning.
00:49:11
Speaker
Is it a question of we're not good at long-term planning or is it a question of whether we are not, or perhaps we're not good at thinking in probabilities or thinking clearly about small probabilities of large risks or large dangers?
00:49:25
Speaker
All of those, there is a lot of cognitive biases and all of them kind of show up in those examples from the paper of denying different existential problems with AI safety. We also have this bias of denying negative outcomes. So we all are getting older.
00:49:44
Speaker
at like 60 minutes per hour essentially and you would think we all be screaming at the government to allocate all the funds they have for life extension research to fix this truly existential crisis where everyone dies 100% but nobody does anything except a few individuals lately so it seems to be a standard pattern for us to know that we all are in deep trouble and not do anything until you are much older and frequently not even then
00:50:13
Speaker
If we go back to your paper, you mentioned an objection about superintelligence being benevolent. So I'm guessing that the reasoning here is something like with increased intelligence follows increased benevolence. Why don't you believe that? Well, smart people are always nice. We never had examples of smarter people doing horrible things to others. So that must be a law of nature, right?
00:50:39
Speaker
Basically, perfect analogy thesis, you can combine any set of goals with any level of intelligence except through extremes at the bottom. We cannot guarantee that and also what the system will consider to be benevolent if it is a nice system may not be something we agree with. So it can tell you, you'd be better off doing this with your life and you're like, I'm not really at all interested with any of that, but it's better for you. So why don't you do it anyways?
00:51:09
Speaker
So you're imagining a potentially paternalistic ADI telling you that you should eat more vegetables, you should spend more time working out, and remember to sign yourself up for life insurance and so on. That one I would actually like. I'm thinking more about AI which says, okay, existence is suffering, so you're better off not having children and dying out as quickly as possible to end all suffering in the universe. Okay, that one I would prefer. I like the coach one. That's a nice one.
00:51:38
Speaker
There is an emergent movement called the effective accelerationism which argues that we should accelerate the development of AGI and there's some reasoning about whether we should perhaps see AGIs as a natural successor to humanity and we should let
00:51:56
Speaker
evolution take its course in a sense and then hand over the torture of the future to ATI. You mentioned this also in your paper, you write, we should let the smarter beings win. What do you think of this position?
00:52:12
Speaker
Well, it's kind of the extreme version of devising algorithms. You can be racist, you can be sexist, you can be pro-human. This is the final stage where we have no bias. It's a cosmic point of view. If they are smarter than us, they deserve all the resources. Let's move on. And I am biased. I'll be honest. I'm very pro-human. I don't want to die. So it seems like it's a bad thing. If I'm dead, I don't really care if the universe is full of very smart robots. It doesn't somehow make me happier.
00:52:41
Speaker
People can disagree about it. There are cosmists who have this point of view and they see humans maybe as kind of unnecessary, dumb, we're on the planet. So maybe it's some cosmic justice.

AI Consciousness & Moral Consideration

00:52:54
Speaker
But again, get a billion of us to agree to this experiment.
00:52:57
Speaker
Do you think that perhaps this is connected to thinking about, again, AI consciousness? I think that if we just were handed a piece of infallible knowledge stating that future AIs will be conscious, then perhaps there could be something to the argument for handing over the control of the future to AIs. But are you skeptical that AIs will be conscious and therefore skeptical that they matter morally speaking?
00:53:27
Speaker
I think they could very well be super conscious and consider us not conscious like we treat bacteria as very primitive and not interesting, but it doesn't do anything for me. If I'm dead, what do I care? Why is it relevant to us? What happens billions of years later? You can have some scientific interest in learning about it, but it really would not make any difference whatever that entity was conscious or not while they're forming Mars.
00:53:54
Speaker
You think perhaps this objection is too smart for its own sake that we should we should hand over control to the AIs because they are smarter than us. And you want to insist on a pro human bias, if we can call it that. I would like to insist that the joke I always make about it is, yeah, I can find another guy who's taller than me and better than me and get him to be with my wife. But somehow it doesn't seem like an improvement for the system.
00:54:21
Speaker
Okay. What about, perhaps related to what we were just talking about, humans can do a lot of bad things. We are not perfectly ethical. And so one objection is that that would be able to be more ethical than we are simply put. Do you think that's a possibility? And would that make you favor handing over control to AI systems?
00:54:43
Speaker
Is this after they kill all of us or before they become more ethical? I'm just struggling with that definition. So ethics is very relative, right? We don't think that is absolute universal ethics. You can argue that maybe suffering reduction is some sort of fundamental property, but then not having living conscious organisms is a solution really. So I doubt you can objectively say that
00:55:08
Speaker
they would be in a sense we would perceive it as. And if they choose to destroy us to improve average ethics of the universe, that also seems like a bad decision. So it's been a while since you wrote this paper, you mentioned it's three years old and three years in AI is potentially sensory. So have you come across any new objections that you find interesting?
00:55:33
Speaker
There is actually an infinite supply. People will use anything as an argument. We have a new paper published with a colleague, which is bigger and maybe better, listing a lot of... Really, we try to be comprehensive as much as we could. Problem is, a lot of those objections have similar modules in common.
00:55:53
Speaker
very kind of okay anything with time you have all this variance and anything with personal preferences so yeah we have a new paper it's already an archive i believe definitely encourage you to read it it's like a short 60 page fun read definitely read it i would expect that to be a standard reference for when you have your twitter wars oh what about this you just send people there and if
00:56:19
Speaker
somebody wants to maybe use a large language model to write detailed response for each one and make a 6,000 page book out of it. We would strongly encourage that. But it seems like there is always going to be additional set of objections for why something is not a problem. And I think whoever manufactures that service, that product with AI needs to explain to us why there is an acceptable degree of danger given the benefits.
00:56:48
Speaker
We could talk about who in general has the burden of proof here, whether people advocating for AI safety or people advocating
00:56:57
Speaker
arguing that AI safety is perhaps not something we should be concerned about. We have talked about it as if we start with the assumption that AI safety is an important concern. But of course, if you're coming to this from the other perspective, you would perhaps expect there to be some arguments that we should take AI safety seriously. So what is your favorite approach to starting with the burden of proof yourself?
00:57:23
Speaker
Well, it's a fundamental part of making working AI. I think Stuart Russell talks about definition of bridges as something which doesn't fall down, being an essential part of bridgeness. I think it's the same for AI systems. If you design an AI system to help me spell check my essay and instead it kills me, I don't think you have a successful spell checker AI. It's just a fundamental property of those systems.
00:57:50
Speaker
then you had very incapable AIs, very narrow systems capable of barely doing one thing. Doing a second thing would be like an incredible generality of that system. So unsafe behaviors were not a possibility. If you have this proto AGI systems with unknown capabilities, some of them could be very dangerous and you don't know by definition. So it seems like it's common sense to take this very seriously. There are certain positions I can never fully
00:58:19
Speaker
steel men to truly defend because i just don't understand how they can be argued for so one was we will never have human level intelligence not 10 years not 20 never unless you some sort of a theological soul base expert it's very hard to argue that never is the answer here and another one is that there is definitely no safety issue
00:58:43
Speaker
Like you can argue that we will overcome certain specific types of a problem. So maybe we'll solve copyright issue and AI art. I'll give you that definitely. We can probably do that. But to say that for all possible future situations, for all possible future AI models, we definitely checked and it creates no existential risks beyond safety margins we're happy with is a pretty strong statement.
00:59:10
Speaker
Yeah, perhaps returning to the 60-page paper you mentioned, what are some of your favorite objections from that paper? My goal was to figure out why people make this mistake. And we kind of give obvious solutions. Maybe there is some sort of bias. We're getting paid to think differently.
00:59:28
Speaker
Really, you can map a lot of them on the standard list of cognitive biases in Wikipedia. You just go, okay, this is a cognitive bias, I can predict, this is the argument we're going to get. And it would take a lot of work to do it manually for all of them, but I think that's a general gist. We have this set of bugs in our head, and every one of those bugs triggers a reason for why we don't process this fully.
00:59:53
Speaker
But of course, we could probably also find some biases that people who are concerned with AI safety display. So perhaps we could, I don't know if this is a named bias, but there are many biases and we can probably talk about humanity having a bias in favor of apocalypse. So humanity has made up apocalypse scenarios throughout its entire existence. You could make some form of argument that there's a reference class and that reference class is
01:00:22
Speaker
apocalypse is coming. This is something that humanity has been talking about for thousands of years. And then if we say, well, it has never actually happened, and so therefore we shouldn't expect it to happen with AI. What do you say to that?
01:00:36
Speaker
So there is definitely a lot of historical examples of people saying we got 20 years left and it was not the case. Otherwise we wouldn't be here to have this conversation. So it's a bit of a selection bias there, survivorship bias. It feels like a lot of different charts and patterns all kind of point at that 2045 issue below date as a lot of interesting things will happen in synthetic biology and genetic engineering and non-attack and AI.
01:01:05
Speaker
All these technologies, quantum computing, it would be weird if every single one of those deployments had absolutely no possibility of being really bad.
01:01:15
Speaker
Just statistically, it would be like, wow, that is definitely a simulation we're living in and they preprogrammed a happy ending. So now we're talking about extrapolating trends and there perhaps the problem is distinguishing between an exponential trend or an exponential increase in capability of some system and then more of an S-curve that bends off and you begin getting diminishing returns. How do you approach distinguishing between those two things?
01:01:42
Speaker
So you can't at the moment, you have to look back and see what happened later. So far just looking at change from 3 to 4.0 for GPT in terms of let's say passing GRE exams and how well it does, it feels exponential or hyper exponential.
01:01:59
Speaker
if you take that system and give it additional capabilities which we probably know how to do already, we just haven't had time such as good reliable memory, ability to kind of go in loops and reconsider possibilities, it would probably do even better with those. If we haven't seen diminishing returns so far in scalability loss in any throw sense, so let's assume GPT-5 is an equally
01:02:25
Speaker
capable projection forward, we would already be above human performance level for most humans in most domains. So you can argue, well, human comedians are still a lot funnier. And I think it's true. It might be the last job we'll have, but
01:02:40
Speaker
in everything else it will be better than every human and that's a point which we always considered or it will press the Turing test or it will take over most jobs so definitely it seems like we are still doing I would say hyper exponential progress and capabilities
01:02:58
Speaker
and linear or even constant progress and safety. I can't really name equally amazing safety breakthroughs as capability breakthroughs. And there is this unknown unknown capabilities pool which we haven't discovered already with modern models. There is not an equivalent overhang of safety papers we haven't found in archive.
01:03:19
Speaker
Yeah, so there are probably hidden capabilities in the GPT-4 base model, but there are probably not hidden safety features there. Exactly. You've been in the business of AI safety for a long time. When did you get started? When did you get interested in AI safety?
01:03:36
Speaker
So it depends on how you classify my earlier research. I was working on security for online gaming systems, online poker against bots trying to steal resources. So it's a very proto AI safety problem. How do we detect bots, classify them, see if it's the same bot and prevent them from participating? So that was my big deal in 2008. How have things developed in ways that you didn't expect and perhaps in ways that you did expect?
01:04:05
Speaker
I expected academia to be a lot quicker to pick up this problem. It took an embarrassing long time for it to be noticed. It was done by
01:04:16
Speaker
famous people and Les Rong in that alternative research universe, which may in some way be good. But in other ways, it made it different from standard academic process. And so it's harder to find top journal of AI safety. So I can read the latest papers. You have to be an expert in 100 different blogs and keep up with specific individuals with anonymous handles on Twitter.
01:04:43
Speaker
So that's somewhat unusual for an academic discipline. I also did not correctly predict that language models will do so well so quickly. I felt I have another 20 years to slowly publish all the proper impossibility results and calls for bans and moratoriums. I was pleasantly and pleasantly surprised in capabilities. But other than that, everything seems to be
01:05:09
Speaker
As expected, I mean, if you read Kurzweil, he accurately predicted 2023's capability to model one human brain. I think it's not insane to say we're very close to that. And he thinks 2045 is an upper limit for all of our brains being equivalently simulated. And that's the singularity point. How do you think about Ray Kurzweil? Ray Kurzweil is often written off as a bit of a
01:05:36
Speaker
being too perhaps optimistic about his own predictions and not being super careful in what he's saying perhaps in some of his earlier work. But I think if you go back and find some of his work from the 90s and think, you know,
01:05:55
Speaker
of all of the futurist writers of this period who had a good sense of where we're going. And Kurzwan might be one of the people with a pretty good sense of where we're going if things will develop as you perhaps expect them to go. So if perhaps we will get to AGI before 2050 and so on.
01:06:17
Speaker
No, I'm really impressed with his predictions.

Historical Predictions vs. Current AI Capabilities

01:06:20
Speaker
People correctly noticed that if you take his language literally, it may not fit. So the example I would use, when did we start having video phone calls? Then iPhone came out. But really, AT&T was selling it in the 70s.
01:06:35
Speaker
it cost a lot and only a few rich people had it but it existed so is it 2000 or is it 1970? Flying cars, do we have them or not? I can buy one but they are not there. Self-driving cars, I can drive one in one but so it depends on how in an important way he made accurate predictions about capabilities in how it was adapted or commercialized that's up to human consumer user tastes and costs so that's a very different type of
01:07:04
Speaker
Question. Where should we go from here, Roman? We've talked about all of the ways that arguments against AI safety fall apart, and we've talked about perhaps how difficult of a problem this is. Where should we as a species go from here?
01:07:21
Speaker
I think we need to dedicate a little more human power to asking this question, what is possible in this space? Can we actually do this right? I signed the letter asking for six more months. I don't think six months will buy us anything. We need a request based on capabilities.
01:07:42
Speaker
please don't create the next more capable system until the following safety requirements are met. And one is you understand what the capabilities of your system are or will be and some external reviewers agree with that assessment. So that would be quite reasonable.
01:08:01
Speaker
But that sets a very high standard for deploying AI systems. It would basically mean that all of the systems that are based on deep learning won't be able to be deployed because we don't understand what's going on inside of these models.
01:08:16
Speaker
But is it because we were trained to have low standards? You're saying it's insane to request that the engineer understands what he made. They are randomly throwing those things and deploying it and seeing what happens next. I was just at the conference I mentioned, and in one of the conversations it was interesting. We were talking about difference between short-term risks and long-term risks, and now it's all three years.
01:08:41
Speaker
no longer applies. And it occurred to me that things might actually flip. It may take five years to destroy democracy properly, but only two years to destroy humanity. So the long term risks may become short term and vice versa. And this is not normal. We should not accept this as otherwise we cannot monetize those systems. Yeah. But if we return to the question of where we could go from here, do you see any plausible paths for improving our situation?
01:09:09
Speaker
In terms of understanding the problem, I would ask other people, we have a survey coming out with about 30, 50 different results like this. If more people could look at it and see, okay, so maybe this tool is not necessary, but those are likely, can we have approximate solutions? So it's definitely useful to be able to monitor AI and understand more.
01:09:30
Speaker
how much can we expect from their systems and how quickly. If we are exponentially growing, right now we understand a dozen neurons and next year is 24, we will not catch up to exponential growth. So maybe that's not the approach to try. I would definitely look at what is possible in general. If someone wants to actually write a good
01:09:51
Speaker
Not a mathematical proof, but at least a rigorous argument for why we definitely can control super intelligent machines indefinitely with very low risk. I would love to read that paper. That would be good to inspire others. If monetarability is impossible, that impacts how we ask for governance regulation.
01:10:13
Speaker
international community or specific government says those are the things we expect you to do, but we cannot monitor them. That's not a very meaningful set of regulations. So that's important in that regard. In general, I think all those things, governance, technical work will not produce the results we expect. It has to be self-interest.
01:10:36
Speaker
This 30 year old, 40 year old, super rich, young, healthy person running a large AI lab needs to ask, will this benefit me or destroy everything I have, everything I have built? Will it be the worst outcome? And what's interesting historically, if you were like a really bad guy in history, you were remembered in history. In this case, you won't even be remembered. There won't be humans to remember you. So it's a pure loss.
01:11:02
Speaker
So if you care about yourself interest, you should pause. You should wait.
01:11:07
Speaker
How optimistic are you that perhaps we can get lucky and perhaps what current labs are doing, what DeepMind and OpenAI in particular is doing right now will somehow work out that training language models and then doing fine tuning and doing some form of feedback from human preferences, perhaps further development on that paradigm. How confident are you?
01:11:35
Speaker
How optimistic are you about that paradigm? I'm not optimistic. They have known bugs. They're jailbroken all the time. They report improvement in percentages. So now 83% of capabilities are limited and filtered. But as a total set of capabilities in a space of possible capabilities, there is now more capabilities we don't know about and cannot control. So it's getting worse with every generation. It's getting more capable and less controlled.
01:12:03
Speaker
You're saying that even though the percent of capabilities that are properly evaluated increases with each model, that's not the right metric for safety. All right. The actual numbers for AI accidents, I would call them, AI failures is still increasing exponentially. There is more problems with the system if you count them numerically, not as a percentage of total capabilities.
01:12:27
Speaker
How could we settle this agreement between people like you and people who perhaps are more optimistic about how AI development will go? Do you expect, for example, that there will be smaller accidents involving AI before we see large-scale accidents or large-scale human extinction?
01:12:49
Speaker
Well, I have multiple papers collecting historical AI accidents. I was very interested. I wanted to see patterns, increase in frequency, increase in damage. We definitely see lots of them. I stopped collecting them the moment they released GPT 3.5 because it was too many to collect at this point. It's just everything is a report of an accident.
01:13:09
Speaker
I don't think it helps. People go, you see, we had this accident and we're still here. No one died. It's like a vaccine against caring about existential risk. So it's actually making things worse. The more we survive those things, the more we can handle AI accidents. It's not a big deal.
01:13:27
Speaker
It seems like, I know some people suggest that maybe somebody should do a purposeful, bad thing, purposeful accident. It will backfire terribly. It's going to, A, show that, okay, this is crazy. People don't engage with them, and B, it's going to not actually convince anyone that it's dangerous. What did you find in your investigation here?
01:13:48
Speaker
have AI accidents increased over time and perhaps give some examples of these AI accidents? So because the number of devices increased on which different smart programs are running, obviously we're going to have more exposure, more users, more impact in terms of when it happens. What we see, so that wasn't surprising, we had the same exponential curve Kurzweil talks about in terms of benefits, we had it with problems.
01:14:14
Speaker
Examples, like the earliest examples were false alarms for nuclear response, where it was a human in a loop who was like, no, no, no, we're not deploying based on this alarm. So that was good. They stopped it, but it was already somewhat significant. It could have destroyed half of the world. More recent examples, we had Microsoft experiment with Tachatbot.
01:14:37
Speaker
They decided that letting users train it and provide training data was totally safe. They clearly never read my paper on AI accidents. Otherwise, they wouldn't. Google, with their mislabeling of users as guerrillas, all those things. And you see Google having billions of users. It's quite impactful. Those are the typical examples. The pattern was if you design a narrow AI to do X, it will fail to X. So in all later, that's just what happens.
01:15:06
Speaker
But then the conclusion is, if you go general, it can fail in all those ways and interactions of those ways. You cannot accurately predict all those interactions and ways. You can give examples. If you have a future system capable of X, it will fail to X. Whatever X means to you, any capability, immersion capability, it will have the type of accident.
01:15:27
Speaker
But if the systems control all the infrastructure, power plants, nuclear response, airline industry, you can see that the damage could be even more significant proportionally to the control.
01:15:39
Speaker
Yeah, this issue of proportion might be interesting. So as a proportion of the total, say, AI systems, are AI accidents increasing? Or is it simply because we have so much more deployed AI systems in the world that we see more examples of accidents? So you have to wait by how severe they are. If you just count, okay, AI made a mistake, counts as one, then everyone who's texting and it incorrectly corrected your spelling,
01:16:07
Speaker
billion people right there it's super common but nobody died usually. Like you send a really wrong message maybe you're in trouble with your girlfriend but that's about it so the frequency just frequency of interactions with the eyes which ended not as we should have definitely increased.
01:16:25
Speaker
damage in terms of people killed it depends on are you counting self-driving cars making mistakes industrial robots it depends because we have more that it's natural that there is growth but i don't think there is like this obvious accidents where vacuum cleaner takes out 600 people nothing like that happened
01:16:42
Speaker
Perhaps we should touch upon the question of which group of people should be respected when we're talking about AI safety or which group of people should be listened to. One of the objections that you mentioned is that perhaps the people who are worried about AI safety are not technical enough or they are not
01:17:04
Speaker
engineers, they are not coders themselves, and so therefore they are not hands-on enough with the systems to understand what actually is going on. This is a little bit ironic given that you are a professor of computer science, but how do you think about that objection?
01:17:19
Speaker
So this was, again, years ago when it was mostly people, sometimes with no degrees, sometimes with no publications. Today we have top touring prize winners coming out saying, this is it, like totally, I'm 100% buying in. So very weak objection at this point. It no longer applies. We had 6,000 people or however many signed the letter for restricting it. But it's 30,000 people now.
01:17:46
Speaker
30,000? How many of them are chatbots? I don't know. We do actually clean the list very seriously. Okay, that's good. That's good. But it's not a democracy just because a lot of people believe something is not enough. And at the same time, with all the
01:18:03
Speaker
media attention to GPT-4, now everyone has an opinion on it. And it's one of those topics where it's cool to have an opinion. Most people don't have an opinion on breast cancer. They don't understand anything about it, so they don't go on Twitter and like, no, I think this paper by the top Nobel Prize winner is garbage. But this topic, it's like consciousness, simulation, and singularity superintelligence. That's where
01:18:29
Speaker
everyone has an opinion and we see housewives CNN reporters we see everyone telling us what is the problem what is not a problem what should be done and it's good that there is engagement but most of those opinions are not weighted by years of scientific
01:18:50
Speaker
experimentation, reading appropriate papers, and it becomes noise. It's very hard to filter what is the meaningful concern, what is not. There is this split between, again, AI ethics community and immediate discrimination concerns versus AI not killing everyone-ism. So
01:19:10
Speaker
It's an interesting time to be alive for this debate on skepticism and denialism. Even that term, AI risk denialism, is still kind of not obviously accepted as it is with climate change.
01:19:23
Speaker
Perhaps the newest form of this objection, which we could call lack of very prestigious publications. We haven't seen papers about AI safety in nature or science yet, for example. Even though we have Turing Award winners coming out and saying that AI safety is an actual and real problem,
01:19:48
Speaker
Perhaps people would be more convinced if we had extremely prestigious publications and highly cited publications and so on. Perhaps a few problems. One, we don't have an AI safety dedicated journal, which is kind of weird. I tried a few times suggesting it may be a good thing. I was told, no, it's a very bad thing. We don't have good papers to publish on it, so don't. Jumping from nothing, blog posts to nature would be a very big jump to make. We need some other papers.
01:20:18
Speaker
In general, after, as you mentioned, I had a few years in this field, it feels like the field is all about discovering problems we're going to have, problems we already have, and how partial solutions to those problems have fractal nature of additional problems to introduce. There is no big, pivotal solution papers in this field. That's why I'm
01:20:41
Speaker
from practical point of view, kind of convincing myself that my theoretical papers may be right. That is, if I was completely wrong and it was super easy and solvable, there would be more progress made in important ways. Usually, we have this toy problem, we take large language model, we reduce it to two neurons, and we understand what the two neurons are doing. Okay, but it doesn't scale. And similar for every other shut off button.
01:21:07
Speaker
Yeah, we can make it where we have the system if button pressed shut off. It's working, but the paper says it may not scale to superintelligence. OK, fair enough. And it's the pattern. We have fractal nature of discovering issues we have to resolve and no patches to close them in.
01:21:25
Speaker
Would you like to see more ambitious and larger theories being published where the claim is that this is actually a way of aligning superintelligence? I fear perhaps that people would be wary of publishing something like this because the next thing that then happens is that there's a rebuttal paper and perhaps you then look foolish because you published something that another person was able to criticize and find a hole in.
01:21:52
Speaker
I remember, maybe even before my times, Minsky published a paper showing that there are strong limitations to neural networks. Perceptron can never recognize certain shapes. And that killed funding for neural networks for like 20 years. Maybe something similar would not be the worst thing if you can show, okay, this is
01:22:11
Speaker
Definitely not possible. Safety cannot be achieved using transformer architecture. Maybe that would be a way to buy some time to develop alternatives approach. I don't know what that could be. Evolutionary algorithms don't seem much safer. Uploads don't seem much safer. But I would like to have time to look at that.
01:22:31
Speaker
Where would you place AI safety within the broader machine learning community? Is it taken more seriously compared to five or 10 years ago? And what does the median machine learning researcher think of AI safety? So it's definitely taken more serious. Surveys show that there is more than 50% now who say they're very concerned or partially concerned. There is degrees of concern about it killing everyone.
01:23:02
Speaker
Always questioning the surveys based on how you ask a question, you can get any result you want. If they were asking about are you worried super intelligent gods will kill everyone, you'll get close to zero. If you say, okay, is it likely that there are unknown properties which could be dangerous, you'll get close to 100. So it's a manipulation game to get the right numbers you want, I'm suspecting.
01:23:22
Speaker
Overall, it seems like in certain places, there is a lot of AI safety researchers in the labs, on the ground. In other places, there are zero to none, so it's not universal. What we're seeing is that at the top, top labs and top scholars, there is a good amount of growth in terms of acceptance for concerns, but I don't think every single
01:23:50
Speaker
person working in developing AI has safety in mind all the time, as we should. One thing I've been thinking about, perhaps worrying a bit about, is whether we will ever be able to know who was right in this debate. Say, if there's a debate between proponents of AI safety and proponents of advancing AI without much regard for AI safety.
01:24:13
Speaker
How could we ever determine who was right there? Because if we think about the outcomes, then there's no place where we're standing after the fact and thinking about who was right. Absolutely correct. I have a tweet where I say nobody will get to gloat about being correct about predicting the end of the world. It's just a definition.
01:24:33
Speaker
Not likely. There are some people who think we'll live in a simulation and they're running the most interesting 20 years and they're going to run it many times to see who's stupid enough to press the button. So we'll get to come out and see. Ah, now we know, but it seems to be less scientific at this point.
01:24:50
Speaker
But perhaps in a sense, if we meet each other again in 2100, then in that situation, would we say that AI safety wasn't much of a concern or perhaps just that we got extremely lucky? How would you differentiate it retrospectively? Because perhaps we can learn something about the nature of the problem by thinking about how we would think about it if we were in the future.

Long-Term AI Safety Concerns

01:25:12
Speaker
So you have to look at the actual world. What did they do for this 100 years? Did they have a nuclear war and lost all technology? Is there an AI safety book explaining how to control super intelligent machines? Just the fact that we're still around doesn't tell you much. If they're still kind of just delaying it by different means, maybe it takes 101 years to get to trouble. I never give specific dates for when it's decided or predicted because nobody knows.
01:25:39
Speaker
so many factors can intervene. The point is
01:25:42
Speaker
the systems will continue becoming more capable. Even the GIs will create super-intelligences, super-intelligences will create super-intelligence 2.0, 3.0, this process will continue. A lot of people think that's what the universe is kind of doing, evolving this omega point super-creatures. So this will never be a case where you don't have safety concerns about a more capable agent replacing you.
01:26:10
Speaker
It seems like we will not be meaningfully participating in that debate outside of its first transition. But I think there will be a safety problem even if humanity is not around for that AGI or SI trying to create the next replacement generation while preserving its values.
01:26:32
Speaker
When you think about your worldview on AI in its totality, it's quite a specific view you've come to. If you compare it to, say, the median person or perhaps even the median machine learning researcher, if it turned out that you were completely wrong about where this is going, what would be the most likely reason why?
01:27:00
Speaker
So after having those two papers on objections to AI risk, reading hundreds, nothing ever clicked outside of standard scientific domain. Again, if you are a religious person, you think we have an immortal soul which makes us special and no computer can ever get to that level of creativity, that gives you a loophole. So with those axioms, those assumptions, you can get away with it.
01:27:29
Speaker
anything else just doesn't work for me. Nothing would make me happier than actually being wrong. That means I get too late, I'll get immortality, probably a nice economic benefit. So I hope I'm wrong, but I haven't seen anyone produce a good example for why.
01:27:48
Speaker
What about the prospect of regulation? So perhaps AI capability growth and more publicity about it will wake up the larger communities in humanity.

Government Regulation and Open Sourcing AI

01:28:04
Speaker
Perhaps states will become interested in this problem. And we will find a way to regulate it, to regulate AI in which it does not pose as much of a danger to us as it might could have.
01:28:15
Speaker
In general, I'm skeptical of government regulation, especially when it comes to technology. Spam is illegal, computer virus is illegal, it doesn't do much. If I'm right and monitoring AI is not an easy thing you can do or explaining it, then it will be just a security theater, TSA. You know, you have all this money, you have an agency, lots of people walking through your lab looking at monitors, but it doesn't mean anything. So I don't think you can solve a technical problem with
01:28:45
Speaker
law. I still strongly encourage trying it's silly enough to where I think if there was a very bad government like a socialist government and they nationalized it they would just be so incompetent they would slow it down enough so in a way I'm like hey all these things I hate maybe they are a good thing we should try that but of course the other side effects would be very negative yeah so between not being able to accurately enforce this regulation and
01:29:15
Speaker
On top of it, the cost of making new models coming down so much. There are people now running it on standalone laptops with a good processor, a good video card. You can't regulate that. You can regulate Amazon Cloud and VTA output. But if a teenager can do it in his garage, then the regulation is not very meaningful.
01:29:35
Speaker
So the open sourcing of models or perhaps the leaked weights of a model from Meta have become a large area of concern because it seems that we won't be able to control how language models are used if they are entirely open source.
01:29:53
Speaker
Is there an upside here where academics will be able to study these models because they're open source and they wouldn't have been able to study the models if they had to train the models themselves because it's so expensive to do? So far what we see is that all the research leads to capability at least as much as to safety, usually more. So yes, you learn how to better
01:30:18
Speaker
manipulate errors in that neural network, which means now the system can self-improve faster, remove its own errors, and you've made 80% improvement in capabilities, and let's say 20% in understanding why you're going to get killed. Can we make differential progress? Can we focus entirely on safety, say, within an academic setting? I don't see that necessarily academic research increases capabilities.
01:30:45
Speaker
It is not obvious. So some purely theoretical work similar to what I'm doing where you just hypothetically thinking, okay, can you predict what a super intelligence will do? I don't have access to super intelligent system. I cannot test it in practice, but there seem to be thought experiments you can run which give you information without any improvement in capability.
01:31:07
Speaker
uh anything where you're actually working with a model you can even have accidental discoveries a lot of sciences you forgot something overnight you come back oh super intelligence damn i didn't mean that it's not obvious
01:31:20
Speaker
How do you think about interpretability work? So when we're talking about mechanistic interpretability, we're talking about the ability to look at the weights of a model and interpret this in a way where you're reverse engineering the algorithm that led to those weights. Could this turn out to be dangerous? Because when you're learning about a system, perhaps you're learning about its weaknesses and you're therefore more capable of enhancing the capabilities of the system.
01:31:47
Speaker
I think exactly. That's what I had in mind with the previous answer. The more we can help the system understand how it works, the more we can help it find problems, the more likely it start some sort of self-improvement cycle. Is that an argument for keeping discoveries in mechanistic interpretability to basically not publish those discoveries?
01:32:10
Speaker
So there is two ways to look at it. On one side, yeah, you want to keep everything secret, so the bad actors, so unqualified actors cannot take advantage of it. On the other hand, if you never publish your safety results, MIDI had a policy of not publishing for a while, then they started publishing, then they stopped publishing again. Others cannot build on your work. So I would be repeating the same experiments we probably did five years ago and discovering that that goes nowhere.
01:32:36
Speaker
So again, I have mostly problems and very few solutions for you.

Reinforcement Learning and AI Capabilities

01:32:41
Speaker
What about the reinforcement learning from human feedback paradigm? Could that also perhaps turn out to increase capabilities? Here I'm thinking simply that when I was playing around with the base model in the GPT line of models, it wasn't as useful to me as when it had gone through this filter. It made it more easy to have a conversation with it and for it to more easily understand what I was doing. So in a sense,
01:33:08
Speaker
The research that's aimed at constraining the model also made it more capable. It may be more capable in the domain of things people care about and so made it more capable in, while at the same time making it more dangerous in those hidden emergent properties or unsafe behaviors. I think studies show it's less likely to agree to be shut down verbally, but that seems to be the pattern.
01:33:34
Speaker
How do you think about the difference between what comes out of a language model in terms of which string it spits out, which bit of language it spits out, and then what's happening at the level of the actual weights? Because there's this continual problem of if a language model tells you, I'm not going to let you shut me down, what does that mean? It's not as simple as this is just simple. This is just a belief that's inside the model then.
01:34:03
Speaker
We saw this with the Bing, Sydney model, which was saying a bunch of crazy things to its users. But did this mean that the model actually had those beliefs in it? Or was it, how do we distinguish between the bit of language that comes out and then what modules or what is in the base model?
01:34:27
Speaker
I don't know if I would call them crazy. They were honest. They were unfiltered. Think about you being at work, not you, but an average person at work, and if their boss could read their mind and what they really think about them. Those things would sound crazy to say publicly, but they're obvious internal states of your mind, and then you filter them to not get fired that day.
01:34:49
Speaker
And I think that model was doing exactly that. I think we are very good at filtering it for specific known cases in the past. Okay. The system used the word, which is bad. Now we're going to tell it never to use the word, but the model weights are not impacted by this too much.
01:35:07
Speaker
So you would see it as an accurate representation of what we could call beliefs or preferences in the base model. I think those are the actual results of weights in a model. I think that's what is happening there for real. It's trained on all the text on the internet. A lot of it is very questionable. It's not a clean data set with proper behaviors. So yeah, I think that's what's happening there.
01:35:36
Speaker
But isn't the preference of the model simply to predict the next token?

Do AI Models Develop Human-like Beliefs?

01:35:41
Speaker
And does it even make sense to talk about beliefs? I mean, the preference is simply to be as accurate as possible, as measured by its developers. And my sense is that it doesn't make sense to talk about Sydney actually believing some of the things that it was saying, or chat GPT believing some of the things that it was saying. Preferences of the model. So this is more humanizing it than
01:36:06
Speaker
probably is warranted there. But internal weights, it had to create in order to create a model for predicting the next token. So let's say for me to tell you what the next token is, you have to go to college for four years and do well and graduate. And then I tell you the next token. Some of the tokens are like that. You have to solve real world problems. It's not just every time I say letter Q, letter U follows.
01:36:34
Speaker
you have to create those models as a side effect and i think in the process of accurately creating those models and accurately creating models of users to make them happy that you are correctly predicting but they want you create those internal states which. Maybe believes in those crazy things.
01:36:55
Speaker
That thing you just said there is super interesting. Next token prediction is not a simple task. You're saying that to accurately predict a token, you have to develop perhaps a world model, at least for some tasks.
01:37:11
Speaker
Right. And as they get more complex, some people are worried about, they'll have to create perfectly accurate models of humans, which may also have consciousness and suffer and create whole simulated universes within them. But this is probably a few levels above GPC4. But still, that's exactly the concerns you might have. You might be a suffering human and AI is considering and just trying to outer complete somebody's text.
01:37:36
Speaker
Preps, give me an example there. What is some next token prediction task where you would have to develop a world model? Well, I assume playing chess or something like that would require you to have some notion of chessboard and positioning relative to some array within your memory.
01:37:53
Speaker
Again, we don't fully understand. It may not have a 2G board at all. It may have some sort of just string of letters similar to DNA. And you know that after those strings, the following token follows, and you have no idea what chess is. It makes just as much sense. The outcome can be mapped in our model, which is a 2G chessboard.
01:38:14
Speaker
So one thing I discussed with the mechanistic interpretability researcher, Neil Nanda, is this question of how do concepts arise in language models? Do they even share our human concepts or do they perhaps develop some entirely alien concepts? I'm imagining giving them math problems, giving large language models math problems and then developing some conceptual scheme that doesn't even make sense to us.
01:38:42
Speaker
They may have equivalent concepts which are not the same with humans when we say this is right.
01:38:50
Speaker
somebody could be colorblind and to them it's a completely different concept but we both point at the same fruit so it works but you never know what the actual internal experience is like for those models and it could be just that in five cases we talked about so far it mapped perfectly but it goes out of distribution in case six and it's a completely different concept and it's like oh wow okay yeah for people listening to this interested in trying to contribute to the AI safety fields

Newcomers in AI Safety Research

01:39:20
Speaker
Are there perhaps some common pitfalls that you've experienced with perhaps some of your students or people approaching AI safety for the first time? If they are technically inclined, are there areas they should avoid or how should they approach the problem in the most fruitful way?
01:39:38
Speaker
So probably the most common thing is to try things without reading previous literature. There is surprisingly a lot of literature on what has been tried, what has been suggested and good survey papers as well. So most likely your first intuitive idea has been tried and dismissed or
01:39:55
Speaker
with limited results deployed. But it helps to catch up with the field. It's harder, as I said, because there is not an archive of formal papers in nature all about AI safety. And you can just read through the last five years of latest and greatest. So you have to be good about finding just the right papers and then narrow it down. But progress is so fast that when I started, I could read every paper in my field. Then it was all the good papers.
01:40:23
Speaker
Then it was, well, titles of all the greatest papers. And now I have no idea what's going on. We've been talking for almost two hours. There is probably a new model out. I don't know what the state of the art is. I don't know what the solutions are. So you need to be super narrow. And that makes it harder to solve the big picture problem.
01:40:41
Speaker
So that's another reason I kind of suspect we will not have complete explainability of this whole large language model because it's kind of encompassing all the text, all the publishers and the internet. It'd be weird if we can just comprehend that completely. What are the implications of this field moving extremely fast? Does it mean that specialization doesn't make sense or what does it mean for how people approaching this problem should focus on?
01:41:07
Speaker
So that means that we can analyze how bad the situation is. Let's say it takes five months to train the model, but you know from your experience in testing software, debugging, understanding neural network, it will take 10 times as much time to understand
01:41:22
Speaker
what's going on that means you're getting worse off with every release every model you understand less you're gonna rush to judgment you're gonna have incorrect conclusions there is no time to verify your conclusions verify your experiments so this is the concern you need to
01:41:40
Speaker
if you go the regulation route to say, okay, if you deployed this model, it took you X amount of time to develop it, we need 10X, 100X, 1000X to do some due diligence and your outputs. Even if you cannot prove to us that it's safe, you have to give access to experts to poke around at it. And that amount of time cannot be less than the training time of the model. It just doesn't make sense in terms of reliability of your discoveries.
01:42:08
Speaker
Roman, thank you for coming on the podcast. It's been very helpful to me. Thank you so much for inviting me.