Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Training Machine Learning (ML) models on Kubernetes image

Training Machine Learning (ML) models on Kubernetes

S4 E11 · Kubernetes Bytes
Avatar
1.4k Plays6 months ago

In this episode of the Kubernetes Bytes podcast, Bhavin sits down with  Bernie Wu, VP Strategic Partnerships and AI/CXL/Kubernetes Initiatives at Memverge. They discuss about how Kubernetes is the most popular platform to run AI model training and model inferencing jobs. The discussion dives into model training, talking about different phases of a DAG, and then talk about how Memverge can help users with efficient and cost-effective model checkpoints. The discussion goes into topics like saving costs by using spot instances, hot restart of training jobs, reclaiming unused GPU resources, etc.    

Check out our website at https://kubernetesbytes.com/ 

Episode Sponsor: Nethopper 

Cloud Native News:

  • https://www.aquasec.com/blog/linguistic-lumberjack-understanding-cve-2024-4323-in-fluent-bit/
  • https://kubernetes.io/blog/2024/05/20/completing-cloud-provider-migration/
  • https://thenewstack.io/introducing-aks-automatic-managed-kubernetes-for-developers/
  • https://www.harness.io/blog/harness-to-acquire-split

Show Links:

  • https://www.linkedin.com/in/berniewu/
  • https://criu.org/Main_Page
  • https://memverge.com/
  • https://youtu.be/tY8YOMRuqWI?si=yB3hHqLUpYPZ-KWN
  • https://youtu.be/ND4seSKpJHI?si=shh0iuA9qC-dO6eb

Timestamps: 

  • 01:04 Cloud Native News 
  • 08:47 Interview with Bernie 
  • 51:40 Key takeaways


Recommended
Transcript

Introduction and Podcast Overview

00:00:03
Speaker
You are listening to Kubernetes Bites, a podcast bringing you the latest from the world of cloud native data management. My name is Ryan Walner and I'm joined by Bob and Shaw coming to you from Boston, Massachusetts. We'll be sharing our thoughts on recent cloud native news and talking to industry experts about their experiences and challenges managing the wealth of data in today's cloud native ecosystem.
00:00:30
Speaker
Good morning, good afternoon, and good evening wherever you are. We are coming to you from Boston, Massachusetts. Today is May 30th, 2024. Hope everyone is doing well and staying safe.

Bhavan's Episode and Expert Guest Discussion

00:00:41
Speaker
Let's dive into it. This is going to be a Bhavan-only episode. I guess we do have an awesome guest to talk about.
00:00:49
Speaker
model training and model checkpointing on top of Kubernetes. But for the intro and the key takeaway section, I guess you guys are stuck with me. So now that we have kind of introduced the topic for the episode itself, let's dive into some cloud native news.

Recent Cloud Native News

00:01:07
Speaker
First up, I have a CVE that was found last week, late last week, like right around the Memorial Day weekend here in the US. If you are using
00:01:18
Speaker
versions 2.0.7 through 3.0.3 of Fluentbit. Make sure you upgrade your Fluentbit versions to 3.0.4 because in those versions that I just specified, there is a new critical severity vulnerability called Linguistic Lumberjack, CVE 2024 4323, if that number makes any sense to you.
00:01:41
Speaker
But it affects those fluent bit versions which involves memory corruption error potentially leading to denial of service attacks or denial of service information disclosure or even remote code execution.
00:01:56
Speaker
This has been identified. We link a blog from Aqua Security, a vendor in the Kubernetes security ecosystem on how this CV can be actually exploited on top of your Kubernetes clusters and then what are the best practices. I think if you are looking for just one easy answer, I think it's upgrading to Fluentbit version 3.0.4 from all the previous versions that we just listed. So keep an eye out for that in your environments and make sure you fix those.
00:02:22
Speaker
Next up, we have some awesome news from the Kubernetes community. One thing that caught my eye was the CNCF or the Kubernetes community worked on removing more than a million lines of code from Kubernetes.
00:02:38
Speaker
more than a million guys. It's 1.5 million lines of code. They removed all the cloud provider integrations that were entry and they have successfully migrated them outside from the core Kubernetes repository to an external plugin based architecture. So remember how we went through the same transition for the storage plugins that were built into Kubernetes and then eventually driven outside with the CSI plugin framework. Kubernetes ecosystem is going through a similar transition
00:03:08
Speaker
for all the different cloud providers like Amazon and Google and Microsoft. And so the Kubernetes repository itself, 1.5 million lines of code, I think this also reduced the size of the deployment or the binary by 40%. So huge efficiency gains from the community. So kudos to everybody who worked on this, but I just wanted to highlight that as part of today's episode.
00:03:30
Speaker
Then we had the Microsoft build conference last week and Microsoft announced an interesting solution for something called AKS automatic. It is currently in preview. It's not generally available in all the regions, but AKS automatic is kind of that.
00:03:47
Speaker
Even easier to use for your inside a case right so if you're deploying in a case cluster today you're responsible for configuring the cluster how many nodes you need what kind of cni how do you handle security how do you.
00:04:02
Speaker
scale or size your cluster so it can support all the different applications that you want to deploy on top. How do you plan for upgrades of different versions? With AKS Automatic, all of these things go away. All you need to do during an AKS Automatic cluster deployment is give it a name, select a region, specify a resource group, and maybe a couple more details and then that's it. Microsoft or the AKS team will worry about or will take over the
00:04:29
Speaker
configuration, scaling, security, and even upgrades for your Kubernetes resources. So let's say your organization hasn't adopted Kubernetes for all their applications, right? But we have seen a gradual increase in the number of ecosystem vendors that are providing their applications as a containerized application. So if you want to use some off-the-shelf application and run it inside your environment or inside your cloud environment,
00:04:58
Speaker
You need a Kubernetes cluster. Instead of becoming Kubernetes expert, this gives you, I guess, an easy way to just point the cluster to the repository or the source code for that application that you're trying to deploy. And then Microsoft handles the amount of nodes that you need in the cluster and also making sure that it's secure and it's running the latest and greatest versions. I'm not sure about the greatest version, but it's running a supported version of Kubernetes for sure.
00:05:22
Speaker
We'll have links in the show notes for you to check out the announcement around EKS Automatically.
00:05:29
Speaker
And then finally, today, as part of the news segment, we have another acquisition. So I know last time we covered the new debt financing ground that Harness had announced. I think they're just following it up with an acquisition. Harness is now acquiring split software. So split software allows teams to release software updates quickly and safely through advanced feature flagging, A-B testing, and real-time experimentation.
00:05:58
Speaker
If you are a launch darkly user, you can think about them as kind of a launch darkly competitor, but now part of harness. Split, the acquisition price was not shared publicly, but looking at the crunch base and how much money Split had actually raised, they did raise over $100 million or right up to $100 million before this acquisition was announced.
00:06:24
Speaker
somewhere in that range, I would assume. So we'll have the link in the show notes to read more about the acquisition on Harness's site. So if you are a split customer, congrats, now you are a Harness customer. But again, congratulations to everybody at Split Software. Acquisitions are always cool, or exits are always cool. It gives a way for early employees to have some form of liquidity event, right? Because
00:06:49
Speaker
Many organizations are not ready to start going public and try their hands at an IPO or a public offering either in the US or somewhere else.

AI Model Training on Kubernetes

00:07:00
Speaker
But yeah, that's it for the news segment today.
00:07:03
Speaker
As I hinted earlier, today's episode is going to be another AI themed episode, but we are going deeper into one specific topic. We are focusing on model training and even inside model training, we are focusing on model checkpoints and how a startup like Memverse can help you.
00:07:23
Speaker
Optimize your GPU resource utilization, help you reduce your cloud costs, things like that. So I'm personally excited for this discussion. Let's invite Bernie and talk to him about this topic.
00:07:38
Speaker
This episode is brought to you by our friends from Nethopper, a Boston based company founded to help enterprises achieve their cloud infrastructure and application modernization goals. Nethopper is ideal for enterprises looking to accelerate their platform engineering efforts and reduce time to market, reduce cognitive load on developers and leverage junior engineers to manage and support cloud mandated upgrades for a growing number of Kubernetes clusters and application dependencies.
00:08:08
Speaker
Nethopper enables enterprises to jumpstart platform engineering with KAOPS, a cloud-native, GitHub-centric platform framework to help them build their internal developer platforms or IDPs. KAOPS is completely agnostic to the community's distribution or cloud infrastructure being used and supports everything including EKS, GKE, Rancher, OpenShift, etc.
00:08:32
Speaker
Nethopper KIOps is also available on the AWS marketplace. Learn more about KIOps and get a free demo by sending them an email to info at nethopper.io or download their free version using mynethopper.com slash auth. Hey Bernie.
00:08:47
Speaker
Welcome to the Kubernetes Bites podcast. I'm really glad to have you on the show to educate us on AI on Kubernetes specifically, right? So I know I gave away the topic a bit, but why don't you introduce yourself and tell our listeners what you do? Sure. Thanks. Thanks, Bob, for inviting me. I really appreciate it. So, yes, my name is Bernie Wu. I'm the VP of Strategic Partnerships for Business Development for a company called Memverge.
00:09:15
Speaker
And Membridge is a Bay Area tech startup and is focused on solving some of our biggest memory problems facing the computer industry and users, particularly when it comes to AIML workloads. And just some of my background prior to Membridge,
00:09:34
Speaker
been working in both startups and established companies, both in hardware and software, and pretty much been in the what I would call the infrastructure area. So I've been doing everything from semiconductors, start my career in Intel.
00:09:51
Speaker
Flash storage, object storage, software-defined, virtualized storage, data protection, cybersecurity, database engines, worked on cloud and infrastructure as a service. I like to call myself a plumber. I do a lot of this.
00:10:08
Speaker
A lot of buzzwords for sure, but that sounds like a fun career that you have had so far. And then with the AI wave, I'm so glad to see that you're continuing to be on the bleeding edge, right? Okay. So the reason we are doing this podcast is because I was listening to
00:10:26
Speaker
the Data on Kubernetes presentations from Paris a couple of months back. I wasn't able to attend in person, so I was just playing catch up. And you had a really great talk. So I want to start with one basic question. You were at Data on Kubernetes Day. So is AI more of a Kubernetes stateful workload or a stateless workload? And the reason I ask that is, initially, back in November of 2020,
00:10:52
Speaker
too, when OpenAI first released GPT 3.5, everybody is like, okay, if you want to run these models locally, you run it and then you interact with it as a chatbot. You don't really have to manage any state along with it. So in May of 2024, is AI still stateless or is it becoming more stateful?
00:11:12
Speaker
Yeah, it's definitely becoming more and more stateful. And most AI today is actually running on Kubernetes. So just to back up a little bit on Kubernetes itself, it's kind of an interesting shift that Kubernetes has to go through itself. So the original objective of Kubernetes back in 2014 was to be a cloud native platform.
00:11:40
Speaker
and for running web services. And at the same time, I thought it was interesting because not only a new platform was a new programming model. So the whole concept of a containerized microservice was developed and it absolutely had to be stateless. Kubernetes purists would be all over you if you didn't write a totally stateless microservice architecture.
00:12:06
Speaker
And so it was actually for that use case was ideal. I mean, it allowed it to become a highly scalable, highly elastic, independently scalable microservices. And then anything that had to do with state, you know, you tucked in a database that was off the Kubernetes cluster, basically. And since then, what I've seen happen with Kubernetes is that it grew in popularity and it became
00:12:31
Speaker
more of a vendor agnostic neutral open source hybrid cloud operating system. And so now everybody wants to start doing all the other kinds of workloads, including stateful workloads. So that started happening in 2017, I think. And now as we fast forward to this AI large language model area in particular,
00:12:55
Speaker
you find that the model training stage is a very data-intensive, stateful batch operation. At a high level, that's a mismatch for the original approach of Kubernetes.
00:13:11
Speaker
Basically, you have this giant data digester ingesting a huge amount of data, computing all these model weights, and then back propagating them. And then, you know, so rinse, wash, repeat is the iteration until it gets close enough to, to some level of, of model accuracy. And so that can take quite a while to execute that kind of program. It's not a microservice architecture and
00:13:38
Speaker
What I found was interesting is that the groundbreaking model chat GPT open AI developed was developed on Kubernetes using several thousand Kubernetes nodes. So that's the training side. It's definitely a highly stateful kind of operation. Now what's interesting is on the inferencing side, what's the
00:14:02
Speaker
once the models were trained, initially, as you mentioned, the initial chat GPT was essentially stateless. It would, as soon as you asked a question, it would forget what it was. Long running context windows. Yeah. And, uh, and then also it didn't, it didn't remember anything after December, 2021 or whenever it was to stop training. So it would, it would answer, if you asked any question after that, like who won the last NBA championship, it would give you a ridiculous answer because it didn't know. So, um,
00:14:32
Speaker
So, that wasn't very useful. I mean, it was enough to cause a tipping point in the industry, but to make it really useful, it really needs to be more and more stateful. And most importantly, it needs to avoid hallucinations, the phenomenal hallucinations in AI models.
00:14:48
Speaker
So that means that these AI models, especially in the interesting stage, need to be grounded in a database or some source of truth, even current real-time information to make sure that they're not generating fictitious results.
00:15:05
Speaker
And people are starting to use these vectorized databases, and they're starting to do domain-specific fine-tuning, and all this kind of stuff. And not only that, the context window is getting larger and larger. You used to be able to ask a 20-word question, now you can 500-word, you can throw a whole book at it now.
00:15:24
Speaker
I think at Google IO they announced that you can give Gemini Pro like a 1500 page PDF document in just one question as the context window is now so big and you can ask it questions. So that's essentially state, that's memory state that has to be kept and because there could be a dialogue over several hours or days over that area.
00:15:51
Speaker
So, basically, the inference thing is becoming more and more stateful. And people are creating even pipelines. So, it used to be just a one-shot inference thing, ask the question, here's the answer. Now, a lot of times, it's a pipeline, it's decomposed into multiple queries, and there's even an agent running around now to guide the whole response to the user. So, definitely going that way, yeah.
00:16:16
Speaker
Okay, that's awesome, right? I think, especially for our audience, which is more concerned or more focused on like the storage aspects around Kubernetes is definitely like glad to hear the answer like, okay, AI on Kubernetes is a stateful thing, eventually, you will end up using all the proprietary data you have inside your organization either to
00:16:34
Speaker
train your own models, fine tune your models or even if you're running inference kinds of workloads, you'd build something like a rag architecture to use the data and use that business intelligence that you have built to better serve your customers.
00:16:49
Speaker
I want to go back and focus more on model training. You used OpenAI as an example. They used a huge Kubernetes cluster to train their... And they still do. They use that as an infrastructure to train their GPT models. But how are these companies training their models? Is it happening only in the cloud or a mix of hybrid on-prem and the cloud? What kind of tools are they using? Can you shed some light on that?
00:17:17
Speaker
Yeah, sure. So yeah, I think a lot of people initially started playing around with open AI. It was easiest, very, very convenient. You could just be anybody, you know, just download a notebook and hack, do a little bit of hacking on open AI. People are doing a lot of that kind of work.
00:17:34
Speaker
And then I think what's happening is that people, now that it's past the kind of the toy stage, novelty stage, people want to look at putting these into real enterprise workloads. And enterprises have a huge amount of proprietary data.
00:17:52
Speaker
And they also, at the same time, people have also discovered that, hey, we don't need such a gigantic model. We can use smaller models that are fine-tuned and connected to RAG or whatever and make them more domain-specific.
00:18:08
Speaker
What I'm seeing is that actually a lot of companies are developing dozens or even hundreds of models that are fine-tuned, domain-specific. Maybe one is in charge of a call center and one is in charge of looking at the latest information on a particular stock ticker.
00:18:30
Speaker
and consolidating that into a nice summary. Another model is working on being a co-pilot for converting COBOL into Java. These are examples that I've come across in the financial service area. So yeah, there's some movement to doing some of this stuff on-prem or using smaller open models like LAMA or whatever.
00:18:58
Speaker
And the tool chain, initially a few years ago, it was dominated by TensorFlow, but more and more of it now is based on PyTorch. People are pretty much standardizing on PyTorch. And there's a whole bunch of frameworks out there. So it's actually people like Ray and NVIDIA, PyTorch is not saying deep speed. There's a whole bunch of training frameworks that are being built
00:19:25
Speaker
And some of them also are using tools to extract all these experiments. So we're still heavily in this experimentation phase. People are using notebooks at the individual data science level. They're keeping notebooks that kind of
00:19:43
Speaker
sit at the top of this stack of software. And then way down, farther down in the plumbing, what you see are projects like Qtflow for Kubernetes, which has a nice set of tools for doing model serving and fine tuning and the workflow itself on various operators. So yeah, it's starting to gel now, the whole pipeline. And I think
00:20:09
Speaker
kubernetes is really going to be the platform of choice along with these different stacks. Yeah. Gotcha. And you said a keyword there, right? I want to focus on that like pipeline or if you also, I think another word that we can use for a pipeline is

Importance and Technology of Model Checkpointing

00:20:24
Speaker
like an acyclical graph basically a pipeline of sorts or a graph of sorts that you are building to train your model. Can you talk about like what are the different phases involved? I know this is our way of getting to the core of the topic which is using or how do we take model checkpoints but can you first walk us through like the higher level stages and then we can deep dive into the actual training phase.
00:20:49
Speaker
Yeah, so initially, obviously, people are ingesting data. They're doing some sort of processing or filtering of this data to remove the bias or things like that. And then it goes into a training phase. That's another major stage of the pipeline. After it gets out of that, there's a lot of optimization, fine tuning,
00:21:12
Speaker
So there's this baseline model, whether they develop it, train it themselves, or they use existing models from hugging face or whatever, they will then go into a fine tuning phase. And then eventually it moves into a production phase or inferencing phase and needs to be connected to databases for using RAG and things like that. So there's a lot of phases and there's
00:21:40
Speaker
there's constantly going to be neat iteration between all those phases of the pipeline. So yeah, it's a pretty complicated workflow and the requirements on the infrastructure vary tremendously. Some parts of inferencing are very memory intensive, other areas are GPU intensive. And so it's really a whole new ballgame right now.
00:22:07
Speaker
For sure. So I think I want to talk more around what checkpoints are and why do we need them. For example, when Meta announced the Lamba 3, the 8 billion and 70 billion parameters models, they said they had to cut this
00:22:28
Speaker
because they had to start training the next one so these are like really long running training jobs like can spend days sorry hours to days to weeks or even months right that these jobs are running. How do we make sure that if anything fails at the infrastructure layer it doesn't, I don't lose all the work that I did and how do model checkpoints play a role there.
00:22:51
Speaker
Yeah, so there's a lot of these training frameworks like PyTorch and TensorFlow actually have built-in checkpointing capabilities. And actually, I believe certain versions of TubeFlow also have some level of checkpointing for what I call the pipeline scenes between the stages. So the whole idea of one of the key use cases
00:23:20
Speaker
And I think there's at least three different use cases for that kind of model framework check pointing. One very important use case is to be able to keep versions of the model. So you're doing this iteration process.
00:23:37
Speaker
to train the model. And at some point you decide, well, the model is good enough, but that's difficult to know a priori. So a lot of times you have to test this model and then you find out with other data outside of the training batch and find out how well it works or doesn't work.
00:23:59
Speaker
And sometimes you find that, for example, you may have over-trained the model. So you actually want to roll back to a previous epoch where it's less trained. Because if it's over-trained, then you can't handle any variations as well in queries. So checkpointing is used sometimes to roll back the model to a earlier training epoch and use that as the final version of the model.
00:24:30
Speaker
So another for these checkpoints is for hyper parameter tuning. So not only want to get the right kind of model, but you also want it to be the most efficient. So there's there's tuning knobs on batch sizes and other, you know, high level parameters that affect the performance of the
00:24:53
Speaker
And so you don't want to go all the way back and retrain every time you total a hyper parameter to see what the effect is. So that's another use case. And the third use case is actually has to do with the fact that yes, all these training models take a long time to run and you, you know, you have a crash, you know, hardware fail, something like that. And you don't want to start, you know, from two weeks ago.
00:25:19
Speaker
Maybe two hours ago is acceptable, but two weeks ago, no way. Those GPU resources are very expensive. You don't want to lose two weeks worth of work. A lot of it is to just roll back for recovery purposes. Bernie, the way you were describing a model checkpoint, it sounded awfully close to what a storage level snapshot is, but obviously that's not all. Can you expand on all the different layers that are involved when a model checkpoint has to be taken to save the work, basically?
00:25:47
Speaker
Yeah, so a model checkpoint is actually not as sophisticated as since you and I came from the storage snapshot. It's not as sophisticated as a storage snapshot. Basically, a model checkpoint is simply just stop the whole thing and copy everything from memory to files. It's just a bucket brigade and then annotate it so you know what version it is.
00:26:10
Speaker
And everything has got to be, obviously, time done in synchronization. So that's the essence of model checkpointing technology today. So if you do a lot of checkpointing, there is a significant amount of overhead as you're trying to drain
00:26:28
Speaker
all this memory onto some file system for safekeeping. So that is sort of a problem with the current type of checkpointing that's going on.
00:26:44
Speaker
Yeah. And who's doing or taking these like checkpoints, right? Like, is this the developers or data scientists? You said these are done through PyTorch or TensorFlow or some, some of those points, right? So yeah, so the, the personas for these checkpoints is really the, the, the data scientists. They, they, they're trying to preserve their work. They're trying to find two, you know, two in the hyper parameters. They're trying to make sure their job gets done. So they want to have checkpoints that have to start all over again.
00:27:12
Speaker
So yeah, so right now the yeah, those are all done at the at the pipe at the coding level for the data sciences. Yeah. Check your checkpoints. Sometimes you do every other I think because there's too much overhead or whatever. So there's there's a bunch of things like that. Yeah.
00:27:31
Speaker
like data scientists usually don't care about the infrastructure. So if the operations team or the IT teams inside these organizations, right, if they wanted to have some sort of control, do they have to use tools like PyTorch? Like how do they make sure that they are meeting their own SLAs and taking these checkpoints and not just relying on the data scientists to hopefully do their job and use these checkpointing technologies?
00:27:58
Speaker
Yeah, that's where our type of checkpointing comes into play. What we have developed and it's not fully out there yet, but there are some early adopters out there for what we call transparent checkpointing. So the idea is that we're doing a checkpointing that's more at the platform ML ops engineering level
00:28:19
Speaker
And so we call it transparent is because it involves no intervention by data scientists, no code changes. Nothing is. And the idea is that whatever we is being check pointed or snapshotted at that level, one has to be really high speed because of it. It really took a long time. Yeah. The data science is getting ahead because it takes too long. Why is this thing all of a sudden seizing up and all that kind of stuff? So performance is really important when you're doing it at that level.
00:28:49
Speaker
And then also transparency and the fact that the data size doesn't have to change any scripts or code or anything like that. And so that type of checkpointing is really a tool
00:29:05
Speaker
That's aimed at the ML ops and platform engineering guys. And the idea is to be able to, one of the main use cases is to improve resilience of the overall architecture. Because GPUs are like light bulbs, they burn out.
00:29:26
Speaker
And wouldn't it be nice if when you detect the error rate of a GPU creeping up that you magically checkpoint it and move it somewhere else and the people using the GPU cluster above you don't even notice, right? So those are the kinds of things that can be done with it. Another big thing is that
00:29:48
Speaker
As time goes on, there's going to be less emphasis on training, more emphasis on inferencing. And training, if you think about it, when you train a model, you just kind of train it once for a use case and you've got a cover.
00:30:01
Speaker
But what happens with inferencing is that it scales in proportion to the number of users, right? So as the number of users go up, the amount of inferencing goes up and cost can skyrocket if you don't do what you can to optimize it. So a lot of people are concerned about how much power GPUs use and what the utilization rate is. So checkpointing
00:30:24
Speaker
at the AI-ML layer or in parallel layer can be used as part of a scheduler to improve what we call bin packing of the different AI workloads and get higher utilization of GPUs. So if you look back at actually these large models, like OpenAI's models or whatever,
00:30:50
Speaker
None of them reported only 30, 35% GPU utilization. So if you can imagine, if you could drive that utilization through having better job scheduling and all that, and they were just dedicated to training things, so they were just trying to train something for the sake of training and beating the other guy. But later on, when we get to inferencing, having 35% idle GPUs is pretty costly.
00:31:16
Speaker
The only people that will be happy are NVIDIA shareholders and NVIDIA employees. That's it. Nobody else will be happy with that. Exactly. That's where this checkpoint can be used to help in architecture similar to auto scaling on Kubernetes to drive up the utilization per node.
00:31:36
Speaker
And Bernie, I haven't trained a model of myself, but I wanted to see if, or ask you if, how long do these checkpoints actually take? You said you have to basically dump everything that's in memory onto a file at the storage layer, right? And looking back, I think these GPUs, they have gigabytes and gigabytes of memory. So how long does a checkpoint actually take?
00:32:02
Speaker
Yeah, so a checkpoint, yeah, there's physics involved in it. It takes so much time to move memory. And one of the things that we have done is in developing our checkpointing technology in the industries. We started out with the open source, there's an open source checkpointing
00:32:25
Speaker
Project out there called Creo stands for checkpoint restoring user space so it's designed to checkpoint things in user space on the Linux platform and we've enhanced that we built our own enhancements around it because right now.
00:32:41
Speaker
Yeah, the time to do a checkpoint can be excessive, the time to copy all that memory and all that kind of stuff. So we've enhanced it in two ways. If you're from the storage industry, remember you had full snapshots and then also had incremental snapshots.
00:32:59
Speaker
and incrementals are a lot faster because you're only doing the deltas between the last snapshot or the last incremental and then offline you can stack the incrementals and make a full kind of snapshot so yeah we have we implemented that kind of capability to keep the snapshot overhead down nice it's some change block tracking
00:33:21
Speaker
Yeah, yeah, exactly. Incremental memory snapshots. Since most of the stuff that we're trying to checkpoint is in memory, you have to keep as much of the workload in memory as possible. Otherwise, the execution of these GPUs gets really, really bad. And then the second thing is we're doing things asynchronously.
00:33:42
Speaker
Normally, what would happen is that these checkpoints would go straight from memory into file. So there's a whole serialization process and it's going to be bled out to the file. And so you're bound by the speed of the network and bound by the speed of the file system, blah, blah, blah.
00:33:57
Speaker
So what we're doing is we're actually doing that kind of memory snapshot and then we immediately, once we start the snapshot, just like from the storage industry, we immediately release the application. So go back into production mode and then
00:34:14
Speaker
synchronously copy the memory snapshot, which could be very, could be thin because it's only a delta to the file system. So we use an asynchronous approach as opposed to a synchronous approach to speed up and reduce the snapshot overhead. And again, and then the other key is the transparency. So we're capturing the entire process, memory states and network cache, everything that's needed so that this
00:34:43
Speaker
this checkpoint can also be restarted. This checkpointed application can be restarted on another computer, this is another pod, even on another. Wow. So you can actually clone from a snapshot or create a new instance of the entire training job from a snapshot, is that correct? At some point you could
00:35:06
Speaker
Right now, we can clone individual nodes. At some point, yeah, we'll build the entire cluster. Yeah, so there's... Oh, wow. Okay, that's awesome. Yeah.
00:35:16
Speaker
So how does all this work with Kubernetes, right? Like you mentioned the checkpoint restore in user space, the Creo project. Like if I'm a Kubernetes platform engineer today, how do I get this or try this out? Most things, the standard automation pattern in Kubernetes is build an operator. So that's exactly what we did. We built a Kubernetes operator that can
00:35:43
Speaker
that wraps this check putting technology with all these enhancements I talked about into a nice, convenient operator, which all you have to do is deploy this operator on the controller and then go into the targeted applications that you want to have check pointed and validate in the YAML files
00:36:05
Speaker
okay, checkpoint this one, don't checkpoint that one. And so by process ID or application or process tree, we can checkpoint specific applications. And what happens is like currently is when you have a node, a pod eviction signal coming from the scheduler, say, hey, we got to boot you out, we got a big room for something else or whatever, we'll automatically trigger the checkpoint.
00:36:31
Speaker
And then when the scheduler tries to relaunch that pod on another node or wherever, we'll automatically trigger the restore process. So we will, in the interim, we'll park the copy of the memory and any ephemeral files into a persistent buyer PVC on Kubernetes. So that way, there's no more cold restarts. We call it hot restarting of the application. So it's almost as,
00:36:59
Speaker
almost as fast as a real live migration, but it's actually using a checkpoint. Gotcha. So many use cases show up here, right? So first of all, let me start. I have a list of questions that just came up because of what you just said.
00:37:17
Speaker
You said you can annotate resources. Are you annotating the custom resource that your operator is deploying and asking it to take five of the namespaces that exist on my cluster and make sure that you are checkpointing it at regular intervals? Or you're going into those Jupyter notebooks or resources that, let's say, you're running TensorFlow on Kubernetes, you're going inside that board specification,
00:37:43
Speaker
and specifying or annotating it in a certain way to say that, okay, this has to be check pointed right now or before it gets evicted. Yeah, so it's funny you mentioned Jupyter Notebooks because that actually is one of the use cases that I think we're working on and actually other people like Azure Cloud, I think have already implemented. So the idea there is that this checkpoint is not only useful just for
00:38:11
Speaker
you know, compaction of driving up GPU utilization. But a lot of times the data scientist is working on his Jupyter Notebook. He decides he's going to go home for the evening and he doesn't even shut his system down. He leaves it in some unknown state. And so what happens is that you can automate this to the point where
00:38:33
Speaker
You say, well, this notebook's been idle for no activity for the last 20 minutes. We automatically can give it an eviction signal to the checkpoint, save as notebook. And then when it comes back, it comes back up and we can save power, shut down the whole instance and things like that. So it can be used, definitely using that kind of use case. And like I said,
00:38:57
Speaker
By implementing it, leveraging namespaces and things like that, you can checkpoint whole areas, whole groups of applications or specific ones and things like that. Gotcha. So how do you monitor if a notebook is idle for some time? Something is happening inside the thing, right? Like from a Kubernetes perspective, you wouldn't know if,
00:39:24
Speaker
Are you monitoring the CPU and memory utilization for a specific part, or how do you know if a notebook is inactive or what? That's one way to do it. I think there's ways through the Jupyter, if you're using something like JupyterHub, which is a hub for these things, I think there's ways to detect whether something has changed at all or not, and then use that to trigger a shutdown.
00:39:51
Speaker
No, that actually is interesting because for some applications, not necessarily for AIML applications, we do monitor CPU and memory utilization. So we have a technology we call wave writing.
00:40:09
Speaker
where we look at, we heuristically kind of learn the behavior of a particular application, work memory and CPU consumption. And so we can automatically, even, you know, forget trying to use us on spot instances, like we can automatically decide, hey, this compute instance is not large enough anymore, we're starting to see memory spilling the disk.
00:40:35
Speaker
And so we better automatically migrate it to a larger instance. So we can do things like that and the reverse. We find out there's too much memory. We can downsize it. Same with the number of cores. So we look at CPU utilization and memory spillage and utilization.
00:40:53
Speaker
And so we can literally wave ride the size of the application as it maybe changes, fluctuates. And so between that and also checkpointing and allowing people to use spot instances on the cloud, because now we don't care if the thing gets killed, whether the Kubernetes schedule kills it or the
00:41:15
Speaker
The cloud provider decides to shoot it because there's a spot instance. It doesn't matter, we'll checkpoint it and be able to restore it and keep the job running. That's an awesome use case because I was listening to a re-invent talk that Anthropic did last year and they heavily used EKS and they were using the Carpenter project which automatically
00:41:37
Speaker
spins up new worker nodes that are GPU based to run those training jobs as they need more GPU capacity. But they didn't talk about what happens when they are not fully utilizing that node. Like, okay, if the pod goes away, even if it was using 30%, at that point, they'll remove the node from the circulation and they'll stop paying for it. But if it's over provision, then there's no way to actually identify that. So I think this is like a very
00:42:04
Speaker
a specific use case that will be super helpful to people that are in this ecosystem and trying to train this model. So you said it's wave rider, right? We'll make sure we find that. We have two buzzwords. One is called spot surfer, which is designed to surf, literally surf different spot instances. So we have long running EEA workloads.
00:42:26
Speaker
two weeks to run and it may go over a dozen different spot instances over that time, but we'll run it to completion without having to have any cold restarts. The Wave Rider allows us to upsize and downsize the workload. Some applications are fairly deterministic, others have huge fluctuations in memory CPU utilization.
00:42:49
Speaker
So we've seen up to 75% compute cost reduction by using a combination of spot surfer and wave rider in our product. We're taking those technologies and we're going to be bringing those
00:43:05
Speaker
in the near future with this new Kubernetes snapshot operator into building a scheduler for GPU clusters for AIML.
00:43:20
Speaker
And again, there are organizations that are trying to get into the AI ecosystem or trying to use these AI technologies. Definitely have a budget set aside for it. But since we are on this podcast, I want to ask everything that we spoke about. Is it open source or is it a paid offering from members?
00:43:39
Speaker
Yeah, the core project that we built a lot of our stuff on is called Creo. Checkpoint restore for in user space. And we have been contributing to that. We've been actually partnered with Nvidia's to develop GPU snapshotting with Creo. And all that's going to
00:43:58
Speaker
contributed to the project. But we've taken that core and we've wrapped a whole user interface platform or transformed it into a Kubernetes operator. So we're both contributing and building something on top of that that will be a subscription, typically a subscription-based kind of thing that customers can deploy and use it to drive up efficiency and
00:44:25
Speaker
If you're saving somebody 75% of their GPU spend, that's an easy line item on their budget, for sure. And you said you were working with NVIDIA already, so is all of this specific to NVIDIA or doesn't matter, you can use Intel-based GPUs, AMD-based GPUs, any GPUs basically.
00:44:47
Speaker
That the user wants to use right so the the creo project the core part of it is you know, it's open source and there is a pluggable device driver architecture to it so besides Nvidia AMD actually on the Sega C
00:45:02
Speaker
Mi 300 extra i think it is what is called their GPUs has actually also has got a plug a little driver for trio so we support that also and then we're waiting for intel to come along so we're.
00:45:18
Speaker
I'm pretty sure all of them are going to implement this because it's going to be a checkbox requirement for running any kind of AI ML model at scale. We're going to need to be able to move workloads around and pack things efficiently. There's nothing else for saving the planet anymore. Electricity that's going to be used to run all these AI inferencing systems.
00:45:45
Speaker
Gotcha. And do you have customers that you're already working with or users from the ecosystem that you're already

Transition of AI Workloads to Production

00:45:53
Speaker
working with? Can you share some examples? Yeah, so we ourselves have a lot of examples in the computational biology area. That was kind of our starting point for this.
00:46:05
Speaker
wave writing and spot surfing. So if you go to our website, you'll see a lot of either academic or industry partners on that side, like the TGen or Columbia University, things like that, are using us to just to cut compute costs very simply.
00:46:27
Speaker
And then also on the EDA side, the chip design area, we're starting to see a lot of traction there. I just spoke at the Cadence Live event and there was a lot of interest in deploying us. We have some sites already deploying us too, because those EDA workloads are also very long running batch problems.
00:46:44
Speaker
So what's interesting is that the whole industry is including the AI area. AI ML is basically another long-running patch or a continuous service, like what's, what inverse, inference is going to be. So this, this kind of, uh, as soon as, uh, uh, I think as soon as we get out of the experimentation phase overall in the industry and people start seriously looking at it, putting us in a production deployment, that's when you're going to see, uh, our, our, this kind of transparent checkpointing talked about more and more.
00:47:16
Speaker
Actually, one of the leaders in this area, I would say, among the hyperscalers is Azure. They are using this checkpointing to do exactly what I was describing earlier, to improve the resilience. Like I said, these GPUs, once you go up the scale, they're burning out like light bulbs. You need to be automatically cut in and out of production.
00:47:37
Speaker
And then also, they're shutting down idle notebooks automatically and losing the guy's data. That's the use case. So these are great use cases, right? For sure.
00:47:55
Speaker
Something that since you were bringing a batch so many times, something that came to my mind. I was at supercomputing conference last year in Denver and there were a lot of HPC guys and for them, right, I think batch is more important than the orchestration capabilities that Kubernetes provides.
00:48:14
Speaker
Do you offer some solution where people can use the wave rider or the surfer capabilities and batch their workloads so that they can, as you said, bin pack more things more efficiently or right size the size of the cluster with the number of nodes that they have? Oh yeah, absolutely. So we're working, we also regularly go to the supercomputing conferences. We've been working with SLUR and with LSEF.
00:48:37
Speaker
those are the major schedulers, resource managers for that community. And then also on the Kubernetes side, there's actually multiple batch efforts going on because again, AI ML is a batch operation. So there's a product called Q we're working with, things like that. So yeah, absolutely on the batch side, as a matter of fact, prioritizing jobs.
00:49:02
Speaker
This job has higher priority, we'll check the other one and launch this one and then restore the other one when the high priority job has gone through, things like that. So yeah, there's a lot of applications definitely in the HPC side now.
00:49:17
Speaker
Perfect, I knew about the Q project but I didn't know that you guys were a contributor. Man, that could have been a whole another set of questions. We're not a contributor yet but we're starting to get involved in that because I think that's actually, there's that project, there's a couple of other projects.
00:49:36
Speaker
I'm slipping my mind. To be honest with you, I went to KubeCon in Paris and I walked away and it sounded like there were at least six batch projects going on. But I think two years back, it was all around Kubernetes security, like six different open-source projects, if not more, enabling users and developers to shift left and things like that. So KubeCon always has a theme where you'll see most of the vendors talking about the same topic.
00:50:05
Speaker
Yeah. Gotcha. Okay, so I know we are coming close to the time that we have allotted for this episode. This has definitely been super helpful, not just for me, in just the two conversations that I've had with you, Bernie. I've learned a lot. I'm sure our listeners have learned a lot too, but if they wanted to follow up, learn more about it, maybe get some hands on, try it out in their own environments, where do you recommend they go and what resources should they check out?
00:50:33
Speaker
Yeah, so for example, we'll start with our website, www.members.com. And then from there, you can, if you want to try, test drive this Wave Rider.
00:50:42
Speaker
or SpotSurfer software. We have some free trials for that. There's also use cases on the website and then also we have demos for GPU snapshot. That hasn't been production released yet, but we definitely have the demos so we can demo the GPU snapshotting for things like TensorFlow or whatever.
00:51:07
Speaker
And then also, you know, someone has even more detailed questions, you can find me on LinkedIn, or you can reach out to me at verny.woo at members.com and send your questions in.
00:51:21
Speaker
Awesome. Thank you so much, Bernie, for your time today. This has been an awesome learning experience for all of us. I'm sure we'll talk in the future. You have an open invitation. Six months down the line, a year down the line, if you have something really cool to talk about, just hit us up and we'd love to have you back on. Okay. We'll look forward to that. Great. Thank you.
00:51:39
Speaker
Okay, that was a great episode or a great interview with somebody who's doing this on a day in, day out basis and not somebody like me who's more focused on the community storage ecosystem,

Evolution of AI Workloads and Kubernetes

00:51:51
Speaker
right? So this has been an awesome interview for me. I think the things that stood out were definitely the evolution from stateless to stateful.
00:52:01
Speaker
With every new piece of technology, we see an increased rate with which users are adopting things and increased rate with which changes happen. Kubernetes introduced in 2014. I know we have a 10-year celebration in a couple of weeks that Google is hosting in their Mountain View offices.
00:52:21
Speaker
When Kubernetes started in 2014, it was meant for stateless. And then 2017 timeframe is when they started talking about data on Kubernetes. And we have come a long way from there. With AI on top of Kubernetes, when it started, it started again as a stateless thing. And now, in less than two years, we are at a point where people are talking about how we can make things stateful. So data is always important. Data is the new oil of
00:52:46
Speaker
or data is a new plutonium or whatever metaphor you want to use for data. But data is super important for organizations. So I'm glad that we are seeing this transition. But then talking more specifically about model checkpointing, I like how
00:53:02
Speaker
in the developers. I know we always talk about shifting left and giving more capabilities to the developers so they don't block, but then at some point they get overwhelmed. So I like this balance. It's not up to the data scientists to always be responsible for checkpoints and making sure that they are saving all of their data. The ops team need to share some of these responsibilities as well, and tools like Memverse can definitely help do that. I really like the discussion around
00:53:31
Speaker
how they have these wave rider and spot surfer extensions or offerings where they will monitor if your Jupyter Notebook has been added for three hours and you can basically take those GPU resources back, scale down your cluster.
00:53:47
Speaker
or spots are for like basically look for instances that are available and right size your communities cluster all of those sound like really cool solutions personally i would i'm going to check it out and then get my hands dirty i know gpu resources cost a lot of money so might take me some time but
00:54:05
Speaker
If you are in this ecosystem and if you are evaluating or already running some training jobs, this is definitely a solution for you to check out. I know this is not like a vendor specific podcast. I was just happy with the discussion and the technologies that we discussed

Closing Remarks and Call to Action

00:54:21
Speaker
today.
00:54:21
Speaker
With that, that brings us to the end of today's episode. One thing that Ryan and I always like to ask to our listeners, please, please, please give us a rating on the podcast app where you listen to the podcast. Give us five-star reviews. Four-star works too, but hopefully five stars if you find value here. If you are an audio listener, which is majority of our listeners,
00:54:45
Speaker
go to the YouTube channel, just hit subscribe, give our videos a like, like and share it with others. That really helps us work the algorithms and helps us expand our listen base. We if I'm being candid, like we are at a good, good number, we are happy with it. But it's always good to keep growing. So help us grow. And we'll keep bringing in more and more interesting guests on the podcast. With that, this brings us to the end of today's episode.
00:55:13
Speaker
I'm Bhawan and thank you for joining another episode of Kubernetes Pights.