Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Deploy and fine-tune LLM models on Kubernetes using KAITO image

Deploy and fine-tune LLM models on Kubernetes using KAITO

S4 E15 · Kubernetes Bytes
Avatar
1.2k Plays4 months ago

In this episode of the Kubernetes Bytes podcast, Bhavin sits down with  Sachi Desai, Product Manager and Paul Yu, Sr. Cloud Advocate at Microsoft to talk about the open source KAITO project. KAITO is the Kubernetes AI Toolchain Operator that enables AKS users to deploy open source LLM models on their Kubernetes clusters. They discuss how KAITO helps with running AI-enabled applications alongside the LLM models, how it helps users bring their own LLM models and run them as containers, and how KAITO helps them fine-tune open source LLMs on their Kubernetes clusters.  

Check out our website at https://kubernetesbytes.com/  

Cloud Native News:  

  • https://azure.github.io/AKS/2024/07/30/azure-container-storage-ga
  • https://github.blog/news-insights/product-news/introducing-github-models/  

Show links: 

  • Azure/kaito: Kubernetes AI Toolchain Operator - https://github.com/Azure/kaito/tree/main
  • https://www.youtube.com/watch?v=3cGmHDjR_3I&list=PLc3Ep462vVYtgN4rP1ThTJd2UlsBc2sou&index=2
  • https://aka.ms/cloudnative/learnlive/intelligent-apps-on-aks/episode-2
  • Jumpstart AI Workflows With Kubernetes AI Toolchain Operator - The New Stack - https://thenewstack.io/jumpstart-ai-workflows-with-kubernetes-ai-toolchain-operator
  • https://paulyu.dev/article/soaring-with-kaito/
  • Concepts - Fine-tuning language models for AI and machine learning workflows - Azure Kubernetes Service | Microsoft Learn - https://learn.microsoft.com/en-us/azure/aks/concepts-fine-tune-language-models  
  • Keep up to date on the most recent announcements by following some of the KAITO engineers on LinkedIn: 
  1. Fei Guo - https://www.linkedin.com/in/fei-guo-a48319a/
  2. Ishaan Sehgal - https://www.linkedin.com/in/ishaan-sehgal/ 

Timestamps: 

  • 00:02:15 Cloud Native News 
  • 00:05:34 Interview with Sachi and Paul 
  • 00:42:08 Key takeaways
Recommended
Transcript

Introduction & Podcast Overview

00:00:03
Speaker
You are listening to Kubernetes Bites, a podcast bringing you the latest from the world of cloud native data management. My name is Ryan Walner and I'm joined by Bob and Shaw coming to you from Boston, Massachusetts. We'll be sharing our thoughts on recent cloud native news and talking to industry experts about their experiences and challenges managing the wealth of data in today's cloud native ecosystem.
00:00:31
Speaker
Good morning, good afternoon, and good evening wherever you are. We are coming to you from Boston, Massachusetts. Today is August 7, 2024. Hope everyone is doing well and staying safe.

Podcast Anniversary and Audience Engagement

00:00:42
Speaker
Can't believe we are in August already. Ryan and I started this podcast back in August of 2021. So this is like a three-year milestone for us. Can't believe it's three years already.
00:00:53
Speaker
ah We are more than 75 episodes in. This has been a great ride for us. And as part of this podcast, we have had the opportunity and the privilege to talk to a bunch of awesome guests from the community, meet a lot of our listeners at industry events like KubeCon and Red Hat Summit. So I would love that ah we we are able to do this on an ongoing basis.
00:01:15
Speaker
Thank you so much for being a listener and sharing this podcast with your colleagues and friends. ah We really appreciate you listening to us, giving us the time of your day to just ah listen to us talk about a specific piece of technology or ah or an open source project. ah We have definitely grown our audience over the years and again, can't thank you enough. It's through word of mouth. It's just through people finding value in this podcast that we do and sharing it with the others that has helped us grow the audience over the years. ah Three years is a long time to be doing something on a bi-weekly basis, that's for sure. But you're nowhere close to done, right? Like both Ryan and I are committed to doing this and continuing this journey.
00:01:56
Speaker
We both have had like job changes since we started this, but we have always made this podcast a priority and that's not changing. I'm personally excited to continue this journey and keep evolving this podcast and continue talking about different technology stacks in and around the cloud native ecosystem.

Introducing Microsoft's Kubernetes AI Toolchain

00:02:13
Speaker
With that, let's talk about a few things happening in the Kubernetes ecosystem before we dive into the topic for today's episode. ah First one being Azure Container Storage, or ACS, is now generally available. so Microsoft announced this sometime last year, and it was available through its regular preview stages, so private preview, public review, and now it's officially generally available.
00:02:36
Speaker
Azure Container Storage is a software-defined or cloud-native storage solution from Microsoft. They have derived it from OpenEBS. It's based on OpenEBS, which used to be a similar project in the open source ecosystem.
00:02:48
Speaker
ah With the general availability, they announced support for the underlying devices to be either Azure disks or ephemeral disks. And they also said that Azure less Elastic SAN still remains in preview, but you can expect general availability for ACS with Azure Elastic SAN in a couple of months.
00:03:08
Speaker
ah Some of the features include storage pool expansion and resizing, replication across different nodes, ah customer manage and encryption key support, ability to dynamically provision persistent volumes and take snapshots and all the all the benefits that you get with like the CSI standard, all of that is included. ah We'll have a link in the show note that talks about the the journey from ah from inception to OGA and the features that are available and how you can get started in the show notes.

GitHub's AI Model Integration with Azure

00:03:34
Speaker
ah Next up, but again, something from the Microsoft ecosystem today, ah GitHub, a famous like source control repository, is now adding support for something called GitHub Moderns. This means that you can now have access to large language models like llama 3.1, GPT-4.0, GPT-4.0 Mini,
00:03:55
Speaker
to small language models like 5.3 or models like other open source models like Mistral and Large2. You can access each of these models via built-in playground inside GitHub that lets you test in different prompts and model parameters for free right inside GitHub.
00:04:10
Speaker
And if you like what you're seeing inside this playground, they have also created ah an easy path to bring that model to your development environments in code spaces or visual studio code. And once you are ready to go to production, they the the next evolution in the Microsoft ecosystem, as they mentioned, is using Azure AI and and that that has all of these models available across all different Azure regions that they support.
00:04:35
Speaker
I'm just excited that that in addition to hugging face, now there's a different alternative where models are being available to test out. And ah if you are in the Microsoft ecosystem, this is definitely good news for you. So ah those are the couple of things that I wanted to highlight as part of news, the things that are happening in the cloud needed storage ecosystem. ah with ah But let's move on to the topic for today and introduce our guests. So today we are going to talk about the Kubernetes AI tool chain operator open-source project from Microsoft or the Keto project or Kaito project.

Meet the Guests: Saachi Desai & Paul Yu

00:05:06
Speaker
ah We'll find out how the the Microsoft folks said. So to talk about the project, we have Saachi Desai, who's a product manager in my Microsoft Azure, and Paul Yu, who's a senior cloud advocate focused on Kubernetes. He has been doing this for a while now. so
00:05:20
Speaker
and Fun fact, he's also a Kubernetes Bites listener. so I'm excited to talk to him and have him share his ah thoughts around the Kaito project and what it is all about. so Without further ado, let's bring Sachi and ka ah Paul on the podcast. okay Hey, Paul. Hey, Sachi. ah Welcome to Kubernetes Bites podcast. I'm so excited to have you guys on the show to talk about Kato. Why don't we start with like a quick introduction from the two of you and talk about what you guys do at Microsoft. Sachi, why don't we start with you?
00:05:50
Speaker
Yeah, of course. um Thanks for having me on the podcast. My name is Sachi. I'm a product manager in Azure Kubernetes Service. um I've been on the team for about a year now and I work within AI workloads and making them run more efficiently on AKS as well as um onboarding to GPU related work streams and scheduler.
00:06:12
Speaker
That's awesome. Paul, how about you? Cool. Yeah. Thanks for having us on. bovin ah My name is Paul Yu. I'm a cloud native developer advocate at Microsoft. um I've been at Microsoft for a little over five years now, but I've done a ton of work um outside of the company um before then. Anything from ranging like from software engineering, solution architecture, and lately it's just been all Kubernetes for me. I think I've been like Kubernetes for the better part of like maybe like four plus years now. So that's kind of like all I do now.
00:06:42
Speaker
ah yes yeah Kubernetes is i think top of mind for almost everybody these days. like ah that's That's the topic for today. so like How Kubernetes fits in with AI. and that's That's what I want to ask you guys about. yeah yeah Okay, so let's start there, right like ah artificial intelligence, AI, ML workloads on Kubernetes. like Why do you think they make such a good fit? And like if we dive into it a bit more, ah where do you think it gives it has more benefits, like on the model training side or on the inferencing and other architectures ah after the model has been trained side?
00:07:19
Speaker
And yeah anybody can pick it up, yeah.

AI Workloads & Kubernetes Capabilities

00:07:21
Speaker
I'm going through that a little bit. Yeah, so when you think about AI workloads, it ranges from training, fine tuning, inferencing, and you want to be able to run those workloads at scale. So as you're doing it large enterprises and um like factory ah kind of scenarios, you want to be able to run it reliably. You want to have portability there.
00:07:43
Speaker
version control, and all the necessary aspects that Kubernetes itself provides for your traditional microservices. and You can really apply it there for your large-scale AI workloads.
00:07:55
Speaker
Gotcha. And is it more suited because of all the control plane capabilities? Because ah like I've been talking to ah other folks from the community as well. right There are some capabilities that are missing from communities like batch processing and the ability to schedule jobs. How do we how does Kubernetes handle a all of that? like ah Do you see the community making progress there?
00:08:17
Speaker
but Yeah, definitely. I feel like that's a great part of what I've seen with the cloud native community coming together in that um we have these KEPs. So these proposals that different working groups across the Kubernetes community can um put together ideas for making Kubernetes more specialized for AI workloads. As you mentioned, you need um specialized communication, for example, across nodes,
00:08:42
Speaker
ah So support for Infiniband and those other specialized processes. So there are several AI related working groups across the cloud native community that are making progress there. And we've seen that at KubeCon and many other events that have happened in the past few years.
00:08:58
Speaker
Gotcha. And I think like Microsoft has always been on the forefront of Kubernetes, right? It has been, I think, 18 months since we had Brendan Burns on the podcast to talk about AKS and all the all the things around AKS. So let's talk about the the topic for today. What is Keto? Or can you guys talk about like first start with like what

The Keto Project: AI Toolchain on AKS

00:09:18
Speaker
the full form is? like What does it stand for? And then what is that new project going to solve for me?
00:09:23
Speaker
yeah Yeah, ah so Kaido is the Kubernetes AI toolchain operator. So it's an open source project that was developed at Microsoft. And starting off with AKS right now, um it is able to automate the deployment of large language models within your AKS cluster. And we're working towards expanding towards other um cloud providers for Kubernetes clusters.
00:09:47
Speaker
But essentially, the gap that was we noticed and um a lot of people across the Cloud Native community noticed was that when it comes to infrastructure provisioning, um there's a lack of awareness of what kind of infrastructure to have ready for your ah AI workloads, for inferencing specifically, how can you do it in a ah low-cost, cost-aware ah mentality and then also make a lot of those initial processes automated and reduce down those um like burdensome steps sometimes in your onboarding process. So that's also where um containerization of your models come in. So what Kaido is able to do is ah the two moving parts of the workspace controller working together with the node provisioner controller. ah you know and And like, if I were to
00:10:40
Speaker
if I were to sum it up in like just one tagline, and especially when Kaido first came out, um I always said that Kaido reduces your time to inference, right? um Because, I mean, there's additional support in Kaido today that just that was just released a couple weeks ago, and we'll get into that, but when Kaido first came out a couple months ago, it was it was really about um deploying a model into your cluster and provide provide a ah inferencing endpoint so that you can immediately interact with that large language model. And yeah, and so as Sachi says, it's... um
00:11:14
Speaker
it's it's It's really designed in a way so so that you don't really have to care about the underlying infrastructure. I don't know if um you know you worked in you you know Azure Cloud or any of the other clouds. There's so many different SKUs, VM SKUs, and how do you decide what to pick? and how do you How do you understand ah which NVIDIA device plug-in or a drivers work for that machine? machine you know so yeah that's That's the problem that it solved initially out the gate, so it's pretty cool. Okay. And this might sound like a really basic question, right? But I think we, the community went through a similar transition when, when Kubernetes started, it was mostly stateless. And then we started bringing in stateful workloads. When, why do you think we need to run models on Kubernetes? Like, why can't I just use the Azure open AI service to interact with models and get that

Benefits of Deploying AI on Kubernetes

00:12:02
Speaker
endpoint? Why do I need to run it on top of Kubernetes? ah can Can you guys share a bit around that?
00:12:08
Speaker
Yeah, i think I think the first thing is um you know when you interact with OpenAI a models or even Azure OpenAI, um that's that's almost in a like a model as a service type format, right? So the model is running somewhere and you just call the inferencing endpoint and that's just how you interact with it. um that could be a problem for organizations that have ah really tight constraints around like data sovereignty, data privacy, and maybe even just latency in general, right? So what having the model inside the cluster, you know, as close to the workload as possible, I think is pretty cool. No, and I agree, right? I think, and as Sajid mentioned in the first question, like,
00:12:50
Speaker
it It's AKS for now, but it will expand to other Kubernetes distribution. Maybe it becomes a part of Azure Arc Kubernetes service as well. So it's not just that you're running in the cloud and you have a managed service to consume, but wherever you want to run your Kubernetes cluster, Kato or the open source project will allow you to run models ah almost instantaneously and have a local inference endpoint. So I like that.
00:13:12
Speaker
ah So where does Keto fit in my AI pipeline, right? Is it just meant for inferencing or overtime? You guys have added more features to it. Yeah, so it it started off with um simplifying your time to inference. So getting all of the um like your model image like containerized, um working with that specific version of your model, ah getting that stored in a container registry and kind of pulling that in, um getting your like service endpoint ready to start interacting with it. So that was the initial version. yeah And that really simplified almost like days to weeks of discovery and figuring out what's necessary
00:13:52
Speaker
for the model that you've particularly chosen and the infrastructure that's needed um down to just a few minutes. Within 10 to 20 minutes, you can ah get Kaido to deploy the model that you want.
00:14:05
Speaker
So now in the recent version, we've released support for fine tuning of your model. So with that, there are two parts to it, and we can go into that a little bit, um but essentially ah editing those certain layers of the model with new data that you want to essentially make your model a little bit smarter, more context aware, and then working with that output layer and pulling it into new inferencing jobs um for whatever data you're interacting with in the future.
00:14:34
Speaker
OK, I definitely want to double click there, but I think I got like a one keyword in your answer, right? Like you can now package models as containers. And so how do we do that? Like is it something that Microsoft does for me and it takes open source models that I want and packages it up at container as container images and allows me to download it using K2 or how does that work? Yeah, so ah go ahead, Sashi. Go for it, go for it.
00:15:02
Speaker
Yeah, so if you if you take a look at the Kaido GitHub repository, um you can look at our library of presets. So there are several model families that we support today. And then within those model families, um all different sizes, so the number of parameters, um and the individual sizes of, let's say, like Falcon, you can have 7 billion parameters versus the 40 billion parameter version. And our engineering team ah maintains those container images for each of those models. So essentially, within the custom resource definition of each of those model presets, we would provide the minimum ah infrastructure size that you need to essentially get your service up and running. So that is where the cost-effective approach comes in. And then it's also defined right there for you in that file.
00:15:52
Speaker
Yeah, and that and that model, um like so you mentioned, like the Kaito engineering team, they've done all the hard work in preparing the Dockerfile, understanding um how to package that container. And we even host um a lot of the, it's license dependent, but we host a few open source models on the Microsoft Container Registry, MCR. if you've used that before. um But if you've never packaged a container, um there is a little bit of an exercise in the GitHub repo for you. So if you want to package, let's say like a llama2 or llama2 chat um ah model, ah due to licensing constraints, we're not able to host that on MCR. So you as a consumer, um get the fun part of downloading the model weights and building the container and pushing it to registry. So if you want to actually see what that looks like,
00:16:41
Speaker
ah take a look at the instructions in the repo. Okay, that's super helpful but because that was my next question. Can I just take any model that's available on a repository like Hugging Face and and package it up as a container image? So I'm hoping that the answer is yes, based on your previous statement. Yeah, we have an entire recommended onboarding process for whatever custom model that you're working with. And the models that we support today all or for the most part are from Hugging the Face. So they have that um actual like open source license, which can also differentiate across different models that you find out there on the internet. ah But the advantage there that is that Hugging Base also provides different libraries for um fine tuning the model, for example, and like interacting with it so forth.
00:17:26
Speaker
Gotcha. Okay, so ah like when I'm publishing models of these container images that are basically models ah on my own, am I pushing it to my private registry or i am I pushing it to MCR? No, no, no. You would be pushing it to like, let's say like an Azure Container Registry or whatever private registry you have. Yeah. Okay, yeah yeah that's awesome. ah So like, okay, let's talk about how KIDO works, right? Like,
00:17:52
Speaker
I have an AKS cluster today. How do I get started with it? What's the step what's step one? ah there's There's actually two pathways you could take. okay but let's Let's talk about the easy path first. right yeah so um Kaito is actually available as an add-on for AKS. It's a managed component. um You just say, I forgot what the flag was in CLI. It might be like enable like or AI tool chain. all here Yeah, yeah what's like some flag in there.
00:18:20
Speaker
You can just say AZ, AKS create, and then just pass in that flag, AI tool chain operator, and boom, Kaito is in your cluster. um It may not be the latest and greatest Kaito, but it is a version of Kaito. um So that's the easy path. um ah The more, I wouldn't say it's difficult, because ah to me it's pretty simple, but yeah um to get started, you would just provision your AKS cluster, and inside of the ah Kaito repo, ah there's instructions on on how to um provision, you need two things. It's called a GPU provisioner, so um which is in charge of sourcing the GPU VM for you, ah installing the device plugins, the drivers and and and connecting it with your Kubernetes clusters. That's one operator.
00:19:06
Speaker
And then ah you would have to install the next operator, which is the workspace operator. And that's in charge of managing the CRDs and reconciling that and making sure that um your workspace can be scheduled on a node or a GPU node. Yeah. Gotcha. So right to I'm assuming both of these are just simple one line commands that I copy paste from my AKS.
00:19:30
Speaker
Yeah. Okay, okay perfect. perfect yeah So I have the operators and the CRDs installed. ah How do I get it to a point where I have an instance of LAMATU running or an instance of a different model that's supported in Kaito running on my cluster?
00:19:45
Speaker
Yeah, so like LAMA2, like I mentioned, it is a little bit more involved. You do have to package that container yourself and and host it in your own private container registry. But um from a user perspective, it's literally, and I in a took account of this, it's literally 12 lines of YAML to deploy a workspace. I really like the way they design um the API for the workspace. It makes it super simple, especially if you're using the presets, right because the presets, they've been vetted by the team. ah They know what VMs queue to use. They they know um you know where to pull that model. and so Literally, you can just take an example manifest, which is in the GitHub repo as well, and literally just do a kubectl apply,
00:20:30
Speaker
okay and apply that workspace. And so the VMs use in there, the models in there, um and it's actually pointing to the MCR at that point. yeah If you were to deploy, um if you were to deploy like let's say ah a model from your own ACR or Azure Container Registry, um you would have to deploy a secret into your um into your cluster and then reference that secret so that it can pull the image. okay yeah All of that sounds standard enough. like There's nothing special because I'm using an AI LLM model on my AKS cluster. but yeah When you apply that ah configuration, does it check what nodes I have? Or regardless of whether I have GPU nodes existing in my cluster, it deploys new worker nodes for me.
00:21:13
Speaker
Yeah, so about that, um cool thing is that it uses the Carpenter API. And so if you're familiar with the Carpenter project, um it's all about provisioning the right resource or the right node at the right time. And yeah so ah with these presets, um they have some details around which node to provision. And so um remember that GPU operator that I talked about, um the workspace operator will actually um deploy a ah machine ah manifest into your cluster, and then that GPU provision will will see that and say, oh, okay, I need to go get a NC, I don't know, an NC v3 machine and go provision that, right? So the workspace will tell the GPU provisioner what to go provision, and then the GPU provision will go and do that.
00:22:03
Speaker
so And the the beauty in this CRD is that it's customizable. um We have some opinionated recommendations, but you can go in and see that recommended GPU VM size. Let's say it's like NC6 series GPU VM. And if you want to replace it with what you have the quota for in your subscription, something that's the same size or larger, we'll do some preflight validations there once you apply the CRD.
00:22:29
Speaker
just to make sure that um your model can run on that infrastructure. Gotcha. And what happens if like I apply that configuration and for some reason in that region, there isn't enough instances available of that GPU type? I know we're not talking about H100s here, but still what happens? How do I fix that issue? That, to me, the quota thing is probably the hardest thing about Kaito.
00:22:56
Speaker
apply creating the manifest and applying it to the cluster is super easy. But um you know for a lot of folks that never deployed a GPU VM in their Azure subscription before, um chances are you probably don't have quota for it out the gate because we don't we don't just give everybody quota, right? You have to go and ask for it. And so um that's exactly what you would need to do at that point. You'll see um when you check the ah the the status conditions um on the workspace, um like you know you do a describe workspace,
00:23:26
Speaker
um you'll see some flags there and it'll tell you whether or not that resource is ready. And if you wait, normally it takes about like 15 ish, maybe even quicker ah to provision um everything end to end. um But if you start to see that the resource ready status condition is set to false, that's when you kind of have a clue that it had some trouble provisioning that resource.
00:23:52
Speaker
and Then at that point, you can go take a look at the ah GPU provision or logs and you may see in there that you didn't have quota in that particular region. yeah and so At that point, it's just a simple um support request in the Azure portal. Oftentimes, um the they've gotten pretty good. so Oftentimes, the ah the quota requests go through an automated um process and you get approved fairly quickly. Got you. I'm assuming since you are using Carpenter under the covers,
00:24:22
Speaker
oh If I have an application that's scaling and it's sending more requests to the LLM model itself, it sees that need for additional resources and adds more GPU nodes. Is that a good assumption? Like how does scaling work?
00:24:36
Speaker
Yeah, um just a note on that. I think, and Sachi can correct me if i if I'm wrong, but it uses the Carpenter APIs at this point. And I know that the team will is is working to eventually ah flip over to Carpenter or AKS's node auto provision, which is like the Azure implementation of it. So, okay yeah. yeah ah Right now there is an auto scaling implemented, but definitely something that we're working towards in the roadmap. Gotcha. And the the best part of this being an open source project is the roadmap is also on the GitHub repository, you know, like if you will want a specific feature and request, I'm sure they can go in and vote for that feature, I guess. yeah Oh, yeah. exactly Okay. Well, if I want to delete the cluster, I'm just going through all the CRUD operations. If I delete the custom resource, that means my nodes go away. I'm not paying for that GPU resource anymore.
00:25:29
Speaker
Yeah, that's right. As soon as you delete the workspace, because remember the workspace is the custom resource you're dealing with, then yeah, the node goes away too. Okay. Okay. So now that's all on the

Containerization and Model Fine-tuning

00:25:41
Speaker
inferencing side. Let's talk about fine tuning because that's something, so inferencing is easy to get my ah head around it. Fine tuning, right? it it That's obviously the next level. Like how do I make sure an open source model is ah is aware of the context or has the knowledge of my database or data sets that I might have? How does Kaito enable that workflows actually?
00:26:03
Speaker
Yeah, so I can give a little bit of background on different approaches to fine tuning and then the parameter efficient fine tuning that ah our engineers then chose to integrate into Kaido. So with fine tuning, um you can essentially change all of the weights within a given model or you can choose to edit certain weights within your model.
00:26:25
Speaker
um I'm just saying it from very like high level terminology just for our average Kubernetes user. When you choose parameter efficient fine tuning, you can ah have more efficiency with the resources, you can use less of them and also save costs.
00:26:42
Speaker
so The versions of fine tuning that we support with Kaido are LoRa, so low rank adaptation, and then quantize low rank adaptation. So essentially, um with fine tuning support within Kaido, there are the two processes that I mentioned earlier, and Kaido integrates containerization um throughout those processes.
00:27:04
Speaker
So essentially you have your input ah new dataset and you can either pull it in um with ever how we restore it as blob storage or within a container image. um And that you can format in a certain way. We give the instructions for ah what you're looking to achieve with that dataset.
00:27:23
Speaker
and then you can choose Lora, QLora, and essentially within your fine-tuning workspace, the output of that is an adapter layer that is stored as a container image temporarily. So within that adapter layer, you have these um rank decomposition matrices. Essentially, they are representing changes in certain parts of your model that represent the new data um that you ultimately want to be um make like tuning the model with. So those layers, you can create many of them with different types of new data and those will store as separate container images. And in your new inferencing job, you can pull in ah what those called adapters, you can put one or more of them into a new inferencing job and essentially be working with your smarter tuned model. Okay.
00:28:18
Speaker
And you said I can provide it my custom data sets, either using a block storage endpoint or a container image. like does it have to be like Is it just a collection of files in ah in an in a bucket? Or can I have like a relational database or a non-relational database and use that data set to fine-tune my model as well?
00:28:39
Speaker
ah for For the most part, what I've seen is um kind of like flat files. okay So CSV, Parquet files. um And they could be files that you have you know within your um i guess your systems. But if you if you see it like a data set like out on Hugging Face, um yeah and if that data set ah follows the formatting guidelines based on the model that you're trying to ah fine-tune, you can literally just point to that data set and go pull that in. yeah yeah Okay, and so when I'm pulling the dataset from Hugging Face, does it just spin up as a container with a PVC attached or how like or is it every time going outside to the Hugging Face endpoint?
00:29:19
Speaker
um So there's actually a ah great diagram in the fine-tuning GitHub repo. ah Basically what happens is um when your fine-tuning workspace gets deployed, ah it's basically a pod with multiple containers in there. And one of them is an init container. So that init container will go out, download the data set from Huggy Face, and then at that point you can work with the, yeah. Okay, so everything is running locally on my AKS cluster. I'm not going outside, okay. Perfect. You are to pull it down that one time, yeah. Yeah, okay.
00:29:49
Speaker
And then ah when I have that adapter version of that model ready, am I just running it locally or do I also get the option to push it to my private registry as a new version of the model? um The latter, you you have that option. Actually, yeah, you're going to have to put it somewhere. um whether I mean, if you want to put in a public register, you could, um but but most often we'll we'll see customers um you know push that into their own Azure Container Registry or wherever.
00:30:19
Speaker
Okay, so like all of this is, it's great, right? But I want to look at the alternative. Like if I was not doing it through Keto and Kubernetes, how difficult is it? Like yeah how how much complexity that we are reducing using something like a Keto and Kubernetes? um Let me put it like this. I'm not a um AI or ML expert. Me too, yeah. He's a great expert here.
00:30:47
Speaker
Yeah, but just by using these tools oh that's done like 90% of the work for me, okay helps me understand and and um you know actually accomplish some of these goals, right? And so ah what are the other alternatives? I mean, I guess the other alternatives is a cracking open your own Python notebook and um importing the hugging face packages and and working with it that way. It doesn't tell me that's much fun though.
00:31:15
Speaker
I mean, to some is probably, yeah. Yeah, Leo, I think people who can do that, crack open notebooks and write like an author, those are getting paid a lot more than I'm assuming I am.
00:31:34
Speaker
We're just Kubernetes people. okay okay butfing so and Oh, sorry. One other point. All that really plays into version control. You don't want to be maintaining these large model files um for all different versions of your model. You kind of want to just maintain what is going to change along the way. Whatever new data you have coming in, whether it's your intelligent app or whatever service that you're running,
00:31:59
Speaker
You can kind of just store those temporarily changed layers of your model and then pull them in as needed. You don't always need to be maintaining 150 gigabyte ah like model sizes and downloading them every single time. Okay. Okay. That is helpful. I think one question that led me to was,
00:32:20
Speaker
Kato helps me package up models as containers, but now I have a fine-tuned model, which is now a container image, but I can only run it on AKS. Is there a way to take that fine-tuned model and run it somewhere else? like Can I just have the model file available outside that container image, or it's always going to be in container file fine-tuned through Kato?
00:32:43
Speaker
I think it's the latter because part of the part of the fine tuning job, ah once the fine tuning is complete, its job is to package it into a container and send it on its way. Yeah, so it will be, um that's a good call out though. um But yeah, as of right now, from what I'm aware, it packages it. Okay, okay, gotcha. And I think one one thing that you guys mentioned at the beginning, right, that each of these models ah Even outside the time that it takes gpu models GPU worker nodes to be provisioned, just pulling down the container image with some some of these larger models might take time. like Are there any efficiencies around ah caching layers if I'm downloading the same model over and over again? ah How does that work? like Are there any efficient ways of downloading these models or every time it's a complete new pull?
00:33:36
Speaker
um It's going to be a new pull once the ah ah the node comes up to actually host the model. okay From what I'm aware of, I don't think there's any other efficiencies that we can take advantage of. um like I know we have a feature in in Azure Container Registry called artifact streaming. I don't think that will help very much in in this particular ah scenario. um I was actually talking to Ishan, one of the engineers on this about it. in what he Because I asked him the same thing too. I was like, can we can we take advantage of streaming to like download faster? He's like, well, because it's a model, you're probably going to want everything in the model for it to yeah you know for behave the way you intended it to. so
00:34:23
Speaker
um But I think it's something that they were going to look into or if you know they may already be looking into it, but yeah. yeah go I don't think there's a there's anything other than you know you just have to go pull that, however many gig file that is or container image that is.
00:34:39
Speaker
Okay. Gotcha. Kato, I think I first learned about it last year um at maybe the KubeCon event. ah It has been a while, at least six to eight months.

Community Engagement & Resources

00:34:50
Speaker
oh Are there any customers that are using it ah that are part of the community that are trying to use it and then also contributing back, or this is mostly a Microsoft Lead project? Yeah. so We've um presented it in ah several different like customer meetings and also um at KubeCon this past year. And we've seen great feedback. I think people are definitely testing it out and seeing how it fits into their AI pipelines. um And we've had many feature requests flow in, so lots of things adding into our roadmap. And we're excited to see people start contributing. especially once it becomes ah compatible with other managed distributions of Kubernetes. okay yeah that's awesome yeah I think as of right now, like ah probably the most community feedback that's happening is requests for more presets.
00:35:39
Speaker
right oh I mean, if you get a kick out of doing that sort of work and trying to like align the stars ah you know appropriately and try to figure out what VMs, um it's always better just to have folks on the engineering team you know put the put together these presets. So um you know a great example of that is is is the the new edition of the 5.3 model, which the turnaround was pretty quick ever since the model got got released. I think that would be um something that we encourage everybody in the community to do. If you have a model, or if you know of a model that that you think would really benefit you know as being part of this project, ah submit that issue. um okay yeah Or I think it's a pull request to go request. I can't remember, but yeah there's a process.
00:36:25
Speaker
No, I think having that preset definitely helps right because I don't want to figure out or into ah like which instance to you use for which size of model, how much memory do I have available. I think somebody should do it for me. um i'm I'm here because I want to manage Kubernetes experience. Please don't ask me more questions. Just deploy it for me.
00:36:45
Speaker
I like that. ah So, Saji, you as a product manager, right this is definitely an interesting space. What's next? like Where do you see the communities leaning towards? Are they asking for more models to be supported, more Kubernetes features so to be supported? like What's next for Kado? Yeah, great question. um We're actually in the progress of ah and applying for a CNCF sandbox, so you can get official within the cloud native community.
00:37:13
Speaker
I'm excited for that process to kick off and really start getting more contributions from people in the community. um Within the roadmap itself, we're working on the design for RAG. So that is an alternative to fine tuning. And it really just depends on your scenario, whether you want to fine tune and actually change the base model that you're working with, or just make it better at retrieving information for the specific context, the data that you want. So that design for RAG will also involve integrating it with some kind of orchestration framework. So um our engineers are definitely working on that currently.
00:37:50
Speaker
And yeah, so those are some of the bigger things that are um a part of our roadmap. And then, as I mentioned before, extending to other cloud providers. Gotcha. I think RAG is definitely interesting, right? Because ah instead of fine tuning a new model every time with my custom data set, I can just have it ready. And the data set is always up to date. I can just point whatever a new newer version of my model and have it not hallucinate and give me the right answer. So and when you're looking at RAG deployments, right? Are there any specific vector databases that you're working on? Will those be part of the Keto project itself or that ah you leave that up to the user? Yeah, I think when we think about the scope of Keto, it's essentially in the name itself, it's tool chain. So it's part of your entire AI pipeline. And when you look at some other vector databases that are open source project, I think
00:38:45
Speaker
Min.io is one of those auctions there um as a cloud native option. So we wouldn't really want to create an entire um like platform that encompasses those as well, but essentially fitting in Kaido to work alongside those existing like vector database options, for example.
00:39:05
Speaker
Okay. ah Interesting. I didn't know Minayo was working on a vector database. I need to look that up. Sorry. ah Okay. That's awesome. on I think that's one example that I've heard of. Gotcha. Okay. So ah like I personally, I've looked at like reactor videos, like Microsoft reactor videos on YouTube. but I think you guys have done a lot of live streams. Hopefully this podcast helps. But how can users get started, learn more about Kaito and like get their hands on this?
00:39:35
Speaker
Yeah. Um, so you're right. We did, uh, we did do a live stream. It seems like eons ago now, Sean and I, we did a live stream on the Microsoft reactor YouTube channel. Um, you can find the recording over there. Um, or you can also find that recording, uh, on the AKS community channel on YouTube as well. okay and bob And I think you were a guest on that, um, show at one. i want la yeah Yeah, so um there's a playlist for Learn Live on the AKS community YouTube channel. And um the recording of that live session is there. ah Inside of the ah video description, there's a link to the actual hands-on lab as well. So um yeah, that's what we did. So if you if you have an Azure subscription and you have GPU quota and you just want to try it out real quick,
00:40:26
Speaker
um Give it a try. ah It's not going to break the bank um just by keeping your workspace up for, let's say, an hour. Just don't forget to delete your workspace when you're done testing it.
00:40:39
Speaker
ah but No, hands-on labs are always important, right? Because it gives you that controlled environment. Even if it's in your own a Azure account, it just gives you the exact steps to follow. And yeah, I don't know. it it It makes it really easy to get started with something. So I need to find that link and go through it myself. Yeah. Yeah. And I'll send you some of these links so you can share it in your show notes. sir That's a awesome awesome. No, perfect. That saves me at least 15 minutes that I have to Google for all of these things. So thank you, Paul. I appreciate that. Yeah, no worries.
00:41:08
Speaker
Anything else that you guys think I didn't ask and we should make it part of this episode or we are ready to close this up? Yeah, I think um or another like resource that could be helpful is the AKS documentation as well. So for the managed add-on of Kaido, we built out some more um conceptual docs to better understand fine tuning, why it might be necessary and helpful, as well as um small language models like the 5.3 family and large language models, and when you would um want to use those. Oh, I like that. like I think I'll go and read up. I think i my fine tuning skills are not as good as I want it to be, so that that's a useful resource. Thank you, Sachi. Yeah.
00:41:52
Speaker
Perfect. Okay. So I think that brings us to the end, right? Thank you so much for being on this. This was definitely a great learning experience. I didn't do some research, but learned so many things during the actual recording. So I really appreciate both of you joining the podcast. Awesome. Thank you for having us. Yeah. Thanks for having us.
00:42:07
Speaker
Okay, that was a fun episode. ah I didn't, I learned so many new things. Like I went in with some background research around Kaito and how it does ah help users with ah running models locally on their AKS clusters. So it's closer to the application which needs access to an LLM model instead of using a managed service. So all of that is good, but learning about how they have enabled not just ah models that are provided by Microsoft in that container image, but now how they have a path where people can basically take an open source models and package it up as a container image on their own and then deploy it on any AKS cluster from their private container registry. That's obviously like a a huge deal. And then the second fact around like, oh,
00:42:51
Speaker
in In addition to inferencing and maybe building those retrieval augmented generation or RAG architectures to to enhance your inferencing workflows, now you can actually use the Kaito project to also do fine tuning. Again, fine tuning, I know the theory of it, but I haven't done personally or I haven't interacted or read up on it much. ah read ah read up on it much so That's something that's new and something that personally i I need to go and look into and how I can take an open source models. and customize it with my own dataset and I'm glad that something in the Kubernetes ecosystem is making it easier for me to do that.

Conclusion & Celebrating the Journey

00:43:27
Speaker
So um those are a couple of my key takeaways. I think it's an interesting project. I'm just waiting for them to expand it to other Kubernetes distributions or it gets ah like more people in the community can use it instead of just customer customers ah for Microsoft Azure. ah With that, I think um I just want to reiterate the fact that
00:43:46
Speaker
three years, it's it's been a great journey. Thank you so much for being a listener. I will again ask everybody to share this podcast with their friends. if you If you find value in this, even if you don't, maybe you should share the podcast regardless, but that brings us to the end of this episode. I'm Bhawan and thank you for listening to another episode of the Kubernetes Spites podcast.
00:44:08
Speaker
Thank you for listening to the Kubernetes Spites podcast.