Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Generative AI on Kubernetes image

Generative AI on Kubernetes

S4 E5 · Kubernetes Bytes
Avatar
1.6k Plays9 months ago

In this episode of the Kubernetes Bytes podcast, Ryan and Bhavin sit down with Janakiram MSV - an advisor, analyst and architect to talk about how users can run Generative AI models on Kubernetes. The discussion revolves around Jani's home lab and his experimentation with different LLM models and how to get them running on NVIDIA GPUs. Jani has spent the past year becoming a subject matter expert in GenAI, and this discussion highlights all the different challenges he faced and what lessons he learnt from them.   

Check out our website at https://kubernetesbytes.com/  

Episode Sponsor: Elotl  

  • https://elotl.co/luna
  • https://www.elotl.co/luna-free-trial  

Timestamps: 

  • 02:02 Cloud Native News 
  • 15:31 Interview with Jani 
  • 01:11:00 Key takeaways  

Cloud Native News: 

  • https://www.techerati.com/press-release/octopus-deploy-acquires-codefresh-to-boost-kubernetes-and-cloud-native-delivery/
  • https://www.civo.com/blog/kubefirst-joins-civo  
  • https://cast.ai/kubernetes-cost-benchmark 
  • https://www.techradar.com/pro/vmware-customers-are-jumping-ship-as-broadcom-sales-continue-heres-where-theyre-moving-to 
  • https://cloudonair.withgoogle.com/events/techbyte-making-ai-ml-scalable-cost-effective-gke
  • https://dok.community/dok-events/dok-day-kubecon-paris/ 
  • https://training.linuxfoundation.org/certification/certified-argo-project-associate-capa   

Show Links: 

  • https://www.youtube.com/janakirammsv 
  • https://www.linkedin.com/in/janakiramm/
  • - NVIDIA Container Toolkit - https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html
  • NVIDIA Device Plugin - https://github.com/NVIDIA/k8s-device-plugin
  • NVIDIA Feature Discovery - https://github.com/NVIDIA/gpu-feature-discovery
  • Hugging Face Text Gen Inference - https://huggingface.co/docs/text-generation-inference/index
  • Hugging Face Text Embeddings Inference - https://huggingface.co/docs/text-embeddings-inference/index
  • ChromaDB - https://www.trychroma.com/
Recommended
Transcript

Introduction to Kubernetes Bites

00:00:03
Speaker
You are listening to Kubernetes Bites, a podcast bringing you the latest from the world of cloud native data management. My name is Ryan Walner and I'm joined by Bob and Shaw coming to you from Boston, Massachusetts.

Podcast Focus: Cloud-native news & expert interviews

00:00:14
Speaker
We'll be sharing our thoughts on recent cloud native news and talking to industry experts about their experiences and challenges managing the wealth of data in today's cloud native ecosystem.
00:00:30
Speaker
Good morning, good afternoon, and good evening wherever you are. We're coming to you from Boston, Massachusetts.

Episode Date & Greetings

00:00:36
Speaker
Today is March 12th, 2024. Hope everyone is doing well and staying safe. Bob and Spring is right around the corner. You can feel it. I know. I can feel it. Yes. Yeah. Let's go. You say the same thing.
00:00:51
Speaker
That's what happens when we talk for a long time, dude. Come on. I know, I know, right? I know. Well, I am always cognizant of second winter though. We haven't had second winter, which is usually, you know, what happens in the Northeast when you finally get days when it's 60 degrees and then it's like mother nature says, haha, just 30 degrees again.
00:01:13
Speaker
I'm still hoping that I think this time for Groundhog's Day, the Groundhog didn't really see his shadow. So like, I'm hoping it's a shock to him. He said early spring, but you know, do you trust a Groundhog? Do you trust him? Nope. Yeah, if I would like that, that's not a good sign too. So nope, I don't trust a Groundhog. But I can take some directions from people. Hey, yeah, no, I'm hopeful as well. I'm ready.
00:01:38
Speaker
I'm ready for some warmth. We'll see. We might get one more snowboard day in my local mountain if they have enough snow, but we'll go with it.

Upcoming GenAI Discussion Teaser

00:01:48
Speaker
Go with ice. So yeah, today we have an awesome guest, and we're going to talk about some great stuff in the generative AI world.

Cloud-native News Segment

00:01:58
Speaker
But before we dive into who that is and all that good stuff, we do have some news. So why don't you kick us off, Evan?
00:02:05
Speaker
Yeah, sure. So I have a couple of acquisitions. Again, not big time mergers and acquisitions, but still relevant to the cloud native ecosystem. First one being Octopus deploy acquired another startup called Codefresh. The reason I found out about Octopus deploy is because I knew what Codefresh did and now they got acquired by somebody else. How long was I supposed to deploy then around?
00:02:27
Speaker
I have no idea. Like I, I didn't do that research. I feel like not that, not that old. Yeah. Code French was really early stages. Like I remember us discussing like their pre-seed or seed round last year after KubeCon. So like they are brand new. I don't even know if they have like a bigger team, might be less than five people, but they were building something cool around Argo and they were one of the maintainers for the project. Found it in 2012. Octopus deploy.
00:02:52
Speaker
Okay, so they are a big enough player, or at least have been around for a long time. It's interesting. I always thought they took the name Octopus because it had to do something with the number of sites and the Kubernetes logo, maybe that fit, but in 2012, there was no Kubernetes logo. Or GitHub's an Octopus too, right? That's true. Codefresh was 2014. I was going to say, I remember Codefresh when I was at Athena Health, so I was there.
00:03:17
Speaker
Earlier, I mean, of course, companies founding days are very different from when they actually like take some noise. Yeah. OK, so Octopus deployed officially acquires CodeFresh to boost its Kubernetes and cloud native delivery platform or delivery solution.
00:03:32
Speaker
Codefresh has been an argument in our figure in the community CD ecosystem. Now, by integrating these two solutions, Octopus Deploy can expand their platform and offer that continuous delivery component or continuous delivery solution as well. As with all the acquisitions right now, we don't see any price. We don't see what the valuation was. Everything is just a big black hole. I'm just happy that people are finding exits in this weird market.
00:03:57
Speaker
Honestly, I'm still hoping for the IPO market to open up, so we see some more exits from our CNC, our community startup spot. Hopefully, the Reddit IPO later this month, hopefully, unplugs the blockage, I guess, and we see a lot more IPOs.
00:04:14
Speaker
And then second acquisition was another smaller one, right? Like previous guests on the Community Spites podcast, Cube first, Jared and John, those guys got acquired by Civo, Civo Cloud. So I know Civo does offer like a managed Kubernetes service, especially for the SMB space. And they have a really cool solution there. I remember seeing like a couple of talks that were posted after their Civo navigate conference last month, I think.
00:04:40
Speaker
But yeah, congratulations Cube First team. Now they can still operate as a separate entity according to the blog post and the LinkedIn messages. And they can, Cube First will still continue building their GitOps based solutions for different cloud vendors and cloud managed Kubernetes distributions. But now they'll do that after being like from under the CEO family. So I guess yeah, congratulations to both the teams.
00:05:09
Speaker
Yeah. And then, yeah, I mean, we had John and Jared on what it was June last year talking about, uh, get ops and authentication and all sorts of stuff. So if you haven't listened to that episode, spoiler, you can go see what we were talking about back then.
00:05:24
Speaker
Yeah, they have a really cool open source solution that instantiates that entire cloud-native environment for you using opinionated open source projects. So that was a good episode that we did. And then finally, for news for me, CAST AI introduced or launched their new Kubernetes cost benchmark report. And it definitely turned some heads on LinkedIn or people were definitely sharing it with their own networks.
00:05:49
Speaker
I think the main part of the report was they analyzed over 4,000 clusters. And I know this is just to feed more information for companies like Cast AI or CubeCost, but out of those 4,000 clusters analyzed,
00:06:04
Speaker
They saw a really low CPU utilization and memory utilization across the board. So like if you had like a 50 CPU cluster, that might be like a five node, seven node cluster. People are only utilizing, let's say 13% of the overall CPUs that were available to them. And then if you actually talk about scales, a thousand plus more CPUs, that might be like a 20, 25 node cluster at least. Only 17% utilization. So people are not
00:06:30
Speaker
trying to utilize their Kubernetes clusters completely or failing to do a good job at it. This report made it evident. And obviously, since Cast AI is in this ecosystem, they have a best practices section on how you can use things like spot instances for your worker nodes inside your Kubernetes clusters, how we can use auto scaling and some of the projects like carpenter, which was originally part of AWS, but now is an open source project, how you can right size some of the workloads that you're running through
00:06:59
Speaker
either provided by vendors or building something of your own. So if you are listening to this, go and check your own Kubernetes clusters. Maybe download a free tool. I know open cost is available. Run that analyze your clusters and see how you can save on cloud costs. So those were the three things that I wanted to share.
00:07:15
Speaker
Yeah, I mean, I think it's surprising, but not at the same time, right? I mean, we saw this with cloud in general. So seeing it with Kubernetes and knowing how much effort the community is putting into metrics and monitoring and cost lately in the past year or two.
00:07:35
Speaker
Uh, not terribly surprising. Uh, also just means, I think, you know, we're at this point where we really started, we're really starting to care because more and more data centers are running Kubernetes. So now we're like, all right, we gotta, we gotta rein this in. So, uh, yeah, it's a cool, definitely go take, take, take a look at it. Um, as you said,
00:07:56
Speaker
surprising but like the surprising part for me was the percentage utilization like less than 20 percent that's that's a lot lower than what I anticipated I thought like 50 60 percent or sorry 40 50 percent but yes 17 and 13 percent just too low yeah 17 13 is pretty low yeah but you know we shall see how we do in the future I guess as a community yeah
00:08:22
Speaker
Just a couple quick news items from me. One is an article from Tech Radar really talking about where customers are going, given if they were VMware customers prior. I know this is something that probably a lot of people are starting to or in the middle of thinking about if you have a lot of VMware in your data center, in your shop.
00:08:44
Speaker
Because of all the changes that are happening to licensing and and and products and the free is all gone. So basically it's a really interesting article just reading through sort of the other alternatives out there like
00:08:57
Speaker
Hyper-V and KVM and some other ones and it really just kind of focuses on that You know people are flocking that what they see in this article that is to KVM based Linux based at least it says like two-thirds of the Two-thirds of the respondents sort of our KVM and zen-based hypervisor alternative So I'd love to hear from our audience a little bit about sort of if you're dealing with this, right?
00:09:22
Speaker
either reach out to us on Slack or send us an email from the website. I'd really like to get a better sense of our community and what they're doing if they're involved in these types of decisions. It'd be really interesting to get some first-hand feedback.
00:09:43
Speaker
No, I think I like that, right? I know you listed Hyper-V and KVM as alternatives, like Kubernetes ecosystem. Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kubernetes ecosystem, Kuber
00:10:10
Speaker
Yeah, try it out and go listen to that episode as well. Another plug for us. I'm not doing it on purpose, I swear. The next one, which I thought was super relevant for the episode today, which hopefully we'll get the episode out today. There's a tech bite webinar from the folks at Google Cloud kind of talking about similar things that we did today, right, which was
00:10:38
Speaker
You know, how do you get these GNAI sort of architectures with LLMs on Kubernetes? In this case, they're going to be talking about GKE, but you know, how do you do it in sort of a simple and scalable and you know, these are their words cost-effective way. Just I think some other resource in the community and I think for those that are using GKE or you know, have
00:11:01
Speaker
Use that cloud and in the past definitely something may be worth going and checking out if you get to it live Probably actually because we are supposed to record this yesterday. I remember and then I did That webinar is today. So you will have to Watch a rerun of you'll find that according. Yeah, exactly. You can I think you can watch it on demand in the same link that I will provide So definitely go check that out
00:11:27
Speaker
Also, a little plug for Data on Kubernetes Day in Europe. If you are going to KubeCon and hopefully you signed up for a DOK day because it is a great place to go talk about Data on Kubernetes and with some awesome talks for a Day Zero event.
00:11:45
Speaker
It should have been free to sign up when you had signed up for KubeCon, so I don't know. Sometimes there's a way for you to go check out the DoK event, and if the room's not full, maybe you can get in. If you didn't sign up for it, I wouldn't discourage you from trying to show up and
00:12:04
Speaker
and saying hi to those folks and seeing how that day is going. So definitely go check that out. If you're in Paris, Bob and I will not be, right? I know. I'm sad about that, dude. I'm sad about you. It's like the first one. At least one of us is going to be. Yeah, hasn't it? Yeah. So if you find, if listeners of this podcast find anything interesting, share it with us. We'll do a news recap like we always do, but this is
00:12:29
Speaker
this time it's truly going to be like a virtual experience and what we learn from the conference through social media or through through blog sites but yeah if you have if you find something interesting or pictures to share like tag us and we'll include that as well
00:12:45
Speaker
And then the last piece of news, I guess you could qualify it as news I saw they had a new the CNCF had a new certified Argo project associate Kappa. Basically, certification is good for three years. But I think more and more folks are starting to get used to the GitOps workflows and Argo is a great tool in this space. We've talked about it before on the GitOps.
00:13:09
Speaker
episode, you know, third plug, there you go. But this Kappa certificate is interesting. I think, you know, as we saw with tons of, you know, with the Kubernetes administrator or the application developer or the security one, to see sort of, you know, one specific two.
00:13:26
Speaker
Argo and GitOps, really cool. So go check that out. It's only, I think, $250 is the exam. So pretty reasonable, all being said, and you'll come out sort of an Argo expert, so to speak. That's what I had for news, Bobbin.
00:13:44
Speaker
Okay. Let's introduce the topic and then bring on our guest. Sounds good. So we have, um, Jen AI with Johnny today and, um, uh, John Acura MSV, who is an analyst advisor and architect. Uh, you may have seen a bunch of his, uh, articles maybe on Forbes or the new stack or those kinds of things. He is very active in the community and really, really knowledgeable. So.
00:14:10
Speaker
We're excited to have him here on the show to dive into what it looks like to get these generative AI
00:14:19
Speaker
architectures set up on Kubernetes and what that really means with a different layer. Without further ado, let's welcome Johnny to the show. This episode is brought to you by our friends from Ilotl. Ilotl Luna is an intelligent Kubernetes cluster autoscaler that provisions just-in-time, right-sized, and cost-effective compute for your Kubernetes apps.
00:14:41
Speaker
The compute is scaled both up and down as your workloads demand change, thereby reducing operational complexity and preventing wasted spend. Luna is ideally suited for dynamic and bursty workloads such as dev test workloads, machine learning jobs, stream processing workloads, as well as workloads that need special resources such as GPUs or ARM-based instances.
00:15:09
Speaker
Luna is generally available on Amazon EKS, Google Cloud GKE, Azure AKS and Oracle OKE. Learn more about Luna at illotel.co.luna and download their free trial at illotel.co.luna-free-trial.

Guest Introduction: Johnny, Market Analyst

00:15:30
Speaker
All right, welcome to Kubernetes Bites, Johnny. Why don't you give a introduction of who you are and what you do and then we'll jump right in.
00:15:41
Speaker
Sure. So I am essentially a market research analyst based out of Hyderabad, India. I do two things. One, I analyze the market trends and I publish my content at Forbes, the new stack, InfoWorld, some of the media outlets. And my second line of business is working with startups and platform companies as an advisor. So I work with the CMO and CTO
00:16:05
Speaker
defining the product roadmap and also working on the outreach, evangelism, and developer outreach, essentially. So those are two roles that I play. Yeah, you are a person with many hats, as they say. I can vouch for that working with you in the past. If you don't know what Johnny's been up to, just go Google his name, basically, and he's got such great content out there. So hopefully we'll get a little sneak peek of what you've been into in the Gen I space.
00:16:34
Speaker
Today, so here we're here to talk about, um, Jenai. So, uh, Bob and I think we'll kick it off. I'll kick it off. Right. Uh, so, uh, Johnny, like, I think I like the name of this episode. I don't know, Ryan, if we'll end up choosing this, but Jenai is what we call it. That's like a perfect topic, but let's start there, right? Like there has been a lot of buzz around Jenai for the past year and a half now since GB3 came out the door.
00:16:59
Speaker
Can you give us an elevator page, a level set for our audience?

Evolution of AI to Generative AI

00:17:03
Speaker
What is GenAI? What does it include? What are the different types of assets that you can generate using these new LLM models?
00:17:11
Speaker
Right. So to set the context and to make sure that the cloud native community is on the same page, I just want to briefly talk about the evolution and what got us to GenAI. Right. So I actually saw three different milestones in the evolution of AI. So the first milestone was
00:17:35
Speaker
very simple machine learning, which was essentially done on CPUs and any desktop PC, the favorite framework for scikit-learn. And all you could do was linear regression and logistic regression, basically predicting a future number or classifying something. So that was the first milestone.
00:17:55
Speaker
And a lot of work has gone into it. A lot of Python developers used pandas, NumPy, scikit-learn, and that in itself was pretty powerful. That was the first generation of machine learning and initial AI. And then the second phase was all about neural networks.
00:18:15
Speaker
And that basically took the basic machine learning and traditional machine learning to more sophisticated neural network algorithms. And we had things like computer vision based on convolutional neural networks, and then the reinforcement learning, the recurrent neural networks,
00:18:33
Speaker
more sophisticated techniques like LSTM, long short-term memory. So that lasted for a while. I think for about four or five years, AI was typically meant using neural networks. Then the third phase or the third generation is what we are currently experiencing and this is generative AI. So there was a very
00:18:58
Speaker
abrupt overnight transition from your network to the transformer network. That opened up doors to generative AI. What is the difference between the previous generation versus the current generation?
00:19:13
Speaker
Traditional AI was all about predicting. It was all about predictions. Machine learning and AI was used for anomaly detection, predicting fraudulent transactions, classifying dogs versus cats and hot dogs versus burgers, stuff like that.
00:19:35
Speaker
all about either prediction or classification and then generative AI took that to the next level and the uniqueness of that is the ability to generate content you know which we are now seeing through chart GPT and multimodal stable diffusion and tools like you know the mid-journey runway so
00:19:57
Speaker
When AI is able to generate content and reproduce content, it reaches the masses. That was the inflection point where everyone saw value in AI because it was not confined to academia, it was not confined to researchers, it was not confined to very niche groups, but it suddenly became highly relevant and everyone was in awe because
00:20:20
Speaker
It just influenced and impacted almost everyone's life because it could generate content. That means a lot for marketing folks, for copywriters, for content producers, creators, influencers. Everyone literally got impacted by this. The fundamental difference between generative AI and the previous
00:20:41
Speaker
generation of AI was predictive versus generative. So today we are living in an era where AI can generate a lot of content, which is just mind-blowing.
00:20:52
Speaker
Yeah, it really is fascinating, especially when you think about, you know, I think a lot of people when they start thinking about AI, they think of something like the Turing test, which was what the 1950s, right where we were. You know, that was a long time ago at this point. And, you know, it's gone through, like you said, a lot of iterations. I think, you know, people know the terms machine learning, they know the terms
00:21:12
Speaker
Deep learning but maybe aren't as familiar with or in in you know, excited about what those things were because they didn't get to interact with it. I think that Transition to the consumer right? Yeah, it was the biggest reason for
00:21:28
Speaker
why i think a lot of people started to really latch onto it although i think you know they're all there also is sort of a downside right a lot of us maybe grew up in a way where we we did work with chat bots that were terrible but didn't where they were really backed by these new generations right so we might have some.
00:21:49
Speaker
PTSD working with some of those too. So I do want to change the topic to focus on the cloud native, the Kubernetes side, because I want to make sure we hit on that mostly for this podcast.

Kubernetes in GenAI: Serving Models

00:22:02
Speaker
So the term Gen AI, and as you described, what does it mean to the cloud native community and Kubernetes? Where does it fit in?
00:22:12
Speaker
Absolutely. So that's been my focus for the last eight months to a year. And I see two different categories where Gen AI and cloud native have very symbiotic relationship, right? So first, the short term tactical opportunity is
00:22:33
Speaker
how do we exploit the cloud-native infrastructure to serve GenAI models and make the researchers' engineers' developers' life easy? That's what Kubernetes is meant for. Kubernetes, as Kelsey Hightower called, is the platform of platforms. Today, we are seeing the real implementation and that
00:22:56
Speaker
that entire analogy becoming very concrete and realistic with platform engineering and all of that. Just like Kubernetes has become the platform for application development, Kubernetes is a platform for generative AI and it means quite a few things. As infrastructure architects and infrastructure engineers, how can we simplify the life of
00:23:20
Speaker
a GenAI researcher, a GenAI engineer and a GenAI app developer. How do we become relevant to those personas? What can we deliver to them to make them more productive and more efficient? That is the first opportunity for cloud native community. So for that, we take the building blocks that Kubernetes already offers and then we have these serving layers and we have these vector databases that are coming in and we run these
00:23:50
Speaker
agent infrastructure, expose endpoints, you know, bring in service mesh, bring in observability, you know, all the goodness of cloud native with that added flavor of GenAI specific platform layer, another platform stack. So GenAI platform stack has some essential building blocks like, you know, the GPU operator, the shared storage layer,
00:24:16
Speaker
And then it has, for example, a highly available vector database. It has an embedding model that is exposed. It has one or more LLMs that are exposed. And then there are some web apps that run in batch mode behind the scenes, pulling the data.
00:24:33
Speaker
Updating and refreshing vector databases and so on and then ultimately the top most layer is the developer experience layer that is going to be exposed to the consumer who is the ML engineer or researcher or the developer so that is the tactical short term opportunity for the CNCF community now in the long term we are going to see something very interesting.
00:24:55
Speaker
So LLMs, the large language models, are becoming the brain of applications, control planes, and infrastructure. I'll give you a very classic example. So today, if you go to charge GPT and
00:25:09
Speaker
give some examples. FedEx is a logistics company. Apple is a technology company. For example, what is Emirates? It looks at these two. It's called few-shot prompting. It looks at the examples, and then it figures out you are actually giving the company name and the category. It automatically says it's an airlines company. It is able to do that with the pre-training. Now, imagine
00:25:39
Speaker
You do this few-shot prompting to an LLM and you put that LLM next to the Kubernetes control plane and you actually create a prompt to this LLM saying, when Kubernetes events and logs are emitted,
00:25:54
Speaker
pass that through the LLM and automatically classify them into the severity levels or whatever labels you want to give.

AI Ops vs ML Ops Clarification

00:26:03
Speaker
Now, the LLM becomes the brain. Literally, it becomes the L1 engineer sitting next to the Kubernetes control plane. With a little bit of plumbing and connecting the dots, you can even raise a Jira ticket when this LLM classifies one of the log
00:26:21
Speaker
outputs and it detects that there is a problem and it automatically goes ahead and raises a JIRA ticket and flags off a support engineer who can take an action. That is where we actually see AI ops coming into the play. There is a misnomer in the community that AI ops is running AI on
00:26:43
Speaker
cloud-native ops or DevOps. No, it is not. AI ops is infusing AI into ops. We shouldn't be confused between ML ops and AI ops. ML ops is the infrastructure and ML plus ops like Dev and Ops. AI ops is infusing AI into ops so that we are more efficient and we're bringing this autonomous operations into the picture.
00:27:06
Speaker
Yeah, I think that's a super exciting place for at least the system side, the operation side for AI. You know, a lot of people, like we said, do use it from the consumer side. And, you know, I think we've had run one on the show here and Kate's GPT, right? Some people might be familiar with those names there.
00:27:24
Speaker
They're really thinking about this problem and trying to make products or projects out of it. And I'm excited to see where that goes, obviously, in the future. Hopefully, where we can work alongside it and it doesn't take our jobs. But it doesn't. But it just makes us more efficient. So the LLM infusion to control planes. I take example of Portworx because I know the architecture pretty well.
00:27:55
Speaker
The storage control plane, which is common for any storage engine on top of Kubernetes, it emits a lot of logs, a lot of data that is coming out of the regular. If the engine is humming together, it is emitting a lot of logs. Now, you can define certain maturity levels. The first level of maturity is
00:28:15
Speaker
just connect this LLM to a knowledge base, which is living in Confluence or somewhere. And you basically identify what is the error, look up the knowledge base through RAG and say, hey, this error has occurred. Here is the knowledge base. Go do something with it. And you're literally getting human in the loop, right? The second level of maturity is
00:28:37
Speaker
autonomous clusters, taking autopilot to the next level. Again, giving the example of outworks that all three of us are familiar with. PX Autopilot, take it to the next level where this LLM not only informs you, but it actually remediates
00:28:54
Speaker
The classic problem with Portworx is HD ports are not open and each node is not able to talk to the other. It's a classic newbie problem when you're installing Portworx. And it is obvious. When I actually look at logs, it tells you clearly that HD is not able to talk to the
00:29:13
Speaker
talk to each other and the nodes are not communicating, which means you got to open 2379 port as simple as that, but it might really create a havoc for someone who is doing it for the first time. But imagine this LLM assumes an IAM role if it is running in AWS, goes ahead calls Pulumi or a Terraform script to basically open these ports.
00:29:37
Speaker
boom, you know, port work starts working, all the three nodes are up. That's auto remediation, right? So that's where LLMs will get to. No, I'm sure I'll take this feedback back to the first thing. Guys, we need to work on this. Yeah. I mean, there's so many of these, I think this common problem where, you know, in the configuration, the day one,
00:29:57
Speaker
journey of AI can really help you get off the ground because there's a lot of knowledge and people have done that before, where it can pull from and remediate. There's maybe a little bit of danger of things just magically start working and maybe you don't have all the backend knowledge now to know what it did. That's maybe a separate part of it is like, okay, if it takes remediation step, it should let you know, hey, I did this. That's the other side of it.
00:30:27
Speaker
Gotcha. And so, Jani, thank you for discussing that. I like Ryan's question in the first place.

Johnny's Home Lab Setup

00:30:33
Speaker
It gives us an idea or a strategic overview of how AI specifically can be relevant to cloud-enabled communities' workloads. But next question, I want to get more tactical. You listed a lot of things. You listed vector databases. You listed how
00:30:49
Speaker
models can be deployed? Can we get tactical and talk about some of the tools that you might be using in your home lab to get these things or get these models deployed on Kubernetes? And if you can also cover like, is Kubernetes only relevant for serving these models? Because I remember seeing like an open AI talk from like a few coupons back where they were actually training you using Kubernetes as well. So I would want to like, I want you to talk about how Kubernetes is relevant in different stages and what are some of the tools that you recommend. Right.
00:31:19
Speaker
So to set the context in my home lab, I run two massively powerful PCs. Now each PC runs two RTX 4090s.
00:31:32
Speaker
So that means I have four GeForce RT90s. And these two are two individual machines. They are independent. And I run a single node Kubernetes cluster on it. So the reason why I invested, and in India, I could have bought two mid-range sedans instead of those two machines. It costed me a fortune.
00:32:00
Speaker
But why did I invest and why did I really have to go through it? Because today, the biggest challenge is, take any hyperscaler, take any region, take any zone, you cannot get a GPU instance on demand, period.
00:32:19
Speaker
Or at least it's hard, right? Forget about H100 or whatever. You get the most basic GPU that doesn't really help you with much. Exactly. So that's been my frustration. So I'm a big advocate of cloud. I believe in this whole argument of CapEx versus OpEx, but I had no choice because I could never get hold of any GPU instance.
00:32:43
Speaker
The next best thing was to handcraft one and assemble one for myself. I've gone ahead and assembled it from the scratch. I got the most AD cabinet. It is almost like a mid-range. It reminded me of some of the Solaris machines that I have seen in my college. They are massive.
00:33:04
Speaker
So I got that, and then I installed, and trust me, it's a pain to get two GPOs plugged into the motherboard, because these motherboards are not server-grade. They are Asus, or MSI, or one of those. And they're not meant to run two GPOs. They are hardly, they barely managed to get one GPU up and running. The GPUs are huge, too. You know, each GPU is like three cages or something, so I had to get additional stance, and it's a hack.
00:33:34
Speaker
But I managed to get two of them in one chassis, in one cabinet. And then I installed Ubuntu 2204. I installed Nvidia drivers, I installed CUDA, and the moment of truth is Nvidia SMI, and it shows up both GPS.
00:33:52
Speaker
So my son, who is currently in his third year engineering, was helping me. And both of us started from the scratch. We unwrapped the motherboard, we unwrapped the Intel CPU, you know, the IE9 14th gen. And from there, to installing Ubuntu and installing NVIDIA drivers and typing NVIDIA SMI was like a moment of truth. We celebrated at 2am.
00:34:19
Speaker
It started from there. Once that was available, I installed Docker, I installed Container D, and then I installed the NVIDIA Container Toolkit. NVIDIA Container Toolkit basically exposes the underlying CUDA and NVIDIA drivers to these container runtimes.
00:34:38
Speaker
Docker engine and container D. So, you know, the next, next test validation is you basically do a Docker run, you know, GPOs all and container, you know, the runtime is Nvidia container runtime, and then you do an Nvidia SMI through the image, you want to image, and then Docker actually shows you Nvidia SMI, which is next checkbox, which means all the plumbing is there. Exactly. The containers are able to see the GPS just fantastic.
00:35:07
Speaker
So the next step is installing upstream Kubernetes. I had trouble with RKE2, K3S and other flavors, so I believe in installing upstream. So I installed a single-node Kubernetes using Selium as my CNI and the basic stuff. And then to make Kubernetes

Kubernetes & NVIDIA GPU Operator

00:35:31
Speaker
talk to this container runtime, you need to install the NVIDIA GPU operator, which is the next step. So NVIDIA GPU operator bridges the container runtime with the Kubernetes. Basically, it's the thin layer between kubelet and the underlying runtime. So now the next phase is you take an Ubuntu 2204 image and run it as a pod and check NVIDIA SMI. So now we have three layers of abstraction.
00:36:00
Speaker
So Jani, a question, right? So you said you installed the Docker toolkit for NVIDIA, and then you also installed the GPU operator. Yeah. I know, like, does the Kubernetes, like, I know NVIDIA has a device plugin for GPUs inside Kubernetes. Does that help consolidate some of these things? Or that's a different thing completely? Yeah, so there are two techniques. And this is very important to call out. I'm actually going to document it. So there are two ways, right? One is,
00:36:27
Speaker
My requirement is very simple. Kubernetes runs as one of the processes, but I want to download a model from Hugging Face and write a PyTorch program directly. So I don't want to go to the GPU via Kubernetes. I want Kubernetes to be readily available at the OS level. So I did everything to get to that point, right? So after that, to really connect Kubernetes to this, you can use device plugin. So device plugin,
00:36:53
Speaker
is just the bridge between whatever you install. You get complete control of the CUDA runtime, the NVIDIA device driver, and then you take it to the device plugin and you basically surface it to Kubernetes. That is very easy to install and Kubernetes doesn't, NVIDIA's software doesn't meddle with what you already installed. It just uses everything below the stack.
00:37:20
Speaker
But if you don't want to go through this process and you don't mind if CUDA is not available at the OS level and it is only available through Kubernetes, the easiest is to use the Helm chart of NVIDIA GPU operator. That does all the heavy lifting. It basically does all these steps, but it does behind the scenes. You had to reboot once and boom, it just brings up everything. But the side effect of that convenience is,
00:37:49
Speaker
If you want to really run a Docker container, or if you want to directly access the GPU with code, you cannot because Kubernetes locks everything. It takes control of it. Okay. Okay. That makes sense. And if you change a subtle configuration, it breaks. So it's very brittle. It's a very, very, you know, very fragile installation when you go to operators. I don't prefer operator because it's a black box.
00:38:18
Speaker
You can do that when you are running a completely automated bare metal install where you don't care what it is you are running. You are using a thin noise like flat car Linux or something like that and you don't really care. But if you are using that as a dev machine and also single node cluster, I recommend using the device plugin because it gives you the best of both.
00:38:40
Speaker
Okay. And I know that I'm just taking us in a rabbit hole around Nvidia plugins, but can we like, I know the device plugin allows you to time slice. And if you have the compatible GPU to like multi-instance GPUs, does the Docker version or like the Docker toolkit allow for those kinds of GPU sharing algorithms as well? Or that just gives you control and you can do whatever with you want with it. So it's a very, again, very important aspect now.
00:39:07
Speaker
a Docker container or a pod will basically assumes that it has 100% access to the entire GPU and it just occupies whatever you give. So what happens is the moment you load a GPU enabled model, for example, when you point Docker to NVIDIA container runtime and launch a container and look at NVIDIA SMI,
00:39:37
Speaker
Very quickly, 100% of memory goes to this single container. And after that, when you try to launch another container, you see the dreadful oom message and the container crashes.
00:39:55
Speaker
So it's very hard. Now what I do is I restrict my Docker images, containers to one GPU. That's the reason why I run two. So when you actually set this environment variable that says CUDA visible device is equal to zero, you basically expose, you restrict the runtime to see only one GPU.
00:40:17
Speaker
So I sent that environment variable and then I run the Docker container. Now it quietly restricts itself to the first one. And then I use CUDA visible device is equal to one and I run another model. Right. Okay. And where do you set these environment variables? No, it goes to the container image, you know, most of the inference engines, most of the CUDA based images respect this flag.
00:40:44
Speaker
Okay. Gotcha. Thank you. You cannot really say 0.5. You know, you cannot say a CUDA visible device is 0.5 and expect the model to take only 12 GB out of 24 GB and leave the room for other. No, it doesn't work that way. That is the biggest problem. So it also depends on your inference engine. I'll come to that a bit later, but the hard fact is
00:41:11
Speaker
Unlike CPU cores, you cannot share CUDA cores across containers. That is a big problem. There is no solution. Intel, AMD, and Nvidia are working with the Linux community to bring support for C groups and basically treat GPU like CPU and have those limits and requests and that whole model, bringing that to GPUs, but currently it's not there yet.
00:41:41
Speaker
Right. That makes sense. I mean, I think, you know, we're used to that kind of.
00:41:47
Speaker
configuration, you know, if you've been working with Kubernetes or containers for a while, maybe you'd expect that, but we're still early days, right? The reality of it is we're still early on. All right, so looping back around, so you have exposed your GPUs to your Docker environment, to your Kubernetes environment. Now what? What do you run on this thing? Where do you get started? Yeah.
00:42:13
Speaker
You know, the funny thing is, throughout my life, there is one piece of the stack that forces you to go grab a coffee and stare at the monitor.
00:42:25
Speaker
So I think I worked for Microsoft for about 11 years. So installing .NET and Visual Studio was that coffee break moment. You know, you push this DVD double click installers and yeah, actually not one coffee. You got to have three coffees before it finishes. Right. So from there we went to NPM install and NPM install used to take some time. And then we had Docker pull was taking
00:42:55
Speaker
No decent amount of time giving you some thinking time and now after you know we figured out how to do caching and networks have become faster and docker has matured all of that now the coffee break moment comes when you're pulling a model from hugging face.
00:43:11
Speaker
Okay. Well, what's hugging face? I'm sure some of our listeners just probably heard that name and said hugging face. You know, that's a strange name. Yeah. Could you explain that a little bit?

Hugging Face: The Docker Hub for GenAI

00:43:22
Speaker
Yeah, absolutely. So hugging face is to foundation models. What Docker is to container images, Docker hub. Gotcha. Okay. Okay. Okay. Just like
00:43:35
Speaker
If you don't do anything, if you don't point to any registry, if you do a Docker pull, it goes to the library.index.io or there is a standard URI and it pulls the image from that.
00:43:52
Speaker
Exactly, right? And same thing when you're done with your Docker login, when you do a Docker push, it automatically talks to that endpoint. So similarly, hugging phase has become the new de facto endpoint for sharing artifacts related to generative AI. So there are foundation models, which are the base models, you know, like meta and
00:44:15
Speaker
Databricks, Google, all of them are training these foundation models, like Llama2 is a foundation model. Gemma recently from Google is another foundation model. Databricks, Dolly, Falcon. So where is the repository for all of this? Where is the Docker Hub for these foundation models? It is hugging face.
00:44:34
Speaker
Got it. And those are pre-built. Obviously, you can go peruse and see what kind of models you're interested in. Absolutely. And they are way bigger. For example, a 70 billion parameter foundation model will be close to 17 GB or 20 GB.

Decoupling AI Models in Kubernetes

00:44:53
Speaker
So there's our coffee moment, like you were saying, right? Yeah.
00:44:57
Speaker
But hugging face is more than just a repository for models. It is also a repository for data sets because models and data sets go hand in hand. So for example, if I curate a data set, I'm just making it up, for example, all the logs of Kubernetes and I create a data set.
00:45:18
Speaker
How do I share it with the community? Well, I upload that to Hugging Phase and it sits in the data set section of Hugging Phase. So you go to my username, Janikiramem, slash, you know, you can actually pull the data set, pull that foundation model, pull that fight tune model. So it is basically the hub for all the artifacts.
00:45:35
Speaker
Now, speaking of datasets and pulling these large models, obviously this takes some sort of storage, some sort of integration of where you put these things on Kubernetes near Docker containers. What does storage look like in this GenAI infrastructure? Are we using block? Are we using file? What are our databases used? Let's dive into that a little bit.
00:46:02
Speaker
These models, as I mentioned, are massive. And it's not a best practice to containerize these models. It's a bad idea. You shouldn't take llama7b and put that in a Docker file and package that as an image. Big container image, yeah.
00:46:20
Speaker
It's a bad idea, right? So you need to decouple the model from your inference code, which is basically consuming the model. The reason for that is your inference code is stateless, whereas the model is stateful. They need to be decoupled. They shouldn't be, like when you scale out your inference engine, the model shouldn't scale out. It's a nightmare if you are attempting to do it, right?
00:46:48
Speaker
That actually points us to a very unique aspect of the storage layer. Now, imagine two layers. One is the static storage layer where the models are optimized for read-only, read access. Then you have this inference engine that is scaling out itself to all the available GPU nodes.
00:47:11
Speaker
So how do we bring models to every node? Now, again, it's not a good idea to replicate these models on every node and cache them. They are massive, right? So we need a shared storage. So a shared storage layer that is available on every node, and it is exposed as a very well-known, consistent endpoint.
00:47:37
Speaker
We don't have to, first of all, download the model on every new node every time we scale out. The second thing is the start time of the container is not really blocked by this model. We decouple that.

Loading AI Models to GPU Memory

00:47:57
Speaker
No, I think I'm more of a basic question, right? Like I've gotten this question from some of the people that I work with. Like, okay, the models exist on Hugging Face. We can put them down locally on a Kubernetes cluster. Models need the memory that's inside a GPU. So like I'm pulling it down, I'm loading it on the GPU. How does persistent storage come into the picture, right? Like is it stored in the PVC and then it's running on the memory of the GPU? Like how does that workflow work?
00:48:26
Speaker
Yeah, so models are just blobs. They are basically a set of areas that are serialized and stored on the disk. So when you load them, technically what happens is you're deserializing this blob, and then you're spreading that across available GPU cores, either a single GPU core or multiple cores of multiple GPUs. OK.
00:48:55
Speaker
So it is as simple as you basically serialized a massive array which contains some data and you've uploaded that to Google Drive, I downloaded that and now I deserialize it and load it to the memory and I start playing with it. That's exactly what happens when you pull a model from Hugging Face, load it into the inference engine. So loading that into inference engine is basically this process where you unpack the blob and
00:49:25
Speaker
spread that across available CURA cores of the GPU.
00:49:29
Speaker
Okay, so then let's create a random scenario, right? I have an air gap cluster. I pull down the model once in a cluster that's connected to the internet from hugging face. I have a volume, right? We all know how snapshots inside Kubernetes work. Instead of pulling it from hugging face, can I just create a new volume from that snapshot and I have a model running? Or that's not really how it works.
00:49:56
Speaker
No, the better mechanism is to use something like, you know, if you're running it in the cloud, use something like EFS. Okay. Right. So you basically download your models to an S3 bucket. This is purely for the cloud. And then you write a batch process to download whatever is cached in your S3 bucket onto your EFS and you mount this EFS on every node. Okay. Okay.
00:50:26
Speaker
Gotcha. So if I want to download the same model on different clusters that I have access to, I at least have to pull down the model once. I can't just copy or have my storage transfer the models between clusters.
00:50:41
Speaker
No, that's not a good idea. So all the nodes should have a consistent shared storage layer. And these models are 99% of the times are read-only versions. Unless you are fine-tuning them, you don't need to touch them. So at least in my home lab, I have a NAS. And that is my pass-through proxy for hugging face. So what I do is,
00:51:08
Speaker
I basically wrote a cron job that synchronizes hugging phase with my NAS once in two days to get the recent revision. Just like Docker images have tags, these models have revisions. So occasionally, these revisions are pushed and updated, and you need to have them locally. So what I do is I basically sync my NAS with hugging phase for a dozen models.
00:51:36
Speaker
On Kubernetes, I use an NFS provisioner to simply mount my NFS mount point into my Kubernetes and I have the same models available.
00:51:48
Speaker
Okay. No, thank you. That makes sense. That way the model's available and now it's up to the sort of application side or the inference side to kind of load that in and kind of create whatever that application might be doing. Exactly. The beauty of this approach is, let's say Meta has launched a new revision of Llama and I want to use that. I just set the deployment to zero.
00:52:13
Speaker
terminate all the parts and set the deployment to one, it picks up the most recent model. Gotcha. Right. So I know, like when you are dealing with container images, we had projects like Harbor, which were pulling down recent versions and caching them locally near the cluster. You wrote your own cron job, but are there any projects or something that the community is working on to make sure that these models are accessible locally as well or
00:52:41
Speaker
Even if I wanted to create a private repository of my models that I'm training in, I know open source models is a different thing from all these different vendors. But if I'm building a small language model inside my own enterprise, can I host them locally? How do I do that? So very interesting. Just like a decade ago when Kubernetes came out, there was no concept of a private registry. Yep.
00:53:05
Speaker
Docker actually had an image, which was called the private registry, and you could expose it. And it again pointed back to the file system. And that's all we had. That's how you basically managed private registries, and then came in Cray from CoreOS, now which is a part of Red Hat, and then we had hardware. So similarly, there is a need for a model catalog living on top of Kubernetes. We don't have that.
00:53:35
Speaker
Yeah. Okay. So anybody in our listeners, if they are working on this, you need to give us a share of the new starter.
00:53:45
Speaker
Yeah, so now we have reached a point where projects like Hardburg are smart enough to automatically sync the images with your public registry, do a pass-through, do a proxy mirroring. You can do a lot, but unfortunately, there is no model catalog. See, if you look at what is going on in the public cloud and try to replicate that on-prem with Kubernetes, every hyperscaler offering Gen AI as a service has something called Model Garden.
00:54:16
Speaker
Bedrock has one, Vertex AI has one, Azure OpenAI has one. Now these model gardens are essentially the repositories of models and you can browse, you can pull, push. NVIDIA has one in the form of NGC.
00:54:34
Speaker
GPU cloud, right? Now, none of those are available in an open source version or a cloud native version that you can deploy and say, hey, here are my hugging face credentials, go pull everything and keep them in sync. Nope, we don't have that.
00:54:51
Speaker
Yeah, it sounds like a good opportunity for a lot of these IDP folks we keep talking about, right? You know, kind of built in this sort of thing. Well, I mean, that's that's great. I do want to get back to so that we're at the point now we've exposed GPUs, we have containers able to get them, we have models being pulled down and put on shared storage.

Inference Engines for AI Models

00:55:14
Speaker
Now, how do we go about the other side of that, meaning that we have that model on shared storage,
00:55:19
Speaker
What is that application side of things look like? Use the word inference side. Exactly. Exactly. How do developers consume these models? Imagine you are corporate enterprise IT, you want to expose model as a service to your internal customers. These models are just artifacts. As I said, they're just blobs. The most important layer that exposes and interfaces with these models is the model serving part of it, the inference in general.
00:55:49
Speaker
Now, in the second generation of AI, we had a bunch of them like K-serve of Kubeflow, one of my favorites. You could do that with PyTorch. NVIDIA had TensorRT and Triton.
00:56:06
Speaker
which was the native inference engine, and then BentoML was one. There were a bunch of those serving engines that were available, but unfortunately, none of them are optimized for LLMs and foundation models. So, we have a whole new breed of LLM inference engines.
00:56:29
Speaker
And what they do is they basically expose a consistent well-known API to the outside world and internally, they deal with the model management. Gotcha.
00:56:41
Speaker
And they're not... Ullama, I think, is one of those guys. But I know, continue with your answer, but I just wanted you to share examples as well, like what these tools are. Absolutely. I'm coming to that. I'll list the top of my mind inference engines. So my favorite inference engine is from Hugging Face. It's called TGI, X-Gen X-Generation Inference.
00:57:04
Speaker
It's open source. The reason why I like that is, first of all, it's proven. Hugging phase, professional offering of inference as a service is running on TGI on Kubernetes. It's proven. Hugging phase, they're on dog food and they use TGI to expose their own endpoints. Gotcha.
00:57:26
Speaker
proven container infrastructure, number one. Number two, Hugging Phase makes sure that the most recent model is optimized for this inference engine. If you go to Hugging Phase and look at any model like Llama or Gemma or Falcon or Mistral, there are tags describing what the model is capable of, like text generation, chat,
00:57:51
Speaker
and instruct and so on. And there they put a label that says XGen Inference. And as soon as you see that tag for that model, it means it works with the TGI container, which is fantastic. And Hugging Phase keeps the models and this TGI compatibility in sync. Like when JAMA got released two weeks ago or three weeks ago, Hugging Phase very quickly updated TGI to make sure it supports JAMA.
00:58:20
Speaker
Gotcha. That is the second reason why I really prefer that. And third reason is it's very, very convenient and well-documented. It's an open source project. It's written in Rust. You go to the GitHub page, you understand all the parameters, and it has a couple of keys which are my lifesavers.
00:58:43
Speaker
It has a CUDA fraction flag. When you are launching this container, you can set an environment variable that says CUDA memory fraction is equal to anywhere between zero and one, which means you are able to restrict the container to a percentage of your VRAM of the GPU. That's a lifesaver for me. So on a 24 GB GPU, which is RTX 4090, if I am launching a llama 7B, it takes only about 13 GB.
00:59:13
Speaker
But if I don't restrict it very soon, it occupies 100%. So what I do is I use TGI as a deployment and I pass this environment variable that says CUDA memory fraction is 0.5. Interesting. It's great that TGI allows you to do that, right?
00:59:34
Speaker
Yeah, and behind the scenes, it is PyTorch that is actually exposing that. But all you have to do is just set this command line parameter for the arguments. And that's it. You can basically restrict. And can you believe on my four GPU two node cluster, I run nine models. Whew. OK. Very cool. That's a good result on engagement, right?
01:00:00
Speaker
Absolutely. I run embedding models, I run state-of-the-art 7B models, and my GPU utilization is 110%. I very, very carefully optimize the GPU course, and I'm able to run all this.
01:00:16
Speaker
That layer is super critical. Now, apart from TGI, your next choice is something called VLLM. So VLLM is another inference engine optimized for exposing MLMs as endpoints.
01:00:34
Speaker
You can use that. I haven't used it myself because I've been too busy tinkering with TGA and getting it running on my cluster, but that's a very good choice. For example, VMware's private AI is built on top of VLLM. Google has officially standardized on VLLM for GKE.
01:00:54
Speaker
So, it's a very proven framework. The third one is from NVIDIA, which is TensorRT LLM. Not in a specific order, but just my recall. So, TensorRT LLM is a pretty good inference engine. So, NVIDIA has optimized TensorRT for transformers and they surfaced this
01:01:16
Speaker
Of course, Olama is one of the things, but running Olama on Kubernetes is a little bit of a hack. Olama is designed to run on your desktop, particularly on Macs. Not really meant for Linux and Kubernetes. It's a nifty tool if you want to play around with chatbots on your local machine.
01:01:37
Speaker
It's not a big deal or a difficult task to run Olam on Kubernetes, but it doesn't really have any knobs, buttons, and sliders to drag around. There are no tweaks. And then there is a very complex framework called Ray. Ray is very popular, very well known. It's more of a scheduler. And that runs on Kubernetes. And Ray is used in production by many inference as a service providers.
01:02:06
Speaker
Got it. And these inference engines that actually serve the model, they give you that endpoint within the cluster, but you still sort of need the client side of things, right? Inferencing is not the inference side of things, isn't the thing that the end kind of user or system would interact with, right? That's the chat bot or that's the something else, right? Exactly. Yeah. Exactly. But the good news is these inference engines expose
01:02:34
Speaker
open AI compatible API. So if I'm already talking to open AI, I can change the endpoint and I can actually talk to my local LLM, which is like a great use case. Yep. Right. Change the endpoint or even write something yourself, right? Because those APIs are there if you, if you are so ambitious, I guess.
01:02:55
Speaker
And coming back to TGI, one of the reasons why I really love that is Lang chain, which is a middleware. I'll come to that. Think of it like the developer middleware or a developer framework to consume. It's like ODBC for LLMs. You can swap LLMs while keeping your API consistent. So the TGI endpoints are supported by Lang chain. So if you are writing your client side code on Lang chain,
01:03:21
Speaker
The beauty is you can keep the endpoint as is, but behind the scenes, you change an entry and config map and reload the pod. You switch from Mistral to llama to Gemma with the client doesn't even getting restarted.
01:03:37
Speaker
So what I do is I write a config map and this config map is injected as an environment variable into the pod. And that's the model name which the inference engine picks up. So what I do is I go change this config map and restart the pod.
01:03:53
Speaker
It picks up the new environment variable and it starts serving the new model that I just downloaded from Hugging Face and the client continues to just run without even knowing that the model has been hot-swapped. That is the beauty of TGI again because of the line chain integration and the client compatibility.
01:04:11
Speaker
Okay, I think those are some great tools, right? I need to go and play with TGI a little bit more. I have some experience with Olamas, some experience with Qube. But yeah, I think... We'll create a list of these in the show notes too, right afterwards, so that people interested in all these terms that we're throwing out, sort of peppering you with, you'll be able to go back and kind of look at them a little bit more.

Johnny's RAG Pipeline Setup

01:04:34
Speaker
So yeah, let's switch gears a little bit since we're towards the end of our time here.
01:04:43
Speaker
is sort of an introduction to how things are running on Kubernetes. But we do want to talk about what are some of the use cases, especially for maybe your home lab that you're building this particular kind of architecture for. What are the use cases you're looking at or how do people consume AI in their daily life? Maybe that isn't a chat bot.
01:05:05
Speaker
Yeah, absolutely. My end goal when I built this infrastructure was to run the retrieval augmented generation pipeline end-to-end. I can pull the plague, my router's plague, I am disconnected from internet, and I can still run RAG pipelines with no dependency on internet at all.
01:05:25
Speaker
So what I do is on my cluster, I run LLMs, I run embedding models. Embedding models are a very critical piece of the puzzle. What they do is they take your sentences or phrases or pages of a document and convert them into contextual low dimension vectors. And they give you this output that is stored in a vector database. And my favorite vector database is Chroma.
01:05:54
Speaker
It cannot be a stateful set because it is not a distributed database, but Chroma in itself is very simple. I run embedding's model on GPU. What I do is I run minio as my data lake. Minio is my unstructured storage.
01:06:16
Speaker
I put all the PDFs there and I wrote this nice webhook based on Minayo. As soon as I drag and drop a PDF, it actually calls a webhook and the webhook takes this PDF, slices that into every sentence, sends it to the embedding model. Embedding model gives a bunch of vectors. I push those vectors into Chroma.
01:06:38
Speaker
And later on, when I search for a specific phrase, basically Chroma gives me back the top K results that semantically match what I'm searching for.
01:06:52
Speaker
and I get it back, and I use that as a context, and I send it to the LLM saying, you have this grounded, factually correct data, and here is my question. Look up what is in the context and answer, and if you don't know, say, I don't know, but don't hallucinate, right? I explicitly use that in the prompt. And my LLM comes back with a very accurate answer, and the beauty is, with RAG, you can use an extremely
01:07:20
Speaker
a low, small language model, like there's something called 5.2 from Microsoft, which is only 1.5 billion parameter. Even that behaves well. It becomes a well-behaved LLM when you feed sufficient context to it.
01:07:35
Speaker
So what does rags stand for? I know we probably won't go into depth in this, but just for context. Retrieval augmented generation.

Explaining RAG in AI Context

01:07:42
Speaker
So basically, if you go to charge GPT and ask a question like Oscars that happened yesterday, charge GPT cannot answer.
01:07:50
Speaker
But if you go to msnbc.com, copy paste a blurb that has all the award winners, you paste that and say, now look up the data and tell me who is the winner, it can tell you. So what you're basically doing is you're injecting context into the prompt. So instead of manually copy pasting, you're building a pipeline that basically takes your query, looks up a database, grabs the relevant
01:08:14
Speaker
context injects it into the prompt, sends the context plus the prompt to the LLM. That is what is called RAG, retrieval augmented generation. So we are augmenting the prompt with context. Got it. Yeah. I mean, that's a very powerful thing, I think, as we start to get to using AI tools in real time, right? You could think about
01:08:35
Speaker
you know, something about booking flights or, you know, things that aren't historical, right? I imagine that's a big, and I think it is, right, if you kind of look at what's going on in the AI space, there's a lot of
01:08:49
Speaker
talk about RAG and those kind of things. I think we could probably spend an entire another podcast just on RAG. So we won't go into that just yet for sure. I think like Ryan, as you said, we have so many other questions on our notes that we wanted to bring up, but this has already been like a great episode and a great use of our times.
01:09:10
Speaker
Jani, I know you have a plethora of resources. I love your YouTube channel. Why don't you go ahead and plug all of that information in and then also point people towards any more resources that you would think will be helpful to people that are just getting started.
01:09:25
Speaker
Yeah, as soon as I'm back from KubeCon, I'm starting a series on cloud-native infrastructure for GenAI. Every episode will focus on one of the layers. So I have a cake diagram of all the layers of the stack. And every episode focuses on one specific layer of this cake. And I go into the depth. I show commands. I point to GitHub repo where you can actually run shell scripts and set it up by yourself.
01:09:53
Speaker
Oh, that's going to be super helpful. Thank you. It is difficult. Even though AI is the new thing, but finding resources that are actionable and something that you can implement on your own is really difficult. I'm talking from my personal experience here. Yeah, listener, you heard it here first. Head to Johnny's YouTube channel. We'll put it in the box. Is there a name that it goes by that people can go search you up by? Johnny.tv.
01:10:20
Speaker
Yeah. Perfect. Well, Johnny, it's been a real pleasure having you on the show. I know we covered a lot today, so thank you for packing it into this hour for our listeners. It sounds like we'll have to have you on again in some future timeframe to see where all this has been and what you've been up to and
01:10:45
Speaker
I imagine it's going to be one of the fastest growing topics that we have to talk about here on the show, too. But yeah, you know, thank you again for coming on the show and kind of diving into this from the Kubernetes perspective with us. My pleasure. Thank you. It's been a fun conversation.
01:11:01
Speaker
All right, Bob. And that was, um, a jam packed episode. I'll call it. And probably, uh, we only touched on like two thirds of the, the actual questions we maybe wanted to get to. And I know Johnny kind of joked when we first got on the call, he's like, there's way too many, he's right. Too many questions. I think we just got excited because it is an exciting topic and hopefully you as a listener sort of learned something, but, uh, let's just jump into a few takeaways from our perspective. Why don't you take it off?
01:11:26
Speaker
No, I think I really liked the episode. Even though we had so much to talk about, there were so many awesome nuggets of information that make things tactical. I think the vision, you'll find a lot of podcasts. I know we did a 101 episode and everybody likes to talk about it from a higher level.
01:11:42
Speaker
I know I've been following Johnny through his blog post and his LinkedIn post and his YouTube channel for a while and he has been hands down or heads down into Gen AI on Kubernetes, so it was great getting his perspective. I really liked how you had us break down his entire homelapse setup, like what are the GPUs? What plugins did he install?
01:12:03
Speaker
What projects is he using? That makes it real, right? Because there's a lot of buzz around. But how do you get started with this? I think that's my key takeaway from this episode. There isn't one. You have to listen to this whole episode once, if not twice, to get everything out there. And then just a fun takeaway was there were so many keywords and project names that we used. If we were playing the SEO game, which we don't, this episode would rank high up there, dude. So I think...
01:12:31
Speaker
I really like this episode specifically for a practitioner's view on GenEI on Kubernetes. Yeah, that's a really good point. I was thinking along the same lines of
01:12:44
Speaker
Today, especially when you hear about AI or Gen AI, people are talking about it from a mile high. They're talking about the power it can bring you. It's going to change the world. Someone like me or you is sitting there like, yeah, but how does it work?
01:13:01
Speaker
How do you get it running on this? I think that's one of the most valuable things. Particularly the pieces that are specific to if you wanted to build this thing in your own lab. He's got a very unique perspective.
01:13:18
Speaker
with the challenges around running nine models on two GPUs isn't the most straightforward thing to do, but it is capable. The fact that it is a challenge to get GPUs for your applications, so sometimes
01:13:34
Speaker
putting up the capex to buy a few mid-range SDVs, I mean, GPUs and computers to get you going and you can work offline might be something you're interested in. So really cool perspective from Johnny today that I think a lot of folks hopefully will appreciate. And if you do drop us a note in the Slack to say, hey, we want to do more of this. I know it's Bobbin and I's goal to do a lot more on this topic because it's endless.
01:14:02
Speaker
We mentioned a little bit about RAG frameworks at the end, which is actually a really interesting topic all by itself. And we really didn't even touch on it. We just kind of told you what it is. So expect more of this. Or if you're in this space as a professional or as a user or as a developer, come talk to us. We'd love to talk to you. So yeah, that's my takeaway. Awesome.
01:14:28
Speaker
Yeah, I think one more shout out that I have for our listeners is if you are based in the New York tri-state area or like if you are traveling to New York in May, there is an awesome community event called Kubernetes Communities Days or KCDs. I think it's the first time KCDs are happening in US and obviously they couldn't have picked a better location.
01:14:47
Speaker
It's on May 22nd, but one thing that we have special for our listeners is if you are registering for it and you use the code communitiespipes, you get a 10% discount right there and there. So if you are going to the event or if you are on the edge, hopefully this 10% discount gets you over that edge and you register for that event.
01:15:04
Speaker
Yeah, that's a big shout out. Yeah, if not an excuse to go visit New York City, I guess. So yeah, there's a lot of KCDs in Europe, so really cool to see one happening now in New York City. And from what I've seen of them in Europe, they're really great events. So go check that out, which I believe is, what, Wednesday? May 22nd, that's Wednesday?
01:15:28
Speaker
I could be off. It's right around there or something. Okay. All right, Bobbin. That brings us to the end of today's episode. I am Ryan. I am Bobbin. Thanks for joining another episode of Kubernetes Bites. Thank you for listening to the Kubernetes Bites podcast.