Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Inference in Action: Scaling Al Smarter with Inferless image

Inference in Action: Scaling Al Smarter with Inferless

S4 E20 · Kubernetes Bytes
Avatar
1.1k Plays2 months ago

In this episode, we sit down with Nilesh Agarwal, co-founder of Inferless, a platform designed to streamline serverless GPU inference. We’ll cover the evolving landscape of model deployment, explore open-source tools like KServe and Knative, and discuss how Inferless solves common bottlenecks, such as cold starts and scaling issues. We also take a closer look at real-world examples like CleanLab, who saved 90% on GPU costs using Inferless.

Whether you’re a developer, DevOps engineer, or tech enthusiast curious about the latest in AI infrastructure, this podcast offers insights into Kubernetes-based model deployment, efficient updates, and the future of serverless ML. Tune in to hear Nilesh's journey from Amazon to founding Inferless and how his platform is transforming the way companies deploy machine learning models.

Subscribe now for more episodes!

Show Links:

Inferless LInks:

Recommended
Transcript

Introduction to Kubernetes Bites

00:00:03
Speaker
You are listening to Kubernetes Bites, a podcast bringing you the latest from the world of cloud native data management. My name is Ryan Walner and I'm joined by Bob and Shaw coming to you from Boston, Massachusetts. We'll be sharing our thoughts on recent cloud native news and talking to industry experts about their experiences and challenges managing the wealth of data in today's cloud native ecosystem.
00:00:28
Speaker
Good morning, good afternoon, and good evening wherever you are. We're coming to you from Boston, Massachusetts. Today is October 21st, 2024. Hope everyone is doing well and staying safe.

Weekend Recap: Mountain Biking Fundraiser

00:00:40
Speaker
Bobbin, what's new? How's your weekend, man?
00:00:42
Speaker
Weekend was good. I forgot that we were back in summer in New England this week. Oh, that's right. It was really warm yesterday, yeah. It was. It was like 72 degrees. So and that was a good break from the cold weather that we have been starting to get in. So I had fun. I spent a lot of outdoor time for sure over the weekend. How about you?
00:01:01
Speaker
Yeah, same. So um Sunday yesterday was my New England mountain bike chapters. The chapter I'm part of is Blackstone Valley, but it's our annual fundraiser ride. Oh, nice. It's called the Best Damn Ride. We call it that because it's at West Hill Dam, which is the Army Corps of Engineers property. um But yeah, we had 400 people or something like that. I had a few hours of volunteering and then a little bit of writing. and That's awesome. Yeah, it's a blast. So so like is it like a ah run by a mile marathon like bike? Like, like how how do you raise money? Yeah, so you raise money just by selling tickets to the event, basically, okay the event itself, we do like, a really good, really great raffle. So, you know, as mountain bikers, we had things like a pole behind cart for like kids behind your back, yeah or a kuat.
00:01:56
Speaker
bike rack for a car or like new tires, new ah bike levers, all sorts of stuff, right? So you kind of like, you get a free raffle ticket when you sign up, but the the proceeds for signing up go directly back to the chapter. And we have like food and ice cream and stuff there. And then, you know, all the volunteers put on sort of events, um like they'll, you know, go intermediate ride and novice ride, advanced ride.
00:02:22
Speaker
Um, we built the pump track this year. So we had, we actually had three state representatives come and, um, kind of, you know, bless our new pump, which was kind of cool to hear them kind of give their speech on it. so Yeah, it was fun. And yeah, the weather was killer. So can't go. Yeah. I'm like, do the state representative like jump on a bike and go around the new track or they were older. So no, you know, it was funny is none of them really did any mountain biking, but, but specialized is a sponsor there, right? So they are vendor who comes there and they can demo bikes. So we got one of the reps to go on a specialized mountain bike and he has an e-bike and a gravel bike or something like that.
00:03:03
Speaker
And he was like, oh, now I'm going to have to get a mountain bike. And he was doing the pump track and everything. So it was kind of cool. Oh, that's awesome, man. Yeah. A successful event for sure. Yeah, it was fun. Can't complain.

OpenShift 4.17 Updates and Features

00:03:13
Speaker
But, uh, yeah, we, we have a fun episode today. Um, we'll introduce our guests after some news, but why don't we dive into our news segment? Uh, yeah, I can get it started. Uh, open shift for the first item is open shift four dot 17 is now generally available. Uh, so a few things that caught my eye, uh, they support communities version one dot 30. So o oo I still don't know how to say it correctly. Yeah. But now, if you're running OpenShift 4.17, you get Kubernetes 130. And ah but if you specifically zoom into the OpenShift virtualization ah space, ah they added a few features like so safe memory over comments, if you wanted to work memory for your VMs that are running on OpenShift.
00:03:57
Speaker
You can do that. Memory hot plugs, if you want to increase your memory footprint for the virtual machines, that can be done on a non-disruptive scale-up basis um ah to improve your VM performance. They also have something called, this is something I haven't read in detail, but automatic VM workload balancing with the scheduler. So to me, like just reading that, it sounds like the VMware DRS feature where VMware basically um the distributed VMs across different hosts to avoid any hot spots in the cluster. Maybe once you have already doing that side, that's a feature I want to go and check out. And then um another feature that applies to us is ah from a tech preview perspective, they added support for storage live migration. So you can take a VM that's provisioned on a specific storage class and live migrate it to a different storage class without, I think this is a feature that
00:04:47
Speaker
Red Hat themselves are working on without integrations with other storage vendors. I'm not sure. It wasn't clear in the live stream that they do. And we linked the live stream in in the show notes. But it shows that now you can, like for an already deployed VM, change the storage back ends using the console itself. Yeah, that would be interesting to know if it it's going to work across back end types of storage classes. Yeah. Yeah, that's my question too. like What if there are certain features that a specific storage vendor is providing and other is not? And how do you?
00:05:16
Speaker
um Like, do you show a warning? Like, yeah, again, it's a tech preview features. It's like not GA, but... Yeah. Or if it works on top of a some sort of file system versus like Snapchat. Oh. It really depends. I have to look in that too. I need to dig into that clearly. Yeah. No, I think this release definitely had a lot of features that need like more detailed evaluation of the documentation

NVIDIA's OpenShift CI Benchmarks

00:05:38
Speaker
ah that they do. ah But I don't know, the live stream that the Red Hat product management team puts on is awesome. like Man, they they, I don't know, they have created of a great workflow where every PM from every different piece of OpenShift will jump in for like two minutes, hand it off to the next one. They know who's before you, who's after you, and they'll come off video and oh man, hey a well oiled machine oh for sure. Yeah, I know.
00:06:04
Speaker
Okay, and then outside of Red Hat, oh okay, actually, since you're talking about OpenShift Virtualization, um something that i was down on my list, ah they have started publishing um you customer case studies um more publicly now, so they just um put one on the new stack around how Nvidia GE Force now uses or runs on Cubeboard. Again, this was a Red Hat sponsored post. I remember at Red Hat Summit, Nvidia had a session maybe, ah but there was definitely a CubeCon session that will be linked from this new stack article where they talk about how they run benchmarks or CI driven benchmarks for their VM environments. But basically it says that and instead of ah modifying or modernizing their
00:06:43
Speaker
a whole footprint from ah virtual machines to containers. they are like insert Let's just modernize where they are running VM. so ah It's a cool little story and the session this is definitely worth checking out. so ah We'll link that to the show now.
00:06:58
Speaker
And then finally for me, I think since we had an episode talking about OPA a while back, oh ah I've been following Staira, which is the company, I think one of the major contributors behind the OPA project. So they announced something called as Policy S-WAM. And anything with S-WAM, Ryan, we have to bring it to us. We have to talk about it. Yeah, I've got to talk about it. So Policy S-WAM is an inventory of policy modules, which includes your sources, versions, dependencies, everything that's within an OPA bundle. so ah Basically, instead of just an S-BOM, which is a bill of material for different components that are part of an application, policy S-BOM is an analysis or detailed description of what is included in a specific piece of policy.

Policy SBOM and Security Transparency

00:07:38
Speaker
this is i know ah like i want The reason I brought it up was the the policy S-BOM integration. but
00:07:43
Speaker
um This is, I think, only available to Staira Enterprise customers. If you have your DaaS subscription, ah you can now start using this. But this is an interesting approach that ah companies in this ecosystem are taking to document everything and and list out the the the permissions, the packages that are part of any.
00:07:59
Speaker
any entity inside Kubernetes. So just wanted to share that. Nice. Gotcha. You mentioned Woober Detties. So I'm going to talk about a possible next release, which is 132, which is the one coming up. okay And we're going to talk about an article um ah basically about the feature change block tracking from um the storage ah team, basically, they're looking for feedback on CVT. So it was slated in over neti's 131 to be alpha, but it slipped to 132. So now um the APIs are going to be available, 132. And it's basically, you're able to prototype with it and all that stuff.
00:08:45
Speaker
going forward and the storage team is specifically looking for feedback on change block tracking. So if you're not familiar with change block tracking, the TLDR is it enables you to do things like incremental backups. So you look at an entire backup and see what's been changed, only report those blocks in some metadata or something like that. And there you take a ah backup, which is incremental, meaning you're not changing ah or backing up all the data, you're just kind of ah recording which blocks have changed, hence change block tracking. and um both and um And so this feature is going to be, I think it's been like two years in the works or something like that. I want to say the Kepp, the 3314,
00:09:29
Speaker
said it was implementable in June 2024, as the article says, but it slipped. So, you know, I think this is a really cool thing. It's something that I think almost every storage provider does provide. So having CSI sort of also enable it now is going to be an interesting space. So um there's some great talks, I think, coming up in Salt Lake City, if you're there, I believe. um the data protection working group is putting on a deep dive talk. So if you're interested in this and you're going to be in Salt Lake City, definitely go check it out.

CNCF Project Updates

00:10:01
Speaker
Anyway, we'll post that article there for you as well. And you said this is alpha feature for 132, not data as we are, right? Yes, 132. It'll be the alpha APIs. Got it, thank you.
00:10:15
Speaker
The other two things I had was Cube Edge is a CNCF graduated project now. So we've talked about Cube Edge before in the show, um but it's kind of fast-tracked through its way through the process. Not really fast-tracked, but it's it's got a lot of support you know for managing IoT devices and you know syncing data back and forth between the Edge and and so forth. So Cube Edge is making a lot of great projects. well post that article as well. um And then if you use Pulumi, their 2.0 operator was also released. And it adds a whole bunch of things like ah scale and and some other things. So that article will also be posted. And that's my news for today.
00:10:55
Speaker
Man, Cubase, right? Like going back to that for a second, like they are, they have a lot of contribution from our support from different vendors in the ecosystem. Yeah. The, just the graduation post has a lot of different brands. It's like, how is pipe dance using Cubase? I don't know. It's a question that I need to figure out. answer to ah Is it turning on our phones? Nope. If you have TikTok. No, and you know that's awesome.
00:11:23
Speaker
Yeah, good stuff. So, um that's the news for today.

Meet Neelish Agrawal

00:11:26
Speaker
We'll make sure all those links are in our show notes, if you want to go back and read some more about them. But we have an awesome guest today. Here is the co-founder and CTO of Infirlis. His name is Neelish Agrawal. And Infirlis is really interesting stuff and we'll let Neelish comes on the show and and kind of dive right into what he's doing. So without further ado, let's get him on the show. All right. Welcome to Kubernetes Bites, Neelish. It's great to have you on the show. Give our audience a little bit of background of who you are and what you do.
00:11:59
Speaker
Sure. So, hi folks, I'm Nilesh. I'm the co-founder and CTO at Infosless. Before this, I have 10 years of experience, mostly machine learning and infra, where I started the US based company where they used to do CICD, mostly for FinTech banks, like Deutsche Bank, First Data, some of their customers. A post that have been in Amazon, where I was in the machine learning team, or where anything that you see like an ad for Amazon product, we used to like ship all of those. We had like 50 million ads running across 16 countries at scale.
00:12:27
Speaker
Post that I also started another dev tool company before this and currently with Infilis what we're trying to do is help companies deploy machine learning models faster where they can like deploy it as a function and then we scale it up and down for them. That's an impressive background man, just 10 years you said! okay and you know when When people are putting out the the you know advertisements for new jobs, yeah he checks that box of 10 years, even though there's not 10 years of yeah new technology. I always like my skills. Any 20 years of experience in Kubernetes.
00:13:04
Speaker
ah Okay, so Nilesh, since you have so much experience in in machine learning and inferencing specifically, right? that's That's what we are here to talk about today. Before we talk about Kubernetes or Inferless, let's talk about what the inferencing landscape looks like today.
00:13:21
Speaker
ah These models have been out and about for a while. How are customers or users deploying these? Are they doing DIY? Are they using just services from the managed like managed services from hyperscalers? Can you give give us ah an overview of the that ecosystem?

Inferencing Landscape Overview

00:13:38
Speaker
yeah So I think it's a very interesting one. I think the inference landscape is pretty broad, I would say. It's highly like in like three or four categories, right? And I think the first one of that is definitely the close models, right? Where there are companies like Antropic, OpenAI, right? Which have like, I would say 30 to 40% of the market share, right?
00:13:58
Speaker
where they say that, hey, you just take APIs for us, we'll manage the underlying model, we'll manage the underlying infrastructure, and you just pay us like tokens per second, right? So I think that is the first. I think the companies using that are mostly like people who need like raw intelligence. So you can consider it like for a company spectrum, like let's say they want just to analyze a mathematical problem or solve a case with just putting putting tokens in and out, right? So where there's complex thinking, I think those are the like first kind of inferencing that's happening at scale now.
00:14:28
Speaker
The second, I would say, a still broad spectrum which is happening is through these providers, which is second 20 to 30 percent, which I would say is with companies like Together AI, Azure, AWS, where they are just hosting models, and then these are OSS models. These are not proprietary models like Antropic and OpenAI. These are still open source models which are, I would say,
00:14:56
Speaker
80% as good as the e close source models, right? Again, phone front like be very ah specific on that, but they do the task. They are like, um you have a lot more control in terms of how your data flows, right? So this is ah comes to like hiring mostly enterprises, when you ship from like a commercial app to more enterprise, B2B use case where data is sensitive, right? So the second biggest inference is happening in this.
00:15:20
Speaker
And then there's the third spectrum of company, which also infilis caters to most of them is people who are taking these llama with a lot of open source models, but then they have some of their proprietary data sets. And then they want to like, let's say beat the category one models, right? So ah we have seen a lot of our customers specifically like 10X and a couple of others who have been able to beat GPT for performance, but only for certain specific tasks.
00:15:47
Speaker
with a very low like cost. right So that is where the third set of companies comes in, where we see a lot of inference happening. These are companies who have taken a task, they start from a task, and then they have like fine-tuned a model or a task to go deep into that. So that is where we see the inference landscape. And these are the ones who are doing it more DIY, going on Azure, AWS, getting the GPUs, or using providers. right There's a lot of control within the third category. So for the third bucket you said, it's targeted towards it specific use cases. right like ah ah They want to beat the performance of these larger models for specific use cases. ah Who is fine tuning these models for that specific domain? Are vendors like yours fine tuning models for like ah running a dentist office somewhere or a lawyer's office somewhere else?
00:16:37
Speaker
um all These dentists offices somehow are creating an association and building fine tuning models that work for their use case and then working with you to host those. Like who who owns the fine tuning? Call it. So usually or like these are like probably like Genii's focused startups, right? Who are building either or let's say a vertical chart, what in hospitality, legal tech, sales, stack right something like that.
00:17:04
Speaker
So we see a lot of these vertical players coming up, taking a du domain and then a task within that domain. domain right ah Now for them, how the pipeline closes first, I think they have to reach like a million users right with the open model. right So I think first is like a journey of finding PMF, which I think is not a good idea to start fine tuning your model on day one.
00:17:26
Speaker
yeah So that's the most companies we see being successful are the ones who like use charge EPT or like the second part of the model like llama. So people start with one of these two categories and then they have, let's say a product which works right. But now let's say once you ah fill in the 80% or 90% use cases, then you need to get to like very high accuracy. I think for each of the startups, the battle is the accuracy going towards 95% and how can you do better than like.
00:17:53
Speaker
Gotcha. In the last 10% use cases, right, where typical general purpose AI would fail. That is where their mood also lies.

The Rise of Retrieval Augmented Generation (RAG)

00:18:01
Speaker
So we see a lot of people like going beyond a certain threshold, and then they start investing a little bit, like, in like fine tuning, right. So at that point of time, they would have already served, I would say over like five to 10 million queries, seen usage performance across it. Okay. And they have data that hey, these were the input outputs and now oh there is two think One is accuracy and one is cost because both of these are pillars of how you want to give your service to the customer. right These are the two efficient things that One last question before we move on to actually Kubernetes or CNCF related projects. You said ah these are just fine-tuned models. ah are Are vendors in the third bucket that you described also offering RAG or RIG? and That's um a new term that I learned, like retrieval augmented generation and retrieval interleaved generation. ah Are these vendors also offering those kinds of services ah where they can host the model, and maybe have data available in a vector database somewhere and and
00:19:03
Speaker
ah without even fine tuning, you you get to that higher percentage. Yeah, RAG is a very interesting use case, right? Because RAG is, I would say, somewhere in the third and second booth. And it's very interesting why is it's both the categories. So yeah, we definitely have customers who use like a vector database along with an API from Infilare. So usually this this happens. like So when you are a RAG company who is building like a prompt or something like that, with augmented generation or interleaved generation, right? You can either do two approaches, right? Either if you have a long tail model, right, which is not super famous. So let's say that could be like a 5, 3.5 or like some model, which is very hard to find off the shelf. So again, people come to serverless where they say that, Hey, like I still want to do it, but this model is not readily available out of the box in, let's say,
00:19:56
Speaker
SageMaker or their like managed offering. So then they think about okay hosting itself. But from RAG applications where they're using, let's say, llama 7TB out of the box, right for them, they often also tend towards the second bucket where they're like, hey, I just need tokens per second. right okay I have a huge database. I need to encode this.
00:20:15
Speaker
send it into my data database, take the API again, and then answer it for them. ah So the RAC is spread across usually these two buckets, and then also the first bucket, where a lot of folks I know also use charge GPT as the embedding model and directly use that RAC application. So I think RAC goes across the spectrum. And it really depends upon the like modeling capabilities.
00:20:41
Speaker
Makes sense. So I know we're talking a little bit about the inferencing landscape. and And since we're on the subject, and before we you know talk specifically about and what you do at Infralist, what are the different sort of paths someone can take if they're interested in getting into this type of landscape? And and maybe those are projects within CNCF. you know Which ones have you used or which ones you know of? Or I'd love to hear your perspective on that.

CNCF Model Deployment Projects

00:21:09
Speaker
yeah ah No, I think for model deployment, CNCF again historically has had a lot of projects and especially like the one which I would like to highlight is the K-Serve because it's a fork of Knative, but which is very specific to machine learning and which also is the need of the team. So they so I think that is a K-Serve is one project that we look highly up to.
00:21:34
Speaker
ah Some of the more advanced with us still in the incubation phase are like NIDUS and like a couple of other like smaller projects. ah So these are the two projects that I have mostly looked at and like we have used them very heavily and then like figured out what they do could and what basically other the areas which still like are not like fully there in these open sources.
00:21:58
Speaker
Okay. Like since you have, since you're already on this topic, right, since you have evaluated these different open source projects, what are some of the challenges when it comes to model serving or deployment using case or for example, can you share some of those?
00:22:13
Speaker
ah Sure, so I think of just coming back to the like, I think good things that case of does first. So I think it's a great platform where I think it gives you a control plane, where you were using your CI CD, you can just like, ah put a quick YAML and then you get like a ML service upright. So it's super easy to set up, they have really good support for like,
00:22:35
Speaker
Backing servers like Triton. So Triton is one of the servers that nmedia has made it open source. That's also an open source project by nmedia. That's a super good engine. It has like a lot of support for dynamic patching streams like streaming and also it's a pretty good one. And then case of has been a good community that integrates with like dots of integrates with tensorflow serve and a TF serve and then also like Triton. So overall like case of is a great project, right?
00:23:04
Speaker
And then you can deploy a model like in any, let's say, Python, TensorFlow, PyTorch using K-Serve, right? Now, talking about some of the limitations of K-SUB, I think if you see K-SUB was formed of a Knative, which was the original like open source CNCR project.

Challenges with Model Serving at Scale

00:23:23
Speaker
Now, Knative was an excellent project for like smaller model and computer. If you see the history of K-SUB, K-SUB is used in large organizations, so even went to some fact and sales force, and like other ones which I know of.
00:23:36
Speaker
on a capacity of smaller models. So when model sizes, let's say 100 MB, 500 MB, which if you see before 2019, most of the models were in like that regression and prediction models in range of a gigabyte. It works really good because you can pull one GP model in like under five to 10 seconds, right? Or ah like a pretty slow machine also.
00:23:57
Speaker
But what it comes is now the landscape has changed, right? If you see with llama, like 13B, 8B, like all of these models are in like 20 gigabytes plus, right? And some even go up to like 100 and 150 gigabytes. ah Now that is where like these systems are not first built for it, right? um Because these are huge model weights. So orchestrating them in a way where let's say you have, you're dynamically filling the weights on request, right? That becomes a huge bottleneck, right?
00:24:27
Speaker
So if you have a case of cluster with let's say 50, 60, 800 deployed, uh, like booting up a service will take like and insane an amount of time, right? And then that we may lead to a bad quality of service for your influence. infrastructure So that's the biggest thing where we see, and it's mostly because of the data size, which has exploded. So that is where the, uh, case of an K native platform are not very good.
00:24:51
Speaker
Second is one part of the like container orchestration right because they don't have a schedule. So their schedule is very generic, like it is based off the Kubernetes primitives, which is like this CPU, this RAM, this hardware, find me this wherever it's possible. right But I think that kind of scheduling also has a lot of downsides to it. yeah So these are the two things I would say mainly like K-7, K-native are not able to, I think, catch up, and they'll probably catch up in a few years from now. What type of scheduling decisions like are beneficial to these types of workloads if if those basic ones that Kubernetes kind of comes with are not necessarily built with that in mind, obviously? Correct.
00:25:35
Speaker
ah So what we have seen is like the kind of scheduler that you need to have right for especially especially machine learning models, which are like 20 plus gigabytes. So one is like the container runtime, right which is like a Docker file, like PyTorch, which version, which packages. right ah So there needs to be like some strategy which has like, hey, that hey oh that is like Docker container image is also a first-class resource where you say like 20 GB of RAM where you can also say that, hey, I want this Python version and this CUDA version to be there in like some, that's one. Second is of course the model weights, right? So there needs to be some way in let's say sharding or some combination of scheduling where ah we say ahead of time that hey, let's say
00:26:20
Speaker
I, for a typical very big company, write like Salesforce for some example, right? We have like a hundred models and each of these models have, let's say some traffic pattern which we can like, so based on that, can we see, can we do some prefetching or pre-caching to make the like model weights there in some way, or maybe we have a distributed storage that's fast enough to lecture this model like on the fly, right? So those are two different approaches, but I would say these are the things which I think the scheduler should also consider model as a first class resource and in containers as a first class resource where that would solve the problem of like making it like super fast for the LLMs or diffusion models of the current era.
00:27:03
Speaker
Okay, so is the challenge more on the startup time? Like it's a startup issue, right? ah Similar to, I think when Ryan and I were doing our serverless episode, the cold start problem was a big one, where whenever the app needed some resources, a Lambda function needed to spin up to solve that traffic. This is more around day zero, but once you have deployed things using kserve,
00:27:31
Speaker
yeah Actually running it on Kubernetes is not a very big challenge. Yeah. ah So I think it's definitely the scale problem. So it's not just zero to one, but also one to end problem. So that way it's pretty linear, right? Let's say there's an app called Photo Room, right? and we talk In our early days of infilless inception, we talked to them, right? ah So they say that they usually have to provision like 2x of their infra. So let's say at the point if they use like and GPUs just for example, right? So they have to provision 20 because they're time to like, let's say once the customer starts using them, right? For them, the denial of service is something that their business cannot deal with. So if a flux of users come in, so they have to make sure that they are able to rapidly scale up, right? So scaling is basically like slower because of the like underlying details of the model and the container and the runtime and everything, right?
00:28:24
Speaker
So that is when this problem still persists, even at scale, right? So it's like, how do you start a quad faster? Yeah. Okay. No, I think that makes sense. Uh, if I have any auto scaling and auto scaling rule in place.
00:28:36
Speaker
model at the end of the day is running as a container or a pod on your cluster. If you get more requests in how do, okay. Okay. Thank you for clarifying that because it, for me, it initially sounded like a zero to one problem. So yeah, is zero to one is for sure. Like if let's say companies have an area of thousand models and they have a hundred GPUs, right? They want to figure out a bin pack so they can quickly load and unload. So sometimes it's zero to one, but then the same logic also goes from like one to end in something the same problems. Yeah.
00:29:03
Speaker
Cool. Well, um, I think it's time to maybe switch gears a little bit off of the CNCF projects and talk what you're up to specifically at Inferlist.

Origin Story of Infilis

00:29:12
Speaker
Uh, I'm, uh, you know, interested in sort of why Inferlist is here today. Why did you decide to kind of like go about building this company and, and sort of talk about the origin story a little bit. And then we can dive into maybe what you're doing specifically to fix some of these challenges. Cool. Sure.
00:29:29
Speaker
ah So I think again, Infantless is like a pivot from the original company that we were building before this. So we are trying to build an AI coaching pot, which could help people with their behavioral skills. And this was back in 2019, 2020, where GPT-3 was not like publicly launched, but like it was available out in PETA. And we had some open source versions of that also like GPT-J, GPT-New, right? At that time, OpenAI used to open source most of their models, right? So ah while experimenting with that, what we figured out was that we were trying to host these models on Kubernetes, right? And we were using some of the said open sources, but then we figured out that we had to like
00:30:08
Speaker
keep a lot of GPUs on and like op really over provision. So let's say at a month we were spending like $20,000. So we were doing like $2,000 in revenue for like the customer for serving those, right? So that was the origin story of it where we saw that, Hey, there was like, there was no way AWS Lambda or even Google Cloud run to like give them a model and then say that, Hey, when I call this model, turn it on and just give me the output of it, right? Whenever I, whenever we require. And then similarly, like ah So that was our problem. But then we also talked to a lot of folks, probably like 60, 70 plus engineers who were at that time in generative AI trying to build early diffusion based models. so So most of them, if it was the same story where their cost of like infrastructure was even like greater than the cost of what they were earning from the users. right And while that is okay in the early part, but I think there's some at some point the scaling laws have to match, right or else you can't build a sustainable company.
00:31:08
Speaker
And then you also landed up with, to double tap on this here, you also landed up on a report, which was by a 16 C that they are saying that the cross margins of AI companies are actually as good as like a service-based company with humans, right? Because they have to invest on training. They have to invest in continuous inference. And those are not like SAS, SAS was like five or 10%. You would spend an infrared, but here it's like 40%. Right. So, so that.
00:31:35
Speaker
validated our theory and that is where we started that hey there should be a way to do it serverless. Then the next couple of months of an infilless was us that hey can we really solve that problem? So we spend gave us six to nine months in just saying that hey how can we run this model faster? Let's say what's an acceptable limit? Let's say it starts in five seconds, it starts in 10 seconds, our customer is okay with that penalty and can we build something around that penalty right of scaling the pot? So that has been We started so the first version of the product which we wanted to build was like something that's like managed completely managed for like the end customer where we go back and like rent out G2s from pejor and AWS, but then make sure that when they want to scale up their folder service, it gets started in like under 10 seconds, right? So that's the benchmark we set for ourselves and
00:32:24
Speaker
We launched earlier this year and then since then we have now been working with 20 plus companies. We're still in private beta because there are still like a lot of scaling mechanics that we are figuring it out. But that was the origin of like how we made it serverless, right? And so these are like, you can come up with any model and any like runtime, let's say $5, $10, whatever. And then we just help you like get that service up and down like super fast and then you can also scale with us like 0 to 50 or 100 GPUs and then scale back.
00:32:56
Speaker
Gotcha, gotcha. Now, I do have a follow-up question just because Bob and I have had almost ah an entire episode on this one term, write serverless. So I do want to clarify for, ah from your words, what makes Infraless serverless in your own words? and And maybe explain a little bit of the difference between you know going on some of your documentation at bound, um deploying serverless containers versus ML models directly with Infraless.
00:33:22
Speaker
on ah Yeah, so I think the term serverless from an end customer rights means that they don't have to manage the Kubernetes or the plane of the data plane, right? So that is where the like pure definition of serverless is, right? Because I think serverless is a word where it has been like, is there an open source serverless? And then there is like a lot of variations of I think the word serverless.
00:33:44
Speaker
So I think in the purest fashion, what serverless means is basically like for the end user, they don't have to worry about where the compute comes from. And like, they just have a piece of code, which gets executed, like on a remote machine, right? And that machine like is needs to be out of the customer's data plane, right? So there are a lot of like managed serverless also, but I would say that they solve the same purpose, but I would say by definition, they are not serverless because if you have to manage server. So our definition of serverless is like for customers, they don't have to manage OS patching or anything regarding the resource, right? The resources. That's why you call it serverless. Coming on to the second question, like ah serverless containers versus single model stripes. So I think oh when we set like machine learning models, right? That makes the scope a little bit limited, right?
00:34:35
Speaker
So, uh, this is, this was again, something that we discovered on our way. So, uh, machine learning model is typically like the bite or for the tensor flow, like five, which you have like with weights, like safe tensor or something like that. Right. So when you talk about serverless containers, they can also have like a pre-processing logic in Python or like some, uh, some runtime that basically allows them to also like.
00:35:01
Speaker
run multiple forward iterations, right? So that's why I made it serverless containers, because these are just not machine learning models, but wrapped with like to the necessary logic, let's say, it processes it, throws it, something like that, right? So that's the really difference that, like, there are a lot of, I would say AWS also in some fashion allows serverless containers, but serverless containers with GPUs is something that's like none of the cloud providers to today give access to Okay, so that's what that's what we are about Interesting so I think I want to skip a couple of steps and I was I want to assume that okay customers have figured out how to run these models either using Inferless or on their own on Kubernetes cluster using casef One perspective that we are that I I don't know the answer to is how are organizations that are at that stage? updating their models, so I know
00:35:58
Speaker
Open source ah like companies that are dedicated to publishing open source models have new releases every four months, every five months, where they increase the performance. ah How are these models being updated in production? ah My assumption right now is I'm updating my app using either GitOps workflows where code commits already make it directly to production, or I'm running some sort of a canary or a blue-green deployment where I'm testing two different versions of the application.
00:36:25
Speaker
Now with models involved and they can impact the the features that my application has or the output that my application has, how do how are organizations handling all of those updates?

Managing Model Updates in Organizations

00:36:37
Speaker
um ah so Again, I can talk from an experience of few people. so I think this was a very interesting part and no something that we also like learned along the way on our journey of six months. so Initially, we thought that people would come, they would deploy the model once, and then they would just go away and come back after three months.
00:36:55
Speaker
But what we were surprised with our learning is that they were like changing the model every week, where either they were tweaking the parameters or they were tweaking the like temperature or like some setting or the other, right which could change dramatically how the inference is happening. right So when it comes to inference, I think the permutation and combinations of different types of libraries, let's say VLLM, TensorRDE,
00:37:20
Speaker
like ah so Like even if you have the same model weights, the library that you are using for inference can cause a huge change in how the outputs are producing, right? So yeah that is one learning and that is, I think, the most important flow which people regress, like thoroughly test on us, right? And we had few customers who were using serverless just to test our models, right? Because one of the things which it comes with GitOps is like, so definitely there is a way of like pushing the model to production, right?
00:37:49
Speaker
and But I think instead of a blue-green deployment or like canady kind of thing, what we see most of our customers doing at least is like they have a very good set of evals. So they use evals mostly, and they have let's say a huge data set of evals, and then they would have like a dev endpoint, which they would deploy the new model, they would run the evals on it, and then they would deploy it on the like, with a production customer, right?
00:38:15
Speaker
So that is how like we see most of the like models getting rolled out in production. So it's definitely blue-green. But like part of traffic on one model and part of traffic on another model is like something that we have not seen. They usually use key valves to go through stuff like that. And the beauty of serverless there is, like again, I think, especially for testing, is that you don't have to like set up a huge cluster, turn it on, and then like like keep the machines there on like some like in a dev environment, it's very hard to manage the infrastructure. So serverless as a solution, a lot of people use it just for that. So there are a couple of customers who are just like putting their test workloads on us, doing the evals, and then they do it in the most cost efficient way because they don't need to keep the GPU or provision machines from that. So that's a very interesting use case that we see. ah With GitOps,
00:39:06
Speaker
Let's say for people changing their model right to answer your question. I think what people do is we have like a lot of integration with CICB so you can integrate with us on a code commit level on a GitHub webhook level so you can set up let's say a webhook that whenever I push the code to this branch, it automatically gets built. So CICB is the next three months of our journey which we had to build for developers to efficiently use inference.
00:39:29
Speaker
OK, is there a term for it? Like I know when ah vin we started using things like Ansible and Puppet, ah initially we coined the community, coined the term infrastructure as code. I know it's still going strong. ah Like your infrastructure is code, your code is code. Now your models are also code. Do you have a new action for us? knowledge that i discuss yeah Okay, no no thank you for walking us through. I think I'm interested. So how are they managing these different versions? Is it in their source control repository and that's it? Like they are treating the customers are treating that as your single source of truth in terms of what models are being run in production? ah So yeah, usually what happens is they deploy like, let's say for like one of our customers, they have like a
00:40:15
Speaker
three endpoints when we'll be staging staging dev and like prod. Prod is the one which is exposed to the customer. They push their model to staging once which is like a GitHub branch. So but basically this is the like for one of the customers it's like a GitHub based deployment and then with GitHub they have three branches right.
00:40:32
Speaker
So first they would like commit to a branch and based on that branch, the end point with info less would get upgraded. So let's say if we have like one end point configured on a staging branch. So as they push the code to staging, then there are like a eval test and then the same code is just promoted to like broad and then like the broad endpoint is switched to the newer model. Okay. Got it.
00:40:55
Speaker
All right, well, we're going to switch gears on a little bit of a case study. I know this is something Bob and I always like asking our guests, especially if they can talk about one of the real world examples or customers. So, you know, we saw a case study on clean lab that kind of talked about um saving a whole ton of GPU cost.

Case Study: CleanLab and Infilis's Solutions

00:41:16
Speaker
I'd love to hear more about sort of how that came about and how they were able to do that with you. I think of clean lab again, just to give everyone a brief history is like a company that helps like companies to clean their data based on LLMs and some of their customers are bigger companies like Amazon, Google. So so one of the things that they do is like ah so when they clean the data set or something like they have like a lot of hallucinations or sometimes your model can hallucinate and let's say write a data point which is just not coherent the whole scheme of things. So there was this one particular tool. So hallucination is something that's like
00:41:56
Speaker
detecting hallucination is even harder than answering like a LLM. So the compute needed for that is let's say, if you want to detect a hallucination, you will pass it to like three or four models, try to see which works, which doesn't work, right? And can I flag this as a hallucination rather than a right prompt answer? ah So for them, like Uh, they have like a public tool for detecting hallucinations also. So this is a tool that they directly have it open on the web. So they have control on how the traffic flows for them. And it's pretty erratic. So the traffic erratic part was something that, uh, triggered them that, Hey, uh, like I have launched this in public. I don't know where most of the people will use it. It could be like 2AM could be 5AM. There's no predictable traffic.
00:42:40
Speaker
okay So for them, it mattered that let's say whenever there's a burst of traffic comes in. So if hallucination is not detected in like five to 10 seconds, then like everything after that pipeline has ran, right? So it's kind of denial of service. Like if it exceeds a certain threshold of seconds. So it had a constraint on that. And that's when they discovered us since we were able to like auto scale their pods in 10 seconds. And I think the hallucination detection part takes like another one or two seconds for them.
00:43:08
Speaker
um So it was like if they could still like scale from 0 to 30 GPUs and like detect all of those hallucinations together and then like shut it down whenever it's not using. So that is where they used us. So they used our endpoints to deploy like a model with us for hallucination detection. ah It does the hallucination detection and then basically gives them a binary answer that, hey, is model hallucinating or not? And for that, they have like of traffic, right? So at a point we would see them using like two or three GPUs, but in the next instant they would go to like 50 or 60 odd like replicas trying to detect hallucination and then it would go down to Z3, right? So now the other way for them was like they could have also deployed it on a qbe but
00:43:55
Speaker
So I'm not going to name it where they were going to deploy. Sure. Come on, man. um that's o now That's totally Yeah. So what they were deploying, is so so for them to achieve the same output, they were like keeping on 25 GPUs, right? so Even if it takes like, so so when the scale is 50, so even it takes like a little bit more time, it's fine with them, but they had to keep 20 or 25 GPUs always on, right? Because if they keep three GPUs on the time, it will get to like 25 GPUs, it could be too late for them. Their customer will have a denial of like, excel so that was the use case where it was a latency sensitive use case. And since we were able to scale in 10 seconds, that was a key factor for us to do it. And then we saw that
00:44:41
Speaker
Their average usage is like three or four GPUs, and then it goes to like 30, 40, like doing some part of the day and goes down. So they were significantly able to like save those under-provisioned GPUs, right? Which was like 20, 25 GPUs, which was like a bulk. If you see like three versus 25, it's like 90%, right? that yeah That was most of the use case for that.
00:45:01
Speaker
Gotcha. I'm assuming there is like an AI model that can help you chart some of these usage statistics. and scaleed under the Sorry. Okay. I think ah we are we are almost at the end of our list for of questions, right? I think ah one thing I wanted to, this is still an early space. Like I know being in Bay Area,
00:45:23
Speaker
oh things can move so fast and especially with AI, like everybody's talking about it as if this has been around for the past 10 years. ah Since you guys are in it day in and day out, I wanted to ask you for any best practices that other people can use when they're trying to deploy these models in production or any lessons learned that you want to share with our listeners that, oh, make sure you keep an eye out for this so they know ah how to proceed down this road.
00:45:51
Speaker
yeah No, I think definitely I can share like, again, I think some of the ones I already shared organically that don't try to fine tune your model until you have like a few customers. That's always a bad idea. So I think ah for people trying to run their models in like production, trying to offer any kind of AI service, which involves like significant inference. I think it's a very like kind of a decision chart, right? that When do you want to take the cost of ownership under you, right? Because with AI, things can get super costly, super fast, right? So you would want to avoid the cost of ownership as long as you can, right?
00:46:29
Speaker
so That is what I would say, especially for early companies, right? Because there are too many parameters of control here, right? So the best one is to like start with probably one or two where I would say start with OpenAI or like some hosted provider where you basically get your first couple of users, print the MVP with RAG, prompt engineering or whatsoever. ah Once you have a significant that, hey, I've solved this business use case, now I need to get to the efficiency, right? Or like accuracy. So efficiency and accuracy.
00:47:00
Speaker
is something that people should now think that, hey, now let me start owning the infrastructure, right? Where I want to control parameters which are not in my control in these most of the providers, right?
00:47:11
Speaker
And from there, you basically say that, hey, till when should I deploy a model myself versus let's say I should take something like infinite, right? So there we say that if you see like your utilization is less than 50%, right? Then that's a good trigger point for you that, hey, if I'm not using the GPU all the time, then is it a good idea for me to own this GPU, right? Gotcha. For the whole set of times. So that's when you say that that's the second branch decision that you take.
00:47:36
Speaker
that you take everything in cluster or like just go with a managed service which just gets the GPU and gets your job done. And then also it also boils down to a traffic decision. Sometimes even it's like if your traffic is too spiky and you are a very small team.
00:47:52
Speaker
even if you are paying like 10, 20% more, it still makes sense not to hire people to do this because with MLOps, there are very few people who have really good experience in how to like drive the best efficiencies with it, right? So yeah that's a very cumbersome process. And I think KSR, I think whoever wants to, let's say, takes a decision that, Hey, I want to take control of like the GPU, the Kubernetes, the infrastructure. I think Knative is definitely a good way for them to look at whatever, how they can like manage different API services. Yeah. Okay. No, thank you for that. Those are actionable things that people can start thinking about like right now. Yeah, absolutely. That's great. Speaking of where people can take some action, where can people go to learn more about InfraList?
00:48:43
Speaker
Yep. So I think in terms of info-less, we do like a loop blocks on our like website only where we talk about like, let's say, Triton, how to deploy models on that, how to use Hudding phase. We also have ah like a github like open GitHub repository. So what we do is we try to optimize the tokens per second, and then also like publish that, hey, this is the like configuration that you can deploy to get the most tokens per second. We have benchmarked this across this hardware. So that is something that we also publish, as this was quite often. And then I do also have a Medium blog where I share like my learnings of the Q and&A to show you how we can properly contribute to that.
00:49:24
Speaker
Perfect. and Is there any way, um, people can kind of get in contact with you yourself or, you know, on any social or LinkedIn or anything like that? Yeah. So definitely I have, uh, like my LinkedIn and Twitter, I think I'm pretty active on both of those. So if you have any questions specific to the deployment, if you're trying to configure something in your class, in your own VPC and what to like ask about, like what are the challenges you're probably going to have?
00:49:50
Speaker
Cool. Cool. Well, uh, it's been, uh, really awesome to talk about this topic. I think this is a perfect example, Bob, and of, you know, in, in, uh, some time we have to have you back on the show and you're like yes where you've, where you've come. And and this is such a, uh, an evolving space. Um, I'd be so curious to see your success hopefully in the next year or so, but, uh, yeah, we'll, we'll get you back on the show eventually. And, uh, thank you for joining. I appreciate it. Have a good one.
00:50:18
Speaker
All right, Bhavan, it was great to have you on the show. Sorry, what? Is this the same Zencast link and how will it affect like, does it show up as a difference? When you're on as the admin, you see recording one and recording two. Oh, you can see different. Okay. So this is just cool. I was worried that the other session shouldn't go away. It just says two, so you'll like jump in recording one to download it and recording two to download this one. Cool. Sorry for interrupting your flying. Starting over.
00:50:47
Speaker
All right, Bob, and it was great to have Neelish on the show. I know I was kind of excited about this one because you know I reached out to Neelish and was sort of interested in kind of what they were up to. I know during the talk, and he kind of mentioned that it's kind of only six months old. So really early days for InferList. And just in general, the AI landscape. So um really interesting stuff from today's show. But what what were you taking away? I'm curious.
00:51:15
Speaker
No, I think I like that he's in this ecosystem, right, and building a company. So he gets to talk to customers day in, day out. I like the way he laid out the ecosystem overall, like the three different buckets, ah the open AIs and the entropics that have their own service, the hyperscalers and then vendors like Infelis. I think I like that breakdown. ah But and I think challenges, right? It was helpful that he walked us through the zero to one and one to hundred problem. Like deploying a model is not enough.
00:51:45
Speaker
models have to keep up with the traffic that you are expecting. And once you have any AI-based application, even a dumb, I don't know, poem generation, name generation, you will get so many people to just experiment with it. So I think your their backend infrastructure has to keep up. So some of those challenges that we described were good. And then from a best practices perspective, I like the way he wrapped things up by saying like,
00:52:07
Speaker
First of all, think about cost. That is one of the biggest lessons learned. You don't want to invest millions of dollars in buying Nvidia GPUs and running it on-prem in your own data center. you you You first need to identify that use case. Second, don't start fine tuning immediately. See if you can reach a certain performance level and get early customers ah to use your application that is AI enabled. so I think all of those perspectives that we got from him were super helpful for me.
00:52:34
Speaker
Yeah, absolutely. And and to the to that point, right he was saying you know a lot of these models with the example of what was the name that company, CleanLab, which is like deploying a number of different models to detect hallucinations. Or yep I think he mentioned that it's quite often that they're deploying three of these for like staging and and those kinds of things.
00:52:59
Speaker
you're gonna deploy you're gonna you're gonna scale really quick, right? yeah um And that kind of brings me to the the point at which we were talking about K-serve and some of these other CNCF projects. I think it just speaks to the fact that, you know, Neelish has kind of worked with these tools and says they're great tools, but maturity at scale or when you're kind of thinking about the details that he talks about, model weights and and where those models are being served from. and and how quick the startup times are that you know some of the scheduling aspects weren't really as mature as they need to be to um't know run a company or build a company on. yeah Not to say that they're they're not super useful if you want to dedicate you know developers to giving back to that project or even enough
00:53:43
Speaker
Uh, enough developers that, um, or, or people are familiar with this that really kind of can spend the time day in and day out, uh, because it sounds like, you know, the end user is going to tweak a lot of things and you're going to be supporting the project that allows them to tweak a lot of things or redeploy it. Right. So, um, as, as with anything, you know, you have to balance that.
00:54:05
Speaker
complexity versus using sort of an as a service, right? And there are times just like we say, there's, there's a time to use Kubernetes and there's time to not use Kubernetes, right? We've, we've said that many times on the show. I think this is similar, right? There's, there's a time to manage your, your all your own AI infrastructure and there's a time to kind of outsource to SaaS. So yeah, I thought that was an interesting aspect, especially early days to kind of,
00:54:32
Speaker
ah be you know And I think this just kind of points back to, do you want to move fast? And how fast do you want to move? And you know what are you going to do with it at the end of the day? so I think that's a perfect way to sum up the episode. Cool. Well, I think that's all we had. We'll we'll make sure and post all the links that Neilish had to offer at the end if you want to get in contact with him or check out Infillis.
00:54:56
Speaker
or some of these other projects that we talked about. But yeah, I think that brings us to the end of today's episode. I'm Ryan. umhavin Thanks for joining another episode of Kubernetes Bites.
00:55:09
Speaker
Thank you for listening to the Kubernetes Bites podcast.