Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Building the AI Hyperscaler with Kubernetes image

Building the AI Hyperscaler with Kubernetes

S4 E13 · Kubernetes Bytes
Avatar
1.3k Plays5 months ago

In this episode of the Kubernetes Bytes podcast, Bhavin sits down with Brandon Jacobs, an Infrastructure architect at Coreweave. They discuss how Coreweave has adopted Kubernetes to build the AI hyperscaler. The discussion dives into details around how Coreweave handles Day 0 and Day 2 operations for AI labs that need access to GPUs. They also talk about lessons learnt and best practices for building a Kubernetes based cloud.  

Check out our website at https://kubernetesbytes.com/  

Episode Sponsor: Nethopper
Learn more about KAOPS:  nethopper.io
For a supported-demo:  info@nethopper.io
Try the free version of KAOPS now!   
https://mynethopper.com/auth  

Cloud Native News:

  • https://siliconangle.com/2024/06/24/ollama-addresses-remote-execution-flaw-following-wiz-discovery/
  • https://siliconangle.com/2024/06/18/suse-acquires-kubernetes-observability-startup-stackstate/  

Show links: 

  • https://www.linkedin.com/in/brandonrjacobs/
  • https://www.coreweave.com/  

Timestamps: 

  • 00:01:39 Cloud Native News 
  • 00:05:30 Interview with Brandon 
  • 00:51:37 Key takeaways
Recommended
Transcript

Introduction and Hosts

00:00:03
Speaker
You are listening to Kubernetes Bites, a podcast bringing you the latest from the world of cloud native data management. My name is Ryan Walner and I'm joined by Bob and Shaw coming to you from Boston, Massachusetts.

Cloud Native News and Challenges

00:00:14
Speaker
We'll be sharing our thoughts on recent cloud native news and talking to industry experts about their experiences and challenges managing the wealth of data in today's cloud native ecosystem.
00:00:30
Speaker
Good morning, good afternoon, and good evening wherever you are. And we are coming to you from Austin, Massachusetts. Today is June 28, 2024. Hope everyone is doing well and staying safe. Can't believe it's almost the end of June, end of six months. I remember when we started this year, we had the New York Times ah ah platform architect on on the board for an episode. And we were talking about this year, we want to bring in more practitioners to talk about their experiences with Kubernetes. um Looking back six six months, that hasn't panned out a lot. We still had a lot of vendors on. It turns out it's easier for vendors to talk about their products and the challenges and how they solve things than practitioners to talk about their internal details.

CoreVue's Kubernetes Platform

00:01:13
Speaker
But today, we are back to the practitioner approach. ah We have Brandon Jacobs from CoreVue, which is an AI hyperscaler, and he'll talk about how they have built
00:01:24
Speaker
their entire ah cloud platform based on Kubernetes from the beginning. so I'm excited for that discussion. ah Before we dive into the actual ah episode or actual interview, ah let's cover a couple of news items that caught my eye. ah The first one being Olama. I know Olama, we have spoken about it a few times on the board as something that we use to run. ah we like I personally use it on my laptop to run I don't want to select if you want to download llama or any other open source in the models like the google jama model or my profile you can download it from all of my internet locally of this security actually has a new cv e which allows of of an attacker to send space specifically especially crafted a steady requests to an all on my server. but
00:02:12
Speaker
ah The CVE-2024-37032, they actually have an interesting name for it called Problema. Problema, I guess, play on words. ah But again, ah the the blog that we'll share in the show notes included that as soon as we shared the vulnerability with the Olama folks, they jumped on it, resolved it in four hours, and there's a patch that's available. So if you're running and haven't updated Olama in a while, make sure you update that um ah if you're running LLMs locally. Next up, we have another acquisition and in the Kubernetes ecosystem. So a couple of weeks back, I think Suze had their annual conference SuzeCon in Berlin, Germany, I think. um And they announced there that they acquired an observability a platform called StackState. The eventual plan is to integrate StackState and all of its end-to-end observability capabilities into Rancher Prime. But ah with everything that Suze does, they also are planning on
00:03:10
Speaker
open sourcing a version of StackState. So that's it's available for users to use ah without having to pay for licenses. ah So that's cool. StackState, what is it? StackState is an observability platform that automatically discovers and maps dependencies between all your different components. So like all your different containers, how they map to communities nodes, how they map to like cloud resources, other infrastructure services. So they've built all of these integrations to give you a complete picture So it would be pretty cool like once they integrate it into the Rancher Prime platform like from that single pane of class dashboard, you can visualize how your apps or how your clusters are doing and how they map to like the cloud resources or infrastructure resources that you might have underneath it. So ah something to look out for later in this year, ah that's the timeline I think for the Rancher Prime integration and also open sourcing a version of StackState. There was no price being disclosed and honestly like I didn't know StackState
00:04:04
Speaker
as ah as a startup or as a company. So this was news to me. ah Congratulations to everybody there. Suze is a great company who's always innovating for for their customers. So a good company to be a part of. ah With that, I think I'm ready to introduce Brandon and start the interview section of the podcast. This episode is brought to you by our friends from Nethopper, a Boston based company founded to help enterprises achieve their cloud infrastructure and application modernization goals. Nethopper is ideal for enterprises looking to accelerate their platform engineering efforts and reduce time to market, reduce cognitive load on developers and leverage junior engineers to manage and support cloud mandated upgrades for a growing number of Kubernetes clusters and application dependencies.
00:04:51
Speaker
Nethopper enables enterprises to jumpstart platform engineering with KAOPS, a cloud-native, GitHub-centric platform framework to help them build their internal developer platforms or IDPs. KAOPS is completely agnostic to the community's distribution or cloud infrastructure being used and supports everything including EKS, GKE, Rancher, OpenShift, etc. Nethopper KOPS is also available on the AWS marketplace. Learn more about KOPS and get a free demo by sending them an email to info at nethopper.io or download their free version using mynethopper.com slash auth.

CoreWeave's AI Workloads on Kubernetes

00:05:30
Speaker
Hey Brandon, welcome to Kubernetes Bytes podcast. I'm so excited to have you on the board. I know Corev does a lot of interesting things, but I recently learned from you that they use Kubernetes under the covers. I want to talk more about that. ah Let's start by having you introduce yourself quickly and then we can dive into what Corev is and how it uses Kubernetes. Yeah, that's awesome. Yeah, thank you for having me. I'm glad to be here. So yeah, so um my name is Brandon Jacobs. I am an infrastructure architect here at CoreWeave. I've been at CoreWeave for a little over two and a half years now. And if you're not familiar, CoreWeave is an AI hyperscaler cloud. ah We build purpose-built
00:06:08
Speaker
application of the platform for, you know, AI companies, people doing really large AI workloads at scale. um I am primarily responsible for building out our Kubernetes backend, which is run on bare metal and is different from other clouds. And we believe that that is what gives us a more performant and efficient version of cloud infrastructure for AI. Awesome. I like the AI hyperscaler thing. I think I saw on Twitter somewhere that, uh, or X maybe, uh, you guys filed for a, not a patent, but a trademark for the AI hyperscaler. So it looks like you got it. Exactly.
00:06:42
Speaker
exactly And you be working on building out the infrastructure using kubernetes makes you the perfect guest to have on because we have had vendors and we have had practitioners but like we don't have anybody who has built kubernetes at this scale so i'm i'm excited to have this conversation so we already covered what korev is let's talk about why you ended up choosing kubernetes because from what i know korev is newer than like the AWS, the Azure, right? So ah did you always start with Kubernetes or did you modernize to Kubernetes? Yeah, it's a good question. So I joined around two and a half years ago. ah The transition from what CoreWeave originally started out as, you know, just working with GPUs in a garage to deciding to become a cloud started about ah a year before I joined, right? So within that year, we had always started with Kubernetes. Our founding engineers, our CTO, VP,
00:07:37
Speaker
We from the beginning, we recognize that Kubernetes was the tool that was going to get us to where we needed to be, right? It and had so many things from from the ground that why would you rebuild it? And, yeah you know, like you had stated the core weave as a newer cloud, right? We are the new kid on the block when it comes to the hyperscalers. So when you look at that, we weren't You know, held to the same restrictions that other people had 20 years ago, right? They didn't have all these things. So we didn't have to use, you know, hypervisor layers and yeah build our own orchestration tools. So it was a huge win from that. Right. And plus the Kubernetes ecosystem is incredibly diverse. There's so much support out there and it's a known tool.
00:08:18
Speaker
so it makes adoption so much easier when what you're building is what everyone else uses and it's just a common platform. you know And with that, right we've been able to take that and become one of the largest providers of H100 GPUs on the planet, right which is incredibly powerful. And we also build some of the most, you know ah some of the biggest supercomputers on the planet up to 32,000 GPUs under single, I think, band fabrics, all powered by Kubernetes, which is really incredible. but That's awesome. like okay you You already said the biggest GPU cluster. like What scale are we talking about when it comes to Kubernetes? right like I'm sure like any other cloud or hyperscaler, you have different regions, but then how do you manage Kubernetes or what's what's the the biggest Kubernetes cluster that you have? Yeah. so weve We've got ah tons of clusters currently. right We are first
00:09:09
Speaker
Corweave the legacy corweave that we call it which is really funny to say two weeks two years ago our original corweave cluster was about six thousand nodes across three different data centers and that's a little unheard of um most people

Tailored Kubernetes Environment for Clients

00:09:23
Speaker
don't want to run kubernetes across regions and we can tell you why right obviously like. You know, you have your, your, the heartbeats from that, you know, the, the, the nodes have to call talk back to the API server, unless you distribute your API server and etcds. And then they'll tell you, don't run etcds across region because there are issues with that, right? And consensus. So there are a number of reasons why you don't do that, but we made it work, right? When you're a scrappy startup, you know, how do you give?
00:09:47
Speaker
access to flexible compute to customers who who really need that right on demand these you know two years ago we were you know bringing in customers who just at the tip of the AI revolution right like they didn't know what they needed people stays you know the stable diffusion came out or like these like GPT you know 2.0 like all these things are just starting to come out and people are like who knows how popular they're going to be so we needed to have that flexibility and that and it was really powerful and it worked and we learned we had a lot of lessons we learned from that um But now we run you know very performant, you know isolated training clusters, inference clusters, anywhere from 1,000 to 5,000 nodes, you know like I said, 8,000, 32,000 GPUs all under single fabrics, which is really incredible. Okay. So are your customers coming to you asking for Kubernetes clusters or they don't really care about the infrastructure, they just need access to GPUs to run their workloads?
00:10:39
Speaker
I think I think it's both. um What we find is that these these AI labs, right, they are extremely competent, they are, or they know what they're doing. um You get, you get some that have a very particular way that they run, because a lot of these companies already have on prem clusters, or they have clusters in other clouds. So they have very specific ways that they like to run their Kubernetes clusters, or they have very specific requirements. And we have our own opinions, right? Because we've been doing this a lot, we have, you know, certain settings that we like to apply, we have a certain, you know, things that we install, and sometimes they don't mesh. But what we find is that all these, these customers like care about is the performance, right? So, and what what we differ for how we offer Kubernetes is we include what we call batteries included Kubernetes, um our managed Kubernetes offering includes a lot of the things that most managed Kubernetes doesn't give you, right? Usually you get a control plane,
00:11:33
Speaker
And that's it. And here you go, right? You have to but upgrade the the masters. yeah figure to know Exactly. um And we take a lot of that off the plate. You know, we pre install a lot of the things that you know, the GPU drivers, the the health check verification systems, the we call that our mission control, right? So it's not a black box. And that's one of the biggest differences. Our Kubernetes is is crystal clear to us and the customer. So we get notified of issues simultaneously with the customer. And we always are able to to say, Hey, your your node just went offline or it's less performant than it needs to be, we're going to we're going to swap this out. right And that's kind of the key differences. It's not, it's not here's your cluster, have fun. it's like It's a living breathing organism that we co-parent and keep it alive. And that's really like what's really different and interesting and about how we do it. No, that's definitely interesting. right So
00:12:25
Speaker
like giving that extra layer of managed manageability to the customer and monitoring the cluster on behalf of the customer as well. Do you see this more in lines with, I don't know if you have been paying attention like at Microsoft build last month, Azure announced like AKS automatic and GKE has this autopilot ah version of ah Kubernetes cluster deployment as well. Does Coreweave standard look like ah one of those services where it's more managed? I want to say somewhat. right i'm not I'm not too familiar with those because I haven't had a chance to go play around with those those offerings as much. um But I will say that our managed Kubernetes, right I like to say it's kind of a combination of the control plane and the data plane. you know For example, we have a very specific setup for the CNI. We have a very standard CSI that we pre-install.
00:13:15
Speaker
perform it for AIML style workloads, right? You know, it's a specific NFS client, like there's all kinds of things that we do. um And most of the time, customers have to do that themselves, right? Like, oh, like I need to set up a node feature detector, I need to set up node problem detector. yeah You don't have to. We pre-bake everything. We pre-install Prometheus. We pre-install your monitoring systems. um you know Obviously, some customers have very specific requirements of like, hey, I want to send it to Datadog. I want to send it to you know this external platform or on a cloud or something. right and Obviously, like we have the ability to you know work with them and you can do that as well. But you know sometimes you just want to focus on what you care about, which is your application development, your model training, your inference. like So what we do is we say, OK, here's a cluster. Everything's in it that you need. um Here's how you look at the observability stack. Here's how you deploy your stuff. You're good to go. right No one needs to touch anything else. And I think that's really the beauty of it, because we can take someone from onboarding into a cluster, getting the nodes online, and going to production in two weeks for a 500,000 node cluster.
00:14:23
Speaker
and that's very common like we can get a thousand no supercomputer online in two to three weeks wow okay that's awesome that's crazy but to think about like that scale okay uh so when when when we started this discussion you said you deploy kubernetes on bare metal so when customers are requesting for kubernetes clusters are they getting these bare metal kubernetes clusters or We were also talking to like on on a previous episode we spoke to ah the CEO of Loft Labs and he implied that CoreVue also uses vCluster

CoreWeave's Node and GPU Management

00:14:54
Speaker
technology. So like how do you leverage that and is that obvious to the end user? So yes, so we do have some work we've done with vCluster. The vCluster components are a separate offering, separate thing that we use for. um A lot of the clusters that we're deploying today are
00:15:12
Speaker
strictly bare metal nodes, bare metal clusters, containers on bare metal, right like straight performance, straight to the host. One of the key things that makes this possible for bare metal Kubernetes in our environment is that we leverage NVIDIA DPUs for network isolation, network security, which means that we're able to provide cloud-based VPC um Security, right, just like any other cloud, except we don't have to do hypervisor layers or anything like that. We just do straight host level isolation, which is great, right? Because with supercomputers, you know, sometimes you have to get onto the host. You have to see like, what is happening with my amount to storage or what is going on? Like, you know, why is container to be messing up? Or like, why is the GPU, Nvidia SMI, for example, like, like, why is, why am I getting, you know, corrupted output on my training job? Like, you know, and
00:15:58
Speaker
it gives the customer a chance to dig in because they obviously a lot of people have that expertise. like you know We have people in-house who are good at ML, good at understanding the customer's problems, but like we're not AI labs. It's a lot of give and take for them, you know identifying problems, we're learning from them, and then us updating our stuff to help them. so you know it's It's a living relationship and I think it's actually really really helpful for how we do that. Interesting. so ah like when When you're talking about having access to those physical resources, ah there is no level of multi-tenancy that CoreWeave does. like You give them access to these nodes with Kubernetes and everything installed on top of it, and then the customer can decide if they want to use a different like GPU sharing algorithms like time slicing or or VGPUs, all of those things that's left to the customer.
00:16:48
Speaker
Yeah, absolutely. We, you know, historically, core weave has never found a need to do ah GPU fractionalization or start time slicing. um Because we've always had such a large array of GPU skews to offer, right? Historically, we'd started with like RTX 5000s, you know, like all we had all these different sizes, right, that could fit, you know, eight gig models, you know, 24 gig, 48 gig, and then we, you know, and then we got the A100. So we had, you know, you're fitting 40 gig, 80 gig models on, you know, so, so with that, right, you know, traditionally, um
00:17:19
Speaker
People only had one gpu type right so you so in order to like make it worth your money you had to split it up right because you didn't need the whole thing but we never had to do that so we would always just tell people hey just choose the right size gpu for your workload so so we prefer not to do that but obviously in this environment there's nothing stopping anybody from doing that right so if you wanna run ah if you wanna run a system like running i, um or anything else that does that stuff, it's it's it's totally possible, it's well supported, and they can do that. But lucky for other people, they don't have to if they just want to request certain GPU types that meet their needs.
00:17:53
Speaker
Okay, okay. No, that makes a lot of sense. Like I've you were talking about installing like the node discovery plugins and those those things that are on top of your Kubernetes lesser. I've done that in the past and for first timers, it can get tricky. Like, okay, it's not the most straightforward thing. um So I'm glad that you guys are solving it for the customer. ah From an underlying provisioning perspective, right? Like how does how do you handle GPUs? Like, do you have different types of nodes are based on what the customer is asking for. You use those to create a cluster. How does that work?
00:18:25
Speaker
Yeah, so obviously, you know, like any, any cloud, we have instance types, right? You know, we've got our, our core CPU types that people can, you know, we'll get in their clusters that run, you know, what we call core critical services, like you would in any cluster, you got core DNS, you've got your, you know, if you're running Calico, you've got typos, or the, or the, you know, the coop controller, or if you got Celium, you've got Hubble, like, you know, so all that stuff, right, you've got some nodes that need to run those things, obviously, um and obviously, customers have CPU compute needs as well, um that they don't want to be running on GPU nodes, because a a lot of times people might take up the whole host like with just a few containers. So you got CPU instance types and then when it comes to GPUs, I can't speak to the specific instance types. I'm not going to go into the the details because I'll leave that to the the sales guys. and the arc of are Those people. but you know we we you know Obviously, as a new GPs come out, we're like where we just we just announced GB200 in the future. We've been doing H100. Customers will come and say, like hey, I you know i want H100 nose. We've got ah an instance of an H100 or lots of instances of H100s. We have a whole set of our company that's focused on you know
00:19:33
Speaker
the the hardware, the life cycle, because GPU nodes are not, you know, set it and forget it. These things are, you know, supercomputers do not stay up for more than two weeks. like I think think Jensen said that ah um from what I what i remember as someone telling me. So when you think about this, right a Kubernetes cluster with hosts like this, do not just downline once you deliver it. right It is a constant thing of making sure that the node is healthy when you hit the node. right You run it through checks, you make sure that it's from the vendor, it's even okay. And then it goes online, and then you have to actually run it through you know your provisioning checks. But then it's constantly after that, you have to like keep checking it, right? Because these things will potentially break. um And it happens with each new model, right? It's just like a car, yeah right? Like, like the first model of a car is always going to have some some things, it's going to get replaced, the second version is always going to get better. So, you know, so being first is always fun, because we find all these, these things, but it makes us better. So we're able to identify by the problems, we we get faster at you know fixing it, like getting the customer back online. so um But in terms of getting the note, it's a very straightforward process. we've you know we're We're working on building out you know all the standard auto scaling logic that you might want for GPU nodes as well. So you know it's a very standard feeling on our Kubernetes as you would as any other cloud as well.
00:20:54
Speaker
So what have let's let's say i'm i'm ah I'm an AI lab who request for requested for 1,000 node cluster. What happens when five of my node fail in two weeks? Do you have hot swaps available? Or when you run out of capacity, that's it. You'll have to wait for some time. Usually we do have hotspot capacity available, but it's not a hundred percent, but that's something that Corey works with with you individually to make sure that that's possible. Right. So, so, you know, if, if obviously with a thousand, no cluster, if you need a thousand nodes, you know, it would be pretty, we wouldn't be very smart if we like only had 1000 rights, but we we try to make sure that we have the right, for the right customers at the right time. You know, obviously if, you know,
00:21:33
Speaker
Sometimes there are shortages in capacity and it takes a while but like for the most part like we always guarantee that our customers have the capacity that they need to run to their required amount. Okay okay so let's let's go back to the day zero right. ah You said people are requesting these huge clusters.

Automation and Custom Developments

00:21:51
Speaker
What does your automation stack look like like? I'm sure people are not manually logging into each cluster, each node, and then installing kubectl and configuring using kubadium. How do you set up these large environments? What does day zero look like? Yeah, so we've spent a lot of time building a lot of custom operators in Kubernetes. One of the cool things is I've actually seen this
00:22:14
Speaker
be an interesting shift in the space last year or so is Kubernetes on Kubernetes. yeah right we've We've been doing that for three, four years now. okay we you know what's Because when you when you look at it, right what is Kubernetes great at? It's great at orchestrating. It's great at running containers. It's great at taking all that complexity away. Okay, well, how can we use that to run kubernetes so you so you think like, okay, we run control planes and Kubernetes. And what that means is that we're able to build custom operators were able to build
00:22:45
Speaker
all the standard tooling that leverages you know metrics exporters and all the other things to monitor API servers, to monitor etcd. um We've been fortunate to build our own etcd operator. um that's ah That's a whole other topic. That's a very fun, that's a very fun space. I'm actually excited to talk about hopefully this year at coupon. We'll see how that goes. But um, but yeah, right. So it makes it makes that space so much easier from a day zero perspective, um to be able to manage via like, you know, operators and we you know we're a big Get ops company right are you so internal we do that a lot of that for our infrastructure? um um But on top of that right so That's kind of the easiest way for me to say what day zero looks like right and then our hosts are You know we have a we've been we've worked for the last four years to build out our own you know Baked OS that has everything we need right like like I was talking about with the GPU drivers all the optimizations like interesting like it's just
00:23:39
Speaker
It's just there. It's just ready, right? So we're able to just get nodes online into a cluster. in about 30 minutes, give or take. like you know And obviously, like when click when a control plane comes online, a customer requests a node, a node's going to boot up. A node's going to come online in 60 seconds, like just like any other cloud. um so But if you're talking about when a node gets unwracked from the vendor, obviously, it goes through checks and gets boot up. But it's the same thing. It still maybe take like just a few minutes, because we have our systems you know completely the same for anything in all of our data centers. And it's really great, because there's no snowflakes.
00:24:14
Speaker
That's true. I think ah having that custom OS, which is standardized, has everything installed, definitely saves up time during the initial configuration, right? Is this something that's open source, like the AWS bottle rocket thing, ah which is the other customers? Okay, no, it's it's not open source. It's, you know, and it's also not anything super secret, right? Like, it's just, you know, think of it as a very opinionated, like, package thing that just, you know, does everything we think it needs to do for what we what we're meant to do.
00:24:46
Speaker
Yeah, and I think when you're dealing with scale, it's important that you have a golden image somewhere, and then everything gets imaged with that. And then if there is an issue, you know, ah what version different nodes are running in that becomes easier to fix. Absolutely. yeah Yeah. You find that, you know, at scale, one minor change can really make an interesting difference, um especially when people are looking at performance from like, you know, Uh, the GPU utilization levels are like the, the temperature something runs that or, you know, assist CTL setting changes, you know, yeah inherently, but like, or a kernel version up there, like it's, you know, NFS driver changes. Like it's, it's amazing what some of these customers will be able to tell and notice that you didn't think would be a big issue, but it's, so it's, it's, it's really fun when you, when you start looking at it like that from that.
00:25:30
Speaker
Gotcha. So ah like, let's, ah you brought up GitOps, right? Let's dive into that a bit. So let's say I'm, I went to the core weave portal. I don't have an accountant, but maybe I should like if you, if you have a free tier, but then if I request for a cluster that kicks off, let's say an algo CD pipeline, what does it do to get the cluster ready? Like if you can chat, any of you does it? Yeah, so I mean, we we have a number of things that we do to, you know, kick off in provision, right? um Obviously, you know, our clusters are pretty isolated to the individual sites or regions that they're in, because, you know, with supercomputers, you don't really do supercomputers across availability zones or across site, right? so So our model is kind of like, if I wanted and one in, you know, California, right, you would be in the California. So we have some you know some global tooling that would then take those requests and then you know filter it out to the proper region and then go and provision the control plane you know within that within that data center. And then once that control plane is online, right that would then trigger some GitOps magic to the provision all the applications and all the things all the data plane components right that we talked about. right This CNI, the core DNS, the
00:26:39
Speaker
you know all those things and get those up and running right and then put in your initial nodes that you need to run that stuff right because every cluster needs you know one or two worker nodes because like i was saying because you're in companies and kubernetes your master nodes are not nodes your masters or containers which means okay so theoretically in kubernetes when you go to do something like kubectl get nodes. um you wouldn't have any nodes because your masters are not nodes in your cluster. They run outside, right which is which is great for us from a management perspective. And it also takes some of the complexity away from the customer. like They don't need to worry about it. like it's just It's fully managed, um but it does require us to boot in um initial worker nodes, which is which is you know just part of what we do and what we know how we offer a control plane for a customer.
00:27:25
Speaker
no That's super cool. like Provisioning control plane nodes as containers, that that definitely speeds up things and doesn't and it makes makes it more resilient. right As you said, you use Kubernetes and Kubernetes. so Those control loops kick in whenever one of your one of the control plane nodes for the customer goes down. so ah That's awesome. yeah there is The resiliency piece is really a fun story here. Because when you talk about containers and being able to do things like rack affinities or error availability zone affinities, and you know and just you know Obviously, a container spin-up time is so much quicker than saying, oh, I lost a masternode, how do I? like We don't have to deal with that. It's just, okay, let me reschedule this and you know and state will just be fine. So it's it's really great.
00:28:07
Speaker
So what does day two look like, right? So but i'm ah when I've dealt with other hyperscaler based Kubernetes clusters, ah upgrades are definitely the biggest pain, ah but it also includes like AMI refreshes. I'm specifically talking about Amazon. It also means that I have to backup or protect the applications that I'm running. What does that look like in like a GPU world? What are the day two operations and what how how how do you handle that? Yeah, it's it's a good question, because there's a lot of differences with the GPU side, the GPU clusters compared to normal clusters, right? Obviously, you know, you have the same requirements when it comes to Kubernetes upgrades, and keeping the Kubla versions and, you know, within a certain range. So, um but one of the nice things is that, you know,
00:28:48
Speaker
We manage that for the customer in a sense that you know right now because if you close this you can't just take them offline like we're not gonna be like hey we're gonna upgrade your cluster um it is a coordinated effort but you know what say the cluster needs to do an upgrade right. we need to upgrade firmware, we need to upgrade a GPU driver, or we need yeah whatever, right? Obviously, like you know as long as there's nothing running on those GPU hosts, you know the nodes get marked as needing upgrades, and you know whenever they become you know available, we just run them through the upgrade process, right? And okay just it's just seamless to the customer, right? like The node just gets updated on cordons, here you go, right? And it's just one of those things that like we have processes for, from our automation side that do it,
00:29:35
Speaker
And then our support and our ah know our solutions architects you know work with the customer to make sure that they're aware of when this is happening. And you know and and that's kind of one of the key things, right? it's we We have it as hands off as it should be, but at the same time, these aren't clusters and these aren't environments that you can just say, you know, hey, we're updating your customer or cluster in five minutes. Good luck, right? It just doesn't happen. These are extremely sensitive, large clusters that people are paying money for it to do groundbreaking work. So if we take a little different approach, but we also take the complexity away from the customer, as long as they approve it, we just we roll through it. And it just just updates right with control plane, you just roll the control plane, control planes rolled, and then you just start rolling the nodes. And it just, you know, we just update it.
00:30:25
Speaker
Okay, now I think that that kind of ah has to happen that way, right? Because all of these training jobs are weeks long, if not months long, so you can't just pull the rug from under underneath them and say like, I need to

AI Workload Optimization

00:30:38
Speaker
update the version. So ah Okay that's interesting what about other data operations like do these training jobs need to be like check pointed do you help with that or like what are the operations that customers end up asking. Yeah yeah so i mean so a lot of our customers are really smart about the stuff right there everybody has their own methods for how they do check pointing for how they they do training jobs.
00:31:00
Speaker
And we're not going to pretend we're the experts. We do have a small team that are focused a lot on stuff like that, and they're really smart. I'll give an example. We've actually given some talks about this before. We have an open source project called Tensorizer, which is related to the ML space to basically allow you to, I think, correct me if I'm wrong, I'm not an ML engineer, I think you tensorize the weights of a model and allows you to stream it directly into GPU memory. so great So it really speeds up. So and this is really used for inference, not training. But just give an example, like we have the teams that are focused on optimizing that type of performance. um So for the inference side, right, we, you know, we'll work with customers and help them integrate tensorizer, which means that for inference style workloads, they can go from, you know, let's say a pod spend up time would take
00:31:46
Speaker
Two minutes because it takes that's how long it takes to download the container right because most people will put the weights in the container, etc, etc. But tensorizer basically you just put the weights in object storage and then it just downloads it directly into memory, right? So you if you start optimizing that way, we can get container spun up in 15 seconds. Oh, why? okay So when you talk about auto scaling, like, we focus on making things fast, efficient and scalable. And the same for training as well. But you know, so it kind of all depends on the use case. But yeah.
00:32:19
Speaker
And when you mentioned the 15 second boot up time for container, you actually meant, uh, like a 70 B, uh, llama three model, not just a random engine X container, right? but Yeah, exactly. Something with, you know, 20, 30, 40 gig model weights, like these things are these, you know, so like, you know, you're talking about like really pushing like downloads and, you know, and historically like having large containers is not ideal. because yeah yeah because it's not fun to wait for a container download and you can't really speed it up a whole lot. I mean, there's a few things you can do, right? Like there's some things we can do from the container D level and all that other stuff. But, you know, I just, I would prefer not to.
00:32:59
Speaker
Gotcha. Okay. And you led us on a good path, right? Like talking about open source projects, not just the the ones like tensorizers that could be, but I want to ask you about like, what are the different open stack, sorry, open source projects or homegrown tools that you use for different layers, like storage, networking, authentication, ah monitoring, observability. How do you do handle all of these things? Yeah, so I mean, so obviously, core group does have a number of things we built in house, but a lot of it's oh based on the, you know, the the code builder, the operator SDK operator framework, right? So we have a lot of operators we built since we are Kubernetes native, we find that to be extremely useful and valuable for how we like to do operations.
00:33:37
Speaker
Um, but in terms of open source, right? Our Kubernetes and the things we run in Kubernetes is basically what you would find in any on-prem installation. Or basically if you walk into KubeCon, it would be like that whole row of things. You'd probably see that. So, you know, I'm talking, you know, we're talking like we, you know, we run vanilla K eights, right? We, you know, obviously we, with certain settings, but we run vanilla K eights and we run traditional, you know, etcd, uh, cert manager, external DNS, metal, LB, uh, Calico and Celium note feature, no device, Kube for. Cordian s i mean so really like we run a lot of the the standard things that people would expect to run and in kubernetes and and what that. Is what's great about that is that.
00:34:18
Speaker
By us running that, you know not only do we run that you know internally and in our managed offerings, but like you know when a customer comes to us and you know they're familiar with that product because the with the world's familiar with that product. right like it's it's It's not like we're doing anything special. And it also means that the the debugging efforts and when bugs come up or when there's patches and they're like, we can all speak intelligently about like what the system is doing because we all know. And I think that's really what's key. right Because like What we're delivering is you know these large clusters that are focused on you know performance and scale for AIML workloads. you know why should we spend the time you know Why should a customer spend the time debugging something that's you know proprietary in a cluster when that's not what they're trying to do? so and Also, don't reinvent the wheel. like We're not here to reinvent the wheel on some of this stuff. We're here to deliver maximum performance okay for what these companies need to deliver.
00:35:10
Speaker
Okay, no, I agree with that, right? Like, don't reinvent the wheel. People have put in years worth of work theyre developing these standards, developing these things. So why try to do it on your own? So you you brought up one of the projects as Qboard. So ah can you talk about how you use Qboard? And do you also have like a VMI infrastructure as a service offering where people can request a VM and that uses Qboard? Like, can you talk more about that? Yeah, so historically, obviously, like I said, CoreWeave runs bare metal Kubernetes. right that That is inverse to how a lot of people run Kubernetes on clouds. right So to get around that, right you know when we started out, right a lot of people you know weren't doing distributed training yet a lot of people weren't You had a lot of researchers and people who needed VMs to because that's how they're used to. right That's how a lot of people are historically used to running compute.
00:36:01
Speaker
So, in order to meet those requirements, we we ran Couvert, right? um are In the past, we have our own set of custom operators and tooling that our cloud allows you to to to create VMs, ah a number of other things, but it's all powered by Couvert on top of bare metal Kubernetes. which is great right because you can come in and ask for a VM with 500 gig ephemeral root disk. You can say, attach me ah a public IP, ah give me four extra volumes. right You can bring your own OS, you can bring your own image, obviously all that other stuff. and You can spin up a ah GPU workstation in a matter of you know a minute, which is which is really great. and It's been extremely powerful in the past. um What we're seeing now though is
00:36:44
Speaker
There isn't a huge ask for that now because of what these companies are doing, right? They're doing multi-node GPU training jobs or you know you're talking about scale inference and VMs just don't give you that. That's what Kubernetes is better at. is like how like That's what K-native is for. That's what all these like systems are for. So um we can support that. Covert has been a great tool and I think we're going to always continue to to have that you know in our stack. But we're not seeing that be the the ask right now. We're just seeing for like, okay, how do I get 32,000? How do I get 64,000 GPUs? It's just getting larger and larger. And you know VMs aren't going to cut it. I love that. VMs are not going to cut it. So Brandon, like one question, like this is a storage-focused podcast, right? And I always had this. And since you do this on a day-to-day basis,
00:37:34
Speaker
For these AI labs, what kind of storage do they expect? like Are they expecting block storage, file storage, or everything is just object in the AI world? It's interesting because it really depends on the company. right I think everyone has their own flavor of what they like. um yeah know Some people are all about object. you know some people are all about you know shared file system, NFS, others others love having their block, right? When it comes to distributed training, you you can't really get away from having like lunar local checkpointed files on on the note itself, but you have to back it up somewhere, right? Whether that's you know whether it's a networked file system somewhere or if it's you know object storage somewhere. And you know our so we have a storage offering that
00:38:17
Speaker
you know is is a network-attached storage, right? it's It's, you know, NFS, we have a partnership with VAST, um so we we know, and others, but we also have our storage team is amazing at optimizing for extremely high throughput, you know, workloads, you know, because you can't imagine what it's like when you have a thousand nodes trying to download a file or like, it's pretty crazy, right? And I'm not on the storage team, so I'm not going to speak too closely to what they do what they what they do on a day-to-day basis, but it's it's pretty incredible some of the requirements some of these people have. And just like some of the things that we have to do to get to that point.
00:38:50
Speaker
Gotcha. And I know Vast has, again, let let me know if this is not something that that you own, but I know Vast has a CSI plugin and that would make sense, right, for the shared file system. What about Object? Like ah Cozy, the standard, like we we did a podcast like one and a half years back with the community members and it was still in the beta stage then and I think it's still stuck somewhere there. It's not completely GA. Yeah, yeah. Have you built something of of your own that allows you to consume object storage from inside Kubernetes or how does that work? Yeah, you know, it's a good question because two years ago we we had the same thought. I had looked at the cozy about two years ago and I was really interested in and I was hoping it got further along and I would be interested to see where it where it continues to go. But um no, we didn't feel that cozy was quite ready in that or we didn't really feel that object storage in Kubernetes was at a point where it made a lot of sense. um We had we do have our own our own object storage offering, ah which is which is great, but we don't
00:39:49
Speaker
have it, I guess, natively integrated into Kubernetes, but it is something that we're actively interested in, obviously. yeah um But it's such an interesting space. I'm glad you brought it up because I've been i've been thinking about it for the past few years of like seeing where this will go and like how we how it can be done really well. and I think at the Paris KubeCon, I saw some talks from the CSI community, like the SIG storage community, and now they are trying to bring Cozy along and make it part of the SIG storage. So I think I'm expecting a bit more progress over the next year than what we saw over the past two years. Yeah, I'll be very interested interested with that.
00:40:24
Speaker
Yeah. Okay. Okay. Thank you. Thank you for like going on that tangent with me. So you have been doing this for two and a half years. I'm sure you have a lot of like ah long weekends ah where you where you were working or lessons learned of fire ah flood and blood stories. Like do you mind like sharing any specific lessons you learned? And that can be helpful for anybody who's not doing things at this scale, but still trying to adopt communities inside their own organization.

Operational Challenges and Observability

00:40:50
Speaker
Yeah. Oh man. Um, you know, it's it's one of those things that I think every engineer always hears that, you know, you don't really know what you know, until you do it at scale. Yeah. And, um and people was asked, well, how do you get to do it at scale? I'm like, well, you know, come join CoreWeave or come to another company. um But yeah, so, you know, obviously, at scale, there's a lot of challenges, right? I think, I think the one of the first things that you have to be careful about is, is your CNI configuration, right? You know, um especially when you're dealing with multi region setups, which is
00:41:23
Speaker
not the norm but it does happen um you know obviously any any ripple effect in the network could cause an overload or strain on the on the control plane components right so making sure that you're not that you have enough capacity on the control plane side that your aps are not gonna get overloaded that it's not gonna. you know breakdown and you know ah you know fail to commit a message or something. like you know These things just ripple. right I think so when you're talking about scale, you have to really make sure you understand where the bottlenecks are going to be because just because it hits one point, it's going to ripple that effect down. CNI has issued all the like, let's say you lose connectivity to a data center, all of your nodes go down. Okay, oh well, now they all come back up. oh Well, now they're thundering, hurting, back to the API server. Well, now my API server is down, so then everything else is unhealthy. right so
00:42:10
Speaker
so it's a ah So it's a matter about anticipating what's going to happen, um making sure that you identify how all the components interact. And I think a lot of people that run Kubernetes get scared of all the pieces that are needed. But I think if you just take the time to invest in how they play together, and you know and I think a lot of people, from from us, right, because we are a cloud, we run everything on prem, you know, I've been fortunate enough to be in the weeds on a lot of the networking as well. And I think a lot of people take the networking aspect for granted, because there's a lot involved with Kubernetes networking and understanding how to make a performance and understand how that plays with your setup, right? And everyone does it a little bit differently, all the clouds have their own
00:42:47
Speaker
networking setups. And and yeah there's not there's not much you can do with that, right. But um I think if people could take the time to understand how everything plays together, like, okay, like, how does, you know, like, what is cordiness actually doing? What is what is the CNI actually doing, right? Like, like, you don't have to be an expert, I don't think anybody needs to go and make code changes to Calico, or Celium, or any of these CNI's. But, you know, I think it helps, right, the ability to troubleshoot and say, you know, when somebody comes and says, Oh, my pause networking is not working, or I can't hit Google, like, okay like you know it would be nice if you could easily identify i know i know what the issue is right and i think it' i think most of our time as engineers has been saying is this really a problem or like where is the problem like you know because there's nothing worse than just sitting there trying to ask yourself like well what is the issue this looks fine right and i think
00:43:37
Speaker
That's kind of what I've learned a lot in the past two years from running at scales that you you know one thing can cause six other things to fail but they're all just side effects right and you have to find the root cause. And I've seen, ah like, when people are doing things at scale, right, observability and monitoring is like this, the key to making sure you can not just monitor your systems would also help with RCA. ah Is that like a best practice inside CoreWeave, like anything that gets built in, anything that gets integrated in the stack needs to have this level of observability from scratch, or it doesn't go to production? Is that like a getting factor?
00:44:14
Speaker
Absolutely. But I think it may it's really easy for us because we're Kubernetes. right So we we are you know we run for you know the the kubeprom stack. right We have you know service monitors for everything, pod monitors for everything, ah dashboards, all this stuff, right alerts. So, you know, thankfully, we get a lot of this for free with just the Kubernetes, you know, baked in alerting that you would expect, right? And obviously, any custom services, you you know, you do your your job and create custom metrics for that and things like that. But because we run infrastructure, right, a lot of our alerts are the standard alerts, right? And it's just a matter of making sure that we route them in the appropriate ways, we have the right thresholds for things. um Because, you know, with with some of these environments being so
00:44:55
Speaker
you know finicky right sometimes you're gonna get false positives but like you know in our case you know that's just a matter of us tuning the alerts but i think we're pretty fortunate that you know we have such a good ecosystem with kubernetes that you know a lot of it's included for us so okay we're not having to reinvent the wheel like i said before which is really great and like ah Since you brought up CNI, right? And my mind went to, how do we handle CVEs? Because I think two, or three weeks back, Aqua Security released a report where they said Fluentbit had ah ah had
00:45:28
Speaker
some remote execution vulnerability, and this is how you fix it, things like that. So when it comes to Core Weave and Core Weave customers, what does the shared responsibility model looks like? like Do you monitor for the CVEs because it's like um more managed experience and fix it for the customer or do you forward that information over? Oh, it's it's absolutely a a shared thing. we we you know if If it's part of the stack that we manage, we absolutely will we obviously remedy that and take care of that because our customers expect that from us from a security perspective and a management perspective. If it's not something we directly to control, we still are aware. like We still have direct monitor of all CPEs, all security concerns. we take you know Our security is ah a huge concern at our company. um so you know If it's something that we don't necessarily run, but we we know that someone's running, right you know we do
00:46:16
Speaker
Say hey, you know, this is um we've seen the cve will you guys you know? Go fix this right and usually it's things like please fix it or we'll fix it for you type of thing, you know But you don't want that blast radius to expand outside at one customer, right? Like they you don't want them to get and most customers are going to absolutely fix it right ever No one's gonna say oh, sorry. I want my CD and my cluster. That's not that's not what that's a miss like this for me don't exactly and i think that and That's what makes our offering a little bit different, right? It's not a black box, right? I can't speak to, I can't speak to other, what the other hyperscalers doing. I'm sure they have their own tooling and you know, it's just fine. Um, but because we don't treat it like a black box, we are always well aware of what's going on in a sense to help the customer. So we will never let something that looks off, you know, stick around. And I think that's the best way to put it. Like that's what we think of the shared, the shared model. response
00:47:08
Speaker
Yeah. Okay. ah any Any other lessons learned to share? If not, I can jump into my next question. I didn't want to break you. No, no, I think that i think that's kind of it.

Future Services and Expansion Plans

00:47:17
Speaker
Okay. Okay. Perfect. So what's next for Corev, right? You hinted at Knative, like, are you building anything on top of Kubernetes or adopting any new CNCF projects? what what What's the future look like? Yeah, there's a lot in progress. Right now, we are, Corev is focused on you know expansion and meeting the needs of the customers right and these new these new GPUs that keep coming out. right there're They're bigger, they're better, there people are asking for more clusters or more nodes. So we're focused on just for you know getting that experience more streamlined, expanding the scale. um We've just expanded into Europe.
00:47:58
Speaker
um so yeah so we're you know We're looking to just you know make ourselves more available to people that need need us. um And there are a number of other things we're planning need to build to make these experiences better. right you know Our managed Kubernetes is only going to continue to get get better. um you know A lot of people want to configure certain things, and the more and the more people we bring on, right the more asked they're going to have. right I want to configure my audit policy. I want to run this. I want this feature gate. right and you know then We have our own opinions on some of that stuff, but at the end of the day, like this is your cluster right so we're gonna let you but we're also gonna tell you hey what we think is gonna work best right because we're also.
00:48:34
Speaker
pretty good users of Kubernetes. And that's what's a little bit different, right? We're not just gonna hand you something and say, have fun. We're gonna coach you to say, okay, I see what you want, but let me tell you why we think this is better and kind of work together. And I think that's what's really cool. You're not just shipping boxes and like, okay, yeah, figure it out. Here are the boxes with GPUs in them and go have fun. It's not that, exactly. It's not that. And that's what's really interesting about building something like this, right? And that's what we're focused on building is like, how do you give someone this really cool, awesome, really flexible sports car. But, you know, that they can do anything they they they want to to it. But you should tell them, hey, like, we've made this for you, like, I promise this is going to work like like you'd want. So it's, you know, it's a fun relationship to, to you know, figure out and and continue to build.
00:49:20
Speaker
Gotcha. And before we wrap this up, Brendan, I think I saw a thought of another question. like I've seen Azure and and AWS, those guys having these model garden type of services where they host a lot of models that you can use for your as part of your application. This core, we have like a a model garden or a registry somewhere that's available to all the customers. And how do people actually get these models on their clusters, right? Yeah, Cory doesn't host anything like that today. um There might be plans in the future, but I can't speak to that. That's a little, that's that's the other side of the house. I'm sure somebody would be happy to speak to that. But um, but yeah, but you know, but like I said, you know, for people getting models into their clusters today, you know, it depends on the company, right? People either pull from hugging face, they do whatever they want. um But you know, it's and and anything works, right? That's kind of that's kind of the beauty of it. That's flexibility, like, okay, go do your thing. Yeah, exactly. And obviously, we do have opinions, we have, you know, things that we know works better than others. um and that And that's why people come to us because like we have
00:50:21
Speaker
the history of doing these things the right way and like doing it well. um But we're not going to tell somebody like, hey, like, you can't do this, right? Like, that's exactly we're just going to be like, hey, like, this works better, like, trust us, like, gotcha. Okay, let's let's wrap this up, right? ah Where can users or or people that listen to the podcast go to learn more about a few things like either what core we was doing in the future? Are they hiring for communities experts? And do you have like a tech blog where you publish some of these findings and then what your your team is up to?
00:50:55
Speaker
Yeah, so if people are interested about Coreweave, you can go to coreweave.com. um Obviously, that's our website. you can if you If you're interested in using Coreweave or wants to, if you have a need for GPUs or Kubernetes with GPUs, reach out to our team. They'll be happy to set up a call. If you're interested in jobs, obviously, we can go to our coreweave on LinkedIn. um We have job postings there. We're always looking for Kubernetes experts who are interested in like you know building more things. I'll be happy to have that conversation with anybody that's interested. That's awesome. Thank you so much, Brandon, for your time today. I know I was excited for this episode and you delivered, man. This was a fun fun discussion. ah ah Thank you so much for your time. ah Yeah, I appreciate it. This was fun. Yeah, thanks. Okay. That was a great episode. I really liked the level of detail Brandon went into.
00:51:43
Speaker
ah about how they started using Kubernetes three three and a half years back that was the infrastructure or platform of stack platform of choice they ah for for their cloud rollouts. The discussion around scale was just next level like imagining running 6000 node but bare metal node Kubernetes clusters. and then offering these clusters as a service for people that are running their HPC jobs, for that are running their AI model training and model inferencing jobs. I know in in the past, in the new section for previous episodes, we have covered like funding rounds and debt financing rounds that Coviv has raised. Obviously, building out infrastructure is not cheap, ah but I'm glad that they are doing it and and doing it using Kubernetes ah because that makes it interesting for our ecosystem.
00:52:30
Speaker
ah Specifically, I like the lessons learned, like ah making sure observability is part of everything that you do from so from the beginning, instead of trying to bolt on observability. I know usually that applies to security, but like making sure you have observability built in at all layers. Making sure you identify where your actual bottlenecks are, not just symptoms that are caused by a specific issue. ah That was good advice. I'm sure he has some interesting stories of the record to talk about CNI and which CNI and what was the exact issue he faced, but it was interesting to see how they ah do these things at scale. I like the the more managed experience. like It's not on the you on their users to upgrade communities. They handle those things for that.
00:53:13
Speaker
for their customers. So overall a great discussions. I like that they have their own custom OS for the worker nodes that has all the GPU drivers and ah binaries that are needed, all the NFS provisional binaries that are needed. So overall a great way to offer Kubernetes as a service. I know not all of us are going to build like hyperscaler based environments, but if you are part of like a platform team that's responsible for offering Kubernetes as a service to your and developers or people inside your organization. Think about some of these principles, right? Think about if you're doing it on bare metal, what are the different things that you need to think about? Things like Metal LP for service provisioning. If you're doing it on a virtualized platform or just having a front end that eventually deploys EKS, AKS clusters, how do you abstract all of those things away? So ah think about observability. you Think about all the different tools. Think about the consistency. All of those things will help you
00:54:07
Speaker
ah have a streamlined experience, hopefully reduce your operational overhead and allow you to focus on things that actually matter to the business. With that, it brings us to the end of this episode. before we Before I sign off, I want to make sure that you guys ah give us a great review on Google Podcast, YouTube, Apple Podcast, share the link with all of your friends, join our Slack, you can just go to Kubernetesbytes dot.com for all of our previous episodes, ways to join our Slack channel, find us on all the different platforms. ah And thank you so much for being a listener. ah With that, ah it brings us to the end of today's episode. I'm Bhavan, and thank you for joining another episode of Kubernetes Bites. Thank you for listening to the Kubernetes Bites podcast.