Oops!Zencastr was unable to start because Javascript is disabled
To fix this problem, check your browser's settings and enable Javascript

Become a Creator today!Start creating today - Share your story with the world!

00:00:00

00:00:01

Intro to distributed databases on Kubernetes

S2 E2 · Kubernetes Bytes

447 Plays4 years ago

In todays episode of KubernetesBytes, hosts Ryan Wallner and Bhavin Shah discuss the basic of running distributed databases like Apache Cassandra and Kafka along with Mongo, CockroachDB and others on Kubernetes. There are various capabilities of Kubernetes that were designed for these types of data services and this podcast should help you get a basic understanding of the landscape as well as WHY you may want to run them on Kubernetes.

Show Links:

Recommended

Securing Kubernetes for the Mythos Era image

Securing Kubernetes for the Mythos Era

S6 E5 · Kubernetes Bytes

00:58:51·8 days ago

More Code, More Problems: Software Delivery in the Agentic Era image

More Code, More Problems: Software Delivery in the Agentic Era

S6 E4 · Kubernetes Bytes

01:01:56·22 days ago

Building Grafana Labs and the Future of Observability with Anthony Woods image

Building Grafana Labs and the Future of Observability with Anthony Woods

S6 E3 · Kubernetes Bytes

00:52:20·1 month ago

Kubernetes for VMware Admins: Understanding KubeVirt image

Kubernetes for VMware Admins: Understanding KubeVirt

S6 E2 · Kubernetes Bytes

00:58:05·4 months ago

Secure your Kubernetes applications with Chainguard image

Secure your Kubernetes applications with Chainguard

S6 E1 · Kubernetes Bytes

00:41:01·5 months ago

Diving Into Kubernetes: The Developer’s First Steps with New Relic image

Diving Into Kubernetes: The Developer’s First Steps with New Relic

S5 E2 · Kubernetes Bytes

00:52:20·1 year ago

Database as a service with Percona Everest image

Database as a service with Percona Everest

S5 E1 · Kubernetes Bytes

01:02:44·1 year ago

KubeCon NA 2024 News Recap image

KubeCon NA 2024 News Recap

S4 E23 · Kubernetes Bytes

00:58:24·1 year ago

Increasing AI adoption using Kubernetes image

Increasing AI adoption using Kubernetes

S4 E22 · Kubernetes Bytes

00:52:03·1 year ago

Monolith to Microservices using Kubernetes at Guidewire image

Monolith to Microservices using Kubernetes at Guidewire

Kubernetes Bytes

01:06:28·1 year ago

Inference in Action: Scaling Al Smarter with Inferless image

Inference in Action: Scaling Al Smarter with Inferless

S4 E20 · Kubernetes Bytes

00:55:17·1 year ago

Container security with Wiz image

Container security with Wiz

S4 E19 · Kubernetes Bytes

01:02:33·1 year ago

Dagger.io Deep Dive with Co-Founder Sam Alba image

Dagger.io Deep Dive with Co-Founder Sam Alba

S4 E18 · Kubernetes Bytes

01:06:24·1 year ago

Running Ray on Kubernetes with KubeRay image

Running Ray on Kubernetes with KubeRay

S4 E17 · Kubernetes Bytes

00:53:06·1 year ago

Building scalable data platforms using Data on EKS image

Building scalable data platforms using Data on EKS

S4 E16 · Kubernetes Bytes

01:02:20·1 year ago

Deploy and fine-tune LLM models on Kubernetes using KAITO image

Deploy and fine-tune LLM models on Kubernetes using KAITO

S4 E15 · Kubernetes Bytes

00:44:17·1 year ago

The business case for cloud-native and Kubernetes image

The business case for cloud-native and Kubernetes

S4 E14 · Kubernetes Bytes

00:54:24·1 year ago

Building the AI Hyperscaler with Kubernetes image

Building the AI Hyperscaler with Kubernetes

S4 E13 · Kubernetes Bytes

00:54:56·2 years ago

Shifting Minds: Exploring OpenShift's AI Landscape image

Shifting Minds: Exploring OpenShift's AI Landscape

S4 E12 · Kubernetes Bytes

01:05:07·2 years ago

Training Machine Learning (ML) models on Kubernetes image

Training Machine Learning (ML) models on Kubernetes

S4 E11 · Kubernetes Bytes

00:55:29·2 years ago

Transcript

Introduction to Kubernetes Bites

00:00:03

Speaker

You are listening to Kubernetes Bites, a podcast bringing you the latest from the world of cloud native data management. My name is Ryan Walner and I'm joined by Bob and Shaw coming to you from Boston, Massachusetts. We'll be sharing our thoughts on recent cloud native news and talking to industry experts about their experiences and challenges managing the wealth of data in today's cloud native ecosystem.

00:00:27

Speaker

Good morning, good afternoon, and good evening wherever you are. We're coming to you from Boston, Massachusetts. Today is January 19th, 2022. I hope everyone is doing well and staying safe.

Weekend Experiences

00:00:42

Speaker

Let's dive into it. Bobbin, how have you been? I'm doing good, Ryan. We had a long weekend this weekend, Martin Luther King Jr. Day. I made a lot of plans, but I didn't do anything.

00:00:56

Speaker

So I found out that on MLK Day, all the local museums, the art museums and science museums in Boston have free entries, even the zoo. Oh, nice. I didn't know that. And I was like, yes, that's what I'm going to do on Monday. But then the weekend I basically spent watching the NFL super wildcard weekend. And then Monday, it was too cold. I was like, I'm not doing anything outdoors. So I just stayed in and enjoyed a good break. Yeah. Did you happen to watch how

00:01:25

Speaker

The New England Patriots forgot to play football. Yeah, they were not the only team who forgot. Dallas Cowboys don't understand how to manage time. So yeah, it was a fun weekend. Oh, did you see the Bengals game at all? I did. So did you see the drone that flew into the stadium?

00:01:43

Speaker

Oh, no, I miss that. So, Mavic 2. I know we talked about this before because you had bought one a while back for Hawaii, right? But there's TFRs, like no flight restrictions, when, you know, an hour before games, an hour after, and you're not allowed to fly in there. Okay. One just flew into the stadium and like dropped down right into the, like... Oh, wow. Look it up on YouTube. It just look up since Nate Bangles, Mavic 2, and this guy just flies around.

00:02:11

Speaker

totally illegal in like several different ways he probably doesn't care or doesn't know I don't know I can't speak for him but I was just like you know holy moly it's just right there on the field like 30 feet above the players could be a $20,000 fine basically but we'll see what happens I was I caught that I was like what is oh look at that okay

00:02:35

Speaker

How are you? What were you up to? Yeah, I visited family. We didn't get to go see family for the holidays because of Omicron. And so we delayed it till this past weekend and use the extended weekend to

00:02:54

Speaker

Uh, to basically stay down there a little bit longer, although we had to come back because of the storm.

Cloud-Native News Discussion

00:02:58

Speaker

Um, although like, you know, visiting those places in Boston probably would have been not so great because it was terrible weather out. Um, but yeah, it was fun. We got to see a lot of family and catch up, which is always a blessing these days. I feel like.

00:03:13

Speaker

Um, and we are back to work and there's a lot going on in the industry. I know last episode we had a little bit of a light news week. We have more news to talk about. Why don't you kick it off? Yeah, sure. Uh, so, uh, if, if you were still paying attention around the Christmas timeframe, uh, everybody was talking about log4j and the vulnerability, the day zero vulnerability that was introduced or announced and how everybody was trying to catch up. Now, I think we are.

00:03:41

Speaker

past the point where it was the biggest thing. One of the blogs that I follow, Aqua Security, they have a nice recap blog or an overview blog of what was the vulnerability, how it affected. It was a critical day zero vulnerability that was released. Everybody panicked. There were quick fixes, but with every fix and more testing, they found some new vulnerabilities with varying severity levels. Basically, I think now they have a patch

00:04:10

Speaker

that can be applied which doesn't have any 7s or 9s or 10 level on the 70 level rating. So that's a good blog if you want to catch up like if you were living under a rock and didn't pay attention to log 4 check this is a good recap to bring you up to speed and understand some of those memes that are going out there.

00:04:29

Speaker

Next, I think as a follow-up to the Kubernetes 1.23 release, IPv6 support was introduced. Amazon EKS is just following the release cycle and now they have introduced IPv6 support for your EKS clusters. So it will use the same VPC-CNI Kubernetes plugin in your clusters and you can configure your EKS clusters to either use IPv4 or IPv6, but not both at the same time.

00:04:57

Speaker

The only caveat is if you already have an EKS cluster and you want to start using IPv6, you can't do that. You'll have to start by creating a new cluster and choosing IPv6 as your networking mode. So just keep that in mind. We will have linked to the announcement blog post, which has all those gotchas in place. But yeah, those were the two things that I wanted to discuss today.

00:05:18

Speaker

Great pieces of news, great pieces of news. I will start off with the GigaOM data protection report version two. So for those of you who have seen the GigaOM reports for Kubernetes storage and Kubernetes data protection, this is version two of the Kubernetes data protection report, which really focuses on

00:05:42

Speaker

The backup and restore capabilities, migration, disaster recovery capabilities of various vendors in the market will really summarize these market categories, look at the key criteria, give you a comparison. The GigaOM radar is sort of a radar depicted above, which players in the market are platform, innovations, fast movers. It's really a valuable report. Version 2 is out there, and we will link to it.

00:06:11

Speaker

in the show you do need to be a subscriber to get this information although a lot of vendors who are on it often allow you to view it through you know their email paywalls whatever it may be so there's ways to get it I definitely do go take a look at it I think it's a really valuable piece of news that

00:06:33

Speaker

you know, honestly dives into why it matters to have these capabilities specific for Kubernetes versus traditional approaches. So definitely go take a look. And then the other one on a less serious note, I mostly included because it's just such a great name, Nubanetes.

00:06:54

Speaker

I really can't vouch for it at all, but it is just a curation of references for microservices, DevOps, CSED, Kubernetes, and the main page of it has just these cars in a container. It's just such a great picture and it's a great name.

00:07:14

Speaker

So, you know, if you're a contributor and or a listener to Nubinittis, love the name and for our listeners,

Running Databases on Kubernetes

00:07:22

Speaker

definitely go check it out. There's a lot of information on there. Also available at, what's the other name? Awesome-cubinittis.readthedocstio. Less fun to say, yeah.

00:07:33

Speaker

And as a lead-in to our topic today on distributed databases and an intro to how you run these on Kubernetes, we'll put an article from... Where is it? ContainIQ? Yes, ContainIQ, which is, should you run a database on Kubernetes? And I think this is a perfect segue because

00:07:54

Speaker

A lot of you may be asking yourselves the same question, or as a team, you may be asking yourselves the same question of, should I be running databases on Kubernetes? And if it's a distributed database,

00:08:07

Speaker

Even more so, you can ask yourself the same question, but what are the options for running databases? And this article goes into, I think, a really good high-level view of the difference between self-managed databases, mostly administered by DBAs versus managed databases.

00:08:24

Speaker

solutions like Amazon RDS or Azure Arc databases, Google Cloud, et cetera, and then those that are Kubernetes managed. I would call these cloud native. We had that discussion. That's another podcast. But these are the ones you run directly on databases and there's some trade-offs and there's some benefits. We're going to talk about them a little bit today, especially when looking at why Kubernetes actually does a lot of great things for databases, especially distributed ones.

00:08:54

Speaker

So definitely go take a look at that article. I think it's a really great view into some of the concepts like replication and charting and failover that we'll touch today, but always doing your own research and reading is valuable.

00:09:12

Speaker

So let's dive into it then, as that segue kicks us off into introduction to distributed databases on Kubernetes. Let's start with why. And I'll ask you that question, Bobbin. Why would you want to do this? It's a Kubernetes podcast, right? We don't need a why. You have to run into one Kubernetes. You have to. You're listening to it. You must.

00:09:38

Speaker

But in all seriousness, there are benefits, and we'll cover these topics point by point. But as Ryan said in that article, there are self-managed database instances that how organizations and enterprises have been running databases for a long time.

00:09:56

Speaker

Most of these organizations have a dedicated database admin team, if not a dedicated DBA at least. Then people started moving to the cloud and started consuming those managed services. Amazon RDS is one of those examples and you started pointing your application to those database instances and using those connection strings. But now, that has a disadvantage. Obviously, you get vendor locked in, you are restricted to that one platform.

00:10:24

Speaker

One of the main advantages of moving it to Kubernetes, again, there are many that we'll discuss, but one of those is it makes it portable. We all know and we have learned through different episodes in this podcast series that Kubernetes, you can run it on any cloud, you can run it on-prem, you can run it in a hybrid cloud topology.

00:10:41

Speaker

And you can deploy the same database in an identical fashion across all of these different Kubernetes distributions. So portability is kind of like the first thing that comes to my mind when we are talking about distributed databases on Kubernetes.

00:10:57

Speaker

Yeah, that's a great point. And I would say insofar as much that, you know, these traditional databases that were run in huge data centers, you know, as, as these monolithic, huge databases, I think,

00:11:14

Speaker

Kubernetes allows you to scale in a different way. And that's definitely a reason why you would come to Kubernetes is for various reasons, like you mentioned, but applications, whether they're databases or not, get the benefit of automation scale.

00:11:32

Speaker

Being able to run a single SQL database is great and there's a lot of benefits to that and you can grow over time but as your business scales communities can help you scale with it especially horizontally and that sort of leads us to.

00:11:50

Speaker

The question of what databases are we going to be talking about today? You know, specifically this one is about distributed databases, so we're going to sort of talk about what is a distributed database. I think at the most basic level, right, it's a database that communicates as a cluster and provides things like replication, fault tolerance, sharding of data across them. What would you add to that?

00:12:16

Speaker

Yeah, so like for me, it goes like for a distributed database, I think I have like four things. One is like resiliency or other way to look at it is disposability. So your database solution, whatever you choose should be able to handle disruptions, especially when you're running it on public clouds or on infrastructure that doesn't have a 100% SLA. And I think

00:12:45

Speaker

And whenever I think about SLAs and cloud infrastructure, Werner Vogel's code comes out. Everything fails all the time. So when you're developing an application, you're choosing a database, you should select a solution that can handle these disruptions. Even in Kubernetes, pods are modeled by design. They can go down. The beauty of Kubernetes is that because you have a desired state,

00:13:07

Speaker

Kubernetes will spin up new parts that will replace your older parts, but parts are meant to go away. So you should have a database solution which is distributed in nature and can handle these disruptions.

00:13:20

Speaker

I don't think, sorry, go ahead. Yeah, to the point, these disruptions tie directly into CAP theorem, right? If you're in this field, CAP theorem is consistency, availability, and partition tolerance. We'll talk about that when we dive into a couple of the examples, but distributed databases definitely aid in tuning your needs for consistency versus availability, and Kubernetes does a lot of great things there too.

00:13:47

Speaker

Yep. And then talking about distributed databases, right? How can you know if it's a distributed database? You have to eliminate all single points of failure. So using that shared nothing architecture and making sure that even if any node in your cluster goes down or in your ring goes down,

00:14:05

Speaker

your database is still up and running. You can still access it, get the same data back. So there has to be some sort of consistency. If you go for strong consistency, that might have some challenges associated with it. But with modern distributed databases, you have what's called a consensus consistency. So before returning any writes, a majority of nodes should agree that that's the right data that's being written.

00:14:30

Speaker

So that's another thing that you should keep in mind. It should be shared nothing, eliminating any single points of failure. Yep, those are all great points. And I think with that, let's give a few examples. So as a whole, we have a list here. Some examples include Apache Cassandra, Elasticsearch, Kafka, Mongo, which can be deployed as a single or multiple

00:14:57

Speaker

Nodes there may be an argument there that it may not be distributed cockroach DB will be kind of diving into a little bit of Cassandra on Kubernetes elastic search and I think you're gonna you know jump into cockroach a little bit But those are some really good examples and I would say, you know, these run really well on kubernetes and we're gonna talk about why that is You know some of a lot of these were designed

00:15:23

Speaker

prior to Kubernetes ever being a thing. And they still run distributed because they could be deployed on individual nodes across VMs, but containers make things more agile, easier to scale out, faster, those kinds of things. So let's get into it. Cassandra is a distributed architecture like the one that Bob was just talking about. It's tailored for multi-data center deployments. It has a lot of redundancy and failover built-in.

00:15:53

Speaker

It has some disaster recovery capabilities to sync data across those clusters, and all of these make it a really good fit for Kubernetes. And that's because Kubernetes really provides you a lot of the advantages around the automation and scaling and monitoring of these databases.

00:16:18

Speaker

When we talk about what Kubernetes does for databases, I think we got to take maybe a step back and we'll use Cassandra as the example first. But let's define some of the challenges, right?

00:16:34

Speaker

running a database as a single container on Kubernetes, fairly straightforward. We had an episode we talked about which objects use persistent storage. So a single database is going to use something like a deployment, it's going to have a persistent volume, and it's going to be sort of a one off thing, right? That volume can obviously have a lot of data services applied to it, you can take snapshots of that volume, it could have its own replication,

00:16:59

Speaker

But when we are discussing distributed databases, there's at least more than one of an individual node in that database because it is sort of that shared texture and can manage its own fault tolerance and shared data across portions of that.

00:17:21

Speaker

And when you are deploying multiple nodes of something, like multiple Cassandra nodes, you may have a three node Cassandra cluster, or a five node Cassandra cluster, or a 300 node Cassandra ring, to be honest. It is very important to have certain aspects of that deployment built into the operations, such as the order matters during bootstrap,

00:17:48

Speaker

So when you install and deploy Cassandra, you need to be able to tell Cassandra nodes, which is the bootstrap node, how to connect to the other nodes because they're clustering, they're talking to each other.

00:18:02

Speaker

So ordering and the identity of these things matter. And the identity brings in sort of, it has to be a unique identity. And doing this with the constructs that we discussed in one of the other episodes alone is actually quite hard. You wouldn't want to have a 15 node Cassandra cluster and have a single deployment for every node

00:18:27

Speaker

and then deploy one after the other by itself and configure each one and have to manage all that configurations. I've seen it done. But you wouldn't want to do it, right? I think. And that's where things like stateful sets and operators come into play.

00:18:44

Speaker

I love operators. Let me just go down this rabbit hole. So again, stateful sets and deployments, all of those Kubernetes objects are great, but then once operators were introduced in the Kubernetes ecosystem, that's when things changed in a drastic fashion.

00:19:02

Speaker

All of these databases can have their own individual operators. And if you just do a basic Google search, you will realize that all the different data services that Ryan listed a few minutes back have multiple operators that can be used to deploy databases on Kubernetes.

00:19:19

Speaker

So deploying an operator isn't difficult, it's just a single one-line command. But having that operator already installed on your Kubernetes cluster allows you to use custom resources and CRDs and custom resource definitions to define and customize how your database should actually be deployed on Kubernetes.

Persistent Storage and Databases

00:19:37

Speaker

So it includes things like

00:19:39

Speaker

the number of nodes that you want, the amount of CPU and memory and storage that you need for that database instance. Based on the operator that you choose, you can specify additional things like do you want encryption at rest or in transit? Do you want

00:19:55

Speaker

Backup and restore functionality so all of these operators have their own CRD definitions or custom research definitions that you should Think about look at take a look at and then use those to deploy databases on Kubernetes so once you have an operator installed you have a yaml file which has a spec for that database instance you can apply it against any Kubernetes cluster and

00:20:19

Speaker

and it will deploy a multi-node distributed database for you on Kubernetes with certain specific configurations that helps you run databases.

00:20:32

Speaker

Yeah, operators has been definitely an evolution to where we currently are. I would say most operators will do the hard parts of deploying that stateful set of deployments for you, which is super valuable. If you want more control and you want to design something yourself, you may go the route of, I want all the control, so therefore I'm going to use the APIs and

00:21:01

Speaker

create those objects myself. But if you're looking for, you know, definitely that turnkey solution that does a lot of things for you, but also gives you a bunch of flexibility, you know, operators can't go wrong. And there's a lot of them out there, probably a whole nother podcast. We said that right to dive into what operators are out there.

00:21:21

Speaker

and that can help you really deploy those databases. Yeah, and you bring up a good point, right? There are many, but then again, if you want to customize, you can do your own thing. It's important to identify those differences. For this podcast, I was doing some research around MongoDB, and they have two different operators. They have an open source operator for MongoDB.

00:21:41

Speaker

they call it the community edition and then they have an enterprise operator which again comes with its own bells and whistles so there are feature differences even between operators from the same organizations from the same database vendor so there might be some features that are only available again i was listening to a talk that was from twenty twenty so this might have changed but in twenty twenty they said,

00:22:04

Speaker

data address encryption was only available if you used the enterprise operator and not the community one. So at that point you have to decide whether you want your application to have all of those features, the data services features, or you want a storage solution that can help you get those regardless of the operator that you're using and regardless of the data service that you're using. So that's also another thing to keep in mind.

00:22:27

Speaker

Yes, absolutely. Great point, and actually a really good lead-in to what I was going to talk about next, which is, you know, we talk about operators and tableset and the objects a lot, but let's kind of rewind and say, why would you want a storage platform or solution underneath your databases? And one reason may be, well, I want that thing to do the encryption, just as Bob and just said. But there's also the aspect of the individual database. So going back to Cassandra real quick,

00:22:55

Speaker

Cassandra has its own availability, meaning that it does its own replication. So you may be saying, well, storage solutions do replication of data as well. So how do these things get along? And why would you want to use them together? So when you're configuring Cassandra as an AP system availability and fault tolerance or

00:23:22

Speaker

Partition tolerance, thank you. You may want the storage subsystem to do more of the availability and configure Cassandra more as a consistent and partition tolerance system.

00:23:38

Speaker

And those things could be tunable depending on each individual database. So with Cassandra, you can tune the consistency to say, and you can do this with Kafka as well, to say, I want my rights to have this many number of acknowledgments before we actually consider that right operation done. Obviously, there's trade-offs to having more consistency.

00:24:01

Speaker

It would generally mean slower operations, those kind of things. But the point being is that these application level components and tunings and settings can be married with the underlying storage system to give you actually the most flexibility. Not to mention that things like failover can actually be improved because when

00:24:25

Speaker

You are using no persistent storage such as you're just using the mem tables and the node memory to host the data in something like Cassandra. A data node may fail and you have to recover from that failure. Data is rebuilt over the network. Well, if you're using persistent storage, that volume contains the data and it may have another

00:24:50

Speaker

You might have replication configured for that volume and therefore it can just reattach and go on its merry way and therefore your recovery of that particular node can be improved by a lot of percents.

00:25:05

Speaker

It all depends on how much data you have in there, but you don't have to recover over the network. You have a cluster that goes from unhealthy to healthy much quicker rather than unhealthy to rebuilding and staying that way for a little while. Obviously, depending on how large your clusters are, how many brokers in Kafka nodes in Cassandra are happy to serve operations, that recovery may be really important.

00:25:32

Speaker

It's a long way of saying that these two worlds of having storage solutions and distributed databases that can run with or without them, you probably want to figure out what works best for your business. Meaning that do we want a lot of consistency and a lot of availability and we're okay with what performance that gives us? Or do we want something that's super fast and good enough in terms of availability and consistency?

00:26:00

Speaker

Yeah, and you said, right, like we'll be able to recover faster and recover quickly. That's definitely, that's something, it's definitely important. But the second thing, like if you are running Cassandra cluster in a public cloud, and then you are using or leveraging multiple regions in that cloud to run individual Cassandra nodes, and if one of your nodes goes down, all of that traffic that's being transferred between nodes is going

00:26:27

Speaker

through the inter-region type and you are paying for data ingress and egress. So again, to recover, obviously time is important, but then you don't want to be in a situation where you're paying a lot of money just so that you can rebuild your clusters. So having a solution where you can have multiple replicas hosted in the same availability zone, not even the same region, can help you come back online quickly and in a cheap fashion.

00:26:52

Speaker

Yeah, absolutely. And, you know, I know I was part of a sort of a solution that looked at Kafka specifically. And we looked at, you know, what configuration sort of a matrix between, let's say Kafka has three replicas and a storage solution has three replicas. And let's see what that gives us. And then let's play with those numbers. And we found that, you know, Kafka having two

00:27:16

Speaker

and storage solution having two replicas actually gave us a really great starting point for broker recovery and rebalancing in terms of we get the seconds of recovery in those scenarios, but also gave us the Kafka availability, meaning brokers, because they had more than one replica as well, weren't

00:27:39

Speaker

unavailable to serve other operations when one was down. In the case that you'd have one replica. The trade-off, I think, is you're taking up more storage space because of the 2x2. But to the point is that these are reasons to consider

00:28:02

Speaker

these things using persistent storage and Kubernetes. And another reason why you definitely want to look at that, right? The things we even haven't talked about yet is the additional benefits, not just, you know, Intra cluster and within that database, but you get things like snapshots. And snapshots are great. But at the same time, if you're running a distributed database,

00:28:25

Speaker

you have to have a consistency group right if you're taking a snapshot of your entire database you need to snapshot that thing at the same time so you have sort of a view of your entire distributed database.

00:28:38

Speaker

or partition in a very specific moment in time. So group snapshots is something very hard to do without a subsystem of a data management to do that for you. And we didn't even get into backup and restore and disaster recovery yet, because I want you to cover some other examples. I know I've talked about Cassandra and Kafka a lot, but why don't you take us into either Mongo or Cockroach a little bit? Yeah, sure.

00:29:05

Speaker

So before we move on to your point, right, taking that group snapshots, going back to operators, different operators have different solutions. So if you are using a combination of all of these different data services, so you are deploying Cassandra, Mongo, CockroachDB on your Kubernetes clusters, individual operators may or may not have a backup and recovery or snapshot capabilities built into it.

00:29:30

Speaker

So that's also another thing to keep in mind that when you're evaluating solutions, choose something that can be unified and it's just one solution. If you're learning this for the first time, it's easier to learn because it's one solution compared to 10. So that's another thing to keep in mind. You need these application consistent snapshots for your distributed database, but how do you do it in the most easy way possible?

Future of Databases on Kubernetes

00:29:54

Speaker

Okay, next, for CockroachDB, I think we already spoke about disposability and shared nothing since CockroachDB's distributed SQL database. Another thing that I wanted to highlight, and we haven't covered it yet, is around horizontal scaling. That's one of the main advantages of using distributed applications, because in this case, you're not scaling up, so you're not increasing the size of your nodes, you're not paying for more resources, you're just scaling out.

00:30:23

Speaker

And if you are co-locating or running these different database instances on the same Kubernetes cluster, you can have them name spaced. So each name space has basically a different database instance running. And if you want to scale out, you can just increase that in your CRD definition, or you can manually use kubectl or kubectl to

00:30:47

Speaker

scale out your distributed database and leverage that horizontal scaling. The benefit of doing it on Kubernetes is it automatically, as soon as you make that change, Kubernetes will start deploying additional pods for your deployment or stateful set object.

00:31:04

Speaker

If you have dynamic provisioning from a storage perspective, that solution will start deploying those persistent volumes to back your new parts that I've shown up. So it's easier to scale horizontally when you are running these databases on Kubernetes and you can manage those using the same set of tools that you manage for your other database instances. Sorry, go ahead.

00:31:30

Speaker

I was just going to agree with you that the Kubernetes APIs and platform gives so much automation benefit to running these types of databases. It's not like we're saying these databases can't be ran in VMs, but there's a lot more touch points.

00:31:49

Speaker

as a database administrator or a DevOps team. And with Kubernetes and some of the automation, especially some of the autopilot features in some of these platforms, you get a lot of this done in a very cloud native and automated way.

00:32:05

Speaker

That's true. And then same was for MongoDB. Again, it's a document DB database, but still it has that distributed architecture and it uses the same horizontal scaling capabilities. You can add more worker nodes, you can add more shards. It also allows you to define how sharding should be done and optimize those patterns to suit your application that's using MongoDB as the backend database.

00:32:34

Speaker

Yeah, so I mean, we've already spent a ton of time talking about these four databases. And I feel like we've just touched the surface. So I think some of the main points that we definitely want to get across is that, you know,

00:32:50

Speaker

distributed databases are really going to be beneficial when run on Kubernetes because of things like automation, the integration with CSI and data service providers, which give you things like snapshots and backup restore and disaster recovery, but also

00:33:11

Speaker

the capabilities to use dateful sets and operators, operators being able to automate a lot of these things for you, even do things like manage your persistent volumes, manage your backups if you so choose, and much more, but also the core aspects of Kubernetes, giving you the, you know, the

00:33:32

Speaker

the API objects to give you the ordering, unique identity, persistent storage, the consistency factors, and also to really stop and think about what the needs are of the data that's within these applications.

00:33:50

Speaker

whether you're using Elasticsearch or Cassandra or Kafka or Mongo or CockroachDB for your applications, definitely stop to look at how that individual database handles availability and consistency. Then also, if you're using a storage provider, look at what it can provide you for replication and availability as well because these things play really well together and when you combine them, you can have a really nice solution.

00:34:17

Speaker

Yeah, and just to wrap up, as we were getting ready for this podcast episode, I was going and looking at these vendors' websites and all the recent webinars that they're doing. If you look at those trends, they are moving from a traditional VM-based deployment to

00:34:35

Speaker

building their own operators and providing these services on Kubernetes because there is a shift happening on how, because of the way Kubernetes is built and Kubernetes is, the features that it provides, these databases do get an inherent benefit of running it on top of your Kubernetes cluster. So you will see that migration eventually from both sides, like database vendors talking about it and then more of the

Episode Wrap-Up

00:34:59

Speaker

Kubernetes platform players also talking about how you can bring more databases or distributed databases or data services on top of Kubernetes. So I think that's going to be like the year of 2022. I think I'm pretty excited. I want to ramp up more on these individual data services and learn more about them.

00:35:18

Speaker

2022, the year of operators for databases, databases as a service, all the staple things on Kubernetes. I didn't want to call it because for so long, we have been waiting for the year of VDI and I think code was the year of VDI, but then I didn't want to start, okay, 2022, year of distributed database on Kubernetes. I like it. I like it.

00:35:43

Speaker

Well, I feel like we've barely done the topic justice, but hopefully you've got something out of that on maybe why you would want to run a distributed database on Kubernetes, what Kubernetes can do for distributed databases and when combined with persistent storage and things like that, what you get out of it.

00:36:04

Speaker

We'll put a bunch of the show notes and links that we mentioned in the podcast episode. And remember, this is Season 2, Episode 2. We will be talking about VMware again in the next episode. This time, we'll be focusing on storage interest groups, how to contribute, what the current landscape looks like contributing as it pertains to Tanzu and VMware.

00:36:32

Speaker

Really excited about that one. We'll have a guest on the show as well. As always, we encourage you to send us a message on Anchor or wherever you can review podcasts. And this will bring us to the end of today's episode. I'm Ryan. I'm Bobby. And thanks for joining another episode of Kubernetes Bites. Thank you for listening to the Kubernetes Bites podcast.