Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Kubernetes Observability 101 image

Kubernetes Observability 101

S3 E15 · Kubernetes Bytes
Avatar
2.1k Plays1 year ago

In this episode of Kubernetes Bytes, Ryan and Bhavin go back to school after the summer break and talk about What is Kubernetes Observability? They talk about how Observability is different from Monitoring, what are the three pillars of Observability and the CNCF projects viewers can check out to get started with Kubernetes Observability!   

Join the Kubernetes Bytes slack using: https://bit.ly/k8sbytes 

Ready to shop better hydration, use my special link https://zen.ai/apaSnaIFOuee5jScqZ28a03tKKvQiqkyz8mtm9wipoE to save 20% off anything you order.

Chapters:

  • 00:30 Introduction
  • 06:25 Cloud Native News
  • 17:57 What is Kubernetes Observability

Cloud Native News:

  1. https://kubernetes.io/blog/2023/08/15/kubernetes-v1-28-release/ - Planternetes
  2. https://techcrunch.com/2023/08/09/sweet-security-raises-12m-seed-round-for-its-cloud-security-suite/
  3. https://www.dynatrace.com/news/press-release/dynatrace-to-acquire-rookout/
  4. https://www.businesswire.com/news/home/20230725088248/en/
  5. https://www.forbes.com/sites/janakirammsv/2023/07/31/kubeflow-joins-cncf-to-accelerate-the-adoption-of-mlops/?sh=6495358e6e75
  6. https://finance.yahoo.com/news/edb-announces-three-ways-run-130000473.html
  7. https://www.prnewswire.com/news-releases/portworx-by-pure-storage-recognized-as-a-leader-in-kubernetes-storage-by-gigaom-for-fourth-consecutive-year-301889796.html
  8. https://blocksandfiles.com/2023/08/15/nutanix-puts-chatgpt-in-a-box/
  9. https://venturebeat.com/ai/middleware-raises-6-5m-to-simplify-cloud-monitoring-with-ai/
  10. https://thenewstack.io/aqua-security-uncovers-major-kubernetes-attacks/   

Show Links: 

  1. https://signoz.io/blog/kubernetes-observability
  2. https://landscape.cncf.io/card-mode?category=monitoring&grouping=category
  3. https://www.linuxfoundation.org/webinars/kubernetes-observability-with-opentelemetry-and-beyond
Recommended
Transcript

Introduction to Kubernetes Bites

00:00:03
Speaker
You are listening to Kubernetes Bites, a podcast bringing you the latest from the world of cloud native data management. My name is Ryan Walner and I'm joined by Bob and Shaw coming to you from Boston, Massachusetts. We'll be sharing our thoughts on recent cloud native news and talking to industry experts about their experiences and challenges managing the wealth of data in today's cloud native ecosystem.
00:00:29
Speaker
Good morning, good afternoon, and good evening wherever you are. We're coming to you from Boston, Massachusetts.

Podcasting Routine: Post-Summer Break

00:00:35
Speaker
Today is August 16, 2023. I hope everyone is doing well and staying safe. Let's dive

Bobbin's Banff Adventures

00:00:42
Speaker
into it. So first episode back since our little summer break. That was enjoyable. I hope you enjoyed it too, Bobbin. Yeah, it was a good break. I'm still like, while I was prepping for this episode, I was like, how do we do this again?
00:00:58
Speaker
It was just like one episode that we skipped. It's not muscle memory by now. It kind of is, but like this ended up being like a one-on-one episode, right? More pressure on you. Not just me, like us. Just figure out like this needs more research and more work because we're not just coming up with questions and understanding the topic, but we're also
00:01:22
Speaker
trying to be the experts to talk about it. So fun times. But yeah, I had a good summer break. I know we spoke about like where I was headed to, went to Banff National Park over in Albert at Canada.
00:01:36
Speaker
I had to put that in there. The most terrible Canadian accent. It was fun. I didn't expect it to be very touristy, but it was. It ended up being a long weekend in Canada, the weekend I was there. That probably didn't help.
00:01:52
Speaker
Yeah, didn't help. Overall, it was fun. National Park trips are always like wake up like 5am in the morning, go out there to find parking spots for hikes, do day hikes, way to avoid the crowds as well. It's so pretty dude. It's awesome. I really do want to make it there someday soon. I've heard a lot of good things now. I'm jealous that you like that. I know. It's amazing. What was your favorite thing that you did there?
00:02:20
Speaker
A couple of hikes that were good, but then I think going to Lake Moraine in the morning, just after sunrise, getting that golden light, it was just amazing. I can see the reason why they had it printed on their $10 or $20 note for the longest period. I don't know. It was amazing. It was serene.
00:02:42
Speaker
Did you bring a nice camera? Do you have one that you can capture those moments with? Yeah, I do have a Canon EOS R, so I can take some pictures. I don't call myself as an expert when it comes to taking photos. I will save the memory for you. Yeah, better than an iPhone photo. I love my iPhone, but I realize that when I go to these national parks and take pictures, and eventually, if I want to blow them up and put frames of it, clearly,
00:03:10
Speaker
I realized that the iPhone pictures when you blow it up like become really blurry. That's why I was like, okay, I need to buy a camera that at least can help me capture. Yeah, you can have a little more control over all the settings and everything and really get the picture you want.
00:03:24
Speaker
You have to be into it though. I feel like, you know, if you're not, if you don't enjoy that process of capturing an image or like messing with camera settings, it's just, you know, you probably just won't do it. Yeah. It's fine too. Cause then you just keep it in your head and that counts as well. I think. True. How was your big time?

Ryan's Dirt Bike Mishap

00:03:42
Speaker
It was pretty good. Um, I, I have this, I won't hold up my finger. Oh, nice. I listening, but cause it's my middle finger, but I injured my middle finger on a dirt bike trip.
00:03:52
Speaker
Oh, well, changing a tire wasn't even wasn't even riding. So I tore on an extensor tendon on top of my finger, which basically just makes your finger just like permanently move forward like that. So you have to have to wear this for six weeks. And a little suture.
00:04:12
Speaker
action there. Wow. The nice part is it's it's it's a wet cast so I can take it on and off and clean it and those kind of things so like there's a lot worse things that can happen and it's like you know you want your fingers to work properly but it still feels like kind of a a wussy injury you know.
00:04:30
Speaker
But I found out I had the issue on the trail. So the guy I was with made a splint out of two butter knives that we had. Nice. So I was riding back to Urgent Care with butter knives on my fingers, just like Edward Scissorhands running down the road. I quickly replaced that once I reached a civilization that had a CVS. Yeah.
00:04:55
Speaker
I'm glad that was the only injury on your trip. Like it was a week long thing, right? Exactly. Yeah. It was still a lot of fun. We had really good weather, which always helps. It's been raining a ton. So we were a little nervous, but did rain one night and thunderstorm overnight. I don't know if you've ever been like camping in a tent when it's just like pouring rain and thunderstorm, but
00:05:12
Speaker
I find it very peaceful, which is, I don't know, maybe different than what people think, but I slept like a baby. Not that a baby would sleep very well on a tent, but the point is, I slept very well during the rainstorm, saw a bear jump out in front of the tent. Oh, wow. Okay. Yeah, it was a whole thing. Nice. Yeah, if you're into that kind of thing.
00:05:31
Speaker
I didn't see a bear in Bath National Park. You saw one on your dirt bike trip. Come on, man. In Western New York. I was a small black bear, probably different than the one that's in. It counts, dude. Like, anything counts. Yeah, so otherwise I enjoyed it. But yeah, now we're diving back into the podcast, which is, it's nice to be back. I don't know.
00:05:53
Speaker
We'll see.

Kubernetes Observability Overview

00:05:54
Speaker
So, um, today's topic is, is, uh, presented by Bob and I, um, we'll have some, we have some guests lined up coming forward, but we're kind of easing our way back into it. We're going to be talking about Kubernetes observability specifically from sort of that high level point of view, these episodes that we do to kind of give an overview, sort of a one-on-one view of, of the market and some use cases and those kinds of things. But we'll dive into some things, ask each other some questions. Uh, but before we do that,
00:06:20
Speaker
We'll have some news. What rounds do you have for us today? No, like news, right? It's been a while, so we do have a few articles that we can share.

What's New in Kubernetes 1.28?

00:06:30
Speaker
Starting with Kubernetes 128. I know we really like 127 and chill vibes, but now we have a new contender, Planternetis. Planternetis? Planternetis. Yeah. I think the release lead was fan of indoor plans. But yeah, fun name. It just came out yesterday on August...
00:06:49
Speaker
On time, I heard too. Yeah, which is awesome. I'm just glad that this is like they now have like three releases a year instead of four. So there is time for people to like plan for upgrades and actually upgrade things and more enhancements and things like that. So this one has 45 different enhancements, 19 are entering alpha, 14 graduated to beta and 12 have graduated to stable. There have been some removals and applications, things like
00:07:16
Speaker
I think the Ceph RBD and Ceph FS entry CSI plugins are deprecated. There are a lot of enhancements and honest to be honest like I didn't read through all of those. There was only one that was kind of related to storage around like if you have a default storage class set after
00:07:34
Speaker
from a day-to-perspective, it will take into effect as well. That feature was already there in beta, now it has been graduated, and that's the only enhancement I had to talk about. Is that the recovery one from non-graceful shutdown, or is that a different one? That's a different one. That one sort of relates to storage too, because it kind of affects how things, PVCs get detached and things like that. But that one's GA as well. Yeah, it is. Other good news.
00:08:08
Speaker
And then I think going back to like funding rounds and acquisition section,

Sweet Security's Seed Funding

00:08:11
Speaker
right? Like Sweet Security. Again, that's the name of the company. Like today we have companies that have some awesome names. Sweet Security raises $12 million in seed funding. So they are building a cloud native security solution based out of Israel. The co-founders had worked for the Israel Defense Forces, which I think at some point in your career, everybody in Israel has to go through the military training.
00:08:23
Speaker
I was going to ask you if you had your, uh, like, you know, top picks from enhancements, but I won't, I won't. Thank you.
00:08:33
Speaker
exercise, but they are building solutions for cloud native, heavily relying on EBPF to monitor traffic and then building. I think they want to use this funding to grow out the team and then help with automatic remediation. So there are so many tools that give you what's wrong with your cluster, but these guys want to take it a step further and like, okay, this is how we'll fix it for you. Like just give us some permissions and then we'll fix it. It's still in early stages, right? It's still a seed round. So
00:08:58
Speaker
They're still looking to find that product market, but we have another vendor in the already crowded cloud native security ecosystem. Yeah, that space in CNC landscape is huge. I mean, I think the only one bigger is like the actual Kubernetes providers themselves. But the security ones, there's just like you could definitely get lost in there for sure.
00:09:18
Speaker
Yeah. And everybody has like an open source version. I know we spoke to Mondo. They have CnSpec. We spoke to Arvo guys. They have CubeScape. Aqua Security has Trivie. Like all these vendors have an open source project and then they have an enterprise or a paid version as well. So. I think it shows kind of where, you know, where the market is too and where people need, you know, these kinds of things. Yeah. Yeah. I mean, like a lot of, a lot of companies, a lot of startups kind of are tackling that space because there's a need, right? Obviously I think we've covered that numerous times on the show as well. So I think it's cool.
00:09:48
Speaker
Yeah. And like, let's talk about the acquisition that you found, which also has an interesting name, which is also sort of tied ish to somewhat, I guess you could kind of tie it to security a little bit. But yeah, middleware raised 6.5 million seed funding. But they're going at the observability and AI space. So like all the things we'll maybe talk about today in terms of observability that are important, are kind of tying that to generate AI
00:10:19
Speaker
And kind of making it, you know, the realization that it's very complex. And there's a lot of things moving in the spaces is really scaling and these clusters can be quite complex.
00:10:31
Speaker
AI part of it. I think we're seeing a lot of smaller companies come out with the AI tie and that kind of rhymes, but we'll see lots more of it, I'm sure. I'm just amazed that they call themselves middleware. I think sweet security wins today. Middleware might be at the bottom.
00:10:54
Speaker
But the middleware folks got the domain middleware.io. Like, okay, that's awesome. Yeah, it's a very common word. I mean, does that make it tougher to have good SEO or bad? Oh, that's true. We'll find out when they reach series A how good they are. Yeah, exciting times though.
00:11:15
Speaker
Okay, moving on, I think an acquisition.

Dynatrace Acquires Rookout

00:11:20
Speaker
So Dynatrace acquired a startup called Rookout to add to their observability scenario. I don't know, for some reason, we decided on this topic and then everything I see around in the ecosystem has something to do with observability. So I think that will help Dynatrace build out their portfolio and go after Cloud Native's customers as well. So when you buy a car, you see it everywhere? Yeah, yeah, that's true.
00:11:41
Speaker
Or when you are thinking about buying a car, like I drove a BMW X3 in Canada. And then, yeah, there's a whole story behind it why that happened. But now in Boston, like, oh, BMW X3, X3, I didn't know there was so many X3s around.
00:12:00
Speaker
I don't know much about them. I was looking at them the other day on the observability space as well. Yeah, I think we need a deep dive on what each of these vendors do. Obviously, there will be some overlap, but yeah, it needs to be some form of... You want your brain to explode, then get a deep dive on any one of these companies. True.
00:12:22
Speaker
Nice. And then a couple of just small updates. Acuity, the Argo CD, enterprise Argo CD company, they have now launched an AI assistant to help troubleshoot common Kubernetes deployment issues. So they are obviously helping customers build a CD part of their pipelines. They saw a few common troubleshooting issues and then let's have an AI chatbot or an AI assistant that can help resolve those. So if you're already using Acuity, I think something that you can check out.
00:12:51
Speaker
I guess I tie. Yeah, any I tie. Do what comes in mind is like the memes of the Oprah just being like you get now you get now you get
00:13:03
Speaker
Agreed. And then, OK, talking about AI, let's also talk about the predecessor to AI, like ML and MLOps, which was a buzz before GenAI, I guess. Kubeflow, the project to run machine learning workloads on Kubernetes, is now accepted as an incubating project into CNCF. So I think it was managed. I think the biggest contributor to that project was Google Cloud. And now they have officially submitted it to CNCF, similar to what they have done with STO and other projects. So adding another.
00:13:32
Speaker
I think this is the 38th project inside CNCF. So the landscape just keeps increasing or the projects that CNCF manages keeps increasing. Absolutely.
00:13:42
Speaker
And then finally, our friends at EDB announced new services from a managed database service perspective with Google Cloud. So they have something called as, let me look at my notes because it was a different name. So yeah, fully managed database as a service called EDB Big Animal on GK standard.
00:14:03
Speaker
And then they have a community version like EDB Community 360 Postres on GKE Autopilot and GKE Standard. All of these are available on Google Cloud as of yesterday. So if you are using EDB on Google Cloud, you have new services to check out. Big animal. Yeah.
00:14:19
Speaker
But that's it for the news, like cloud native news for me, Ryan. Nice. Now the naming today, sweet security and big animal. I mean, I dig it, I think. So I already talked about the middleware one for me. I just have a couple more. One is Nutanix, who we've talked about on this podcast and prior. They have an article here about putting chat GPT or GPT in a box platform. So another AI tie.
00:14:47
Speaker
I looked at the market picture or a higher level diagram. It was just like, okay, they can run Kubernetes on their HCI stack, and now they can help you deploy your models on that stack, and that's it. That's GPT in a box for you. GPT in a box tied to their GPU acceleration stuff with Carbon. It ties in well for them, I think. I think, again, we'll see a lot of this type of thing going forward as a lot of people are asking for it.
00:15:17
Speaker
Uh, the other one was around Aqua, uh, another security tie, not an AI tie. Um, but really just talking about sort of the default, um, configurations again, clusters being kind of put there out there on the internet with things like, you know, accepting, um, all IPs and from anywhere to access those kinds of things. Um, and again, I think it just goes to show you, you know, Bob and you mentioned before.
00:15:45
Speaker
the security CNCF landscape is huge. But constantly, we get articles like this every week about many, many companies falling prey to these things that can be misconfigured. And these tools, I think, are going to be something that everyone has to take seriously very quickly.
00:16:10
Speaker
And regardless of what tool you are using, please don't allow 0.0.0.0 for incoming traffic. That's a bad idea. Regardless of whether you are using Kubernetes or not using Kubernetes, that's a big no-no. Yeah, for certain things, especially like the Kubernetes API itself. But anyway, those are all the news articles, I think, between us for this week. And we can jump

Deep Dive: Kubernetes Observability

00:16:35
Speaker
into our topic. Again, no guests to introduce other than introducing ourselves, which
00:16:39
Speaker
If you don't know us, hi. Hi. I love that. And thank you for listening. We'll be right back after this short break. As long time listeners of the Kubernetes Bites podcast know, I like to visit different national parks and go on day hikes. As part of these hikes, it's always necessary to hydrate during and after it's done.
00:17:03
Speaker
This is where our next sponsor comes in, Liquid IV. I've been using Liquid IV since last year on all of my national park trips because it's really easy to carry and I don't have to worry about buying and carrying Gatorade bottles with me. A single stick of Liquid IV in 16 ounces of water hydrates two times faster than water and has more electrolytes than ever.
00:17:27
Speaker
The best part is I can choose my own flavor. Personally, I like passion fruit, but they have 12 different options available. If you want to change the way you hydrate when you're outside, you can get 20% off when you go to liquidiv.com and use code KubernetesBytes at checkout. That's 20% off anything you order when you shop better hydration today using promo code KubernetesBytes at liquidiv.com. And we are back.
00:17:56
Speaker
But today we're going to be talking about Kubernetes observability. Again, this is sort of a mile wide, foot deep topic. We do this every now and then with the Kubernetes topic, and then we kind of dig in and have people on the show that tie into this space. So today is observability, and it's a loaded term. So I'm going to ask you the question, Bob, and what is Kubernetes observability?
00:18:20
Speaker
Okay, so I have an expert definition and then I have like, okay, how does it differentiate from monitoring to make it easier? Because when I was like this, not right now, but when observability started making a buzz around in like 2017, 2018 timeframe, I was like, how is this different? Because at that time you had monitoring, which infrastructure folks and infrastructure teams are already using.
00:18:41
Speaker
as a solution or as a thing to monitor your infrastructure stacks and applications. Observability was coming around, AI ops was coming around. So like all of these new buzzwords, like what was observability in the first place? So the textbook definition is observability refers to the ability to monitor and diagnose the performance and behavior of your community's cluster and applications. Doesn't seem that different, right? Like it's just like, okay, monitoring and diagnosing. But it's always something that's been important. Like the observability term, like you said, did.
00:19:12
Speaker
just kind of take off, but it's not like it's new, right? It's lumped together in something that's very important to Kubernetes, but continue. I like the way, I know we're not talking about any specific vendors, but Honeycomb, that's one of the big vendors in the observability ecosystem.
00:19:29
Speaker
They define observability as having three distinct characteristics to just help you understand the way your infrastructure stacks are, the way your systems are functioning. The first one being high cardinality. Just understanding this, if you have a key value pair, high cardinality just means that your values can be in the thousands or millions of entries, the possible different combination. Let's say, for example, if you have a database,
00:19:57
Speaker
The key is social security number, and then the values can be any possible combination of the 10-digit or 9-digit SSNs that people in the US have. That means their solution has high cardinality. The second characteristic is around high demand.
00:20:12
Speaker
which means that even your keys can have so many different values for a specific attribute. If we put this together, cardinality means you have so many different columns in that table, whereas the high dimensionality means you have so many different rows in that specific table that relates to a specific attribute.
00:20:32
Speaker
And then the third distinct characteristic is just being exploratory, which means the ability to explore your data in real time and ask arbitrary questions. I think this is where I want to take the segue into the difference between monitoring and observability. But let me ask you, do you have anything to add before I go on to the next thing? Yeah. And to your point, right.
00:20:55
Speaker
I think the way I've always talked about and thought about observability, I think coming from like working on, you know, platform engineering teams, teams is I always was trying to understand
00:21:08
Speaker
the application, the system, because whatever end user I was serving, the end customer, I was trying to solve problems that they had or we had. Things slowing down or issues that came up. Really, the definitions that I like are more around
00:21:31
Speaker
observability is having the means to be able to troubleshoot issues quickly. And it's very broad because so is observability. Because what does it mean to be able to troubleshoot those issues quickly? Well, you need to be able to put your eyes on the right spots when it comes to
00:21:49
Speaker
all the signals and logs and metrics and traces. And that's all part of observed building, right? So at the end of the day, why do you care about observability? Well, you're kind of to try to solve and improve many aspects, whether that's performance or customer issue, those kinds of things. So I took a really good,
00:22:11
Speaker
definition from Cygnoz that I found. They have a good article. I'll post it in the show links. But they have observability in Kubernetes is the means and ability to troubleshoot issues quickly and with the help of collected telemetry metrics, logs, traces, and having the visibility for your cluster. So I gravitate toward that side of the definition, maybe because of just the background I have. But that's where I am.
00:22:38
Speaker
Okay, okay. So now let's talk about like, the difference between monitoring and observability. And the reason I'm dying, like, to answer this question is because I have a really good, like, metaphor, metaphor away.
00:22:54
Speaker
So if you do a quick Google search, there are so many articles that we'll talk about the difference. At the end of the day, it is the difference. Monitoring is all about the known unknowns. You know that you want to monitor X, Y, and Z, and then you are just waiting for alerts or things that just get generated. But whereas observability is about unknown unknowns, you don't even know what you should know.
00:23:16
Speaker
I think that's where observability comes in. The metaphor is this, if you go to a doctor's office, monitoring is just like they have a board in front of them. They're like, okay, they check your weight, they check your blood pressure, they check your pulse and things like that. They're just following a list of things that they know they should check, maybe at your annual checkup.
00:23:37
Speaker
But observability is doctor actually starts by asking the hard questions like what hurts and then tries to do a root cause analysis and doesn't leave anything like doesn't stick to a specific list, but just keeps digging into like, okay, what hurts? Are your arm hurts? Why does it hurt? When did it start? What happened? And just figuring out the root cause of why your arm hurts. I think that's observability and the ability to have all of that data readily available.
00:24:00
Speaker
So you as the administrator can ask those questions for your systems. So it can help you perform that root cause analysis and help you catch something or treat something before it becomes like too serious. Yeah. Monitoring could almost, you know, could almost be looked at as sort of the medical evidence that's supporting whatever observation the doctor is making in that point. Right.
00:24:22
Speaker
if you say that you've been doing a lot of working out and now your arm hurts. Well, you've told the doctor this, but now he's putting it all together or something. Yeah. But if you didn't start with asking these random unknown what questions, you wouldn't even figure out that your arm hurts by just looking at your weight and looking at your blood pressure. It's just like if you were only monitoring for certain metrics, you will miss out the bigger picture or you will miss out what's actually wrong with you or wrong with your infrastructure for that matter.
00:24:50
Speaker
Yeah, and I think that's kind of the notes I put in there too, was less on the metaphor side of it, but it tracks really well. So it's really monitoring the technique of collecting, right? And you have a lot of unknowns in this, but you're collecting them, you're putting them somewhere.
00:25:08
Speaker
And now you need to be beyond what you do with just monitoring and collecting all that is where you get the value of the observability, which is more the overall sort of tooling and technique about what you do with a lot of the things that you're monitoring. I think observability for me and sort of like an approach is like the observability of top monitoring sort of a subset, and then you can go even deeper right into those specific things that you're actually monitoring and what you do with them.
00:25:37
Speaker
Yes. Agree. There is definitely overlap. Monitoring is a huge subset of what observability is. You can't just have observability and ignore monitoring. That's not even...
00:25:49
Speaker
possible. As we talk about like, what are the different components? I know, I know in your definition, you spoke about the metrics, traces and logs thing that's like metrics is monitoring, like you're collecting for specific metrics. They are like closely tied to each other. But I think there is a lot of confusion in the industry. And like, personally, that metaphor helped me understand like, okay, this is the difference.
00:26:10
Speaker
Yeah, the metrics and the logs and all these things that go into monitoring, these are just the things that you are, right? And ideally, you don't really box yourself in.
00:26:21
Speaker
within your platform of what you can start to monitor. I think the challenge is you'll have plenty to monitor, but making useful observations out of that is a challenge. And just visualizing what you're monitoring is also not necessarily helpful because a dashboard that may just visualize what you're monitoring is pretty
00:26:43
Speaker
It's not that useful unless it's actually helping you solve some sort of other problem. If your end user is experiencing some additional lag or performance degradation on their end device when they're trying to talk to your server, you need observability to figure out what's wrong and analyze the traces and things like that. Exactly. I think this leads to why observability in general is important. We're starting to talk about visualization.
00:27:12
Speaker
And obviously, you know, from the human eye, you know, having metrics stored in a huge, you know, database somewhere is not that useful. I mean, it can be but you'll have to do a lot of like queries and searches and all that too. So, so visualizing so leads to sort of why it's important to
00:27:29
Speaker
these teams that are starting to manage governance and manage the applications. Maybe you have an answer for that. Yeah, I have a few bullet points based on the research that I did. As you said, visualization, it definitely helps increase visibility for those
00:27:43
Speaker
DevOps, platform engineers, any XYZ persona. It helps you get a comprehensive view of your cluster to make it easier for these personas to understand how different components of their infrastructure and application stack are performing and identify potential issues. This in turn helps improve
00:28:02
Speaker
the performance of your application so it helps you identify bottlenecks and i think in the phoenix project there was a statement that bill made like if you're not fixing the bottleneck anything else that you're doing is not even like helping you obviously i'm
00:28:18
Speaker
Like this is not the exact code. I'm just paraphrasing it, but yeah, helping identify bottlenecks in your system and then helping you fix those bottlenecks is definitely a thing that observability can bring to your table. And this in turn leads to like better troubleshooting and better reliability. Like, okay, if you have to, let's say perform root cause analysis, like if you're seeing from Android devices, if you are seeing that you're not able to like,
00:28:45
Speaker
users that are using Android, your app on an Android device are not seeing better performance compared to iPhone users, you can actually break down that traffic and then troubleshoot better and identify things that you can fix in your system and maybe scale out the pods that are responsible, maybe fix the routes or the ingress endpoint that you have for those applications communicating to your backend system. So definitely helps with better troubleshooting
00:29:12
Speaker
We can also extend this and talk about security. If you know what's going on in your system, you can fix it better rather than you being oblivious and ignorant. Ignorance is bliss, but not when it comes to your day job or managing the infrastructure.
00:29:29
Speaker
that can come back to bite you in the butt. Yeah. Yeah. There's definitely a security component, right? I mean, just having visibility, which is, you know, a big part of observation of even just like requests being, you know, like, you know, all of a sudden we have requests coming in from this, you know, endpoint that we have no idea. It's not on our network, all these things. You can investigate that, right? There's all, there's tools that are, you know, help you kind of identify those things in the fly and enable you. But, you know, this is a core tenant of being able to do those kinds of things.
00:29:59
Speaker
And at the end of the day, for me, you're trying to solve a real problem. And that's why it's important. And what does that problem mean? Well, that's different depending on the perspective. You have the customer where they care about downtime. So they need things to work smoothly. And then the team running it itself, you want it to run smoothly. But beyond that, you want to be able to, I was reading an article, which I think I'll talk about later,
00:30:30
Speaker
the folks at Optum, how they kind of put together observability, is they cared a lot about mean time to repair, right? So one of their metrics or KPIs was mean time to repair and observability was a way for them to reduce the MTTR.
00:30:46
Speaker
Gotcha. From the team's perspective, you know you want this well-oiled machine, you've worked hard at it, Kubernetes is complicated, but when something goes wrong and it will go wrong, observability is a way to be able to pinpoint and reduce those repairs. Then I think there's a third one which probably doesn't come up as much is stakeholders in your company. Observability also gives you a way to give snapshots of the health or
00:31:16
Speaker
or kind of what's going on in a system that you could give to an executive internally. So there's that big part of it that doesn't get talked about as much, but it is an absolute use case, I think, for the actual internal perspective. I think for sure. And to tie into your point that it helps with reporting as well, the new thing that we have seen, or at least I have seen recently, is people also talking about cost management and reducing costs.
00:31:46
Speaker
under the observability umbrella. It's like figuring out what different components of your stack are costing how much money and then trying to help you optimize it. Cost definitely is a major concern for everyone building anything today. That whole part of the CNC landscape, CNCF landscape,
00:32:06
Speaker
which is a newer one, which we see a lot of the cube costs and things like that is, what is it called? Continuous optimization, right? A lot can fall under there. I mean, even one we didn't talk about yet is chaos engineering also falls into, because at the end of the day, all of these sort of use cases for, in this realm, all fall back to
00:32:28
Speaker
being able to observe and use metrics and use what you're monitoring to be able to do something more interesting. It's definitely one of those things that can easily blow up into all other topics, which we'll cover some of them a little bit. But I did want to go back to another metaphor. Okay, let's do it. I kind of think about sort of an engine, right?
00:32:51
Speaker
being the bike dude that I am. But, you know, I think of all these metrics and traces and everything as sort of, you know, the fluids and everything that flowing through an engine through a system.
00:33:01
Speaker
And you need to be able to understand that they're doing what they're supposed to be doing, right? All of a sudden, if your car doesn't drive over 30 miles an hour, but you got pedal to the metal, something's wrong. You need to know why, right? So at the end of the day, there's so many like these types of metaphors, right? Kubernetes is our engine in many cases. And you're kind of taking techniques of being able to look at parts of that and from various perspective. And all these different tools will help you get there, so.
00:33:28
Speaker
that's why i love that metaphor like yeah like you're trying to go as fast as you can but you really can't because something is wrong let's figure out what anyway um yeah so you know dev ops teams can benefit a lot from you know having these and um i think that can move we can move on to maybe
00:33:48
Speaker
You had a topic about three pillars of observability that is an obvious next step of like, hey, if I'm going to start down this journey, where should I focus first? Yeah, I think I know we have already referred to these three things like logs, metrics, and traces.
00:34:05
Speaker
I would say like three pillars of observability and I'm sure like the entire community would say that too or support that too. So logs basically, I don't think I need to introduce the concept of what log is, a logging is, but having a system that can help you collect these logs consist of constant messages that are generated by different application components, different infrastructure component over a precise period of time can help you tell a story of what's actually happening. So this can include things like logs that
00:34:32
Speaker
your different application components are generating like MongoDB or your nginx or whatever you're using in your application. Logs that are being generated from that. It can also include things like Kubernetes cluster component logs. What are your notes saying? What is kubelet saying? If you have a security tool or a storage layer, what are those logs telling you? Then Kubernetes audit logs. This ties into
00:34:55
Speaker
not just the performance, but also security. Who's trying to access that system? Which components are trying to access each other? Just figuring out all of those things, using logs will be definitely helpful. Yeah, and I think the one thing to keep in mind, or a couple of things to keep in mind with logs is the whole garbage in, garbage out rule. Logs have to be useful information, right? And having logs set to the appropriate level as well. If you have the most basic level, you're not getting a lot out of it. If you have too much,
00:35:25
Speaker
then it's hard to search, right? Also logs is just, you know, it's mostly text, right? So things like open search and those kinds of tools are absolutely fundamental on being able to find useful data out of logs, right? Seeing patterns and those kinds of things. So search and garbage in, garbage out, I think are big things to think about when
00:35:48
Speaker
collecting logs, cube CTL logs, your pod is useful, but again, it's only what's getting put out by the application. That's true. I think later in the episode, we'll also talk about observability-driven development, and I know you have that bullet point, so this definitely fits into that story. Garbage in, garbage out. If you're not...
00:36:08
Speaker
thinking about logs in the right way, they won't be as helpful. So true. The next pillar, I think it's metrics, right? The metrics consist of like that time series component or time series data that describe like how your resource utilization or how a specific component is behaving. Like it helps people get insights into the health of a system especially. So like instead of looking at just log for a specific component or a metrics for a specific component at point in a time snapshot, you're looking at
00:36:38
Speaker
Metrics across different components and you're looking at it across, let's say a day or a month or a year and figuring out trends and things that you see that needs to be fixed. I think that can be really helpful as well.
00:36:49
Speaker
Yeah, and a really common example, right, that everybody pretty much these days that has any kind of smartphone will deal with is just like load times is a great venture, right? So you hit enter or, you know, submit on your phone or keyboard, that call goes off, makes a request and could hit in the Kubernetes world, could hit many different, you know, containers, microservices. And each one of those calls, you know, can be a metric. How long is it taking to respond? If you see a trend and all of a sudden it spiked,
00:37:18
Speaker
maybe something else is happening, there's a bottleneck. So those are sort of the metrics. And just bring that up as an example, just because it's a common one very using easier to understand if it's your website is blowing, loading really slow. People get pissed.
00:37:33
Speaker
I think that's perfect. That's a perfect segue into the next thing like traces. These are these unique IDs for individual requests. Let's say Ryan is trying to access an app and it's really long. You can actually look at tracing information for each specific request and map it throughout your system and identify bottlenecks as these requests go from one component to another.
00:37:55
Speaker
Maybe the CDN that you're using is not the best. Like you don't have to worry about anything on your own infrastructure. Maybe the CDN is fiction. You need to buy more bandwidth. That's the answer. Or maybe you need to fix a component and increase the queue size for some specific component. And maybe that will help. But having tracing implemented would really help you isolate these individual request calls throughout your system and help you identify these bottlenecks. And this kind of bleeds into like you might be able to trace it to a specific microservice.
00:38:21
Speaker
But then there's traces within that application itself. So what in that application could be part of an algorithm or a function, something that could be slowing things down. And traces are used in parallel from a system perspective. There's a trace, and then there's a code traceability as well. So the term used, I think, both ways, but useful in this scenario either way.
00:38:49
Speaker
And I think like next thing was like, okay, the three pillars is done. But then as you have been clearly pointing out, like if you can't visualize all of these things,
00:38:59
Speaker
It's not really helpful. Again, the log garbage in garbage out metrics and tracing information. If you don't have a good UI to help you monitor all of these, it's just information that you have, which is not being used. Do you want to talk about how visualization is important? Yeah, sure. I think the one that many people will be familiar with is Grafana. Especially in the Kubernetes ecosystem, Prometheus Grafana,
00:39:22
Speaker
are sort of a de facto standard in many ways of open source tooling and are super useful, right? Prometheus helps you kind of collect from different endpoints in your cluster, whether that be the cluster itself, applications, but then you have to do something to visualize it. You can point Grafana at Prometheus as a data source and then do all sorts of interesting things. That being said, having Grafana just visualize something like
00:39:50
Speaker
how many requests are coming in is only so useful, right? There's a specific example I was thinking back to where I think we were monitoring requests happening within Ceph from like an open stack dashboard back in the day or something like that. And the difference of having a useful dashboard for this piece that we were looking at
00:40:15
Speaker
was really just a matter of how we dissected the actual data we were ingesting from Prometheus. That being said, if it was like a request, well, we actually needed to view it over a certain amount of time. So we needed to kind of slice it by what's happening in the last 30 minutes and looking at a very specific thing. So you have to really think about
00:40:34
Speaker
what you want to monitor. And I see dashboard examples all the time that give like a few charts and things like that for a huge system. The reality is a lot of people that are doing these kind of things and have it be really useful is they have a full dashboard for one component.
00:40:52
Speaker
Right. So like every component or any, every market service has its own dashboard. You know, like there's no, there's no magic bullet where you have one thing to look at and can, and can see everything. So you really got to kind of double click and get fine grained with those things. And there's a lot of tools out there, right? Grafana is one of them. Uh, open dashboard is one of them, which if people are familiar with the elk stack, it's just like AWS is, you know, forked open source version of it. But the elk stack is also right. If you're used to cabana, those kinds of things. Um,
00:41:21
Speaker
These are all ways you can visualize different types of metrics and traces and logs as well. Awesome. I think as the next question, I want to talk about just the different options that are available for users that are maybe new to observability, trying to interpret the CNCF landscape, which is a complicated one, and see what are those tools that they can get started for free on their own with the good community support.
00:41:52
Speaker
Starbucks Prometheus Gravana. I think that's a no brainer. And something I have a bullet point here that I just didn't realize is alerts is a big part of this observability landscape too. I think it falls a little outside of, I don't know, does it fall outside of observability? I mean, so basically alerts are just taking the ability to have sort of like, literally an alert, like a text or an SMS or
00:42:16
Speaker
You know an email that kind of thing pop up if if one of your observability metrics or kind of
00:42:23
Speaker
Um, kind of rules that you create, say, if load times go over this, right. That's constantly monitoring. You don't have to look at it all the time. Right dashboard. You don't want to sit there and stare at it. Right. And, um, so that's another big part of it. But Prometheus has very good integrations with that. It has, you know, works well with Grafana. They're both open. Uh, they work really well with Kubernetes. They're, I would say, you know, fairly novice in terms of like being able to install and get going. Um, I think it's a perfect spot to start.
00:42:52
Speaker
Um, you know, there are a ton of tools though. Right. When, when you go beyond that, I think those are the de facto standards and a great stop to create part to start with.
00:43:03
Speaker
But we'll also link to the CNCLF landscape part of observability. And you'll see yourself, right? You'll start asking. But I know you have Thanos and Cortex and some of the other ones on here that you want to talk about those. Yeah. And I think I don't, so I don't manage infrastructure at scale, right? Like usually I'm like, okay, I'll deploy my own Kubernetes cluster, do things, delete it. That's it. I don't even have like long running clusters, but I've always deployed Prometheus and Grafana and I've thought, okay, that's enough. Like I don't need more tooling.
00:43:31
Speaker
But what I've realized after doing research for this episode is Prometheus is like single cluster. There is no way to have a Prometheus deployment or have Prometheus understand that it can talk to different Kubernetes clusters. You have a Prometheus endpoint on each. It's also hard to retain data for longer periods. Like right now, the default is 15 days, which I'm sure like most of the people might just be setting that and not- Yeah, retention is tough, right? I mean, you could easily blow some kind of quota.
00:44:01
Speaker
When you're pulling in logs and all this stuff, it can get quite expensive. I think that's where the open source community came together and have built projects like Thanos and Cortex. Both of these are open source, part of the CNCF community. Talking about Thanos specifically, they give you highly available Prometheus setup with longer term storage capabilities. It is based on Prometheus, still uses PromQL, it still uses all the core components of Prometheus, but it gives you the ability to
00:44:30
Speaker
store more than 15 days worth of data. It gives you the ability to talk to multiple Kubernetes clusters through their individual Prometheus endpoints and give you a global query overview. So if you're mapping out or creating new dashboards in Grafana, you can actually point it to a Thanos instance and actually have dashboards that
00:44:50
Speaker
show you information from multiple Kubernetes clusters instead of just talking to Prometheus and giving you that one single. So it helps you with a global query view as well. And then Cortex is something similar, like horizontally scalable, highly available, multi-tenant with longer term storage. I think the only thing is like it's still in incubation phases and it uses like it can, I think vendors like AWS are already using it for their managed Prometheus services. So they're using all of these benefits and offering it as a service to the customer.
00:45:21
Speaker
Yeah, and yeah, that's a good point on scale, right? You'll have to think a little bit differently on these tools, but many of them do support these kind of things. I think Prometheus itself, you can actually target other Prometheus's. So you can have like one Prometheus point at
00:45:36
Speaker
other Prometheus's for a data source. I mean, these other tools probably make it easier, but I believe that is also the case. But yeah, I mean, I think beyond just the sort of standard monitoring tools, we haven't even talked about tools that do things like logging or tracing, right? Some of the ones that people might be familiar with, Splunk is a very big name in the ecosystem that's very good at logging.
00:46:05
Speaker
Even just using something like the elk stack with fluent D, you know, click or log stash the that whole thing where you basically have a demon Or an application a small application running near or on a node near an application That's collecting all the logs and sending them somewhere else. It's kind of like it's sole purpose to do that efficiently, right? But then you have an aggregation point and then you have sort of a visualization and search point
00:46:30
Speaker
I think logs in terms of architecture is definitely something fairly straightforward. That being said, scale again comes into play. Applications and systems put out a ton of logs, so you can easily fill up some bandwidth. You might be able to think about how much is coming out.
00:46:54
Speaker
You might need to limit it, you might need to change logging levels, those kind of things. You might fill up your Kibana dashboard quite quickly and get overwhelmed. But, you know, definitely something to think about at scale. Yeah, and I think the thing that I like about Fluendy, right, I know you said like demon set and things like that, but it allows you to talk to any data source, it has a universal language, but then it allows you to
00:47:17
Speaker
take inputs from any type of data or any data source and then translate it into any destination as well. You can have Apache logs going to MongoDB or Elasticsearch or both at the same time. It can also help you with routing these, how your logs are actually being translated from different sources to different destinations. It also helps you store things locally. I know this is running on Kubernetes as a daemon set. If the part goes down for Fluentd, you don't have to worry about losing access to all your logs.
00:47:45
Speaker
It stores it in a PVC by default and then you can also configure it to push things to a remote repository and then can be deployed in HA configurations. Fluentd definitely is the cornerstone of logging when it comes to Kubernetes.
00:47:59
Speaker
Yeah, absolutely. There are a bunch out there, but go look for yourself. We're not going to cover them on this podcast. I'll put a link to them. I think tracing, we can cover a little bit. You had Jaeger in here. I think that's a preferred CNCF solution as well, completely open source, helps with end-to-end distributed tracing. It can help you visualize the chain of events between microservices. Components helps you
00:48:23
Speaker
interpret the interactions that are happening, perform or help you perform root cause analysis and service dependency analysis as well. So, Jaeger is a tool when if you want to start thinking about tracing. Sidebar, the Jaeger logo is pretty cool. It's obviously, so it's a gopher, which how you know, it's like very CNCFE.
00:48:41
Speaker
he's looking at footprints and like investigating but i always thought he looks i always thought he like as tracing like i'm following these things but i always thought he looks angry like who put these who put these dirty feet on my floor but i'm pretty sure he's supposed to be investigating because he's got the whole hat on and stuff but i never put that together it's just a cute logo but that makes more sense you tell me maybe it's just me who thinks he just looks angry at the feet like someone's putting it on his carpet or something
00:49:09
Speaker
But yeah, go take a look at those. You know, we won't spend a lot of time on the individual tools. We talked about chaos engineering and continuous optimization as well. These are things like Chaos Mesh, Chaos Cube, Chaos Toolkit, which are all about introducing sort of literally the chaos into the cluster so that you can see how your system reacts. And a big part of the reason that it's part of observability is because you introduce these chaos mechanisms where it will maybe spin up a ton of pods or those kind of
00:49:38
Speaker
uh, introduce some errors. Um, you want to also see how your observability stack, um, negotiates, um, you know, viewing what's going on from your sort of.
00:49:49
Speaker
tracing and monitoring abilities, right? So like, to be honest, when I heard about Thanos for the first time a few years back, I was like, that's a chaos engineering tool, right? Like with the whole like, yeah, with the naming of it. But nope, it is a metrics tool. Come on, man. Yeah, yeah, exactly. And then, um,
00:50:09
Speaker
Optimization also includes, right, the keep costs, open costs, which we spoke to and have keep costs on and on the show, but also kind of goes through how to use metrics to manage things like the whole Black Friday scaling issue where
00:50:30
Speaker
um, scale ops and crane and those kinds of things can do more interesting things beyond just cost. I mean, cost is a big part of it, but you know, um, as you use metrics, you can get, do interesting things. Like you can kind of start to predict some things that cast AI.
00:50:44
Speaker
So really cool things I think are going to happen in that continuous optimization space. The core tenant is having all this observability stuff anyway. No, I was just saying, I like how it's called continuous optimization. They have kept it generic and not continuous cost optimization because again, cost is
00:51:04
Speaker
definitely a major component of it, but if you don't have your applications right-sized, you will have these performance issues. It's all about figuring things out and updating things as it goes, so at the end of the day, optimizing things. Yeah. Cool. I wanted to make sure we get to these other related topics we had, and our chat TBT question, which is the idea of observability engineering, or I think they're synonymous, but observability-driven development.
00:51:33
Speaker
I think this is adding too many things on the developers personally. Like shift left, security, they have to think about security. It definitely is putting a lot of pressure on them, but this is just the idea that if you're building applications, you should think about observability from day zero or day minus one.
00:51:55
Speaker
make it part of requirements, think about the bigger picture when you're starting small. As we said, if your application components are not spitting out the right logs at the right level, they are useless. So when you're building your apps, when you're implementing logging in those components, it definitely helps to keep observability in mind. That's my... Yeah, I think a big part of it is just making sure that the development side of things
00:52:21
Speaker
is aware of what's being observed in production. All right. So using sort of the, the, the what, when, and how things are happening in, in production to influence development. Right. So, right. So like a feedback loop, right? I think there's that part of it, the mindset of it as well of like, Hey, I'm going to develop code that
00:52:45
Speaker
um, aids well for tracing and logging those kinds of things. But, um, yeah, feedback loop being, being able to have your developers as a stakeholder to your observability stack, right? Having them be able to understand or be presented to or whatever it may be, you know, make, hopefully make it low touch for them to low effort, I should say, um, to consume that information, but be able to say, Oh, you know, Hey, we're seeing this. So maybe, you know, we take a look.
00:53:13
Speaker
uh, improving this kind of thing, those kinds of things. So, um, I think it's a, I think it's great. I think it can be overdone, um, in terms of how much is put on the development side of things. But I think it's, it's crucial done as, as again, just like another feedback loop to for those teams to kind of consider. Cool. I think, I think we're, we're almost near the hour mark. So, um,
00:53:39
Speaker
I think we're gonna stop here in terms of diving into observability. That being said, we know we didn't touch everything. We know we didn't go into a lot of detail on certain things, but that's sort of on purpose. We're giving a brief sort of overview 101 to observability in the Kubernetes space and maybe some of the things you'd want to consider and look at. Hopefully that was something useful to you. Let us know if you like these kinds of episodes.
00:54:05
Speaker
And we'll dive into having some guests on in this space. But before we leave you, we do have a chat GPT question. Should we do the short one or the long one? Let's do the short one. All right. The short one is we asked chat GPT if observability was a superhero.
00:54:23
Speaker
Who would it be? You know, I expected chat GPT to tell me a specific real superhero that I would understand, but it just was like, nope, I'm not going to, I'm not going to pick one. I'm going to make one unless, unless this one is one. Um, but it said, if it was early, we're a superhero, it would be named monitor master. So it just, it needed, it doesn't have the best ring to it. Um, it's a little tactical, but
00:54:52
Speaker
So I was thinking it would just take an example from the world. But it said this superhero would possess the power to see through the complexities of distributed systems and cloud native environments. I'm sure everything is running smoothly and efficiently with its superhuman ability to collect, analyze, and visualize the data from various sources. So I guess this superhero can do all these things itself.
00:55:19
Speaker
Who knows if it's if that GPT is I don't know drawing outside the lines and coming up with superhero I think I the way I would answer this is just like I don't know observability for me is not a superhero it's the the villain I guess the Riddler from the Batman series like always asking riddles are asking questions like what's wrong yeah that's my answer it's a lame answer
00:55:42
Speaker
Yeah, so it kind of goes on and says just like a superhero swooping swooping in to save the day. Monitor master would provide insights and recommend help the teams to maintain their app. So it was it was a pretty like take take, you know, the topic and out there and then create this name. Monitor master, which is definitely things we hear the Kubernetes because I don't know, I was a little let down by you, Chappy GPT. Come on. I think this was I was I was hoping it would pick an actual super
00:56:12
Speaker
But all good, all good. But yeah, I think so. We have some fun episodes coming up soon. Again, come join our Slack. Suggest episodes in there. If you haven't joined it, we'll put the link in the show notes again. Introduce yourself. Suggest episodes. Give us feedback. Talk to us. Whatever. Ask questions. We'd love to have you there. And we'll see you soon on some new episodes. But that brings us to the end of today's episode. I'm Ryan. I'm Colin.
00:56:41
Speaker
Thanks for joining another episode of Kubernetes Bites. Thank you for listening to the Kubernetes Bites podcast.