Podcast Introduction and Hosts' Banter
00:00:03
Speaker
You are listening to Kubernetes Bites, a podcast bringing you the latest from the world of cloud native data management. My name is Ryan Walner and I'm joined by Bob and Shaw coming to you from Boston, Massachusetts. We'll be sharing our thoughts on recent cloud native news and talking to industry experts about their experiences and challenges managing the wealth of data in today's cloud native ecosystem.
00:00:27
Speaker
Good morning, good afternoon, and good evening wherever you are. We're coming to you from Boston, Massachusetts. Today is February 16, 2022. I hope everyone is doing well and staying safe. Let's dive into it. Bhavan, how are you? What have you been up to?
00:00:45
Speaker
I'm doing good. I'm just surprised with this New England weather. Like last Saturday we had 60, I think all 55. And I was so happy I was out there on a walk and then on Sunday it was snowing. So I'm still trying to figure this out. Yeah. I think I came into work and explained that I was in my t-shirt on Saturday and then I woke up Monday to 10 inches of snow, which Boston didn't even give that much. He drove drive 40 minutes out towards me and we had 10 to 12 in some places. The forecast said,
00:01:14
Speaker
less than one when I was Sunday night, Sunday night at midnight, I was up and I said, all right, there's nothing going on here. And then poof, I woke up and that's 10. So that is a New England weekend. If there ever is one, I think Andy on our team said, that's the most New England thing you've ever said about being outside in 60 and then 10 inches of snow and school wasn't even canceled for 10. So that is what it is.
00:01:41
Speaker
I had some family up this past weekend, which is really nice. We've been seeing a lot more family now that things are becoming a little more normal, although it seems like everyone and their brother or sister is getting COVID here and there, and most people are fine. I'm looking forward to some normalcy, and I know we've been talking about KubeCon a lot, and I am really, really hopeful for KubeCon Spain EU to be as normal as it can be. How about you?
00:02:10
Speaker
Same here, I don't know. Now, every time I'm wearing a mask and going outside, I was like, we should at least get rid of masks outside. I'm okay wearing it indoors and while I fly, but outdoors, maybe time to call it off, but that's not my decision to make.
00:02:26
Speaker
Yeah, it's probably a part like, you know, personal matter at this point, I guess in crowded spaces, I you know, it's it's the world is forever changed. And, you know, I doubt people listening want to hear about COVID though. So let's dive into our guest
Guest Introduction: Vinit Pothleaparti
00:02:41
Speaker
today. We do have a guest today. Actually, I'll we'll introduce him and then we'll do a little bit of news beneath hot elopathy.
00:02:48
Speaker
I think that's the best I can do. He is a software developer from Timescale, and he's a developer by profession. He's a Kubernetes contributor, and he loves to work on cutting-edge cloud-native technologies. He's really interested in distributed, resilient, and high-availability systems, which you'll hear a lot about today. He's a maintainer for one of the projects we'll talk about, which is OpenTelemetry Operator.
00:03:14
Speaker
Before we get him on the show, let's talk about a little bit of the news. What do you have for
PX Backup and NetApp Astra Updates
00:03:20
Speaker
us, Bhavan? Yeah, two things that I wanted to highlight. One of those comes from our employer, Portworx or Pure Storage. Last week, we announced PX Backup, which is our data protection tool as a service. So a hosted service for customers that are running on Amazon EKS, and they just want
00:03:38
Speaker
a control plane where they can connect their clusters and create those backup jobs and perform those backup and restore operations. Making things easier, right now it is in early access. We'll be able to share a link to the registration page if you want to register and get your hands on the early access bits.
00:03:55
Speaker
That's one. And then another vendor in the ecosystem, NetApp Astra, had a bunch of announcements as well. A key thing that stood out to me was their expanded support for the backend storage. So instead of just supporting NetApp, Google Cloud Volumes or NetApp Azure Files, now they can also support Google Cloud Persistent Disks or Azure Disk Storage as the backend storage.
00:04:20
Speaker
for your applications that are running on those respective Kubernetes services and use Astra control service to protect those. Again, they had additional announcements around their on-prem support, their launch with Red Hat in Operator Hub. We linked to all of those in a summary blog in the show notes. But yeah, those were the two things that I wanted to talk about.
00:04:44
Speaker
Great. Well, the one that I wrote down here, which was all about awards, which is always fun to see what the industry is up to. This is tech targets, storage magazine, and search storage.com announcement.
Storage Industry Awards Discussion
00:04:56
Speaker
The storage products of the year for 2021. There's a number of different winners in here because they categorize them. So there's backup and disaster recovery, hardware, software services, which Veeam and Kastin and Druva are all in. Then there's a disk and disk
00:05:13
Speaker
subsystems and Cloud Storage. Cloud Storage has Minio and Satera, the disk subsystems that's Viddedad and Dell. Then you have storage system and application software, which Portworx is actually in there along with Pavilion and Comprise Intelligent Data Management. Really cool stuff. I always like to see that. We'll link in there. I know a lot of those folks probably submitted last year at some point and they're all waking up too.
00:05:42
Speaker
see if they want anything fun.
What is Observability in Cloud-Native?
00:05:44
Speaker
So yeah, I think let's dive in to today's topic, which is all about open telemetry, timescale DB, prom scale, and metrics, traces, and logs. Lots to cover here. So let's get Vanith on the show.
00:06:02
Speaker
Hey Winnie, welcome to the show. Welcome to Kubernetes Spites podcast. So as a first question, we want all of our guests to introduce themselves and talk about what they do on a daily basis. So go ahead and introduce yourselves.
00:06:18
Speaker
Hi Bhavan and Ryan, so happy to be here and it's a pleasure to be part of Kubernetes by its podcast. And let me introduce myself. So my name is Vinit Pothleaparti. I am working as a product manager at Timescale in the observability applications team. So I've been working as a software engineer from last four to five years. And it's been a couple of months I've been working on the product management side of the house and
00:06:44
Speaker
I officially transition into the product manager recently and it's been an amazing journey for me at timescale and working on observability as a product manager so yeah that's my work and I primarily work on the projects from scale and tops.
00:07:02
Speaker
which we'll be discussing today in the podcast. And other than that, previously I've worked with Aqua Security, a cloud native security provider. So I worked previously with it and I'm also an upstream open telemetry operator maintainer and actively involved in the upstream projects. Like I worked on Kubernetes upstream releases from 1.16 to 1.18 and I also led the 1.18 docs team for
00:07:32
Speaker
during the release and also contributed for cortex and open policy agent as an CNCF mentee and I have done my Google summer of code during the college
00:07:43
Speaker
with the Zalando Postgres operator to build a CLI to make the management of the operator and the database instances easier. This has been my work in the community overall and I'm actively involved with OpenTelemetry at this point in time. That's me. Nice. That's a long list of projects. Yeah, I know Bobbin and I have both used Zalando Postgres operator in some of the demos. At least I think we have. I know I have.
00:08:13
Speaker
And one of our colleagues is working on right now, so a small world there. I know we're going to ask a little bit about some other topics such as what is observability and metrics, things like that. But for my own sake, you know, where does timescale DB fit into the CNCF landscape and or Kubernetes landscape? I'm just curious.
00:08:32
Speaker
When you think about timescale and everything, in my experience of how I was fascinated with the product, what timescale is building. When I was in the CNCF space and the timescale product was many in the CNCF space were not completely understanding what timescale was, but if you think deeper into what timescale has to offer, it's
00:08:55
Speaker
the time series database. In the recent times, all the data we see to power your dashboards, your business data, your metrics, your observability data, your crypto, your NFT. Everywhere you see there is a time series data and you want to store them. When you say time series data, this data will scale, might be
00:09:18
Speaker
coming per second base, per millisecond base, or for once in five seconds, which means over a day, over a week, over a month, over a year, it would scale up to millions, billions of data points. And if you try to store them in a database, the database should basically crawl across these data points to visualize and make insights of them. So that's what exactly the superpowers of timescale come into the picture and come into the cloud native ecosystem.
00:09:48
Speaker
The Timescale DB is an open source and that's TSL license from the Timescale. So you can directly use the Timescale and coming to the Cloud Native picture. So it just works as any other database on your Kubernetes cluster. So we offer the Helm chart and I would say in the Cloud Native ecosystem with the increasing in microservices architecture and the way you design your application. So you want to have a global single store for all your time series data.
00:10:17
Speaker
And this is when I see timescale fits into the picture where all of your multiple microservices will be writing, reading all this time series data to a single store. And it's petabyte scale like what we offer today in timescale and it's reliable with the rock solid foundation of Postgres. So that's how it fits into the cloud native ecosystem as the timescale DB or the timescale as a product. But coming to the other side, the prom scale. So again, it's an
00:10:45
Speaker
Extinction. It's an extinction of timescale into the observability world, I would say, when you say prom scale. So, timescale fits into the CNCS space generally as a database and also for the observability use cases. Okay. My next question for you, I think, for me, I need to set some baseline. I want to understand what is observability, how is it different from just plain old monitoring,
00:11:10
Speaker
What are the different categories under observability?
Deep Dive into Observability Components
00:11:14
Speaker
So yeah, if you can talk about that, that would be great.
00:11:18
Speaker
Yeah, definitely. So that's an interesting point to start from. And yeah, definitely you need to hear all my answer because it's a big question. But yeah, let's get started on this. So yeah, observability basically comes from the control theory on how systems work and how does things work. But observability by definition stands for is the ability to infer internal states of the system based on the system's external outputs.
00:11:45
Speaker
So by the definition itself, you want to capture the internal state of the system by emitting the external output from the system. So from the observability, why it has become more required or demanded tools for your applications or the systems today is that in the world of cloud native ecosystem, you are basically building
00:12:08
Speaker
you're moving from monoliths to microservices, which means you'll be scaling your applications from tens, hundreds to thousands, even if you have a scale of a high scale. So if you have so many microservices, you want to understand how each and every service is performing.
00:12:26
Speaker
What's the throughput? What's the latency between the services? And how is the request life cycle is following? So for example, if you just book an Uber cab, the request might jump into 10 to 20 different services with just one click. So you want to understand where exactly the latency is being added for your request.
00:12:45
Speaker
So the observability, if you think in the application's point of view, there are different ways of capturing it because at a point you just want to know how many processes or threads are running at a given point in time.
00:13:01
Speaker
What's the metadata of a transaction you want to look at? What is the service performing exactly at this point in time? In some cases, you want to follow the complete request lifecycle. How does it flow across the services? How is it different if X user is trying out this transaction versus a Y user is trying out this transaction?
00:13:23
Speaker
So observability itself has its multiple ways of looking at it. So they say the observability has three main pillars, which is metrics, logs, and traces.
00:13:36
Speaker
So I just give an intro for metrics, logs, and places. So the first comes the metrics. So usually this is the pattern I keep seeing. So people usually start with logs, so not the metrics first. Usually you'll create an MVP and you start logging what an action is being performed in each and every function saying that, hey, this is a warning log, this is an error log, and this is an action
00:13:59
Speaker
performed at this point, at this logic, business logic flow. Okay, I've done this action. So that's how you get the observability by logs to basically understand what is the series of events that a specific service has performed.
00:14:16
Speaker
So that's the step zero for your observability and usually people start from logs and then you pivot to metrics. So metrics is something that you want to understand that how many elements have been cached. So this is just an example. So how many cache images were happening in your business logic or how many
00:14:41
Speaker
transactions have you processed since the application startup. So you want to understand basically with metrics, you want to capture a value of the internal state of the system. So with logs, you are basically emitting the events, but with metrics, you are basically capturing the state. How many Go routines are running at a given point in time? How much time is my GC duration seconds is taking to process?
00:15:07
Speaker
So it's basically, if you think as a log, you can think log will log will usually have a structure as timestamp and a log line with all the metadata. And then you will have also have something like for filtering, is it an info or a warning or an error? So there is levels in terms of logging, you do and then a metadata log line with label set and all the data you want to
00:15:31
Speaker
Share well debugging the service so that's a lot and when you come to a metric you can think it as a time stamp and bags basically the metric name and the metadata of the job name and the cluster it is running and the pod name container name and a float value so basically this value represents the exact
00:15:51
Speaker
state or the value you want to capture from the system. These are logs, metrics, and the last part is traces, which is the most interesting part. The observability data type, which we are actively working on, and even in the CNCF space, you see that's the next big thing, I would say, in observability with open telemetry. But coming to a trace, trace is an observability data point where you want to
00:16:20
Speaker
capture the complete request lifecycle. For example, you are an e-commerce site and you want to order a book. You will just log in into your site. Whenever you log in, it's basically a request lifecycle. Once you click on a login, basically, there will be a database which will authenticate with the user credentials you have provided and then there will be a database which gets
00:16:45
Speaker
where the details of provided gets validated and it just sends back the acknowledgement that it's conform and then you log in and as soon as you log in the advertisements and all your suggested items get processed so there are a bunch of microservices involved in performing this action so a trace basically denotes the complete complete request life cycle will happen on a specific action.
00:17:10
Speaker
So you can think as a list of spans basically. So there will be a parent span and the children span. So span is nothing but an atomic action or an event which is being performed across these microservices.
00:17:25
Speaker
Let's say you click on the login button. So first thing it does is that it validates. Is this a valid email? So if the first action it does is that so that's a specific action. It's where it's doing and all the spans are basically captured the way you instrument your application. So in your application code in the in the parts you are interested and you want to say that this is a specific action you basically do in your code span dot start and span dot end.
00:17:55
Speaker
Basically, whenever this action happens, it basically creates an answer span. How many requests flow across this business logic? The spans will be created from those many actions and they'll be attached to a trace where the actual request was initiated. This might be a bit confusing to explain orally without any visual representations where this is what logs, metrics, and traces in the high level.
00:18:26
Speaker
So for a developer, this might be making more sense like if they are making these changes in their code, but then if I want to collect all of these right logs, metrics and traces, how do I do that if I have an application already running? Can I retrofit an application or this needs changes to the code? Can I use sidecar containers to my existing pods and collect all of this information or how does that work?
00:18:52
Speaker
So yeah, this is an interesting way of looking at observability. The day is zero problem. How do you adopt observability into your applications? So the first thing is that logging definitely uses log what they want to log from their application. So basically, you use some open source logging packages where you use log or there are multiple packages. So you just
00:19:16
Speaker
They help you basically to grab the logs or to make them even more structured for your use cases. So you just instrument your applications with logging saying that they do log dot warning at this case, log dot errors. So pretty much all the observability data type needs instrumentation if you want to get the ideal way of visibility into your applications.
00:19:43
Speaker
So with logs, you need to instrument it and with metrics. So I'll be touching the most popular cloud native project Prometheus, which is a
00:19:57
Speaker
systems monitoring and alerting tool which helps you to perform. Basically, it helps you to capture metrics from your application. So even in this use case, you need to use the Prometheus client libraries to instrument your application. So I would say instrumenting for logging and tracing is like
00:20:18
Speaker
you are less than 10 lines of code, so you just initiate. It's basically a concept of saying that, hey, you are initiating that, hey, I'm creating a logger, I'm creating a matrix, I do basically init Prometheus, and then you start capturing a counter, guard, histogram, or a summary. So there are different metric types. Based on your use case, you'll be capturing those metrics.
00:20:40
Speaker
For example, basically, my business logic is involved in caching some elements basically, whatever the queries I've executed so far. I want to cache it for the last 30 minutes. I want to understand how many elements or the list of items being cached. What I will do is I'll just create a counter there saying that whenever you cache something, just increase the counter.
00:21:06
Speaker
So as an application developer or as an xri internet to understand what's the cash being occupied this application so i'll just view that metric okay what's the.
00:21:17
Speaker
Total caches when queries are executed, so it just gives me the internal state of the system. How many requests have been cached and how many new requests which are going to the application are not from the cache list, which means this means it creates a cache miss because you don't have those elements cached.
00:21:37
Speaker
For this also, as I told, it's all about instrumenting with few lines of code for metrics. Once you instrument it and you configure your Prometheus saying that, hey, this is the scrape config, this is the endpoint, and Prometheus will start scraping your application saying that, hey, just give me the metrics you want to expose. Then Prometheus will start storing them and with PromQL, the metrics query language offered by Prometheus, you will just query this metrics to make sense of your system. How is it performing at a given point in time?
00:22:06
Speaker
This also helps you to build dashboards using PromQL like pie charts, bar diagrams, and all-time series graphs, whatever the visualization makes sense and appropriate for your data. With logs, it's just few lines and even with metrics, it's few lines. With traces, it's comparatively a bit
00:22:28
Speaker
The larger code changes because when it comes to tracing basically you need to instrument across your microservices which means you need to pass the headers to connect between the spans like okay this is the parent span and this is the trace ID where the trace is coming from so.
00:22:46
Speaker
here basically what you will do is let's think you're performing a calculation so whenever an input comes in and you want to perform an addition, subtraction, multiplication and division based on the operators you see in the equation pass to you. So basically whenever a request comes to the service that you perform this action and we just think that this calculation is architected in a way that
00:23:11
Speaker
a specific microservice performance addition and other service performance subtraction and other service multiplication and so forth. So here the frontend service basically takes the calculation request and passes it over to the different applications based on the action it needs to perform.
00:23:29
Speaker
the equation okay this is the multiplication I need to get it done this is the subtraction and this is the addition so whenever this happens so the trace will be initiated on the front end service and each action offloaded to a different service is nothing but a span so once the request all this request is successfully performed and the answer is served back to the user so it's basically creating spans from this different microservices and in this calculation you can understand that hey the total request
00:23:59
Speaker
total time taken by this process is 20 seconds and 18 seconds of the time is processed in the multiplication service, which means it is the bottleneck for my job what I'm trying to achieve here. So the rest of the services are performing in two seconds, but a specific service is adding that latency of 18 seconds. So this is where this helps you to understand the high-level overview of how does the request lifecycle
00:24:26
Speaker
flow through these different applications and what exactly is the service doing and when. So let's say a multiplication service might be being a bottleneck only when you give numbers higher than 1000 or some bigger numbers. So usually if you give lesser numbers it might be performing really well. But the bigger the number is the slower the service might be. So you want to understand when exactly this is being happening. So it's just not about
00:24:56
Speaker
start time and end time and saying that this is the time it's consumed, but it's more about capturing the metadata. This is the client ID and this is the operation I was trying to perform.
00:25:09
Speaker
This is like whatever the data you want to capture during that process. So during instrumentation, you basically do this in tracing. So coming back to your question, so the way you start is you basically instrument your services this way, or there are some open source solutions which also offer you some auto instrumentation for traces. Basically, OpenTelemetry does that. So if you have a Java application, it basically tries to extract some traces from your Java runtime.
00:25:38
Speaker
So it basically does that but this are more like in high level traces so it doesn't go deep into your business logic. They just give you in the high level like what was the overall execution during the runtime was.
00:25:53
Speaker
I think there's no special sauce that just makes you automatically start getting tracing or metrics just automatically. So you mentioned OpenTelemetry. Are there other tools in the sort of CNC for Kubernetes landscape that someone can start using? Ultimately, TimescaleDB is the resting place for all this data, but there's a lot of framework that goes into it.
00:26:16
Speaker
and possibly you just talked about some changes to your application that get metrics there. So what are some of the tools out there that folks can use?
00:26:24
Speaker
Yeah, so in CNCF today, so just going back to the context of open telemetry, open telemetry was formed as a project in 2019. So you can think the project is just two to three years old. And during the November San Diego cubicon, so open telemetry was basically formed as a merger between two projects. One is the open tracing and the open census from Google. So open census is the
00:26:51
Speaker
internal way of instrumenting in Google, both metrics and traces. So this was a bit confusing for the users in the ecosystem that, hey, which standard should I follow? Open tracing comes with its own standards and opens and comes with its own standard. Then CNCF was like, no, we are not doing this. Just let's get best on both worlds and let's design a new standard and which is open telemetry. So this was the past of how tracing, especially in cloud native ecosystem worked.
00:27:20
Speaker
So these are currently declared end of life. So you can just use them, but you will not get any support for this project. But OpenTelemetry is in the active development. And you might not know this inside that in CNCF, after Kubernetes, OpenTelemetry is the second active project in the cloud native landscape. So you can think that how many projects? Oh, wow, that's interesting.
00:27:49
Speaker
integrations have been done to open telemetry so this is the project and there is another project called Eager in the CNF which is also graduated but most of the Eager maintainers and authors have all are moved to open telemetry so even
00:28:06
Speaker
The agar was a complete end-to-end solution for your tracing needs basically. It offers you an agar UI to visualize the traces and it gives you agar client libraries basically to instrument your applications and then you have storage like agar has a back-end storage support for Cassandra, Elasticsearch, in-house and BoldDB shipper. So there are some bunch of services which agar supports in-house but
00:28:33
Speaker
as open telemetry is the new standard and people are moving towards it. The recommended tracing solution today in CNCF space is obviously the open telemetry. Even for eager client libraries, if I remember correctly, even those are already published, those are an end-of-life phase.
00:28:56
Speaker
Anyone looking to like open telemetry need are these metrics logs and traces need to focus on open telemetry as that's the project for the future? Yep, yep, that's the project for future. So there are multiple multiple ways of looking at it. So if you're getting started with traces, then open telemetry is the one and basically open telemetry. This is the one side of the open telemetry. So traces being its main focus and the place that's the
00:29:26
Speaker
core differentiator from other projects but if you think the future of open telemetry open telemetry is trying to unify complete observability like let's say they have announced metrics if i remember correctly and they're also working on supporting logging in the open telemetry so open telemetry will be something like one client one instrumentation for all observability data.
00:29:47
Speaker
So you don't need to get into logging some logger instrumentation to get the logs and you need log shippers like Grafana agent or Ruffluent bit to ship this logs from your applications and then store them in an elastic search or some other proprietary solution. So instead of that what OpenTelemetry is trying to achieve is that you just use OpenTelemetry client libraries captured metrics
00:30:11
Speaker
locks and traces and then ship it to open telemetry collector which is basically observability processing pipeline you can think so you can process the metrics logs and traces and in open telemetry collector you can configure an exporter where this data needs to be sent so if you are interested in sending a proprietary solution you can just configure there back in or if you want to send it to prom scale the open source.
00:30:35
Speaker
observability backend so you can just forward the traces and metrics to the prompt scale and prompt scale just stores them. So this is the overall state and picture of how observability in CNCF is. But if you see the observability projects in the CNCF, so Prometheus is for metrics and open telemetry is for traces, but it's evolving to unify all this metrics, logs, and traces in one project.
00:31:03
Speaker
And we have FluentBit and FluentD, this projects under CNCIP, which are basically a log shippers for your logs. So these are mainly the major projects of observability. Got it. You mentioned prom scale, and I think I'm personally not familiar with prom scale. So what is it? Maybe you can help our listeners know that too.
00:31:24
Speaker
Yep, yep, yep, yep. So basically promscale, when I say promscale, it is an open source observability backend for your metrics and traces powered by SQL. So this is by definition what promscale is, but let me put it in much more like a brief description how it fits or how it serves your use cases. So currently, if you think about the sources of how you capture this data,
00:31:50
Speaker
So you are using Prometheus for metrics, you are using some other log shippers to ship this logs and you are using open telemetry client libraries or eager client libraries to ship this traces. So the sources are very much different projects, basically different standards. And again, if you want to store them, the traditional way of storing this data has been okay.
00:32:14
Speaker
you want to store logs and go to elastic search you want to store traces metrics you want to go to prometheus or cortex or panos and then you want to trace you want to store traces then pick agar or pick some proprietary solution so basically if you want to get 360 degrees of observability from your applications which means you should be running basically three different platforms for your logs metrics and traces.
00:32:39
Speaker
Just imagine like if you are running all this kind of services in your production cluster or dev cluster you're basically managing more applications than actual business applications are so your business applications are just five or six but
00:32:54
Speaker
The observability-dependent applications, which basically gives you an observability from your business applications, are 5x or 10x more than actual services you are running in the production, which are actually money generating for your company. These are consuming more compute on your cluster.
00:33:13
Speaker
It's not just about the resources and the deployment time. Just having a certain set of skill set, a certain amount of skill set to deploy these, manage these on an ongoing basis. That's also difficult. Again, this is just listening to it for the first time. So how do we solve that?
00:33:30
Speaker
Yeah, so that's where we have come up with a prom skill. So what we want to say is that, hey, we are one-stop shop for all your observability storage needs and to get all kind of analytics querying and everything.
Challenges in Observability Solutions
00:33:43
Speaker
So if you see the platforms for storage, there are three different solutions, as I was saying. So you need to go through different solutions and you need to have a dedicated SRE team.
00:33:53
Speaker
to manage and all these services more or less comes with a bunch of microservices. It's not just one service you deploy and it keeps running. You need to scale them accordingly as the scale increases for your application. If you're scaling your application, which means they'll be emitting more of observability data to accommodate or ingest that scale, you need to also scale your observability platforms.
00:34:16
Speaker
So this is definitely pain for you to manage and deploy accordingly.
Introduction to Prom Scale and TOPS Tools
00:34:22
Speaker
With promscale, what we basically do is that, hey, you just deploy promscale. So when I say promscale, it's nothing but a promscale connector. It's a stateless service and the timescale DB, the database itself.
00:34:32
Speaker
So you say your open telemetry collector or your Prometheus just write the data to the prom skill and prom skill just basically processes this data into time scale DB. It basically writes all the data sent from different sources into time scale DB using SQL as the time scale DB itself is the relational database.
00:34:51
Speaker
So just imagine if you could store all your observability data in one single database. So you can correlate between this data. Whereas with other solutions or how the industry practices are today, you're storing in three different object stores.
00:35:06
Speaker
And how do you correlate between these three? You are again stitching all this on the query layer, saying that, hey, at this timestamp, get a trace, at this timestamp, get a metric, and at this timestamp, get me a log. And then you're correlating on the query layer, but all this data is actually not stored in the same database.
00:35:23
Speaker
process, store, query, all in a single pane of glass, a single storage, how easy it would be. You can think that you're basically optimizing your observability platform by 3x. If not, you need to be running three different systems. And again, they are five plus microservices in each platform. Here, you're basically running just a stateless service and a database itself.
00:35:47
Speaker
So that easy it is. And the other interesting part, if you see, so if you want to query metrics, you need to learn PromQL. If you want to query traces, basically you need to learn or basically there is no tracing query language yet in the open source. So basically people use a UI based off of querying it or query based on trace ID or some timestamps or some metadata.
00:36:13
Speaker
If you want to query logs, each open-source solution has its own query language for querying logs. Each platform is doing it. As an engineer or a developer, there's a hard learning curve for you to first instrument your applications and then store them and to even get insights or to debug your services. Again, you need to learn the query language.
00:36:36
Speaker
So it's definitely a hard learning curve and the value proposition after deploying and setting all these things are very low because it involves so much of learning curve and you need to be really an expert in each and every solution basically to understand what's happening deep inside this data which you have captured. But with prom skill, we are giving you the superpowers of SQL. So just use SQL and we will just abstract out all the optimized schema for you. So all you have to do is just query this data from tables.
00:37:06
Speaker
Okay, so next question, I think, in the beginning or in the intro, you said you are contributors to two projects, right? One is from scale and another one is TOPS. Let's talk about TOPS. What is it and how can it help me with observability? Yeah, so the initial days when we were thinking on the observability side of the house and building a product in timescale for the observability use cases, so we were just trying to
00:37:32
Speaker
how to get started with observability setup in the Kubernetes cluster basically. As I was coming from the cloud-native and Kubernetes background, I wanted to try things out. How do things work? How do I put them all together? As I told, these are three different platforms. What I have to do is find a hull and chart or find a Kubernetes manifest to deploy Prometheus and then go to a project of
00:37:58
Speaker
Open telemetry, deploy it, go to a project of Grafana to visualize it, deploy it. So you need to crawl or jump across five to six different projects to set the observability into your cluster. Just to evaluate, does it work? Does it fit your use case? Or do you really want to do it? So there is no easy getting started way for observability. So yeah.
00:38:22
Speaker
What we have decided is that how about having a one-stop shop for getting started with observability in your Kubernetes cluster? That's when we have started this project called TOPS, the observability suite for your Kubernetes cluster. What it basically does is that it basically stitches or packages all the observability health charts into one.
00:38:43
Speaker
and we give you this super helen chart which is tobs helen chart. You can just deploy tobs helen chart. It deploys the kubepromethius project for you. So you can think in the kubernetes world, I think many are aware of kubepromethius. So kubepromethius is basically the monitoring stack for your kubernetes.
00:39:03
Speaker
Basically, if you deploy Kube Prometheus, it serves the monitoring needs, but you want to also serve the tracing and logging requirements as well. What we have done is in TOPS, we have added Kube Prometheus as a dependency. If you just deploy TOPS, the Kube Prometheus is out-of-the-box deployed for you. When we see what Kube Prometheus includes, the name itself says it includes the Prometheus and the Alert Manager to fire the alerts and the Grafana.
00:39:31
Speaker
the Grafana dashboard for visualizing all the monitoring data and it also deploys kube state metrics basically to export the metrics from your Kubernetes cluster and node exporter to export the metrics from your nodes and a Prometheus operator to manage the life cycle of alert manager and Prometheus. So if you just deploy kube Prometheus, you get all these components out of the box for you. And there is also Kubernetes mixins which will
00:39:57
Speaker
Build dashboards for pre-built dashboards for you so you just deploy Q Prometheus and you can see how is your API server performing how is your etcd performing and how is what's the health of your deployment basically if you have deployed a hundred deployments with five replicas how are the deployments.
00:40:17
Speaker
performing overall with desired state versus current state. So you can get all this kind of insights out of the box with Kubernetes. It's the magic of Kubernetes mixins. So it rebakes the dashboards for you.
00:40:30
Speaker
So this is one component basically the monitoring and tops with cube Prometheus and we use open telemetry operator. So what open telemetry operator basically does is the operator basically manages the life cycle of the open telemetry collector. So we deploy an open telemetry collector. Now, once you have your applications instrumented with tracing, you can just
00:40:50
Speaker
configure the environment variable to forward the traces which your applications emit to this open telemetry collector. The open telemetry collector will process the traces that comes. The third part which TOPS is opinionated is that we want to make the getting started of observability easier. At the same time, we also want to achieve it with as minimal components as possible and make easier for the storage also. In TOPS, we package also the prom scale.
00:41:19
Speaker
So what happens is we also add promscale within top. So once you deploy tops, it deploys cube Prometheus for you, it deploys open telemetry for you, and it deploys promscale for you. So once the Prometheus starts capturing the metrics, it forwards them to the promscale. And once open telemetry receives the traces, it forwards to the promscale.
00:41:39
Speaker
So all the data, so you just think about jobs, you just install in one command, and then the data is and all this observability components are deployed out of the box for you without any networking or date or storage configurations, all the pvcs, everything is just created out of the box for you in the cluster. And then
00:41:58
Speaker
This data is also stored to promscale and the promscale will be up and running in your same cluster. So this is all pre-built for you. So you don't need to make anything. So with Helm Chart you can do this and we also offered a top CLI. So with Helm Chart, so today there is a slide by us adds some opinions that Helm doesn't do well or it is a bit opinionated. So what we wanted to do is we wanted to abstract out this Helm layer from the users who do not like or who
00:42:26
Speaker
do not want to learn a new tool for deployment. So what we have done is we are publishing a top CLI. So you basically download this binary and run tops install, all the stack gets installed for you. And then you do tops Grafana port forward, the Grafana pod will be port forward locally for you in your local host, you want to get the password of Grafana instance to log in. So ideally, what you should do is you need to find the Grafana pod, check which secret it's mounted to and get into that secret and base
00:42:55
Speaker
and you need to decode that base64 encoded string to get the password and then you log in. So you need to basically jump between Kubernetes resources to get this kind of password and all this operation. So what Top CLA does is you basically run Top's Grafana get password. It just outputs the Grafana password from you from the resource. So you basically use that password and log in into Grafana. And if you want to change the password for Grafana,
00:43:22
Speaker
Ideally, I think it's a bit tricky. I think I'd not know any. Maybe if you ask me right away to change the Grafana password, I need to get into Grafana docs to do it. But with docs, basically you do Grafana, top Grafana change password and you give the new password.
00:43:36
Speaker
Topps basically changes the password for you. So it's that easy. And if you want to do volume expansions for your Prometheus or timescale DB, you do Topps volume expand for timescale. And it just does that for you and upgrades and install and all this kind of action. So it's really cool. And it's still in the initial phase of how the observability needs to be conceived and how it needs to be evaluated and getting started if you are new to observability.
00:44:06
Speaker
So it's in that phase and we are also thinking the future of talks at this point. We are rethinking to re-architect the project with new standards because basically the platforms, once they get evolved, they get stabilized, but deployment mechanisms get changed over a period. So before we were using Ansible and then Terraform and then we have so many mechanisms and approaches to deploy services onto the infrastructure, right? So it has been changing
00:44:35
Speaker
Since years and each organization each company has its own way of deploying so they follow their own a tech stack to take this artifacts into the infrastructure and get them running so things keep changing so this deployment size i feel in my personal opinion the deployment mechanisms also keeps evolving and we are just thinking what should we be doing for tops and.
00:45:00
Speaker
We are open for the feedback. I'll save for all the audience. They should definitely try out prom scale and tops and share the feedback with us. Yep.
00:45:08
Speaker
No, I'm definitely convinced on TOPS. If it's the one tool that helps me get started with observability and has all the different components that are needed, that's the place to go. And then obviously, if you want to customize and if you have become an expert, you can dive into each of these topics or each of the different components that are deployed as part of TOPS you mentioned. But to get started, TOPS might be the place to go.
00:45:33
Speaker
Yeah, it would definitely be the place that I would look to go being that a lot of this is generally new to me. I mean, I've worked with, you know, Prometheus in the past and metrics, but traceability traces, those kinds of things. Definitely new. And I'd probably go look at tobs and I'm going to link it in the show links here to after. So people who are also interested are going to go check it out.
00:45:52
Speaker
And now I have to ask because you said that TOBS, the CLI sets up everything for you to all the Helm charts done for you. Can folks choose what storage is put behind promscale or timescale DB, since this is sort of a data focused podcast?
00:46:10
Speaker
Yep, yep, yep. So I think we are like, if I'm right, we use GP2 or something. So I'm not exactly sure. Okay. I think that answers my question. Well, I think this is a probably a good place to wrap it up because I know what I'm going to do after this show is go check out TOBS because I am still needing to learn a lot here. I've learned a lot through this show for sure, Veneeth, and I really appreciate you coming on the show and telling
00:46:35
Speaker
all of our listeners about this, and Bhavan and I as well. So like I said, we will link TOBs, and we will link problem scale, and some of the other things we talked about today. But where can listeners find more about you or the projects you discussed today? Any of those tidbits would be really helpful.
00:46:52
Speaker
Yep, yep. So I would suggest the easy way to get started with prom scale is that you just search for time scale docs and you can see the prom scale section there. And we do have a dedicated product page for prom scale. So whatever I'm talking today in this podcast or anything around prom scale is completely open source.
00:47:10
Speaker
We do not have yet the product itself. When I say product, the hosted or the offering, it's all open source. If I were you, I would definitely jump in and try the new features which we are building and what's in store from the timescale in the observability space. The promscale docs is definitely a great point to start. We do have the promscale in GitHub, the repository as it's open source. You can see all the docs and content there.
00:47:40
Speaker
good place to subscribe or just stop
00:47:44
Speaker
look for is the Timescale blog. There is an observability section for all the blog posts, which the observability team publishes. We have some amazing blog posts, how to use Timescale Cloud as your observability storage. How you can empower promscale with Timescale Cloud and all this. For TOPs, you can get started with promscale docs and we have a TOPs dedicated section there. All this is in the Timescale docs website and promscale section.
00:48:13
Speaker
and even tobs is open source so you can find the repo in the github so feel free to reach out to us in timescale slack to share your feedback on promscale and tobs and if you want to get involved so it's an open source so just create an issue or just reach out to us again in the timescale db community slack hash promscale slack channel yeah perfect thanks we need like we'll definitely include all of these links in the show notes uh for people who want to get started
00:48:39
Speaker
With that, Denise, thanks for coming on the show. And we'll probably try to have you back on to dig into one of these other topics that could probably be its own podcast. I'll put it that way. And until next time, thanks for joining the show. Well, okay, I don't know about you, Bobbin, but I learned a lot today.
00:48:57
Speaker
Vanneeth is a wealth of knowledge when it comes to metrics, logs, and traces. Honestly, I've worked with Prometheus in the past. I've worked with Prometheus on our software, what we work on day-to-day, but there's so much more to learn. I just got introduced to a lot of things I need to start playing with, especially that top observability stacked. I think the takeaways that I got away from this conversation with Vanneeth was the learning curve can be high.
00:49:26
Speaker
That's probably right in my wheelhouse. It'd be a little high for me to learn all this stuff. So tackling something like tobs as a first go around is probably the way I'll go about it. And speaking of which, the second thing I got out of this was, you know, having a common tool set or common framework, like having a, you know, single querying language for, you know, managing your traces and your metrics and all those things to, it can really help with efficiencies rather than trying to learn
00:49:54
Speaker
you know, the 14 different tools that you might install on your Kubernetes cluster. So what about you? I know, again, I completely echo your takeaways, like those are really important. But the one thing that I wanted to highlight was observability, what it is, because I've like, I've been to AWS re-invent and kubecons, not just last year, but then
00:50:13
Speaker
your past years as well. And there's always vendors who are using these buzzwords. And I have had a hard time figuring out what they actually mean. So just getting a 100 level idea of observability and how it breaks down into logs, metrics, traces, what are the different projects that can help you with that. And then again, if you are an expert in dispute, you already know what those are. But if you want to get started, you have something like TOPS that you can use as an opinionated installation.
00:50:42
Speaker
to set up the observability stack. But yeah, that's that's it for me. Great. Well, that brings it great. Well, that brings us to the end of today's episode. And if you want to catch our other episodes, we are on Apple Podcasts and or Spotify. Check us out there. Review. Send us a message. Let us know what you do or don't like. And I think next week we'll be talking about operators and how they interact with storage, which is always a good time. And with that, I'm Ryan.
00:51:12
Speaker
I'm Pavan and thanks for listening. Thank you for listening to the Kubernetes Bites podcast.