Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Building Observable Systems with eBPF and Linux (with Mohammed Aboullaite) image

Building Observable Systems with eBPF and Linux (with Mohammed Aboullaite)

Developer Voices
Avatar
0 Plays2 seconds ago

How do you monitor distributed systems that span dozens of microservices, multiple languages, and different databases? The old approach of gathering logs from different machines and recompiling apps with profiling flags doesn't scale when you're running thousands of servers. You need a unified strategy that works everywhere, on every component, in every language—and that means tackling the problem from the kernel level up.

Mohammed Aboullaite is a backend engineer at Spotify, and he joins us to explore the latest in continuous profiling and observability using eBPF. We dive into how eBPF lets you programmatically peek into the Linux kernel without recompiling it, why companies like Google and Meta run profiling across their entire infrastructure, and how to manage the massive data volumes that continuous profiling generates. Mohammed walks through specific tools like Pyroscope, Pixie, and Parca, explains the security model of loading code into the kernel, and shares practical advice on overhead thresholds, storage strategies, and getting organizational buy-in for continuous profiling.

Whether you're debugging performance issues, optimizing for scale, or just want to see what your code is really doing in production, this episode covers everything from packet filters to cultural changes in service of getting a clear view of your software when it hits production.

---

Support Developer Voices on Patreon: https://patreon.com/DeveloperVoices

Support Developer Voices on YouTube: https://www.youtube.com/@DeveloperVoices/join

eBPF: https://ebpf.io/

Google-Wide Profiling Paper (2010): https://research.google.com/pubs/archive/36575.pdf

Google pprof: https://github.com/google/pprof

Continuous Profiling Tools:

Pyroscope (Grafana): https://grafana.com/oss/pyroscope/

Pixie (CNCF): https://px.dev/

Parca: https://www.parca.dev/

Datadog Continuous Profiler: https://www.datadoghq.com/product/code-profiling/

Supporting Technologies:

OpenTelemetry: https://opentelemetry.io/

Grafana: https://grafana.com/

New Relic: https://newrelic.com/

Envoy Proxy: https://www.envoyproxy.io/

Spring Cloud Sleuth: https://spring.io/projects/spring-cloud-sleuth

Mohammed Aboullaite:

LinkedIn: https://www.linkedin.com/in/aboullaite/

GitHub: https://github.com/aboullaite

Website: http://aboullaite.me

Twitter/X: https://twitter.com/laytoun

Kris on Bluesky: https://bsky.app/profile/krisajenkins.bsky.social

Kris on Mastodon: http://mastodon.social/@krisajenkins

Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/

Recommended
Transcript

Humorous Dot-Com Era Contracts

00:00:00
Speaker
The worst system monitoring setup I've ever witnessed was in the early 2000s during the dot-com boom. There was this company I was working with and they needed exactly three servers.
00:00:11
Speaker
And they signed a support contract worth the equivalent of $2 million dollars in today's money. It was crazy back then. It's absolutely ridiculous today. And I said to them at the time, and I was only half joking, I'll make you a counteroffer.
00:00:26
Speaker
I will give you the best support you will ever witness. For half that price, for a mere million dollars, I will camp out next to the server rack for the whole year, and I will never leave their side. i will be constantly watching them.
00:00:43
Speaker
And they didn't accept my offer, which is a shame because they went bust a few years later, so I would have been able to leave the room early.

Evolution of System Observability

00:00:51
Speaker
Now, while we all contemplate whether that would make an excellent season of Squid Game, we must also contemplate whether the state of the art in system observability has improved since those days.
00:01:04
Speaker
And I hope it has, because I'm certain that the problems got harder. Our expectations for scale and uptime have gone up massively since then, meaning a lot of the systems we build these days are distributed by default.
00:01:20
Speaker
which in turn means we need techniques for building out different components. We start introducing things like microservices to manage the complexity, which in turn opens up building systems with many different languages and different databases.
00:01:37
Speaker
How do you stay on top of all this? How do you make sure it's performing well? And how do you debug things when they go wrong? I'll tell you how you don't do it.
00:01:48
Speaker
You don't do it in an ad hoc way. It's no good having a different monitoring technique for every piece in the system. System observability needs a unified strategy.
00:01:59
Speaker
You've got to shoot for something that's going to work everywhere, on every server, for every component written in every language. And I think that means you have to tackle the problem from the kernel level upwards.
00:02:13
Speaker
And that's where I need an expert.

Modern Monitoring with Mohamed Aboulate

00:02:15
Speaker
Joining me to discuss the latest in monitoring, profiling, and observability strategies from the kernel all the way to the dashboard is Mohamed Aboulate.
00:02:25
Speaker
He's a back-end engineer at Spotify, and he's going to take us through how you can peek into the Linux kernel programmatically with eBPF, how you don't have to because there are several projects that have already done it already, and how you go from there to a complete monitoring picture of your system.
00:02:43
Speaker
We've got a lot to pack in in this one. We managed to cover everything from packet filters to cultural changes, all in service of getting a clear view of what happens to your software when it hits production.
00:02:56
Speaker
I'm your host, Chris Jenkins.

Recording Logistics and Anecdotes

00:02:58
Speaker
This is Developer Voices, and today's voice is Mohamed Aboulate.
00:03:14
Speaker
And joining me today is Mohamed Aboulate. How are you doing, Mohamed? Very good, very good. And good to see you again. It's been quite so long. It's been a whole week.
00:03:25
Speaker
Exactly. we were in Miami. We were supposed to record this under the glamorous Miami sun and logistics got in the way. so now you're in a particularly glorious office room there with the gray shining back at you and we'll do the best we can.
00:03:41
Speaker
Yeah, and thanks for ti for the flexibility. i got I got the calendar invite wrong, obviously, because of the I accepted it when I was in Stockholm. I did with the time zone. was just like, sorry of for that. And thanks for the flexibility. Oh, no problem. I'm sure there's a link here between calendar problems and...
00:04:02
Speaker
um overloading of disparate systems and having to reschedule the long-running processes. I'm going to make that link because we're going to talk about profiling a performance and what to do when your machine gets overloaded. There we go.
00:04:20
Speaker
So um you for context, you work at Spotify and you work to yeah some other interesting places. You have done profiling in what we might call the the very wild, right? Yeah.
00:04:33
Speaker
Yeah, correct. And I thought my first question is is, is the state of profiling today that there is one universal good answer that works on every operating system, works for every application, and we should start talking about that immediately, whatever it is?

Universal Profiling Solutions: Linux Focus

00:04:52
Speaker
Or is it just there is no one size fits all solution and we have to talk about the different approaches? ah As anything in software engineering, it depends, right? ah And I think when we talk about production and stuff, we generally talk about Linux as the dominant operating system in that sense.
00:05:13
Speaker
um So a lot of solutions that I worked with and I worked on are primarily Linux solutions. So if... So my experience would be primarily around Linux. i want to be covering, obviously, Windows, because I had no experience whatsoever ah in deploying applications in Windows servers or using Windows servers.
00:05:36
Speaker
I use it, that I think, for a brief period of time, just getting access to it, and that's pretty much it. But experience has been primarily around Linux. Just wanted to get that out of the way and clarify it for for you and the audience as well. So Universally, i would say I don't know.
00:05:53
Speaker
And that's obviously ah an accepted answer. Even if a lot of tech folks doesn't want to say I don't know, especially in the age of NLMNavs. ah But yeah, I don't know about other operating systems. ah But yeah, for Linux, we're getting close to it because of, we're going to probably dive into it because of eBPF and how it's built into the kernel. So whenever you have an operating Linux kernel,
00:06:20
Speaker
I think with a specific version kernel that it should be now widely ah supported, eBPF can be there. And then there's a lot of tools that are built on top of eBPF for profiling.
00:06:33
Speaker
Okay. We are definitely going to dive into that. Yeah. I want to ask you one more contextual question before we start on that, though, which is um i was thinking about this. I feel very out of date on what the state of the art of

Transition to Continuous Profiling

00:06:46
Speaker
profiling is. It's a good reason to have you on the show.
00:06:49
Speaker
i I remember days of, you know, you gather logs from different machines and at least try and put them in one place and look at those. You'd find things that seem to be a bit weird and you'd end probably end up recompiling ah suspected app with a minus minus profile flag that was specific to that language or that compiler and you'd slog it away from there.
00:07:13
Speaker
the And are we and we still in that state or is the state of the art moved on from that to something better than remedial profiling? I think we are ah in a sense that um we as human beings like the comfort zone. So that has been used for quite a long period of time. We have a lot of tools that are using it. And then it's just, yeah, a lot of people are still using it and didn't get out of that ah bubble.
00:07:40
Speaker
But on the other side, we still we are having now a lot of tools that enable to do that um in a much more modern way, in a much more continuous way. And they believe the discussion that we're going to have is more around what we call now continuous profiling.
00:07:57
Speaker
It's like how we can get this continuous feed of data, ah similar to what we have with metrics. um So how we can get this continuous feedback about um not only the health of our system, but the code that runs into our system, how we can continuously verify ah how the memory is used in our applications, how the CPU is behaving, not only the from application holistic point of view, but also going down to the code bit and the lines of code, which method is using much CPU, ah which what
00:08:33
Speaker
what code is basically stressing that much of my memory memory, what's inside my heap, where my CPU spend a lot of time. ah So all that answers, profiling in general tries to answer. But what what the shift that is happening recently is we are trying to get into that continuous collecting of that data. It comes with a lot of challenges, don't get me wrong, but it also it comes with a lot of benefits ah that we are continuously getting that feedback.
00:09:00
Speaker
We are continuously analyzing it, but not only from We are getting that dump, analyzing it, and then trying to figure it out um hours, minutes, even days later. Now we are continuously seeing it in when when it's happened in real time, which is a big shift from ah where we were at in the state that you mentioned.
00:09:22
Speaker
ah Okay. I was going to leave this till later in the podcast, but i have to bring it up now. The idea of... No, no. ah the idea of who like I see a lot of problems with the idea of continually profiling all your applications.
00:09:37
Speaker
um And the first is that just sheer volume of data being gathered. Correct. so If that's solved, maybe we should talk about what eBPF is and then address how it solves it, because that seems like a showstopper to me.

Introduction to eBPF

00:09:55
Speaker
so it's It's going to be a lot of lot of amount of data that is generated. um and there is no right or wrong answer here. It's just like you have to experiment with it.
00:10:09
Speaker
You have to find the best ah use case for you, the best thresholds for you, and how going to benefit from the continuous profiling data.
00:10:20
Speaker
um So a clear sample that a lot of people like ah simple rule of thumb is that the more recent the data, the more frequent we keep it, and the more historical data we wish we make that window longer.
00:10:38
Speaker
An example is we want to keep the frequency for the last five minutes higher, for the last minute even more higher, like we're getting, for example, 100 milliseconds for the last minute,
00:10:51
Speaker
We can get a second for the last five minutes. We can get a minute for the last hour. And then we can expand on that. So have having that snapshotting would enable us to lower the amount of the data that we get.
00:11:08
Speaker
And then we process it into the server. so you're saying sorry Are you saying then that, like, I'm thinking a web server, like I might have, I might be able to go and see profiling data for every single function call that served a single rep web request yeah within the last five minutes. But if I come a day later, I'm just going to get like how long it took speaking to the database, how long it it took to serve the whole request.
00:11:34
Speaker
I mean, you can get the whole data for every... I mean, we capture the data based on thresholds. That's the what why these use. And then you have ah um between ah time span that you get a snapshot of what's happening and profiling data. So let's assume it's 100 millisecond.
00:11:54
Speaker
And then you can get the data of each 100 millisecond. But if you check it within an hour... it's a lot of data within a day it's even a larger data within a week it's going to be like a lot of data that you need to store and the problem with storage is that it it it comes with a cost um so you can you can have the granularity to go with 100 milliseconds a week past.
00:12:21
Speaker
That you can do. But then it comes with a cost that you need to save that data somewhere. So what I was talking about is it is a problem, but then There is some ways to go around it. And one of those ways is the affinity and the fidelity of the data, ah how much frequency you want to keep the data. So we can basically try to minimize it by keeping the fidelity higher, closer to the date that you are at, closer to the time that you are trying to look at.
00:12:54
Speaker
and then trying and to condense it and minimize it as the time passes. So you can have less data, li ah less of obviously finer view, but then you gain in terms of storage and how much you store.

eBPF Security Model

00:13:12
Speaker
Right. So does that mean, are we doing this? Are we like, if I go back an hour, I'll find I've just got the average time it took to call this particular function.
00:13:23
Speaker
ah Exactly. So you get some and like instead of getting the 100 milliseconds, you can get for that hour, like each one second or probably 10 seconds. So we optimize for that. Okay. Okay. Then I think we need to dive into what this mechanism is so we can start to see what the kind of data we can gather is. Yeah.
00:13:41
Speaker
So EBPF, I looked up the acronym for this, um ah the Extended Berkeley Packet Filter. I thought that sounds like a firewall. Packet filter is a firewall, isn't it? It is. And um so for the record, I had a few talks about EBPF.
00:14:00
Speaker
And I always made a joke that EBPF and BPF has nothing to do with Between them. They are similar in the names, but the functionalities are very different.
00:14:14
Speaker
um BPF was meant to be as a way to filter ni network packets. and okay And then the idea behind eBPF was exactly that. So we want to modernize that eBPF.
00:14:30
Speaker
and then extend it. So that's where the idea comes from, is like an extended BPF. However, it evolved way too much to become more than bpf So it the there is no version 2.0, for example, of BPF. It becomes way more modern, way more structured, and it's even beyond what the original intentions were.
00:14:57
Speaker
So it started as a way to optimize for networking, but then it got used into, oh, we can use the same principles for monitoring. and we can use the same principles for security.
00:15:09
Speaker
So for idea that you have, that you can basically offload and upload programs to the kernel that were written in the user-led, it's basically unlocked a lot of potential, and most importantly, in a secure way. ah So eBPF, it's basically a framework, a toolkit that enables you to write a program in the user land, and then it gets compiled and verified and then loaded into the kernel as it was written from the get-go from the kernel.
00:15:43
Speaker
So the kernel now can have a set of modules or micro-modules. And then those modules can be not only written by the kernel developers, it can be written by everyone.
00:15:57
Speaker
So that's why this extendability where it comes from, that we are extending the kernel, we're making it ah more pluggable and more modular or that we can integrate bits ah over it to extend its functionality.
00:16:10
Speaker
And then, of course, it's it's an oversimplification of what the framework does, but it's at its core, it's it's basically that. We are writing programs that can be loaded into the kernel.
00:16:22
Speaker
Of course, it comes with a set of limitations that there is ah you can write it within C or RAS because that's what what the kernel supports. You can write it with Go and Python, but obviously that ends up to be C-compiled.
00:16:35
Speaker
to be loaded into the kernel. um So the program that you're at that you write needs to follow a certain specification because there is a step that verifies that the code that you write is actually safe to run because it's loaded to the kernel. But putting all of that aside, the idea that um probably the listeners and the viewers need to keep in mind is that eBPF is simply a way to extend the kernel, allowing us to write programs and loading them to the kernels so they are, from a kernel point of view, as they were developed developed from the get-go to be run in the kernel.
00:17:11
Speaker
Right. And this idea just unlocked a lot of but potential. so I mean, yeah you can imagine running everything in the kernel. Okay, that so my first simple question is, are they dynamically loaded?
00:17:23
Speaker
I don't have to recompile the kernel for this. ah You don't have to recompile the kernel. Good. No. Okay. Because i've got you were giving me as you were describing it, you were giving me flashbacks to compiling Linux kernels, and I don't need to go there ever again.
00:17:37
Speaker
Okay. So the next two questions are, how

Kernel-Level Profiling with eBPF

00:17:40
Speaker
flexible is that? i mean, we're talking about profiling, but could you write any arbitrary kernel code? And if so, what's the security model? Yeah.
00:17:50
Speaker
So you can't write any arbitrarily kernel code. I mean, you can if it passes the verification verification step. So when you load when you try to load the kernel to the program, there is two important steps that happens.
00:18:04
Speaker
So you write the program, and then it compiles it. And then if the compiler is happy to run it into ah what we call a just-in-time compilation to have a bytecode, that bytecode, so the eBPF program, you need to attach it to an attachment point. but Yeah, whatever. We'll get to that later.
00:18:21
Speaker
When you try to load it to that attachment point or load it to the kernel, there is an important piece of software that is called the verifier. So the verifier, it does what it what its name but what the name says, is basically. It verifies that the code that you are trying to load actually is safe to run.
00:18:41
Speaker
So it verifies that you do not have access or try to access arbitrary memory. You do not try to to expose bits of memory that you don't have access to. to verifies, first of all, that you have access to run the code.
00:18:56
Speaker
And then it verifies that the paths of your code end in a stable state. You can't have, for example, while true in a kernel module. That would basically stop the kernel from working, right? So it verifies that all the execution paths in your code end in a stable state.
00:19:17
Speaker
So there is an end statement and reachable statement from there. So with the verifier basically does a lot of heavy lifting to ensure that the code that you write is actually safe to run and to be loaded within the kernel.
00:19:31
Speaker
Of course, here, um ah just ah to make everyone aware that we are talking about the kernel. So aside from the fact that you need to not load any third party or arbitrary code, eBPF code into your kernel, the verifier helps you with that, but it's still your responsibility to make sure that the code that you run, try to load into the eBPF, is actually safe to run.
00:19:59
Speaker
Right, so just because it passes the verifier, it doesn't mean you can just blindly trust the code you've been asked to insert. Yeah, especially if it's coming from a third party, because that's the kernel. someone The verifier is still continuously evolved, but as we know in software engineering,
00:20:18
Speaker
But there there are bugs, and someone might have discovered that bug before the Linux community had had a patch. So someone can basically ah use it as a malware, use it as to break your kernel, use it to collect data. So there's a lot of security concerns that go with it.
00:20:33
Speaker
And the best way is to basically not load a code that you don't know do not verify into your kernel environment. Yes. You've got to trust and verify, I guess. Exactly. yeah. yeah Okay.
00:20:46
Speaker
So um the you mentioned bytecode. This is compiling some kind of kernel virtual machine. Yes. Which presumably limits the footprint of the code, which is why the verifier stands a chance of working.
00:21:01
Speaker
Exactly. Yeah. Yeah. That makes sense to me. Okay. we are talking specifically then about kinds of EBPF program that allow you to instrument the running kernel and hence your programs.
00:21:18
Speaker
Yeah. How's that put together? What's it actually doing from, i sort of imagine myself down in the cellars of kernel space, looking up towards where the application's running, wondering how I'm going to find it and instrument it.
00:21:33
Speaker
So you know to our high-level applications needs to call the kernel for everything. Yeah, for a for accessing the memory, for calling the CPU, spinning threads, accessing the disk, all of that.
00:21:48
Speaker
So whenever you call the kernel, the kernel has visibility over that. It knows what you need. It knows what bits of code is getting executed. it has visibility over everything.
00:22:00
Speaker
So the EBPF folks, or especially the ones that are interested in profiling said, so we have the visibility to do that. Why not simply leverage that information that we have

Tools Utilizing eBPF

00:22:13
Speaker
and enhance it with some additional information. Because when the program gets executed, the kernel can see a certain program, how based how long in the CPU it gets used, how much memory is used, all of that.
00:22:27
Speaker
We know that from the age of containers and even before that. So we know that we can instrument that bit. So they added basically metadata into... Okay, so this code is basically using that memory and then dump it into a profiling information because the kernel has access over everything, basically. So we just like map in, oh, this function...
00:22:51
Speaker
this is the function that uses this amount of CPU, and then how we can collect that information, dump it into a store somewhere, either locally, ah some of continuous profiling do with that, or you can basically send that, send it to a backend that does the analysis afterwards. So there are two strategies that are happening there, either in cluster or off cluster ah in ah in a dedicated environment that does the post-processing.
00:23:18
Speaker
Right. So you might have a separate and like um you have a separate analysis team running a whole cluster of things gathering from the main network. Yeah. yeah ah And that's how, for example, Datadog does it. um i and probably some of the cloud environment does it as well.
00:23:37
Speaker
Like they take the stuff from your environment, the collect metrics from your environment, and then send it to dedicated servers that basically does the analysis and performance. So we have some sort of agents, in that case, an eBPF profiler that does that.
00:23:53
Speaker
Or one of the magic of eBPF is that you can basically share the data between the kernel and other CPU program. or user-length program. So what does this mean is that you collect the data, you save it in a some sort of a database, which is not a database. It's just maps or eBPF maps.
00:24:13
Speaker
yeah okay So you save it there. And then because it's managed by the kernel, the kernel verifies that only that program has access to collect that data. So we're collecting the data from there, saving it in a place.
00:24:27
Speaker
And then your other program is basically running to collect that data, either either analyze it. yeah Probably you want to compress that data for mid work ah for network traffic if you're sending it somewhere.
00:24:38
Speaker
ah So all that post-processing just after it's it's there, because to not block the kernel much, we want to do the operation as fast as possible. We dump this data and then all compression, optimization, cleaning up the data and all of that happen in the in another username program before it's sent into do to the to the another data another cluster or server to do the most post-analysis and all of that.
00:25:06
Speaker
Right. So so whilst it's the mechanism is completely different, it's mentally it's the same as if someone dumps their web data into Mongo for someone else to process. Exactly. yeah Yeah. Okay. That makes perfect sense.
00:25:19
Speaker
yeah um i The one thing I'm not getting here is I'm i'm at the kernel level. i write Let's say I write a Python program that has a badly written for loop and allocates way too much memory.
00:25:31
Speaker
yeah I'm at the kernel level. I see like I'm malloc-ing this chunk of memory over and over. how do i How do I stitch these things together from the kernel level call to the line of Python that's badly written?
00:25:49
Speaker
You mean how to know which... basically which Yeah, so like, line of go day the kernel when the kernel is madly allocating memory, it doesn't know that that's because there's a for loop on line 27.
00:26:04
Speaker
But I want to know that as a programmer. yeah So how do I connect the dots from user space to kernel space?
00:26:13
Speaker
I honestly don't know how that bit is is actually managed, to be honest. I used some tools that does it. Maybe if we dive into some of those tools, we would know the answer.
00:26:25
Speaker
um So how that bit is actually ah done. But I would imagine it just like basically... We're knowing this bit is using that much of data and didn't want to instrument and enhance it with other bits.
00:26:38
Speaker
So I just don't want ah to throw anything that i I'm not aware of or I don't know ah to the audience. Okay, but but some so someone has crossed that up and down that tower of Babel to the point where I can see my Python program and the the impact of it.
00:26:56
Speaker
Correct. all of All the continuous profiling tools based on eBPF does that. It's basically the it's one of the building blocks of having a profiler is to know which bit of code is basically using that much CPU and that much memory. and all of that.
00:27:13
Speaker
So even the EBPF powered continuous profiling tool or profiling tool does that as well. And they managed to crack that. Okay. Which bit of code is using that much, uh, that much data?
00:27:27
Speaker
Uh, I, I can imagine because when you call basically the, the EBPF program, you can enhance it with the context. So that context may be enriched as well to get this data, uh,
00:27:40
Speaker
So there's different bits that can be used. I can't say for sure how it's how it's done, but for eBPF program, you have the context as well. So when you want to call an eBPF function, we have the context over what we are calling and why and all that bits that can be basically...
00:28:01
Speaker
glue it together in order to get that information. Right. Yeah. I'm not a developer, so that bit is a little bit nuanced to me. It's it's useful to know, like you you've used this a lot. It's useful to know where the boundaries of your knowledge are and what you had to know and what you've just learned because it's interesting.
00:28:20
Speaker
So does this this makes me wonder when I'm writing a program, knowing that it's going to be instrumented, do I change my program? Can I, should I, must I? No.
00:28:31
Speaker
For most of the cases, no. You don't have to. And that's that's one of the benefits of using um an eBPF or a continuous profiling tool that doesn't lead you to rewrite your program.
00:28:45
Speaker
ah Some of the profiling tool would basically you need to have add when you want to profile it and why add those bits in order to gather those information. A lot of Things with continuous profiling and add annotations, you don't have to do that.
00:29:02
Speaker
Especially with eBPF, you have all the information there for you. So you have lines of code and a hierarchy of them as well. ah Which one is called which and which one ended using blocking that much memory, that much IO and that much CPU.
00:29:18
Speaker
So you you can basically and instrument your code if needed. Don't get me wrong. Sometimes it's some use cases, it might you might want to estimate the code base because you are not getting the bigger picture.
00:29:32
Speaker
But as those Profiling tools get more rich and more widely used. They cover a lot of the use cases that we rarely now come to a use case where we have to ah basically instrument our code.
00:29:50
Speaker
While they offer the options, 90% of the use cases are covered. You might stumble into a 10% where you need specific data that is not covered and all of that.
00:30:01
Speaker
you might need to basically instrument your code in order to send this data to the profiler in order to be analyzed. But 90, 95% of the use cases, it's basically covered from ah those continuous profiling tools.
00:30:18
Speaker
And I have to mention, it's not... It's not only eBPF. So eBPF is shiny and going into what you mentioned in the beginning, into standardizing the way we collect the data, as going making it as a universal way of basically how we want to not only profile, but also monitor our application and secure them.
00:30:39
Speaker
But there are tools that use basically agents to collect this data, um such as, for example, ah Periscope. they have For each language, they have a dedicated way in order to gather the data. So it's not only eBPF, even if it's now it's booming within the Linux and the kernel community. There are other ways that you can install small agents with a smaller footprint, obviously, and into your production environment.
00:31:08
Speaker
ah with lower overhead to collect this data. But back to your questions, most of the cases, no, you might run into it, but yeah. Okay. I do like the idea, um if I'm understanding this correctly, of one tool that will work regardless of language or tool.
00:31:24
Speaker
Yeah, and that's the power of eBPF. In the same way so ebpf in the same way as the containers manage to add this layer of abstractions, so we don't care about what language you're running.
00:31:36
Speaker
We just provide us with this container abstraction format, and then we will deploy it um and build orchestration tools that basically does a lot of that and we took it even in the era of AI because of that abstraction way of seeing things.
00:31:54
Speaker
eBPF added another layer of abstraction. as soon as much As soon as you have a Linux kernel that is within a specific versions and later, you can basically write an eBPF program that does the magic for you.
00:32:09
Speaker
That sounds like could be full of lots of different ideas, but I'm going to try and stick to profiling and not drag us down a rabbit hole. Very tempting, though it is. um So I guess the next question I have to ask is, what's the overhead of this?

Efficiency in Continuous Profiling

00:32:25
Speaker
Because I have been in situations where profiling everything adds like 30% to your CPU.
00:32:33
Speaker
Yes, um and and it has been the issue for so long, um until the beginning of 2000, where Google decided to um launch a paper or publish a paper, sorry, for how they do large-scale profiling on their end.
00:32:57
Speaker
I forgot what they called it, but they didn't call call it continuous profiling, but it's something large data center provider, data center provider, whatever that name was. I'll find it and I'll put in the show notes by the time this is published.
00:33:11
Speaker
Google was the first to publish a paper with a working... um initial version of a profiling or of a profiler using used go I think, or PProf, something like that.
00:33:29
Speaker
um And then sets the building blocks of building a profiler ah in a sense that you collect the data, there is a profiler that analyzes the data, and then you would need a way to ah store this data and the UI to see stuff.
00:33:48
Speaker
yes So they shared that in 2000-ish, I think. And then the industry just followed the path of Google with variations over it. But it sets basically the foundations of how Profiler are built.
00:34:05
Speaker
And even if if you check, even the audience check now, ah the For some of the tools, either open source or commercial, ah the architecture if it's available, it's very similar to that original paper.
00:34:18
Speaker
So we all benefited from that original paper that was published by by Google. And at that stage, they must have been fairly confident that the overhead was small enough that they could run that on the entire Google infrastructure.
00:34:32
Speaker
Yes. Correct. um And then you can imagine that was but the paper was published in 2000-something, but it was run a few years before that in Google data centers in order for it to be published.
00:34:46
Speaker
ah So it's it was definitely run years before that. um And one of the advantages is the small overhead. So basically, they unlocked that bit of how they managed to collect continuous profiling information with minimal overhead across all this data center.
00:35:03
Speaker
And then if Google does it at that scale, then what's stopping others of doing it? And then a lot of ah lot of companies followed afterward.
00:35:14
Speaker
Meta ah was back known as Facebook, did Amazon and all of that. And then a lot of tools started to pop up, both open source and commercial. Yeah, when you're doing it that kind of scale, you can't afford to wait for problem to come along and then persuade a team to profile a specific application, right?
00:35:33
Speaker
ah Exactly. I mean, because... At that scale, you have a lot of um feature teams or product teams, and then one team that is dedicated for the infrastructure. Or it's going to be a big team, but still, whenever you need something to profile, you go back to that team, ask for the dump, and then go back again. just like It's going to take forever, and that team will be flowed with requests.
00:35:57
Speaker
So having... that as self-service, ah providing this information continuously to the team that's one one set, he's basically ah has ah has a lot of benefits, both in terms of um getting in the information at well, fast, but also the teams can control what information they want to get.
00:36:19
Speaker
So back to to your original question of the overhead. There is obviously of an overhead. um It depends on the language, it depends on the runtime, it depends on the tool that you're going to use.
00:36:31
Speaker
So for eBPF programs, ah for the ones that are open source, at least, they claim that they have an overhead between 1% to 2%. Things like Periscope, Parca, Pixie, they claim that's the overhead that they have, between 1% and 2%.
00:36:48
Speaker
Generally, it's between 1% to 5%. um If you want to run and run it um i run one it's in production. And that's because of a lot of ah factors.
00:36:59
Speaker
How much frequency you want to collect the data, how much data you are collecting. how Where and when you do your processing of the data, the compression of the data, do you postpone it over, do you send it to another ah server that basically handle it handles it?
00:37:17
Speaker
That is obviously going to be off CPU, but it's a lot of transmission I.O. that is going in there. All you want to do is some preprocessing in your cluster, in your server before sending it to ah another ah server.
00:37:31
Speaker
And that would take also from your CPU. ah Pixie, for example, does store some data in your cluster in memory, think, for a for for a period of time. So that also take from your server, but it's still within ah ah respectable and acceptable threshold.
00:37:49
Speaker
So, yeah. General rule of thumb, between 1% to 5%. With eBPF programs, they claim it's less than 2%. So you can get it around 2%, which is a huge win compared to what we had previously with those heavy tools.
00:38:04
Speaker
Yeah, yeah. um I'm trying to do the kind of back of the envelope calculations in my head. and Anything less than 10% I'd be pleased with. Anything around 1%, I would be tempted to leave it continually running.
00:38:17
Speaker
Yeah. which The problem is not leaving it continuously running. I mean, leaving it continuously running is a feature, but then the problem is you can't leave it continuously running. The problem is what you're going to do with this ah amount of data.
00:38:34
Speaker
yeah The problem is not getting the data because I think that's a bit is solved somehow because the the overhead is not that much, so we can keep it running forever.
00:38:45
Speaker
ah But then you can get that much as much information as you want. But then once you get that information, what's you going to do with it? And then back to ah the discussion that we started this episode with is, okay, uh, I have this huge data.
00:39:03
Speaker
Should I keep it all? Should I pay for it? Does it make sense to me? And then there's a question of, or also of, uh, Profiling needs to be actionable.
00:39:14
Speaker
There is no sense of having all that information just like we, oh, it's overwhelming. know I don't know how what to get out of the data or how to use it in my um to gather any meaningful conclusions or informations.
00:39:29
Speaker
So how much data, it's one question that you need to have because it comes with a cost. And then how meaningful is the data for you as well to gather actionable insights? Yeah.
00:39:40
Speaker
These are two important questions that um Anyone willing or using continuous profiling will have to ask or has to answer, basically. Yes. it's ah being a That's common problem with like um being able to see inside the black box, right? The first problem is seeing inside, and then that creates a nice new problem.
00:39:59
Speaker
How do I deal with this floodgate, right? Yeah. So tell me, do you do you want to tell me how we deal with the just managing the sheer volume, or should we go to the tooling that lets you make sense of it?
00:40:12
Speaker
um I think we touched upon how we can manage the sheer amount of volume. um Some of it would be basically ah keeping the last bits.
00:40:26
Speaker
The fidelity ah goes less as the time span becomes larger, meaning that...
00:40:37
Speaker
um We don't care about ah three months plus data. ah we don't have We don't want as much granularity and fidelity within two months of data.
00:40:49
Speaker
ah We can have a medium of one month. And then as the time goes by, we try to shrink and limit amount of data that we want we want to have, for example. So we don't keep all the information in our server, but tries to shrink it and condense it um to basically lower the amount of cost.
00:41:10
Speaker
And then What do you do with the three months plus? Do you throw it away? do you keep it in um an archive? ah Same question for the month plus.
00:41:21
Speaker
Probably you you don't keep it in your primary database or primary data store. You have it into a backup data store or a secondary data store that is way cheaper. ah So those are some techniques basically to lower the amount of cost ah that you have. Because in continuous profiling, because it's this continuous bit, we are generally 95% of the cases interested in the recent informations.

Data Management Strategies for Profiling

00:41:46
Speaker
So either what's happening now, what's actually happened in ah the past week or the past month. ah Once it goes beyond that, it becomes kind of meaningless or less useful compared to what I have now because continuous profiling is enabling me to compare how my code is performing now, or how my code is performing compared to yesterday, to the last week, what I did wrong in that time span that basically brought the performance down or up.
00:42:19
Speaker
So that's sort of questions that I'm trying to understand. Yeah, yeah. I would i would think I'm looking at recent data for, oh, no, something suddenly got slower and we're panicking.
00:42:32
Speaker
And then the older stuff is, are we jim is it me or are we getting slower? Exactly. Are we slower than were six months ago? Correct. Yeah. Yeah. yeah yeah Do you also get, because the third question I would ask is, yeah, we've got Rust and Python in our company and the Rust programs are faster, but are they significantly faster?
00:42:52
Speaker
ah That's a good question. um And if you if you check the if you check the studies, or not the studies, what it's called, the benchmarks, it suggests that it's significant significantly significantly faster.
00:43:11
Speaker
However, the benchmarks are basically simple codes. and just like It's basically you send a request or you execute a bit of code millions of times, and then you compare it.
00:43:25
Speaker
But then, it exun for example, that's a question that, I mean, think i'm I'm a Java developer myself. So Java is slower. Rust is here. yeah But then you add feature to them both, right?
00:43:37
Speaker
So you add Rust, you add features to Rust, and that's because you basically want run in productions a code that basically print hello world, does a simple thing, or execute one request.
00:43:49
Speaker
You would add more and more features to it. So as you're adding feature, Basically, both of them become slower and slower and slower. So Rust may be a little bit more faster as we're adding more feature to it.
00:44:03
Speaker
But then you can see that the distance between both languages or different languages is basically going to come closer as we're adding more features because it's the way software runs. So we're adding code to it that may might make things slower um as we're adding. So it's not only the runtime might be faster, but we're adding more feature to it.
00:44:27
Speaker
It basically will make it um a little bit slower. And then we we might or might not see that big of a difference in ah in our production environment. Probably that's one of the things that continuous profiling might help with or might not, depending the complexity of of the code. um So If we follow the benchmarks, yes, definitely Rust is faster compared to ah probably Go or Java.
00:44:52
Speaker
But then as we start to add more and more stuff to it, the complexity of the applications basically take up from that. And then our programs become ah a little bit, bit by bit slower and slower. And that's where probably some of the enterprise languages benefit in the long run because they are optimized for that bit.
00:45:12
Speaker
so Just to say that probably the benchmarks, even if they say that the language is slower than another one, it also depends on the use case and what you're running and how you're writing your code

Programming Language Performance

00:45:23
Speaker
and all of that. So it's not as simple as that.
00:45:26
Speaker
If your program mostly waits for user input, then the thing you optimize is sitting behind the keyboard, right? Exactly. Yeah. Exactly. But it'd be interesting to see that kind of data and say, okay, in the real world, in our company, this is how it's actually playing out versus the benchmarks. I think that would be fascinating.
00:45:43
Speaker
Okay, so what so let's get into the kind of reports that you can get from one of these tools. um yeah know So the tooling is a slightly separate thing from eBPF.
00:45:55
Speaker
There are analysis tools for this data?

Analyzing Profiling Data

00:45:58
Speaker
um So they are packaged as one. So when you when you use a tool, you use it as one. But the architecture is different. There are multiple components yeah ah of it.
00:46:10
Speaker
But all some of them are based on eBPF. So eBPF is one of the ways to gather the data. Another way is using an agent to get this data.
00:46:20
Speaker
And the third option is to instrument your code. As I mentioned, it didn't send this data off. um And then you would have a profiler. That's basically, that's the um heartbeat, that's the ah backbone of the profile information, the core.
00:46:37
Speaker
um And then we need a way to analyze the data and then a way to see the data and visualize the data. So that architecture is what the Google paper back to again described, and what all the solutions have in common.
00:46:54
Speaker
They have bits of difference here and there, optimize things ah differently here and there, ah but they all share the same components basically. Is there much to choose between them? Are we arguing over which one has prettier graphs?
00:47:10
Speaker
oh um it's It's not only that. um How much it costs is definitely important. oh how would optimize how How it's optimized for the data. Is it opens source open source or commercial?
00:47:22
Speaker
um So if it's open source, how much the cost of running it? Is there support for it, especially if you're going for the enterprise world? If it's commercial, then... how much I am paying for the costs.
00:47:35
Speaker
And then that cost would basically would go how much data you want ah out of it. so So it's not how nice are the graphs I would get, but most and importantly, how much data and granular the data I would get. i think that's the the biggest factor. And they all share similar similar graphs.
00:47:56
Speaker
ah So we'd have Flame graphs, so whenever we talk about profiling, flame graphs come comes comes to mind. ah So they all offer flame graphs to see which bits is user in ah which piece of code is using at that amount of CPU.
00:48:14
Speaker
You can see the memory as well. You can compare ah between now and a period of time, or two different periods of times. um um They offer a way to filter, obviously, so we can pick CPU, memory, IO, and yeah, also all sorts of stuff.
00:48:36
Speaker
this is um This is making me think of the Google browser console where I can get flame graphs for my running JavaScript. It's all very nice, but it will just tell me what's happening in the browser, and I want that for the entire system, right?
00:48:50
Speaker
yeah Yeah. So continuous profiling is basically that for your application or your applications. Yeah. Okay. So what kind of questions do you ask of a system like that?
00:49:03
Speaker
Are you just trying to... i mean, where do you begin?

Identifying System Performance Issues

00:49:06
Speaker
It's a bit needle in a haystack, isn't it? You're looking for a report that says this thing is slow. Are you waiting for someone to say web servers a bit slow today? And then you dig in.
00:49:18
Speaker
How do you start to know what to look at? So um I think if if a system is is slow, um you would have traces.
00:49:36
Speaker
I mean, we talked about everyone probably knows the three pillars of observability. So we have traces, metrics, ah and logs. Things are slow. difference between traces and logs?
00:49:50
Speaker
So logs are contextual data of of your code. So basically, you are saying code executed here. There is an error here. So you instrument your application to log stuff.
00:50:04
Speaker
Traces are basically trying to um correlate the events or correlate the the journey of a request between multiple ah services and components.
00:50:16
Speaker
It's widely used in distributed applications. While locks are within a single application, you can see the events that happen in that you added basically in your code ah in that application and see fast in the of a single event or a request ah in your application and how it went.
00:50:37
Speaker
It helps you to understand how the request basically behaved in your application. And DRSheep traces try to ah find the correlation between in a request, and try to ah grow the journey of that request with multiple components. It's very useful in distributed systems and microservices architecture.
00:50:58
Speaker
Yes. So what you're saying is if my user says, I tried to create an account and it was really slow, yeah I need to somehow trace that request through the user microservice to the account registration microservice and know that those two calls on two different machines...
00:51:16
Speaker
A one semantic thing. Exactly. So yeah, that for the user is one transactional atomic operation. yeah But while for us, it could be ah the gateway, it could be the accounting service, could be another service, could be the database.
00:51:32
Speaker
ah So the traces helps to understand where that bottleneck comes from. Yes. And then once you identify that bottleneck, you want to understand why this application performs slow. So then you go and check the metrics of that application and the logs of that application.
00:51:49
Speaker
ah So the metrics might tell you that the CPU runs higher, or there is a lot of wait time, or we're just sitting idle waiting for an IO operation, or there is memory that is...
00:52:02
Speaker
ah consumed very high in that application, or the locks might say similar things. But then you don't know ah in which part of the code is actually there is this issue.
00:52:15
Speaker
And then that's where Profiler came in. So if you're understanding in CPU why my application is taking slower or spinning a lot of CPUs, you can go and check the bits of code that basically does that. Might be a ah loop operation that is poorly performed. So it consumes a lot of um CPU time and slowing everything down.
00:52:39
Speaker
and And probably for the memory, there is a memory leak happening somewhere that you weren't aware of. I mean, you can see it in the metrics, but you don't know what objects are or what methods are basically having that leak. So it's the profiling can help you with that, basically. So generally, that's the journey.
00:52:55
Speaker
ah So we have an error ah when is I mean, if an error is happening, you have an error and then you try to ah boil it down to where exactly this error is coming from.
00:53:07
Speaker
yeah Then once the maturity goes up, ah you start to onboard it as part of your um health metrics ah and in a sense that you can include it as part of... ah your post-deployment routine.
00:53:26
Speaker
but You deploy something, and then you can compare it side by side. Oh, this bit that I added, that did did it have a significant impact on ah memory, significant impact on CPU?
00:53:39
Speaker
And then you can take it a little bit even further and check OK, you can add some alerts. ah I see that it CPU has have thresholds. Like if the CPU is more than 10%, send me an alert.
00:53:51
Speaker
If the CPU is more than 5% or the memory is more than 5%, send me an alert. ah Could be a low urgency alert just to, oh, I've noticed this. You need to be aware. And then you can throw a little bit of AI into it and then have it more dynamic to so it analyzes the things for you.
00:54:09
Speaker
Some solutions offer that as well. And then it noticed, oh, post-deployment, there is a shift in the patterns between ah the previous deployment and this deployment while noticing this kind of patterns.

Cultural Shifts for Continuous Profiling Adoption

00:54:20
Speaker
Yeah. And when you've spent a lot of when you spent a whole sprint or two doing performance optimization, you want to be able to say, look, I made it this much faster, right? Exactly. You want quantifiable credit for your end of year review. Who doesn't?
00:54:33
Speaker
Okay, but you've got to explain the technical details of this particular thing for me. Someone makes a request to the web server, it goes to a user registration thing.
00:54:45
Speaker
yep How do I stitch that together? That seems to me like I would have to do some kind of code changes to be able to connect those two calls together. ah You mean for the tracing?
00:54:56
Speaker
Yeah, yeah. like how does How does the tool that's constructing this trace for me know that this request over there resulted in that API call over there?
00:55:09
Speaker
So they generally, again... ah
00:55:15
Speaker
modify the context. So the request there is a request that you make. And it generally, either for jrpc, we expand the context, or for HTTP, we add it in the headers.
00:55:29
Speaker
ah So there is a request ID, and there is, I forgot, was is there a session ID or something else? um And then we propagate. So the request ID is the one from paint point A to point B. And the session ID it goes the lifetime of the request, of the whole session, basically.
00:55:49
Speaker
And then combining those two, we're able to correlate oh between the session, or the session, in that case, is request, went through, ah probably the span and request.
00:56:03
Speaker
So I forgot the exact name, but that's basically how it works. So using these two fields, we're able to reconstruct the journey of the application, knowing from point E to point B to point C to point D, and also measuring that time it spends in each and every iteration.
00:56:21
Speaker
Right. But that is that something... and if I've got web server sitting in between a database and a load balancer, right? I can imagine, from what you said, that the web server would say...
00:56:34
Speaker
ah the load balancer is giving me this session ID, i yeah I now need to pass it on to this SQL call or something. Yeah. So I would have to make a code change to pass the session IDs around?
00:56:46
Speaker
So generally, we use libraries to do that for us. ah So... the The ecosystem has evolved that you can basically ah now things are embedded in most of the applications.
00:57:01
Speaker
ah For Spring, for example, there is Spring Sloth. I think that was the name that basically offers you that bit. I'm a Java developer, so I'm giving that. But I imagine there is ah equivalent things in the cloud native ecosystem as well.
00:57:17
Speaker
um There is Envoy, which basically offers you a way to visualize those bits. ah So there's a lot of tools that basically offers you to do that. ah You just need to embed them in your and your application. So the entry point, it's aware that, oh, that's the entry point.
00:57:37
Speaker
There is no span ID or session ID or request ID. So I need to be the generator. sure of that ah of that um session or that transaction or that request and then it just propagated over and over again until it comes back and then it stumps it into ah to save the data and save the whole history and that's it basically.
00:57:58
Speaker
Right, yeah. i can You're giving me a a lot of flashbacks during this episode, but i can sort of see how like Java would be in the background attaching a thread local variable, which would then get passed to my ORM and yeah, okay.
00:58:12
Speaker
Correct, yeah. So mostly mostly you want to arrange that happens by magic, but presumably in some frameworks or languages you do have to step in. ah Exactly. I mean, it's it's not happening by magic, but if you use some those frameworks, you won't even think about it. It just like it just happens.
00:58:29
Speaker
ah But the I think the idea is just is just that. They stitch um context form information, enable them to collect that data and then gather it back and collect it and stitch stitch the things together.
00:58:42
Speaker
Yes, okay. think OpenTelemetry has support for it as well. It's become the de facto standard when it comes to monitoring and all of that. Right. This implies that, I mean, it's called continuous profiling as a technique, but it yeah implies that it also has to be ubiquitous processing.
00:59:00
Speaker
You'll only get those benefits if all the machines are always profiling. Yes, correct. Yeah. um
00:59:12
Speaker
That's a big ask for a team, maybe. I mean, all the machines are always profiling. um all Obviously, comes with a cost. And then we're living in era that um you pay per node or instance.
00:59:26
Speaker
So the more instances you have, the likelihood of ah you have to pay ah big load of money. And that, for some, might be out of ah capacity.
00:59:37
Speaker
ah So you can... resume it or have it to your critical services that you know they are consuming um a lot of resources and you want to optimize those resources or at least keep those services healthier, more performance, and do you know that they bring them more ah value.
00:59:58
Speaker
ah So you want probably to invest in these and then when it proves value or when the um <unk> you're having a great year and then you can offer that luxury of generalizing it to other services.
01:00:11
Speaker
um So I believe the there might be trade-offs, as you mentioned, ah because the idea scenario which would be to have it everywhere. Few companies have that luxury.
01:00:23
Speaker
ah So you might want to basically limit it to the critical services, ah start there and then see if it brings value, um make the team also aware of it because it's it comes also with a mind shift, a mental shift, uh, from traditional profiling or the pillars of observability. It needs to be, uh, proven, um, uh, bringing value.
01:00:49
Speaker
So allowing that time of adaptation and, uh, culture change and all of that is also important. Oh yeah. Cause you probably need to get everyone in the organization to start buying into it.
01:01:01
Speaker
Exactly. oh we as i mean We as human beings don't like changes. We don't like changes that someone else has told us we have to make.
01:01:13
Speaker
Exactly. And even if with we science proves that it brings value value, we would find a way to say it's not. ah So allowing the time to shift that and prove value and allowing the time for the adaptation is also needs to be um taken care of. um Because...
01:01:35
Speaker
it's The data is there. ah The graphs are nice. You can do whatever you want with it. But then if there is no developer ah that can take those insights, take takes the data data and turn it into an action to optimize things, to improve things, it's basically um for... for for the company is just like a cost that they might want to get rid of because it's not has it has not proven value. So there's a little bit of adaptation in there and making sure that it brings value to the developer and then it's part of their ah process and continuous improvement on all of that.
01:02:15
Speaker
Yeah. So is does this mean that you are very much... you're trying When you've done it, you're trying to show, like, look how easy it is to profile your application rather than, i found a problem with your application and here's the data you need to go and fix it. um Maybe both.
01:02:35
Speaker
Maybe, oh, ah it's easier set up. It could basically, can be onboarded part of your morning ru routine. You just came in and then open it in a publicly accessible yeah URL, ah check the stuff. And then if you find a pattern that is weird in there, go and look into it and approve your performance of the application.
01:02:55
Speaker
Or another one way to get buy-in is when there is an issue, there is no great way of proving value than fixing that issue issue and using those tools. So it's going be a mix of ah both strategies.
01:03:08
Speaker
ah You want to ah get buy-in that the tools is easy to use, but also you want to prove value. And if you manage to prove it when everyone hands-on on something and everyone is focusing on something, then that would be a great one as well.
01:03:23
Speaker
Yeah, yeah, yeah. um That makes me think of one more practical question. Can I, this is a kernel level plugin, so can I roll it out to many servers and dynamically switch it on and k off?
01:03:39
Speaker
um You mean the eBPF stuff? Yeah, so like if i can I just try and put it on all my clustered servers, maybe not not gathering anything, and say, okay, we'll just flick the lights on for that one and take a look at that today?
01:03:58
Speaker
Is it trivial switch them on and off?
01:04:03
Speaker
it It depends. I mean, trivial is just like how... easy your program is to use. I mean, it can be like a configuration batch and then you can configure the service that you don't want to to not be included, could be a matter of um ah matter of configuration.
01:04:22
Speaker
And this configuration can be centralized. So you go to one place, configure it, and this configuration is dispatched everywhere. And then the eBPF program basically reads that configuration. And then if it's on, gather the data. If it's off, just ignores it.
01:04:36
Speaker
It's just also a matter of how well your program is run. And then... um talking about EBPF program in general. But with continuous profiling, i think, yes, you can do something like that if you want.
01:04:52
Speaker
um Just keep it running there. ah And if not, just turn it off ah and not collecting any data and not adding any overhead for you. I mean... I'm going to make that more specific because I get your general idea, but maybe if I couch it in Java-ish terms, I've got a machine that I set up earlier and I want to either switch profiling on or off.
01:05:14
Speaker
Is it as simple as I'm going to call an Mbeam nbean on that running JVM or am I redeploying a whole Kubernetes pod with the new settings?
01:05:25
Speaker
No, you can do it in in runtime. Dynamically at runtime. Yeah, you can do it dynamically. That's what I mentioned, configuring. It also depends on the configuration that you have in your um in your program, basically. Because it's a program, you can do anything with it.
01:05:40
Speaker
As long as you've planned ahead. Exactly. okay But if it's not, as you mentioned, you would need basically to redeploy it. But then you don't need to recompile the kernel. it just like As any program, you just need to deploy it, and then it would take care of itself.
01:05:55
Speaker
OK, so I can set myself up for this kind of profiling. Yeah. OK, so I've got a way of doing it that seems practical, has a low enough overhead that I might set up my whole cluster to run this.
01:06:07
Speaker
Yep. I can see the data management problem, but you've given me some ideas about mitigating that, reporting, getting back to the developer. I see the picture. I think you need to start giving me some specific recommendations.
01:06:20
Speaker
Which tools would you pick for this strategy? i I like Periscope because it offers um best of both worlds.
01:06:32
Speaker
It has support for eBPF, but if your org for a way of another doesn't want to onboard in an eBPF journey yet, um it has an agent um an agent alternative that you can use for specific languages and runtimes.
01:06:49
Speaker
Oh, because the presumably the eBPF needs root access, but the agent, I can just run in user space. yeah So if I've got a security team, I don't need to have that argument.
01:07:00
Speaker
yeah Yeah, exactly. it has... it's it has A bit, as I mentioned, a bit of both worlds. It has this enterprise ah flavor into that makes it more enterp enterprise-y.
01:07:12
Speaker
And it's it's pleasing to that enterprise world. ah It tries to find shortcuts and find and meet you where you are your are rather than trying to but move you to another runtime.
01:07:26
Speaker
um So that's what I like about it. And then Periscope is from Grafana. And then a lot of companies has Grafana as well. So the tooling would be based on Grafana, which is um many are familiar with already.
01:07:42
Speaker
um From the cloud side, Datadog is have their own. And then we when we were at Miami, ah one of the but of the speakers basically were using Datadog and they were very happy with it. So um I think from if you want something that comes with ease of use, um you don't have you'll no want to bother yourself about deploying it and um managing it yourself, Datadog take care of that.
01:08:14
Speaker
um Another two other open source tools Parca is one. ah It's really good. Then my favorite, ah though it's still not early days, but getting there, it's Pixie. It's a CNCF project ah by, yeah I forgot the name again, ah New Relique.
01:08:36
Speaker
yeahf but by newek um So it's it's more than a continuous profiling tool. It tries to be a monitoring tool. ah its it's It tries to combine profiling together with the metrics part.
01:08:52
Speaker
um ah But it's it's it's cool. ah It's open source, ah but it also offers a way to onboard it as part of New Relic.
01:09:02
Speaker
i It has this New Relic flavor as well. um So it's it's my favorite tool so so far. But it's again, it's still a little bit early days for it. Not that widely adoption ah yet, but it's it's getting there. Since it's part of CNCF project, it has a larger community. So it's improving bit by bit.
01:09:23
Speaker
Okay. I kind of want to ask you which one you use at work, but I suspect then the Spotify legal team will dive on this podcast. So I'll leave that question entirely. I think that gives me a complete picture. Maybe i need to go and play with one of these.
01:09:41
Speaker
Where would you start if it's just out of pure curiosity? Pixie? I would start with Pixie, yes. um Yeah, they have a... a cloud version to so you would need to install the pixie agent on your cluster and then they have a cloud last time i played with it and they have a cloud version that basically get the data and send it add to it and you can visualize it um so yeah pixie would be uh probably the first and then periscope maybe the second uh because it's well integrated, easy to use, and it's part of the Grafana ecosystem. So it's nicer.
01:10:21
Speaker
Cool. I'm going to go and check those out. Mohamed, thank you very much for joining me. And I hope when you get to the end of this recording, you don't think the elapsed time was too long or too short.
01:10:34
Speaker
It was very enjoyable. Great. Thank you for having me. Thank you. Thank you, Mohamed. As always, the show notes the place to head if you want links to anything we discussed.
01:10:45
Speaker
And before you head there, please do take a moment to like and rate this episode, maybe share it around, because it all helps other people find this. The algorithm decides that if you liked it, other people will like it, and off we go.
01:10:57
Speaker
And that helps share the knowledge, which is the whole point of this podcast. Please do make sure you're subscribed so that you can find us in time for the next episode. And until then, I've been your host, Chris Jenkins.
01:11:09
Speaker
This has been Developer Voices with Mohamed Abulet. Thanks for listening.