Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Making Apache Kafka Diskless (with Filip Yonov & Josep Prat) image

Making Apache Kafka Diskless (with Filip Yonov & Josep Prat)

Developer Voices
Avatar
1.2k Plays2 hours ago

How do you retrofit a clustered data-processing system to use cheap commodity storage? That’s the big question in this episode as we look at one of the many attempts to build a version of Kafka that uses object storage services like S3 as its main disk, sacrificing a little latency for cheap, infinitely-scalable disks.

There are several companies trying to walk down that road, and it’s clearly big business - one of them recently got bought out for a rumoured $250m. But one of them is actively trying to get those changes back into the community, as are pushing to make Apache Kafka speak object storage natively.

Joining me to explain why and how are Josep Prat and Filip Yonov of Aiven. We break down what it takes to make Kafka’s storage layer optional on a per-topic basis, how they’re making sure it’s not a breaking change, and how they plan to get such a foundational feature merged.

Announcement Post: https://aiven.io/blog/guide-diskless-apache-kafka-kip-1150

Aiven’s (Temporary) Fork, Project Inkless: https://github.com/aiven/inkless/blob/main/docs/inkless/README.md

Kafka Improvement Process (KIP) Articles:

Support Developer Voices on Patreon: https://patreon.com/DeveloperVoices

Support Developer Voices on YouTube: https://www.youtube.com/@developervoices/join

Filip on LinkedIn: https://www.linkedin.com/in/filipyonov

Josep on LinkedIn: https://www.linkedin.com/in/jlprat/

Kris on Bluesky: https://bsky.app/profile/krisajenkins.bsky.social

Kris on Mastodon: http://mastodon.social/@krisajenkins

Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/

Recommended
Transcript

Warpstream's Innovative Approach and Industry Impact

00:00:00
Speaker
Exactly a year ago, we had Ryan Whirl on the show talking about Warpstream, which is a company that had executed ah very cloud age trick. You take a database or a data storage tool, in their case, Apache Kafka, and you copy the same idea, but you replace the storage layer with S3.
00:00:20
Speaker
Why would you do that? Because you'll lose something in latency, but in return, you get cheap, automatically replicated, practically infinite disk storage.
00:00:32
Speaker
From a technical point of view, that's an interesting trade-off. What took me by surprise is quite how appealing it turned out to be as a business idea. Less than a year later, they sold that company for, reports vary, but around a quarter of a billion dollars.
00:00:50
Speaker
Billion with a B. Which kind of makes you wonder, are there any other Apache projects you could copy but use object storage? That's a question for another day. question for today is, given those kinds of numbers, it's not really surprising there's a wave of other companies doing the same Kafka but on S3 model.
00:01:09
Speaker
It's clearly big business. What is surprising is that one of them is trying to get that feature merged back into the original open source project as an open source feature.

Replication of Kafka on S3 Model

00:01:21
Speaker
that raises some questions. First up, why would you do that? Why would you give up what seems to be a big business competitive edge? It can't just be altruism. Then you've got the host of technical questions. How do you do it?
00:01:35
Speaker
How do you take a clustered, replicated, database-like thing and then retrofit a completely different way of storing the data without breaking backwards compatibility?
00:01:47
Speaker
And even if you can crack that, then you've got the socio-technical problems of getting a major feature change merged into a long-established Apache formalized project.
00:01:58
Speaker
There are many hurdles. Joining me to be grilled on those hurdles, on the why, how and with whom questions, are two people from the company attempting to do it, Filip Yonov and Josep Pratt of Ivan.
00:02:11
Speaker
Now before we begin, you need to know this episode has been sponsored by Ivan, so let me tell you what that means and what it does not mean. They haven't had any control over the questions. They didn't suggest questions. They didn't get a list of questions beforehand.
00:02:25
Speaker
I think if you're a regular here, you know these episodes don't have scripted questions. It's the death of conversation. It's never going to happen. I did offer them a review copy after we recorded, but before this was published.
00:02:38
Speaker
They declined, but I did offer. The main change is I let them choose the publication date.

Sponsorship and Editorial Independence

00:02:43
Speaker
I've got about five months worth of episodes recorded. Sponsorship lets you jump that queue.
00:02:50
Speaker
I'm comfortable with that. I hope you are. But the only way you can really judge is if we get

Exploring Kafka's Replication Model

00:02:54
Speaker
started. So let's do that. I'm your host, Chris Jenkins. This is Developer Voices. And today's voices are Filip Yonov and Josep Pratt.
00:03:13
Speaker
Joining me today are Philip and Giuseppe. How are you doing, gentlemen? Pretty good. Yeah. quite good i'm I'm really going to pick your brains on this one, so hope you're feeling sharp. There's a lot to get into.
00:03:26
Speaker
So I was thinking about this, right? um I was trying to dig up in my memory Kafka, replication, persistent storage, all this stuff. I was trying to go through in my mind how it works. And I think it's pretty similar to what you build for any distributed replicated system, right?
00:03:46
Speaker
In that, here I am. I would like to write some data. And I want to write it in a way that's fault tolerant and durable. so i So I have ah um three machines, let's say, and one of them is the leader.
00:03:59
Speaker
I talk to the leader. I give it some data. It writes that to its own disk and to its um followers. And when they acknowledge, it acknowledges back to me.
00:04:10
Speaker
Right. And then that's durability because I've got three copies. If one of them fails, there's a new leader election. So that's high availability. And the only thing I've missed out, I think, is how do I find the leader? So there's some so brief conversation beforehand where I say to the coordinator, who's the leader?
00:04:30
Speaker
That's who I'll talk to, to write my data. Have I got that roughly right? Right. Pretty much youre right, yes. that Good, right, OK. I mean, the only thing you can add as well is if you want to have the durability, you might want to put these three different nodes in different availability zones.
00:04:48
Speaker
So if there is a fire on the data center, you might want to have only one of them going down instead of the

Economic Factors Driving Diskless Kafka

00:04:54
Speaker
three of them. Right. Yes. Yeah. I dream of running a company large enough that has that problem, but there are plenty of them. If we go in theoretically speaking, that's what would like to do, right? You would like to have three nodes that keep each copy of that data that you're sending in complete isolated areas in the world, or at least that if there is any catastrophe, doesn't affect all the three of them at the same time.
00:05:19
Speaker
Yeah, yeah. Okay, so, but that that setup I've described already exists today in Kafka. So I think the first question has to be, and I'll address this to Philip, what's wrong with that? Why meddle with it?
00:05:34
Speaker
Well... um I am less technical than Joseph, so I'll actually take a step back and how do I explain myself the replication today? So how how the hell those megabytes are being spread around to achieve this high durability? So basically,
00:05:52
Speaker
um What we just described is fine if you're sitting in a single availability zone in the cloud. It doesn't really actually cost anything to move data there. But the moment you actually want to distribute ah your brokers in three distinct availability zones to achieve maximum durability, you effectively start paying the replication four times.

Benefits of Diskless Kafka

00:06:13
Speaker
So the model is basically the following.
00:06:16
Speaker
In the default three zone distribution setup, data process availability zones through those four paths. So let's take it. Producer to leader. You already mentioned that you need to kind of like connect with the leader.
00:06:28
Speaker
Then the leader to the first follower availability zone, which is kind of like the broker. Then you have leader to second follower and then leader to consumer. So effectively...
00:06:40
Speaker
If you turn that hose up, it's a um almost like a multiplication factor on your bill because in AWS, which is one of the more taxing clouds, we are talking about data in and data out. So these four times are getting taxed one cent each on egress and egress.
00:06:58
Speaker
So you can imagine that this kind of like starts blowing up very quickly. And there is a very significant difference in price between data within an availability zone and across an availability zone.
00:07:10
Speaker
Absolutely, yes. yeah um yeah One of the things which we kind of discovered, and by the way, why we discovered it, because Ivan is one of the oldest managed Kafka's on the quality, and this didn't really exist back in the day when they were setting up Kafka to begin with.
00:07:25
Speaker
We discovered that actually the inter-AZ tax or a fee in AWS is, even though it's a list price of one cent, it is actually counted twice on data in and data out. And obviously to achieve the durability guarantees, we always cross. So we always ingress and then ingress.
00:07:45
Speaker
Right. So ah we then talking about eight parts? Do I pay? No, I'm paying once per hop and I get the response for free at least. Okay. Yeah. Just check it. Yeah. Yeah. Yeah.
00:07:56
Speaker
Okay. So this is mainly about the the sheer operational costs once you split over different availability zones. Yeah. i see I see how that would justify a colossal engineering effort, which I want to get into.
00:08:10
Speaker
That's where the technical juice is. Yeah, so basically if we actually start speaking and look at the motion...
00:08:18
Speaker
So basically, um when we talk about the motivation, it all started with a white paper I wrote internally at Ivan two years roughly a ago, yeah um where we were kind of studying the cost and the footprint of our fleet, like understanding how our costs of Kafka and the FinOps was actually panning out, simply because um we wanted to understand the real gross margin of each of our customers. right So this is like how it started.
00:08:44
Speaker
The real motivation behind it, it was economic. At this time, we already started seeing the newcomers. For example, Warpstream was one of the first companies out there which kind of proposed a new design, a fundamentally new design.
00:08:57
Speaker
And this kind of coincided with the numbers we were running, so it was a very much an economic motivation. Let's make our Kafka cheaper. Can we grow our gross margins?

Concept and Open Source Strategy

00:09:07
Speaker
But the more we dug into with the engineering team team over the last year, we found that actually by delegating the replication flow or the this replication mechanism which we just discussed, the more um benefits we started uncovering to the point where I found myself the other day talking about diskless Kafka.
00:09:26
Speaker
Cost was only priority number three and four in most of those conversations. it kinds kind of kind of like after the fact. People were far more interested in the elasticity and some of the other ergonomics which kind of like this delegation produces.
00:09:39
Speaker
Yeah. Okay, yeah, because um we've all had that problem where disks fill up, right? Yeah. And that is another nice... Because i I'm going to say, just for the record, I mean, i e the word serverless I used to get stuck on, and I feel myself getting stuck on the word diskless again. but this is the But the point is not that there's no disk. The point is the disk is someone else's problem. Correct. When fills up, I don't have to worry about it. Yeah, exactly.
00:10:06
Speaker
The problem with serverless was not that I'm not running on any server, right? It was the server is not my problem anymore. It's somebody else's. It's the same as cloud and computers. They are not my computers anymore, but they are computers yeah or data centers. So it's the same thing with disk.
00:10:23
Speaker
Diskless doesn't mean there are no disks whatsoever, but rather don't need to worry about disks. and Yes, and disk headacheless doesn't trip off the tongue as easily, right?
00:10:34
Speaker
it's a funny yeah It's a funny story how actually we came with the name, Joseph. Maybe you can you can tell this one. Okay, so so the we had a code name for this project internally.
00:10:46
Speaker
And the code name, knowing that Apache Kafka has the name from Franz Kafka, who is a writer. ah We decided to take a spin on that one. And one is want what is the core thing that writers used back then? It was ink.
00:11:03
Speaker
So what we said is this project will be called inkless. Inkless, okay. Because it's the same thing as Kafka would write without ink, and it would use new mechanism to persist its main writings.
00:11:19
Speaker
So we said, okay, what if Kafka itself wouldn't use disks? So the disks were ink in this analogy. And when we needed to upstream the ah proposal, we thought, okay, what what can we use that is not inkless?
00:11:34
Speaker
And we wanted to play with the less part, and then we were always saying that ink is like the disks for Apache Kafka, so we end up saying, why not diskless? And yeah that that's how we got into that place.
00:11:48
Speaker
Okay, I get it. So i am So here we are with a version of Kafka where the disks aren't your problem.
00:12:01
Speaker
That you're trying to open source, right? Now, this is this is a commercial angle I don't understand because um but Warpstream, you mentioned, they recently sold for a very large amount of money. Reports vary, but the very large is consistent.
00:12:18
Speaker
so that um one Kafka company could have diskless storage. We were at the artist formerly known as Kafka Summit last week, and that had, i don't know, something like four or five companies who were saying they could do diskless Kafka.
00:12:35
Speaker
So there's obvious money and commercial interest in developing object storage for Kafka. Why are you trying to open source it? I mean, it's great that you are, but why? What's the angle?
00:12:47
Speaker
um I will speak first on um the commercial motivation and I know Joseph probably will have something to add there but um we we were thinking very long about how we will approach this diskless mechanics, right?
00:13:02
Speaker
So the core breakthrough here is delegating the replication mechanism to the object storage itself, right? So that was kind of like the core the core idea. ah The how was actually quite unique to us simply because When we were discussing, someone raised a hand in the room and said, hey, we at Ivan, we are not running a fork of Kafka.
00:13:26
Speaker
We are running Kafka. We are Kafka. So our version of Kafka is whatever Kafka's version is actually in the open source. So why would we actually try to fork and try to kind of like split the protocol even further When actually the very thing which makes us successful as a business is suffering, right?
00:13:46
Speaker
So we actually and went several rounds and we came back with the conclusion that if we actually donate this core primitive back to the back to the community and obviously if it's accepted and everything goes well, right?
00:13:59
Speaker
and This will help us go forward simply because our model is relatively simple. We believe that if the price is actually good and the quality is high, a lot of businesses will choose for us to manage their Kafka, right? So it's not just about the features, but kind like ease of use and ergonomics, reliability partnership and all this.
00:14:21
Speaker
So this is like the core motivation for us. ah We don't see any value in withholding a specific feature. There will be no enterprise version of this. Let me put it this way, right? there is no point for us to actually have this.
00:14:36
Speaker
And you said Josip had a an angle on this too. So historically speaking, I've been the director of the open source program office at Ivan. And the fact that Ivan was having so much emphasis on open source already

Future of Kafka with Cloud Storage

00:14:52
Speaker
tells you something about it.
00:14:54
Speaker
The mission that we have at Ivan, it has buried through the years, but it has always, except on a brief period of time, contained open source in it. And the idea is always to be the open source data platform out there.
00:15:09
Speaker
Now, it's the AI-ready open source platform for you. So what we always do is we don't try to keep people in our platform, in our projects, just because they kind of run what we run elsewhere or that we are...
00:15:24
Speaker
sequestering them and not letting them choose the best provider for the technologies they need. We don't believe that's fair. And that's how the business has been made since the very beginning.
00:15:36
Speaker
And obviously what we did was, what if we don't need to maintain that fork, a hypothetical fork? Maintaining a fork is extremely complicated and hard.
00:15:47
Speaker
and no matter what you do and no matter how many people you dedicate to it, you will always miss parts. You will always have merge conflicts. You will always have struggle and reconcile these two walls that are- You will always be behind the leading edge.
00:16:01
Speaker
Correct. Yeah. So you need to make the decision. So all these companies we've seen at current, previously known as Kafka Summit, you could see that they were running a version fixed in a point in time from Kafka.
00:16:18
Speaker
yeah if you are lucky, because some of them are complete rewrites just using the external API of Kafka. That means that they always need to play catch-up game with Apache Kafka. And what we thought was, what if we do it exactly the opposite way? Why, if we are not doing a fork at some point in time, still there, and just need to play catch-up game, or we start from an empty shell that is the Kafka API and start implementing things,
00:16:46
Speaker
Why don't we start from Kafka itself and we elevate Kafka to speak the new cloud primitives?

Addressing Cloud Replication Costs

00:16:54
Speaker
And I think that's what that's the first step in the path probably.
00:16:57
Speaker
Kafka knows disks and disks are an essential part of Kafka. You know the concepts are baked in reading the broker. What if we elevate Kafka to be ready to learn and use those cloud primitives that are now available for us?
00:17:16
Speaker
This makes me wonder, I was gonna bring this up later, but this maybe this is the right point to do it, in that we are entering this world where disk is infinite, cheap and abstracted, right?
00:17:30
Speaker
Do you think these days, if Kafka were a new project starting today, that it would start with that kind of storage and maybe add in SSDs as a later feature?
00:17:44
Speaker
um It's a philosophical question, and I'll give you my philosophical answer. When we were trying to explain to the wider group at Ivan, like, what are we doing with diskless topics, right? And how what's our approach here?
00:17:58
Speaker
um I started with with, as I call it, the problem behind the problem. Kafka's replication is not broken. It's actually been a very, very robust mechanism which made Kafka so popular.
00:18:12
Speaker
It's great to the point where I still do not see ah very much use in active-active you know clusters in the cloud. It's just like so good at retaining and you know providing the durability guarantees that you seldom see the need to go the extra mile. right However, the problem behind the problem is that this very replication model is just taxed.
00:18:33
Speaker
There is a way to figure out how to kind of like um yeah make it more expensive with the ingress and egress and the inter-AZ fees. Now, um what we are trying to do here with the flexible replication path or delegating the replication path back to the object store is just making sure that the kaka Kafka can survive this pricing model, this kind of like cloud economics model.
00:18:59
Speaker
It's not like trying to introduce a better replication, if it makes sense. Right. Yeah. this is This is seeming like the implementation is a big technical challenge, but the motivation is not technical.
00:19:14
Speaker
In fact, from what you're saying about open sourcing it, the aim is to try and keep the technical edge as small as possible. Well, not yeah not only. um What if tomorrow, for example, EBS gets a magical, you know, like durability guarantees and cross kind of like regional replication and cross-zonal replication at the fraction of the cost?
00:19:36
Speaker
We should be able to kind of like take advantage of this. Kafka should be able to take advantage of this. This lifting and shifting model, like from the data center and actually make no mistake, Kafka survived two data center generations.
00:19:50
Speaker
it's having a very hard time surviving the cloud

Seamless Integration of Diskless Kafka

00:19:53
Speaker
economics. right so Because of the overall model, we kind of like figured out that if we power kafka with a flexible replication this will open up the doors for even what's streamable not only the ergonomics we will gain not only the elasticity not only the cost but maybe we can actually unlock this latent demand from ah customers and from high scale streaming which kind of like held back because of the economics or held back because kafka just becomes expensive for certain things
00:20:23
Speaker
Yes, yeah. it's ah It's related to something I think of as Tustane's law. If you halve the price of something, it becomes 10 times more appealing. Hopefully this is the end result. Yes, exactly. Yeah. yeah Okay, so let's dig into the technical part of how you do this.
00:20:40
Speaker
Because i I was thinking through on the back of an envelope how I would change Kafka to use optionally the existing block storage or object storage, right?
00:20:53
Speaker
That's hard. A large 10-year-old open source project that you want to not break anything but completely rewrite the way it stores data? um i would I'm trying to come up with a ah a naive headline summary that we can talk about, but I'm lost. How do you how do you begin to implement that without breaking things?
00:21:16
Speaker
so we had a So we had clearly to picture where we wanted to start. We wanted to start the open source project, and we said we are now creating a fork just to play our experiments there.
00:21:29
Speaker
The fork used to be in trunk. So like at that point in time, it was 4.0 in the making. So we basically our baseline data was Kafka 4.0.
00:21:40
Speaker
And we said, and we want to have minimal merge conflicts. And that's basically the goals we set for ourselves. And I know that that's awesome, and every single project should be like that.
00:21:51
Speaker
But then it was, OK, now let's try to make it happen. so then we went into a discovery journey and trying to find out, OK, what we need to change, where we need to change that. ah Because Kafka has 1,000 million layers where you could inject something.
00:22:04
Speaker
yeah and By the way, after we threw our kip out there, there has been already two more kips, Kafka improvement processes that are applying the same mechanism, not the same mechanism, but basically they are bringing the same capability to Apache Kafka in one way or another, which is talking to the object storages. That means that there was need for that.
00:22:26
Speaker
We just needed one crazy group of people who did the first step and then some other ones followed back. So what we first looked at was that the ad storage.
00:22:37
Speaker
Because that's right the the the natural place to look at it. Tears storage is already dealing with tears. It's already talking to BIA plugins, to object storage, for example.
00:22:49
Speaker
So it was the right place to start looking at Because you've got this system, if I can explain this briefly. So you've got this system where the recent data stays on um SSDs yeah and the old data gets archived off to object storage. And that came in maybe two, three years ago?
00:23:06
Speaker
Exactly. It was gaming in 3.6 as a graduate initiative or feature. And I think it was yeah roughly one and a half years, two years, something like that. ah Time is a blur. Sometimes when you speak about versions, I know exactly the version. It's Kafka 360. So I can tell you that thing clearly. That's a better timestamp than a timestamp, right? Exactly. It's exactly that one. I know it by heart.
00:23:31
Speaker
So... and paid probably. so what we what we thought is, can we do something in there? The problem is even if you aggressively tier, you will still have local disk, and you will still need to have the cross application because the data that is not yet stored into the storage needs to be persistent and durable, guaranteed to be durable only via disks.
00:23:57
Speaker
So then you need the replication. yeah So no matter how aggressively you tier, you still pay for each single byte that comes in, you need to pay several times for the replication.
00:24:08
Speaker
yeah So that was not the solution. And then start messing with the active segment was quite a heavy thing to do. And when you start adding bigger latencies to something that is expected to have almost no latency,
00:24:24
Speaker
you start thinking, I'm asking for trouble. So probably that's not the way we need to look at. and What we did was go several levels of VOLF. We decided to do something, which is, can we do it opt-in or opt-out per topic? so Can we create a new type of topics?
00:24:42
Speaker
that will follow the classical rules of Kafka as we know

Balancing Costs and Latency with Object Storage

00:24:47
Speaker
them. But in the same cluster, can I create a new type of topics that whenever I write there, I'm not following the normal laws of Kafka when I'm using the object storage?
00:24:58
Speaker
as a native tool so that means that quite high on the on the branch of basically the broker on the decisions that the broker does to write things and persist and modify and operate any single operation we did a divergency there by extending existing interfaces and re-utilizing as much code as we could moving a couple of things around that with writing not so much code as one might expect. So we don't need to write the whole broker. We just need to write some thousands of lines of code to be able to have all these mechanisms, including
00:25:35
Speaker
some organizational components and the the plugins to write to the object storage. So what we did was branch out. If that was the case, if you are in these types specific types of topics, then you the need to do certain things, like replication is for you useless. You don't need to do replication.
00:25:54
Speaker
Forget about that one. Treat it as replication factor for one. right Now we can follow down and then instead of talking to these synchronous APIs to write to disk, you will need to talk to them asynchronous APIs to talk to the S3 or to the any object storage that you use. so we needed that That was our way of thinking. so How can we make that parallel path so that no matter how long it takes to merge this project upstream, it will always be an easy path.
00:26:30
Speaker
Because the code that adds this capability sits quite isolated on the side, and it only extends the right interfaces. And there is not so much application of code, just tiny little bit that can be solved by refactoring Kafka code in the API approach.
00:26:48
Speaker
So you you are keeping the existing mechanism entirely. Correct. You're not just going into existing code and changing it so that has. That makes sense. I'm trying to think now how how this will play out because I write some data to a transactions topic, and that's on my usual object storage thing. So that just goes down one branch of the magical if statement.
00:27:14
Speaker
Nothing's changed. Now I'm writing something to pick something less important, click streams, a click stream topic. And that I want on object storage because it's just cheap archive data.
00:27:26
Speaker
So now that's going into something which isn't going to be a leader. It's putting that data to object storage.
00:27:39
Speaker
The first problem I see with that, I'm there going lots of problems, but the first one that comes to mind is i don't want to write to object storage every time someone hands me some data because I'm paying per put, right?
00:27:51
Speaker
Correct.
00:27:54
Speaker
So I think you're going to say something about batching, and then I'm going to say something about reliability if the crash mid-batch. Explain that to me. So that's exactly what we're going to discuss right now. So one of the reasons what you want to do is I'm getting a one single message, and I don't want to put that message up there. it's in It doesn't make sense.
00:28:15
Speaker
It will be economically or astronomically expensive. Right, yeah. Let's forget about that one. right So that didn't work. So what we wanted to do then is group them, and aka, badge them. So if I want to use badge, I will use we group them and then it sounds fancier.
00:28:31
Speaker
Okay. It's in the end the same thing. So what we do is group it either by time or by size, whatever happens first. These two thresholds are configurable. However, we did some studies and analysis on how much time or how much it size it makes sense and where it did the value is basically diminishing returns where it makes no sense to increase or decrease because it's either too much.
00:28:58
Speaker
yeah So expenses go high to too fast or at one point in time they go down so low that it makes no sense to change any value. And we found that 250 milliseconds and 8 megabytes are...
00:29:09
Speaker
are roughly the right number where the more you add, the less benefits you get. and If you are making them smaller for general use cases, each use case at its own, obviously, um they might be getting bigger increases that are not linear. So it's say a logarithmic.
00:29:29
Speaker
Right, yeah. So that's one of the things we do. But what we decided to do as well is like, okay, now, a consequence of delegating everything to the object storage, as in the storage and replication factor, yes it means that we can go without leaders.
00:29:48
Speaker
Because Amazon or whoever are doing your leader for you, essentially. Exactly. So I'm writing to S3. S3 will make sure that thing is replicated. that That's the one that needs a leader replica work. And they have that one, and they have plenty of nines.
00:30:05
Speaker
Yes. It's very battle-tested by now. So why do I, in my project, need to do extra work that will increase the cost of my operation when the place where I want to write it as already does it already for me?
00:30:20
Speaker
It makes no sense. I'm paying double for the same concept. And we shouldn't be doing that. Yeah, I'm with you so far, but I see two problems with that. I'm writing to something which is talking to S3 that...
00:30:35
Speaker
yes that that could crash and lose my data which hasn't yeah gone to S3. Also it could crash and then i have no leader, right?
00:30:46
Speaker
Correct. And if it crashes, so let's think about so the first one. What happens if it crashes before it's getting persistent and straight?
00:30:58
Speaker
By default, how we configure it to work, and by the way, we are working on improving that piece because that's one of the weakest points right now, but we are working on making it slightly better.
00:31:12
Speaker
Right now, we have an ACK all, basically. So what you need to do You wait until it's replicated to say that it's written. right So that's how replication works when you say, I want all ACKs or all acts basically received.
00:31:30
Speaker
I'm sending you a message and only until every single replication factor. So I have replication factor of three. So every three brokers say it's written, i will tell back it's written.
00:31:43
Speaker
So we behave the same way as in X-ball, I'm writing and only when it's written in S3, I can tell you that it's written. Obviously, that creates a lot of lack.
00:31:54
Speaker
So as the client, it's my job to make sure I've got hold of that data and can resend it until I've got that hack, which is exactly the same way as the previous version. Exactly. So that behaves the same way. as a unless If there is any problem in between, i cannot ensure that it's durable. So i as ah producer, as a client that produces the data, I need to retry.
00:32:17
Speaker
It's the same behavior with a different timeline. What we're working on as well is on can we add mechanisms for accepting act one so that only when it's persisted on your own node, which gives you plenty of, i mean, really not much guarantees, because if the node dies, then the data is lost.
00:32:39
Speaker
It's not being replicated yet. So you know that it's probably there, unless there is some problem. We could replicate that mechanism the same way, as in only when you push it there, we hold that data to batch it and push it to the S3.
00:32:55
Speaker
But we can tell you that as long as it's in our memory, It's in there.

Scalability and Flexibility in Diskless Kafka

00:33:00
Speaker
It's us as risky or slightly riskier than what we do right now with Axe 1.
00:33:07
Speaker
yeah Actually, just want to highlight something which we found out after the fact here, pun intended. Axe 1 is probably the most used kind of like configuration out there. So the most workloads today in the world actually are revolving around Axe 1.
00:33:28
Speaker
with some doing ax all, right, and very few ax zero. When we were going into this kind of like um ah discovery, we thought that it's far more ax zeros out there, but actually turned out that it's not. Even metrics and logs and some other things nowadays require at least the leader to say, yeah, I got the data, you know, proceed.
00:33:50
Speaker
So this is kind of like one of the key discoveries and the focus of like the next research step for us here. Right. yeah So just to recap that, make sure I've understood it. You've got this mechanism where you can say, I want to wait until I've got confirmation from all three notes that my data has been written before you tell me it's been written.
00:34:10
Speaker
That's one configuration thing. The other is as long as someone's written it, that's fine. Yeah, correct. And that's the ax1 or ax3. yeah That's basically the difference between those ones. And then ax0 basically means, yeah, sure, I don't really care that much. if that I'll be generous and say that's the Erlang approach. Throw the message across, and if it crashes, it crashes. yeah and and good It will crash nicely and gracefully with a lot of fireworks.
00:34:37
Speaker
so But you said you said about two minutes ago, the difference here is lag, and that seems to be an important difference. Correct. So the if you wait on an X3, for example, on on any normal classic ah Kafka setup, you would need to wait until that message gets into one broker,
00:34:57
Speaker
It's written, goes back, another one goes back. So you need to wait until those two messages go back and forth, you like simplifying a little bit and all stuff. But that's what you need to wait. And that's network traffic, should be okay fast.
00:35:13
Speaker
But now if we are considering that we need to group something, and then we need to talk to S3, we're talking about two sizable lagging operations.
00:35:23
Speaker
Yeah, you said 250 milliseconds-ish on the batch, and writing to S3 can be variable, but let's call it another 250? Exactly. a rough Roughly that one, right? Yeah. So what happens is, first, if we remove the batching, you still would need to pay the 250 no matter what But then if do the batching and then you can start making maths and try to do the medium case, the best case, worst case.
00:35:48
Speaker
So in the worst case, you need to wait the full 250 milliseconds and then you write the 250, let's say, to S3 and then you get back. The counter-intuitive aspect of that one is that the higher the throughput, the less you need to wait.
00:36:05
Speaker
So the bigger the throughput you have, the bigger system you have, the less delays you're going to do on waiting to accumulate. Is that because you yeah because you were saying 8 megabytes? So if you fill up for 8 megabytes, you don't bother waiting the full 250 milliseconds. Yeah, yeah.
00:36:23
Speaker
So, counterintuitively, intuitively when you think like, oh yeah, on a smaller the smaller it is, the better works, and the bigger it is, the bigger the delays, that actually is not the case with this particular instance. It's it's kind of reversed.
00:36:39
Speaker
With really small and tiny workloads, you will need to wait lots of time because you will never hit the eight megabytes, right? So you will always need to wait the full time because you are not filling up the bucket in that sense.
00:36:53
Speaker
yeah The bigger the load you have, the faster you can have the bucket full so you can it makes sense and it's worth already pushing it to S3 or to Google Cloud Storage or any other object storage.
00:37:06
Speaker
And actually, the unit economics of diskless is also weird and this is dictated by the cloud itself, that at the very low throughput, say a couple of megabytes a second, you're always better off with the classical replicated Kafka.
00:37:21
Speaker
And this is when you start actually getting the more benefits it's a very like at a certain point. So it's kind of like almost inverse curve it it in this sense simply because um It's not great at low throughput. It gets better the bigger the throughput if you tune it properly. So this is kind of like one of the discoveries we also had that it's really made for bigger workflows to begin with.
00:37:47
Speaker
It's kind of like works very naturally at at those like higher levels. That sort of makes an intuitive s sense because you wouldn't generally use object storage for tiny bits of data, right? yeah It's for storing chunks, not scraps.
00:38:04
Speaker
yeah okay Also, usually what you would like to keep in there is and all that massive amount of data that comes, as you said before, all the click data, all tracking devices, for example. The an IoT data, that's also one example that I i often share. is like iot When we go into the massive IoTs, they are not just machines in my home, but rather the whole country, scattered around on the the windmills, for example, those ones, sometimes they have no signal.
00:38:36
Speaker
They don't send any message anyway. They don't send anything. They send only one they can reach on signal and they send huge burst of data. So what it comes back to that one is like for those use cases that they are massive in amounts in amount of data that they need to be sent and processed, the question we asked the people was, do you really need that low latency end-to-end or you actually don't care?
00:39:03
Speaker
And for most of the the use cases, were no we actually don't care because either we look at the data couple of hours later or we already group them in five, 10, 15 seconds.
00:39:14
Speaker
all the metrics, all all the data from those 10, 15 seconds, we grouped them. We don't want to have seen all single data points, we grouped them already in our visualization tools. so yeah yeah We don't care that much if there is some sort of delay,
00:39:30
Speaker
we just want to have this huge massive amount of data and a cheap value, basically. We don't want to pay premium for that expensive replication mechanism when we can rely on cheaper versions of it. We just want to have that work there.
00:39:45
Speaker
And that's ah one one use case where it shines and where we can see that, yes, it has a price, even though we're working on optimizing and trying to make sure that the price that you need to pay in latency is as low as it can be,
00:39:58
Speaker
But there are use cases where even unoptimized, it fits like a globe because the ah whole pipeline is already dealing with two levels of magnitude bigger than the ones that we're talking about.
00:40:13
Speaker
Yeah, yeah, I can see that. is there I would imagine in the real world use cases, it's a pretty clear cut which of the two sides of the coin you fall on, right?
00:40:23
Speaker
Most people... Most people will either be putting through the kind of quantities of data where it's unambiguously better to put it on object storage.

Dual Storage Model for Diverse Data Needs

00:40:32
Speaker
Whereas some people have use cases that are unambiguously, this is going to have to be hot disk. Exactly. And and that's why we decided to say, whenever we build this thing, we need to keep both topics in the same broker.
00:40:44
Speaker
Because we thought, why do you need to have two different clusters? One for the low latency ones, one for the high latency ones. what y Why the person producing data needs to understand suddenly my infrastructure topology to know these specific details that probably shouldn't be even that made aware for the one producing the data. That should be a configuration value for the one setting up the topics.
00:41:11
Speaker
yeah And having those ones, the two types of topics in the same broker or in the same cluster makes it easy for one to say, i now this one is clearly that one, this one is thisla This one is also diskless, this one stays classic, and eventually you could even migrate from one to the other one.
00:41:31
Speaker
Why not? I mean, from classic to diskless, you could potentially do those things. Obviously, migrating from diskless to classic begs the question as well on how we transport the data that it's in there. So obviously, could download that one, you could mirror make it yourself, but yeah yeah that it's slightly trickier in that sense.
00:41:51
Speaker
but um going to guess that we're going to have some system where, at least for version, the first release of this, you say, just copy the data over to a new topic and then migrate that way. Correct. And for that, we have already MirrorMaker that does this thing wonderfully. So you have a topic and you can say, please replicate that topic to the same cluster.
00:42:13
Speaker
And because it's a logical replication, you can go from different types of top storage. Correct, yeah. Yeah, yeah, okay. Okay, you've made me think of a very thorny question I'm just going to hold that because there was one of the things when you mentioned windmills and wind and reliability, I was thinking, does you does it make a big difference if you have like bursty traffic?
00:42:35
Speaker
Imagine my windmills only send data when there's wind or or I'm a shop in the United States and things are naturally quieter at night in the United States. do do you Do you run into different usage patterns or problems when one segment is going to be 10 clicks and the next segment is going to be 10,000?
00:42:56
Speaker
So one of the beauties of that one as well is how would you do that in classic Kafka? If you would know that you would need to bring up maybe two or three more brokers and then they would need to align themselves and basically rebalance and again, make sure that they are in sync, right? And then they could start producing on working.
00:43:17
Speaker
we don't have the need for rebalancing, we don't have the need for being in sync as a replica. You could basically say, need more throughput, I can spin up another broker, I could basically break the petitions and I could handle those ones.
00:43:34
Speaker
Or what I can do is actually just leave it as it is. Because actually, as I said before, the higher throughput, it will have less latencies as long as basically you can process all the messages that are coming.
00:43:51
Speaker
The only thing that you will see is probably a better end-to-end latency when do you have the burst compared to when you have a really quiet moment. Oh, yeah. Yeah, okay. I can see that. That's one of the consequences. Obviously,
00:44:06
Speaker
we will need to see how which numbers we're talking about, right? So what is the baseline number that we are going to and what's the first and how other magnitudes bigger there are because it's exploding that we might need to do some, we we might need to grow the cluster obviously.
00:44:21
Speaker
But that's one thing that now you have the capability to grow at new brokers into the, sorry, into the cluster almost at immediate level. So you need to wait. The only thing you need to wait is the Kafka broker to stand up.
00:44:37
Speaker
there is no data synchronization. You just have it there ready to consume and produce. this This thing where when you start a new broker, it has to get itself replicated from the existing live ones. Yeah, that's go that goes away.
00:44:50
Speaker
Exactly. So yeah one thing is when you do an upgrade, you can do an upgrade in place. Forget about that. Let's ignore that one. That's not the case. When I want to add a new broker, three, four, five, whatever, the number of brokers I want to an existing cluster, those ones, they have nothing. They have an empty disk.
00:45:07
Speaker
So they cannot do anything. They need to wait until they are in sync. So they become one of the other replicas and then all the data needs to be sent through the network to reach them. And then once they have all the data, they are in sync. And once they are in sync, they can start serving, for example, customers and things like that.
00:45:25
Speaker
But obviously, all that time, now it's gone. because there is no such concept as I need to be in sync. When you're using, obviously, these new types of topics, for the old types topics, for the classical ones, everything stays the same way as before. Yeah.
00:45:40
Speaker
But you've basically pushed that problem off to S3 again, right?

Evolving Kafka's Architecture with Cloud Solutions

00:45:44
Speaker
Correct. yeah Because they have it or they they solved it, right? So why you need to solve the problem twice is the same thing. So if you would basically now have a software that solves the problem, and they tell you you need to solve it again in front of it,
00:45:57
Speaker
why so you you should basically ask yourself why i need to pay twice for solving the same problem if it's already solved for me. Yeah, yeah, yeah. i am I think probably the answer to that is just at the time it was created, that wasn't an option, right?
00:46:16
Speaker
Absolutely. And I'm not saying i'm not saying that as bashing Kafka. I'm saying it as in now that we can use that directive directly and and basically Kafka understands that they don't have just disks for object storage.
00:46:29
Speaker
We can remove some of the things that Kafka is doing because they are already torn by the object storage layer. So in a sense, we... What we do with this this class is delegate replication and storage to the storage layer. yeah Yeah, that makes sense.
00:46:48
Speaker
Okay, then let me ask you the thorny question which came to my mind as you were describing that. What about transactions? There is an existing mechanism in Kafka where I can say I'll read data from this topic and I'll write data out to those two topics and I do all three of those things or none of them.
00:47:10
Speaker
So first we thought, can we deal with transactions later? And when we proposed, so basically we wanted to do the KIP and push it there. And then we thought, okay, let's let's not delay for months and months before we have a really huge book written about it. and And then we throw it. We said, okay, let's write the minimum purpose of KIPs that we think will make the project viable.
00:47:38
Speaker
And we released those ones. And first we thought, okay, transactions might be able to, we might be able to ship that one as a version one without transactions and then come back to it.
00:47:49
Speaker
and We got the feedback, no, we should rather do it right now. So we are discovering and we have certain things that work already. So you could produce both types of topics at the same time.
00:48:00
Speaker
and Transactions, we need to check a couple of things if they work and then we need to tune in a couple more. But basically now we want to we don't We cannot say that's a problem of of our future selves.

Gradual Adoption and Testing of Diskless Kafka

00:48:14
Speaker
It's a problem of our current selves.
00:48:16
Speaker
And we we will need to be a deal with that one. So right now, the KIP doesn't solve that. It basically says most of the mechanisms should work on transactions, but some of those ones might work. So you then put the NTC works, but transaction might have a couple of...
00:48:36
Speaker
turn flaky areas or not so ah pointy areas. let pretty And actually this is very good question because while transactions are not supported That doesn't really um prevent people to adopt the diskless topics because diskless is Kafka.
00:48:56
Speaker
So if you find that diskless topics do not support specific features, in this case like good transactions, just go about your day upgrading like anything any other version, right?
00:49:08
Speaker
And when these become available, you can actually go back to using diskless topics by simply just creating ah yet another topic. So this is not like asking the customer or in this case, like the open source user to completely lift and shift their estate to something else or even worse fork their estate to keep some of the Kafka topics on whatever Kafka they're using and then go on to some other system.
00:49:34
Speaker
They will just continue using the same cluster with the optionality to have diskless. And this is extremely powerful. And we did it with very we with a lot of intent simply because we wanted to experience the same upgrade for our own fleet going into this innovation. We wanted to make sure that everything is trivially to upgrade to begin with. yeah It must help having a large existing customer base using Kafka for like checking statistically how this is going to work in practice and that kind of thing. Yeah.
00:50:07
Speaker
yeah It does. Well, they keep us at night, right? um it's It's a lot of work, and this is probably the time to say kudos to the whole domain, keeping the lights on, working on this.
00:50:20
Speaker
It's been an insane additional workload for us to get it out of the door, because, um as I like to say, the work actually just begins, because we started kind of like making the inworks to roll it out within our own fleet.
00:50:35
Speaker
And the beauty of this upgrade part is that we probably will decide to quietly upgrade everyone, even though they don't know, and just keep you know the business topics ah behind the flag so we can activate it opportunistically.
00:50:49
Speaker
Until now that Philippe opens the old plans to everyone in the world, because they will listen to your podcast and then they will know that. but You make it sound so nefarious that you're going to upgrade them behind their backs, but that that's the whole point of cloud services, right? Someone will upgrade and I would never have to.
00:51:07
Speaker
Upgradeless. It's upgradeless. Upgradeless. Yeah, exactly. So we have another name of a product. Great. if If you think about it, this one shouldn't be that different as of Kafka 4.0 in API level. So if you update from 3.9 to 4.0, that's the same cost and complexity and problems. You will have exactly the same ones upgrading to Inglis because Inglis and 4.0 are exactly the same thing. So ah it's it's exactly the same.
00:51:40
Speaker
Right. No, was it zero cost. The only thing you need to know when you have the new diskless implementation is that you need to configure, set a couple of settings.
00:51:52
Speaker
And when you create a topic, you have a new option, optional, to decide if the type of the topic, if you want to have that it's a diskless topic or No, if you do there nothing, it's a classic one, obviously.
00:52:04
Speaker
so And that's the only change in the API is an backwards compatible change of the API. So everything works if you can touch anything. Only if you add this new parameter, you create the new type of topics.
00:52:15
Speaker
And then couple of settings configs, that's trivial as well. So yes, technically speaking, that's an additive change that needs some client modification, but it's an additive client modification. and never a breaking change backwards incompatible OK.
00:52:34
Speaker
ah just Just to clarify on that, if I go and create a new object-backed storage topic, yeah so different flags when I'm saying create topic,
00:52:48
Speaker
But then when I go and connect my regular client software and let's say it's not the Java one, let's pick the JavaScript one because that's not exactly, but it's broke broker protocol compatible.
00:52:58
Speaker
okay Me writing code, will I notice any difference other than latency? Only latency. And that's the beauty of it. Producing and consuming.

Inkless Project and System Simplification

00:53:10
Speaker
You need nothing to write to that type of thing. There is nothing else. It's a topic. And that's the beauty. So we decided to be really, really conservative on number of changes towards the end, as an end user, because we didn't want to increase our service of...
00:53:29
Speaker
denying the feature to the upstream project. The more you modify things, the more you change current behavior, the more the resistance you will have, obviously, to get any change in because it's a breaking change, it touches too many moving parts.
00:53:44
Speaker
So we wanted to change as few moving parts from the outer layer as possible because nothing with if you need to rewrite all the libraries in the world, you will never finish.
00:53:56
Speaker
yeah Yeah, that's a mountain that's too high higher climb. Correct. So the only thing that you need to change is when you create a topic, you need to pass a new flag. That's the only thing.
00:54:07
Speaker
That's the only change that needs to be done, which by the way, can be done by command line with the shell script that we all know in Apache Kafka. So we modified that one, we could do this one.
00:54:19
Speaker
And as a proof of that is that Ivan is a Python company. So they use the Python client, and it works. OK.
00:54:30
Speaker
So yeah the biggest problem we had was between 3.9 4.0, between four zero and because the library we were using was still speaking all protocol, message protocol, so those were dropped in 4.0.
00:54:47
Speaker
And that was the biggest change we needed to And it was from 3.9 4, not from 4 to Inklis. OK, yeah, yeah. By the way, i I'm saying Disklis and Inklis every now and then because Inklis is the project we open source that implements, that it's an implementation of Disklis.
00:55:05
Speaker
And that's why I keep saying both. Not randomly. I was trying to be precise. but I think I got that. What you don't know is every time you say Inklis, somehow I think of Octopi.
00:55:19
Speaker
Oh, okay. Okay, that's just popping up in my head every time you say Inklis. It's the same thing as so when we were looking for a logo. The first thing that came to my mind was an octopus.
00:55:31
Speaker
And then you thought about being sued by GitHub. and No, the problem is we have already an octopus as a logo for one of the projects at Ivan. So we couldn't use an octopus for that. but okay Yeah.
00:55:42
Speaker
Even though the GitHub isn't... Octocat? It's an Octocat. I'm not sure if the courts will recognize that distinction when they try and sue you.
00:55:53
Speaker
Well, just to kind of like prove what Joseph said, when we set out to upgrade our kind of like one of our products, it's called Bring Your Own Cloud, Doesn't Matter, um the the compatibility was so strong that a single engineer um took approximately less than four weeks to get it up and running.
00:56:14
Speaker
This was just to kind of like update automation. And most of the stuff was like the automation shouldn't do, right? Like don't, you know, like don't look at the disk. Don't look at the replication progress. Don't look at yeah at this.
00:56:26
Speaker
So it was extremely boring exercise the moment we actually started onboarding our own Kafka to diskless. It was very, very interesting how quickly things actually go when you have rethought from first principles that you don't really touch the API surface.
00:56:42
Speaker
Yeah. Yeah. That's the good kind of boring. We like that kind of boring. Yeah. What Philip was saying, like one of the problems of the new type of topics is there are plenty of tools out there that they open the hood of Kafka and they start looking at the internals on the state of the internal and start looking at how are my replicas? How are they in sync? Are they not in sync?
00:57:08
Speaker
concepts that we destroyed for the new type of topics. So obviously, that's the thing that we would need to tune or modify. And that was clear for us, obviously, when you try to change the how things work internally, every single project that goes and looks inside will need to suffer a little bit.
00:57:27
Speaker
and This is going to change the monitoring world for Kafka. Correct. The monitoring world that it's really built in and has smart metrics and smart even actions or recommendations that would need to be relearned or they would need to be excluded for these type of topics.
00:57:44
Speaker
Yeah. I can see that starting to percolate into the um Kafka-related monitoring tools anyway, because there are so many companies pushing in this direction that surely they're reacting to that.
00:57:58
Speaker
Yeah. Okay. So no client-facing changes, but there is a large chunk of this we haven't discussed how it works under the hood in the back end. We've talked almost exclusively about reads.
00:58:11
Speaker
Does this change the right picture at all? Sorry, i've I've said that the wrong way around. yeah We've talked almost exclusively about writes. Does this change the read picture at all? So the read picture has the counter problem of what we were talking about before.
00:58:30
Speaker
When you produce, what do you what we said is producing one single message will be prohibitively expensive, so we needed to batch them and group them and send them. Similarly, if I'm asking for one message and I'm going to s three come back, serve it, then you come again and you ask again for the next one and you need to back you can see that that will be crazy expensive.
00:58:55
Speaker
Yeah, you're going to start looking like something like a paging algorithm in an operating system, aren't you? And by the way, that's what in a sense, that's what the TH storage plugins and the TH storage mechanism did already in the past.
00:59:10
Speaker
You have the offsets. You need to go to S3. So you have the offset management that needs to know exactly in the S3, what a walled locator is. And then when you fetch, you don't fetch just one message, but you fetch a group of them that is big enough. You put it on your caching memory, and you serve from your caching memory.
00:59:30
Speaker
You they need to worry about that one. So the problem that comes as well is when, and and now I will showcase that I have back pain and I'm slightly old.
00:59:44
Speaker
Windows defragmentation disk, if you ever remember running that one. Do you remember No, far too young to remember defragging this. Yeah, exactly. But let's pretend I remember exactly what that is. Exactly. Let's pretend you don't need to go to the doctor every year to check your back. And you do remember this.
01:00:03
Speaker
What happened? do you want You were optimizing for writes. But at one point in time, optimizing for writes made it that reads were really slow. And then we need you needed to go and do a diffragmentation. So you need to reorder every single block, a sector on the disk to be written in the proper way so then files were continuous and not just broken into different parts.

Optimizing Read Operations and Performance

01:00:25
Speaker
Make sure that everything that sits together logically sits together physically. Yeah. And in a similar way, that's one of the mechanisms that we propose is when we write, we write on write effectiveness.
01:00:38
Speaker
So what that means is that you could potentially say, okay, I have data from this petition, also from this petition, they're written in the same broker. I can group those ones in the same object to be sent to the object storage. And can push that thing there.
01:00:51
Speaker
Right. yeah So I'm grouping things that probably logically might not make sense, but it's really good to get that. You remember the throughput megabytes that we were talking about before?
01:01:01
Speaker
If you reach them, you need to reach them on a single partition a single topic. You could basically say, if I'm reaching them on that broker on any diskless topic, then i will push them out. Right?
01:01:15
Speaker
so Right.
01:01:18
Speaker
That's on writing optimization, but then for reading, you're one exactly the opposite. So on running time, what you need to do is go and read all the messages that they are entangled and detangled them and put them into the right ordering. So when I'm reading and I'm fetching data that is old.
01:01:35
Speaker
yeah I'm getting old data that is useful for me because when I'm reading that offset, I will probably follow with the next one and the next one and the next one and the next one. And what I want is data locality when I'm fetching.
01:01:46
Speaker
yeah But that works nicely when you're reading old data or scale stale data, quote, quote. Like archive data. Exactly. yeah yeah But if you're reading data that it just happened, you will have a lot of consumers that will read the data as it happened.
01:02:04
Speaker
So probably that right that writeim optimized storage organization might work already for you because the consumers produce more or less, you're producing and you consume as soon as they come.
01:02:19
Speaker
and Yes, they're kind of grouped logically by time in that sense. So it probably is what you want to see first out. if you are all reading, the same pattern that you're writing and always keeping up more or less with what's been written, probably the right optimized writing is not that bad for you.
01:02:37
Speaker
But when you go back in time and one consumer, now I need to read all my historical data, then yes, I have that i will read a lot of data that I don't want to even know about, obviously, like fetching from this rate.
01:02:50
Speaker
So and we need to fetch a lot of data that's discarded. That's why we have this optimization, this rate compaction or basically, a re-sorting of data so it's useful and more effective for reads.
01:03:03
Speaker
Yeah. Now, that has made me think, because if i go in d if I go and defragment my storage, I think that's going to be fairly straightforward, because I read a bunch of files.
01:03:16
Speaker
Once you've put the storage, it's kind of um it's right once, isn't it? Mm-hmm. Right once. So I can read stuff back, confident that it won't be updated while I'm reading it.
01:03:27
Speaker
alright put put it all together in one thing, write that back, and then tell my metadata layer, forget those four blocks, there's this one new block, right? Correct. And that makes me want, so that that sounds great and straightforward, but metadata layer, where are we writing this metadata?
01:03:45
Speaker
And what are doing about the durability of that? We've skipped over. You thought you could catch me out. No, no, no. I was thinking, we are not talking about batch coordinator yet, and it should come.
01:04:00
Speaker
So obviously, to the piece of the puzzle that is missing right now is that one. And we now need to tell you that so the KIP proposes one thing. What we implemented is another thing for now because of convenience.
01:04:16
Speaker
Okay. Okay. So the KIP, knowing that that needs to be upstream to Kafka, we need to find a Kafka-native solution. And what's more native to Kafka than topics themselves?
01:04:28
Speaker
So you would manage the offset management or basically the batch coordinates. The thing that translate that offset is in that coordinate on the object storage to be backed by a topic.
01:04:39
Speaker
and That's being heavily discussed. We have several options and several possible architectures on how we can save that and how we can process that and structure that and how it scales up and down.
01:04:53
Speaker
Each of them has its own advantages and disadvantages. And we're talking we were thinking about as well, do we follow a cell architecture or not? Or we just have one single partition or we have several? So full of discussions on on what's the most optimal way of doing it.
01:05:11
Speaker
No matter what we do, if we are going on Kafka, it's obviously on a classic topic, the default topic, it's as suitable as Kafka itself. So technically speaking, this class will use classical topics to maintain durability of the metadata.
01:05:30
Speaker
Okay. And that's the metadata as in meta, in this particular case, it's the meta meta metadata. And what we say is that what we want to do is remove the disks for the data layer.
01:05:44
Speaker
The metadata layer is tiny. yeah Right, yeah. It's usually, if it's not worth it to go into the business of removing the need for having storage for the metadata.
01:05:59
Speaker
the returns of that will be so not narrow and tiny that it probably makes no sense. and actually yeah we We actually have some numbers already because we are using it internally. And under a sustained load of circa 100 megabytes a second, right it fluctuates because we we we actually hooked it up to a live service.
01:06:21
Speaker
um The most we saw actually was 36 megabytes per So this is almost negligible when it comes to to local disk used. This is for the metadata. That's 6 megabytes of metadata? storage.
01:06:36
Speaker
Correct. Yeah, that's it. yeah Total or...? Total, yeah. um This was actually total usage. So I've spread across. it's It's almost negligible in the grand picture of kind of like stateful system like Kafka.

Commitment to Open Source and Community Engagement

01:06:49
Speaker
Okay. And as I was mentioning before, the that's the same thing though. So that's the version we want upstream to to Kafka. What we implemented to basically get something quickly out of the way that we knew that it would work, but it will never be merged upstream, it's it uses a possible database.
01:07:08
Speaker
and And we we happen we happen to sell a lot of Postgres as well. ah We know a thing or two about Postgres. It was convenience. had Postgres around and we knew how to deal with them. and So we we used them.
01:07:24
Speaker
ah So we used that thing to just basically get something out of the way to get it quickly. We knew that going to a Kafka native management of the batch coordination would be quite controversial and no matter path we took, we would need to take another one probably or a modified one.
01:07:42
Speaker
And its it's a complicated problem. So we decided to take one step back and say, okay, let's, for us, for ourselves, we can be we can run Kafka with another thing. It's fine. For the open source version,
01:07:56
Speaker
It can't be like that. Yeah. yeah Then you have a Postgres instance over there that is the one that stores this data. And how durable it is, as much as you can and as much as you want. So you can have several replicas. You can have a RIP replica that is synced all the time.
01:08:15
Speaker
You can have constant backups, wall ah backups with full backups every now and then. Yeah, yeah. you choose The purest in me wants it to be all Kafka. The realist in me sees why you made that decision.
01:08:28
Speaker
So we will eventually go to the whatever implementation we agree on collectively as Apache Kafka community. But in the meantime, that was handy quickly and we could get this thing out of the way. So we got ourselves with that all the way.
01:08:44
Speaker
Yeah, yeah. Makes sense. you have to be You have to be practical and aiming for beautiful. Exactly.
01:08:53
Speaker
Sounds like married life somehow. and Something, something, something. Yeah, yeah. I'm criticising myself there, but, you know. um ah So this is the this is the big thing that we haven't dived into then.
01:09:08
Speaker
You have... um You have a working implementation internally at Ivan. You are trying, working hard to upstream into the open source model.
01:09:19
Speaker
How's that going? What's the status? What's the challenge of getting that done? So we have a basically a project into our Ivan organization. So if you go to github.com slash Ivan slash inkless, like the octopuses we were mentioning before, you can see the code there. It's purely a Kafka 4.0 baseline code with plenty of commits.
01:09:43
Speaker
And you can go in the history, you can see every single step of the way. We didn't hide anything. We didn't obfuscate any commit. We just programmed everything. in the shadows as if we would be programming in the open because we knew that we were going to open that one eventually. So we just went onto that mode.
01:10:02
Speaker
it's All the history is there. That's the project that we have. It's implemented there for everyone to see. We don't want actually people to contribute to that project because we don't want to create a split community. We don't want to now people start flocking around this new project and start saying, oh, I'm using this one instead of the other one or anything like that. we want it If people want this feature, the right thing to do is push for that feature to be in Apache Kafka.
01:10:30
Speaker
yeah Otherwise, we are not in the business of creating another Kafka competitor project. That's not what we want. We want to have Kafka having this feature. So the first thing we did was create a KIP, which is the Kafka improvement process. And we started with motivational question, which is skip 1150.
01:10:51
Speaker
And I paraphrase, but it's basically saying, do we even want Kafka to talk to those new cloud primitives like object storage?
01:11:04
Speaker
Because we' we've seen that there are seven different forks out there, some of them proprietary, some of them open source, but they are forks of Kafka that have this capability. Do we want the main line, AppSpring Kafka, to have this one?
01:11:19
Speaker
And that's 1150. That's the key, 1150. And then 1150 says, and if you want that, we have ideas and we have an opinion. And please go read our opinions. And we separated them.
01:11:31
Speaker
Instead of having one mega KIP, we learned from diet Storage, aka 405. It took six years from writing to 360. three six Was it six years? Yes.
01:11:43
Speaker
Crikey. Okay. So... We learned. So I checked, I look at the history of what happened. I'm i'm not saying i have the keys of why things happened the way they happened and many different complex things happened at the same time.
01:12:01
Speaker
But one of the things that that comes to you directly when you read KIP 405 is that it was a huge, big explanation of what is this feature doing.
01:12:13
Speaker
And that makes discussions because in case somebody doesn't know how how proposals work in Kafka, you need to write a proposal upfront. You write it upfront exactly every single change you propose with migration path, with testing strategies and whatnot.
01:12:30
Speaker
And then you open up for discussion with the community. Once you reach a consensus on the discussion, you open it up for a vote and then you need three plus one votes at least ah The vote needs to be open for 72 hours, and in the end, you need to have more plus ones than minus ones.
01:12:47
Speaker
ah but And the votes that count, let's put it this way, ah the binding votes, that they come from maintainers only. The community can vote, and it only expresses sentiment but not power.
01:13:01
Speaker
ah Correct. it's It basically expresses willingness of the community to have no one. So I would really wonder, it never happened, if the community would say plus one's in mass and maintainers would say minus one in mass. ah That would be a really weird situation where the maintainers kind of go in an opposite direction of where the community wants. But as I said, that's never happened.
01:13:25
Speaker
But... Imagine it's happened in geopolitical history. but absolutely. Let's not touch that topic. Why did bring it up? So open source sometimes is more geopolitical discussions than technical discussions.
01:13:39
Speaker
So what we decided to do was, instead of having one big, huge discussion, that is that's what it's a totality discussion. It's either I'm agreeing with the totality of the proposal or I'm disagreeing with the totality of the proposal. Oh, right, yeah.
01:13:52
Speaker
What we did was, can we break it down into areas that are replaceable, if that makes sense? So we have a core KIP, which is the one that explains everything.
01:14:05
Speaker
But the batch coordinator is and another KIP. The core implementation needs a batch coordinator. If it has A, B, C, D, or F,
01:14:19
Speaker
version, we don't care in that sense. mean, the the feature works with a version of that. It doesn't matter which one. Yeah, yeah. You don't want the the whole discussion about the overall feature to be blocked by a legitimate concern about the way you're writing metadata.
01:14:36
Speaker
Perfect. Yeah. And then you isolate discussions where they matter and where they can be precise, concise, and as long as they need to be. Yeah. Because they might need to be long and we might need to go through a thousand million places. And that's perfectly fine.

Innovating within Open Source Dynamics

01:14:50
Speaker
But our approach was, can we isolate discussions so they are meaningful? And every single thing that is discussed is to move the feature forward and not just to stall or block other parts when half of it could yeah And and that's the that's one of our attempts on how we can make that happen faster probably, or how we can make it more palatable for people to reveal, understand, and agree to those proposals.
01:15:20
Speaker
Yeah, we we basically we don't want perfect to be the enemy of good. That's really like the intent behind all of it. And I think it has worked rather well so far because it's kind of like still we are very much on track to kind of like this because do we want it right? Is this the right place? Is this the right place?
01:15:39
Speaker
um approach to actually having Kafka speak to those primitives and should actually Kafka adopt this type of ah replication flow. Yeah. Instead of going into the mechanics of the batch coordinator, for example.
01:15:52
Speaker
Do you find, i mean, I'm just thinking as you try and make the argument, you said that a large part of the argument is about cost. and Does that argument land well in a purely open source project?
01:16:08
Speaker
that That's a very good question. And I was talking to some maintainers as well, ah in petit comité, so basically behind doors saying, hey, we have something. You want to have a sneak peek.
01:16:22
Speaker
and And that was one of the questions that we got as well. this was the we Shall we go there? And then the question is, well, seven other forms of it went there.
01:16:36
Speaker
and And a friend of mine and also maintainer of Kafka, who actually works at Itham as well, actually, he submitted a call for papers for Community of Code.
01:16:48
Speaker
And the title is, We Shall Win Kafka Back. Because effectively, if you start counting on number of, as a service, so in SaaS products, if you count probably the number of Kafka's versus Kafka-like servers,
01:17:07
Speaker
clusters that are out there I think the Kafka like wins by large by sheer volume of use yes yeah So if we think about and every of the big ones, so if we talk about Amazon, they have their own version of Kafka, which is not Kafka, it's own variant that it's closed source and nobody knows what they run. It's basically MSK, right?
01:17:33
Speaker
We know that um Confluent has Quora, for example, and they have also now Warpstream. Both of them are not... pure Apache Kafka. they One is probably a modification of it.
01:17:44
Speaker
I'm just going to assume. I never worked at Confluent. I don't know what it is. I only can assume. I know that Warpstream is a reimplementation of the API and InWords.
01:17:57
Speaker
And we have other big players that they, in the end, what they run is not Kafka itself, but they run versions of Kafka with new capabilities in it.
01:18:08
Speaker
So the question was, do we want to keep that dichotomy and let Kafka just become the gatekeeper of the API or the protocol?
01:18:19
Speaker
Yeah, it it is starting to blur whether Kafka is a project or a protocol. and and And that's our take of, let's make it a project again. so But in order to do that, you've got to look at the reasons why it's forking.
01:18:35
Speaker
it's not Correct. so yeah so and And going back to them why people create forks. So why i need do create a fork for an open source project is because I have a feature.
01:18:46
Speaker
I want to implement this feature and the maintainers do not want this feature at this moment. So I happily so take a fork, make that new path to line, and I go over there.
01:18:58
Speaker
But the same beauty of open source that lets you do that lets you merge back and lets you, oh, maybe now I changed my mind. And now that I see that, yeah, those were successful features, let me add it myself.
01:19:12
Speaker
and And that's the beauty of open source that You can say, I made my own copy and my own version. And in a way, they validated that that was a right feature to do on on Kafka.
01:19:24
Speaker
Because as you said, you started to ah one of the questions that you asked at the very beginning was, why did you open source it? When it's clear that there is appetite for like business and and and companies are being bought by hefty amounts,
01:19:41
Speaker
by holding this thing proprietary. So clearly they did something right. That feature clearly is wanted. yeah and And there is business case behind. And that's why the motivational question came, do we want to close that gap? do we want it's it's It's our attempt to close the gap and to reclaim Kafka as not just a protocol holder, but a project holder where the innovation leaves.
01:20:08
Speaker
Yes.

Future-Proofing Kafka with Diskless Capabilities

01:20:09
Speaker
That's really epic. hey Yeah. There are many places in which the project has been at the forefront. In this one, there's certainly a chance to catch up.
01:20:20
Speaker
And there's going to be, there's a fundamental economic reason why there's always going to be that demand. Yeah. Yeah, I see it. Well, it's not only the the economics. One of the things which we also need to look at is also ergonomics and a few other things. And like I always like to see, like, what's the question behind the question? And even though the motivation question is present in 1150, the real question for us was...
01:20:48
Speaker
if it took six years, almost half a decade for tier storage to materialize, can we afford to have such a feature in the next six years? Can we actually have it today? can we actually have it soon?
01:21:02
Speaker
Can the whole project actually move faster? And I find it always funny that Kafka doesn't have a marketing team, doesn't have a you know like a product team.
01:21:14
Speaker
yeah It has a bunch of ultra talented individuals contributing to it. And a lot of companies you know sit behind it. But what if we, and at at the same time, not having those capabilities, it's still the most dominant deployment um form of Kafka, right?
01:21:32
Speaker
The open source version is the most used one, the biggest one, and probably will continue to be ah such. So the question we actually want to ask is, can we...
01:21:46
Speaker
secure Kafka's future for the next 10 years because we'll continue building on it. i intend to stay probably for the next 20 years in Kafka. I want to make sure i want to make sure that we have the necessary primitives to continue building.
01:21:59
Speaker
So going back to the motivational question, should we make sure that we take care of our own house so that we can kind of like continue building whatever we want to build on top of it? Yeah.
01:22:10
Speaker
That perhaps leads me to the last question in this, which is, I know as part of this, you're proposing that the batch writing coordinator thing become an interface rather than an if path.
01:22:28
Speaker
Do you see there being other kinds of batch writing coming up in the future? Yeah. or is this just trying to lay the groundwork for abstraction? i'm I'm going to come back from the product management perspective right now just for a second because this is very important. um We are not building batch coordinator for the sake of having batch in stream or something like, you know, fancy.
01:22:52
Speaker
We are building the batch coordinator with the intent to leverage a certain cloud primitive. In this case, it's the object store. I already mentioned it. If tomorrow we find a very ah nice kind of like cross cloud primitive, which, for example, like EBS, right?
01:23:06
Speaker
And we don't need to cross zones there, right? And we can just piggyback on Hmm. Why not actually figure out the coordination mechanism right to actually link this back to Kafka?
01:23:18
Speaker
So the batch coordinator is kind of like the result of ah problem we want to solve, meaning that we want to bypass all those kind of like tollboots of the cloud, as I call them internally.
01:23:29
Speaker
yeah um with object storage. But if tomorrow, for example, we have a better or more interesting ah lower latency or more economic economically viable the cloud primitive in EBS, and by the way, very recently, s three Express lowered dramatically the prices.
01:23:47
Speaker
So we are already seeing this architecture winning simply because we can flip a switch and now use a new storage class in diskless, which is S3 Express. And this is much lower latency.
01:23:58
Speaker
It will improve by almost 60% the end-to-end latency, yet it will cost just increment on top of the S3 standard storage class. So this kind of like ability to mix and match the storage classes is extremely important for us going forward.
01:24:14
Speaker
yeah Yeah, I can see that. So hopefully this doesn't take until 2031 to merge and you can start taking taking up advantage of all of that.
01:24:25
Speaker
yeah Do you want to give a prediction or are you are you not going to go down that road? I don't want to give any prediction. That will stay forever and then somebody will come back to me saying, look what you said on that podcast. I want to save my future me from myself right now.
01:24:42
Speaker
But I hope... So if I can hope something, I hope that the discussions stay on basically as they have been as of now, they have been awesome discussions and really on point on how can we solve that problem and what's the best way to solve that problem or that other one.
01:25:00
Speaker
And I think the only thing I'm hoping is that that these are the kind of discussions we're having and there is no other ah type of background discussions or conflict of interest discussions that sometimes have happened in the past in many other different projects.
01:25:16
Speaker
So I hope that we keep that discussion level as it is right now and move forward from that.

Quality and Security through Open Sourcing

01:25:23
Speaker
Yeah, that makes sense. my my My prediction is very very similar to yours, Joseph, but I will just take it a step further. um The real narrative here is that open source is quality. So in many cases, we see companies open sourcing something, but not being around to maintain it like two years, three years, five years, 10 years into the future, right? Like it's business at the end of the day.
01:25:46
Speaker
So ah with open sourcing and being extremely intentional that this is upstream aligned and open sourced, It means that you secure the future of some of a feature regardless of what happens to the original creator.
01:26:00
Speaker
This is extremely important to reinforce this narrative and this kind of like quality field to the Apache Kafka project. So um what I can actually predict is that we will not let go of this feature until it's delivered when this is something for the community to decide.
01:26:15
Speaker
Fair enough. On that community note, perhaps as a final question then, and I'm someone with a vested interest in open source Kafka's future. I'm not a committer.
01:26:26
Speaker
Is there anything useful I could do? Of course. So despite your maintainership level, either if you are no maintainer or maintainer, it doesn't matter.
01:26:38
Speaker
If you are interested in that space and you want to give your opinion, go to the Discuss thread. are open and express your opinion or vote on on any of these features.
01:26:50
Speaker
Your opinion matters. And one of the things that I want to raise as maintainer of Apache Kafka is that sometimes we are in an own bubble because we talk among ourselves and between ourselves about the things we think ourselves, but there is few feedback from the community that sometimes I guess they might feel it might feel scary to go to those threats and start saying, actually, I like this feature.
01:27:16
Speaker
And we should have that one. And sometimes you see two maintainers arguing why that might not be a good feature because nobody wants it. and yeah And the community might be behind saying, but but actually, we we did want that one.
01:27:31
Speaker
So as a community member, what I would Anyone who is who cares about Kafka i would say, please go to the discuss threads of this KIP or other KIPs, doesn't matter. The ones that speak to you, go there and share your opinion, share your ideas on what's good, what's bad, what works, what doesn't.
01:27:49
Speaker
And if you like those features to be in there, and also share that. but Even if it's just, I really love this thing, I would love to have this one in. Okay, that's what I need to do then. i will I will do that and then I will, while I've got the links, I'll stick them in the show notes for this episode.
01:28:04
Speaker
Awesome. And on that note, with work ahead of all of us, but far more work ahead for you two than me. Yeah, probably. You're set, Philip. Thank you very much for joining me.
01:28:16
Speaker
Thank you so much. Appreciate it. Thank you very much. Cheers.

Collaborative Feature Integration and Conclusion

01:28:19
Speaker
Thank you, gents. As ever, links in the show notes if you want to get involved, if you want to review their proposals or critique their proposals, or if you want to check out the temporary fork they've pushed.
01:28:30
Speaker
I hope this feature gets merged in one form or another, but it won't happen overnight. It shouldn't happen overnight. It deserves some thought and care and community input. So we shall see what happens.
01:28:43
Speaker
If you're emotionally invested in the idea, go and check it out. As I said at the start, this episode was sponsored by Ivan, so thank you to them. I have been second-guessing myself all week about any unconscious bias.
01:28:56
Speaker
um I don't think there was any, but I'm sure you'll let me know in the comments either way. And either way, if you've enjoyed this episode, please take a moment to like it, rate it, or share it with a friend.
01:29:07
Speaker
And make sure you're subscribed, because we'll back soon with another episode. Until then, I've been your host, Chris Jenkins. This has been Developer Voices with Josep Pratt and Filip Yonov. Thanks for listening.
01:29:18
Speaker
okay