From Datadog to Warpstream
00:00:00
Speaker
This week on developer voices i'm joined by ryan world who's taking an interesting career path from being an engineer data dog the monitoring at scale company to the co founder of his own company what stream he has jumped ship in order to reimplement apache kafka in go.
00:00:21
Speaker
with the fundamental design decision that they're going to use S3 for all the storage. The logic to this being that if you want to build a platform that stores and processes vast amounts of data, S3 is going to sacrifice some latency for a storage system that's vastly cheaper than local disks and endlessly scalable.
00:00:45
Speaker
In order to unpack Ryan's journey, we go through his experiences at Datadog, the things he's learned trying to replace live systems that are currently dealing with ludicrous quantities of data.
00:00:57
Speaker
how and why he chose Kafka as the target for his next project, and why he chose Go as the language to do it in. And of course, fundamentally, what happens when you try and build a database-like thing that doesn't really have its own disks? How do you do that? What technical problems come up? And how do you rewrite a decade-old Apache project without falling into the trap of memorizing the whole source code?
00:01:26
Speaker
Surely there are going to be some nasty surprises along the way. Let's find out. I'm your host, Chris Jenkins. This is Developer Voices, and today's voice is Ryan Well.
Challenges at Datadog and Legacy System Replacement
00:01:50
Speaker
Joining me today is Ryan Well. How are you doing, Ryan? I'm doing great. Thank you for having me.
00:01:56
Speaker
Ah, it's pleasure. You've been at one of those places that always seems to me to be a hotbed of really thorny, interesting problems. Because you used to work for Datadog for a while. And they being a repository of half the world's logging information, have some serious scale and data headaches.
00:02:20
Speaker
Yeah, that's to put it nicely. Datadog is a very complicated engineering problem that had to evolve as the
00:02:36
Speaker
cloud-native companies or cloud-first companies were starting. So data got to start out very simply, but now it's used by huge enterprises that generate a ton of data. So it's a very interesting challenge. I've worked in the past for companies that were brought down by their own log files, but a company gathering everybody's log files must have had some serious problems. Give me an idea of what you worked on while you were there, because I know it's going to lead into what we're really talking about.
00:03:06
Speaker
Yeah, so my co-founder and I both worked at Datadog for, um,
00:03:18
Speaker
roughly three and a half years or so. I think that's about right. And during our time there, we replaced the legacy search and query system for what Datadon calls the event platform internally, which is a
00:03:38
Speaker
If you look around at Datadog's UIs, you can kind of notice there's some similarities between the different products. And at a high level, there's metrics. And then there's the event platform. And most of the products are some amalgamation of those two different types of things. So metrics, which are the pre-aggregated time series, and then events, which look more like structured logs. So most of the products at Datadog are some combination of those things. Pretty graphs and tables.
00:04:07
Speaker
Yeah, basically. But there are two pretty distinct storage systems and two distinct teams within Datadog. And my co-founder and I joined Datadog to replace the legacy version of the search and query system for the event platform. But you can kind of think of just the data being stored there as like JSON with the timestamp.
00:04:34
Speaker
And then you want to search for it with full-text search and run analytics queries like aggregations and group by and stuff like that. Yeah. Again, that's one of those problems that seems simple in the small. I imagine it gets horrific in the large. Yeah.
00:05:01
Speaker
There are a lot of challenges across multiple dimensions when it's a multi-tenant product. I think the sheer volume of the data is one of them. But because it's a multi-tenant product, it's not like all the data is in one place. There's a bunch of different customers that have their own isolated sets of data. So that makes some problems easier.
00:05:23
Speaker
But it also makes some problems harder because of the variety and variation in the queries that you have to run. They'll hit every different edge case of your query engine and every different edge case of your storage system because of the size of a log might vary by 100x between customers. Some customers would admit huge logs that had tons of stuff in them. Other customers would admit a really high volume of really small logs.
00:05:52
Speaker
There's variation all over
Migration Strategy and Cloud-Native Architecture
00:05:56
Speaker
the place there. So there's a lot of different pieces and parts that you have to get right to make the system work as a whole for everyone. Yeah, I can imagine this. One of the things about running cloud services is for every corner case in your code base, there's a customer who absolutely exercises it, right? Yeah, exactly. Yeah.
00:06:20
Speaker
Do you have any strategies for dealing with that at those kinds of scales? So I think the strategy that was the most effective for us because we were migrating. We were migrating. All of the existing customers, they're same queries, the same data set. Ideally, they weren't going to notice anything different about this migration while it was happening.
00:06:50
Speaker
So the strategy that was the most effective for us was shadowing, where we would adjust the same data set in both systems. And we would shadow the queries from the live system to the new system called Husky that we built. And over time, we would crank up the shadowing percentage from like 1%. And then eventually, we'd make it all the way to 100.
00:07:15
Speaker
And we would log any of the mismatches between the old system and the new system. And that gave us the confidence that we weren't changing the behavior and that we weren't
00:07:29
Speaker
creating any new bugs that we didn't know about. Obviously, there were some old bugs that we found along the way, some old behaviors that we considered a little weird that we had to keep because we needed to be compatible with the old system. But I think that was the most effective one for us was shadowing.
00:07:48
Speaker
Yeah, I've had a few people say that as a strategy for replacement. Like, don't try and do the big bang rewrite. If you can, but if you have to, then at the very least keep the old system running around while you check the new ones the same. Yeah, it's obviously expensive to do that. But I think the way that we did it was basically we did it
00:08:16
Speaker
product at a time and then subsets of tenants within that product at a time. We didn't migrate. So we didn't have to run full capacity versions of both things all the time. Oh, yeah. And that also gives you a chance to ramp up the new system, right? Yeah, yeah. I mean, this was a migration that took from start to finish. It took multiple years. So there was never a date where we had to like, you know,
00:08:42
Speaker
shut off the old thing and be fully migrated to the new system. It was a pretty gradual process because as we were rewriting or creating Husky, we started with the simplest products first in terms of the features that they exercised in the query engine.
00:09:04
Speaker
And the first product that we migrated was network performance monitoring, which is like network flow data. The needs of that product at the time from the query engine were pretty simple. It's just aggregations and group by. There was no full-text search.
00:09:26
Speaker
There were a relatively small and static number of columns in the dataset, so it was a pretty easy one to do the migration for. Then I think the last one was logs, which is like a huge product. It touches every part of possible edge case. It has a bunch of special cases because it's a really old product as well. Along the way we got to
00:09:53
Speaker
The shadowing helped a lot, but also we just picked a good path through which to do the migration. We didn't start with the hard one first, we picked the easy one first. Yeah, cut stairs into the hill, that kind of thing. The one thing you haven't mentioned is what is it that made the old system legacy? What made you think we need to get rid of that? Yeah.
00:10:16
Speaker
There are a bunch of different issues with it, but I think that the big one from a cost perspective was that it stored all of the data on local SSDs. So it was not a cloud-native architecture by any means.
00:10:37
Speaker
The data was ingested onto the same nodes that queries were running against. The data was being written to local SSDs, so it was just not a very elastic system.
00:10:52
Speaker
As you might imagine at a company like Datadog, being able to elastically scale up in response to data ingestion is a super important task. That's the thing that the customers are paying for, essentially.
00:11:08
Speaker
We don't want to think about what it takes to run a system like Datadog. We're paying you for that. We would like to be able to just send you 10 times as many logs between one day and another, because maybe there's a Black Friday event, and we're an e-commerce company, or it's a big event.
00:11:33
Speaker
The Super Bowl, there's just so many things that could be going on in the world that are seasonality driven things for the end user. And you need to just accept their big volume of data, because that's the whole point. And then that worked before Husky. It didn't work.
00:11:59
Speaker
It was a lot of operational burden on the team that was running the system I dated on, basically. It was backed by people who were on call and would get haged when these things would happen and would go deal with it, basically, instead of it just being handled automatically in software by the system. Yeah, yeah. Throw people at the problem as a fine solution until you find you are one of the people, right?
00:12:22
Speaker
Yeah, exactly. And during the migration, we got to have a pretty strong taste of what it was that we were replacing in terms of being on call for it. So we felt the pain, for sure.
Cost Savings and Technical Benefits of Migration
00:12:40
Speaker
So it's interesting that you describe servers running in the cloud using SSDs as not cloud native. Interesting, almost to the point of being controversial.
00:12:53
Speaker
Yeah, I'm okay with that. I think that when most people talk about cloud-native things these days, they're basically just not, like they're just not
00:13:09
Speaker
Describing reality like they're describing software that could have equally have run in a data center just the same as it does today on like VMware VMS and it would be no different from an experience perspective the thing that makes To me what cloud native means is leveraging Higher level
00:13:35
Speaker
It's hard to even describe it in terms of levels, but there's things that can't exist outside of the context of a hyperscaler cloud provider, like object storage. That fundamentally can't exist within a data center environment because in order to get the, at least not at the same cost, I should say, the API, you can mimic. But to get it at the same cost, you need
00:13:59
Speaker
hundreds of thousands or millions of spinning disk hard drives, and you just are not going to get that in your data center. So what makes something cloud native to me is not just we took something that would have worked in the data center and installed it on cloud provider VMs. That's basically just what lift and shift means, which people don't use that word very much anymore because it's kind of like
00:14:26
Speaker
a 2010s cloud migration word, but that's what people have been continuing to do is they just take software that would have worked just fine in data center on VMs and they run it in the cloud. And the only thing you don't have to do at that point is figure out where your VMs are coming from, but it would have just been the same thing as it would be in the data center.
00:14:51
Speaker
Yeah. Yeah. Okay. I can see that. So your definition of cloud native then is something closer to akin to pulling the computer architecture apart and saying, okay, disk is a service. CPU is a service. The bus between the two is a service.
00:15:10
Speaker
I think that it doesn't need to be so highly articulated into that type of a definition. I think it's kind of a you-know-it-when-you-see-it type of thing. It depends on what the system is doing, whether or not it could be, whether or not it's cognitive. But I think that the shift towards
00:15:39
Speaker
The shift towards serverless architectures makes it a little clearer. I think that basically if you have to provision something in terms of individual, you're thinking about it in terms of I have VMs somewhere, that if you go to a database provider and the thing that you provision is I need this number of VMs, it's probably not cloud native.
00:16:08
Speaker
Now it may use cloud native things under the hood, like with AWS Aurora, like you do provision of VM, but they have this massive scalable storage system under the database that makes it much more magical than vanilla Postgres installed on a VM. So that's like kind of on the border of being cloud native, but it's
00:16:29
Speaker
I think a good way to think about it is if you have to go provision a VM with a certain amount of storage from RAM and CPU, it's probably not a cloud data thing. Okay, yeah, that's fair. So I want to know what the results were of this migration you did for Husky, you called it. Because it's going to lead into what you did next, but what was the after picture of that system?
00:17:00
Speaker
Obviously, saving money by moving data from local SSDs to object storage, that was a huge winner. That's not obvious. Why was that such a difference? The math behind it is pretty straightforward, I think, to explain. If you're replicating data three times on local SSDs,
00:17:27
Speaker
You can go to AWS and it's easier to talk about for EBS because EBS has a specific list price per gigabyte compared to instant storage where it's baked in with the hardware. EBS is roughly eight cents per gigabyte. With the legacy system, it was triply replicating the data onto SSDs.
00:17:53
Speaker
So you need three of those. That's 24 cents per gigabyte. And then you can't run the system at 100% capacity because that would not be very elastic at all. So let's say you wanted to run it with 50% free space so that you can handle some burst before the new VMs come online for when you want to scale up. And that gets you to 48 cents per gigabyte. And that's like,
00:18:21
Speaker
that's in the abstract, what does that number mean? Compared to S3, S3 charges 2.3 cents per gigabyte per month. So it's a pretty dramatic cost savings if you can move the data from triply replicated local SSDs with 50% free space to S3, where you pay for exactly the number of bytes that you have stored, and that's it.
00:18:47
Speaker
So that saves a bunch of money. Obviously, the system needs to change. You can't just swap one out for the other directly. That's why we had to rewrite the whole thing. And it was a huge project that took a long time. But that was a big winner.
00:19:07
Speaker
But I think those are actually not that interesting, basically. So the saving money ones are obvious, and that's a big motivator to do the project. But the product level stuff that you could do once you had this, it lets you kind of uncouple
00:19:29
Speaker
Like data retention becomes more affordable at that point. It's not just the data ingestion part that you're thinking about anymore. It's now you can make products that are more useful when you retain the data for longer.
00:19:44
Speaker
because you're no longer tied so directly to the cost of storage. So you can offer different products, basically. And Husky is obviously not the only part of this, but Datadog announced their product called Flexlogs at some point in their, I don't know exactly when it was, but during one of their recent conferences. And it's a,
00:20:13
Speaker
completely different model from what they were selling the standard tier of logs for. So retention is very affordable in FlexLogs, but you pay to provision compute that's sized differently based on how much compute you need to run your queries. So it's like kind of decoupling those two things now.
00:20:39
Speaker
And then along the way, we also removed the limitation of the old system that you would need to.
00:20:46
Speaker
You have to specify upfront what fields were in your logs so that you could actually query them. That was a limitation of the old system where you had to say, oh, this field name is a string type. And if I don't specify that, it's not searchable. You can see it if you could find that log through some other way. And you could click on it and look, the field would be there, but it wasn't searchable.
00:21:13
Speaker
So we removed that limitation, which was a huge product enhancement. The migration was super effective across a bunch of different axes.
The Birth of Warpstream
00:21:28
Speaker
Working on that product was really fun because it was a technical challenge, but also the business and the product stuff that we got to drive as a part of that was also really fun to see.
00:21:43
Speaker
I can imagine it was very satisfying and very, very, very rewarding from point of view of the results and also good brain food for what came next, I assume, because what you did next isn't that far away, but it is a bit of a jump. So take me to where you got next. Yeah. So along the way, during this migration, we
00:22:11
Speaker
Basically, when it was over, looking back, we were saving a lot of money because the high pole and the tent from a cost perspective was the legacy search system.
00:22:28
Speaker
Once we had changed that, we removed the huge amount of cost from the legacy search system. We were looking around at what was left, basically, in the costs. And Kafka was a huge part of it. And at the beginning, we would have never guessed that, I think, going in just because of the ratio, the way the ratios were. But by the time we were at the end, it was a huge fraction of the cost.
00:22:57
Speaker
My co-founder and I left Datadog to basically solve that problem and some other problems related to Kafka by taking what is philosophically the same approach that we did to the legacy search system at Datadog and applying a lot of the same techniques to Kafka. So we built a work stream
00:23:25
Speaker
which is a drop-in replacement for Apache Kafka, but it stores data only in object storage, not on any local disks, which is philosophically the same thing that Husky did with the legacy search system.
00:23:41
Speaker
We're going to get into that, but I just have to ask you before we leave the Datadog, what was your motivation there? Did you look at Datadog's bill summary and think this is going to be their next target? Let's jump ship, build a service and sell it to them when it's ready. Or did you just think this is a really interesting problem that we know how to solve?
00:24:07
Speaker
So I actually had pitched this same idea to my co-founder before, like when we were working on the system before we joined Data Dog that eventually was like the inspiration for Husky. I pitched this idea to him too. Basically, what if we could take Kafka and put it
00:24:33
Speaker
On top of storage and the other idea that I was pitching to him basically this was Like early 2019
00:24:42
Speaker
I pitched them these two things. The second one was, let's build a calmer database for observability. Those were the two ideas that I had. He did not think that the Kafka one was very interesting at the time. He thought it was
00:25:03
Speaker
A little boring, I think, which it is. It's not like a super exciting idea, but also just like he didn't use Kafka. He was working at Uber at the time on their metrics database called M3.
00:25:21
Speaker
And they didn't use Kafka for M3. They had some other system that was like an in-memory message bus that they used because it was cheaper than Kafka.
Rewriting Kafka for Scalability and Cost Efficiency
00:25:32
Speaker
They didn't need the durability, I guess. So he didn't see the value in it, I think, at the time.
00:25:43
Speaker
But I think once he was a data dog for a little while, he really got it. And the problem is very...
00:25:53
Speaker
industry agnostic, I think basically every vertical needs something that looks like Kafka for some reason or another. Even if it's just something as simple as you're using it to move data from point A to point B and integrate it in between disparate systems, that's like kind of a universal enterprise.
00:26:19
Speaker
challenge, which is why people use Kafka. It's kind of like one place where people can put all of their stuff as it's moving around in a common API to integrate things together. So it doesn't really have anything specifically for Datadog, but the problems that we set out to solve were most acutely affecting companies like Datadog, I think, with the
00:26:46
Speaker
We'll get into it a little more from the technical perspective, but the cost of storage and then the cost of interzone networking and the having to be scaled for peak load. I think within observability, security, IoT,
00:27:05
Speaker
those verticals, I think, feel the problem most acutely. But it's pretty universal. And most companies nowadays have some slice of all of those things within their enterprise. Like financial services company has, nowadays, they have IoT data, they have security data, they have observability data. And maybe it's in Kafka, maybe it's not, because maybe they outsource this function to a vendor.
00:27:36
Speaker
But a lot of them don't. Or they do hybrids where they use a vendor for the super important stuff that they want the best experience for. Or some department within the enterprise uses it, but another department doesn't. They self-manage the system. So it's kind of all over the place. Basically, these problems are pretty universal.
00:27:58
Speaker
Yeah, I think they almost inevitably crop up as soon as you find a company large enough to have several IT departments. Then they will have data integration problems. Yeah, exactly. There's so many enterprises out there that the way that they evolve historically is through
00:28:22
Speaker
murders that inevitably things are a complete mess on the inside and they need huge amounts of data integration. Kafka is very effective at being the single place where people put everything behind a common API.
00:28:47
Speaker
The data warehouse is a little bit too opinionated as the place to do all that integration work because developers for whatever reason don't. It's hard to get
00:29:00
Speaker
everything integrated into one place in SQL because not everybody wants the schemas or the schemas for a SQL table or a whole. They don't quite match what their application does, but with Kafka, you can just show whatever bytes you want in there. Optionally, you can use schema registry and stuff like that to get a little bit under control, but Kafka is flexible enough that you can make it work for just about anything. Yeah. Sooner or later, someone wishes they could just dump the raw data and someone else wishes they could just read the raw data.
00:29:31
Speaker
and then you build stuff on top of that. Okay, so let's get into the nitty gritty of WarpStream. And let me put it this way. So someone coming to Kafka for the first time would probably hear the phrase, we separate storage and compute. And they would think, well, okay, so that sounds like it's gonna be fairly easy to separate the compute from S3 storage, isn't that job done?
00:29:59
Speaker
So take me from that very beginner's perspective to what you actually had to do. Yeah. So first of all, a lot of people don't under like when you say that you separated the storage and compute for Kafka, they don't quite understand what you mean because they think that Kafka is just storage. So what's the compute? Um, which is somewhat true. Uh, I can understand why people would, would have that, uh, first reaction.
00:30:27
Speaker
And for Kafka, when I say that, the thing that I mean is basically there's the... When you're talking about Apache Kafka, I think the best way to think about it is there's the broker that presents the TCP server that is the...
00:30:46
Speaker
thing that clients speak to to read and write data and do metadata operations like stuff with consumer groups and things like that. And then there's the local disk on the broker that is the storage. And the combination of the replication protocol and all the local disks on the brokers is the storage system basically. And the compute part is the interacting with clients and
00:31:12
Speaker
getting them the data that they're asking for when they're reading something like translating the, I want to read topic A, partition one, offset X, translating that into reads and writes on the local disk on the machine. That's like the compute part, that translation.
00:31:34
Speaker
So there's a lot of hard problems along the way when solving that. And I think that the biggest one that we had to solve was Kafka makes files on disk that are per topic partition. So like there's...
00:31:56
Speaker
When you think about what a Kafka partition is, it's essentially a log of the records that you wrote to that partition over time. But there's a few other files that go along with it. There's a file that's an index by time. And then there's a file that's an index by offset so that you can seek efficiently within the partition. The problem when you translate that into object storage is,
00:32:25
Speaker
Let's say that you understood that it's object storage, the latency is going to be a little bit higher than it would be on a local disk. You want to write a new file so that you can acknowledge new records that some client wrote. You wanted to do that every 250 milliseconds because you don't want it to be too long. You want it to feel reasonably real-time. If you write a new file to object storage every 250 milliseconds,
00:32:54
Speaker
you're going to pay roughly $50 a month in put requests for your object storage. So if you have one partition, the minimum cost of that partition, if you write one record every 250 milliseconds, is $50 a month, which is obviously not going to work. So we had to figure out how to solve
00:33:19
Speaker
that problem first. And the way that we did that is instead of writing files that are per topic partition, we have our stateless broker thing that we call the agent. The agent writes a new file to object storage every 250 milliseconds, but that file contains records for all of the topic partitions that were written to that agent.
00:33:49
Speaker
in the last 250 milliseconds. So a file, we've now grouped all of the records together for each topic partition into these like 250 millisecond windows of time.
Efficient Data Retrieval and Storage Strategies
00:34:01
Speaker
We also do it based on size. So if you write more than eight megabytes of uncompressed data within that 250 millisecond window, we'll also make a new file.
00:34:11
Speaker
But that's the basic idea. So once you do that and you've decoupled those things, you can get the cost down to around $50 a month and put requests per agent.
00:34:26
Speaker
which you run way fewer of those, and they're not directly coupled to how many partitions you have anymore in the cluster. So you can have thousands of partitions, and it doesn't change the cost of the cluster from an object storage puts perspective.
00:34:43
Speaker
So what you're doing, just make sure I've got that clear. So you've got three agents running. You've got different clients connecting to them. Each one is taking a time is gathering up the data every quarter of a second, slicing it by time and shipping that. And you'll end up with three new files because you've got three agents times four in a second.
00:35:06
Speaker
You don't need to run three agents. You could run one or two. Obviously, how highly available it is changes when you run one or two, but it doesn't change the durability of the system because all the data is either in object storage or the producer request hasn't been acknowledged yet. So you can get away with running just one agent if you decided you wanted to have the absolute rock bottom.
00:35:32
Speaker
cost. But the math is right there. It would be 12 new files per second. OK. So hang on. So this seems like it changes quite significantly the way data is stored on disk, which makes me wonder how things change when you're trying to read it back. Because a consumer in Kafka reading off a partition has a fairly easy job of just chunking through the disk. You've just made it harder.
00:36:03
Speaker
Yeah, it's made a little harder. I think that if you... So it's funny that you bring up that comparison because Kafka relies extremely heavily on the page cache on the operating system to make reads efficient.
00:36:25
Speaker
When the data actually ends up on disk in Kafka, it actually looks not, it looks even more disorganized than how it looks in workstream in terms of, yeah, cause the file system is not, uh,
00:36:41
Speaker
It can't perfectly, it doesn't take a file and like perfectly lay it out in a linear order on disk. Most file systems look somewhat like a tree. So when you're writing new data to a file, that new append that you created, it's not going immediately next to the append that was before it. It ends up in some other place on the disk. So with Kafka, if you have lots of partitions,
00:37:10
Speaker
you end up in exactly the same position that we had not been in WarpStream and we have a solution that's very similar to it. The way that you read it back out in WarpStream is the agents form a cluster with each other in each availability zone and they
00:37:32
Speaker
In that cluster, we route read requests for files using consistent hashing to one of the agents in that availability zone. And that agent will take care of caching in fairly small chunks, blocks of data from those files in S3, so that when you're reading the live edge, like the most recently written data,
00:38:01
Speaker
The effect that we get is basically when you read out of S3, we'll only read, we'll do one S3 get for every four megabytes of data per availability zone.
00:38:15
Speaker
the agents take care of acting like the page cache in the operating system where they keep the recently written data in memory so that reads are very efficient because they are really, as you might imagine, if you have a lot of clients writing to a lot of different partitions, you'll end up with lots of little tiny chunks of data that you need to read. And we take care of that by
00:38:40
Speaker
lots of batching. If you're going to send a read request for this 10 kilobytes, and there's another client that's concurrently reading 10 kilobytes that's just a little bit further away, we do lots of batching to make the RPC overhead of this pretty efficient. But the problem is essentially the same as it is with
00:39:08
Speaker
We have to, we, we make sure that you have enough, um, as long as you have enough memory to read the recently written data, which is, you
00:39:21
Speaker
but it'll work just like the page cache does in the operating system. But the thing that we do that Kafka does not do that makes it possible to replay historical data with really high throughput is that we take the
00:39:38
Speaker
Lots of the 12 small files per second that the agents were creating in your example before, and we can pass those files together. So we merge them into much larger files so that later when you want to come back and read, let's say that 10 kilowatts of data that was in one of the files that you were reading as a live consumer, if you reset your offset back to the beginning and read the data that's been compacted, you might get
00:40:07
Speaker
tens of megabytes worth of data that you're reading at a time. You can read that out of S3 with really high throughput. Kafka doesn't do this. The storage system for Kafka does not recompact any of the data. Most file systems don't just move the data around out from under you because the performance of that would be pretty catastrophic. If it does it at the wrong time, it'll
00:40:30
Speaker
stop the live workload that the machine might be disruptive. Basically, no file systems do that type of background compaction. But we can't because the storage and compute is separated. We can schedule the compute in an intelligent way because we know what's going on in the whole cluster. We can say, this compaction job is really big and we don't have enough capacity to run it right now, so let's wait. We have that information that the operating system doesn't.
00:40:59
Speaker
Okay, so you're saying there's a separate process that goes away and grabs those 12 files and turns them into one big file sorted by topic partition?
00:41:11
Speaker
Yeah, so it's integrated. You don't have to run. It's not like a separate service or anything. You can split it out into different machines if you want to. We have the flexibility to do that. But by default, you just run one Docker container, and it does everything. And the magic is kind of in the control plane to decide what to do when you don't have to manage any of that. Right, so it's a separate logical process within the same binary. Yeah.
00:41:38
Speaker
And is that, I'm curious, is it building that as a back? I mean, I'm wondering if you're doing something like, um,
00:41:49
Speaker
the git trick where you store everything immutably and eventually you're just going to replace the pointer to those 12 files with a new pointer to the one jumbo file. That's exactly how it works. So we have a job that runs in the agent that will read in
00:42:09
Speaker
a lot of small files and then merges them into one larger file and then we replace that file. All the pointers to data within it, we atomically swap the new file and delete all of the old files. So that
00:42:29
Speaker
We don't delete the actual data from storage for a while because you may have some concurrent reads that haven't finished yet for the old files. We don't delete them immediately. They're still both there. It's a copyright.
00:42:43
Speaker
architecture where when we want to rewrite the data we make a second copy of it and then we replace it at a metadata layer and not actually physically delete it yet.
Warpstream's Custom Storage and Future Plans
00:42:56
Speaker
That makes perfect sense. I'm just curious because it's S3 so that would be storage I have access to. If I went to one of those files and tried to read the binary, what would I see?
00:43:11
Speaker
So WordStream has a custom file format. It's not particularly interesting and we have a tool for you to convert it to JSON if you want to that's built into the agent. But we're working on right now a feature that will let you basically take a
00:43:39
Speaker
logical snapshot of your cluster at some point in time and read all of the files from object storage. Like take a consistent snapshot at this point in time and then, you know,
00:43:51
Speaker
blast it with Trino or something, some query engine that you want to read all of your historical data with at once. That's a feature that we're actively working on right now. It doesn't work by directly reading the files out of storage, because again, it's a custom file format. It's not integrated with Trino or any other system. So you would still have to pull the data back out through the agent. But because the agent is stateless,
00:44:18
Speaker
you can just temporarily scale them up and then remove them. It can run just as much as you need to. And you can also run agents in a completely disconnected way for these kind of analytics workloads from your primary agents that are serving the live workload, but they're sharing the storage underneath it in S3. So they have the same logical view of like, these are all the topics I have, these are the consumer groups I have, but they're running on isolated sets.
00:44:47
Speaker
If I wanted to replicate a cluster, would that be as simple as just copying a bunch of S3 directories? So this is another feature that we're working on right now. The only thing that you can do for replication in WarpStream today is use MirrorMaker 2, which is essentially the same tool you'd use for Kafka. But one of the features we're working on right now is for either WarpStream as the source or
00:45:20
Speaker
source, you can hook up the agent to read all of the data out of an existing cluster. It basically looks like MirrorMaker again, but it's just built into the agent. And then a third feature that we're working on is this kind of replication at a
00:45:40
Speaker
kind of like the thing that you were getting at is what if I just made two copies of the data in two different S3 buckets in two different places. We're working on another feature for that right now, which it goes even a little bit further than that. The idea is that we would have a synchronously replicated multi-region active active cluster where the
00:46:07
Speaker
If we use, so inside our Cloud service, there's a database that controls, it's like where the pointers to files go, basically. That's what you can think of what this database is storing. If we moved that database from a regional database that is now, if we moved it to a highly available multi-region database like CockroachDB Cloud Spanner,
00:46:38
Speaker
and the agents synchronously wrote to multiple S3 buckets, we could give you a multi-region synchronously replicated active-active cluster. As long as all of the agents in all the different locations can read the data from these multiple S3 buckets,
00:46:54
Speaker
With obviously a preference to read it locally if it was available because it would be cheaper. You would be able to survive a, let's say that you have a configuration of three regions. Like you had one in the US East Coast, US West Coast, and Europe.
00:47:14
Speaker
If you had that set up, you could survive an outage of one of those regions and the other two regions would keep running. It's obviously a more expensive configuration because it requires copying data from one AWS region to the other. But it's definitely something that for a subset of data, it could be worth it if it was mission critical information. Because a lot of companies do this today anyway. They just use MirrorMaker to copy data from one.
00:47:44
Speaker
continent to the other. But we could have that kind of built in. Yeah, I could see that you replicate the financial data, but you don't worry too much if the product catalog gets stale, that kind of thing. Yeah, or like the observability data can stay region local. But yeah, the
00:48:07
Speaker
transaction history that's, you know, like directly touching payments would probably want, you probably want that replicated in multiple regions for disaster recovery purposes.
Reimplementing Kafka Protocol in Go
00:48:19
Speaker
That makes sense. So let's go into the how of this. Cause if you're adding features, that makes me wonder, how have you written this? What have you written it in and how did you learn which bits of Kafka you needed to replace to make this work?
00:48:37
Speaker
Yeah, so the first one is pretty simple. Everything is written in Go. The full system is written in Go. We are shooting for just full Kafka protocol compatibility. The only major thing that we don't have left is transactions. We have idempotent producer, which is a big part of transactions, but we don't have the full implementation of transactions that's actively being worked on right now.
00:49:06
Speaker
So we're shooting to just be a drop-in replacement for all the features. We have compacted topics, consumer groups.
00:49:14
Speaker
Most Kafka applications that are not using transactions will just work today as a drop-in replacement, which is most of them. Especially when you talk about the simpler use cases like, I need to move my application logs from point A to point B. That's the most trivial usage of Kafka there is, basically. It uses almost no features.
00:49:41
Speaker
So that one works extremely well today. It is very cost effective and we have customers and production. That's exactly what they're using work stream for. They're saving a lot of money. Sorry, what was the third one? I think I covered one and two, but what's the...
00:50:00
Speaker
How do you go from, okay, we know we want to re-architect or replace Kafka, so it uses cloud storage. I have go in my left hand ready to go, if you'll forgive the double use of the word go. How do you actually learn which bits do you need to replace? Do you start by just pouring through the Java code base?
00:50:28
Speaker
We basically don't look at the Kafka code unless we absolutely have to. And there are some cases where we have had to
00:50:41
Speaker
Unfortunately, because the documentation is either misleading about how features work internally. Especially with consumer groups, consumer groups are full of undocumented stuff, undocumented behaviors that if you were implementing the broker side of consumer groups, basically you would need to understand.
00:51:10
Speaker
There, in theory, maybe there was a message on the mailing list seven years ago that explains what this was. And there probably is that message. But just because Kafka was a single implementation specification, essentially, the implementation is the specification for all types of services. They didn't write this back and then start implementing it. Yeah. And you can tell that a lot of the KIPs are basically
00:51:38
Speaker
written backwards, like somebody wrote the implementation first, and then they wrote the KIP later. And because there is essentially one vendor who's dominant in control of the Kafka open source project, because they employ a significant number of the committers. Some of these things are just like
00:52:00
Speaker
It's obvious that the implementation was written first, or like an interface was written to work with something that is already happening internally at the vendor. So we do have to look at the Kafka code from time to time, unfortunately, but we try to take a fresh perspective on how these things should work so that we're not, our minds are not poisoned by something's good choice from 10 years ago.
00:52:30
Speaker
Yeah, OK, I can see that. What do you do then? Do you fall back on your own trick of shadowing? Do you have a Kafka cluster that you compare your exact behavior with? We do that as we have some integration tests that do that for certain things. I think what is more useful than reading the Kafka
00:52:54
Speaker
Java code is actually reading the Kafka client implementations. Because the clients that are not lib already kafka and the Java one. So all of the clients that are custom written for other languages
00:53:17
Speaker
Because they were not so closely evolved with Kafka, they have a lot of interesting comments in the code about what they had to do to get it to work with this latest release of Kafka. So the clients are actually very interesting and instructive for what to do. And then other than that, we just try to look at the protocol specification.
00:53:43
Speaker
And a lot of the protocol messages are self descriptive about what should happen based on the name and what the names of the fields are.
00:53:55
Speaker
and some basic description of what that API method does. A lot of them are very self-descriptive, so you don't need a lot of... It doesn't take a lot of work to get compatibility. Consumer groups is basically where that all falls off a cliff, like the fields. The consumer group protocol, they just kept shoving stuff into the existing messages and just having implicit behavior baked in.
00:54:23
Speaker
There's just a lot of settings that you can attach to consumer groups that are very complicated. That was the hardest one for us, I think, was getting to compatibility with consumer groups. Can you give me an example? I think the biggest example is the
00:54:46
Speaker
The behavior for static members in a consumer group, the whole concept of static members and dynamic members, I guess that's the opposite of a static member. That distinction was added later, along with the concept of incremental consumer group rebalance was also added later.
00:55:15
Speaker
What you have to do at each point in the consumer group protocol varies based on what kind of member it is. And this information is not encoded. There's not a static member joint group request, dynamic member joint group request, with the right information for each one, I guess, for backwards compatibility reasons. That would be my best description of why I think it worked that way.
00:55:45
Speaker
I'm not our internal expert on Kafka consumer groups. By any means, it's some other members of our team that were a little closer to that. But it has been a continual evolution for us to get closer and closer to ...
Benefits of Kafka Ecosystem and Go Language Choice
00:56:03
Speaker
This was six months ago, I think, when we were going through this process.
00:56:10
Speaker
good now, but just along the way, our first implementation of consumer groups went back in.
00:56:17
Speaker
I think it would be August of 2023 that we shipped that. It was just broken because we wrote it entirely based on just looking at the protocol messages and saying, okay, this is how it should work because the protocol messages, they're very obvious. There's no reason why there should be any problems here. But when we first encountered somebody's real application in the world,
00:56:41
Speaker
Because it mostly worked if you write a very, very trivial application. It mostly worked. But when you first encounter somebody's re-application in production, it would fall over spectacular. Oh, yeah. Yeah, I can imagine. Because it's like...
00:56:57
Speaker
If the API looks self-documenting, but it's evolved in ways that have attempted to move the behavior forward without changing the messages, that could be very misleading. This leads me to a philosophical question, if I may. Excuse me.
00:57:17
Speaker
If Kafka's protocol is quite complicated, and it is, and it's got these gotchas, why do you think it is that there are so many services cropping up that treat Kafka as a protocol? Because it's not that people are saying, this is the best protocol in the world, we must jump in and replicate it. I think that it is the strength of the ecosystem
00:57:43
Speaker
that makes it, Kafka was essentially the only game in town if what you wanted was open source, or more importantly, if what you wanted was free. So the
00:57:58
Speaker
The ecosystem that has built up around Kafka over the last 10 years or so is pretty strong. And there have been attempts by other vendors to introduce other protocols, like there's Kinesis, which is a proprietary protocol for
00:58:20
Speaker
you know, it's a service that's only available in AWS. There's Pulsar, which is technically an open source project, but it doesn't have nearly the strength of the ecosystem that Kafka does, so it's usually just this kind of niche. But Kafka works with essentially everything at this point. So I think that's why it's such a popular,
00:58:47
Speaker
protocol to re-implement because it lets you piggyback off of a lot of existing work, integrations with existing tools. Like if you use WorkFrame today, like if you use our serverless product that we
00:59:04
Speaker
We run everything for you. It's just you connect to a Bootstrap URL over TLS, over the internet. If you use that, you can go to other vendors that have integrations with our competitors and just use their integration. You can put in your sasl plane username and password and the Bootstrap URL that says warpstream.com in it into a competitor's
00:59:32
Speaker
branded integration with some other vendor. Tiny Bird, for example, is an analytics provider. They have a very nice UI, very easy to use product for running, basically doing SQL materialized views over your streaming data.
00:59:55
Speaker
And if you go use the branded integration for Confluent Cloud with WarpStream, it just works because it's the same protocol and it works basically the same way. The strength of the ecosystem there was just great. We didn't have to go talk to Tiny Bird to get the WarpStream logo on their website or any of that to get it to work. It just worked because it's confident. Right, yeah. So it's one of those cases where once
01:00:25
Speaker
Once a protocol has reached critical mass, you follow that one because it gets you in the club. Yeah. And I agree that Kafka is not the best protocol. Like there's a lot of things about Kafka that are hard to use from a user perspective that are driven by limitations on either the protocol level or the, you know, the implementation of open source Kafka at the, you know, the job.
01:00:51
Speaker
code that we would love to evolve over time, but it will be pretty slow because we can only basically move at the speed of Kafka for some things. There are plenty of things that we can evolve on their own because we can kind of hack our way around them within the existing protocol. But there are some things that we can't. So it's helpful that we get a little bit of a head start, but it also makes us more complicated because we're not as fully
01:01:22
Speaker
to provide at the product experience level. Yeah, yeah, I can, blessing and a curse, but that's the reality of programming sometimes, right? I just find it curious. Yeah, yeah, I can believe that. I find it curious that the same thing isn't happening with, it's not like there are a mountain of companies saying we are Postgres API compatible, not in the same way.
01:01:49
Speaker
I think that for Postgres, I think it's a little different. So there are some. They may not pitch themselves as Postgres compatible, but the way that they say it is postgres, but they've actually swapped out a bunch of this stuff out of the hood. They're probably using Postgres code. I think the biggest one that has come around recently is Neon.
01:02:18
Speaker
Neon is Postgres compatible in the sense that they actually wrapped the single node part of Postgres with a new storage system and a new API load balancer thing, but there's still Postgres running way down in the middle, whereas WarpStream is not like that. There's no Java stuff running anywhere in WarpStream at all.
01:02:42
Speaker
So I think for Postgres, it's just like it's a more complicated thing to replace. So the way that people do their replacement is they reuse as much of Postgres as possible.
01:02:53
Speaker
That's the way that they do it. And there are, you know, AWS Aurora is kind of like this as well. I don't think it's known publicly whether or not AWS uses the actual Postgres code in Aurora, but I imagine that they do just because it would be easier. And they did the same thing as Neon, where they swap out the storage engine and have a new API network.
01:03:12
Speaker
a load balancer thing in front of it. So that's, I think, where the difference is. And the other difference is there's other SQL implementations besides Postgres. So you can still get the benefits of switching to another system without necessarily having to rewrite everything. It's easier to rewrite a little bit of SQL.
01:03:34
Speaker
to switch to a new system compared to rewriting everything that processes any data in Kafka because the client library has completely different behavior. I can go, there's an abstraction layer for SQL databases where you can connect to, you just have to make sure the driver is installed. But once you do that, the interface for interacting with the SQL database is the same. It's just like the SQL text you might have to change between better.
01:04:00
Speaker
Yeah, yeah, that makes sense. That makes you wonder, you say that with quite a lot of enthusiasm. Are you happy with the choice of Go? Yeah, it's been, so that's what we wrote Husky in. It's what we chose to write workstream in because it's a very no nonsense programming language.
01:04:24
Speaker
And my co-founder and I and the engineers here at Workstream know it pretty well. So it's a combination of the culture of people
01:04:40
Speaker
at least that we interact with using go we don't spend time like talking about go we talk with there's no discussions about programming languages at work stream because we just want to like write code and get things done we don't spend time thinking about.
01:05:01
Speaker
There are other popular programming languages today that I'm not going to engage in a programming language debate, but a lot of their community focuses on things about the programming language and their program language versus another programming language. They focus on that a lot versus actually delivering real projects in the real world that provided a lot of value. I'll put it that way.
01:05:27
Speaker
Okay. I'm avoiding entering into that debate. I'm just listening to your perspective. And to be clear, Java is not that. Java, I think, very similar way to go. There's a lot of nonsense in the culture, I think, but that has nothing to do with debating about whether Java is a good programming language or not. Java is a very
01:05:50
Speaker
get things done, language as well. Nobody thinks it's cool anymore. It's got, it can do anything. You can contort it to do anything. So I think Java is also a, we could have written the work stream in Java. I think if we like happened to be people that were Java experts, I think that was more why we picked Go is because we are, we know everything about it. And it was easy to hit the ground running.
01:06:20
Speaker
It's a testament to it. And bring on new developers, I would say. I think it's a little easier for developers to learn Go than it is to learn Java if you've never interacted with it before. Yeah, I can believe that. And it's a testament to it that if you've done one major project in a language and you're not saying never again, that's usually a good sign for a language.
01:06:44
Speaker
Yeah, yeah, it's, it's just go you have to learn a few weird things. But the number of weird things you need to learn is pretty small to be productive in it. And then once you once you've kind of mastered those, and you have good
01:07:05
Speaker
which is another great thing about Go, is the tooling is fantastic. And it's very easy to write your own tooling, because the parser for the programming language is available for you to import into a Go program. And we have things that run in our CI that we wrote that check our source code to say that they don't.
Latency, Cost Efficiency, and Future Potential of Warpstream
01:07:23
Speaker
You can write your own linters, really, basically, is the way that I'll say it. You don't have to be a programming language expert to write on a linter for Go.
01:07:31
Speaker
That's interesting. The tooling for Go is fantastic. And I think that's another reason why we've kept going with it. Even if it takes a little bit longer to get the performance to be the same as what it would be in C or C++ or Rust, the benefits that we get from a productivity perspective because of the tooling and how straightforward the language is, I think, far away costs.
01:08:01
Speaker
It's a good effort for it to go, I think. Okay, so you've mentioned performance, so that gives me a way to take it back to Warpstream to wrap this up, which is, if here I am a Kafka user tempted by Warpstream, what am I trading off? What do I get and what do I lose? Yep. So there's one big thing and one small thing.
01:08:26
Speaker
One big thing is that the latency of WarpStream is going to be higher than open source Kafka or basically all the other vendors that use local SSDs for storage instead of object storage.
01:08:40
Speaker
To put that more concretely, if you're going to use S3 standard, and it works on all the major cloud providers, so anything that's S3 compatible or GCS or Azure Bob Storage, but I'll just talk about S3 because it's the simplest. The P99 producer latency with the default settings or with a very small amount of tuning, nothing crazy, is about 400 milliseconds at the P99.
01:09:09
Speaker
And by that time, during the time that that has elapsed, the thing that's happened is that file that contains the data that you wrote is durable in object storage and is durable in the control plane metadata store database. So it's very durable by the time that that's happened. And the end-to-end latency is P99 of around 1 to 1.5 seconds with S3 standard.
01:09:35
Speaker
If you need something a little bit lower latency than that, Workstream also works with S3 Express One Zone, which is a new high-performance tier of S3. And with that, you can get the producer latency to around 150 milliseconds at the P99 and the end-to-end latency to around 400 milliseconds.
01:09:54
Speaker
With S3 Express One Zone, the way that we use it, you can get regional high availability out of it with Warpstream because as the name implies, it's a zonal service. What we do is write to a quorum of two out of three S3 Express One Zone buckets. You can get regional high availability using that storage class as well, which you don't get out of the box. That's the performance trade-off. Then there's one
01:10:23
Speaker
kind of weird configuration that you could run a Kafka cluster in that you can't like work stream won't be able to match, which is if you run a really, really tiny Kafka cluster that is producing very few messages per second, it's cheaper to pay the inner zone bandwidth and the local SSDs than it is to write a file to object storage every 250 milliseconds.
01:10:49
Speaker
So if you don't use Kafka heavily at all, like you're writing one kilolite per second,
01:10:57
Speaker
through Kafka. Well, some people do because they're in development in their application, and eventually they're going to go to production, hopefully the traffic is higher. That would be cheaper to use to self-host an Apache Kafka cluster. Whether it would be easier to manage, that's another story, but just purely talking about the infrastructure cost, it would be higher with Nord Stream because of those S3 API feeds. This is where we're saying that the costs are
01:11:26
Speaker
It was a floor, basically. But your floor is that $50 a month. Yeah, it's a little more than that because the cost, the price that we charge for having a cluster running 24-7 connected to the work stream control plane is $100 a month. And then there's the minimum footprint of an agent costs around $50 a month in puts.
01:11:54
Speaker
But to solve that problem, at least on the storage side, we're working on some alternative storage backends besides S3 for the lower volume data. The first one of those would probably be DynamoDB. So if your data is just flowing in in a very small trickle, it is actually more cost effective to write it to DynamoDB instead of S3. Okay, yeah.
01:12:23
Speaker
And I suppose the S3 API is sufficiently simple that you could use very wide range of potential storage backends. Yep. So S3, the guarantee or the features of S3 that we use, along with the guarantees that S3 provides for those features is
01:12:43
Speaker
pretty widely applicable to any storage system. Warframe only writes to new keys. We never overwrite objects. So you can
01:12:58
Speaker
You can make it work on just about anything. In theory, we can make it work on NFS. We can make it work on Cassandra, DynamoDB, Bigtable, almost any key value store or something with a key value store like API, it would work. OK.
01:13:19
Speaker
Postgres, MySQL. We could squeeze it into anything. It's just a matter of what makes sense from a cost perspective for ease of use. Squeeze it into GitHub just for fun. Probably. I don't think they would be very happy with us, but yes, we could probably squeeze it into GitHub. Interestingly, given the opportunity to tell me the trade-offs, you've only told me what's worse.
01:13:48
Speaker
So I guess I thought that's what you meant by trade-offs is what's worse. I love it. I love it you being honest about what you lose, but I'll give you the chance to say what you get in return for that. Yeah.
01:14:03
Speaker
What you get in return is a work stream is like the thing that you deploy that you run in your cloud account is a stateless Docker container. If you're going to deploy it into Kubernetes, it's a deployment, not a stateless set. So you can hook it up to whatever auto scaler you feel like it doesn't need an operator or anything like that. And all of the data is in object storage. And there are two big benefits to that. It's basically there's no interzone networking cost and there's no
01:14:34
Speaker
the storage cost is dramatically cheaper than triply replicating on local SSDs. The thing that we target is for a high throughput Kafka cluster, we can reach roughly 80% savings over self-hosting Apache Kafka, inclusive of our licensing fee, to use the cloud control plane. So it's a significant savings over self-hosting Apache Kafka. If we're using another vendor, then it's going to be even cheaper than that.
01:15:04
Speaker
Basically, it's all about making it easier to operate and cheaper. That's what you get in return for that latency tradeoff.
01:15:17
Speaker
we can achieve as good or better throughput than Apache Kafka in most situations. The interesting one is that if you just had your workload has very few partitions. In Kafka, you would be bound basically by the number of brokers in your cluster. This is a very weird situation, but let's say that you just had one partition in your whole cluster. The throughput that you would get would be limited by one.
01:15:43
Speaker
but the throughput that you could get out of one set of three-boat burgers that were relocating the data. With WarpStream, because the metadata and the data is separate, we can drive all of the throughput for that one partition through all of your agents, if you wanted to. It's a weird setup, but basically the trade-offs are completely different around what WarpStream
01:16:06
Speaker
The partitions are all virtual, effectively. There's not a physical resource backing the partition. There are lots of weird situations you can get into with Apache Kafka that would limit your throughput that you won't find with WarpStream because
01:16:21
Speaker
The way that we load balance, the workload is we're not balancing partitions among brokers, we're balancing connections to agents. And our connection balancing is literally round robin. So you get a very smooth distribution of resources, assuming that your clients are all doing roughly the same activity. Round robin will get you a very good
01:16:44
Speaker
balance utilization across the cluster, which means the more balanced you are, you can add more agents and it will remain balanced so you can drive more throughput to the cluster. Interesting. Well, I'm very interested to see how this evolves and especially as part of the landscape of Kafka as a protocol coming up with people tweaking for certain use cases and certain usage patterns. Yeah.
01:17:11
Speaker
Yeah, we're our bread and butter customer today is essentially doing something in the high volume telemetry space. So application logs, security logs, IoT, things like that.
01:17:26
Speaker
Because the ratio of staff to throughput on those workloads is very different than what you might find for other people using Kafka. So that's where it makes the most sense today because you can show immediate cost savings and reducing the operations burden for those very few staff that was running the Kafka cluster before.
01:17:52
Speaker
So I think that you'll continue to see adoption from us along those lines for now. Once we have transactions, I'd say basically the sky's the limit because we'll just be fully compatible with Kafka and we can squeeze into any use case where Kafka is today.
01:18:13
Speaker
as long as latency isn't... I've always found this in the Kafka world, everyone has a different definition of fast. And for some people, one and a half seconds is terribly slow. And for some, it's lightning fast. Yep. From our conversations with customers and users of Kafka,
01:18:35
Speaker
You'd be surprised how often people don't actually know what the end-end latency is of their processing pipeline at all. Or if you ask somebody who's at the lake.
01:18:45
Speaker
More on the business side, the product that they offer involves processing data. So they understand, basically, that's what they're doing. You ask them what they think that their latency is. They're like, oh, we're a high performance organization, and our data processing pipelines are super fast. They're 10 milliseconds end to end. And then you get a little bit further down, closer to the cluster, your developers, the operations team.
01:19:13
Speaker
And they'll say, maybe it's two seconds or so. We don't really keep track. Because most of the time, it just doesn't matter. There are very few use cases, I would say, that we come across that latency does matter. And they're basically all in financial services.
01:19:34
Speaker
And a big fraction of them are fraud detection-related. So it's like I have a credit card that's being swiped right now somewhere in the world, and I need to give a yes-no decision for whether to allow this charge. That has somewhat strict latency bounds, and a lot of times the critical path is Kafka for that.
01:19:58
Speaker
But for that, we have alternatives like S3 express one zone, and eventually once we have DynamoDB, I think we're going to be able to squeeze the weight down even lower so that we can fit into those use cases. But yeah, today, if we're just using S3 standard, most people are perfectly satisfied with one and a half seconds end to end. Yeah, I can believe that.
Closing Remarks and Outlook
01:20:23
Speaker
I wish you luck serving that niche. Keep an eye on Warpstream to see how it grows.
01:20:28
Speaker
Ryan, thanks very much for taking us through it. Thanks. This has been fun.
01:20:32
Speaker
Thank you, Ryan. If you want to check out Warpstream, you'll find a link to them in the show notes, and I wish them the best of luck. If you've enjoyed this episode, please do leave a like, rate the podcast, or share it with a friend. And make sure you're subscribed, because we'll be back next week with another fellow geek chipping away at their corner of the internet, trying to make things work a little bit better. Until then, I've been your host, Chris Jenkins. This has been Developer Voices with Ryan Wuerl. Thanks for listening.