Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Kir Tititevsky: Modern Streaming Architecture Transforming the Service Bus image

Kir Tititevsky: Modern Streaming Architecture Transforming the Service Bus

S1 E24 · Straight Data Talk
Avatar
33 Plays5 days ago

Kir Titievsky, Product Manager at Google Cloud with extensive experience in streaming and storage infrastructure, joined Yuliia and Dumky to talk about streaming. Drawing from his work with Apache Kafka, Cloud PubSub, Dataflow and Cloud Storage since 2015, Kir explains the fundamental differences between streaming and micro-batch processing. He challenges common misconceptions about streaming costs, explaining how streaming can be significantly less expensive than batch processing for many use cases. Kir shares insights on the "service bus architecture" revival, discussing how modern distributed messaging systems have solved historic bottlenecks while creating new opportunities for business and performance needs.

Kir's medium - https://medium.com/@kir-gcp
Kir's Linkedin page - https://www.linkedin.com/in/kir-titievsky-%F0%9F%87%BA%F0%9F%87%A6-7775052/

Recommended
Transcript

Introduction of Co-host and Guest

00:00:02
Speaker
Hi all, it's Yulia from Stray Data to Talk. And I have a little bit, not a little bit, but a little announcement. i have a new co-star, co-host, Damki.
00:00:13
Speaker
Damki, jump and introduce yourself. Yeah, thanks, Yulia. Hey, it's Damki, now also a ah host of Stray Data Talk and happy to get started today.
00:00:25
Speaker
Wonderful. So we're lucky to have a guest, Kier Titevsky from Google Cloud, who, yeah, who've been there building all the streaming stuff since 2016. When did you start, Kier? I think so. 2015 or so.
00:00:41
Speaker
yeah ethnic so yeah thousand and fifteen or so

Kier's Background and Expertise

00:00:45
Speaker
Hi, all. My name is Kier. i I currently find myself working as a product manager in Google Cloud, um and I work on an Apache Kafka service.
00:00:56
Speaker
um I've worked in that general area of of streaming and storage infrastructure for the better better part of the last decade, starting with Cloud PubSub.
00:01:09
Speaker
And I did a little bit of time in our object storage service called Cloud Storage. And now finally, we get to I get to play with with open source with this Kafka project.
00:01:21
Speaker
And I have to say, I feel a little bit outnumbered. I wish we had a, you know, really, you have a co co-host. I wish I had a guest.
00:01:32
Speaker
that That's okay. It's okay. I guess yeah opposite the opposite, the benefit is that I get to be special. ah your own You're certainly special, i don't even doubt. um But listen, thought you also were Dataflow, or am I?
00:01:50
Speaker
that's That is true. i have done some i have done some work on Dataflow, which is a service for data processing and data integration. We have built on Apache Beam.
00:02:02
Speaker
So for those of you who are familiar with Spark and Flink, it's it's a very similar idea. um and and And much more simple. Like when when you get to Google Cloud to start with, it's better to go with Dataflow, which is Apache Beam.
00:02:20
Speaker
Kier, I'm excited to have you. And no, sorry, we excited to have you. No, it's we. Sorry, Donkey.
00:02:31
Speaker
You can speak for yourself. I am also excited to have you, Kier. ah Maybe this whole co-guess thing was ah was red herring. But I have two people excited to have me. This is so great.
00:02:45
Speaker
um Listen, I think um you are my to-go person when I'm

Understanding Streaming vs. Batch Processing

00:02:52
Speaker
thinking about streaming. um And we had this client you know who does only streaming.
00:02:59
Speaker
um and I but hosted a webinar with you with C2C. and But I'd love you to you know to kind of begin this podcast with explaining What is streaming and why people often ah mistaken it with micro batch?
00:03:25
Speaker
What is streaming and maybe where is a line between streaming and micro batch? Oh, this is wonderful. Yeah. Ah, I think that's a great question.
00:03:37
Speaker
So, um,
00:03:41
Speaker
Streaming is just about um being able to get your data off the service that receives it from a user onto a disk or into some other saved storage, let's say within milliseconds to hundreds of milliseconds into a second, and being able to read it right away.
00:04:06
Speaker
don't know if that's useful, but I think what's more useful in my mind is when I think about once renting an industrial sized video projector for a conference and went to pay for it at this retailer and they said, hang on, you can't pay for it yet.
00:04:23
Speaker
I just typed it in. It takes 15 minutes for the PO o to come out. And I could see the reflection in this person's eyes that of a batch job, ETLing that PO into the payment system, right?
00:04:40
Speaker
It's very interesting because you can you consider, i think I think streaming really is, it's easy to forget that's how the world worked, right? Because like if you look at the origins of SAP, the company SAP, it was billed originally as the real-time processing system.
00:04:57
Speaker
And if you this is this is a fascinating story about SAP. right They came out as a real-time processing system, and that I think they could do they could do their ETL and updates in their database with you know in 15 minutes or an hour, or something like that, right because it wasn't the day that the previous systems in the 80s used to standardize on.
00:05:18
Speaker
So all of this is to say that streaming is a continuum. Now, with the way we think about this today, the way I think about this today, is...

Comparing Object Storage and Streaming Systems

00:05:28
Speaker
And really, we have two kinds of storage systems in the cloud.
00:05:32
Speaker
We have object storage. So in Google Cloud, we have cloud storage. Many folks are familiar with S3, the S3 protocol. and And then we have ah we have messaging or streaming storage systems like Kafka and PubSub.
00:05:49
Speaker
um The difference in one of them, you um generally will do atomic operations where you store a whole file.
00:06:01
Speaker
And until you're done but storing that file, um it's not readable by anyone. Now, you can store very small files. It just happens to be very expensive because you pay per file.
00:06:14
Speaker
And if you look at systems on the other side, like Kafka and PubSub, you don't pay per file. There are no files. They're just these general buckets of of data where you can send data that gets persisted on disk, and it's visible before you close the connection or before you close um before you close a file handle or anything like that, right?
00:06:37
Speaker
So the result is in one case, you generally end up saying, well, I'm going to chunk my data up into one megabyte or 10 megabyte or 100 megabyte chunks. On the other case, I'm going to stream these as they come.
00:06:50
Speaker
Sometimes you still have to do some processing. You still do some buffering usually, but you're doing kilobytes. I just had a customer who was like, who was doing 12-byte messages, 12-byte commits, and saying, it's taking too long.
00:07:04
Speaker
I said, we need to batch them up. He said, okay, I will do 40.
00:07:09
Speaker
I don't know. It's working. They're they all kinds of people, right? 40 bytes. um And so the upshot is that in practical terms, if we if we look at what happens with object storage systems, you end up with um you know with latencies in seconds and up and twinned, right?
00:07:28
Speaker
And in streaming storage, you end up with with latencies potentially in 10 milliseconds, maybe even lower. um So very long story. But I hope that um the picture I painted is that, you know, it's a continuum. there's I don't blame people who do micro batch, right? As long as I don't have to wait with my credit card outstretched for 15 minutes, what I need to pay and get out and move on with my day.
00:07:59
Speaker
but But you know, this is, sorry this is no it's super interesting. And I think there's there's so much to to unpack there. right and And I kind of got triggered by um what you said about these like older systems in the 80s and 90s.
00:08:16
Speaker
One thing I'm interested in is um like you had a time where everything just ran on one database, right? And so if everyone reads from that same database and updates to that database, maybe you need a bigger or better database to make it work, then you have that same kind of, like, you can have that low latency, right? And so but what would you say conceptually is the difference between just having one big database, which maybe is also a trend that is kind of coming back a little bit with, like, just way bigger nodes and and that kind of stuff,
00:08:54
Speaker
what's the difference between having like one big database versus having like conceptually a streaming architecture that's that's such a phenomenal question because i think i' I'm going ah i'm gonna like pull it and stretch it a little bit, right?
00:09:22
Speaker
Because I think it's not the difference between having a single database and a streaming architecture. It's a question of having a single database and multiple databases.
00:09:35
Speaker
in your organization. I do think, you know, first of all, the fact that, you know, if you can run everything over a single Postgres server, everything is real time. Everything is transactional. That's great, right?
00:09:47
Speaker
ah There's not not a deep thought here. I think... What we're finding is that who was I was was talking recently to to a customer who is an architect at a bank, a major international bank, bank and he sort of he got really sad for for a second. And he said, you know, they sold us on microservices.
00:10:11
Speaker
ah he was like He like, now I have to unentangle, disentangle all these different integrations, right? So i think you know I think the really interesting story happens and in and why companies and organizations decide to go from a monolith to multiple disaggregated services that talk their own databases.
00:10:34
Speaker
I think the classic thing that that that we talk about in the analytics space is just the separation of the transactional database and the analytical database and the big data, right?
00:10:45
Speaker
um There, I think that the story is pretty simple, right? The sysadmins got fed up with ah with the analysts doing a select all on the production database and said, make your copy. just Just make your own copy, right?
00:11:00
Speaker
And... um We quickly realized that sometimes this copy needs to be relatively fresh. So end of day kind of ETL process, not so great.
00:11:14
Speaker
um Then we realized that, well, the way you do this is you is you read the bin log from the database, you need to change data capture. right And the bin log is getting generated with every transaction to the database, so it's a streaming data source.
00:11:26
Speaker
It's as real-time as it gets. And so you end up with um saying, well, the natural way in which this data out of a database emerges is to get real-time. you know um I click on a button, gets recorded in the database, and immediately appears in the bin log, and immediately can be it can appear in the bin log reader.
00:11:45
Speaker
Of course, the way this data got processed and probably still has in Node systems is there's a reader that grabs this change data capture, buffers it up into a big file, and sticks it in and ah in an object, a file that's processed in batch.
00:12:01
Speaker
right And then um I think the beauty of of streaming storage systems like Kafka and PubSub, Kafka, would say, particularly so in databases, and we can talk about why that is,
00:12:16
Speaker
is that the storage system now have the same kind of property as the source of of the transactional data, right? So your bin log is updating on disk in the database. You can grab that update and store it in Kafka with roughly the same latency incrementally. And on the other side, if you want to have a reader that inserts it into a ah storage system, you can um take that take that event out with no delay.
00:12:44
Speaker
Now here's a kind of an interesting, tricky part. If you look at most analytics systems, there's still batch, right?

BigQuery and Real-Time Data Handling

00:12:50
Speaker
If you look at every Hadoop installation, it's dealing with files. You're still batching these things up, which is why I think, um,
00:12:59
Speaker
you know, to plug Google Cloud, right? I think this took us a little bit to kind of get through to folks is that BigQuery, for example, has been an analytical system that basically looked at logs, right? And like you would...
00:13:13
Speaker
um and it had a streaming update functionality, it was like a data warehouse where you could update things, right? and And where you could stream data. So you you didn't have this, like, timescale mismatch between the source of the data, the the the intermediate storage system Kafka, and the data warehouse.
00:13:30
Speaker
um I think if we look at um most data lakes today, they're still based on object storage, so you still they're still buffering. Again, not a big deal for the really large volume customers, but um or I would say large um scale sources like click data, but those are not coming through transactional databases or directly to through frontends.
00:13:54
Speaker
So I think, I think the, again, this is a little bit of a, I took us in another kind of long loop around this topic, but I think, you know, to take this back to your question, what's the difference between being a single single database um architecture and an architecture where you have streaming pipelines connecting multiple databases?
00:14:19
Speaker
It's that um in the second scenario, we now have the ability to not be a headache for the sysadmin of any particular database.
00:14:30
Speaker
right And we can say, because you can stream data and you can store it safely and um in an intermediate system, and you can read from that system really quickly. um Everybody can have their own copy.
00:14:43
Speaker
um And so if that matters to to a company, that's you know then this is a perfect fit. For some companies, it just doesn't matter. And if you're small enough or whose case is small enough.
00:14:56
Speaker
but But so you do, like, you're basically saying ah you're uncoupling this mechanism of ah systems talking to each other or, like, reading and writing or or consuming and and publishing, basically, yeah um from...
00:15:12
Speaker
you use it as a separate system, right? Just like you can separate compute and storage, these kinds of things. And so depending on how you've organized your company, you can plug these different systems into each other and use this as your messaging layer.
00:15:30
Speaker
Absolutely. I mean, there's the the service bus architecture. Somebody just was, you know, somebody was making fun of me for bringing back service bus from the night. It's all kind of slimy and green and like, ah, service bus, right?
00:15:43
Speaker
ae I think it still makes sense, right? The promise of Service Bus, the architectural notion of it is still exactly the same. It's just that, you know, when we originally tried to do Service Bus, it was a single box installation and it had the same throughput limit as the database.
00:16:00
Speaker
So if you tried to plug many databases into this one box, well, guess what? Many of you, one two co-hosts, one guest, right? It doesn't compute. so so um But I think architecturally, it still made perfect sense. And then when Kafka came around and when distributed distributed messaging came around.
00:16:22
Speaker
um that this That particular limitation went away, and we're really recalling these things event platforms where all the databases dump their data in, but it's the same service bus architecture or event bus architecture.
00:16:37
Speaker
um So I think that's exciting. But, you know, to riff a little bit on on your... um On your formulation, Domki, I think there's there's kind of the human organizational aspect to this, which is you know everybody wants to see the data or in the database they want.
00:16:56
Speaker
I would say that you know on the technical side, which I find particularly exciting, is that um
00:17:05
Speaker
I like the idea that you can materialize any data in your company in the database that matches your performance expectations. or actually your API, right?

Heterogeneous Materialized Views

00:17:16
Speaker
Like some people really like MongoDB.
00:17:20
Speaker
um Who doesn't like MongoDB? Well, people who build payment systems. yeah right So you're like the business operates, right? on on On something that's not Mongo, but the people who are building their web app maybe want to want to write it on top of Mongo. Well, it's okay. Well, great.
00:17:39
Speaker
you put You put your data in the transactional database of your choice. going to put my data in Mongo. And guess what? We don't have to worry about it going out of sync because everything is in the service bus.
00:17:50
Speaker
It's just a materialization. Materialization here, materialization here. Same for the you know the analytical use case, right? It's just a materialization. So the idea of like this, I'm going to, hold on, I'm going to throw a crazy term or two, heterogeneous materialized views.
00:18:13
Speaker
Oh, God. That's terrible. What does it change? But I find that to be very compelling, right? Just like as we have materialized views for performance within a database, like once you say, hey, what if you could do this across databases, across database technologies?
00:18:27
Speaker
and It's sort of a beautiful concept, right? Of course, it all depends on the idea that if all the data from every database is neatly stored in some magical service bus,
00:18:40
Speaker
that's you know that's clean and governable or performant. absolutely so So what is the... um it i think is this is absolutely fascinating, but um to play the devil's advocate a little bit here, right? what What's the trade-off then? So so why is... like If we can have have everything in our um in our streaming or event platform, why do we even still need something like MongoDB? Why do we even need like our analytical database? Like,
00:19:09
Speaker
it wouldn't the requirements for those platforms, like the reason people use MongoDB is, you know, they want want to have performance and and flexibility in their schemas. The reason someone has an analytical platform or an analytical database is they want to do fast aggregations.
00:19:27
Speaker
Are those limitations then not passed on to your event platform or those requirements, basically? Like what what is the trade-off for, um for the streaming part then, if it's if it's enabling everything else to to work?
00:19:47
Speaker
Do you get what I mean? Yes, I think so.

Limitations of Streaming Systems

00:19:50
Speaker
I think actually, you know, if you look at the Kafka ecosystem, right, they sort of started with this, not like Kafka is it's just a log of all the changes, and then they added log compaction. So for every key, you remember that can associate a key with a record, I can get the latest value, key compaction,
00:20:07
Speaker
Hang on, it starts to look like a database. And then if you look at it, oh, it's a partition database. It's horizontally scalable. I can read at a vast rate. Isn't this the perfect database, right?
00:20:18
Speaker
And then we had KSQL that came out of our, you know, that our friends at Confluent put out. I think that was a very compelling idea, right? You have the SQL database built on distributed horizontally scalable storage system.
00:20:31
Speaker
Yeah. That comes with all sorts of integrations. It's real time. It's kind of cool. I think that um
00:20:41
Speaker
I am seeing some of our customers, um and they wouldn't be our customers because we don't offer KSQL, um but some some folks really like this idea and they're trying to implement systems on top of this idea.
00:20:58
Speaker
But fundamentally, you know you're still dealing with a particular data structure. right it's It's a log. right And so any reads you do with that log are basically sequential.
00:21:09
Speaker
And it's partitioned in a particular way. It's a very simple example. If I have customers buying widgets, I can either partition that log by customers or by widgets. And so aggregations by customers are going to be relatively fast. Aggregations by widgets are going to be relatively slow.
00:21:25
Speaker
right If I put the same data in Postgres, I can have an index on both. If I put them in Elastic, I can do all sorts of fancy fancy searching. If I put it in Mongo, I can get it as JSON blobs, and that's even better for some applications. so I think that although and you know Kafka in particular has kind of this, the subject of certainly my current professional interest,
00:21:54
Speaker
um it is It is a very highly performant system, but it's optimized to one particular access pattern. Okay. Yeah, that makes that makes total sense. And and it's like, so so you're saying that the basically the people who know the kind of transformation they want to apply, the kind of business logic they want to apply, should do that in the system that fits their need, the database that fits their need. So it can be performant for their use case, for their team, for their application.
00:22:26
Speaker
Yeah, I mean, as as somebody used to know, should is a very dangerous word because we say as you should, but I would say that, you know, if if you do find a way to put your data, if A, if you run out of room in your um monolithic database or in your transactional database, and B, if you find a way to stick your your um change logs and your event data into a horizontally scalable messaging system or storage system like Kafka,
00:22:55
Speaker
You now have the option of taking the database that has the performance characteristics you want, it has sports indices, or it does, you know, big, you know, easy scans or whatever it is, right?
00:23:06
Speaker
The right kind of partitioning and just rehydrating or materializing your data there. I think that's the fundamental kind of opportunity that this this streaming streaming storage creates for us.
00:23:21
Speaker
I also think that um not only the matter of choosing where it's supposed to be stored, mean, it's also ties back to people's skills who work with data.
00:23:34
Speaker
we are talking about advanced skills to store data in, you know, sort of Kafka storage. and I like working today with clients.
00:23:46
Speaker
I don't see that coming, um you know, the nearest future. ah What I'm really interested in here is... um How would you characterize the customers you work with?
00:24:05
Speaker
Because from customers who we work with at MassHead, only 10% gonna be using streaming. The rest is batch, like super batch.
00:24:19
Speaker
You know, it's micro batch as well, but how would you characterize those um clients? Like what are the use cases?
00:24:30
Speaker
You know, i see that as well. my My sense of the word world, based on what I see, is that most of the data still um moves in and batches, certainly around systems of their organizations.
00:24:50
Speaker
You know, I once went on this research project where I started asking customers like, hey, if we took your batch systems and we made them go from 15-minute latency to two seconds, who are you going to run and tell? Who is going to be super excited?
00:25:07
Speaker
And most of my customers are like, I don't even know, right? I think, which which is to say that, you know i'm I'm struggling to come up with a characterization um of customers of of companies where you know who've come to a conclusion that they absolutely need to do their processing. It's analytical and non-analytical processing in streaming mode in real time.
00:25:36
Speaker
I will say, let's let's try to kind of identify a few examples. So one of the earliest examples of very high-scale streaming workload that we've encountered in PubSub the last decade um was Spotify. they actually you know It's a public case study. They went and talked about this. Again, was pretty early on, maybe 2015. Wow. problems they had was so one of the problems they had was
00:26:09
Speaker
they run a global distributed service, Serving Music, and every time you interact with the app, at the very least, they have to go and pay a label ah you know for you playing a song.
00:26:20
Speaker
And so what happens is you have these front ends all over the world that are interacting, serving serving your requests from your mobile app or your app. browser and you just want to get that event off the front end so it's not a stateful application that simplifies scaling etc right so it wasn't about end-to-end processing in real time it was just about not having to use local disk and keeping front end stateful so it was a big use case and i think that still remains a major use case just long collections
00:26:53
Speaker
right It's um in many, you know, I would say if you looked in most companies 10 years ago, probably, right, the the standard way of collecting logs was you write them to local disk and then there's a log rotation daemon that's running in the back in the background that's uploading them to an FTP server or an SFTP server if you're really good.
00:27:16
Speaker
right Then we moved on to do uploading this to batch, to to object storage. Now we could do some more creative things, right? But the the idea of ah skipping local disk entirely and keeping your logs, keeping your systems stateless as, you know, still is largely enabled by streaming storage like like Kafka and PubSub and similar systems.
00:27:47
Speaker
So that's one use case. We have another set of use cases where you really actually need real time. So this is where you have coordination between different pages or ah views in an app flow.
00:27:58
Speaker
So one example is When when i put um something and I upload a document or when I put a, upload a document to a document storage service.
00:28:11
Speaker
If you look at like Google Drive, Dropbox, things like that. Or if I put an item in my cart and go to another page, that page needs to render with a new state.
00:28:25
Speaker
right And that state may or may not come from the same microservice. In particular, like imagine this use case, I want to search for what's in my cart, or I want to search for items that I put in my cart.
00:28:38
Speaker
less Less interesting for the cart example, but much more interesting for the I uploaded this thing to my content management system, and to my storage thing, right?

Key Use Cases for Streaming

00:28:47
Speaker
Oh, I see it, right? So the index must have the same document as the primary storage system and the database.
00:28:54
Speaker
right If they don't, I search for it. It doesn't come up. Oh, now did I lose it? It's terrible user experience. So we've seen cases like this. We've seen cases where um we've seen cases where um personalization happens.
00:29:16
Speaker
And personalization adds. That's a use case. Again, it's very similar in that what happens on the next page really depends on what you just did. Or if you look at sort of real-time bidding, don't know, and that's a particular corner of the media business, right? Where, you know, oh, I'm i'm on um on your podcast podcasting page and there's ah an opportunity to serve an ad and you have 50 milliseconds end-to-end to pick an ad, right? And pick a price for that ad.
00:29:42
Speaker
So there's there's a there's a model that you have to run that decides whether you want to which ads to pick, which how much to pay for them, right? And that has to be very tightly controlled um in terms of how how fresh the information is that's available to the state.
00:29:59
Speaker
Same for retargeting, for example, right? if If um I went to a travel site, that travel site works with a data distribution platform that says, hey, we aggregate data from everybody who is interested and travel to Las Vegas, which I am, by the way. I will be in Las Vegas for GCP next, ah April 11.
00:30:22
Speaker
Come to the COPPA talk. I already signed up for your session. just That's great. and we'll we'll have it We'll have one of our customers talk about some very interesting things about this real-time bidding use case. I think...
00:30:38
Speaker
but a use case so um i think In any case, here, here if, and you know, the fact that the fact that I'm interested in travel, right, and is is a very valuable thing to people who are interested in advertising travel to me, and that the value of that declines very rapidly second by second.
00:31:02
Speaker
So and that's another major kind of streaming use case. So all of this is to say that I think the customers who believe
00:31:16
Speaker
who adopt the streaming pattern, we really had a real-time use case where the business had to be, like, had to have low latency. I will say that we are seeing more and more customers who
00:31:30
Speaker
who have become believers in the service bus architecture again, say we may not have a real, really real-time use case, but this idea that like, let's just put all the data on this bus and that can work this time.
00:31:44
Speaker
so we're going to make an architectural sort of strategy decision to go that way. That's like another category of folks who are coming around. and That's an interesting space now because What Kafka brought is the solution to the to the bottleneck, right? That single box bottleneck.
00:32:01
Speaker
um What it hasn't solved is the governance and the structure and the semantics part of it, right? And yes, you know, if you look at their additions to Kafka, it's schema registry, yeah our governance solutions of all sorts that are starting to um address that. But i wouldn't say that, you know,
00:32:20
Speaker
It's a fully solved problem. So I didn't spend a lot of time in this, but you know like the architectural bit on this heterogeneous materialized views or so service bus architecture is is also another kind of customers.
00:32:36
Speaker
So you have a true low latency use case, or otherwise your business doesn't work, or you believe that the architecture will be simple if you have service buses. That's who does real time.
00:32:48
Speaker
Wow, did that take a lot longer than I wanted Well, open means you have a lot of to say. Go ahead, Damkem. No, I just want to, like, I know, at at least speaking for myself, I do try to avoid ah these these infamous two letters that everyone talks about these days. um But with the rise of AI, ah since you talked about personalization and...
00:33:14
Speaker
And that kind of thing in in terms of real-time bidding. Do you feel that that customers are starting to see like maybe more use cases where they can use events and and streaming to kind of feed their their AI models for um for real-time predictions as well? Do you see that as as something that is on the rise as well?
00:33:43
Speaker
Well, does the hype affect your customers? Well said, yeah.

AI in Streaming Context

00:33:53
Speaker
I don't see the point. can't say that I see a direct line yet. And to be fair, right, I i work in the infrastructure layer, right? So this is streaming storage.
00:34:05
Speaker
There's probably someone... in every company who gets to work at the API above the infrastructure, maybe really at the level higher, who gets to think about these use cases and they just sort of think about the infrastructure is the stuff that sys admins do. right um So it's possible that I'm not seeing a signal.
00:34:28
Speaker
To my mind, in the end,
00:34:32
Speaker
the way AI expresses itself to basic storage infrastructure and events, it's just another use case, right? It's just another model. It's another analytical case. It's another service that consumes the event.
00:34:44
Speaker
um I don't think it creates a need for fundamentally new way of sharing basic data and Yeah, exactly. So the the events are already there. It's just another consumer on top of the events. Of course, you know, if you look at if you look at some um storage needs specific to machine learning, right, how do you store and share models and weights? How do you associate training data and snapshots with the models? like there
00:35:16
Speaker
ah um Vector databases, all these things, right? That's a separate side event for which, you know, we we we could talk about, but I don't see that. I don't see much there that's specific to StreamAid.
00:35:31
Speaker
um Kira, I have another question. um What I saw again, we have this customer cool who is largely utilizing Kafka.

Cost Misconceptions of Streaming vs. Batch

00:35:43
Speaker
And what I saw from their cost um is much more less ah than for any i customer who uses batch.
00:35:56
Speaker
Why there is a misconception on the market that streaming is more expensive than the batch?
00:36:09
Speaker
Um...
00:36:12
Speaker
Do you have an answer? We'd love to meet this customer. I have this story, right? We've talked about this. You and I talked about this. Yeah, I know. This is also very important to highlight. You know, when you told me this, i we didn't have that customer yet.
00:36:25
Speaker
Just for you guys to to to reference, um I cannot name the customer, but they received over 2 million events during like 10 hours or so.
00:36:36
Speaker
And it costed them on Google BigQuery Compute around $2. a day for 2 million events which are inserted and processed via Kafka and PubSub.
00:36:49
Speaker
And if we would be talking about batch jobs via Python or anything else, that would be we would be talking about hundreds of dollars.
00:37:04
Speaker
So to to recap this idea briefly, you know i don't think that I think it's fair to say that there are cases where where streaming might be quite a much cheaper than batch structurally.
00:37:18
Speaker
And the reason is that when you're batching data, especially if you're batching, if your batches are very large, you have to have capacity to process giant spikes.
00:37:31
Speaker
And that capacity in between the spikes between your batches, that capacity is sitting in doing nothing. So typically you provision your Spark cluster or your data flow cluster, whatever it is, right?
00:37:42
Speaker
for um For the, that you know, whenever the con job kicks off, your or your airflow graph kicks off the thing, right? And so if you look at streaming, you're, you know, on average, you're just, you you're you have highly utilized capacity that is not nearly as spiky.
00:38:00
Speaker
Then you can scale it with your demand. You potentially can, if you do it right, you spend a lot less money on idle capacity. um And you just need fewer machines.
00:38:13
Speaker
I think... this This does require a little bit of care, progressively less care as technology develops, right? But certainly if you like if you do things in 12-byte batches, like this customer I mentioned, right, streaming is going to be pretty expensive in terms of compute, right?
00:38:31
Speaker
So batching and aggregation, you know, as as basic performance optimization tactic, still remains very relevant in every level of the operating system and app.
00:38:42
Speaker
right But you don't have to go to batches of hundreds of megabytes a tens of megabytes, even megabytes. With these streaming systems, you can achieve most of that most of the benefits of of batch processing you know once you get into tens or hundreds of kilobytes. even So I think um there's you know there there's that caveat.
00:39:06
Speaker
The other one is that if you look at the cost of object storage there's this there's been this historical accident maybe thanks to s3 and emerging early on right to store and serve cat pictures you know and um you we have cloud storage that's similar right if you look at them the throughput on these services is free yeah right so and and because of that that distorts a little bit the value that you get right so
00:39:39
Speaker
um ah storing Storage, just keeping the bytes on disk is relatively inexpensive. But most importantly, you don't pay per byte coming in. And you can get vast um ingress and ingress capacity from these services without really paying extra.
00:39:57
Speaker
So I think that's you know that that kind of counteracts a bit the this picture. So it's it's going to be a balance between, you know amy when you consider this idea, right it's going to be a balance between, you know are you keeping a cluster around just so it can run a batch job when it arrives? And are you spending a bunch of money on idle CPUs of RAM and disk and so forth?
00:40:19
Speaker
right or you know Or do you do 15 seconds of processing every hour? And are you taking five minutes to spin up the cluster that's going to do 15-minute processing or 15 seconds of processing?
00:40:32
Speaker
Or you know are you pretty efficient? right ah if If you're very efficient, and it's quite possible that you're doing just fine. If you're not, ah streaming can be can be very, very cheap.
00:40:45
Speaker
I just wanna highlight why it's important to consider because with the additions billing model at Google BigQuery, this is what we see. So first step, customers are moving from on-demand to additions, which is slots per hour, which is basically the same, you know, you have a machine allocated for you to run your ah pipelines, right?
00:41:13
Speaker
And the second step, when they moved to slots, to additions, they started to hit the ceiling with capacity they committed to. And this is where they need to rethink their e details, not just orchestration, but also the way it's inserted data, as a way they process the data, because they could have the spikes, you know, at 2000 slots, you know, once a day, but the rest of the day, it's gonna be just 100 slots needed.
00:41:45
Speaker
So this is a reason also for um analytics use cases to to start thinking strategically. if they can you know apply basically some streaming or chunk down the data processing.
00:42:02
Speaker
Absolutely. On this note, I am reminded that I operate in a continuous processing mode.
00:42:11
Speaker
And I must move on the next to the to the next engagement. Beautiful. It was so kind of you to invite me and and listen to me ramble and ask very interesting questions.
00:42:27
Speaker
Many thanks to both of you. you lay du It was very insightful, ki Thanks a lot for for being with us. Yeah, appreciate appreciate joining us. And see you in Vegas.
00:42:39
Speaker
yeah Yeah, maybe one final question. Where where can people find you if they want to know more about what you do or... um Thank you, Donke. How do you think? Great question. I'm on LinkedIn. You see my name, and but I will be appearing live.
00:42:53
Speaker
One engagement only. Las Vegas, Nevada, April 11th, Friday, ah around 10 a.m. m in the morning. Herb, Marilyn, for a talk Google Quad next. Beautiful. Thank you.