Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Batch Data & Streaming Data in one Atom (with Jove Zhong) image

Batch Data & Streaming Data in one Atom (with Jove Zhong)

Developer Voices
Avatar
1.9k Plays8 months ago

Every database has to juggle the need to process new data and to query old data. That task falls to any system that “does stuff and remembers stuff”. But it’s quite hard to really optimise one system for both use cases. There are different constraints on new and old data, and as a system gets larger and larger, those differences multiply to breaking point. That’s something Twitter’s engineers were figuring out in the 2010s.

One solution that came up in those years was the Lambda Architecture. A two-pronged approach that recognises the divide between new and old data, and works hard to blend the two together seamlessly in userspace. But that seamless blending is easier said than done. It’s nearly all bespoke work.

What if you could get it off the shelf? Let someone else do the work of combining two different kinds of database into one neat package? That's the question of the week as we look at the recently open-sourced project Proton, and its attempt to be the Lambda Architecture in a box…

Proton Docs: https://docs.timeplus.com/proton

Proton Source: https://github.com/timeplus-io/proton

Timeplus: https://www.timeplus.com/

Kris on Mastodon: http://mastodon.social/@krisajenkins

Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/

Kris on Twitter: https://twitter.com/krisajenkins

#podcast #softwareengineering #databases #dataengineering 

Recommended
Transcript

Introduction to Lambda Architecture

00:00:00
Speaker
There was a brief, glorious moment in tech history where Twitter was regularly in the news for technical reasons. Seems hard to believe now, but there used to be. And it was around those days that I first heard about something called the Lambda architecture.
00:00:16
Speaker
And the Lambda architecture was proposed as a solution to Twitter's big urgent problem. You have these two roles you need to fill. You've got a huge set of historical data that needs to be queried in all the usual ways, but then there's this much smaller set of live data that's really urgent and has to be processed right now if it's to be of any value.
00:00:38
Speaker
And there's a mismatch between those two kinds of data. They have different storage requirements. They have different access patterns. And of course, they have different performance constraints. Plus, there's the extra problem that the new data is going to become old data. So eventually, every piece of data you have will live in both of those roles. It can be very difficult to deal with because you can't optimize one without sacrificing the other.

Integrating Data Systems

00:01:04
Speaker
And the Lambda architecture proposed this solution. Accept it. You're going to have two different data systems. And what you have to do is build them optimized for each use case, and then do all that difficult integration work that makes those two systems appear like their one system. It's very hard. I remember it taking a long time, but they did it.
00:01:27
Speaker
But what if you could get all that integration, both of those systems neatly coupled together off the shelf?

Introducing Proton with Jove Jeong

00:01:35
Speaker
It seems to me that that's the question being asked by this week's guest, Jove Jeong. And his answer is the recently open source database Proton.
00:01:45
Speaker
Well, if that's his answer, I have questions. What's a database like that for? What does it do well? How does it work? And how are you making that difficult integration piece simple enough that I can rely on it, but transparent enough that I can forget about it? That's the trick. I suggest we find out. I'm your host, Chris Jenkins. This is Developer Voices, and today's voice is Jove John.
00:02:24
Speaker
I'm joined today by Jove Jeong. Jove, how are you? I'm doing good. Hi, Chris. So glad to join this podcast. I'm glad you've joined us. You've got, so you're in a world with a lot of people competing to try and solve this problem, right? Of streaming data. And it seems to me that everybody has a slightly different way of solving the problem and those architectural decisions that they make play out in different and interesting ways.
00:02:54
Speaker
So I want to pick your brains on the architectural decisions you've made and how it plays out in the system proton that you've ended up co-designing, right? But before we get into that, I think we should try and ground ourselves in, what's this for? What's an end user going to actually use this for? Yeah,

Proton's Functionality and Use Cases

00:03:12
Speaker
sure. Yeah. So port time, I would say that's a...
00:03:16
Speaker
For people who are not that familiar with the idea of stream processing, you can just, considering this is a next generation ETL. Even I support people who know what is ETL. So you want to move data from A to B and you want this as fast as possible, no latency, you don't have to wait for the next batch, say next one day, next one hour. Whenever there's something happening, for example, there's a new order coming, there's a new people, click your webpage,
00:03:44
Speaker
person being attacked, a server is going to be OO. Those information can be sent to the system in almost zero latency and you can apply some alert, you can have some automated actions. So, at common use cases, this is really good for build a real-time or streaming ETL.
00:04:11
Speaker
So and beyond that, that we also really is a true streaming processor. So meaning that it's not just fast enough, it's also have this stay for processing. So the difference that is we can do some
00:04:29
Speaker
Stay for processing such as I can count every one second how many clicks for that particular page, then you know I accumulate a state within that one second or one minute. Either you can do some fancy things such as this session window,
00:04:45
Speaker
Whenever a signal, meaning that session start and the session end, and we keep this session window, for example, when the user log in, do a bunch of action until they check out or log out, this is a session. The session window can be as long as, say, two hours or can be just five seconds if the person is not patient enough. The digital equivalent of someone comes into your shop and you want to kind of understand them until they leave, right? Yeah, yeah.
00:05:14
Speaker
simple transformation or filtering on those sophisticated stable processing, all those are in the category of string processor. So some of the big players in this space, for example, like Apache Flink, and there's some other vendor as you briefly mentioned that it's
00:05:34
Speaker
Yeah, it's not a new space, but still at early stage, because people really want to get things more real-time. They may still struggle on the best side, but more and more people want to get to the next chapter, which is streaming. So this idea of stream assessing analytics is really a new thing, and people really want to try different solutions.
00:05:55
Speaker
Yeah, this is the thing that interests me and I keep coming to like, I want to hear about everyone because everyone in this space is scrabbling to come up with the ideal solution or the solution that's optimized for a particular subset of that market. And everyone has a large piece of the answer, and everyone's saying that they've got the whole of the answer and they just kind of want to explore it, right?
00:06:18
Speaker
But first, I'm going to pick you up on a specific word. I'm going to nitpick your words that you said. It's a true stream processor, you said. Yes. What's a true stream processor?
00:06:31
Speaker
Yeah, again, the definition of a streaming processor is really about how to handle the streaming data in a meaningful way. The streaming data is the data you constantly get. So you get not just batch by batch, you always get new data all the time.
00:06:51
Speaker
And you also want to process data in the streaming way, not in the batch way. So more like a common or naive solution that is you always get the data from your OLAP system. There's a lot of OLAP system. I guess the market of OLAP system is even busier than our streaming processing market. But anyway, you can always put the data into your OLAP system and you can query it every one second.
00:07:18
Speaker
Then if your system is powerful enough, you get quite a nice dashboard or alerts with the limitation of you get this every one second or every one minute. I think most systems don't really are happy to support per second query because that's too much, too frequent. On what is worse, that is, imagine you want to get, hey, what's my
00:07:41
Speaker
revenue dismounts, and you want that number update every second or maybe every minute. If you are using OLAP system, then you query this every one second or one minute, but you are wasting a lot of resource because the number you get last time, maybe it's, say, 10,000. And between these two queries, you might get a few transactions and you might get 10,200.
00:08:11
Speaker
But because you are around this as a batch query, you end up with query all the data over and over again, even there's a very small incremental thing. So this is the common we call real-time solution. And for the streaming processing, we have this state. So we know what's the last time we query is this, say, 1,000. Then there's some new event. We know the delta.
00:08:36
Speaker
So we can give you the, we get to the delta combining with the previous result will give you the new thing. And what is more interesting that is usually there's a streaming processing is a long running process. So you only run your query once and it never ends until you cancel it. Even the server start, restart or scale out. Usually the system can still keep the state and resume what is left until you specifically cancel it. So,
00:09:05
Speaker
Back to that previous example, that is, you want to know what's the life, a number of this month's revenue, and you just run one single query, something like select some amount from stream. And that streaming processing will keep giving you new results.
00:09:21
Speaker
whenever a new data is available. So you are happy to render this necessarily, or you can set up some alerts. So this is the streaming processing. And what I mean for Portum is really doing very interesting ideas. We take this to the next level. We also have our storage behind. So you don't have to, using your streaming processing to send the data to the other systems such as OLAP or Kafka,
00:09:47
Speaker
But you actually just do everything in our own system alone. We give you the serving layer for both real-time part, but also the historical part. Even you can have a query to query both the historical part and the real-time part. So this is... I don't want to throw too many ideas, but you might get...
00:10:10
Speaker
the keyword like

Understanding True Stream Processing

00:10:12
Speaker
lambda or kappa architecture. We need to get into that. That's something we can discuss for sure. We will dive into that. So is it fair to say that your definition then of a true stream processor is something that's live, it's not very, very fast batch or micro batching, it has to be a continual transformation of a state by a single event?
00:10:39
Speaker
Yeah, by event, by interval, by other conditions, you name it. You define your atom of change, but it has to be sufficiently small to count as true stream processing.
00:10:50
Speaker
Yeah, the problem is, again, we start from those ETL use cases and we also uniquely design our system to have more historical partner just rely on other system to do the historical part. And also we made some design choices to
00:11:12
Speaker
not depend on too many cloud technology so that we can be easily deployed in those smaller servers, even edge devices or distributed way so that you can get data from the local sensors to some filtering and aggregates together in a central way.
00:11:34
Speaker
Right, I'm going to get into that. But next, I mean, I'm going to unpack this slowly. The thing I find particularly curious about this is you seem to have a dual storage model inside Proton. So take me through that and especially the why. Yes. So take other common choices of stream processing such as Flink.
00:12:00
Speaker
Even, I think, recently, two or three years, the community of Flink also involved. They do very good on the computing side. Now they are adding the storage side. So there's a project called Apache Pymon. Originally, it is called Flink Table Store. It's a sub-project of the
00:12:20
Speaker
Flink, but they realize it's just more than Flink and they are aiming for much bigger space. They want to create, they call stream data warehouse or streaming house, have the both the computing side, but also the storage side and the storage side can be
00:12:39
Speaker
I'm not saying this is next or the other iceberg, but they really are being very ambitious and they want to support other systems such as Spark, Apache Spark. So anyway, why I mentioned this, that is in the Flink community,
00:12:55
Speaker
they move from or they involve from a pure computing engine to add the storage part. But in the case of Proton or our company 10 plus, at the very beginning, we think this is the future. So on the day one, we're not just focusing on the streaming processor on the computing side, we really want to have the
00:13:17
Speaker
vision of, hey, in the future, this should be one single system. People really needed to process data no matter they're comfortable using streaming way or using the batch way. So we choose Clearhouse as our key foundation. And it's already a very mature batch-based system or lab system. We really add extra streaming capability into this core base.
00:13:44
Speaker
so that you can process data in the streaming way, just as a stream processor or stream ETL, very similar to Apache Flink or ksqlDB. But in the same time, the storage, the data is stored in the clear house historical parts. So you can query it, say, I want to not just know what is happening right now, but I also want to know what is the pattern or trend in the last two weeks, two months, or even two years.
00:14:13
Speaker
If it's possible, you just query in portal along without asking other system. And also there's other interesting things such as backfill. Backfill is a special term, I think, very popular in the fintech area. They want to figure out what is my best trading strategy based on those ticks or those stock price or those offers, things like that. Then usually they want to build a good strategy for the real-time part, but also they need
00:14:43
Speaker
to validate that strategy using the data in the past, I don't know, maybe one year or two year, they call this back view. And it is very common people have
00:14:53
Speaker
two teams, one team building this streaming or real-time trading power, the other team building the historical power, even they can be using different language. So it's at the extra cost and the inconsistency. So in our case, everything is SQL-based and we are using the same engine. So you can easily build the same code for both the real-time trading and also the back view test. So there's some benefits to have a system to have both.
00:15:21
Speaker
So this is very much the Lambda architecture that I remember first hearing about at Twitter where they're dealing with the historical tweet data. I'm not going to call them exes. The tweet data plus the live hot stream of people tweeting right now and wanting to see timelines right now.
00:15:39
Speaker
Lambda, again, is very popular and a little too easy to understand, but people also see the challenge of having maybe two serving layers, one serving layer for the real-time power, the other serving layer for the historical power. Even data is saved in two different places. It can be complex, so that's why people come into kappa architecture. I would say in the case of Proton,
00:16:03
Speaker
It's more like kappa, but it's more like kappa in a single binary. So because we have everything written in serverless as our core engine, and it can be compiled to a single binary, and that single binary can serve for both real-time or streaming query, but also the historical query, and all the storage is
00:16:28
Speaker
is also unified. So internally, we do have historical storage based on Clearhouse. So this is a column-based storage, you know, and it is very fast for you to do aggregation, especially if you don't need to scan all the columns, right? For example, there are 10 or 15 columns. You really want to do min-max for two columns. The beauty of a column database is that you only
00:16:57
Speaker
get the data path for that two column without getting extra data and you can apply some SIMD vector processing to get results in a single CPU call or something like that without having too many stuff. So that's what we inherit from KDAHOS. But also, in the meantime, we have our own port and streaming storage
00:17:23
Speaker
We call it a native log. So it's an internal component. But the purpose of that is all the data is arrived in that streaming storage first. It's more like a rather high log, apparently. Very similar to Kafka, but much more lightweight. Then the data is there. Then we can do some streaming stuff. Then the data is also replicated to the historical power so that we can get a booster. So you're saying internally there's a right-hand Kafkaesque log.
00:17:53
Speaker
which you're querying for the hot set of data and also writing it across to a bundled click house for classic analytics. That's right. I can begin to see why that architecture has a balance to it, but that sounds like a lot of work on your side to try and... Is it a problem keeping those consistent?
00:18:20
Speaker
Is it a difficult thing to know? How do you know when to query which layer, for instance? Yeah, that's really an interesting thing we're happy to discuss. So, yeah, in the streaming world, everything looks very beautiful and easy at the first glance. So that's, oh, yeah, of course, why not? But when you really work on that, you might realise there's a lot of challenge, for example,
00:18:46
Speaker
out of order, right? So you might get the data from different sensor, you might get from different gateway. They might using different clock or even certain channel may have some certain delay. So all one data is consolidated into your system.
00:19:04
Speaker
you are not guaranteed that the data will be in the exact same order. Maybe somebody has arrived first. Somebody's using a mobile app, but they were on the underground or the subway and they sent the analytics data like half an hour late.
00:19:18
Speaker
Yeah, yeah. We have some use cases of people using our solution for those fleet management, the sensor on their truck. And if they go into a tunnel, I don't know, maybe it can be five minutes, 10 minutes, the signal will be very bad. But they will still send out and you get them late. So there's some out of order event, there's some late event, and also how to keep this
00:19:43
Speaker
state manageable. So as I mentioned, that is usually the streaming CQ is long running. It's always wrong in the back end. But if you are doing some calculation, it's always add the new data in the state, never reduce.
00:20:02
Speaker
that no matter how big your memory is, it will be OM eventually, right? So how to have a certain way to control the size of the state, and also stuff like how you can change, schema is changed, or how you react when the server scale out or scale down. So that's a lot of our challenges. In our case, we do have extra design. One of the key
00:20:29
Speaker
an important part of that is we introduce our own format, we call it 10 plus data format, TDF. So the TDF is our key data format to
00:20:45
Speaker
have some design aiming to solve those challenges. For example, we have a column or flag called sequence number. So the sequence number is the important information to let us know, say, when you ingest, for example, three data in a single batch to plot them, then
00:21:07
Speaker
then we know each event have ID and assign a sequence number. And then later on, when there's the other event coming, they have a different or a newer sequence number. Those sequence number, the data will be in our streaming storage first, then we will replicate them to our historical storage, but also keeping that sequence number so that when we don't really need to
00:21:34
Speaker
In some cases, the data can be in both sides, but when we query it, making sure, based on that source number, we only get to the data once. So I'll give you an example. Say, in our streaming storage data, you can set a retention policy for one day, for example. Again, this is configurable. In the streaming storage, the retention policy is one day. In the historical storage, it can be, say, one month.
00:21:58
Speaker
Then in our case, if you wanted to, hey, can you give me the number of events with this condition for the past one week? Then we start from our streaming storage and we get one day data and we know it's not enough. Then we can go back to our historical storage to get the data from that one week to that minus one day. And because we know the sequence number, so making sure there's no overlap.
00:22:28
Speaker
So for example, in the stream storage, the six number last number is 500. Then the other one should be 501 or 4.99, I don't know. But initially, very smoothly, we can grab data from both sides, but we don't have any duplication. So that's how we can solve those other issues or other issues.
00:22:51
Speaker
So that does sound like there's an extra storage layer in there, because where are you storing that state? The state today, we still save in our native log, which is our streaming storage. Essentially, it's a file-based. So are you kind of snapshotting back to a different write-ahead log? Yes. Because it's our own format, we can do
00:23:21
Speaker
I'd have a crazy optimization. I'd have to wait for the community. OK, so your system goes down. Let's not say it crashes. It goes down neatly. It comes back up. You're going to then to reserve that query, walk back along the event stream until you find a snapshot, and then add that into the historical. Is that how it works?
00:23:45
Speaker
Yeah, so for example, we do upgrade. For example, say the version is A, and we need to upgrade to B. And effectively, you have to restart the server. So in B, it's just two seconds or two minutes, whatever. Then we're making sure when we start a new server, we pick up where we left. So what's the previous internal state, as well as what's the
00:24:09
Speaker
What's the last sequence number we scanned? So the last sequence number, for example, is 1,000. But now in our stream storage, there's 10 more events.
00:24:22
Speaker
We know those new events, we never compute, so we can use that sequence number to figure out what's the gap we should process. Right, yeah, I'm with you, I'm with you. Okay, let me give you another difficult architectural question. So what's your approach with, let's say my fleet of trucks is huge, right? And I'm dealing with
00:24:46
Speaker
I don't know, what counts as huge these days? Let's say a million, for sake of argument. I'm dealing with a million trucks whose state I would like to calculate over time. How do you deal with when that gets too large to hold in memory? What's your sharding strategy for very large stateful sets? Yeah, sure. It's not so uncommon, right? So we cannot put everything in memory. I know some database
00:25:16
Speaker
They really want to put everything in memory, even have some cluster version of a memory-based system. So again, as you previously mentioned, probably there are people who try to solve in different ways, different solutions. It's very difficult to have one thing fit for all. Everyone has their own.
00:25:36
Speaker
assumption and design choices, or even design preference, or depends on which customer they talk to. Some customers hate cloud, some people only use cloud for example. Yes, in the case of a lot of state or data, we do have this sharding strategy. I think this is very common that is to make it simple, everything is a single shard.
00:25:57
Speaker
but you feel free to design a sharding strategy. For example, you think, oh, maybe three sharded is enough or 10 sharded is enough. Then it's sharding, partitioning, those are just acronyms, I think. So you can design a sharding strategy based on a certain key so that data in the same shard is more relevant or similar in some way. Then combining the sharding strategy
00:26:24
Speaker
with the cluster because you might know, even on a single node, a single host, you can have multiple shots. And essentially, each shot can be independent to IO. So for example, if you monitor or preserve that is in your server, the IOPS capacity is very high. You cannot get the highest throughput
00:26:59
Speaker
For example, you can create more shards so that you can have more in parallel data IO. So that's the good reason why you might want to have multiple shards in a single server.
00:27:11
Speaker
But also, you want to have some certain HA, high-value data, or disaster discovery, then you might want to have multiple servers. Each server will take different shots. So, I guess this is quite similar to how Kafka works. You can have multiple partitions for one topic, and you can have multiple servers handle different partitions, even you can have a replication factor.
00:27:37
Speaker
Each partition needs to be replicated at least twice or three times. So combining together, you have a relative
00:27:46
Speaker
common distributed architecture, so support almost any number of events. It depends on whether you have enough disk IO or disk servers, and the network within them is good. Yeah, okay. So that raises another question for me, which is, you've chosen to embed Clickhouse as an OLAP platform, but you haven't chosen to embed Kafka. You've written your own write-ahead log.
00:28:17
Speaker
What's the thinking behind that decision?

Proton's Deployment and Versatility

00:28:20
Speaker
We want a system can be deployed as a single binary, as simple as that. Or it can be a cluster, as I mentioned. We do support cluster. And even the way how spunk support cluster that is, it's also a single binary. But you can run the same binary in different role. Say, I want to run this as a master. I want to run this as an indexer, as a search head.
00:28:45
Speaker
Similarly, in our case, even if it's the same binary, but based on the configuration file or based on some related action, different things can be different to the role. And also, because of the single binary story, it is
00:29:02
Speaker
It is possible even today, there's a community user in our Slack asking, can I use Kafka to replace native log? And the answer is yes, for sure. So we made this as a more like interface that is we need a stream storage.
00:29:21
Speaker
And we have our own one we call native log. It's part of our process. But if you really want to configure external messages such as Kafka or even RedPanda, you can do so. Then we just leverage this as a streaming storage for put our Red Hat log or put our TDF format. So that is configurable. So that state snapshot you're talking about, that could be written to RedPanda instead.
00:29:51
Speaker
Yeah, that's right. That's the data part. But also, besides the data, there's also other, for example, metadata information, how you define things like materialized view, how you define a stream.
00:30:06
Speaker
Do you create the stream on this node? Are the nodes aware of that? I think today it's still stored in the file format.
00:30:21
Speaker
And we replicate to other nodes using the Raft protocol. So again, internally, there can be some data in the streaming storage, some data in the historic storage, some metadata information in the form of file. And the file can be replicated to each other. OK. That makes me want to talk about, because you've talked about how this is embedded.
00:30:50
Speaker
how there's a trade-off between the historical and the real-time stuff, makes me want to think a bit about performance. What kind of queries does it perform really well on? What kind of size can you optimize for this hybrid internal architecture? Yes, performance is really one of the key driving factors when we design things.
00:31:15
Speaker
we have some of the similarity with Rapenda. For example, Rapenda basically is a simplified version of Kafka, right? And in some over-simplified way, Proton is a simplified version of Flink, or we call it native engine of Flink. You know what is happening interesting right now? There's some native engine for Spark.
00:31:44
Speaker
Spark is mainly in Java, some parts written in Scala. But in some use cases, people are not very impressed for the performance of Spark, the Java version, JVM version. So there's a few projects coming from Facebook, coming from even Apple, I think. They have written this in either Superpass or Rust.
00:32:08
Speaker
as a native engine for that. Even the comment behind Spark Databricks, they have been building this for four or five years and used it for at least three years, they call it.
00:32:24
Speaker
I don't know how to pronounce it. It's a photon. It's P-H-T-O-N. It's very similar to proton, but with the letter H. So that is a native engine. I'm not sure whether it's Rust or C++, but it's an in-place replacement for Spark. You can run your Spark workload, but it's executed using the engine. So you see people, if they want to aim in for the next level performance,
00:32:51
Speaker
GVM may give you something, but may give you some limits. If you want to leverage SMD, you want to leverage vector computing, you want to have some modern file system, IOU ring, the memory management, those stuff. You might have to do this manually. Java is vastly faster than it was when it first came out. It's incredibly fast. But if you want to have an argument with the CPU directly, you kind of need Rust or C++ or C.
00:33:21
Speaker
Yeah, I even see some folks, they are tuning for ARM. There's a lot of details. If you do this carefully, you can get way more faster, but it is under...
00:33:33
Speaker
underestimated. So, I mean, a lot of programs should run faster on ARM if they do properly, but they require a lot of effort. But anyway, long story short, that is, we choose C++ as our language implement. We leverage a lot of high performance C++ library. We always use the latest C long, LLVM, all those cool things. So we can get very, very low latency without requiring a lot of
00:34:01
Speaker
a lot of computing resource. I'll give you an example. For example, one of our community users, they need to do, they call high cardinality group I. The term is fancy, but essentially it's just a group I, you know, in the SQL group I key, right? But what if the key is not 1000, maybe say 10 million?
00:34:27
Speaker
So imagine you have a 10 million unique key, and each key you need to get a certain aggregation, for example, the current, the main max, or even some crazy one like a P99. So if you want to do the extreme case, that is you have a 10
00:34:44
Speaker
million unique key, each key you need to do some P99. P99 is tricky because you need to get all the data and figure out which value is at the 99%. So it's more complicated than sum or average. Yeah.
00:35:01
Speaker
But in such cases, mini-system cannot handle that. But because we implement in C++, we have our own control for the data structure. We can actually spin off multiple internal process to leverage the modern CPU and all the Linux stuff to get this result correctly without having a lot of memory. I think
00:35:30
Speaker
I don't remember the details. I think it's more like 4x or even 10x less memory compared to Flink for the same high-coordinate group I. And this is
00:35:44
Speaker
High cognitive groupbar is very common for some use cases such as cybersecurity. You may know, for example, if you have Firewall, FSLR, I mean, you want to figure out which IP keepers sending you some package, and the IP can be easily a large number of us, so it's high cognitive. And some other cases such as FinTech, you might have a lot of stock or Bitcoin, things like that. Especially if you go into like IPv6 grouping by that, yeah, that's big. I'm not saying some nine-minute bias is a really big challenge.
00:36:16
Speaker
You're saying you can get 4X or whatever out of that. What's the key part of that? What's the thing really making the difference? Yeah, I would say we can leverage the low-level CPU or IO or Linux stuff. I think Flink in many ways is designed in a very generic way.
00:36:40
Speaker
And also with some limitation of a GVM, you might have to do certain things in the generic way. But when you have the chance to handle things, I'll give you a very simple example. For example, as simple as this bunch of integers.
00:37:02
Speaker
Again, I haven't tried this class for a long time, but if you design your data format in the proper way, for example, every integer is, for example, two bytes or four bytes, you make this as a fixed waste, and using the same representation as the memory, then essentially it's just more like a memory dump.
00:37:25
Speaker
Then you do the minimal Marshall on Marshall things. I guess this is also similar idea as Apache Arrow. They have the memory representation of the memory in the network, in the disk. So same thing that is, if you do this in a careful enough way, you are handling data package in a very efficient way. I'm not quite sure whether you can do this easily using Java.
00:37:55
Speaker
I don't think you can do it easily. I'm sure someone will tell me that you can do it, but it doesn't sound like it's a natural fit for a virtual machine that tries to abstract away from these things. Someone will educate us both on that one. You've mentioned this a couple of times, and while we're talking about architecture, I have to ask this.
00:38:17
Speaker
you're talking about ARM and embedded systems you've mentioned, if I want to stick a Raspberry Pi out in a field gathering interesting data, is this a good choice for that? Yeah, it can be. So, at the early days, I use
00:38:37
Speaker
I forgot what's the instance type name. It's T2 Nano. It's a very small instance on AWS. I guess it's a half CPU, one gigabyte memory. I use that for a long time for my demo.
00:38:53
Speaker
Yeah, because our binary is still so small, it's a few hundred megabytes, it can be even smaller. But what is now that is, if you just run some filtering or some bounded time window, so it's just the plus data consistently and send it back to Kafka, for example, then it don't really require a lot of memory.
00:39:17
Speaker
Say, again, I don't want to pick up Flink, but just as a standard, that is, you want to have a Flink, you need to have the right version of GVM. There are so many different GVM versions, different random ones to have certain...
00:39:33
Speaker
control on the GVM. Anyway, you choose the GVM, and you download Flank. Maybe you put it in a certain path. You start it. Then you say, yeah, cool. I want to get it out from Kafka. And then you copy the SQL from the documentation and run it. It shows you a result that it cut off on. And then you figure out, oh, I need a
00:39:55
Speaker
download the Kafka jar file somewhere and put it into the library. Then you run this and you want to, oh, maybe I need to create a jar file and deploy that jar file as my customer code. So the server
00:40:15
Speaker
require a certain amount of CPU and memory. I know there are certain ways you can customize your Flink to consume less data, but it's not default. So my point that is, yeah, again, Flink is already a great product, but you still need to figure us out many details. But in the case of Proton, because we are so focusing on the simplicity, you just run a single command to download as a single binary or have a very small Docker image.
00:40:42
Speaker
and it can connect to your Kafka immediately. And we are using the librt Kafka, I mean the C++ library from Kafka, so it can connect to Kafka to read or write. And I think it's fairly easy to support, for example, the resource on the Raspberry Pi. We also support ARM for a long time.
00:41:09
Speaker
because we have multiple clusters and some of the cluster is based on the ARM chip. Amazon has a very nice ARM-based chip.
00:41:30
Speaker
with up to, what's the number, up to 30% discount or something like that. So it's much cheaper than the MD or X64 chip. So we use that for a lot of our internal tests. So there's some difference, but not huge. So if you really want to run stream processing on your Raspberry Pi, on your embedded device, I think the bottom is almost the best choices.
00:42:00
Speaker
I think Rupanda is doing something similar. It might be possible to run Kafka in your helmet, in your other device, but it's much easier you run Rupanda in those small devices and you can maybe put a button there to do some secure-based stream processing. Okay, yeah. Because that seems fun to me to be able to do the processing as well as the storage all in one simple place.
00:42:28
Speaker
Yeah, one more thing to add is we even talked to some of the users. For example, they are in the energy space. They needed to get the oil, and they have a lot of sight because sometimes it's just like lucky joy. You put a lot of stuff there and figure out, among those, say, 3,000 sight, which one have a lot of oil behind?
00:42:53
Speaker
So you need to have a lot of sensors. And those are remote space. You don't really have a good enough network. Even if you apply 4G, 5G, it can be very expensive. So the solution is more like you have a lot of more like local data center or micro data center and you get the data, you do some filtering processing. You might be able to consolidate them into the central servers one day, but if you can solve those
00:43:18
Speaker
disputed environment network air guide, but if you can apply stream processing or real-time processing, they can filter out a lot of garbage data and figure out what's the real signal and help you to figure out which well you should spend more money on. Because those kind of in-field processes tend to pick up a lot of noise along with the signal. If you could pre-process that in the field, that makes sense. Yeah.
00:43:43
Speaker
Do you think there'll be, or maybe there is already, a way in the architecture to then say, OK, so I've got this proton node which has filtered out some of the uninteresting data, but still got a lot of interesting data. Can I now sync that back into my mothership cluster when the connection comes up? Yeah, it's already ready. So this is our design goal. Each node is
00:44:11
Speaker
I'm not saying self-driving, but they can do everything by themselves and do some filtering, figure out those interesting stuff. Then they can send them to an essential server when they detect the network is good. Because we have some storage or buffer, you never lose data.
00:44:31
Speaker
But when the network is recovered or when you turn on the network, then that kind of data can be synced to the central server. The central server can know all the data around different sites and do some maybe correlation search or pattern detection. So it's pretty much like Kafka
00:44:54
Speaker
or the mirror thing. It's similar to that, but... Like a mirror maker. Yeah, yeah, yeah. So we just implement this in the potent side. Okay.

Proton's Applications in Drone Technology

00:45:03
Speaker
So come the day that Amazon have their drones flying around us in the sky, you'll be hoping they've embedded a proton server in each one.
00:45:12
Speaker
I hope that there are many choices, but I assume they have a good 4G, 5G, so they can generate the data and send it back to a server. Some users really want to do everything locally.
00:45:33
Speaker
Yeah, if the network or latency is okay, I think having each sensor on the drone sent to central Kinesis or Kafka through the processing there will be easier. I mean, we don't have to look for trouble, but that's a lot of choices for sure. That would be interesting trouble to get into. And then you'll have to optimize your binary for the weight. If someone wanted to kick the tires on Proton, where do we go first?
00:46:01
Speaker
Yeah, of course. Yeah, go to our repository on the GitHub. You can also get this from our templast.com. Yeah, that would be easier if you just do templast.com. I think we have a big enough icon to guide you to our GitHub. And in our repository, there's a command line. It has many options. You can have using CURL to
00:46:24
Speaker
to grab our binary from our server, and you can run this directly, or you can install it. I mean, install is just meaning you install to a certain pass and with some conversion file, but it's optional. You can just run that binary in your current folder. So everything is very easy. And also, we have a Docker image. Beside that, we have a bunch of Docker Compose. So Docker Compose is really a good way to create some sample stuff. For example, we have a sample
00:46:54
Speaker
how to using Rapenda to generate some data about the clickstream and the visualize using, for example, guafana. I mean, because I mentioned we have not just a string processor, but also a serving layer. So that's something if you are somehow no fling, I don't think that's easily you can do that is using guafana to visualize your fling query.
00:47:19
Speaker
because Flink essentially wants to get the data, process it, maybe send to Kafka or send to OLAP. Then you can query from there. I mean, Flink is not really designed as a serving layer. Maybe they have something like Gallery recently, but it's not. I gather it's trying to just do transformation and not do storage or serving. Yeah, yeah, yeah. So you might have to send this out to Clearhouse, then using Guafana or other BI to query it. But in our case of Proton, you can just using
00:47:49
Speaker
Grafana to query Portland directly showing nice charts. Yeah. Grafana, Metabase, Redash, we also put them. How's that working? Are you just embedding a web server in it and Grafana is going to there, or is it going, have you got some native connector? Yes, that's also, we largely leverage Clearhouse, right? So we, again, we, Portland is, we call, powered by Clearhouse.
00:48:15
Speaker
It's not really running a separate Clearhouse process in our stack because that's only one process is important if you choose the single node way. So it's interesting, important is a process. Then we just have this a lot of nice thing from Clearhouse and including this port. So for example, Clearhouse port, it's interesting that people in Clearhouse are really focusing a lot on the simplicity and
00:48:41
Speaker
on the flexibility. So they have HTTP port, they have a TCP port, they even have a different port for different databases. For example, you can turn on certain ports to using Postgres or MySQL client to connect to Clearhouse, if it's possible. Oh, really? I didn't know that. So those stuff will get your modes for free, right? So it's important you leverage that. So we really customize the Clearhouse, Grafana plugin,
00:49:10
Speaker
so that it can have a much better integration with Portland. But that kind of a channel that TTP, all those stuff is largely inherited from K-House. So that's the reason why we really get a lot of benefit from K-House. All those nice different drivers for different language, we have integration for BI systems. But also we are contributing back to the K-House. For example, we contributed some PR regarding how we do streaming processing.
00:49:39
Speaker
back to the community so that if Clearhouse team want to merge it, then you can get some basic stream processing in the official Clearhouse version. But if we want more features for sure, Portland will be the best way. But in the same time, Portland, because we are a small team, we can iterate, try different things faster, and we're willing to share all the stuff to Clearhouse team, but they might be
00:50:05
Speaker
more focusing on, I don't know, maybe data warehouse market as a primary goal for them. Streaming is something you want to do, but they're willing to ask as a contributor, but they may not have enough team focusing on the stream a lot in today.
00:50:19
Speaker
Yeah, yeah. Here's the code, do with it what you will. Well, I'm going to go and play with that then, because I have installed and tried Proton, but I didn't know I could connect Grafana to it and get nice, pretty graphs for free. So I'm going to go and do that. Joe, thanks very much for joining me. Thank you.
00:50:39
Speaker
Thank you, Jove. As usual, you'll find links to all the things we discussed in the show notes, including links to Proton's source and its docs and Time Plus. And if that's left you curious about Clickhouse specifically as an OLAP database, I've added a link to a previous episode we did with Al Brown, where we did a deep dive into Clickhouse. You might find that fun. Hat tip to Al while we're here.
00:51:03
Speaker
Before you check out that or another of our many episodes, we're closing in on 50 now, please take the time to like this one, maybe rate it, share it, review the podcast. It's all good for my heart and my feedback and good for the future of Developer Voices, a future I now leave you to. I've been your host, Chris Jenkins. This has been Developer Voices with Joe Chong. Thanks for listening.