Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Bringing Pure Python to Apache Kafka (with Tomáš Neubauer) image

Bringing Pure Python to Apache Kafka (with Tomáš Neubauer)

Developer Voices
Avatar
2k Plays1 year ago

The “big data infrastructure” world is dominated by Java, but the data-analysis world is dominated by Python. So if you need to analyse and process huge amounts of data, chances are you’re in for a less-than-ideal time. The impedance mismatch will probably make your life hard somehow. 

So there are a lot of projects and companies trying to solve that problem. To bridge those two worlds seamlessly, and many of the popular solutions see SQL as the glue. But this week we’re going to look at another solution - ignore Java, treat Kafka as a protocol, and build up all the infrastructure tools you need with a pure Python library. It’s a lot of work, but in theory it would make Python the one language for data storage, analysis and processing, at scale. Tempting, but is it feasible? 

Joining me to discuss the pros, cons, and massive scope of that approach is Tomáš Neubauer. He started off doing real time data analysis for the Maclaren’s F1 team, and is now deep in the Python mines effectively rewriting Kafka Streams in Python. But how? How much work is actually involved in porting those ideas to Python-land, and how do you even get started? And perhaps most fundamental of all - even if you succeed, will that be enough to make the job easy, or will you still have to scale the mountain of teaching people how to use the new tools you’ve built? Let's find out.

– 

Quix Streams on Github: https://github.com/quixio/quix-streams

Quix Streams getting started guide: https://quix.io/get-started-with-quix-streams

Quix: https://quix.io/ 

Tomáš on LinkedIn: https://www.linkedin.com/in/tom%C3%A1%C5%A1-neubauer-a10bb144

Tomáš on Twitter: https://twitter.com/TomasNeubauer0

Kris on Mastodon: http://mastodon.social/@krisajenkins

Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/

Kris on Twitter: https://twitter.com/krisajenkins  

--

#podcast #softwaredevelopment #datascience #apachekafka #streamprocessing

Recommended
Transcript

Language Divide in Data Science and Infrastructure

00:00:00
Speaker
There's a contradiction in the world of large-scale data at the moment. The de facto language of data science is Python. They have all the core libraries. They have all the people pushing data crunching forwards. But the de facto language of data infrastructure is Java. They have all the big name Apache projects for handling large-scale data in real time.
00:00:23
Speaker
And neither side really wants to become the other. There's no sign that the Python world is about to try and rewrite Kafka. And there's no sign that the Java world's going to come up with something that replaces NumPy, even though numge will be a very satisfying word to say. But until someone bites the bullet and writes numge, I think the divide between data science and data infrastructure is here to stay. And that's a very fertile ground for bridge builders.
00:00:53
Speaker
My bridge building guest this week is Thomas Neubauer. He's the CTO of Quix, and I might summarize his bridge building technique as use Python for data science, use Kafka for data infrastructure, but if you stop thinking of Kafka as a Java tool and just think about it as a protocol, the solution is to teach Python a new protocol.
00:01:19
Speaker
But that's a lot of work. And how is it done? How do you implement a pure Python version of the Kafka world without rewriting Kafka? How much work is it? The devil is in the details. And even if you do all that work, is it enough to give people tools in a familiar language, or are you still left with the problem of teaching them new approaches too?
00:01:43
Speaker
Let's get stuck in and find out. I'm your host, Chris Jenkins. This is Developer Voices, and today's voice is Thomas Neubauer.

Challenges in Large-Scale Real-Time Data Processing

00:02:05
Speaker
Joining me today is Thomas Neubauer. Thomas, how are you? I'm great. Thank you. How are you doing? I'm very well, very well. Nice to talk to someone. You're not too far from me, right? You're in London. Well, a great time in Prague. I'm flying to London next week. So I'm traveling a lot between Prague and London. That's a nice life. I've been there. Traffic is working. I spent some very happy times in Prague. So maybe I should come back with you sometime.
00:02:35
Speaker
Yeah, sure. But we're not here to talk about international travel. That's a different podcast that I'd happily present. We're talking about the world of Python, right? Yeah. Python, obviously, a de facto key part of data processing in the technical world.
00:02:54
Speaker
And you've been trying to push the de facto state of things in a new direction, which we'll get to. But give me your estimation of the market as it stands. What's the normal way that people do large scale data processing in Python?
00:03:12
Speaker
Yeah, well, the world is quite difficult, to be honest, if you are in a Python and you want to do a large data processing right now. It depends on what are your requirements, of course. If you go to batch, it's slightly better. But if you go to real time, and we're talking about building stream processing pipelines,
00:03:36
Speaker
Your options are either using old big server-side processing engines like Flink or Spark streaming.
00:03:51
Speaker
You can build it by yourself. So you can use Kafka and client libraries or similar brokers and their client libraries to do it by yourself. And third option would be using some cloud providers options like Alumdas, for example, but they have lots of limitation as well. Now,
00:04:16
Speaker
The reason why I'm saying that it's not ideal is because Kafka and
00:04:22
Speaker
Fling and Spark, they are all coming from JVM world. And as a result, for you as a Python developer who wants to use them, the life gets suboptimal. First of all, you don't feel to be a first-class citizen in the ecosystem. And secondly, when you actually try to use it, you will see that you have to do a lot of Java-related stuff.
00:04:50
Speaker
Yeah, the first time I used the Python Flink API, I was disappointed to get a Java stack trace back.
00:04:57
Speaker
Yes, so that's true. But I think I got disappointed even more when I had to put a jar files of Kafka connector. And then a jar file that this jar file depends on in the right folder and then the pen reference it in my code. I was even more disappointed.
00:05:22
Speaker
Yeah, I hadn't thought about that because I'm used to writing in Java as well, but that's not a friendly way to treat a pure Python developer.
00:05:32
Speaker
That's definitely not it. And there are more problems to it. Like when you want to build your user defined function in Python, which I guess this is why you're using something like Python because you want to use Python to build a business logic or potentially use something from the Python ecosystem. And it's not that straightforward because you're actually not running your processing in Python. You're running in Java.
00:06:02
Speaker
And so what the fling does, for example, is run to runtime on each side by each other. And there is some socket connection between them. And when you map your user defined function in Python to some columns in fling, what's going to happen that there will be basic communication between them.
00:06:24
Speaker
And that means you can't really debug it easily. There is performance implication on if the payload is too big. And if anything gets wrong, you get an amazing Java stake trace back. OK, but there's
00:06:43
Speaker
I mean, I think a lot of people in the data processing world recognize that problem, but very few want to bite the bullet and rewrite something of the complexity of Flink in pure Python.
00:06:56
Speaker
Yeah, well, I think that there is a couple of new emerging server side engines that's trying to be more friendly to Python. And there was a project called Faust made by Robin Hood that actually were on the direction to do something, but it got abandoned, which is a shame.
00:07:26
Speaker
Then, obviously, you have the cloud provider's options. But then you have a quite strong vendor lock-in. And they actually have a lot of limitations. Like, I don't know, if you use AWS Lambda, you will struggle if you have a state in your services or in your transformations. And there are problems with dependencies and resource limits. So again, lots of drawbacks there. Yeah. Yeah, I can say that.
00:07:53
Speaker
It's when you have to deal with state that stream processing gets really interesting. Up to that point, you could just use simple transformers. Yes. Well, basically, I always say that until you are limited to one message at a time processing, so, you know, when you process your data,
00:08:15
Speaker
You don't care about the message before and the message after. It's a doable option to use just client libraries of your broker or of your distributed system and by yourself. But the moment you touch state,
00:08:33
Speaker
I think it gets exponentially more complicated to build in a scalable and resilient fashion. It's a really hard computer science problem to solve. Because you have to start thinking about persistence and distribution of that state. Yes, and partitioning of your data, checkpointing of your data and state, delivery guarantees, reassignments. So there's a lot of stuff that could happen.
00:09:03
Speaker
doing your state being changing and so i would really not recommend to build your own state for processing on top of what client libraries just for the purpose of one project yeah it's it's.
00:09:21
Speaker
Okay, then, because I know that seems like very good advice, but I know that you had a project where you felt this pain and suddenly decided to build a stream processing library in Python for one project effectively.

Building Real-Time Pipelines for Formula One

00:09:38
Speaker
Yes. So in a way. So what's the project and why was it so painful for you that you decided to bite the bullet? Well, so.
00:09:47
Speaker
In my previous job, we were building a real-time decision inside Formula One. It was basically a pipeline with telemetry data from the cars. An idea was to build
00:10:04
Speaker
pipelines that would tell you what to do in next seconds, not next hours. And the amount of data was quite huge. It was around 30 million different values from different sensors per minute from each car. And the people that were developing these pipelines were mechanical engineers, data scientists, ML engineers. They all wanted to use Python or Java.
00:10:34
Speaker
And this is the first time I saw, hold on, this is very powerful technology. The streaming stack was really, really great. It worked really well. But to actually leverage it by these teams was very difficult. There was no tools, no way to simply analyze data.
00:10:55
Speaker
So, you know, things like Kafka tool back then wasn't really sufficient enough. And then when we actually managed to explain everybody how to connect to the source of data and how to actually, you know, get it to the Python.
00:11:14
Speaker
The biggest problem was the muscle memory of working in a batch, just analyzing data and database in Jupyter Notebook versus building real-time streaming applications where you process row by row. It's such a different paradigm, such a different approach to the problem that they struggled a lot with just the concept of streaming.

Shifting to Streaming vs. Batch Processing in Python

00:11:44
Speaker
Like, you know, sometimes I think even a mean, you know, even a mean that's, that's the most simple operation that you can do in Excel. You know, you have like, imagine you have a speed RPM and you want to do average speed.
00:12:00
Speaker
on a table, not everybody can do that. But to do a mean in the last 10 seconds for a car, real time, it's actually very complicated thing and very different approach to do it.
00:12:16
Speaker
This is something that always interests me, because I think a lot of this sector of the tech industry acts like it's a purely technical problem. And that's a big part of it. But the mental problem is a surprisingly large leap, right? And you almost would have thought, I mean,
00:12:34
Speaker
Python has asynchronous coroutine primitives, right? And they've been in there for a long time. You would have thought the idea of streaming data would be a little more normalized in Python. Yeah, I think that
00:12:53
Speaker
Building asynchronous code, it's obviously a bit more complicated than a single trace code. But I think streaming goes much further than that. The level of un-intuitiveness for somebody who spends the whole life working with the static data is much bigger than someone moving from single to multi-tree application.
00:13:18
Speaker
And sometimes I kind of, you know, compare this to people trying something like reactive extensions for the first time. You know, it's such a huge, overwhelming new thing. Or when people in the past moved from the classical
00:13:42
Speaker
you know, Winform style desktop applications to this MVVM and MVC buttons in the web. These big mindset switches require you to suddenly do a common task very differently. And this is very similar.
00:14:00
Speaker
Everybody is used to get the data from a database, analyze it in Jupyter's notebook, print it on a waveform, look at it on a map, maybe do some filtering, some analysis. And they are in the comfort of that the data are not moving. When you run stream processing code, every time you run it, it will get a different input.
00:14:24
Speaker
And that's just so difficult to handle. Because at the end, you're building something which gets input, you do something with it, and you put output. And it's difficult to tune that middle box when the inputs are changing every time you press Run. That's difficult.
00:14:43
Speaker
Yeah, I totally see that. I mean, maybe I've been thinking in this way for too long to fully see the problem. But to me, it almost seems a lot simpler. You just have to build something that deals with one thing at a time. And some other engine will scale that up to dealing with an infinite number of things.
00:15:02
Speaker
Yeah, so that's obviously one way of looking at it. The problem is when you get stateful, it's you need that context of your data.
00:15:15
Speaker
And then in streaming, where it gets even more difficult, is that a part of the actual data, what really matters is the behavior of the data. So order of messages, the speed of messages, and schema are also important not to trip you over your code.
00:15:38
Speaker
When we were back in Macran, for example, there were lots of bugs and crashes caused by just the nature of the burst of data coming after the start of the engine. So people just did not unit test it and could not think that there would be a burst of data in the first five seconds worth of five minutes
00:16:04
Speaker
normally and they just got a lot of memory overflows so the nature of the data how they come in is also important not just the content of it okay yeah okay so
00:16:20
Speaker
So let's start talking about your solution to this problem and how you designed it. Because it's a two-fold problem, right? You've not only got to design tooling, but find some way of making it... Is it a question of training people to think in a different way, or is it a question of finding a way that fits in with their current thinking?
00:16:39
Speaker
Yeah, well, I think it's a bit of both. Having said that, I don't think that we can go away with just training people. I think I personally...
00:16:54
Speaker
by myself felt that it's rather difficult. So when I was looking at the streaming and how you can, what are the obstacles in learning curve? That was basically, I was trying to look at like, I haven't done the streaming at all. And now I'm in this moment that it's intuitive for you to do streaming. What are the obstacles on this journey?
00:17:24
Speaker
And can we somehow remove them?

Benefits of Python IDEs for Development

00:17:28
Speaker
And I think that the first big obstacle is what we discussed at the beginning, which is you are not actually running your code. You're running it in the engine, which is even not Python. So I think the big power of something like Faust or something like Quick Stream Library is that
00:17:51
Speaker
you actually run your functions in your IDE. So you can leverage the ecosystem of the Python. And then when you inject any sort of dependency from it, you will see how it works in your IDE. And you can debug it. You can go line by line and see, OK,
00:18:12
Speaker
I'm using this mathematical function from this science package, and it's crashing because I'm giving it badly formatted numbers. I see it because I get the breakpoint before I call that library, and I investigate all the inputs, and in the watch window in my IDE, I'm changing the input until I get it done.
00:18:40
Speaker
That is a vastly different experience for developer than I'm just deploying some black box, which I'm adjusting to my engine, and then I'm seeing crashing or not, and I'm relying on locks to... Yeah, as much as I like cloud services for putting things in production, having something local and native is just much nicer for development time.
00:19:05
Speaker
Yeah, yeah, yeah, sure. And you know, especially when you get into the more advanced stuff like the machine learning and computer vision.
00:19:16
Speaker
you also get dependencies issues. So you might do just pip install something like in computer vision, and it will not work until you get a particular system dependency in when you run your code. And so if you have a server side engine, you have to make sure that every node where this
00:19:38
Speaker
fling clusters, bar cluster runs, will have this dependency, system dependency, like for computer vision, you have the libs, something just going to have to be there, which normally you just install in Linux. Now, if you go with the microservice docker root, you have docker to do that for you. And that's brilliant. It's just a line of in a docker file. So,
00:20:06
Speaker
I think that library approach, where you're basically running library in containers, gives you not just a better IDE experience, but also much easier integration with this ecosystem of the language, which in this case is Python. Okay. So in that scenario, are you connecting your IDE into the Docker container so that you're developing inside Docker?
00:20:31
Speaker
That's a good question. Yes, it's an option. So you don't have to do it. You can obviously develop Python on your laptop. And it's very easy to do that. But what we're working right now is exactly to cover this with a dev containers. So the idea is that you have
00:20:52
Speaker
You have the dev profile that you'll be using to deploy your code, and then you attach it to your dev container, so you have exactly the same environment for your Visual Studio code or PyCharm. And it's brilliant, especially when your Python library doesn't support M1, for example.
00:21:10
Speaker
So I'm using this very often, because some Python libraries still not built to m1 architecture. And you can use dev containers with a different base image to overcome that. Yeah, that makes sense. I'm just making a quick note to look up dev container support for NeoVim. And I'll move on from that, because not always use VS Code.
00:21:37
Speaker
Well, I'm not sure. But it's quite a popular technology now. So I would guess that would be some support. It's rather brilliant because you can even configure things like your add-ons in a dev container config file.
00:22:02
Speaker
So you can literally prepackage the environment for developers with everything they need. And so they're using it locally or in Cloud. Like all the editor plugins that you want. Yes, exactly. Yeah, that's nice.

Streaming Data Frames Simplified

00:22:16
Speaker
Yeah. So what we're working right now is that we're going to have a quick CLI. When you are developing Python code, you would have a command and they would open the Visual Studio code
00:22:32
Speaker
The other thing you need to develop Python-based stream processing service, including all the requirements, all the Python stuff, all the Python plugins in your Visual Studio Code, so you don't have to go to painful journey of learning by yourself. Yeah. That sounds nice.
00:22:59
Speaker
But that seems like step one, right? If that were just the problem you were solving, you could have gone into the developer tooling business generally. So what's your next step from that into making streaming specifically easier?
00:23:20
Speaker
I actually did. So the first years of Quicks were more focused on tooling, which we just discussed. And yes, the second part is quite interesting. You kind of order it as it really happened. The second step is the actual stream processing, which is a second challenge. So even if you have a tooling,
00:23:48
Speaker
even if you get everything you can imagine to develop comfortably your code and you have oversight of the data and you can deploy the code easily as well. It was a huge problem for us to teach people how to use Kubernetes so they are independent.
00:24:08
Speaker
Like, you know, you're either gonna build a monstrosity internally, so people can use Kubernetes somehow abstracted, or you're gonna teach them how to do it. There's no idea solution. And then even if you give them that, then you will find that they just can't, or not can, but it's very hard to get the head around this new concept.
00:24:32
Speaker
And so this is where the streaming data frames idea came to the place, an attempt to bring a batch muscle memory into streaming. Right. Data frames are a term I only really know from a cursory acquaintance with pandas, right? Yeah. You're treating it as more of a concept than a library specific thing.
00:24:59
Speaker
Yes, and to be honest, we are not very first who had the idea to use the concept outside of the technology. So, a PySpark has slightly similar
00:25:17
Speaker
approach to batch and big Spark data manipulation to give you a Pandas interface to analyze the data but actually under the hood is not Pandas but it's Spark and the rows being redistributed in a Spark cluster. Now that's what we did with streaming. So
00:25:39
Speaker
You can think of topic in Kafka as a lot of messages. You're gonna have thousands of messages in different streams. Stream being a message key. So if I give an example of a different Uber cars driving around the city, now each driver gonna send
00:26:03
Speaker
a message with the message key, driver ABC, and the payload is going to be some data, GPS location, longitude speed, et cetera. Now, you can think of that as that each stream is actually, if you rotate it by 90 degrees, it's an infinite table, an infinite virtual table where each property of the message forming one column.
00:26:28
Speaker
Yeah, so so imagine you have Jason, let's simplify this for Jason, there will be a property called speed. And that property will repeat in every message of that stream of the one driver. Now what we're doing is we basically flopping it 90 degrees to a table where each message is one row.
00:26:49
Speaker
And then you're working with a table like it will be static in your Jupyter notebook, like it matter lies, but actually it didn't. So let's say that you have a speed column and you just want to, for practical reasons, convert it from meters per second to kilometers per hour.
00:27:07
Speaker
If you would have the static data in your Jupyter Notebook, we would just do a one line in Pandas. You know, df name of new column equals df speed multiplied by 3.6. And it would just add a new column to your static data. Now, we do the same with this virtual table.
00:27:28
Speaker
But the difference is that every time we get the new message, which is the new row, we execute this function to add that new column, which is actually, in this case, new cell, and then we send it to output topic. So we are doing the same thing. It's just we're doing it for every message rather than the whole table. Right. So I'm building up a mental image of an Excel spreadsheet where
00:27:59
Speaker
I can scroll down and down and down and it keeps going and new stuff's coming in even as I scroll. And I would like to pretend I can just transform, I can add a new column in my Excel sheet once. What you're doing is pretending, is back forward filling that. Every time a new row gets added, you add in that column that I defined earlier.
00:28:26
Speaker
Yes, exactly. So when you're writing your Python code with streaming data frames, you're actually not manipulating any data. You're building a lazy loaded data pipeline, which will know what to do when the first message arrives.
00:28:44
Speaker
And then that instructions, so you're building basically instructions with your Python code. That instruction will be executed for every message. So you don't have to think about that. Your brain doesn't have to kind of absorb the fact that data are flowing and they are causing this rather unnatural behaviors compared to batch.
00:29:14
Speaker
But you can focus on the actual data and manipulation of it. So for example, if I create a classical batch operation, I load the data of an Uber driver, I add the new column and I put it back to database. And basically what I'm doing is I load the
00:29:36
Speaker
all data, I add the column ones, and I save it once. What I'm doing here is that I have an input topic, and I'm getting these rows as they arrive to my transformation. I'm adding the column to each row, you know, independently, and then I'm sending it to output topic where somebody can consume it. Equally, somebody can consume the result of the budget work in the table.
00:30:05
Speaker
So at the end, you can sync that data pipeline to table as well. And the only difference would be that the roles will be appearing real time rather than once per day or once per hour when the batch happens. But the result is the same. And you are abstracted from as a developer from the drudgery of thinking in this paradigm.
00:30:32
Speaker
Yeah, I can imagine writing some code in Python that looked like I'm doing a for loop over a list, and then some compiler magic transforming that into a bunch of yield statements, which actually are processing it one by one as it comes in.
00:30:48
Speaker
Yeah, that's very similar. Yeah, it's that way of thinking, yeah. OK, I'm happy with that. I think this, and I could see how that would shield, insulate the developer who's happy with Batch from dealing with the streaming world. Surely this gets hard when we reintroduce things like State and Windows. Yes. Having said that,
00:31:15
Speaker
It's still, you know, it's not 100%, obviously, because, for example, with the Windows, there are more, you know, more Windows types in streaming that make sense in batch. Like, for example, I'm hoping Windows is a bit, you know, counterintuitive in a batch, but things like rolling Windows or tumbling Windows,
00:31:42
Speaker
They are still same in Jupyter Notebook or in streaming. So you can think of like, give me a last 10 seconds average speed. Now, if you go to a Jupyter Notebook, you load your driver's data and you do the operation, you would do df.rolling
00:32:06
Speaker
you know, you put the parameters and it will add a column where you would use the values from the cells above to calculate a value. What we do instead is that as the rows are coming in, we save them to the state and then every
00:32:28
Speaker
message coming afterwards would be a combination of the incoming message and the state to produce the output. So then we kind of moving from IO box to IO box plus state where the logic goes with the state as well.
00:32:46
Speaker
OK, let me see if I can push you further on this, because I can see how you could library away, compiler away some of the differences between streaming and batch. Does your shield not break when you start doing things like streaming joins of potentially infinite sources of data?
00:33:13
Speaker
Yes, obviously, as further we go to a stream processing specific features, the
00:33:25
Speaker
it would go far from each other. But still, for joins, you can join data in a static environment. You can join them in a streaming. In streaming, you have to provide more information to a behavior that you expect. So for example, how long you are willing to wait for your data. Now, of course, that's not happening in static, because you have the data at your disposal. But here, you might want to be
00:33:55
Speaker
waiting five seconds for the second source to arrive. And that slightly complicates the things. But my thinking here is that it's a journey. You know, you're going to stream processing, and you're gonna start with simple stuff, then move to probably some stateful stuff. The joints are really hopefully the thing where
00:34:23
Speaker
You can start with some default and then start tweaking it. Still, I think much simpler than using just the client API. OK, so I think, tell me if I'm right or wrong here, but I think I'm going to characterize your approach as lots of people in this world are trying to say, we need to make it seem familiar for people. So we'll start with SQL.
00:34:52
Speaker
that seems like a good choice but eventually you hit a ceiling where SQL isn't powerful enough for the job and then you get bumped into Java. And you're saying we can still do the same trick of making it seem familiar but if we start with Python, eventually we'll have to introduce new concepts but we won't have to introduce a whole new language.

Python as a Familiar Tool for Streaming

00:35:14
Speaker
Exactly. And we're carrying all the perks of a stream processing library with it, which is native in the language. So hopefully, you know, people
00:35:31
Speaker
can debug it to get a bit more sense into it. And when they're building the logic, it will be a bit less complicated to debug the problems. So that's the idea behind it, yeah. And hopefully, the massive memory from Patch will carry on into the streaming.
00:35:58
Speaker
Yeah. Okay. Yeah. I quite like that. Cause it's like saying, you will have to learn new things. You don't have to learn a lot on day one to even get some results. Yeah. Yeah. Yeah. And this was always our, you know, our way of thinking is to, if he, if there is a path to get you to something with very see the value fast,
00:36:23
Speaker
people will appreciate that. Because I always felt that when you have this interact, you know, when you're learning a new technology or new thing in general, having these points, okay, I know that I can do something, and I feel enabled to do new things, because I just learned this new stuff.
00:36:46
Speaker
it kind of drives you to and motivates you to go further. When you have to spend a long time to get to the first anchor, you start that, you know, you get that doubtful like, am I actually going to get there? And maybe bail it out. Yeah. Is it the library? Is it me? Is it time to move on either way? Yeah. Okay. So let's dive into some code because
00:37:12
Speaker
I can easily think of a language or two where I'd like a decent stream processing library that isn't Java. Let me pick one. I'm going to say Gleam. I want a stream processing library in Gleam. I'm feeling spicy. Tell me how and how much work I can expect to build a proper stream processing library in a new language. Give me some pointers.
00:37:36
Speaker
Basically, first thing you have to look at is what are the client libraries for the broker you want to choose. So let's say you want to use Kafka. Now, if that language has a decent library for the broker that you want to use,
00:37:59
Speaker
The one thing is thick. It's a good start. And then you're going to have to learn a lot of concepts if you want to go to stateful stream processing. For stateless, it's more about packaging the whole Kafka interface into more digestible, more
00:38:25
Speaker
language-specific interface or use case interface, if you like. Take the Kafka library and make it idiomatic for the language. Yes. But then when you go to state for stream processing, then it's a journey because you need to think about how you're going to start your state.
00:38:50
Speaker
Are you going to use change log topics? Are you going to use state with checkpoints? Are you going to duplicate your state to do some resiliency in terms of database corruption? And then you need to find a way, am I going to use my broker and possibly the deployment engine like Kubernetes to scale my processing?
00:39:18
Speaker
which is the client library approach, or am I going to build a server-side engine that's going to manipulate my data and my code in a fashion to do that? Both directions have plus and minuses, but this is the stream you have to do. And if you go with a streaming library like we did,
00:39:42
Speaker
And then, yes, you're going to have to use the broker scaling and resiliency features to make this scalable and resilient. So like retention,
00:39:58
Speaker
the replication factor of the topics, the assignments of your partitions. You're probably going to have to use temporary topics for certain things like Goodbye. So the way how you can do Goodbye with the streaming library is that you restream your data to a temporary topic. And then by that, you repartition it. If you do it with a Flink, Flink will take the data and partition it inside the engine. But we don't have the engine.
00:40:28
Speaker
Yeah, so you have to build some kind of process job graph generator thing that's transparent to the end user. Yeah, and well, transparent, but also extensible. So I think that whenever
00:40:51
Speaker
whatever your built-in methods are rich or not rich in terms of what you offer and built-in. In any real application, people are going to have to build their user defined functions. It's just...
00:41:05
Speaker
I don't see how any SQL-based system without UDFs is any way useful. It needs at least the SQL UDFs, if anything, because in a real world, you just need more flexibility.
00:41:27
Speaker
And so that has to be easy to do, like to build, you know, you are doing not mean by standard deviation, or maybe you're doing some, you know, you're calling a machine learning model to give you some estimate or, I don't know, recently I was calculating
00:41:46
Speaker
Which is quite an interesting use case. If you have GPS coordination of the car, what is the distance travelled between the points? So it's a function. It's just a mathematical function.
00:41:59
Speaker
And it's a function that gets, it's one that's easy if you just want to do Pythagoras and a bit more tasty if you want to consider the curvature of the earth. Yes, exactly. I was thinking about that recently. We're talking with a friend, like the problem with some of the SQL engines. Maybe you can tell me how you solved this. So the problem with some of the SQL engines is
00:42:25
Speaker
You've got that ceiling where eventually you get kicked out into, say, Java. And if you've got user-defined functions, that ceiling goes a lot higher. There's a lot more you can do staying in SQL. But you always have a problem when you let users run their own functions, but you're trying to provide a cloud service. You can't let the public run arbitrary code, but you need to let the public run arbitrary code.
00:42:54
Speaker
How do you solve that? Yes, well, first of all, I would say that the user defined functions of SQL move the ceiling higher. It's also open the doors for monstrosities of it.
00:43:12
Speaker
I think that when people start to build a business logic in SQL, very quickly it gets very ugly. I have quite experienced with that at the beginning of my career, actually. I may have committed a few sins myself in my past. I remember the T-SQL times very well.
00:43:34
Speaker
when people were creating thousands of lines, long story procedures. And this kind of leads to the same thing. Well, yes. So to round a customer code, it's a challenge. Now, the way how we have solved it is using a Docker and Kubernetes and set the rules around it accordingly.
00:44:00
Speaker
But yes, you're kind of opening the Pandora's door with that. But I don't see the way in limiting the framework. The way how I see it is that people will just deploy dedicated instances of the thing for systems that requires the extra.
00:44:26
Speaker
the extra protection from any sort of, you know, separated infrastructure. So don't do language level security, do container level security? Yes, because like, I just don't see how a SQL based transformation without GDFs, that'd be anyway useful, because
00:44:52
Speaker
you're literally working with five or 10 or 15 functions that you have at your disposal. That's just not going to cut it to me. I think we do sometimes in the streaming world underestimate the business value of just taking this schema and turning it into that schema.
00:45:14
Speaker
But as soon as that becomes the day-to-day work, the more interesting work, it's like being given a programming language where you can only use the standard library. There's a lot of work you can do, but if you can't define your own stuff, very limited.
00:45:30
Speaker
Yeah, and well, yes, and also, you know, the standard language libraries are usually much richer than the Flink SQL is quite rich for because the libraries and the engine is very old, but still, it's, yeah.
00:45:55
Speaker
Still omitted. The nice thing about SQL is it's very accessible and even non-programmers might well know it. I think that's becoming kind of true of Python too. Yes. Having said that, we're also planning
00:46:13
Speaker
to build a SQL layer on top of our Python library. Well, it's not that difficult, to be honest, and it has a value. It's opening the door even more for a bigger pool of people. But yes, I think that Python...
00:46:33
Speaker
The Python is most used language in the world. And there are reasons for that. I think it's probably easier to learn than the others, although Python wasn't my first language to learn. So I kind of never had the opportunity to learn the Python first. And I think second is the ecosystem. Like right now, if you look at the world right now, all the interesting stuff
00:47:03
Speaker
All the things around LLM, all the interesting innovations are somehow happening in a Python ecosystem. And that's what I think is fueling the user base of the Python.
00:47:24
Speaker
Yeah, it's becoming the de facto language for people getting into programming and for data manipulation experts, right? Yeah, yeah, it does. And I would argue that if you're starting in the ground zero, is it going to be SQL that much easier
00:47:51
Speaker
than the Python if you know, because to me, the secret problem is that lack of autocomplete, in most cases, some some databases have autocomplete, and it's great, but it's always kind of fiddly. And
00:48:11
Speaker
If you start with a Python write and you get a good content, I don't think it's much more complicated. But then the power you get with it is so much higher. Yeah. Yeah. Again, that ceiling is, I mean, it's a general purpose programming language, right?
00:48:28
Speaker
OK, so are you at all tempted on that logic? You know how to do it. You're looking at making it accessible to more people. Are you at all tempted to say, our next target is the JavaScript world or something like that?

Quix's Strategy for Python in Streaming

00:48:46
Speaker
We're constantly debating this internally. What are we going to do here?
00:48:55
Speaker
We have focused on one thing, because as a company at our stage, we have to focus our resources.
00:49:04
Speaker
To be fair, because we're not building just a library, we're building the whole cloud platform to help you process streaming data. It's such a challenging engineering task that even with the focus to Python, it's still too big. We will.
00:49:28
Speaker
We will look into more languages later. We will do that. But now we're trying to solve the Python. And so if I start my Gleam library, you're not going to feel too threatened at this stage. No, no. Not with the Gleam. But yeah, there are other languages underserved by streaming tech stack. Like, you know, my
00:49:53
Speaker
My old .NET, which I started my career on, it's very underserved. It's exactly the same problem in Python. There's no difference in the problem. It's just that maybe the Python has more users, but the .NET is super underserved. The client library to Kafka is very
00:50:20
Speaker
suboptimal, I would say. That's a diplomatic way of putting it. It's surprising because C sharp is so spiritually connected to Java, you would have thought that it would already have pretty good support. Yeah. And yeah, and there will be a something like Kafka Streams
00:50:44
Speaker
very well supported by some big company, but it isn't. There is some one attempt, some consensus attempt, which I haven't tried. But yeah, it's people are left in some optimal place to use Kafka and other technologies in that stack.
00:51:09
Speaker
Okay, so let's just go on to a slightly different topic, which relates to you dividing your attention as a small company, right?

UX Design in Stream Processing Tech

00:51:18
Speaker
You've got a cloud platform, and I'm going to say the user interface for that, for building streaming pipelines, is one of the nicest I've seen.
00:51:28
Speaker
So give me some insights on how we make this whole technology presentable to people in the world. Not just accessible to programmers, but like the art of user interface design for a new area is a tricky one.
00:51:43
Speaker
Yeah, I think one of the most important UX elements that we have done is our pipeline view, which kind of help you to just get your mental thoughts straight in what you're actually building and what is actually built by others. So like,
00:52:10
Speaker
even people that are not programmers, but going to be maybe using this pipeline. Maybe they will use the result of the pipeline somehow to integrate it, or they just need to understand what the team is building. With this pipeline view, you can visualize it in your head. Just getting a bunch of services that maybe have some environment variables in them, that's very difficult.
00:52:39
Speaker
I think the second thing we are now trying to internally do is to serve this technology to different personas because they have very different way how they want to consume the data and how they want to use it to their goals. So you have a software engineer which probably want to use
00:53:07
Speaker
full power of microservice, full power of local IDE, and full power of the dependency management that provides versus maybe a Python analyst. Analysts are using Python that wants to just do a simple analysis on data. And just write a couple lines of Python
00:53:34
Speaker
to manipulate data to do the windows and the filtering aggregations. And maybe they are not interested in using the full-fledged programming experience there. So this is where we're thinking on that front. OK. As I always think that you get a data science job and you think you're going to be doing data science, you actually spend 80% of your time cleaning up data.
00:54:03
Speaker
Are you looking into this idea of like, what's the word, data mesh, like data catalogs? Like we can build, we can easily build something that would do all the cleaning of data and then present that as a user interface, like here are your options for data sources within the company. Yes. So I think that
00:54:28
Speaker
With this approach, first of all, you don't end up having such an unreadable data. That's the number one thing because you can branch the data as they come into the system and then save them in a nice way. So you kind of
00:54:53
Speaker
you're kind of moving the responsibility of do something with the data before they saved versus after they saved, which I think is very powerful because the natural thinking is, I'm getting this data, let's dump it to S3. And our similar style of database.
00:55:15
Speaker
Storage and then you have this 80% of the time they're trying to do something with it. Well, you can actually still dump it to S3 because that's the point in that. But then you can build a branches and say, okay, there is a bit of time series data. Well, let's save it to influx.
00:55:36
Speaker
And here we have some data for our vector database, and maybe here a metadata for Mongo database or similar style document store could be useful.
00:55:51
Speaker
And then if you enable them to do that, rather than some integration engineer that's actually getting their time, then they can actually leverage it. So that's the number one thing. And in terms of the resources, I think that it's just about getting companies to the
00:56:13
Speaker
better, being in a better game in terms of data, like how you treat your data, and where you are in that game. This is just one step, not everything, you know, there are more things to improve that, but it's just one step.

Debating Data Schemas in Python Streaming

00:56:29
Speaker
Okay, that makes you think of one more thing we haven't really touched on. And it's, it's not a natural fit for the Python world. But data schemas, types,
00:56:41
Speaker
Is that part of the puzzle that you thought about? No, it's a very good question, actually. We're going to integrate with skin registry. We haven't yet. And I tell you kind of the reason.
00:57:01
Speaker
Other libraries and other engines are strictly typed, and they strictly require you to get a certain schema in the topic because of the way how they surface the data. They surface the data as objects. And for that, you need the types. Now, we surface data as data frames, tables with columns. And they don't require such a strong
00:57:32
Speaker
schema, although at a certain point. And there are definitely use cases where schema integration makes total sense for validation and for kind of the whole company-wide orchestration of your schemas. But it's not mandatory. So you can still, equally as you would do in Pandas in Jupyter Notebook,
00:57:58
Speaker
you can check are these columns present in my data and maybe fill the columns that are missing with something. People in Jupyter numbers are not used to work with scheme registries. I can believe that. I believe that most Python programmers are happy with dynamic typing, sure. Exactly.
00:58:22
Speaker
The reason why we planning to integrate it is that this is not 100% true. And also it helps with some things like data visualization and et cetera, when you kind of provide some metadata for data that you're sending. But on the other hand,
00:58:44
Speaker
It's also bringing a lot of complications in other use cases, so that's why we're definitely never going to make it mandatory. So imagine you're getting data from an IoT device that has a different sensor sending in a different frequency. Now, what you're going to get is, for example, GeForce data 20 times per second and GPS data once per second.
00:59:12
Speaker
Now, are you going to send it into topics? Then it will make your life very hard. So then what you have to do is create a schema that contains both, and then you're just sending unnecessary syntax, encoding sugar around. Or you're going to send messages as you want, and then you have processing in a pipeline that just joins the data,
00:59:42
Speaker
when you need it, but not join the data when you don't need it. So you don't need to join data to sync it to your database, but maybe you need to join the data if you need to reference both columns in your processing logic. Okay, I can see that as a solution.
00:59:59
Speaker
We are gradually going full circle here, but I'm now thinking about error handling.

Error Handling in Real-Time Streaming

01:00:05
Speaker
Now, I'm not a Python data frames expert, so correct me if I'm wrong, but I believe when you're dealing with like, you deal with data, you've got a million rows of data, you calculate something, all the data's there, so it works or it doesn't.
01:00:21
Speaker
In the streaming world, you can have code tested on a million rows of data, and then the third millionth row turns out that number isn't a number. Sometimes it's a string. How do you deal with the error handling when you can't process your errors in a batch? There are two types of use cases which requires a different
01:00:49
Speaker
way of handling these situations. So either you are in a situation where, and that's usually when it's status operations, to be honest, but where you rather get data continuously and you rather not stop your processing,
01:01:11
Speaker
And then you probably send the messages that failed to a dead-letter queue of some sort, basically a topic which will hold the messages that wasn't processed properly.
01:01:25
Speaker
And the second use case is where actually it's a stateful operation where missing one message could make everything completely incorrect. And that is, by the way, a default behavior of our library. And it's that the way how it would work is that when you reach to the point where there is something like that that raised an exception,
01:01:54
Speaker
the code will stop there, restart the service to the last checkpoint and try it again. Because maybe, you know, it was some environment factor in it. And if it's not, if it's really just your code, the processing will stop there until you fix it. Until you deploy,
01:02:15
Speaker
a fix in your code, it will be basically stopped at that checkpoint. And it's fine because you have the storage of topic, and you have the checkpoint saved, and you have your state saved to that checkpoint. So the moment you update your code to fix, maybe you have not expected the date in that form, or maybe you have not expected that this code might be missing,
01:02:44
Speaker
Then you can continue your stream processing. And that's a different, this is a second scenario, really. OK. Yeah, that makes sense. And is the dead letter queue scenario, is that something I would just handle as ordinary Python code, try and catch, save it off? Yeah. Yeah, yeah. It is like, imagine you have very stateless
01:03:11
Speaker
processing where the fact that one message in 100 has a corrupted schema or corrupted data in it doesn't necessarily mean that 99 messages has to stop working as well. But if you have a problem with
01:03:29
Speaker
you know, a balance calculation on your bank account. Now it's probably better to stop. Yes. And that's definitely a domain specific thing, like how important is it to process every single rep? Yeah, that's totally makes sense. Okay, so I think it's almost time for me to go and build my Gleam based competitor.
01:03:51
Speaker
I better go and have a look at Quick Streams first on my local dev container. How do I get started? The first thing, you just do a bit of quick streams, which is part of the beauty.

Getting Started with Quix Streams

01:04:07
Speaker
Then our website even have a code sample with a public source, so you can literally play with that.
01:04:18
Speaker
And then when you are ready, you can sign up for our cloud and actually start building the pipeline. But yeah, the easiest way is to literally do pip install quick streams and get that code sample from our website. It's doing some windowing, and you can change it to whatever you want. Kick the tires on it and see if it's a good fit. Yeah, exactly. Yeah, cool. OK.
01:04:45
Speaker
Okay, great. I will go and do that. And then I'll go and speak to my Gleam expert team. And you'll see me competing with you two years from now. Thomas, great to talk to you. It was great to talk to you as well. And thank you very much for having me. And yeah, good luck. See you the next time you are in London. Yeah. See you. See you next time. Cheers.
01:05:09
Speaker
Thank you, Thomas. And I'll tell you that I genuinely have started scratching out a Gleam library for this, but it's going to be a very, very long time before it's anything more than my little toy playground. Unless for some reason you want to invest, in which case we're weeks away from launch, provided we can secure the right level of funding.
01:05:29
Speaker
While you're holding your breath and waiting for that library, don't hold your breath. While you're waiting for that library, if you want to kick the tires on Quicks, your best bet is probably pip install QUIX hyphen streams, and then check the links in the show notes for documentation.
01:05:46
Speaker
One quick announcement before I go. If you're a regular listener, thank you for joining me regularly. There won't be an episode next week. We'll be back in a fortnight. That gives you a little extra time to like, rate or share this episode. Please do. Please take a moment. And if you're not a regular listener and you're not already subscribed,
01:06:06
Speaker
Do click on subscribe and join me in a fortnight for another Developer Voice. Until then, I've been your host, Chris Jenkins. This has been Developer Voices with Thomas Neubauer. Thanks for listening.