Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
ByteWax: Rust's Research Meets Python's Practicalities (with Dan Herrera) image

ByteWax: Rust's Research Meets Python's Practicalities (with Dan Herrera)

Developer Voices
Avatar
2.1k Plays7 months ago

Bytewax is a curious stream processing tool that blends a Python surface with a Rust core to produce something that’s in a similar vein to Kafka Streams or Apache Flink, but with a fundamentally different implementation. This week we’re going to take a look at what it does, how it works in theory, and how the marriage of Python and Rust works in practice…

The original Naiad Paper: https://dl.acm.org/doi/10.1145/2517349.2522738

Timely Dataflow: https://github.com/TimelyDataflow/timely-dataflow

Bytewax the Library: https://github.com/bytewax/bytewax

Bytewax the Service: https://bytewax.io/

PyO3, for calling Rust from Python: https://pyo3.rs/v0.21.2/

Kris on Mastodon: http://mastodon.social/@krisajenkins

Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/

Kris on Twitter: https://twitter.com/krisajenkins

--

#softwaredevelopment #dataengineering #apachekafka #timelydataflow

Recommended
Transcript

Introduction and Guest Background

00:00:00
Speaker
This week, we're back in the world of data engineering with a curious mix of Python and Rust. My guest for this is Dan Herrera. For a while, Dan was working in ad tech, which is not everyone's favorite sector, I admit, and we do end up talking about that. But it's undeniably one of those sectors that's going to teach you a lot about the problems of processing data at scale in parallel at high speed.
00:00:28
Speaker
And it's those kinds of problems that led down to the topic of this week's episode, bite wax, an open source distributed stream processing engine, which makes a number of interesting architectural decisions in this space that we're about to get into. But it has an interesting backstory of its own. So let me briefly tell you how it got started.

Why Rust for Data Engineering?

00:00:50
Speaker
Back in 2013, Microsoft Labs published a paper on NIAID, which was a novel approach to these kinds of data processing problems. That got implemented in Rust and had its first release in 2017 under the name Timely Dataflow, which possibly you've heard of.
00:01:10
Speaker
And the thing I find interesting there is Rust makes sense to me as a low level, low latency language to prove your paper and build out some serious infrastructure. But it's not the first language you think of when you talk about data engineering.
00:01:28
Speaker
That probably belongs to Python. And there's the origin story of bite wax. How do you take a research back tool written in Rust and bridge it into where all the data engineers are in Python? And what do you end up with when you get there? What kind of beast have they built? Let's find out. I'm your host, Chris Jenkins. This is Developer Voices. And today's voice is Dan Herrera.
00:02:06
Speaker
Joining me today is Dan Herrera. Dan, how are you? I'm very well. Thanks for having me. Pleasure. I'm wondering a lot of things about, you work with bite wax.

Ad Tech Challenges and Learnings

00:02:18
Speaker
I'm wondering a lot of things about bite wax, but why don't we start with the origin story? How did you get into the data streaming space?
00:02:26
Speaker
Sure, yeah. I come from like a pretty traditional data engineering background and that's spanned through several jobs. Most recently before BIOX said GitHub, but before that in the kind of ad tech space and then sort of being the first data engineer and founding the first data engineering teams at other companies before that.
00:02:46
Speaker
I've worked in the data space for quite some time. And then I think the streaming data portion of it came out of a lot of the work that I did in the ad tech space, which ended up having real time connotations for doing bidding for header bidding and ads and things like that. So it's been a long and interesting road. It's funny how much of advertising was a driver for adoption in the early days of it.
00:03:13
Speaker
Yeah, yeah. Similar problems of needing real-time interactions for data. Large data systems were very germane in that space. And so yeah, it was interesting. Yeah, I think I wasn't crazy about working in ad tech like anybody would be. But the problems are really interesting, and the people are always great. That's always my favorite thing about working in a job.
00:03:40
Speaker
I had a brief gig at a blockchain company, and I feel exactly the same way. The problems are really interesting. Can we please focus on that part? Yeah. My boss used to start every talk that he gave by saying, if you use an ad blocker, like I do, and then he would go on from there. Right.

Integrating Rust and Python

00:03:58
Speaker
But let's get into the tech of it, because the first interesting thing I noticed about this is you're in a Python space. You're building out a Python library for data streaming. But under the hood, it's Rust. And that seems like a curious way to approach it. So how did that happen?
00:04:17
Speaker
It was a confluence of some things that I think are happening right now in a data space. More generally, just in the Python space, I think that Python is really experiencing a new renaissance. I think there's so much renewed energy and interest in Python and so many applications, the obvious one being machine learning and artificial intelligence.
00:04:44
Speaker
I think Python has been such a base layer for data for such a long time. And yeah, it's been kind of interesting because my genesis in the data space was very much rooted in Java, which a lot of the current tooling and older tooling is all based around Java. And I was very excited about a couple of things. Number one, somebody introduced me to the project called PIO3, which I think is...
00:05:11
Speaker
Pyre 3 is a library that allows you to make ergonomic Rust bindings for Python, or vice versa. So you can call Rust from Python, or you can call Python from Rust, or both. And it's really a fantastic project.

Python in Machine Learning and AI

00:05:27
Speaker
So that was probably one of the things. And the second thing that I became aware of, because our CEO, Xander, introduced me to it, funny enough, was timely data flow.
00:05:40
Speaker
So that's a Rust library. So kind of the confluence of the availability of those two things, I thought was like a really great opportunity to sort of visit the space. Okay. I kind of, I want to get into timely, but just briefly, I know I've worked occasionally with calling C from Python. And so I can imagine that Rust from Python will work, but C from Python, I wouldn't describe as fun.
00:06:07
Speaker
Python is really amazing for being an extensible language. I think they've worked very hard. It's amazing to me that you can install a few packages using the same package registry you would for pure Python, invoke a program, and get some Fortran code, some C++ code, potentially some Rust, all within the same stack.
00:06:32
Speaker
Some people even realize how many different languages end up being connected to Python. And so it's always been a very extensible language in that way. And the most famous examples of these sort of connectivity are NumPy and SciPy are libraries that connect to old and very battle-tested and very reliable implementations of things in other languages.
00:06:58
Speaker
Okay, so because of Py03, you're looking at NIAID and Timely Dataflow, and not scared that it's written in Rust? No.

Timely Dataflow: Design and Implementation

00:07:08
Speaker
Yeah, it was... I was very excited when I started learning about Rust. It was like a very interesting language to me personally, and
00:07:19
Speaker
I think in between the amount of care that has gone into both creating Time Lane and PIO3, it's just kind of amazing. They're both just fantastic libraries to use the community around them. It's really great and you can see that there's some very deep thinking about some of the problems. The problem space for PIO3 is
00:07:42
Speaker
Very large Python is doing such amazing things as removing the gil or providing sub-interpreters that can interact without holding the gil. So there's a lot of really interesting work that's going into Python these days and the people that work on PIO3 have a very daunting task to be able to make a library that can take advantage of those things and is still ergonomic to use and they do a great job of it.
00:08:09
Speaker
Yeah, yeah, that's a very difficult marriage of being good at the low level stuff and still being good at like developer space, developer experience stuff. Yeah, I think the same is true of timely. I think my experience learning timely.
00:08:25
Speaker
And I would say that I wouldn't represent myself in any way as being an expert at Timely Dataflow. I'm a very enthusiastic user, and I would say a dedicated student. And it's taken me a long time to sort of understand the subtleties of the language. But I think when I step back and look at Timely,
00:08:44
Speaker
I think they've done an amazing job of writing what they call a low-level library. It's made all of the world of difference to us to be able to construct what we wanted to construct based on the primitives that they provided. And I think after you do library design and API design for a while, you learn that that's a pretty impressive thing to be able to do.
00:09:07
Speaker
Yeah, it's hard to get something that's both flexible and usable. You either make it so general that it's hard to know where to get started, or so usable that it's hard to remain flexible. Yeah, that's exactly right. So we're going to get into the guts of Timely. Great.
00:09:29
Speaker
Of all the different stream processing ideas, you've got data coming in, you want to do stuff to it, and then send it on its merry way. What is it about timely that sets it apart, in your opinion?
00:09:44
Speaker
I think there's lots of things and I think one of the things that I went back through and I was reading through the Timely Dataflow book which is a great read for anybody who's interested in this. It's just the long-form documentation that they have linked from their site that's not specifically API documentation, but the Timely Dataflow documentation describes Dataflow programming as
00:10:08
Speaker
I'm trying to remember it exactly, but it's organizing a program in components, each of which can respond to the presence of new data, which I think is like an interesting description. And what it really means is that there's some coordination of pieces that happens in timely data flow, but in sort of
00:10:38
Speaker
In contrast to imperative programming, it's interesting to think about how that differs from an imperative programming model, because you can imagine any process reacting to the presence of new data, like a web server receiving a new request. But I think what's really interesting about data flow programming is that
00:10:56
Speaker
Each of the components are described as being connected to each other, but each of those components can receive new data at different points in the Dataflow. The availability of new data is not something that starts at the beginning and ends at the end, and a great example of this in a lot of Dataflow programming models is like a window.
00:11:16
Speaker
If you have a window of data that closes after a period of time, it's accumulating a bunch of state and a bunch of data. And then after a period of time, that window can close. And that event itself can cause computation to occur somewhere else in the data flow. Yeah. So I often think we treat stream processing like it's a new idea, but it's very much, once you take out the persistence part, it's very much like reactive programming. It's very much like go routines and co-routines.
00:11:46
Speaker
Yeah, I think it shares a lot of similarities for sure. Yeah. So what is it that, I mean, you could have taken a naive approach and said, okay, Python, coroutines, I'll just do yield all over the place and I'll try to remember to write it to disk.
00:12:04
Speaker
Yeah. Why jump through rusty hoops to get to NIAID? It's a really good question. The thing that Timely Dataflow gives us, for Bitwax specifically and for users of Dataflow programming libraries, like Bitwax does for other people as well, is
00:12:26
Speaker
The idea of progress tracking throughout the entire data flow is something that is simple on the surface, but I think it has profound implications for the way that you can build and especially scale independent components. The idea of when you have inputs in a streaming system, being able to correlate those with outputs in a streaming system actually becomes an interesting problem.
00:12:56
Speaker
Go ahead. Do you mean for debugging? Just for being able to produce correct answers or fit whatever particular idea you have for the type of data flow that you're writing. Do you mean example? Sure. I think that's a good idea. So in a streaming system, you can imagine that you're not going to see all of your input. In effect,
00:13:20
Speaker
you don't really know if you're ever going to see the end of any particular input. And so the question becomes, would you produce answers and how do you judge the sort of correctness of those answers?
00:13:31
Speaker
So a simple example would be, if you're receiving a stream of bank balances for a particular user, how do you correlate which outputs you see in there with the inputs that affected that particular thing? So being able to assign those things together is actually a different component of jump subjects here. But progress tracking and logical time stamping
00:13:59
Speaker
the things that you can use in order to be able to correlate those things together. And it's one of the key concepts of Timely Dataflow.
00:14:05
Speaker
OK, maybe if you explain to me how it does that, it will start to come together. Yeah, sorry. It's hard to talk about Timely. And I realize how tremendous a job they've done in sort of presenting this information. But Frank McSherry and the other folks that have done a lot of work in Timely do a really fantastic job about talking about it. So I'll do my best to do the same, to do justice. I'm sure you'll do justice.
00:14:37
Speaker
So if you have a data flow and you're receiving input, each one of those inputs can be assigned a logical timestamp. You can say, not necessarily when did this particular thing occur, but it gives a sense of ordering over
00:14:57
Speaker
the span of inputs. Additionally, one of the things that timely allows you to do is you can say for a given timestamp, you can say, all right, there's a point at which why I'm never going to see any more of this particular timestamp.
00:15:13
Speaker
So if you imagine that all of the records that you receive for something all have the same timestamp, you assign them all the same timestamp and for lots of demonstration purposes, it's easiest to think of them as auto-incrementing integers. Each of the inputs gets assigned a logical timestamps and can proceed through the Dataflow. So let's just say all of the first batch of records that I pulled off of Kafka get the first timestamp of one.
00:15:45
Speaker
In a Dataflow program, you can say, I've reached a point where I'm going to say, I'm never going to assign anything else that same timestamp. I'm going to advance the timestamp to two, and the rest of the components in the Dataflow can react to that information. They can know that they can now produce a result saying, I know that I'm never going to see anything else with that same logical timestamp.
00:16:11
Speaker
Okay, so I could have, to use your example, a bunch of transactions coming in onto a user's account. And eventually I want to say, okay, at some point I've got to say for legal purposes, closing balance on this day was X. So I'm going to draw a line under all transactions with a timestamp of one.
00:16:39
Speaker
And anything that comes in now must be two or higher and it will not go into today's closing balance. Yeah, that's a great example. It's really interesting to be able to use that mechanism
00:16:55
Speaker
It's a very lightweight mechanism in order to be able to coordinate multiple workers. You can imagine a data flow can be started with multiple processes or multiple workers. They can share that information with each other. But all of the parts of the data flow can know when they're not going to see any more information at the same timestamp anymore. And then they can decide if they produce results at that point in time. And those can be correlated with that particular input.
00:17:22
Speaker
It's really interesting. They describe it as the lightest way to introduce that sort of idea of ordering, of coordination between multiple workers in a data flow system. And I think it's a really interesting way to think about it because there are so many things that you can actually do once you have some of these really primitive, interesting, very small primitives.
00:17:48
Speaker
That's the thing that interests me, because it sounds very simple in the small. And I guess that's the beauty of it, right? But what happens as you start to build that picture up to a larger and larger graph? In what way?

Scaling and Recovery in Dataflow Systems

00:18:02
Speaker
Like, how does coordination happen? How does? Or what are the implications? I mean, it's a great question. Probably all these questions. But like, OK, so it seems pretty simple for the case where there's one processing node saying, have I seen anything? Is it time to stop accepting ones?
00:18:19
Speaker
Yeah. But then, okay, so you're in your example, we've got a couple of different people, a couple of different nodes all looking for the end of the age of one and maybe some downstream processes who are still expecting the age of one. How does this scale up into complex graphs? Yeah. In terms of that, let's start with in terms of that notification that it's the end of an era going through the graph.
00:18:47
Speaker
Yeah, so that's a great question. So what Timely is doing for you is giving you the ability to notify any of the components in the data flow that you want.
00:19:04
Speaker
to take action when these events occur. When you advance the epoch of input, that's a major event and the rest of the data flow can react to that information. I think what's really interesting about that is if you consider, you can have more than one epoch in the data flow at the same time.
00:19:27
Speaker
Just because you are emitting records with the timestamp of one, you can start emitting records with the timestamp of two or later and have those be processed in the data flow at the same time. But it gives the entire data flow an interesting set of order in that you can understand that timestamp one comes before timestamp two,
00:19:50
Speaker
And when you process this data, you can process them together. And there's a sense of which comes before the other. But you can have multiple epics running in a data flow at the same time.
00:20:02
Speaker
Okay, so there's something downstream that, because graphs are complex and different machines run faster, it could get all the information from Epic 2 first and say, well, I'm going to have to hold on to that and wait until I've got a closed message for Epic 1, and then I can just spit the whole lot out at once.
00:20:21
Speaker
Yeah, so you can do some types of processing in an eager fashion. You wouldn't potentially emit a result at that point in time, but there may be computation that you want to do every time you see a new item. But you know that you can't produce a result until you've seen the end of that particular epoch. So how does that look? I'm assuming there's some kind of protocol between nodes saying,
00:20:47
Speaker
Okay, this isn't a main record, so I'm going to just pass on customer made a transaction. Meanwhile, there are messages that say, hey guys, it's the end of era one. Is that what protocol looks like? It's a mixture of domain messages and kind of protocol messages.
00:21:03
Speaker
Yeah, I remember we looked at the progress tracking messages. So if you have more than one worker in a timely data flow cluster, they're exchanging this information with each other. And I think part of the brilliance of the library is that it's both a runtime and a library. And the runtime has had lots of optimizations. You can imagine that coordinating this among very large nodes can get expensive.
00:21:30
Speaker
exchanging all of the data for progress tracking with each other. So part of the parts of Timely that I haven't had direct experience with, but I understand, are doing a lot of optimization in terms of how that exchange of data happens between nodes.
00:21:45
Speaker
I think one other thing that's interesting to talk about just briefly about the way that Timely is different from other Dataflow systems is that when you have more than one worker in a separate system, let's just say like Spark, you can have parts of the Dataflow graph that are happening on different machines. Like for example,
00:22:10
Speaker
We can go back even further and think about map and reduce happening on completely different machines altogether. What's interesting about Timely is that each worker has the complete data flow graph and all of the operators that are included. So it has input and it has output potentially, depending on what you want to do.
00:22:30
Speaker
And so each one of those workers does more than just progress tracking exchange. In order to produce correct results, you might also want to make sure that all of the values for a given key end up on the same worker. So they can not only exchange progress tracking information, but they can also exchange data with each other.
00:22:53
Speaker
Okay, but are you saying like they've all got a copy of the data processing graph so I as a node can say, oh, it's key ABC, I best send that to machine one, two, three. That's right. Yeah, right. You can tell operators to participate in different, I forget what they call them.
00:23:14
Speaker
They have a name for them where you can say this is a pipeline operator, meaning as soon as you're finished with this, I want you to give it to the next operation and you can exchange it. You can have an exchange operation where you say at this point in the data flow graph I want to exchange the keys to the workers where they belong and that's another like.
00:23:35
Speaker
like a powerful tool that's in the toolkit that you'll sort of like find later when you need it. Like you realize I actually want to broadcast this information to all workers or some of the other things that you might want to do. Right. Before we get further into the usage of this, then are there any other main tools in that toolkit? Yeah, I think I'm trying to think if there are patterns
00:24:03
Speaker
I think the last thing that I probably didn't talk about was being able to attach probes to different parts of the data flow. Probes can tell you when things have reached different parts of the data flow. So you can say, I want to be able to know if
00:24:24
Speaker
Information has made it all the way to this part of the data flow graph and then potentially take other actions that happen in there. So as an example, we use this in bite wax in order to do garbage collection. After a certain period of time, we want to clean up some of the records that we're keeping internally, snapshots of older state.
00:24:46
Speaker
And so you can use a probe to say, I just want to make sure that the data flow and data in the data flow has reached this particular stage before I take actions in other places. So you can attach probes to various parts of your data flow and use that information to take an action elsewhere. Is that something you do before the graph starts running?
00:25:09
Speaker
Or can you do it at processing time? Don't ask. No, it's something that you have to do. Dataflow construction happens before runtime. And I think a lot of that has to do with you're describing a dataflow as operations that you want to have occur and then the connections between them. So for example, a classic would be like joins or branches in a, you say,
00:25:36
Speaker
If the predicate that I give for this particular operator is true, I want the results of that particular operation to go to a different operator. So you describe them all as connections in between each other. And then probes are an important part of that data flow construction as well. So you would describe them when you're describing your overall data flow.
00:25:57
Speaker
OK, so to make that concrete, I might be saying, I've got three different kinds of message coming through. I want to fan them out to three different kinds of processing node. And once they've all been processed, I want to probe to make sure that they've all been processed before I throw away some kind of intermediate state in my branching algorithm. Yeah, I think that's a good example. OK, we'll work with that.
00:26:25
Speaker
When you mentioned MapReduce, I was reminded of something I read years ago about Google optimizing a MapReduce job at the kind of scale that Google do it. And they found that even though they thought all their machines in the cluster were just as performant and basically identical, they would always have like 3% of their nodes, which inexplicably took too long to process a particular sub-task. Yeah.
00:26:55
Speaker
I'm wondering if you've got the monitoring and kill it off and start again from the start of that epoch capability. It's something that we added to Bitwax specifically. Like we were talking about, Timely is an amazing low-level library, and kind of the interesting thing that we did with Bitwax is
00:27:21
Speaker
We had a similar goal. We wanted to provide a generalized framework for construction of data flow graphs in Python. And so it's interesting for us because we actually don't use progress tracking in the same way that you would.
00:27:37
Speaker
if you were writing a data flow using timely data flow. So when I gave you the example of using progress tracking in terms of assigning logical timestamps and correlating them to output, we take a little bit of a different approach and we use progress tracking for internal pieces of the data flow that we manage for you. And the biggest one that we spent a lot of time working on
00:28:01
Speaker
was recovery. So being able to crash the data flow, start again and resume from the same same point. And there's some large pieces of that that we had to take some time to create. But
00:28:16
Speaker
The two major ones are being able to preserve state inside of a data flow and then being able to resume at a consistent state. And so the sort of broad strokes of all of those features together are that we can guarantee in bite wax that we're going to process data at least once. As long as your input supports it, we can tell you that
00:28:38
Speaker
will restore the internal state of the data flow to where it was when we crashed or to a consistent point. And then we can start consuming input data from some point in the past. And then we can get back to a point at which we've processed all the messages at least once.
00:28:56
Speaker
Is this logically you're saying, OK, we closed out epoch three successfully, so let's record the state of the graph epoch three. And then we can always just restart from there. Yeah, that was a very good summary, actually. It's surprising. Did you work on this yourself? I tend to summarize a lot of things in my role as a podcast host. It's really good.
00:29:24
Speaker
Yeah, so we start epics in bytewax just based on wall clock time. And so we say we're starting an epic now. And at the end of each of those epics, we can use that coordination point across multiple workers to record the state internally of all the stateful operations that we have going and the state of the inputs. So simplest example would be like, which offsets have we consumed in Kafka to that point in time?
00:29:51
Speaker
If you take the snapshot of that whole thing and serialize it, and if you crash, you may have processed some data after that point, and you may have seen some new records from Kafka, but what you really want is to restore the entire state of the data flow at that point in time to a consistent state, meaning,
00:30:10
Speaker
in some stateful operations, you wouldn't want to apply the same message twice because then you'll get incorrect answers. And so being able to serialize the state of things in a coordinated way across multiple workers and then replay data that you had seen previously in Kafka but hadn't reached the next endpoint where you took a snapshot is how we do coordination. But that's a little different than like the initial
00:30:36
Speaker
when you read about Timely Dataflow and you start using Timely Dataflow for the first time, coordinating inputs and outputs using Epic is a really useful thing. For a very specific purpose, it can be very beneficial to decide that for your particular problem domain, that's how you want to coordinate things. But for us, when we were building a generalized toolkit for people to do Dataflow construction, it's hard to
00:31:05
Speaker
be able to make that level of assumptions about people's inputs and outputs. And we struggled with that a little bit at the very beginning. We tried very carefully to mimic the Timely Dataflow API for our purposes. And then at some point in time, we were like, my colleague David was like, I think that we should actually not do that. I think that we should model this in a different way. And he came up with the idea that epics could just be wall clock time, and we can use that as a different type of coordination mechanism.
00:31:35
Speaker
Okay, is that the point that it becomes more a low level library than a. You're using timely as a low level library to what you want to build rather than it's just a wrapper around time. Yeah, timely is really amazing in that way.
00:31:52
Speaker
Timely is the underlying substrate under another library that folks that materialize, and Frank McSherry and other people have been working on, called Differential Dataflow, which is not something I'm super familiar with, but the way that you understand it is that it is construction of another system that's built on top of the Timely primitives as well.
00:32:15
Speaker
I think that's what makes Timely so amazing is that it can provide facilities for lots of different problem domains and it's very interesting in that way. Okay, well we're going to carry on with the idea of how you've been.
00:32:29
Speaker
using it as your little toolkit. But before we move on, I have to check something. You mentioned wall clock time, and we're talking about distributed systems. Isn't that dangerous? In the particular case where you have multiple workers that are using wall clock time, you could imagine that you would want some kind of coordination between those two things. But I think what matters is that
00:32:56
Speaker
The event where you say, I'm not going to produce any more input at this time anymore is kind of the more key event. That doesn't necessarily have to be coordinated for all of the workers to say, when we reach the end of this epic, we're going to take certain actions and snapshot all of the data flow. That doesn't necessarily have to be tightly coordinated with each other. So it's a great question.
00:33:18
Speaker
I, I'm not a hundred percent sure I'm understanding that you saying that you choose an event, which make marks the, the end of this, of the bite wax epoch. And then you look at that events timestamp and you just say, well, Hey, that happened to be the timestamp of when we, what is that roughly the model? Oh, sorry. No, we're just using the system clock time. Uh, but what happens if you've got two machines with differing system clocks?
00:33:49
Speaker
Well, you're only really using it as a marker of wall clock time, not the specific time itself, but when to advance to the next epoch. You're using 10 wall clock system seconds in order to say, all right, this is the point where I'm not going to be reading from, or I'm not going to be emitting records from, let's say, Kafka at that same timestamp anymore. And you move on to the next one.
00:34:17
Speaker
For that particular worker, it's a coordinated point in time across that worker's operations. You could say, this is happening on a worker level, not on a graph level. Correct. Yes. Right. Now I'm with you. OK. Yeah, it's a great question. I thought we were accidentally miscoordinating multiple workers. No, I think it's a good insight. Yeah, something I'll be very careful about.
00:34:43
Speaker
OK, so what else did you choose? Is there anything else where you said, OK, timely is a good low level library, but that we want to expose something different to user space? Let me think.
00:35:00
Speaker
Timely has some facilities that we are not using that I think would be interesting to people that want to use it directly as a library. So one of the things we haven't done. Timely supports the idea, which is, I think, pretty unique in data flow programming of the concept of doing iteration.
00:35:19
Speaker
So being able to, if you have this kind of like, ethical progression model in there, what happens if you want to be able to express a computation that needs cycles? So like, in your dataflow programming model, you have a sort of
00:35:38
Speaker
directed graph of operations. What if you wanted to introduce your kind of like for loop inside of that computation? And so timely gives you the ability to do that using compound timestamps, which is not something that we actually use in PIOX, but is really interesting for lots of different applications. And the examples that I think that they give most commonly are graph computation. Oh, like if I'm a social network trying to do
00:36:09
Speaker
Yeah. Okay. So find me all the ancestors of this and then the sub ancestors and that's going to probably end up looping around and it's to read its own output. Yeah. Yeah. Yeah. Okay. So is that you've, you're not using that because you don't see any need for it. It's too complicated to expose to the user. Um, so far I think we just haven't, we've been pursuing, uh,
00:36:38
Speaker
A lot of the things that I think for an operational data flow are, so when we first started, the primitives that Timely gives you mean that you're responsible for building some of the systems on top of that. So we talked earlier about being able to do stateful operations where you're maintaining some state as you're doing an operation, the ability to do recovery,
00:37:05
Speaker
The ability to do branching and joining was something that we had to kind of come up with an API for. And then the concept of windowing was something that we had to spend quite a bit of time getting getting right. So if you're a user of other data flow systems like Flink.
00:37:21
Speaker
There's a feature set that I think you're primarily interested in using, and we were tackling a lot of those first. So maybe at some point in time we'll have the ability and the time to go back and work on iteration, but it hasn't been something that we've had people ask us for specifically. It's just kind of cool that it's a capability that's there that we could potentially take advantage of in the future.
00:37:43
Speaker
So one day you may find the killer use case for it, but you've got plenty of work right now. Yeah, I'm sure somebody has a great use case for it. But yeah, we're working on a lot of really interesting things, just sort of like these are the sort of table stakes if you want to provide somebody with a very useful data flow programming environment.
00:38:08
Speaker
Since you've mentioned Flink, and it's always difficult when you're talking about competing ideas, you don't want to tread on people's toes, but since you bring it up, one of the things about Flink for a Python user is their Python library is wrapping Java.
00:38:25
Speaker
and your Python library is wrapping Rust. Do you think, and then there are Python libraries which don't wrap anything and are native, do you think, what's a user's experience of you wrapping Rust going to feel like?
00:38:41
Speaker
We have worked really hard to expose a lot of the primitives that we have as a Python API and have implemented a lot of the functionality that we have in Python directly. So we started the project originally leaning pretty heavily on all of the Rust code and writing a lot of the stuff that we were doing in Rust.
00:39:01
Speaker
and just essentially calling Python at certain points in the data flow when your operator calls the logic for this particular operator and that returns a result and then Rust carries on doing the major parts. But I think my colleague David and the discussions that we've had, we've learned that
00:39:20
Speaker
Moving a lot of that API into Python is really helpful for people that want to construct their own operators. And so to answer your question, we would like the experience of using this to be fine for anybody who never wants to learn any Rust.
00:39:37
Speaker
In the Rust layer are some really important pieces that Timely also provides the communication fabric between multiple workers. So it's both like a library and a runtime for being able to orchestrate data flows. Some of those pieces will stay in Rust, but a lot of the pieces that we're working on
00:39:56
Speaker
will end up having a Python API. And so for most people's experience, it should be the same as using any Python package. It's something that you can use by doing pip install. You write your data flow in Python. You run it as though it was just a regular Python program. And so most of the Rust stuff should be pretty invisible. What's the, OK, so again, I want to be careful about mentioning Flink, but the point at which you will find out that Flink
00:40:26
Speaker
Python isn't quite Python is when you get an exception back and it's a Java stack trace. What's the point at which I'm going to find out that bite wax isn't just Python as a user? You can definitely see some rust in some backtraces if you see the dataflow crash.
00:40:48
Speaker
We did take a pass, and we worked really hard at reconstructing one of my colleagues. But it has worked really hard on error handling and making that look as Pythonic as possible. So reconstructing the trace back of those and seeing something that looks very Pythonic and most hopefully pointing you to where in your code the problem lies was a tricky problem. But yeah.
00:41:13
Speaker
I think Flink is a great system. I have a lot of respect for all of the work that's gone into Flink and I've met some of the developers and they're great.

Pythonic Dataflow Environment

00:41:23
Speaker
I asked them, do you have any advice for people that are writing their own data flow programming environment? They said, yeah, good luck. It's a hard problem. Yeah, it is. They do it very well.
00:41:37
Speaker
But I think that Python users deserve something that is very Pythonic. I think Python users prefer not to have to learn everything that's in the Java ecosystem in order to be able to do the same type of programming. I think it's great when people have access to, I need to express my problem using these tools, and I want something that feels very native for me to be able to do that. And I think that that's something that we do.
00:42:02
Speaker
I think I have lots of friends in the Python ecosystem who have always thought of themselves as the neglected child in the ecosystem because everyone was writing Java and they're like, this is great and it works in Java. They had to keep raising their hands and say, hey, it's not working for me. I'm using this from Python and it's been hard for them to get enough attention. I'm hoping they see it now.
00:42:28
Speaker
Yeah, I'm also hoping someone's going to do something similar for the TypeScript world. I could have a lot of fun with that. Yeah.
00:42:38
Speaker
Okay, so if I'm writing, let's say, give me more of a sense of the boundaries of the system. Because if I'm writing something that processes my user's transactions in a more interesting way, I'm writing that in Python code. What's happening under the hood? Is that being passed to a Rust process which has a pointer back into the Python code so it can call it?
00:43:08
Speaker
Yeah, I think the beauty of Py03 is that marrying those interfaces together was pretty straightforward, being able to call Python code that way. But I think the interesting parts were you're describing your data flow in Python. And what that needs to do is translate into a series of operators that are happening essentially on the Rust side. So those are timely operators under the hood.
00:43:37
Speaker
But what we did was boil those down to a core set of operators that we can construct the rest of the operators that we wanted in those. So there's a very minimal footprint in Rust for the shapes under the hood. It ends up that my colleague David did a very good job of reducing those down to just a very small set of core primitives.
00:43:59
Speaker
So then the orchestration is essentially you're starting multiple workers inside of a Python process and that is sort of animating the machinery of Rust to do the communication between workers when you need to exchange data.
00:44:14
Speaker
you know, stopping and starting a data flow. And then at each point in time where you're processing data, the operators under the hood are calling out to the user code that you provide as part of your data flow. You have a map step that provides a function that you need to run it every time it sees new data. A batch of new data ends up in that operator, and then we call that and then move that along the data flow.
00:44:44
Speaker
When you say we're calling it, we're passing it to the Python function in process, or are we sporting out a separate Python node which we give jobs to? It's all within the same process, so Rust is calling into the existing Python process with data that it receives. So there's a little bit of translation sometimes when you need to serialize data and ship it to other workers, and that means
00:45:09
Speaker
you know, we have to serialize your Python object, we have to exchange it to another worker, deserialize it, but then when it ends up back in Python, you wouldn't know that that stuff actually happens. Okay, but a lot of time, it's just another C pointer to here's my function.
00:45:25
Speaker
Yeah, I think that's one of the advantages of the Timely system is that for most use cases, for a lot of use cases, you don't need to serialize and deserialize in between those. They're just passing them within the same process to different operators. Yeah. OK, great. That's definitely good for performance.
00:45:47
Speaker
You've hinted at it then, so we now have to go into distributing across multiple nodes and parallelization. So let's start with parallelization, because that seems like the easier one to tackle. Python, unless you jump through certain hoops, Python is single-threaded, right? So how are you parallelizing within one machine? Are you leaning on Rust for that?
00:46:18
Speaker
No, not directly. Timely is also not. It uses what they describe as cooperative multitasking. So functions need to yield in order to be able for other functions to run. And effectively, a worker can be thought of as basically single-threaded.
00:46:41
Speaker
If you want to in Timely, you can start multiple threads, each of which, again, is a whole copy of the data flow and is running with inputs and outputs that are potentially independent of the others. For us, that can be difficult because, as you pointed out, in order for that to proceed, if we're running all in the same Python process and we spawn multiple threads,
00:47:05
Speaker
you do have contention where you need to take the Gill in order to be able to do things in Python. So I think that's what's so exciting about the potential of like a no-Gill Python, sub-interpreters in Python and all of the work that we were talking about earlier. They're coming, they're coming, right?
00:47:24
Speaker
Yeah, it's also really hard. The other typical solution to this has to do with async. You could think of coroutines or other things like that. Those can be really difficult, especially when you're thinking about, well,
00:47:39
Speaker
If I want to make sure that all of my output has happened, and I'm using an asynchronous function to do that, how do I tell my Dataflow system, OK, I've finished doing that? Because essentially, your asynchronous work is happening in the background, could potentially fail. There's not a direct point where you can say, have you completed doing that particular thing yet? So it's a difficult marriage between Dataflow systems who are very carefully orchestrating the progress of data through Dataflow and asynchronous systems.
00:48:09
Speaker
Generally, what we do for people that want to interoperate with asynchronous code is use some of Python's underlying async. Just run this to completion and tell me when you're finished kind of thing. It's interesting, also, Timely has not incorporated a lot of those async pieces into the core part of Timely for similar reasons. Do you think you will or is it just probably not a priority?
00:48:40
Speaker
I don't know, I think... Is it a design issue or is it a workload issue that means you're not doing that?
00:48:49
Speaker
I think that's a good point. It could be both. I mean, sometimes you have to ask yourself whether or not any synchronous solution is more appropriate or would be, it depends on like, are you IO bound or are you CPU bound? Sometimes it's hard as a generic library maker to make those decisions for people and to opt them into ecosystems where
00:49:13
Speaker
it may or may not be the greatest fit. And so, so far, we've just sort of made it possible for people to marry async code into a very synchronous kind of like workflow. But there was a new release of PIO3 recently where they're adding better support for marrying both Python's async runtime and Rust's async runtime. And so there's some very exciting pieces of work that are happening on that side to make those two ecosystems work better together.
00:49:42
Speaker
OK, so if it were a lot easier to do, you might find ways to get it into the Bightwaxes API that fit more naturally. Potentially. I think it was something we just hadn't focused on for a while. And yeah, I think we'll see. OK, so then the next thing is to go when we're going across multiple machines. How does that play out in Timely and Bightwax?
00:50:10
Speaker
Yeah, so Timely and Bitwax both do the same thing. You can start a process on a second worker, and they will establish communication with each other. Essentially, you give each worker a list of all of the workers that it should communicate with. And you can start multiple worker threads in each one of those workers if you like. But you can start them on multiple machines, and they'll connect to each other. So that communication fabric comes directly from Timely itself.
00:50:37
Speaker
And do you have to deploy, I've written my code, now do I need to deploy that code to all the workers to get it up and running?
00:50:46
Speaker
Ah, that's a really good thing that we haven't talked about. I think one of the advantages is that there's no separate runtime for bytewax. And similar for Timely, the Python code that you write has everything that it needs in order to orchestrate those. So you don't have to take your data flow code and submit it to
00:51:08
Speaker
a cluster that is going to run that for you, or to submit it to a specialized process that's running somewhere else to run data flows. The same way that you develop locally, you can invoke it with Python, is the same way that you deploy it remotely. Okay, so I just ship my Python route. It's not like I install some platform software on my cluster, and then I can send it my recipe. I just ship my Python code to all the nodes and run it.
00:51:38
Speaker
Yeah, we do have a platform that we have for Bitwax, which incorporates a lot of the patterns of managing recovery stores, adding features like a programmatic API to be able to deploy your code.

Bitewax on Kubernetes

00:51:53
Speaker
That's more of a layer over Kubernetes, so essentially a way for you to manage data flows and deploy them. It's based on a lot of work that
00:52:04
Speaker
Anybody has done who has, OK, I've written my data flow code. I'm ready to go to production. OK, I need to have a way to manage this. I need some monitoring. I need metrics. I need a way to sign into the UI and sort of monitor what's going on. And so we have a whole separate product that we built in order to sort of do the management of data flows. OK, you've reminded me of another question I wanted to get to, which was, what if I send it out to my six node cluster and one node goes down?
00:52:36
Speaker
Does it get rebalanced the other five nodes? Do we have to wait for the sixth to come back up? Do we restart the whole job? What goes on? Well, in that particular case, what we would do is crash and then restart from our last recovery checkpoint, which is essentially, depending on what you're doing,
00:52:56
Speaker
That's probably what you want. It's a tricky problem to be able to redistribute. OK, that sixth worker was handling a sixth of all of the key space of all of the users in our fictional bank processing platform. So it had state about the transaction data for those people. It had state about which new transactions it had seen. When it crashes, the other five workers would have to take on responsibility for that piece of state.
00:53:25
Speaker
So that was something else that we did in bite wax. If you're not in the scenario where you're crashing and just restarting, like you're restarting pods and Kubernetes, we built primitives into bite wax so that you can rescale. So you can stop the cluster
00:53:42
Speaker
restart it with a different number of workers and proceed from that point in time. And the workers will take all of those pieces of state and say, OK, this worker is responsible for the state that that old worker used to have. And the rest of them might be primary for other parts of that. So we have a system built into recovery that allows you to do rescaling because obviously that's a pretty nice thing to be able to do. There's a lot more load. I want a lot more workers.
00:54:07
Speaker
Are you implying this is something the user asks of it, or is it dynamically rebalancing? No, this would be something that you would do in a stop-start. So let's imagine you had a flood of traffic for Thanksgiving, and you needed to spin up a bunch more workers to handle the increased load that you're having.
00:54:26
Speaker
you could stop the entire data flow and start it again with a different number of workers. But yeah, unfortunately, if one of those workers crashes, you'll have to restart from a known good point, because we don't have a way to dynamically redistribute all of that state into other workers.
00:54:44
Speaker
It's a pretty complicated problem, and it was the subject of some research that happened in Zurich. Part of the academic nature of where Timely came from means that there's some really interesting papers that were published and written about adding some of these systems to Timely itself. I think if you're a serious student about Timely,
00:55:06
Speaker
You end up reading everything that Frank has ever written. You read all of the papers that came out of ETH Zurich and all kinds of other things. You just sort of devour everything that you can find. Sounds like an interesting place to be reading. Perhaps we'll go more into the practical application as much as I like academia. Sure.
00:55:31
Speaker
So we've talked a lot about Bitwax, the library. Do you want to tell me a bit about the service you've built around it? Yeah, I think we were talking earlier about the confluence of things that sort of existed when we created Bitwax originally. And one of the things we were thinking about when we created Bitwax was
00:55:52
Speaker
For operationalizing or deploying data flows, if you could do it again, what would you do differently? And I think the thing that we wanted to do differently was not require a separate orchestration layer, essentially, in order to deploy
00:56:10
Speaker
spark or some of these other systems, you need another system in order to be able to deploy those on. I think removing that dependency was something that I wanted to do. We built a platform just using Kubernetes. Essentially, using Kubernetes as the ubiquitous backplane for doing deployments.
00:56:37
Speaker
We built a platform that integrates with Kubernetes that allows you to deploy and manage data flows. It's essentially what I would have had to have built when I was productionalizing anything if I was working at a company. OK, I've got my data flow. It runs great on my local workstation. I need a way to deploy this and manage it and monitor it when it's running in production. And so we built a platform to encapsulate all of those patterns together.
00:57:04
Speaker
a way to see which data flows are deployed, a way to monitor them, and a way to manage them that just leverages the backplane of Kubernetes under the hood.
00:57:15
Speaker
Okay. So if I, I'm thinking about getting, giving, uh, kicking the tires on this, I can, I know I can install bite wax locally with pip install bite wax. And then at some point when I'm ready to productionalize it, if that's a real word, which I don't think it is, I probably used it as well. That's when I would, uh, that's when I'd let you, um,
00:57:39
Speaker
Let you take the headache of Kubernetes for me. Yeah, I think there are several options depending on for me for places that I've worked in the past. It was always just sort of the default choice for places. They had a team that would manage Kubernetes for you and so deploying there made a lot of sense and.
00:57:58
Speaker
with the availability of those in major cloud providers. It's a great place to get started if you're running multiple data flows. The other option is a tool that we created called WAX Control that allows you to just deploy directly onto an AWS instance or something running in Google Cloud.
00:58:20
Speaker
It's also just very possible because we don't really need anything else to just start a container that contains a single data flow. The fact that you don't need a second process to sort of monitor those things that each of the workers can connect to each other means that you can kind of pick and choose what's right for you.
00:58:38
Speaker
I think there's a lot of advantage into outsourcing some of the pieces that are necessary but aren't really adding value for your customers exactly. So building a UI to manage data flows is not necessarily something that's like maybe the greatest use of your time.
00:58:54
Speaker
We also created a programmatic API for our platform offering, which would allow you to say, I wanted to deploy a Dataflow that does these things based on these particular conditions. And so having some of those pieces can be really nice for if you need the sort of programmatic access to being able to manage Dataflows. So those are the pieces that we put in the platform.
00:59:17
Speaker
Nice. For the record, I'm the kind of person that always wants programmatic access to things. Yeah, exactly. It's great the other button. That's awesome. Can I please have an API? Yeah. Because in the end, everything should be controlled directly from Emacs or Vim. That's my philosophy. Nice. Yeah. Nice. I love it.
00:59:41
Speaker
Well, I'll go and give that a try then, I think. I've always got some interesting data in Kafka on my machine, so I'll go and see how well it works in practice. Fantastic. Dan, thanks very much for joining me. Yeah, thanks so much for having me. This has been great.
00:59:56
Speaker
Thank you, Dan. And before we go, I think a brief celebration is in

Milestone Celebration

01:00:01
Speaker
order. This is Developer Voices 50th episode. And at some point between this one being published and next week's, it will be Developer Voices first birthday on the 10th of May. So happy birthday to us. And I just want to say thank you very much for listening, whether this is the first episode you've listened to or you're one of the something like
01:00:23
Speaker
19,000 subscribers over all the different platforms. If you're a regular or a first timer, thank you very much for joining me. And most of all, thank you to the guests who've joined me every week and let me pick their brains. I hope you enjoyed the experience. You know, from one point of view, being a guest on this podcast is the easiest thing in the world. You just have a conversation about your favorite technology. But I have to admit from another point of view,
01:00:52
Speaker
It's like a really stressful job interview where you absolutely have to know your system inside out because you have no idea what the questions are going to be. So to all 50 of you, I hope you mostly enjoyed the experience. I hugely enjoyed listening and learning from you. Thank you.
01:01:10
Speaker
So until next week and the beginning of the next batch of 50 episodes, you will find links to always discussed in the show notes. If you've enjoyed this episode, please leave a like or share it with a friend and make sure you're subscribed for episode 51 and beyond. Until then, I've been your host, Chris Jenkins. This has been Developer Voices with Dan Herrera. Thanks for listening.