Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
If Kafka has a UX problem, does UNIX have the answer?  (with Luca Pette) image

If Kafka has a UX problem, does UNIX have the answer? (with Luca Pette)

Developer Voices
Avatar
1k Plays1 year ago

One of the recurring themes in the big data & data streaming worlds at the moment is developer experience. It seems like every major tool is trying to answer this question: how do we make large-scale data processing feel trivial?

In some places the answer is any library you like as long as it’s Python. In other realms, a mixture of Java and SQL shows promise. But as this week’s guest—Luca Pette—would say, the Unix design metaphor has plenty to give and keep on giving.

So in this episode of Developer Voices we look at TypeStream - his Kotlin project that provides a shell-like interface to data pipelines, and is gradually expanding to make integration pipelines as simple as `cat /dev/kafka | tee /dev/postgres`.

--

Luca on Twitter: https://twitter.com/lucapette

Luca on LinkedIn: https://www.linkedin.com/in/lucapette/

Kris on Twitter: https://twitter.com/krisajenkins
Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/

TypeStream homepage: https://www.typestream.io/

TypeStream installation guide: https://docs.typestream.io/tutorial/installation

Crafting interpreters: https://craftinginterpreters.com/

…by Bob Nystrom: https://twitter.com/munificentbob

NuShell: https://github.com/nushell/nushell

#podcast #apachekafka #bigdata

Recommended
Transcript

Kafka's Approach to Data Handling

00:00:00
Speaker
On Developer Voices this week, we're back in the world of data and data streaming, and specifically Kafka. But this time we're going to look at it right from the very top of the stack. Because the thing about Kafka is it's kind of a ground-up rethinking of the way you handle data at serious volume, when you might have a billion messages an hour coming through across a cluster of machines. That needs some new thinking compared to what we had a decade ago.
00:00:29
Speaker
And I think Kafka has been really successful because it solves that core scale problem with a very simple core idea, the log file. And it builds up quite nicely on that. The challenge that Kafka has really faced is that when you rethink things from the ground up, you can't just rebuild the foundations.

User Experience Challenges in Kafka

00:00:49
Speaker
You have to build the whole tower all the way up to the top into user space. You need to build infrastructure. You need to build tooling.
00:00:57
Speaker
And I think it's fair to say that in the 10 years that Kafka's been around, it's done a lot of good things for data at scale. But the user experience, the developer experience, is still not a solved problem. Maybe it is. Maybe it is if you and everyone on your team speaks and loves Java. But for everyone else, most of the answers you get are some new dialect of SQL. That's what's going to make this easy. Treat it like SQL.
00:01:25
Speaker
Good answer. But there are other answers. The design space is still being explored.

Introducing Typestream: A Unix-inspired Approach

00:01:30
Speaker
And my guest today, Luca Pette, has been building a new data processing tool called Typestream, which takes Unix as its inspiration instead of SQL. Because when you think about it, Unix has been building pipes of data since the 70s. So why not mine it for some ideas?
00:01:49
Speaker
So that's the topic for today's podcast. An old design idea, Unix, transplanted into a new domain, real-time data streaming. Let's see what we can learn. I'm your host, Chris Jenkins. This is Developer Voices, and today's voice is Luca Petty.
00:02:22
Speaker
Nice, nice, good, good. And you? Yeah, I'm good. I'm glad to see

Conference Attendance and Networking

00:02:26
Speaker
you. We've yet to cross paths in the real world, even though we're, you know, working with similar circles in the data space, right? It's true. I think, you know, the main reason might be that I don't go to conferences, especially in the U.S., but in general, I'm not really like a conference person. So I guess that may be the reason why we never met
00:02:47
Speaker
Maybe next year, if we both end up at Kafka Summit, it would be nice. I think there's a good chance we'll overlap sooner or later, because you're only in Berlin, so you're not far away from me, right? You shouldn't be. But even if you're not a conference person, you're stuck now. You've got to give me a conference talk in conversational form for the next hour or so. The reason I wanted to talk to you is you've been
00:03:13
Speaker
I'm going to frame it like this. So there's a lot of people working in the big scale data processing world, trying to make this more usable. That's a big problem. There are people, we had Neil Busing on recently, saying how Kafka Streams is his favorite library for data processing. And it's a great library, but you couldn't call it user friendly. Yeah, I think it'd be kind of a stretch. I do agree with him is the best, at least,
00:03:44
Speaker
also my favorite Java library out there. It's, I think, a pretty incredible piece of technology. But user-friendly would be kind of a stretch. Just for starters, you would have to be a Java developer. That's already a stretch to call it user-friendly because you have to know a language. And for example, I am a good example of that. I literally went back to the language just because of Kafka Stream.
00:04:11
Speaker
So I forced myself back into the space trying to learn the language I hadn't seen in 10 years. So no, it's not. And I think it's, I would say, I would argue generally true of the streaming tooling out

Usability Issues in Data Streaming Tools

00:04:28
Speaker
there. And there is nothing wrong with it. I don't want that to come across as a critique saying,
00:04:36
Speaker
all the tools out there are not user friendly. I think it's more like of a story. Like if you look at it historically, it's just like you put it in the right perspective and then it's actually just normal that the tooling is not user friendly yet because well, it's kind of early days. It doesn't feel like it's early days of data streaming because I don't know how you first came across Kafka but
00:05:00
Speaker
Uh, my, I think my, in my case, the first project must have been zero dot eight zero dot nine. So we're talking a long time ago. Yeah. Yeah. It's a very long time ago. I think it was so early that, that people in the team I was working with, they were a little worried that, you know, Kafka wouldn't, you know, uh, wouldn't be so stable. That's how early days it was because no one would ever say that now. And to be fair, it was also, it also made sense because Kafka had just changed
00:05:29
Speaker
their protocols significantly. And we've been talking to people saying, oh, they just changed all the protocols. We're stuck on 30.7. So people were worried, but it sounds like a million years ago, but I think it's slightly more than 10 years ago. So it's just a decade since we have seen, you know, uh, streaming come to, uh, I
00:05:51
Speaker
Would you call it mainstream? It's also somewhat of a strange world play. I think we're still, I think we're in the cusp of being a mainstream idea, but we're still not there yet. And I think one of the things.
00:06:05
Speaker
One of the things stopping us getting there, I think in order to get any kind of traction in this, we had to go right down to the nuts and bolts of how do we store data. And we're climbing up that ladder from disk all the way up to user space. And that's why usability is almost the last piece before we can go mainstream.
00:06:24
Speaker
Yeah, no, and it's kind of exactly my point with saying, you know, I don't mean it as a critique when I say the tooling out there is not user-friendly because, first of all, it is kind of just a fact that it's not user-friendly yet because
00:06:40
Speaker
I believe Kafka Stream might be the simplest tooling out there to do stream processing, and it's definitely not an easy technology to get started with because of the language, because it doesn't click immediately in people's head. I wrote a very long article trying to explain
00:07:03
Speaker
explain how Kafka Stream works to other people, but it was also a way for me to see, do I even get it myself? Because the beauty of this technology is so profound that it doesn't click in your head that fast. And yes, I agree with you that usability comes last. And in a way, if you think about it, I think the conversation is
00:07:26
Speaker
somewhat starting now, and I think most people are approaching it from a perspective that I think it's obvious if you know what I'm working on that I don't agree with, that is, you know, we're trying to bend SQL to do things that SQL cannot really do. And again, one more time, I don't
00:07:45
Speaker
I'm not trying to be negative or trying to critique the solutions that are using SQL because I think they make sense. I think it's obvious that you say, OK, how do we make stream processing more user friendly? And then, I don't know, Confluent comes up with ksqlDB. I understand why they did it. And if I was where they were when they started ksqlDB, I would probably suggest SQL myself. I think it makes sense because it, you know,
00:08:14
Speaker
Everyone knows SQL. That's kind of the argument, right? Everyone knows SQL, and that's how you get people to do stream processing.

SQL vs Streaming Data: A Conflict of Models

00:08:23
Speaker
But then you hit limitations pretty soon, actually. And then you end up with the dialect of SQL that people don't know anything about. Like, I've done SQL for basically all my career. And then every time I have to do something with ksqlDB, I just have to learn the syntax from scratch again, because
00:08:40
Speaker
Because in the end, there is nothing wrong with it. It's just that it always makes me feel that...
00:08:47
Speaker
you know, the metaphor doesn't actually work because SQL has this declarative approach to asking questions that is like, you know, this is what I want. And you figure out how to give it to me. And there's some sort of like, maybe it's not even really there, but I get this, you know, there is this inner requirement that says like, it's more like a request response thing where, you know, I give you, I give you some SQL, you're giving me some data back.
00:09:14
Speaker
which by definition doesn't actually work with streaming because streaming is unbounded data. I give you something and you give me something back. There is already something breaking down in this metaphor.
00:09:30
Speaker
The usability thing, we are scrabbling around the design space looking for answers. And this is great. This is an exciting time. I like it when we're thinking about new ways to design software. And yeah, SQL is a very, it's a natural fit for how can we get data and what people are used to and usability. But that's a double-edged sword. What people are used to can also be a limitation of if you've got a very different underlying model, the abstraction can break down, as you say.
00:09:57
Speaker
Yeah, which is I think what we see even though like people have gone really far, right? If you look at how powerful flingies and like they have gone really far with it. But the problem is that, you know, it's kind of funny because while on one side it did make it more approachable, like I don't know if you try to run this technology on your machine, then, you know, you're still like,
00:10:25
Speaker
miles away from the average experience you get if you interact with Postgres on your machine. I think it's such a simple and trivial way of looking at the problem, but that's what literally developer experience means. If I'm experiencing getting data out of Postgres,
00:10:47
Speaker
I just have to remember how select works. I don't even know where PostgreSQL is stored. I never even looked at it. I mean, it's not true. I'm just making this up for the sake of the conversation. The point is more that you don't actually need to know where the data is. You don't need to know their encoding. You just have some interface that gives you the data back.
00:11:09
Speaker
And because the basic abstraction is so simple, then we kept building up on top of it. And then you end up with, I don't even remember how it's called. I think it's called the DataGrip plugin inside IntelliJ or these beautiful interfaces that allow you to talk to SQL. Even the interface is nice. It even looks nice when you interact with it, which it's like a second stage of usability.
00:11:36
Speaker
This might be actually the only thing I always struggled with streaming since 2013. Even the simplest tasks are not actually that easy. I worked in a variety of spaces with streaming. The only thing they had in common was literally Kafka. Different industries, different reasons use Kafka. Sometimes it's just moving data around. Sometimes it's actually processing data.
00:12:05
Speaker
But all these projects had been common that it was really hard to do even the simplest thing. Give me 100 records from this topic because I actually need to look at the data I'm looking for. I have a bug that I don't know how to fix it, but I need some real data because I don't understand the problem. Imagine how easy it is to do that with
00:12:28
Speaker
I don't know, with Redshift or with a remote possible database. You just connect to it with a read-only database, get 100 rows. You even have a syntax to get directly as a CSV and just move on. And in every single place I worked with, there was always two or three
00:12:48
Speaker
ops just to get to basic answers. Now there are some UIs out there that make it a little easier, but the nature of the problem and the fact that we kept building on something that I don't believe it works, that is SQL, we ended up with solutions that more or less solve one problem really well, but don't give you this.
00:13:12
Speaker
like full developer experience that is just like even remotely as good as what you have with the relational databases, which to be clear, it's somewhat obvious because we have started working on usability for relational databases, I don't know, in 84, 85? I don't know when to.
00:13:33
Speaker
Yeah, maybe even the 70s, we could be pushing on to the 50th anniversary, right? Yes. You could argue that we started the... I wasn't born yet. I think that's kind of... In the late 70s, I wasn't there yet. And our industry was already working on how to make this more usable. And I was always the same trivial
00:13:54
Speaker
example. But to be honest, it happens every day, especially when I've mentored a lot of people into the streaming space, because I'm really into it because I really like it fits my mental model of how data flows into systems like GloVe. So I applied it in a lot of companies. So I found myself in a position where I had to mentor a lot of people. And I could see, for example, this completely obvious burden that when you're working with, say,
00:14:24
Speaker
with a Kafka cluster and you have a bunch of topics and you have to get data out of it. Well, you have to know the encoding upfront, which I know it's obvious, right? I know the technical answer is of course you have to do because you have to realize the network. That's not what I'm arguing. I get that's obvious, but that would be the same as saying that when you extract data from a Postgres database or a MySQL one, you would have to know how the B3
00:14:51
Speaker
tables look like on disk, I think it would be really hard to get paid out. Yeah, it would be a hard sell. Yeah, it would be very hard to sell. So I understand that some of the complexity is not going anywhere, like at least long term, like I can't imagine like completely being able to hide the concept of partitions, even though I have my opinions about that as well. I think it's a little harder.
00:15:18
Speaker
But some of those things, like, for example, knowing upfront that you have a broad topic or a JSON one or a product buffer and you actually have to have the scheme upfront before you can even look into one record of this topic. Yeah, I think it makes it significantly less usable because of how much work you have to do to even know what you're working with.
00:15:40
Speaker
And that's, you know, I think it's like, it's literally that part of it, that every single project that uses Kafka, I worked with people struggle at first. And then you, and then I've seen two different, two different scenarios coming out of this either.
00:16:00
Speaker
People get really good with Kafka Stream and then become so experts that they don't remember anymore how hard it was to get there.

Mastering Kafka Streams or Alternative Uses

00:16:09
Speaker
So they say, you know, Kafka is easy. Like if we have to write a stream processor that gets the data, filters out some stuff, and then does aggregation. I mean, it's 15 lines of code, right?
00:16:18
Speaker
They've got all the recipes in their short-term memory. Exactly. It's cached in the red. They know exactly how they can picture the DSL from Kafka streaming in the red. Even though a nitpick there, it always bothered me that we call it the DSL because I think it's a fluent API and not technically a DSL. Different conversation. I have thoughts on what DSL actually means and yeah, a separate conversation.
00:16:45
Speaker
But it would be a very long conversation. The point is that you either end up with people becoming experts, nothing wrong, actually pretty amazing. Like I have some friends that now they're going around the world and doing Kafka Stream in places that there was no Kafka Stream first, very happy about that. Or people just say, okay, Kafka is somewhat of a black box that
00:17:09
Speaker
can ingest data really fast. I can get it out of it. But when I have to do something, I will move it first somewhere else. And then I do my thing. And then I move it to last. I move it to which, to be clear, there are some really obvious use cases where this is the right solution. But I've seen a lot of people ending up saying Kafka is just good at moving data around. And then when I have to do something,
00:17:38
Speaker
I will do it somewhere else, which I think is not really true. I think there is a lot of, even the most basic work that you could do in flight, like filtering data out, basic aggregations. There's a lot of work that you could do out of this realm, but the problem is that for that to be easy for you, you've got to be a Kafka Stream expert, and then you see the problem.
00:18:04
Speaker
There's a terrific amount of power that we're missing out on just because the user experience isn't quite there. And I do think you're right without criticizing the SQL approaches at all and I use them and I love them.
00:18:17
Speaker
And we've made great strides in that, but it's still not a solved problem and an answered question. There is still more to explore in the design space.

The Origin and Development of Typestream

00:18:26
Speaker
This is why I got you in, because I think you have an interesting and novel answer to how we could solve a new area we should be looking at for this design space. So I'll let you tell me about it.
00:18:37
Speaker
Yeah. Typestream. Yeah. So, I mean, so I have to make a premise there because I don't like, you know, I sometimes you listen to people talk about their projects and they're a bit sensationalist about it. It's like, you know, this is the best new thing. So I appreciate you saying it's novel and I do get why you say it's novel. But that's actually the thing I struggled the most
00:19:02
Speaker
with when I talk about Typestream, because I think the ideas are not novel at all, actually. When I look at what Typestream does and how it solves the problem, I didn't invent any of this. I just looked at it from the perspective of data streaming. So how does Typestream look like, first of all?
00:19:25
Speaker
The initial story there is that it has actually nothing to do with streaming, is that I really wanted to get into Kotlin and I needed a real world project and I had nothing at my hands. That's maybe two years ago. And I started exploring programming languages because it sounded very hard. It's like, you know, can I write a programming language? It sounds impossible.
00:19:50
Speaker
Like I don't even, I don't actually understand how programming languages work. Well, it turns out that shout outs to a munificent pop, I think on Twitter, that is the author of Crafting Interpreters. It's an incredible, an incredible book. Like it's literally, you know nothing about programming languages and you read this book and when you're done with it, you can actually write your own programming language. There are, there are, you know,
00:20:17
Speaker
There are things that the book doesn't go through because the space is too big, but you can do most of a small programming language on your own. This is how I got to Typestream. I didn't come up with the idea. I made the idea for Typestream some
00:20:37
Speaker
10 years ago, almost. In 2015, I think it was, at the time, I was the CTO at this food startup called Marlesport. We were discussing how to use Kafka to move order data from a webshop into different parts of our infrastructure.
00:20:57
Speaker
shipping labels on one side we need the ability to have manifest to build boxes on the other and it's like to think about it it's always the same data and it just looks slightly different like i need all the errors going to customer care i need all the shipping labels to go to the shipping team and it's
00:21:14
Speaker
you take always the same data and you do these little pipes that in my head, in this meeting, I can picture the meeting in my head, they just looked like these basic unit pipes that we use all the time. I was like, you know, I can cut my orders and then I just grab all the errors and send it to customer care. And when I had this idea, I actually couldn't put it into words because, well, first of all, it was so early that I think Kafka Stream was not even there yet, or it was really early days.
00:21:42
Speaker
I actually don't really know when Kafka Stream came out, but it must have been around that time. And I also had no idea how to put it all on it in the practice. Then I went on with streaming for like some eight more years. And then I found myself wanting to learn Kotlin and then everything just came together and said, here's a very difficult project. Can I get a bash-like programming language?
00:22:10
Speaker
that I give it streams that look like Bash comments, pipes, or one-liners, or whatever you want to call them, and can I compile these down into a Kafka Stream application? This was the question, right? I didn't toggle the problem from a usability perspective. It's more the other way around. It's like while working on these that sounded like very difficult and very fun,
00:22:38
Speaker
Well, the more I looked at it, the more interesting the metaphor got because, you know, by nature, I'm one of these people that is very skeptical about his own grandiose ideas. Right. You know, it's like when I started working on these, I'm like, am I really trying to write a programming language for data that sounds.
00:22:57
Speaker
That sounds crazy. That doesn't make any sense. You're never going to be able to do this. Sure, it's maybe half imposter syndrome, half healthy skepticism, but the metaphor just kept giving while I started
00:23:13
Speaker
looking at it because I got this remote compiler working and then I could do these basic things like cutting a Kafka topic and doing some grep, which is literally filtering. And then I got some basic VC version, WC, I think it's the common, when I could aggregate data and just count page views and things like this. And then it hit me really hard.
00:23:39
Speaker
that this UNIX metaphor that we have heard for 50 years that in UNIX everything is a file, I mean, there was really no difference. Between these concepts where in UNIX, what makes it really powerful, this abstraction is a composition, it's a combination of two things. The way you interact with the data is pretty uniform because, well, literally everything is a file.
00:24:05
Speaker
and the programs are very composable because of pipes. They do one thing, you compose them together, and then you can do the same kind of transformation over very different files using the same exact ideas, and you can just keep reusing it as much as you want. And the more I thought about it, the more it felt like it would work for streaming. And then what happened is that I just kept going, and now it turned out into this vision
00:24:34
Speaker
Yes, I do believe that's another way of looking at usability for streaming because type stream pitches. What if you could interact with your Kafka topics? What if you could write streaming applications with the same simplicity? You actually do some work on your files, on your file system.
00:24:58
Speaker
And the beauty of, at least in my opinion, the beauty of this metaphor is that it actually works perfectly from the perspective of the file system too. Like, meaning, you know, I have Typestream as this virtual first system where you connect it to a Kafka cluster, it will create a view of this cluster where you get each topic as its own path. And that's how you addressed it, right? You have this, so that you can do cut, name of the topic, which is a part of it.
00:25:26
Speaker
You're literally enabling something like you can run Typestream and go cd slash dev slash Kafka slash topics. Cat users onto Grep Berlin onto Berlin users.
00:25:41
Speaker
The basic metaphor that I applied that already works, even though Typestream is a very young project, it's a very completed one, so it's a bit hard to get everything up and running.

Typestream's Complex Language and Data Handling

00:25:57
Speaker
Definitely my first programming language and Bash-like programming languages are very complicated because of bad words. Until you get to the final process of compilation, you don't even know if that string you're looking at is actually a string or a path on the file system. It gets very complicated. It's very tricky or variable. It can be a lot of different things.
00:26:21
Speaker
So the thing that really clicked in my head is that when I started learning the type stream from like more of a product perspective, right? Because when I started talking to people and say, Oh, this sounds like it sounds like a good idea. Like, you know, I bet people telling me, why isn't this like the default way we interact with the data, which is very cool feedback, right? It's like, it sounds so easy. Why are we doing this by default? And well, I have
00:26:46
Speaker
Two answers, one of them is that I have no idea why that's not the default in a way, which feels very suspicious because we are doing this for 50 years with Unix and am I the only person that ever thought of applying the concept of Unix to something we literally call data pipelines? It's literally in the name, right? The operator we've been using for decades, every day, several times a command is, yeah.
00:27:15
Speaker
I feel really strange about it. It's like, am I the first person thinking about this? And that's why I kept trying to sabotage my own project and say, yes, if I do this, it doesn't work. And this is how I discovered how the metaphor works in reality with, like, it works beautifully in context. I had never thought it would work. Let me give you one concrete example. Are you familiar with Tiny Bird? The start? Yeah, yeah, yeah. I think you might have
00:27:42
Speaker
They do click house, APIs in front of click house, right? Yeah. And it's very clever. The first time I ran into it, it was a bit like, I think it felt a bit like the first time I ran into React that I'm like, why is everyone saying this is cool? I'm not getting it. It's my fault, right? Let me spend two minutes on it. And then when I realized that they would expose API for you, and then instead of consuming the data from just a managed click house, you can build on top of it.
00:28:10
Speaker
You know, it clicked in my head in two seconds saying that, well, if I apply the metaphor to Typestream and I have a media file system, I can mount a web server and say, you know, then cut topic, grab, you know, Berlin, and then you can, you know, redirect these into a
00:28:30
Speaker
a web server that is a slash media slash server one slash endpoint, which is the actual endpoint you expose to the file. So processes is like entries into the file system. Yeah. Yeah. And then also be eventually they'll be slash dev slash Postgres too. Exactly. And it works. Yeah. Yeah.
00:28:49
Speaker
What's interesting is because it works both ways in Unix, it will work both ways in Typestream. I mean, if I get to building it and both ways, I mean, then you can end up with a one-liner where you actually get data out of Postgres, you process it on top of
00:29:06
Speaker
Kafka, you throw it at a topic, but also you take the topic with tea and you also expose it to a web socket, because why not? I mean, the whole idea behind it is that it looks exactly like Unix and that composability is really obvious for us in the Unix world, right? It's like, we do that all the time, right? You might send it the R scene. Well, R scene, it's like Kafka Connect, it's just R scene.
00:29:35
Speaker
in TypeScript, right, conceptually. Because it's kind of the same idea. Yeah, I can see that. And that's kind of what fascinates me and keeps me working on this project, which I will not deny. This technically the most difficult project I worked with, because there is way too much going on. Even like, you know,
00:30:00
Speaker
Because we ignored one thing that type stream does so far which i think it's worth mentioning that is that there is a reason why it's called type stream apart from the fact that it sounds nice and it resembles other things that people are familiar with it's because all these data pipelines they're all types.
00:30:18
Speaker
And I think that makes a very big difference compared to other solutions where you would have to bend SQL significantly to actually achieve the same usability. So let me give you an example. For example, when you do cut topic, pipe, grab something,
00:30:36
Speaker
If you use a bare word, Typestream will do exactly what Scrap does. It just looks at the whole line, which in the context of Typestream is the old record and just looks for whatever you passed. That makes sense. But of course, Scrap, you can use this, I think it's called Square.
00:30:57
Speaker
brackets operator that creates these conditions, right? In Bash there is these square brackets operators, and you can use it in if statements. And I started thinking about it. It's like, you know, to make grep more usable, I can use the same conditional concept to grep. And then now, for example, Typestream does things like, you know, you can cut a topic and then you can grep and look that, you know, a specific field is bigger than 500.
00:31:25
Speaker
Now, this doesn't sound very smart, because of course you expect a system like this to give you the ability to, I don't know, give me all the books that are more than 50,000 words. What I think makes Typestream more usable, and it's kind of easy to do once everything is a programming language, like the technical problem you're solving is a programming language, is that, well, I can actually type check the pipeline you give me, because I know
00:31:54
Speaker
I know because of the schema of the data, I know the type that the whole pipe is, so I can infer each single step of this pipe and say, okay, the book schema, it looks like this. There is a title, and it's a string. There is a word count, and it's long, and so on.
00:32:15
Speaker
When you get to the grep operation, I know that the field words is not there because it's called word count. And then I can actually tell you, I cannot run this for you. I cannot compile it for you, which at first sounds like just a nice feature. But then if you actually apply it to the whole idea of writing data pipelines every day with this,
00:32:40
Speaker
with this technology, well, you end up in a place which we just give for granted in the whole industry that is not streaming, which is just obvious. It just works like this everywhere, as long as you use a typed language. But in data streaming, it sounds like a very advanced feature. And to be honest, it's... Well, I think it is. No, I think it is. Because I mean, I use things like Kafka Streams and
00:33:06
Speaker
Sometimes it feels like most of my time is getting it to, it won't infer the types and the serialization stuff. And I just have to teach it how to deal with types at every single step. Yeah. Yeah. Yeah. And you know, so, but what I meant to say is like, I know why you say it's an advanced feature. And the point I'm trying to make is that it is only because we're not tackling usability because well, it's really not an advanced feature conceptually. It's something we ought to be able to expect.
00:33:36
Speaker
Yeah, exactly my point. Even though it's hard to implement, that's what I'm getting at. It's not easy. I will agree with that. But the point is that when you look, because if you look at a Kafka Stream application that gets data from two topics, it joins them together, and then it filters out some of the data for some business criteria and sends the result via WebSocket with a forage, these 150 lines of
00:34:04
Speaker
Java code or Kotlin code or Closure code, whatever you use, as long as it's JVM. Well, there is a lot of details going into this because there is no abstraction. It's not Kafka Stream's fault that they don't solve this problem for you. It is really out of the scope of the library to solve this error.
00:34:22
Speaker
like this realization. I know exactly what you mean about this, but if you find yourself there, it's like, okay, I have to tell that the resulting type is this thing that I don't have yet. In every single project I worked with, we ended up with some sort of
00:34:38
Speaker
We would call it like a hybrid serializer that for all the internal steps of the data pipeline, we would use a JSON serializer so it would be easy to do all the operations really fast.
00:34:53
Speaker
Yeah, because the truth is that when you work in a project where you use this a lot, well, this happens every day and you're not going to do a new app row schema for every step of every single pipeline. Well, the beauty of Typestream in a way is that, well, Typestream can do that because it will just compile it, right? It will compile every step and figure out what's the right
00:35:14
Speaker
schema, and it can also output in different formats because, again, well, you can just have a two JSON or two CSV or two whatever command at the end of the pipe to change the encoding because that's how you would solve the problem in Unix, right?
00:35:29
Speaker
Yeah, I didn't invent it again. I think there is actually literally a comment in UNIX. I never remember its name that changes the encoding from DOS to UNIX. There's UNIX to DOS and DOS to UNIX. And the idea is the same, right? All I'm saying is if you want a different format, you just pipe it into a tiny program that its only responsibility is taking whatever you're giving it and changing the encoding.
00:35:55
Speaker
That's all it is. So one day Typestream will support pipe Jason to Avro? So it already supports this automatically at inference level. The point is that if you do a pipeline where you start with a schema and then you grab something out of it and then you're redirected somewhere else,
00:36:21
Speaker
Conceptually it's very easy to imagine that the end topic should have the same schema as the starting one because you did not change the schema right you just like filtered some data out but then if you filter something out you should change the schema so these types team right now does it automatically by assuming that all the.
00:36:40
Speaker
So what it does is like the first encoding of the pipeline on the fly, and it will figure out that none of the data operators you use changes the schema, so you get the same schema as you already had when you started. And as soon as you use something like a cut or WC or a join, which by the way I discovered while working on Typestream, there is a- We have to talk about joins.

Validating Typestream's Unix-Like Commands

00:37:04
Speaker
Yeah, there is a joint command in Unix. I had no idea about this. I discovered while working on Typestream. I didn't know that. You know, this is how actually the first time I started working on Typestream, I was debating it with a friend of mine, shout out to Bruno, because he's very helpful into doing this, you know, like a remote rubber tack where we'd all come.
00:37:27
Speaker
on signal or something. And then that's the exercise we did. It's like, can you find semantics that work with, you know, because some of these are really obvious, like grep is filtering, wc is aggregation, cut is literally reading file and, you know, the bigger than is literally the two of Kafka Stream. And I discovered that Unix has a join command where you can join two files.
00:37:54
Speaker
It's just thinking surely Unix has already solved this problem somewhere. That's kind of my point, right? The point is that I didn't even know that Unix already had the program because it's like I'm going to have to do a join command because otherwise I cannot join streams that are from different sources together. And then when I started looking join Unix, I could not believe the command was already there. And I started using it with files to understand how it works. Of course, it has limitations compared to the
00:38:24
Speaker
joins and mandates of streaming, but still the basic idea that the metaphor just keeps giving is what kept me going with types. So before we moved to joins to close the encoding part, well, to be honest, this is one of those things that every single person I got started with
00:38:47
Speaker
with the data streaming has struggled with because you have to enter encoding on your own. And from the perspective of a type stream user, it's completely transparent. Unless you want to physically change the encoding saying from Avro to product buff,
00:39:03
Speaker
There are things that I have to shout from a syntax perspective, but it already works. It already does that because, well, the point is that Typestream is a remote compiler. So when you give it a string that is actually a pipeline, it does all the things a programming language would do.
00:39:21
Speaker
It looks at the pipes, it figures out all the operators, it tells you, well, you cannot use grep with this field. This field doesn't exist yet. And the beauty of Unix metaphors in this context that it works with both kinds of commands. I have a HTTP command so that you can
00:39:41
Speaker
pipe things into an enriched block where you get all the data coming in from a stream, do a remote HTTP call, and then have a resulting stream coming out of it, which is a non-trivial Kafka stream application, say, to write. And this HTTP comment, because everything is a stream and because everything is
00:40:01
Speaker
Because the metaphor works everywhere, it just worked at first try. It just works with GRAMP. I literally discovered that it would work with the other data operators by testing it after I finished the first implementation, because I wasn't even sure that it would work, but it actually worked immediately, which I think in a way it's a form of validation that's
00:40:30
Speaker
This might not be the best solution ever, but at least it's a concrete and valid way of looking at the problem space, of improving usability in data streaming, because even Copilot agrees with me. I think I tell this to every person I meet, that is, when I write docs for Typestream, and I'm writing, I don't know, today, literally today, I added the minus B
00:40:56
Speaker
option to grip. For some reason, I thought it was already there, but it wasn't. It's the option that inverts the match, right? So I added this option. I started writing some docs, and Covilot just finished the line in the type stream.
00:41:11
Speaker
Code, right? Because you're reusing the metaphor so perfectly. Yeah, because the compiler doesn't know that the code is type strip. The compiler just thinks it's a shell script, right? It's a little shell script. And that's what gets me very enthusiastic about it. Even the copilot that is a heartless machine sees the metaphor.
00:41:36
Speaker
Yeah, yeah, yeah. I have to ask you, this is something on the side, but I feel like I have to ask you this. So you're building out an existing language with a lot of conventions and a lot of rules. It's your first programming language you've written. You're writing it in Kotlin, which is your first big Kotlin project.
00:41:58
Speaker
You're doing not just a programming language, but a programming language with type inference. I mean, how large a mountain are you trying to climb, Luca? No, it's fair. So, it is a very fair question. Yes, it's very large. I will not deny this.
00:42:17
Speaker
I'm working with a friend all the way. We co-founded a small company behind Typestream because Typestream is, to be clear, Typestream is fully open source. I believe that there is no other way of doing this. I'm not even going to argue for any other way of looking at Typestream from a project perspective because I don't think it makes sense.
00:42:41
Speaker
Actually, timing is perfect because six months ago, I used it to say Terraform would never become so successful. Now it feels a little strange to say it, if it wouldn't be open source in 2014. I think you can tell how badly the open source community reacted to HashiCorp changing licenses.
00:43:06
Speaker
Upstream is an open source project and the scope of it is immense. I agree with you, right? There is no doubt that there is a bunch of first things I never did. I think that, to be fair, the fact that it's my first large Kotlin project, I don't really care because that's not
00:43:27
Speaker
That's never been an issue for me, because for me, languages are literally tools. I don't look at it as it was not even hard to learn. I just needed a real world project. That's how I ended up using Kotlin. I will say that
00:43:45
Speaker
It is difficult from a programming language design perspective because of course I've never done a programming language but you know it's kind of funny because if you look at the history of the people that build programming languages when you talk to them they will always say like I had no idea what I was doing and then you end up with Python
00:44:05
Speaker
and with Ruby and with JavaScript. And yes, some of these languages have obvious quirks in inside. But in a way, I feel safer from that perspective because, well, I'm not really reinventing the syntax either. I have to borrow ideas from other languages that Bash doesn't have because it's really hard to do stream processing without the concept of block.
00:44:33
Speaker
Because if you're mapping data in streaming, it's really hard to express that without the block syntax. Bash doesn't have that, actually. The typical lump that you would have in Java or in Kotlin, it's not really there in Bash.
00:44:54
Speaker
I had to add a bit of syntax coming from other languages. There is a Rust project out there called Nu Shell. It's kind of funny because they solve a very similar problem that I'm trying to solve with Typestream, but for Shell only. It's a shell where everything is structured data, so you can write very, very clever scripts inside your shell because the shell is aware of the data types of everything you want.
00:45:23
Speaker
Yeah, it's very interesting. And, you know, in a way, again, one more time, a form of validation that, you know, the metaphor can work in the context. But yeah, so the point is, in a way,
00:45:38
Speaker
The thing we really need for the project is, of course, to people to use it a little more, like I have friends using it, but ideally it gains a bit of traction so that some other people are interested and maybe
00:45:55
Speaker
The company behind Typestream can hire a couple of full-time open source developers.

Typestream as a Comprehensive Data Language

00:46:02
Speaker
Ideally, we would hire at least one compiler engineer that works with me that does nothing else than working on the open source project, which, by the way, I think it's a dream job, if you ask me.
00:46:14
Speaker
But everyone has opinions about that too. If anyone listening would like to apply, send your details now. Yeah, I don't do that yet because to be fair, I'm doing this out of my pocket at the moment. I cannot really hire anyone at the moment, but ideally that's the part you would want to take because I think it's so important.
00:46:36
Speaker
for a pro, and that has nothing to do with Typestream. It's more about my leadership experience of 20 years in the industry that for a project to be successful, you do want them to know your limitations. You do want to put the project in place with people that are hyper-specialized into solving one problem.
00:47:01
Speaker
I do feel very secure of the vision that I have for Typestream because I've spent so much time trying to sabotage it myself that I can't. Either I have a giant blind spot that I haven't seen yet, but the more I talk with smart people like you, the less I'm convinced I have a blind spot toward Typestream. The more I'm convinced that
00:47:27
Speaker
We should try to make work the full extent of the vision where, as you said some half an hour ago, at some point there is a slash dev slash prosperous table somewhere and use that to
00:47:42
Speaker
Basically, same type stream becomes a data programming language, which is a very strange thing to say, I think, in a way. I can see one day it's going to try and occupy the same space as, say, Flink, right? Yeah. Connective processing tissue.
00:47:58
Speaker
When I started talking to people about the projects from a private perspective, one of the feedback I got most often was, why are you even trying to solve it as a programming language? Because there are
00:48:14
Speaker
other ways you can approach the project. I think there is a company solving a very similar problem as a Python library. I think it's called ByteFux. And it's very clever, very smart, and it's a different approach, and it's as valid as types.
00:48:33
Speaker
The difference is that because it's a programming language, Typestream is something that you cannot have with these other projects, not because Typestream is better. It's literally because of how the Typestream is designed. Because it's a compiler, Kafka Stream is one runtime. I think this is not obvious when you talk about Typestream for the first time, because right now it looks like this.
00:48:58
Speaker
the naming in combiner space, it's really funny. But you know, there is a front end, the back end, the middle end. I know, you know, it's really hard to believe there are middle ends, but it's true, right? So the naming is a bit strange, but
00:49:14
Speaker
Because it's a compiler and it's taught as a compiler, nothing prevents you from changing either the frontend or the backend, meaning you can compile to different runtime. You can compile to Pulsar, you can compile to Memphis, you can compile to fling. Why not fling job?
00:49:29
Speaker
All you need, I'm trivializing the not the scope, but the simplicity of the task, but the scope is very well defined. What you need is a one on one mapping semantic between the data operator that type stream offers.
00:49:45
Speaker
and the ability of expressing that problem in the native code of, say, Fling or Pulsar or whatever it is. Right now, if you would look at the compiler, the last stage of the compiler, the one that creates the Kafka Stream application, it's literally like a walking algorithm, like, you know, basic depth first algorithm on the graph.
00:50:10
Speaker
that does nothing else and every note that it runs is a what is this note okay that's a that's a great well that's how you do great pink off costume all that's a yeah that's it's that you get to the point where like flink and pulse are in python will all become architectures.
00:50:27
Speaker
Yeah, from the perspective of type stream, they are just run times, right? Yeah, there isn't even a larger scope, which it's a kind of worms, which probably we don't have the time to go through this specific kind of world that would be.
00:50:45
Speaker
At some point, Typestream can get to a place where it can run pipelines where there are multiple runtimes involved, meaning you have clusters of all kinds and then you can decide on the fly what's the best place to run this pipeline.
00:51:10
Speaker
It sounds like a science fiction level kind of technology for data pipelines, but I don't think it is once you accept that Typestream is nothing else than a programming language, and then you have available to you all the techniques that are literally basic stuff in programming languages, like code elimination or caching. Caching is very common in programming languages.
00:51:36
Speaker
Caching part of a computation of a data pipeline sounds like a very advanced feature in another system but you know if you write a topic, a grip something and then you pipe it somewhere and I write something that starts the exact same way but then does one more grip and pipe somewhere else.
00:51:56
Speaker
I mean, nothing prevents us to use the first part of the computation. It's the same exact pipeline. And those things are only unlocked by the fact that Typestream is trying to solve the problem in a way at an unreasonably low level that is giving you the ability to express pipelines with a language
00:52:17
Speaker
where the hardest cell would be you would have to learn a new language, except you don't, because the language looks exactly like your terminal. That's kind of the final pitch, right? Yeah, yeah, yeah. There may be a lot of work for you to implement each part, but the syntax is, yeah, yeah. So, I mean, I think
00:52:38
Speaker
I think it's really interesting. I think it's at least as valid an answer as the SQL is the right answer to this thing. But I feel like we should just, if Unix is the metaphor and you're convinced that it's a good metaphor that seems to be fitting well, we should test it in a couple of places. Test if the metaphor hangs together when we start doing more interesting things. So the first one is joins. Tell me about how joins will actually work.
00:53:03
Speaker
I mean, so the syntax of the join command in Unix is very close to the syntax of the join fluent API from Kafka Stream because it does...
00:53:20
Speaker
nothing else. I'm simplifying a little. But it does nothing else than says, those are the two strings you want to join together. And this is how you stream them together. And this is what I brought up when I talked about blocks when we were discussing Bosch syntax. Bosch doesn't have that feature. Bosch doesn't have the ability to tell you, capture a little piece of code here and use this resulting code every time you do this operation. And that's how the join syntax looks like.
00:53:50
Speaker
Right now Typestream has the simplest join syntax possible, which is literally you can join two streams and we will join them by key. And I think there's a default window. It's so hard to call that at the moment because the scope is so big.
00:54:07
Speaker
For each single problem, I just try to glue them together and say, okay, that's the infrastructure. If you want to make joins really smart, there is one class that is the abstract syntax tree class representation of join,
00:54:23
Speaker
You can make it as smart as you want, because the beauty of the abstraction is that they just work with a class that it's called data stream inside the code. They don't know about anything else about the outside world. So you can make the join as smart as you want. And the way you would do it is by imagining a little syntax that gives you the ability to add what I think is normally called the value mapper, because you have to map the result of the join. And that syntax is already there because
00:54:53
Speaker
Typestream supports a command that is not unique standard called Enrich, which is the equivalent of Map. I stole the name, I think, from New Shell because it was literally one-on-one with the idea that I wanted. Then the rest, you can do it with options the way you do. If you look at the Rsync page, there are a million options because it's really smart program. If you think about it, I'm pretty sure Rsync has more options than how many possible joins you can do.
00:55:23
Speaker
I'd probably agree with that. Maybe I should say this out loud because I don't think I ever did so far. I don't think the one-on-one
00:55:39
Speaker
ability to express one-on-one every single thing that the streaming processing that already exists can do should be a design goal. I think it would be nice to get to a good 95% and not do the last 5%, but I don't believe we have to be one-on-one feature compatible because
00:56:06
Speaker
because this is not really an alternative, right? I'm not trying to build an alternative to Kafka Stream or to fling. I'm trying to solve a different problem. The problem is interacting with stream processing with this tooling is very, very hard.
00:56:21
Speaker
So I'm going to create a tooling that allows me to do 95% of the job with one line of code. And then if I need to do something really complicated, and I need one Kafka Stream app, I can still do a Kafka Stream app that writes into a topic, and that topic just becomes a file system, but for me, right? Yeah. How do we make things that are easy, trivial?
00:56:45
Speaker
Exactly. I care significantly more about the fact that right now, if I was working on a Kafka project and I had to get the past four weeks of data from a specific place and filter data out, I would have to write a whole app instead of writing a line of code for it.
00:57:08
Speaker
Okay, so maybe moving on from joins then, let me test it with another important thing in stream processing. Is there a Unix-y way of handling state if I want to like roll up, have a running balance, for instance?

Typestream's Integration with Kafka and Kubernetes

00:57:22
Speaker
Yeah, I mean, so the funny thing is that this was one of the hardest thing to implement for me. So the thing is that
00:57:33
Speaker
So there is two answers to it. There is the syntax answer and the physical answer, the actual, how does this work in TypeScript? And the syntax answer is that, in a way, as long as you have a Unix command that makes sense, like I test these things by sending people little pipelines and ask them, so what do you think this pipeline does?
00:57:58
Speaker
I don't tell them what it does. That's how I figured out that, for example, the WC command works in a lot of places where you have to do these basic count aggregations. I want to aggregate all the data by key. As for handling state, I don't want TypeSim to ever solve the problem in a way because if you think about it, TypeSim is just a compiler. I say just with quotes because it's a little reductive for how complicated the problem is. But at the end of the day,
00:58:28
Speaker
Typestream doesn't actually solve the problem because it compiles down to Kafka Stream, while Kafka Stream solves that problem. And I've had this question from other people as well saying, when you get to state, when you get to storing data, it gets very complicated. And while I agree with all of it,
00:58:47
Speaker
I don't think it's in the scope, which is kind of interesting in a way, and it confirms a part of this idea that I enjoy very much. That is, because it's just a compiler, I don't have to deal with this problem at all, actually, because it's
00:59:03
Speaker
relegated completely to the underlining implementation of the streaming library that I'm using. In theory, that's kind of a funny part of the answer. In theory, you could combine a Unix pipe written in type stream into a streaming library that solves these problems on their own like Bytewax does.
00:59:30
Speaker
Because why not? Because what prevents you from solving that problem? So maybe there is something that is not clear to you from a technical perspective, and I should spend a minute explaining it. Typestem is a remote compiler, which makes a difference. Why is that relevant? It's relevant because
00:59:54
Speaker
In production, and there is a variety of reasons why it works like this now, and we could talk probably one hour just only about this, honestly. But in production, I like Typestream
01:00:09
Speaker
requires you to use a Kubernetes cluster. And the reason why it works like this, it's because as I don't want to solve the state problem for the stream processing, I also don't want to solve the orchestration part, which I think it's unclear to people that, and it's my fault, obviously, because if you go to the website right now, it's obviously not clear that Typestream also
01:00:32
Speaker
manages these jobs. Once again, the UNIX metaphor just keeps giving. If I give you a pipeline and I add a commercial ad at the end of it, it will just run in the background.
01:00:48
Speaker
It's like, why not? And that's exactly what Typestream does, right? When you run Typestream as a little piece of code that says, where am I running? Oh, this is a Kubernetes cluster. And then it puts itself into Kubernetes mode, which means all the long-running jobs, instead of being in coroutines, which they would die as soon as the server dies, they get delegated to Kubernetes jobs.
01:01:14
Speaker
Why am I bringing this up in the context of state? Because I think those two things are actually connected because what you end up getting if you run Typestream in production, in your cluster where you have your apps, you have a namespace with Typestream, and all of these tools to Kafka, what you end up with is with a
01:01:36
Speaker
extremely simple way of running long-running jobs on top of Kafka that do almost the exact same thing that a Kafka Stream application would do with the difference that the thing will take you, I don't know,
01:01:52
Speaker
15 seconds to write instead of two days, and you also have to manage it yourself. Long term, the idea is that right now, the PS command, again, the Unix metaphor just keeps giving. The PS command is a bit dumb because it just shows you the name of the app with the state, but nothing prevents you to make these
01:02:12
Speaker
much smarter using once again a different aspect of the Unix metaphor by exposing metrics of consumer groups, Kafka Stream applications, brokers, all of these via the proc file system.
01:02:27
Speaker
So one day Typestream will let you say cat slash proc slash Kafka slash some job and you'll get the metrics for it. Yeah. Yeah. That would be very nice. Yeah. So for that to work, like everything is already in place because, you know, that's kind of the exercise that I did is like, is this two hours that I don't even know how to do. So I just kept answering questions to see if I could go full circle with the problem.
01:02:53
Speaker
There is actually literally one thing missing there that is when you start jobs, they reflect inside the file system immediately. What I'm trying to say is that if I implement that feature, then what you just said would already work, actually. Because everything is a stream, everything is a file, and it's always the same metaphor.
01:03:13
Speaker
to the file system live job part that's basically the proc management. I think that's how we would call the roadmap item in Typestream when we implement this. Well, then this would already work. And the beauty of it, like the long term vision, the beauty of this approach is that there is a lot of code already written that works with this kind of solutions. And then if you have a script that monitors the proc
01:03:43
Speaker
It would be very easy to translate, right? It would be very easy to translate all these scripts, and I haven't even looked into the progress that we've made in the past, I think, four or five years, not more with the ABPF, with all the same inspection that is a bit more low-level, trying to resolve all these metrics again via the file system, because why not?
01:04:07
Speaker
immediately. Yeah, I can say that. And then, you know, and that's kind of my, you know, that's kind of the reason why I'm so excited is because whatever question I get, even if I don't have an answer immediately, my brain thinks about it. And then maybe while I'm swimming, I'm like, well, I mean, Unix does this already. And you have to consistently, you didn't invent anything. You just realized that we've been doing this for 15 years, we forgot to apply to data. That's all we
01:04:36
Speaker
Yeah, you've got a 50-year-old book of design recipes to draw from. Okay, so that leads probably my last big question, which is possibly one of the biggest parts of a Unix system, which is the ability to extend those pipelines with your own commands. Where is Typestream on the I can extend your language story?
01:04:58
Speaker
Yeah, it's a beautiful question and so I think there's two things that need to happen, right? One of them is really
01:05:09
Speaker
syntax slash semantic. So right now, the language, to be fair, this is also relatively easy to do. I just didn't get to it yet. So the language doesn't have the ability to define functions. And the reason why I bring it up, because one solution is obvious. That is, you add two features to the language. One is the ability to define functions, and the other is the ability to search functions.
01:05:38
Speaker
And then you end up exactly where you ended up with bash and zshell where you write your own scripts and then you just search them and they just appear and they just work. So now these solves a variety of use cases. And I think it's very interesting that you can get a function that removes
01:05:56
Speaker
I don't know, it removes all the possibly privacy sensitive data from whatever stream it's coming in. Like if there is an email or IP address, it looks like this, whatever it is, it will just remove it. And then you have this tiny function, you can call it PII and just use it on all your streams, like in case. Like PII pipe, yeah, yeah.
01:06:19
Speaker
And then it just works. And that's also very fascinating because in case you do something wrong, Typestream will just tell you, well, you apply this to a stream that doesn't have these types. So what are you doing? And you would also relax it to a warning if you wanted to. The other answer is a bit more complicated. I don't have a fully formed answer.
01:06:45
Speaker
there has to be a way for people to write their own native operators. Let's make it as concrete as possible, right?

Future Plans and Enhancements for Typestream

01:06:54
Speaker
So cut, grip, these filter comments, WC, LS, CD, those are what you would call the built-in shell programs, right? If you try to
01:07:07
Speaker
to do a month page or one of this program. You actually get one giant page with all the programs. It always confused me. I want something about one program, but you don't tell me with the month page. And how do you add more comments? And I think the answer lies into some clever application of
01:07:30
Speaker
parts, like the concept of part in Unix where I give you a binary part and you can put the file there. Now the thing is that this language, so this program is natively written in Kotlin, which I think it means that the commands that you put in the binary part got to be jars. I think there is no way out of it, which is what Kafka Connect does. I think that's literally how Kafka Connect, I don't know if you ever worked with this,
01:08:00
Speaker
I think they're called simple message transformation. To be honest, they are significantly more powerful than what people feel. Because they don't know that Kafka Connect has this feature, it doesn't even cross their mind that with just a configuration file, you can get pretty far ahead with Kafka Connect by cleaning things. But if you really need that
01:08:22
Speaker
a special transformation that is, I don't know, some business logic that you are the only person knowing about it and Kafka Connect couldn't possibly solve that problem. They give you the ability to load plugins on the fly. Now, again, not trying to bash on the project because I absolutely love the way they solve the problem. It's not the most user-friendly way of adding a functionality to it.
01:08:48
Speaker
What I expect the long-term vision there would be is that you would make all of Typestream a Java library, and then you add the commands by running your own Typestream version with your commands inside. Then it becomes much easier. You have this main command that essentially Typestream
01:09:09
Speaker
and you register your own commands which just abide to some interface. And while the interface will actually look really simple, because it's a function that takes a data stream and returns a data stream.
01:09:21
Speaker
Yeah, I can see myself one day like SCP in a jar to type stream slash user local bin, right? Yeah, that's kind of like, and you know, because we shipped, we type stream an official because type stream is a server and it's written in Kotlin and it exposes a gRPC server. It's not the most user friendly way of interacting with type stream. So we support official
01:09:51
Speaker
command-line applications written in Go, because once again, big fan of using the right tool for the job. And I think Go is very apt for little command-line applications. So I imagine the command-line app might actually just support it out of the box. I think a bit like Kubernetes does with the CP command.
01:10:12
Speaker
In a way, you probably do that as well. User-defined functions are definitely coming. In a way, what I like the most about the conversation is that they would be typed. I think that's not immediately obvious how much powerful there are other solutions in the space. This approach would be because it allows you to express
01:10:38
Speaker
a lot of problems in a very safe way, right? Like the example we just made, like if the stream is coming, it has emails, just remove it. It's not so easy for me to imagine how you would solve that problem in different technology that is already out there that is both reusable and type safe. That's the thing I cannot say.
01:11:01
Speaker
I think at that point we should probably say what state is the project in today and how do people get started playing with it? If I would have to look at it with the most critical eyes, I would obviously say that I don't think it's production ready for the most obvious answer is that no one uses it in production.
01:11:22
Speaker
Chicken and egg. It is production rated by definition, and it's not, which means right now this would be the most valuable next step for Typestream would be putting in production somewhere with someone from like say features, standpoints, how you get started with it. So there's two things about it to say. So the
01:11:49
Speaker
Getting started experience, I think it's somewhat pretty solid already. Like if you go to the docs right now, because I really, really care for, I think, obvious reasons at this point, the developer experience, I tried to make, you know, the getting started as, you know, as little friction as possible. So there are... You know, I'm going to say you've succeeded because I tried it out. Oh, thank you. The initial onboarding is very smooth if you've got a lunch break ticket. Very cool.
01:12:21
Speaker
I worked really hard because the point is that there are a lot of things that are new to you when you get to type 3, because I used the React metaphor there. I don't know if you remember when React came out.
01:12:37
Speaker
But they had, I think, literally the first six months on the homepage when it was still on GitHub, not even when they had the official domain. They had things saying, you know, can you please give this project five minutes before closing the tab? Because the idea is really, really novel and you should probably spend some time. And the reason why I bring it up is because, of course, Typestream is not under the Facebook umbrella. So I don't get that boost in trust.
01:13:09
Speaker
I had to make it as easy as possible just for that. And I think the getting started is somewhat of pretty much solved. The only thing that is maybe not completely straightforward, and I don't really know how to make it easier than that, is if you want to get the project up and running on your machine,
01:13:30
Speaker
and develop Kubernetes features, then there is maybe two or three things you have to do. But the getting started thing where you want to play around with it, I think it's pretty much there. There is a variety of comments already there working, the data operators, I mean. And maybe there is more comments than docs. I have literally this as a next to-do item in my list saying, you know, I should
01:13:57
Speaker
make more docs, because there is more code than docs. So when I have more docs than code, then I can go back to code. Eternal circle of programming. It's a circle. I do believe strongly that documentation makes the quality of the project, actually. Your project is as good as your talks. But from a
01:14:20
Speaker
Some of the things we talked about today are very hard to implement, so I'm not going to lie about it. The idea that you can have these data pipelines that get data from extremely heterogeneous data sources, like say Postgres on one side, Mongo in the middle, Kafka and then Redis.
01:14:39
Speaker
Yeah, I would say that's far, but all the ideas we discussed about being able to write 95% of what you would normally do in a Kafka Stream application, that actually already works. It literally already does all of it. What I would call the
01:14:58
Speaker
You know, beta features, the only thing that's missing is the ability to change, like to force the encoding of your choice. Like right now, you either get your original app or a protobuf or whatever it is, encoding, or you get JSON, because that's the lingua franco when type stream join things coming from different encodings. And well, if I have a topic, it's upgrade. The other one is J and the other one is protobuf. And what's the output supposed to be?
01:15:27
Speaker
I think there is no right answer. So as a default answer, we use JSON because that's kind of the, you know, the structured, the lingua structured data encoding lingua franca of the world right now. Like that's what it is, right? There is nothing like it or not. I'm not not a big fan because of the loss of types.
01:15:50
Speaker
obviously. That's the reason why I don't really like that encoding format. And if you have the types, it gets pretty cumbersome really fast, so a very inefficient way to bring the data around. But to me, it was really obvious answer. And if you look at it a bit further, not just looking at the golden, it also makes
01:16:13
Speaker
some of the features that Typestream can offer are really obvious, because if you are doing a pipeline that it ends up into a WebSocket server, there's a very high chance you want that stream to be JSON anyway. That's why it's also- Yeah, I can't really say that. Yeah, that's why it defaults to JSON, right? That's also that. Yeah, yeah. So in a way, the state of the project is
01:16:35
Speaker
I think it's at a state where if we get some production users, it will grow really fast from here because it's very easy for me to imagine that once I get past the initial effort of getting it installed in production where someone would use it every day, I think they
01:16:58
Speaker
I would expect them to use it 95% of the time because this is what I missed 95% of the time. Of course, I'm one person and then that's not how data works. I must say as a sample of one, but I do believe and I have the feeling you agree with the basic. The basic metaphor holds up really well for a lot of people.
01:17:25
Speaker
I think you've really hit on a seam of a good design idea to mine. And you've got a lot of mining to do, but you don't have a lot of design work to do, I think. And that's a huge accelerator, those design problems, those really difficult search space problems being solved. On which note, I should probably go and leave you to pick up your pickaxe and go mining down that seam of ideas. You've got a lot of code to write.
01:17:54
Speaker
A lot of books. A lot of docks to write. Thank you very much for joining us, Luca. This fascinating idea. Thank you. Thank you very much. Thank you. Thank you, Luca. I kind of have to say, I think that's the most time I've spent thinking about pipes with an Italian since Super Mario Brothers.
01:18:11
Speaker
Sorry, Luca. If you're interested in giving Typestream a whirl, check it out at typestream.io. There's a link in the show notes. It's going to be a while before they have full support for said and all. But what they've got right now is a useful tool and also an interesting design study, I think.
01:18:28
Speaker
It's also worth taking a look at New Shell, which we mentioned briefly. I'll put a link for that in the show notes too. But if you're looking for a New Shell or New Shell ideas, search for a New Shell. And with that, I think it just remains for me to remind you that if you've enjoyed this episode, please like it, rate it, share it, hit subscribe, because we'll be back next week with some more thoughts on how we can build the future of computing, sometimes with inspiration from the past.
01:18:57
Speaker
Until then, I've been your host, Chris Jenkins. This has been Developer Voices with Luca Petty. Thanks for listening.