Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Shouldn't Data Connections Be Easier? (with Ashley Jeffs) image

Shouldn't Data Connections Be Easier? (with Ashley Jeffs)

Developer Voices
Avatar
1.4k Plays11 months ago

Benthos wants to be part of your Data Engineering toolkit - it’s there as a quick and easy way to set up data pipelines and start streaming data out of A and into B. In contrast to a lot of the tools we’ve talked about on Developer Voices, Benthos seems focussed on cutting development time down to a minimum, so you can quickly configure a new pipeline and test it out, without making a whole sprint of the task. As quick as a quick-and-dirty shell script, without the dirt. 😉

So this week we’re talking to the creator of Benthos, Ashley Jeffs, to hear why he created Benthos, what it can do for you, and what its strengths and weaknesses are. And Jeff’s refreshingly candid about when you should and shouldn’t use it. If you ever need to get data from an HTTP connection into S3, or S3 into Kafka, or Kafka into a flat file, Benthos might just save you a few hours of development.

Benthos: https://www.benthos.dev/

A list of supported inputs, processors & outputs: https://www.benthos.dev/docs/about#components

All their cute blobfish logos: https://www.benthos.dev/blobfish/

IDML: https://idml.io/

Kris on Twitter: https://twitter.com/krisajenkins

Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/

#software #podcast #dataengineering #datascience

Recommended
Transcript

Evolution of System Connections

00:00:00
Speaker
It's not the most glamorous work, but a large part of software development is about setting up connections between different systems. That's all the more true these days. We've got the rise of the internet, giving us vastly more data to deal with, and then you've got the popularity of microservices and cloud services, giving us even more systems that want to be connected together.

Integration Tools: Past and Present

00:00:24
Speaker
You have to be able to do that connection work to build anything of any real size.
00:00:29
Speaker
I suppose you could hanker back for the old days when there was just one big database at the center of our world. But even then we had data integration problems. We just solved it in a more ad hoc way, custom connection software each time. That's not the way forwards. The way forwards, and we are gradually getting better at this in the industry, is to build reusable tools for connecting arbitrary system A with arbitrary system B.
00:00:56
Speaker
Now, some of those solutions I really like, but they are admittedly quite big like Kafka, Red Panda, that kind of thing. There's quite an upfront investment. Some do an excellent job with a very specific approach. I'm thinking of things like Debezium, if it fits your use case, fantastic. If it doesn't, then you've got to keep looking.

Benthos: A Lightweight Solution

00:01:16
Speaker
But my ears pricked up recently when someone recommended I add to my list benthos as a kind of lightweight way of getting some kind of connection up and running really quickly. Something I could add to my toolbox as a bread and butter tool that was more formal and more reliable than a shell script, but was a similar investment of my time to get something working.
00:01:40
Speaker
So joining me today is Ashley Jeffs. Ash is the creator of Benthos and it's a project that started at his day job where he was creating a lot of data pipelines. And the project went open source and got more and more popular until he hit that dream that some of us have. His open source project became his day job.
00:01:59
Speaker
We talk about how that happened and how that journey unfolded, but mostly we talk about Bentheos, the tool and what it can do for you, the design sweet spot it's aiming for, what it wants to be and what it doesn't want to be.
00:02:13
Speaker
going to get into that. But before we do, quick aside, I have to say the Benthos project has as its mascot a blobfish and when we recorded this Ash was sitting next to an adorable stuffed toy blobfish and I couldn't resist mentioning it
00:02:30
Speaker
But I did break a rule of radio in doing so. I'm talking about something you can't see. So if you're listening to this on the audio only version, please imagine a man sitting next to a melancholy pink stuffed fish. There's a sentence you don't hear every day. If you have that vision in mind, we can get started.

Introducing Ashley Jeffs and Benthos Community

00:02:50
Speaker
I'm your host, Chris Jenkins. This is Developer Voices, and today's voice is Ashley Jeffs.
00:03:08
Speaker
Joining me today is Ashley Jeffs. Ash, how are you doing? I'm good, thanks. How are you? I'm very well, very well. Nice to see the company logo in the bottom corner. Yeah, this is a custom crochet from a fan. Did not think marketing spend is still zero. As befits an open source project. But it's nice that you've got crocheting fans out there. That's a very specific crossover of your user. Or family thereof, I think actually in this particular case. I do a bit of crochet, or I'm trying. Yeah.
00:03:37
Speaker
Okay. Well, that's, that's definitely a topic for a different podcast, craft voices. But for now, I thought, so we're going to talk about Bentos.

Can Benthos Simplify Your Workflow?

00:03:46
Speaker
And I thought the way we get into that was something I actually use day to day. I have a script that goes to YouTube's API and grabs some YouTube data, as you might expect, given what I do. And it does a bit of parsing on it and shoves it either into Kafka or a SQL database.
00:04:06
Speaker
I want to ask you, have I done it the hard way? The first thing I would ask is, do you really care if it fails? Is it the sort of thing that if it fails, you'll just run it again because you're running it as a CLI or is it on some server and you'd rather not think about it?

Automation and Failure Handling with Benthos

00:04:24
Speaker
I do run it manually. I wish it were run in an automated fashion, but then I'd have to worry about failure.
00:04:30
Speaker
more than I do. Okay, so yeah, that's exactly where I come in usually, because I think if people see that they've got a particular script or like a use case that just does some simple like plumbing, and they come to me and they say, hey, can we do this with Bentos? If they're happy, and they don't really care if it fails, and they're quite happy to run it manually, I will usually say, not unless you try and learn what Bentos is.
00:04:54
Speaker
because it's just another tool. If you had a script and you're happy with Python, then why rock the boat? But it's when you've got some plumbing system that you could... It's not necessarily a streaming application as it currently is, but in your case, you would probably want that to be almost like a stream where it's polling on some interval.
00:05:13
Speaker
and then spin the data through and you don't have to think about it. It's just running automatically. Some people would say that's like a batch job. I would just consider it a stream because it's hands off. You're not hitting anything manually. You're not maintaining anything manually. And the problem with doing that is
00:05:31
Speaker
The question of what happens if it can't send data to your database? What happens if it can't send data to Kafka? What happens if the transformations fail? What happens if it can't hit the API? All of those things. If you want a nice answer for looking after all of those aspects, then yeah, that's when you want to use something more
00:05:51
Speaker
stream processing that has already got a nice answer for those questions and kind of forces you to deal with them. So yeah, those are the times when I would say, yes, it's worth learning a new tool and one that's kind of like in the streaming space, I guess, or, you know, like a workflow processor or some some general data engineering tool that's now mostly config, not necessarily script or code. Yeah.
00:06:15
Speaker
I certainly would like it to run for the day to just magically show up more often in my case. But it also in cases where things like the output is just unavailable for a period of time, if that happens at three in the morning, you don't want to get woken up by some alert. So you want it to already have in its own
00:06:34
Speaker
mode of operation, some answer to dealing with that, which obviously in the stream processing space is usually just back pressure and alerting if you've enabled it. But it's like an opt-in thing. But ideally, you want to be able to wake up and see that, oh, there were some issues last night, but it's just carried on. I can just fix itself. It's sorted. I can see that in the metrics or logs or whatever.
00:06:52
Speaker
You didn't have to do anything. It just resolved itself. But in cases where it can't resolve itself, like your database is just broken, you wake up and maybe there's some logs that tell you, hey, you've got to fix this thing. And while I wait for you to do that, I've just stopped myself from operating for a bit. That sort of stuff.
00:07:11
Speaker
Yeah, yeah. Okay, so I want to get into how you do that and what your design choices are, but I thought before we did that, one big interesting design choice you have up front is to limit the scope of what this does. It's not trying to do all data system to data system processing under the sun.
00:07:29
Speaker
Why and what's the scope?

Focused Data Processing with Benthos

00:07:32
Speaker
Well, from the initial conception of the project, it's because I had a lot of engineering problems where I worked that were basically in the single message transfer space where you're doing like enrichments and you're hitting external services and then you're aggregating the results into like a single payload that then gets
00:07:51
Speaker
pushed along, whereas typical data engineering tools are all usually catered around windowing systems. So they fit the mental model of it's basically a database, but it's a streaming database. And I didn't want any less stuff. I didn't care about any less stuff. I just wanted systems that you can compose that will do enrichments and brokering, you know, reading from multiple sources, writing to multiple sinks and, you know, doing sort of
00:08:19
Speaker
What would be chores in the streaming world? Usually people would just make bespoke tools for this sort of stuff, and then they would use the bigger tools like Flink and SQL overstream, whatever product you're going to pick for that, for the big tasks.
00:08:36
Speaker
But I wanted a solution that I could just keep redeploying with config that's going to solve what I consider to be the boring stuff, traditionally, the things that you just throw away to an engineering team and say, hey, build a tool that's going to read data from Kafka, and then hit our sentiment analysis tool that's owned by the data scientists, and then remap the data to fit some of the schema, and then dump it in Elasticsearch or some database or something like that.
00:09:02
Speaker
I don't want to have to write that same program over and over and over again because I was at a point in my career where I know that that's dangerous. Like writing the same streaming application over and over again, but with different code every time, you're going to hit edge cases because delivery guarantees are actually super complicated people. And there are these edge cases that it's not necessarily, if I'm writing code and throwing it over the wall, I don't necessarily care if the operations team have a bad night's sleep.
00:09:31
Speaker
That's terrible, but that does happen. But sometimes they get so angry that they'll make it my problem. And then, you know, I do have to think about those things. So, you know, I was in a realm at a company where we just had loads and loads of streaming tools that were doing all kinds of different things. And I wanted something that was just going to solve the operational
00:09:50
Speaker
side of things. So delivery guarantees good behavior around recovery and hitting issues. And then the idea is that the bit that I will build is the ability to compose these simpler broken down problems. And there's nothing stopping you from adding complex stuff on top. Like there's no reason why you can't have a windowing algorithm implemented within Bentheos. And in fact, there is one there's a very basic window
00:10:18
Speaker
But the point is that's not what you're necessarily deploying every time you use it. So you can have the really simple use cases and start from there, very, very simple tool, and then it only reaches the complexity that your use case has, essentially, when you're adding stuff to the config. Right. So this is making me think, I mean, there used to be this old thing in Perl, right? It was the job of Perl to make hard things possible and easy things trivial.
00:10:46
Speaker
And you're on. Make the easy stuff trivial end of the design spectrum. Yeah, definitely. 100%, yeah. I mean, the initial use cases I had for Benthos were the most trivial, almost like obnoxiously simple use cases for stream processing. Like imagine we're just migrating from Kafka to NAT or something like that, where you're just making a bridge, maybe some buffering.
00:11:11
Speaker
And then the idea was you make that really, really simple, especially to express in config. Everybody hates YAML, so you want to minimize the amount of YAML they have to write. And then the idea is that every piece of functionality that you add on top of that,
00:11:27
Speaker
So obviously you can get more complex with different brokering patterns, you get more complex with processing patterns, and error handling, and swimlaning, all these things. But all of those features are introduced as just blocks of config that you can
00:11:42
Speaker
add, but you don't have to learn about it. You could use Benfoss as a user for years and not have any idea that any of this stuff exists. And you definitely don't have to do any operational steps in order to enable things that you're not going to use. So you don't have to worry about disks and persistence and stuff like that. It's stateless and just memory based, essentially. And it's only if you were going to opt into something more advanced that you then have to deal with the implications of that. And yeah, there's definitely been a day one goal.
00:12:13
Speaker
because it reminded me, I mean, it seems pretty straightforward to set. It reminded me a bit of, um, Docker containers, right? You set up this thing, you set up that thing and you describe another bit that connects the two together and hopefully you're done.
00:12:27
Speaker
Yeah, I mean, all of these sorts of tools were kind of up and coming when I was conceiving of the general Project Bento. I mean, one of the things that did really drive the way that it operates is containerization. The idea that you're just going to have this one thing, it's portable, you can deploy it, and also the idea that you can just deploy one of them.
00:12:51
Speaker
Back in those days, if you wanted to test Kafka, but you wanted to use containerization, that actually wasn't a container. There was no image for running Kafka. You had to use all these hacky, weird custom builds, and then it's like multiple containers. You have to work out the networking for all these things. It was a nightmare. If you just want to run this stuff just to play with it,
00:13:09
Speaker
as a developer researching these tools, it was an absolute nightmare. So yeah, in the forefront of my mind the whole time when I'm building these new tools is the idea of what does it look like for somebody to explore this tool? What is it like for them to do a Hello World test?
00:13:28
Speaker
Basically, if you can just do a one-liner, you've got a thing running, and then you move on from there, that's the high-level goal. You can get it started with the very basics, it works, and then you kind of dig deeper into it.
00:13:43
Speaker
Right. That explains, cause I, um, I tried it out quickly cause it's, uh, without sounding like I'm pitching your stuff, but I tried it out. You do like Ben thos, create input slash processing layer slash output, and it'll create you a config file. And I just guessed, I thought I'd put you to the test. And then I was like input file slash, I don't know. It's a JQ slash. Sequel, I think. And it did just work, created me a company file.
00:14:12
Speaker
So well done on the developer experience, at least the initial stuff. But how's it done? How have you implemented this? Well, the whole thing's in Go. And that was basically just because I was having fun with Go at the time. I was a C++ developer primarily. And then I kind of had this. So a lot of the tools I was building, these bespoke streaming tools I mentioned, they were all in C++ basically.
00:14:38
Speaker
I was part of the team that was managing all these different things. There was definitely this belief of it has to be C++. For this to run as well as it does, it has to be written in C++. I didn't necessarily question that myself. I just figured it
00:14:56
Speaker
probably ought to be proven. So I took one of the services that was essentially just a bridge with a buffer. So it was actually reading from zero and Q to Kafka and vice versa. And then doing a bit of dis-buffering on top of that with memory map files. And I thought, okay, well,
00:15:13
Speaker
how much is the damage if I wrote this in Go? I did it in a really cheeky way. I'm using Go channels because I'm lazy and I don't want to have all this custom stuff. I'm just going to use the basic primitives. What does that look like for performance? It didn't run as memory efficiently. It didn't run as fast, but it was well within. There's something like an extra 10%.
00:15:36
Speaker
of time and CPU resource on top. And that's with no optimizations, no thought process of trying to make this fast. It was just, what's the easiest for me to maintain? And then from that point onwards, it was like, okay, well, I guess I'm never writing C++ again then. I tried to basically double down on this tool.
00:16:00
Speaker
How do you implement it? Because you've got... I mean, it's one of those classic integration things. Your biggest problem, I'm guessing, is... Two biggest problems are reliability and plugability, because you want to support all things to all things totally reliably. Let's start with all things to all things. How do you do that?
00:16:20
Speaker
Okay, so it's evolved over time. But essentially, the internal representation of an input, the inputs are the more complicated ones. So the internal model of that is, it's a thing that creates
00:16:38
Speaker
messages by some means, it could be pulling stuff over a network, or it could be making stuff up. But essentially, it doesn't really matter. That's part of the plugin implementation that you have. So the Kafka one will obviously be reading Kafka partitions. And then that's what I'll be reading that messages, the file one will be reading a file by some scanner, so other lines or whatever. And then the idea is that as it's creating these messages, it
00:17:06
Speaker
essentially introduces them into a Bentheus pipeline as what I call a transaction. What that is, is a mechanism that associates a given payload of data or more. It could be more than one. It could already be a batch directly from the source. It associates it with a mechanism
00:17:26
Speaker
to acknowledge that payload of data. So from Kafka it would be a mechanism that ensures the partition is marked with a given offset for a message. Obviously that gets more complicated if you want to process messages out of order and you want to make sure you're not marking offsets that technically you haven't finished yet.
00:17:49
Speaker
But essentially, that's all encapsulated in the acknowledgement mechanism, and that's abstracted as just basically a function that's associated with the payload, and then it gets pushed through a Benfos pipeline using Go channels. The channels mechanisms, those are things that you don't
00:18:08
Speaker
really touch if you're developing a plugin. If you're developing the Kafka plugin, you don't have to worry about the channels. You're just defining how a message is formed and how an acknowledgement is established and what to do if the message is rejected as well. Because that will be different depending on the input. Some inputs have a sense of a knack that you can push upstream and then some of them don't. So it wouldn't make sense to just drop the data just because it got rejected. So what you'd have to do is you'd have to make sure it gets reintroduced into the pipeline.
00:18:40
Speaker
And that channel mechanism that ends up becoming the lower level, I guess you'd call it, representation of an input, that can then get hooked up to any number of layers, let's call them. So the obvious layer is the output layer.
00:18:57
Speaker
Which receives traffic see it receives these transactions over a channel and then for every transaction it will try and deliver the data Potentially multiple at the same time so that there could be like a maximum in flight that an output has configured for itself or a user has configured so you could have like 200 messages in flight at any given time and then what its job is at the plug-in level if you were writing a Kafka plug-in you're just writing a definition of you receive a message and
00:19:27
Speaker
How do you serialize that into the data that gets sent to Kafka? In this case, you're just getting the raw bytes and some record headers, and then you either return an error or you don't. It either succeeds or it doesn't. You can add a bit of complexity with batching. For example, you could define how to, rather than deliver individual messages, you might want to represent how to deliver a batch of messages and you benefit from performance there.
00:19:53
Speaker
especially in the world of Kafka, you might want to send a block of messages. And then what you can also do is you can translate an error that comes back, you can break it down by messages of the batch, and then you can return like a Bentheth's representation of here's a failure that happened for this batch for these given messages. So if you're able to, some inputs won't be able to, but if you're able to, whoever formed this batch, if you were able to then break it down into just these indexes and retry the ones that failed and not the others,
00:20:20
Speaker
go ahead and do it, otherwise re-try the whole thing. And then once you've got those basic abstractions, you can then form the high level ones. So brokering patterns in Bentheos, you can have like fan out sequential round robin, you can have switches for swim laning, all that kind of stuff.
00:20:39
Speaker
Those are abstractions around the channels. They're able to do pretty much real-time flow control. It's nice for me as a developer to benefit from Go channels for that sort of stuff because it helps you when dealing with all the nasties, all the edge cases such as back pressure, retries,
00:21:01
Speaker
having multiple things in flight at the same time, all those nasty stream processing problems you can basically solve with Go channels, which doesn't make it trivial. You're not going to solve it overnight, but it makes it a lot easier to both write and reason about once you've done that. Yeah. What's the word, tractable? Doesn't make it trivial, but it makes it tractable, right? You've got a chance of solving it in the same way. I'll thumbs up that. OK.
00:21:31
Speaker
I'm trying to stack the things I want to dive into here. So the first is, in order to get into how we handle errors, there must be statefulness, right? You can, if you've got a Kafka input, your progress through the topic will be stored on the broker. So you don't worry about that. But if you're reading through a file, your progress through the file, file isn't going to track that for you. So Benthos must be stateful.
00:21:53
Speaker
So with the specific file input, I don't track any of that stuff. So basically, the way that the file input works is it's not a streamed input, which means if you read it and then you crash the service and you run it again, it will just read the whole file again. People have been asking for watch mechanisms and things like that, so you can read it gradually. But again, this is one of those problems that I've just figured I don't want to solve that. I'm not writing a log aggregator.
00:22:22
Speaker
For example, you can obviously process the logs once they've been written into Kafka or something. But the problem set around doing something like watching a file, for the delivery guarantees perspective, if you want to do it properly, there's a huge amount that you've got to implement for that. So I figured I'll leave that to the other tools that specialize in that sort of stuff. So the file input in Bentos is basically just how can you do almost like a batch job, in which case if you restart it after a crash,
00:22:47
Speaker
you'll just run the batch again, pretty much as you would with a normal batch tool. Okay, so then that leads into delivery guarantees. Yeah, so delivery guarantees, obviously, within Bentheos, the goal is at least once, as a core foundation, you don't have to do anything special and it will do it. But the obvious caveat to that is, if your inputs and outputs don't support at least once delivery guarantees, and obviously, Bentheos doesn't.
00:23:15
Speaker
So, if you're writing data over UDP stream, then you can't guarantee that the data's gone anywhere. And similar, if you're reading data from standard in, you can't guarantee, because there's no guarantee that data that's been consumed by Benfos has also been delivered by Benfos. It just doesn't exist.
00:23:34
Speaker
That's basically one thing that I'm quite happy to just leave out there. You document it in the input that you can't really have an expectation of strong delivery guarantees with these things, but the things that you do expect delivery guarantees on, you don't have to think about it. It'll just work. That's really where my focus is. Right. This is, again, delineating where this tool begins and ends.
00:23:58
Speaker
Let's talk a bit more about what happens when it does go wrong. Is it just drop the world and start again? Or can you do something more sophisticated than that? So it'll depend on your config. But you've got different options in the stream process world, right? You've got reject. So you can you can knack a message. So let's say you're reading from SQS just to get a range of queue systems into this conversation. But imagine it's got a system upstream where you could have like a
00:24:30
Speaker
a queue, like a reject queue, a dead letter queue is what I'm looking for. And you don't want to just keep retrying a message internally in Benfoss forever, if it's a bad payload, right? So essentially, the default behavior of Benfoss is something reads a message,
00:24:46
Speaker
It goes through whatever processing. It gets to the output layer. Processing cannot drop data. So error handling in the processing space, so say like mapping, filtering, all that stuff. If errors occur there, there's a different mechanism for handling that. The data itself always travels through Benthos. So if you don't handle your processing errors, messages that aren't processed will be delivered and you'll have to deal with it yourself.
00:25:13
Speaker
It doesn't drop data under any circumstances because in my opinion, it's easier to deal with, oh, we've got some weird messages appearing in our Kafka topic. We must have done something wrong than 10 months later, oh, we've actually been dropping 10% of our messages because there's some issue. So processing is kind of like a separate topic, which we'll have to dig into. But assuming the data makes its way to the output layer.
00:25:38
Speaker
We attempt to deliver it. If that fails, the default behavior of Benfoss is to reject the transaction that comes from the input layer. So if the input layer is NATS or Rabin and Q, whatever, it will knack the message. If that's not the case, so saying Kafka Land doesn't make any sense. You can't knack a message because that would mean that that offset is just done. You're going to lose that data if you don't reread it.
00:26:02
Speaker
So what happens instead is Kafka will enforce an internal retry and it will never acknowledge, it will never store an offset that is still in the process of being handled. So if you imagine you've read a message that's too big.
00:26:20
Speaker
to be delivered. It will reach the output layer. The output layer will reject it. What will happen is it will get nacked. And then the Kafka input will say, yeah, but that doesn't exist. So I'm going to pass it back through the processing pipeline. And what you'll get is back pressure because the output layer will eventually stop delivering data
00:26:38
Speaker
if it's not going anywhere. So it'll try its best. If there's some data that's in like a retry loop and you've got like maximum flight of greater than one, it'll attempt to continue to deliver traffic and you'll see like error logs and metrics telling you that there's data that's being not delivered.
00:26:55
Speaker
But eventually, it's going to grind to a halt. And the back pressure is obviously important because you don't want to be retrying an indefinite number of messages. So you have to have some number at which messages being retried and blocking the pipeline is going to stop the whole thing from consuming.
00:27:11
Speaker
the input that will then stop being asked to deliver data. So it will then stop. And then what you'll get is, Bentos will effectively slowly grind down to a halt as more and more data doesn't get delivered. And then you as the operator of this pipeline, at your leisure, can come along and figure out, okay, well, why is data entering this retry loop? What processes do we need to add? Like maybe a filter that just, you can just drop the data if you want to, but it has to be explicit.
00:27:37
Speaker
So if messages are this size, just delete it, I don't care. And then you rerun Benfoss with that new config and it will reach those same offsets because it didn't commit them. So it will reach the same offsets, reread that data, drop it, and then it flows like a happy fish.
00:27:54
Speaker
in freshwater. The idea is that every possible edge case fits some model similar to that, where the worst case scenario is you have a task as an operations person to
00:28:11
Speaker
adjust this config or expand this config in order to deal with edge cases that you haven't anticipated. But the idea is that the data isn't just gone. It doesn't just disappear. We don't just move on and forget about it. We make you deal with it, but you don't have to deal with it straight away. You don't get like an alert that says, oh my God, this whole thing is dying because the data is still in Kafka, right? It still exists somewhere. You're using queue systems that have delivery guarantees, which means it's persistent on disks somewhere, hopefully more disks than one.
00:28:36
Speaker
and you don't need to panic. The stream processing doesn't need to panic. There's no need for urgency in any form. It's inherently asynchronous, right? Yeah, exactly. There's no reason to wake somebody up at 3am if, as long as you can process the backlog, so as long as you don't end up in a situation where you can't catch back up again,
00:28:56
Speaker
But obviously, if you're in a situation that's that tight, then yes, I would strongly consider it an advantage to get ahead of some of these issues before they might arise, rather than just relying on Bentheos to hold your hand through it. Right. But presumably I could set up a fallback that just sent it to a pager system that texted me as soon as there was any.
00:29:20
Speaker
The default output, obviously, if you've only given it one output, it's only got one place it could possibly read that data to. And if it can't, then it will just apply back pressure. But if you've given it a full back, so there's a bunch of different programming patterns. But one of them is, if the first output fails, try this one, and try this one, and try this one. And because of the whole composure of benthos, you can add processes specifically to outputs themselves. So you can have like a full
00:29:48
Speaker
an output of deliver the data to Kafka. And then you can have a full back output that's also Kafka, but there's a processor on it that says, if the payload looks a bit dodgy, just send this metadata instead. So you still move on. You don't retry the data and the system doesn't apply any back pressure, but what you've got is you've got a record somewhere that it doesn't have to be the same topic. It could be a different topic, like the
00:30:15
Speaker
So I could literally just send it to a dead letter queue that's annotated with extra metadata. Yeah, exactly. So the pattern is there for doing that. And then obviously, you can have more fallback outputs after that as well. So you could just write to standard out. You could write dev null if you wanted to. Or you could just delete the data, obviously, because you could just put processors in there as well. So you can also have processors that just delete the data or send an HTTP request somewhere. Or as you said, you could hook it up to alerting.
00:30:44
Speaker
if you get to that point. And then there's other broken patterns. So unfortunately, we're going to have to have this whole conversation again, because as soon as you add a broken pattern like fan out, for example, the error handling has to look completely different, right? Because there's now new edge cases. So in the world that I described, where if you've got two outputs, Kafka and say, Nats,
00:31:06
Speaker
and the Kafka is failing consistently either because it's offline or some of the networking issue, but the other one isn't. You don't want to have a situation where a message travels through the pipeline, gets to both outputs that it's rooted for, manages to be delivered to one but not the other, and then it's retried again and again and again and again in like a fast loop because then you're going to flood NATs with duplicate data and you're still not delivering anything to Kafka.
00:31:35
Speaker
When you add an output broker that's a fan out, by default, it will isolate the retries to the output itself. For example, if you did route a message to NATS and Kafka and Kafka fails, what the broker will do is it will keep attempting the message at Kafka and it won't knack the message. It'll just keep retrying to make sure that we don't do that busy loop. Instead, we have a soft loop happening at the NATS level.
00:32:03
Speaker
And obviously, you still get the logs and the metrics and stuff. But by default, it's not going to enter that busy loop. But if you do want the busy loop, you can have the busy loop by just adding a bit more config on top that just essentially would force the knack. Because I mean, you might want to hook it up to like RabbitMQ input and still have it delivered to a dead letter queue at the input level upstream, even if it did get it delivered to NATs, but not Kafka.
00:32:31
Speaker
But yeah, that's brokering. There's like a bunch more patterns as well. Okay. I mean, this is all stuff you could do with more sophisticated tools. I guess the edge here is that it's fairly easy to set up these patterns.
00:32:46
Speaker
Yeah, so I mean a config in benthos is talking to like 20 lines of YAML to have that broken pattern I just described with some processing on top. You could have that in like 20 lines of YAML config. And then you've got the metrics, logs, all the observability that you need. Plus it's portable, plus it doesn't need any access to the disk.
00:33:05
Speaker
Plus, the operational simplicity is stacked massively in its favor. But yeah, you're not getting the bigger fish like Flink, for example, it's obviously going to do way more advanced, supermassive use cases if your needs are that, like if you need these
00:33:24
Speaker
super advanced window and algorithms and super efficiency on transferring and storing all that data, then sure, you're going to need to reach for something else. But if you use cases, I just want to read some Kafka, enrich it. There's like 20 HTTP services that are like this interconnected network of things I need to hit. I just want to store that in a file that my data scientists can edit. They can change, oh, it's not a post.
00:33:50
Speaker
It's a get. They might want to do some trivial change. Actually, the payload is slightly different now. It's capital instead of all lowercase for this field. They can just go into a YAML file and modify that, submit it as a pull request to you, and you can just click it, approve, rather than them having to modify your code. Yeah. I'd much rather tell a data scientist to edit a YAML file than update some Kafka Streams Java.
00:34:18
Speaker
We also don't leave you completely out in the woods as well. It's YAML, obviously people do have issues with YAML, but you've got a linter, which is very nice. There's also a bunch of dev tools for building Bentos YAML configs and holding your hand through the whole process. There's an explicit schema as well. For example, you can use Q,
00:34:38
Speaker
If you've heard of Q, C, U, E, GoLang project, it's basically a better configuration system, I'm going to say. But essentially, it's more advanced and it's much more explicit. So you could use that. We generate a Q schema. And we also generate a JSON schema if you want to use that to help you build your configs out and that sort of stuff.
00:35:01
Speaker
OK, then let's talk about the stuff around it like that. So operationally, what have you got for monitoring and that kind of thing? So there's obviously logs. I don't like logs personally. I've never really liked logs as a way of monitoring a service. So that obviously highlights specific issues if they occur. You can look at the log and it'll describe what exactly happened. But there's metrics for throughput and latency and all the stuff that you would
00:35:29
Speaker
consider important in a stream processing system. Every component that you add to a benthos config will also have its individual metrics. So for example, if you've got three processes, a mapping, some jq, or, you know, some HTTP hit, you're going to hit some service, they will all have individual metrics by label that you can dig into with dashboards. So if you want to specifically monitor the errors hit by your HTTP service request, then you can have a specific thing for that.
00:35:59
Speaker
There's also, what do they call it now? The distributed tracing stuff. We've got open telemetry support for distributed traces, so you can use EAG and all that stuff. That is actually really cool because you can literally look at an entire journey of a message through a Bentos config. If you've got a massive complicated Bentos config,
00:36:19
Speaker
you could literally see a picture of his journey. But personally, I don't really use it, I just use the metrics. Because obviously, I know what the metrics should look like. And if it doesn't look like that, then I want an alert straight away. But then also, so there's no there's no like formal
00:36:40
Speaker
alerting system in Bentos, what you would do is you would just hook it up as part of your, it's like an output for your config. So like you described, you can have a fallback that's just hit you with an alert directly. So you could have like a page or an email or whatever, or, you know, Slack message. And, you know, things like that can just be glued into your config as if it was just any other destination. Because at the end of the day, it needs to have the same delivery guarantees, right? If you've got a message that's failed,
00:37:06
Speaker
getting an alert because it's failed is probably just as important as delivering the data itself in terms of delivery guarantees. So the idea that you might just not get an alert because it was hooked up as a second-class citizen of the pipeline. If you think about it, that's not really that good a thing. That's obviously something that you would probably want to address if you could. So yeah, I'd just treat that as just any other component. So yeah, the big three are the logging metrics and distributed tracing.
00:37:36
Speaker
Where is the metrics? How do you access that? Does it come with a Web GUI? There's lots of options. You can have Prometheus, Scraper, you can send it to StatsD, there's InfluxDB, there's CloudWatch metrics.
00:37:54
Speaker
Basically, for those options, it's just a config block, right? So you're just saying, instead of the default of Prometheus, send it to CloudWatch, this address, and it's like a few lines of your config. Because by default, it's Prometheus, and you scrape an endpoint that Ben Floss hosts. But you can also, if you want to do things locally and you don't like reading Prometheus metrics, and you don't want to hook up an actual metrics endpoint, you can just have it spit out JSON.
00:38:21
Speaker
And you can do that two ways. So you can have it so that you can scrape an endpoint and get JSON formed metrics. So like, you can see the counter go up every time we refresh the page. But also you can add the ability to log the metrics if you wanted to. So you can run benthos as like a one off job. So it's like a batch job or something. And it runs through a file and then writes it to like Kafka or whatever. And then what it does at the end, just as it exits, it will spit out like a JSON block of
00:38:48
Speaker
of metrics for the run. It's usually like the latencies and things like that, which is pretty cool. I didn't add that. That was somebody else who contributed that, which I think is pretty cool. Okay. That leads me in contributing, right? So from the list I've got, you support a fairly large number of inputs and outputs and a surprising number of different processing layers. Are you writing all those yourself? Are people contributing them? Is there a plugin mechanism? What's the deal there?
00:39:16
Speaker
So I would say about a half would probably be me, and then probably the other half of people just coming in and just adding stuff that they want. And then there's a smaller number of people who are dedicated and they add things because they think other people want it. It's kind of like a mini version of me. And the way it works is that there's been different generations of the plugin API.
00:39:38
Speaker
to reach the right level of abstraction. Obviously, I'm trying to make the config for Bentos simple for people using Bentos, but also then there's the exact opposite end of the dev spectrum where I want the developers building plugins to also have a fairly easy experience. It's the same philosophy of if you've got a really simple component that works basically like all the others,
00:40:01
Speaker
you should just be able to write essentially just a function for how to deliver a given piece of data or consume a piece of data or process a piece of data. And the plugin APIs are essentially designed so that you can define the configuration spec. So what does the configuration look like? That includes things like default values for fields, whether things are optional, whether fields are advanced, because you want to be able to generate nice documentation. So you essentially define what the configuration for a component looks like.
00:40:30
Speaker
And then you define the thing that you want to do with that configuration as usually just a nice function or a struct that implements a certain interface. And the idea is that if you've got a more advanced use case that needs a little bit more control for performance reasons or just because of its functionality. So if you were going to implement a broker, for example, it needs to be a little bit more to it than just a function.
00:40:57
Speaker
For those, you would use a more advanced API that builds upon the other one, and then a more advanced one if you want to opt into other functionality. Most of those are internal. I've got a public API.
00:41:15
Speaker
And that's the one that most plugin authors will use, and that allows you to have your own custom build of Benfoss with your own custom plugins. And I would actually say that a huge chunk of Benfoss users have their own custom build with their own plugins in. And the idea is that they're first-class citizens, so you can generate documentation, and it will be the exact same Benfoss website as the official docs with your plugins there.
00:41:39
Speaker
And it's all the same thing. And, you know, the create tool that we talked about and linting and all that stuff works the exact same with that plugins as it does for everybody else. And you can obviously contribute official Bentless plugins that way as well. It's the same API. So somebody can write their own private plugin and then decide later, actually the world needs this. And they can come and basically just copy paste it as a PR. And then I'll send these internal ones.
00:42:05
Speaker
Let me think how that plays out. So I'm working at a bank trying to connect super ancient mainframe to some modern SQL database, let's say. So I write, is it presumably some Go code? Does it have to be Go? It implements your function?
00:42:24
Speaker
It doesn't have to be. It's a lot easier. There's a bunch of options. Go is the best, I would say, for just really hucking into the APIs for the config specs and all that stuff. But if you want to, you can just execute a sub-process and just essentially read that. Not particularly good for delivery guarantees because you're reading off just a stream of bytes rather than a back-and-forth protocol.
00:42:51
Speaker
You can also just hit an API. I mean, if you want to, you can just run your like a sidecar service that exposes an HTTP stream endpoint and then use Penta to consume that. There's also some WebAssembly stuff in there. So right now, I think, I'm pretty sure, I should probably know this, I've only added in WebAssembly processes. So you can like define a WebAssembly thing.
00:43:14
Speaker
in whatever language you want and then execute that as like a processor. Eventually, I would like to have it so that you could define an input or an output with that as well. But the WebAssembly experience is a bit confused right now, I'll say. It doesn't really match the ethos.
00:43:34
Speaker
Eventually, I could write these things in Rust, I guess is my go-to WebAssembly thing at the moment. That's the goal. I think that's kind of like the dream one of all these like hip WebAssembly native languages, the Rusts and things. You can stick them in there. You won't get better performance, I don't think, but you can still do a hip language.
00:43:55
Speaker
Go still counts as hip, doesn't it? No, I think Go is business seats now. Oh, is Go gone that way? Okay, fair enough. And so these private plugins I'm doing, this kind of sounded like you were saying I needed to maintain my own fork of Bentheos. It's not dynamically loaded.
00:44:13
Speaker
It's not technically a fork. So you would have a, it would be a custom build. So it's a go build. And then what you do is you basically import Pentos as like a library. And you define your plugin, and it like registers, you can have like an isolated environment of plugins, or you just have like a global one. And then you basically call a function that's
00:44:34
Speaker
I think it literally is just run CLI and it will basically execute benthos as a CLI. So you're essentially running the official benthos. Or if you want to get a little bit cleverer, there's like an API for building benthos streams in code.
00:44:51
Speaker
and you can have as many as you want. So you could like to find a bunch of plugins and then you can execute like multiple streams in the same binary and stuff like that. And you're essentially running Benthos programmatically with that, which I do know quite a few organizations do because they don't just want Benthos. They want to kind of nest Benthos in their own kind of ecosystem. So they want like custom inputs and outputs and things as well that they might want TID to control over them rather than just like plugins, if that makes sense.
00:45:20
Speaker
Yeah, I can totally see that. So I end up writing probably a bit of go with a very thin main function in most cases. Yeah, the main function is three lines. OK, gotcha. OK, you've led into this. If we're running multiple benthos processes within that custom main function potentially, we get into the issue of kind of clustering. I mean, if I want to do some processing that's too large for a single machine,
00:45:49
Speaker
Do I still use Benthos or do I need to break out into the Meteor tools? There's no reason why you can't. Basically, it depends on the input. So if you've got Kafka, for example, the default behavior is if you run multiple instances, then they'll have a consumer group so the partitions get distributed. So even if you're banking that off to Kafka.
00:46:13
Speaker
Yeah, exactly. It's essentially determined by the sources. And obviously, the majority of the ones in the streaming ecosystem, you just get it for free. So I mean, it's definitely been asked for. People will come in and say like, hey, are you going to add any coordination between Pentos? And I would just kind of think, well, why? What are you doing with that? As long as you can fan the workloads out, what else is there? There's no state. There's nothing happening. Yeah, yeah.
00:46:41
Speaker
if you care about because obviously if you care if you're doing some sort of windowing because as I said like Bento has like a windowing capability in there right so if you if you needed to make sure that your windowing messages are all of a given type or like given group say you're already dealing with that in Kafka by the partitioning schema so you're keying messages that are of a group of a window by the key in the first place so you know you've already solved that problem
00:47:11
Speaker
And it's the same with storage, persistence. You've solved that with Kafka or NATS or RabbitMQ. That's an issue that you've already had to deal with. So why make somebody deal with that again by having Benfoss instances have to coordinate? Because then you've got to work out, well, how am I going to enable that? Just use the source. Use your input. That's what it's for. That's what it's built for. Yeah. And why make Ash write it again? Yeah, exactly. Why give me extra work?
00:47:41
Speaker
That's fair enough. Okay, so another thing that leads into you talked a bit about windowing. So you've talked about enrichment, I'm trying to get the boundary lines here. So you've got enrichment, sometimes that implies joining different inputs. And sometimes it doesn't. But when it does, it implies statefulness. So what kinds of enrichment can I do in a stateless world?
00:48:07
Speaker
So, well, the easy one, the easy pattern, and if I talk about this for what, 30 minutes, you won't bother me about the more complicated stuff. The easy one is you've just got a single payload that is just the world, right? So, I mean, a tweet, and you want to enrich it with what's the language, what's the sentiment, who's it mentioning? Sorry, an X. And, you know, all that stuff, maybe you've got like a data science team, they've deployed a bunch of services that will do that sort of stuff. So, this is kind of like Benfoss' bread and butter.
00:48:37
Speaker
Because what I did is I replaced a system that was very stateful and handling all these relationships between these things. You have to do the language section first. You have to do the sentiment analysis afterwards. You have to do this afterwards. And then there's always a massive dependency graph of the firm enrichments. And in Benfos world, you also have to negotiate each API differently. So imagine each team who's made some enrichment will do it differently. It might be a different company. It might be completely out of your control.
00:49:07
Speaker
So the way all of this stuff works in Bentless is it's composed. So we've got an HTTP thing which just does a request, whatever the contents of the messages it will be sent, and then whatever comes back replaces the message. But then you can compose that within what's called a branch. And the branch will describe a way of transforming the current message into something new.
00:49:31
Speaker
And then you do any number of processes, so one of them will be your request. And then there is a mapping afterwards that describes how to merge that back into the original payload. So you don't lose the original contents of the data and you don't have to send everything out.
00:49:46
Speaker
which is a big big problem for efficiency if you've got this massive payload and you have to send it out to all these different services you obviously don't want that you want to just create like a subset of the message and then what comes back you're gonna just form it back into the the new payload so it's kind of like this abstraction of mapping do the enrichment and then map it back and then what you can do is you can take that
00:50:08
Speaker
that block, and you can compose that. So if your use case is complicated, you've got this big network of things you have to hit. We've got what's called a workflow processor, very loaded term. But in this case, what it means is you've got a bunch of these branches, these enrichments that you want to execute.
00:50:29
Speaker
and they essentially have a dependency graph. You want to do maximum parallelism. If you've got a bunch of services, you can just hit straight away. You want to just hit those all in parallel and then aggregate the results. Whatever depended on those, you'll do those and so on and so forth. You can either do that automatically
00:50:48
Speaker
by allowing Benfoss to analyze these mappings so it knows what you're using as reference to the enrichments and it knows what you're mapping back into the payload so it can kind of like build a best attempt to work out what the dependency graph is or you can just make it explicit you can just add in a
00:51:07
Speaker
a list of these ones parallel, these ones parallel, these ones parallel. And then it will execute those things in a streamed fashion. So it's reading Kafka data, for example, it might be processing 24 messages at the same time, because that's something you can tune is how many messages get processed in parallel.
00:51:24
Speaker
And then each one of those will enter this workflow execution little mini engine you've configured. And then all these parallel requests are happening. And then you can add a little bit on top of that to be efficient around batching. So for example, you can create micro batches at the input level.
00:51:45
Speaker
and then send those batches to your workflow and then the branches themselves can all have custom behavior as to whether they send out individual requests or if they create a batch. So say like a JSON array or line delimited messages or whatever, whatever the individual protocol is of those services, your little encompassed enrichment config
00:52:08
Speaker
can basically choose how it handles batches just for a little bit more efficiency than just leaving it all into like single message interactions. So that's the simplest one and obviously it gets more complex. So basically what I've done there is I've taken systems that I've already seen in the past that have essentially a flipped mentality of messages come through and we're going to hit all these
00:52:34
Speaker
services, and then each service is essentially a stage in a workflow stream. So the output of one will be then stored back into Kafka, potentially, and then reprocess stored back into Kafka, and then process again stored back into Kafka, and it's a slow-moving pipeline.
00:52:54
Speaker
And you've got, you know, potentially persisted data multiple, multiple times. And what I've done is I've kind of like flipped that on its head and just said, well, okay, if realistically, we're only talking like minutes to do all of these requests, you could just do that in memory, like there's no reason why you can't as long as it fits in memory. And, you know, that flow isn't going to change because the speed at which you can hit all those endpoints doesn't magically change because you've
00:53:19
Speaker
persistence of the disk over and over again. So if you can realistically just do this in memory, why not? And the idea is that you've then just made this composed config that it's not necessary. I wouldn't call it stateful. Technically, it is. Because it's got state in memory. It's got a message in memory. But if everything crashes and it starts back up again, it just does the same. The last message that didn't make it all the way through gets reprocessed. A little bit of duplicate effort. You're not going to notice it unless there's something really wrong.
00:53:48
Speaker
that's not an issue for most people operating a pipeline. And if it is an issue, you can do the same pattern as before, you can do the stream one after the other approach, because it's all composed. So there's the more complicated thing which is joins, and that is
00:54:08
Speaker
That's essentially something that at the beginning I didn't really care too much for because in my mind like windowing systems that these data engineering products were putting together was kind of like a separate thing to something else that our team was doing. So something that I was doing as part of my career was essentially doing data joins through caching. So you might have like a memcached instance or Redis or something. And what you do is if you've got multiple streams that you need to join,
00:54:35
Speaker
You read all of these different queue systems and you essentially populate the bits that you need from them into these caches and then you choose one of them is like the canonical stream and that's the one that you flow through and you basically hit all these caches almost exactly the same way that I just described all the enrichments where each cache represents a bit of the stream that you need to join it with and
00:54:59
Speaker
some of the topic or some of the queue system, or it doesn't really matter what it is. But essentially, you need to acquire a piece of information based on some signature from the data. And then whatever comes back, you then want to merge it back into the new payload that's being formed. And then eventually, you either make it all the way through, or you don't, in which case, then the interesting behavior comes along where maybe you put that into a dead letter queue that's got a delay on it.
00:55:25
Speaker
And maybe that's tiered, I don't know. As far as you want to go, you essentially make sure that payload is going to be retried if we couldn't find all the information that we needed. And what that essentially represents is your window, because you're obviously going to put a cap on how much time you're willing to wait for a payload to reach all of its
00:55:45
Speaker
other pieces of information in the pool, you know, maybe a day, let's say, that means your window is now a day. And rather than having a conceptual window that actually exists, where it's all one weird thing that's been sort of like designed to work on a disk or S3 or whatever, this conceptual thing, what you have instead is you just have a bunch of memcached instances with some info in it.
00:56:11
Speaker
that's not going to be anywhere near as efficient for some payloads. So there's efficiencies that you can definitely benefit from by not using that approach. But if that's not the case, and it's literally just there's a blob of data with a key
00:56:27
Speaker
from all these different topics and I just want one overall blob of data based on that key, you can get away with it. You can have massive throughput, huge volume data pipelines where you have complex joins and you're not doing anything interesting. There's nothing going on there.
00:56:46
Speaker
So your operational job as the person deploying all these services is to make sure that these caches are good. So obviously you need to make sure they've got enough retention, the disks are there, and either there's backup, or you've got some of the system in place for recovering that data if you need to. It's a mechanism for repopulating the cache if it crashes.
00:57:08
Speaker
Exactly. You might have some mechanism for doing a backfill if you find that some of those caches have died. In which case, you also need to have a policy for how long that's going to take. Realistic, because we might never catch up. Stuff like that, you need to have an answer for.
00:57:24
Speaker
As long as you're comfortable with that sort of stuff, then you now have a very, very simple pipeline. It's an extremely simple config for the actual busy work that Benthos is doing. And you know that Benthos, if it crashes midway through a job, it just restarts and it picks back up where it was. The data in Kafka, you need to make sure that that's replicated and persisted. You had to do that anyway. But that's essentially where your data lives. That's its real location until it's been delivered somewhere safely. You're at least once delivery guarantees existing Kafka.
00:57:54
Speaker
that's just the reality. If your Kafka clusters all die and all the disks are gone, your data is gone, and that was always going to be the case. So that's the bit that you worry about in terms of data retention. The caches need to be alive. And obviously, Kafka is what refills them if something goes wrong. That's probably a procedure that's going to be manual, right? You're going to have to make sure that you're able to manually kick off a backfill, in which case you'll have some
00:58:18
Speaker
procedure that the operations team needs to know. But in my mind, that works out a lot simpler operationally than you've also got this other tool that's doing this complex aggregation on disk that also has state that you might need to recover because now you need to understand what that looks like. What does the recovery look like for that? How do you make sure it's backed up? How do you do all these other things? NAMCached is obviously lossy AF.
00:58:43
Speaker
So the idea is that you don't have any pretense that that's a safe place to keep your data forever. The knowledge is always there that it's Kafka that is the real source of persistence. Yeah. I can also see some argument here for as long as configuring these pipelines up is easy.
00:59:06
Speaker
there's an argument for having it, you know, I can get you this join, put together as a proof of concept within an hour. And then it's going to take a sprint to do it in Kafka streams or something to make it more fault tolerant. There's tremendous value in being able to ship it within the hour, right? There's a lot of people who spend those full proof of concepts. And then what happens is they, they have a bit of fun with it. And then I get it. And they stay with it.
00:59:34
Speaker
So maybe we should talk about that next, because this is, I mean, you're unemployed, employed, unemployed, this is your reference. So technically I'm self-employed, but I'm not a very attentive boss, let's say. What's your life as a programmer working on Bentos? Tell me about that.
00:59:55
Speaker
So I feel like this kind of situation, because basically I live off ad hoc support contracts and stuff, which was basically just me doing stuff to live until I figure out what I actually want to do. And I've just been doing that now for like four years. But then also the majority of my income now is sponsors.
01:00:18
Speaker
But that's an extremely lucky situation. I definitely would not be able to just reproduce. But essentially, that just means I can figure out what I want to work on at any given time. And with that freedom,
01:00:34
Speaker
The thing is, in a situation where you've got an open source project and you are trying to keep it going, that's the number one concern. The concern I don't have is taking over the planet. It doesn't need to be the most popular thing out there. It doesn't need to be anywhere near close for me to essentially get my goal, which is to live off keeping this thing going.
01:00:57
Speaker
Essentially, my job is to attend the code base, obviously. That's the obvious one that everybody knows. I have to write some code every now and then. But you're also evangelizing a little bit. In my mind, that doesn't mean marketing and finding new people. It's developers that do that for me. I just make sure they're happy enough to spread the word. My evangelism is making sure that the documentation is good
01:01:21
Speaker
and fun. The various support channels that we have are active. So we have lots of people in our various chats. We've got Discord and Slack. And as soon as the question pops up, one of us is usually on it within a couple of minutes. And then you make dumb videos and you go on people's podcasts and stuff like that. But essentially, the way that my day-to-day looks
01:01:47
Speaker
is whatever I feel like doing is what I'll do because I like a bit of variation. I like a bit of context switching personally. So if I wake up and I just don't feel like coding one day, there's obviously a lot of not coding stuff I need to get done.
01:02:03
Speaker
like, you know, documentation, testing, uh, videos, you know, all those things. And what you're doing is you're basically switching roles. So I might be like a product guy one day and I'll just have like a three hour shower where I'm thinking of, you know, what does the next three years look like? And how are we going to get there? And you know, all that stuff. And then, you know, maybe the next day I just want to draw stupid blobfish and crochet them.
01:02:25
Speaker
or crochet them. Basically, I'm just doing dumb artwork that one day I might go, oh, that's perfect for this thing. And then I'll put it in my blog post and I don't have to worry about, oh, I need some graphics for this. It looks a bit boring. For a lot of us, this is the dream, right? This is the programming dream. Any tips? I feel like open source is one of the things where everybody can dabble in it a little bit. And then we all have this pipe dream with like, oh, one day I'll make money from this. And it's definitely
01:02:55
Speaker
great for a lot of reasons. But also, if you don't have the self-control to do certain things, then yeah, it ends up becoming a nightmare because it will end up happening. You basically become a CEO, but it's for a company that doesn't exist.
01:03:12
Speaker
You can easily get yourself stuck in a situation where you're basically working for free now on stuff that you don't enjoy. Keeping the joy is the fundamental bit because you are going to be working for free and then just hoping that you can live off it by some means.
01:03:29
Speaker
If you feel like you're stuck and you're doing stuff with no compensation and, oh my god, I need to get a real job at some point, you're going to freak out and you're not going to enjoy it anymore. It's basically an aspect of you have to keep this joy going. You're not really your own boss. You're more like your own
01:03:49
Speaker
I don't know, guru or something, but you're trying to get into your own head of like, how can I keep myself enjoying this thing? So that when it's, when it feels a bit tough, like there's obviously days where you don't want to deal with people or you don't want to fix this bug or you don't want to handle this pull request or whatever. Um, do you need to kind of like get into that head space? Uh, but then, yeah, the main trick is don't, don't burn out. And that, that means you've got to be able to like jump. And I think most, most people would not enjoy getting to that stage. Once you've got a lot of users,
01:04:18
Speaker
and you need to be doing a lot, you have a lot of spinning plates. A lot of those plates are things that people don't want to bother with and it ends up not being as fun. But yeah, I mean, obviously if you can make it work, then yeah, it's great. It's like a dog chasing your own tail. Every day I'm like distracted and doing whatever I want, but also kind of being led by the world is essentially telling me what I need to do at any given point.
01:04:46
Speaker
That's nice, because when you've got a large user base, you're not living in a cave disconnected from the world. You've got these people anchoring you and keeping you sane, I guess. Yeah, yeah, definitely. So we have regular community calls as well, so we can actually see each other's faces. On the very rare occasion, we actually physically meet in a real world space, but that's obviously very difficult to coordinate. But yeah, a lot less lonely once you've got a big enough community that people can be around a lot.
01:05:14
Speaker
and not just like five minutes every month when they've got a question. Yeah. Okay. That's cool. That's cool. I hope it continues for a long time, full of joy.
01:05:24
Speaker
I have to ask you, there's one more topic I have to cover and it's a little bit out of order because we normally end on the lighter stuff, but let's go to something harder. The processing layer, right? I know you support, you've got quite a few options, the different ways you can write processing, including Ork, excellent choice for the old school kids. You've got JQ in there. I know you've got a few others, but you've also got your own language for that, right?
01:05:50
Speaker
Yeah, you're developing. What is it? Blob? Tell me. Blob-lang. Blob-lang. Yeah, blob-lang. Why? Yeah, so I didn't have it for a long time. I didn't want to build a custom mapping language. And you can obviously fit. There's like brilliant Go libraries for JQ, James Path, all these things.
01:06:15
Speaker
Brilliant people behind them as well. I wanted to get those in because it unblocks people. You can do arithmetic and ORC, which was a blocker at the time when I brought that in. Obviously, a lot of people are used to JQ and things. There's a JavaScript process as well now for people who want to do that.
01:06:34
Speaker
Eventually, I would like to have a Python one if we can. I don't want to limit people's choices as to what mapping languages they use. I kind of felt like it wasn't really my job to solve mapping or anything either. The problem we had, though, is there's a lot of mapping that is very, very
01:06:54
Speaker
complicated and messy for like big nested structures. There's definitely justification for having a language that specializes in that stuff. Luckily, I didn't really have to design that because there was a language called IDML.
01:07:10
Speaker
It's a very, very small project, but it's a very, very cool language that was essentially designed specifically for mapping. It's either JSON or JSON, or it could be anything. It's just structured data. The idea is that you can really dig into the nested, horrible
01:07:29
Speaker
massive structures that people are used to with a lot of enrichments, where you've got a raise of tokens or something like that, where it's just these really gnarly, deeply nested objects that you might need to zip up with something else. I had it in my head that I was going to eventually support IDML itself, but it was really difficult to get it involved in Go.
01:07:50
Speaker
because it's basically JVM 2.0. It was a pretty messy experience. But then the other side of things is, I didn't just want to have this extra mapping language. I also wanted to have it sort of like a native component within Benfos so that we can do clever stuff with it. The fact that Bloblang is essentially native to Benfos is how... The branches I was describing where we can infer the dependency graph of enrichments
01:08:12
Speaker
The reason why that's possible is because the map to translate to and from an enrichment is Bloblang, and I can analyze Bloblang to see what are we depending on.
01:08:24
Speaker
as part of the data transformation and what do we create at the end of it when we're merging it back into the new object. So the fact that I can analyze that means I can then infer what's the dependency graph because I can see where does certain fields come from and who uses those fields, that sort of stuff. But also it was apparent that
01:08:43
Speaker
I can obviously support JQ to an extent, and I can support Ork to an extent, but having a mapping language that I actually wrote makes it a lot easier for me to support people with really complex use cases, because I can immediately see, oh, you can fix this, you can do this, blah, blah, blah. From the support aspect, it's been a massive benefit because somebody can
01:09:05
Speaker
could be coming in with a different user. They might have Ork in their config, but the fact that I can just give them some blob length that does something really gnarly and complicated, but it specifically fixes their thing, kind of frees me up then. I'm kind of unblocked on this particular question.
01:09:21
Speaker
That's interesting because there are so many places in the design of this where you've solved the problem by saying that someone else's problem pushed that out to a different system that takes care of it. This is a place where that's reversed and you said, I'm going to bring that particular problem in-house.
01:09:35
Speaker
Yeah, exactly. And it's definitely a scenario where you've made a compromise here where you're going to have this thing that you have to support indefinitely. And it's not trivial making a mapping language. I took a lot of inspiration from IDML, so I didn't have to design it completely from scratch. And because I've had a lot of time to mess around with use cases and stuff, I knew roughly what it had to look like in terms of error handling and flow control and all that stuff.
01:10:04
Speaker
So I wasn't coming at it completely from scratch, but it was literally just writing language from scratch. I initially thought I was going to put it out there and it'd be like a niche feature within Benfos. And then I would probably just let it sit as that. Almost like it's just a way of querying fields and that would be it. And then I was just going to see how people received it because I don't want to force people to use a new language. Because that's obviously the benefit of having all these different processes and stuff.
01:10:32
Speaker
But people just used it. People went with it and used it, so I thought, OK, well, I'll support it then. I'll build it up. And it's got its own plug-in API, so you can write your own blob-lang functions, just the same as you write other plug-in types. And you can also use it as its own library. There's a few organizations that are using it as their translation language. But again, it's not really...
01:10:55
Speaker
Yeah, exactly. They've basically just got this thing running as a library in their own applications. It's not really something that I'm putting out there as its own project because I don't have the resources. I don't have the energy to do that. I don't
01:11:12
Speaker
I don't have the energy to convince people that this is this awesome language that does all these things. I'd rather just leave it out there. And if people want to use it, they can use it. But it solves the Benfoss problem, which is it's a lot easier for me to support. People use it. It's native to the language. So it ticks a few boxes that I didn't have checked until I made it. And it's not a massive thing to maintain.
01:11:36
Speaker
It's not Turing complete or anything like that, so it's like a fairly simple language. As far as languages go, it's easy for me to keep on top of. Okay, so if I want, and if other people want to get started with this, where should I start?
01:11:54
Speaker
Go to benthos.dev and there is a few options on the website so you can either do a getting started guide by just going to the docs or there's a video where I talk about it if you want to see more of me
01:12:11
Speaker
Then you can go down the video route, identifying people. There's like different categories of people. Some people need a video to feel engaged in something. And then some people hate videos. So we've just got all of them. If you go to benthos.dev, you can find the exact thing that you need. Some people just want to jump straight in. If you go to benthos.dev, there's like five config examples on the front page for various things. So if you literally just want to copy paste a config, run it, and then start from there and play with it.
01:12:40
Speaker
That's probably what I would have done, to be honest. Then, you know, you can just go ahead and do that. It's easy to install because it's Go, so you can either just download the binary, or you can use Homebrew and things like that. But yeah, benthos.dev. It's Go, I'm assuming then it's Windows, Mac and Linux. Yeah. Yeah, cool. MIT license? Yep. I think that's all the headlines then.
01:13:06
Speaker
I think I've got the rest of the afternoon free. I think I'm going to spend it trying to convert my Python script and end where we began. Oh, good. Yeah, you can send me the script and I'll give it a burial. Thanks very much, Ash. Great pleasure talking to you. Thank you for having me. Thank you, Ash. And I hope you'll be sending me a toy blobfish when those go into mass production one day.
01:13:33
Speaker
As we said, if you want to check out Benthos, it's benthos.dev. But actually, since you now know what it does, the place I would start is there's a page that lists all the input sources it supports and all the processor types and all the outputs. And that very quickly tell you what it could do for you today. So I will put a link to that. It'll be the second link in the show notes that takes you straight to that list.
01:13:58
Speaker
Before you click that link if you've made it this far please click like and subscribe and rate and all those buttons and the algorithms will then make sure we see each other again soon. Oh and if you happen to be listening to this on Spotify on the mobile app
01:14:14
Speaker
please consider rating developer voices. I don't know why, but you can only rate podcasts on Spotify on the mobile version. So it's an oddly specific feature that I have to make an oddly specific request about from time to time.
01:14:29
Speaker
I've been thinking about it. I bet there is some weird internal reason why it's set up that way. So if you work for Spotify and you can get permission to talk about it, please get in touch and come on the show. I would love to do an episode about the difficult realities of writing software at a company the size of Spotify with the user base and the number of platforms that Spotify has. I think that would be fun.
01:14:56
Speaker
All that said, I think it's time to go. I've been your host, Chris Jenkins. This has been Developer Voices with Ashley Jeffs. Thanks for listening.