Introduction to Ben Gamble and His Expertise
00:00:00
Speaker
They say one of the keys to intelligence is the ability to make connections between disparate pieces of information and to see patterns and links before everyone else sees them. If that's so, then today's guest might be the most intelligent person I know. I'm joined by Ben Gamble, and he's just exceptional at seeing the connections between different technologies and different technical ideas.
00:00:27
Speaker
And he has such a varied background as a programmer that he's got like a wealth of different experiences to draw on. So who better to gaze at the programming landscape with us and ask, with all these pieces, how are we supposed to fit things together to a coherent system?
Database Architecture: One or Many?
00:00:44
Speaker
Which bits of technology belong in a modern architecture? Do we really want half a dozen different databases with an event bus between them? Or would we be happier with just a general purpose relational database ruling the world?
00:01:00
Speaker
Because you can build a thriving business with just MySQL and Python. You really can. But by the time you're a thriving business, are you going to start to wish you had a dedicated analytics database like Apache Pino for your queries? Are you going to wish you had Redis as a high performance caching layer? Which bits should you use? And when do you make the trade off to switch? And when the landscape's this vast, why isn't there a map?
Chris Jenkins and Unpredictable Weather
00:01:29
Speaker
That's the topic up for discussion today, and it's a huge one, so we'd best get started. I'm your host, Chris Jenkins. This is Developer Voices, and today's voice is Ben Gamble.
00:01:53
Speaker
I'm joined today by the inimitable Ben Gamble. Ben, how's things? Things are great, if slightly warm today, I suppose. Yeah, it's cooking up in England. It really is. We had a wet and remarkably cold summer until literally this last week or two, it seems.
Ben's Journey from Gaming to Engineering
00:02:11
Speaker
That's the joy of England. I used to know this Italian person who said she couldn't understand why the English talk about the weather so much until she came to England. And then the weather's so interesting because you never know what you're going to get.
00:02:22
Speaker
If you really don't, what's funny is one person in my team who literally lives, it turns out, within walking distance with my parents, and we've been going to the same library for the last 30 years without knowing this, he's 10 miles away from me, and we normally have different weather, even though we're in highly flat part of Cambridge.
00:02:40
Speaker
I kid you not. It's quite funny. Okay, we'll save that for the meteorology podcast, trying to explain why that would be. You've got one of those avalanche mines. Whenever I poke at it, I get just an avalanche of information coming out. That's why you're here.
00:02:58
Speaker
I want to get into that, but before we do, one of the reasons you have so many perspectives and angles on programming is because you've got such a varied backstory. You've worked in so many angles in the industry. So I'm going to start there, give us your biography.
00:03:13
Speaker
Sure thing. I got into the industry the usual way a lot of programming, at least the usual way most people do, which is you see video games, you think, I want to make those. Then you go to the local library at the time and then find a book that says Learn C++.
00:03:28
Speaker
That painful book has a lot to answer for. Why do we do that to children? I know, I know. And then my mother, being wonderfully supportive, went and said, well, there's a course down at a local college you can go to for that. So me, age 11, was surrounded by postdocs growing up in Cambridge, learning C++ to some like, NVQ level of stuff way back when, back in 1999.
The Technicalities of Game Development
00:03:54
Speaker
And so that's where it all started. And then at school, I was lucky enough to be able to play with everything from microchips, coding in both basic, yes, actual basic on microchips to some assembly. I had a fun maths teacher who used to run C-sharp courses during lunch times every now and again. And so I kind of got into it through all this. And then the fateful thing happened. I found modding tools for video games.
00:04:20
Speaker
and kind of decided to build and attempt to break every single video game ever from hacking data files in things like GTA 3 to using the Unreal Engine editing kit to basically build my own levels and then occasionally mod games. Oh, wow. Yeah. That's a cracking way to grab a teenager's imagination. It really is, because you've just given them a bunch of tools and say, are you smart enough? And then your first instinct is, yes, yes, I am. Let me go prove it. And then the answer is, no, no, I wasn't.
00:04:51
Speaker
All I did was learn a lot about loops and what doesn't work. How to crash a game. Very easily. If you do some of the fun things, like what scale actually costs, and things like, actually there are limits to what your machine can do, and there's a real reason games designers make certain levels, certain shapes, so you can't see things around corners, so they can lazy load them in.
00:05:13
Speaker
Oh yeah, that was a serious thing back in Doom. It's still a serious thing today. So Doom did this a bit, but it really became big in the build engine, and the Dart project was really famously a Portal renderer, whereas this is the idea of you render things through little Portal doors to the next zone and never beyond that.
00:05:35
Speaker
I went to university originally to do some sort of avionics engineering, which I hated, and then transferred to astrophysics. I have a degree in astrophysics, which I basically spent 90% of my time trying to avoid physics.
00:05:52
Speaker
Basically, every computing module you can imagine, every stats module you can imagine, electronic engineering modules, and then enough physics to get by and still get the qualification.
From Consulting to Augmented Reality
00:06:05
Speaker
Basically, I was hooked into various little bits and pieces, which was the first version of Unity came out on PC after the Mac product originally, and being a friend tried to build games on it.
00:06:18
Speaker
Still a good friend. We're building, we're chatting about building games in San Francisco last week because we didn't know we were both flying out.
00:06:25
Speaker
world being a small place. And then after uni, I got my first job at a consultancy doing technical consulting and management consultancy. I got that by talking about Minecraft mods I built. Really? I know. Yeah. So I used to build world generators to kind of plug on top of Minecraft. And I was trying to build things, worlds based on the kind of the equations I've been learning in planetary science, things like how do you make a mercury looking world in Minecraft?
00:06:51
Speaker
Oh, cool. Yeah. So this is what I was doing at university rather than actually my degree, which I really struggled through the end of. But also on top of that, trying to prove that a 3DS didn't need two cameras. So I tried to build an augmented reality system on my own. From scratch, I don't recommend.
00:07:09
Speaker
I got halfway there and that actually got me hired at the time as an image processing engineer for the consultancy. I can say that. And there I joined in on a bunch of very large system developments, everything from inspection machines for drugs. So the drug capsules you buy have all gone through a variety of inspection machines. I helped design bits to the insides of one.
00:07:30
Speaker
It was amazing. I got to basically play with high-speed cameras, gigabit ethernet back when it was still a rarity, and then custom GPU imaging techniques to basically say, how do we look for defects and such in true real-time? These are a million items per day, one-to-one rejection. As it's sliding along the conveyor belt. As it gets printed and then goes over the edge of the conveyor belt, you have to do air rejection, blowing off the capsules.
00:07:55
Speaker
It looks very cool to watch, but then it gets faster and faster and faster and you realize what speed really can go wrong. I kid you not, the first bit of debugging I did after we couldn't work out what was going wrong was an oscilloscope on trigger wires of the camera. I found something! I found an unbalanced lack of a common earth between various UPSs in the system.
00:08:17
Speaker
Oh my God, that's serious. Oh yes, but also very annoying. I lucked out by finding it. My colleagues at the time said no one would have done that. That was kind of a weird move, but it worked. There's literal debugging where you pull the moth off the circuit board. Pretty close. And there's one level above that, which is an oscilloscope. An oscilloscope between two things, watching pulses go by to see if the camera's trigger is actually a proper rising edge in the right place.
00:08:45
Speaker
Yeah. And from there, built embedded circuits, programmed tiny 8-bit micros and some other bits of all. And I ended up doing mobile dev because I kind of talked about the augmented reality thing. And then one of the partners overheard that I could write code for devices and asked, how long do you think it would take to write an iPad app to do a questionnaire? And I said, a few weeks. Shouldn't really be that much more complicated. Vague understanding of how I didn't know it was objective C, thought it was still C++. Right.
00:09:14
Speaker
And then a Mac turned over my desk the next day with an iPad on top.
00:09:19
Speaker
It's one way to get the gear. Three weeks later, I chipped the app to this internal thing, and it did a lot of it. It did very well, except I had to speed learn Objective-C, and I do not recommend. From there, it kind of became things like how I increasingly kind of ended up in these bigger and bigger systems, like things like camera systems, inventory management systems, and then a lot of apps which had to communicate either as IoT devices or otherwise.
00:09:47
Speaker
And then what happened next was, over time, specialized deeper and deeper into trying to make things a bit more interactive. So I did a lot of augmented reality, and then a lot of just high-speed data, things like how do we deal with low transmission rate UHF signal lines? How do you deal with that? The answer is, it's fine as long as you're accepting about rate of nothing almost.
00:10:10
Speaker
I mean, think low-end of null modem. Not all bad. The thing is, two ethernet ports went in either end. It was transparent, if slow.
00:10:22
Speaker
That's where I started out. A lot of things I can't talk about due to official secrets. It's a bit weird, but generally, the more secret, the more boring. I can believe that, actually. Then I left and founded an augmented reality company on Google Glass. Google Glass. Oh, Google Glass. You were actually one of those companies. Yes. It's a YouTube video of race yourself. You can look up for augmented reality exercise games.
00:10:52
Speaker
Yeah. So it was the whole idea of running the park with someone chasing you or a personal trainer telling you to slow down and speed up. Oh, yeah. Yeah. I could see that. Those kinds of things were just like what we did lots of.
00:11:03
Speaker
And then because I'd been involved in the launch of WebRTC when I was at the consultancy, so Google will launch WebRTC, I built an Air Drum kit for it with a Kinect. So you can do this jammy Chrome game, if you remember that, with an Air Drum kit. That was at IO. It was good fun. And what happened was I realized that, hey, why don't I just take this tech and reuse it myself? So I used WebRTC to do multiplayer games over Google Glass. So you run around playing Pac-Man with other people in the park.
00:11:32
Speaker
I'll have to share the video with you. It's good fun. It was over 10 years ago now. I'm always sad that Google Glass didn't evolve into something. Yeah, it felt like it should have, but then it was always that weird moment when we were telling people about turning on the GPS chip and they said it didn't have one. I showed them the bill of materials and I showed them the GPS strings coming out the device and they were like, how did you do that? I said, did you not try?
00:11:56
Speaker
So after that peated out and a bit of pivoting happened, I stepped away and was working at the hiking app company for a while called B-Ranger. I was there already doing more augmented reality on iOS and Android. So cross-platform building of basically AR apps to label mountains. Oh, nice. And then load GPS code to safety correct positioning in sharp valleys and things like this. So how do you de-amplify error? How do you make it safe so that all the mountain rescue teams who use this app could actually do it safely?
00:12:26
Speaker
and actually navigate safely around the mountain where it told you they really were rather than amplifying error, which you can do quite easily. Have you pitched any of this to Apple now that they're doing their Apple Vision thing? They must be clamoring. I have a picture of Tim Cook using the Vision thing with the labeling mountains. So that app company we had on the Series 2 watch launch, there is a literal WWDC thing. One of my old colleagues, she's on stage talking about the watch app.
00:12:53
Speaker
Oh, cool. Yes. I know for a fact they've seen some of this stuff already. Right. That's, that's quite a cool thing for that to have in your cap. Oh, it is. There are loads of these little bits and pieces. I just wish I was physically there, which I wasn't. I was back at the head office going, please don't explode the server. Please don't explode the server. The firestorm clicks only eventually explode the server, but they were pretty cloud and in a physical machine somewhere in a data center.
00:13:22
Speaker
originally there were a company that launched on Symbian. Symbian? Yeah, Nokia OS. There's a name I haven't heard from. So that was originally where the app had started and they ported it over to Android and iOS, but now they're part of our outdoor active.
Open Source and Developer Education
00:13:39
Speaker
So it's kind of still a big app. It was, you know, it is high. It had, you know, 5 million actives. So it was pretty cool time.
00:13:46
Speaker
From there, it kind of steps away due to a logistics company I've been building in the background for a while for my old investors. I followed and then did a ton of contracting on the way for bootstrap startup reasons where I worked at Rare, Microsoft on things like Sea of Thieves and Everwild.
00:14:01
Speaker
which was good fun. I worked on AI development there for their games. Originally, it was brought in to bring literal generative AI thinking into the game. It was the original remit of why I was there. It never quite happened due to remits shifting around. That was what I was hired for. What year was that that you were trying to do generative AI in games? 2016. 2016, yeah.
00:14:23
Speaker
And what's funny is we actually built a filling very similar to an LLM for addresses at the logistics company. So it wasn't a lot of NLP processing behind the scenes, where we were basically taking in addresses, pulling them apart by parts of speech by additional steps and doing that kind of functional correction across the stuff.
00:14:42
Speaker
Then from there, I went through a bunch of things where I consulted at places, building literal operating systems from the kernel up, which is fun and silly, to do crazy scale stuff. There was a cool demo with CCP Games where we got 15,000 players into an actual real-time game. I helped a lot with the architecture for the game itself. Then after the logistics company was going on through its high, but a failed acquisition burned me out of it. I was the exec producer at a small studio for a while.
00:15:11
Speaker
And then at improbable for a bit as well, after the kind of merger-y weird stuff happened, doing big scale things again, built some of that, built kind of a big initial part of the renderer that became their Metaverse renderer.
00:15:24
Speaker
And then I ended up, then the kind of pandemic hit just before that I'd left, uh, then bait and you, my daughter was born sort of a bit better time out and then joined ably real time or think we met and I was still there. Real-time data. Real-time data. Yeah. So I was the one banging the Kafka drum there. Um,
00:15:42
Speaker
And so that was big scale web sockets with a hardcore reliability and also MPTT. And I was head of DevRel and then kind of a position analogous to a field CTO called the default champion. And then after about two and a half years of absolute hilarious, really good fun there, because we're getting things like the Kafka stories sorted out, getting us to market in these bigger and bigger areas, showing what you really do with high speed data. I joined Ivan.
00:16:09
Speaker
And now I lead developer education here, though I tend to go by the open source code sommelier these days. Open source code sommelier. For any of us. There is one thing I'm jealous of you, Ivan, is you have, like, so they're a kind of stuff as a service platform, right? So you've got all the databases and all the infrastructure playground stuff that you can go and build crazy demos with and call it work. That is, that is not, shall we say, a conservative 90% of why I joined.
00:16:38
Speaker
It's also kind of the biggest story, which is that open source is eating the world. And when my colleagues had this great description of open source, which is you're basically leveraging the free cycles of every single one of the world's developers out there, because that's what they're doing. They're just putting things into open source, arguing beautifully, productively to make something really good because they care. And then at someone like Ivan, where we are still big contributors to open source, we're almost like 20% of like about dev team, 25% of our dev teams, literally upstream only committers.
00:17:09
Speaker
Seriously, we have a lot of committers in staff. That is a decent percentage, yeah. It is, because of all four founders or Postgres committers. Oh, cool. People don't realize, everyone thinks, oh, we just released this from the site. No, founded by people who contribute, run by people who contribute still.
00:17:26
Speaker
I know, it's really cool. So the DNA of the companies about this idea of saying, find the best of open source, deliver it in this awesome piece of infrastructure. And I do mean like so much infrastructure work behind the scenes. So abstract away all those cloud platform layers and just say, here's the database, get going.
00:17:44
Speaker
Which leads us on to the main topic I've brought you in for, right? Because you've got access to all these different data platforms. A lot of experiences in different ways of building software. And not everyone can see this, but you're wearing a t-shirt that says, have a nice data. So this is from the Catholic summit this year, possibly one of my actual favorite t-shirts. I've had so many requests for more people. People say, where did you get that t-shirt? It seems awesome. Cool. Looks like the other good one from last year.
00:18:16
Speaker
putting all that together, right? Let me put it this way. I would think a lot of companies,
Choosing the Right Technology
00:18:22
Speaker
a lot of projects, say to themselves, they have two arguments for generic project X. They say, OK, let's argue about whether we're going to use MySQL or Postgres. And once you've settled that argument, we'll argue over whether we should be processing data using Python or SQL. And once you've answered those two questions,
00:18:45
Speaker
your way. Yes. This is the classic dichotomy of there are some answers in the world, but most of them are postgres.
00:18:56
Speaker
And yes, this is the thing. Everyone starts out with thinking, what's reach for the tools? And we all remember the LAMP stack of old. It was beautiful because it answered every question you've had. You need somewhere, you need a machine to run on, running Linux. You need something to handle the actual requests themselves and reverse proxy them. So you had Apache. You needed a database. MySQL was everywhere. At the time, Postgres wasn't really an option. It was just a little early in the Postgres story.
00:19:24
Speaker
And then, of course, you needed something to actually write your code in. And PHP, it got stuff done. And they will always give it credit for. It got stuff done. It got stuff done. And it opened the door for a lot of programmers, and let's stop that. Absolutely. It had productivity on its side. And eventually, getting stuff done soonest will win. Whether it runs or not, if it really is the Simpsons, there's the right way, the wrong way, and the max power way.
00:19:50
Speaker
I've not heard that quote. Homer James' name to Max Powell and the question response to it was, isn't that just the wrong way? And Homer says, yes, but faster. And that in some respects sums up PHP, but it's not actually always the wrong way. But it is faster to get there. You'll get an answer faster.
00:20:11
Speaker
That's another great dichotomy in programming. Do you want your problems today or tomorrow? There are a lot of solutions which are just storing up problems for tomorrow. Absolutely. As much as I'm often the one saying technical debt is a coin you spend like any other, you do end up at the point where sometimes you do have to pay it back. This comes back to this idea of saying, what tools do you use along the way? How do you navigate that decision space?
00:20:38
Speaker
Generally, where would you come off that path of Postgres with Python? If you're actually at Ivan yourself, the answer is never. Ivan is built like Postgres committers in Python. That is the religion in this building. However, that aside, it comes down to two or three major questions.
00:21:03
Speaker
The first is always going to be access patterns. How are you trying to access this data? Am I changing the location of a single person a thousand times and then being able to query quickly? Am I reading the stock market data as a massive stream of data and then acting on the events of change? Am I looking at flight data for the last 100 years? Maybe 100 years, yes. But let's say last 10 years of flight data to work out where I can optimize my flight routes.
00:21:30
Speaker
all these data questions come down to have a different access pattern to the data itself. You can always say Postgres dot, dot, dot, and it will give you an answer. Until you get some specifically very far outliers, you will just get a worse answer depending on what happens next. Yeah, you'll be stretching it further and further out from its comfort zone. As soon as you go beyond a certain point, what really gets you more than anything else is the cost.
00:21:58
Speaker
Fundamentally, you only really have a defined budget for an answer, right? And yet then look at the cost. That's both a financial cost and a cost of failure, right? So I'm considering time as a cost in there too. Yes. And that's the kind of like you have a time window bound for success and then a failure window on top of that. So if you imagine, let's say an example I keep giving these days is.
00:22:17
Speaker
you are trying to recommend a movie when you leave the Netflix. Leave some Netflix or Amazon Prime or a service yet to be determined. You leave that video and they need to recommend to a video. They have about 20 seconds total to recommend something right or they can recommend something wrong. A good recommendation keeps you in the platform or lets you lock in your next thing. Your actual churn percentage will go through the roof if they don't give you the next thing because you'll just not see what you need to do next and exploration takes time.
Data Access Patterns and Cost Management
00:22:46
Speaker
Your engagement is massively determined by how fast the answer is. So your cost of failure is actually quite large by not having the data fast enough. And if the cost of data being fast enough is, I now need 100,000 cores to run my Postgres, that's not a good answer anymore. And this is where we come back to like costs. So I'm kind of abstracting costs as a time versus machines cost here. If you have enough machines, the time will go low, but number of machines to cost going low is another cost all of itself.
00:23:16
Speaker
Yeah. Yeah. Yeah. And then flip side is trading. Let's say you're trying to make, you've got a fill or kill order for a hundred shares of Tesla. If you don't get that right, you have potentially unlimited liability for that. Right? Failing that like by Tesla stock. Which could be worse. Not touching any long mass questions. No, no, no, no, no, no. We're leaving that. It's just weirdly volatile at the minute of reference. Okay.
00:23:45
Speaker
This is where it kind of comes down to it is the cost of failure can be very high. Therefore having a correct answer in the time allowed is actually something you can spend money on. But then it's like, come back to why we don't use Postgres for everything. Like these days, a lot of the time I'm saying use an analytic specialist database, like Clickhouse, like Snowflake, like Redshift. Because to be honest, you just can't put a few, you can't put more than a few terabytes in Postgres before it gets very upset with you.
00:24:13
Speaker
And then it just starts consuming core walls and then you watch it slow down and you can't do the fast transactions you want to do on it. And some of this is just because of what Postgres is with this kind of acid transaction model. Some of it's because of SQL, but a lot of it's simply because of the way it's engineered is not for saying, I can scrub through a very large quantitative very quickly, but I can find one answer inside a very defined six jumps. I think it is B plus three under the hood. Yeah. Yeah.
00:24:41
Speaker
That kind of implies that this is just a question that affects people with more than a few terabytes. The fact is that the one of the most ubiquitously freeing things, particularly of the last 30 years of software development, has been more store. Fundamentally, the machines we run on are supercomputers.
00:25:06
Speaker
I was at the San Francisco Computer Museum recently, and though the Cray warning was in front of me and I wanted to sit on it like it was a piece of lounge furniture, I didn't because I said no. But fundamentally, that is probably less powerful than the server this podcast is being recorded to.
00:25:22
Speaker
by a decent margin. I think my laptop's got a dozen graphics cores. When did that happen? I know. These are scalable vector units, and you think, those used to only be in supercomputers. That is a serious business when you think about it. If you think that Moore's Law is basically holding up whatever we use, we get away with a lot of solutions which are
00:25:45
Speaker
not optimal, but so far in the noise. This is where we come back to some of those ideas of where speed comes from. If I'm going to call a database and it's on the other end of a network wire, I'm not going to get faster than maybe 50 micros in the same data center, 50 microseconds. If I get better than 10 milliseconds to get there and back, I'm already doing pretty well. I don't really need my database to be any faster than about half that network velocity.
00:26:12
Speaker
Yeah. You could find that the bottleneck is actually serialization and deserialization. It really often is like we have these novel formats specifically for this. And like a lot of things like message pack and protobuffs, hopefully not protobuffs, I have a bone to pick there. I ran it on stage at QCon about them. It's been fun. It's now published.
00:26:34
Speaker
But the key thing always comes back to this idea of like everything costs something. Therefore, where do we need to optimize the bit over here? And particularly with Postgres is Postgres is pretty quick and most data is pretty small. And most queries are not very complicated. If you're not doing a very complicated query, you can get away with almost anything. I know people who have read it as their primary database because they're not doing anything clever.
00:26:58
Speaker
I know someone who had read this as their primary database and the company imploded one day.
00:27:06
Speaker
So I once lucked out by that for a little while. We had it as our primary database for about eight months on one startup for a while because we were in a hurry and it worked. And we never got caught out until we found out how bad it really was later down the line. We realized we had a very aggressive caching policy, never actually ended up doing this clever write module we put in to write it to disk properly. Oh, right. Yeah. And it worked.
00:27:32
Speaker
For disclaimer purposes, I'm not sure Redis are advising that it be your primary database of records. Their website would say otherwise. Oh, really? Oh, OK. A very multimodal story.
00:27:43
Speaker
If you work for Redis, you're invited on the podcast to argue your side. Genuinely, they've done some magic. I will say that they have Riderhead logs going to disk these days. They've taken a fun idea, stretched it so far that it's weirdly awesome now. I have nothing but fun things to say about what Redis can do. The general answer is should it.
00:28:09
Speaker
Fair enough. Push this more into the language space for me, though, because we talked a lot about databases on the podcast, but where does this affect language choice? What really comes down to it is if you think about it, most of what we do in programming is trying to express a series of concepts in our code, which machine can then understand.
00:28:30
Speaker
By the way, this is my general statement about LLMs. LLMs do two things very well. They make us sound like computers and computers sound like us.
00:28:39
Speaker
I like that. That's all I do. They're really good at translating between the other. They're basically just a way to do, basically, a very weird programming language, for want of a better word. Or a very bad interpreter for a programming language. Which makes me think of something someone else said about LLMs, which is they don't give you an answer, they give you something that looks the shape of an answer. Exactly. And we're missing the compiler of the other end, basically.
00:29:02
Speaker
The thing that checks that actually you said what you think you said. Yes, I can imagine a borrower checker looking at someone's grammar and going, no. It's like incomplete clause stuff. Can I get back to this idea of manipulating data at any kind of scale, whether it's the one or the many? Currently, the de facto is sequel, 49 years of development, and then I stand behind it.
00:29:28
Speaker
It's designed to be a declarative language to basically state what you want to happen, and then it happens. However, over time, that sort of munch is you can do all things in SQL. I think I may have shared the Schemaverse with you at one point. The Schemaverse. If I haven't, I need to. I don't think you have. Someone built a multiplayer space trading game in Postgres SQL, and the whole game runs inside Postgres Astort procedures.
00:29:56
Speaker
You can find it. It is open source. Okay, we're going to link to that in the show notes. How I got into Postgres. I was searching multiplayer space games and found that and was like, I'm in.
00:30:08
Speaker
If it's powerful enough to do that, you're interested. No, sniping me is not really a challenge. I get that sense. I really do. Back to this kind of thing. So SQL can do most things, but should it? And is it very optimal for these things? And in general, it was designed against this relational model, which has a lot of thoughts and patterns from when it originally came about, which is the idea of relations are important in data, which will normalize our data to reduce its storage footprint. Because if you remember, storage
00:30:37
Speaker
was almost an order of magnitude more expensive than compute for a while. It really was. So normalizing your data mattered. And then we had the big shift, probably about 10 years ago, maybe a bit more now, maybe 15 years ago, where compute collapsed in price. Compute collapsed in price a bit. But here's what really happened. Storage went to near zero. Commodity storage became a thing. So suddenly, it's like, why are we paying costs to normalize when we don't need to? Why don't we just make access faster?
00:31:07
Speaker
So then we'd normalised out our data and suddenly the relational model didn't hold up, so SQL didn't hold up. So we went, no SQL or not just SQL, or not only SQL, or an acronym of your choice, basically. And you kind of come round again and think, wait, what? We now have a different access pattern, like Cassandra's first query language was thrift. So, really? Yes, it had a thrift API to begin with.
00:31:32
Speaker
A while ago, it was just Ken Beck about that, and he says, it was good at the time. I regret many things about it.
00:31:41
Speaker
But this is the kind of thing. It's like RPCs are just a language kind of model of choice. And that's all SQL is. It's an RPC language that changes something, whether it's data description, whether it's data modeling, whether it's a query. And let me think, but why aren't we using something more general purpose? Because I can transform data with SQL if the data thinks it's SQL shaped. But what happens if I actually want to, you know, run something a bit more clever? Like I want to do, I don't know, a thorough transform.
00:32:06
Speaker
Find the fluency domain or something. This is very prominent images. A JPEG is a Fourier domain compression. You take an image, you split it to the Fourier domain, and you basically do some normalizations there and flip it back. This means JPEGs are actually very good at looking good, even though highly compressed. They're smarter things, but they all work on similar principles.
00:32:28
Speaker
If you think about it, you can't really express a Fourier transform in SQL unless it's a competition, at which point competitive codes like some of your previous guests talked about, that is the realm of wonderfully crazy, I don't approach. Instead, you'd want to approach it into a more complete language, but the problem is now is imagine you've given people with Python and access to your actual database, and they're running Python in your database.
00:32:56
Speaker
Are you actively worried about that? Personally, I like the idea of it until everyone says, but what happens if someone puts a bit of malicious query and it starts doing random things across your network? How do you sandbox it? I've seen in the wild a Java stored procedure that decided it'd be a good idea to start a web server.
00:33:16
Speaker
Yes. And it wasn't malicious. That was just a really bad idea. I've had that bad idea. Literally had that bad idea myself. So that kind of stuff is why, on average, you don't let people run code directly in the database. A lot of Lua inside many things, it seems. But it's really powerful because suddenly you have arbitrary compute. And we like arbitrary compute because if I can manipulate the data where it is, I'm not paying a network cost.
00:33:45
Speaker
I can use the fact that I have a big CPU there doing my heavy lifting and my web server can be nice and stateless. It's like one of those benedictions, may your services be stateless. There never are, but we'd like to think they are. The more you can push into the database, the better. Some people would argue that that's a mistake of how you're looking at the database and that you should actually split it out into the storage layer and the computing layer.
00:34:13
Speaker
Ah, yes, the share everything bizarrely model, which has two separate parts. It's one of those naming decades where you share nothing, has the computer storage shared, yet share everything doesn't.
00:34:27
Speaker
One of those interesting moments of, I see where you're going, but the naming came out funny. I can't understand where these things go. Loads of stink services now do this. Famously, Snowflake just dumps everything in S3. Loads of databases secretly dump everything in S3 these days. Hell, Warpstream have rebuilt Kafka with only S3 below it.
00:34:50
Speaker
Yeah, they have very interesting looking thing. And it's modeled now like a lot of like a lot of these kind of Prometheus back ends like Thanos, where you just have agents writing directly into S3 are kind of like the pause the data link model, like the kind of iceberg or hoodie tables.
00:35:08
Speaker
The problem ends up being that once you're in S3, you're at the whim of S3. S3 is way faster than it should be for what it is. Way faster. It's also way more reliable and scalable than it should be. It's actually almost black magic under there.
00:35:22
Speaker
But fundamentally, it's still 100 mils to do a change or a read. And that's partially network, partially the actual S3 itself, but also the actual API itself is not that fast, fundamentally. It's OK once you start streaming, but you've got to establish and reestablish an action, and you're starting having a problem. So when you're on a local disk, you actually end up this thing where it's like literally two orders magnitude faster. Yeah, that makes perfect sense.
00:35:49
Speaker
The question is, where can you afford to pay it? If you can afford to have an S3, it's probably a good idea. But if you can't afford to have an S3, you don't realize it until you've already paid the cost and you don't know you're paying it. That's where you end up with these hybrids. A lot of what I'm working right now is around this idea of how do I do hot and cold storage at the same time.
00:36:13
Speaker
Treating local disk is just another layer of cache. Fundamentally, yeah. So I have some clickhouse examples where I have dictionaries, which are literal key-value lookups covering half my memory, half my literal RAM. Then I have a hot layer, which is a strict materialized view in local SSD. And I have my colon extension off in S3, which is either scrubbing across parquet files from a data lake, or it's just reading files I've dumped there myself.
00:36:41
Speaker
And then I have tiered storage, but more than just the two everyone talks about. Yeah, yeah. Is that because you just like the idea? Is that practically useful? Do you think we should be doing it?
00:36:56
Speaker
It's the fact you're going to look like that. That's what I'm getting at. I started out thinking, this is fun, like I often do. And what can I do next? Because of Clickhouse is a massive box of tools. But then what happened was you come across the simple factor, you need the right tool for the job. And this is where we come back to those language choices. Why should I use R or Python today versus SQL? If my problem domain is too complicated to easily express as SQL, I probably shouldn't.
00:37:23
Speaker
But if I have a regular query, I'm running on repeat. I have my streaming analytics coming in, which is, let's say, every single position of every single vehicle in my fleet of delivery drivers coming in. Also, the state of every kitchen of all the restaurants we work with are a food delivery company today.
00:37:44
Speaker
I often want to know a view of every single restaurant against where their delivery driver currently is. That's just a key value pair, a series of key value pairs. But if you think about how much scrubbing you've got to do to get that view out of something in S3, that is disproportionately expensive. I want to do it once and have it ready to go. And this is where you get to those optimizations of it's not just faster by two orders of magnitude or three, it's cheaper by four.
00:38:13
Speaker
because I'm not paying for every single node's network hops along the way. And when you actually want to start getting these big answers, you start needing to think, how do I have small data? If I want big answers fast, I can't have lots of data change when I want to change them.
Real-Time Data Processing Strategies
00:38:28
Speaker
Otherwise, my calculation is just going to start stretching into near infinity. And we can only put so many cores in before it stops making sense.
00:38:36
Speaker
Because otherwise you end up with a simple fact of them hopping between cores, hopping between threads, hopping between servers. And I'm back to that, wait, how fast is my network again? Yeah. I'm sure, now this feels like, to play devil's advocate, someone is going to hop in and say, hang on, you're trying to join two large data sets. You're back to Postgres. Of course. Yes. And the answer is, I wish. If we could, I think Ben Stops had said this best of, if we could get away with just Postgres, we would.
00:39:07
Speaker
It was quite fun. The answer is you always aren't joining something, but my argument then comes down to if I pay the join once and only once, and then have the secondary table ready to go, am I good? If you just have these cascaded views of your data, and this cascades all the way up, and this is the big conceit of where I work right now is I don't have to just say this tool is magic. I can say this tool starts here, stops here, and then I go up the tree.
00:39:34
Speaker
Because now I get to say, and now let's go really fast and have a Redis cache on top. Yeah. Yeah. Okay. So give me that map then, and I'll allow you a bit of a plug here for, um, for Ivan.
00:39:46
Speaker
Yes. Okay. Given the services on offer, where do they start and end? To give you a final example of like, let's say you are a rideshare company or a delivery company, but e-commerce, e-commerce is wherever. Everything is e-commerce strictly. Give me the restaurant one, because I like that, because I've not heard that one before. Give me the restaurant one. You start with many drivers streaming their locations, so you've got lots of quick data coming in.
00:40:11
Speaker
So that's MQTT, and then we're going to absorb that into Kafka, because Kafka beautifully matches with MQTT. I wanted to give that talk, but they said no.
00:40:22
Speaker
MQTT because it is just the IOT language of choice. Yeah, but you can put websockets there with ABLI or something else. It doesn't really matter. You just need to get that data in quick and it's streamed. So then we've got the stream of data coming in from Kafka. And now we're going to build what I like to joke about known as the KFC stack. And it's an actually good joke because it's Kafka, Flink, and ClickOut, but also because we have 11 products, which are herbs and spices. Oh, God. Okay. I basically had everyone signing in the room and I was like, this is insane.
00:40:50
Speaker
That's the proof of a good pun, if you can make the whole room great.
00:40:53
Speaker
Yes. And literally, one of my team just goes, don't walk away from it. So what happens then is you go into something like Flink. So Flink is kind of the stream processing engine du jour. And not to say that it's going to go away, as it's actually slightly older than Spark, which is funny. It has been around that long. But it's kind of very good at this kind of distributed streaming processing thing. Take the kind of concept of Kafka streams, but wrap it up in something that will handle all of the offset management for you, the checkpointing for you.
00:41:23
Speaker
So you do a big kind of, let's say, joint denormalization, and then you split out the data, you convert it to Avro, you make it in some properly easy to process formats. Then you put it into, so you're in Kafka right now, and then you stream it back to Clickhouse, so you have your long-term store being built live.
00:41:40
Speaker
So this is where you have data going back into the time. But then it's still in Kafka. So we can do more because Kafka and PubSub means you then have a hot cache in Redis. So this is where you do geo-add. So the Kafka connectors for Redis have the ability to actually add the data to geo sets in Redis. So now what I can do is build a hot cache of where every single driver is for geo point queries. While I'm doing all the rest of this in flight, no real cost to anything else.
00:42:06
Speaker
Now I know where all my drivers are in real time. I do the same thing in my restaurant, so I have two caches. Give me the restaurant and its current status. Give me the driver and their current status. But now I want to do one step more and queue up another kind of nice, quick, easy access table to say, give me the best thing for my current situation. So what I'm going to do now is I'm going to join the contents of my drivers in flight.
00:42:28
Speaker
with where they are. So now what I do is I have an averaging of a window. So every 20 seconds I might say, dump out the drivers and their locations and their current content in the current direction. So when I do a select from to calculate nearest driver please, I have a nice, you know, approximately hot cache to know these are the people in the snapshot window. At which point I can ask the really hot cache to say, where are they really?
00:42:51
Speaker
then I can safely make that join. But it gets better, because this is PubSub, and I have more than one subscriber. So, that restaurant data is coming through. And I now know if my restaurants are starting to meet their capacity or not, because I have previous historic data in ClickHouse to say, these restaurants can only really handle 30 or 40 orders per second. So, I can do a join of your max orders per second and say, anyone starting to see that,
00:43:15
Speaker
I recommend them down, lower in the list, so I don't ever end up over indexing on the most popular restaurants. I want egalitarian, I want everyone in my restaurant platforms to experience an even load. But more importantly, I can't give a bad customer experience. This whole thing we do is because of customer experience.
00:43:36
Speaker
Yes, always. Now, what I've got is the ability to recommend the right restaurants at the right time, independent of how loaded they are. If the load goes too high on the local place that serves the best pineapple pizza, it's going to go down the tree. Yes,
Scaling Beyond Traditional Databases
00:43:51
Speaker
that's an in-joke, and you'll probably have another guest on here at a later point who will make it more apparent.
00:43:57
Speaker
But when that goes down the order list because they're too busy, no one is going to suffer neither the restaurant having to say no, nor a user getting frustrated. And this is where it's most important here. We're doing this in real time. You don't have to re query and see full restaurant, full restaurant. We're just not going to return it to your app. Because we know.
00:44:17
Speaker
We also know not to give a driver too much things. We know when the driver is not actually going to make a turnaround point. We can give a really accurate assessment of cycle time. Actually, the driver we suggested might be the closest, but he's on break because we have his current state before we make the decision.
00:44:38
Speaker
In that stack, you're advocating for real-time data, spending the cost of materialization once per notification. And having that ability to have the historic views turned into this basically, so often I turn this as extract, transform, load, and optimize, turning the O to the end. Cause like the key problem always ends up being is I've loaded this data, but if I don't optimize it, I can't really use it outside of a dashboard.
00:45:05
Speaker
Yeah. So how do I make long-term historic data usable and real-time scale? And your answer to that is picking specific data tools for the job. Exactly. And you never have a stack of one tool. If you have one magical tool for me and it's not Postgres, I'm going to be surprised.
00:45:25
Speaker
At what point should a solution start going beyond Postgres into this? What is definitely a more complex and expensive stack? Oh, so the answer is as late as you can get away with.
00:45:38
Speaker
And no later. And no later, ideally. But the inevitable fact is slightly too late is nearly always the case. But we get to this point where you can be quite forward-looking because none of these things are hard dependencies on each other. That's the magic of doing a proper distributed architecture with something like an event bus between them.
00:45:57
Speaker
Because we can start with Postgres, add a click house next to it, just start draining the long-term data out straight away. No additional tools required. And then we say, actually, we need the states to go more places. Let's start with PubSub, start with RabbitMQ, if we need to go lightweight. Go to Kafka when you realize RabbitMQ is, you know, it's still pre-version one, so maybe it has a problem. Yes, I will bash RabbitMQ first.
00:46:22
Speaker
But it is still below version 0 point. It's not version 1 yet due to reasons. At this point, I believe it's naming convention more than anything else. We could go into a whole separate rant about what version 1 actually means, but let's not. Yes, I believe marketing is the technical answer at this point.
00:46:43
Speaker
But yeah, the idea is you incrementally build one of these systems. The reason you normalize against something like doing this on a vent is because once you decouple the systems a little bit and say, I'll bring the best thing for the right job and rather overstressing any individual system, at no point does your main transaction system fail if any of the things downstream fail.
00:47:03
Speaker
You still take orders, right? You can still have a rough guess that your driver is going to get there or not get there. And it allows you to have this ability to say, well, I wanted to do this transactional system in Postgres, but now I've gone too big. Let's roll Cassandra. Let's actually go massively, massively huge. Let's make sure I can't fail at any given point in time. No single point failures. So rolling Cassandra. Let's say I'm doing shopping baskets now. This is one of my favorite little demos I built, mostly to try and prove a point to a local supermarket.
00:47:34
Speaker
Which was, let's have my baskets at a big scale. I'm using change data capture. I'm pulling that thing out of Cassandra tables because I can go to any scale. And now what I'm doing is matching that, my baskets with my actual inventory. So now I know when I've exceeded the percentage of I might have actually tried to sell too many oranges today, right? Therefore, what I can do is message my top few people who are either my subscribers and say, lock in now and don't get a substitution.
00:48:00
Speaker
Because I now know roughly who is going to be disappointed ahead of time. Because I've seen it happen as it happens. But as my stock levels change in real time as well, I need to have both. Yeah. So two questions, and they may be the same answer. In that stack, what's the place for transaction-heavy processing? Ah.
00:48:24
Speaker
Actually, I'll save the second question. Everything you've described feels like analytics-based programming processing. I fall into that camp of being more into the event-sourcing world and arguing that transactions are kind of a flawed concept. Okay, give me that argument. The idea of a transaction is it's atomic-consistent
00:48:47
Speaker
Isolated and... Is he durable? Durable, yes. Between the two of us, we can pass the... Yes. We are sort of a computer scientist. We're computer enthusiasts. The point about transactions, particularly, is firstly, everyone says distributed transactions are hard, and any system with two systems is distributed, so we've already distributed transactions before we even started. The second part is that
00:49:15
Speaker
The first thing you must do for a transaction to be real is stop time. Because your transaction is only ever going to be consistent within a certain time slice it was ever in. It can only be atomic assuming no other writes happened at the same time it happened, which means you're already time slicing in ticks, which only means your consistency is time sliced to that point in time, at which point, if I want to refer to it, that time point has passed. So either it's an event source at which point I materialize my events and it is consistent or it's not consistent.
00:49:43
Speaker
It's the ever classic bit of the quantum world of incremental time, but also just the idea that I believe that we have this kind of touchstone of acid compliance, assuming it's the only way to do things, whereas actually, it's never really held true. It's like CAP theorem. We normally get one, not two. On a good day, we get one in a bit.
00:50:10
Speaker
But it doesn't actually matter as the other point, because that's the trick here is if we can make everything go quickly enough, one, the likelihood of a change low enough that statistically we're good enough. So you think push into event sourcing and avoid the transactional system entirely? So dilate back as lower its footprint as far as you can. Like you do need like you do need guarantees where transactional systems offer really good guarantees, though I argue that what they consider to be guarantees are softer than they admit they are.
00:50:41
Speaker
just because you've used computers for a long time. I've used computers for a long time. The one thing that longer you use computers for is the more surprised they work than you were this. Every day, I'm more surprised anything works.
00:50:54
Speaker
Yeah, yeah, absolutely. And this is because the more you see these systems, the more you know what's going on, the more you realize that none of these things have a consistent view of the world. So we basically assume that Postgres, MySQL, or one of these others has a pretty damn good view of the world. So we'll trust it to a certain point as that kind of starting point. And then as we kind of cascade down the stack, we basically accept that we are not event, we are eventually consistent in a defined timeframe of assuming no more events popped up, be consistent.
00:51:25
Speaker
That is literally the model we have to go for, because any system beyond a certain level of complexity is going to be somewhat consistent at best. There's a great talk by, I can't remember her name anymore, Kafka Summit, where we're talking about the idea of using completion patterns. That's exactly it. That was Anna McDonald. Great talk. Absolutely. Best talk I heard there, genuinely. We'll link to that. It is the best talk of Kafka Summit in our humble opinion.
00:51:51
Speaker
I have written a literal follow-on to it. I think forward just didn't get accepted. But it's now a cornerstone of what I talk about a lot, because I've used exactly these patterns before. And for exactly the same reasons, it's just that this was that crystallizing moment of, I need to talk about this more.
00:52:11
Speaker
And it's that exact thing. If we can achieve consensus, it's not a problem, but we've got to pay for it somewhere, but we don't need to pay for it where everyone assumes you do. So Outbox pans, they're fine, but do we need them? Take me through that in a bit more detail. So Outbox basically says that we have our transactional table. We join a bunch of things and we output to another table and we just follow the log of that table.
00:52:35
Speaker
where I'm going to argue that fundamentally we're doing some processing. Let's just throw that into our stream processing engine, which is a bit further downstream and have the events as is and get no delay. So why not just use Flink for that downstream and rebuild it at will and have all the information where we need it, rather than assuming we only need a limited subset.
00:52:55
Speaker
Now, this doesn't work in banking or some of the highly regulated things where you need to show certain things are true.
Data Modeling and Programming Paradigms
00:53:00
Speaker
Therefore, showing three things work in one database is much cleaner than showing they work across 10. Yes. Yeah. And it's more likely to be true, frankly. One would hope. These days, I haven't surprised.
00:53:14
Speaker
Also, early days of certain databases I won't mention led me to believe that publishing to Dev Null was at least deterministic compared to some of their so-called transactional. I could probably guess which databases you're thinking of, but let's duck that. No, and if you know the one I'm talking of, it got a hell of a lot better. They bought somebody who fixed it. Okay, it's MySQL, isn't it? Oh, it's not.
00:53:37
Speaker
Oh, is it not? It's not. Oh, OK. Because my sequel suffered from that and they bought something and got a hell of a lot better. Well, that definitely did, yeah. But that was all... Oh, you're thinking Mongo, aren't you? Might be, might not be. Right. It's one of those two, or possibly both. I'll simply say that Wire Tiger is really good. OK, yes, it is. I get to say it.
00:54:03
Speaker
Yes, but generally Mongo is an absolutely awesome tool these days. In fact, it is so good that it actually causes problems because people don't model their data as much as they might need to ahead of time. Oh, yeah. Mongo is the wonderful safety blanket until it isn't. Yeah. The other lesson of relational databases, which was modeling your data and understanding your data model as a primary concern. I think that's an art we've lost in programming.
00:54:28
Speaker
It really is. And when I came, so I came into databases originally through Cassandra and the idea of not modeling data to me seems 100% alien because of those only model your data. And if you started in hard type languages like C++ or Haskell for a while for me, as well as you, I think. Yeah, absolutely. You get this idea, I don't understand everything as a type. Yeah. I always liked the rich Hickey quote, everything has a schema. The only question is, did you write it down?
00:54:56
Speaker
Yes, exactly. And then he wrote an unstructured language. You know, two of these things do not agree. And I don't know where you're going with this. I love what closure can do. I just can't, I can't wrap my head around its thought patterns required. Yeah, I see. I love the thought patterns. My
00:55:19
Speaker
What killed closure for me, not killed, but retired, let's say, is I found in Haskell, I could do everything I liked about closure plus static typing and all the benefits that come with that. I went the other way. That was the problem. Right. Cause I started out in Haskell and I got to closure and I was like, but now I have Jason objects flying around and I have no idea what they are. And I was like, I need a monadic expression here. Cause this is not a pure function. I'm doing something and there's no first-class monad.
00:55:47
Speaker
No, it's sort of just implied as a closure. And I'm like, oh, it's not quite what I needed. Yeah, OK. So without delving too far down into Monads, let's take back to something you said earlier. Do you think the problem with putting something like Python in a processing stack is malicious code, right? If you have wild Python running into it, let's say I built a tool and then I let you put code inside it. Wild code is always a problem. And sandboxing is not easy. It's the short version.
00:56:18
Speaker
But this is, again, without using the M word, that's something in Haskell that lets you sandbox what a piece of code is allowed to do. And it works really well. Any hope of that pushing out into our data processing languages? So this is kind of like, in my opinion, the number one reason SQL is still too damn useful to go away is it limits what you can do to a subset of very useful queries. Someone has already done some heavy lifting to make fast.
00:56:42
Speaker
Yeah. And it's also the promise of WSSAM if you get where I'm going, right? On WebAssembly, the idea of making these kernels that bake down to a known API set and then embedding them inside your software. This is how CillaDB does its UDFs because it's running on software C++.
00:57:02
Speaker
UDFs being user defined functions that you embed into the database language. Yeah, custom code, basically. And the only thing to do there is have some way of sandboxing it. Lua can be sandboxed because it was designed to be. JavaScript, bizarrely enough, is pretty good at this as well. Because it was designed to be sandboxed inside the browser.
00:57:22
Speaker
Okay, yeah. It was actually originally a sandbox language. Why they didn't use Lua at all is a thing I will continuously ask, is Lua would have been a better choice. It's also smaller, which makes it more of thinking for embedded languages. Yeah, and it's a more rational implementation. I mean, you can actually write to a spec. I know! It's a short one as well, which makes it very happy.
00:57:43
Speaker
But yeah, and the idea of being able to push this, you know, a code which is sandboxed into your database is super powerful. Like, but you always end up with this kind of dichotomy once again of saying, why don't you just query it, which is so spark, all of it is, you know, build data frames on a distributed system, and then do stuff and then put it somewhere else. So it's pulled it all out, put it in memory across n systems and put it back. And it's great until you realize how much that can cost at any speed.
00:58:10
Speaker
or how many times it runs. Isn't Flink doing that as well? So Flink is, if you run it the same way as you run Spark, and this is where we say about amortized versus batch costs. So if you're having to do something in memory of speed, there's no option but to put the data in memory, right? So we should pay that cost once and upfront. This is the transform bit of that extract transform load optimized step. So my thesis is do as few calculations you can, as upstream as you can get away with,
00:58:39
Speaker
of the end query pattern. In general, queries as SQL as an end user is much easier than writing code to talk to some custom data store. I've queried bits on a chip before, I do not recommend. If you look at some of the old game save formats, they are literally bit reads for bit flags. And they're horrible. They are very horrible.
00:59:02
Speaker
And it's why you get things like in Starfield, the new one that came out recently. Oh yeah, that's the new Bethesda one. There are corrupted save files where you have to do some ridiculous stuff to uncorrupt them. If you do something, you'll freeze your game until you swap your character's gender backwards and forwards.
00:59:21
Speaker
because it will set a bunch of bits, reset things. Right. Oh God. Yeah. No, those sorts of bugs shouldn't exist, but you can see why they do. And you can, oh, I mean, if you've ever been in these kinds of things where you just got to get these weird formats working, you just go, I tried. I could test edge cases, but then players happened. And hence we end up with this idea of saying what happens if we just constrain the data language down.
00:59:47
Speaker
and say, don't allow someone to write custom codes or write custom bits on a wire. Cause like, yeah.
00:59:55
Speaker
Do you think we'll see a future with more custom constrained languages? Is that what you're saying? So I've seen a lot of declarative languages come out recently. It's a project I do some things every now and again with called Tremors or tremor.rs. And it's a stream processing engine written in Rust, but it exposes a Rust API if you want to go deep. But its default one is in fact just its own domain specific language.
01:00:19
Speaker
which is declarative and has a very defined spec for exactly these reasons. They know what it can do. They know the engine only has certain types of things externally imposed inside it. But for one or better thing, as long as you constrain what the user does, you can highly optimize it when it actually hits the actual engine itself. If you submit it, it can be vectorized as in SIMD-optimized out. All the loops can be unrolled because it's a known thing. It's a scoped problem. But as soon as you unscoped the problem,
01:00:49
Speaker
you just don't know how long it's going to take. Which either makes your sass fender very, very happy or very, very sad, as you are at the lavender that clocks up a million dollars in 10 minutes to see what happens. Horror stories. But the other half of that is simply that it also means you can rationalise it better because
01:01:10
Speaker
I've written a lot of stuff from C upwards, and you build your own constructs as you take them around. The best thing about functional programming is it teaches you how to think in terms of composability. You compose all these bits together in your head and go, right, I'll bring this composed model with me, and go. And that's super powerful, but most people don't want to do that. They just want aggregate, distinct, go. They don't want to have to work out what bloom filter you need to run a distinct query.
01:01:39
Speaker
That's the great thing about SQL, the declarative nature. It's possibly the only declarative language that has really stood the test of time. Absolutely. Before this, I was looking through the Wikipedia page of fourth-gen languages to try and find any I recognize beyond that. I used most things, I thought, until I looked at that table and went,
Expanding Skills in Technology
01:02:05
Speaker
Okay. Like, you know, there's something to be said for like XML could be considered one of these things. Like there are all theories written in XML for early. Oh, so some of the some of the XML query languages. Sure. Yeah. And then X path for JSON as well. Yeah. And those kind of tools have different query tools, but they're basically just modeling SQL to the new domain.
01:02:27
Speaker
And that's kind of where- CSS to a degree. CSS is this whole thing. On the list of things, you have to choose some things to ignore, and CSS is very high on my list of things, or they'd be dragons. We won't trigger any traumas for you. There's also traumas, just complete lack of understanding. Okay, fair enough.
01:02:51
Speaker
No, for my sins, I have written far more front ends in raw C than anything approaching JavaScript. Oh my God, okay. It's just the nature of what you do. And this is the kind of the running joke is when you meet anyone who's with a significant amount of domain expertise, you always end up with someone with a lopsided skill set. And mine says, I don't write JavaScript very often. Yeah. So perhaps to wrap this up, then that we can touch on that problem, because
01:03:20
Speaker
There's always the danger. Most places don't get the ideal architecture for what they're doing. Some of that is structural management, cross department constraints, time budget. But some of it is not knowing what you don't know. You're not knowing that MQTT would be the ideal solution here, or that Clickhouse would actually massively improve your solution over there.
01:03:45
Speaker
Have you got any, beyond just stay curious and keep learning, have you got any suggestions for how people can fill their toolbox? I will be cheeky and say listen to this podcast. It's a kind of cheekiness I love, thank you. Beyond that, the classic way I always, the way I've genuinely learned a lot of what I've learned
01:04:06
Speaker
is you pull up a tool, which you never to get at least one, and you see what it can connect to. You just list down the connectivity. I found Kafka because I needed a queue system and I was looking for what... You can Google for what queues you want to talk to, but back in 2016, you probably wouldn't have found Kafka. I found it back then because what I was looking for was a way to make things durable on disk in logs, literally.
01:04:31
Speaker
Which, when I was digging around, I found that exact phrase. But these days, when I'm learning things, I'm mostly finding it in the documentation of the tool I'm already in.
01:04:39
Speaker
There's some really powerful tools which have federated, like let's say, you've got the Clickhouse documentation, you list the number of federated tables it can offer. That's so you don't know about AMQP and either, what is this thing? It seems to be able to be materialized out of in Clickhouse. So I'm like, what is AMQP? And then you can go on this little learning mission to see what's next. So you're basically learning what you currently know as nodes on a graph.
Solving Task Scheduling Challenges
01:05:03
Speaker
Exactly. The best thing about graph theory, to quote a friend, is that everything can be either a node or an edge. Fundamentally, if you think about things as a series of nodes, you're never going to need more than one. I would love to give the talk where I make everything just Postgres and use a Postgres instance for every single thing, from the queues to the processing engine to everything else. It will work.
01:05:28
Speaker
for a defined definition of well enough to prove a point. But, you know, Revolut do exactly this, by the way, if you look at their published architecture. Oh, really? Yes. But then again, they went down in a big way recently. I was just reading a Revolut horror story this weekend. Postgres is a bad idea.
01:05:50
Speaker
You come back to this idea of saying, when inevitably you have a problem, the classic one is, I want to do a delayed task. This is one of the most genuinely hard problems I keep coming into is, how do I start a task at a fine time in the future? And the answer is, there are lots and lots and lots of average answers to this question. I got to find a good one.
01:06:13
Speaker
Well, I think to reference that talk that you mentioned, I think Anna McDonald's answer would be, that's probably not actually what you want to do. Exactly. And that is the thing. Being time independent is possibly the most powerful thing you can do. Yeah. And it comes down to this idea of, say, you want to do this, and then you keep exploring. You go, well, I have a cron tab. No, please stay away from the cron tab. And then you go, well, I can do cron in the database. Better, but...
01:06:37
Speaker
And then you look at something like temporal.io, which is not quite one of these DAG processes of writing a cyclical graph processing engines. But it does have all these primitives like delay X and then do. And you think, that sounds interesting. Is that what I should be using? And you look into it, and you look at how it's doing it, and you realize, yes, for some things, ideally never.
01:07:00
Speaker
But it's like having a TTL check, effectively. It does a really clever TTL check in whatever the base database it's using is, and then extracts that into your programming language for you. But then you come down to this idea where you say, well, what's next? And you say, well, I need to get the data out of that, so change it to capture. There's tools for it on my database. And this is why, a lot of why, I have a lot of fun at Ivan, is I get to say, yes.
01:07:24
Speaker
Generally, we all know that Kafka solves a lot of problems and definitely not all problems. I would challenge that statement of it being the new data lake any day of the week. That's for a different one. I'll have that argument another day. That's a different argument. Yeah. The key thing here is it could be used as one at a pinch. Yeah.
01:07:44
Speaker
And I disagree wholeheartedly because for all kinds of reasons. But the key thing here is like, but what would be better? And the answer is, well, like, you know, Hoodie and Iceberg have first-class integrations with it. So why don't we find the thing which is optimal for the task we have? And with Ivan, actually, I'd start with Clickhouse because you can write to either of those, but they also do that wonderful compression across columns.
01:08:06
Speaker
Yeah. So it comes back to this dream. I have one that, cause all these nodes that connect to each other dream of actually building a map of computing. Exactly. And it's like a Lord of the rings thing there. Yeah. And then you have like the, and then the signal goes through like the beacon fires. The data lake would be drawn as a real lake with things in it. The black marshes.
01:08:34
Speaker
Yes, as you can guess, the true nerd shines through. On that point, perhaps we best leave it with the dream of a map of programming.
01:08:42
Speaker
Exactly. And the idea of being able to traverse it with the right tool at the right place. Yeah. Yeah. And not get lost on the way. Oh, we wish. Probably would. Even then, because it's fun, right? It's always fun. And that's kind of the joy of it all. The answer is, it
Conclusion: Diverse Approaches in Tech Exploration
01:08:58
Speaker
depends. It's always thrown out as a bad thing. But the answer is, it depends mostly because we have more than one answer. And many of them are pretty okay.
01:09:07
Speaker
Yeah, yeah. And some of them are, most of them are worth knowing for later, for one day in the future. Yes. And it's always great to be able to say, yes, but. On that note, Ben Gamble, thank you very much for joining us and filling us with some new things to put on our map.
01:09:25
Speaker
Thank you very much for having me. It's a pleasure as always. Pleasure. See you. Thank you very much, Ben. We will be continuing to draw that map of the landscape over here at Developer Voices, so do consider subscribing if you haven't already. We will be back next week with more. In the meantime, you'll find links to the things we mentioned in the show notes, along with Ben and my contact details if you want to get in touch.
01:09:51
Speaker
And if you have a particular expertise in some corner of that programming map, let me know. I'm always scouting for interesting new guests, new tour guides to show us around places. And with that, I will leave you until next time. I've been your host, Chris Jenkins. This has been Developer Voices with Ben Gamble. Thanks for listening.