The Internet's Role in Driving Change
00:00:00
Speaker
One of the strongest forces on the internet is the belief that someone else is wrong, or something else is wrong. That it's wrong, they're doing it wrong, and it needs to be put right immediately.
00:00:11
Speaker
And as we know, that's not always the most positive motivation on the internet. But it can be. It can be the thing that pushes us out of the status quo. The thing that forces us to make fundamental improvements, rather than just incremental ones. That burning feeling that the current way we're doing things is wrong. It can drive genuine revolutions in programming. It can upset the Apple card.
00:00:38
Speaker
And it can also be the drive that pushes us programmers out from behind the keyboard into starting some genuinely new businesses.
Who is Uran Dirk Grief?
00:00:46
Speaker
Some of the most successful businesses I know started with someone saying, the current way we're doing the software isn't good enough. And the only way I can change it is by stepping outside the current system.
00:00:59
Speaker
And that was kind of the starting point for today's guest, Uran Dirk Grief. He had an urge to build an open source database called Tiger Beetle and then build a company around it from looking at the current status quo and wanting to change things radically.
00:01:16
Speaker
And it's still early days for Tiger Beetle. They are just about to go into production. So I'd say, you know, they're reaching the end of act one of their story. But I think that story, that journey from discontent to inspiration to launch, it's one that speaks to all of us programmers who have a little slice of entrepreneurship in us.
Insights into Systems Design and Reliability
00:01:37
Speaker
And along that journey, along the way, Uran has some really interesting ideas about how we should be doing systems design, clustering, network communication, reliability, integration testing, all these things burning to build something genuinely new, which is putting Uran in a heck of a road, a long road, and
00:01:59
Speaker
Yeah, we're going to meet him at a waypoint on that journey and hear the story so far. So I'm your host, Chris Jenkins. This is Developer Voices, and today's voice is Euran Durk-Grief. My guest today is Euran Durk-Grief. Euran, how are you doing?
00:02:27
Speaker
Hi Chris. I'm doing really well. Great to be here. Good. Good. Um, where are you coming from? You're coming from somewhere in South Africa. Is that right?
Uran's Entrepreneurial Journey
00:02:35
Speaker
Yes. Interesting thing is I'm, we are actually same time zone London time, uh, best times in the world, most, most overlap with all the continents. And, but I'm calling in from Cape town. Uh, nice. It's on my list of places to visit. Okay. You haven't been yet.
00:02:52
Speaker
Not yet. Not yet. I'll find an excuse soon. Okay. Well, you've got a developer friend here, you know, so I'll walk you up when I come over. Now, the reason I've got you on, your story starts very much in Cape Town and the project around there, right?
00:03:12
Speaker
And one thing I'm always interested in is developers who have such an itch to scratch that they end up walking along that road saying, this should be a real product and a real business. So let's start with where you got the inkling of that. You're working on some project for payments. Fill me in. Fill me in. Yes. So I love that question.
00:03:37
Speaker
I can relate to that. I think this was like a, I don't know, I think it was a 16 year itch in the making. And then COVID happened and COVID was a catalyst for a lot of people, I think, to do a lot of interesting projects. It was not an easy time, but I studied, yeah, I was always into coding as a boy.
00:04:03
Speaker
And I loved coding and I always wanted to do a business. And, um, so I actually studied accounting at university because I understood people told me, you know, that, uh, it was actually on my list of things not to study. I had a list 10 things never to study. And number one,
00:04:22
Speaker
I think so. I had it in my head, at least. Accounting was top of the list, and yet I always wanted to do a business startup. That was in my blood, in my nature. My grandfather immigrated from Netherlands, and he built his own business, and my father built his own business.
00:04:41
Speaker
So it was just a thing that I was always going to be an entrepreneur or just someone to start a little company. And so I had this conflict. I didn't want to study accounting and I wanted to start a business and people actually spoke to me and said, well, look, you know, accounting is a great way to see the world of business because that's how you describe business is accounting.
00:05:08
Speaker
And so I thought, Oh, okay. So I went and studied accounting instead of computer science. And then I actually got back into coding, thankfully, and I got into startups. But it was a long time, like 16 years or so of just honing my skill and practicing and coding and coding and coding.
00:05:29
Speaker
That was a period where I could learn a lot and discover a lot of techniques and get into storage systems and how to write fast software, how to write safe software and learn. Daniel Lenoir, the music producer,
00:05:46
Speaker
use to say that he spent sixteen years in the cave learning how to work with you know some engineering equipment i kind of can relate to that because i was in this cave in cape town in what you know somewhere inside table mountain just for a decade or more just practicing,
00:06:04
Speaker
And then, yeah, and then COVID happened.
Challenges in Improving Transaction Speed
00:06:08
Speaker
And the next thing I knew, I bumped into Adrian Hope Bailey on a soccer field and he said, well, you know, he's working on this payment switch. And I had no clue what a payment switch was. But he brought me in to consult on it and to see, you know, how can we make this payment switch faster?
00:06:30
Speaker
And that's sort of, you know, that that's fast forward, that that was not 2020. And yeah, so what was interesting there was that I had this background of doing things differently, you know, living in a cave in Table Mountain, honing my skill. And
00:06:54
Speaker
So I had specialized in the Galapagos Islands and become this interesting database developer creature. You evolved towards a different genetic path than everyone else. Yes, yes, exactly. And then what happened was I came up to this payment switch and we had to see how can we increase the number of transactions per second. If you gave it a lot of high end
00:07:24
Speaker
Cloud hardware, you could get 2,000 transactions a second, which is a lot. But that wasn't good enough for me. I thought, well, can't we do 200,000 a second? Can't we do a million? Why not? Because I understood from some of the experiments I was doing that even a spinning disk is really fast. It can write sequentially 500 megabytes a second.
00:07:51
Speaker
And that that's not even solid state, you know, so flash can do you get devices already that are beyond three gigabytes a second. And if you think of the information, you think of the raw materials you have, you have desk, you have network, you have memory and you have CPU, you've got these four physical raw materials and you can reorder them in an interesting combination.
00:08:17
Speaker
And why can't you process transactions really quickly? We've got these incredible raw materials. I looked at the switch and from the very first second of the engagement, I was already drawing on the boardroom table.
00:08:32
Speaker
Here's another sketch for something totally different that could power the switch, kind of like Iron Man. Take out the old heart and put a new one in and suddenly you've
The Search for Efficient Transaction Solutions
00:08:43
Speaker
got all this energy. But at the same, roughly what the existing architecture was.
00:08:51
Speaker
Yeah, sure, thanks. Great question. I think this would be helpful for people listening in. The existing architecture was a MySQL database. SQL rules the world. You had a SQL database, you had 10,000 lines of application code,
00:09:10
Speaker
Around the database like we all do and obviously there was a lot of kafka streaming everywhere and that's still there that that's that's kind of orthogonal so the heart of this payment switch was a sequel database that was tracking transactions basically bank balances and transfers between bank accounts for the participant banks using this payment switch so
00:09:39
Speaker
You can think of it as like eight rows in a SQL table, eight banks, eight rows.
00:09:46
Speaker
And then you're doing a lot of transactions across these rows. So it's actually so simple, you know, really simple. You've got a table of accounts and there's eight accounts, and then you've got a table of transactions and transactions, that table is big. You know, that's the whole financial history, but the interesting stuff is that you're really just trying to move, you're moving numbers, you're doing addition and subtraction across eight rows.
00:10:16
Speaker
And to do that, there was 10,000 lines of application code. Okay, this is always one of those things where until you actually see the code, you don't realize the complexity you're not seeing. But on the surface of it, that seems like disappointing performance, given that you're only swapping money between eight different accounts.
00:10:42
Speaker
Yes, yes, yes. And I think this is the interesting thing, what I learned is that, I think, I mean, just in general, you know, as you work with a SQL database, you know, we denormalize our data or whatever, we've got different, you know, then sometimes you have, you've got indexes and you've got tables and you've got schemas and design. But how often do we stop to think, how many rows are going to be in this table?
00:11:10
Speaker
Are they gonna be eight rows or eight hundred thousand or eight million or eight billion and yet what what's very interesting with this actual domain is that. Very often you actually only have like eight banks and and say we've got eight rows or you in practice you actually have a little bit more so i'm i am simplifying maybe you've got sixteen you got thirty two it's some multiple of the number of banks,
00:11:36
Speaker
But very often, you only have four banks, or eight banks, or 16. It's pretty much the same order of magnitude. So the table is actually very simple in that sense. You can see it. This is like an Excel sheet that could fit on a screen without horizontal or vertical scrolling. Yeah, yeah. And probably in the prototype.
00:12:01
Speaker
Yes, exactly. But the performance then of 2000 transactions a second is actually not bad for a SQL database. That was the other thing I learned, is that if you do one network round trip to do a transaction,
00:12:21
Speaker
You can only physically do so many, you know, according to your network bandwidth and according to the right locks that are going on in the database. And remember, like the database also has to do it has to write to disk for that transaction and has to call F sync.
00:12:38
Speaker
And there's only so many f sinks that it just can do a second and obviously now with flash that's gone way up but when you put all this together you know rolox individual rolox and network requests and round trips and latencies and.
00:12:56
Speaker
disk latency, there's always a fixed cost that you pay to write even the smallest of unit of information to disk durably, you're going to pay a minimum fixed cost. And then as you write more, you're going to pay more, but you've always got this minimum threshold whenever you want to get some disk.
00:13:16
Speaker
But then if you think about it, we typically use SQL databases and we do one SQL transaction for one real world event. So one real world event is one physical SQL transaction. In other words, one logical transaction, one financial transaction is one physical SQL transaction. And that has a minimum cost, right? And that has a minimum cost, exactly. And so kind of the 2000 is not bad because
00:13:45
Speaker
You know, Simon is skills and has got a fascinating blog called napkin math, and he dives into this. One of his posts, you know, how, how many F sinks can my SQL do? If a my SQL could do F sinks, uh, you know, or postgres. And, and that's a really interesting question because it brings in things of group commit. Obviously the database is going to.
00:14:05
Speaker
internally batch some of these transactions so to try and amortize that minimum cost but kind of when you look at it like hand wave the rough rough napkin order of magnitude is maybe you can do ten thousand transactions a second that's like your theoretical limit for these designs okay so now having convinced me that that's pushing on to the limit what makes you think you can do better what makes you think you can get up to like half a million or whatever numbers you're aiming for
00:14:33
Speaker
Yes, so this comes back to like, you know, scratching the itch for a long time.
Innovations in Transaction Processing
00:14:40
Speaker
And what was funny was the, you know, the very first day of the engagement, we were sketching our tables in the office. I was thinking about this this morning, but there are these white tables and you can draw on them.
00:14:53
Speaker
with a permanent marker and then you've got like a little you can rub it out but you can actually all the tables in the office you are supposed to draw on them so they encourage graffiti so and i mean if it wasn't for that maybe tiger beetle wouldn't exist because you know we were we were in the boardroom where i am now we were
00:15:12
Speaker
At this conference for this payment switch, we were looking at the architecture, and already I was just drawing on the table. It was my first day in the office, and I didn't feel bad drawing on the table. But I already was sketching out something that I thought could do much better.
00:15:29
Speaker
And to answer your question, all comes back to just taking a step back and saying, well, look, we've got incredible raw materials at our disposal. You've got NVMe that can now do three gigabytes a second of sequential write. We have raw information. There's a minimum size in bytes to capture the information content.
00:15:52
Speaker
That's being processed by the system by this raw materials, but what if we could re re reorder the composition of these these raw materials in an in a more optimal way and basically what you have to do is you have information that flows into the switch network network bandwidth isn't really a problem.
00:16:13
Speaker
It can handle quite a lot of information then you have to write this information to desk and that has a minimum cost so there's an advantage that you may be right. I'm one megabyte a second you know what one megabyte not one megabyte a second but one megabyte for each school you know because that that's you don't want to do like you don't want to write one bite at a time.
00:16:40
Speaker
or 100 bytes, or 1,000 bytes, you want to do something a little bit bigger. Because that just amortizes that Cisco cost. And if you start to do that, well, you can see, then after that, you've written it to disk durably. Then you have to run it through your in-memory state. And you can picture your in-memory state as just a few
00:17:07
Speaker
Hash tables so again you've got eight hash tables sorry you've got you've got one hash table with with eight keys in it and each value is the bank account balance for these eight banks I'm again I'm simplifying but this is really the the the the the
00:17:25
Speaker
the skeleton, the bone structure of this human that we're trying to create, you know, that you can start to see the form. If you can understand this, that you've got one hash table in memory, you've got eight keys, eight balances, and well, how fast can you write to a hash table? And actually you can do, this was some of the work that I had done experiments. You can do some of the new high performance hash tables can do at least say 10, 10 million inserts a second.
00:17:55
Speaker
which makes sense because the memory bandwidth is pretty good and you can look at like you know how many nanoseconds does it take to do a cash miss to go to a location in memory and it's maybe seventy nanoseconds or better and you work that all back and you know if you know that that.
00:18:14
Speaker
that there's also memory parallelism. The chip can access 10 cache lines and pull them from main memory in parallel. You start to see what a hash table can do. And now you've got network, then you write to disk durably, then you do a lot of hash table operations, and then you're done. And you send an ACK back across the network.
00:18:40
Speaker
And that's kind of doing like this event sourcing trick where you capture the raw data, store that durably, and then do the onward processing. And then you don't have to worry about the onward processing being quite as durable because you can always reconstruct it. That's right. And that's kind of using a log. So event sourcing or thinking, thinking of a database as a log, you know, there's that, that classic saying that.
00:19:04
Speaker
The database is the log, and everything else is just a cache. And I had a chat with Alex Gajago of Red Panda, and I asked him, because I see Red Panda as a database, which is a funny thing. People will say, well, isn't it crazy?
00:19:20
Speaker
Yeah, yes, yes. So I asked Alex, I said, you know, there's some of some of our friends will say, well, is it really a database? Obviously, I think it is. And Alex, I'm gradually coming to the conclusion that if I think I think if you tell someone who's used to databases, that it's a database, they'll be disappointed. But if you tell someone that that's used to logs, that it's a database, they'll think, yeah, this has way more power than I thought originally.
00:19:46
Speaker
Sorry you had a quote I interrupted you had a quote no well that that's that quote you know the database is the log and everything else is a cash but that that is actually how you know postgres works my sequel all these systems they have right ahead logs they have a log. And that's really the master record that you don't get to say.
00:20:06
Speaker
Yes, they put it in the log, then there's some processing. There's going to be more data written to disk as indexes are updated, but really that is all. You can think of that as cache.
00:20:19
Speaker
This is also how the databases guarantee atomicity, so crash consistency through a crash, or what they call crash consistency through power loss. The database is busy doing things, updating indexes, doing a lot of work, and then you pull the code,
00:20:38
Speaker
Really, it comes back to the log. At startup, it's going to go back to the last known location in the log, and it's going to redo that work. There's many ways, obviously, of specializing on this, but this is the big idea, is that you've got network, and you've got a log, and then you've got in-memory work, and maybe you'll have more storage work later.
00:21:02
Speaker
So I kind of did see this through the same lens. And this is even how file systems work. For example, ZFS also has the log. And then there's Copy on Write.
00:21:16
Speaker
But taking a step back, I thought, well, the information content of a financial transaction, I mean, this is another question.
OLTP and Double-Entry Accounting: A Business Framework
00:21:28
Speaker
What is a transaction in the OLTP world? It's kind of like you've got the who. You've got the who, what, the when, where, and the why. So that is OLTP. Who, what, when, where, why, how much? And you need to do a lot of that, and you want to be highly available online.
00:21:45
Speaker
And i think that is all tp who what when we're why how much and basically if you squint and look at this. Which i did as an accountant.
00:21:57
Speaker
I said, Hey, OLTP is really double entry accounting. Like it's the same thing because, um, double entry accounting, it's, we use this, it's been the schema for hundreds of years to model businesses, any kind of business, um, that wants to track business events. They want to track the who, what, when, where, why, how much. So, so, so in the region, right?
00:22:22
Speaker
Exactly. Original event sourcing, it has all these properties that are very friendly for our raw materials of compute and storage network. Alice, Bob, that's the who. What? Well, it was Alice reimbursed Bob, and there's the why.
00:22:42
Speaker
Okay it's the reimbursement, the how much is the amount, the when, when did the transaction happen. Maybe there's another when as well. When was it recorded and when did it really happen. There's a where as well, like you want to track jurisdiction. And I was thinking about this yesterday.
00:23:01
Speaker
As a developer, it took me a long time to realize, yes, UTC timestamps. Just use UTC timestamps. And then it took me even longer to realize, no, UTC timestamps are not enough. You want to track the where as well. You want to track the identifier of the locale so that you can, you know, DST or the classic calendar app. You want to show someone a time that makes sense in their time zone.
00:23:27
Speaker
Yeah, utc is data and everything else is formatting. That's my opinion. Yeah, exactly. Yes. And, and if you know the when and you know the where you can do the formatting and yeah. Yeah. Okay. But double entry actually. Sorry.
00:23:42
Speaker
Yeah, actually captures all of these quantities for you. It's a transaction between two people or places, something moving from some way to somewhere else, like Google Maps. Get me a direction in Google Maps. That's double entry accounting. I want to go from Cape Town to London. How long is it going to take? It's place to place or person to person. It's OLTP.
00:24:10
Speaker
I'm obviously being very hand worthy here but i really think you know it in principle like the heart of ltp is tracking these things these business events. And double entries a great schema for that. Yeah i got my career start with the software systems of double entry accounting i still think it's one of the great data models in the great understanding of how you model what's actually happening in the world.
00:24:36
Speaker
We should get back to your journey because you've got all this double entry accounting knowledge from your accounting training and you've got an architecture you think is going to work written on a desk, which is the first time I've heard that as a design tool. And it got dropped out, Chris. It got dropped out. That's why the back of a napkin is more durable. Yes, yes. But how did you get on to the point of actually cutting this into software and seeing if it worked?
00:25:07
Speaker
So yeah, I think that's the great tension, because you have to always build trust. If you're a consultant on a project, you can't jump in and say, well, we're going to redesign Postgres or MySQL in terms of the optimal configuration of raw materials from first principles.
00:25:29
Speaker
So i so we actually this was the best part i think is that we actually looked at the switch and we said well. The heart of the switches all tp it's equal transactions less there's thousands of tens of thousands of lines of code you know we could read this or. We could see where do we talk to the database and let's instrument that let's trace every single secret query that flows through the system and.
00:25:55
Speaker
Let's send a lot of payments through, a lot of financial transactions, a lot of OLTP. Let's see the physical transactions that are being issued across the network. So we trace the SQL and then what we saw was for one logical transaction, there would be say 10 to 20 physical SQL transactions and you could optimize that
00:26:21
Speaker
For that you know some some systems do you do one secret query for one physical transit if one logical transaction you can do that this particular switch. There were good reasons for being between ten to twenty. But this was interesting because you start to see there's an impedance mismatch.
00:26:45
Speaker
Because we're really trying to do OLTP. We're trying to do transactions processing, real world business transactions, business events.
00:26:55
Speaker
Yes, using physical database transactions, but as far as I understand, the original meaning of OLTP was always in the sense of what does the user need first, and then yes, do you have physical transactions to do that? Physical transactions to implement because they are real-world transactions.
00:27:16
Speaker
But you see there's an impedance mismatch for every payment there were like 10 to 20 network requests, F sinks, etc. Yeah, there's a clear disconnect between the logical world and the reality of how the software is modeling it.
00:27:30
Speaker
Yes, exactly. And in the past, this didn't matter because disks were so slow. It really didn't matter. And transaction volumes were never nearly that high, right? Exactly.
00:27:48
Speaker
Exactly so i think like this is my thinking lately i think for a long time i will tp was very you know was very welcoming it said yes and we're also general purpose you know you can put all your data everything in the same database and you can we're all tp but also general purpose database. Thank you thanks and that was fine for a long time.
00:28:16
Speaker
And then I think there's like a divergence, OLTP in general purpose because of this impedance mismatch. So coming back to your question, when did we decide to do something different? I think I'm really grateful that we didn't immediately, we first analyzed the switch and we could get these insights really understand the problem first rather than try and come up with a solution in search of a problem.
00:28:40
Speaker
We could really understand the problem deeply and see, yes, actually, there's only on the order of 8, 16, or 32 rows in this table, and there's a highly contentious workload. So straight away, you start to see, it actually doesn't make sense to horizontally shard, because there's contention, double entry. You're going to update one bank account and update another one.
00:29:06
Speaker
You almost and sometimes you want to update all the rows because you're doing very interesting financial contracts where they literally touch all the balances. So sharding isn't going to solve the problem counter intuitively.
00:29:20
Speaker
Now it's going to make the CPU wait even more as you start doing network requests. I think we did this work for like six months or so, or three months in actually. Then there was just this period where I think 16 years had built up.
00:29:42
Speaker
And it was actually a Sunday afternoon. It was raining. My one co-founder, he does his best work when it rains. And I think I do mine when the sun shines. But this happened. It happened to be raining. I had the fire going. I was listening to Black Keys, this album, El Camino, which is great.
00:30:02
Speaker
It was a Sunday and COVID, and I just banged the keys for five hours and I sketched these raw materials. You got this prototype that could do 200,000 two-phase transfers, which is you're doing everything twice. Basically, it could do 400,000 financial transactions a second.
00:30:27
Speaker
And this was just a rough performance sketch, um, like back of the envelope and, but, and it was JavaScript. So, and, and, but it was, it had all the ingredients that had F sync that had cryptographic check summing, even, um,
00:30:44
Speaker
Yeah, just to really see. It's more than just a very simple prototype. Yes, yes. And it was trying to see if we sketch out the network protocol, the disk protocol, the in-memory hash table operations, the real thing that could actually work.
00:31:03
Speaker
It could work, so it was very basic, but after five hours and maybe a few days more work, we actually did plug it back in. We did a heart transplant. We took the ledger database out, this MySQL and application code. We took that out. We put what we called ProtoBeetle. We put ProtoBeetle into the switch, and it took a day because the design had come out of the switch so we could put it back in. It extracted from a real problem.
00:31:33
Speaker
And straight away, the switch that when we evaluated it, we had this trick where we were trying to evaluate it on really small hardware. So it's very easy to do benchmarks where you have the best hardware. What we would try and do was have minimum viable deployment. Like, what's the smallest GCP instance we can deploy this on? And the minimum number of instances.
00:31:59
Speaker
So we used to benchmark like that. And in those configurations, we would only get 76 transactions a second. And when we put Proto Beetle in, we could get to 2,000 without fixing any of the other performance issues.
00:32:17
Speaker
Yeah, so proto beetle itself, if you use it properly, you could get 400,000 a second. There was a lot of other overhead in the switch, but it already made an order of magnitude difference. Was that with the same level of guarantees about transactional safety, durability, or you cutting corners on that?
00:32:40
Speaker
Yeah, so Proto Beetle was just a sketch, so it definitely wasn't beta ready. But it did do cryptographic checksumming, it did do logging to disk. We were basically trying to evaluate the performance, if this could work. So it wasn't a plane that you would sell tickets for, but it was like a plane that you could launch off a cliff and see that it flies.
00:33:08
Speaker
And whether it lands is another story, you know, it didn't have landing gear, didn't have parachutes or, um, okay. We, we best go on to the, uh, what happened next after that prototype? Yeah. Cause I'm hoping that we can actually fly in at some point on this. I think that's, that's the whole question. You know, that's
Building a Better Database: Tiger Beetle
00:33:30
Speaker
the heart of it. It's like, you can say to people, Hey, we've bought this plane and it's so much faster.
00:33:37
Speaker
And they're going to ask you, you know, well, um, is it as safe, you know, as, as our regulated planes, is it going to land me safely? And you'll be like, no, you know, it's just really, really fast. Um, so, and then you realize like how much work goes into safety. How do you make a plane really safe? Yeah. And how much speed do you sacrifice on that road?
00:34:02
Speaker
Yes, exactly. But being involved with the payment switch again was a fantastic opportunity to learn because you realize that actually it's not good enough to say we're going to be as safe as all the regulations and the standards and the existing planes.
00:34:19
Speaker
Because people are still going to say, well, you say you're safe, but you're still a startup. It doesn't connect with people on a human level to say that you've built a new database and it's as safe as something that's 30 years tried and tested. People are going to be, well, that's table stakes. And I'm still going to err on the side of 30 years. I know you're as safe.
00:34:48
Speaker
you're not 30 years. You've got an uphill battle on trust, if you want to turn that into a business, definitely. Exactly. So it comes back to trust. And so then we realized, well, there were a few factors in this decision that actually these systems were 30 years old. You know, you spin the coin around one more time and you carry on with that and you go, well, actually they're 30 years old and there's
00:35:17
Speaker
There's been tons of storage research. These were those experiments, again, in the cave. And places like UW-Madison and the FAST conference, incredible research every year testing these systems. And that's kind of the other side of the coin. They are tried and tested, which means in the research, they know how they fail. All the latent correctness bugs, how you can lose data through F-Sync gate, a lot of consensus systems that
00:35:47
Speaker
the design of the right to hit log is actually not fundamentally safe and you can have global cluster data loss with the raft with the with a lot of rough clusters some of them are patched for this but the you know the stock standard draft consensus,
00:36:04
Speaker
doesn't get you there safely. So there was a lot of research around safety, storage fault research, distributed systems research, which was also at UW-Madison. And then we started to see, OK, yes, we are going to build a database.
00:36:23
Speaker
And it's the perfect time to build it 10 times safer. So again, orders of magnitude, it is so much fun. You realize, hang on, that moment of tension, yes, we're just going to do this prototype. You get past that. And then you realize, well,
00:36:43
Speaker
Now we can do this 1,000 times faster and 10 times safer and 10 times cheaper, because actually we made it 10,000 times faster, but let's give 10x of that for cheaper, small hardware, rusty hardware instead of state of the art.
00:37:01
Speaker
So how can we make something that's really fast and small and much safer because that moves the needle and then that's really what you want is far more safety.
00:37:16
Speaker
Happy to draw into more of this. Let's get concrete about the implementation of this because I know you've got your design idea there. The prototype has proved and you've decided to go ahead. You've made some interesting choices I know about how you're going to implement this. The first, most obvious one that jumps out at me is your choice of programming language.
Why Choose Zig for Tiger Beetle?
00:37:42
Speaker
Yes. You didn't stick with JavaScript as in the prototype. Yes. I didn't stick with the memory safe language. Yeah. So I think probably the most surprising thing for someone, you know, if they ask, why Zig? Zig is the language you've chosen. Yeah. Yes. Zig is the language. Why not C? Why not Rust? Why not JavaScript? Why not Go?
00:38:12
Speaker
And I think the most surprising part of the answer is just to say, well, I truly believe, believed then and now that Zig was actually the safest choice. If you look not only at memory safety, but safety of the system. So safety is a much bigger thing than only memory safety. Safety has to do with correctness. What makes for correct control flow?
00:38:40
Speaker
And i think what makes for correct control flow is simple explicit control flow minimum of abstractions versus zero cost abstractions so i would rather.
00:38:56
Speaker
I think more important than zero cost abstractions is to have a minimum of excellent abstractions because that reduces the probability of leaky abstractions. So it's things like, it's like subtle things like this. Just quickly step back. And for those that don't know the programming language Zig, give us the headline features.
00:39:20
Speaker
Yes. Okay, great. So I was actually coming at this as a C developer. So around this time, I'd been doing most of my coding in C and I was looking for a C replacement.
00:39:35
Speaker
I loved Rust. I had pointed like Ryan Dahl. I was just one of the people and he had written Denno with the back end in Go and I said, well, Rust would be great because then you don't have two GCs.
00:39:52
Speaker
But then I was a sea developer at heart and I love the simplicity of sea and the control of sea and precision. But Zig came along and it just fixed everything that was wrong with sea.
00:40:09
Speaker
that it also gave you far more safety. So it gave you checked arithmetic, which I thought was very important. And actually, if you look at a lot of newer languages, if you look really closely, most of them don't actually enable checked arithmetic by default in SafeBolds. So what checked arithmetic is, is if you're going to do integer operations on a type, like a U64, or let's say a U32, and that integer arithmetic is going to overflow,
00:40:39
Speaker
in a lot of languages that's just undefined behavior or it's a it's a wraparound and i had your choosing too small of you. Exactly but many languages you would be surprised i think if everyone went home and double checks this you know in a safe vault the default settings is checked arithmetic enabled usually the answer is no which is very surprising.
00:41:03
Speaker
Because actually you want check arithmetic on where safety is mission critical because i had done a lot of security work in the security work in the cave you know some rnd on how do you do static analysis to detect zero day exploits.
00:41:19
Speaker
And it worked, you know, you could, you could catch them. It was really fascinating, but a lot of it came down to like, let's look at a zip file and let's see is the, is, is this zip file format? Is there.
00:41:35
Speaker
arithmetic overflow happening in these values. And if there is, chances are it's almost certainly a zero day. It's going to attack the antivirus software that opens it up, because you're going to get arithmetic overflow, which will allow the malware to exploit something else. And you chain these things together, and you get a CVE. So check arithmetic is really important if you look at things from a security perspective, I think.
00:42:05
Speaker
Security flip that around and while that's safety, it's a big ingredient. It was one of the few languages that actually had it enabled in safe bolts. I thought, wow.
00:42:19
Speaker
That's a great design decision. I wish more languages had that. And then it had bounds checking, which is based, you know, if you, if you're in C, you know, it's like walking blindfolded on the cliffs of Dover and you're going to fall off. So, so easily, you know, and you really want bounds checking and Zig gave you that and you want, you know, actually like for distributed systems, the challenge isn't really memory safety bugs. Most of there was some paper on this, you know, I think,
00:42:47
Speaker
on the order of 90% of the major incidents in distributed systems are actually the lack of error handling around syscalls. Zig was really great for that. Zig gave you first-class error handling and the compiler would check if you don't do proper error handling. You can't just ignore stuff.
00:43:08
Speaker
Even now, Zig has got a design decision that people have pushed back on, you know, no unused variables. People say, well, I want to have my unused variables, but in terms of safety, like, do you really? Because often that's, you know, in Tiger Beetle we've seen before Zig made this change, we saw we could have had some, you know, latent correctness bugs. We found them already.
00:43:33
Speaker
But this principle of not allowing unused variables is a good principle from a safety perspective. And Zig is a very simple language. But fundamentally also, if you're a database, you need to handle memory allocation failure. You also want very fine-grained control of your allocators that you can use. You don't want to have
00:43:56
Speaker
Just a global allocated somewhere hidden you know and a panic so again see it was either going to be see or zig and and the way i motivated this internally was i said well we're gonna you do see.
00:44:12
Speaker
And then everybody panicked and they said no please I want to I want to I want to be able to build this on my Windows computer like I please give me a proper tool chain you know and they had such a phenomenal compiler you can cross compile from Windows to Linux or to M1 chips it was the first compiler to do that.
00:44:33
Speaker
So it's just got a fantastic developer experience story. It's a great, lovely compiler. And it's also one of the few languages that are also investing in the tool chain. The compiler is so important. So it was always going to be Zig, I guess, but threatening C helped. That is a Captain Kirk level bluff there.
00:45:01
Speaker
Well, I think it was unintentional, but retrospectively, you know, let's, let's claim it for Kirk. Claim it as a win. Claim it as a win. Yeah. Okay. So we've got the architecture. You've gone for Zig. I know you've, another thing you've wanted to do differently is, um, the way you approach testing, which I thought was very interesting in the architecture. Take me through that.
00:45:26
Speaker
Yes, thanks Chris. I guess just to add, both Zig and Rust, they're phenomenal new systems languages. Again, are we going to invest in the last 30 years of C? If we're going to write a lot of code, where do we want to be in 20 years' time? What language do we want to be using? For sure, like Rust or Zig.
00:45:49
Speaker
So that kind of made it very easy. So testing, yeah, that was the other challenge, because building a database, it's a storage engine. If you're a single node database, you have to build a storage engine. And that's something like an LSM tree. Usually, that's the big engine that powers these things these days. And typically, if you want to build this LSM tree, and you want to get it production ready,
00:46:18
Speaker
They say it's about five years to not only to build it to test it, most of the time is dominated by testing. So you can build it in a year.
00:46:32
Speaker
But then you tune it in another year and you test it in another 3 years and then it gets widespread adoption. But that's 5 years and the problem was we needed a database that was not only single node like MySQL or Postgres because we wanted a great operator experience.
00:46:51
Speaker
you just get automated failover and high availability. So you run target beetle as a cluster of like three or five nodes. And we actually support some more interesting configurations, you know, but this general idea, you know, a cluster of five nodes and your primary goes down and the cluster will automatically elect a new it's basically backup and
00:47:15
Speaker
and recovery automated for you. Instead of you doing this, doing manual failover in a multi-master setup for Postgres, you don't want to do that manually at 2 AM. Consensus can do that for you in a way that's automated and tested and highly available. The switchover happens within milliseconds. Slightly aside, but did you roll your own consensus mechanism? Or have you done something like Zookeeper?
00:47:44
Speaker
Great, great question.
00:47:47
Speaker
I think, again, the answer is always safety. And I think it's surprising. So we wrote, yes, we picked a new systems language. We wrote a new storage engine. We wrote a new consensus protocol. And the answer was surprisingly always safety. Because again, the research was showing that the way existing consensus protocols were proven. You know that, yes, they had formal proofs for Paxos and Raft.
00:48:17
Speaker
But the formal proofs missed something. They only considered how the network fails. They didn't look at how the storage fails. So if you want to build something that is correct for Paxos or Raft, you have to guarantee that the disk never fails. So each disk of the cluster may never fail because otherwise the promise that was given during the voting phase of consensus
00:48:43
Speaker
That promise could get revoked, which could undermine the quorums, which could cause the cluster to regress into split-brain and global data loss. These two protocols rely on perfect disk. They call it, in the consensus world, they call it stable storage. And the formal proof, there is no fault model. It just says disk is perfect. And that's fine.
00:49:09
Speaker
You can then solve that locally with logical RAID, like ZFS. You can't solve it with most RAID, because most RAID, if there's a corruption on one of the disks, it's a guess as to, are you going to recover in favor of parity or in favor of the corrupted version? It's just XOR happening across the stripe. So there's research on this, too. RAID will make sure that your disks are in sync.
00:49:36
Speaker
But if there's corruption, there's not enough information at the block device level. Check something to know which is the corrupt version in the Stripe and which is the real version. So ZFS can get this right. Sure, you can run your consensus over ZFS.
00:49:52
Speaker
But now you're paying the cost of 3x or 5x global replication plus 3x local replication. So now it's extremely expensive. So what did he do? So these were kind of what motivated us. We thought, well,
00:50:13
Speaker
But the big problem too is that actually like these consensus protocols, if the storage engine takes five years to test, the consensus protocols take 10 years and they still haven't found all the bugs. And so, you know, even if we take an off the shelf protocol, is it still safe? And do we understand it? Because we can take it, but then we have the responsibility to audit the code
00:50:38
Speaker
We can't just bring in an dependency if we don't understand it. Understanding an existing implementation of consensus, that is again going to take a year or two years to build up that in-house knowledge.
Ensuring Reliability with Custom Protocols
00:50:54
Speaker
What we want is a storage fault model. We want to handle disk corruption and we want to do that efficiently without local replication because we already have the global replication. There's been a ton the last 10 years of research on LSM trees since RocksDB and LevelDB, so we didn't want to take those two because
00:51:14
Speaker
there's you can do there's a lot you can do these days for example those have one second right stores which will bump your p one hundred latency is your database. Yeah so long short answer is basically we realize that.
00:51:32
Speaker
There's this paper, UW-Madison Protocol Aware Recovery for Consensus Best Storage. In other words, how do you build a distributed database in 2020? That's the paper. How do you build a distributed database in 2020? And the answer is that you really need to
00:51:48
Speaker
design your local storage engine to work with your consensus protocol and vice versa. So what that means is you want your storage engine LSM. Yes, it must have checksums, but if it finds a checksum error,
00:52:04
Speaker
You don't want to do any local action immediately. You don't just want to like truncate your write-ahead log because that could actually lead to data loss in the consensus protocol context. So what you really want to do is your LSM and your write-ahead log of the database must be designed first class for your global consensus protocol and vice versa. Then what you can do is if you see a local corruption, you can actually ask the cluster and get a quorum
00:52:33
Speaker
In a correctly and then you can say what is the correct thing to do am I allowed to truncate this piece of data.
00:52:42
Speaker
And the cluster can say to you, yes, for sure. We know this was never committed. It's safe to truncate. And that way, you actually keep the cluster going much longer, because now you know you can unstuck things. Yes, it's safe. You can quickly recover, truncate the log. You know for sure it's safe. But there are cases where you can't truncate the log, because that log contains committed transactions.
00:53:11
Speaker
and you need to to preserve your votes to the cluster so then the cluster will say well no you can't truncate and really like you have to ask the cluster you can't at most of the existing designs they they make this decision without reference to the consensus protocol just not correct. Long answer we realized we have to actually this is going to be one of the first implementations of.
00:53:37
Speaker
how you do this. We're going to make an integrated stable storage that's integrated with consensus protocol and the thing can be efficient. Use the global replication to recover local faults and do that correctly, high availability. That makes sense. We had an episode recently with Benjamin Bengford about implementing your own consensus algorithm
00:54:01
Speaker
And how you could then have custom primitives inside the consent algorithm and it's exactly that but this is specific use case of that idea which is really interesting. Yes yeah no i think that's that's often why this isn't adopted because people will say well you know we building a new storage engine for raft, but we don't we're not at liberty to.
00:54:24
Speaker
to improve Raft to be storage fault away. And so you're stuck. You've got the abstractions, but it stops you from getting to the system you really want. And the desire to make consensus a black box is coming from the fact that it seems too scary to open that box. But you kind of do need to open that box if you want to get the right behavior.
00:54:51
Speaker
Yes and you do want to understand it you really you do want to understand it and it comes back to zero cost abstractions because. Actually abstractions always carry a probability of being leaky if the abstraction boundaries are not. Exact you know and this is a classic example where the abstraction of global consensus protocol in isolation from local storage faults is actually.
00:55:19
Speaker
the formal proof needed to really consider the whole system, be a system proof. I have an episode with you and Benjamin for a SmackDown. That'd be interesting. Yeah, that would be great. I think I still haven't answered your question. How do you test? Testing. We were talking about testing, yes.
00:55:40
Speaker
Yeah. I've just made everybody nervous. Like, you know, oh, they wrote their own storage engine. They wrote their own consensus protocol. Storage engine takes five years. Consensus protocol takes 10 years and we still haven't got all the bugs. Well, I'm going to dip into that before we get to testing in that case, as you mentioned it, because some people are going to look at this and think all these safety properties are very important and great, but it sounds like you're disappearing down a rabbit hole.
The Future of Tiger Beetle and Open-Source Vision
00:56:07
Speaker
how are you actually going to get production from that large work stack? Yes, yes. So I think the surprising thing is we found a silver bullet. I love Fred Brooks, you know, no silver bullet, but we did find one in the cave. And this is really like credit to FoundationDB. They wrote FoundationDB very differently also, you know, they
00:56:31
Speaker
They wrote this whole database that you could run it in a simulator so you know what's the best way to become a pilot and in the past you know you used to jump in a plane and fly and crash and then you survive and you get better and you put hours on.
00:56:46
Speaker
And then one day people realized, well, we don't want to keep crashing our planes. This is very expensive and we lose the pilots. So actually, let's build a flight simulator and we'll simulate everything and you can crash inside the simulator where it's safe. So that's sort of what Foundation DB did. I think they were one of the first
00:57:05
Speaker
Most databases are not designed like this. You have to fly them for real and crash them in production and get the issues. FoundationDB, you fly it in a simulator and then the simulator speeds up time also. If you have to run your database for two years before you hit the bug, the simulator can speed up the passing of time. You can find the bug in like two days of simulator time versus
00:57:32
Speaker
If you were to test a system with Jepsen, you have to run it for two years. Jepsen can't speed up time. Foundations found this silver bullet. It's so valuable. It's actually called deterministic simulation testing. It's the idea that you design the database to be run in a simulator because then you can speed up time.
00:57:57
Speaker
You can get to, for example, so Tiger beetle, obviously we did the same. We wrote our consensus and storage in a very specific way that it's deterministic. Given the same inputs and network latencies, the code will always execute exactly the same. You always get to the same answer, which means that now
00:58:18
Speaker
you can debug tiger beetle and from a single seed you generate this whole like some city world of events and it's like a big bang and a whole lot of chaos happens and then but you can always recreate that so if you find an interesting consensus bug your team you know you drop a little seed in slack and suddenly everybody around the world can reproduce
00:58:42
Speaker
this whole series of millions and millions of distributed systems, events, and debug. This is like generative testing, but you're not just generating the fixed data set from that seed, you're also generating how reliable is my network model in the simulation, how reliable is my disk in the simulation, that kind of thing.
00:59:04
Speaker
Exactly and then then what you do is you use simulating latencies so if i you know and latencies alone are very interesting if you simulate wild display tenses it could uncover a lot of interesting race conditions in your database code,
00:59:20
Speaker
I'm sure existing systems could do more of this, but you can do this in the simulator, wild latencies, network faults, and then storage faults. At Tiger Beetle, some of the tests we inject like 8% of corruption. Every time you read, the simulator will corrupt that read 8%.
00:59:44
Speaker
And that's something that I think almost no database can handle. There's a lot of checksumming, but usually the checksumming is there for crash consistency through power loss. But it isn't there to handle just corruption of everything in the data file. Like anything in the data file is just corrupted 8% of the time. On the read path and then on the write path, we'll just, 9% of the time, we'll write to the wrong place.
01:00:14
Speaker
We do that every day, it's very normal. We operate at extremely tight tolerances, like extremely rare bugs.
01:00:26
Speaker
And then the easy bugs become very easy. I almost hesitate to say this on record, but it seems like the military would be interested in running your database in the field if you've got that level of fault tolerance.
01:00:44
Speaker
Because I'm just thinking, where would you actually get 8% disk corruption when there's a battlefield happening around your disk, right? Yes, yes. Or if you boost your Tiger Beetle into space, and you've got all these cosmic rays. When I was a kid, Tiger Beetle in space was my favorite cartoon, so that's perfect.
01:01:06
Speaker
Is that really a cartoon? No, it's not. I'm kidding. I wish. Okay. Well, maybe for the next generation of kids, we can make it happen. Yeah. Yeah. It's a little sideline mentioned that. Yeah. Okay. So we're sort of running out of time. I do want to ask what's your part, where are you? What's your current status? Like in terms of alpha beta production and how are you going to get to production?
01:01:30
Speaker
Yeah, so we're very close. We've actually only been a little over three years from zero, and already the consensus is tested to a tolerance that most systems couldn't survive and the storage engine.
01:01:46
Speaker
We're busy polishing and wrapping up and coming into some first releases this year, starting a release process in September and tagged releases. And then we'll increase the guarantees of storage stability as we go. That's sort of the current focus now is let's just lock in our data file format.
01:02:11
Speaker
I guess the second leap we haven't really dived into is at some point, can open source become a company? That happened last year, November, this idea of Tiger Bill Incorporated. What's your angle for that? How are you going to make the open source software pay for itself?
01:02:33
Speaker
Yeah, so that was also kind of a tension. We were fully invested, and we still are always in Apache 2 open source because we came out of an open source payment switch. So we saw how open source is crucial to a business. You couldn't have a business supplying an open source payment switch with software that isn't open source.
01:02:59
Speaker
Even the BSL wouldn't work because it's not open source. It's the anti-business license. We couldn't have a business if it's not open source. The question is, how do you do a business then if it is open source? That took me a long time to figure out.
01:03:15
Speaker
The key is just that it's mission critical. So people want they want open source, but they also want people to run it for them because it's mission critical. Or if they run it themselves, you know, who are you going to call if something does go wrong?
01:03:33
Speaker
So we may see tiger beetle support contracts and tiger beetle as a service before log. Yes, yeah, yeah, I think it's quite I would love that, you know, and and the more people running tiger beetle, the better like we it's kind of I would never have worked. I actually joined coil full time to work on tiger beetles open source, but I would never have worked on these things. I just see them as so valuable, you know, a new distributed database.
01:04:02
Speaker
I love ZFS so much and I was always sad that Oracle didn't get the license right. I would never have worked on a new
01:04:15
Speaker
distributed database technology if it wasn't open source, because I think these things are too valuable. They're bigger than a company. It's an ecosystem. So you really want this to be open source. And it actually makes business sense to create an ecosystem around this.
01:04:36
Speaker
Yeah. Well, I hope building the business around it works out so that you can keep the open source going for a long, long time. Yes. In the meantime, Uran, thanks very much for giving us the breakdown and I'll talk to you again soon. Yeah. Thanks so much for having me. It's been a real pleasure. So thanks again to you. Cheers.
01:04:56
Speaker
Thank you, Yoram. It's gotta be said, he is biting off a lot there, but it's an interesting bite to take. And his idea is about testing. I think he might be onto something. He might actually accelerate the development of making that database reliable.
01:05:11
Speaker
by a heck of a lot. So I'm going to put a link in the show notes to one of Tiger Beetle's test cases. It's an easy read, it's well written and it's interesting the way they've modelled like cluster failure and disk failure, network failure straight into the test suite. It's probably the way we should be doing really, really mission critical integration testing. Something to learn in that code.
01:05:36
Speaker
I think we're also, something to learn, going to need to have a dedicated episode on Zig, the language he's using. I'm keen to learn more about that, so I'll go and look for a guest. Stay tuned! And of course the best way to stay tuned is to click subscribe and notify and all the buttons, whatever buttons your app has. If you've liked this episode, please do leave a like or a share or a rating or all three.
01:06:02
Speaker
Because ultimately, that translates into there being more episodes to come for a long, long time. And it makes my day too. And with that, I think I'll leave you to your day. I've been your host, Chris Jenkins. This has been Developer Voices with Yorran Dirk Grief. Thanks for listening.