Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Bitemporal Databases: What They Are and Why They Matter, with James Henderson of XTDB image

Bitemporal Databases: What They Are and Why They Matter, with James Henderson of XTDB

Developer Voices
Avatar
1k Plays1 year ago

As a developer, it's crucial to understand the various types of databases available so you can choose the right tool for the job. In this episode, we're shining a spotlight on bitemporal databases with James Henderson, lead developer of of a new bitemporal database called XTDB.

You may have already created an ad-hoc bitemporal database without realizing it, but James and his team have been hard at work building a custom database that's tailor-made for situations where having two notions of time are essential. Join us to learn about the what and why of bitemporality and explore the process of designing and building a database in Clojure.

Ready to get started with XTDB? Visit https://www.xtdb.com/v2 to learn more.
Want to get involved with the XTDB community? Head over to https://discuss.xtdb.com.
Follow XTDB on Twitter at https://twitter.com/xtdb_com
 and Kris Jenkins at https://twitter.com/krisajenkins.
Connect with Kris on LinkedIn at https://www.linkedin.com/in/krisjenkins/.

Recommended
Transcript

The Rise of Specialized Databases

00:00:00
Speaker
We seem to be living in an era of tailored databases. There's been a real explosion of them in recent years. You get databases that are really strong at certain kinds of workloads. Database is designed with a deliberate sweet spot. And I don't think that means the end of the general purpose database like Postgres or MySQL or anything like that. I don't think they're going anywhere. I think what it means is that as developers, we have to expand our toolbox.
00:00:29
Speaker
We need a larger, mental list of the kinds of databases that are out there, so that we know when it's time to use the specific tool for the specific job.

Introduction to XTDB: A Bi-Temporal Database

00:00:40
Speaker
So, in that spirit, today we're looking at XTDB, which is a bi-temporal database. What's a bi-temporal database, you ask? Exactly!
00:00:50
Speaker
I've brought in an old colleague of mine, James Henson, to discuss it. We last worked together years ago, and since then he's become the lead developer for XT. And I wanted to ask him about the two sides of that job. What's a bitemporal database?

What is Bi-Temporality?

00:01:06
Speaker
Why should we care?
00:01:07
Speaker
And what's it like being the lead developer on a new database? I don't think that's quite like other programming jobs. No, we're all users of databases, but very few of us will be on a project building a database. So this week we get to live vicariously through James, as well as expanding our list of options for future projects to by temporal database is kind of something you might have already created an ad hoc version of in the past.
00:01:37
Speaker
but today we're going to see something that's formalized it properly. So let's look at bi-temporality. I'm your host, Chris Jenkins. This is Developer Voices, and today's voice is James Henderson. Joining me today is James Henderson. James, how's things going?
00:02:05
Speaker
Hey, Chris, doing real well, thanks. And yourself? Very well, very well. Glad to see you. We used to work together many years ago. Oh, it was a long time ago, wasn't it? Yeah. Must have been 2013, I want to say. That's forever in internet years. Yeah. Yeah, it really is. It really is. These days, you are a lead developer for a database called XTDB.
00:02:30
Speaker
And I wanted to talk to you about two things which we have to get deep into. One is, why does the world need a new database? And two, what's it like being a developer on a database? So it's probably best for context we start with what you're building rather than what it's like. So why does the world need a new database, James?
00:02:53
Speaker
So I'll talk a little bit about what XCDB is.

Why Build XTDB?

00:02:57
Speaker
So it's a database. We've been developing it for about four or five years or so now. It's written in written enclosure. And its main cell is that it's a bitemporal database. So it's got it's got time with its time and it's very hard. And what we essentially part of the reason for us building this was that
00:03:21
Speaker
We see in so many of our client use cases, we see people trying to roll their own by temporality, whether it's tables with soft deletes or tables with loads and loads of different time columns. You're going to have to step back and define by temporality. I probably am, aren't I? Yes. By temporality quite literally means two times, two time dimensions.
00:03:48
Speaker
And they're called valid time and transaction time. The names for those vary hugely across them. So for valid time, you might hear application time, business time, domain. As usual in the industry, we can't agree on the names for these things.

Applications of Bi-Temporal Databases

00:04:04
Speaker
Transaction time is also called system time. So transaction time or system time is when the system first sees the fact.
00:04:14
Speaker
Right. Valid time is when that fact is actually applicable to your business, to your application. So it enables things like retroactive and what we call retroactive and proactive updates. So retroactive being, I understand that someone's changed their name. I've been informed after the fact that they in fact changed their name when they got married last weekend.
00:04:39
Speaker
Right, yep. It's like I'm overdue doing my business expenses. At some point, I will say I bought a laptop last month. So we'll have the timestamp of now when I actually assert it and the timestamp of last month when it matters. Exactly, yes, exactly that. And then it allows you to ask questions either with or without corrections. So if I'm looking, I mean, one common use case we see, for example, is things like risk and risk calculations and the like.
00:05:09
Speaker
And as in a lot of regulations, I had to justify why I made a certain business decision or a certain trade or whatever it may be. One of the things I might say is that in hindsight, that was a terrible decision. Absolutely terrible trade. Why on earth did you do it? It took you completely out of your risk profile.

Challenges of Maintaining Bi-Temporality

00:05:27
Speaker
But one of the things you can say is actually based on the information I had at the time, it was a reasonable decision because you go back without the corrections that you've seen later.
00:05:37
Speaker
Right. So what did we know at Tuesday at 4pm without subsequent corrections? Yes. So only consider it to you filtering for everything that had happened a month ago, whether or not it was in the future or the past of that in logical business time. Right. Yes. Okay. But then in other situations, I might want I might want the exact price as of Tuesday at 4pm as we know it now.
00:06:06
Speaker
because we might have been back and corrected it and said, okay, so your Microsoft share price was whatever it is. Actually, we made a mistake there or the system was delayed in getting that information to us. We now know that the price at that time was this. And so now I want the absolute upstate information with the corrections as we best know it. Right. Now, in my dim and distant past, I've worked in accounting systems.
00:06:34
Speaker
And we did that, you know, we had like, I forget what, again, I forget what the names are, but we had like an insertion time and a business transaction time, right? I have, haven't I? We're not moving the field forward. Right, but we did it in an Oracle database and it was just two columns.
00:07:00
Speaker
So I think you have to justify why we need more than two columns in our familiar database. Right, yeah. So the update, especially when you go to buy temporality, the update procedures for that get quite hairy. And especially once you add on those retroactive and proactive updates. So when you've got four columns, you as the developer then have to consider, right, okay, so if I am going to make a retroactive update as of 4pm last Tuesday,
00:07:27
Speaker
I need to consider how many rows I need to add, because I'm going to have to add rows to that table, because you're going to have to add rows for essentially the new versions of the document. You're then going to have to consider what changes you make to the existing ones. So I'm going to need to cap off the current row. But also, if I'm doing it bitemically, I'm going to need to cap off the system time.
00:07:54
Speaker
I'm then going to need to keep the record of what it was back then without corrections and the record of what it was back then with corrections. And that ends up being, to maintain full-by-temporality, that ends up being quite a headache. The other thing that we bring to this is performance, of course, because if you are just doing comms in your own ad hoc way, the database won't necessarily be able to optimize the queries. So for example,
00:08:22
Speaker
we put quite a lot of emphasis in making sure that as of now queries, which typically in all the systems represent the vast majority of everyday queries, that we make sure as of now queries are correspondingly faster than maybe historical queries might be. So we understand a bit more about the distribution of the data, if you like. Okay, yeah, that makes sense. Does it factor into like querying as well? Do you have anything like special operators for querying at particular points in time?
00:08:53
Speaker
We do, yes. And so we're basing, at least in sort of XT version 2, the early access that we released a few weeks ago, we're basing that very much on the SQL 2011 spec, which covers an awful lot of ground here for by-temporality. Really? It introduces a number of new syntactic structures, which aren't very well implemented across the industry as yet.
00:09:18
Speaker
It's taking the second class, what was it, SQL 2011, what are we now, 2023. It's taken us a while to catch up, doesn't it? Yeah, but the SQL spec, it's not supposed to move fast. It's a bit faster than that, but. Yes, quite, quite. So it does allow you, for example, when you're doing select from a table, it allows you to say from that table for system time as of, and then you can give it system time. We've then extended that for valid time as well.
00:09:47
Speaker
But you can also do system time between or valid time between. Give me the history of this entity throughout 2022. Between 1st of January and 31st of December. Including corrections. Corrections depends on what you specify for system time. Right. If I want corrections, I'll leave the system time as of now because I want to be the most up-to-date. I want system nodes now.
00:10:15
Speaker
If I don't want corrections,

XTDB's Architecture: Inside-Out Approach

00:10:16
Speaker
I'll go back in system time. I can see accountants and auditors loving this. Absolutely. But the thing is, they've known how to do this for centuries, perhaps. Yeah. Yeah. And they've certainly known about immutability. We need to talk about that too.
00:10:39
Speaker
Yeah, this whole thing around, we don't ever go back and correct the past, we write a new version, we write a correcting transaction. They've known about that for way longer than I've been around. And me too, I'm not that old. Yeah, so what we're doing is sort of drawing on that kind of knowledge, which it's about time we did. And especially now that
00:11:08
Speaker
The constraints have changed as well. So certainly in the early days of relational SQL engines, we had very different, we as an industry had very different constraints, and I wasn't around for that either, by the way. We certainly have very different constraints in terms of how making best use of our storage, because storage was very expensive. We had to make updates in place in order for it to not become crazy expensive,
00:11:37
Speaker
Whereas now, with storage, especially remote storage, being a lot cheaper and a lot faster to access, the sorts of solutions we can consider for these kind of databases, where all of a sudden it does become an option to store literally everything, to store your entire history. Yeah, there was a period in our history where you would update and throw away old financial data because you had 64 meg of RAM.
00:12:02
Speaker
Yeah. All right. Or sometimes like 64 meg of disk, if we really go back. So you have to. Yeah, we can make different design decisions now. Absolutely. And we can, I mean, obviously, we still prioritize keeping as much data as local as we can. So we'll, in XT, we're looking to bring as much data into CPU cache, into memory, into disk, into
00:12:30
Speaker
something like Redis, and then only at the last stage do we go all the way out to the remote storage. I think that's a very typical pattern anyway. We're not inventing anything new there. Okay, so you're leading into some gory details, which I want to get into. What's happening under the hood? Give me some technical facts about how you make this work.
00:12:56
Speaker
So XT, both in the version one that's out in production at the moment and the V2 we're on the early access, it's based on what's called the inside-out architecture, which I first heard from a talk by Martin Klepman. Apologies if that may or not be the first person to have thought of it.
00:13:16
Speaker
He's, I think, I don't know either, but he's widely recognized as giving the founding conference talk that made everybody start talking about it, right? Yes, yeah. And the inside out architecture says that rather than having the log sort of ingrained within the database and being an internal implementation detail, we make it front and center. We very much have the log as the history of everything that's happened, and then base our data store on that log.
00:13:45
Speaker
as a like a like a projection over. So your data becomes essentially a projection over that log. Yeah. And that's, and that's sort of the event sourcing folks will find that very familiar, no doubt. Yeah, he hasn't really gotten to the event sourcing space yet. Yeah, they'll certainly be very, very familiar with that. This is related to like Kafka, right? Yes, no, exactly. Yeah, definitely related to Kafka.
00:14:11
Speaker
And in fact, Kafka is our model for what we call the transaction lock within XT. When you say the model, as in you've copied it, or it's like you use it? So the model implementation, the better implementation of the transaction locks. Right, okay. So XT is, more so than V1 is what we call unbundled. So you can bring certain components to the table.
00:14:38
Speaker
we ask you to provide a transaction log and an object store. So a transaction log has to be totally ordered and all of the consumers have to agree on the order of the transactions. That's what gives us the consistency over the cluster. But then we also have a component called the object store, which as it sounds, if you think S3 or
00:15:05
Speaker
blob storage or whatever Google Cloud call it. Cloud storage, isn't it? Whichever cloud you subscribe to. Other services may be available. So it's that kind of principle. For object storage, we're looking at being able to store big blocks of data and grab hold of them and grab hold of them.
00:15:31
Speaker
OK, so you're also storing images, movies, that kind of thing. Aren't they? So the blogs is essentially our database pages. If you think of the database page and you're a before-all. Right. And then it's up to us to then manage that, which pages we bring locally, and which pages have to stay where we're at. So the idea behind that is that we, as XT, don't then manage that.
00:16:01
Speaker
So we delegate a lot of the hard work to that, particularly Kafka and the total ordering inconsistency between clients.

Managing Changes and Transactions in XTDB

00:16:14
Speaker
Out of curiosity, what are you totally ordering by? Which of your fields? So we're totally ordering essentially by transaction time. So it's a single writer, similar to the atomic idea.
00:16:28
Speaker
Okay, so single writer, single partition on Kafka, and you're just building this big append-only log. That's exactly right. Yeah, that's what that's. So you don't know the consideration about partitions at least as yet.
00:16:47
Speaker
You say you're offloading a lot of the hard work, but it sounds every database contains part of an operating system, right? And you're managing a page cache to disk. It's getting close into like what operating systems do to manage levels of caching. That must be a laugh. It's quite naive as it stands. We've got a lot of work to do in that area before we can really sort of push V2 out to the big time.
00:17:17
Speaker
Okay. So is this largely considered to be like an analytics database rather than a transactional database? We're calling it HTAP, so hybrid transaction analytical. Okay. So the idea of being there that way too much time in our industry is spent in ETL land. And if we can get something that works reasonably enough for both, we'll be in a good shape. We can get people building applications on top of this.
00:17:48
Speaker
without having to go through a lot of that rigmarole. So certainly there's a few of the technology choices we've made that are slightly sort of oriented towards the overlap side. And we're having to make the OLTP side catch up a little bit at the moment. That's indeed what I was working on before I came on this podcast. How do we make that happen?
00:18:11
Speaker
That gets into it. What's it like being, you know, because lots of us build like web apps, right? A lot of programmers are out there building some kind of application that faces eventually some kind of non-technical user. And then, excuse me, you're working on like, really core infrastructure that will eventually be used by developers to eventually do something for people that aren't technical at all. So what's it like?
00:18:41
Speaker
It's quite different. And certainly for me, at least, it took a while to get used to. I think for me, the vast majority of my experience has been in exactly that. It has been sort of web apps. It's been talking to end users. And you're familiar, and chances are most people will be familiar with the kinds of processes that go on there, the kinds of cadences you get into when you're developing.
00:19:10
Speaker
in terms of milestones and releasing functionality and that kind of thing. There's a couple of things that are quite different about databases. Obviously, for us, it's very much more about R&D. I think one of the big differences we see there is that we have very few estimates in our planning processes, largely because of the R&D nature of the work. Especially in this last 18 months or so,
00:19:40
Speaker
when we've been very much in a research phase. We're also focused a lot more on backwards compatibility. In my early days at Closure, I wrote a few libraries for the Closure Regrow system, and there, in library world, you have to worry a lot more about backwards compatibility than you do in web app world. And in database world, it's worse. Yeah, I can believe.
00:20:08
Speaker
You've got to worry about not changing APIs, and they've already got this log of immutable data. Yes. Yeah, no, exactly. Yeah. And so there have been quite a few issues in XD, certainly in the early days, where we really had to consider, what have people got serialized on disk? Yeah. And particularly when you're in the immutable world.
00:20:33
Speaker
and when we're saying that this transaction log is completely immutable, we're never going to go back and change that log. Whatever structures are on there, are on there, and short of a migrate your entire log, which we've been tempted by once or twice, but have never done it, mainly because we understand what impact that would have on our users.
00:21:03
Speaker
You've got a live running system, what you really don't want is your database, like your database folks saying, right, so what we're going to need to do now is to stop everything. Like pipe your log into something else, pipe your transaction log into something else, and then go again.
00:21:19
Speaker
which ideally, I mean, if things are working the way they're supposed to, is a really large, important log of, like, possibly financial data. Yes. Yeah, exactly that. So how do you deal with that? How do you have, I mean, it sounds like your hands are very much tied. How do you make progress? Generally, by being quite conservative about what we put on there in the first place. Right. So there have been a few and there have been a few features certainly that we've
00:21:48
Speaker
that we've considered and have had to step away from because of not wanting to change that transaction lock. But also partly because when we read that transaction lock, we then index it in a form that's more useful to us. Obviously, you don't want to be playing through a transaction lock every time you need to make a, can you find me the order with this ID query? So we pull all the transaction dates off that,
00:22:18
Speaker
and index it. In version one, it's all local. In version two, it's a bit more shared between the different nodes. But that gives us an opportunity then to say, if we really do need to make bigger changes, that's our chance to, in a reasonably backwards compatible way, say, what we're going to do, we're going to change our indexing structure. And so while your system is running live on version M minus one,
00:22:46
Speaker
your system can carry on going and we'll index it into that format. But you can bring up another XT node with a new version once that's caught up with the index instruction in the new format. You can then sort them over in a green blue deploy type way. But that's a great thing we get out of this inside our architecture. We couldn't do that if the log wasn't the absolute golden store.
00:23:14
Speaker
Yeah, yeah. And it's also nice that you've got like, with an immutable log, most of the data is just sitting there guaranteed and changing. And there's just a new bit at the front to worry about. Yes, I imagine if it's anything like Kafka world, you can do those re indexing upgrades and do nearly all the work as slowly as you like. Yeah. Yeah.
00:23:37
Speaker
Yeah, exactly that. This makes me want to circle around a bit to, you say indexes, makes me think, what's it like as a developer using this database? I mean, is it like a relational database and I'm creating my own custom indexes and stuff like that? What's it like to work with? So XT is very much a schemaless database. So we don't require any upfront schema from the user.
00:24:07
Speaker
Even though it supports now both data log and SQL, we don't have to do any of the usual SQL DDL of create table or anything like that. You're not restricted on what columns you put into the table. Essentially, you insert a document, you insert a map of data. The only thing we do ask for is an ID, like give us an entity ID and we'll work with that.
00:24:33
Speaker
That just gives us the ability to keep track of an entity's changes over time because the ID remains constant.

XTDB's Schemaless Flexibility

00:24:43
Speaker
So it's quite free and flexible in that regard. And then we then do the best we can with our indexes to get you the data you need. So we keep a number of different trees, a number of different trees.
00:25:02
Speaker
At the moment, it's in rocks. RocksDB, yeah. That's great at storing all the trees, and that gets us pretty decent performance. We're looking at different structures for V2. V2 is a column in the database based on the Apache ARRA, which has a completely different way of thinking about how we're storing the data. Also, again,
00:25:32
Speaker
Coming back to what we were talking about earlier, the constraints are a little bit different. All of a sudden, the cost of scanning a lot of data on desk comes down relative to random accesses. Our query engine now is prioritizing those over a lot of random accesses in some cases. For version two, you've had to rewrite the query engine on top of everything else.
00:25:59
Speaker
That in itself is a lot of work. I'm just thinking of Oracle. It's almost like a historical empire of optimizer code. It's certainly a lot of optimizer code. We were looking around once at the various job boards when we were looking to hire for XT, and you look at the job boards for database engineers, and they are nearly all you'll be working on query optimization.
00:26:32
Speaker
by far and away the biggest thing people are looking for. Really? Yeah. I'm surprised. I mean, I know it's big, but I wouldn't have thought it's like significantly the number one thing. It certainly seemed to be in our very limited research. Okay. Yeah, definitely. Okay, so I'm trying to think this through now. So
00:27:02
Speaker
Inserting documents, right? Have you got this? Let me put it this way. I create a user by inserting a new user document and it's got their name and their age and their social security number and their address. Now I come back and I want to change their address. Can I just insert the delta? Or do I have to reinsert the whole record? You can now. We do have sequels, update, DML,
00:27:32
Speaker
and what we're essentially doing then on your behalf is grabbing the document and creating a new version of that document. Okay, so under the hood, you're writing a whole new, you're like Git, right? Where it stores the complete version of the file every time. Yes, yeah. Okay, that must be mixed, because you've got a single writer. So you've then got the kind of operating system problem of managing access to that single writer. So the operating system
00:28:01
Speaker
We get that for free with the single partition from Kafka. So we don't have to do an awful lot of coordination. We don't have to do any coordination between any threads or anything like that, really, because it's concurrent processes. And for a lot of use cases, that's absolutely fine. Let's talk about the other side. Single partition on Kafka can be very quick these days.
00:28:24
Speaker
Yeah, and very large and still be useful. Let's look at this from the other side then. There you are optimizing different queries, worrying about not changing data. How do you get meaningful data sets to test against for a new database? So there are a few industry benchmarks out there that we use.
00:28:50
Speaker
The main two that we tend to look at when we're looking at the impact of new changes is TPC-H, which is quite common. TPC-H. TPC is the Performance Consortium. They've got benchmarks like A through whatever. I've got no idea how far they've got these days. But H is an OLAP benchmark.
00:29:17
Speaker
And it's based around a fairly typical use case of customers, orders, products, line items, suppliers. And pretty much any, if you're running a business of that kind of nature, it's all of the analytical and BI like queries that you can imagine. And you can run that at different scale factors. So for example, we run it at quite a small scale factor to have a rough idea.
00:29:44
Speaker
of the impact of a change, but then you can scale it up and run it on a larger scale factor. For example, when you're doing a new release or you're really looking at a change that's going to heavily affect performance at larger scales.

Scaling and Consistency in XTDB

00:29:59
Speaker
Okay, that's interesting. There's also similar LTP benchmarks as well. Right, yeah. As someone who builds demos having access to a data set like that, I shall store that away mentally for future projects.
00:30:13
Speaker
You mentioned the word of the industry, scale. What's the scaling story? You're a single writer, so you're limited on writes, which are scaling for reads. Yeah. So the scaling of reads is that you can scale reads horizontally. So if you do need more reads, scaling, then you add a new xt node,
00:30:39
Speaker
It brings itself up, and then you start querying against the shared object store in v2, so it will pull down what it needs from the shared object store and then get going. Now, because our transactions are entirely deterministic, we know that the state on each of the nodes is going to be consistent.
00:31:02
Speaker
And that's a big simplification for us. We don't need any coordination or anything like that between the nodes because we trust that they're going to end up at the same state at the end. Right. They may not necessarily be in sync, though, perfectly. They may not be in sync. So what you have to do in that case is that each... Again, I'm talking the VT here. Each client keeps track of the last transaction that it wrote
00:31:31
Speaker
and at the very least it will give you the world as of that transaction. So that enables you to read your rights. If it's got further than that, that's great. You can at least put a document into the database and then the next query you make will read it out.
00:31:51
Speaker
So you have been thinking like, like, my university professor is going to be very unhappy with me, but the name for those like isolation levels in database design. Yes, yeah. Okay, that must be fun. So in that case, I mean, the single writer again gets us a long way here. So in terms of the because the reason the rights are separated, on the on the right side of things, you also submit the transaction to Kafka
00:32:20
Speaker
it comes back round through the Kafka pipeline. But then because you've got this single writer at the head of the queue, we actually work at a strict serializable level, which is the top level that you can get on the brighter side. Every transaction is guaranteed to see every transaction before it. And so it's naturally serializable because it's a single writer. Things get a lot easier when you don't have concurrent writes.
00:32:52
Speaker
that we also don't have to consider either. Row locking and that isn't a thing. We don't at least as yet have interactive transactions. So it's not like you can then read and then write based on that, based on that read. So you don't get sort of,
00:33:14
Speaker
You don't get people accidentally locking a table or anything like that. This is the thing I thought, like, off the back of Martin Kleppmann again, if you start with this transaction log, then you're basically building state machines on top of that transaction log. And like a SQL database like Postgres or Oracle is a really, really clever, fancy state machine.
00:33:43
Speaker
to allow these things, but do we need as sophisticated a state machine as that for all our transaction log processing tasks? A very good question. I think we're obviously a little bit biased. We're obviously quite biased on that one. So do you ever get different people using the same transaction log that you're using for your bitemporal view of it?
00:34:12
Speaker
do people reuse that same underlying transaction log for different purposes ever? I'm not aware of people using our transaction log. I mean, the format on there, while it's an open format, it's probably a little bit gnarly for people to be reading off. I think with, I'll tell you what, because as part of the world, we provide it. So it's a half-life.
00:34:41
Speaker
And we provide a Lucene module for XT that reads off the same transaction log. And there are internal APIs that people do want to go diving down that do give them that access to the events coming through the system. OK. What's the Lucene API get used for? I mean, what do people search? It's just like free tech search in your transactions. Yes. Yeah, no, exactly that.
00:35:09
Speaker
So, yeah, if you're, yeah, I mean, I think all the usual use cases for loose scenery. We do have the opposite, in fact, where we do have the ability to hook XT into someone else's transaction log. And we've done this, we did this for a client who were using Corda, the Corda blockchain. So they had a,
00:35:38
Speaker
The B word came up. I can see where they want. I mean, they've got a mutable log and they probably need analytics on it. Yeah, I can see this. So they already have the data in that format. What they want is a way to query that using XTs by temporality. And so we wrote them a little module that hooked the two together. And so they kept their code a source of truth.
00:36:08
Speaker
But they were able to then run run XT queries over. I can see that being really popular in the blockchain world. Without declaring for or against blockchain technologies, I can see decent analysis over those blockchains as a service being quite popular. Right. Yes. Yeah. Well, I guess it's the same principle, isn't it? It's a it's a log at the heart of the system.
00:36:38
Speaker
Yeah, yeah. So yeah, I can see quite a lot of overlap there. That takes us to the question, the $64,000 question for a small company building a database. Who's using it and for what? So we've seen quite a lot of users across various different industries. Obviously, we see quite a lot of it in financial industries, as we were talking about earlier.
00:37:02
Speaker
I think the natural tendencies in that industry both know whatever happened at a certain time, and especially with the kinds of regulations that are coming through that are really forcing people to justify what they knew and when. That kind of thing. We see an awful lot of usage in those kinds of areas. But to be honest, it varies quite widely.
00:37:31
Speaker
Like anything where time is of the essence, if you like. So we see, for example, in pricing systems, in product pricing systems. So in one case in particular, that Jux does work with quite closely, the company like to schedule price changes.
00:37:55
Speaker
And they can do that with a bi-temporal system because you can say, I want this product to have this price as of next Tuesday. Yes. Yeah. So any kind of scheduling or CMS or those kind of systems use the other side of bi-temporality, if you like. Yeah, writing into the future, which is a largely unexplored field, right?
00:38:17
Speaker
Yes, yeah. Is that reliable? Is the developer experience such that you're not going to accidentally write software that pulls from the future? In what way, sorry? I'm just thinking like you've got all this flexibility to make that query. How easy is it to ensure that you do the right thing in most cases?
00:38:45
Speaker
So our defaults make sure that the default is always as of now. So you only ever CSM. If you can imagine a nice big sort of 2D graph of sort of history going through in both dimensions, we limit you to the y equals x, the diagonal by default. You know this is an audio podcast, right?
00:39:07
Speaker
Yeah, okay. So you can't see me waving my hands and sort of hand wave the explanation of any of the people listening to this on like Spotify and Apple podcast can't even see your resplendent beard, James. They're missing out. They're missing out. Yes, if I am in which case, so without without the visual matter for them. And the same in a lot of cases, because because it's what sort of traditional relational SQL databases do.
00:39:36
Speaker
that we make sure that the default is as of now, both in what we currently know and what we currently know about the current time. So transaction time and valid time. You have to explicitly ask for across time queries. That's what I was getting at. So the default behavior matches your expectations of how things generally were. Principle of least surprise, they call it. Absolutely. Absolutely.
00:40:03
Speaker
OK, so I think I have two more big questions to ask you, and I think they're slightly controversial. All right. But I have to ask them. Buckling up. Why build a database enclosure? So I think for us, I mean, Juxt is a closure company, and we're very much sold on the ideas behind closure, especially as the
00:40:31
Speaker
So building an immutable schema-less dynamic database seemed to go quite hand in hand with an immutable schema-less dynamic language. So for us, it was a good fit in that sense. And we've also obviously got quite a lot of experience with the closure industry as well. So there was quite good pairing there. We do, of course, find that the more sort of, the more performant areas of the code
00:41:01
Speaker
and especially the more stateful mutable ugly code that you tend to want to hide away. That is being written more and more in Java now. Just because there were certain areas where we found that we were writing closures if it were Java for the mutable performance. But this is as our Lord and Savior, Rich, says. Rich Hickey. Rich Hickey, creator of Closure. Yes. He says it's
00:41:32
Speaker
I think he's done a couple of talks now where he said that as long as the exterior of the system looks immutable and behaves immutably, what you do below the surface, if the tree is being mutably hacked down in the forest, as long as the exterior behaves immutably and gives you those guarantees, that's okay.
00:41:56
Speaker
Okay, so you're happily in closure in the logical world, but under the covers, you're... I don't want to say slumming it with Java, you're... I can't think of the right way to say it that isn't going to offend someone, so I'll just drop it. Hey, it's getting better these days. It really is, yeah. I think they've learned a lot from other JVM languages.
00:42:20
Speaker
I think they got a bit, I don't want to say complacent, but they just, they, they lost momentum in the Java world. And then suddenly maybe closure, definitely Scala. And maybe even Groovy, Kotlin since then made them get their act together. And that's very hard to go from a language that's lost momentum to getting it back again, but hat tip to them, they really have. Yeah, definitely.
00:42:50
Speaker
I think the other thing I'd say about Closure is, especially for a research project, we find we've been able to move a lot quicker. So if you think about the make it work, make it pretty, make it fast, certainly for the make it work, we've found that we've been able to do that phase very much quicker from Closure and its interactive development, the repl experience and the fast turnaround times. So the experimentation side of things, like what happens if I design the system in this way,
00:43:19
Speaker
Yeah, yeah. Closure can be a great language when you have no idea what you're doing, which happens. Absolutely. And it happens just as much when you're writing a database as it does when you're writing a web app. Maybe more so. I can believe. Yeah, because you're explore, especially if you're writing a new database where you're having to explore entirely new design decisions. Yes, yeah, definitely. Yeah, and
00:43:49
Speaker
And I mean, for us, at least in the early days when we were figuring out how to best use Apache Arrow, having something that we could really sort of poke it with and see what it did, you have to be quite, it's quite strict with you about your memory management in Apache Arrow. It's very sort of manual memory management type. This is an aside that maybe we'll save for the DVD bonus features, but tell me a bit more about Apache Arrow. So to call on the data format,
00:44:18
Speaker
It's supposed to be pioneered by Apache and being used in a number of different companies in industry. But one of its main benefits is that the on-disk format is exactly the same as the memory format. So there's no serialization or deserialization. If you want to, for example, read an ARRI file, a memory map is a great way to do that.
00:44:43
Speaker
OK. Because there's no transit. Say, for example, you were writing your files in JSON or whatever. You'd need to translate that into whatever objects that you've got, whatever objects you're working with, or maps if you're in the closure world. Whereas with Arrow, there's none of that translation happening. So you're literally really bites. And what are you getting back? Do you spend a lot of your development time
00:45:15
Speaker
deserializing byte strings or is it just pop off disk looking like a closure map because that's what you wrote? Certainly for the primitives. So it'll pop off disk looking like doubles and longs and that kind of thing. Maps and other composite structures and maps and lists predominantly in Arrow are stored in a column in a way. So for example, when you've got a map of like A and B keys,
00:45:42
Speaker
and the arrow format will store all the A's in a row, sorry, all the A's in a column, and then all the B's in a column. And so you end up, particularly if those are fixed width, so particularly if they are sort of long, sort of doubles, sort of like, you end up being able to navigate like straight to the value you're looking for.
00:46:03
Speaker
Okay. And that, I'm guessing that works out super fast if you want specific fields or you want to aggregate specific fields, but the trade-off is you lose getting the whole object with all of its fields. Quite so quickly. And we've had to put a little bit of thought into how we do think like Slack start.
00:46:23
Speaker
for that exact reason. OK. So last controversial choice. I know XTDB supports two different query languages. It does, yes. Tell me about that. So from the early days of XT, we very much went with Datalog. Again, heavily inspired by Datomic. I don't think that's any secret.
00:46:52
Speaker
I think for us in the closure industry, I think there's always been a little bit of that love for Datalog. It's very much sort of a data-oriented query language. It feels a lot more declarative in the way that you're asking for what you want rather than necessarily how you want it. For those that don't know, because I've worked a little bit with Datalog, but for those that don't know, describe it. Datalog essentially is a language that works, it's a logical language,
00:47:22
Speaker
It's a subset of prologue, although I guess if people have worked with prologue, where essentially you bind different variables to values coming out of your documents. For example, if I'm joining customers onto orders, I'll make two declarations. I'll say that a customer has a customer ID of my customer ID, and then I'll say that my order also has a customer ID of that order ID.
00:47:50
Speaker
and what the data log engine is then doing is saying, right, okay, I can see customer ID twice here. I'm gonna make sure I'm gonna unify these two. And that essentially is an implicit join. And so you can trade them, you can very much sort of, within your query, you can very much see, okay, so here are the things that I'm joining on. And I get a bit more of a declaration of sort of, when I traverse my graph, if you like,
00:48:18
Speaker
Because you can kind of think of customers orders line items as a bit of a graph database. And at times, you have called XC a graph database. Oh, have you? For that very reason. You can think about it in a graph-like way, a graph of documents. What was I getting at that? So when you're writing the query, you are writing it almost like a graph traversal. You're saying, find me this customer node.
00:48:45
Speaker
and then from this customer node, navigate out to the order nodes and navigate from the order to its individual line items. And the query actually looks like that to work with. Fundamentally, it's data as well. So it's data rather than a string. So anyone who's, I think anyone who ever works with a SQL has probably had to generate SQL at some point in the past and has sort of
00:49:08
Speaker
It got in a bit of a tangle of how many ands do I need in my where clause when I'm. Yeah, yeah. Yeah, you can't really the thing I. My favorite feature of data log is you can start because it's a data structure rather than the string. You can start to compose things, right? You can say here's two extra clauses that I want to mix in with this query. Yeah, without having to like string edit where the end is in my where clause.
00:49:39
Speaker
And all of the sequel injection that follows, no doubt. But what we've done recently, so we obviously love Closure, and I think the Closure industry obviously loves Closure as well. But we recognize that that love's not universal. How aware of you. We recognize that it's a hell of a niche.
00:50:08
Speaker
even if you, the development team, are fully bought into this, you're not gonna convince people outside of that that they should cast aside all of their SQL experience and tooling and everything else that they've gotten used to about the SQL world. So that's one of the reasons why we really wanted to make SQL a first-class citizen as well. And so what we've ended up building is a database where you can query the same data
00:50:39
Speaker
with both data log and SQL. And roughly equivalent data log and SQL as well. So you can look at a data log query and a SQL query side by side, assess that they're equivalent queries and they'll return you the same data. They'll return you the same results. So we think that's fantastically powerful. Are they on feature parity as one more powerful than the other? They're not far off, you know. So we've had to put a fair bit more work into SQL.
00:51:08
Speaker
Datalog, particularly in terms of its scoping rules, is quite a lot of a simpler language. But what we do is we compile both of them down to an intermediate representation. We compile both of them down to what we call a logical plan, which is essentially relational algebra, that old thing.
00:51:33
Speaker
Your university professor must be very proud of you. Oh yes, yeah. Might not have been proud of how much I remembered when I first started on that, but at least, yeah, remembering that it existed was certainly a very good start. Yeah, yeah. Isn't that all we're ever doing? So remembering that something exists enough to go Google it. Yeah, sometimes the hardest problem is discovering the bits that you don't know. That's why this podcast exists. There's my plug.
00:52:04
Speaker
But no, relational algebra is fantastic, I love it. And it's obviously the underpinnings of SQL. So SQL is very much based on that. But the composability of relational algebra, you really can compose those queries together. And we're looking to get back to that with those log. So you're hoping secretly that those people that start off with SQL will gradually be tempted to your way of doing things.
00:52:34
Speaker
you can cut that out of the podcast, right? Don't tell anyone. Yeah. Yeah, we'll leave that on the cutting room floor, honest. If you're hearing this at home, I have betrayed James and he's never speaking to me. No, to be honest, I mean, it very much was the tooling of the SQL ecosystem, because there's just so much of it. With the best will in the world, we're not going to compete
00:53:05
Speaker
in the closure industry. Yeah, I'm inclined to think that if you're trying to do a new way of doing databases, that is one front on which to change the

Querying with SQL and Datalog in XTDB

00:53:15
Speaker
world. And that's plenty. Yes. Yeah, you have to be a little bit sort of conservative about that. But that particular budget. Yeah, yeah, absolutely. So you say, what's the phrase? But SQL reaches
00:53:30
Speaker
Datalog rocks, but SQL reaches. Love it. So to steal a phrase there. I can predict that at some point in the future, your team will have that on a T-shirt at a conference giving it away. Absolutely, yeah. Yeah. So you said you've just put version two in like, was it preview release or early access? It's very much early access at the moment. So we're in listening mode.
00:53:58
Speaker
So John, our CEO, announced the V2 Early Access at Closure Conch, which was a few weeks ago over in North Carolina. And so the idea behind this is to get people to have a bit of a play with it, try it out, let us know what you think, and let us know of any big stumbling blocks you come across. Over time, we'll be moving through the traditional sort of alpha, beta, RC, and stable release.
00:54:27
Speaker
Okay. I think it's fair to call it pre-alpha right now. If you're not comfortable with the bleeding edge, it's probably not for you just yet. But where do I go? Is it fun to play with if I want to just kick the tie as an experiment? Absolutely. So you can go to xddb.com slash v2. All the instructions about how to get started and write your bi-temporal queries will be on there.
00:54:55
Speaker
OK, I'm going to give it a go. It sounds like fun. And maybe I'll finally get the accounting system that will make my taxes easier at the end of the year. Ah, you've got to write on the tax rules yourself, though. We don't, we don't come. Then bundle those in. I hear the government's going to simplify them any day now. I won't worry. Yeah, I'm sure. James, as ever, a pleasure to talk to you.

Testing and Feedback for XTDB's Features

00:55:20
Speaker
And you, and you, good sir. Good luck with the path to official final release of version two. Cheers, Chris. Cheers. Thank you, James. If you'd like to take XTDB for a test drive, if you want to explore by temporality or data log or anything like that, there's a link to the project in the show notes. And if you do, please drop them a line if you've got any feedback. I know James would appreciate it.
00:55:48
Speaker
I appreciate feedback too, so if you've enjoyed this episode, please take a moment to like it or rate it or share it or review it or subscribe to it, all those different feedbacky things. It always helps and it's always interesting to hear. Or just drop me a line. My contact details are in the show notes too for Twitter and LinkedIn and the usual.
00:56:08
Speaker
And while I'm thinking about it, we're going to continue to explore different kinds of database on this podcast. So if you've got any suggestions or requests, let me know. I know I'm planning to do vector databases soon. That one interests me. So that will be coming up. But I think that's all for this week. So I've been your host, Chris Jenkins. This has been Developer Voices with James Henderson. Thanks for listening.