Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
#49 Crux with Jon and Jeremy image

#49 Crux with Jon and Jeremy

defn
Avatar
60 Plays6 years ago
In which, we talk to Jon & Jeremy about Crux the new Bi-Temporal Database by Juxt (https://juxt.pro/crux/docs/get_started.html)
Transcript

Introduction to Guests and Their Roles

00:00:15
Speaker
Welcome to Deaf and episode number 49. And I think we are almost going to be two years old next month. This is a pretty long time. So this episode is what Ray said. It's a crucial episode. Cruxiel episode. I don't know. You can't pronounce it. You can't say it. You're going to read it. Exactly.
00:00:41
Speaker
So we have two gentlemen from Jext, so John and Jeremy. You want to introduce yourself and then tell us a bit about what you do, and then slowly we'll move on to Crocs.
00:00:57
Speaker
Yeah, so I'm John. I'm the managing director at Juxt. So my life involves less coding, sadly, as the days go on. But my job is essentially to just, it's a bit like attending a garden where there's lots of interesting flowers that come out, but then some weeds and you have to go in and you can't kill all the weeds, but you do what you do for an hour that day, push them back.
00:01:25
Speaker
And just manage things, really. So that's what I do. I really do still love the tech enclosure. And it's a cliche. It's kind of an embarrassing cliche, I think, when people say, oh, it's like I'm post-technical. I really love to code, as though it's like they're condescending, as though they're trying to sort of stay at one with the techies. And I felt that a bit uncomfortable, and I resisted that cliche.
00:01:53
Speaker
But it is, there's some truth to it. Like when you, when you sort of just do your day to day, lots of phone calls and speaking to people, it's lots of activity in the brain. But when you can code, it's like that sense of self disappears a bit. And you go into like a meditative zone. And it really is quite peaceful. So I do enjoy coding. And I find it a necessary part of the job. Just, you know, it's the cliche, it's what everyone says, but it does, to a certain extent, keep me sane.
00:02:23
Speaker
Cool. So you still fiddle with a little bit of closure then, John.
00:02:28
Speaker
Yeah, like right now, I'm sort of doing some jobs on Crux. It's good fun, really. It's just I want to try and help out the team, so I'll pick some of the niggly issues, just trying to trim the issue board. But yeah, still like to get into EMAC, still like to solve problems. I quite like Crux, because Horkhan Rayberg, who is the tech lead of Crux,
00:02:57
Speaker
he's quite a disciplinarian, he's quite sort of strict. And I quite like, it's a bit like CIDR, you know, with Bozidar. I quite like that sort of authoritarian figure on a open source project. And I quite like being that person that submits an issue and is it good enough yet? Like what are you thinking? They say, no, it's like there's not enough tests and not enough comments. And I quite like just working towards someone else's spec and standard. So I quite like that sort of dynamic and correct. So it's a good place for me to

Exploring Crux: Features and Benefits

00:03:24
Speaker
be coding. Excellent, cool, yeah.
00:03:26
Speaker
So Jeremy, what about you? Yeah, great question. So I'm relatively new to the juxt world. I joined in January and really my role is all that crux. So I guess the term is offering manager, but you can think of me as a product manager. So because crux is a very new thing, you can think of my job as sort of really in John's garden,
00:03:51
Speaker
Watering one solitary plant, making sure none of the predators get it, and then figuring out how we're going to take it to market and who is going to want to use it. So, you know, I came into the Crux project reasonably late in its life, I guess. You know, John and Orkin have been, and Malcolm have been thinking about this for a long time. So I've had to get up to speed with, I guess, what Crux does, what value it brings, and
00:04:19
Speaker
figure out how it fits into the wider picture and how, I guess, all the use cases and systems people are going to be building in the future, what part can Crux play in that. So my role is very much
00:04:35
Speaker
about making sure that as we keep investing in corrupts and building it out that we're aligning it to some real world use cases and as we get people using it that we're listening to the feedback and incorporating it into the roadmap.
00:04:53
Speaker
Yeah, I'm not necessarily like the engineering lead. That's definitely Hawkins role. But I'd like to like to think I'm bringing a level of business focus and visibility over, you know, what are the what else is happening outside of crux and making sure that we we have a sort of solid direction and strategy. So that's, that's my role. You've got a bit of closure chops as well.
00:05:23
Speaker
So, you know, it's a project, product manager, which is like, you know, very business-y, et cetera, but, you know, you're not, you're hiding a little bit there, aren't you? You know, come on. There is a little bit of closure there too. There is a little bit. Yeah, no, so I've been, I mean, I've been programming on and off, I guess, for a long time, but it wasn't really until maybe eight years ago, I got sort of quite serious about programming. You spent a while at,
00:05:50
Speaker
playing out with Node.js when I realized actually I wasn't cutting it. I switched to Closure and this is just as a bit of a hobby really, but I'm very interested in knowledge management, data management, and Closure ticks a lot of boxes for me. And so I've spent a lot of time looking at
00:06:09
Speaker
how pleasure works and yeah I've been building various demos and prototypes and things but yeah it wasn't until I discovered a data script that I
00:06:21
Speaker
I found sort of a very natural place to start working and sort of finding my niche. So I'm sort of very keen on data log and the data script implementation and that sort of philosophy. So yeah, I'd say I'm nowhere near as adept at closure as most people at Juxt. Like I'm very much in at the deep end working around Malcolm and John and everyone else.
00:06:46
Speaker
I hope to get my Emacs configuration. Well, Vijay can tell you how to make Emacs look like Vim, so that could be a bit of homework for you.
00:07:02
Speaker
But this seems like a fantastic episode. I mean, we started with Emacs already, so this is like the best episode ever. After 48 episodes, we have like a majority of Emacs users on there, or to be Emacs. I'm really thinking about converting to Emacs now. Oh, yeah. Well, do you not use Emacs, Ray? I don't use Emacs now. Absolutely, Vim is your... I don't use Vim. I do use VI sometimes, you know,
00:07:30
Speaker
When he said converting, he meant upgrading to Emacs. He hasn't said what he's using yet. I'm waiting for the big reveal. I use notepad, of course. Because who wants text highlighting? That's for losers. Everybody else in our company, by the way, including the big guys, we all use
00:08:00
Speaker
Right. How do you think he curses a lot? That's the reason. He uses cursive and then he keeps cursing a lot. That's the thing of beauty. It's a real joy. It's good.
00:08:13
Speaker
That's what I heard. Anyway, so let's get to the meat of this episode. So John, you live open sourced at your enclosure north, Crux. So what is Crux? So there is something on the website. We'll get into deeper different things. What is the elevator page?
00:08:39
Speaker
Elevator pitch on well crux is a by temple document store a graph query. It's like the definition of a monoid. Yeah, exactly. We end the podcast now. I need a t shirt like that. That's what you should do. Actually, Jeremy, you know, part of your product management is half like one of these like, you know, what is a monad t shirts, you know,
00:09:05
Speaker
Yeah, yeah, that'd be good. We did have some pretty nice t-shirts at Closure North. I'm sorry, John's not repping it right now. So I was actually, as in the Closure world, we look for the definitions of something for everything. So I know the colloquial usage of crux. I'm not a native English speaker, obviously.

Crux's Architecture and Design

00:09:25
Speaker
But then when I looked into the Merriam-Webster dictionary, it says a puzzling or difficult problem. That is the definition of crux. But when I saw the
00:09:34
Speaker
Oxford dictionary, it says the decisive or most important point at issue. I think Oxford is right there. Yeah, because it came from, I don't know, Queen's English. So I thought it is probably that one, but there is 1.1 on Oxford dictionary that says again, a particular point of difficulty. So that's a very interesting name. Yeah, but it comes from Jesus being crucified and, you know, things like this and so, you know,
00:10:02
Speaker
It's quite deep. I mean, if you want the honest answer of why we chose it is because it had to be four characters. And we were walking around Bristol and looking at things like, oh, there's wood. Okay. Okay. There's door. There's leaf. There's ding.
00:10:17
Speaker
And we're thinking, what's a meaningful, hardened word? And after his fifth cocktail, Malcolm pronounced, it shall be called Crooks. We thought that was very good. We've got cockroaches, DB. Yeah. Almost there.
00:10:36
Speaker
For sure. Anyway, let's not get too rude. Crocs, crocs, crocs, crocs. The crocs of the matter. It's a bi-temporal document store that provides graph query. So let's break that down. Exactly. Please do. So you're a developer, right? You took some Eden documents, right? It stores them for you. Job done. That's the doc store part.
00:11:04
Speaker
Then also, there's a bit of crux that sees these documents coming in and indexes the attributes within those documents so you can do data log, graph query against those documents. And then you can join documents, you can do a data log. That's the graph document part. The bitemporal bit is,
00:11:29
Speaker
it's really quite simple. It's a problem that you didn't know that you had in most cases. So let's say you got a temporal database and really let's just assume that what that means is that you can query for any point in time against the database. So I want to query what
00:11:49
Speaker
what did the charts look like in the UK in 1997, for instance, on a particular day. That's what a temple database can give you. The issue with some temple databases is that if there's only one axis of time, which is transaction time, then the database is sort of recording facts
00:12:11
Speaker
at the time the database sees those facts. But the world that we often work in is quite messy, so you often want to enrich the past. Maybe we're sitting now in 2019, but we're learning some stuff about the charts in 1997. We might have got it wrong. Perhaps there's another chart that we want to add. So we want to add facts into the past.
00:12:38
Speaker
But you can't really do that if the only time axis that you've got is transaction time because you can't insert into the past because the only time you've got is transaction time when the database sees a fat coming in. So when you've just got the single time axis, the time is always increasing. You can't insert into the past.
00:12:56
Speaker
You maybe can if you take on some complexity, you take on a hit, you do some extra work, you do some modeling, perhaps you do some sharding. There's always ways around, you should build an index on top of the database. There's things that you can do, you can split up your data accordingly. But it's non-trivial.
00:13:12
Speaker
What bitemporality gives you is that it maintains that transaction time, that immutable transaction time that's always increasing. So you have that, your database doesn't change, it creates facts, that's what we want. But then it adds this valid time notion on. So you can add a fact, and the transaction time of that fact is increasing, it's always going in the future. But the valid time of that fact can be in the past.
00:13:38
Speaker
So we have transaction time and we've got valid time. All valid time is saying is when this fact was true. That's it, right? So you- It's according to the user rather than according to the database and that's- Exactly, exactly. It's according to the user, it's according to you as the programmer. So in a way, by temporality, there's two fundamental time axes that work. Like one is the transaction time that the database works
00:14:04
Speaker
with that inherent sort of level of the database. But then you've got valid time, which is more about the users, the programmers, and it's giving you more control of time. So you can say, look, I want to insert a fact into my database. And this fact was true in 1997. And then when people want to do as of queries or historical queries, they can use valid time. So they can do show you what the world looked like in 1997. And that can then take account of these facts that have come in later, because the valid time is when
00:14:34
Speaker
And it was when the user wanted it to be. Is valid time like a kind of, does it default to the normal time? Does it default to the database time and you can kind of override it or is it, how do you do that?
00:14:49
Speaker
Yeah. Like, um, I'm sure that Jeremy can correct me on the precise syntax and I'll hand over to him in a sec, but, um, yeah, you can submit a transaction into crux and not give it a time. Right. And then it's just like the valid time and transaction time is like now, um, or you can be specific and say, okay, this is my, this is the valid time that I want. The transaction time will still be derived by crux. Uh, but the valid time is something essentially you can override. Yeah. Okay. Makes sense.
00:15:17
Speaker
So by temporality, it does have a lot of ambiguity in the word. If you look at what other things do that claim to be by temporal, there's a broad range of definitions. So some databases, John wrote about this in his blog post, you have a valid from and valid to field, whereas Crux essentially has just a
00:15:43
Speaker
Basically, it's the valid from within an implicit valid too. So the valid time you insert document means that that document is valid until the next document is inserted in the temporal order. So there's some nuance in that model. But essentially, they're dualism. So you can use that to model the high-level concepts.
00:16:09
Speaker
The other thing that I always think about, like, because I used to be a database guy back in the day, was that, well, the problem with the database back in the day was that you had to, you had to kind of like, add timestamps into your data. And then if you wanted to reconstruct the database as of a particular time, it was kind of on your own, on your own dime, basically, you know, you had to either you had to design it upfront, so that all of the
00:16:38
Speaker
all of the tables are the same type of date fields in them. And that's what everyone did. And then you basically have to make a query that said, okay, give me the information across all these different tables as of that time. So how does that differ? I do know the answer, but fuck it. I'll just ask you anyway. How does that differ for your database? So do you have to put all these timestamps all over the place? Or how do you get to a particular snapshot in time?
00:17:09
Speaker
Yeah, so the answer is that all the timestamps are implicitly recorded unless you specify a valid time along with your transaction. So the transaction time and the valid times that are recorded will be according to the current time of the
00:17:28
Speaker
The crux node, that's making that insertion. Now, there are some implications around what happens to the crux node. Clocks are slightly different and all that kind of thing. But at the lowest level, you're not having to manually manage these timestamps. But you're not precluded from then inserting additional timestamps inside of the documents themselves. So you can model domain

Testing, Scalability, and Comparisons

00:17:49
Speaker
times in addition to valid times.
00:17:54
Speaker
So is the valid time kind of a mandatory thing in Crux document? So the valid time isn't inside the document, although I think there are advantages to putting it inside the document, because then you're able to query against it. But by default, valid time is just simply a way of making corrections to this temporal key.
00:18:21
Speaker
So how does the schema look like? Because I'm trying to understand this from the trying to contrast it with something like MongoDB or the NoSQL database like Cassandra or something. Or I'm trying to understand it from the if I'm trying to understand it from the relational model. So how do you contrast them with these two types of databases?
00:18:46
Speaker
So I'm not an expert on the MongoDB query language, but my understanding is that MongoDB does give you some reasonably sophisticated ways of navigating around, but it's still not generic, efficient, recursive graph queries. So MongoDB won't automatically index all the fields, whereas Crocs takes that view that actually, if you just index everything, then you can do these graph queries very efficiently.
00:19:15
Speaker
But there is no schema, so the indexes aren't being built to cross bond with the schema that you design up front. It just happens that when you insert a document, all the fields are automatically processed. And this only happens at the first level of fields. Like if you had nested fields within the document, they wouldn't get indexed. So again, there's a nuance there. But essentially,
00:19:39
Speaker
that the fact that it's schema-free, it's schema-less means that the indexing is very simple and that you can design these graph queries to do schema on read.
00:19:49
Speaker
Oh, OK. So I don't need to specify that. I'm just thinking out loud. If I'm getting started, then I put up Crocs, and then I can put any hidden document into the database. That's it. So the experience is similar to MongoDB, for example. There I just throw in some JSON slash BSON document, and then I'm done. And then it automatically thinks, OK, these are the fields I have. And the next document can have a different schema, so to speak.
00:20:16
Speaker
different fields in the same collection. So from the storage-wise, what is the concept you have? Every document is independent, or can I have a collection of documents like other document stores? So currently, everything is in a single monolithic collection, but there is no separate abstraction which we've defined as collection.
00:20:44
Speaker
But because of the way it's indexed again, you can pretty efficiently model your own sort of collection semantics. And of course, you know, you could have multiple crux databases as well if you did want to have different collections. So it's like the ultimate spreadsheet where you just basically just put data on this huge grid.
00:21:05
Speaker
and you just pick out the information, like you said, schema on read, you just decide which bits, which view you want to have when you read the thing rather than when you're writing. Yeah, I think that's a nice way of thinking about it. Certainly, spreadsheets are quite novel in terms of making programming simpler for some value of programming. And I'd like to think that Crux makes databases simpler.
00:21:32
Speaker
Again, for some value of database, but Crux certainly makes it easy to have that on-ramp from, okay, I'm just going to store these documents. I don't really know what's in them. That's the MongoDB use case. But then as you want to build tighter and tighter constraints around how you're querying and doing further indexing on those documents, you have that flexibility without having to bolt on a whole different query engine.
00:21:59
Speaker
Just a quick question on that, because I know you guys are very seated in the practical world of finance and business and stuff like that. Again, back in the day, people were always very concerned about the correctness of data, especially for
00:22:22
Speaker
that they conform to regulations and stuff like this. So what's the kind of motivation for having this totally freeform model? Or is that just, is it a completely separate sort of, what's driven that? What's driven that kind of like, is there a business use case there that you're kind of trying to answer is, I suppose is the question. I think so. All right.
00:22:44
Speaker
I have an answer, which is that Horkan, who built this primarily, he built the simplest possible thing he could in the shortest amount of time, and that's why it is the way it is. But he's done it with the view that everything else is
00:23:01
Speaker
something you can decide later or design later. So it's not necessarily the case that Crux does everything you could possibly want to do in terms of access controls and validation on rights and that sort of thing.
00:23:20
Speaker
in things like transaction functions, but the view is that all of these things can be achieved on top of what the core already provides. The core is simple for maintainability reasons.
00:23:32
Speaker
Yeah, I think it's more, it's not necessarily the simplest thing that we could get done in the shortest time. It's more the designer crux is layered. So we want the core to be as fundamental as possible. So we don't want to have stuff in the core that you can add as like a decorating layer.
00:23:51
Speaker
So that's the direction of travel and Crux has this layered sort of design. So things like aggregating data, extending the temporal capability. Those are things that will come in as middleware or decorators or maybe even because Crux is open. It exposes all these various protocols and you can swap the various bits and pieces. It's unbundled in that design.
00:24:15
Speaker
So it's entirely possible that someone will create a library out there that configures crux in a certain way that adds their own code. Then we'll take that over time into our repository. And we want crux. It's a bit like, do I think about the design is a bit like the sun, like in the center of the sun. It's like, you know, the photons take the longest to get out from there or in there, right? And that's a bit like the core of crux. It's going to take the longest to evolve.
00:24:42
Speaker
Because what we absolutely don't want to do at this stage is, I mean, the scheme is a classic example. If we add that into the core now, then there'll be some people out there that would be like, well, why do that? I don't necessarily need the schema on write. I prefer it on read. So if we keep it out of the core and then we add it to the higher level where it's optional, then everybody sort of has their piece of the cake and can eat it because then we satisfy
00:25:09
Speaker
both concerns. So it's kind of, yeah, it's wanting to do the simplest thing, but it's very much been the philosophy of Crux just to really keep that core simple and lean. And you should see Crux as a less of a sort of black box, but more of a customisable open box. And you can look at the core, you can extend it, you can augment it, you can play with it and you can add to it. Maybe on that note then, actually, John, is like,
00:25:39
Speaker
What design principles are you kind of operating on? What are your kind of core goals with crooks?
00:25:47
Speaker
So one design goal is the layered architecture that we mentioned and the whole layered design, the layered thinking is a thing in its own right. So that's one thing. The other thing is openness. So we really want crux to be open. The architecture is open, it's unbundled. So we use Kafka as the event log.
00:26:10
Speaker
but we also swap out Kafka. We have a local event log that isn't quite as powerful, but if you just want to get going for your pet shop and play about, then you can use that. I'm sure someone in time will come up with, you know, yeah, like pool SAR or, you know, just swap out Kafka or rats or whatever you just said. Raft. Raft. Cool. I don't know what that is. It sounds cool. So consensus algorithm, basically. Yeah.
00:26:36
Speaker
Okay, sweet. Yeah, why not? Like, I mean, there are some semantics that Kafka has like it's a
00:26:42
Speaker
It's just log of events and it's replayable. So as long as, because the way that Crux works is that you stand up a Crux node, it has a local database, a local sort of the indexes that we build that sits on top of a KV store. And the first thing that a Crux node will say is, hey, I haven't got any data. So I'm just going to replay it from this upstream event log that is also notionally part of Crux. And then it will use that sort of to keep itself informed, to keep up to date.
00:27:09
Speaker
So that open nature, you can swap out the Kafka bit if you want to for something else. You can swap out the, we build these bi-temporal indexes on top of a KV store. And the two that we're shipping with is LMDB and RocksDB. And from the get-go, we actually provide a choice.
00:27:26
Speaker
that they are slightly different characteristics. Rocks is faster to ingest, but LMDB is faster to query. Rocks is built by Facebook. It's tremendously sort of powerful. You can imagine all the development firepower it's had. But we like that idea of like crux is a kind of
00:27:44
Speaker
framework of pluggable parts and that openness precedes it. It's also open source, it's entirely open. And we want to try and have design conversations in the open and just get the community to be engaged. So openness in terms of communication, the source nature of it, the design nature of it. And yeah, that's a guiding principle.
00:28:08
Speaker
Yeah, I mean, I think those are the two that I would have. Jamie's got some more design sort of guiding principles then. Yeah. Well, yeah, I just hope the foundations of Crocs go on to enable us to build bigger and better things as a community enclosure and elsewhere. You know, Crocs, of course, is built enclosure, but it does have a lot of Java interop and we've been looking at, yeah.
00:28:38
Speaker
doing a rewrite of the core and Rust to maybe get some, shave a little bit of performance. But fundamentally it's a Closure-y thing, but actually we have a fully fledged Java API and we think the capabilities that Crocs offers are useful in so many domains and well outside of the Closure community, which we're really excited to sort of
00:29:06
Speaker
bring to bear. And we think this is a quite an untapped area of thinking this whole by temporality. So I almost feel like we're bringing by temporality to the masses in a way that I don't think we've really seen delivered by this.
00:29:22
Speaker
I wanted to riff on the before we leave by temporality because it is quite kind of dry on one sense and a little bit esoteric and you know if you're honest you might have a question which is like why do we really care and I think it's Jeremy that really got me thinking about

In-Depth Technical Dive into Crux

00:29:38
Speaker
these particular lines, which is that time is just fundamental. If you think about what we do in software, we try and create a model that represents part of the real world. I mean, that's what we're doing. And time is half a space time. Time is just so important. It's fundamental to everything. It really is intrinsic.
00:30:03
Speaker
And then as Ray said, like at the start, you know, he was sort of messing around, adding like database columns, like, you know, I'm going to add like a time column on this table, a time column, and that's a bit of slap dash, maybe I need two or three more. Jeremy said, Oh, let's get a valid two valid from let's all just have a big time frenzy party in our database model.
00:30:22
Speaker
But isn't that weird? It's like, do you not ever stop to think like, why am I doing this? Like time is so fundamental, right? It's like a database should support, it should help me with time. I shouldn't have to like, do battle with time and have that friction.
00:30:38
Speaker
And we found that when you don't really stop to sort of model time and think about time, about what you need from time in your model and your tools, then you're sort of on the back foot. And then you're doing exactly that, like adding database columns in here, doing updates to mark which version was live on a particular point in time and all this sort of stuff. And we've all, you know, most of us have done this, we've all added like various columns.
00:31:01
Speaker
But it's always that feeling that you're on the back foot and you're always catching up with time. Whereas if you think about time a bit more, like it's by temporal need, we've got valid time, we've got transaction time, what other time things do you need? What different pieces of functionality do you need? If you consider these upfront, then you're on the front foot. And a lot of that complexity and time that costs you sort of having to put the complexity into these models to cope with time, you don't have to deal with and you spend more time thinking about what the business needs, the functionality that you need to build for them.
00:31:34
Speaker
I mean, funny enough, because we were doing like blockchain stuff, so time becomes things for us as well. And the biggest thing that we, you know, that I like to talk about with time now is just the fact that in the 70s and in the 80s, when like the big database has kicked off, like, you know, DB2 and Oracle and
00:31:56
Speaker
know, infamix and ingress, all those big guys, you know, all the relational models, the reason they were how they were was mostly because of constraints in terms of storage and CPU and things like that. You know, and obviously, hardware and cloud and all this kind of stuff has moved
00:32:22
Speaker
generally the overall arc of computing is being towards giving us the capabilities of freeing us from some of those fundamental constraints. And I think it feels a bit like people have been like this, like dog whipped into thinking that this is a fun, that there's
00:32:40
Speaker
fundamental. But as you said, they're not. Time is actually fundamental. And we've been fighting against it for 30 or 40 years because we haven't had the firepower to deal with it. Yeah, that's right. I mean, it's the cost, right? I mean, it was extremely costly. I think in one of the talks, Rich was talking about this place oriented programming that we are used to because of the cost. And
00:33:01
Speaker
I think people were, at least my limited experience was that people were thinking about the time. I remember in one of the projects, I was using Hibernate and Hibernate has this audit, automatic audit thing annotation for every entity. It will create a different table.
00:33:20
Speaker
For every entity, it will capture the information automatically. But the thing was that database was Oracle, so you don't have much control on how it evolves and whatever or not. And it is also expensive to store all this data. But the domain that I was working in was court-related stuff, like the legal things.
00:33:39
Speaker
So everything is practically immutable and everything had to be kept because it's a big international criminal code sort of work. So that every transaction needs to be audited or kept the data. And the biggest problem there was that none of the database was supporting because
00:33:59
Speaker
it was too expensive to store all this shit. It's because you keep a copy of everything. So I think that's the thing. Yeah. So it's kind of fascinating when you start to look into the history of temporal databases, because in the 90s, there were these people, like Richard Snograss, and they're proposing extensions to the SQL standard, which is like, OK, we need to have system time where you can query as of system time. We need to have these temporal predicates.
00:34:26
Speaker
But you can imagine, just like you said, it's like the answer there is to store everything. And in the 90s, they're like, what, you crazy? I'm going to store every fact, like a coral reef, like everything, and you can go back and forth. That's going to cost tons. I'm going to have to get somebody from Dell out or something to put a new rack into my, put a new bit into my rack. It's going to cost me a fortune. So they had these dreams in the 90s, and we've been looking at the papers, and we've been geeking out on it a little bit.
00:34:54
Speaker
There's lots of interesting speculation back then. And they thought this was going to be the next big thing to free us, like you say, of these time constraints to help with this problem. But it turns out that it just didn't really take off. And it is like you are both saying because, well, we think because of those cost implications. Yeah. And so because we are talking about the architecture and because of the change, the underlying technology has changed a lot. So we could do much different things.
00:35:20
Speaker
whenever we're talking about databases, obviously CAP theorem comes up. So which out of CAP, what are you not providing in Crux?
00:35:33
Speaker
So consistency or obviously it has to be consistent, you know, like otherwise I'm pushing the facts on them. I don't see that on the table that's fucked up. Well, no, I think consistency is always optional actually. But OK, there's another point, isn't it availability and partition, network partition? Yeah, partition tolerance. Yeah. So I think the official answer is we sacrifice petition tolerance.
00:36:01
Speaker
I don't think that's right because you're using Kafka. That doesn't sound right. You got multiple nodes. So my guess is that you're not concerned about consistency. We have consistency as of the transaction time because the transaction time is set when it goes into

Future Aspirations for Crux and Juxt

00:36:22
Speaker
Kafka. Consistency is like atomic transactions and serializability and all this kind of stuff.
00:36:29
Speaker
That's when you get into like complicated conversations were about time, actually. And I think it's, I mean, we can really go into that, actually. But why not? Fuck it, you know? Because because I think it's a bit of a nightmare to see realizability and like, you know, this whole, the definition of serializability and like repeatable reads and all these kind of things, they're all a lot of horseshit. You know, they're, they're not really well standardized, actually.
00:36:58
Speaker
So the way Crocs works is everything is written to a single Catholic partition.
00:37:04
Speaker
in the, sorry, all the transactions are written to a single Kafka transaction partition. They have to be serialized. Yeah, because it's a single. Yeah. And so the coordination is really about who gets the first. So by using CAS and compare and swap and retry is that's how the nodes coordinate. There is no sort of communication between the nodes to coordinate rights to that transaction.
00:37:34
Speaker
partition. So well, the point is like, you know, the kind of like, numbers zero to 10. Yeah. And I'm writing zero. And then VJ is writing one, john is writing two, and you're writing three, Jeremy, you know, so we're all how do we how do we order these things? So I go first, you know, how do you make sure that it's all ordered in the right way?
00:38:04
Speaker
This is a big problem, I think. Well, it's the point of when it goes on to Kafka. So Kafka gives us that guaranteed ordering. And then you can guarantee that all of the crux nodes are going to pick up the transactions in the same order. Okay, then you then you deal with the valid time, I guess, as a sort of mechanism for dealing with this straggler problem.
00:38:23
Speaker
Yes. Kafka doesn't do that, right? Because if you have multiple producers, I'm assuming every... No, it does all the things, actually. There is no order in Kafka. No, per partition, yes, there is an order in... I'm not talking about the order in which you put in and then you read, you get the same order. That's okay if you have one partition. But if you have multiple producers and they're writing, it's first one wins, that's it. The data that enters into partition,
00:38:51
Speaker
is based on which producer is trying to connect first. That's it. And that's actually, funnily enough, where Bitemporality just come into play because we've had this situation at a large investment bank and they have this cut off at five o'clock and that's the end of day, right? At five o'clock they want to draw a line in the sand
00:39:13
Speaker
and save all that data that we've had throughout the day thus far. We want to use this to calculate our P&L reports, who gets a bonus, you know, whatever, what's our risk exposure, all that stuff done on this five o'clock cutoff.
00:39:24
Speaker
But the problem is, as you say, it's like in the real world, it's like, well, crap, there's like a data center in Hong Kong, there's a front office system inside the US and they're spread all across the world. So the data does arrive out of order and it's a little bit jumbled up, like it will get on some message bus in the middle and massage, whatever. So by the time it gets into your system, it's a little bit all over the place. And then at least if you've got valid time, then you can use that to your advantage because you sort of reorder it after the fact.
00:39:53
Speaker
So you say, well, let's give some window of five o'clock as a cut off day plus five minutes of transaction time beyond that. And then within that, we'll then query against valid time. And that should be our end of day.
00:40:06
Speaker
So where is this transaction time coming from? Because if I have multiple nodes, multiple producers, which are actually writing to the same partition, is the transaction time, because Kafka doesn't add anything, right? So it just, you know, bytes in, bytes out. Kafka doesn't care what kind of data that you're putting in. So it cannot add anything there. So is the transaction time coming from the nodes themselves? No, Kafka does have, like, the messages get stamped when they go on to Kafka with a timestamp. And we use that for the valid time.
00:40:37
Speaker
You only get the message ID, right? Because you only get the offset. No, that's the timestamp associated with the event of going under Kafka as well. There's metadata on the Kafka messages. Yeah, there is metadata, of course. So the transaction time is basically Kafka, the time when you're receiving to the queue, or sorry, receiving to the topic.
00:40:56
Speaker
Yeah. So you can use the valid time to put more information to say, okay, this has happened faster than the other one because your domain knows what is faster. Yeah, you're right that the messages definitely do have an offset. I need to check exactly how that offset sort of gels with the transaction time. But the transaction time definitely is from the point it hits Kafka. And then you can use that to get some measure of consistency when you do reads across those nodes.
00:41:24
Speaker
This is a very interesting design because I spent a lot of time with Kafka, so I had some fun with it. Before that, this authentication and SSL-based stuff, especially we had to use it at a bank and that was much more funky, securing it. That's why I'm curious about using Kafka as a data store because then that means I need to keep my, sorry, the persistence
00:41:52
Speaker
for Kafka is essentially forever in that case, right? Because you're using it at a store. That's right.
00:42:00
Speaker
Okay. So that, uh, yeah, because that, that is some, some use case that, uh, that I haven't thought about before, because usually when you use Kafka, you're, you're using it as a quote unquote message bus. So you have like seven days or eight days of data and then, uh, persistence or topic. You can decide that one as well. So why Kafka? Why not something like foundation DB or something like a different store?
00:42:30
Speaker
Do you want to go? Yeah, I'll give it a go, although you're probably more qualified. But I think Kafka, because it is log oriented, there's, I guess, less overhead in storage.
00:42:45
Speaker
Yes, you're right that the common use case for Kafka doesn't have this unlimited retention. And if you look at most of the managed Kafka hosting options, they don't offer unlimited retention. And I know it's an evolving story as well, exactly how efficiently Kafka can support that. So John mentioned Pulsar earlier, there are other
00:43:06
Speaker
which provide very similar APIs that we had used from Corrux to do that log storage. So Kafka was just something we were familiar with and could get working quickly. But John, did you have another angle? Well, Kafka does offer the infinite retention. So that's what's advertised. So it is something that we can take advantage of. Yeah, I mean, it does go against like the common use case. But the way that we,
00:43:36
Speaker
So one thing we haven't really, like, you know, seeing as we're sort of getting into detail in Kafka, I feel like I want to throw up the other big design feature of Crux, which is that the Crux operates with two different topics. So let's say that you're throwing a document into Crux, right? Job done, you're sort of happy.
00:43:55
Speaker
That document, what you will do is that your fashion up of transactions, you'll say, I want to put this document into Crux. So that transaction will go onto the transaction topic. And that topic is going to be quite small by default because the content, the documents go onto a different topic.
00:44:16
Speaker
That's why we can hopefully get away for quite a long time with having a single partition for the transaction topic. But the content's gone to a different topic and we've got more options there potentially with sharding and also compaction because the key on that document topic is a hash of the document.
00:44:34
Speaker
So we'll get some ease of compaction, ease of duplication, which is a mini sort of feature for free in a way. But the main reason that we design this is really to be able to evict data in a clean way. So the transaction logs are mutable. You can't change the transactions, but all a transaction has is like a pointer to the hash of a document, and that lives on the content topic, the document topic.
00:44:58
Speaker
So then it's kind of built from the ground up for this use case. So like GDPR, et cetera, where you can evict. So you can go to a crux node and say, I want to evict this document. And then it will, it will cleanse it from that Kafka and document log.
00:45:12
Speaker
Yeah. And how does that work when you're replaying? Because if I understand correctly, so you send an evict command or something, and then that gets into transaction, and the real document gets evicted from Kafka. Yeah. How does it work when you're replaying? Because then it will
00:45:29
Speaker
when you're catching up from assuming that there is another node that just starts up, obviously Kafka will start sending, okay, you know, this is the first item, so start from here. So how does that work? Yeah, I mean, it's definitely a bit of nuance there. So really the document has to get killed from three different places, like the document topic, you know, that's the main place we have to worry about. You can send a nil, I think, into Kafka, you know, for this message, for this action, and it will kill it.
00:45:58
Speaker
Then we have to evict it from the underlying KV store as well, which are the local nodes kind of storage. Then we have to evict it from a cache, like the caches that we have inside a crux between the document storage and the query engine. So we have to do all those things.
00:46:13
Speaker
I think when you replay the transaction log, it just points at a document. There is some synchronization inside of the crux indexing process. So the transaction is pointing at documents. Crux will hang on until it's got the documents that have been pointed at by the transactions. But then when a transaction comes in and we know that that document isn't there because there's a nil, essentially, for that document, then it will just not index it. It will just say, look, this document doesn't exist. I can't index it.
00:46:42
Speaker
One of the interesting things about Kafka, though, I mean, that I think that again, I mean, I remember back in the day, you know, talking about like,
00:46:52
Speaker
second, everyone was like, we got Oracle to 100 transactions a second or 200 transactions a second. And now we're looking at thousands of transactions a second on a node, in my mind overall. So I mean, have you done any load testing on these things? It must be pretty good at ingesting stuff.
00:47:17
Speaker
Yeah, I mean, we, so we want, um, let me go in a slightly different direction to answer that question. Uh, so you asked me about design principles at the start, uh, sort of neglected to like, you know, scaling is important for this project. Like, you know, we want this to scale with lots of data and the unbundled, I will get to answer your question, right? But the unbundled, you can ignore him as well. It's going to drain on now forever. I want my question answered, John. Come on.
00:47:48
Speaker
This is a today program, you know. The unbundled nature of crux basically helps you scale, right? Because you can say, look, I don't much care for Kafka. I want to use my local sort of version that we ship with. You can use, or maybe you'll write a, maybe your take on the burden of writing to fulfill the protocol, some other implementation that doesn't use Kafka, but something a bit more lightweight.
00:48:15
Speaker
But the point is that you can scale up, like the unbundled nature and the plugability of it means you can swap out the components. And as you need more power, you then make use of RocksDB, which the people at Facebook have built to scale extremely well. So that's just one nice thing about scalability when it's open, you've got options to scale.
00:48:33
Speaker
The other scalability point is around the indexing. So using RocksDB, the algorithms at the heart, they might not be like the nippiest at the door for small data sets, but then it really comes into its own when you've got like lots and lots of data because the query sort of compiles to kind of indexes against the underlying KV store, which by the way, a nice side effect of that is that the results are lazily streamed from the query engine.
00:49:02
Speaker
So you don't have to materialize all of the data in memory and then pass it through each clause. It just comes from the query engine in a funky way. Yeah, so that's scaling. And then there's a sort of full blown crook's topology that we might envisage where you've got Kafka and then you've got something like Rocks or LMDB that's backing up those KV stores with the query engine on top.
00:49:25
Speaker
That is built to scale. We have been load testing it, so we've load tested it. We've load tested different products just to get a feel for it. We've used test suites, not just our own. So we started off like my first commit to adding some shitty tests, you know.
00:49:44
Speaker
And then we went around and Hawke said, no, we need to be a bit more professional in this rather than sort of fee bar. Let's see what's actually in the industry. So we went and we took some tests. Jeremy's going to be much better to sort of name them. But we took some from the University of Waterloo, which really sort of tests the graph queries kind of
00:50:06
Speaker
kind of at scale, so it sort of like sweats them as well. Like it gives really sort of difficult joint condition graph queries and then puts enough volume through those to stress them. And the idea is that if you can create a graph query engine that can sort of fulfill all of these different what-if tests, then you're in a pretty good shape. And there's also these LUB tests as well. Jeremy, can you just pitch in and describe the test? I think it's quite interesting. It's worth...
00:50:31
Speaker
Yeah, so both of these test suites are actually sort of RDF-based tests. So they're designed for testing traditional RDF triple stores. But essentially, this sort of property graph index RDF dualism, it means that we can very easily adapt these RDF tests. So that's what's been done. So I think the official academic sort of definition for what kind of graph query there is like a subgraph isomorphism.
00:51:01
Speaker
or graph pattern matching. But yeah, all these test suites, they're testing the ability for crux to handle just absurdly complex queries, but also these ones are very different shapes, so well beyond an actual use case, because they're all generative tests, so they're
00:51:25
Speaker
Some of these tests take minutes to complete a single query. So they really max out the amount of memory, some of them time out, and that's kind of on purpose that they're real stress tests. So there's a whole range of stuff. But by doing comparative testing with SAIL, which is like an RDF open source Java thing, we'd be able to validate that, actually, Crux is doing quite well against the state of the art in this space.
00:51:55
Speaker
So I have, I think, two questions about, one about scalability and one about testing. So I think I'll start with the scalability thing. So you said by design, I can understand that transaction topic needs to be one partition, obviously, because you want to keep the...
00:52:13
Speaker
Keep the order. What about the documents topic? Because one of the things with Kafka is that if I'm trying to start a cluster, then I need to decide how many replicas I need per partition. So that is usually how much of this is leaking through Crux design, or is it
00:52:35
Speaker
Completely under under my control as an administrator so I can decide okay document topic and have 50 partitions and then I can decide because one of the issues or one of the design problems with Kafka is that if the partition becomes super big then you know it's not really easy to bring in new brokers into the system because everything need to be caught up.
00:52:54
Speaker
And there is always one leader partition, so everything connects to the same machine. So how much does Crux enforce a single topic or a single partition thing for Document Store?
00:53:12
Speaker
I think right now you can set Kafka up yourself, and then Acrops can just consume off it. So it's up to you how you configure it. You are right in that from the get-go, we expect those single topics.
00:53:28
Speaker
But then I think as we grow as the Crux project matures, there's definitely things that we can do on a content topic. And sharding is on the roadmap as well. So it's something that we want to really investigate and just listen as well. We want to understand, speak to people like yourself and just get a better feel for what options we've got.
00:53:47
Speaker
But I really like the pitch that you made or the design decision that you guys made that you can optionally replace Kafka and then introduce your own layer there. I think that is a really nice idea because, as you said, I can replace it with FoundationDB or something else.
00:54:06
Speaker
Not me. I'm not that smart. I'm just talking bullshit probably. Some other store. No, it's entirely. There are closure protocols as long as you sort of... Kafka has the semantics of the replayability and helps with eviction and has those qualities. But if you can fulfill that contract, then you can swap it out. That'll be...
00:54:35
Speaker
That'd be an amazing day. That'd be awesome if that happens. Awesome. So the other question that I had was, because you were talking about testing, are you considering doing like, what is it called, a Jepson? Just been from Affair. Sorry, I only know you. Call me maybe. Yeah, call me maybe. Affair. What the fuck? What is his name? Shit. File.
00:55:02
Speaker
Kyle. Yeah, Kyle Kingsbury. Sorry. Yeah. Holy fuck. So are you planning to do any tests using Jepsen? Because I think he does for the split brain issues and network stuff and other stuff as well. Yeah. I think I'm considering it. I think I mean, I'll just jump in there. But we did look at that. But then Crux isn't actually a distributed database in the rawest form because the nodes are single writer.
00:55:26
Speaker
So the queries aren't distributed in a way that I think the Jepsen test would test for. That's my understanding. Yeah, yeah, I think that's right. Yeah, there are, again, like,
00:55:40
Speaker
the nuances when you would have lots of nodes, you would want to use valid time as well as transaction time to run your queries to make sure that you're, let's say you're using the HTTP interface that the node you're talking to is using like a version of the database which exists across other nodes because your query might go to different nodes at different points in time.
00:56:03
Speaker
But essentially, yeah, there's no distributed query going on. It's not deep graph. It's a much simpler model. Actually, that's worth mentioning just briefly before we skim over it. There is this, you can ship it in different ways. So you can put it as a jar file in your project, and that is it. Everything is local, the KV store, the replacement for Kafka that's local.
00:56:27
Speaker
But then if you wanted to, you could then sort of, it ships with a HTTP server. So you can still do that where, you know, you file a crux of crux nodes and where it has a HTTP server, and then crux ships like a HTTP client that you can call to do those calls. Or you can go full blown and have like a, you know, like a, you know,
00:56:43
Speaker
suite of crux nodes behind a load balancer feeding off the event log Kafka and then your app can call them or you can ship your applications as crux nodes as crux an embedded jar file and then it then just subscribes and starts getting those updates and builds up the local data that it needs. That's pretty cool that means you can just have like
00:57:07
Speaker
If I have one application that is using Crux, then I can ship another application number at Crux and then they can connect to the same data store and then sync up. It's like the original version for databases, multiple applications using the same database.
00:57:24
Speaker
So what's the kind of like, what are you thinking in terms of like the operation? What are you thinking? Now in terms of the operational models, you mentioned like the couple there, but I mean, do you, are you, are you kind of like prescribing a girl preferring a couple in the early days?
00:57:47
Speaker
I wouldn't go that far. I think we'd say that we're a little less keen on our homegrown event log as opposed to Kafka. But we recognise that people want to get on and play and then have something that they can just work with as a get-go. I mean, the local one does have the backup and restore capability as well, so it should be battle-tested.
00:58:11
Speaker
But yeah, we'd expect for serious use. And then right now, the main option is Kafka as the event log. But then I dare say that I don't think we have any strong opinion as to whether you would always say Crooks has always got to be like its own HTTP server thing. And I always communicate via REST to use it.
00:58:34
Speaker
I think we're ambivalent on whether you want to do that, which is fine, you can do that, or whether you just want to ship your application as like these single instances that wrap crux is a library and then it spins up the local data on disk. I think both are okay. There is some backup and restore stuff in there. So a node, if it fires up fresh, it doesn't have to sort of replay everything from Kafka. The idea is it can feed off a backup that's been made at some point using a snapshot.
00:59:01
Speaker
OK, so the reading part. So you have a different query language, if I understand correctly, or what kind of query language you use to query all this data that we are storing in. So it's Eden.
00:59:24
Speaker
So it's like, yeah, even flavored data log. So it will look familiar to many people in the Closure community, but there are some differences. The primary difference is actually the fact you can't query using sort of wildcard attributes. So you have to define your attributes in the query. You can't say, show me all the attributes relating to this entity. You'd have to go to the document to look for that.
00:59:51
Speaker
Yeah, because you don't have a schema to look up all the stuff. So obviously that's the document. Yeah. Yeah. And of course, you know, you could, um, you can query the, uh, like the statistics. So, you know, uh, crux, of course, the statistics of the attributes that are, that are indexed and ingested. So you can look at attributes that are in the database, but, um, just not that there's just isn't a specific.
01:00:18
Speaker
And for a given entity, show me all the attributes as index. So yeah, it's got two primary indexes for looking up things.
01:00:30
Speaker
It's really fun, the datalog that we support is inspired by datomics and data scripts, but it's not compatible, so you couldn't take an app and just use it straight away. One of the main reasons is this is coming back to this layered, we just want to get the basics, the fundamentals right. So there's lots of stuff that we could imagine being written as decorators. We might supply some decorators ourselves as part of the Crux repo.
01:00:56
Speaker
But we're intentionally not doing a one-for-one to sort of support everything because we see it is layered. It's like you've got crooks and then you can add some helpers. You can riff on those and people can contribute different ones. And if I understand the design correctly, the query engine or the reading stuff will never hit Kafka because it is always going to either OXDB or LMDB. Is that right? That's right.
01:01:27
Speaker
It's also worth noting that it's always querying a single point in this valid time, transaction time continuum. So it's not querying a range within that continuum. So it's not like, oh man, what's it called? That database where you could essentially give a query and it would do live updates.
01:01:53
Speaker
Like a stream cursor in MongoDB, for example. No, it was another one, wasn't it? The one that kind of opened. Rethink DB, that's it. It actually pushes the data to you, right? We can just keep the connection open. You make essentially a query on a view, and if that view changes, then the data gets updated in that view. But you're not like that. You're basically saying that you
01:02:19
Speaker
And then if things change, do you get alerted of change? Because that's one of the nice things about event systems, I think. I mean, it's one of the things I was kind of, I've been a bit passionate about over the years is that, you know, I don't want to necessarily have to constantly query back to the database all the time. One of the things we think that he was nice about was that you could make a query and then, you know, if one small bit of data changes, it would come to you.
01:02:46
Speaker
it would tell you, or you're a listener on that, transact on that view. And I think the atomic has the ability to, you can register like a transaction listener essentially to say, okay, if that schema changes or that bit of data changes, then tell me about it. Do you have something kind of similar to that or where you can listen to transactional events?
01:03:08
Speaker
Yeah, so there's two levels to this. So the first is, yeah, like at the granularity of the document, you can absolutely listen to the transaction log. You know, we don't have a trivial way or like a pre-canned way of doing this, but certainly the hooks are there, the APIs there to listen to the transaction log, filter it for just the
01:03:30
Speaker
document hashes that you're interested in and then react to those, of course. And then I guess the other level you're asking about is, for want of a better term, streaming data log, which is essentially this differential data flow research, which is going on. And Nico and others are building 3DF and
01:03:56
Speaker
Yeah, so solutions in this area and those aren't incompatible with with with crux crux is more of a traditional database. It doesn't offer streaming queries. But, you know, I'd say that that differential data flow stuff is so
01:04:14
Speaker
academic still that it doesn't, it wouldn't cover the same range of data log queries that Crops is able to. So maybe it could do a subset and the data log could be quite similar. So if you had the two running side by side, you could get the best of both worlds. I think it's like one of these things where because you've left it nice and open, you probably got the ability to eventually hook that there that on top by the sound of things. It's not like a closed operation.
01:04:42
Speaker
Yeah, that's the, that's really the answer that I was going to suggest as well. And the sort of layered kind of openness, we would typically always just say, well, develop a crux access pattern, you know, like you can put a crux node out there that then listens at a lower level, sees the changes and then reacts to it on a way that then can present it some other system. I mean, the nice thing about this, you've kind of like, cause one of the things that is like kind of annoying about traditional databases,
01:05:13
Speaker
is that, or in the past anyway, now we're breaking all these walls down. Like the fourth wall was time, you know? So, and the other one is like this transaction stuff. So all that kind of API is that the database vendors had access to, but you didn't have access to. Because you've unbundle it, you know, it's in the Martin Kleptman thing inverted the database. You've now got access to this sort of transaction stuff. Then like I say, you can layer stuff on top of that as you wish really. So that's a nice,
01:05:43
Speaker
Yeah, because you've got you've given us access to the all the actual raw data, then it's very composable. You want to read the Kafka and, you know, get in between yourself and crux as there's a myriad of possibilities.
01:05:59
Speaker
I'm sure there's more work. Like right now the bytes are serialized using NIPI onto Kafka. So there's a degree of openness that isn't quite open. And we've got a GitHub issue about maybe changing the serialization format or whatever. We need to consider that. But yeah, fundamentally this kind of opens. So you can pick and choose and get in there and weave and change things.
01:06:22
Speaker
Yeah, probably one last question from me about the Kafka related thing is that
01:06:30
Speaker
So how does the read scalability work? Because if you have one partition, then essentially you can only have one consumer because every consumer connects to one. I mean, in a group you can only have one consumer, rest of the consumers don't do jack shit if I don't have multiple partitions. So how does the read scalability, did you do any tests on how much it is going to scale in terms of reading? And related to that one, for example, if I have
01:07:01
Speaker
Yeah, because Kafka is pull model. So consumer needs to pull the data. So Kafka is not going to let you know. So that means there is some level of delay, I mean, quote unquote delay, because you need to trigger this pulling mechanism and then processing mechanism. So how does these things work? Or what is your design choice here? Just an inverse order, because the first question was hard. Well, you're not reading directly from Kafka.
01:07:30
Speaker
No, I mean, to sync your RocksDB, you need to read from the Kafka. That's the story's there. So you can sync. So you can say wait until you've caught up to this particular transaction time. And then if you have a gatekeeper kind of crux node where for some reason you want to block until you know a write has been made,
01:07:49
Speaker
then you can do that. You can say like, I only want to carry on writing once I know that it's made it all the way through a Kafka and my local RocksDB has been updated, it's ingested and now it's up to date, so I can carry on. So you can weave in that back pressure.
01:08:06
Speaker
Yeah, because that's an interesting point, right? Because you have Kafka, and I know Kafka design choice, they're like, okay, this is not our priority, so we're not gonna do this, because that's why Kafka is so simple to understand for even a person like me, that they made a really simple design, so to speak, to understand.
01:08:24
Speaker
If I'm writing to Kafka, then I can say, okay, make sure that all the brokers are in sync, so then only acknowledge or something, so I can guarantee that it has been replicated. But do you cover the transaction, so to speak, if I have seven nodes of Crocs?
01:08:44
Speaker
and I push from one node some data. Can I say that acknowledge when this is actually in all the RocksDB level, not just Kafka? You can acknowledge when it's got to you. So it's gone into Kafka and then it's come back and you can block until it's got to you. What you can't do right now is ensure that every single Crux node has got that data. We don't have a mechanism to
01:09:10
Speaker
to cope with the fact that for some reason there's a crux node that's on bad hardware or some micro instance and it's just like, okay. But normally it's like a replication factor, isn't it? Anyway, you're never going to get full duplication and most systems don't offer that. In fact, I don't know any system operates that.
01:09:31
Speaker
In Kafka, you can say that. You can say, only let me know when if I have a replicating factor of five, then acknowledge when all the replicas have been received. Of the five, but normally we're running 10 or 20 now. No, that's based on your replication factor. If I have 10 nodes, I can say replication factor of five, then I can either say five and forget, or I can say one node, or I can say all the five. That's cool. But that is like the Kafka level thing,
01:10:01
Speaker
But one thing, John, by just looking at the image, it was very easy to understand how you designed this one. So I really appreciate it. The first part that you put in the website, openness, that was very evident from the design. So I understand you're still in beta. So maybe I'm asking some stupid questions that I don't understand. It's actually an outside. It's amazing. So we've got a way to go. Yeah, sorry. I'll try. Yeah, yeah. I just saw the guitar. Your own risk.
01:10:33
Speaker
So, the biggest question or probably the first thing, probably everybody who is familiar with Clojure that comes to their mind is like, how does it compare with Datomic?
01:10:45
Speaker
So I think for the fact it's completely alpha and it's not mature and, you know, you use it at your own risk. I mean, that's the obvious point, but then let's go through the differences. It's schema-less. Datomic enforces a schema, which is pros and cons. Like both tools, I believe, have some sharp edges. So, you know, you can cut yourself on any tool, but one of these sharp edges of crux is that the data can be inconsistent because we don't enforce a schema.
01:11:15
Speaker
and the other one is as you've highlighted you know right now we're shipping with kafka for serious usage and that's going to put some people off um that there are like you know there are some trade offs for kafka um it's it's kind of heavy um so there's that sort of thing to add in um uh oh yeah the datomic story is is kind of
01:11:40
Speaker
I mean, we are inspired by Datomic. Like, Datomic is really sort of the fact of bringing your data close to your application. We've used Datomic a lot. And that awesome sort of rapid development story that it comes with is fantastic. It's really embracing the cloud with ions. So I dare say that if you want a mature cloud story, then the money would probably be with Datomic.
01:12:07
Speaker
And also the API is different. Datomic is going to be far more polished. There's going to be more in the data log that you can get access to. Crux, you might have some false thoughts because we come with a fair bit out of the box, but then there'll be helpers and APIs that people add and there'll be some stuff in the community.
01:12:24
Speaker
So it really is, for the schemalists, we pay a bitemporal tax just in terms of, you know, we had some conversations early on, which is, do we really advertise the fact that it's bitemporal? Because a lot of people would be like, what? It's just, I mean, what is that? It's a hard sell. But we've chosen to double down on that, double down on it in our architecture. And there is a bitemporal tax. It's complexity in terms of our indexer has to work hard to maintain those bitemporal indexes.
01:12:53
Speaker
So it's, it's a little bit more, the tool is taking on more complexity. So arguably, you don't have to, but you got to work out if you want your tool to be doing that. So I, I think, you know, we can present what, what Crux says. It's inspired by Datomic and Datescripts and other things. But there's, there's enough differences there. And I think those differences will evolve and time will ultimately tell like, in which case, one particular tool is better than the other.
01:13:24
Speaker
I'm sure Datomic now, for the fact it's not Alpha, is probably wiser.
01:13:29
Speaker
I mean, obviously it's because we're talking about time. I mean, at this instance, it is alpha, but it is going to move to production level pretty soon, obviously. So that's just an instantaneous state. Yeah, hopefully. I mean, if you want some good news about Crux, it's that we launched Crux last week. And there is that thought coming on this podcast as well. It's that thought, which is, what if we've just made a bad choice or there's something that someone will point out and it's like, oh, yeah, actually, that was
01:13:58
Speaker
that's pretty shockingly bad you know it's no one's fault but like you know we don't know everything you know we're any human but like we've got something wrong and you know okay let's just move on or you know we're going to take a lot longer to sort of change something that's fundamentally incorrect um i'm sure there's big things we need to work on we've we've got some things in the roadmap and it's it's a start it's a right at the start of a journey um yeah but so far at least uh
01:14:23
Speaker
My big fear is that it's just something that's unearthed, that's the unknown or known. It's an invalid design. And thankfully, we haven't had something as fundamental as that occur yet. But I welcome, you know,
01:14:40
Speaker
Let's get the feedback. It needs to be battle tested in the real world. I mean, that's the big sort of elephant in the room. There's not actually that many people using crux. And we're starting to get people on the mailing lists. We haven't used it in serious anger yet on production systems. We have used it internally, playing around with it, we stress tested it, but it's alpha. So right now it is a bit of a leap of faith. It's a bit of a gamble.
01:15:05
Speaker
Yeah, just look forward to the day when we can reflect and say, nothing's ever truly stable. But yeah, at least it's got a bit of experience under its belt. Well, there's definitely plenty of room for more databases in this space, you know, I mean, it would be crazy if there was only ever one database that the closure rooms looked for, I think, you know, and I think you've taken a different tack, you know, like you say, you've taken a much more kind of
01:15:34
Speaker
open approach and much more modular approach in many ways or openly modular approach anyway. But I think, you know, everything should be possible. I don't think that the people at the atomic are saying, no one else should ever make a database. That would be weird.
01:15:56
Speaker
Don't think they're allowed to do that. I think it's got to be good for enclosure that people try stuff as well and the different ideas. And we want closure, don't we? We want closure to be known as having these awesome applications, unless Jeremy and Hawke can rewrite bits of it in Rust. But that's a different conversation.
01:16:15
Speaker
I think the other thing that's interesting, kind of like a weird thing, but I know I remember like this, I don't know if it's an absolute cystic or if it's just something I've misremembered, but I think Stu once said that there's only like one or two
01:16:30
Speaker
atoms in the whole of the atomic or something. Did you find anything, because you mentioned closure, did you find anything about the closure, the designing and making enclosure that was specifically delightful or surprising or was it just like everything you knew about closure just was brought to bear on this project?
01:16:52
Speaker
I think Closure is good for the ideas and playing around and implementing the algorithms that we found in the various white papers that we consulted. So Closure is seen as that, let's build it in Closure to start with. But then we've had to rewrite bits in Java as they ossify. And then I think ultimately, do we then take the very core bits and maybe try Rust, because it's quite good fun to see how fast we can make this as well.
01:17:22
Speaker
Yeah, yeah. So closure scene is that sort of, yeah, let's just let the thinking flow and take it from there. And just that low friction as well. You could argue the schema-less, the dynamic nature of it. We sort of embraced as well. I'm guessing you're getting a leverage in a lot of the performance from the infrastructure as well.
01:17:46
Speaker
like RocksDB and Kafka, they're all got awesome performance. And often, the way I find it, maybe there's some tight loops in your code that you could optimize, but probably a lot of it is just, you know.
01:18:01
Speaker
It's quite interesting like when you dig into it and it's actually quite fascinating because actually seeking around on the so imagine you just have a index which is a by temple index it's just a KV source so you have a key path so you've got all these different keys sort of lined up against each other.
01:18:20
Speaker
And what you really want to avoid is too many seeks, because each time you sort of just don't do a next, you actually do a seek, a jump. That's expensive. And a way that it works is that we are doing a lot of seeks. So then, I mean, Hawkan, you know, he's...
01:18:39
Speaker
He's just amazing, really. I've worked with him for kind of 14 years and a bit of a mentor to me, like, if I could be as good a coder as him, you know, be happy. I mean, it's not going to happen, to be honest. But he's been experimenting with, you know, the Rock's DB driver that came out the box. The JNI wasn't fast enough, so he tried to write his own in C and just to really speed up.
01:19:06
Speaker
Because RocksDB, by default, it uses the raw byte arrays, but then they're copied when it actually gets into Closureland, like that bridge from C to JVM, the byte arrays copy. So Kamri uses the heat byte buffers instead, where that underlying byte array from C is preserved. But then that wasn't fast enough, so let's build our own sort of C variant that does a better job.
01:19:28
Speaker
it's just trying to really maximize the potential there. RocksDB, the Kafka guys, the Confluent guys happen to know, they're occasionally looking to move away from RocksDB to LMDB. And one of those reasons is that the people at Facebook have geeked out on the configuration possibilities of Rocks. So, you know, if you start reading it, it's like, oh my word, how many config options and column families and we could spend months just optimizing Rocks.
01:19:58
Speaker
What we're actually looking to do is we're looking at different algorithms as well that could potentially replace or I'm using or augment just to get different speed benefits in different ways. But we're always conscious that we are paying a bit of a bi-temporal tax because we have to maintain these more complex indices. So we have to be quite careful and think for a long time before we want to pull a different algorithm in. Did Michael Dragales go and work for the Kafka people?
01:20:27
Speaker
Yeah, I know he's working with Confluent. Yeah. Yeah. Because they were doing this, not that it wasn't anything like what you were doing, but they had some kind of like, you know, streaming. It is similar, right? Because, well, you know, Datastore, it's obviously, you know, everything is, anything that stores data.
01:20:43
Speaker
Didn't he do something with S3 and the retention? Yeah, exactly. So from Kafka to automatically to store S3, pyrostore. So it's like a storage mechanism. So you don't care whether it is coming from Kafka or S3. So you can just write something as pyrostore is like on top of that one. Yeah, sequencing for us. Definitely want to chat to him. Fairly similar, but I think a bit more different, I think.
01:21:11
Speaker
Probably same difference. So what is the role of Jax in this project now? So where do you fit in into this thing? I know it's an open source project. That's amazing. That lets people experiment with the code or maybe even build features and then take it into, as you said, layered approach. So it makes much more sense. So what is Jax doing in the project or not doing? Fuck it.
01:21:38
Speaker
the direction that JAX wants to take this one into.
01:21:41
Speaker
I think Jux is, I mean, it's like a department within Jux, which is the R and D department. And Jeremy sort of heads it up and, you know, we're building out crux and we have a team, we have standups, we have a roadmap, we have a tech lead, we have a process. So from that perspective, it's just like, it's just like a project that, you know, like you would have. Yeah. It's like a client project in a way. You know, it's kind of managed the same way and resource the same way.
01:22:09
Speaker
But I think we see jokes long term, it'd just be nice to sort of diversify a bit. Consulting is what feeds us. We get experience from our projects and real world sort of application of software.
01:22:26
Speaker
that's really what teaches us and sort of gives us insights. Although I was thinking today, it's like every time I speak to a client these days, it's like I should be paying them. It's like reverse consulting. It's just amazing. I mean, that's potentially a danger of crux is that we attract clients who are just so far out of our league.

Consulting Challenges vs. Product Development

01:22:50
Speaker
It's like, well, let's learn from you for a bit. But yeah, I think
01:22:55
Speaker
Juxtapy want to diversify and I think it's fun to have a product. Consulting is difficult. It can burn some people out because you're always at that early stage of the software delivery lifecycle. You know, at the start where you're the ones that are hitting the ground running, putting in the energy, building stuff, and then over time you'll sort of fade yourself out.
01:23:18
Speaker
But those people that were involved driving it in the high energy, they go on to the next projects and it's expected for them to go again to drive the high energy. So I think having a product and something just different, I think it just really helped people to rotate around and just get a different perspective, a different pace.
01:23:36
Speaker
it kind of keeps it all a bit fresh. And they also feed in, like Crux will help us to attract clients with different problems. And then when we get those clients, because we're learning about Crux, we're having these conversations, like we've talked about lots of different technologies and products tonight, that's good for us, because then we can talk to our clients about those. So it's kind of like a symbiotic relationship. And it's just fun as well. It's

Community Involvement and Use Cases Discussion

01:24:02
Speaker
different as well. And how do you see the community participation in the project?
01:24:06
Speaker
How do we see it? Yeah. Well, we can hope. We hope that people give it a go. You know, a little bit sort of, it's completely early. So as I mentioned, there's a bit of a fear that people will come on and say, God, what have you built is ridiculous. But that hasn't happened, thankfully, yet. And although we welcome that, of course.
01:24:28
Speaker
But no, we just want people to play with it. That open, extensible nature of Crux, we'd be really pleased to see a kind of repo spring up on GitHub where someone's like rewrote the sort of data log or extended it. And then that could be really cool. And it might be that we have conversations that we find ways to learn from each other. We take some of that into Crux or we respectfully just say, no, use this if you want this particular feature, this person's library.
01:24:54
Speaker
And it'd be really nice to think who knows where Crux could go. It could hopefully go beyond JuX, beyond our design. Let's get it out there. The community will adopt it. It will morph into something more exciting. As Jeremy mentioned, the data flow. What's it called? The 3DF guys? There's different futures. There's different possibilities.
01:25:20
Speaker
It's more exciting just to launch something out there, not necessarily know where it would go, but just know that you're at least going on an interesting journey. Neil. So in theory, I mean, it could be, like you said, people could pick it up and embed it in their ideas and take it forward that way, as well as being a product that someone can use on its own. Yeah, it's like you can wrap it, you can extend it. There's different possibilities. Yeah, it's fantastic.
01:25:48
Speaker
I mean, it is also a product in the sense that we're going to offer support for it. And that should give people a bit of confidence that it's a bit of a safe bet because, well, hopefully, if it stands up to the test of time.
01:26:06
Speaker
but it should give people some confidence that we're going to invest in it. And I love to get to a point where it's sustainable and there's like five or six people on it and just having a wonderful time and taking ideas and just extending it. Yeah, that'd be lovely.
01:26:22
Speaker
That's amazing. So did we miss anything? I mean, no questions. We're going to miss loads of stuff. Yeah, probably. I just noticed that we've been going for an hour and a half. So it's so easy to talk to you guys about this tech, really, yeah. Yeah, yeah. So I'd say that the only thing that maybe we haven't touched on too much is all the different things you can do with bitemporality, like what are the actual applied use cases?
01:26:52
Speaker
try to document as many as I can think of and find out in the wild in the docs. So there's use cases across loads of different industries, whether it's financial services, health care, law. But arguably, those are places where you need by temporality, like where you'd be sort of foolish to not use it. But arguably, it'll just make development simpler, prototyping easier.
01:27:21
Speaker
I know there's active research really going on by temporal event sourcing and these kinds of things. And I think there's a whole world of possibilities yet to be uncovered. Awesome.

Crux's Future Impact and Events

01:27:34
Speaker
I think we are...
01:27:38
Speaker
I'm just checking the time. Oh, one and a half hour. OK. That's a significant amount of time. Obviously, my brain can't focus more than 30 minutes or something. So probably after 30 minutes, whatever I said is probably bullshit. So that's fine. So thanks a lot for open sourcing this one. That's a pretty good thing. As you said, John, must be amazing code. So there is a lot to learn from.
01:28:05
Speaker
I think probably from the community and also from, as you said, reverse consulting from the community as well. That would be amazing. And I'm really curious to try it out because I looked into the documentation the last couple of days. And I'm really, at least I would be at least trying it out because this is one of the problems that I've been following for long. And I know that Mozilla has something
01:28:31
Speaker
at some point, like a Datamic implementation in Rust or something. And then that was an abandoned project again. I forgot the name. It was a mensah. Yeah, something like that. Yeah, mensah. And Datascript, obviously, is still flourishing. It's going on. And Nikita is doing a great job in that one. But still, Datascript sounds like it's on a browser sort of thing. And it's not really like scalable backend thing.
01:28:58
Speaker
I think with Dataflow and Mentat and Datascripts, I mean, there's a lot of work going into Datascript. It's now funded by Closers together. I think you could arguably say it's a mini wave of things happening in this space. But you're filling the very, very specific area that we don't have a tool for, or at least not enough choice available. So this fits right in, especially with the things that you both explained with the
01:29:26
Speaker
the bi-temporality thing that I think I finally understood a bit. I think I'd love to see this project grow and then get more users and more use cases and all that stuff and more community participation. That would be amazing. And hopefully, you'll be offering it as a, that's what Jeremy said, an offering manager.
01:29:55
Speaker
offering it as a service or as a security product. Exactly. That would be amazing. Having smart guys like you, providing the support for it, that makes it much more lucrative.
01:30:08
Speaker
So I think that's pretty much what I have in my list to ask. Ray, do you have any other comments or questions? Well, there's lots to talk about. Like you said, it's so easy to talk. We could probably go on for a few hours. I mean, I think my thinking is, whichever these guys back on in, let's say, a year's time to review what's happened.
01:30:32
Speaker
It's going to be a great wave at the beginning and a lot of enthusiasm and I'm definitely breaking to use it for sure. Let's see what happens in a year, where the design is taking you guys. I'm really excited for you. I think it's great.
01:30:50
Speaker
I love you guys, so I think you're doing great stuff. I'm very excited about this product. I know you've been cooking it for a while, so I'm really excited for you guys as well. I think it's just a great innovation, and I think you're doing great work for the community. So thank you very much. Yes, thanks very much. Yeah, thanks, BJ. Yeah, keep up the good work with your 49th podcast. Thanks for having us on. That was an absolute pleasure.
01:31:20
Speaker
Just a couple of announcements before we say goodbye. So John, do you know when your talk will be online from Closure North? No, they said a month, which is where to go. So in two, three weeks, the talk should be out.
01:31:36
Speaker
So look for that one, people who are listening. And Dutch closure days, it was good. And other people said it was good as well. So I'm just trying to say that it was good. And we have all the videos up and running on YouTube. And Heart of Closure, tickets are open right now.
01:32:00
Speaker
So book your spot to go to Heart of Closure. I know Jax is sponsoring Heart of Closure, so probably John, you'll be there. Or somebody from maybe Jeremy, both of you, or Malcolm, or the whole Jax gang.
01:32:16
Speaker
That'd be amazing. So I think I'll show up there as well. So the people who are listening, go and check out heartofclosure.eu, I think. Get your tickets. It's in August. So that's it from us. Thanks again, Jeremy and John, taking your time to explain the meaning of crux. And from thinking that it's a point of pain to actually say, you know, the crux of something, that's the main idea.
01:32:45
Speaker
so thank you goodbye
01:33:16
Speaker
you