Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
DataFusion - The Database Building Toolkit (with Andrew Lamb) image

DataFusion - The Database Building Toolkit (with Andrew Lamb)

Developer Voices
Avatar
1 Plays34 seconds ago

Building a database is a serious undertaking. There are just so many parts that you have to implement before you even get to a decent prototype, and so many hours of work before you could begin working on the ideas that would make your database unique. Apache DataFusion is a project that hopes to change all that, but building an extensible, composable toolkit of database pieces, which could let you build a viable database extremely quickly, and then innovate from that starting point. And even if you’re not building a database, it’s a fascinating project to explain how databases are built.

Joining me to explain it all is Andrew Lamb, one of DataFusion’s core contributors, and he’s going to take us through the whole stack, how it’s built and how you could use it. Along the way we cover everything from who’s building interesting new databases and how you manage a large, open-source Rust project.

DataFusion Homepage: https://datafusion.apache.org/

DataFusion on Github: https://github.com/apache/datafusion

DataFusion Architecture (with diagrams!): https://youtu.be/NVKujPxwSBA?si=tw9ACxlbdpBuVsnv&t=1045

Datalog: https://docs.racket-lang.org/datalog/

Tokio: https://tokio.rs/

Andrew’s Homepage: http://andrew.nerdnetworks.org/

Andrew’s Blog Post about Tokio: https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/

Velox: https://velox-lib.io/

Arroyo: https://www.arroyo.dev/

Synnada: https://www.synnada.ai/

LanceDB: https://lancedb.com/

SDF+DBT: https://docs.sdf.com/integrations/dbt/integrating

Support Developer Voices on Patreon: https://patreon.com/DeveloperVoices

Support Developer Voices on YouTube: https://www.youtube.com/@developervoices/join

Kris on Bluesky: https://bsky.app/profile/krisajenkins.bsky.social

Kris on Mastodon: http://mastodon.social/@krisajenkins

Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/

Recommended
Transcript

Introduction to Ambitious Projects

00:00:00
Speaker
One of our industry's best phrases, a phrase that's launched a million projects, is how hard can it be? but I'll write my own framework. How hard can it be? I'll write my own web server.
00:00:12
Speaker
That's just a network socket and a bunch of headers, right? How hard can it be? So many ambitious projects have started from a point of benevolent ignorance.
00:00:23
Speaker
And a fair few have actually finished because by the time you've gone down the rabbit hole and learnt exactly how hard it can be, you've kind of gone too far to turn back.

Simplifying Database Construction

00:00:33
Speaker
There's one kind of project that might just be travelling in the opposite direction, that might be getting easier to build despite being obviously very hard.
00:00:44
Speaker
Databases. Databases are a huge, very difficult project with a lot of engineering and a huge footprint. At the least for any database, you're going to need an SQL parser, an expression evaluator that borders on being a full programming language, query planner, query optimizer, query executor, persistence, durable, optimized, fault-tolerant persistence. That's just for starters. It's big.
00:01:14
Speaker
But... The very fact that I can list out a whole bunch of things that every database project needs suggests an opportunity. And in that opportunity, we find this week's topic, Apache Data Fusion.
00:01:29
Speaker
It's a project that says, look, every database is going to need this bunch of default pieces. So let's build them once and for all in an extendable way so that when someone wants to write a new database, they can reuse all the standard stuff and just focus on the pieces that make their particular database special.
00:01:48
Speaker
It's a very good idea if someone is prepared to take on all that work to do it.

Interview with Andrew Lamb

00:01:54
Speaker
And joining me this week is one of the people that is My guest is Andrew Lamb. He is a very happy Rust programmer, very knowledgeable database guy, and a core data fusion contributor.
00:02:06
Speaker
And he's going to unpack the whole project for us. And in doing so, he takes in just about everything that's interesting in the database world right now, from Arrow and Parquet and Iceberg to DuckDB and DBT to time series databases, streaming database.
00:02:23
Speaker
Fascinating guy. And by the end of this conversation, I was actually thinking, you know, building a database, how hard can it be? I'm your host, Chris Jenkins. This is Developer Voices.
00:02:35
Speaker
And today's voice is Andrew Lamb.
00:02:50
Speaker
Joining me today is Andrew Lam. Andrew, how are you? Very well, thank you. I'm very excited to be here. I'm glad you could make it up from the database, Mines. Oh, yes. Crawling out armfuls of rust code in your fingers.
00:03:04
Speaker
But there's so much so much good stuff down there. There's so much more to mine. I'll keep you out of the mine for about an hour and then you can dive back down. All right. So you've been working on, among other things, the project you're going to talk to me about right now, you've been working

Database Evolution and Economics

00:03:20
Speaker
Apache Data Fusion.
00:03:22
Speaker
And I think we need to start with understanding what it is and what it isn't. Because i sometimes I feel a bit lost in all the different database-y pieces we've got, right? Like Arrow, there's also Iceberg, there are loads of different things.
00:03:37
Speaker
What's DataFusion? Yeah, well, to be able to start with Data Fusion, I think I need to start at the really high level of what's been happening in the database landscape the last, don't know, 10, 15, 20 years.
00:03:48
Speaker
And I think without going too much into depth, what we've seen is you saw the focus of on-prem databases, right? 10 years ago, that was were basically self-managed. They were what you would call shared nothing architectures, which was a fancy way of saying you ran them on nodes that had CPU, memory, and disk. right like If you ever run anything in production the last 15 years, that's probably what your database looked like.
00:04:10
Speaker
Then there was a rise of distributed databases where they had like a bunch of nodes that like that, but that worked together. and and so And I think what we've actually seen now over the last 5 to 10 years is the rise of so-called disaggregated databases, which is basically driven by the rise of cheap object storage and cloud providers. So instead of tightly coupling the storage system to the compute and memory, what they've done they disaggregated the storage. So you know this is the those are fancy terms for basically saying, now you put all your data on S3 or other object stores because it's so much cheaper, and that handles the durability for you.
00:04:45
Speaker
and then you are able to scale up and scale down the compute resources. So Snowflake is the classic example of like the first major vendor that did that, but there's a bunch of other um services that that you like big query has that architecture like i said snowflake is another canonical example databricks so they're all basically moving largely because of the economics of how storage works on the cloud to these disaggregated architectures right so that's right that's what's been happening i'm getting to data future but but it but i swear so i like context it'll help me understand
00:05:25
Speaker
So that's sort of what's been happening in databases. And it's not dissimilar to what happened when the transition from mainframes to mini computers to mini computers to you know servers is that the underlying hardware changes and then you have to change your your database architecture along.

Complexity and Cost of Building Databases

00:05:38
Speaker
So, yeah however, if you've ever implemented a database, and I have in fact done that at least once before at a company called Vertica, which was one of these distributed shared nothing databases, like it's they're just expensive pieces of software to build they take a lot of effort and i think as the industry shifts again right towards as the underlying hardware infrastructure shifts right you got to re-architect your databases a little bit there's a lot of those pieces of those databases that are still very similar specifically like how you represent data on disk
00:06:10
Speaker
is fairly whether or not it's on object store or on local disk for doing analytic workloads like crunching large amounts of the data the organization is pretty similar likewise when you bring it into memory to process it uh best practice these days to take best advantage of like uh simdian modern architectures is you put them in columns right a lot of the stuff is like columns versus rows which we can talk about a much more but um i'm trying to make a higher level point before going to the lower level details yeah and the higher level point is like the the basic patterns to build one of these high performance analytic systems is basically well understood now and it's well understood because
00:06:48
Speaker
we've done it for at least three successive generations of products, right? Like you had basically the share nothing databases, like what Vertica was maybe 12, 15 years ago. um ParExcel, which became Redshift, is another an example of that.
00:07:03
Speaker
Greenplum was another one. and then there was a whole set of technologies in hadoop right that are very similar uh in theory like impala and pig and and those systems and then we've basically been doing again now you see the same basic architectural patterns in systems like duckdb and snowflake and uh spark data bricks photon that kind of stuff so in my opinion right and i sort of have an angle here that the the engines are the fundamental uh techniques to build those engines are are approximately the same obviously the details are all different but they're basically all columnar execution engines that have the what's called vectorized processing and so
00:07:43
Speaker
Uh, the amount of money it takes to build one. So, so it's well understood, but that doesn't mean it's easy, right? So it takes a lot of time and effort. It translates to money. Right. And so that observation currently keeps that type of technology, like the, um, high performance analytic engines locked up in people who can afford it. Right. So that's people will raise a lot of VC money, like tens, hundreds of millions of dollars, or it's people who, you know, that DB in particular is phenomenal. But like, if you imagine if you wanted to build another duck, DB, what would you need?
00:08:14
Speaker
what you need is you need access to basically one of the world's premier research institutions at cwi and their grad students right which is yeah i'm not saying that's the only thing you need for duct tv but like that certainly helps you said there's just not with a team full of researchers that really know everything inside out yeah and basically like from the lab that basically invented a bunch of the underlying technology know from peter bonson's lab so you know However, that doesn't mean other people don't want or need that type of

Composable Databases and Components

00:08:40
Speaker
technology.
00:08:40
Speaker
So finally, we get into like these pieces of Arrow, Apache Arrow, and Apache Data View, Apache Parquet. So that I would say those are pieces of technology that are part of a longer term, like I think the higher level a description was called composable databases or deconstructed databases.
00:09:01
Speaker
And the idea is... rather than re-implementing all basically the same techniques in a tightly integrated engine you can reuse these components that basically follow best practice that we now understand what that is because we've done it like three four times uh we being the collective industry in academia and you can then assemble systems much more quickly but not really so sacrifice performance so make sense so far?
00:09:25
Speaker
Yeah, yeah, yeah. it It seems like you're about to tell me that, you know, um no one needs to write another SQL parser. No one needs to write another how do we store columns on disk.
00:09:39
Speaker
Rather than no one ever needs to write a new one, I think the better way I would say it is... unless that's really key to whatever product you're building right that that's really going to be a competitive advantage that you can take it like you can use the existing really good open source implementations of arkay arrow data fusion iceberg i'll talk about that and then focus your innovation dollar on the things that actually really matter right like for certain uh people in certain applications it really is important however they've laid the data out on disk however for a lot of applications like you're going to basically reinvent parquet slowly over time uh and which again might be fine but like you you should at least understand that that's what you're doing um so yeah so that's that's my real pitch it's not that
00:10:28
Speaker
Any particular technology of the stack, like you couldn't do better in isolation and if you had infinite resources. However, practically speaking, you don't. And it's very, very hard to do better than and um what we've been able to put together in the open source world. like like There's been a lot of obsession about performance, which which maybe I can talk about.
00:10:48
Speaker
ah but So give me a list of the features you get off Data Fusion. Yes, yes. So that will give me a sense of quite how much money I'd have to spend to do it myself.
00:10:58
Speaker
Absolutely. So let me start with, i i instead of just data fusion, I want to start with ah well what we call the FDAP stack, right? This is something that we made up, but it it stands for flight, arrow, data fusion, and parquet.
00:11:12
Speaker
but So those are like the high level technologies. And it's if you're going to build a database system, they're the kinds of technologies you would you would need to find somewhere. You build them yourself or you have to write them. So simplest thing is you've got to store the data in some persistent format on disk that's fast to query.
00:11:28
Speaker
and So that's that's Parquet. What that is, it's a fast column or file format. We could talk about that a bit. you also have to have some format that you are actually going to process the data with, right? So when you actually do the calculations to do string manipulation or do arithmetic operations or group or whatever, like like you need to have a representation in memory that's that's efficient for processing.
00:11:50
Speaker
so So that's what Arrow is. It's the memory format along with the... Oh, this is where you not just store the data in memory, but you store in a way that's aligned with CPU architectures. That's right. That's right. that's right yeah Which, roughly speaking, is typically columnar these days. right So that sounds really silly. It's like, oh, instead of storing in rows, we're going to store it in columns.
00:12:10
Speaker
But the the reason it's so different is because modern CPUs are really good at doing the same thing over and over again, and they have lots of data parallel instructions. So if what you're doing is you're taking an array of 8,000 elements and you're trying to add it to another array 8,000 elements,
00:12:27
Speaker
Computers have very ah high performance instructions that can do that. They do like six or eight or 16 integers in each cycle right with these big SIMD instructions.
00:12:38
Speaker
And compiler is very good at auto vectorizing, aka generating code that uses those instructions. if you've got the inner loops right. So basically by organizing your code, rather be like, we're going to this calculation for you know this row, and then we're going to do that, and then just move one row after another.
00:12:57
Speaker
if instead you do like, we're going to do 8,000 of the same thing all at once, picking 8,000, because that's a common batch size, but like you know it's it's several thousand is basically the pattern. the The CPU just gets going and it just rips it through, and it goes you know four eight times faster.
00:13:11
Speaker
Yeah, this is one of those things where, in theory, there's no difference between processing it by row and column. But in practice, there's a huge one, right? Huge, huge difference. Huge difference. That's right. um OK, so that takes us to Arrow.
00:13:23
Speaker
Right, so that's what Arrow is. yeah And then, right, so now you've got some way to process. you go tube's But it's actually somewhat of a pain to work with these columnar formats, because it's not like everything has to be columnar-based. And it's not like you can just run SQL. So now you now you need an actual um engine that can run SQL.
00:13:41
Speaker
as ah as an example. but But there's other pieces too, even if it wasn't SQL. and So that's that's what ah Data Fusion is. So Data Fusion has a SQL parser, and it has ah ah planner and an optimizer, and a fast execution engine.
00:13:57
Speaker
but So basically with, I think, three or four lines of Rust code, some of which are just like adding something to a cargo file, you can basically have like a a fast embedded sql engine uh for your own project right and so you might argue well you could do that with duck db too which you can by the way that data fusion approximately the same performance the difference in data fusion is it has extension api is basically anywhere for you to plug in so you can all the different functions you can override them or add your own functions right those are those are classic things but you can also add your own
00:14:31
Speaker
um plan nodes and you might ask why you want to do that for example if you want to build a new query language right because you don't like sql and you're part of the 50 years of people who are trying to do better yeah but but think about that right like there might very well be a successor to sql but right now or at least before something like data fusion if you wanted to build your own query engine but like a new you want to produce a new query language you basically have two choices you can either write the whole engine yourself right which takes a lot of effort.
00:15:03
Speaker
Or you could, I don't know try to transpile it to SQL or something like that. that Those are the two common approaches. And transpiling to SQL means you're basically restricted in your language to whatever express whatever you can express in SQL, even if you're transpiling it. And if you build your own engine, then it's this huge effort, right? that You've got to build intermediate representations for the plans and all the optimizers that deal with the stuff.
00:15:26
Speaker
and so I can imagine. and like Have you heard of Datalog? i've not heard Datalog, Datalog is this interesting not SQL query language that has some interesting features, like it's really composable.
00:15:40
Speaker
Yes. And in the back of my mind, I've toyed with the idea of creating something that uses Datalog. There are a couple of databases that do, but I've toyed with the idea of creating one. And like you say, it's way too much work because you don't just create the language.
00:15:54
Speaker
but Yeah, so i like I love to think about it as like LLVM is as my personal um favorite analogy. So here it is, right? LLVM, I know, is like a pretty low-level thing. Not everyone gets it, but like, so LLVM is the...
00:16:11
Speaker
suite of compiler tools that underlie basically all modern programming languages right so you clang swift java uh probably julia i don't actually know it zig all of them you know they have their front end they have the language but actually when it goes gets the time to like generating optimized code for particular architectures they all and like the debugger support they all uh use lvf Right.
00:16:36
Speaker
And so I think the parallel there is if you want to build a whole new language today, like, like well, no, rewind the clock to the late nineties, whatever you want to build a new programming language, early 2000s, maybe if you wanted to build new programming language to compete with GCC, which was like the the top dog then, right? Cause that's basically the best compiler. You either had to build a proprietary. So, you know, which if you were Intel, you could do it, but not many other people did, um, or you're basically stuck because, and the reason you were stuck is because it wasn't just you had to make a better like see compiler right you had to implement all the basic compiler optimizations and then the code generator right to make good code for a particular architecture and a debugger right and the other tool support and like the all that stuff and so if people wanted to innovate in the language design right like hey i don't want to deal with c anymore
00:17:26
Speaker
like Rust, I doubt Rust ever would have been able to get around if they had to both innovate at the language, but then also reimplement all the low-level optimizations, code generator, debugger, all that stuff.
00:17:38
Speaker
Yeah, yeah. But instead, they could focus all their innovation on stuff that they cared about, and they could use this high-quality shared foundation under the covers. Is this how most of the projects using Data Fusion break down, then? They have one big insight about performance, and the rest is user space.
00:17:56
Speaker
That's probably an oversimplification. Well, I think it's more like...
00:18:02
Speaker
but we could We should probably talk about to a bunch of the interesting stuff people do with Data Fusion. But like if you're going to build a database, which is a very common thing to do with Data Fusion, I think people have... They'll often... like like They want SQL, right?
00:18:14
Speaker
But then they have different ideas about maybe how the data gets into the database. Like either like in the case of my employer, InfluxDB, the data actually doesn't come in as... tabular data it comes in in a semi csv final format called line protocol but which is fine right but then we turn like turn it internally into stuff and so from the end user's perspective like it's it's um you're inserting line protocol and you can query with either sql or actually we've also implemented a language called influxql which is kind of like sql but um much more focused for time series and that also runs through data fusion so um
00:18:52
Speaker
So the point is we could we could build this whole system, right? We didn't worry about the low-level implementation. Well, I mean, I worried, but most of the people working on Inflex didn't worry a whole lot about how like the low-level of Data Fusion was working. Instead, they were focused on the features or whatever that were important for our users, right? So ingest, how the data is reorganized and you know managed in its lifecycle over time and that sort of stuff.
00:19:15
Speaker
yeah Okay, so maybe we should take that as an example then. a time series database. Yeah. I would have thought just naively, I'm building a time series database. I take most of it from Data to Fusion, so I'm not worrying about parsers or storing it on disk.
00:19:34
Speaker
I've got to write custom indexes, I would guess is the first problem that's mine to worry about. Custom indexes and or the the ingest pipeline is probably the thing you need to worry about the most, right? When your data comes in, ah time series database typically has to get it organized as quickly as possible into a format for for querying.
00:19:54
Speaker
ah So that that's one major challenge of a time series database. Another major challenge is managing the data and during its lifecycle because oftentimes there's databases like ingest huge volumes of data like influx i'm talking about in general but influx inus specifically but a lot of that data is not like it's really valuable for the when it's new right like what happened in your cl in your monitoring is very very important to be able to quickly query or for the last half an hour a couple of days whatever but by the time it's like a year old it's much much less valuable and so the amount of like
00:20:28
Speaker
resources you want to devote to managing and querying and whatever the data that's really new and the data that's really old is is very very different and so a lot of the sort of techniques of building a time series database is how do you manage the data as it flows from being really hot and really important to query super fast to not as important to like then eventually you got to age it out but maybe you want to do historical analysis on it for machine learning or whatever type ai maybe i should call it now but like you know that kind of workload in future ah so so that sort of managing the life cycle is typically what you'd probably spend your time on now rather than like reinventing that query engine
00:21:07
Speaker
Right. Yeah. I'm trying to imagine how that would fit into data fusion. Do I, am I doing, how much of that am I just doing on disk myself and just feeding, feeding it into a query engine, which is data fusion? ah give Give me the out, give me the border of what data fusion does and doesn't do in a project like that.
00:21:28
Speaker
Yeah. Well, so one, let's say you already had data on disk in a format that, that Data Fusion comes with native support for it, like Parquet is the classic example. but But yeah there's also, it supports JSON and CSV and Arrow, and aro I think, are the the data sources that come built in or provided with the great. But you can use the same instrument APIs that those use to implement your own.
00:21:53
Speaker
um But let's say it was in one of those formats, what you would do is you tell Data Fusion where the which files formed what table. right and then you could that that's actually enough for people to immediately start querying the data right you say my measurements table lives in these sets of files um you register that with data fusion as a table provider and then data fusion will go through the trouble when someone goes and writes a query that's like yeah know whatever compute the average cpu usage over the last five days or whatever binned by binned by minute right like
00:22:28
Speaker
all that uh machinery that's required to run that query well that'll be run through data fusion right so the sql it'll parse sql it'll produce us ast that'll then be translated into what's called a like a logical plan in data fusion but that's a traditional way in and in the database engine you'll have a plan that represents the relational operators that compute the results you want then there's like another whole level of opt like a bunch of optimizations happen on that logical plan that ah the optimizations are effectively rewrites that produce the same results but um do so hopefully faster and it does things like fold like ah partially evaluate expressions and remove joins when it can and a bunch of stuff like that
00:23:14
Speaker
and then it'll then it'll apply yet another whole set of optimizations uh which would be like it'll translate it to a lower level form called like a physical plan which is called an execution plan data di fusion but it's um like a much lower level detail and it has particular um like it knows exactly what files are being scanned and it knows how the data is sorted and which algorithms it's going to use to do joins or uh grouping or whatever and then that will then get run that plan gets run by the the data fusion execution engine which has all sorts of like low level optimizations to do that by pushing really fast right it knows how to read parquet super fast it'll push predicates down into the scans it'll only read columns you care about it'll apply limits where where it's necessary like there's all sorts of
00:24:00
Speaker
stuff we could talk about like endlessly, which are basically all all well-known techniques. And what I just described, by the way, is like how pretty much every database engine is implemented. It's not, or at least the classic ones.
00:24:12
Speaker
So Data Fusion is not trying to break new ground with some innovative design. right It's trying to be like a really solid industrial foundation that implements best practice, but it's not implementing the latest, greatest ah crazy research ideas in hopes that those will pay off.
00:24:26
Speaker
like We're trying to do tried and true. Right, right. Okay, so i'm I'm forming a mental picture where actually the thing that's coming to mind is Postgres. When I go to Postgres and I type explain query and it spits out a whole bunch of stuff that says I'm going to do a hash join against this table with a cost estimate of that.
00:24:44
Speaker
Yep, yep. That story from query to this is how I will go to disk and fetch individual bytes and join them together, that's data fusion. yeah Yeah. Yeah. Not just the story of how i'm going to do it, but then the actual, all the code to to do it.
00:24:59
Speaker
Right. Yeah. Okay. So maybe we should talk about how you hook into that. Cause, um, like, yeah, let's come back to indexes and optimization. I want to hook in because I started with naive implementation and I realized I can easily write a specialized index.
00:25:16
Speaker
So how do I teach data fusion to consider my index when it's optimizing the plan? So the first thing, thing if well let's let's take a specific example of um uh let's say you could i don't know you're trying to write something which knows how to look up very particular um i don't know you have to have an example of an index you want you want to fire up or or i can make up one too okay uh let me think i what kind of query you care about I'm going to let you give me an example. I bet you've got a good one.
00:25:51
Speaker
All right. Yeah. So I have some good ones in time series. So like let's, for example, say that you have, um you might imagine it's really really, important to get the most recent value really quickly, right? Like you've got a table that's getting data in, but like it's very common to be like, what's the current sensor reading, right? like So you want that to be super, super fast. Let's go with that. let's say Let's say I've got everything on disk, but the latest hour of data I keep deliberately in memory.
00:26:13
Speaker
Right. Deliberately keep in memory. So... um What you could do, by the way, this exactly what Influx does, is we have a special... table provider, which is the interface in Data Fusion. right So Data Fusion comes with a table provider that knows how to read Parquet files out of directories, but you can build your own and using much of the same infrastructure.
00:26:33
Speaker
So ah what you could do is you could build a table provider. So from Data Fusion's perspective, it just looks like a table right that you can tell it this you have to tell um tell Data Fusion what the schema is and whether the provider supports a predicate pushdown, can apply filters during the scan, that kind of stuff.
00:26:51
Speaker
um And then when it goes to the planning time, eventually you'll data a fusion say, Hey, all right, I want to run, you know, I want to scan this table with these set of predicates and this push down and these columns and you table provider, please give me the specific plan to x execute.
00:27:09
Speaker
And so what you do in the case of like, an and like you had the last hour is you, logic would look at the predicates that came down and like maybe you'd have a predicate on the predicate just a fancy way of saying like a boolean expression right it's like a field that filters out rows they're really common in databases so you would use those predicates and you figure out well hey i this query is only looking for the last 30 minutes of data right and data fusion did all the trouble to figure out that those predicates applied to to your table so you don't have to do anything you just look at the the expressions that came down
00:27:40
Speaker
You say, hey, I know it's 30 and I look at and I know I have all that data in memory. So I'm just literally going to take the memory chunks in Arrow right there and store them in Arrow. And I'm just going to pass those directly back to the scan for Data Fusion.
00:27:54
Speaker
That's basically all you got do, right? don't have to do anything grouping. Data Fusion will do all the push down stuff. um Okay, what if what if my query is looking at the whole day, but yeah my data but I can say I can supply the first hour quickly.
00:28:13
Speaker
How does it handle, like, I get some data from this table and some data from that table? Well, it depends on how you model it, but if you modeled it as though it were a single table, to Data Fusion's perspective, it would just say, give me a scan.
00:28:26
Speaker
But what you as the implementer would do, what you would provide a plan. This is actually exactly what InflexDB IOCs does, which is InflexDB 3. um It'll basically give you back not just a single thing that knows how to scan Parquet files, but in our case, it would give you back the thing that knows how to scan Parquet files that have the historical data, and then a thing that knows how to provide the data for the most recent information and you basically union them together.
00:28:53
Speaker
So from like the data fusion planning perspective, it has no idea that they're coming from different sources until it gets to the physical planning time where it effectively like would compile the access to a table into actual operators that know how to fetch the data.
00:29:08
Speaker
And then some other optimizations might go on. So i i sort of create a hybrid aggregate table, and when I'm asked for logical searches, I give back some physical details. Yes.
00:29:20
Speaker
um with I'm with you. that That's one way to do it, for sure. Okay. Okay. I can just about see how that works. I want to talk about um extending the language, but I've got one quick question. Since we're talking about live data, i query through the last 24 hours of data and aggregate it out and I get a result, right, ah that has five different continents and the sales totals for them.
00:29:48
Speaker
Yeah, yeah. now five minutes later, I've got some extra data. Is there a way to mix that in? is Is there a way to say, well, I had the state of the query here. Can I add in a couple of extra rows and see what would have happened?
00:30:06
Speaker
uh not in the core of data fusion like like what you've just described is often implemented as uh at least i think would be implemented as as materialized views right or like streaming queries or something well actually streaming queries is another way of doing it but um like a materialized view is the classic way of you have like a query that you've defined up front right is the view definition and then there's at least two ways to execute that right one is you just re-execute the query every time someone wants the results and another way is that you can incrementally update you can basically store the result of running that query previously and you know basically incrementally update that that result so that that's called incremental materialized view maintenance if you want to get exciting and like read read a bunch of research papers about that um
00:30:54
Speaker
So there are people who have implemented that using DataFusion, right? So they they basically have their own query rewriter that knows how to look at a query, figure out when it can um combine new data with with a historical pre pre-computed... i think I think there's a company called Polygon that they gave a talk last year in New York about some of the the cool ways they've implemented serialized views.
00:31:18
Speaker
But yeah, so I've seen people do it with DataFusion, but it's that's not directly built into the engine itself. Okay. I'm gradually starting to get a sense of where the boundaries lie, which is interesting.

Streaming SQL and Data Processing

00:31:29
Speaker
Another thing that people did that I never really, i didn't expect when I first started working on data fusion was they've implemented like, um, street streaming SQL systems. And so like, I wasn't even super familiar with what those are to begin with.
00:31:41
Speaker
Right. So flink is the canonical streaming SQL system, uh, Apache flink. So and at a high level, you like traditional databases or you know like final school analytic database you go and like you run a query and you get a result right and the the engine's all designed to really crunch the data fast and it can basically whatever query you give it it'll come up with a plan for it i'll give you the answer streaming sql is a little different where you basically create a standing query up front it's kind of like a materialized view except the difference is as data flows into the system, the the standing query will continually compute like an ongoing operation. So like, for example, maybe you're computing what's the average stock price will allow for every minute interval, right?
00:32:23
Speaker
So basically as the stock price stuff comes in, you're keeping an average, and then once the minute, once the time boundary turns over to the next minute, it like emits the, the the output. So those those are called streaming SQL queries. I'm sure it's far more complicated than I just did, and i didn't do it justice. But that that's a really simple example.
00:32:41
Speaker
Yeah, it's good stuff. And um it's kind of interesting. DataFusion, I thought, was sort of a classic batch processing system where you just give it some arbitrary query and we'll just crank on it. But in terms of bunch of people cared about implementing these streaming systems, and they've put a bunch of features in DataFusion to help support that, specifically the analysis to know what queries can be streaming and what can't.
00:33:01
Speaker
And so now there's it's another several people who, you know, several companies like Cineda and Arroyo or two, and I'm sure there's other ones that are coming up, that have this sort of streaming SQL engines that are built on top of Data Fusion.
00:33:13
Speaker
um And now yeah, it's it's been pretty neat. That's cool. And it's cool that you didn't see it coming, but it was still enabled. Yeah, yeah. I think it's partly because like Data Fusion also has what's called a streaming execution engine, ah which is...
00:33:33
Speaker
i if you think about what a what a database execution plan is really doing right like this' it uses it's often described using fancy words like relational operators or whatever but like logically what it is is it's a data flow graph and what's flowing like so that means there's nodes and edges right that are connected And data typically flows from sources, like reading data off disk or whatever, or other access paths. And it flows eventually. like The sync of the graph is the the client or wherever the data is going.
00:34:01
Speaker
And then so what an execution engine really is, is just like a data flow engine that's that's driving this graph. So like a relational engine, relational operators, very fancy way of saying what's going along logically on those edges are tables. right They're not just single values. They're like ah rows and columns.
00:34:19
Speaker
and so With that as a model, if you're not you know like the simple thing to do that you might do initially is you're just like, well I'll just compute the whole intermediate result. So you literally compute the whole all the data that comes out of your file, and then pass that to the next thing, the next thing, next thing.
00:34:34
Speaker
But as you might imagine, for large data sets, that's very, very RAM intensive because you have to like store all RAM, right? yeah Yeah, and needlessly slow because you have to finish processing a million rows before you can take it to the next step.
00:34:46
Speaker
that's That's right. That's right. um And so best practice in these these analytic engines is to implement them as streaming engines, right? so So you still are passing logically these relations around, but what really happens is you pass like batches of rows. So instead of like, if you've got a million rows that come from one to another, and you're actually like, what you do is you pass the first 8,000 or I knew you were going to say 8,000.
00:35:08
Speaker
8,000 is my favorite number. There's research papers, if you want. We can go, you know, we can I can put them in my wire notes. But like, the research papers basically show somewhere between 4,000 and 16,000 rows in the batch work approximately the same, right? and Exactly where in that, if you get too much bigger, the rows stop fitting in L1 cache. So that's not effective. And if it's small,
00:35:27
Speaker
If you're slower than that, the overhead of like the the per batch overhead starts becoming a significant factor, right? So you can't amortize the per batch overhead as much. So 8,000 is right in the sweet spot. current in sweetes but's That's just the current sweet spot.
00:35:43
Speaker
um Okay, so you batch up your million rows. Oh, yeah, yeah, sorry, batch the million row thing. Yeah, you batch up your million rows into 8,000 size chunks and just pass those. And so that's called, like a I would call that a streaming engine or a streaming execution engine.
00:35:57
Speaker
I think that's that's pretty typical these days for analytic systems, because otherwise, if you try to process big data, you just run out data that's larger than RAM. It just runs out of space. Plus it must also open up the possibility of doing it like multi-threaded, right?
00:36:11
Speaker
That must get easier. Oh, well, there's an awful lot of effort that goes into making these engines the multi-threaded for sure. Although I think technically speaking, you can have an entirely in-memory engine that's still multi-threaded. but realistically in fact i think the the research system umbra is like that but but anyway that's probably less that's interesting so yes there's actually a whole lot of effort that's gone into applying multiple cores you know keeping them all busy all the time yeah to running plans and and that's another thing right like you take my mac laptop and i go run some query it'll like keep all the course busy which is very satisfying
00:36:48
Speaker
But it takes a lot of time and effort to like get that to exactly work right and not to have bottlenecks and not to have deadlocks and all that all that sort of stuff. Is that kind of thing planned out ahead of time or is it dynamic?
00:37:00
Speaker
Does it start rerouting around different CPUs? ah Well, Data Fusion uses something which I would call a classic volcano ah style optimizer. So like if you...
00:37:12
Speaker
read the data uh if you're a database literature we need that you know what that is which i assume not many people i love the i love the term i want to unpack so it comes the name volcano comes from uh or maybe it's can no a volcano um comes from a paper about a research system called the volcano system by goats graph i believe back in the day but but uh at a high level what what that does is the plan is divided into some number of like sub parts and each sub part runs on a single core and then there's only certain parts of the plan where the data is ah exchanged transferred back and forth between cores so in the original volcano paper it was called i believe those those operators were called exchange operators but it basically that paper the volcano paper describes how
00:38:04
Speaker
you would take a you know and an engine that was designed for one core and run it on multiple cores it's ah it's in so anyway this exchange model is common i want to get to a big discussion because there there was a big debate a couple years ago about whether that's a good model whether we need a different model but um i'm going postpone that for just a moment so so the way way like let's take an example if you're trying to agger like just aggregate a bunch of data and maybe this will be a good example to give you some give you and your listeners some sense of like what's actually required to implement fast aggregation so when i say aggregation i mean like i have a billion customers and i want to find the average or have a billion like uh transactions or whatever in a database and i'm trying to like find the average uh value of but the transaction for every customer right
00:38:55
Speaker
So logically, what but that means, like if you wrote that in a Python program or whatever, you'd probably like, you'd read the data in, you'd build a hash dictionary keyed on customer, right? And as every row came in you would look up the customer record and update the running average or whatever, right?
00:39:11
Speaker
Yeah, yeah. so So that's basically how, I mean, and so databases do it like that too, except they do it very fast and they do it with all the cores. And and the typical way to do it is you use a, multi like it's called a multi-phase aggregation.
00:39:23
Speaker
So instead of, having just one core sitting there, like looking at every customer, what you actually do is you start immediately, um you'll have two phases. You actually have two sets of of aggregation operations.
00:39:35
Speaker
The first one will, you know at at the core they all do that they have a little hash table that they that they update although you know the details of how to do that fast is also fascinating in itself but the you know at the high level which what you do is you arrange your plans so that every core sort of like reads a block of data out of disk right like filters it or whatever and then immediately aggregates that block that are read.
00:40:02
Speaker
So all cores are basically going at once, reading data, aggregating, reading data, aggregating, and they're not talking to each other. So that part's really fast. Yeah? And there's some part of the plan that says, okay, core one, you've got this byte range in the file. Core two, you've got this byte range.
00:40:19
Speaker
Yes. now yeah okay so yeah So that's that's how data fusions like Data Fusion's planner handles that particular thing. um There's other ah techniques of how to do it. But but at the end the day, something basically ah assigns different parts of the file to different threads.
00:40:33
Speaker
And those those threads like are purposes that they go read the data off disk as fast as they can. And then they aggregate it immediately after they read it off. And that's important so the data basically stays hot in the cache and never... you know like You don't want to like read a bunch of data buffers somewhere, then eventually get around to aggregating it.
00:40:52
Speaker
Make it really fast. You need to like read it, aggregate it immediately, um and then go try again. that Just do that in. But if you do that operation, right so that's great. You keep all your cores busy, but now you basically, let's say you have 16 cores, you now have 16 hash tables right that all could have, every customer could be in every one of those hash tables or some subset. You have no idea.
00:41:14
Speaker
So then... the way these ah distribute fast analytic engines will work. And so what you'll do is they'll they'll then effectively shuffle the contents of the hash tables across all the cores.
00:41:25
Speaker
so to So typically what you do, yeah. Yeah, yeah. So I send all the ones beginning with A to core 1 and all the ones beginning with B to, yeah, okay. That's exactly right. and then he But of course there might be duplicates because they might have come from different original cores, right? So then you have to do a second phase of aggregation to compute the final the final result.
00:41:45
Speaker
Right. Yeah. And then presumably there's a third phase where you stick it all back in one place to be served up. Depending on what you want, like ah if you're just streaming out the results, like they can each kind of just start sending them out and you can just read them back. If you want the customers, like I don't know ordered by a particular, um like by last name, for example, you would then have to sort them somehow.
00:42:07
Speaker
Okay, yeah, that makes sense. I'm wondering how this is implemented, because I know Data Fusion's written in Rust. Oh, yeah. Under the hood, is this like all asynchronous Rust or something?
00:42:18
Speaker
We do use asynchronous Rust. or I would like to talk about that as a separate topic, too, before I talk about... um
00:42:28
Speaker
ah in terms of rust and performance stuff right like traditionally in databases um i have i have a particular period but like traditional databases you you would write them basically in c plus plus right that was that was the way people wrote databases and they the story was you basically had to write it in c or c plus plus to get good performance or get good because jvm could get you so far but at at some point basically the you could you need control over memory and you just like you just can't do it with the jvm That's like the traditional, if you go ask the database purists or the researchers what the story is.

Rust's Role in Database Development

00:43:01
Speaker
Now, however, you might observe that like 15 years ago 10 years ago, there's a huge push into Java anyway, right? There's a whole whole like um generation of systems that were all database systems that were implemented in Java. Like I'm looking at hadoop you, Hadoop, right?
00:43:14
Speaker
And a bunch of those other, you know, Spark is another good example, right It was basically written in Java, um at least at first. And my theory is that the reason that was is because despite the fact that you...
00:43:27
Speaker
you know, leave some performance on the table, you're just so much more productive as a programmer. you know, and and I know it's like productivity is kind of like a weird generic thing to talk about, but like as a personal anecdote about productivity, right, as a C++ plus plus programmer, the things I don't have to do and didn't have to do in Java, and I'll talk about Russ in just a second, is like why after two days of really hard beating does the program segfall?
00:43:54
Speaker
Right? Like, why is it crash? I don't know. Right? And you spend like forever trying to debug why, you know, like whatever the particular memory race or stomp was. Right? So, so productivity is not just like how fast can I pound it out in the editor? It's like how, how fast can, or how much efforts you have to do to make a real high quality production piece of programming.
00:44:12
Speaker
right yeah and how much of next month will i spend tracking down seg faults is a big i mean i can't i can't i don't want to admit how much my time in my previous you know my earlier career i spent chasing down like we're it would be like hey we'd be know we had i still remember vertigo we'd have this like stress test that we'd run for two weeks on the database and like what would happen is after two weeks of like basically doing random things to the database it would seg fault right and then the goal would be like, all right, you've got to go figure out where the bug is. be like like You can do it, but it takes a long, long time.
00:44:43
Speaker
Yeah. And in the worst case, it takes you two weeks to check that it actually worked, right? Oh, yeah. And then you've got to restart it, right? That's exactly right. Yeah. Well, and and we still, one of the other great cases there was we we had one of these things that segfaulted, and we like we spent weeks trying to figure it out.
00:44:59
Speaker
And you know eventually like Eventually, what happened was it turned out the the hardware was bad. right There was like RAM in one of the servers that actually was like crappy and would every now and again flip bits.
00:45:10
Speaker
But like proving that that was the problem was it was like a multi-week endeavor. um So anyway, yeah that's this is that's like a lot of words to talk about the productivity of C++. Oh, by the way, also, I don't know if everyone who's listening here has ever messed around with C++. o plus You ever like get weird linker errors? Like, why does it can't find the stupid symbol, right? Like, in which WL command you have to give it to like make it find it? I wasted not as much as chasing down distributed crashing stuff, but like far more time on the build system crap than I would like to admit either.
00:45:40
Speaker
So anyway... my thesis here right is that rust basically you know it's basically as fast as cbus plus fall as we can argue whether there's this case but it's basically as fast and it's very very close to the java productivity right like in my now what is now five years of professional rust development experience i've there's been like one memory problem depending on how you count like one thing that what one weird memory crash now that
00:46:07
Speaker
came with a lot of like you know very lot of frustration with the Rust compiler at first, for sure. right Because the compiler was just like, I'm not letting you do that. And I still remember at first, I was like, I've done this for like 10 years in C++. pluss plus just like Just do what I'm telling you. learn No, no, I'm not going to it.
00:46:20
Speaker
yeah and And then eventually, OK, it's basically preventing yeah well eventually understand okay what did you know it's basically preventing Like eventually you really appreciate what it's, why it's not letting you do things.
00:46:32
Speaker
But at first it was super frustrating. But anyway, so where I'm headed with is yes, like data fusion is all written in rust. There's very little unsafe. There's some places that are, that use unsafe, like the really low level kernels, uh, sometimes do often to like skip validating that something's valid UTF-8 or whatever, when you know, cause you wrote it in as valid UTF-8.
00:46:50
Speaker
But i mean the amount of effort that's gone into like you know obsessing about not copying strings around and not you know avoiding copies and being being clever about things is like it's a lot there's been a lot of people obsessing about ah about that level of stuff and so so like the data version aggregation is really really fast i'm quite and filtering and reading parquet is basically state-of-the-art um from what i can tell don't know anything that goes faster So I'm quite pleased with what it we've been able to do as bunch of random people on the internet hacking together. right it's it's been If you ever told me when we started that we'd be able to build... like you know I've been in an enterprise software for 20 years, so I know what it's like to build with that and big teams. If you ever told me we could build a level of technology we have right as an open source project with just random people on the internet who
00:47:40
Speaker
I have no idea where they come from or what their backgrounds are or why they're doing it. It's like, I would have said you were crazy when I started, but, but, you know, here we are. um see i want to get, I want to get into the project organization, but I still want to know before, before we do there, I still want to know how, what rusty techniques you're using to manage multiple cores and dispatching data between them in a pipeline.
00:48:03
Speaker
All right. The short answer is we use async Rust and we use the Tokyo runtime. but Okay. And the reason I wanted to postpone that before is because I personally i think I personally, maybe I i treat it a little too personally, but we've taken a lot of heat in the project for that decision. I think it's the best one we ever made, my opinion.
00:48:20
Speaker
um But not everyone's agreed, so I just want to acknowledge that not everyone thinks it's a great idea. However, I think it's a great idea. And i the reason I think it's a great idea is I have a blog post about this. Maybe we can, we can point about it, but I'll put in the show notes. I have, I have, I have written own thread scheduler.
00:48:40
Speaker
It's really hard to get, like, it's easy to get it to work 99.9% of the time. And it's really, really hard to get to work all the other time. Like the final 1% of the time or half percent of the time. Yeah. I can believe it's full of corner cases.
00:48:52
Speaker
and it's full of crazy corners especially shut down and canceling and all that nonsense and so in tokyo is like it's really really good right like it's i know people talk about but not just as an async io like actually doing i understand it was designed for io and blah blah blah blah and data fusion absolutely fills up full of cpu bound work like there's a whole thing about whether that was a good idea or not um but it is it's a good idea and the reason is because one you get access to the super high quality thread scheduler which is tokyo's runtime another thing is because you get basically built-in language support for implementing what i would call continuations it's what we call that but it's it's like when you're building one of these data flow engines what you have to do is you have to like you you do some work then eventually you you need
00:49:42
Speaker
input or you need to produce output and when at those particular junctures sometimes you have to either wait for more input to be available from your input or or like the output's not ready to consume what you're ready to produce right yeah so so what that means is that you as a programmer is that you have to be prepared to like basically return control back to the um thread scheduler at those junctures and Yes.
00:50:08
Speaker
You want to model the whole thing as a bunch of coroutines that can wait on data and ship data out. Well, that's one way to implement it is you as explicit coroutines, for example.
00:50:19
Speaker
um However, Rust is synchronized. The thing is, the compiler builds that stuff for you with async. right like so Every time you do a.await in Rust, Under the covers, you like with some crazy, I don't even know how it's implemented in the Rust compiler, effectively what that does is that that produces something which returns control, like like literally like return right to a to the Tokyo threaten scheduler, which can then call you back directly Or maybe it'll call something else.
00:50:47
Speaker
But the point is, you as a programmer just write straight line code and like, oh, I need some more data. All right, get it. dot a wait And then I continue. So the code is very clear what it's doing. It doesn't have to be like, oh, I need more data. Oh, crap. I got to save all my state in like some variable somewhere, return control. And then next time I get called, come back and have to pick up where I left off.
00:51:06
Speaker
yeah Like the compiler does that for you. Anyone who's spent time in early JavaScript in callback hell and then got syntactic sugar for async await knows the value of that.
00:51:20
Speaker
Yeah, well, I mean, in my experience with something like that, it was an embedded network device in 2004 or something, and it had a company called Data Power.
00:51:31
Speaker
But like we had the same thing where you basic we had a like a multi-threaded execution engine. it was all This was all C, by the way. c was um But it was very important, ah it was to, it since it was a network device, it was doing lots of IO, and so you couldn't do blocking IO, right? You couldn't issue blocking IO calls.
00:51:49
Speaker
But so when you were in the middle of processing something, if you want to do an IO, you basically had to save everything you were doing in some structure, and then like register a callback to get called when the IO is ready, and then return back to the,
00:52:02
Speaker
the the kernel basically and like right yeah the number of bugs i spent chasing down were like you know getting it worked the positive way was annoying but it was like fine right but like the error paths were just like a giant because like oh i was halfway through something something errored well you have an error call back but the like um my god it was it was it was a mood was miserable so clearly triggering some trauma for you here Yeah, right. And so then you see this compiler doing you're like this is the best thing ever, right? The compiler does it.
00:52:30
Speaker
Yes, sometimes it doesn't get the lifetimes right. Yes, some people don't, you know, like, has some overhead or something at runtime. But like, compared, like, you know, I don't know, compared to doing it all yourself, it's so, so much better to have the compiler do it.
00:52:43
Speaker
um And like I said, i there's definitely people with different opinions about it than me. But like, I would use Tokyo for run CPU round work, again, in a heartbeat. Okay. In fact, I did. i'm I'm working on like with some other guys, uh, my personal passion project at the moment is building like a TPCH data generator that just goes like fast.
00:53:00
Speaker
that If you're trying to generate, TPCH is like a classic analytics benchmark and it's, models like uh orders and stuff it's very well like because it's easy to generate the data it's super well studied and so basically every analytic system will like post numbers about the tpch benchmark but but the data generator is terrible for it's like vintage 15 year old c code that like takes forever to run and it's just it's miserable and it milks makes something that's kind of like csv but not really because it use it's just it's stupid And um so anyway, one of the things we've been doing is trying to like make it fast. And so I've currently got it. We've we've got to crank frankn and Anyway, we've got the data generator going super fast. And it absolutely I absolutely use Tokyo to use CPU butter work. And it's like 20 lines of code. And the thing is like cranking through all the CPU. It's it's like amazing.
00:53:47
Speaker
Anyway, that's that's a that's a so preview of coming attractions. That's going to be great. This is good to hear. I'm working on a Rust side project myself, and I picked Tokyo because it seemed like the default. I'm glad to see there's a good review on it. I've made the right choice.
00:54:01
Speaker
Now, the one thing that legitimately is tricky is if you're if you have a If you do CPU bound work on the tokyo on the same Tokyo runtime as you do IO, it's very easy to do that with data fusion if you're if you're not careful.
00:54:17
Speaker
And frankly, for systems that you're doing a few queries, it doesn't, frankly, really matter all that much. But if you care a lot about latency, especially with those large numbers of queries per second, ah you really don't want to be doing CPU bound work on the same Tokyo runtime as you're doing same...
00:54:35
Speaker
your network requests because basically the networks calls are not being serviced quickly enough. And so they're like, you end up getting throttled by the network layer. So I will say it's like, if you have to run two different sets of the runtime at once.
00:54:50
Speaker
Yes, that's exactly right. And so you, like that's, it and which is, uh it's not as easy as i would like it to be have there's you can do it and like we do it at in flux with a bunch of other people have done it with the data fusion um like the pedantic guys for example i know uh yes it's not as easy to do like the upside of using just one tokyo thread pool and is that it's very easy to get going right like you basically don't have to think about it you can immediately do io and your table providers right like this it's just all works the downside is if once you get to the point where you care about running large numbers of transactions a second these like effects of running the same like different uh types of work on the same thread pool become much more apparent at that point it's a little tricky to fix although i still posit it's good that like
00:55:42
Speaker
many people never even get there right they like they never get to there and so they don't have to worry about multiple thread pools at first so i in my opinion it's a good trade-off but right yeah not you know if you spend two weeks trying to figure out why the hell it doesn't scale you know under under serious load maybe you don't feel the same way but that's that's okay okay yeah um so yeah so So traditionally what you do in ah in a network system that's also doing lots of CPU-bound work is that you have separate thread pools.
00:56:06
Speaker
So you have one thread pool that's handling the network request so that it can, with low latency, respond to whatever is coming into the network or whatever work the kernel has for it to do. And then you have the other threads cranking away, keeping the CPUs busy, doing doing heavy data processing work.
00:56:22
Speaker
and Yeah, that makes sense. Yeah, sorry, go ahead. It's almost like having... you sort of wish that was built in and you could have like different quality of service levels for different workloads. But what you're basically doing is adding that in manually.
00:56:37
Speaker
Yeah. And I think part of is that it's not always clear. like there's ah There's a bunch of trade-offs involved. like if if what you If you want to always ensure you have the lowest latency network response, you you'll like you'll basically reserve a core or more for only doing network stuff.
00:56:54
Speaker
right yeah but that'll look like if you're under heavy load under heavy cpu load it'll basically look like you're not using all your cores right because like some some you're basically keeping some of the capacity around for if you're going to get additional network requests which which might be appropriate for your system or it might be seen as as not like something you don't want to do so i think the exact sort of mix of how you want to assign your cores and whether you want to reserve space up front or whether you're trying to totally maximize what you're currently doing I'm not sure there's ah like one generic way of of doing that. so but yeah
00:57:27
Speaker
And that makes sense, going back into this idea that Data Fusion is there for you to choose the parts you want right and build something unique, reusing the common stuff. Okay, I get that.
00:57:42
Speaker
So... You've talked about how ah like Tokyo is slightly controversial. Let's talk about working with other people. How is Rust?
00:57:52
Speaker
Sorry, how is Data Fusion organized as a team? It's an open source Apache project, right? The open source Apache project. So it's run, it's, you know, it's like it's run the way, ah or it's governed the same way that every other Apache project is run. i don't know how familiar your listeners are with that.
00:58:09
Speaker
In my mind, Apache, like the Apache Software Foundation doesn't, directly develop any software like what i like to think about apache is is that it's it's a set of uh people and processes and like stuff that uh open source project like best practices that open source that that they figured out work well for open source practices and then they basically allow projects that that want to participate and follow those guidelines to come be part of the umbrella so the uh apache solver foundation handles a bunch of stuff you know there's some infrastructure stuff they handle and they handle they handle the governance and and how it does but mostly the projects at a day-to-day level certainly the code or what goes in there like all that's done entirely by the project itself not by any sort of high level apache thing deep so there's a
00:58:59
Speaker
PMC, which stands for Program Management Committee, Project Management Committee, something like that. And that's maybe 20 people, I think, on Data Fusion, something like that, 15, 20 people. And then there's committers who can actively basically have write permissions to the repositories.
00:59:15
Speaker
And that's another. So there's probably 40 people in total across that. That's very healthy, yeah. Well, and you know, like a lot of open source projects, some of them are are not active anymore, but some of them are very active. And, you know, so the project, i don't know, has multiple commits a day, probably between five and 10 a day for the last four five years now.

Apache Governance and Contribution

00:59:35
Speaker
Um, So in terms of the number of commit, like the last time I did numbers for a couple of releases ago, it was, you know, we do sort of monthly releases and there were like on the order of, I think, think it was 47, it might be more individual, like distinct committers for each one those months. It might've been,
00:59:51
Speaker
It might have been 90. I should go check. But like the point is there is, a lot of individual committers. Now, like most open source projects, if you look at the distribution curve, it's like it's like a power law curve, right? like There's a lot couple of people who do a lot, and there's like a very long tail of people who did like one thing. But I think...
01:00:08
Speaker
but i think from my perspective, the health of the project is very, very good. Now again, I'm super biased, right? Cause like I spend a lot of time on there, but mean, if you go there and you can look, right, there's multiple new PRs a day and tickets and people are clearly using it for real, real issues in their or real systems. Cause they're filing, you know,
01:00:28
Speaker
crazy corner case bugs, right, which I don't think you find and want to fix in your spare time. i Yeah. so Do you have the whole, ah is there like a data fusion improvement process official? Are there dips, numbered dips? No, there's nothing formal like that.
01:00:44
Speaker
ah So, other like, there are some... One of the weirdest things for me starting to work on open source. And, ah like, so I came from...
01:00:56
Speaker
enterprise software background, right? Like, like literally I've been doing enterprise infrastructure software for 20 years. Um, there was like, well, who, you know, people were like, well, what's the roadmap? what Who sets the priorities for these things?
01:01:10
Speaker
And I was like, I don't know. Cause there's not a, there's no one that I know of that is paid full time to work on data fusion. Or even it looks like I get paid full time. Sometimes like I i don't, I spend half my time in data fusion maybe.
01:01:22
Speaker
Um, and it's been my, um So no one gets paid directly and there's no, like at a company, there's the company's sort of resources, they gave the people at the time.
01:01:32
Speaker
And then there's a way for the company to kind of set a priority, right? And then there's like, you kind of align the people to the, what the company's priorities are. In the open source world, there's like, you know, no equivalent,
01:01:45
Speaker
corporate like the patches off foundation. So they're not setting it. The PMC is not setting it. People who really set the agenda are the people who are providing the time to do the work. Right. So that was a really weird, thing like who's setting the priorities? Well, it's really the people who are going to do the work or like are going to set the priorities. So like you could say, you know, we really should do whatever, but like if no one's going to do it doesn't really matter. Right. Cause, and I have no direct control over what people do or, you know, yeah yeah we can't, they're just doing it basically because they, because they find value in it.
01:02:15
Speaker
There is a question i reckon I could ask a thousand developers and get a thousand different answers, but how do you feel about that way of organizing a project? I think it's worked out phenomenally well. That's that's what I think.
01:02:28
Speaker
There are definitely times when I wish desperately I could just be like, i want this thing done. you You just go do it, right? Although it turns out, which you know when when you're a manager at a company, you can perhaps do something like that. Although good managers, of course, realize that's probably one of the least effective ways to get people to do stuff. um No, so what I've what i've taken to doing, which has worked super well too, is is I basically think about it as like,
01:02:55
Speaker
The goal is to make what I want, what I think is the priority. Like I can't tell people what to do. you know um I have no direct influence over them. But but i what I can do is I can make it easy easy. The easy path is for them to do what I what i think is important.
01:03:10
Speaker
Where like the stuff I think is priority for the project. So that's the technique I've been using, and I find it very, very effective. Now, what does that actually mean? It means spending a lot of time writing down tickets and explanations and and basically watching around. If you see someone who's like doing some if you see ah someone who will show to the project that's that clearly um like high caliber and is isn't motivated, it's gonna do a lot of good work.
01:03:39
Speaker
If you see them show up and they have interest in some project where like you think that I think is important, spilling to disk is a good example, right? Or like someone got excited about doing explain plans, which I can talk about, right? So um those were recent examples. This TPC's data generators, another one where someone showed up that was interested in doing that.
01:03:58
Speaker
So, and I thought those were important too. So what I can do is I can spend a lot, right? i can explain what I think is needed or what I think would make a good project. Turns out developers love coding stuff, less defining projects, right? Yeah. yeah and help review their PRs quickly and help make a big, you know, publicize what they're doing.
01:04:18
Speaker
And so I think that's worked out really well from, uh, if I want stuff done, but I have no direct influence over people making it like, Hey, here's all this cool stuff you could do. If you want to come work, work with us.
01:04:30
Speaker
Uh, it's, it's been shockingly effective. Like, yeah. Yeah. I remember but we had Alice Cecile on talking about Bevy and that really rings a bell. She was very simple. Like talk about what needs working on and people organize around it.
01:04:45
Speaker
I can't like, I mean, I'll give you some examples like the, so yeah, I will certainly admit that. So DuckDB is another one of these embedded databases, right? It's very popular. I sometimes have FNV of DuckDB's feature set for a bunch of reasons.
01:04:58
Speaker
okay And one of the things one of the features they have that's really, really cool next um is these explain plans, right? Those classic explain plans in a database look like the Postgres style where they've got some indented text and they're kind of a pain to read. and you're kind of like have a special skill to read. And it's the thing you call up when it's not performing the way you thought it was going. Yeah. And it's like, and, and so for developers who understand the internals, like the the explain plans are fine. Like they have all the details you need, but if you're not used to seeing that, like understanding how to map that tree structure to like the actual, like the indents into a tree in your mind and doing that, like it's, it's a bit, so anyway, DuckDB has this really cool, like basically ASCII art tree explain plan mode and people love it. And
01:05:40
Speaker
I've wanted for Data Fusion for a while. And one of the CMU classes said, oh, you know, one of reasons we used the first... One of the first projects we used DuckDB instead of Data Fusion is because it had these cool explain plans.
01:05:53
Speaker
Because that it's easier... It's like a lower barrier entry. i was like, oh that's just... So I wrote that up and we we talked about it And so someone, so I don't even know what their real name is, to be honest. I know their GitHub username is Irene J. Thank you, Irene J. Showed was like, hey, this is great, right? So he that person went and ported a bunch of the code from like like the DuckDB rendering engine Rust, put in Data Fusion.
01:06:16
Speaker
i didn't set, like... And it was good enough that we we merged it and then we organized a project where we kind of like listed out, hey, we got to like port these different, you know there's like 20 different operators that have to be changed to use this new format.
01:06:29
Speaker
And we, fire you know, we, we basically did one or two of them together closely. We figured out the pattern and then we filed the other 18 tickets. That was like, apply this pattern to this operator.
01:06:40
Speaker
and like in a week and a half, it was done. And like, where the products that's, it's amazing. um And was your involvement in that, I mean, at the start, did you basically just draw out the ASCII art of what you hoped it would end up looking like?
01:06:55
Speaker
Well, someone had filed the ticket about like, hey, it would be great if we had DuckDB plans. I might have done that. oh But then, you know, one of the things I think the DuckDB guys have been really good at is building Mindshare in the...
01:07:10
Speaker
research community, probably because they came from the database research community and they're running around talking about it. right So again, I got sort of envy there. Like, well, how how can I make Data Fusion more um inviting as a research platform? because i think that's another way to draw contributors back. right If you look at like, how do i get how do we get the product to be the best you can possibly be when you've got to get good people to come in and help you help you do it together? right so how do you do that? well yeah you know, graduates do do database research sounds like a pretty good set of people to ah to encourage to help you.
01:07:36
Speaker
Yeah, especially if they're sick of seg faults, right? right well yeah and and you get the burst from like rust being like you know the hip you know people do seem to be very excited about doing rust so that that help yeah uh but there's another example too where like someone's recently been working a lot on spill you know like when you're sorting data that doesn't fit in memory oh in that sense yeah right so there's there's a whole like body of research about how you actually do that efficiently And you have to spell it to disk and merge it do and manage memory and blah, blah, blah. It's not non-trivial to get working right or or to get working well and fast. And so DataFusion had a basic version of that, but it wasn't really... it was As people started clearly using it in production because they started finding tickets, it was not yeah know like as high quality as some of the other pieces of DataFusion.
01:08:25
Speaker
but But several people showed up who were clearly very good engineers and who are actively trying to push that forward. And so even though that particular project is not a priority from a... like like my employee like influx is who's my employer we don't have discs attached to our processing things so we don't spill right so it's like it's basically irrelevant for us but it's important for a bunch of enough enough other users that i think it's worth helping and plus there's a bunch of people who are clearly very good and kind of push it right so it's one thing if you're like well i gotta you know some random person of the internet you gotta somehow
01:08:59
Speaker
get them up to speed to even understand why spilling and whatever is important versus like you've got people who clearly understand that and are actively willing and and able to help push the thing forward. Like if you think about leverage, helping them achieve that goal is is a very high leverage opportunity.
01:09:15
Speaker
Yeah, yeah, yeah. Especially if they have read through all the papers that explain how you do this properly, right? And they're keen to implement it. Yeah, I can totally see that. So the As we're talking about developers, pull me back out into user developer space.
01:09:32
Speaker
like I would like to understand... probably not going to start reading a database paper and implementing new Data Fusion features, but I am curious about how you might ah do simple things like add in a new SQL function or more complex things like how I might add in data log as a new language. How does that work?
01:09:52
Speaker
Yeah, so... for new functions there's a whole bunch of examples of how to do that right the the the ap the api internally that's used for you data fusion comes with don't know 300 400 implementations of functions but the apis that are used internally are exactly the same that you as a user can use so there's basically like a whole wealth of both example implementations of functions plus like a simple write-ups in the documentation and and examples. So in terms of adding a new function, it's very, very safe for There's a whole bunch of examples.
01:10:27
Speaker
Implementing a new language is slightly more interesting. What you'd have to do to do a new language is you need a parser for your data log, right? So something that would take like a string that came in and you have to part. to Typically, what the way you would do it is the string comes in like it's just string of bytes.
01:10:44
Speaker
You to implement some sort of parser that parses that into a ah tree, like an an abstract syntax tree is the fancy way of saying it, but it's like a structured representation of what you wrote in text.
01:10:56
Speaker
And then there's a translation steps that translates whatever the structured AST means into the first data flow graph. so and in Data Fusion's case, that means a logical plan.
01:11:08
Speaker
So you need you need something which... How much does that logical plan look like ah <unk> I don't think it looks like... so It doesn't look like SQL. It looks more like a data frame. Oh, okay.
01:11:24
Speaker
It looks much more like, you know, scan this... It doesn't say what you want. is It says the how. So it's like, scan this table, right? And then... sort it by this and then apply these functions to it and then maybe aggregate like this in that order right so it's a specific order okay if that is built up and so that initial plan describes sort of the how to how to calculate whatever result you want and then there's a whole bunch of like machinery that goes on to try to make that calculation really fast but the the logical plan the initial logical point describes the the what like like the how to calculate initially
01:12:03
Speaker
Okay, so I've got string, parse it to my AST, and I've got to turn that into something that could be an inefficient physical plan, mentally. Yeah, you don't have to worry. Like, you could just scan all the columns, apply this filter here, some, you know, aggregate it like this, and done, right? You don't have to worry about how can I push the filters down, or where do I not have to carry the columns through, and how do avoid sorting that if it's already sorted, and all that kind stuff. Like, that, you just, it describes what to do, and then there's a bunch of optimizers that will go and make it really fast. Yeah, okay, think see more clearly why you think of this as being a bit like LLVM.
01:12:42
Speaker
Yeah, if you just generate that logical plan, then there's there's a whole bunch of stuff that happens after that. Like you can get explained plan immediately out of it, for example. Or you can go run it through this crazy multi-phase aggregation, whatever we were talking about. Like like you as a language designer, to worry about that at all. You just say, I want an aggregator like this, right? And Data Fusion goes and does its does its thing.
01:13:00
Speaker
Nice, nice. And for the simpler case where I'm just adding functions to SQL, I've done this in a, there was a Java-based tool that was looking a bit, there was SQL for the Kafka world.
01:13:14
Speaker
yeah yeah And a lot of it boiled down to there are simple scalar functions, which is I'll give you a value, give me the changed value back. And then there are aggregate functions, which are I'll give you the state so far and the new value and aggregate them.
01:13:28
Speaker
Is it just as simple as that? Pretty much. if Data Fusion actually has three different types of, view well, actually four different types of user-defined functions. There's the scalar functions that you defined, that you described, which logically you get one row in and you generate one row out for each one of the input rows there's aggregate functions which you get given a bunch of rows for a group and you produce one row out this is logically then there's window functions like which are sql window functions right which like you get one you get a bunch of rows in and you still produce a bunch of rows out but you can look at the other input rows and then there's also table functions which basically you in theory get a table in and you produce a a table out so those was all
01:14:09
Speaker
there's There's basically Rust APIs to implement all that that those different kinds. And they're all implemented in terms of arrow, like Apache Arrow. So when you implement implement a scalar function, there is like a wrapper that you can literally implement, like a row, one row to one row.
01:14:24
Speaker
but the way most people will do it is you the the api is in terms of you get given a batch by 8 000 rows right you get given a batch of rows and you have to produce the batch of output rows um and then you can depending on what your function is doing you can take it like use all the different there's a whole bunch of kernels in the arrow library that do things like calculations of them really fast or you like there's a whole bunch of ways to make those functions quick but um i'm I'm a little bit surprised i get it on the row or batch of rows level rather than the individual column value level.
01:15:00
Speaker
Well, I mean, you get the column values for a bunch of rows
01:15:10
Speaker
So do I have to include some logic to deal with the fact that I'm doing more than one function call? But if i if I was re-implementing uppercase, I'd just get a string in and a string out, or I'd get a batch of rows? you'd get an arrow array in is what you'd actually get string array in, and you'd have to produce an arrow array out.
01:15:31
Speaker
Okay, so there is a little bit of the machinery I have to worry about around the transformation I'm interested in. Yes. However, I think there's enough example there's enough examples of how to do that Like you can write that in three lines basically. Okay.
01:15:48
Speaker
Or you can write it in like a hundred lines. The hundred line one's going to be a lot faster because you're going to take care of a bunch of special cases, but the three line one's going to work fine for you. Um, And so I think that in my mind is also very nice where you can kind of start with the, you know, it's actually how we implement a lot of these functions initially. Like you just kind implement the simple version. And then as people care about it, they'll go do things like optimize away string allocations or whatever, um which, should you know, makes them more complicated, but they go faster. And so over time they get better.
01:16:17
Speaker
Right. Yeah. Yeah. I can see that. So I want to pull out one more layer then to what are the most interesting projects using Data Fusion, do you think? I'll give you one more plug of InfluxDV for time series. Yeah, clearly InfluxDV using it for a time series in both, you know, all the different 3.0 product lines.
01:16:38
Speaker
There's quite a few other database companies that are, there's a whole bunch of people who are doing various observability, but like other time series databases or logs or whatever, it's still shocking to me how, like, there's still new companies all the time coming up with products for that. So that's clearly not a solved problem for me, but but a bunch of people are building observability products on top of Data Fusion.
01:16:59
Speaker
right there's a bunch of people building um and then there's a bunch of like various other specialized databases there's a so sequence of people who are basically trying to take over some or all the workload of spark right there's a whole spark is huge it's very well yeah and so we see there's there's like a subproject to data fusion called apache arrow comment which is trying i don't know if you're familiar with something called velox but no no and so da so you have to, all

Spark and Performance Innovations

01:17:30
Speaker
I'll give you that. So Spark is is an entirely Java-based engine, right, just when you start with it.
01:17:37
Speaker
um It's very, very popular in the data processing world. However, as soon as you start running it on large data volumes, people immediately realize, well, hey, you know, it's been a huge amount of time in the vm JVM yeah that we could make better.
01:17:48
Speaker
And so one of the first things Databricks did, which is a big Spark vendor, right, is they implemented their own custom execution engine in C++, plus plus right? That has the same semantics of the Java one, but but it's... And so you still sort of interact with it as a Java frontend, but then what actually goes and runs the code, it runs in C++.
01:18:05
Speaker
and there's been a bunch of other people trying to but but that's proprietary it's not open source so facebook made a non-proprietary well still proprietary but it's open source called uh velox right to do basically the same thing to try to accelerate the workloads and then people a bunch of guys at apple and others are trying to build something similar on top in rust using the data fusion lower level primitives right like if you want the fast parquet reader, you just use data fusions, fast parquet reader and stick it into the spark plan.
01:18:33
Speaker
That's not obviously skipping a whole bunch of the details, but um it also, so it also seems that a bunch of people have been using data fusion to try to, move some of the Spark work, do whatever they've been doing in Spark, do it more efficiently to DataFugure's native Rust implementation. right So there's there's this comment project to try to replace the whole execution runtime.
01:18:54
Speaker
But there's also been people who are trying to offer sort of Spark-compatible engines to do whatever you know whatever people want to do, but faster and more efficiently. ah' So like SailHQ is an example of someone doing that. and

Data Processing with Data Fusion

01:19:08
Speaker
I think there's other more interesting... so So there's a whole bunch of database people doing lots of database stuff. There's the streaming systems, like I talked about, like Arroyo or Suneta, and there's there's some other ones too.
01:19:17
Speaker
And again, these are the ones I know of. like this then No one has to tell us that they're using Data Fusion, so like I don't know all of them. They just know the ones who bother to tell us. yeah Also, interestingly, is that the...
01:19:28
Speaker
like the table format so like apache iceberg and apache um or delta rs or delta lake rather and hootie those are table formats so those are like adding a little layer of metadata on top of a bunch of files that live on object stores so you can kind of treat them as tables interestingly so the the first versions of those are typically written in java because it kind of from like the hadoop-ish kind of world um but all the rust implementations if you look at them they all use data fusion under the covers because those
01:19:59
Speaker
table formats require you to to basically do some amount of processing on the data. Like compaction is basically you got to so resort the data, right? Like Data Fusion has a lot of low level operations to resort data really fast, right? Write parquet, like, ah, we can do that, right? We read parquet really fast.
01:20:14
Speaker
um So there's a whole bunch of, so so that's kind of cool. um an an interesting use case um and then also you see new other people trying to provide different data formats like those Lance DB is one sort of as an AI machine learning focused file format or and database system course they going be a lot of people doing it for AI right now yeah yeah and then there's a company called um uh spiral i think of the company but the vortex is the name of the product which again i think has a specialized format for certain workloads that they that they use data fusion think they also connect to duct db too but um so so that's kind of there's some other interesting uses another fascinating one that i found was people actually like a company called um sdf which was recently acquired by dbt i know that's a lot of like three-letter acronyms but um
01:21:08
Speaker
dbt is like this data preparation tool that i think is yeah is quite popular sdf um had a similar type of tool but what they they use data fusion for is um for helping companies that had like big collections of SQL queries that basically did ETL. right like So you you basically describe data processing as terms of a bunch of ETL. ETL stands, that sounds very fancy. But it's basically like you you run a bunch of queries to generate reports. right But like typically, it's not just a single query to generate a report. It's like it goes through five different levels or 10 different levels, who knows what.
01:21:40
Speaker
And so by the time the data comes out the other end, it was very hard to understand, like well, is there any personally identifiable information that came there? right or is this did um was this derived from a certain source or not, right? Because because it all got mushed together.
01:21:54
Speaker
yeah So they were doing some sort of data provenance analysis, right? Where they'd basically take a bunch of SQL queries and then figure out how the data flowed between them. But the way they did it was they, think they had their own parsers, but then they built the data fusion logical plans.
01:22:09
Speaker
And then they did this sort of data flow analysis thing over the data fusion logical plans. ah That makes a lot of sense. Yeah. Which was kind of interesting, because that's a use case like where it just used the front end of DataFusion. It didn't use the execution engine, at least initially.
01:22:22
Speaker
um Or the SQL planner, for that matter, was just using the the logical plan stuff. and i And I think over time, what happened is now that they had had the whole engine in there, and they had a logical plan, they could actually not just could they understand what the SQLs did, they could actually run some of them, or all of them. right so

Enhancing Database Performance

01:22:37
Speaker
you could do like not just could you understand. But the thing is you could maybe like test it out locally or whatever. so Yeah, that's a really nice way to sneak in with an analysis tool and end up with a database.
01:22:48
Speaker
Yeah, you know, and yeah ah ah so there're there's some of the That's cool. That's cool. there's some other people who tried to put, like, they tried to basically do it.
01:23:00
Speaker
There was a company called CFowl, I think. I'm not exactly sure what's going on with them at the moment, but they they got acquired by EnterpriseDB, which is the guys, like, an enterprise version of Postgres, for trying to build, like, basically a special analytics engine in Postgres. Like, Postgres is a traditional row execution engine, but...
01:23:16
Speaker
They were trying to add various features to make it faster for analytics, so i can like doing large aggregations over large numbers of rows or whatever. Yeah, that makes sense. And the C file guys had like a Postgres extension that would basically use the Data Fusion engine to calculate some of that stuff.
01:23:32
Speaker
Oh, okay, because Postgres is C, right? So they're mixing in Rust into that. That's perfectly feasible. Yeah. i don't I don't honestly know exactly how the machinery worked, but that was the high-level...
01:23:44
Speaker
ah Yeah, that makes perfect sense. So what about you? What's your next thing to work on in Data Fusion?
01:23:53
Speaker
ah Well, I've got this TPCH data generator that I'm currently obsessing about, and the explain plans is cool. I also want to make... I have a whole bunch of things. I personally want to make the...
01:24:07
Speaker
It's super weenie. The actual evaluation of predicates when you're scanning parquet files. right like This is like the lowest level when you're running a query against a large amount of data, and like you're filtering, and you're only picking certain rows. right like I'm very proud of how much technology is in the Data Fusion and and Parquet Reader stack right now. like it's It's basically as good as I know of in any open source implementation and quite a bit better than ones that are probably not open source.
01:24:34
Speaker
But it has one thing it's not yet on, which is like called filter pushdown. And it's it's, I don't know how much you care, but like you know if you got if you're reading a bunch of rows and like you p apply a filter just one of them, or reading a bunch of columns rather, you apply a filter to just a few of them, right now Data Fusion will like decode all of them and then throw most of it away and so there's a much more there's a much fancier way like you evaluate the predicates just in the rows you care about or you evaluate the on them sorry you evaluate the predicates on the columns that are required and then only decode the columns that pass the subsequent predicate or pass the first ones so you don't
01:25:11
Speaker
So this is like, I'm just trying to form a mental model. If I were reading lots of large JSON objects, I could just dive into the specific column and see if it was worth decoding the rest of the JSON string. that's that's Yes, that's an excellent example. yes Oh, that's very clever. Yeah.
01:25:30
Speaker
so So we've had that, like that feature has been in for a long time, but it's not on by default because for certain cases, it slows things down for really low-level technical reasons that my intern, you can read about it. It's a very awesome blog post. He wrote, wrote, wrote about it, but, um, he basically has a way to fix it now. And so we just need to like, like we understand how to do it now. It's just, we just got push it over the line. So I, I would very much like to do that. Cause it'll get us that much more sort of performance, especially on that some important benchmarks.
01:26:01
Speaker
Yeah. Yeah. can see that. So in my mind, ah that would be, that's like the last, that's a major performance boost, um, we go improve explain plans um and i also want to make it easier and easier for people to use data fusion to build uh engines that operate on data that lives on remote object stores right you can already do that data fusion is great and it works but if you really want to get like maximum performance out of like if you want to build a system that really does really well about reading stuff off a remote object store you still have to apply quite a lot of engineering yourself to to make it work really fast.
01:26:35
Speaker
And so I want to lower that barrier as well. So part of that's like making it easier to use multiple run times, right? Like we were talking about earlier, part of it is making it easier to understand and take advantage of like, how do you how do you get the maximum read throughput from a remote object store, right? Like if you're reading off S3 you have a hundred megabyte file, say, if you make a single request,
01:27:01
Speaker
That's actually often much slower than if you made 10 individual requests each for 10 megabytes in parallel. um And there's a bunch of other ways to optimize that. no You gotta be careful though, cause not everyone wants, cause it's a genius business model, by the way, right? So Amazon, like you can read this 100 megabytes with one request, you pay for that one request, but it's kind of slow.
01:27:21
Speaker
So if you want like faster bandwidth, well, what you can do is in fact they suggested you actually, you know, do this most chunking idea where you make multiple requests. However, right, you also now pay for each one of those requests. So you actually, you know,
01:27:34
Speaker
there's trade-off there but but the bandwidth is now much higher because each individual connection is kind of throttled on what is connection and what is bandwidth can do but if you're running them in parallel and then there's a similar thing you can do if you really care about latency you can do this thing called racing reads where um like the object i don't know this may be again too much in the weeds but if you make a request object store to s3
01:27:57
Speaker
Sometimes you get the data back in like 100 milliseconds, and sometimes it'll inexplicably be 300 milliseconds for reasons you have no idea why. Sometimes it'll be 50 to when you get the first byte, and there's some distribution that's probably due to how it's cached. You don't have any visibility into why, but it's probably due to which particular machine S3 server you got it routed to and whether it was cached or blah, blah, blah. Who knows? But the point is the distribution of latencies.
01:28:22
Speaker
So if you want to yeah guarantee that your requests have low latency, and for some workloads, latency is really, really important, One technique is you start a bunch of, like you basically make the request three different times for the same same part of the same file.
01:28:34
Speaker
And you pick the one that comes back the first, go back first. If you really care about latency, you're paying three times as much, but it's a lot faster. Yeah, and well, it's it's basically it's typically got much lower P99 latency, so you don't have these like long tails of like sometimes randomly your request is really slow for some reason, because when that does happen, you would have made a second request that got to some other server that was less overloaded or whatever and got back faster to you.
01:29:00
Speaker
But yeah, now you're paying Amazon that that much more money. So

Unexpected Uses of Data Fusion

01:29:03
Speaker
there's a bunch of stuff. Does this mean in the future we'll look at an EXPLAIM plan and it will also include dollar costs?
01:29:10
Speaker
Yeah. I think it should. Like, that's an example. with Data Fusion probably won't do that, but the but like that's exactly the kind of thing a um system built on top of Data Fusion could do. It'll let you too tweak the knobs.
01:29:22
Speaker
yeah In fact, there's a whole bunch of interesting, like, i think Dask, is a Dask? Not Dask. Daft. Daft. a couple of these new tools are basically, like, you know trading and the fact that it's super easy and and good to integrate with these uh object remote object stores but i feel like they're going to end up reinvent they're to end up having to reinvent their whole uh execution engine which is a lot work right so like yeah it's a lot of people like i think there's a lot of opportunity if you make it easier that people can basically go build really cool tools without having to reinvent stuff Yeah, well, I'm not going to talk about it, but I've got a secret side project that this might actually solve a lot of problems for for me.
01:30:00
Speaker
So I'm going to get on with that. i'm going to send you back down the Rust Mines to work on Data Fusion. yeah Andrew, thanks very much. That was fascinating. Well, thank you so much for for letting me blather for so long. All right.
01:30:11
Speaker
Cheers. Thank you, Andrew. So here's a fun thing I found out after we recorded that interview. I've been working on a database-like thing. It's not a database, but it's in that space.
01:30:24
Speaker
And I don't know if it's a research project or a hobby project or if it's got legs, we'll see. But I've been writing it for a couple of months, and early on I realized I was going to need an SQL parser. Started writing one.
01:30:37
Speaker
realized it was going to be a real-time sync. So I just went to crates.io and searched SQL parser, picked the first decent looking option. And I thought, I'll wire that in for now. And if I need to, I'll rewrite it later.
01:30:50
Speaker
Anyway, it turns out that I don't need to rewrite it. It's actually very high quality. And it's the SQL parser from the Data Fusion project. I've managed to record that whole interview not knowing I've been a user for several months.
01:31:05
Speaker
I guess that's my endorsement of Data Fusion. It's so easy to use, you won't even know you're doing it. I'm going to keep using

Conclusion and Engagement

01:31:12
Speaker
that, and I may even steal some other pieces from that project. And if you want to be in a similar boat, or if you're just curious to see some more details, you'll find links to Data Fusion and all the other many projects we discussed in this episode in the show notes.
01:31:26
Speaker
Before you head to the show notes, if you've enjoyed this episode, please take a moment to like it, rate it, share it with a friend, share it with a social network. If your company has said, you know what, we should write our own database, how hard can it be?
01:31:40
Speaker
Share this episode with them too. Make sure you're subscribed because we'll back soon. And if you want to support us being back soon for a very long time, check out our Patreon if you want to support the channel.
01:31:52
Speaker
But for now, I've been your host, Chris Jenkins. This has been Developer Voices with Andrew Lam. Thanks for listening.