Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Implementing Hardware-Friendly Databases (with DuckDB co-creator, Hannes Mühleisen) image

Implementing Hardware-Friendly Databases (with DuckDB co-creator, Hannes Mühleisen)

Developer Voices
Avatar
2.7k Plays11 months ago

SQLite could do with a little competition, so when I invited the co-creator of DuckDB in to talk, I thought we'd be discussing the perils of trying to build a new in-process database engine. I quickly realised things went much deeper than just a tech refresh.

Hannes Mühleisen joins me this week to blend his academic credentials as a database researcher with his vehement need to make that research practical. And so we dive into what modern database literature has to say on making queries faster, more parallelizable, and closer to the metal, and how it all comes together in a user-friendly package that’s found its way into my day-to-day workload, and might well help out yours.

If you’re curious about the gory details of database queries, how they can take advantage of modern hardware, or how all that research actually turns into a useful tool, Hannes has some great answers.

--

DuckDB: https://duckdb.org/

Database Systems Book: http://infolab.stanford.edu/~ullman/dscb.html

Kris’ first computer: https://en.wikipedia.org/wiki/File:ZX_Spectrum_Plus2_(retouched).jpg

Volcano Query Evaluation System [pdf]: https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf

Morsel Query Engine [pdf]: https://cs.brown.edu/~kayhan/papers/morsel_cp.pdf

Unnesting Arbitrary Queries [pdf]: https://cs.emis.de/LNI/Proceedings/Proceedings241/383.pdf

Papers Hannes' team have published: https://duckdb.org/why_duckdb#peer-reviewed-papers-and-thesis-works

DuckDB on Mastodon: https://mastodon.social/@duckdb

Kris on Twitter: https://twitter.com/krisajenkins

Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/

Kris on Mastodon: https://mastodon.social/@krisajenkins

--

#softwaredevelopment #podcast #programming #database #duckdb #sql #sqlite

Recommended
Transcript

Navigating Early Career Challenges in SQL

00:00:00
Speaker
In my very first professional programming job, I had to write a financial report, the heart of which was an SQL query. So I worked and I got it working and I was very proud of myself until we ran it against the staging dataset and it ran like an absolute slug.
00:00:19
Speaker
And then a small senior colleague of mine said, put an index on it. And I did. And it was fast. Brilliant. I now knew how to make databases run fast. You throw indexes at them. So I started throwing indexes at everything. And wouldn't you know it, the staging database starts slowly grinding to a halt. Perhaps you've already diagnosed why.
00:00:40
Speaker
But that was my first real-world experience of if you actually want things to work properly, you have to understand a layer beneath. You have to understand something of how things work under the hood. So this week's kind of fun because we're meeting someone sort of coming in the other direction.

Introducing Hannes Mollison and DuckDB

00:00:58
Speaker
I'm joined by Hannes Mollison. He is a professor and a database researcher.
00:01:03
Speaker
But as you're about to hear, he's absolutely not content to remain studying theory. He's come out of that ivory tower and he's built one of my favorite new discoveries of late, DuckDB. Came recommended by a few colleagues of mine and I've been really liking it. It's a local, convenient analytics database. It's fast. I like it enough that I should tell you this episode isn't sponsored by them. I just thought it was good. So I wanted to peel back the covers and understand something of how it works.
00:01:33
Speaker
and it turns out to be quite a treat. And Hannes is very good at explaining it. If you're curious about how a database can chew through a billion rows in a few seconds, or how you parallelize queries across multiple cores,

Hannes' Journey from Early Computing to a PhD

00:01:48
Speaker
when multiple cores seem like the only way that computers are going to get faster anymore, Hannes is your man. He's an excellent teacher. So let's go and hear from him. I'm your host, Chris Jenkins. This is Developer Voices. And today's voice is Hannes Mullison.
00:02:16
Speaker
My guest today is Hannes Mullison. Hannes, how are you doing? Very good, how are you? I'm very well, I'm very well. You're going to take me all the way down into bits and bytes today. It's my favorite thing to talk about, like these bits. I don't know what's so great about them, but somehow they have kept me occupied for a while. They've been drawing us down since at least the 50s, right?
00:02:40
Speaker
But I want to get into your background first, because I have to understand why it is you wanted to create what you created. My background is very interesting. Well, I guess not. It's a super vanilla timeline. At some point in my life, I found computers that were like
00:03:01
Speaker
14 or something like that. And it's a fun story that I want to tell. So I had a computer that my dad sold me. He's a good businessman.
00:03:12
Speaker
Um, it was discarded little computer and, uh, it had a buyer's password. If you remember those sets on it. Yeah. Yeah. And he didn't know anything about computers and I didn't know anything about computers, but I just had spent all my money on this thing that had this, like you turn it on and it showed you this password prompt, right? Your dad sold you a brick. He did. He did. And he said, I'm telling you, he's a good businessman. And then, and then it's like, yeah. Okay. What's this password prompt? And there was no internet that people today cannot, you know,
00:03:44
Speaker
conceptualize this anymore, but it's like there was no internet. So you can't just Google how to reset the BIOS password. You couldn't just go to the library. They didn't know either, right? Yeah. So at some point, you know, you've found somebody who knew somebody who knew somebody who knew somebody to tell you, like, take the damn battery out. And, um, and I think that's, that's kind of, I don't know. I did this kind of way of thinking about problems.
00:04:08
Speaker
And that they're just like a challenge and they're just there to sort of, you know, to be overcome and not something that you could, that you have to give up on. Like maybe it was lucky that the first one was actually solvable, but. Yeah. Yeah. So your dad actually sold you a puzzle box and the career at once. Kind of. Yeah. And so then when, you know, I started programming like many kids with PHP and MySQL.
00:04:31
Speaker
I was one of the founding members of my hometowns, my sequel user group. I had a license plate with SQL on it because I lived in Stuttgart and German license plate started with S. Hit my kind of gig. It's very bad. And then I needed to study some things. I was like, all right.
00:04:54
Speaker
I was looking to forestry for a while, looking for a naval career, but then was like, fine, I'll study computer science. And yeah, so I did that and then liked it. Went to Berlin, got my PhD in computer science. What's your PhD? My PhD is about distributed career processing, which is kind of the anti-timer of what we're going to talk about later, which is
00:05:21
Speaker
It's sometimes interesting to see how careers kind of go. So then I came to Amsterdam after my PhD, bit of an accident kind of, because I didn't come here for like career reasons. I came here because I had a girlfriend here. And then I was like, okay, so I better find a job. So I looked around and it turned out that there was one lab, the Dutch National Research Lab for computer science and mathematics.
00:05:47
Speaker
that was hiring in a database group and I thought, I wasn't in my SQL user group. I could do this. I could query. Of course, in a gigantic over interpretation of my own knowledge. Anyways, that turned out this group was actually very good at data systems that
00:06:09
Speaker
coming up with new ways to construct data systems. And so join us at Postdoc, stay like a tenure track, the whole thing. Yeah, and ended up to be a place where very clever people have been thinking about data systems construction for quite a long time, for over like 20, 30 years. And I learned a lot.
00:06:32
Speaker
about the fundamentals of how we do querying? Yes. Although in CS, you always have to be careful when somebody says fundamentals, because they often they mean that they want to hurt you with Greek symbols. Okay, I am not I am not about the Greek symbols. I want to make this absolutely clear. I'm a painfully practical person. And
00:06:56
Speaker
It's

Real-World Relevance in System Development

00:06:57
Speaker
so practical that I always wonder what I'm actually doing in academia. I was wondering a lot what I was actually doing in academia. People later told me, we're not here to solve problems. We're here to talk about problems. That's rather cynical. So anyway, you said fundamentals. That's absolutely right. We learn a lot about the fundamentals.
00:07:24
Speaker
but not in the way that some people might think fundamentals as in formulas and proofs and these kinds of things. Now, here's a modern computer works, and here's how you have to sort of hold this computer funny so it actually solves data problems efficiently. So that was really all that was about, is fundamentals. Yes, it was about query processing.
00:07:43
Speaker
you know, memory pattern, access patterns, that sort of thing. We can, we will probably go over to that more later. Yeah, I hope so. Yeah. And then, okay. Now that was the academic, then I got, I didn't become a professor. I am still a professor.
00:07:55
Speaker
Yeah, officially on data engineering at the University of Nijmegen. Most are still affiliated with the lab, the Database Architectures Research Group at the Research Institute, but these days I spend a lot of my time at DuckDB Labs, the company that we spun off at some point from the Institute around DuckDB Project that we started there. Take me through that. How do you go from being a professor to being an OLAP database writer?
00:08:24
Speaker
Practical. Oh, that data, because that's practical between professor and... I mean, professor-at-programs is almost unheard of, okay? Again, you can call me cynical again if you want. And it's only been like five minutes of podcast, but professor-at-programs is pretty uncommon. Yeah, so how does this happen? Well, I think in data systems, it's
00:08:48
Speaker
You can write papers, but it's kind of understood that a lot of your impact will derive, or have you actually created any systems that are sort of relevant in the world out there? Have you created software? I think it's unusual. It's very unusual. The writing of software is just, you know... Yeah, it's not respected a lot. It's the turning of the handle that gets you to the paper.
00:09:13
Speaker
Yeah, it's true. It's not respected a lot. And it's also the quality of software producing academia is also typically very bad as a result, because it's just as you said, it's a means to an end. You want to get your nature paper is what you're going for.
00:09:29
Speaker
the R script that analyzes the result from your telescope is like whatever. But now, actually, in CS, especially in practical computer science, especially in data systems, it is actually the case that the people that have gotten the highest honors in our field, we have two Turing Award winners in data management systems, it's Jim Gray and Michael Stonebreaker. They both have created actual systems.
00:10:00
Speaker
Do you know them? Postgres. Stonebreaker was one of the people behind the creation of Postgres at Berkeley way back then. They have gotten these awards, not for their Greek symbol fighting abilities, but more for they have actually built something. I think in our field, and there may be some adjacent fields like operating systems that behave similarly.
00:10:28
Speaker
If you're an academic in operating systems, it does help if you have made some Linux kernel patches that turn out to be good ideas. I think it helps. If you're in security, making tools is always going to get you some credibility. There's some adjacent fields, but it's not very common, I agree.
00:10:48
Speaker
Yeah, it was definitely acceptable in this subfield to write, to spend time writing software. We may have pushed this to new heights. I'm not entirely sure. But so when you come up with a new idea, you can write a paper about this new idea. Yes, you can build a prototype to show that your idea is a good idea. Sure. But people are only really going to take this seriously if this has proven itself somehow in the market.
00:11:13
Speaker
Okay. So unusual to get in academia, you need to push yourself out into the marketplace. It is it is a it is a I think there has also historically been a lot of sort of exchange between the academic teams and the teams in the industry. Like, you know, like you have in AI as well, like there's a ton of people doing cutting edge AI research at Google. They're not technically academic academic people, but
00:11:40
Speaker
are the same sort of level. I think data management systems, we also have had this for a while, that the teams at Microsoft that built SQL Server, the teams at Oracle that built Oracle, teams at IBM that built DB2, they had a very deep sort of knowledge in these systems. And the academics would frequently come from these teams, go to these teams, do the particles there, go back and forth. So I think there has always been a pretty good
00:12:06
Speaker
exchange. But if you want industry people to take you seriously, you have to show that this can work in the marketplace. And it is a gigantic market. I think people underestimate this. So databases are sort of a trillion dollar market or something like that, right? I didn't know that, but I could believe it. So if there is such a huge amount of money flowing or flying around in this world,
00:12:31
Speaker
You can see how if you want to convince somebody that something's good idea, you can show them, OK, this is actually better. It's going to make you more money. Anyways, there's always been a bit of connection. So the reason I think we started programming so much in the institute was that we were sick and tired of the quality of research prototypes. And we were like, OK, we actually won't have impact with what we're doing. And so when we started this DactyB project, we were like, you know what? We'll do it.
00:13:00
Speaker
for real this time. We will have CI on the third commit or something like that. We'll have testing in place. We will have coverage in place. We have all these software engineering tools that are pretty standard in the industry, but unheard of in academic systems, we will have all that. We did.
00:13:24
Speaker
It's also very common academic systems to write the prototype until it can run the benchmarks, pretty common. Why would you add functionality beyond the benchmark when all you want is the paper? To prove your idea, yeah. Right. But we said, no, no, no, no. This needs to be able to actually run general purpose SQL queries. But in case I haven't said this yet, DuctDB is a SQL system. So it needs to understand the SQL language, the one that was on my license plate. The problem with SQL is that
00:13:54
Speaker
It is such a, I mean, honestly, I was totally blown away by this because I thought, I know a sequel. I mean, I've seen some sequel. How hard could it be to you? I'm like famous last words. It's like, thank you Jeremy Clarkson for this meme of how hard can it be? And it's like,
00:14:17
Speaker
I was totally amazed. I only really learned SQL when we had to start implementing an engine that interprets SQL because people kept throwing queries. So that's when we were like, who? This is allowed. And they were like, yeah, it's in the standard. We were like, oh, OK. Yeah, I can believe that. It's like you kind of think maybe it's got 20 or 40 keywords, but I bet it's a lot more and a lot more.
00:14:36
Speaker
Like, recently, I learned something about, from another person that builds database systems, I learned something about that there is something called accept all. Like, in SQL, we have union. And people know that there's a difference between union, you and all. Yes, you have heard of this difference. It turns out the same. There's also an accept statement in SQL, the set semantics. But there's also an accept all. And I didn't know about the existence of accept all until this other person told me about it. But sure enough, it's in the standard, has been there since, you know,
00:15:05
Speaker
Way back when, so we then, you know, we were like, fine, we'll have to implement that. So we spent a lot of time implementing this. We wouldn't have had this ambition of implementing the whole of SQL if we had just been out for a paper, right? Yeah, yeah, yeah. It only makes sense because we had been out to actually make a system that people use. And if you want to do that, well... You kind of got to support as

User-Friendly Data Management Systems

00:15:24
Speaker
well. There's no shortcuts, really. You just have to get through it. But did you...
00:15:30
Speaker
Okay, so you're there in academia wanting to prove something. Did you start from, this is our topic of research, let's turn this into a product. Or did you say, this is what's missing in the world? So let's research that. Yeah, I think the latter, we talked to people and, you know, data practitioners. And we found out that this was absolutely missing in the world. And what is this? We found that
00:15:56
Speaker
Nobody had really considered sort of the ergonomics of using data management system is always very complicated to set up. Like it was, I don't know if you've ever tried to install Postgres or something, but you find yourself editing like some arcane config files and rebooting services and installing services.
00:16:15
Speaker
creating database files. It's not a very pleasant process, creating users, whatever. So one of the core ideas that led to Jacktibi was to say, look, there's a lot of stuff that happens that actually makes it very hard to use these systems. As I said, you have to do all this installation, maintenance, blah, blah. Nobody has really thought about data management systems with this
00:16:40
Speaker
user sort of angle like user in this case is like a programmer often or as an analyst, but we had really thought about them in like a with like a what's the maximum simplicity that you can get here perspective. Also, nobody had really thought about data management systems where getting data in and out of them is like a first principle like that has to be fast.
00:17:03
Speaker
Yeah, right. Yeah, I definitely dealt with plenty of systems where you get the database up and running. And then the big job is to figure out how to turn it into a schema and some insert statements. I had to see is getting the CSV reader to work like I have tried to see as viewers of I don't know, 50 database systems in my life. They were all terrible, right? Yeah. So so we realized that a lot of the initial sort of interaction people do with data systems is a setup.
00:17:29
Speaker
And B, data import and export. So let's maybe make that pleasant. That sounds like a good idea, right? Because this is like the one time you have to make a good impression. It's like in these initial steps when the people don't know a lot about your system. They've just started. They just want to play around. They just want to get a job done. So we really designed this thing to be useful. So that was our academic angle. He's like, look, we need to reimagine these systems for usability.
00:17:59
Speaker
How do you mean that's an academic angle? Because that seems more like a product industrial angle, UX. Yeah, this is true. It can be seen like that. But if you think about it, we actually wrote a paper about this. The question is, if you start from these user interaction principles, like what I said needs to be easy, needs to have good efficiency for import and export.
00:18:28
Speaker
You can have an academic discussion on what does this mean for the construction of the rest of the system.
00:18:35
Speaker
So how does this impact our canonical textbook architecture? And for example, there's one gigantic way in which it impacts the textbook architecture. The textbook architecture, since 1984 or so, is this client server model, the two-tier system where you have database client, database server, funky line between them. That's a textbook thing. But we realized that in order to meet these user
00:19:02
Speaker
interaction and data import efficiency goals, we need to actually go to a different model, which is the in-process model, which is similar to what SQLite does. If you're aware of

DuckDB's In-Process Architecture

00:19:14
Speaker
what SQLite works, generally,
00:19:16
Speaker
Yeah, I see database as a library, you link it to your process, the thing runs in the same process, because now you're in the same process as your application. Well, that has a bunch of implications in terms of system design again, right? So for example, you need to somehow cooperate with this application process. You can't just assume that this computer is yours, like typical data management systems do. You have to kind of cooperate, you have to be very sort of careful in using resources, you have to be careful in
00:19:44
Speaker
how you crash. Traditional database systems handled fatal problems by just exiting, right? If you're an in-process system, you can't do that because you bring down the host with you. So now you have to reimagine some fundamental properties of how we do, for example, transaction control and persistence under the premise that we cannot just exit when we don't like something. Yeah, it's not your environment to play with. The environment. Another super interesting aspect there is like,
00:20:13
Speaker
If databases are used to being coddled, like database systems, like Oracle is used to be run on very, very expensive hardware with some sort of staff of nurses sitting next to it here. They're called DBAs, but it's the same thing. They have memory correction in memory, they have redundant power supplies, they have
00:20:36
Speaker
die in RAID arrays to deal with hard disks failing and all that stuff. So they can make certain assumptions about their hardware. DuckDB is made to run on everyone's computer, their phone. The simplicity and the way it's designed lends itself to run everywhere on everyone's laptop, stuff like that. That's a very different environment. One of my favorite environments is the Brazilian Windows laptops. Nothing against the Brazilians, but they happen to occupy a warm place.
00:21:10
Speaker
And now you have a laptop that's maybe a bit older that's sitting in tropical heat.
00:21:18
Speaker
And then now you're trying to run like a data, like serious sort of data crunching, that thing that's going to raise the temperature even further. The chances that your RAM starts like doing something funky goes up quite a lot. Right? Yeah. So there we had some issues where we're like, dude, your computer is broken. I can see from the bug report that your computer is just plain broken. But then, of course, the next question is, again, like, OK, maybe how can we cleanly deal with this? So DuckDB now has a bunch of sort of self checks
00:21:47
Speaker
that, for example, make sure that the hardware is doing something meaningful.
00:21:52
Speaker
That's one. So you can have an academic, you definitely can have an academic discussion because like these, indeed these UX kind of aspects, like you said, they are more traditionally more in the product space. But if you really think what that means for the rest of like this architecture that is so well known or so, and this follows these trodden paths where it's like, you open like the thick book about databases, I can actually see it from my chair, the thick book.
00:22:23
Speaker
And it tells you how to implement the database system. Like page 743 will tell you, this is how your transaction manager works. Which book is that? What's it called? It's called Database Systems. I think it's the more leaner, yeah. Leaner is the author. It's a textbook. University students have to read it. Poor university students. Yeah, yeah.
00:22:44
Speaker
Okay. So usability being both an important and academic concern, but let's talk about what DuckDB does technically and how it does it. I want to get down to this. Yeah. Okay. So we're running SQL queries. Yes. If you strip away all the user interface stuff, our interface is like our, what we get is a SQL query. And now we have to come up with something called a query execution plan.
00:23:11
Speaker
which is sort of a more sort of mechanistic description of the steps that we should do to compute the result. Because SQL is a declarative language, it doesn't actually tell you how to compute a result, it just once tells you what you want. It's the task of the SQL engine, in our case, DuckDB, to come up with an execution plan to do that as efficiently as possible. But there's some aspects to this.
00:23:34
Speaker
So the first kind of steps you do is you try to bind the query to the schemas that exist and the types. And so then, you know, the types, you know, the tables actually exist. That's a good thing. Yeah.
00:23:47
Speaker
And then you run optimizers. There's some static optimization that always are good ideas that you then run. You run a symbolic optimizer that these things usually are shaped like trees. Like you have operators stacked on operators. And they can have two inputs or one input or multiple. Like a join will have two inputs. An aggregation will have one input, these kind of things. These are all plugged together to form your query. Optimizers will operate, and then we'll say, aha.
00:24:13
Speaker
You're joining these two gigantic tables, and then you're filtering a lot of the data, the resulting data out again on top here. Can I maybe move these filters into these join, into the inputs of these joins so that I don't have to create this gigantic thing only to throw it away later? Yeah, so how are they rewriting? Yes, these are rewrites that are like in this case of projection push-downs, what we call this, or selection push-down.
00:24:40
Speaker
These are always good ideas. So you have static rewrite rules. You also have dynamic rewrite rules that depend, for example, on the cardinality, the amount of rows in tables. Now you have, for example, you have a join. Say we want to join two tables together. Typical algorithm to do is to do the hash join, where you build a hash table on one side of the join input, and then you probe the other side of the join against that hash table. Now, cardinality,
00:25:09
Speaker
of your tables tells you, OK, which table is the bigger one? And it makes sense to build your hash table on this smaller one, because the hash table has to sit in memory. If you build a hash table on a 10 billion row table and then probe with 10 rows, that's very, very inefficient. The other way around is very efficient. So these are dynamic rules that then depend on properties of the actual data.
00:25:31
Speaker
Maybe they depend also on statistics, like Dr. B or many other systems will know, for example, what the minimum maximum value in every column is, these kinds of things, right? So then you have your optimized plan. That's pretty standard. The Postgres does something very similar. I think any database system worth anything has an optimizer that does
00:25:50
Speaker
these things, and then you get into the realm of the execution. Now you optimize your query plan. At this point, we tend to call it a logical plan. That thing gets transformed into a so-called physical plan, which then says, okay, now, instead of having a join, I'm going to pick a hash join. Instead of doing an aggregate, I will do a hash aggregate, these kind of things, you pick an implementation.
00:26:16
Speaker
You have a physical plan, you go to execution. That's where it really starts getting interesting. Now, I should also say it's very important, for example, that these executions are parallel.
00:26:30
Speaker
Like that we use a duct to be like other database systems also in the space of doing data analysis on large amount of rows. It's an analytical system. It needs to paralyze queries. We can't avoid it. It used to be optional, like 20 years ago, you could get maybe away with the database engine that was single threaded because most computers did have one core. Now my stupid MacBook has like 10 cores.
00:26:58
Speaker
And you need to use all of them to get in. I mean, they're amazing cores, yes, compared to the one core from 20 years ago. They've gone... They are crazy. Yeah, I was blown away when this thing happened. Doesn't matter. We have to paralyze over these 10 cores. Okay, so then... Out of curiosity, where does that decision happen? Is it as you're finished optimizing and you're building the logical plan? Yeah, no, it actually inducted me. It happens during execution. I will get to that.
00:27:26
Speaker
Okay. So now we need to actually execute this plan in parallel, okay?

Optimizing SQL Engines and Execution Plans

00:27:32
Speaker
Interesting question, like how do you automatically parallelize it to a incomplete language? It's not obviously trivial, right? Okay, so these are all things that have to happen and then execution will, you know, compute things and come to the result. Now, how does this work?
00:27:48
Speaker
Well, the general approach is that you first separate your query into so-called pipelines. So if your query plan is a tree, now you look into pipelines. What are pipelines? Pipelines are things that can happen without so-called pipeline breakers. What are pipeline breakers? Those are operators where the entire intermediate result has to be assembled before things can continue. For example,
00:28:18
Speaker
Say I'm aggregating. I'm doing like a count star group by x or something like that in SQL. So now this operator can only start producing output once the entire input has been read. Because, OK, say I pick one group to output first. There might be relevant data for that group in the very last input row that I read.
00:28:42
Speaker
Yeah, yeah. Or sorting, I assume. You can't output anything until you know which the smallest row is. Absolutely. Sorting is a great example as well. There's a couple of those join hash table build. The hash table build of the join is also one of these things you have to read. So these are called pipeline breakers. So every time we encounter a pipeline breaker, we split the query plan up into the so-called pipelines. And pipelines means that's an
00:29:06
Speaker
It's a bunch of operators that can read, that can run from a so-called source to a so-called sync, in a streaming way. It's streaming in the heart of the idea. They're streaming. There has to be, because you can't just... Okay, naively, you could say, why would you just run one operator at a time? You take the input, you produce the output, and if the next operator runs and so on and so forth. That's like what Pandas does, for example.
00:29:37
Speaker
if you know pandas like this Python tool to wrangle data run. And that's a terrible idea because the materialization between operators that could actually stream data through is a terrible idea because it will have to materialize this stuff in memory somewhere, which creates memory traffic between your CPU and memory. And that will take a lot of time and you might run out of memory and it's generally not pleasant. Whereas you could have stayed all in the CPU, is that what you're saying?
00:30:05
Speaker
Whereas you could have just streamed a chunk of data to multiple operators at the same time, and that indeed would be staying in the CPU cache much more likely. If you look at something like Postgres, they go to the other extreme. They do a row-by-row thing. So Postgres reads a row of input, applies the next operator, applies the next operator, applies the next operator, and then produces output or not. SQLite does the same thing. And that's really great for systems that
00:30:33
Speaker
deal with small sort of result sets or small input, like the transactional systems tend to do, like update an order. You don't really look at 10 billion rows, right? But for analysis, it's like switching between operators, switching across the types that could potentially exist in the query actually creates a lot of overhead. So therefore, we have to find some middle ground. So what Ductibee actually does is implement a so-called vectorized query execution engine. So what does this mean? We have our pipelines that are streaming.
00:31:03
Speaker
It can't be multiple operators in a pipeline like this, like a filter, some window functions could run in a streaming way. And projections can run in a streaming way, these kind of things, right? And now, we don't do the full materialization, which would be like running one operator at a time and writing stuff with memory every single time. We don't do the post-dressing and reading one single row every single time. There's something in the middle. We take, in DuckDB's case, we take like 2,048 rows.
00:31:32
Speaker
Just as like experimentally found to be a good idea to have 2048 rows. It used to be 1,024. Eventually, as I will explain why this number exists. And then you basically take like these vectors, we call them vectors like subsets of these intermediate results, and then we stream them through the operators. And it's important to know that this is a column first representation.
00:32:00
Speaker
We don't have single rows that we've shoved through the operators. We have little chunks of columns that we shoved through the operators. That has big advantages because we can then have a columnar sort of processing, which allows us to be more efficient in terms of branch prediction, which is the CPU basically trying to figure out where your code is jumping next. And it uses a statistical model for that. And if you do the same thing all over again, the model gets better.
00:32:28
Speaker
And so doing, so this is actually, this is the reason why columnar representations are better in this, because the branch predictor will be better at predicting where you're going next as your amount of sort of stupid repetition increases. Right. Yeah. Yeah. That makes sense. And if you had, if you did this row wise and you had to do something else for every, every sort of field, because the fields are different types and different things happens to them, then the branch predictor goes like, I don't know.
00:32:57
Speaker
You seem to be doing 10 different things at once. Yeah, exactly. A lot of things happen to you. I have no idea where you're going. So this is why doing this in a columnar way is better because the brushboard tip will tell you.
00:33:07
Speaker
going there. And it will be right. And if it's right, it will also do speculative execution. So the branch will actually say, based on my amazing statistics, I'm fairly sure that you're going to go this branch. So I'm actually going to already schedule this branch for execution in my CPU. And only if I realized that I was wrong in my prediction, I will abandon all that effort and actually do the other thing.
00:33:29
Speaker
Right. So that's actually quite a powerful thing. So this is why we have vectorized execution. We want to be one reason why we have vectorized execution. We want to be good on this brand prediction. But the other thing, and that you mentioned this already, is that by controlling the amount of like the size of the intermediates, like the size of whatever intermediate results goes between these different operators that are in a pipeline, we can make the query execution stay in the CPU cache for longer.

Vectorized Execution vs. Just-in-Time Compilation

00:33:59
Speaker
And ideally forever. And that's where the 2048 comes from, right? Because the CPU caches in modern CPUs are actually quite big. They don't sound big. They're like, oh, it's 32 megabytes. But 32 megabytes is quite a lot of storage. Yeah, that used to be a whole computer. That used to be, yes. I mean, my first computer had eight megabytes of RAM.
00:34:20
Speaker
I think my first had 512 kilobytes. There you go. And now that's cache, right? That's like on-die CPU cache. And you have 32 megabytes of them. OK. But point being, by controlling the size of the intermediate with like this constant like 2048, we can keep this stuff in the CPU cache. And why is that important? It's because of the memory hierarchy.
00:34:49
Speaker
The storage has this whole hierarchy. We have the cache sitting on top. Actually, on very top, it's a CPU register. That's the fastest thing. It's one cycle to access. Very nice. And then you have the caches, and then you have main memory, and then you have storage, and you have tape, and you have internet or whatever. That tape's still a thing.
00:35:06
Speaker
Really? Yeah, it seems medium to do backups on. It's really like tape costs nothing, right? Okay. It's just tape. I haven't seen it in a while, but I can believe. No, I think if you can actually, the cloud still has it. If you use Amazon Glacier, yes, it will be on tapes. Oh, I didn't know that was tape. Yeah. That's why it takes them four hours to get your data back because it's like, hang on.
00:35:33
Speaker
Anyways, we want to be up in the storage area, ideally in a register, but we do data processing, we can't be in a register, but we can be in cache. Right. And by staying in cache, we can get like a 100X kind of speed improvement on that compared to if we had to go back to forth to memory all the time. It's actually something that was invented here in Amsterdam at the CWI.
00:35:54
Speaker
The institute that I used to work at, I told you they knew something about things. They invented this whole vectorized query execution paradigm, specifically colleagues of mine, Peter Bons, Marcin Sokowski, and Niels Ness. It's very interesting because Marcin is now one of the guys that founded Snowflake. It's like there's some interesting connections in that world. Peter is also a professor of data systems here, and he's also
00:36:20
Speaker
It's a very small world, these data systems. And the Amsterdam tech hub investing in people. Yes, well, investing, yeah, it's a government, right? It's a government agency that does this. But yeah, anyways, the vectorization, that's really a key ingredient. And I should note that there's a competing approach.
00:36:39
Speaker
Oh, OK. Tell me about that. It's a competing approach. So you can do vectorization, which is what I just explained. Or you can actually do something called JIT, just in kind compilation. So there's also a bunch of SQL engines out there that basically ship a gigantic C compiler with them.
00:36:57
Speaker
I'm not kidding. They just embed a little VM done. And then you could also try to convert your SQL query to a C program, assembly program, throw it into a compiler, and then run the result. That's a competing approach. Then you also don't get
00:37:16
Speaker
a lot of interpretation overhead, like you do from looking at the types and operations every time for every row that you would otherwise get. But with the jitting, you can also basically fix that by just saying, okay, we'll just create a binary for this particular query. We'll compile a binary for one specific query. Is that then a favorable approach when you tend to run the same query over and over?
00:37:43
Speaker
Well, yes, that definitely helps. It does help for transactional use cases where you have a lot of prepared statements. It also helps if you have very, very long expression chains, things like that. The big downside, however, for jitted engines is that they have to ship the gigantic compiler with them. And actually, that was one of the things that followed from our first principles that we can't do that. Because, hey, look, the sting thing is to be easy to use and easy to install.
00:38:14
Speaker
Part of that is the binary can't be like two gigabytes, but... And can't depend on you having a certain toolchain available. It can't depend on you having LLVM with exactly the right version, by the way, installed, because... Yeah. So therefore, we couldn't use gitting. That's one of the things that followed from our sort of high level of sort of assessment of...
00:38:41
Speaker
what should this thing do? How should it behave? It's like, I can't use JIT because it's not going to be economical. It's not going to be portable. It's not going to be nice. And also, there is a research paper written by the proponents of JITing and vectorization that have essentially come together, wrote a paper proving that the approaches are essentially equivalent. So, hey. Allowing for time, presumably. I'm sorry? Allowing for time.
00:39:09
Speaker
No, at a time. Also equivalent as in like execution performance. The vectorization gets to the same sort of speeds if done properly. So we have this vectorized thing, which is really cool. It does make it a bit more complicated to write these operators because now instead of like
00:39:38
Speaker
writing like a projection where you do an addition of two columns like A plus B. If you had a simple old-school system with rows, it would be pretty obvious how to write that. You have a loop, and then it goes left plus right, left plus right, left plus right, and so on and so forth. If you want to do this in a vectorized way, it's like, oh, now all my math operators actually operate on these arrays of data.
00:40:06
Speaker
has to be like code being generated with templating that expands all these, you know, things into like loops and there's the, you have to do like null handling and you have to do like all these things. So you get a bit more complexity in the operators themselves. Because you're constantly dealing with a batch of 200 and 2000. Yeah, you're constantly dealing with these batches, right? Also, like if you do like an aggregation,
00:40:30
Speaker
you are aggregating 2,048 values at the same time. Sometimes it takes a bit of mental gymnastics to do this well or to wrap your head around how the operator actually has to be implemented in order to be efficient when dealing with a lot of values at the same time. Or like an index lookup.
00:40:48
Speaker
Traditionally, you did an index loop up where you have one value, you walk it down your B-tree, there's your thing. If you have 2,048 search values, you're like, how do I do traverse now? It has some... And there must be... I mean, you could just do it 2,048 times, but there must be more efficient ways to walk down a tree.
00:41:09
Speaker
There is, yes, you could have 2,048 pointers in the same tree because then you kind of try to reuse the same IO. But of course, also problematic, you can't have 2,048 outstanding IO requests, blah, blah, blah. Complications. But hey, people have PhDs for a reason, right? We have done nothing else in the last 10 years than to think about that. So that's fun. So that's the vectorization stuff. Yeah, it does slightly complicate your operators.
00:41:40
Speaker
From a user's perspective, it's great because stuff stays in cache, so it's really, really fast. You, as a user, you don't really see that it's a vectorized system anyway. You write your SQL query and poof out comes a result. It's a very sort of detail on how the implementation works. But it is the thing that DuctDB does. And it's kind of cool. Because I think in DuctDB, one of the unique things that we combine, like this crazy academic sort of knowledge on how data systems should be built,
00:42:09
Speaker
all wrapped into a package that's, like, try to be friendly. And that's, I think, something that I don't think has been done before. Like, as in, like, the state-of-the-art stuff tends to be, like, a bit clunky to use. And the stuff that's easy to use, I don't want to name names, but there are some data management systems out there that try to be very easy to use, but then have serious issues in, sort of, the internals and how they deal with things like persistence, right?
00:42:35
Speaker
because they compromised all of that away. So we kind of have both, right? We have this crazy core that uses really the state of the art in query processing together with this friendly interface. But I want to talk about something else. I want to talk about parallelization. Because that's also really, really, really interesting.
00:43:01
Speaker
There is an interesting paper from one of the greats of data management systems research, Goetz Greff. I don't know if you have heard of him. He's German like many database people. I don't know exactly what it is about tables, but somehow it appeals to the Germans. No comment.
00:43:27
Speaker
It's really uncanny how many Germans are. And it's like, you go to the US, you know, it's like, oh, I was a German person. Haha. Not only, of course. Don't get me wrong. There's many, many, many different sort of people working together. It's just an uncanny amount of Germans and there are more than there should be.
00:43:46
Speaker
Anyways, so Gudskidev, one of the greats, he wrote a paper in like the 90s or something like that about, it's called Volcanoes, the Volcano Query Processing or something. Volcanoes in the title, if you will find it if you're good for that. And in this paper, he also described how to do parallelization.
00:44:07
Speaker
And so that's actually, as I said, it's a difficult problem. You come up, you have your query plan, and now you want to run this on many CPUs. What do you do? Well, Dr. Greff, he came up with this idea of the exchange operator-based parallelism. And at the time, it was very clever because it's a way of doing parallelism,
00:44:35
Speaker
that doesn't impact the other operator's implementation at all, which is great, right? If you want to get something done, it's great if you don't have to throw your rest of your system away.
00:44:49
Speaker
It's always good if you want to get something through, like, here's this cool idea, and by the way, it works with what we have. So how does this work? Okay, yeah. Exchange, by the way, Ductubee is not using that. It's just what everybody else is using. And I'm going to say why it's bad in a second. So exchange-based parallelism works like this. Say we have a filter, and we want to parallelize the filter.
00:45:15
Speaker
OK, so now we insert this special operator under the filter that splits up the incoming data stream into, let's say, eight data streams.
00:45:28
Speaker
Yeah. Okay. Into eight data streams. You can kind of hang on. I have too many fingers. This is eight. Yes. Yeah. And now you run this filter on every one of these eight data streams in parallel. Haha. Yeah. And then you set another, put another operator on top that collects all the output of those filters back into a single data stream.
00:45:49
Speaker
And now you have successfully paralyzed this filter, okay? This sounds like MapReduce. Aha, yes. I mean, MapReduce came much later, of course. So, who got this from whom? It's really clear in this case. Point being, okay, this works really great for filter, yes? What about an aggregation now?
00:46:11
Speaker
How does this work? Because obviously, I can't just use my little distribution operator to have eight input streams, run my aggregation on all of these eight, and then just glue together the result again, right? You can if it's some, it's a little trickier if it's average, that kind of thing. Yes. You can do that, but then you have to actually reinterpret the results. You have to then remap the groups together after all these operators.
00:46:37
Speaker
Right. And there can be millions. So then you've spent a lot of time on the remapping. So that's really not the point. Right. So what this exchange-based operator parallelism does is it introduces a hash partitioning on the group key in the distribution operator saying, aha, if I hash the group key in the distribution phase already, and I do a radix partitioning based on bits in the hash,
00:47:01
Speaker
between all these data streams that I'm creating in my distribution operator, then do the aggregation, and then I can be sure that I can just glue all these results together again, and the result will be semantically correct, because, you know, yeah, we can prove that only data that matches like from this hash partition went into this one group by operator,
00:47:26
Speaker
So you're saying if I'm doing select average of something group by user, you're going to hash the user ID and chuck the same user ID. Yes, and split it up like that. Exactly. OK, yeah, that makes sense. So that's how it works, right? It's also how the joins work. They will say, aha, we'll split both the build and the probe side of the join using the same hash partitioning, and then we can run this in parallel, and then we can do the recombination, and everyone's happy.
00:47:53
Speaker
Well, there's some issues with this. Can you spot any issues already? I would have thought, if you're trying to split it up into eight chunks, going to disk.
00:48:06
Speaker
then you've got a big problem. You're basically still, you're reading the disk. It's going to be hard to parallelize pulling out the different bits of disks and then re-partitioning them into different groups, right? Yes, but it doesn't have to go to disk. This is like a stream thing, right? Like so, as the data on the fly, it can do this partitioning. Okay, so it's, okay, streaming batches. It doesn't have, yeah, it's streaming. Okay, so there's too many issues with this. One is the fairness of the
00:48:36
Speaker
partitioning. Okay, so data doesn't give that doesn't isn't nice enough to be sort of uniformly distributed across all dimensions. Yeah, right. So what happens now is like what happens if my group key is like highly skewed, like I have 10 billion rows with the same group key, and then I have 10 billion rows where they all have different group keys.
00:48:59
Speaker
Yeah, yeah. Plenty of businesses have a few users that have a problem. I like to call it the Justin Bieber problem. Like everything is, you know, like if you could look at like social networks, they're all like following on these like massive nodes that have a lot of connections. Same problem. So if you now do a hash partitioning on those, what happens is that one of your cores will be very busy.
00:49:26
Speaker
with the heavy hitter because one of them is unlucky to get the hash partition where the heavy hitter sits in. And all the other ones are sitting there. Or doing something, but obviously the runtime of the biggest partition will determine the end-to-end runtime of this query. So skew is really terrible here because skew kills the fairness in distribution.
00:49:50
Speaker
The other thing that's really terrible is it's not reactive. Once I start partitioning this into eight partitions, I am doomed to keep partitioning this till the query is completed because I cannot just change my mind and do fewer partitions because my downstream operators don't know about this, that this is going on in the first place because I've tried to avoid rewriting them. And therefore, I cannot really change my mind. Usually the exchange-based parallelism
00:50:16
Speaker
is baked into the query plan. So they actually have a step that takes the physical plan, inserts these operators that split up and recombine, split up, recombine. And yeah, it's really problematic for skew and reactivity reasons. Is that like if I've got 10 million rows and the first million are skewed quite differently to the last million? That, but also something like, oh, say your second query is coming. I can't just hog all the cores.
00:50:45
Speaker
while I'm processing the first one. Second one will have no resources to use. Maybe I actually want to go down with the number of cores that I'm using for the first query because now I want to distribute my resources. It's actually one of the issues that's really plaguing Spark. I don't know if you have experience with Spark. Regrettably, I have tons of experience with Spark.
00:51:05
Speaker
Apache Spark. And they use Exchange, but they just implemented Guts's paper, right? They use Exchange Reparism. But as a result...
00:51:16
Speaker
Spark classes are sometimes used by multiple people. But if somebody gets lucky and grabs all the cores for his query or her query, the second guy or girl coming with another query has to wait till the thing is finished, because they have baked in this parallelism in the query plan. And they're waiting for the whole thing to end. And it might be that seven of the cores are idle anyway. Yeah, well, you don't know, right? So there's some ways of fixing this with overcommitting and blah, blah, blah. But you could also abort the query and redo your partitioning
00:51:46
Speaker
They'll ignore all of this now. So this is why a state of the art isn't this system anymore. Instead, we use something else. We use morsel-driven parallelism. How does this work? Morsel. Morsel, yes. So there's another paper at VLDB, the very large database conference that explains how that works, where we basically say, hey, this idea from GUTS from the 90s tried to avoid redoing the operators.
00:52:16
Speaker
for good reasons. Maybe that we can now start rehacking these operators and we can avoid all these problems that come from this operator unaware parallelism scheme. The operator doesn't know about parallelism in the exchange-based system.
00:52:35
Speaker
in morsel-driven operators do

Morsel-Driven Parallelism in Query Execution

00:52:38
Speaker
know. So I already told you how the operators become more complicated because they have to be vectorized. We're adding a whole level of complexity now to make them parallelism aware now, which makes it even harder to write these operators. But for good reasons, because of what morsel-driven parallelism does, it says, look, we will not bake these extra operators into the plan to make parallelism. We will actually make the plan itself parallelism aware.
00:53:04
Speaker
And this comes back to the pipelines I mentioned earlier. The pipelines are streaming from a source to a sink. Sources can now basically split up their source into multiple chunks, but not with any lot of logic. They can use whatever partitioning scheme they want. They can just say, you know what, the first 10 million rows go here, the second 10 million rows go there. It really doesn't matter anymore.
00:53:33
Speaker
If you do a table scan from disk, you will look at your metadata and say, aha, I have 10,000 blocks. So I guess the first 1,000 go to core one, the second 1,000 blocks go to core two. And now we've lost this partitioning scheme that the previous method relied on.
00:53:49
Speaker
Okay, so which means that we have some semantics issues in our operators. However, only blocking operators have these semantics issues because they have this whole dataset view, like a group by, like a sorting and so on. Yeah. So now those are the ones that have to be adapted to basically be able to work with multiple streams of data from multiple pipelines that run in parallel, coming into an aggregate, for example, at the same time,
00:54:15
Speaker
And then this operator, this aggregate operator, this parallelism aware operator needs to know how to reconcile this. And in the case of the hash,
00:54:25
Speaker
You can imagine that you're building more hash tables. You're trying to figure out which data is in which sort of group. You're trying to add a secondary recombine phase, like you just said earlier. It's all sorts of tricks you can pull. It doesn't make the thing more complicated, but there's a great upside because there is no longer a fairness issue. We can have n threads working on this, and they're just grabbing subsequent tasks from these sources. And if you want to have
00:54:55
Speaker
One thread is working on this. We'll just wait until one of them is done and we'll have the thread do something else. And nothing bad happens. It's no longer the stream that has to be orchestrated to run in parallel for it to work. It's more like a list of tasks that you sort of go through. And the more threads go through this task, the better.
00:55:15
Speaker
is no longer like, no, it has to be eight and have to be running at the same time. Similarly, the source can dynamically react as it's splitting up that disk. No, it will just generate a bunch of tasks and the worker threads will grab those tasks. Oh, I see. Okay. Right. Second theory, we had this fairness issue with the beaver problem, like with the high hitter group,
00:55:40
Speaker
That's no longer an issue because it means, yes, one of your threads will be like, first of all, we don't have these all in the same thread, all these groups in the same thread. This is the problem of the upstream operator. And if one thread takes a little bit longer than another, not everybody else is blocked because they can go and fetch the next task already. Like if this guy takes a bit longer, whatever.
00:56:02
Speaker
So, you can just grab another task while this guy is still doing this thing. So, it's very elegant. It's very elegant to use this morsel-based parallelism. It's a very sort of effective way of doing parallelism, and the result is that,
00:56:19
Speaker
basically parallelize arbitrary queries over all the cores that you have. You can also tell it to use fewer cores. It can dynamically react to say, oh, now we have two queries running. We probably shouldn't give all the resources to the first query, but maybe we should give some resources to the second query as well. All these things become possible. And the only cost, quote, quote, is, of course, to the poor people that have to implement this to some of
00:56:47
Speaker
Some of the people in the company here that, all myself, I've done this to now have to implement operators that are aware of parallelism, aware of this mortal driven scheme. Most of the whole point of getting a database, you can enforce other people to do the really hard computer science for you, right?
00:57:06
Speaker
This is the same with lots of CES. If you look at how the CPU works or how the operating system works or how your phone works, the baseband in your phone works, a lot of really, really hard complexity is hidden from people behind somewhat efficient user interfaces.
00:57:27
Speaker
And databases, this is the same, right? Like we do all this crazy stuff. And these are just two examples. There's more crazy stuff happening, of course. So in the end, the user just goes and types a query.
00:57:42
Speaker
And the thing will process this in as fast, as short a time as possible, using these crazy tricks to get your CPU to be as efficient as possible on this. It's quite interesting. You're trying to wrap the ugly parts to hide them from users. But I can mean these two things. They're the vectorized pre-processing and the model repairism.
00:58:09
Speaker
they're really fundamental, let's say, to the whole thing. I would say those are like those characterised, like how the system works. It's interesting how, I mean, the strategies involved are familiar even if the domain isn't. So the idea of splitting things up into medium sized batches and stream processing them
00:58:29
Speaker
where possible. And the idea of, of rather than having a really dynamic scheduler or something, you just have chunks of tasks to be dealt with as they come in. I mean, those are familiar outside of the database world. Yeah, absolutely. Absolutely. There's
00:58:48
Speaker
Absolutely. I think one of the complexities in database world compared to, say, other systems that support exposing this, you mentioned MapReduce. In the olden days, there was...
00:59:01
Speaker
There was Hadoop to everybody's great entertainment. It was pioneering, let's say that for it. I think the difference of SQL databases is that there is this crazy complex language that people throw at us.
00:59:21
Speaker
This is, it's pretty obvious to do this like in a straightforward query, like select blah from the group by bar. You can draw this on a, I can draw this on a whiteboard, explain this to students or like to be users. Once this becomes like, you know, 15 joins deep while various aggregation levels, you know, window functions,
00:59:45
Speaker
Nested subqueries, a good one. Yeah, yeah. We actually have implemented another paper from a group from Munich, by the way. We have colleagues from Munich, also academics that also, by the way, also work on actual systems. The team that is behind Hyper and Umbra, Professor Neumann and his team. They've written a bunch of great papers on how to solve some of these really ugly issues, specifically one around nested subqueries.
01:00:13
Speaker
That's something that one of my co-creators, Dr. B. Mark Asfeld, who's sitting next to me here, he at some point disappeared for a couple of months with this paper and reappeared only when sub-queries were kind of solved. And to this day, I still have my highest respect for Mark for pulling this one off because sub-query unnesting is one of these
01:00:41
Speaker
things that, you know, like three people in the world know anything about, you know what I mean? Like there's this guy, there's Professor Neumann in Munich, there's Mark, and maybe one other person that understands this. And yeah, it is fun, right? Because usually people don't do this. And Postgres, by the way, cannot do it. They cannot unless subqueries.
01:01:04
Speaker
Why not? Because nobody has ever bought at Postgres has not bought it to implement Professor Nerman's seminal papers, Unnesting Arbitrary Subqueries. It's there. It's just very difficult. But once you've done it, you have no more like the nested subqueries actually disappear in your execution, which is great. That's like a factor 10 million query speed up right there, because instead of having to do this dumb re-execution of your nested queries for every row,
01:01:33
Speaker
Yeah. You actually turn it into a join or an aggregate and that's it. I can recommend, if somebody has a lot of time on their hands, I can recommend reading this this Unlisting paper. Yeah, I bet that's quite easy in the simple case, but getting it to be semantically sound in the complex cases. Yes. And they actually can prove they can do it in all cases and it's exactly what you want because then you can throw away your other implementation. Right. It's like you want to be able to optimize all of them away.
01:02:03
Speaker
Yeah, so there's a lot of these things. There's also fun stuff around window functions that nobody... Have you heard of window function in SQL, right? This idea that you can look around a bit or you can also say, let's add up the previous two and the following two rows and things like that. Normally in SQL, that would be like four self-joins or something like that. It wouldn't be pretty.
01:02:26
Speaker
So you can take the difference between this one and that one. And finally, what we all want to do, differentiation in the database. Yes, exactly. Well, what people do with this thing is really beyond me. I mean, I'm remaking database systems. What people do with it is entirely different. We talk to people. We want to know. We want to know what they do with it because we want to see something. But sometimes what ends up on our issue tracker is like, what?
01:02:53
Speaker
But to come back to window functions, there's also one of these things, right? There is maybe, I don't know, 10 people out there that know how to efficiently implement window functions.
01:03:05
Speaker
And because, yeah, usually your database does it for you, but if you're the person that has to write this thing, it's not going to be pretty because you're suddenly very, very alone. And you're very grateful if somebody, and in this example, also the Munich Group has written a paper eventually.
01:03:25
Speaker
You're very grateful if some crazy academic out there actually wrote down what it takes to do this efficiently. That's something that we're really also really grateful for standing on the shoulders of giants. People here in Amsterdam that we learned everything about vectorized query execution from. People from Munich that we learned everything about, and I think sub-queries from. It is a big advantage of being in the research community, of just knowing where to look, I guess.
01:03:53
Speaker
Yeah, yeah. And having a bunch of people that have not only done the work for you, but proved it's going to work. Well, they haven't done the work for you. Okay. They don't publish code that you can, like code, transferring code between database systems is really not happening a lot, right? Like nobody has, I mean, nobody's going to copy paste the optimizer from Postgres into MySQL or something like that. It's not going to work, right? You have to reinvent it. I was just saying you've done the thinking work for you.
01:04:21
Speaker
They have definitely done the work, and they have bothered to write it down. And it's not going to be a pattern like what Oracle does, something like that. It's complicated enough. It usually ends up in a pattern, and you can't touch it.
01:04:34
Speaker
When somebody writes a research paper, it actually helps other people. But as I said, building these kind of systems, you end up being alone a lot. You're the first person to come in across this problem ever, or in recorded history or something. There's nobody else you can ask. It's not like with this computer that had a password. There's nobody you can ask.
01:05:01
Speaker
Yeah, but that's the joy of academia, right? The worst situation is I can't find a single problem that hasn't already been solved. Yeah, I'm always, I have a bit of a rant about that. So especially the Greek department, like the people with the symbols, they are masters at like inventing problems.
01:05:23
Speaker
And then solving them, and then there's a paper, and then everybody goes like clap, clap, clap, clap, clap. But in our field, it always bothers me when people kind of design into thin air or design based on sort of imaginary problems.
01:05:38
Speaker
because there's so many actual problems in data out there. So many people are struggling so hard to process data. If you look at what people are using out there, people are on pandas. They're running pandas, which is this, I mentioned it earlier, this is full materialization query engine from state of the art of maybe
01:06:02
Speaker
mid-90s or something like that. It's not great, but it's the only thing people have. And then when the academics who actually know something and know better how to build these things, then decide to throw symbols around or decide to design something for an imaginary use case, then it bothers me because it's like, look, we have all these people out there on pandas. Can I just do something for them? It's actually a
01:06:28
Speaker
Again, our Turing Award winner, Professor Stonebreaker, he's at MIT. He has coined the term of solving the whale's problems. A whale's being like Google and Facebook.
01:06:41
Speaker
Yeah, right. And I think as an academic, it's actually quite criminal to solve the whales, to try to solve the various problems, because the whales don't care. They have clever people. We've worked with some of the people at Facebook, Google, as well. They are clever people. They don't need us to solve the problems. They will ignore solutions that we have come up with anyway.
01:07:01
Speaker
And now here we are spending tax money on solving Google's problems. What is this saying? What is the moral story here? I have another example. When we started DuctDB, we said, OK, distributed curve processing, the topic I got my PhD in is not going to be something that we're going to do because it's absolutely idiotic to start throwing distributed systems
01:07:28
Speaker
at 99% of data problems. But that's not an acceptable view in academia because they say, yeah, but Google has all these big data sets. How can your solution be relevant and meaningful and valuable if it doesn't solve the 1% problem? But our point was to say, look,
01:07:47
Speaker
if the 99% are still on pandas, clearly something needs to be done. So we were always saying, no, we're not going to touch distributed systems. And in my opinion, it was the right call because the computers get so fast. You can easily go terabyte data processing on your MacBook these days. And there's also this interesting effect that I've noticed is that hardware scales faster than data sets.
01:08:16
Speaker
Because the humans scale slower than hardware. So we have a natural limit on monkeys hitting keyboards, right? An amount of meaningful, valuable data we can produce is just limited by the amount of people that we have. We have 8 billion, seems like a lot, yes, or 9. I don't know. Actually, I don't know the number. Do you know? Yeah, it's 8 to 9 billion. It's not going to be 20 billion next year. But the macro is going to be twice as fast next year.
01:08:44
Speaker
Yeah, yeah. So actually, many data problems are actually just eaten up by hardware, right?

Challenges and Innovations in Data Management

01:08:52
Speaker
And then so spending a lot, and you have no idea how much brain time and thinking time went into distributed curve processing. It was like, aha, clearly, the data problem is going to be so big that we have to have multiple computers. And this was true. If you're Google, sure, I believe you. You have your search index.
01:09:12
Speaker
is gigantic, you need this. But then you have people that are getting paid very well to deal with this. And I don't have to spend my time on this. At the same time, if I can show that any meaningful data set that people see in the world fits on a MacBook,
01:09:29
Speaker
I want to, that's where I want to be. That's just my, that's just my ethos as a, as a researcher paid with public money. You know what I mean? It's like, I often think like industrial programmers often complain that there's nothing new in programming since the seventies. And I think, well, it's because we're not looking there's that bridge from in this industry to academia isn't there. There are new ideas and we're not implementing them.
01:09:54
Speaker
Yeah, it's true. It's a fun problem. Yeah, it's one of these things that I'm sometimes wondering why funding agencies, for example, don't mandate something like this. There are enough problems out there. Just look at them. Just look at the world. No, we just picked one. And then, hey, it's really annoying to process data.
01:10:19
Speaker
In that case, perhaps I'll feel better about pulling out to my specific use case of DuckDB, and we can talk about that quickly to end. So I started using it. I found DuckDB on the recommendation of a friend of mine because I have a bunch of modestly large JSON files that I wanted to do queries on.
01:10:38
Speaker
and had faced that I could use Postgres or SQLite. And it's not that hard to create a schema that matches the JSON. But then I just chucked it into, I didn't even chuck it into DuckDB. I did select star from JSON file name, and it just works.
01:10:58
Speaker
Firstly, that raises applause, making it that usable. But then I'd like to know what's going on under the hood for that. Is it creating a schema? Is it creating a temporary schema table? Is it doing optimizer stats on that to make the query efficient? What's it doing?
01:11:16
Speaker
Yes, that's great. This works with JSON files, it works with CSV files, it works with parquet files, it works with a bunch of other stuff. What's with the Excel, doesn't it? Yeah, I think somebody made a plug. I don't see everything happens in the activity anymore. So with the JSON files, what happens is, okay, you say select star from... By the way, select star is optional, you can just say from JSON file in the activity now.
01:11:42
Speaker
We have blog posts on friendlier SQL where we innovate the language just to make people happier. So if you say select start from JSON file, what will happen is that our query analyzer will say, okay,
01:11:58
Speaker
And from file name, I would expect the table name there. I do not know such a table. Therefore, I'm going to do what we call a fallback scan, where we look for other things that could potentially be treated as a table. In this case, it finds, it will go through various sort of extensions, we'll say, does the file name end in parquet? No, it ends in JSON.
01:12:20
Speaker
Let's ask our JSON reader to look if it potentially can read this file. What does the JSON reader do? It will open the file.
01:12:29
Speaker
I see if this file exists. If it exists and the JSON file says, I can do this. Then in the binding phase, I mentioned this earlier, it's the phase where we collect the types and the data column names and the table names and all set for our query. We will actually run code that will try to create a temporary schema for your JSON file.
01:12:52
Speaker
And so I think by default it will sample your JSON file because it can be gigantic and you don't want to read the whole thing, blah, blah, blah. I'm not entirely sure. I didn't write the JSON reader. But it will generate this temporary schema for you.
01:13:09
Speaker
And then basically, from looking at the file, it will know the names. It will know the types. OK, this is an int. This is a string. This is a timestamp or whatever JSON has. And then so then with that information, it can resolve the rest of the query. So if you say, select star from JSON file where A is 42, it will say, is A a field in this file? Yes, it is, because it just generated a temporary schema. The binder will be able to use this.
01:13:36
Speaker
in resolving the rest of the query. And then during query execution,
01:13:41
Speaker
Actually, no dedicated importing step happens. What happens is that we do a streaming read on the file directly, and we treat it like as if it were a table. DactyB has this notion of treating a lot of things like there were tables. Also stemming from, by the way, this whole integration in the same process kind of idea. Because we realized that once we're in the same process, we can treat a lot of things that exist in this process, like if there were tables, for example, if you're in Python.
01:14:09
Speaker
you can treat a pandas data frame as if it were a table and just query it directly. So in this case, JSON, it treats as a table. There's code in DuckDB that essentially can read a bunch of bytes from the JSON file.
01:14:22
Speaker
emit the vector chunk intermediate thing that the rest of the stack can understand. And it all happens sort of dynamically. And that temporary schema only lives as long as the query lives. Okay. And if it's sampling, then you don't need to read the whole file multiple times before you can start the work.
01:14:40
Speaker
No, I think you can make it to read the whole file because sometimes JSON files are very irregular, but very, very often they're also very regular. The sadness about JSON files sometimes is that they are indeed exported from relational databases only to be read back into relational databases. There's only one thing that can go worse there, which is like if somebody does this with XML files, but that's a
01:15:03
Speaker
Hopefully. I won't criticize Jason too much because it's a miracle we managed to mostly agree on a universal file. No, I'm not against Jason. Jason is fine. I'm happier with Jason than with XML. But this works with lots of file formats. We can do, as I said, parquet files.
01:15:23
Speaker
Does it do something similar for the optimization phase? Do you sample some statistics from the top of the file? No, actually we can't because the statistics we have have to be exact, or at least out of bounds. For example, if we want to
01:15:38
Speaker
if we want to do some static proving that filter can never be true. So if you say set star from JSON file where A is bigger than 42, our optimizer normally would say, let me look at the minmax statistics. If A is never bigger than 42, I can just remove this filter entirely and return empty set and be like,
01:15:59
Speaker
Done. This is the fastest query ever. I can statically prove there cannot be a result based on the available statistics. We do that. In the JSON case, we don't have the statistics. The optimizer can work without those statistics. And then it won't be able to do these kind of optimization. If you want these kind of optimizations to be done, you have to copy the JSON file into an internal table, and then it will have all this information.
01:16:29
Speaker
Yeah, DuckDB has a really great support for nested types like repeating fields, structured fields, these kind of things. So these go natively into columns in our background storage. They're compressed. It works pretty well. There are file formats like Parquet that do have statistics. And if Parquet files have in their metadata things like minmax statistics and a null count and things like that,
01:16:53
Speaker
And we do use those if they are present. So in the parquet file, if you did this thing where A is bigger than 42, and we can, based on the metadata, derive that there cannot be any value that matches this criteria, and we will also just be like, done. So that depends on the input format. You can actually write your own plug-in. So if you have a data format out there that we don't support yet, that to be has this concept of plug-ins, or we call them extensions.
01:17:22
Speaker
that can provide their own scan functions, we call them. And so those scan functions can be just whatever you want them to be. And they have to basically do two things. They have to be able to generate this temporary schema based on some input, like the binding, what I said earlier. And they have to be able to read this thing and produce our intermediates from this input, like our column format or column chunk format for the 2048 values in it.
01:17:49
Speaker
I could create my own plugin that read Apache HTTP log files. Yes, absolutely. Although our CSV reader would probably be okay at reading those. I think we have the world's most advanced. We actually have a PhD in computer science. Does nothing else than work on our CSV reader? Again, not because we love CSVs. We actually don't and he doesn't. He's quite miserable sometimes.
01:18:16
Speaker
But because we realize it's the first thing people do with the data system, is they throw a bunch of CSV files at it, right? Yeah. So because of that, you need this to be good, and you need it to be fast. I mean, you want it to be fast. You care about this sort of thing. They've got to meet where they are. Yeah. Yeah. The world runs on a bunch of CSV files. Who am I to? Who am I to? I can whine about this on Twitter. But who's helped by that? I can also be like, right.
01:18:45
Speaker
get to work. Yeah, it really is. So thank you very much for taking me through how it works. A pleasure, Chris. Oh, cheers. Hannes, I'll see you again. Thank you.
01:19:00
Speaker
Thank you, Hannes. And I'd like to dedicate this whole episode to Hannes' dad for selling his son a bricked computer.

Life Lessons and Parenting Humour

01:19:07
Speaker
I'm all for teaching your kids to be independent problem solvers, and I think we've just seen the fruits of doing that. But bricked computer, that is a next-level move. Good parenting, Hannes' dad.
01:19:19
Speaker
If you would like some

Resources, Recommendations, and Wrap-Up

01:19:20
Speaker
more next level moves and next level ideas of a different kind, have a look in the show notes. You'll find links to all the papers that Hannes mentioned. And if you'd like something a bit more accessible to go and play with, have a look at duckdb at duckdb.org. They're not sponsoring this. It's just very quickly found a useful place in my toolbox. So thumbs up from me. Go kick the tires on it.
01:19:44
Speaker
Before you go and kick, take a moment to click the like, rate, share buttons. I live for the brain food of these podcasts and the interesting conversations I have, but I also live by a bit of feedback. So if you leave some, thank you very much. And make sure you're subscribed, because we'll be back next week, of course, with another great mind from the software world. I've been your host, Chris Jenkins. This has been Developer Voices with Hannes Malizan. Thanks for listening.