Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Practical Applications for DuckDB (with Simon Aubury & Ned Letcher) image

Practical Applications for DuckDB (with Simon Aubury & Ned Letcher)

Developer Voices
Avatar
3.4k Plays4 months ago

DuckDB’s become a favourite data-handling tool of mine, simply because it does so many small things well. It can read and write a huge number of data formats; it can infer schemas automatically when you just want to move quickly; and it can interface with most languages, run like lightning on the desktop or be embedded into a webpage. I’m a huge fan.

But I’m not nearly as knowledgeable as this week’s two fans, Simon Aubury and Ned Letcher, who’ve just written a book on all the many ways you can use DuckDB and all the hidden tricks and tips that help you make the most of this. So in this episode we’re taking a practical look at DuckDB, what problems it can solve at work, and how to start getting the most out of it.

Getting Started with DuckDB (book): https://packt.link/byKYt

DuckDB episode with Hannes Mühleisen: https://youtu.be/pZV9FvdKmLc

DuckDB: https://duckdb.org/

dplyr, the data-manipulation language: https://dplyr.tidyverse.org/

duckplyr, DuckDB’s ‘native’ version: https://github.com/duckdblabs/duckplyr

Substrait: https://substrait.io/

Observable (Markdown+DuckDB=Reports): https://observablehq.com/framework/

DuckDB’s “friendly” SQL: https://duckdb.org/docs/sql/dialect/friendly_sql.html

Community Extensions: https://community-extensions.duckdb.org/

DuckCon #5: https://duckdb.org/2024/08/15/duckcon5.html

Support Developer Voices on Patreon: https://patreon.com/DeveloperVoices

Support Developer Voices on YouTube: https://www.youtube.com/@developervoices/join

Simon on Twitter: https://x.com/SimonAubury

Ned on Twitter: https://x.com/nletcher

Kris on Mastodon: http://mastodon.social/@krisajenkins

Kris on LinkedIn: https://www.linkedin.com/in/krisjenkins/

Kris on Twitter: https://twitter.com/krisajenkins

Recommended
Transcript

Introduction to DuckDB and Its Advantages

00:00:00
Speaker
Duck DB has become one of my favorite tools of late, because it's designed as an analytics database. It's supposed to do like millions, maybe even billion row queries quickly. And it's kind of in the same space as SQLite, so you could use it that way. But where it's found a place in my heart is it's great for dealing with kind of awkward data jobs. Like you found a hodgepodge of CSV files or JSON data and you need to pull it in somewhere and start understanding what you're looking at, making sense of it. And maybe when you found an interesting query or two, spitting that out to a spreadsheet to send someone else, it's superb at that. And for that reason alone, it's found its way into my everyday toolbox.
00:00:46
Speaker
But I know I'm only scratching the surface of what it can do. It's filled with little useful tools for that kind of job. So when a friend of mine said he was going to co-write a book full of all the different ways DuckDB can be used, I knew I'd have to read it for tips.

Expert Insights from Simon Aubrey and Ned Fletcher

00:01:01
Speaker
And when i read it i knew i have to bring him in to share some of his practical knowledge here. So joining me this week to discuss a host of useful ways that the baby can be put to use are simon aubrey and ned lecture. Two gentlemen who've gone from using it to deal with social network graphs or awkward hadoop clusters to putting together a whole book on how to make the most of it.
00:01:26
Speaker
and In this discussion, we're going to go through everything from SQL queries, and Python bindings, and R bindings, to hosting DuckDB in a web page through Wasm. because Does anything really exist these days unless it compares to Wasm? Before we begin on the podcast, I do have to do a quick full disclosure thing and tell you a quick story. So a couple of months back, Simon contacted me on a Friday afternoon and and he said, would you be willing to read our book and write a foreword? And I said, well, I'll read it first and then I'll decide if I'm writing the foreword. How long have I got? He said, the weekend. I said, you're kidding. He said, no, it's got to go to the printers on Monday. You've got the whole weekend.
00:02:12
Speaker
I said, well, I'll have to cancel my Sudoku plans, but sure, sure, I'll do it. So I did nothing but read the book all weekend and I learned a lot. I was very pleased and I happily wrote the forward and sent it off. Several weeks later, Simon and Ned sent me a bottle of red wine and a jumbo book of Sudoku puzzles. So we learn two things. Obviously, what I should have said was, I can do it, but I need to cancel my weekend plans. I was going to go hunting for vintage synthesizers. Might have got something more than a book of Sudoku puzzles. But the other thing we learned is if I can be bought, I can be bought fairly cheaply for red wine, interesting tech books and puzzles.
00:02:54
Speaker
It's possible you knew that already. My real bias here is Simon's a friend, and I think he's a great guy, and having spoken to him now, I'm forming the same opinion about Ned. So I hope you enjoy listening to him. Let's get into it.

Podcast Introduction by Chris Jenkins

00:03:07
Speaker
I'm your host, Chris Jenkins. This is Developer Voices, and today's voices are Simon Aubrey and Ned Letcher.
00:03:26
Speaker
Joining me today, we have two fine Australians, Simon Aubrey and Ned Fletcher. How are you doing? Doing well. Great to talk to you today, Chris. Yeah, pretty good. Thanks. Always good to have you back, Simon. I've spoken to you on other podcasts in the past. Ned, you're my first time guest, so maybe I'll start picking on you. Um, so we had to give this some context. We had Hannah Smolhuizen on the podcast a while back talking about duck DB. And we mostly ended up talking about the internals going right down inside, which is fascinating, but it's also fascinating to ask like, what can this actually do for me today?

Ned's Enthusiasm for DuckDB's Analytics

00:04:05
Speaker
Which is what we're, what, why I've got you guys in and you've just written a whole book about it. But before the book was written, Ned, what did duck DB actually do for you? Why do you care?
00:04:16
Speaker
Oh, yeah. Yeah, good question. um I mean, I think first of all, I heard a whole bunch of buzz about it. I sort of tapped into, you know, the the online data scene, Twitter. Now, it might have even been. Yes, it would have been Twitter back then. And ah yeah, just sort of suddenly, you know, people that I pay attention to and trust were saying things like, hey, this DuckDB thing is looking pretty interesting. like a really strong focus on analytical needs and so i guess i was saying data science folk and people doing data engineering and data analytics just saying positive things about it so um i thought well i should probably take it for a spin and i did actually try it on some ah data id extracted from ah using the the twitter api while it was still affordable to me and and sort of, you know, pull down a bunch of my social

Using DuckDB for Social and JSON Data Processing

00:05:06
Speaker
graph. I was interested in exploring what would it be like to sort of see clusters of people, you know, visualize the sort of the network of, you know, little micro communities that form ah that you can't necessarily see out of the box. And so I pulled down all this data and sort of, you know, had this little scraper thing going, and I ended up with, I think, about 70 million lines of JSON lines data, sort of records, and then having to parse that and hydrate it, and ended up with a single DuckDB query
00:05:33
Speaker
ah It came together, sort of you know loaded it, and had it in a table in DuckDB in about you know under a minute, I think it was. and This was on, I think, version 0.7 or so before it's now hit 1.0 recently. so I have a hunch it's probably got even a bunch more performance. This was just you know on on on my desktop. ah so I said, okay, this is this is pretty cool. I'm going to have to check this out. and yeah then um yeah Simon sort of said, hey, do you want to ah write a book about it i will look up i've gotta go deep i've gotta go down this rabbit hole so yeah we're gonna definitely get into that collaboration i've got to pick that up pick you up on that you said social graph. Not surprised me cuz i don't think of db as a graph database.
00:06:18
Speaker
Yeah, out out of the box, not so much. I think it was just more to do with the way that I got the data. And, you know, there was essentially a sort of ah yeah you you you hit the API that sort of says, well, I follow. people who follow this follow that. And so there's a couple of chains deep. And so I guess I had essentially graph data, not in the form of graph data. I wanted it to become graph data. So my task was to dehydrate it. I think I'll actually need some other tool to sort of do that graph style analysis, I think would be appropriate, but just as a first step of wrangling it into the right shape. And I think that's a really good strong use case for ducty bit. You've got a pile of data, possibly quite big, really good with semi-structured data like Jason.
00:07:00
Speaker
And um yeah, just sort of, you know, getting it into that shape that you need for, for analysis.

Simon on Fitbit Data Management with DuckDB

00:07:06
Speaker
Yeah. Yeah. Wrangling is the word that comes to mind yeah that's in my experience at DuckDV. So that's similar for you, Simon. ah Yeah, I think I came across DuckDB in a similar, sort of slightly inquisitive way. It came across my ah radar as a tool that some people were talking about, some people were raving about. um But I think I i had ah ah my own sort of vanity project, which kind of got me interested in it. um
00:07:33
Speaker
i'm ah I wear a Fitbit which collects a a lot of data and one day I just decided I'd download the the big archive of data just to have a bit of a an idea of what had been collected ah through my my ah fitness tracking device. It was really great to download this enormous zip file, which from memory was gigabytes in size. Then I was somewhat terrified when I opened it up and realized that it was a mixture of weird JSON files and some
00:08:05
Speaker
ah CSV files, and it was just a collection of all sorts of things representing heart rates and step counts and like location information. and i mean It was quite extensive, but it was like 70,000 files um all after that. Yeah, that was an awful lot of data. And I thought to myself, um this is kind of cool, but now they're not particularly standard. Some are with weird, embedded American date formats. Some have actual epoch timestamps. Some have some just very strangely formatted JSON. So it was all over the place. so
00:08:45
Speaker
um and I thought, well, I'm going to give this this new tool, DuckDB, a go, which is you know one one of its callouts. is it's It's great for these sort of data wrangling things when you've got everything local. I thought, well, I'm going to give it a go and start trying and playing. and um yeah long ah Long story short, um I brought in what was essentially 20 styles of data really, really quickly um and it sort of convinced me that, oh, for this kind of job of I have a bunch of data and I want to do something with it and it's
00:09:18
Speaker
fairly undocumented or it's got some fairly rogue or unique um ah encodings within it. I just found from that that problem space to lifting it through to now I've like loaded this data into a form that I can do some analysis on. um That path, which I thought was going to be quite a long path, turned out to be a fairly straightforward path. and I think that was one of the my callouts of Experience with ducty was i had a job to be done i kind of anticipated the job was going to be difficult with lots ah lots of mucking around with python format is and stuff like that and lo and behold a lot of those problems were just sort of a smooth the ways i got got my got to my job of making pretty grass pretty easily.

Efficiency and Formats: Parquet and Predicate Pushdowns

00:10:04
Speaker
Yeah, it's um I think I realized it was a useful tool when I had some structured but weird JSON data. I pulled it in and spat it out to an Excel file. And I thought, well, that's a terribly useful thing I'll remember for work one day, because that's so so common, right? but But you guys have actually gone away and kind of cataloged all the different data formats out there. Ned, do you have a favorite? I mean, well, it's always tempting to say ah it depends. It depends on your use case and what you're doing. I think that's the unofficial motto of this podcast. Yeah, right. with Technology in general, or yeah, absolutely. But I think like, I mean, I think the one that jumps out that sort of, you know, a bit of a workhorse of analytical data applications is Parquet. And I think also it's one that, you know, if you haven't come across it before and you're learning about it for the first time, it can be a bit mind blowing what it can open up for you.
00:11:02
Speaker
and So I guess ah you know some of the properties that unlike say a text format like CSV, ah the schema isn't encoded inside that. and has you know If you've had the pleasure, so we say, of having to wrangle some random CSV that could be in all manner of different formats, having to potentially do some of those non-standard date conversions that Simon was talking about before. But if you have your schema built into the binary file as you do in park a you don't have to kind of keep solving that the serialization logic problem everytime so that you have one of the really great benefits for example. Right i'm going to be more cuz i. I think of park a as it's kinda got the table definition and i think it partitions but that's when my knowledge caps out.
00:11:49
Speaker
Yeah, I mean, like, I mean, it's interesting. I think some of these things I've learned a little bit more. I think also probably worth mentioning that in the book, you know, I think I I've sort of focused a little bit more on some of the data science and analytics applications. And and and Simon is definitely closer to our ah resident ah data engineer. So I might pass over to you in a sec, Simon. But I think one of the interesting things that I learned was that the ah the partitioning is a little bit ah orthogonal to to what's going on in the file system. So with DuckDB, you can actually have partitioned CSVs. If you're already in JSON files and get some of the benefits of you know accessing only ah files within that partition in your cloud storage, like you can actually get that from DuckDB, but perhaps
00:12:32
Speaker
If you're thinking about those sorts of optimizations you're probably going to be looking at putting it into into park and getting some of those other benefits. Which i don't know yet simon do you want to jump in around you know what what are some of things you like about okay. Oh, I think you covered it ah nicely, Ned, but ah think um so you know I think we all have our favorite file format. But if I was to pick just one favorite chart, it would probably be ah Parquet, which I think and on the tin would say something along the lines of it's self-describing. It contains both the the schema definition and the the data itself, which is
00:13:09
Speaker
which sounds pretty straightforward but is a sort of a nice side effect that you've got a single file and within that file it describes the data types of the columns, um but maybe some usage and some some statistics ah of the the data within. um That makes it very straightforward if you've got it data sort of sitting in a parquet file because you can just query it and find that your you know, what columns are integers and which ones are floats and which ones are strings. um But because it's self-describing and it's got the schema and it's got a bunch of metadata within it, it has some um nice a side effects in that if you've got a lot of parquet files sitting on a disk somewhere um and you just want to query, I don't know which parquet files have to have a
00:14:01
Speaker
a user called Chris or Ned, um the metadata at the top of the file will essentially have some ah the maximum and minimum values and it lends itself to sort of efficient processing if you've got a smart enough parquet reader um because it can essentially ignore a bunch of files that maybe you know do not have the word Chris, uh, anywhere within that fall. It means that if you've got a smart enough park, a reader, you can ignore those files. And Dr. B is one of those smarter readers. Uh, I think one of the things that I picked up from, uh, doing this deep dive is I think I'm finally able to talk about the difference between, uh, fancy words like, uh, predicate pushdown, which is what Simon was just talking about where you send the filter, like the where clause like, Hey, i I want, you know, some date between this and that. And if that's in a column.
00:14:54
Speaker
The metadata inside of the park a files can actually only go through certain chunks like it will do some sort of search through maximum minimum values and so you don't actually have to read the full ah contents of that column and then the other enhancement the other optimization is the projection push down where it's like hey if it's just in that username for example then if we just doing select username from table. ah we can just pull down that username column and not other columns, which is really handy if it's a really wide table, lots of columns, and this is over in Cloud Storage. ah Using fancy HTTP requests, you can just get those columns. and So you know when you're sending a bunch of data over the wire, anything you can leave behind is is isn't is a nice win. How does that work? is that Presumably, you've got does that mean you have to have a special server at the back end that understands Parquet?
00:15:46
Speaker
Well, that one is, um I forget the name of the HTTP type of request. Do you remember, Simon? But there's it but it's actually built into the protocol. um And ah it is able to sort of look into, but because the the metadata is at the top of the parquet file. So you can just sort of read enough of that metadata to go, ah, I only need these columns. And this is where we get to the benefit of it being all column oriented. you might often hear terms like column data stores or column databases, ah which DuckDB has that orientation. And so when you're talking about analytical applications where we're you know doing things like ah summing and aggregating into you know averages and means, we do need to collapse down entire columns. And so that sort of access pattern, mean this is this is you know the these benefits are really nice and elegant, and they've been designed in that way.
00:16:40
Speaker
to to do things like just sort of efficiently getting those single columns. so Yeah, but how much of that, maybe you know this, Simon, but how much of that picking out single columns needs to be built into the server answering the request? So that's ah that's a good question. maybe Maybe we should just sort of orientate some of this in the, maybe maybe the experience, and then we can jump into the execution. I think um one of the great things about all of this is it's fast. um I think one of the things that we sometimes get caught up in is the the technology and the implementation, but it's just worth orientating that the the biggest efficiency you can have is not doing the but processing at all. So if you can come up with a processing scheme that says, well, we've got 80 terabytes sitting on an object store, but we can be smart to work out where we don't need to look for data,
00:17:34
Speaker
we have We have made a massive optimization by just avoiding work. um so This idea around predicate pushdown within a file or skipping a file if we know that the data is not going to be there um is a terrific time saver because you're essentially avoiding unnecessary work. Um, and again, I'll be straight up. I don't know the, the deep intricacies, but I'm going to make up words and say HTTP range request is that's it. You got it. yeah That's the one. excellent but I was trying to Google it in the background. but yeah um
00:18:09
Speaker
So Ned's revealing far too much here. um But the the long story short here is that what you want to do is either ah completely ignore a file if you know the data is not ah there. And to what Ned was talking about before, um if you essentially had ah had a wide table, project it into a parquet file, and maybe you want the first column and the last column, you know that you can essentially miss all of the columns in between. And an HTTP range request is like a sort of a disk access request. You can basically just yeah barrel through ah a bunch of bytes and then skip ahead and barrel through a whole bunch of secondary bytes. um And just projecting that either through local disk or over a network, again,
00:18:56
Speaker
If you're minimizing the work up front and then being smart around the bites that you're retrieving, you've made the whole experience much faster by just doing only the necessary work. Yeah, I mean, I guess to me it feels like maybe the answer to your question, Chris, is that it's ah it's a nice property of both some design features of the web of you know it should be and and also these analytical standards that you need to like bucket that it formed and maybe you know as to what's which of them sort of came first and maybe you know co co evolution together um i'm sure there's some folks it has more opinion it takes on that but i i think it's you know good design ah choices on on both those sides working together.
00:19:37
Speaker
yeah Yeah, it's ah funny, all these years working with HTTP and I didn't know there was a range request. Makes me wonder what other standards that are in there. Because I've never actually read the docs. I'm sure there's one person that has that didn't write them, but I never... Okay, so maybe we've we've talked about going to the outside world. Maybe we should talk about where you've put DuckDB, because I think I'm right in saying, Simon, you've used it at work for some curious

Handling Sensitive Data with DuckDB

00:20:05
Speaker
projects. Is that true? ah Yeah, that's that's correct. I've used DuckDB in ah obviously a hobby capacity, um but I ended up finding that this was a great sort of tool just in my day job.
00:20:20
Speaker
um One small experience I had was I was once on a site ah dealing with some very sort of sensitive medical data. It happened to be on a big sort of Hadoop cluster locked in safely and ah in a data center. and my My default position would be, oh, if I wanted to do a lot of heavy processing on this sensitive data, wouldn't it be great to lift this into a cloud data warehouse and do it, you know, how the cool kids do it. um But because of the nature of this data, um we weren't permitted to use it. So it was very much, well, how do I bring that sort of smart computation directly against the data? And that's in one of my first sort of professional capacities. That's where I found that, oh, well, this idea of bringing computation directly to the data,
00:21:15
Speaker
and having the benefits of predicate pushdowns and smart scanning, um jumping over a whole bunch of sort of nested hive partitioning. A lot of that is sort of fairly routine data engineering stuff, but to be able to do it locally and at pace with a sort of a free and open source ah processing engine um was just magic. and The fact that you can just sort of point duck dv at a high of metastore and just say hey go forward and read this stuff and query it um just felt like magic. So you're in that scenario you're not moving the data around but you're just replacing Hadoop as your query engine.
00:21:54
Speaker
ah Essentially, yes. One of the great things about DuckDB is it's quite extensible. one of the I think what is now a default extension is its ability to just sort of ah interact with object stores as long as it presents like an object store. um DuckDB can sort of navigate over um sort of quite nested Hive organized ah file systems, and that's pretty pretty standard for Hadoop file systems. In my case, it was like um I acted like a genius. um ah Again, revealing a lot behind the curtain here, but um in my case where I just wanted to um
00:22:37
Speaker
chunked through a whole bunch of files I didn't need to do batch jobs or write Spark code. I could just literally, from my perspective, write a bunch of SQL statements, point it at the right um object store, and yeah I got got the answers I needed. And does that, I'm guessing that involves just saying create a local table, which is actually virtually backed off by Hive. Correct. Correct. sign yeah yeah Yeah. So one of the great things about, uh, DuckDB is it utilizes, uh, memory quite efficiently. So if I wanted to, you know, traipse through.
00:23:16
Speaker
yeah petabytes of information to retrieve gigabytes or terabytes of information. um It's a fairly straightforward trade-off to yeah pull the data I want, put it into memory, and then do something, in yeah in my case, in a ah temporary table in memory, you know the classic of aggregates and joins and ah distribution. That's straightforward SQL, but you're getting a lot of magic under the covers by simply ah using you a SQL dialect, pointing it at an object store, um and getting a lot of heavy lifting done by duct to be under the covers. um And chier ah to your question, yeah a lot of that just sort of materializes ah in memory as a table.
00:24:01
Speaker
And you don't necessarily need to bring it in as a table, like whether it's a sort of a virtual or ah ah real you know an actual table, you can just do the querying you need to do off ah the

Auto-detection and Parsing Capabilities of DuckDB

00:24:14
Speaker
source data system. so It mean does depend on how it's been organized, but you know as we were talking before, if you had ah all of your um data in nice clean Parquet and potentially Hive partitioned, you may find that your queries work quite well ah straight off of that. and Depending on what you're trying to do, maybe you're trying to create some extracts like wrangle it into
00:24:34
Speaker
you know CSV, JSON, whatever it might be. ah you might You might be able to do that wrangling directly without sort of landing it anywhere. um one of the nice i mean This you know potentially brings up some of the nice design features of DuckDB and all the way through it. sort of ah There's like lazy execution goes on, so you stream ah the operations through, ah which means that you can get some quite efficient queries happening. um But of course it does depend on what you're getting up to and maybe maybe for what you need to do yeah you don't need to create some tables and um wrangle things and maybe you've got different stages of tables I think that's something I'm keen to get into a little bit more myself in my more sort of data science workflows.
00:25:20
Speaker
um Actually, yeah I think that's one of the neat things about .db is giving you a bunch of these data management features, like a catalog of tables where I can start landing things, as opposed to, say, you know having export1.csv, export2, or export3.parka. lying around and you're like, hang on, what, what does this file mean again? ah So that's, you know, I think that's one of the other nice benefits. and And I think if you read back some of the positioning papers or sort of, you know, why, why come up with Dr. B that some of the Mark and Hannes talk about was enabling data scientists with those kinds of features. Hey, hey, why can't data scientists have these nice data management features as well?
00:26:05
Speaker
Yeah, yeah. It's because you mentioned ah speed as in performance, Simon, but the speed of just like work not done, things not figured out, interface definition files not defined yeah yeah is sometimes the most exciting thing. Oh yeah, and ah to an extent, we also want to talk about the it's like the the feedback loop of um people or engineers just focusing on the on the on the right problems. I know we've spoken a bit about Parquet files, but I think ah we're all experienced of, here's a bunch of random CSV files, let's pretend to understand them.
00:26:43
Speaker
um yeah And one of the things which, if you start thinking about it, is actually near close to magic, is um DuckDB's ability to just read a CSV file. It shouldn't be that big a deal in this day and age, but it still is. There's a lot that's just done for you with the defaults. um You'll get it to read a CSV file. It will automatically sample ah the columns. It will infer you know the data types, floats, ints, Booleans, and it will project its best estimate on um how to how to organize that table. It will attempt to find the
00:27:24
Speaker
The delimiter, the line termination, whether there's a header, the data types. so What looks like an innocuous read CSV file, star dot.csv, it'll jump in, parse a whole bunch of CSV files, sample the top, I don't know, 200 rows, 2000 rows, actually doesn't matter, um and then make a best guess. inference around the data types and the most sensible way to name some of those columns. um and Again, you are the user of all of this. You're benefiting from a lot of these smarts and heuristics under the covers and what looks like an innocuous command to just read a bunch of CSV files from an object store or a local disk. um Suddenly, a lot of that sort of annoyance and drudgery of trying to work out, well,
00:28:15
Speaker
Is this one tab delimited and this one's a Boolean, but this one's called a flat? yeah A lot of that stuff is just you know abstracted away from you and you can actually worry about what you were trying to do in the first place, which was you know get the data and do something with the data instead of having to um arm wrestle a lot of data types under the covers. that's the the oh sorry chris go i That's the the CSV sniffer ah feature, which um you can look up the documentation around it. it's actually quite a um While it is quite automated and gives you all that powerful um ah auto detection features, it has a bunch of neat features as well. like You can actually see what what settings did you automatically infer for the the read CSV function and then spit that back out at you.
00:29:00
Speaker
ah which okay then you can actually reuse that later, so you're no longer sort of relying on that it did sniff the right set of formats. Now you've got that robustness into your pipeline. Yeah, you can take the guess and then freeze it so you're not guessing anymore. Exactly. yeah Yeah, I like that. It's funny. I wonder if this comes from the fundamental design decision of if you're a transactional database, you kind of assume that your user is defining the schema and they're saying, when you're an analytics database, you kind of have to assume crazy, dirty real world data, right? And it shapes the whole way the product comes out.
00:29:36
Speaker
Yeah, I think there's there's a lot ah ah there's a lot to that. um Firstly, I think you're right. I think the that sort of persona mapping or the use case has really sort of driven a lot of the the development for solving a lot of the paper cuts. And here's a typical path let's try and solve for the 90% path, but also give enough sort of optionality if you do need to override it or whatever. um and Just the ergonomics of making that simple. so All of the defaults are sensible. um The idea that you know you might want to do this in an experimental way and then yeah have a robust way. um Again, those ergonomics are expressed. sort of
00:30:21
Speaker
nicely so yeah I think it's just a a nod to some thoughtful design with a focus on who's actually going to use it and who's going to make it um like ah have an easier time, but just by sanding away some of those those paper cuts. yeah yeah it's ah It's a good point you make, though, I think, Chris, that I hadn't really thought about in in the sense of that distinction between um transactional data and and analytical data and this this sort of challenge of messy data. i think
00:30:52
Speaker
I think you're alluding to this. If you look at a lot of transactional um database use cases, they're inside the context of some application. It's an application database, which means they have been quite well designed with a formalized schema and that's published and everyone you know knows that if I want to you know use the the the user table or whatever it is, this is the schema and this is how we're using it. ah In the analytical world, it be nice if that was the case that we always had these well-defined published schemas, but we have this sort of challenge of ah data coming from different sources around the organization. um Sometimes it's just random weird sort of extracts from yeah Excel or CSVs, but also other operational systems. I think one of the interesting features is that you're often working with data that you don't own. so you don' like you know It's coming from some you know um
00:31:44
Speaker
employee records, ERP data, or sales Salesforce or whatever it might be. And that schema is defined by another team and you know one day they might roll out some change to the schema and then, oh, sorry, we forgot to mention you know that, yeah yeah, we were going to do that. And so for anyone who's been paying attention to all of the interest around data contracts and thinking about, hey, let's make your data operational data as ah as a product and like let's sort of have contracts and SLAs and and things like that. If everyone was doing that, then this would be a little bit less of a challenge, but it turns out coordination at scale in organizations is is challenging.
00:32:23
Speaker
sometimes think that's the whole of the challenge of organizations at scale. it's It's a recurring theme. Yeah. yeah So that that definitely raises the question of sooner or later, your luck will run out and there'll be a format that DuckDB doesn't support. Have you looked into like writing extensions to support an entirely new format of data, or is that something you'd go to a programming language for? Personally, ah personally I haven't written an extension, but maybe maybe just to orientate this conversation.

Extending DuckDB with JSON and Geospatial Data

00:32:59
Speaker
um One of the fabulous things about DuckDB is it's it is extendable and extensible.
00:33:06
Speaker
um So this sort of idea around um there's an extension out there to solve um your particular problem um is is just like a nice little um a piece of optionality with DuckDB. When I started mucking around with ah DuckDB, the JSON reader was an extension, and you choose to sort of include that before you do an adjacent processing overtime i think that um that functionality was so useful it's now a default extension. um but There's a bunch of other extensions that i've been ah playing with recently to handle a particular type of.
00:33:47
Speaker
um geospatial data called H3. Again, this is people much smarter than me who know this space, who've written an extension, have sort of abstracted away the problems so other people um can benefit from this research. and From my perspective, I simply install an extension and suddenly I'm mucking around with ah geospatial data like I understand what I really ah really should be doing. But it's, again, just a great way of here is a problem space, there is an extension, I can load that extension, and suddenly I'm dealing with ah essentially a new data type um that wasn't part of the default DuckDB installation. Yeah, I hadn't come across that ah data type either, the H3. And the reason I found out about it, I think we're both having just learned about it, is that ah
00:34:37
Speaker
but so That's actually a community extension for DuckDB, and DuckDB has recently released its community extension repository. so I guess if you've used Python, it's sort of similar to PyPI, where people can create extensions, upload them, and then install them ah from within DuckDB, so you actually have them published. and so it's yeah It's probably worth ah distinguishing that, from as in contrast to the core extensions that the DuckDB team and project maintain themselves. so there is also the spatial extension which is core you have to open to ah installing it and and running it um but it does give you a bunch of really awesome a spatial um ah features head of the out of the box well out of the box once you install the extension um and which yeah we and we have ah we have a section on some interesting use cases around that in the book.
00:35:28
Speaker
that So there are other people's extensions, so there must be some mechanism of creating them, but then there's always the fallback to you can write a programming language. I mean, you can use a programming language to write data into DuckDB, right? That's correct, yeah. And I know, Ned, you've been diving into of all things R for that. Also, you might be using R with DuckDB.

Integrating DuckDB with R and Python

00:35:52
Speaker
Yeah, so I guess you know an alternative query interface to to interacting with DuckDB. Yeah, totally. um So I guess full disclosure, I've primarily used Python for all my analytical data adventures for quite a while. And I have lots of people I know of
00:36:12
Speaker
user and have had very good things to say about it and wanted to put something together for the book. So i wanted to really have like, hey, how can I get started using producttb and r and and put it through its paces. And so I actually you know i rolled up my sleeves and got down to business and learned a bunch of R. And um i ah I was really, really quite delighted by the... ah So it's this library called dplayer. And it's an interface for doing analytical queries. And <unk> it's a really... um ah like very, very well designed. Like you can see a lot of thought has gone into let's decompose how we assemble queries into these a grammar of data manipulation. And so that we know we have things like filter operations and we know we have things like ah select you know columns. We know we have aggregation and where we're grouping by certain things. So let's have these primitives that we can compose as in a pipeline.
00:37:09
Speaker
and So you actually have this pipe operator. um and so if you're If you've got it all leaning towards functional programming, it' sort of very it feels very familiar and nice and elegant where you're actually composing these um yeah pipelines. and ah The nice thing is that what that means is that you can ah then hang um ah take one step back. is like I guess if you're doing that, sort of perhaps the default experience of doing that is using that interface to write queries over the ah data frame library in R. In Python, there's pandas and polars, data frames. That's kind of what you're manipulating these these data structures.
00:37:52
Speaker
um But it turns out that you can use this nice dPlayer library with this really cool um interactive ah um ah grammar manipulation language ah against other targets, other backends. So there's yet another library, you know often these sort of things playing together called dBPlayer, and it allows dPlayer to ah target different SQL backends. And so it has hud has good support for this. So what that means is once you get to the end of this sort of set of composed queries of different primitives, dplay can sort of collapse them all together into a logical query and then work out, well, what's the right DuckDB SQL ah that I need to generate for that, which you can inspect if you want or run. ah So you sort of get all of the benefits of that nice interface, but then with the performance benefits of DuckDB,
00:38:47
Speaker
Do you find, so you you've used this on R, which is very different to a SQL interface, which again is very different to the Python library. Yes, yes. And before i before we jump into Python, I should also mention DuckDB, they've actually, um I think in collaboration with Posit, they've produced DuckPlayer, which is like dPlayer. But instead of ah generating SQL, it's actually ah targeting um the the internal DuckDB native API. So it's the same benefits. The user shouldn't really notice much of a difference, but you don't have to do this sort of intermediate step to SQL. And so it'll be a bit more performant. And then you don't have you know that intermediate abstraction layer where you might get weird errors from a SQL engine in your dplayer error, where it's like, actually, we could have, but you know you you get nicer error messages without that intermediate level of abstraction.
00:39:44
Speaker
I wonder if this is a thing of the cloud age where everything is becoming like just one core piece with lots of pluggable bits. So you can imagine DuckDB having its own storage, but not necessarily using it, having its own query layer, but not necessarily using it and someone I don't know, a closure person using data log against DuckDB's query engine for the sake of Postgres. Oh, no, it's it's absolutely a thing. It's happening. There's ah a relevant tool is Substrate, and it's all about serializing query plans. um I think there is support for DuckDB and Substrate. It might be sort of in the experimental bucket, but it's definitely there.
00:40:23
Speaker
ah which means that you can um if you target a substrate query plan ah from DuckDB, you can then go and run that in other engines that accept that query plan. so It's that like intermediate representation of a of a query.
00:40:40
Speaker
why I shouldn't be surprised that that's happening. Sooner or later, every process gets turned into data and becomes more useful. Totally. and there's if if if If folks want to go follow that trend, there's a bunch more of sort of breaking into these modular components. and Sometimes they get so modular that they're not actually for end users. I forget which library it is. I think there might be one that's backed by meta. I can't remember which bit of the stack they're sort of pulling out, but it's basically like as an end consumer, you're probably not going to use it. But if you're building your own, ah data engine, distributed sort of ah analytical engine. Yeah, maybe maybe you could build on that. Yeah. Lots of functional programming language jokes about things getting so abstract and pure no one's actually expected to use. Perhaps we should pull it back to something more concrete in every day and the Python library, which has got to be like, if we're talking about analytics and data scientists, we have to dive into the use of Python with this.
00:41:38
Speaker
Is that, which of you two is the Python API expert? i would I'm going to give Ned the talking talking stick for this one. ah Yeah, I did the the Python chapter. um it's ah I really like the DuckDB Python client. There's a little bit of surface area to it. um And it can take you take a little bit to get your head around it. But some of the basics, like to to hit the ground running and just start using it, ah Yeah, it doesn't take very long. and I think one of the really nice features that that you get from the Python client is ah the relational API. And so when you write a query using the relational API, and and I think i would I would recommend most folks to, and like unless you have a reason not to use it, just default to using this, ah you can write a query, whether it's with SQL or a non-SQL way of doing it, which we can get back to. ah
00:42:35
Speaker
The result that you get isn't actually a ah chunk of data that's been fully processed, like a complete you know data frame-like object. it's a It's a relation, which is an abstract representation of the query. If you're looking at in a notebook, you will see the first lot of data. I think it's up to about the 10,000 records. And that's because, hey, I want to actually see what I'm doing as I'm prototyping. But if you're working with million, ah potentially billion-row data set, you probably don't want to be materializing each query fully. So ah you get this lazy representation of your query that will, um until you ask it to actually you know fully materialize or just peek at the first bit, it won't evaluate anything. But then you can start doing interesting things with that. we can
00:43:26
Speaker
run a SQL query, and then um we get the results of that back stored in ah in a relation, like there's just a variable assigned to this relation object. And then you can actually query that relation object. you can do you know select star from the varial variable name of that relation, you know as long as it's in scope in your Python process notebook, what have you. and It'll actually start pulling data through essentially a composed pipeline. um so yeah To me, it sort of feels like this is how you can, you know as a Python programmer, I'm like, well, this is kind of how I might decompose code into functions. ah Well, I can decompose queries into these logical blocks.
00:44:07
Speaker
Yeah, because the thing about SQL is I really like SQL. One of its drawbacks is it doesn't really compose. Right. And do you do you know if under the hood that's still managing to optimize the final query as a whole? Yeah, yes, it it does. So it will take the sort of you know the final yield of that chain that you're you've composed and then it will, the query planner. And so this is actually, you know you you you've you've hit the nail on the head. This is a great reason to be using it um because the the the fight that final composite relation, um the duct to be query planner can now do its magic and and optimize it for you.
00:44:48
Speaker
So there there's there's more um you know more more opportunities to do those optimizations. This is reminding me of my own use case. um But maybe I'm going to pick on Simon and we can talk about this for a bit. The idea of embedding um into a Jupyter Notebook. um So i I've had a lot of fun with DuckDB embedding it into a web page.

Running DuckDB in Web Browsers with WebAssembly

00:45:13
Speaker
Oh, have you been using the Wasm executable? I have. It's the thing everything needs to compile to these days in order to be hip. Absolutely, absolutely. um So Wasm WebAssembly, I think it is the the shortcut for WebAssembly, um it's essentially um
00:45:34
Speaker
as I understand it, a bytecode runtime that can sit in a browser. um And DuckDB, you can pip install it, you can put it in your R library, but there's also a runtime compiled into Wasm. um So yes, um your browser can natively run a full DuckDB installation. um And this, to me, just blows my mind. ah You have a very powerful runtime, which is a standards compliant web browser. You can bring in web assembly code. um So DuckDB um compiled as a web assembly can sit there in a sort of a sandboxed area of your browser.
00:46:19
Speaker
um And you really can hit up ah public endpoints, such as APIs, public or object stores, ah bring bring in data. And everything that we've been talking about, about writing SQL and extending DuckDB, you can do while DuckDB is running as a WebAssembly within a browser. um And this to me just I just found find astounding, incredibly cool, but astounding. um And WebAssembly is super neat in itself. um It's sort of sort of got a, but depending on the browser, got a virtual file layer so you can read and write files off a virtual store. So if you've got a fairly modern web browser, when you
00:47:09
Speaker
export or and import data or write and read parquet files or CSV files, so it's running in your browser against a local file system presented by your browser. um And this is a fairly cool and novel thing and quite a useful thing um to play with, ah but it's got some real world ah implication implications, which some vendors are now using in their sort of high performance um ah visualization and notebook experiences. um but I might be getting this wrong. I think its mode has been public with some of their um analytic environments, which are essentially web assemblies, which allow the outward experience is very fast.
00:48:00
Speaker
um processing and analytics and carving out of data, but a lot of this is implemented using Wasm um just to just to sort of speed up that interaction. I might be getting the ah the they the the vendor wrong, but some of those use- I dug into that one, actually. Yeah, because I would have a blog post on that and they did absolutely switch to using DuckDB. I think it might be more of a server side thing using um the JDBC connector. So it's a bit more like a BI sort of, you know, think of having, you know, DuckDB being your data cube that you plug into an interface they might already have, unless that's changed and they could also be Wasm now. But there's a few others, like there's ah an open source library called observable.
00:48:45
Speaker
which is doing some really interesting stuff around BI business intelligence as code. So helping spin up dashboards and reports with markdown and inline code. Actually, I think one of the things they might have come out with recently is when you used to be static reports, but now the thing that you spin up is actually a interactive app, and I believe that is using Wasm, DuckDB Wasm. Yeah. So that is my use case. So maybe I should get observable on the podcast. Cause it's just this really nice thing of like, I know how to spit all my data to webpage. And now I'd really like to just write an SQL query and see a pretty chart. m And the fact that you can power all of that in the webpage is really cool. yeah And also the fact that you can like have SQL running in the client and not have security issues.
00:49:37
Speaker
That's right, um because as you can imagine, web browsers work in a very hostile environment. A lot of safeguards and virtualization have you get for free ah by using a web browser. And the fact that you have a very powerful runtime um is is is just like a really nice um opportunity to get the the best of you know, high performance data processing with the convenience and ah essentially the ubiquity of browser runtimes. And by browser runtimes, we're talking, you know, desktop browsers right down to Android phones. So you can literally literally run DuckDB in an Android web browser.

Versatility and Low Dependency Requirements

00:50:21
Speaker
So maybe we should talk about that, the kind of performance overhead of actually running it. Do you know anything about like the the size load, the performance load that shipping a DuckDB to a user would cost? It does take a bit of a hit at load time because you've got to pull down the DuckDB ah modules, libraries, and I think that's kind of their recommended advice is if you need sort of, you know, snappy page response, maybe that might not be the right tool because it's sort of, I think they've heard a little bit more on the side of, let's just include
00:51:01
Speaker
you know the kitchen sink. like you like yeah you We want to give you an analytic workbench, like let's so we're not going to kind of prune down DuckDB. There's obviously, as Simon mentioned, there's a few things that because you're working in a browser sandbox, there's some constraints that you have to work within, but it does give you most of a full DuckDB. I think that's probably one consideration is the sort of the load time of your page. But if you've got a bit of tolerance there, it's ah it's a pretty good use case. Yeah, maybe for a company intranet, it's a non-issue for user-facing stuff. Maybe you could just download it when they download an app. That'd be interesting. I'm sure there's some clever workarounds, but it's it's definitely something to be mindful of as a potential drawback.
00:51:47
Speaker
One thing I'd be curious about, and I throw this out in case either of you have thought about it, is um sticking on something like um an embedded chip, like an ESP32 or something. Obviously, it's going to work on a Raspberry Pi. Have you ever heard of anyone using it on something smaller than that? I'm certainly aware of running it on a Raspberry Pi. ah Whether it can compile down to a micro... controller down to an ESP32. Not entirely sure. it's not leave that what is a research it can It can run in a lot of places. It might be pushing it. It it is worth mentioning, but the reason why it's like almost like, oh, maybe? is ah Because it runs in a surprising large amount of places. ah
00:52:31
Speaker
If you're talking about ducty the core it has zero dependencies so it's all written in c++ plus plus and that means that it's ah so i think the x the dependencies that they do need for that core of ended in which means that it's easy to cross compile. um without sort of you know challenging ah compiler chains to wrangle. So if you jump onto the DuckDB documentation and look at the number of official, but then also community supported clients, runtimes, there's there's there's so many, that it's it's which is one of the really awesome things where it's like um you know it's like, if you've got a terminal somewhere, there's a good chance you're going to be able to get DuckDB and pull it down and and start using it.
00:53:15
Speaker
That's cool. One place I wonder, and maybe this touches on the limits of it being an analytics database and not a transactional database, but would you ever use it as a kind of smart, embedded cache? I'm thinking something like RocksDB. Would you use it in that kind of space or would the transactional writing capabilities just not be suited?
00:53:39
Speaker
Simon. Yeah, i'm I'm not sure about that that specific ah case about maybe cash. I am aware that a number of companies have done sort of edge processing um essentially to try and take what are sort of very noisy endpoints and then lift out the ah signals and then move those sort of signals down to ah combined um warehouse analysis. So one of the public use cases on the cross is Okta. So folks might be familiar with ah that cloud authentication ah software. Oh, the single sign on thing. single sign on. If you work in an enterprise, you probably have used Okta as an end user. um I think under the covers, it's it's kind of like a really interesting ah use case. I believe there's some public blog from the ah the folks over at Okta where they're literally talking about trillions of records.
00:54:45
Speaker
ah being processed on multi-millions of object stores. um And they use cases roughly along the lines of they have some extreme high-throughput um collection agents which sit at some edge nodes. um It's probably cost prohibitive to literally bring in trillions of rows into their cloud data warehouse. But they do actually want to pull out some signals and some aggregates. um So they they use DuckDB for some of that sort of pre-processing and normalization and operational metadata. And then I believe that they that gives them a good sort of trade-off between trying to process everything and trying to find out where there's ah some interesting noise or no sort of heuristics or um data to dive into. um but yeah it's ah It's a really interesting use case to think about, well, maybe there's times where you've got some ah processing, maybe at the edge, um and you maybe you don't want to slip all that data back into a centralized place, but you're happy to bring in ah the high-level aggregates.
00:55:55
Speaker
so Yeah, there's maybe not so much case invalidation, but um aggregating ah data at scale is definitely a ah use case that's got some traction. Okay, that's definitely something I can use on my probably still Raspberry Pi sensor out in the field, right? before it starts unloading stuff. yeah yeah But I don't think you should stop there, Chris. I think you can ah set it as a hobby project to cross compile ah down to an ESP32 and see how you go with that. I think you're making a mistake about how much spare time I have in my life, but I'd definitely be interested. interested
00:56:32
Speaker
and Speaking of spare time, thank you, you've given me a segue into what you've been doing with your spare time. Because I i definitely want to know as well as the technical side, the whole writing a technical book and collaborating on it.

Co-authoring a Book on DuckDB

00:56:47
Speaker
because It seems like a double-edged sword, right? If you want to write a book, getting someone to do half of the work seems like it's going to be easier. But then coordinating, it seems like more trouble than it's worth. i mean as Ned, give me your experience. What's it like to write a book with Simon?
00:57:07
Speaker
ah um it was Well, definitely an experience. um Look, you know there's I think it covers all the gamuts. look on On the whole, positive. um we We definitely had challenges. um i yeah I think for me personally, i the last thing I wrote was my PhD thesis. I guess it's been enough years that I've sort of amniesed the pain. I was like, oh yeah, I could i can write a thing. um sim Simon didn't have any excuse and hasn't, so sorry he was- Ignorance his place. Exactly, exactly. um So yeah, look, it it absolutely was really, like I think we, as I mentioned earlier, we sort of tried to approach it from that
00:57:48
Speaker
You know broadly in the baby there's sort of two sort of families of use cases super charging your analytical workflows for sort of data scientist and engineers data analyst and i was sort of wearing that persona that hat um in terms of what i brought to the book and and the other application is using it as a building block for data infrastructure and data products and. Simon brought his data engineering hat. And so we did a bunch of the dividing up of the topic areas. We each drove certain chapters, like we collaborated on all of it, but we were researching and driving specific chapters. And so that was really good because there was stuff there that was just outside of my wheelhouse. And I was learning from Simon as we went along. And that was really great. um
00:58:32
Speaker
I think there was some interesting challenges around. I think we we have different sort of ah yeah working styles. i I will confess that I think that the the book ended up being as as heavy as it is. I think that's definitely on me. i I'll put my hand up and there was sometimes there where I... I was like, maybe I should have actually been, I probably needed a blog. Some of this maybe could have gone into a blog, but i had ah I had a place to write words. and i kept um and look you know To be fair, I think i think ah there's there's interesting stuff in there and and the book is meant to be, now now I'm like, do I need to defend myself? There's there's something in there for everyone. so um that yeah that's that's ah yeah what What about you, Simon?
00:59:13
Speaker
Well, firstly, i I say that the writing experience has been wonderful working alongside Ned. um I did actually just want to underscore a couple of things. um One of the things that we are aware of when we started writing this book is we didn't want to position DuckDB poorly. And we thought if we wrote it from a data engineering perspective, we might be sort of misrepresenting DuckDB as a database. If we wrote it from a data science or an analytics perspective, we might be pigeonholing it. um so When we were trying to work out the the content and the coverage and the practical nature of using DuckDB, we kind of wanted to cover an awful lot of territory to to do what we thought was justice on a very flexible tool.
01:00:01
Speaker
And I think we realized that that ah toette's point that ah to to do justice here, we really did need to um approach this from multiple angles, thinking about this from different personas. And to Ned's point, we did actually want to do um essentially ah some some practical material, um but which is going to be useful for analysts and data scientists and engineers and folks who just um want to get a job done. um So yeah, to Ned's point, um I kind of wrote some of the more data engineering and SQL style chapters and things like ah explain plans and partition pruning and Hive Metastores. That's very valuable to to explore, but we didn't want that to begin the beginning and end of the story, which is why it was so important to understand all the things that ned Ned could speak to, which is a lot of the
01:00:57
Speaker
um understanding and ergonomics and um mental frameworks to really express what it means to talk about some of those ah APIs and consumption frameworks. And also, I didn't want to write the chapter on R, so I was more than happy to pass that to Ned. Yeah, that's one fair way to divvy it up. Did you find, like, did the tone of collaborating change when you started to reach the editing phase? Is it all like, I mean, you can imagine that as someone who's written plenty of pull requests and received plenty of code reviews, there can be like, sometimes you can find out that people are really different people than you thought when they're reviewing what you've created.
01:01:40
Speaker
i I have to say it was an absolute, I'm going to say a joy, a joy in the rear vision mirror perhaps, but um I appreciated the writing phases, but I also appreciated getting feedback on the things I had written because it it showed some blind spots, it showed um maybe what I thought was crystal clear, could require a bit of elaboration, um and sometimes There was stuff that I was really passionate about and to another audience was maybe a bit sort of verbose and could be condensed. So it was somewhat challenging, but I did appreciate that sort of ah back and forth, which I think really sort of refined the book.
01:02:21
Speaker
um And the opportunity to review ah things, um I mean, writing a book gives you an opportunity to learn so much. um And I really appreciate it having the opportunity to to provide feedback because it was um some depth and experience and expertise that I essentially got for free while writing writing the book by just reviewing some of those things. It's an interesting definition of free, Simon.
01:02:47
Speaker
Time's free, right? Yeah, yeah, yeah, yeah. yeah No, i but yes, I 100% agree with that. ah yeah it the whole yeah For me, I really wanted to go deeper into what DuckDB can do, how to use it, and and also you know some of the surrounding ah concepts and and and ah approaches in in dealing with analytical data that you know I haven't sort of been as exposed to previously. So I think I very much learned a whole bunch from ah from from Simon in the process. Well, I won't i mean, that there's a ah faux pas where you ask someone if they're thinking of having a second child too soon and they're still getting over the trauma of having the first one. But I hope at some point the two of you collaborate again.
01:03:36
Speaker
Yeah, I mean, it yes, well, the book does say first edition on it and we're like, oh, hang on, that's a bit presumptuous. for showing yeah yeah yeah Well, you know, I think we we ended up, the timing worked out quite well for us, which is that we, I think targeted version 0.10. point some ah something and But that was positioned by the DuckDB project as specifically a from that the Delta from there to 1.0 would be performance and um stability enhancements. So the the API, you know there were no additional features added.
01:04:14
Speaker
ah No backwards sync-compatible changes, um which means that ah you know for folks jumping into the book, they absolutely should target 1.0. I think there's a duck con coming up in Seattle soon. um ah That will be sometime in ah August. This is 2024. So I guess that's where we get to hear yeah a bit of the roadmap, what's coming up next in the next release. And so I'm really looking forward to see what those features will be. And you know fingers crossed ah the the content of the book remains sort of accurate for as long as possible. But yeah that's that's how these things go. They they they they will change. And you know we we we want to see ah changes where they should go. So um if if we get to that point where ah and another edition is needed,
01:05:04
Speaker
We'll see, yeah. We'll see how we're going. You may find they announced they're doubling down on the amount of R. And you've got a job for revision too. Well, I imagine you're planning, are you doing a book signing there? Whether they like it or not, you can do a Gorilla book signing out the front door. ah wait and I don't think either of us are going to be at the dot-com, this one. there's there's ah I think there's at least, is it a couple? Yeah, they're regular events, so I'm very keen to make it over. It's not always in the same place. the If ah folks want to see what gets on there, the presentations are on Dr. B's YouTube channel.
01:05:44
Speaker
So you can see the ah there's the state of the duck, which is ah you know what's what's going on in DuckDB project now and and looking forwards. and um Then there's a bunch of invited talks from people in the community building on DuckDB or making DuckDB extensions and libraries and yeah really good talks. But hopefully we'll we'll we'll make it to one at some point in the near future. Well, hopefully they invite you. But I will go and check that out for now.

Podcast Conclusion and Book Promotion

01:06:10
Speaker
Simon, Ned, thank you very much for taking me through it. And best of luck with the book launch. Well, thank you very much, Chris. It was so much fun talking to you today. um And really, I mean, obviously, we're passionate about DuckDB and the community and the ah the great open source technology. So yeah really happy to have the opportunity to chat with you today about our passion.
01:06:32
Speaker
yeah Thanks so much, Chris. It's been great. Cheers. Bye for now. Bye. Thank you, Simon. Thank you, Ned. I'll put links to all we've just discussed in the show notes, as usual. And of course, the first link has to be a link to their book, Getting Started with Duck DB. It's available in E Ink and Tree, as you prefer. I'm also going to put a link to Observable in there, which is my personal favourite wine pairing for Duck DB. I think eventually we're probably going to do an episode on that if I can find the right person, but I'll give you the headline. Markdown plus SQL equals nice report with pretty charts in it. That's a really useful tool for me. It's almost time to go, so I'm just going to give you a gentle nudge. I shall nudge your attention in the direction of the like and subscribe buttons if you're so inclined. And if you're on a podcast app, the rate and review buttons are there.
01:07:24
Speaker
And for all of you, there is a Patreon link in the show notes if you want to support this and future episodes. If you click on any of those links, any of those buttons, thank you very much. But for now, I've been your host, Chris Jenkins. This has been Developer Voices with Simon Aubrey and Ned Letcher. Thanks for listening.