Introduction: The Role of Analytics
00:00:00
Speaker
Analytics. That's something that's been a part of our industry ever since we had databases. SQL contains a structured storage language, but there's a reason we call it the structured query language, right?
Emergence of Dedicated Analytics Databases
00:00:15
Speaker
Questions you can ask of your data are at the center of the story in any data processing system. Always more questions to ask of the existing dataset.
00:00:26
Speaker
But even though querying has been around forever, I think dedicated analytics databases are relatively newer.
Exploring OLAP with Clickhouse
00:00:34
Speaker
I think they've come to the fore as an inevitable consequence of having data sets that are bigger than a single machine. When that became the norm, we had to start thinking about specialized strategies for querying large data sets in aggregate. Give me a count star of everything, right? But how's that done?
00:00:55
Speaker
It's time to turn our attention this week to the world of OLAP, online analytics processing. And for that, we're going to take a look at Clickhouse, which grew out of a need to do kind of Google Analytics processing at Google Analytics scale, but has grown up into a more general high performance analytics database.
00:01:15
Speaker
But what does that entail?
Clickhouse: Origins and Development
00:01:16
Speaker
What is an analytics database? What does it need to provide? And how do you make it provide it fast? That's a huge topic. And joining me to dive into it is Aleister Brown, who's been in the data munging world since the start of his career. I think if you look closely during this podcast, you'll see a little Hadoop shaped scar in the back of his soul. But these days he works for Tiny Bird, which is a click house as a service company.
00:01:42
Speaker
We do talk a little bit about their take on analytics at the end of the podcast, but mostly we're talking about what OLAP database is trying to do and how Clickhouse does it specifically. So if your queries are too slow, stick around for some answers. I'm your host, Chris Jenkins. This is Developer Voices, and today's voice is Alistair Brown.
00:02:15
Speaker
Alistair Brown, coming to us live from the east of England. How are you doing? Hello, I'm good, thanks. Good to have you here. You are, we're going to grill your brain. We're going to grill your brain. That sounds like a delicious dish. We're going to pick your brains and grill you. About Clickhouse, right? And more generally, about having a dedicated analytics database.
00:02:39
Speaker
But I'm going to start by letting you give me the lift pitch, as we say in England, for Clickhouse.
Columnar OLAP vs. Transactional Databases
00:02:46
Speaker
What is it? Well, I don't know how much of a lift pitch it is, but it is an open source columnar OLAP database. So we can get a little bit into what
00:03:01
Speaker
columnar, I know that means, but the TLDR is an analytics database. So it is purpose built from the ground up for analytics.
00:03:12
Speaker
It's not looking at replacing Postgres, for what Postgres is good at, the transactional style stuff. It's purely for analytics. It came out of, it's roots, and the name always rings this association with me. It came out of basically being like Google Analytics, a house where you store your clicks.
00:03:34
Speaker
Absolutely. It's kind of an on-the-nose name, really. Yeah, almost does what it says on the tin, right? But yeah, that's right. So it came out of a competitor of Google Analytics entirely to power their Google Analytics product, which is you've got potentially millions of websites. They're running a little snippet. And every time a visitor lands on the web page, it captures some details.
00:04:01
Speaker
a page view and clicking on a button and all that kind of stuff, which has some details, users' region, where they're from, what page they landed on, what page they came from, all that kind of stuff. And
Data Ingestion and IoT Comparisons
00:04:17
Speaker
that's sending it back from all those websites all at the same time continuously, going into a big database. And then on the other side of it,
00:04:25
Speaker
you've got some kind of application where the people who own those sites or maybe the companies and the teams who manage those sites can look at a big application, a dashboard, and see a bar chart that says your top 10 countries or top 10 devices or all that kind of stuff. That's what it was built for. That's where it came out of.
00:04:51
Speaker
And I gather it's generalized from there, but just that model implies a bunch of things I think we should go through. So the first one is really high throughput ingest of data, which isn't going to change. Yeah, absolutely. I mean, that's one of the core tenets of it. And to solve that kind of use case, you absolutely have to, right? It's not like you've got one stream of data coming from one place that's very predictable.
00:05:20
Speaker
potentially millions of different, sort of like IoT, right? Of loads of different devices and users, all generating, potentially at the same time, potentially bursty, you know, you might have in the middle of the night, it might be pretty low. And then suddenly in the morning, it picks up and explodes. Yeah, it's gonna be Black Friday proof. Yeah, exactly. Right. You know, so yeah, I mean, it wouldn't work without being able to handle very high volume interest. So how
00:05:51
Speaker
Magic. Give me some technical details. It's a serious answer. You need a bit of tea for this, I can see. Yeah. Well, it breaks the illusion of what we call streaming, right? So those are lots of
00:06:13
Speaker
worked in streaming, kind of know what's behind the curtain. People think of streaming as like, oh, well, it's individual data points and just constant one by one by one by one. Realistically, what it actually is is micro batching, right? All the time streaming is really just a nice marketing name.
00:06:32
Speaker
over micro batching. So instead of doing a batch every 15 minutes, which would be super slow, it brings it down
Real-Time Analytics with Materialized Views
00:06:39
Speaker
and tries to do it four times a second, for example. So it's still technically taking a batch of data. So it'll grab a piece of data from all of these users that come in on that API and all of these users that come on that API, bung it all together into a chunk.
00:06:56
Speaker
And then those chunks are run, again, I think by default, it's like four times a second, so 250 milliseconds. That is configurable and may differ. Chunks up together and inserts those into very small chunks, basically. Okay. We might revisit that and go a bit deeper, but that takes us to the next implied stage of a big analytics pipeline, which I think is
00:07:23
Speaker
Are you going to do a lot of pre-calculation or are you going to try and store it flat and process it, which is very hard with potentially billions of records you need to analyze? What's Clickhouse's opinion on pre-processing, denormalization?
00:07:40
Speaker
It's optional. You obviously can go and do a whole bunch of denormalization if you want to. And if you are going truly into massive scale, then you probably want to do some. But one of the, at least in my opinion anyway, of the most interesting features of Clickhouse is its incremental materialized views.
00:08:08
Speaker
which don't entirely get rid of the need for that, but for most people, pretty much get rid of the need for it.
00:08:20
Speaker
For people that are familiar with materialized views, generally a materialized view is you write a select query and you get a result and then it executes the query and it saves the result of the query into a table so that you never have to go and compute it again. It's pre-computed and the results are stored in a table. That's the traditional materialized view and normally that runs on a schedule. You say every
00:08:45
Speaker
day or something, go and rerun the materialization query, take the result and store it in the table, and then you've got that day's result. Normally, you have a scheduler and it takes it off and it just says, go and do it, and then it recomputes the whole thing. More recently, certain things have tried to do
00:09:07
Speaker
an attempt at an incremental materialized view where you can say, actually, here's a date, and I need to go and recompute the materialization for data beyond that date. But it's still doing quite a big query, and you have to schedule it and pass a date. What ClickHaves does is it's always interesting to try and come up with the right term for explaining it, like an event-driven
00:09:36
Speaker
incremental materialization, right? So it happens upon ingest. So in the same way that you would have a materialization query, in click clouds, you have your materialization query, which is, you know, doing select and casting column A to a different type and doing some sort of aggregation, et cetera, et cetera, whatever it is, normal SQL. But instead of going over the entire table and running it for every row, every row, every row, and then rerunning it every day,
00:10:06
Speaker
What it does is every time a new row comes in, that row goes through the materialization query and it computes the new materialization value of the new row and combines it with the previous value of it. So it's triggered not by a schedule, not externally saying rerun this.
00:10:27
Speaker
It's triggered every time you do an insert to bring a new row in, and it just appends and then merges the new materialization onto it, which is super efficient. You don't have to go into these massive re-computations of the materialization, but it also means that
00:10:47
Speaker
a lot of the denormalization or the pre-processing you can actually do as part of that materialization. The database itself is doing that sort of denormalization or the transformation and then saving the result. Then at query time, instead of having to go and do any of that and all the complex transformations and joins and aggregations, you can just talk to
00:11:12
Speaker
the pre-computed serialization without needing any sort of external tool to be processed. Right, and is this generally user-defined? Because there's a certain... Okay, so the more general question here is, how much is Clickhouse still a click analytics database, and how much is it generalised to this as a general analytics database for any use case?
00:11:38
Speaker
I mean, it's very generalized, right? You can still see like hangovers of when it was specifically for doing web analytics, right? It's of all the databases in existence, it's probably got the
00:11:54
Speaker
the most comprehensive set of out-of-the-box functions for breaking down URLs and stripping HTTP paths out and finding query parameters and stuff. The amount of time people spend writing like regex patterns to do that in other databases and Clickhouse has just got millions of them.
00:12:14
Speaker
than anything you might want to do with URLs and all that kind of stuff. You can really see when you get into its functions, it was obviously doing a lot of stuff with URLs. It's got a lot of stuff like UUIDs and ULids, which is a slightly more modern UUID.
00:12:32
Speaker
and IP addresses and geolocation and stuff, stuff that you would always expect to be in web traffic. But outside of those functions, everything about it is pretty general, right? It's a general query engine. It's a general column of storage engine. None of that is really tailored for specifically web analytics.
00:12:55
Speaker
Okay. So therefore you must be having a mechanism to use and define these materialized, incremental materialized views. Yeah, so they are just SQL queries, right? So you write a SQL query and you might say, select
00:13:15
Speaker
to start of day, right? So if you've got a timestamp, but you don't actually want, you know, you want to aggregate by the day, for example, you know, you don't want it to be just per row. So you want to bucket it into windows of a day. It's got a lot of really awesome like date functions, which are just amazing time savers. So, you know, super, super easy to like transform a day into a bucket of say the start of the day rather than individual timestamps. And then do a
00:13:42
Speaker
a sum, let's say, just a basic sum, and then group it by that day. And what you end up with is one row that says, I am the 1st of January, 2023, and the sum. But if you've got a billion rows and you run that over all of that data, you get a sum. But then if a new row comes in,
00:14:06
Speaker
you obviously need to go and update that sum, but you don't want to go and recompute it for the billion rows you've already done. You just want to say, I've already got the sum for that day. And now I've got a new row with a value of 20. And now I need to add 20 to the previous value. Super simple, logically, right? It makes, it makes total sense. Go and find the bucket that relate that this row effects and go and add it to it. Um,
00:14:29
Speaker
the way that Clickhouse actually does that, and this will be an interesting one to try and explain without tying myself into a knot, because I'm not looking at a diagram. It uses incremental states where a chunk of rows comes in and it will compute, let's say you've got 100 rows that come in and
00:14:59
Speaker
three of them are for today. So that's a chunk where you need to compute the sum, and that's today's one. And then all of the other rows are various other buckets. So in that intermediate incremental table, you've got potentially multiple states for the same day, where the first of January, we saw three rows, and the total was 10. And then we saw another 10 rows, and the value was 100. And then we saw another 100 rows, and the value was 50 or something. And you end up with four incremental states.
00:15:28
Speaker
rather than you had 200 actual raw, unique rows come in. And then your incremental
Columnar vs. Row-Based Storage for Analytics
00:15:36
Speaker
states might be, say, four rows, which is four slightly pre-computed values. But then at query time, you don't want four rows, you want one row. So then you have to consolidate.
00:15:47
Speaker
So Glickhouse is a background process that takes all of those incremental states and then merges them. So like every 10 minutes, just on a schedule behind the scenes, it's going and compacting all of those incremental states to say, go and find all the incremental states for January the 1st and compact them into one. And then the next time I get
00:16:07
Speaker
an incremental state for January the 1st, I just add a new row. And now I've got two rows, which is the combined state of all the previous states plus the new state. And again, in the background, it's going to go and compact those eventually so that you again only have one. But at query time,
00:16:23
Speaker
So when a user comes along and says, select what the sum is for that day, you don't actually know how many incremental states there might be, because it runs every 10 minutes in the background. You don't know if that's run. You don't know if there's 10 states or one state. So at query time,
00:16:39
Speaker
when you use these incremental materialized views, you have to go and say, do the merge. And it goes and looks for any of the parts, the incremental states, and merges them at query time and goes and does it. Now, you might get lucky, and they've already been merged. And so it doesn't have to do anything, right? So there's no overhead at all. Or there might be five states and has to go and combine them all, which is still a lot less overhead to combine five states than it is to compute the sum over
00:17:07
Speaker
a million rows, potentially, which is just a little intricate detail of when you do those materialized views, you have to not only think, when you're writing the query, what do I want to materialize? But when you query them, you have to think, I need to remember that these are incremental states that I need to merge for the query. OK, so this does actually leak into user space. It does, yeah. So they have these what are called state and merge combinators.
00:17:37
Speaker
Instead of just doing a sum when you want to do a materialized view, you append the state combinator onto it. So you say sum state, whatever the field is that you want to sum, and that tells Clickhouse behind the scenes that, okay, I'm doing a stateful sum where I want to sum whatever I get, maintain a state.
00:17:58
Speaker
some wherever I get next, maintain a state. And then it goes and does all of that merge process behind the scenes. You never have to think about it. You don't have to do anything about it. It's all automatic. But then at user time, you have to use the merge combinator to say, select some merge, the field from the materialized view. And then it knows to go and make sure that the final merge has happened at query time.
00:18:24
Speaker
Okay, so that raises sort of a side question. To what degree is this is the SQL interface standard? Yeah, that's always the fun one. So I think if you ignore some of that materialized for you stuff, then most of the SQL is pretty standard, right? If you're using something that is
00:18:51
Speaker
You know, if you, if you write some SQL, that is like pure ANSI SQL, uh, 99% of the time you can copy and paste that and it'll work. Uh, most of what Clickhouse adds is stuff on top that extends it beyond that rather than changing what came before. Right.
00:19:13
Speaker
Right. So I guess in that case, we should step back for a second, because you're implying that you need a certain amount of analytic-specific mindset to come to this database. So maybe we should talk about why Postgres isn't enough, or to pick a relational database that does it all.
00:19:36
Speaker
Yeah. I mean, let's go with Postgres because I mean, I'll say Postgres, but when I say Postgres, the same largely applies to my sequel and Oracle and whatever. I'm sure, I'm sure everybody in the Postgres world has mixed feelings about their trademark being used broadly, but also they are the de facto. And by the way, I love Postgres. And the thing that I always try and like make clear is that no
00:20:01
Speaker
analytical database, especially Clickhouse and all the other ones that compete with Clickhouse are coming and trying to like compete with Postgres and take away with Postgres and say Postgres is old and deprecated, right? They're different tools for different jobs. Yeah. I think certain companies, once you get past a certain size, the idea that you'll have one database to rule them all is falling away. Yeah, yeah, absolutely. I mean, the thing is, like, there are
00:20:30
Speaker
At small scale, analytics works on Postgres. People do analytics on Postgres, they do sums, they do averages, it works. The challenge becomes when your data grows, your amount of users grows, that's when it starts to become really challenging to scale.
00:20:52
Speaker
There's a bunch of reasons for that, probably. And some of them I'm probably not worth going into because I'm probably not the right person to talk about some of them. But one in particular is the way the data is stored by them. So when I introduced clickouts, I said it's a columnar database.
00:21:14
Speaker
And this is one of the main differences with an OLTP database or a transactional database, like Postgres and Friends, where you do an insert and a row comes in and that row's got columns, A, B, C. You take that whole row and you store that whole row as one thing on disk.
00:21:40
Speaker
And then the next row comes in and you take that whole row and you store it on disk next to the other row. And then every row that comes in just gets appended to the end and you've just got a big, you know, list of full rows. So that is very good when you want to say, get me row three, right? I want all of the columns of row three, just scan through them and find me the third one.
00:22:08
Speaker
where it becomes challenging is when you want to say, I don't actually want one row. I don't actually want all of their columns. All I want is a sum across all of the rows of column three. Because what you end up having to do is go, right, read the entire of row one, get me column three, read the entire of row two, get me column three, and so on and so on and so on through the whole thing, right? Yeah. With a column store,
00:22:37
Speaker
what you do is every time a row comes in and you've got columns A, B, C, you take column A and you store column A over here on disk and you take column B and you store column B over here and it might be on a different disk, right? Completely different spindle and take column C and you stick it somewhere else.
00:22:58
Speaker
right? And then when the next row comes in, you take column A, and you put it right next to the value of column A for the previous one. So one disk you end up instead, instead of being row, row, row, row, you end up with column A, column A, column A, column A, column A, and then somewhere else, column B, column B, column B, column B.
00:23:16
Speaker
And so when you come along and say, give me a sum of column B, I never have to go and read the big file that's on disc with all of column A. I never have to go and read the big file on disc with all of column C. I can just go to the disc that has column B and say, just scan through every single value, read everything and sum it. And I never have to do anything else because it's already there. It was all together. Um, and that is one of the biggest differences, right? And that is like column now versus row based storage.
00:23:47
Speaker
Yeah. That's a good way of thinking about it. It's like, um, how much of a difference does that actually make though? Cause I mean, I'm going to stick my finger in the air and say, um, the average database table is 20 columns wide. So is it dividing by 20? Um, I bet there are outliers in there. So, I mean,
00:24:12
Speaker
In my experience in the world of analytics, so certainly when you're in the transactional world, you do end up with tables that do have 20 columns or maybe 50 columns is quite exotic. In the world of analytics, you do quite regularly end up with 200 columns or 700 columns or 1,000 columns. That's a lot of columns. If you have to scan all of those for every row,
00:24:37
Speaker
That's a lot of overhead. Generally, with analytics, the other thing is you're having generally a lot more data, like a lot more rows. In a transactional database, you might not need to keep
00:24:51
Speaker
10 years worth of data, or you're doing a lot of upsets and deletes, right? You might delete old data. You might go in. When somebody changes something, changes in order, you just go and update the previous row. So it doesn't actually have another row to scan. It's just the previous row was updated. The analytical database is you're not really doing updates and deletes most of the time. What you're doing is you're just constantly appending. So a change comes, you append it, and it's another row. And then you append it and you append it. And then in your logic,
00:25:21
Speaker
if you only want the latest one, you say, well, get me the latest one and ignore the older ones. Or you're actually interested in being able to analyze how many changes there were and what happened between changes. So you want the lock. But all of that means that you end up with
00:25:37
Speaker
lots of rows. Your transactional database might have a row per user and you've got a million users. But your analytical database, if you've got a million users and you might have 20,000 rows per user, because it's a log of everything that they do and what changed and all that,
00:25:58
Speaker
So you can end up with easily going into petabytes of analytics data and billions and billions of rows that you're trying to go through. And so it's a compounding effect, right? Even if you didn't only have 20 columns, but you've got
00:26:14
Speaker
30 times the amount of data, it's a compounding effect if you have to go and scan all that data. But then you end up with actually a lot of analytical systems are much wider, because sometimes it goes back to the denormalization stuff. Quite often, instead of analytical biases, don't
00:26:34
Speaker
normally do the referential integrity and primary keys and foreign keys of transactional databases to do that. You just want one row that's got the entire picture in the whole thing, which makes it significantly easier to do your analytics.
Clickhouse in Multi-Database Environments
00:26:52
Speaker
Going back to the pre-processing question, you might do that before it reaches the analytical database, you might pre-process it and then store it, or you might do that denormalization inside the analytical database.
00:27:05
Speaker
Okay. So does that mean that generally you'll be using something like click house in concert with other databases? Is it part of a balanced breakfast? Yeah. Um, like unless you are doing something hyper specific where you are literally only doing analytics, then you are going to have more than one database. Uh, you know, um, if you're building
00:27:34
Speaker
any kind of front end. I mean, let's say like Uber, right? You're doing Uber Eats or something. Uber Eats, they use analytical databases, right? But they also use transactional databases. So when you log in and you go and get your username and password and you get your profile information, it's got your email and your name and your phone number and all that, transactional database, right? You wouldn't want to use anything else for it. But when you want to go and look at
00:28:02
Speaker
What is the expected delivery time of all of those restaurants because you're looking at how long did it take for all of the other users who ordered and then got their food delivered? What was the average delivery time for each of those so that you can display to the user, this restaurant is taking about 30 minutes to get food to you and that helps users reason about what restaurants they want to go to.
00:28:25
Speaker
doing that at the scale of Uber Eats, where you've got millions of users, loads of data, loads of restaurants that you need to go and crunch. And not only have you got a million users, but you've got potentially tens of thousands to hundreds of thousands, maybe even millions of users at the same time. It's not just like you've got millions of users, but it's only like 10 doing a query at a time. You've potentially got 100,000 users clicking
00:28:53
Speaker
what restaurants are open right now, and they all expect to get a response. And because it's an app, and we're human, and our attention spans are tiny, you click restaurants, and how long would you sit there and wait for that page to load to tell you what restaurants were available and how long they were taking to deliver food, right? You're not going to sit there, click it, and it goes,
00:29:19
Speaker
OK, come back in five minutes and we'll tell you. You'd be like, all right, I'm going to want to install this app and go and get a different one. Right. People expect it to. You click the button and at most like two seconds later, the list populates and you get all of it. So it's super quick with very, very high concurrency.
00:29:39
Speaker
So, are you actually using it directly to feed? Is it user-facing? When a user clicks on Uber Eats, will they be running a query on Clickhouse, or will they be reading a cache that Clickhouse is feeding?
00:29:55
Speaker
I don't know specifically for Uber Eats, but generally with them being the example is you run it directly over the analytical database. The idea is that the data
00:30:13
Speaker
needs to be pretty fresh because people aren't that interested. If data is like an out role, then it's not really relevant anymore. The idea is that it comes into the database live, it actually runs the query, computes the query. Obviously, there's interactions of data being cached on SSDs, and then going into OS page cache, and then you can get results out of there. But it's not like your
00:30:36
Speaker
Pre-computing result and then putting it in redis and then actually you're just asking redis can i have the pre-computed result that was actually computed ten minutes ago. I'm sorry as a programmer writing web server i'm expecting to write queries against say postgres and against click house to get the whole user experience i want.
00:30:59
Speaker
Exactly. Depending on what it is you're trying to do. You're trying to get profile information, you're right at the postgres, you're trying to get analytics, you go to Clickhouse or whatever other flavor of analytics database. This is making me think of a very specific architecture. CQRS, of course. Are we skirting around the term CQRS here? Maybe.
00:31:27
Speaker
In which you have one place where you store command, store data, but then most of the time you're reading from an analytics database, from a pre-computed view database. I mean, I try not to think too much about those kind of patterns, to be honest. OK, fair enough. I look at the, actually, what are people trying to do and just going from there. Because I don't know.
00:31:56
Speaker
Lots of patterns have come up and lots of stacks and they end up being inflexible. People just go, I feel like this is the pattern that I am supposed to go with and it doesn't always work 100% for every... I don't know. I'm not a big pattern. Fair enough. Okay. In that case, let me put it this way. What real world common recipes, common combinations do you see?
00:32:23
Speaker
Well, I mean, what we've spoken about already is a pretty common pattern of having a transactional database and an analytical database. And it depends on the
00:32:36
Speaker
like the size and the maturity of the organization. And it's almost always a journey, right? It's very rarely that you go straight out of the gate. It makes total sense to go and get a super scalable Postgres flavor, you know, go and get
00:32:55
Speaker
Cockroach DB, right, which came out of like a Google research project. And now it's like, you can federate it to millions and millions of servers. And we need to think about that kind of crazy scale. And then we need to go and get a crazy scalable analytics database to go and do this kind of stuff. And then we probably want like a data warehouse on the end of it to go and power all our reporting and stuff, which, by the way, is probably something worth us talking about as well, bringing data warehouses into the mix.
00:33:23
Speaker
what ends up happening is people pick what they're familiar with, right? That they can build with and get something out the door, which quite often is Postgres, right? Because who hasn't used Postgres or MySQL or wherever it is.
00:33:38
Speaker
And they build with that until it stops working for what they're trying to do. So if you've got very few users, not many users going at the same time, and not much data, then doing a sum over a million rows once every 30 seconds or something in Postgres is fine. Why not?
00:34:00
Speaker
is then once you start scaling, you start to go, oh, well, actually, maybe this is not scaling. And now it's affecting the user experience, because more users are hitting the app, the queries are taking longer, and users are getting a bit frustrated that this is being quite slow. And then you start moving to an algorithm. That query we used to run that took one second now takes 10 seconds, even with caching. That's inevitably what happens. I mean, the interesting thing with caching is that you
00:34:30
Speaker
Inevitably, people do it with Postgres. And then they think, hey, the way I'm going to speed up Postgres is I'm going to put Redis in front of it. And that is great, and it works. But what you end up with is you solve the last mile problem, which is the latency of the query. So the user hits the button, and they expect a response. And Redis is amazing. The tiny bit, we use Redis. It's great.
00:34:56
Speaker
It solves that latency, right? The user gets the response in 30 milliseconds or whatever it is. What it doesn't solve is the first mile problem, which is the freshness, which becomes quite challenging if you're just relying on a caching layer because
00:35:15
Speaker
Yeah, the user can access the data really quickly, but the data is quite out of date based on whatever your cache policy is. And then you have to think about, how do I evict out of my cache? And how do I repopulate my cache? And it gets quite complex to do that stuff as well. So quite often, that is the first port of call. And people go, hey, I'll just stick readers on it. I'll cache it. And that works for a little bit. And then users start saying, hey, my experience is degraded again because all the data is out of date.
00:35:40
Speaker
And then they start going into analytical databases and, hey, how can I actually get a system that can handle these big aggregations over big amounts of data with lots of users at the same time? And you might end up with ClickHouse. I mean, there's plenty in the space, right? There's a lot of analytical databases coming up in this space. ClickHouse is, we haven't really, I mean, we haven't spoken about this yet, but one of the nice things about ClickHouse is that it fits quite well
00:36:09
Speaker
with users who are quite familiar with traditional databases like Postgres. So Postgres, it's really easy to come along and download one binary and run it on your machine. And you've got a Postgres, and you can start developing it. You can do it locally. You can stick it on an EC2. You can go ahead and find all of the serverless stuff for it. But it's super easy to get started because you get one binary, throw it up, and hey, you've got a Postgres.
00:36:39
Speaker
a lot of analytical databases. I came up in the world of Hadoop. And if anyone's ever worked with the world of Hadoop, what you will know is that single binaries do not exist. And what you end up with is going and downloading 40 different binaries and then having to try and work out. So if I want to run this one, this tool, then I need to go and have this tool because that's its metadata coordinator. But then even this tool itself, not only does it have an external dependency, it's got
00:37:07
Speaker
six different like services within the one component where it's got like the master and it's got the workers and it's got the coordinators within it. They become super like complex topologies that are a nightmare to manage. And a lot of analytical databases came out of that era. So things like Apache Druid, which like so I used to work at cloud era and that was a project that we worked with at the time.
00:37:35
Speaker
It's a pretty complex topology of it, right? It's got lots of different roles. You end up having to deploy, you know, one type of node and one type of node and one type of node makes it very difficult. And that presents problems in production as well as like developing locally.
00:37:49
Speaker
Yeah, so it's just like a super complex model to go and say, hey, I'm just a developer and I want to run one on my machine to run one on my machine to run some code against to go and develop some stuff. Whereas Clickhouse, it does have literally a single binary where you can go and download it and run it on your machine locally. Nothing to configure, nothing to maintain. It's just like go and you get a Clickhouse and you can start developing against it, which
00:38:18
Speaker
It doesn't sound that impressive of people coming along from Postgres going, well, duh, why wouldn't you? But it's actually pretty uncommon in the world of analytical databases to get one.
00:38:28
Speaker
that can actually do that, which is why it's become super popular, interestingly, in embedded applications. A lot of places have started to actually embed the Clickhouse binary as a temporal database. Just spin up a super quick temporal in-memory database, load a file in, do some analytics, output it, and then kill it, and it goes away, and just use it as you need it.
00:38:55
Speaker
super, super similar like DuckDB, right? It's come along and done this, done a very similar thing. So you quite often see them get compared of like Clickhouse Local and DuckDB of just being like an in-memory, very quick data, like temporal database.
00:39:10
Speaker
Oh, curious. Define temporal database for me quickly. So basically just one that you can bring up, do a little something, and then get rid of it, right? It is ephemeral. It's not long lived. It doesn't stick around. It doesn't sit on a server. And it's always there coming around for queries. It's one where you have an application and it says, oh, I need a database quickly, spins it up, does whatever it needs to do, and then spins it down. And it doesn't exist and doesn't take up any more resources anymore.
00:39:39
Speaker
That raises quick questions about startup time, and is there a memory-only flag? Startup times are very quick. I don't know exactly what they are.
Clickhouse vs. Data Warehouses
00:39:52
Speaker
Most of my interaction these days with ClickHouse is through TinyBed, which I'm not working with ClickHouse Local that often to know what its startup times are. But they are
00:40:04
Speaker
pretty much unnoticeable. It's not like you start it up and then you have to wait for things to get in sync and start for like 20 seconds. It's half a second, a second at most. I can verify that I ran it before we started recording. It was like, okay, this is far. My first experience was very good. Then I realized I didn't have 4 billion rows to hand to do anything tasty with it.
00:40:31
Speaker
Yeah. Um, um, is there an in memory only mode for this kind of use case? Um, that's a good question actually. Uh, I don't, I don't know if the Clickhouse local one is running by default in memory. I mean, like I would, Hmm. Yeah. I'm not sure. I would assume so.
00:40:59
Speaker
Okay, we'll save that one for later research. I would have to Google it to be perfectly honest.
00:41:08
Speaker
No, this is a closed book test, this podcast. You can't do that. Okay, so going back to that whole pipeline of things you must need for an analytics database, does it ship with like something user interface beyond SQL? Is there an analytics GUI? No.
00:41:32
Speaker
Yeah. Clickhouse itself is like Postgres. It's a database server. It counts as a DBMS. It's a database management system. You drop the R because it's not relational, but it is a database management system. It is just a headless database that comes with nothing.
00:41:54
Speaker
But obviously there's now a pretty rich ecosystem around Clickhouse in, you know, there are connectors for pretty much every like BI tool you can think of, you know, Tableau's and Power BI's and Superset and all that kind of stuff that you would want to connect to.
00:42:11
Speaker
Loads of vendors out there, obviously, a tiny bit is one, but there's loads out there who have got their own styles of GUIs, whether that's managing clusters through the GUI or it's actually an interactive way of building queries and building applications on top of it or doing visualizations. But the open source project itself doesn't come out of the box with a web GUI. Do you have a particular recommendation or do you want to stay out of that?
00:42:40
Speaker
Well, I am particularly biased working for a vendor that sells Clickhouse. No, I don't have a particular recommendation. They all have their strengths and weaknesses. I'm a big fan of a pack. If you're doing BI, which we can get onto this a bit later, but the Clickhouse is very versatile. People are using it for data warehousing and BI, but people are also using it to do front-end applications.
00:43:10
Speaker
different vendors and different tools are appropriate for different things, right? So, you know, you wouldn't go and use like a BI tool like Tableau or Apache superset to go and build your front end for Uber Eats. And at the same time, different vendors have put different stakes on what side of the equation they want to work at. So Tinybird has focused entirely on, we think Clickhouse is great for building applications, right? And we want Clickhouse to be the backend that people build user facing applications that do have very high concurrency and all that kind of stuff.
00:43:40
Speaker
Others are taking it off. We want this to be a faster snowflake, right? So instead of using snowflake, you come and you just load it in click house and then you stick Apache superset on it. And you're going to your BI and you're reporting in your ad hoc analytics. It's pretty good at both. But people have optimized for one particular thing.
00:44:04
Speaker
which maybe gets onto, we were talking about what does a typical stack look like. And I mentioned data warehouses and how do they come into it. Because if you've been working with databases and transactional databases,
00:44:21
Speaker
But you haven't come across analytical databases for applications like Clickhouse. You may well have already come across analytical databases, but for warehousing, like BigQuery, Snowflake, Redshift, and all that kind of stuff, or Hadoop back in the day. Which you might think, well, what's the difference? Why wouldn't I just go and do that? Because they generally are also OLAP, and they are also columnar.
00:44:51
Speaker
But then they vary very differently. They tend to have the much more complex distributed architecture of you've got your storage over here and generally it's like cloud storage up in S3 or GCS. It's a blob storage in cloud.
00:45:10
Speaker
And then somewhere else, you've got your compute. And then whenever your compute actually wants to run a query, it has to go over to blob storage. And you've got the latency of going over the network and getting files from blob storage, reading in the file, and then going through the file, which adds a lot of latency. But generally, warehouses focus on arbitrary complexity. So this is kind of like a, like, I've sung the praises of Clickhouse, and it's probably quite good to talk about its limitation as well, right? OK.
00:45:39
Speaker
The warehouses are super good if you are trying to do insane levels of complexity over insane amounts of data. So if you've got 40 petabytes of data, you've got tens of hundreds of billions of rows, and some crazy analyst comes along, and they're trying to select 100 rows with 20 aggregations around some of those columns.
00:46:09
Speaker
And they are doing 70 joins, right? They're going out to so many different tables and doing these crazy joins. Warehouses are brilliant at that, right? Because they will just go, OK. But they might say, OK, see you next week. Because what they're very good at is just going, OK, I will take whatever you send at me. I will do it, right? I will find a way to do it. But I will get you the response sometime.
00:46:39
Speaker
I don't know how long because what they will try and do is MPP, massively parallel processing, of chunking queries up into very small fragments that will fit in the resources they have.
00:46:55
Speaker
and they will always get you a response, but it could be potentially slow. If you've got limited resources, but you're trying to do this insane, massive query, it won't go, sorry, I don't have the resources to do that. What it will try and do is say, okay, I will do
00:47:13
Speaker
a very small fragment of it, I'll take 10,000 rows and I'll compute the result for 10,000 rows. And then I'll store that. And then I'll go and compute the next 10,000 rows and I'll store that. And then I'll compute the next 10,000 rows and store that. And just keep doing that over and over and over and over again until it's got the whole thing. And then it will take all of those
00:47:31
Speaker
intermediate ones and do the same thing okay right i got ten thousand intermediate states going to take ten thousand of those and they're going to take the next one to merge i'll just keep doing it in these stages of breaking up getting result breaking up getting result which means it's very very good at having a like. Any complexity of query on the small amount of resources but i could take.
00:47:51
Speaker
however long, and it will handle failure. So if any of those fragments of the queries fail because the network went down or because when the servers failed, it'll go, okay, I will wait for that to come back up and I'll retry it and I'll get you the result. And it will delay me, but you'll get a result, which is exactly what you want.
00:48:11
Speaker
if you're doing a massive report over billions, tens of billions of rows and petabytes of data that takes a weekend to compute and you hit the button on Friday and then you come back on Monday and you want a PDF in your email inbox with the report. You don't want to come back and over the weekend it failed and it said, sorry, I ran out of memory halfway through. It's not useful. That is what warehouses are optimized for.
00:48:36
Speaker
Whereas Clickhouse, one of its things that it's not so optimized for is that kind of arbitrary complexity and just throw anything at me and I will make it happen. It will hit into boundaries of that's too many joins, that's too much of a complex query. I don't have enough memory on one server to handle that query, so I'll fail.
00:48:59
Speaker
It's a slightly different priority of interactivity. You saw this originally come up in the Hadoop world. You had Apache Hive come up, which was like the snowflake of yesteryear, which was the throw anything at it and it'll go away and take a week, but it'll get you your answer. Then you had Apache Impala come up, which was like
00:49:22
Speaker
No, you don't want to do a report. You actually want somebody sat at a terminal running a query and they get a response immediately. And the point is that it biases towards interactivity of, I want a result very quickly. And if it fails, fine, but I want the result. So it will fail and just tell you it failed. And then the user can go and retry it. It won't sit there for an hour trying to recover and say, oh, no, I'm going to retry that fragment or retry that fragment. It will just fail and say, no, I fail.
00:49:52
Speaker
Which you might think, why would you ever want that? But it's depending on entirely what you're trying to do and what you want at the time. Do you want a super quick result, whatever? Or do you want a slow result? But it always works. I sometimes think the fast result model really shines where you don't know yet what question you actually want to ask.
00:50:16
Speaker
So you're asking a lot of experimental questions, wanting a quick response. You can say, oh, no, I didn't mean that. I meant something slightly different. Yeah. Yeah. Um, I mean, it also, it, it works in, like, even if you do know the query upfront, it works very well in, uh, like user facing, user facing stuff, right? Where.
00:50:37
Speaker
You might think, actually, I don't really want things to fail, right? Because potentially the failure is a bad user experience. But you can build your application logic to say, we'll go and run the query and just tell me if you fail. And I will work out, do I want to retry it and take a little bit extra time? Or do I just want to tell the user, hey, it failed. You need to go and do this because I can't recover from it. So it just gives you the choice of how do I want to handle that user experience.
00:51:06
Speaker
Okay. So one thing the whole data warehouse raises is, and joins is the key word here. What if I want to bring in analytics from two different transactional systems? How am I going to do that with Clickhouse? Two or more? Yeah.
00:51:29
Speaker
I'm assuming what you mean is I've got two existing transactional databases. How am I going to bring the data into Clickhouse and do some analytics? If Department A is using Postgres and Department B is using Oracle, and I'm trying to service reports for management who care about both departments, what's my answer? This is one of the interesting
00:51:58
Speaker
topics of analytical database, which is how do you integrate with other systems that are sources of data? Because it's very easy when the source of data is an API and it's new data that's coming in. You can just say, insert that data. It's very easy when it's a Kafka topic and you can just say, well, whatever's on the Kafka topic, just take that in and put it in a table. When you've got
00:52:24
Speaker
And what happens in a lot of big businesses now is you've actually got databases that have been federated all over the organization and that everybody's gone, ah, well, we hired somebody who likes Mongo and we've got a team who likes Postgres and we use Microsoft SQL, which has kind of
00:52:40
Speaker
you know, it's, um, it's always been a challenge of how do I integrate all of those things? And everybody's ended up writing loads of glue code of, of, you know, I'm going to write a little custom bash script that's maybe executed on a schedule and it just goes and takes data out there and it pushes over there. Or maybe I use a, you know, an open source tool to go and do it like Apache knife, I, or like to be ZM. Um, or you're going to CDC with Kafka and Kafka connect, which makes it like ridiculously easy. Um,
00:53:11
Speaker
or the explosion of ETL tools that are out there in the market at the moment. The data ecosystem has gone a bit crazy with ETL tools, and there's a million different options of crazy ETL tools you can pick. But it's always been a challenge of, how do you do it? Because there's so many different ways to do it. You can go and do it in a batch way with a batch ETL tool that
00:53:33
Speaker
executes every hour. And all it does is it goes to the source database postgres. And it says, select star on where the time is greater than an hour ago when I last ran. Right. And then it just takes all of that. And it just does a big insert into click house and just goes does it. Obviously, the caveat with that is the data is an hour old. So, you know, it's freshness. Or you can go and get a CDC change data capture using something like Debezium and say,
00:54:01
Speaker
actually go and tail the bin log. And every time something, a change happens in Postgres, take that change event and fire it into Clickhouse. Side note on CDC for analytical databases is it's challenging in and of itself because at the start I said you don't really do upsets like updates and deletes in analytical databases, which means CDC becomes quite challenging because if a row is deleted or a row is changed,
00:54:31
Speaker
most of the time, I mean, some analytical basis don't even support at all updates and deletes. You just cannot run like there's no, there's no command for it. Right. Um, yeah. So how do you map that mental model into your new world?
00:54:44
Speaker
Yeah, so that becomes super challenging. And what you end up doing, right, is you just start appending all of the changes and you just say, like, get whatever the original row is, you append, and then whatever the change was, you append the whole row again, and then you append the whole row again. And if it was just changes, then you can just filter and say, if I've got 10 rows that were all the same row, go and get me the latest one, because that's got the latest change.
00:55:08
Speaker
and stuff like that. And if you've got deletes, you can go and handle it in different ways because you can say, well, don't select rows that have got like a flag column that says I was deleted and all that kind of stuff. But anyway, side note on CDC. But what you end up doing is you do have quite complex patterns often of integrating all of these disparate systems and getting a little bit out of just a button, like click how suffers from this. And it was one of the reasons why
00:55:37
Speaker
a year ago now, when I was looking at moving on from cloud era, and I was looking for what is the next, you know, I like data, I want to work in data. What is the next company that I joined is I have fought with this my whole career of how do I go and integrate all of these sources and manage all of this glue code and all that kind of stuff. And I was really interested when I came across tiny bird of they
00:55:58
Speaker
came up that they had an opinion about it and they took a slightly different approach and I like that approach. That's not to say it is the absolute perfect correct approach for everyone. Some people like the more control of going and doing it themselves and like natively writing their own integrations and all that stuff and that works. For me, I really liked that Tiny Bird said, we're going to try and
00:56:22
Speaker
handle all of that for you as a feature and just be like, we can connect to Snowflake and sync that data for you. And it's like two clicks. And that was their approach to take to that, which fits in some cases. Right. So a bit like Kafka's Connect ecosystem. Yeah. Yeah, pretty much. You know, like
00:56:45
Speaker
in the early days of Kafka, it didn't have any in a Kafka connect and you kind of had to do it, do it yourself. Um, and then people said, Hey, that was a bit of a pain. So let's do Kafka connect. And we could solve a lot of that pain for you. Um, and we kind of did the same thing and thought, Hey, maybe we could be more than just a database and do some of this value add, like make it super easy to do certain things that everybody's doing for you. Um,
00:57:13
Speaker
OK, I do want to get into that, but I'm going to push you a little more on the next level, because you've just told me how I connect, say, Postgres from Department A into Clickhouse, or Oracle from Department B into Clickhouse. But then how do I merge those two data sets together?
00:57:30
Speaker
to do analytics queries. So there's a couple of different techniques to do that. I mean, the generic way is effectively just you could either part of your process, you know, we talked about pre-processing before, part of your pre-processing process could be to normalize data. So if it's like,
00:57:52
Speaker
Department A has got customer data and Department B has got customer data, but they're in different schemas, but it's largely the same data. You might have your process like normalize the schema and then just insert them into the same table and have it that way. Or you might just take the sort of ELT approach, which is just take the raw data, dump it into the database and then sort it out with a query, which may be as simple as doing like a
00:58:17
Speaker
create table from select, right? And creating a table that is the result of a select, which the select is doing the transformation. So it's saying, you know, take all of the data from A, transform these fields to look like this, and then union it by selecting all of the data from the other one and transforming it to this and then sticking it in a table. Now, the
00:58:41
Speaker
all of the like different analytical databases might have different, like nice techniques that would make that slightly easier. Um, in Clickhouse you've got the materialized views, right? Which make it kind of nice because you can say, um, you can write that normalization union query, uh,
00:59:02
Speaker
that is selecting from, you know, table, department A table and department B table. And then you could set up a real time CDC stream from each of those. And then as new rows come in, it's not running on a schedule. It's not doing it batch. It's every time a new row comes in, it's doing it in real time and always getting the result, putting it all into one big table. And then you can run your analytics over the combined table over it. The, um, the other interesting thing that I find, I mean,
00:59:33
Speaker
Maybe it's not the most useful for this. Clickhouse has the concept of table engines. Not every table has to be exactly the same. Actually, you can configure exactly how
00:59:50
Speaker
Table works under the hood by using a different table engine so it's got a whole bunch of different ones of like merge trees where it goes and like it can look for ideas and then it can automatically if i see you know if i see.
01:00:07
Speaker
Add the same idea i can go find the latest one of the idea and get rid of the old ones and automatically did you play stuff and it's going to be taking tables where you can say actually. This table does not just belong in one place every time something comes into this table i want you to automatically replicate this table over multiple servers and.
01:00:29
Speaker
So it's got these table engines that you can configure on a per table basis to behave slightly differently depending on what the use case is. But one of them is the null engine. And the null engine is basically...
01:00:47
Speaker
like on Linux, cutting out like dev null. Basically, it sounds weird, right? But what you can do is that it's like an ephemeral pipe so that you can pipe the raw data into a null table.
01:01:08
Speaker
and then have a materialization query at the end, reading out of the null table and materializing the result, but you never actually store the incoming raw data that landed in the null table. So with the previous one, if you have a standard table, you're actually taking up storage, right? Because you're writing, you're basically duplicating all of department A's data onto a click house table and duplicating all of department B's data into a click house table.
01:01:34
Speaker
but for the pure reason of transforming it and then storing it again. So you end up with three copies that are just slightly different. The null engine would mean that you could just take it all out of A, don't store the raw data, immediately transform it, and then just store the transformed one. And then you're only storing the actual end result and never the intermediate raw result that you don't care about, which is quite an interesting one.
01:01:59
Speaker
a way that you could optimize that, which, again, it may well be specific to Clickhouse. I have no idea if the other analytical databases have that null engine or a concept like that. I can think of plenty of systems that have that kind of transformer notion, but that's another way to model it. Yeah. I mean, it's interesting. So, yeah.
Integrating Clickhouse with Tiny Bird
01:02:25
Speaker
so we for time, I want to push on to I do want to talk about tiny birds approach with and you've mentioned it a little bit with click house. One of the things that caught my eye about the way tiny bird position themselves is
01:02:43
Speaker
kind of from the API, making API building easy. That doesn't seem to naturally quite fit with the idea of an analytics database. So take me through that thinking. So, yeah, it's an interesting one. So, I mean, if you, if you consider that
01:03:07
Speaker
what I said before of there are kind of different directions that people have taken Clickhouse in. So some have taken Clickhouse down, hey, we're a faster snowflake and you go and do BI. We've taken the approach of, hey, we think actually Clickhouse is brilliant for user-facing stuff. And we're not that interested in, do you want to stick Tableau on it and have people drag and drop?
01:03:27
Speaker
charts on it in ad hoc build queries, we think it's better to predefine your queries and it goes to an application and your users hit those queries and get results. Generally, the way that you end up integrating that style of thing, applications always talk to REST APIs. If you build an Android app or a web app, whatever it is,
01:03:50
Speaker
Generally speaking, it's reaching out to an HTTP API and saying, hello, I'm hitting the get restaurants API. Can I have all the restaurants, please? And what you end up doing and people who have used Postgres to do the transactional side of things will have written
01:04:07
Speaker
an inordinate amount of APIs. And we'll have worked with all of the different ORM libraries under the sun and all of the different API frameworks under the sun. And you always end up writing the same thing, right, of writing your API and all of your get methods and then hand sanitizing user input and then translating that to a model that then goes and runs a query on the database and blah, blah, blah, blah, blah. And then once you spend all that time writing that layer, you then have to go and
01:04:36
Speaker
you know, host it somewhere, deploy it and secure it and scale it and all that stuff. We basically just said, well, maybe instead of just being a database, we could also do that API bit, right? So, because we are putting the stake in the ground to say, we're interested in applications and we want to help you build applications, and almost always applications end up building our own APIs with our RMS to go and do that, we'll just save you the time and we'll do that bit as well.
01:05:03
Speaker
write a piece of sequel do all of your analytics yes it's click house it's a database it's it's everything you would expect from click house plus hit a button turn that query into an api and then you get the results from the rest api and you never have to go and write the api yourself so it was really just a.
01:05:22
Speaker
you know, that's what we thought would make this more useful. Because I mean, that's like the thing with databases, right? A database is great. And a database can be super quick. And like, what you find a lot in analytical systems is people doing benchmarks, right? If vendor A says, ah, we compared our database to database B and C, and we ended up on this, this benchmark being 10 times faster. And then one of the other vendors does the same test and says, ah, no, we were 10 times faster. And the other one says, we were 10 times faster. Who do you believe?
01:05:52
Speaker
But also, does it matter? At the end of the day, are you going to go and pick a database, right? Because one benchmark said that it was like two milliseconds quicker than the other one. But then in a different use case, it's going to be two milliseconds faster than that one. And they're all fast. They're all pretty much exactly the same in terms of performance, generally. Some are better at one use case, and some are better at another use case. And they trade lows.
01:06:21
Speaker
a lot of it comes down to you are going to have to work with this thing every day, right? And you're going to have to do way you're actually trying to use the database to accomplish something. You're not just buying a database because it's shiny and it's fast. You're saying, I'm trying to build something, I need a database. And we thought, well, instead of just selling another database, let's actually try and help people who are trying to do something and we'll solve another problem on top of just having a database.
01:06:51
Speaker
Yeah, so you're mainly going for the developer experience angle.
01:06:55
Speaker
Yeah. You know, it's not a million miles away from, you know, from neon that you spoke to a couple of weeks ago, who had, you know, doing something very similar for Postgres, right, of being a serverless Postgres, and you just click a button and you instantly get one, you never think about servers and scale, and it takes three seconds, you know, you hit a button and immediately you've got one, there's no spin up time or anything like that. And it takes away a lot of the operational complexity
01:07:23
Speaker
It just helps you actually be productive with the thing, rather than just giving you a database that you then have to go and do everything yourself with. In that system then, will I end up building two front-end APIs?
01:07:41
Speaker
I mean, will I have all of my gets on Tiny Bird and all my posts and puts on some other system I've built? How does that play out? That really depends. Obviously, we're
01:07:59
Speaker
Generally, we were obviously like we're working with transactional databases within an application architecture, right? So you'll have tiny bird and you'll be making get requests to tiny bird when you say, Hey, I want to get all of this analytics data and I want to display a chart or I want to make a decision based on some analytics data. But then when you want to go and do, uh, get a user profile, right? Because it uses logged in, you're going to make a get request to some other API and your transactional database. Now we are not.
01:08:29
Speaker
dipping our toes at all into the transactional world.
01:08:33
Speaker
we take you up to the APIs for the analytical stuff. And then the transactional stuff is on you. I would love to see somebody come and do the same thing that tiny birds doing, but for the transactional world, uh, which I think, you know, people are, people are doing right. Uh, I mean, that space is, is pretty hot right now. You've got like neon and planet scale and super base and all this stuff that are trying to put an experience around Postgres to make that stuff easier. So people are doing it. Um, and yeah, we just end up like your.
01:09:00
Speaker
your application, it makes a get request to your API for your transaction stuff, then it makes a get request to native tiny bird APIs for its analytics as well. For the puts and sending data, that again kind of depends. It might be that
01:09:23
Speaker
if a transactional thing happens that you want to send it to both. And so maybe you just have the application, like if your transactional database has got a post API, you can just post it to that.
01:09:38
Speaker
Um, but tiny bird also has a post API. So if you just want to do an insert and just append new data to tiny bird, we've also got a post API that you can just, you know, that comes out in the box. You don't have to set anything up and your application can directly just HTTP post some Jason to tiny bird and ingest it. But it may be that you want to, you know, an API in the middle that handles data and you put it on a Kafka thing because you're actually putting it into multiple places. Yeah. Really, really depends what you're, what you're trying to do. Um, a lot of different ways to do it.
01:10:07
Speaker
Okay. Yeah. So we are back to it being part of a balanced breakfast. Yeah. It's part of a sane architecture, right? It's a tool for a particular job. Okay. In that case, last question then, if I want to get started with Clickhouse and actually kick the tires on it, I've already downloaded Clickhouse and typed Clickhouse local. That was easy. I got a prompt. What should I do next?
01:10:38
Speaker
Well, I could put my vendor hat on and say, well, now you should... It's the end of the podcast, I'll let you have one sentence of vendor hatism. No, I mean, the Clickhouse Local is like a super easy way to go and try out Clickhouse and do play around with it, right?
01:10:57
Speaker
Tiny Bird is another way that you can go and play around with it. We have a free tier, right? So you don't need to go and put a credit card in to try it out. There's no time limit on it. You can go out, sign up for an account, and there's a free tier. And you can play with it and build something for as long as you want. It's serverless, so it scales to $0. So you can go in and store some data in there, do some queries, create some APIs, and you won't get charged anything. So it's just another way that if you want to play around with it and see if it works for you,
01:11:25
Speaker
There's nothing wrong with going in and just trying out the free tier. It's not going to cost you anything to see if it works for you. How could I get a big chunk of data in there to play with? Into Tiny Bird.
01:11:38
Speaker
There is a whole bunch of ways. As I was saying earlier, we have a whole bunch of managed connectors to bring data in so that you don't have to write your own. We have what we call the Event API, which is a HTTP API that you can just post streaming data to. If you've got an application that is
01:12:01
Speaker
a web app that can make a HTTP post request, you can just sit there sending a whole bunch of streaming data to it if you've got a streaming source. And one of the projects that I built recently was a mock data generator called Mockingbird, which is basically you create a little fake data schema in JSON, and it will just generate fake data, like realistic looking fake data, and post it to streaming endpoints. And that works with
01:12:28
Speaker
tiny bird, but also actually like Kafka and ably pub sub and all these different sources are pretty generic tool. It's open source. It's free. It's not really tiny bird thing. It's just, I like bird names. So ended up getting called market bird. Um, but that's another way that you can generate some fake data. Um,
01:12:44
Speaker
But you can also upload files. If you've got a big file, you can just upload a file from your local machine. If you've got it on S3, you can generate signed URLs and just will download it from a signed URL. We can connect to Snowflake and just sync data from Snowflake or BigQuery, that kind of stuff as well. OK. Sounds like it's time to get busy. Al, thank you very much for taking us through the world of analytics databases. Yeah, it was great chatting with you. And thanks for having me.
01:13:15
Speaker
Cheers. Thank you, Al. Now, this is off the point, but as we're at the end, I can stretch out and tell you this. Something that's always bothered me is the asymmetry between online analytics processing, OLAP, nice and pronounceable, and online transactional processing, Alt.
01:13:35
Speaker
can't pronounce that at all. It doesn't work. So the solution is we go and invent online event processing. And then we've got OLAP and OLEP, which is nice and easy to pronounce. And as a bonus feature kind of sounds like two siblings from a Hans Christian Andersen story.
01:13:52
Speaker
So, back to the point. Thank you, Al. If you want to learn more, you can find links to Click House and Tiny Bird in the show notes. And if Bird and House are making you think of They Might Be Giants hit Bird House in your soul, I'm going to put a link to that song in there too, because it's my podcast and I can do what I like.
01:14:12
Speaker
As always, if you've enjoyed this episode, a like or a share would be very much appreciated. And consider clicking the subscribe and notification buttons to make sure you catch the next episode. But until that next episode, that's all we have for you. I've been your host, Chris Jenkins. This has been Developer Voices with Alistair Brown. Thanks for listening.