Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
#94 Clojure, Go, Cloud Storage Tech and more with Albin, Aurelien, and Wouter image

#94 Clojure, Go, Cloud Storage Tech and more with Albin, Aurelien, and Wouter

defn
Avatar
165 Plays9 months ago
We got the beautiful minds behind the cloud tech: Aurelien, Albin, and Wouter share their experience building block storage system with Clojure and Go. https://www.exoscale.com/block-storage/
Transcript

Introduction and Panelists

00:00:15
Speaker
This is episode number 94. We're coming from the Hague. My name is Vijay. With me in today's panel is Mr Ray from Belgium. Mr Ray from Belgium. Mr Aurelian from somewhere in Lyon, I think. Sorry.
00:00:36
Speaker
And Mr. Albin, who just moved to Switzerland. Let's start the episode, shall we? Put some music. All right. There you go. That's perfect.

Episode Humor and Past Experiences

00:00:48
Speaker
Exactly. It's like Walter Cronkite level shit. Sorry. Welcome to Defund, guys. So let's dive deep into our guest today, I think. This is the first time we're having three guests at the same time.
00:01:06
Speaker
Yeah, I'm not quite sure about that, actually. I think we've had a five way before. Threesome, that's what you were going to say. I've been banging and screwing all this. Having a threesome this evening would have been fine.
00:01:22
Speaker
This became a technical orgy. That's the new subtext for the episode. Exactly. Yeah. I think this is the episode will finally be canceled. We've been waiting for this since episode one. Cannot stop the progress. Should we say what we're here for, Vijay? Totally. What's motivated this shitshore?
00:01:52
Speaker
I don't want to call any of the episodes that guests are as a shit show. There were a couple of shit shows in the beginning when it was only Ray and I, but now it's way more professional now. Well, I was calling it a shit show so far because it's mostly just being you and me. That's true. It's probably going to give you a lot better soon. Yes, so keep listening. Maybe we should do a quick welcome to our guests.
00:02:22
Speaker
First of all, the oldest guest voter. Well, not by age. I mean, I mean, oldest has chronologically older than those guys. I mean, those guys look like they're still at school. What the fuck is going on here? You know, these develop, they're like coppers, you know, like police officers. They're getting younger and younger. These developers, nothing to do with me. Nothing to do with me.
00:02:48
Speaker
No, for some context.

Exoscale's Work and Achievements

00:02:54
Speaker
Both me and Aurelia and Alban, we work at ExoScale, which is a European cloud provider. We use a lot of closure and we recently, or we're about to,
00:03:09
Speaker
Depending on how you look at it, we're about to ship a major achievement in my book because we've built block storage from scratch. Both of these guys and the rest of the team have been instrumental in building it and I thought it would be nice to
00:03:29
Speaker
talk about it because I think it's been an amazing two years to build it, but also the result has far exceeded my expectations.
00:03:42
Speaker
I thought it might be nice to talk a bit about what we've built, how we've built it, what we wanted to do. And I absolutely wanted to put both of these guys in the spotlights because they've delivered far and beyond, in my opinion.
00:04:01
Speaker
So what's your role in a project, Wouter, and what are these? Maybe we can have an introduction about what you do and what they do.

Roles and Technical Interests

00:04:08
Speaker
Yeah, so I'm the team lead for the team that's doing all the storage related things. So historically, that was the storage layer for S3 offering, and now that's also a block storage offering.
00:04:23
Speaker
Um, and, uh, uh, working my team. So basically I do paperwork and they actually made the system running. So just asking, are we there yet? Kind of. Why is it taking so long? Why isn't this done yet? We're two years in, man. No, I let them introduce themselves as well. Maybe. Yeah. Yeah. So once you start Albin.
00:04:52
Speaker
OK, so we're going for the youngest one. Alphabetical order, actually. Yeah. Yeah, the value was clearly the first one. Yes. Reverse alphabetical order. Oh, hang on. It doesn't matter. It's like a circular one. So I think it's like overflow, right? We have reached router. Now we are going to go back to the beginning. It's integer overflow. Yep.
00:05:21
Speaker
So on my side, I've been mostly interested on distributed systems and correctness of systems. I started at Exascale working on verification of properties inside of the S3 offering. So checking that if some parts of the system crash, if we lose some parts of the system, we keep the durability and all of the guarantees. And at the same time, I did check the
00:05:48
Speaker
API correctness against some models to verify that we were to detect the bus where we had issues and confirm that the rest of the system was correctly working. And after that, so this was my end of studies internship. And afterwards, I moved as a full time engineer, just working on the distributed storage systems, mostly like the backend of the object storage and all of the
00:06:19
Speaker
Bug storage related stuff. So they made you put your money where your mouth was. Absolutely. Nice. And what about you, Adelian? What is your backstory? We can tell you about it. So she tried to say something. Yeah, I was going to say we made him write the bugs that he used to discover. Exactly. And it's weird when you have to write bugs.
00:06:49
Speaker
For me, I've been working at Exoscale for the past two years. Previously, I used to work for a small company that was doing an embedded system, and I came out at Exoscale to work on low-level stuff, which is our storage backend and the various fleet of demons that store data and ensure that we

Challenges in Block Storage Development

00:07:11
Speaker
don't lose it. That's it. So far, it has been mostly successful.
00:07:17
Speaker
Do you think it's called a fleet or are they called a hellscape? Yeah, that's it. We've been working in block storage for almost the past two years now. And yeah, we are, I think, all happy that it's coming to an end and it will be valuable for our customers. This would be nice. I think maybe it's a good idea to explain what
00:07:48
Speaker
block storage is, and what are the challenges in building this? It's not just, OK, there is some disk space available somewhere that you can use. It's not just like that, I suppose. I don't have much knowledge, but it would be nice to have a good idea about what was the problem that you're trying to solve, and then what is the offering, and then how you're building this one. What were the biggest challenges here? We'll go ahead, Oholia.
00:08:18
Speaker
So I think the scale of the first storage offering was an S3 compatible product, object storage. So it stores, you can put files via an HTTP API that will store and we will ensure that you can retrieve them down the line. That plus a handful of properties around access control and stuff.
00:08:46
Speaker
So fundamentally, block storage isn't that different. The API is different because you just act as a disk on your machine, but you will write data on it and we will need to ensure that you can retrieve it on the line. So we've used this
00:09:06
Speaker
closeness of problem to reuse some of our software. And the major difference was that while you can accept that your object storage has a few, a dozen of milliseconds of latency to get or write in a file, you can't accept that for block storage. Like you need it to be fast.
00:09:30
Speaker
fast like a disk or close enough to a disk. And that was the major difficulty that we needed to face. We needed to write a system that was good enough so that it can be approximated as a disk and not as a web API. Yeah.
00:09:50
Speaker
So there's issues like the Unix or the POSIX, like System 5 compatible, create, open, read, all these kind of things.
00:09:59
Speaker
So we go at the lower level. What block storage is that we just present an array of blocks. And the only operation you can do is you can read or write a range of bytes. So it's the same interface than a disk offers, not the interface that the file system offers.
00:10:25
Speaker
Okay, so you put file systems on top of your API. Yes, exactly. So it's a network attachable block device, right? It's a pure disk. It has a disk-based API. So you write bytes at a certain offset on the disk and you flush them out and then
00:10:47
Speaker
Like the operating system typically puts, I mean, you typically put the file system over the top and then your operating system uses that, but you don't have to. So it's a very low level. Uh, we simulate a raw disk essentially, but it's not work attachable. So is it actually linked to the operating system?

Block Storage Use Cases and Reliability

00:11:06
Speaker
And I was, is it a pure, like, uh, is it a pure synthesis in your like user space? So the kernel doesn't know anything about it. No, you connect directly.
00:11:16
Speaker
On our side, we do connect the block device directly to a virtual machine, and on the customer side, it's just that a disk has the other, and you can use it. But for the customer, they can't even do the difference between a local disk and a network attached to one. So is it conceptually similar to EBS on AWS?
00:11:43
Speaker
That's exactly the equivalent project. Okay. Yeah. So it's going to say the object storage is S3, the block storage is EBS. Yeah. Yeah. Okay. So the idea is that you can basically, uh, as, as the need arises, you can extend the, uh, storage that's available. Yes. We have two constraints. It's the first is that on physical hyper-vasors.
00:12:13
Speaker
We have a limited disk space that we can allow for virtual machines. And so if a customer is creating a virtual machine, which is a given size, we are going to limit the size of the disk to keep our usage ratio between the CPU, RAM, and disk. So this is why we had constraint on that level. And so the purpose of the block storage is to allow customers to go past that limit and to be like, OK, I want a small VM, but a very large disk.
00:12:42
Speaker
And on our side, we don't have the constraints of the physical disk space anymore. And there is a second big user that we're going to have. It's all of the Kubernetes stuff where you need to have, I forgot the name. It's a PVC, but it means. Persistent volume claim. Yeah. You need to, for Kubernetes, if you want to have some storage with your pod, you want to
00:13:12
Speaker
attach a physical disk and then use it inside of your pod. But you don't want the pod to be linked to a single node, and you want that flexibility. And so that's the thing we are trying to solve, and to offer something easier to use directly. So essentially, when you attach this to a VM, it just looks like a hard drive. Exactly.
00:13:41
Speaker
But how do you get the storage classes then? Because as you probably know, EBS has different storage classes, right? So like, oh, do you want to have like a spinning disk or high IOPS or SSD type of performance from this one? I'm assuming underlying technology might not be SSD or might be SSD, but you are creating some performance characteristics into different storage classes, right?
00:14:08
Speaker
So I think the scale, we don't have a notion of storage class. We have only one offering. But if we were to create a different one, it might just be a different limit that will apply to your virtual machine. You will get the performance that looks like an HDD or the performance that looks like an SSD. And it's just an artificial limit that will be applied to
00:14:34
Speaker
But to that point, because I think they have guaranteed IOPS rate and all kinds of stuff. Is that something which you say, okay, our offering gives us a guaranteed IOPS of whatever it is? Exactly. We limit the number of IOPS to a number that is lower than what our system can support to ensure that you have some more guarantee to eat it and that it's fair for all of our customers.
00:15:02
Speaker
So what is the backend then? Well, backend as in, what is the behind the scenes then? I understand the interface, like how it is being used at the quote unquote front end, which is the VM and CPUs, but where is actually the data going then? Somebody's computer. Yeah, basically. Cloud is. Yeah, exactly. Obviously. I mean, I was actually going to say like the trick with all storage systems is,
00:15:33
Speaker
not necessarily, you know, the happy path is easy, you know, the client side API, like we're using the NBD protocol to expose the disk to the hypervisor. And like the protocol has five methods, I think is read, write, flush,
00:15:53
Speaker
Yeah, I mean, and then there's like two which we haven't actually implemented. That's all of the protocol. So that in and of itself is like super simple.
00:16:08
Speaker
The tricky part is all the error handling, right? Like what if, what if this goes wrong? What if that goes wrong? What if, you know, we can't retrieve this? What if, um, so that, that's where like all the, the funny stuff comes from because you can't, I mean, you can't lose the customer data, right? So do you have like, uh, guaranteed replica sets and that kind of stuff? Yes.
00:16:35
Speaker
So maybe I'll just quickly do the.
00:16:39
Speaker
high level overview of like what the tiers are and how that was designed.

Data Flow and Deduplication Challenges

00:16:44
Speaker
And then like, feel free to ask some questions along the way. And then I'll field those to the specifics to Alfana and Aurelia, basically. But so the way the data flows is so your VM makes a, let's say a write request. The hypervisor passes this on to our daemon, which lives on the
00:17:09
Speaker
on the same machine. So that's locally. So there's a local NBD server to the hypervisor itself. And this one also has a local cache. So we write the bytes into a local cache first.
00:17:28
Speaker
And then either when the cache is full or on a timer, we will empty the cache and ship it off machine, which is the whole point, right? Like we want to persist somewhere else. And so this talks to, in the first place, a proxy, which is the entry point into our storage machines. And then the proxy.
00:17:55
Speaker
slices, I mean, theoretically slices the input range up into what we call blobs, although practically speaking, the NBD part never actually sends anything that's larger than the maximum blob. And so this one replicates it three times and writes it out to three machines that then store this data and reply back. And then basically the read path is the inverse, where we first have to check
00:18:25
Speaker
is this particular data in the right cache. If it is, we serve it. If it's not, we talk to the proxy. The proxy figures out a replica that the data lives on and then ships it back. So that's- And what do you do, like, what's your sort of parity checking, all those kind of things? Because, you know, Alvin was talking about some correctness measurements.
00:18:51
Speaker
How do you guarantee that the, that the rights on all of the replicas all kind of line up? So yeah, go ahead. All of the data is transmitted through TCP. So we have the. Protection at that level. And for the data at first, we have the protection of both the drives because most of the modern SSDs are going to check the sectors before replaying.
00:19:21
Speaker
But we also store a checksum of each blob, which is larger than the disk sector. And we have a process which is going to periodically check and ensure that all of the data on the disk is correct. And so we are sure that.
00:19:40
Speaker
If a disk is starting to fail with detectors or if it's had some bit flips, we're going to detect it and re-replicate the data to ensure that we stay correct on that part. I know this is maybe a bit off-piste, but one of the things that I've heard people talk about doing with storage is
00:20:05
Speaker
It's probably not so, I mean, it depends on the type of things that are getting stored, but if it's like a million things that are all the same and you've got checksums of stuff and they're writing the same thing every time, do you basically say, ah, we're not going to bother writing that again. We're going to sort of do some fingerprinting. To answer simply, no, we don't do data application of the data on the disks. We didn't add a.
00:20:31
Speaker
the proof that it would be worth it to implement as of today and like we we build for the data so we can store it and it's not yes we could optimize the pricing of the system by doing that but we don't have a big incentive doing that so I'm not sure if it is actually I don't know if it is a real thing or not maybe as Ray was mentioning
00:20:55
Speaker
This might be heresy or me just understanding shit completely wrong. I remember Dropbox doing this kind of stuff to make their storage very efficient and faster as well. Because if everybody is storing Deaf and episodes and downloading and then 200 different customers are saving the same data. I mean, obviously it's not at the file level that you can reconstruct everything. But if there is a block that is patterned that is exactly the same, then there is no point in
00:21:23
Speaker
So in theory it could be done. I think there are a few blockers that prevent us from doing that successfully. So unlike Dropbox, our customer can use the product to store a lot of different data. So Dropbox is kind of suited for a lot of file, maybe business files. So they have a likelihood of having copies of the same file, which is somehow large.
00:21:51
Speaker
In our case, as we just expose a disk, our customer can use that to store anything, a database, raw videos, compressed videos. So the likelihood of finding the same data at different places is getting lower.
00:22:09
Speaker
And as we also receive unstructured data, we are just receive a stream of bytes. It would be difficult to try to analyze those, to try to find patterns and repetition. So it is going to cost us a significant amount of processing power to do that. I mean, there was an interesting, I need to look this one up long time ago, there was an interesting website or an online experiment that
00:22:36
Speaker
some artists created that you could type any sequence of words, any sequence of letters in English, you could find that.
00:22:44
Speaker
on that page. It doesn't matter in what order or whatever. So it seems like eventually if you have enough entropy, if you have enough space, you can just put all the bytes available in combination. That means it doesn't matter what people are going to store. I don't know. It might be petascale or whatever. And one last thing that prevent us from doing that is that eventually
00:23:09
Speaker
If we want to encrypt customer data, it means that we can't share the same data from different customers. But it is an interesting space. It is an interesting idea. Obviously Dropbox is, as you said, I don't think they're encrypting. I mean, they're probably encrypting only with their own key, whatever, some location when they're doing the offsite stuff.
00:23:32
Speaker
Cool. Um, so, yeah, sorry, right. Do they have any, no, I was just going to say, so let's go back to the kind of hardware then, because, you know, assuming that you're, you're running SSDs, like you say, you've got all this kind of stuff and you've got it on different machines, like how, how.
00:23:50
Speaker
You've got essentially how much storage are you actually kind of having to provision yourself? I mean, how are you doing that? Are you kind of like saying, okay, like we've got like, I don't know, a petabyte of data that we bought and we're going to like, and then let's say you've got a certain amount of capacity.
00:24:08
Speaker
And how do you kind of like account for that and budget for future needs that's kind of like an interesting challenge i think on the storage side because obviously. Storage is not free it turns out you know what is photographic bullshit as tell you you know they always end up charging you for it so yes that's the question is how do you how do you kind of manage the.
00:24:36
Speaker
Yeah, the kind of predictability of the storage array that you've got to have available. Because I'm guessing that if you've got some customers that will suddenly write a few terabytes of data, that it's like, oh shit, we've got to bring a load more disks online. You have to really do it. You've got to screw them into a machine somewhere.
00:24:58
Speaker
How do you cope with that kind of scalability that the cloud is meant to solve for? How do you solve for it?
00:25:14
Speaker
A business answer from Vamter. Oh, yeah. Because you haven't the manager of the other one. We should go and sell this one. All right. Actually, the technical response isn't very interesting. We monitor. When it gets too high, we order more hardware and we worry. Basically, we order capacity when our cluster is 50% full.
00:25:43
Speaker
So we vary and it depends on how large a zone already is. Um, talking about object storage now, cause this is the way more mature system. Um, where we have some experience with increasing capacity for the block storage offering. This is currently in preview. It's available in one zone. There is some capacity, but like, obviously it's not being used too much. And we, we don't really.
00:26:11
Speaker
We've got, we don't really have good heuristics yet for like, okay, this is how full we can run the system. But for the object storage one, the.
00:26:22
Speaker
The larger zones are several petabytes. And so these, you know, even for these ones, I think like when we hit the 60% mark, that's like the absolutely just trigger to add additional capacity to those. Um, you know, as these, as these clusters, so it's, it's also like the size thing, if you know what I mean, like the larger the cluster already is, um, you know, the percentage at which we can order additional capacity goes up.
00:26:50
Speaker
because the chances of an individual customer basically drowning out the system go down. But the smaller zones, like we've got a few really small zones where I think the cluster size is what, a few hundred terabytes? And so those are like within a realm where a single customer could theoretically walk up and like fill it up really fast. Right.
00:27:15
Speaker
So as always, you know, the answer is it depends a bit, but basically we, we, or, I mean. Another question for you on, on the sort of commercial aspects of it is I noticed when I'm buying SSDs, you know, that there's a sort of like, there's a different set of prices as you go through the sort of capacity range.
00:27:36
Speaker
There's a sweet spot, at least at the consumer level. What's it like at the warehouse, business to business level? Do you find buying the absolute maximum amount of SSD is what you have to do and you just have to pay the top dollar?
00:27:55
Speaker
Do you like by medium size like i don't call discs i guess anymore but i just go to ebay and then buy me the second well that's i'm thinking you know something do you like but you end up like buying. Like mid level what's the optimal pricing i guess what's the that's the question is like how do you do the pricing so that you don't feel like your.
00:28:19
Speaker
Over committing in terms of like that extra premium that you pay for like i don't know sixteen terabytes or whatever i don't know if you even know honestly what what ssd's are these days. What kind of what kind of like options do you have in terms of buying ssd's.
00:28:38
Speaker
We buy machines more than SSDs, really? But don't SSDs, aren't they like, don't you like screw them in with NVMs or whatever, or NVMe? So you don't just like screw in SSDs these days? I don't. I mean, I do on my machine. You've got to like slot SSDs in or you've got to screw them in and buy NVMe or something, you know.
00:29:05
Speaker
No, they're probably these days, right? Like they are there for the block storage. They are NVMe drives for the storage. They are in fact hard disks spinning spinning metal. Yeah. But but we we tend to.
00:29:23
Speaker
We tend to buy the machines as such and so the vendor will make a proposition and there will be some choice that we have for the SSDs or the drives that fit, but we typically don't buy them individually. And then similarly, we do look at price per gigabyte.
00:29:47
Speaker
and things like that, like we don't buy the absolute max. And there's also, especially in object storage side, where it's hard drives. So these have gotten way denser, but the bandwidth of the bus hasn't gone up, right? So this means that you can still only drive X amount of data
00:30:08
Speaker
you know, bits or bytes per second to the same disk. So denser drive and like also big differences in density between drives doesn't necessarily, yeah, it doesn't work too well to like load balance the activity on the cluster. Because like the way that it sort of works is we try to spread
00:30:32
Speaker
And maybe we can explain what that is, but we've got an intermediate concept when we store the data, like physically on the machine. So there's a concept of a partition, which think of it as a hundred gig segment of the disk, basically. And this is the entity that gets replicated. So we don't replicate like individual bytes, like we replicate partitions across. And we try to load balance the data over the individual partitions.
00:31:00
Speaker
so that we spread that evenly. And so denser disks have more active partitions on them, but that also means that the more partitions you have that are open for writing, the more bandwidth that this host is going to attract. We've had some issues with that. We solved them now. It's not an issue anymore. So basically,
00:31:27
Speaker
Like denser disks are useful if most of your storage is cold where it's written once and then you can store it like forever. And it's never like, you don't need to like write it, overwrite it again and again and again. So there are like a few things, but like mostly it's like, we try to look at like most bang for the buck really. Yeah. And do you, I mean, you're just talking there about like, if you have, um,
00:31:54
Speaker
increased partitions do you like to have two nicks so that you are several nicks so that you have like one one for the sort of front facing and one for the application.
00:32:05
Speaker
No, it's also load balance. It's just aggregated. All right. So we use this networking trick called ECMP, equal cost multipathing. And so basically both Nix expose an IP that has the same routing weight. And so that kind of means that if you're sending traffic to that particular IP,
00:32:32
Speaker
Then there's like a, there will be like some load balancing going on. And so 50% will take the one nick and 50% will take the other one. But the balancing is at like the TCP layer. So like one TCP connection will come in over one nick and then the other over the other. So if you've got a sufficient amount of clients connecting, then you end up spreading the load between both of the nicks.
00:32:55
Speaker
So you're still a believer in TCP then rather than UDP for your storage. Well, so good. You said it isn't quick going to change all that, you know.
00:33:09
Speaker
Like, benefits of TCP. Like, TCP comes with integrity guarantees. Turns out, if you're typing data over the wire, you know. Just fire and forget and hope it lands. Yeah, exactly. What could possibly go wrong, you know? I think this is the difference between, you know, people used to having this control S in their muscle memory. Like, you never know. Control S, control S, control S. You know, command S.
00:33:36
Speaker
But we, in fact, we actually rely on this, right? On the TCP integrity guarantees, because we used to in the past. So basically when you said like for checksumming, like we check some each blob, right? So each blob gets hashed and we send the hash with the blob for integrity stuff.
00:33:54
Speaker
So in the past, when we wrote down a blob, we would immediately, um, like we would also pass. So when we wrote out the blob to disk, right? Like, so this is in our storage component, which is called blob D. Um, and this, this also for the record, like this is a shared component between object and block storage. We didn't really get to tell that, but we, we made the decision very early on to not reinvent the storage layer and take the storage layer that we knew and have operated for.
00:34:22
Speaker
five, six years now, I don't know. And so we decided to take that expertise and the battle testing that this system had gone through and reuse it as the storage layer of block storage. So when we talk about that system, we're actually talking about both systems.
00:34:40
Speaker
Um, but in the past we used to stream the bytes through a hashing function and compare the hash with, you know, what was given. Uh, but turns out that this was a compute that we didn't have to do because TCP is already doing that for you. And so that check would never fail and we removed it and we gained some free performance. Um,
00:35:04
Speaker
Like wood, wood. So essentially like quick has, if you look at like these, these, you know, or HTTP three or whatever, like they're based on, on UDP.
00:35:14
Speaker
Uh, but they end up re-implementing a lot of stuff that TCP gives you for free. Right. And so that's it. And I guess, I guess there's, there's some use cases at like super high throughput and volume and Google scale where some of the downsides from TCP are, are, you know, too expensive for them. And it makes sense for them to re-implement just a set they need on top of UDP. Um, but.
00:35:42
Speaker
at least at the scale we are at, we've not seen that like TCP doesn't factor in at all into the overhead that we're seeing. And we eyeball the system very closely. So like we're nowhere, I mean, I wouldn't know why we would change it out. I don't think it would make any meaningful difference. So I will have all the goodness of TCP, you know. So I mean, from your, go on, sorry, go on Vijay. Yeah, no, I was trying to,
00:36:13
Speaker
Maybe a quick digression or maybe diversion into the title of the podcast. I'm curious about Albin and Aurelien, your programming language journey and your experience and where did you encounter closure and what is the tech stack that you're using right now to implement all this, right? I mean, if it is all C, then
00:36:38
Speaker
What are we doing here? We're just going to, yeah, definitely now in C and then moving on. So yeah, I'd love to know the tech stack and how it is implemented as well. And I think we touched on different concepts, different layers. If you can give us some insight into that one, just read the code and then we'll be happy.
00:37:04
Speaker
This is just chat GPT build as block storage system for me. Then it just, it just tells you some shit that is looks like Python, but is actually SQL. Do you want to go? If you want to start talking about NBD agent and go for it and I will talk about the proxy after. Sure.
00:37:29
Speaker
So at Exoscale, we use mostly two programming languages, Closure and Go for the most low-level stuff. We used to write some low-level component in C in the past, but I think since the last four or three years, we've moved to Go to have an easier maintainability than C while keeping some of the nicer low-level properties.
00:37:58
Speaker
So the storage engine is written in Go, as well as the agent that talks to the virtual machine manager, Kim Yu.
00:38:13
Speaker
And apart from these two things that are in Go, we write closure stuff. So the main component here in the equation would be the proxy that receives write from the NBD agent and from the virtual machine and that dispatches them to the storage engines. I was mostly working on the Go part.
00:38:42
Speaker
didn't touch a lot of closure up until recently. But Alba is more of an expert and can probably discuss a bit more. But why the difference? Because if I'm assuming proxy is also a very critical component and then your protocol stuff as well, why pick closure for proxy? And then I think Go is a bit clearer for me. Yeah, sorry, go ahead.
00:39:13
Speaker
The objective was to reuse some of the components we already had inside of the object storage, which is able to stream data to the storage layer. That part was already battle tested and we knew how to operate it. And we already had all of the visibility, the tracing, the alerting required on that part.
00:39:38
Speaker
And we also decided to use the same metadata layer in a different way. And we already knew how to interact with it, even if we did drift a lot from the implementation inside of the block storage. But we were more confident in using Closure at that level because we knew how to interact with other components.
00:40:04
Speaker
And so generally speaking, the heuristic is low-level side effects. Side effect stuff tends to be done in Go. So this is most of the agents that we have, because essentially, if we have to run a JVM, that's one close to a gig of memory that's just not coming back.

Languages and Technologies at Exoscale

00:40:29
Speaker
And so if we have to do that for
00:40:31
Speaker
Pretty much every agent on a hypervisor, we're going to consume quite a bit of memory that we can't sell to customers. And then similarly, historically, so Blob D runs on for the object storage, extremely low-powered boxes. And they just did not have the hardware profile to run a JVM at the speeds that we needed. And so Go makes sense there.
00:40:53
Speaker
And then we try to use Closure for anything that has to implement some form of business logic. So once it's more than just straight up side effect, because essentially the job of VlogD is take some bytes, write them to disk. And then error handling, that's 90% of the stuff.
00:41:15
Speaker
And so for the NBD agent, it's fairly simple, similar, sorry. The proxy tends to, you know, like this tends to have some more logic involved, right? Because this is the index which maps offsets on a disk to blobs on an actual, I mean offsets on a volume on blobs on bytes on an actual disk somewhere.
00:41:37
Speaker
So that tends to have some more business logic. Now for block storage, that's again fairly limited because of the space, but in object storage, if you've ever looked at all the options that S3 exposes, there's quite a lot of bells and whistles.
00:41:58
Speaker
So that's kind of the heuristic that we try to use. And so given the history with object storage and, again, the problem space of each of the individual components, that's where we landed up. Yeah, Ray asked a question about the garbage collection.
00:42:23
Speaker
So again, when we rewrote blobd from C to go, we heavily benchmarked this. And so clearly GC has a bit of overhead, but not enough to actually matter. So it was fine. GC on go is different than JVM.
00:42:53
Speaker
Yeah, it doesn't stop the world for two minutes straight. Yeah. Yeah, exactly. There's no, there's no lock on sweep or whatever. That said, so on the, cause you mentioned it on, on our block storage proxy, which is clearly latency sensitive, right? It can't go on a vacation for two minutes because that's a right. That's a read that's not returning data for two minutes. Um, so we, we're using.
00:43:20
Speaker
this new, well, relatively new algorithm on the JVM called, what's it, ZGC, I think, which guarantees, it makes a different trade-off to a traditional GC, so a traditional GC
00:43:35
Speaker
Basically, when you overload the memory allocation, the GC pauses will become longer. What ZGC does is it guarantees that your GC is sub-millisecond, but it will slow down allocations. So whenever you request new piece of memory and it notices that it's getting up to its maximum allowed time for garbage collection, it will basically slow you down and just return from those calls slower and slower. So your application still slows down.
00:44:05
Speaker
if you overload the memory that you request, but it tends to be more graceful. So. Yeah. Quick question about that. It's like one of the things that the Java, like the low latency Java folks did was they use these ring buffers, you know? So, so the fundamental idea of a ring buffer is essentially that you allocate some memory
00:44:34
Speaker
And then you basically reuse that memory that you that you allocate, which is a kind of like an age old trick, you know, from like C and all these other kind of like low level languages as well.
00:44:47
Speaker
and they brought that up to Java. And I'm guessing that you're writing these blobs. So isn't it possible for you just to say, OK, well, we're just going to allocate a whole bunch of memory for these blobs? Those are the things that get GC'd. So fuck that. We'll just keep a memory and have a ring buffer that we just write into.
00:45:07
Speaker
Maybe as you thought about that, I don't know. It would be a very valid strategy that would work, but it turns out that the Go GC is actually performant enough that you can write stuff without carrying and it's good enough for a long while.
00:45:24
Speaker
But we do still play a few tricks like this one to not allocate for some object or for some memory. That way, we reduce the overall pressure on the GC and the overall work that it needs to do. But we only do that for critical stuff. And for the general recording, we just use the GC we allocate, and it's fine.
00:45:52
Speaker
Both of you, did you start with Go or C? What was your first language or JavaScript? I come from the embedded world. My studies were on embedded systems. So I was doing C and then Go. Never been corrupted by JavaScript.
00:46:15
Speaker
No, not JavaScript. Not yet. I mean, yeah. In my previous job, I had to write a JavaScript engine in Go, or at least wire up something that execute JavaScript in Go, but I don't know if it's current. And for you, Alvin? If I am not correct, the first language that I really learned was Ocamore.
00:46:42
Speaker
But then I quickly switched to C. Wow. Why? This was the path we had in my school. We were starting with a camel for the algorithm. And then we went to C to get the whole system works.
00:47:05
Speaker
You only have this kind of a general affinity around some countries using some language. I heard a few people from France learning OCaml. I think there is also another language called OPA or something that also came from some French folks. I think at some point it literally means grandfather in Dutch.
00:47:26
Speaker
Prolog as well, you know Is it like because if you write OCaml everything is in the other way like French people speak like color do whatever fuck and like everything is there in French I need to say color of something or What is it? It doesn't strike me as a specific feature of a camera
00:47:49
Speaker
I don't know any OCaml, so I can just put in there. It's like an old man ranting to the wind. Don't worry about it. Exactly. OCaml was invented and is still being developed by a French university. It's like if you're studying in Lausanne, in the EPFL, you're going to learn Scala because it was invented there. Not anymore, but yeah. There is some affinity coming from there.
00:48:20
Speaker
It is, it is an interesting, interesting view, right? Like there are, I know, GHC, for example, you know, the Haskell compiler, Muslim Glasgow, and then there's a bit of Haskell stuff there. And, and Utrecht University has another Haskell compiler for themselves and then.

Cultural Influences and Language Preferences

00:48:38
Speaker
I heard more Haskell programmers in Netherlands than any other country. And I think in India, I think we just learned whatever the thing that is going to get us. So we're arriving at the conclusion. So we're arriving at the conclusion that the French are chauvinist. Oh yeah, for sure. There's no debate on that. But, but Albin, coming back to your journey, I mean, let's say you, you, you did OCaml and you had to see at school, but then like when you, was this your, you said this was your first job after you got an internship. So.
00:49:07
Speaker
I guess you were doing like, were you doing Go or Closure? Was that your next after the C? Well, I did learn other languages in school, but when I came to the scale, I never did any Collogue or any Closure. And so the first thing I did was testing the Collogue program with using Jepsen.
00:49:30
Speaker
which is enclosure. So I had, I started learning those two languages at the same time. And when I was able to find a bug, I was going inside of the Google to understand it, and then going back to closure. And this is how I did learn those. That was going to be my next question, because you're working on distributed systems, do you do Jepsenate? And then what are the results? Yeah, it's mostly Jepsen, which we didn't use a
00:50:01
Speaker
quite complex to plug in and the logic inside of Jepsen to create a new system and to test it. It's very easy to test a single component, which is the same thing on all of the systems and interact with them. But when you start to have a system where you have multiple different components that interact together, it's slightly more complex to orchestrate correctly.
00:50:29
Speaker
I think mostly I see all the distributed database systems generally that those are much more easier to test with Jepson, I think. Maybe not super easy, but at least like, yeah. There's probably a reason they all pay King Kyle, Kyle Kingsbury, King Kyle. He's a king, but it's probably a reason they pay him quite a bit to do the tests. I mean, it's.
00:50:58
Speaker
It's, it's amazing software. We found bugs with it. So it's absolutely great, but it's not a plug and play test framework. Let's put it like that. I mean, it's, it's, it's testing distributed systems, right? I mean, the system themselves are fairly complicated already. That means if you want to test them, that means the scenarios that you need to look up and then design and then, and every system is built differently. So you need to figure out how to, it's not just, oh, there is an iOS application, go and test it. Like the buttons and it's not that easy.
00:51:27
Speaker
Maybe another question since you're recently coming out of academia is whether you were given like training in like formal techniques like Coq or TLA plus. I didn't tell anything on that topic. Oh, okay. Are you aware of those things or is that not something? Yes. When I started searching on that field, I started with the formal approach.
00:51:54
Speaker
But it was not very conclusive because the system is very simple in the design. Like it's a immutable key value store. So for a given key, we will only write a single value and we will never write anything else. And so all of the bugs were in the implementation, not in the design. So the formal view was not the best approach for that.
00:52:20
Speaker
Probably the specification-wise, it's pretty simple, I guess. And the system is simple, as in there is not many code parts or not many things that you need to test. If we take just the storage backend, it's super simple. And you will not learn anything by doing some formal tests. If you take the object storage frontend, it's way too complex to model everything. So you don't have to look on this one.
00:52:48
Speaker
Maybe on the block storage, you might have been able to model that, but you still end up on the page where you are going to model something and the implementation is different and you're going to miss something. So might be a very good way to do models when you have very large things and you want to have a state of the world between components when they exchange and to check the
00:53:19
Speaker
state of the entire system itself, but we are able to scale a bit too small to be able to leverage those techniques. Maybe the other question I've got for you is then it's like typically with Go you're looking at very you know you like like Aurelian said it's like very high performance systems and closure is usually
00:53:41
Speaker
good enough performance, but it's not like super high, super high performance, the predictability of it. You know, I'm obviously Java and closure can go fast. You know, if you, if you, if you use certain tricks or make sure you do profiling or whatever, um, you know, I'm not going to say you can't make it like that, but I'm guessing the way you're doing it, you're talking about like having a closure proxy going over a network.

Proxy Architecture and Performance

00:54:05
Speaker
So do you find like that it's like,
00:54:07
Speaker
How do you guarantee the performance of this system? Cause if you're saying you got a right to disc, you know, and you're going to give the same performance as writing the disc. That's right. That does strike me as slightly odd. If you've got closure in the middle there, you know, cause that seems like a bit of a risk for performance. Currently the proxy is able to process a radar, right. And when, when that form and that five milliseconds and which is
00:54:37
Speaker
good enough to operate the system and to deliver the volume of IOPS we promise. And one thing to take in mind that given that the proxy is within Clojure, it means that we can
00:54:55
Speaker
scale, just give it more CPU, just create one more instance, which is unlike for our Go agent where we just want to drive the whole disk at maximum performance. As long as the latency of our closure application are good enough, even if it consumes too much CPU or memory, that's an acceptable trade-off for us.
00:55:20
Speaker
But it sounds like it's good enough anyway, regardless actually. Yeah. And we've, yeah, and we've measured and it's fine. We had doubt that Closure Proxy could deliver this performance, but it turns out it can. We, we, we at some point did write a
00:55:39
Speaker
reasonably complete go version of the proxy as well in it, you know, was performing within, you know, the same order of magnitude. So it's not worth the maintenance cost basically. No, in fact, like it was slower. Like basically it was slower than the closure one. Clearly the closure one had been optimized. I mean, had seen some optimizations at that point and the go version was
00:56:02
Speaker
Um, uh, you know, I mean, we spent some time on it as well, but actually it had not received the same amount of, uh, detailed analysis as we had done on, on, on the closure version, but I'm guessing the closure one uses some sort of underlying Java library or IO library, or is, is that not correct? Like neti or something. Yes.
00:56:24
Speaker
Actually, we rely on the, what's it from, from Zach, Aleph. Oh, Aleph. Okay. Aleph uses Netty, I think, underneath, doesn't it? Yeah, actually, no, for block storage, there's no Aleph, because we use gRPC. But yeah, it's Netty underneath, still.
00:56:43
Speaker
Um, but, uh, but yeah, so obviously there's some interop and there's some, but, uh, like we did, we did compare. And so basically when we tune it, the closure ended up or like basically the JVN ends up being as fast as go. And then the benefits of something like closure in, in, in the proxy, uh, way up. Right. Um, cause it is a nice language to.
00:57:11
Speaker
I mean, it's a nice language to experiment with and it's a nice language to do data reshaping with as well. If you need to turn things around. So clearly if all you're doing is just piping, I mean, the core function of the system is still piping bytes through, but the function of the proxy is to provide an index. Because we have this mapping of offset on your volume to blob basically that needs to be maintained and needs to go fast. And there's some ref counting going on to make all of that working.
00:57:39
Speaker
So there is, there is some business logic and it is nice to be able to express that in the enclosure. Is that what you were talking about earlier with Alvin with the metadata stuff?
00:57:49
Speaker
Yes. So we actually haven't touched on what that is. So maybe we should explain it a bit more. So the proxy exposes the gRPC frontend, which, again, has a fairly straightforward API. It's write extend, read extend, basically, and then get volume metadata because a volume has a name and a size and whatever.
00:58:12
Speaker
And so an extent is essentially a range of bytes. So it's an offset within your volume, right? Like I'm 100 bytes off the start, and it's the length. And that's an extent, essentially. So that's the primitive that we read and write. And like I said, the proxy takes that and slices this up into blobs, and then blobs are written into partitions or replicated. And so this is the thing that eventually ends up on a disk.
00:58:38
Speaker
And so the proxy's function is to, when you get a read for a specific range of bytes, we need to figure out which blob we need to go retrieve in our cluster of files or storage servers, basically. And that needs to go fast.
00:58:57
Speaker
Like we can't spend a whole lot of time on that. So you want that indexing operation to go really fast. And then secondly, as with most network attachable storage, we offer snapshots. So you can take a snapshot of your disk. And so this is also managed by the proxy, right? So it essentially has an index of
00:59:24
Speaker
a layered index of offset to a reference like blob ID essentially, which is the reference to the data itself then. That needs to go fast and we do some ref counting on the blobs because when you have snapshots, multiple snapshots, we do do data reuse at that level.
00:59:45
Speaker
That's kind of weird. You can have multiple live extents pointing at a particular blob. We need to ref count this. When the ref count hits zero, we can go clean up the data. There's a few other things there as well. Functionality for promoting a snapshot into a new volume, all of that is implemented inside of Closure.
01:00:09
Speaker
And so the sense of all these things end up sort of looking like a reduce over a factor of bytes, basically. And that's a nice way to model that problem. And then where I was trying to get to is for the, we've got a database that we eventually store that into. So we're using FoundationDB for this one.
01:00:36
Speaker
So, Foundation B basically contains our actual index of offset to blob. So, it's like your file system table where you keep... Yes, yes, yes. I mean, it's interesting that the concepts that apply to the, I think as Martin Kleppmann's book, you know, like, exploring the database and then moving it into different parts and pretty much similar, if I hear your analogy, I mean, it's similar to
01:01:05
Speaker
If you understand how a hard disk works, and then identify the components separately, and move them into highly scalable systems separately, essentially move the head to somewhere, move the disk to somewhere, move the partition table to somewhere, and then you have a scalable system, essentially. I make it sound easy, but I'm sure it's way more complicated than this. But it seems similar to that kind of ideas
01:01:34
Speaker
Yeah, it's like all the issues with reliably storing data with added benefit of having all distributed systems problems. Exactly. For me, it feels a bit recursive or fractal actually, where it's like we expose a disk which talks to a system that kind of
01:02:00
Speaker
Like you said, is a model after how a disc works, which eventually like still goes onto a disc, like the same things sort of keep coming up. It's like a, it's like a huge, you know, hard drive, but then.
01:02:14
Speaker
distributed into smaller hard drives. And then they're all hard drives. I mean, you could call it virtual even, couldn't you? Yeah, fair enough. So I got just a couple of more questions because we were almost hitting the, well, we didn't hit the one hour mark, but for Albin and Aurelien. So you come from non-closure stuff and then I think mostly probably Albin and then Aurelien.
01:02:39
Speaker
by virtue of proximity, touched most of the closure crap as well. What has your experience been with closure? And how do you feel about the language and tooling? And obviously, my standard question is, which editors do you use? And maybe the final question is, what is the next step? So it's a three-parter, right? So what do you feel about closure? But more importantly, what editors do you use?
01:03:07
Speaker
And then less importantly, what is the next step for you? What is the next step in this tech stack or for you? What are you interested in? For my experience with foreclosure, I think it's really good at doing some equivalence of map-reduce mutations and to take something and apply mutations and at the end, store it in a database, return it, do some logic at that level. And for that kind of interactions, it's very elegant.
01:03:37
Speaker
But when you're trying to model the side effect parts, sometimes when you're trying to have some performance at the same time, you're going to end up with some dirty closure at some point when you need to play a bit with the parallelism and the way we interact with database transactions and all of that stuff.
01:04:05
Speaker
When you end up with functions that are 200 lines of super compact code, inside of single transaction, it can be a bit tough to read. But I'm not a very big advocate for a given language or another. I just like, yeah, we managed to do it in closure. You're on the wrong podcast for it. Yeah. But like, it could be a new language. I'm not a huge advocate for just the language.
01:04:35
Speaker
It's just a tool and it did work, but I'm not strongly in favor of against it. Indeed, when we start talking about editors, I'm a VIM user. Well, goodbye. It was nice talking to you.
01:04:57
Speaker
Initially, I'm a VM user, so I tried to use the human side there, if I remember correctly. But it was absolutely horrific, and so I didn't want to use Emacs, so I went to with Visual Studio Cut. No, VS Cut. It's not the same thing. And it's working fine. Fair enough.
01:05:23
Speaker
There is a reason why you moved to Switzerland. Yeah, I'm not a huge fan of very complex macros and being able to do anything in just using two key bindings and
01:05:45
Speaker
starting at a free thing, like the application is so complex to start in itself that, yeah, I will have to use the ripple connect and that's it. I'm able to walk. Yeah, that's a very good point for closure. It's like when you want to run tests or stuff like that, the ripple mechanism is so useful. Like you can just go on the production node and you shoot your code instantly and you have directly, you can, um,
01:06:11
Speaker
You have the tracing, you have the logs, you have the metrics, you have the flame graph, which is going to work instantly. And you don't have to recompile everything locally, deploy it and all that stuff. So that's the part I find in the development workflow, which is really, really great on the ecosystem. And like your SSH port forward, and you can just connect them, it just works.
01:06:45
Speaker
Okay, so let's start with the editor for us. So we can shut down the discussion if I get the wrong answer. So I'm a new Max guy. Yes. Okay. So now I'm going to bribe Walter to remove, you know, raise and retire tracks completely. So as a disclaimer, I am not a closure expert by your means. I've doubled a bit with it. But, um,
01:07:06
Speaker
How about you Alvin? Aurelian, how about you?
01:07:15
Speaker
There are a few things that I prefer in Go. One thing about the tooling, the quality of the tooling in the Go ecosystem is very high. The ability and how to test and how it's integrated, to do profiling, to do imports and modules, it's all coming as a single bundle where no choice is given to the user. You use the Go tool chain.
01:07:45
Speaker
And that I find, uh, makes for a very pleasant experience. Whereas in closure, I mean, I am not an expert, but I take the scale. We are to have discussion on how do we import stuff? What is our package manager? What is our way to run tests? What is different, which differs in every project. Um, so bike sheds a pound.
01:08:08
Speaker
Yeah. For that, I think, you know, I like Go's approach. I also like, what I also like in Go is the, and that is probably personal, but I find Go to be more readable than Clojure. It's probably
01:08:27
Speaker
more annoying to write because, you know, everyone knows that you need to write if er is not nil return er, everywhere I go. But it has the nice property that when you read it, regardless of the person that wrote this code, you can read it fine, which I found, I find it to be not that true in Clojure. Like if some wizard wrote some Clojure code, there is a high probability that it's impossible to read.
01:08:57
Speaker
possible. The Go blocks make Go a little bit more tricky, don't they? I mean, I've found that myself and I've written Go that Go blocks aren't always super obvious. Go channels. Yeah. Do you mean the ability to spawn a Go routine? That definitely needs some design like you need to upfront sort of say like this is my concurrency strategy and like you need you need some
01:09:28
Speaker
I mean, up from bike shedding to figure out how you're going to handle that, because if you just spawn random Go routines everywhere, you're not going to be happy. I think it comes a bit with the domain as well, where explicit error handling is really nice if 90% of your code base revolves around handling errors, as it does for the storage layer. What if we couldn't read? What if we couldn't write? What if the database update failed? And do we need to roll that back?
01:09:57
Speaker
You want to explicitly consider all of those failure modes one by one and figure out what you need to do. Do I need to write it again? Do I need to pull in something from replication? There's tons of scenarios. Go having explicit error handling there makes...
01:10:17
Speaker
It's not necessarily a pretty file, but that's the logic of the program. So it's nice that it's all in there and exceptions are not...
01:10:29
Speaker
Exceptions lead to uglier code if you need to handle them all at the level that you have to do when you're handling low-level storage stuff. For me, that's why I go for these low-level agents. For example, in the proxy,
01:10:51
Speaker
There's, you know, we still care about the other errors, don't worry, but like it plays less of a role. And so, because like a closure is really nice if you just write down the happy path, right? Like if what you have is a pure function, nothing can blow up. Like you get a beautiful piece of code that very elegantly expresses like your idea.
01:11:09
Speaker
Um, the moment you throw IO in the mix and you get it, you know, then it becomes a bit, uh, uh, you know, your code won't look as pretty anymore. Like the closure can definitely do all that stuff. And it absolutely does all that stuff, but like some of the elegance gets lost. And so.
01:11:29
Speaker
You know, like we carefully tried to balance it. I must echo Aurelien's point of like, when you try to collaborate on large code bases with teams, the fact that Go has like one way to do stuff is nice and it eliminates tons of bike shedding.
01:11:47
Speaker
Because the most beautiful piece of code you've written is your closure. The ugliest piece of code you've ever seen is somebody else's closure. I'm making fun of it, but there's some truth to that. And so it's nice if
01:12:11
Speaker
If you're doing some, you know, if you are doing like some creative coding, like all of the interactivity is super nice and like the audience is used. So like none of those problems play and closure is perfect. If you've got like four, if you've got four teams that, you know, need to
01:12:30
Speaker
If not work on the same code basis, at least be able to share idioms and common language, then you now introduce a bunch of bike shedding that needs to happen. It's the type of bike shedding that people tend to have strong opinions about. That makes things way more complicated. I think the social problems of coding, this is one thing that I just started dabbling with Go a little bit.
01:12:59
Speaker
I wrote more if R&L than real code so far in terms of the line count. But I can see when I was looking at the latest talks, and I think there was a talk reviewing course history recently, what happened with Go, and whatever. And there were some social problems that were resolved. Otherwise, it has always been like a pain in every project. Every project has a different style. Every project has a different way of doing things.
01:13:28
Speaker
Some projects are macro heavy. Some projects use whatever the style that they want. On the other hand, I think it gives you way more flexibility in terms of thinking and building the code up, but not necessarily covering all the parts, as you're mentioning. I'm wondering if you have written at least one of the Go program that if you write with the same level of robustness in terms of the error parts and everything,
01:13:57
Speaker
It will probably look the same enclosure, right? In an uglier, quote unquote, uglier closure, because you're now using all this ephemeral equivalence in the enclosures. It's like trying to catch this exception, catch this exception, catch the other exception. I think the thing though is that like Go was written to replace C, which was like essentially to try and make a better systems programming language. It wasn't written to,
01:14:26
Speaker
I mean, people use it these days for high level things, but to me, that's like, you know, I don't know why they do it because, um, to me, you end up writing a lot more lines of code and a lot more junkie code than you would in something like closure, which is much more suitable for experimentation and, you know, thinking about new things and innovating. Whereas go to me is like.
01:14:51
Speaker
Yeah, it's good for demons. It's good for operating system things that has its players, but you know, it's too verbose in my opinion for these like, uh, more high level programming tasks. So I, so what I wanted to bring up was a bit.
01:15:12
Speaker
I think I, and so I, I, I'm also, I mean, M and L dance camp, like languages are tools for me and like, I will use the tool that helps me get the job done. I have a soft spot for closure. Like it's a great tool.
01:15:25
Speaker
And it has its places, but what I wanted to point out was that the fact that Go wasn't really intended to be the next Python. But tons of people are using it in that space and it still survives it, which I think speaks to the power of the language and the tooling. They hit on something that makes people use it outside of the Goldilocks zone that it was designed for.
01:15:49
Speaker
And I think the other way around goes as well, like the fact that we managed to use closure for the proxy, which admittedly it was not primarily designed for, although I do think Rich sort of first designed the language to build the database that he wanted.
01:16:06
Speaker
It's not entirely, but basically it's sufficiently far outside of the comfort zone of the language, but it still works well. And so I think that speaks to the power of actually both tools. They're really good tools and they managed to do more than they were potentially initially designed for. But I definitely don't want this to be...
01:16:28
Speaker
they both have their place. And what I hope is that we managed to use the right tool in the right place, basically. Yeah. I think it's understanding the problem and understanding the team, understanding the dynamics, and then picking the right tool, essentially. And you can build high performance systems with Closure. And we do at Exoscale, like we do. But you have to
01:16:51
Speaker
You know, there's trade-offs to be made everywhere, right? But I think it's, as Alban said, it ended up working out really well. And part of that is also the context, like we had experience using it for proxies and web servers. We had experience interacting with FoundationDB from the language. We had experience interacting with BlobD from the language.
01:17:19
Speaker
It was the natural next move to rate this enclosure as well. And it worked out super fine. And like, and as Aurelien said, like.
01:17:28
Speaker
It's also the proxy process in the sense that if we need to give it a bit more runtime power to get those development time efficiencies, then that's also fine, right? Provided we get the latency low enough. That's the one thing we care about. Like the latency for an individual request needs to be fast enough. That's it. So even if like the runtime.
01:17:49
Speaker
operating profile is like a bit more memory and a bit more CPU. That's fine. Cause we don't have a million of these. We've got six or eight. It's okay. While on, on the demon side, uh, we've got 1000 blob D servers at the moment, more or less. So the numbers are different. Yeah, it makes sense. Yeah. But like you say, I mean, you know, if you're, if you're writing low level code, there's no way you can use closure for that.
01:18:19
Speaker
you know, that just doesn't have the interaction with the operating system.
01:18:24
Speaker
You know, it's a hosted language, so it's either going to use JavaScript or Java. And, you know, I don't think anyone in their right mind would really say that Java is a system programming language. Sure, people have used it for it, but, you know. Yeah, but I know. But I think. No, it's not. You don't have unsigned types, which is painful when you have to work with bytes.
01:18:50
Speaker
No, I tend to agree, which is why we chose not to. I think the other thing that's lacking to some extent with Java, and it has been for a long time when that's one of the reasons why Clojure suffers from it as well, is there isn't a super FFI. You guys are using TCPIP to do it, but another way of doing it would be over FFI.
01:19:18
Speaker
It seems like that's a big problem with the JVM and things like C and Go and other sort of low-level languages is that the FFI is just, it's coming with Project Panama and stuff like this. They're trying to improve their FFI, mostly for things like AI and LLMs and stuff like this and these other kind of like high-level processing things. But, you know, it doesn't seem good, yeah.
01:19:47
Speaker
Any minute now, yes. Yeah. We took a copy here. The DFFI in Go isn't great either. I mean, it has significant downsides. You can use it, but...
01:20:00
Speaker
It's going to add complexity to your project and it's going to cost something. Be it in terms of usability that the good Go tooling doesn't just work anymore. You need to bring it C tooling. I think it's very difficult to create a good language that is both good in itself and has good FFI.
01:20:23
Speaker
Yeah, I'm looking at Zig at the moment as a sort of potential contender in that space. It does seem to be like hitting a lot of sweet spots there. But anyway, that's, it's not the Zig, it's not the Zig podcast or the Go one for that matter. So the other question that we've probably forgotten Vijay was like the future. Yeah, exactly. So I think that's a nice, nice way to round up the round with the podcast. So what, what is next for you guys? Anyway, both.
01:20:52
Speaker
you know, on your project and then, you know, on your own, in your life. And then, you know, what are you going to do next after, after you close this browser window, you know, near, very near future and very far.

Future of Exoscale's Storage Systems

01:21:07
Speaker
Feel free to answer only the relevant ones, you know.
01:21:12
Speaker
Well, so our system is currently in preview. So any customer can go try it if you have an Exoskill account. Put the small commercial in there. But so we're mainly working towards channel availability, which means we're
01:21:32
Speaker
We've got a few background jobs that we still need to go implement to keep the system efficient. We are looking at designing a custom hardware profile to run the system on.
01:21:45
Speaker
Currently, we use the same type of machine that our storage optimized class of instances run on. The idea here was that since we don't know what the system does, because when we needed to order these boxes, we didn't actually have a system.
01:22:06
Speaker
So we chose to order boxes that we could reuse later, basically. And they're a bit overpowered. So we're now kind of taking a look at monitoring the system, figuring out how much RAM does it actually need? What kind of performance profile do we want from all of the disks? How much CPU are we using? How many CPUs do we want? These type of things were sort of like,
01:22:32
Speaker
And then we'll design a custom piece of hardware. I mean, custom, like we'll talk with a vendor and try to order a box that fits what we need. It's not like we designed the hardware. And so there's some polishing and then we want to roll it out to more zones. So like for the product, those are the next few steps. So currently there's one zone. It's scheduled for a second zone before the summer. And then it becomes, I don't know,
01:22:58
Speaker
We need to have some talk with the investors for how many boxes we can buy. OK. So I think that's a nice way to round up our adventures of storage land, I think. Thanks a lot, Albin and Narelian, joining us and explaining. And in a way, I'm very jealous of you guys, because all my life, I've been just building credit applications and shit.
01:23:25
Speaker
not what they call, quote unquote, real engineering crap. So you're doing an amazing job, I'm sure, I think, as Wouter mentioned. And it's nice that how you're using Closure. I think we had a few folks from, we spoke to some Exascale people before, and then we were talking about how a little bit different Closure usage is at Exascale compared to rest of the world. And because you're on the,
01:23:53
Speaker
what do you call it, the lower part of the iceberg of the cloud stuff, where the heavy lifting happens. And people just go there, click the button, and then give me RAM, give me, you know, buy more RAM. And you're making that happen. And there are different kind of challenges. So super excited for you guys working on this stuff and having multiple languages to play with and building some solid systems. So thanks a lot for joining us and sharing all the nitty gritty details.
01:24:22
Speaker
I'm sure I learned a lot and then I'm going to build my own block storage system now in OCaml, as you know, I'm an expert now. So it's going to be in French though. So if the system do storage, storage or whatever the fuck that is. Mirage, Mirage. It's going to be the other way around. It will store before you store. The uni-kernel. Yeah.
01:24:48
Speaker
Exactly. But I kind of like the fact that Aurelien and Albin are obviously fresh faced as well, because a lot of closure programmers end up, you know, they're sort of cold and grumpy. Yeah, this is kind of like, you know, this is the sort of profile of the closure programmers in miserable thought like me, you know, exactly. You know, who's like fed up with the world and like retreats into Lisp, essentially, comfort blanket.
01:25:14
Speaker
So it's nice to see that you guys are coming into the industry and you're using Go obviously and things like that, but you're also picking up functional programming and it's good for the future, I think. Yeah. And thanks, Eduardo. Thanks for bringing folks here and giving us an opportunity to talk about all these things. I really appreciate it. I know we had some fun with our
01:25:44
Speaker
podcasting system. Obviously it's not, it's not, I'm sure it's not built and go on. It's not running on exoscale. That was a problem. Exactly. No, I really wanted to shine a spotlight on, on, on, on both of these guys and actually like the extended team, you know, like Miguel, Christian, like, um, and then everybody else at exoscale, but like, cause.
01:26:09
Speaker
Funnily enough, I was listening yesterday and today, I haven't finished yet, to Oxide and Friends, the Brian Cantrell podcast. They just had an episode out on their block storage system where they talk about their design. It took them four years to do it. They talk about how it joined. They didn't even dare to start because it was too hard.
01:26:32
Speaker
They deemed it was too hard. From my perspective, what was interesting to see was some of the choices they made the same, some of the choices they made different for the same reasons as us.
01:26:48
Speaker
And, you know, we're like, we, we chose to inherit our blog dealer because you know how to operate it. Like they chose to use ZFS because, you know, they wrote ZFS when they were at Sun. Um, so like they made the same decision, but like it ended up turning into a.
01:27:05
Speaker
a different design and it was fun to sort of see how some of the issues they were having are absolute non-issues for us and the other way around. Their design ended up engineering away some problems that we are facing.
01:27:25
Speaker
So the funny thing about it is that there are not a lot of block storage solutions out there. Like in the open source world, there's essentially Seth. Also, the cloud vendors have not published papers on how their system works. There is no paper on EBS. There is no paper on how disks were in Azure or at Google. Well, there's a few papers on EBS, but not on the actual thing that matters. They've got something around how the metadata gets replicated.
01:27:54
Speaker
There is a few presentations by the guys at Alibaba cloud to explain who it's working on that side.
01:28:03
Speaker
But so in any case, like this is something not a lot of companies have. They're very tight lipped about it in general. And it's a very performance critical thing. Like if your disk is slow, like your machine is slow, like it's not. And we managed to, obviously there's been like drawings before that, but by and large, we did a two year effort and we're shipping something to production now, which I think is record speed.
01:28:31
Speaker
And we're hitting the performance that we wanted, so we're sort of on par with other similar cloud providers, like Scaleway or DigitalOcean. They offer the same amount of IOPS and the same amount of limitations as we do. So I think all of this is super nice for what is essentially a V0 or a V1 system.
01:28:53
Speaker
And I don't know. For me, it's a really major achievement. And again, I wanted to spotlight these guys because I didn't write that much code on this one, actually. It's really all the team. So that is the reason why it was finished in two years?
01:29:09
Speaker
Oh, for sure. For sure. It would have taken a bit longer if it was just... So that's kind of why I pitched it, because I don't know. We don't talk that often about how we do things at exoscale or what we do even.
01:29:32
Speaker
I mean, I'm massively biased, but I'm always very proud of what we do. And I think it deserves, like, a bit more attention. So I thought, like, let's abuse the small little platform I have to shine a spotlight on these guys and all what they do. It's always nice to understand the other side of it, right? Because I think.
01:29:49
Speaker
That's what generates curiosity. That's what generates this. As you mentioned, now people know how this system is built. And they can think about what other problems can we solve with this one. And this is our open thinking and open source and open ideas go around it.
01:30:06
Speaker
And obviously, I mean, for the people who are listening, there is no way sponsored by Exoscale or anything. We don't take sponsorships. But it's more about the engineering knowledge and the kind of problems that we are solving. It will be nice to understand the solutions. It will be nice to understand the trade-offs. It will be nice to understand how things are built. So it's super nice to have you guys and give us the background details and everything.
01:30:35
Speaker
Thank you, guys. Thanks a lot. And yeah, I think we would love to have more and more time. But I think it's almost, we're right on time for our brand. Many minutes. Thanks a lot. And just before we go, I think I'd like to just give a couple of shout outs for our patrons, I think. And there have been people who have been supporting us for a long time now. And there are a few new folks. Alessandro, thank you. And John Sheridan, thanks a lot. And also,
01:31:05
Speaker
Shmuda, I don't know how to pronounce this in German. It sounds like a German name, but thanks a lot for your support. And if you want to sponsor us, go ahead on our Patreon. Not an obligation, obviously. We love to do this with our friends. We love to do it in our free time.
01:31:27
Speaker
and having wonderful guests and it's also learning stuff for us and people who are listening I hope it's also nice experience for you apart from the bullshit that me and Ray produce but most of the time it's way more quality stuff from the guests so thanks a lot guys thanks a lot for joining and thanks a lot for listening and yeah I think this is this is the first episode for this year right
01:31:50
Speaker
I don't know. I don't know. In any case, happen new year. If, if this is the first, first one. Exactly.

Closing Remarks and Humor

01:32:02
Speaker
Yeah. Yeah. Who knows? It's, it's, it might be 2025 and then we have the next episode. All right. Thanks guys. Thank you. Thank you. Thank you. Thanks for the invitation.
01:32:17
Speaker
Thank you for listening to this episode of DeafN and the awesome vegetarian music on the track is Melon Hamburger by Pizzeri and the show's audio is mixed by Wouter Dullert. I'm pretty sure I butchered his name. Maybe you should insert your own name here, Dullert.
01:32:34
Speaker
OK.
01:33:04
Speaker
and see you in the next episode.
01:33:43
Speaker
Fuck what what the hell is this introduction? I thought we were actually gonna like be getting better after 94 fucking episodes, you know