Oops!Zencastr was unable to start because Javascript is disabled
To fix this problem, check your browser's settings and enable Javascript

Become a Creator today!Start creating today - Share your story with the world!

00:00:00

00:00:01

Migrating Memcache in a time of DEMAND | Ep. 07

10 Plays26 days ago

Alright, picture this. It's March 2020. Everybody just went to work from home. You work at Slack, which is suddenly one of three applications keeping the entire business world connected. And nobody owns Memcache.

Today's episode of Turn Stories is with Glen Sanford, who decided he was going to own Memcache and he was going to migrate it.

I am stretching that word today.

Whether you're an aspiring engineer or a seasoned professional, Glen's insights will inspire you to take ownership of your projects and strive for continuous improvement!

Transcript

The Shift to Remote Work and Introduction of Glenn Sanford

00:00:00

Speaker

All right, picture this. It's March 2020. Everybody just went to work from home. You work at Slack, which is suddenly one of three applications keeping the entire business world connected. Nobody owns Memcash.

00:00:11

Speaker

Today's episode of Turn Stories is with Glenn Sanford, who decided he was going to own Memcash and he was going to migrate it. I am stretching that word today. Glenn migrated Memcash from bad to okay, and then from okay to good.

00:00:23

Speaker

This is a story of good old-fashioned engineering in one of the highest volume components in an application that you've probably used in the last 20 minutes.

The Role of Cap Theorem and Observability in Engineering

00:00:31

Speaker

So we're going to talk about Cap Theorem. We're going talk about what a modern observability stack really does for an engineer and Dunkin' Donuts.

00:00:38

Speaker

Let's go. On today's episode, I've got Glenn Sanford with me. um Glenn was ah originally a history major, um but he's worked at companies like TiVo and Twitter.

00:00:50

Speaker

He founded a company at Turbine Labs, which is where I got to know you. um And then we both worked at Slack together, um where he was on the infrastructure team. So welcome, Glenn. Hello.

00:01:01

Speaker

It's nice to be here. Um, so you've worked at a bunch of impressive companies. Tell me a little bit about the kind of stuff you worked on.

Glenn Sanford's Career Journey

00:01:11

Speaker

I imagine, um, you imagine just based on that, that resume, it's, it's obviously all, uh, front end and UX, right?

00:01:18

Speaker

Right. yeah Yeah, no, it's been ah it's been a fun ride. Um, I actually did some front end throughout my career here and there. I've done some, but, um, and I started my career as a front end coder before I went back to school for CS.

00:01:32

Speaker

Um, Because back then, like ah front end was just not something you needed any kind of bona fides for. You just showed up and said, I could do HTML and they would hire you. And that's kind of true now, but you still have to, and don't know, like have a resume or something. that's Dot com times were wild.

00:01:50

Speaker

We could do a whole episode on how front end is now real engineering and believe be so. and Right. Yeah. And, but that's been a, it's been a two decade transformation for it to be taken seriously as a practice and I'm i'm all for it. um But in the meantime, I did a bunch of backend coding.

00:02:06

Speaker

So yeah, tivo was right out of school and I spent five years there and kind of learned how to be an engineer of any kind and like work on deadline and work with people and write documents and like do all the things engineers do.

00:02:17

Speaker

Yeah. And I learned a lot about my my mentor there was really into like every, every system should just be this one box and you just have N of them and like, make simple little boxes that do things, um, which wasn't microservices yet. Cause it was 2005, but, um,

00:02:38

Speaker

but looked a lot like microservices later on. A word yet to be invented. A word yet to be invented. um it was invented during the time I was at Twitter. And we also started into service oriented, like it was service oriented architecture before that.

00:02:51

Speaker

And um somebody decided we should make them smaller and have lots more of them. But um ah Twitter was where I think I really learned kind of big

Scaling Challenges at Twitter and Slack

00:03:02

Speaker

system chops. I went from like,

00:03:04

Speaker

when something's broken, I go and look at the logs to when something's broken, I go and look at the metrics. um And because you can't look at logs when you're processing, you know, 15 million requests a second, because the logs are too, there's just too many logs, right? You can't tail a log file to learn anything.

00:03:21

Speaker

um Later on, like a couple of years later, now you, you, you can dump it all into honeycomb and structure analysis and do all that stuff. But that wasn't, It was like, we can count things at at our pace. so In 2010, we can count things and we can say this many things went wrong.

00:03:38

Speaker

so Twitter was fun. What was the service you worked on at Twitter? So at Twitter, i I worked on a couple of services, but I was sort of fortunate to be in this team of folks who sort of accidentally found themselves ah rewriting all of the the core ah business logic as we migrated stuff out of Twitter.

00:03:56

Speaker

the Rails monolith and into Scala-based services. Um, and so I, I wrote the authorization service there first, which it was like a real, it was actually like one of the first services that had been carved out of the monolith and kind of just withered on the vine for a year or two because we, we couldn't use it yet. We had nothing, we had no backend services to auth with it.

00:04:19

Speaker

Um, and no front end. We also separately were writing a front end, uh, kind of think envoy, but and sort of er-envoy, where the auth would live, but we didn't have the front end yet either. so um Anyhow, so I wrote that, that sat there. And then I contributed a lot to the the user service there, which we called Gizmoduck for no good reason.

00:04:42

Speaker

um And you think of like every time you need to turn a screen name into a user ID or find out how many followers they have or do anything about a user, we we were the service you would call.

00:04:54

Speaker

Uh, and I think at our, when I left, we were doing like 15 or so million user lookups per second. Cause you really need a user for everything. And often you, you'd need like 10, 10 users.

00:05:05

Speaker

Like our average width was 10 users at a time. Um, that was a fun service to build. Um, we, we wrote it and then, um, like the entire time I worked on it, I also managed the team that, um, that worked on it for a little while there. Um,

00:05:20

Speaker

the The entire time I worked on it, we were just migrating the service from like one way it worked to another way it worked to another way it worked. It was like and it was it was sort of there that I developed an understanding of infrastructure engineering as kind of a ah constant migration from one thing to another as as business requirements change, as technology becomes available and old stuff looks older.

00:05:48

Speaker

which seems relevant at as you as you tackled stuff at Slack.

Memcache: Infrastructure and Challenges at Slack

00:05:52

Speaker

um So the migration that i I want to talk about today is um you had... I feel like i I'm, I'm interested in this, but we were both there at Slack at the time. And there was this enormous amount of work that went into memcache, which is extremely at the core of Slack.

00:06:10

Speaker

um So it's true. so yeah, let me talk about how I found myself thinking about that because I didn't get there immediately. I'll try to be as brief as I can, but why? So we, we ah joined Slack and you and I both 2018 and, ah the first project I worked, Slack had just gone through this um shift that a lot of companies go through from, I press a button and I deploy everything all at once.

00:06:35

Speaker

um And YOLO things happen, but like it's a small company who cares to like, when I press that button, it causes a massive side outage for everybody for a minute. And so I need to be more deliberate and careful about my software robot. And also, oh, sometimes when I press that button, everything breaks and getting rolling back is harder and more dangerous than it used to be, et cetera. So they had moved to this, uh,

00:06:56

Speaker

you know, staged deployment process that took half an hour. And all of a sudden, everything that they had stored in config files that were checked in, like what host to talk to where that they could change instantly.

00:07:07

Speaker

Now they have to, it takes an hour for them to change it. And that was totally like, so when your database goes down and you need to patch the file to talk to some other database, because didn't service discovery yet.

00:07:18

Speaker

That used to be something you could just go do. Like you get page and you wake up in the middle of night you press a button and you fix it. And now you have to do a deployment that takes 30 or 40 minutes. And so that was a problem.

00:07:29

Speaker

And so there were a bunch of interesting problems that fell out of that in the kind of configuration plane of the Slack application for things that needed to get carved out and made faster.

00:07:42

Speaker

And so the first project I worked on when I got there was this system for shipping database configuration around that could be changed instantly. And it was like this very simple, ah like write a file to S3 and twirl a little thing and key in console that everybody's watching.

00:07:59

Speaker

And now everybody downloads the new file. And and so it it felt like a really powerful lever um that I sort of had incidentally been shown um and then i got pulled up onto very briefly i was assigned to memcash then in my first couple months and it was like hey nobody owns memcash do you want to own memcash and i was like absolutely that's the no that is not the right way to solve the instability of a service the answer to nobody owns this do you personally want to own it is always no

00:08:35

Speaker

Never that always know. But, um but i agreed to sort of look at it and take a ah crack at um sort of identifying some things that might be improved. And I, I pitched some work that was interesting work to do, but that like was so much lower priority than almost anything else we could do that, um,

00:08:55

Speaker

immediately something else shiny happened and I was moved on to some other project and um themrooms what what do you think was the motivation of like was was there just like if there is known I don't want to say instability but there's no there are known problems with memcash like what what's the rubric that the business was using at that point to figure out like yeah that's not worth fixing we're just gonna let it burn I don't think it was even that careful in its analysis. um

00:09:25

Speaker

I think Memcache is a thing, and and we should do an aside to talk about what it is and what it does like for a minute, but so pin that. um But Memcache is a thing that, as a system, it's fairly simple conceptually, and mostly it doesn't... Their implementation it didn't break, mostly.

00:09:42

Speaker

um And when it did, it was spectacular. ah But you could sort of like... There was never... and there was never

00:09:53

Speaker

a set of people who could reasonably be assigned to fix it because there was in the growing company, always other things to, uh, uh, that were more important to scale out like that.

00:10:05

Speaker

There's this enormous database migration that they, uh, that Slack underwent from this kind of dual master replication thing to the tests. Um, and,

00:10:18

Speaker

That was always more important than anything else, like getting that migration going and then just ah running Vitesse was always more important than anything else. Let's take a minute, though. Let's take like two minutes. what What's Memcache? Because honestly, it's like in the in the depths of things for a lot of people now. Yeah, I guess. but what is What is Memcache and then what...

00:10:39

Speaker

What is Memcache at Slack? Because I think the answer to that is slightly different. Both both good. so and And I'm going to talk about different pieces of that stack throughout what we're talking. So Memcache is a very simple little binary that stores key value mappings in memory. So you say, set this key to this value, and you can say, get this key.

00:11:00

Speaker

and there's some other there's The protocol is more complicated than that, but not a lot more complicated. than um And it's a text protocol. It's very simple, just over TCP. It's like like a dead simple service. And what that's kind of why I think it has endured, is that um that part of it is rock solid and really hard to get wrong.

00:11:20

Speaker

Yeah.

00:11:23

Speaker

ah So I'll talk a little. So there's LAMP stacks. So if we talked about LAMP. Does anybody remember LAMP? i what but What is a LAMP stack?

00:11:34

Speaker

for those of who never experienced a LAMP stack was how you how you built websites in 2008 or maybe or seven. And it was Linux, Apache, MySQL, PHP.

00:11:44

Speaker

um And ah there was like a little lowercase M after the big M that was memcache because memcache was the first thing you would drop in when you started to need the database to be faster than it could be oh because reading from memory is faster than reading from disks. So just there is you kind of ah the conventional wisdom in in the development universe of Lampstacks back then was just slap memcache on it and it'll be fine.

00:12:11

Speaker

Right. And PHP made this fairly easy to do. um had native Memcache bindings that you could just, so it was really easy. Easy using.

00:12:22

Speaker

and So memcache is great as long as what you need to store in memory stays in memory on the one box or you don't need high availability. um And life gets kind of weird when neither of those is true anymore.

00:12:37

Speaker

um And so, and memcache provides itself, it or at least until recently provides no affordance for like, I have want to have 10 memcaches and I want to strike my data across them or any, any kind of wars.

00:12:49

Speaker

It, naively supports horizontal scaling because you can always add more memcaches, but you need more complexity to support that scaling. And so there's this other, you need something basically to to spread ah the data across those memcaches.

00:13:05

Speaker

And that's typically called the standard thing to do there is called consistent hashing, ah where you you imagine your memcaches as a circle, a ring,

00:13:19

Speaker

um and you hash through the keys in a some one way function that is built purposefully built to distribute keys evenly across that ring. And in particular to distribute them so that when you add new hosts to that ring, um, there are some guarantees about where the data goes so that you don't end up with like weird inconsistency problems. So, um,

00:13:44

Speaker

You need some software for for doing that hashing and spreading it out and knowing what the hosts are and all that stuff. And so there's this thing that Facebook wrote that Slack used called mcrouter, with a mick there as memcache. and And it does a lot of the routing for you.

00:14:01

Speaker

ah Its configuration file is in in JSON, and it just goes off and does its thing. and So Slack used McRouter because they Slack also was halfway through a migration from PHP to ah Hack, which is the Facebook invented PHP, but type checked ah language. And so they were kind of just on the Facebook technology train.

00:14:28

Speaker

um So you you got your... you got your memcache, you got your microuter and you had your config file that was just existed. We were talking about deployments a minute ago.

00:14:40

Speaker

Um, how your web apps knew what memcaches to talk to was, it was a thing was fixed in space, checked in. And whenever anything went wrong, you would just add a new host or whatever.

00:14:54

Speaker

um And that broke as soon as the stage deployment went in. So there was this really wonky Python, old Python 2 system that, um, that good engineers wrote in a hurry because they had to.

00:15:07

Speaker

ah i don't want to offend anybody. it was It was exactly the situation that was needed and then it lasted for years and years and years and years. Longer than it maybe should have, but yeah you know it it more that's positive on the engineers that wrote it, right?

00:15:20

Speaker

Maybe. So it was a system of leader election that would allow, basically allow the memcaches to wake up and say, I'm going to occupy this slot in the consistent hashing ring.

00:15:32

Speaker

And then that That was all written in the console. And then everywhere where you needed that routing information, you had console template rendering the

00:15:42

Speaker

rendering a config file. kind mess. its kind of a mess um Anyway, so that I have blurred into a discussion of Memcache and into a discussion of Slack's usage of Memcache, but I think you get the idea. Yeah, the the fundamental point is that Memcache is something... The Memcache, the technology, is built for a single host, and there's not many affordances within the core technology.

00:16:01

Speaker

And Slack's use of Memcache was fundamentally about being able to use more than one machine exactly to cache its data. So, right you know, hence McRouter and the usage there. And...

00:16:12

Speaker

and but we We were talking about how ah we you had seen you'd seen some problems with the deploy here um right after you joined Slack and they were promptly deprioritized and you were told to do something else, yeah if not own the whole thing. um But we're we did we did decide to do this show because you did do a bunch of work on them. Right.

00:16:37

Speaker

So yeah what changed? So Memcache was kind of, I refer to it as a volunteer firefighter organization that ran Memcache for a little while. And eventually it got reorded over to the data stores team.

00:16:51

Speaker

And I think the reasoning was every time Memcache breaks, you guys get paged. So you should own it. um Sure. And the organizational problem with that is they got paged a lot and they never had time to do anything other than sort of poke at it with a stick and say like, why you break?

00:17:06

Speaker

Yeah. And so I think what there was this kind of, ah so I went off and worked on the asynchronous job queue for a little while. And I'm going to just take one minute to say that this, that hammer I was talking about, the lever the lever earlier about delivering dynamic configuration, I was able to take that lever and apply it ah to configuring this job queue in a way that convinced me that it was like a really useful, like,

00:17:33

Speaker

ah repeatable pattern for owning configuration, distributing configuration for a service. my Through an accident of ah hyper growth, my manager took on the data stores team.

00:17:46

Speaker

And then there was this like spectacular confluence of different ah incidents that resulted in someone saying we should add load shedding to McRouter as in a way that felt that I'll talk about why that was not a possible thing to do, but somehow they said it loud enough in a way to manifest funding to to get some people together.

00:18:08

Speaker

um And so my boss was like, this is kind of like a can you go and figure out what, if there's any there there, uh, and can you sort of collect a team of people together to work on this sort of stuff for a little while? And so, uh, they called it a squad because infrastructure organizations have a tough time reworking quickly.

00:18:30

Speaker

Um, so I had myself and a few other people who are all looking at kind of load shedding related things. Um, and I was sort of nominally running that roadmap for a little while. Um,

00:18:43

Speaker

And what super quickly became apparent to me ah was that, like, there was so much wrong with the and memcache stack that, like... like ah fixing this one piece of it that was being proposed to fix was like a insufficient and be like actually totally impossible.

00:19:03

Speaker

ah And there was, that we just were going to like, I kind of used that project as a shield to say, I'm going to spend 18 months or two years on this and I'm going to fix it where fix means,

00:19:19

Speaker

Make it operable and understandable and really chiefly owned by a team that is well, so like well positioned to do something useful with it and keep it running and not be terrified every time it breaks.

00:19:32

Speaker

um I didn't have and any guarantees that I was going to be able to do any of that, but like ah but that was my exit criteria was I'm going to work on this.

Improving Memcache Operations

00:19:41

Speaker

um I'm not going to take the pager for it.

00:19:46

Speaker

because it's just me. So data stores kept the pager. And my my sort of solemn oath to them was like, i'm not going to make it any worse for you. I'm going to make it better until it's something manageable, and then I'm going to take it away from you.

00:19:57

Speaker

And they were so pleased to just not have to think about it for a while that that that that worked fine. So cool. So we sort of arrive at the beginning where you add... Yes, sorry, it took so long. No, I mean, this is...

00:20:11

Speaker

I mean, a big question of every migration is like, why are you doing this? Because if you're going to have someone spend years of their life on it, that it it is worth understanding that. um Yeah. And what I saw when I got when I finally was able to look at it is there was so much wrong that you could sort of there were multiple things, big changes that needed to happen. And so the bit the biggest challenge, I think, initially was like, how do we.

00:20:34

Speaker

figure out what to do first how do we stage it in a way that feels sort of safe and how do we how do we not um sign ourselves up to do things that we can't actually do yet because there's other stuff in the way and a useful organizational approach to this that i that i uh adopted from twitter as this thing called uh good bad okay uh which Tell me about that. What's that? Good, bad, okay. So good.

00:21:01

Speaker

um So a colleague of mine at Twitter named Ryan Greenberg invented this, or at least co-opted it in a way that felt like invention within the company. I don't know which. And it was this this way of describing an effort as a transition from one state to another. And the three possible states are bad, okay, and good.

00:21:21

Speaker

um And the idea was that if you could honestly with yourself talk about... which of those state transitions your job felt into, you you could you could prioritize it, right? So like, if you think about it,

00:21:35

Speaker

um Like you never want to go ah from okay to okay. Like my thing is fine. This other thing is going to be also fine, differently fine. and Like that's a waste of time, probably.

00:21:48

Speaker

That doesn't seem worth doing. Right. and But the biggest one for me is you never want to start working on good until you've fixed all the things that are bad. Like find all the bad things and make them okay before you try to make anything good.

00:22:02

Speaker

Because otherwise you're just, youre you you're going to be working against yourself because you're going to be supporting all those bad things as you try to make the good. Like that was the trap I think data stores fell into is that they were like, There was so much bad, there was no chance to make anything good. They were just the just ah kind of cyclically fighting themselves.

00:22:24

Speaker

You need to buy your own bandwidth to go work on the thing that you want to go work on, no matter how no matter how good the end result will be, that if you can't work on it, you can't get it done. Yeah. does matter so You obviously don't want to go from good to okay or okay to bad. I think those are self-explanatory, but the one that I think is subtle is you don't really usually want to go from bad to good.

00:22:46

Speaker

You want to have an okay step in the middle or several okay steps in the middle, because ah when you go from bad to good, you introduce various axes of risk, right? ah The biggest of like all the second system failures of where like, we're going to make everything better this time.

00:23:02

Speaker

ah we're goingnna we're going to fix it all. um you There's this enormous enormous schedule risk that you bring in when you say we're going to fix it all with a new system. um That's a bad, that's a bad, ah kind of bad to good.

00:23:15

Speaker

um

00:23:18

Speaker

So the the trick with memcache here was... ah Let's articulate all the bads and then let's figure out which of them is like both but read like truly bad and tractable to fix. And let's fix that stuff first. And let's fix all of that bad stuff before we start to try and make any part of it good.

00:23:37

Speaker

and And going back to the doc I wrote when they when I first got there was actually an articulation of a good future state. And like I really wanted to get there. ah But I was like literal years away from getting there.

00:23:50

Speaker

And so it was just like, how do we chart that course? How do we navigate our way there? um as you were So as you're thinking about, you know, you've got your your good desired state. um Where was, and and you've got your bad current state, which and I don't think you said out loud, but this was also around 2020.

00:24:08

Speaker

twenty twenty So there was ah wild uptick in traffic as well, right? Yeah. This was about six months after the big big step up. um And like, did yeah, but we hadn't really recovered. like and And everything was kind of on fire all the time.

00:24:24

Speaker

Mm-hmm. So there was there's um an obvious bad state in in just like the sudden step change scale that COVID drove. Sort of externally bad. Yeah. and other i can how How did you think about like the good state?

00:24:40

Speaker

Like what did the good state look like with that as the problem statement? Well, so to me, I think it's worth just articulating some of the things that were ah that were the bad state, and then we could talk about what what would have looked good in comparison.

00:24:55

Speaker

um So the biggest bad state was inconsistent staffing. um We were under this really tight deadline from an external customer to encrypt all of our traffic, which is a good idea anyway, because it's got messages and pe users and like all kinds of stuff that you really don't want in plain text, even over a controlled private network. um So and we needed to encrypt our traffic.

00:25:19

Speaker

um McRouter, I mentioned, was running on this four-year-old build that somebody had sort of like um cobbled together with a couple of patches on top of the open source project, and we had no way to deploy it or update it or anything.

00:25:34

Speaker

ah There was no observability, really. We had no idea. Like there was kind of this hodgepodge of different ah ways of looking at it, but we couldn't see it. And that was a big problem. ah It broke all the time.

00:25:47

Speaker

Except when it didn't, but nobody wanted to touch it because when you touch it, it breaks all the time. um

00:25:54

Speaker

And then there was just this kind of layer, many different like layers of attempts to fix it by trying things, like many different experiments layered on top of each other.

00:26:06

Speaker

um like We had multiple different versions of McRouter sidecars running on the web apps. We had two of them, one in TCP and one in Socket. And then also the web app had a built-in McRouter that would talk to them. And then also it was just this, um a lot of like, what if we tried this, but then never finished? Yeah.

00:26:26

Speaker

um And that's like because of the staffing problems. So there was a lot of like just stuff to machete out, but how do you do it on the... you know It's repairing the plane in flight, right? and The plane that has multiple smaller planes inside it because we thought maybe adding more wings. Yeah.

00:26:44

Speaker

So, I mean, so you can see ah ah a good state is like, there's a team there. The traffic is encrypted. We have a single version of McRouter running that's up to date. ah We have enough observability to understand what's going on.

00:26:55

Speaker

We no longer run on a like two versions old Python leader election emergency script from four years ago for for configuration management. like um We'd like to be able to do things like actually change the configuration of McRouter dynamically um so that we can do interesting things like availability zone isolation and stuff like that.

00:27:19

Speaker

Really interesting stuff that you can't do if you can't ever change the config file because the thing that uses the config file is four years old and the system that delivers the config file is cobbled together. And so lots there. But you like I said, you you start with all the stuff that's truly bad and work your way there.

00:27:37

Speaker

um And the two things that were truly bad for us um one of them was like a, like a smell. And one of them was business.

00:27:49

Speaker

And the business was, we had to, we had to get the traffic encrypted. Um, because we had just like this hard deadline for a customer who was like, not going to sign up with us if we didn't do this, um, had to happen.

Encryption Challenges and Solutions at Slack

00:28:02

Speaker

Business requirement must had to happen, had to happen. Um, and then the other one was, we really, we really wanted to upgrade McRouter to a single, we we wanted one single version of McRouter running. Um,

00:28:13

Speaker

We hypothesized that it would solve problems, but we we knew that like the original, if you'll recall, the original proposition for the team was, can't you add load shedding in McRouter? And the answer was absolutely not until we could build it with sit-like and, right?

00:28:29

Speaker

Like you can't sure can' go and modify code until you can build it and deploy it, right? Like but you can, but it's ah it's a thought exercise, right? So yeah. So that was our our lowest other tier bad was get McRouter to a point where we can build and deploy it and even think about making changes to it.

00:28:48

Speaker

And so I took on the encryption thing. My colleague, Mark McBride, um who I imagine will be a future guest, ah he was my co-founder Turbine and an old friend and also was at Slack with us. um He took on the...

00:29:03

Speaker

the McRouter. And I'm, I'm going to talk mostly about Nebula to start with, and then I'm going to hand wave over his McRouter work because he, he's much closer to it than I was. We'll have to get him on the show. I'll ask him about it. It's, it's some good, there's a lot of good stuff in there, but, um,

00:29:18

Speaker

So Nebula. What is Nebula? Yeah. Let's talk about Nebula. So um there's lots of ways to encrypt traffic. And ah the the sort of most common one is probably SSL and TLS. So like um you have a cert and I have a cert and we talk to each other to negotiate a key and then we send traffic with that key.

00:29:39

Speaker

um That's a kind of heavy weight for provisioning new hardware because you have to have a cert for every host that you provision. It's ah ah it's hard to adopt, it's hard to make all systems ah support TLS broadly, right? Some of them support it better than others, but it's um it's ah it's a headache.

00:30:00

Speaker

And Nebula was sort of invented to make that headache go away. ah It's a system kind of conceptually similar to a VPN, but it's basically like an encryption system that looks like another network device on on your on your box. So it looks like year your other ethernet.

00:30:18

Speaker

um But when you talk to that Ethernet, the the data is guaranteed to be encrypted. And I'm not... Take a brief side to talk about ah the the security staff, the security folks at Slack, the network security team.

00:30:34

Speaker

Really good, smart people and understood that

00:30:41

Speaker

the company... were customers of their effort as opposed to the other way around. Yeah, it's such an important distinction. Right. And so and then they understood that their job is to make everything secure, but the company's job is to select ship products and do their jobs. And so if if you're constantly saying, yeah you got to fix this, do this, do that, do this, this is wrong, you're bad.

00:31:03

Speaker

That's not a good feeling for anybody. So they were really, really really exceptional at um trying to ah find ways to make it less onerous for customers to solve the problems that that they legally needed to solve for security.

00:31:17

Speaker

um And Nebula, I think, is sort of that made real and in tech in that it mostly just worked. um And it it wasn't very heavyweight. Like, all you had to do all you had to do was turn it on and and the system would sort of negotiate the keys and figure everything there was a lighthouse that would distribute them or something. I don't know. I didn't I never really spent the time to figure out how it worked because I didn't have time.

00:31:43

Speaker

um you You didn't have to, and that's half the point. Right. It's another interface. Use the interface and your traffic will be encrypted. Right. and and for ah So Memcache was fairly late on the adoption curve.

00:31:56

Speaker

um Mostly... Well, because it it it didn't work ah for Memcache very well. And so they had tried to turn it on for, I think we were in like the third year of attempts to turn on Nebula for Memcache. And what's really unique about Memcache more than almost any other service in the in the system is it's um incredibly high bandwidth.

00:32:18

Speaker

Right. So like the thing, like the, um, the usage of memcache went from like, there's this one thing that we look up in the database a lot, let's put it in cache and then it'll be faster to, um, cache is the thing that keeps our entire data set from knocking over the database all the time.

00:32:34

Speaker

And, uh, So the the usage is incredibly high. And like every every Slack API request has 10 or 20 or 100 memcache requests under the hood.

00:32:45

Speaker

And they all are super fast. And some of them have a lot more data than you'd think. Like people store all kinds of weird stuff in cache. And ah Slack ah was not a place that was was super into... It was not possible to enforce usage...

00:33:02

Speaker

usage but restrictions really in cache. They're just the, the, the team grew up too fast and the people who own the cache didn't even know they owned it really. And so people just put all kinds of weird stuff in there.

00:33:14

Speaker

um And so it turns out that the, the, the hard part of the hard part of making that system work is that you have to encrypt all of that network bandwidth and that cost CPU.

00:33:27

Speaker

um And, I just don't think anybody had had the time to sit down and figure out how much CPU it was going to cost for Memcache because for every other system that they had used Nebula for, it was kind of just fine. Like their CPU went up a little bit, ah but the, the, the band with the, the, the amount of extra effort required to encrypt all the traffic back and forth um made Memcache sad and also made the web apps that were doing the same encryption very sad, right? Like and on both sides of that network boundary, it was harder.

00:34:00

Speaker

um but especially on on the memcache side because it was handling multiple requests from all the different web apps. Okay. So they turn on it, it breaks, they turn it, it breaks, they turn on it, it breaks. And my boss just said, I need you to go figure out why it breaks. Like, this but this is like... The traffic needs to be encrypted. So when when you say it breaks, what... Yeah. Well, so the first thing I did was I i built a load test for it.

00:34:24

Speaker

um We're using this really fun thing called memaslap, which is... ah There was a memc slap, which sort of makes sense. It was memcash slap. And then somebody wrote ah a better version of it called mema slap. And it was this, it's now very old ah tooling, but it's, so I like rolled it up into a little container and I put it in a Kubernetes network and I brought up some memcashes and I slapped them around and I turned on Nebula and I slapped them around. and um,

00:34:56

Speaker

Crucially, I um ah checked the tool in and I wrote some documentation for it, and that was super handy down the road. I'll foreshadow that. But um what I found was that um it didn't matter what hardware you were on.

00:35:09

Speaker

When you saturated Nebula, you were only able to use... 200, two cores, 200% CPU, two full cores. ah And in talking to the security people, I learned that um it was a sort of, they had ah a thread for reading traffic and a thread for writing traffic because that was enough for almost anything. It was written in Go. um But that meant that like, it kind of doesn't matter how good your hardware is or how many cores you have.

00:35:37

Speaker

um You're only going to use two of them for Nebula and you're going to use them all the way. Mm-hmm. And so what I found was I actually could, I found some hardware that had slightly faster single CPU speed.

00:35:50

Speaker

And I found that if i got 16 core boxes with four cores over here for memcache and two over here for Nebula, they would mostly not step on each other. Like that would minimize crosstalk.

00:36:01

Speaker

um those are The cores didn't do any good CPU wise, but they sort of like encouraged the OS not to schedule those two things in the same place. And that was the best I could do. And what was going to happen was we were going to need so much more hardware.

00:36:18

Speaker

Just so much. And so I said... What what were you starting... Give me a sense of order of magnitude. Do you you need 20% more? Let me see what I can find. um Yeah, I think we went from like needing... like ah Yeah, here I have the numbers. We started with 180...

00:36:38

Speaker

ah 180 active hosts and 50 spares. And this was for, these were R5, 2X larges. So these are memory intensive machines, um ah but not super high CPU wise. They were just, they're, you know, 2X is not that big.

00:36:55

Speaker

And we went to 540 with spares and they're C5N 4X larges. So there's 16 cores. There's like comp compute oriented hardware.

00:37:07

Speaker

And you have enough memory because of course you do because you have so much hardware now. Right. and And the CPUs, if I was if I was hearing you correctly, the CPUs are half idle because all you're doing is encouraging, mostly idle schedule.

00:37:22

Speaker

Yes. And so I but I said I was like and this is like my boss and I used to joke that my job was to to to figure out ways to spend money to solve problems. ah Congratulations. And so this was a this was a and here's the thing, though, it seems like a lot.

00:37:35

Speaker

It was still about. it It was about 8% of what we were spending on storage. and And my internal heuristic is like, if you spend 10% on cash to save your night the rest of your storage from completely falling over, that probably makes business sense, right? And so in my head, I was like, I got to stay under that 10%, but could as long as I can say to someone with a straight face, like, this is 10% of what you're spending on disk, that seems okay. that seems That seems fine, especially if it's it's getting us to this customer requirement.

00:38:05

Speaker

Not bad, not good, but okay. Yeah. So we did this. um And this was the first time. So growing the ring, we talked about the cache ring. um Every time you replace a host in the cache ring, that host comes up cold with nothing in its cache, right?

00:38:22

Speaker

So it's got to fill up. And there's this period of time where things are trivially slower because some percentage of your traffic is falling through to disk because the data isn't in memcache yet.

00:38:33

Speaker

Yeah. Okay. So when you grow the ring, it's kind of the same thing, right? you're you're You're widening the ring by one, which means now whatever that percentage of traffic is doesn't go to all those other hosts now as it goes to your new one.

00:38:49

Speaker

And so that new one comes up cold. um So there's a a pace at which you can do this. If you do it too fast, then the database melts down. And if you do it too slow, it just takes a million years, ah which is not so. but So I've kind of hand waved figured out what this was and i went off and like did something 500 times right i wrote a script for it because you write you write a script for things right but you had to be careful because like the script required like do you want to do this now yes right and then i go look at charts and then it yes and then i click look at charts and wait two more minutes and click yes and then it gradually widens the ring to 540 um

00:39:31

Speaker

um And I did the same thing to swap the existing old hardware to the new hardware. It's a whole thing. um we I did this the first time and and like immediately upon having done this, and there was a set of security tickets that required us to rebuild the whole cluster. like It was going to be easier to... We couldn't apply the security badges. We needed new hardware.

00:39:54

Speaker

And so I had to go and restart that whole ring of 540 hosts at that same... like every I could do it every 90 seconds. so I could press the button. And my wife is like, can't you get computers to do this for you?

00:40:07

Speaker

And I was like, I know, because if you need a human to go and look at charts and say, it's okay to do it next, do the next one. um And so we refer to this process as making the donuts from an old, the old eighties, a Dunkin' Donuts cartoon or a commercial where the guy wakes with mustache. He's like, time to make the donuts.

00:40:26

Speaker

And, I get real good at making donuts. I rebuilt that cluster probably 20 times over the course of the next year in various for various reasons. I'll come to... I'll get to... I fixed that eventually. but Okay, so we turn Nebula on.

00:40:41

Speaker

Or slowly. 10%, 20%, 30%, 50%, 75%. but ten percent twenty percent thirty percent fifty percent seventy five we get to Everything's looking good.

00:40:52

Speaker

we go to a hundred and Everything falls over, explodes. Total ruin. And so we had to. Why? We had to turn it back off. And um I was talking about observability before. So um we turn it all back off and we're like dead, it literally dead in

Enhancing Observability with Honeycomb

00:41:09

Speaker

the water. We have no path forward.

00:41:11

Speaker

on how to get this done. ah And so we start figuring out how we can see things better. um We have some hypotheses about what might have gone wrong. um But the biggest is that the metrics we have all measure ah per second.

00:41:26

Speaker

right and So we can see a lot of them are per minute, but we can see like spikes in traffic at the second level. We can tell that at the top of every hour, for example, there's this enormous spike in traffic over the first few seconds.

00:41:39

Speaker

um But we don't have a lot of observability into like what kinds of keys are being accessed and when and how often. And and um our our hypothesis is that there are hidden in here some very high throughput ah request patterns that we don't understand.

00:41:56

Speaker

um And in fact, we already have some of these, like we know, like one of our biggest customers keys team, ah the the workspace ID key gets looked up all the time, like literally hundreds of millions of times per second.

00:42:08

Speaker

And so we actually store that in memory on the web apps most of the time. Oh, wild. Interesting. Host level caching. And because we know it would just just destroy, even in in plain text, it would just destroy the memcaches instantly. Because all of that traffic, the way the keys are distributed, when you look up one key, you're talking to one memcache host and all that whole pile.

00:42:31

Speaker

And so if you're It doesn't matter if you're 500 instances, that one host is going to get every single request. That one host is screwed, right? And that one host is extra in trouble now because it's got it hand back that key ah encrypted.

00:42:46

Speaker

And so it's just immediately going to saturate those two cores, right? Like you're just done. So we we theorized that that that we there was some some access patterns that we didn't understand that we needed to. And so the we started in on...

00:43:04

Speaker

Here's where I want to talk a little bit about build versus buy just really quickly and say that, and also plug Honeycomb as a product. um So Honeycomb is a product for ah collecting traces or structured log events and ah with very high cardinality in terms of labeling and being able to query them and and show charts at a sub-second granularity.

00:43:30

Speaker

Yeah. It wasn't, for it was per minute. So our old metrics were per minute and that was totally, and totally useless. um So we, we wanted to get down into seconds and milliseconds and, and honeycomb was just, was the tool for this.

00:43:44

Speaker

um And it's the sort of thing that like Twitter built their honeycomb because they had to, because honeycomb didn't exist yet. um No one should ever build that again in a company.

00:43:55

Speaker

people should use Honeycomb or i hopefully there will be one or more competitors. But I think Datadog kind of does something similar now. um we We actually had Ian on the show last week talking about how Datadog specifically built a system that is you know equal or better, depending on who you ask, to the Honeycomb backend. So, yeah and absolutely. Great.

00:44:16

Speaker

But company they're they're a company whose lifeblood is a system like that, right? So they they shouldve had better had a good one, right? Yeah. and um but like And they do. It's big migration. like Twitter should not build that. Slack should not build that. like No one should build ah that sort of time series, high cardinality labeled time series database ever again. and And I believe there are whole classes of software for which this is just true.

00:44:39

Speaker

And everybody wants to build that because it's fun. um But you cannot, as a business, support building that anymore. And in particular, even if it seems like it's going to be easy to build, you're paying cash a long-term fee to support and maintain a system and own it. And like, just don't do that.

00:45:00

Speaker

pay Pay Honeycomb, pay Datadog. Don't ever do that again. Anyway, Honeycomb was transformative for us in terms of what we could see. And in particular, because we could label every point, like we we instrumented every single operation and we labeled it with key prefix and key suffix and all sorts of other labels because Honeycomb charges by the point, not by the number of labels on the point.

00:45:23

Speaker

So we just threw the book at it. um And what was amazing was that we could then go and slice and dice all this stuff and figure out where's the bad traffic coming. And we, we, we found ah we found a bunch of keys that, and it was keys like the one I talked about that were accessed literally like in a millions of times per second.

00:45:47

Speaker

Why, um why what like why, why didn't Slack have those keys? Yeah. I don't. So, I mean, it's it's not that hard to think of it. Like, so you have a large Slack workspace.

00:46:01

Speaker

um Like, ah i don't I don't want to name names because I'm not sure which which customers are public and which aren't. But so imagine large organizations you've heard of ah use Slack.

00:46:12

Speaker

And um one of the things you do when you when you use Slack is you look up the workspace you're in. You do that often because workspaces change things change like.

00:46:22

Speaker

ah And so it turns out that every user of those companies is looking up that data often. And it's just that there is a lot of people on there. There's hundreds of thousands of of humans.

00:46:35

Speaker

ah who's whose Slack clients are doing that. But really what gets you is the bots. ah There's like the Google Calendar bot at the top of every hour for every user who has it configured is going to go look up that workspace and is going to go look up a bunch of other stuff all at the same time.

00:46:55

Speaker

but um from the clients. It's ah like it's it's sort of a recipe for for DOS. um And it's always going to end up DOSing like these poor single memcaches that are hosting this key.

00:47:08

Speaker

And there is a whole side discussion of why is that? That seems like a dumb way to do it. And you're yes, it is. It's a dumb way to do it. um there That key should live in a few places. And it's just memcache by virtue of its very simplicity ah makes it a little harder to do things correctly that way.

00:47:25

Speaker

And as soon as you have it living in multiple keys, now you have to solve the problem of consistency across that those different hosts. And there's like a whole bunch of gnarly stuff that is why people now have all these other storage solutions for cache.

00:47:42

Speaker

You work with the system you've got. um Anyhow, so we found some of these keys. We fixed them. We put them in in the host cache i was talking about so that the web apps just had a lot of these very common keys in memory all the time.

00:47:54

Speaker

um We also, um our friends and SecOps had developed a new version of Nebula that could use multiple cores to do its encryption. Like it could have multiple dedicated ah threads for both encryption decryption. So all of a sudden we had like the observability to tell us all these access patterns that were terrible and were like, it it doesn't matter how fast your host is if if um you're going to be asked to do something a million times.

00:48:20

Speaker

Like, We had to fix those. We fixed the hotkeys. And then we got this new version of Nambulate running and it's back to that same load test. I go, you slap them caches again.

00:48:30

Speaker

Yeah. so So fun. I'm indebted to it. wish I should know who came up with that name. The hardest problems in computer science and also the most gratifying. We were able to find... um So remember, we started out with these dinky little R2, two extra larges, a lot of memory, not a lot of CPU.

00:48:53

Speaker

um no and And we were able to find ah version of... using these big four extra large hosts, we had these 16 core hosts, um, much beefier than the little ones, well but we were able to find a, like a way to get about 93% of the through about, about, you know, almost, almost full throughput for this giant BV host compared to this little tiny host again, by using this, but encrypted. Yeah. So, and,

00:49:26

Speaker

It's just kind of wild how much CPU it takes to encrypt things. ah But it just does. That's life. um How we ended up laying this out, and I think this is kind of fascinating, but sort of makes sense if you think about it.

00:49:37

Speaker

We used five cores dedicated for memcache. So memcache is running five threads. Its default is four. We found it was slightly better with five. um It's not a lot better with more for, I don't know, reasons.

00:49:50

Speaker

Interesting. um Five more cores for encryption and five more cores for decryption. ah And then we left the 16th core around idle to do stuff like for like um kind of admin stuff for like chef runs and Nessus security agents and just like stuff that needed to happen that we didn't want to get in the way of of the running state of the machine. So um but it kind of makes sense if you think about it, like Memcache can do stuff.

00:50:20

Speaker

right And then it can get a request and it can send a response and it can do stuff. And so there being a symmetry of like worker threads and encryption threads and decryption threads sort of makes conceptual sense. Yeah.

00:50:35

Speaker

It is pleasant. There's a pleasant symmetry there. Because not like the the the doing things doesn't take any time at all, right? So you could have more memcache threads, but they wouldn't have anything to do because they're busy sitting waiting to receive in some network traffic.

00:50:48

Speaker

Stuffing things in the art of memory is not that complicated, I guess. Yeah. um But we did find that there was a benefit to having one. Fewer memcache threads was marginally worse, but mostly it was the CPU.

00:50:59

Speaker

Okay. So now we have this this same 500... five hundred 40 hosts or whatever of ah running. ah We don't have, we, we have to re-provision them because ah reasons, like in order to configure them properly, you end up just having to re-provision them. And so I have to go make the donuts again um to get, but then we turn it on and like, it's sort of like underwhelming, right? Like it just, it worked fine. It was fine.

00:51:27

Speaker

Yeah. hey Yeah. So, but actually, yeah, that's fantastic. Yeah. That's great. Yeah. So that's, it was like not that exciting, but that's what you want in infrastructure work, right? You want it to be boring.

00:51:40

Speaker

And it was finally after I it ended up being about four years, start to finish, like everything's working. Cool. um it' Dozens and dozens of rounds of donuts.

00:51:52

Speaker

So many donuts. And I kept having to make the donuts because... um ah security tickets keep on coming. Right. And, and mostly the, like the, the prescribed fix was reposition your reprovision, your hardware with the latest, whatever, with the newest image, AMI image.

00:52:10

Speaker

And so I was just making like once a month I had to make the donuts and that was ah very cumbersome. um And my, again, my wife keeps saying like, can't you have computers do this for you?

00:52:20

Speaker

And I like, Yes. And we can go back and look at that architecture diagram I wrote in 2018. And it's still true that that would do this for us. But um until you have confidence that in an in an unknown that you can reprovision these hosts at some regular cadence without people watching to say, yes, it's okay to do it again.

00:52:43

Speaker

um We didn't have we didn't have that. ah It was still kind of bad when a host restarted. And so you still had to like it was fine, but you could tell like you could watch the charts do this and like.

00:53:00

Speaker

and You were, you were inducing, you were inducing errors and stuff would fail. And every, every security update is a mini so is a mini outage. Yeah. um Yeah.

00:53:10

Speaker

And so this dovetails now with the work that, that Mark was doing to, to get McRouter back up to date. um Because ah along the way, along that journey, he did eventually get to like a single, a single McRouter sidecar running with, uh,

00:53:27

Speaker

that we could build and modify. And in doing that, we were like, this is fine. We don't need to fix it because it's fine. it's just We just needed a newer one. And we needed solve these three or four ways in which we were misusing it.

00:53:40

Speaker

um And those are kind of fun to talk about. you I know you saved one of the weirdest bugs you ever found for the end, but I'm going to tell you some of them right now. um let's Let's do that. What is the weirdest bug you found? Well, so that the bugs that we really couldn't explain would be like, we would do these tests where we would take down some subset of memcaches to prove that we were resilient to a availability zone failure.

00:54:00

Speaker

So we would like block them block What we could never understand was what you would expect naively is that um all the traffic going to those hosts would fail and you'd full that percentage of the key space would not be in memory and fall back to the database. and like It would be bad, but it would be like 30% bad. But instead it was always bad. break.

00:54:26

Speaker

um everything would always break and oh Moreover, ah so we actually had the caches. We had several different cache pools at the time. And a really startling finding was that if you broke one of the cache pools, requests to all of the others would also fail. Like literally all memcache usage would break as soon as some small subset of it started to.

00:54:51

Speaker

um This was ah a mystery. Like for literal years, this was a mystery. um And we finally figured out um

00:55:02

Speaker

So um think I mentioned that the the HHVM, which is the runtime that that houses the hack language. it's Anyway, HHVM has a ah little built-in McRouter that is how you talk to memcache through HHVM, which is... ah Okay.

00:55:24

Speaker

You don't have to. There's this other... It's a good time to mention that we had three different a hack... libraries for talking to memcache. There was this very, very old one that came along from Flickr, like Cal Henderson wrote it in ages past, right? And it actually was part of the initial code dump from Flickr into Slack. It was that old, so it was a decade old.

00:55:50

Speaker

um There was, so it was called libcache2.

00:55:55

Speaker

That's what you started with. You started on libcache too. Some smart alec wrote libcache, which was to be its replacement, but actually just proxied to it and afforded a few different interesting access patterns.

00:56:10

Speaker

and But there was no effort was made to port anyone to it. It was just like, here's this other differently, confusingly named library because it's actually newer than libcache too. But anyway... And then there was this other one called source cache, which was part of the broader effort to ah stop having our code base look like PHP that had been ported to hack and start making it look like hack, which was in its own, like it can be a very nice code base if you, if you use the the kind of affordances they give you.

00:56:37

Speaker

um So all three of those live at the same time. And um there is kind of a, Slack was a very marketplace driven, like the best solution will win kind of places.

Optimizing Caching Strategies at Slack

00:56:50

Speaker

And none of them was a particularly good solution. So none of them won. So there was just three there were just three and people used whichever one they stumbled upon or whatever, when they copied the code that they were going to use from or anyway.

00:57:02

Speaker

Source cache, which ah was the eventual winner um through effort, and we'll get to that effort later, was

00:57:10

Speaker

used this little local sidecar McRouter to do its stuff. um and ah

00:57:20

Speaker

what we eventually learned was that like the way you configure McRouter matters and defaults, default configuration matters. And the default configuration that we had sort of accidentally used through assuming that the defaults were sane was insane.

00:57:36

Speaker

um Go on. So the sidecar McRouters we had configured, ah consciously, like every different whistle we had looked and said, what does this bell do? We'll turn this on. What do these do so these are these doing?

00:57:49

Speaker

these are little processes that run next to the web app on the same hardware, and they take care of talking which Memcache to talk to. um The local MkRouter that runs inside was just stock.

00:58:02

Speaker

And the the the the most interesting, the the first thing we even found that was this thing called TKO, Technical Knockout. um And the idea is you're talking to a bunch of Memcache hosts, and one of them has started to act poorly. right It's timing out. It's slow.

00:58:21

Speaker

um After a certain number of errors, you're just going to say, that one's bad. I'm going to stop talking to it for a while. You, bad. I'm going to stop talking to you for a while, go in the corner, get better. And I'm just going to fail, quickly fail all of these requests, because failing quickly is better than waiting to fail when you know that failure is inevitable. Sounds like a good idea, right? ah So that's great.

00:58:43

Speaker

um The in-process McRouter the HHVM has talks to its side, the sidecar, which is a single host.

00:58:54

Speaker

Oh, no. Yeah. So ah just the one, right? And so when it's behaving badly and you put it in a corner, what you have done is put all of your, all the cache traffic that flows through source cache in a corner. All Memcache decides, Memcache doesn't work, all 540 hosts. not,

00:59:11

Speaker

so not All of it, because remember, there's three cache libraries and the other two are are immune to this particular. But ah but enough, right? Enough to make it really bad. Like the whole universe topples when that much of your cache isn't working for you.

00:59:27

Speaker

There's another side discussion about where don't make your system fall over when cache goes away. Make it slow and make that be okay. But that's a like it's a big other topic. um That's a separate. Yeah, that's absolutely. That's a whole other ball.

00:59:40

Speaker

So we fixed that. We made TKO go away. Great. We got, we like turned the, the disabled TKO flag on and defaults matter.

00:59:51

Speaker

Right. yeah And we found that we had fixed it kind of, but we still saw these big hourly spikes where we'd have high throughput traffic and The older cache libraries worked fine and source cache just fell over at the top of the hour on all the hosts. So we still didn't understand why.

01:00:10

Speaker

And okay, so defaults matter. um The McRouter has, the sidecar has four threads. Yeah, and that's enough, right? It has these four threads, but actually what it is is that it has four threads for each host it talks to.

01:00:26

Speaker

um So that's like a lot of threads, right? That's a lot of concurrency available. um And it makes sense, right? You might want to talk to the same host a couple times concurrently for something. but So yeah having some fun.

01:00:40

Speaker

ah The sidecar has four threads. or says four threads per host, that the in-memory one, the HHVM local one, also has four threads per host.

01:00:53

Speaker

like This is the same problem, right? So whenever things get really busy, it very quickly saturates its four channels of communication to the sidecar.

01:01:04

Speaker

Everything stacks up, everything blows up, right? It's almost... It's almost better to have TKO turned on if you're going to have that problem too, right? well it's Yeah, it's like a soft TKO. Like it's not as bad as TKO, but it does it does cause this backup. And HHVM, because of the threading model, does really doesn't handle blocking stuff very well. Like it'll just it just runs out of threads and says, ah.

01:01:29

Speaker

And so so we fixed it. What we ended up doing is we we like we did a bunch of experimentation and we found... um that kind of eight threads per core of the box was the sweet spot.

01:01:44

Speaker

Um, so if the box says four cores, then you've got 32 threads to do your memcache to do your, and, and that tended to like more than that started to have a lot of, uh, contention less than that had more of these kinds of stacking failures at the top of the hour. Uh, so I went with that and, and then we made a very thorough audit of the defaults for McRouter, uh,

01:02:08

Speaker

to twice burned seven times shy. Yeah. Uh, it's, um, I'll just note that the way we did with the way we figured all this out and actually the way that, that Mark was able to very carefully migrate to this new McRouter was that, uh, I mentioned that we labeled all the, uh, made all of this honeycomb data.

01:02:26

Speaker

We, uh, We had in the web app this thing called we had Feature Flags, which I think are a fairly common idea now, um but just things you turn on and off to ah affect different behavior.

01:02:40

Speaker

It's delicate to do because the Feature Flag system itself was initialized using Memcache. ah because it was kind of smashed together with the experiment framework, which used memcache to figure out what experiments to run. and So we had to kind of end run it to be able to have kind of lower level feature flags that didn't require any other smarts.

01:03:01

Speaker

ah I didn't want to label all the traffic with every single feature flag because that seemed like it would That's a lot. That's lot of flags. So we did we we wrote this little goofy thing that would take all the flags we cared about um and assign them to this little bit field. And I would just always write this bit field that had the things turned on and off and on and off. And honeycomb is amazing but because you can derive fields from other fields.

01:03:24

Speaker

And so we were able to... come up with label, like sort of pseudo labels that said this feature is turned on that we just kind of derived from that bit field. But we were able to tell um how we even knew that source cache was the one that was being more badly affected was that we sliced on everything. We sliced on library. We sliced on hardware type. We sliced on the way I was able to derive eight cores, eight threads per core was that we We had the all the data labeled with all our different threading experiments. So we could we could compare success rates. We can compare with latency.

01:03:59

Speaker

um We could learn that during this McRouter upgrade, we found ourselves a nine in availability. um Oh, cool. We didn't even know what our availability was before we did all this, right? but now but So it's really amazing that the power of being able to see what you're doing.

01:04:18

Speaker

the The nine was at the front of the number, I assume, right? It was it was the good nine, yeah. we went We went from four to like five and a half as a part of this effort. That's real.

01:04:29

Speaker

But I'm still making donuts every month, right? and and But these are the sorts of fixes we needed in place um to be able to be like, a system of automation feels safe to run now.

01:04:41

Speaker

um And so this is when I start to go off and decide I'm going to build... or sort of in parallel with this, as we're arriving at this, like we trust our system now, how can we get rid of our unenduring toil?

01:04:54

Speaker

Uh, I started to think about the system that would replace the janky leader election thing, uh, Python to leader election thing with a, a system that would use that lever. I talked about at the very beginning of the episode of, um, kind of, uh, fast distributed configuration, yeah. Management.

01:05:15

Speaker

Um, And we called it McRib, ah which ah credit to my coworker, Chris Sullivan, for coming up with the... I think we sort of workshopped the acronym, but he figured out that we could actually call it that and call it the McRouter, the Memcache Routing Intelligence Boss.

01:05:32

Speaker

um And so McRib was to be the system that would... um Because as an aside, we had to upgrade this thing, the Python. We had a Python language version end of life.

01:05:46

Speaker

deadline. And we really, really, really didn't want to put, yeah. Right. You knew you managed this, right. we We really didn't want to um ah upgrade this terrible thing.

01:05:57

Speaker

We didn't want to like make a new terrible thing if we didn't have to. Um, But last week's episode was Lyft's two to three migration, and it was not trivial. It was gross.

01:06:09

Speaker

And the best way to migrate something is not to migrate something if you can. Oh, absolutely. We would all prefer not to touch the software ever again. but and But also, this so like and now we're we're a little bit in bad to bad to good territory here we're not careful.

01:06:23

Speaker

um There were lots of bad things about this beyond just the the version. um Chiefly, like any time it would restart, Memcache would forget that it had been a member of whatever slot it was, and its slot would be given to somewhere else.

01:06:36

Speaker

And it would cause like a little micro outage. um Sometimes they would do chef runs that would take down the whole tier of this little leader election script, which would basically cause a full election of the whole ring over again, um which would cause a huge side outage. um So it was very delicate. We had no way to version it or restart it or deploy it. That didn't cause an outage.

01:06:58

Speaker

So McRib was chiefly there to like fix that, to make, um,

01:07:06

Speaker

to have a system that we understood and could operate that would be in charge of delivering configuration to make router to know which memcaches to talk to just that. um And the, I think that that part was, it was fun, go off and write some go. And that was a lot of fun.

01:07:22

Speaker

The interesting part is getting onto it. Right, because the McRouters currently eat this JSON configuration that's from a rendered console template where the template itself is an ERB file from Chef. So there's like Ruby templating going on too.

01:07:39

Speaker

um And like multiple layers of variable and injection and like... So the first task is like, can we make a file that looks just like this file?

01:07:51

Speaker

and can we deliver it And can we deliver it right next to it on disk and with some confidence? And ah so the the first thing that McRib had to do was it itself became a consumer of that config file.

01:08:04

Speaker

And all it did was eat it into its understanding of like its domain objects and spit it back out and ship that everywhere. Yeah. And so all it was, it was literally, it was a pass through, but it was a pass through that went through McRib logic so that we could prove that we understood the primitives that we were dealing with.

01:08:22

Speaker

That makes sense. Did you find it? Were there any like load bearing bugs? oh and yeah Oh, no, this was actually, were, Not really. i was very happy that nobody cared about white space.

01:08:35

Speaker

Oh, yeah. Or trailing commas or because all that, because we were, all we did was go from like the config file was written out with go templates. So, and it's like notoriously hard to get the white space, right. um But we did like, i think it was the the only biggest bug was that,

01:08:49

Speaker

It was a load-bearing decision, right, ah which is that McRouter uses JSON-C, which is called JSON with comments. And it's it's a superset of JSON that allows comments and trailing commas.

01:09:02

Speaker

um Or I think it actually requires trailing commas. And so you you have to, there's an extra layer of like weird library jank to support JSON-C and Go because it's not part of the built-in JSON marshalling.

01:09:16

Speaker

But there's a nice library for it. um Anyway, so we get that in place. We're delivering the thing. um we're We're eating it and spinning it back out. And that's all working fine. And then like this the way that McRib works is it sniffs service discovery. So it knows all of the memcache hosts that exist.

01:09:34

Speaker

It knows the last version of the config file, which says which hosts were where. ah and then there's always 20% or 30% of the hosts that we just keep spare, ready to take over when bad things happen.

01:09:48

Speaker

um And the trick is you want always to compute the minimum different config, right? So you want your config to look just like the one before, but with just this one host that went bad replaced. You don't want to recompute the whole ring because now all the keys are in the wrong places and and you're yeah caused you've caused a big outage.

01:10:07

Speaker

um So... the the delicate step was, ah going from, I'm reading from the the old leader election script to I'm reading from service discovery and I'm definitely producing the same result either way.

01:10:24

Speaker

Um, and the harrowing, that was very, so but we did it incrementally. Right. The first thing we, like we, um, We produced both and diffed them for a long time and made sure that looked, that ah dipped them in McRib and like complained in a Slack channel that they were wrong. And yeah, so you're you're pretty convinced that it was doing the right thing. and We knew it was going to, yeah, but there's still like the next deploy of McRib is turn it all on.

01:10:47

Speaker

And now you're drifting. Right, you're but somewhere in there with like, I guess actually the first step is ship them both so they're next to each other. Second step is get all of the NIC routers to start eating the new config file instead of the old one.

01:11:01

Speaker

Third step is cross your fingers a lot and and and flip the switch and start reading from the new, like making the new system authoritative. yeah And then those two files have to drift. So if you roll back, you're kind of like, you really can't roll back.

01:11:15

Speaker

Yeah. You can't, you just can't. um So we- There's a point of no return. Yeah. and And most of what we had done to date, we could turn off if we had to, but this was really like, once we had made the switch, there we were. And and thankfully, that was very scary. and Thankfully that that worked okay. How how long before, as so you see you switch and then the system starts to drift. it's it's not It's not drift in code, it's drift in like production reality, right? It's right. Over time, a memcache will go down and which spare you picked to fill that slot is different.

01:11:46

Speaker

Yeah. So what did you have? Like, did you like five minutes or did you have like days? Like, how long was that drift before you were like, we can't go back? I mean, as soon as a few of the hosts are different, it's your your and that they don't fail that often.

01:12:02

Speaker

um So we were largely but they they fail. You're not in charge of when they fail So you didn't know going into it. So we didn't know, no. But what like basically, like we kind of declared victory. And i mean, at that point, there's no point keeping it around.

01:12:17

Speaker

The system we have was it in kind of in every respect easier to reason about. And so if if if it was broken, we were going to fix forward from that point. weren' We weren't going back. And so the the the rest of the process is like clean up all that stuff and get rid of the leader election and stop using console in this bonkers way and and get rid of all these console templates and get rid of all these ERB files and clean up your mess.

01:12:40

Speaker

um But like the win here, the good part is... uh, now I don't, I don't yet, I'm not yet out of donut making. Right.

01:12:50

Speaker

But, um, it's where the system is responsive enough. McRouter works well enough. Um, we understand memcache enough. We have all of these ways of seeing how it's behaving. We have the right alerts in place. We like, we have all the things we need to know when something's gone wrong.

01:13:06

Speaker

Uh, and we have a ah like a very reliable system for recovering from that. Um, and so the thing we do that's fun is we just, uh, ah we had this thing called ASG tester and its job is to turn off a host in a, in a auto scale group in AWS and replace it with a new one.

01:13:27

Speaker

Like that's its whole job is just prove that um provisioning a new host works. And so we said, what if we just did that, but like all the time and we'll just cycle through all our hosts every few days or whatever, at at whatever pace seems good.

01:13:42

Speaker

Yeah. We ended up doing like, don't do it in the couple of hours in the morning where things are very busy because it makes people, right? um but But you're controlling when you in in induce this like stress, but not incident level stress. yeah But it's like basically like host reprovisioning in Kron.

01:14:01

Speaker

Think of it like that.

Automating Processes for Efficiency

01:14:02

Speaker

Right. and and ah And all of a sudden the donuts are making themselves. Right. and like Because when somebody files a security ticket, I say, that's fine, because in a week, it'll just all all those hosts will be gone.

01:14:15

Speaker

All the security tickets are gone. ah And then like the grunt work became closing the tickets. ah But thankfully, secure like a lot of folks got wind of this, like, hey, if you reprovision your stuff all the time automatically and that works, like security problems go away.

01:14:29

Speaker

um i was We championed that as ah as a best practice. And so it became SecOps's problem to... um make those tickets go away when they were no longer true, ah which is great.

01:14:41

Speaker

Amazing. And so that cool so that's a toy. we So we've done like the the um the the mandated migration of technology and we've done the toil migration.

01:14:52

Speaker

um And the coda here, if we have time, is to talk about those three cache libraries um and how we got to the one cache library. Yeah, let's do it let's do that quickly. We should we should wrap up.

01:15:05

Speaker

yeah That one's pretty quick. um And I'll caveat that it wasn't actually all that involved in it. um But ah broadly, there was no incentive to switch from one library to another because they were all kind of fine.

01:15:19

Speaker

um In theory, source cache was prevented certain kinds of thundering herd use cases because it had a way that you could fetch something. But ah if it was missing, you would...

01:15:32

Speaker

You could basically coordinate with everybody else to decide who was going to do the filling. If it wasn't you, you would just wait. And then eventually it would be ready or it would time out. You could see some problems with that if you think too hard about it, but don't.

01:15:44

Speaker

um Sourcecast also show had an envelope around the cached values where you could do things like version numbers and some metadata. so you could do things like, we want to change how things are cached.

01:15:55

Speaker

we'll just flip the number and now we know whether it's a new or an old value and stuff like that. So it was kind of useful. Um, but there was no incentive to do anything with it. Uh, and, uh, it wasn't really a priority for anyone. So it was just kind of bad, but I guess it was an, it was an okay, right? It wast it was an okay to good.

01:16:13

Speaker

It was okay to good. There's always bads, right? Um, you, so what, if so what happened? Well, so if you'll recall, you yourself shepherded a project, uh, wherein we wanted to make, uh,

01:16:25

Speaker

full stacks of requests be a ah local to availability zones so that we could route away from an availability zone with a bad network situation and all the traffic would just silo and in all the other zones that it would be fine. We were in a bad state. that um Amazon would sneeze for 30 seconds and induce 2% packet loss, which is real. Don't get me wrong.

01:16:46

Speaker

And Slack would be hard down for four hours because we were just... but Well, I mean, was it was it was kind of like the with my memcache thing where like if if one goes down, they all go down like it just created a problem that was so like just enough bad that it became bad everywhere.

01:17:00

Speaker

yeah And so one of the things I loved about McRib was that it let us do things like ponder ways we might differently configure a McRouter at all. which was not possible before because we hadn't, it was so cumbersome to change the config templates.

01:17:14

Speaker

um With McRib changing config templates was a deploy and it was easy. um And so we cooked up this config that would let you access this big 600 machine, uh,

01:17:29

Speaker

ah cluster that we talked about before, um either as one big ring or as three small AZ-specific rings. um So same hardware. You just can decide whether you want to talk to the keys across the whole thing or talk to the keys locally.

01:17:45

Speaker

um And this was, it's kind of neat because, so like your total traffic, we had so much extra memory because we were, if you'll recall, just grossly CPU bound for the encryption traffic. So we had all this extra memory lying around.

01:18:01

Speaker

So it was like fine to store everything three times. and We still had more, way more memory than we needed. For from once you managed to fail your job of spending more money. Right. Yeah. I was like, oh, here's the code. Like here's the payoff is that now I can you it's spend a That same money, but in ah in a way that makes us more resilient AZ loss. And also, as it incidentally, like we we shaved off. We went from like a millisecond to like two-thirds of a millisecond for most requests because the the the hop the local hop was much shorter.

01:18:30

Speaker

yeah um Memcache results memcash requests are just in a totally different ballpark of every other RPC in a system. They have to just have to be basically instant. They have to act like memory.

01:18:41

Speaker

Yeah. like Everything else is measured in tens of milliseconds or single milliseconds. And, you Microsoft. we didn't We didn't really talk numbers, but that yeah that that Memcache cluster did, you know, tens of millions of lookups per second at... Wild.

01:18:53

Speaker

Yeah. Yeah. It was real, real volume. So ah it turned out that in order to use this AZ stuff, you kind of needed... You had to opt into it. I didn't want to just turn it on for everybody because it restricts certain use cases and...

01:19:12

Speaker

our cache was kind of Wild West. ah Like an example of a use case that would not work at all is people used memcaches for ah mutual exclusion, like for locking. Like I'm gonna grab this key memcache and that means I can do this, which by the way, that's a lie because the cache can go away anytime. So it's, you're not, it's like, cash it's kind of hopeful locking, right? It's not even optimistic. It's just hopeful.

01:19:33

Speaker

But if the lock you're grabbing turns out to only be an AZ local lock, then you're you're not really excluding it. There's two other friends of yours who are going to be doing the same work. right And so there were a bunch of those kinds of edge cases where the access pattern just didn't make any sense if you were going to be AZ local because you couldn't you no longer could guarantee that you the data you were looking at was the sole place that data lived.

01:19:59

Speaker

there And so it required people to do some thinking, ah but was also a company priority suddenly. And so there were we had a cudgel with which to prod people along.

01:20:11

Speaker

um And we realized that if like... Because people can only do a certain number of migrations at a time and asking them to do like, Slack was sort of notorious for always having four or five big migrations going at a time.

01:20:24

Speaker

um So it wasn't strictly necessary to move to source cache to use the easy local stuff. as ah As a pragmatist, I made it work with both because I just... I wanted it to be a system that we could actually ship. um And so I, and I went through porting some of the really big Cardinal, like high throughput keys on using both libraries to prove they worked and everything was fine. But um it was sort of a convenient cudgel that likes to say, well, if you're going to do this AZ local thing, which, which we are telling you, you have to now, you might as well go to source cache because it's not really very much more work and you get these couple of little things out of it. And yeah,

01:21:03

Speaker

and And that mostly worked. that That was what I wrote a big runbook for how you do this port. um I wrote a runbook for how you tell if you're one of those use cases that is not a good candidate. um And there were a few and they're like, too bad. And like, your answer is I'm just not going to work right when we do these ASIO failures or failovers. and Okay.

01:21:23

Speaker

Um, but then it became a, it would became like a gnarly TPM, uh, spreadsheet of all of the different codes. It became like the thing you want to fix, right? It was that. yeah And this is, and this is exactly one of the stories that I, I anchor on. And you mentioned, uh, Ryan Greenberg earlier, and he's my co-founder at turn. And he he did a lot of those migrations that he, or rather he did a lot of the, like, let me see if I can solve all of these in one pass with, with code modification tools. Um, Yeah, running and running thats running that runbook manually in 2025 when LLMs and AI exist, yeah teams and it seems dated, right?

01:22:02

Speaker

Certainly, like like I hopped up ah as ah as ah the the final coda here. I hopped off the bus here just as this process was getting underway. Number one, because I had found a team or or kind of built out a team that um were the right people to own all of this stuff.

01:22:19

Speaker

They own some other stuff too, but it was like of a piece with what they own. So I had a team that I trusted to run McRib and to understand Lambcash. They had all the right tools in place. And I was tired of being on the hook for a tier zero service. Yeah.

01:22:33

Speaker

after two years. And so I kind of stepped off and went off and did some other stuff after my last deliverable was at those run books. And then it was like, this seems like a kind of a boring problem now to actually get people to do this. This is not like, if if not boring, at least not something I'm good at.

01:22:50

Speaker

So I'm going to go do interesting engineering work and hope that like the, the system of, of grinding people into doing labor works. But certainly if somebody could have just pushed a button, it's it's like a note, like,

01:23:03

Speaker

then my my job of of describing the migration becomes interesting, right? But also the last thing I have to do, which is what I would have wanted, right? Like i said, like, here's how you do this.

01:23:14

Speaker

I've tested it enough times to be comfortable that it's going to work. um You should feel comfortable just pressing this button and and doing it. um Yeah, it's it's one of the, think, undersold things of automation and and kind of a theme of what you've been talking about is that you you wanted to automate ah making the donuts.

01:23:33

Speaker

But you also had to start with implementing it on a different system where it was more tractable. And then you had to do it manually a gazillion times well while dialing in whether that made sense. Same thing for yeah source cache.

01:23:45

Speaker

youre You're not going to automate the first one. you're going to automate the 10th and beyond. Right. Yeah. No, you got to try it, prove it works, find the things you got wrong, fix those, do it again, yeah and get it to where it...

01:23:56

Speaker

Once it's a crank you can turn, it doesn't make sense for people to turn the crank anymore than it makes sense for people to like, I don't know, power turbines. Yeah. really like ah hook Hook it up to the generator and turn the crank.

01:24:08

Speaker

Have the

The Future of AI in Engineering

01:24:09

Speaker

tank crank be turned. But it has to be donuts you're making and not like, you know, engine guns pointed at your face. Yeah.

01:24:18

Speaker

Bad donuts are fine to stretch the analogy a bit too far. That's right. Yeah. But they have to, they, you have to get them to the point where they're harmless at least.

01:24:27

Speaker

All right. Um, I have, I have two questions for you before we go. Um, we're, we're, we're starting to get into it now, but what, what tooling did you wish existed during this whole process?

01:24:39

Speaker

Um,

01:24:42

Speaker

but totally i wish I mean, i i sort of, I wished for for Honeycomb and it was there. And so I was very glad to have that. I could not have done this without Trace.

01:24:56

Speaker

The magic that is always promised and it's so far yet to be delivered, even with AI, that is um like a robot to watch the charts for you.

01:25:07

Speaker

like AI that knows when something truly has gone wrong, AI that can correlate to charts that are, uh, in their domain very far away, but have exactly the same shape in a way that like, I mean, this is like, it's reasoning stuff, right? Like it's, um,

01:25:26

Speaker

What made the job hard was understanding why those two charts were the same shape, mostly. and so ah And what made donuts making hard was knowing when it was okay to keep going.

01:25:40

Speaker

And so it would have been really neat to have... And we tried every...

01:25:49

Speaker

I don't know, 15 years of working. I watched people try to use machine learning and, uh, kind of i'm like sort of pattern analysis to, to be like, they send a page where it's like, Hey, this looks wrong.

01:26:00

Speaker

and what you get is a channel filled with pages. You don't know what to do about that are always either too noisy or not noisy enough. So and that would have been really nice. I think, um, I don't know.

01:26:15

Speaker

ah

01:26:18

Speaker

Mostly that like the the tools I was using were were all fairly simple. And it was like, i think what's was interesting for me about this project was the like, none of it was hard. Exactly. Except none of it except the understanding. Except all of Right. but but but But there's no single technology was difficult to comprehend. It was just how the totality of them.

01:26:41

Speaker

How long did it take you to figure out you should turn off TKL? uh we started in 2021 in like spring of 2020 it was probably a year before we found it's not hard to do it's hard to figure out well there was so much else wrong right like you have to like cut away all the obvious wrong before you can see the really interesting wrong right yeah uh all right cool and last question um if people want to follow up with you is there a place they can find you on the internet Uh, I mean, i don't, I'm a so a little bit of a hermit these days. Um, uh, my website is glenn.nu and is in Nancy. He was in umbrella.

01:27:23

Speaker

Yeah. yeah and And then there's a contact form there. You can give me that way. Um, I'm nine, I'm nine LEN on, on X or Twitter, but I don't really use it anymore. Um, I am. We'll put, I can't even remember what I am on everything else. Cause I don't, I don't, I'm just not that active.

01:27:39

Speaker

Uh, in all the different ways one could be social right now. But i'm easy I'm easy to get in touch with. And I love to talk. Awesome. Well, thank you so much. That was a fascinating story. Thank you. I'm so glad we could make this time.

01:27:54

Speaker

Thank you for your patience with ah we're really quite quite a long endeavor.