Slack's AWS Availability Zone Challenges
00:00:00
Speaker
1% packet loss. It's not enough to take you down, but it's enough to ruin your day. Slack had this problem. AWS had one flaky AZ, and it was clear that users were having a bad time.
00:00:11
Speaker
But draining an AZ meant removing a third of the site's capacity. didn't feel right. So they waited. Sure enough, Amazon fixed the problem. Then it happened again, four hours later. That's when Cooper Bathee decided, we can't do this.
00:00:24
Speaker
In this episode, Cooper shares how he led Slack's move to a cellular architecture, turning draining an AZ from high-risk theory into a weekly activity. Cool.
Introduction to Cooper Bethea
00:00:34
Speaker
So on today's episode of Turn Stories, um I've got Cooper Bethea, and he's going to tell us about his experience migrating Slack to a cellular architecture.
00:00:44
Speaker
He was a senior staff engineer on the traffic and service discovery team there. ah Super happy to have you on the show. Thanks, DR. Actually, my my last name is Bethea.
00:00:55
Speaker
Bethea. I'm only mentioning this because I know my parents will talk or will watch the podcast. And they'll be like, he didn't even get your name right, son i you know um One of the things I'm learning is that I almost never say the last names out loud of people
Early Challenges at Slack
00:01:14
Speaker
that I've met. so Why would you?
00:01:18
Speaker
Why would I? Cooper Bethea. Apologies. but It took me a long time to learn your name was Thomas Rasputin. It's something like that.
00:01:29
Speaker
um So this is... um the The topic we're going to cover today um is something that you've actually talked about a bunch. You've given a talk on this at at QCon.
00:01:43
Speaker
um I want to start a little bit before maybe the actual story of you know transitioning Slack away from from the architecture of your dreams. why What did Slack look like when you showed up?
00:01:57
Speaker
What were the major fires? Yeah. My recollection, so I landed at Slack about 2019, I think about like April 2019. And i think some of the major fires that we would get wrapped up in at first, I remember there was a lot, particularly around the console installations.
00:02:21
Speaker
like We were using console as our sort of back end for service registry and discovery registration and discovery. And are we would have like often load or failure induced issues with these console clusters that would just render them useless, both for publishing and reading service discovery information.
00:02:41
Speaker
That was a really big ah focus of ours for a while. um Then we went through a lot of rounds of doing things like just making the console clusters bigger and trying to like defend them in various ways.
00:02:54
Speaker
was a big focus for probably about the first year. um what sort of What sort of lessons did you take away from that? Was there anything yeah sort of leading into ah the cellular Slack project? Were there pieces of the system that, you know, dark skeletons are in the corner that you'd figured out or ah was it, was it truly separate work?
00:03:18
Speaker
um There was a lot actually that went on to inform a lot of the work that we did in cellular Slack. I remember one of the early wins that we got was around,
00:03:33
Speaker
So we we came to realize that a lot of the instability in console was driven by scaling up and down of our um app servers,
Transition to XDS Control Plane
00:03:42
Speaker
right? Like our what we would call web apps at Slack, which are just kind stateless HTTP servers that do like most of the business logic.
00:03:50
Speaker
um And so those would, as they needed to discover services from console, they would add load to console directly in the form of these watches. um This was difficult because you have to maintain availability of the console cluster the whole time. And you kind of have to, like with a console cluster, the size of a console cluster is five. It's a consensus-driven system.
00:04:12
Speaker
It offers strong consistency. And so we kept just kind of like vertically scaling the nodes in the console cluster. But that was, it was messy and we were running out of nodes and we still weren't really happy with the stability picture. So we ended up sort of bridging it over to this XDS control plane, ah which is part of the Envoy ecosystem.
00:04:34
Speaker
So we published a read-only copy of the console data into that system, which was eventually consistent as opposed to strongly consistent. And that gave us a lot a much freer hand to kind of scale that read service out by separating it from the right surface of console.
00:04:56
Speaker
Yeah, that makes sense. the That's... It's
00:04:59
Speaker
it it's definitely... ah it's a challenging thing to scale. and there's i know there's Slack was not the only people to have struggled with ah scaling console vertically and sort of adding more, adding that capacity in a system where you can't necessarily, you know where you've got it attached to your monolith or or whatever.
00:05:21
Speaker
Yeah, it's tough because console is really set up to have two jobs that don't necessarily go together, right? Like you can, yeah, you can either use it as um as a service discovery, registration and discovery interface where you're kind of just like pushing data into it about like what servers are available and what things are and that, but then also you can use it as like a lock server, which is where a lot of their needs around ah consensus and strong consistency come in. There's actually much, yeah the the problem of the service registration discovery is much simpler if you relax those consistency guarantees.
00:05:59
Speaker
Yeah, absolutely. And that was the that was the pickup. You saw it with the read-only layer, right? um Yeah, with the read-only layer, we can just scale that layer separately. Cool. So with ah with that at the core of things, tell me a little bit about what you saw that that led into Cellular Slack and that effort.
Strategies for Draining AWS AZs
00:06:16
Speaker
o So we had always had this idea... yeah
00:06:23
Speaker
we'd we we'd At Slack, we had talked for a long time about, well, we should be able to drain an availability zone. It was something that like we thought that we could do. AWS advertises these availability zones have, i think, about 3.9 availability.
00:06:39
Speaker
Something like that. um So there was always this idea that like maybe one day and availability zone will drain entirely or sorry well we' vanish entirely and we'll kind of have to drain the site away from it.
00:06:51
Speaker
um There are even kind of like some nods set up to that. yeah There's there's like sort of a virtuous thing in AWS where cross-AZ data transfer is actually...
00:07:02
Speaker
like a big creeper on people's cost sheets. So we had already set up a lot of the infrastructure to kind of like tunnel through the individual AZs to some extent, mostly to save on costs.
00:07:16
Speaker
um But then we actually kind of went through a few outages where we were in a place, we saw a failure and we were fairly certain that that failure was combined to a single availability zone.
00:07:30
Speaker
But we were really unwilling to actually start trying to administratively remove traffic from that availability zone. Mostly we'd kind of like sit around and be like, ooh, is it really bad enough that we need to do this?
00:07:45
Speaker
Because we weren't in and the practice of removing traffic from the availability zone, and even if the even if the availability zone is kind of impacted, you we are talking then about like removing 33% of the overall capacity of the site, and that which was intimidating if you haven't done it before.
00:08:04
Speaker
Absolutely. How was that the was that the sole motivation or it sounds like there was some nervousness around it was, yeah know, I think i think Slack had a pretty good handle on its capacity in a bunch of ways.
00:08:16
Speaker
But still, like, where do where do you think that nervousness was coming from? I think it was from sort of a few places. Well, there's there is like a sense of nervousness ah combined with some futility.
00:08:29
Speaker
um So, right, like, good as I mentioned, we had things kind of like tunneled into the app servers. ah But then there is this like large fan out of connections the app servers actually need to make to do their work.
00:08:44
Speaker
including into systems like vit tests, which are also strongly consistent and aren't kind of trivial to just replicate across availability zones. and So at least part of it was like, well, we don't even really know what this would do, right? Like we can pull stuff out of the app servers, but then the app servers and the other two availability zones will just be kind of like spreading traffic everywhere anyway.
00:09:05
Speaker
um And then we've we you know like we know that for the most part, the sort of scaling of the site is driven by the CPU consumption of the app servers. And so as mentioned before, we're still talking about like removing 33%. So it was some piece of this kind of like theoretical worrying. And also we weren't necessarily in a place where you could kind of like get a bite at the problem.
00:09:30
Speaker
know like we didnt We didn't have a good idea of how we might start figuring out how safe it really was. Yeah. Yeah. So it's almost it's almost like the movement of or the removing of an AZ is not something that had a really clear button even that you sort of mentioned moving just the traffic away from the app servers. And yeah, that's a lot of slack, but that's not the whole site.
00:09:57
Speaker
Yeah. yeah Yeah, exactly. there's no i think the best we could have done if we wanted to sort of dry run it is effectively get people from many teams into a channel somewhere and kind of like run long experiment where we tried to drain things. and Yeah, I think we never...
00:10:23
Speaker
Yeah, we I think we never felt that it was worth all the kind of like effort and coordination without sort of a plan of what to do about it. Yeah, absolutely. And certainly, if you're if your default plan to do this is to get a bunch of people in a room and have them do something they haven't done before, I can imagine that the last time people want to do that is when the site's at 5% error rate and you're not sure if it's getting better or worse. Yeah. Yeah.
00:10:51
Speaker
Yeah, like if you don't if you haven't done something before, you don't want to do it for the first time in an incident ever. sometimes Sometimes you have to, but no one likes that. Yeah, yeah, it's undesirable.
00:11:02
Speaker
That makes sense. That totally makes sense. So, okay.
00:11:07
Speaker
the The site, draining site is, or draining an a and AZ z is not something that feels ah useful or ah even plausible in most cases, but we were having these incidents.
Need for Robust AZ Draining Strategy
00:11:21
Speaker
Was there a particular ah particular incident that was the trigger to start doing work on this or a particular insight that pushed you over the edge to actually start the project?
00:11:33
Speaker
Yeah, I think one of my, the main one that i recollect was there was one incident where we had a graph, we we had a graph of TCB retransmits by availability zone.
00:11:46
Speaker
And it was pretty obvious from you know like kind of the shape of the drop of the graph that we had trouble connecting from to the availability zones into like just one.
00:11:59
Speaker
and it was like it was It was actually quite clear from the monitoring there that there was like something going on at a network level like in the sort of interfaces. I think it was one of the Amazon...
00:12:10
Speaker
ah components that we believed and eventually found out it was going through the transit gateway that was kind of fronting the whole availability zone so we were in the situation where it was like pretty clear cut as much as you know like if we if we could ever have wanted to drain we would have wanted to do it at that point uh and then we kind of but we were kind of like oh should we should we start trying to do this also our network availability there like our our We were a little bit worried about sort of being able to get into this availability zone. are Our procedure at the time would have involved SSHing around to a bunch of different servers and just kind of like causing the lib mouncers to help check down.
00:12:50
Speaker
um I think we were we were kind of like talking about it, worrying about it, slash plotting to do it. And then the availability zone came back. The component got fixed. um And we were like, yay, okay.
00:13:03
Speaker
Well, we didn't have to do that. Yeah, ive but then four hours later, the broken component accidentally got put back in again. And we were like, oh, like...
00:13:15
Speaker
but Now we're, no, once again, we were faced with difficult choice. um You know, like it just, and it felt bad. There was like a certain level of frustration at being, you know, having been put in this situation that felt kind of powerless twice, I think the same day ah that was a big psychological motivation.
00:13:36
Speaker
Did that actually, what did that show up to Slack's users as like, did this actually cause downtime or, or other issues? Yeah. So I can't remember what the what the SLA calculation status ended up being.
00:13:53
Speaker
It was a ah thing where you know we have a lot of retries and stuff embedded in the site. And we were maybe, i don't know, like we were we were certainly not fully down. were serving let's say integer of error rates, maybe one percent-ish.
00:14:05
Speaker
percentage of error rates maybe like one percentage area something like that So people are having a pretty annoying time using Slack, but the service wasn't sort of hard down, and which I think, you know, if we were, if we're in that bad a state, I think we would have been much more aggressive about trying to do stuff to bring things back.
00:14:24
Speaker
Yeah, that's ah that's a tough regime because support is certainly feeling it at that point. Yeah. yeah Yeah, exactly. Where you're like, well, we could do this crazy thing, but only if something even worse has happened.
00:14:35
Speaker
It is like kind of a desperation-driven feeling. Yeah, yeah it's it's freeing if the site is hard down. You can't make it worse, but at 99% success, you certainly can.
00:14:47
Speaker
Very much so. um and want to come back to the question of of Transit Gateway and how ah that works. but Okay, so after that, what was what was the fallout of... All right, saw this error, felt kind of powerless.
00:15:00
Speaker
It was clearly a recurring set of issues. what What do you do about it? Well, you write an angry email. yeah i I mean, I'm kidding, but actually, i think i you know i talked about with a couple of people that was in the incident with, and were all just like, oh, this is like we don't we don't really like feel very good about this. and So I just ended up like,
00:15:24
Speaker
writing a ah a document about it called, I think, like, we should be able to drain an AZ. And I just kind of circulated it around. and I was like, look, I, like, it was basically like, you know, we just had this, we just had this incident. We're doing some analysis.
00:15:40
Speaker
We believe that if we had this sort of technology, if we could, if we could drain availability zone, which we've always sort of told ourselves that we could do if we really had to do it.
00:15:51
Speaker
was like, if we, if we really, had that, then we could have just done it. And so this is, it was kind of a outline, a skeletal outline, I think, about how we might reasonably do this for most services and about what some of the benefits would be.
00:16:08
Speaker
There is this kind of perception, I think, also originally, that draining a service or draining availability zone would be something that like we were kind of only doing like very, very dire circumstances, you know, like it would, you know, it would be like doing something very heroic to save the site. And I was like, well, actually we can make this kind of more commonplace event, right? Like the safe way to do this is make this ah behavior that can happen, that does happen in the system all the time and does not require us all, you know, like kind of existing in this state of panic and an incident channel.
00:16:42
Speaker
um because we should be very confident in it.
Integrating AZ Draining into Workflow
00:16:44
Speaker
And once we do that, we get a lot of it it becomes a more natural part of workflows, you know, like deploying and troubleshooting. You can get into this place where like, well, something is bad in one availability zone. So let's like drain all the traffic out of it. And then like fix whatever, I think we fixed whatever it is and then put like 1% of traffic back, something like that.
00:17:05
Speaker
You know, if that doesn't work, drain it all out again, put 1% of traffic back. And so you can, instead of like these, I i felt there was an opportunity for us to stop operating this kind of like,
00:17:18
Speaker
Binary event mode where everything, either things were like good and everything was operating, ah but we couldn't really do anything or everything is bad, but we can also do everything because everything is down. Yeah. This kind of like, um but there's, there's an opportunity to like explore some space in the middle, I think on the name. Yeah. That makes sense. Yeah. Right.
00:17:39
Speaker
Yeah. Yeah. To what extent was this informed by... you were Google before this, and you know Google famously has a gazillion data centers all over the world. and you know that What I've talked with engineers there is that they think about draining their their services as part of their job.
00:17:56
Speaker
To what extent were you informed by that experience, and were there certain parts of that experience that you were trying to carry over, or certain parts that you were trying not to carry over?
00:18:08
Speaker
Yeah, ah very much so. you know that's the Google is definitely the first... place where I encountered this idea that one could sort drain a data center.
00:18:19
Speaker
And team I was on at Google, ah the traffic team, among other things, ran the load balancing infrastructure that enabled this kind of draining. um i will So I think one notable difference between the way that this was sort of done at Google and the way that we chose to do it at Slack has it was informed by kind of the size of the company and the complexity of the services.
00:18:47
Speaker
So at Google, you're a service owner. You probably have pretty much complete autonomy over where you deploy your service. As long as you're like meeting your SLAs, Google has many data centers and you don't necessarily know where all your backup services are.
00:19:04
Speaker
there's like a high level of decoupling. right So under the covers, and there's yeah there's hundreds and thousands of services um running in Google. And so under the covers, you are a service operator. People underneath or above you are kind of like draining drain from around you all the time. And that that kind of like service by service drain is more common than actually like shutting down the whole data center, which is actually is something that um they can be a little more... There's there's a large cost dimension to draining hold yeah an actual whole data center.
00:19:42
Speaker
So they tend to be a little more judicious about draining a whole data center. um In contrast, at Slack, we were we a large service, but we were not, you know, like many data centers large.
00:19:55
Speaker
um And we were also ah one to two orders of magnitude fewer services. So there was a lot more opportunity and, sorry, and, and, whereas at Google, they have, you know, like a million user-facing services, Gmail, search ads, whatever.
00:20:12
Speaker
ah We really just had Slack. Like we had auxiliary parts of Slack, but mostly we cared about the service as a whole. And most components were shared across the whole service. So when we did the program at Slack, we made it...
00:20:26
Speaker
um it Maybe be collective is the right word, but it was a little bit more of a global regime. I think the first thing that we were focused on was not how will we drain sort of individual services, but how will we just really and kind of like clear everything out of an availability zone.
00:20:43
Speaker
And then later, i think we went back and started adding in more kind of per-service draining. Yeah, but then, you know, the core piece of Slack that everyone sees is can I send messages and can I hop around my channels? And that's all kind of one big blob of code that's all extremely tightly intertwined because it has to be. you can't It can't be separate things.
00:21:05
Speaker
Yeah, and there's like almost everything. There's a reverse proxy, there's an app server, there's a memcache, and there's like some test components. Almost every request that goes through the site is going to touch those things.
00:21:16
Speaker
Yeah. And that makes sense that, so you make your global decision and you say, this is how like we should, we should be able to drain Slack out of an AZ, not, you know, memcache that probably wouldn't help.
00:21:27
Speaker
mean, sometimes I did, but yes. yeah Touche. but ah Yeah.
00:21:40
Speaker
No, sorry. God. Cool. So yeah, Angry email.
AZ Draining Proposal and Experiments
00:21:46
Speaker
bunch people read it. i love the I love the simplicity. I remember actually seeing that um that posted and thinking, like this makes a ton of sense to me.
00:21:53
Speaker
um how How did folks respond to it? And where did where'd you go from there? Well, I mean, I think i
00:22:05
Speaker
I'm trying to remember how it picked up. And I think maybe you were around for... part of this? I'm not i'm not sure. i started hearing... like i cirular Everyone I showed it to was like, oh yeah, this is kind of a good idea. like this This seems good.
00:22:18
Speaker
And I'm not sure how... I can't exactly remember how we sort of leveraged from that into starting a project, except I think we got... Yeah, I think we got permission to like try a little bit and see how it went.
00:22:35
Speaker
you know like I think we... uh yeah we we got to a place where we could at least start draining traffic sort of from the front end and not worry about the fan out behind it and just kind of see see what happened you know like see if the uh see the lid balancing infrastructure does sort of the same thing see if the see if it works even if we only remove maybe you know like five ten percent traffic um something like that. Like we all kind of agreed that that would be enough.
00:23:05
Speaker
You know, that would that would be an an experiment that was like valuable, but not too risky. i think. Yeah. and Yeah. yeah yeah Yeah. Got it. What were you, what are you trying to learn in that experiment? Cause I think this is one of the things that I, I really like to ask folks in this conversation because every, every big migration has this just so story around it of like, and then we did this hugely ambitious thing.
00:23:28
Speaker
And it's like, okay, but, There is almost nothing at this moment where if I look forward is an obvious, hugely ambitious thing for any given company to do. but but yeah It's very much retconned into, well, yeah, and then we then we figured it out. So like at the beginning, there's there's always ah a lot more ambiguity.
00:23:47
Speaker
um And you mentioned this as an experiment. What were you hoping to learn? And we're where did you think it would go at that point just by investing a little bit of work?
00:23:59
Speaker
Yeah, yeah. I think what we were trying to learn is, you know, we'd been through this experience with the outage that we talked about earlier. Excuse me.
00:24:12
Speaker
Where we were like, well, we would have we wouldve drained that AZ if we could have. um And so was kind of like, well, like can we can we practice this in peacetime, as it were?
00:24:25
Speaker
like can we Can we just start doing some of the stuff that we would have done? And we'll see a couple of things. right well We'll see like the sort of main site statistics, latency, and error rates, things like that.
00:24:37
Speaker
um But then we'll also get an opportunity to see kind of does does our machinery for that work? And we'll get to see what is left behind. Like one of the one of the patterns in this project, I think, is that we iterated on these drains every week or so as time went on.
00:24:55
Speaker
And we had you know a couple of like lines on the graph that and indicated kind of how much bandwidth we were dealing with. ah in the AZ z that we were draining and moreover that they would be split out by service. And so that was a way for us to see like, you know, this is sort of like bandwidth per service is like a kind of proxy for how much action is still happening despite trade. and Yeah, that makes sense. That way of us kind of getting nibble at it.
00:25:23
Speaker
Got it. And that wasn't like a known thing beforehand that, you know, you start to drain away from the front and you weren't sure how much of the traffic that was going to get. Well, we knew kind of how much we could take at a time just based on like some math, but we didn't know, we didn't know kind of anything about what would happen no once it had the app servers because the high degree of fan out, the many services behind, for example, like for, for memcache and the memcache configuration we had, we were quite sure we wouldn't be able to remove any traffic.
00:25:54
Speaker
We want to see it anyway. and um Yeah. That makes sense. Um, So how to how did you actually do this draining?
00:26:07
Speaker
talk Tell me a little bit about, know, we talked generally, it's like shifting traffic or whatever, but like, what does it mean to actually do this? Yeah. Well, we said that I'm pleased that the tip mix that we used evolved over time.
00:26:23
Speaker
hear that. So the first thing we did, so you remember I told you that um we had this, ah we Yeah, yeah we we had this load balancing configuration um where we had separate reverse proxies sort of fronting each availability zone, and they fed traffic only to upstream web servers ah web apps in their availability zone.
00:26:49
Speaker
So the first thing that we did is we actually just started going around and um
Challenges and Tooling Improvements
00:26:55
Speaker
ah causing the ah reverse proxies to nac health checks.
00:27:00
Speaker
So they just would not get traffic assigned anymore. you know For each free each one that we did in that availability zone, it would not get assigned traffic. And the outermost load balancing layer was seeing each reverse proxy as sort of a valid valid and equal. It wasn't really AZ aware.
00:27:18
Speaker
you just go from having like kind of 99 total reverse proxies to 98, 97 or whatever. And so like that was the idea about how we would just sort of through causing those at the top of the AZ to fail their health checks, the AZ as an entirety would get proportionally less traffic.
00:27:38
Speaker
Got it. um How well did that work? It sounds fairly simple. Yeah, it worked decently well. um We didn't get like there, there's a sort of fineness of control that was lacking.
00:27:53
Speaker
It wasn't but wasn't extremely fast. right We had to like kind of go around to a bunch of different servers and like run a command. And then you also there's this thing where like at some point you'll start running and follow the sanity checks in the load bouncing layer itself.
00:28:12
Speaker
Uh, for example, Envoy has this idea about panic mode, where if an Envoy, you can configure Envoy so that if it sees like, you know, 33% of its backends are bad, it'll be like, whoa, my health information must be bad.
00:28:26
Speaker
ah The safest thing for me to do is just start kind of spraying traffic across to all the backends regardless health status, because I've gotten some bad information. And so we did run a file on that once or twice. Um, I think that was the problem.
00:28:38
Speaker
Yeah. I think that was probably like the biggest kind of like, uh, Which actually, you know, in the in the environment of these experiment experiments was fine because the underlying infrastructure is actually healthy.
00:28:51
Speaker
But yeah, it brought us to a point where like pagers were going off and the system was not acting like we wanted it to. Yeah, it was yeah it was it had reverted to healthy because somebody was lying to it.
00:29:04
Speaker
That somebody was you who's knacking down services or individual load balancers, um but maybe not. maybe not the fight with the computer you want it to have. Yeah. Yeah. You're kind we're kind of conflating these signals of like this backend is healthy with this backend should be assigned traffic.
00:29:23
Speaker
Yeah. That makes sense. there You, you want to separate, did you, okay So a couple of, a couple of imperfections in that, and you mentioned this evolved. What's, what was the evolution of that tooling?
00:29:36
Speaker
So in the end, what we actually ended up doing was, So we had introduced the ah the XDS layer, um which our implementation was called Router at Slack. We had introduced that as kind of a read-only store for service discovery information from console.
00:29:55
Speaker
But the other job of that protocol is to serve configuration ah information dynamically to Envoy processes. And so there is actually ah relatively simple way in the Envoy system where you can just say like, well, assign proportionally less traffic to this zone.
00:30:15
Speaker
So we actually embodied everything in the Envoy configuration and then just like pushed signals. i think that I think the drain signal actually got wrote into console at some point and then replicated back into the XDS system.
00:30:29
Speaker
So it was more just like a knob that was like between zero and 100% of traffic. Cool. What did that granularity get you? like I imagine Slack had enough load balancers in each AZ that it wasn't that bad, but you did mention that that like increment was was meaningful.
00:30:48
Speaker
How did how you see that show up? Yeah, I think it, um you know, honestly, mostly it was ah it eliminated a lot of kind of difficult math around mainting and un-mainting servers.
00:31:02
Speaker
We could get most of the effect that we wanted. we had, I'm pretty sure we would have, I can't i can't remember remember exactly how many, reverse proxies we would have in each availability zone, but I feel pretty certain it was like of the order of 100.
00:31:16
Speaker
you know So we went ah we ah once we implemented the proportional traffic shifting through Envoy, we had like 1% granularity. But before that, I think at worst, we probably had like 3%, something like that. it was just a little bit of a like a messy control interface.
00:31:32
Speaker
And it also relied on you know stuff like being able to SSH around. In particular, if we were doing this for real, right we wanted to SSH to every server in the impacted availability zone, which is a lot to ask if maybe your network is totally closed.
00:31:47
Speaker
Probably a lot to ask in an availability zone that is presumably having network connectivity issues. so That's a much better way of saying what I said. ah Interesting. Yeah. And I i guess 3% doesn't sound too awful, but the real number was probably something like 2.7%. And knowing that five main to down services times 2.7 is the graph looking right. I certainly wouldn't want to do that math in the middle of an incident.
00:32:15
Speaker
Yeah. Yeah. And then at at some point you're just going to like fall into a panic routing anyway. And then you'll be, yeah, then you'll be in a mess. Yeah. and's No good there. Yeah. Yeah. yeah Cool. so how did how did that tooling... Was that tooling part of the original experiment or did you was this ah sort of beyond... Had you finished that initial experiment and learned what you needed to learn?
00:32:38
Speaker
Yeah, I think we did the first few of these. um Yeah, I think... ah ah yeah yeah i think we My recollection is we did ah few of these experiments to the point where we felt like we had some things to go and like work on with some of the different services underneath.
00:32:55
Speaker
And we were like, okay, like and now that now that we've seen there's like some some good work coming out of this, we can kind of step back and do a bit of work in the load balancing layer to make this like a nicer, and more continuous process.
00:33:10
Speaker
Kind of justify the... investment in tooling by saying like well we're gonna like now go do some work on these underlying services and then come back and try this again and see if we can get further got it that makes sense so um what did what did you see that that proved that it was worthwhile
00:33:33
Speaker
Let's see. So we saw, we certainly saw that we were able to decrease traffic to the app servers. We were able to decrease traffic ah eager or yeah, we really decreased traffic in the AZ as a whole.
00:33:50
Speaker
um We learned some stuff around. So in particular, we my My recollection is that I mentioned two of the important backends are memcache and vitess.
00:34:05
Speaker
And we were like, oh, actually, like for memcache and vitess, we don't see very much fail failover in traffic at all. This is not surprising for memcache because we knew that we had kind of a single memcache node for each key. There's no replication of the system. so um But we weren't entirely sure.
00:34:22
Speaker
There was a theory that we could remove a significant amount of traffic from Vitesse by doing this as well, because Vitesse has some aphanatization per AZ kind of baked into it.
00:34:34
Speaker
So there's an idea that we would get some kind of passive tra traffic draining. And we were a little you know like, okay, maybe this is enough because Vitesse is... we knew that in order to manage these drains through the test system, we would actually have to do work in the test system because the consistency model.
00:34:52
Speaker
And so we did actually learn that we wouldn't, I guess, conversely, we learned that we wouldn't get enough out of this kind of passive draining ah to make it so that we didn't have to go back and do the work on the Vitesse failover.
00:35:07
Speaker
we kind of like just, was yeah, it was kind of finding the sort of justified investing back there to, uh, Yeah,
Service-Specific Draining Strategies
00:35:14
Speaker
that makes sense. You mentioned earlier that you know Google has this model of 100,000 services or whatever, and um that's or thousands of services, and each of them have this own like fine-grained control behind it, but that's not what Slack did.
00:35:28
Speaker
um But it sounds like you actually learned over the course of this process process there were a couple of services where you needed to tackle the services directly because they didn't just passively drain out.
00:35:39
Speaker
um How did you think about... prioritizing those because that's no longer just like your team does some work and Tadabba system is better. hu Yeah. i um I think so what we actually ended up doing more often than not, yeah there there were a few yeah there there were a few services that we could just sort of silo them by AZ, right? Like just make it so that like ah downstreams in one AZ can only reach upstreams in that same AZ and it would just work.
00:36:10
Speaker
But there weren't actually that many of them and not that many of the very important services. So we did end up mostly going to service, kind of service by service.
00:36:22
Speaker
And I think we ended up... ah I think you actually came up with a lot of this. We ended up doing this kind of like Eisenhower matrixy thing where it was like, was sort of like difficulty versus value. Right. And is so you had like, you know, low difficulty, low value.
00:36:40
Speaker
You had like high difficulty, low value. Well, we don't want to do those. Right. We have low difficulty, high value, definitely want to do those. And then like high difficulty, high value. And those are the ones that are like going to be kind of spendy.
00:36:53
Speaker
i I do remember doing that. I don't remember filling anything out myself. I'm just like drawing out the graph and saying like, I, where do these dots go? Yeah. It's not obvious. Yeah.
00:37:04
Speaker
Yeah. And in practice, I think it was like me and like we what we I basically went around to somebody from each of these services. And i was like, well, let's just do ah a napkin sketch. You know, let's do it. Let's do like a one pager about what we think the right way to sort of drainify your services and kind of like how much how much that kind of has to cost.
00:37:29
Speaker
um and that was something that i think we could feed that into our engineering playing process i think both for the project so at this point this was kind shaping up into this bigger cross-functional project but then it also we can make sure that like since each engineering team is keeping their own roadmap for the service and the work that they want to do over the next year or two and kind of like dovetail all that in there as well so that people don't uh I feel like there's a danger in these pra and these cross-organizational projects where people feel like all their work that they need to do to keep their service stable is getting hijacked for something that's like something that's like kind of trendy or a flight of fancy or something. and
00:38:10
Speaker
so there's a lot I found that in the planning process, there often times where teams would be like, well, we really need to make you were asking us to handle another axis of complexity for our service, right? Like now, instead of worrying about our global capacity, we have to worry about each AZ's capacity.
00:38:26
Speaker
Like, can we, can we clean this thing up to make this easier to run? Like, can we, can we reduce some complexity elsewhere? And we were able to like make space in that or for them to do this work that oftentimes had been kind of deprioritized for them for a while.
00:38:41
Speaker
Interesting. how How much... I think this is this is an interesting kind of tangent to maybe take in infrastructure in particular of if you own a service and it costs a lot to own that service, the team who owns that service has the best local knowledge about that service. But when you come in with some global system property and say that we should we should behave this way, is there is there ah way to navigate that?
00:39:09
Speaker
Mm-hmm. I think that was a lot of the ah i think there that's a good question because often there's a lot of kind of, um i don't know if I would say like political or cultural friction that comes up in these cross-functional projects where I think, I guess as I hinted before, right? Like teams feel a little put upon and like they do...
00:39:31
Speaker
understand their service the best. And it's true. like They do, actually. like everyone i sort of believe that every service team is already operating in kind of a local maximum.
00:39:42
Speaker
like given Given the sort of organizational and technical constraints around them, they're almost certainly doing kind of the best they could. we were able to, i think, reflect...
00:39:56
Speaker
there's a way in which I think we reflected a shift in organizational priorities, um which sounds harsh, but in a kind of a good way, I think um we were able to say like, well, you know, like some of this other stuff that we've been, that that you've been told to do, like is not as important anymore. Right. Like we're, we're like, we can, you know, since we've decided this is,
00:40:23
Speaker
an important thing. Like let's take some of these other priorities and make those less important. And let's make this kind of like the most, you know, let's, ah let's also like kind of wind these things that you want to do around fixing technical data, making your service easier to run and like wind those into our cross-functional project that we're doing. Yeah. It's, there's an interesting alignment there of,
00:40:50
Speaker
you have your project, but you also, because you've gone through the organizational ringer, understand the organization's priority perhaps better than anyone else, or at least fresher than anyone else, and can can help teams go stack their roadmaps with work that hopefully they think is valuable and also you know aligned with the latest the latest and greatest thinking of what the system should do as a whole.
00:41:14
Speaker
Which can feel a little political, but it's hopefully a valuable exercise. Yeah. And it was always important for me to kind of be like, well, I'm i'm coming here to have some information. i have some information and I have like some knowledge about what has worked for other services, but you need to decide what's actually best for you.
00:41:36
Speaker
you know If you say, well, sort of like, grade it together, right? We can like talk about sort of the quality of versus difficulty, the solution. We can go back and forth on that. But at the end of the, yeah, at the end of the day, I think like everyone really wants to always, wants to do their own design work.
00:41:53
Speaker
Yeah, and understood absolutely. That's the, that's the hard and thinky part of ah engineering, it I guess 2025. So AI is going to do all the stuff that's not hard and thinky, right? It's over the What were some of the systems that were lived in that like easy and high value area versus like what were the systems that lived in hard and low value?
00:42:17
Speaker
And was there common traits that kind of shoved them in one direction or the other? Yeah, for sure. um We would talk a lot. I think in particular, we we started talking a lot about stateful versus state-less services. Yeah.
00:42:31
Speaker
like ah services, um
00:42:36
Speaker
like the, so the the web apps example would be, sorry, your canonical example of ah stateless service, right? Like there're they're kind of born, they die, they get all the data they need to serve more or less from services upstream to them, right? And so those are,
00:42:52
Speaker
almost trivial for us to do because we just kind of like hack them up by easy. Like we just say like, OK, like you have a weve we introduced like a filter in the service discovery layer where you could just get like a magical list of servers that were all on your own easy.
00:43:06
Speaker
um and So we just kind of like groom those into silos one crazy. And that was completely trivial. I would say the other um on the other end would be like the Vitesse infrastructure.
00:43:21
Speaker
Vitesse runs in these excuse me and these replica sets, ah usually about like five wide. and all the replica sets have ah Each member of replica set has all the same data, but only one of them is primary at a time.
00:43:36
Speaker
and so To accomplish a failover in the Vitesse system, you actually have to send a signal to each replica set that's like, you are not primary anymore.
00:43:47
Speaker
ah One of the others needs to be primary. and And so that actually has to be managed through the test orchestration framework. ah So they actually so that the database team actually had to do ah fair amount of implementation work to make this happen reliably and quickly across thousands of replica sets.
00:44:07
Speaker
I think that was probably a lot that work. higher complexity, but also um like higher, higher need, right? Because that's a very important data source. And we do actually not want to be sort of writing data first into an impacted AZ we're in that system.
00:44:23
Speaker
Yeah. And then the middle, I think was probably, I'm thinking about like memcache ah where like the memcache, you know, we started in this place where, We were treating Memcache as it was it was a strongly consistent data store because we had one big kind of like global Memcache ring for the site.
00:44:41
Speaker
um I'm elating some details, but... we were, ah each piece of data was basically on one memcache server. And so I believe there we actually, um ah you've talked to Glenn already.
00:44:54
Speaker
There's another episode. Yeah, we talked with Glenn a couple of weeks ago on how yeah some data that everyone needs at the top of the hour is all available on one memcache shard. Yeah. That's horrible.
00:45:05
Speaker
Yeah. but So we had to figure out what to do with that. Yeah. As I recall, it was something where we basically introduced a per AZ replica of each memcache and then kind of had to go through the code and figure out where the consistency was important and where it wasn't.
Future Scalability and Flexibility
00:45:27
Speaker
Absolutely. That makes sense. help How would you think, for folks who thinking about this, you know by default, a lot of systems are probably not designed for how can I split my site up into three proto-independent sites 10 years in the future when we're at tens of millions of users.
00:45:43
Speaker
um Is there anything that you would encourage people to think about like early on in system design to make this more plausible later? Or is this just a big company problem?
00:45:56
Speaker
o I was certainly attempting to say it's not. I mean, I do think there's... i think there's a life cycle. ah Well, yeah. I think there are some companies that have to deal with a lot of different sites early on, and they have to think a lot about how their data is kind of flowing back and forth.
00:46:14
Speaker
um And so they deal with that kind of naturally. um i think there's... oh there's ah It's hard to say when... It's hard for me to say kind of off the top of my head just when exactly this is a good idea or like something to do.
00:46:32
Speaker
i will say that there's like something organic in it, which is that at some point you'll become aware, if you're a assuming you're hosted at AWS, you'll become aware of the cross-AZ transfers.
00:46:43
Speaker
And i think in general, it's really useful, even from the beginning, to be able to segment your monitoring by availability zone. and And that's a good place to start kind of like getting your feet wet. And then you can at least see, you can look at bandwidth or RPCs and just see what is kind of flowing back and forth.
00:47:01
Speaker
Usually, yeah, I think as usually the first step is like getting some observability and some measurement in the system. Yeah, that makes a lot of sense. like You need to be able to see it. You need to know if this is a problem for you. and And that pairs really nicely with, as you said earlier, the cost thing tends to be a leading indicator of how much you should care about independent AZs.
00:47:25
Speaker
Everyone's going to hit some amount of cost if you're not aggressively thinking about it. Yeah. And I think another thing to think about that people often don't is that there are a lot of A lot of the time, if there is one logical instance of something, then that is kind of somehow or another a single point of failure.
00:47:45
Speaker
I'm thinking, you know, particularly like if you have yeah have one transit gateway and it sits in between all of your availability zones, um it is unlikely that the physical component will fail in all these availability zones, but it's very likely that somebody will eventually misconfigure it through the console.
00:48:00
Speaker
Yeah. Yeah. And so this will kind of like have the, have the same effect. And so I think there's, you know, like when people sort of like fat finger things, as we say, um there, there's often, i feel like a lot of emphasis on like not making the mistake again and like linting or like doing the right code review or whatever. But I think it's also ah signal that there's a good opportunity to maybe increase some redundancy here.
00:48:28
Speaker
Yeah. I remember one of the things that blew my mind was talking to some large customer of Slack's and they pointed out that they consider Slack a single point of failure because it's
Addressing Human Error
00:48:38
Speaker
just one thing. It's like, of course, looking at it from the inside, like I promise it's many things.
00:48:43
Speaker
can But throw a first you know a good product presents as a single unified thing. And you don't want to have to think about, of course, if you want redundancy, now their their thought was like, do we need to go buy a second chat solution that everyone sits in? And you know my my purpose in that meeting was to talk them out of that. and I think we mostly did because it's a horrible question. But this is true at every layer in the stack, right? The products, the load balancers, the internal services, like everything about it
00:49:14
Speaker
a single point of failure is a single point of failure and those aren't those don't feel good who yeah and um also i will say like based on my experience it is much more likely that outages will be caused by a human misconfiguring something than by some ah underlying equipment failure absolutely yeah yeah and so that's actually like i i feel like that was one of the places in this project um where we were able to accomplish the biggest shift in thinking when we were like, oh, this is not just a tool for ah data center caught on fire, but this is also a tool for somebody just misconfigured some software, a bad deploy went out.
00:49:53
Speaker
Yeah. And you can, if you can drain away from it, that's substantially different. o So um where did things end up? What was sort of the final state of the system when you left?
00:50:04
Speaker
So the final state of the system when i left um was that we could, drain, i can't i couldn't tell you exactly what the numbers were, we could drain all the critical services out of an availability zone. and and I think our goal was like five minutes or less. Most services would drain a lot faster.
00:50:22
Speaker
five minutes or less was mostly, ah yeah, it was driven by, i think, like some of the underlying database work. We got to a place where we were doing it every week. So our our kind of e we didn't talk about this that much, but as we iterated on the cycle we used was to kind of like get everybody together on Friday mornings ah ah the who was involved in a service and be like, okay, we're going to drain the site together.
00:50:45
Speaker
um And everybody you know like rides along in the incident channel until you get enough confidence that like your service won't go down under drain and then you don't have to pay so much attention anymore. um So we kind of like rode along with that until we felt that we could really address most of the significant traffic, but we were also kind of okay leaving a long tail of stuff to be cleaned up. I'm thinking particularly about like the memcache migration work, where we knew that there were a bunch of keys that would need to be migrated.
00:51:16
Speaker
over time, but we got, we had a lot of ah confidence in the basic approach of things.
Institutionalizing AZ Draining Practices
00:51:22
Speaker
And I do, yeah, there's another, there's something I talked about the QCon talk, but not so much here.
00:51:28
Speaker
um Is that I think like one of the keys to success here was that we were able to find something that was incremental and did not require every service to kind of like walk along the program and lockstep.
00:51:40
Speaker
So we had this kind of, i think we got to this place where things we felt were like mostly done, but there were some important work streams that would keep on going for maybe as long as a year.
00:51:51
Speaker
Yeah, that makes sense that's, I think, hugely powerful. A migration where 90% done is 0% value, sucks. It feels bad. And some of them are just have to be that way. that if you are If you are slightly out of compliance, you are out of compliance. And that triggers a whole bunch of of things. But if the goal is performance and reliability, you can be 80% done and that is just 80% done.
00:52:17
Speaker
oh Yeah. And yeah, it's also fragile, you know, to be delivering value so late in the project because you haven't, yeah until you've like, I feel like you're, you're kind of at risk until you're delivering real value and you have things kind of locked in Yeah. and least Absolutely. Yeah. It strikes me the other thing that was important.
00:52:37
Speaker
ah by the sort of formal end of the project was that we had gotten sort of the system defaults for things to be in this kind of new AZ-ified world.
00:52:47
Speaker
It was in a place where like if you're setting up some new infrastructure, there are different guardrails in the system that would ah sort of discourage you from doing things in a global way by default.
00:52:58
Speaker
You'd have to kind of like, you know, like sign, like I i i promise that, I solemnly swear I'm up to no good before you could do anything globally. and incurring cross AZ traffic bills. I think part of the original quote, that's cool. That makes sense. And it really like locks in the progress going forward. So you don't end up with, you know, you having to shepherd this process literally forever in order to make it valuable.
00:53:23
Speaker
Yeah. ah Yeah. At some point, you know, like you need to, the organization needs to kind of like pick up the load. Absolutely. Yeah.
00:53:34
Speaker
and one we' We're about out of time, so I'll ask one last maybe spicy
Closing Insights from Cooper Bethea
00:53:39
Speaker
question. um Would you recommend people use console? And if so, how?
00:53:47
Speaker
If they're just starting a project.
00:53:50
Speaker
A deeply loaded question. I mean, I think it's... What can I say? and
00:53:57
Speaker
I'm... How can I say In short, probably not. ah like i I think that, as I mentioned, there's this sort of like attractive nuisance quality to console, where on the one hand, it's like a pretty easy-to-use service registration discovery system.
00:54:19
Speaker
And on the other hand, it's a lock server. And and like these things are just really... yeah you You need really different guarantees from each one, and those should probably just be separate systems.
00:54:32
Speaker
Like it's, it's okay. Like we have, you know, you can use the AWS native service registration or just, just kind of whatever else, but we like for, for service registration discovery, you need and eventually consistent system that is going to guarantee you high uptime.
00:54:47
Speaker
For lock service, you need well you need strong consistency. It's just necessarily going to be a more delicate system from a reliability perspective. And so you should probably just, yeah, I think that the needs of these things are sufficiently divergent. They should just be two systems.
00:55:04
Speaker
That makes a ton of sense. Cool. um Well, we are we werere just about out of time. Thank you so much for coming on the show. um Is there, I guess, the one one standard wrap-up question.
00:55:17
Speaker
um Where can people find you on the internet if they want to learn more about this or get in touch with you? Yeah, I'm mostly present professionally on LinkedIn. I'm just, my name's Cooper Bithey.
00:55:29
Speaker
um Yeah, I'm pretty inactive on other social media right now. there If you look at my LinkedIn profile, there's a link to the talk I gave at QCon ah last November where I expand on this in a lot more depth.
00:55:42
Speaker
We'll make sure to drop a link to that in the description. Thanks so much for having me, T.R. I've enjoyed this. Thanks so much for coming on.