Oops!Zencastr was unable to start because Javascript is disabled
To fix this problem, check your browser's settings and enable Javascript

Become a Creator today!Start creating today - Share your story with the world!

00:00:00

00:00:01

Slack's 6am Database Club | Ep. 9

24 Plays7 months ago

In this episode of Tern Stories, I spoke with Maude Lemaire about a migration that started with SQL and ended with people taking shifts at dawn.

She joined Slack less than two years out of college, and within months, she was leading a core part of the company’s efforts to keep up with their biggest customer, IBM. The project: unifying Slack’s channel membership model—one of the most central, performance-sensitive parts of the product.

At one point, the system was so fragile that engineers had to be in the office at 6 A.M. every day, just to watch IBM’s login spike and make sure the system didn’t collapse.

-----

Go check out what Maude does! ➡️ https://maudethecodetoad.com/

Her book, Refactoring at Scale, is available now 📖: https://www.oreilly.com/library/view/refactoring-at-scale/9781492075523/

Get Tern Stories in your inbox: https://tern.sh/youtube

Recommended

The DARK SIDE of Code Migration | Apollo GraphQL CEO Matt DeBergalis image

The DARK SIDE of Code Migration | Apollo GraphQL CEO Matt DeBergalis

00:52:33·3 months ago

Inside Snapchat’s BOLD Code Migration: Faster & Leaner Rebuild image

Inside Snapchat’s BOLD Code Migration: Faster & Leaner Rebuild

00:49:01·3 months ago

Surviving High Stakes Code Migration Without Breaking Everything image

Surviving High Stakes Code Migration Without Breaking Everything

00:50:00·3 months ago

Code Migration Secrets: How to Finish in Half the Time with AI image

Code Migration Secrets: How to Finish in Half the Time with AI

00:30:11·4 months ago

The Twitter Code Migration Disaster That Nearly BROKE IT image

The Twitter Code Migration Disaster That Nearly BROKE IT

01:05:19·4 months ago

Slack’s Code Migration Uncovered a Terrifying Truth image

Slack’s Code Migration Uncovered a Terrifying Truth

01:08:23·5 months ago

You have to decide image

You have to decide

00:21:16·5 months ago

You have to decide image

You have to decide

00:21:16·5 months ago

How They Cut Code Migration Time Without Sacrificing Quality image

How They Cut Code Migration Time Without Sacrificing Quality

00:52:32·6 months ago

The iOS Developer Who Picked Nomad Over Kubernetes | Ep. 14 image

The iOS Developer Who Picked Nomad Over Kubernetes | Ep. 14

00:51:46·6 months ago

IBM Killed Our Database: How 5 Engineers Migrated to Postgres | Ep. 13 image

IBM Killed Our Database: How 5 Engineers Migrated to Postgres | Ep. 13

00:54:30·6 months ago

Wrong Tool, Right Choice: PagerDuty's Cassandra Queue | Ep. 12 image

Wrong Tool, Right Choice: PagerDuty's Cassandra Queue | Ep. 12

01:16:33·6 months ago

Rebuilding a YC Real Estate Tech Stack from the Ground Up | Ep. 11 image

Rebuilding a YC Real Estate Tech Stack from the Ground Up | Ep. 11

00:52:36·6 months ago

“We Should Be Able to Drain an AZ” | Ep. 10 image

“We Should Be Able to Drain an AZ” | Ep. 10

00:55:50·7 months ago

Why Every Code Migration Feels Different (and What to Do About It) | Ep. 8 image

Why Every Code Migration Feels Different (and What to Do About It) | Ep. 8

00:58:48·7 months ago

Migrating Memcache in a time of DEMAND | Ep. 07 image

Migrating Memcache in a time of DEMAND | Ep. 07

01:28:00·8 months ago

Ratcheting Progress: How Lyft Migrated 150+ Services from Python 2 to 3 | Ep. 6 image

Ratcheting Progress: How Lyft Migrated 150+ Services from Python 2 to 3 | Ep. 6

00:47:13·8 months ago

What Litigation Teaches Us About Security Operations | Ep. 5 image

What Litigation Teaches Us About Security Operations | Ep. 5

00:52:45·8 months ago

Outscaling ElasticSearch at Datadog | Ep. 4 image

Outscaling ElasticSearch at Datadog | Ep. 4

00:52:45·8 months ago

Upgrading Postgres: 5 Versions Behind, 4 Databases to Merge | Ep. 03 image

Upgrading Postgres: 5 Versions Behind, 4 Databases to Merge | Ep. 03

00:47:16·8 months ago

Transcript

IBM's Migration Challenges

00:00:00

Speaker

Sometimes you don't start a migration because it's on the roadmap. You start it because everything is broken right now. IBM's Slack workspace was growing so fast that just the daily login pushed the system past its limits.

00:00:10

Speaker

So today I talked with Mode Lemaire about what they did next. Mode's now a principal engineer at GitHub. But the story comes from earlier in her career when she was just two years out of college and she was handed the job of fixing IBM's performance problems as part of the enterprise performance team.

00:00:25

Speaker

And she made incredible decisions under pressure.

Mode Lemaire's Career Path

00:00:28

Speaker

She did targeted refactors, sanity check the data model in production and simplified core systems and finding durable fixes for real pain.

00:00:36

Speaker

It's an amazing story and even more amazing start to Mode's career. Today on Taryn Stories, um i have Mode Lamerwick here with me. um Mode and I first met at Slack where she was an engineer for eight years.

00:00:50

Speaker

Almost. I didn't quite make it to eight. Almost eight years. um And is now a principal engineer at GitHub. um Before that, she was an engineer at Runway. And she is the author of the book Refactoring at Scale. So I am extremely excited to have you here.

00:01:06

Speaker

Thanks for coming on. Thanks for having me.

Joining Slack: Challenges and Persistence

00:01:09

Speaker

All right. So we're going to talk about Vitesse migration and all of the wonderful hair that comes with that. um I wanted to talk a little bit about, because you've been at Slack for so long and seen sort of this like...

00:01:21

Speaker

wild growth that happened um how did you find slack in 2016 that's it was not a household name at that point like who who did you meet and then how did you land there uh so slack i started using very briefly i like A year after public launch, I think.

00:01:52

Speaker

It was early. headlinewise it was It was early and because i was still in university at that point and everyone sort of obsessively read Hacker News except me. I still don't obsessively read it.

00:02:05

Speaker

um But I guess... it like Yeah, you know, I can still clearly do my job and exist in this world without reading Hacker News. So I continue to not really.

00:02:16

Speaker

Yeah, arguably better. But at the time, yeah. That's a whole different, whole different can of worms. But basically everyone I went to school with was like, of reading Hacker News and like trying out all the like cool new stuff that was announced on Hacker News. And that meant like, ooh, like Slack public launch or whatever like let's try it for the um computer science undergraduate society that i was on um and so we like dabbled in it it didn't really stick but it like seemed cool and when i went to or the runway after graduation we were using hip chat and i i despised it with a deep deep

00:02:55

Speaker

level of that there's no what what is the noun version of despise just despicable i what i don't know i didn't like it it made me upset i asked And who was I talking to at the time? Someone else I was talking to at the time who was also working at like a smallish company and they were using Slack and they thought it was a a much superior product.

00:03:23

Speaker

hey I was on a like very highly paged location for front end stuff at Rent the Runway, which meant that I had to do a lot of things in HipChat, um particularly after hours.

00:03:39

Speaker

Yes. And so like that just like made things so much worse. um Especially when like things didn't load for some unknown reason. they do.

00:03:51

Speaker

And about...

Role at Slack: Overcoming Technical Hurdles

00:03:53

Speaker

About year-ish into that job, I decided for many reasons to leave. um But one of those reasons was because um I was in an absurdly long-distance relationship with my then-boyfriend, now-husband, where I was living in New York and working at Run the Runway. He was living in Seattle, working at Microsoft.

00:04:13

Speaker

um So that's that's like a not great situation to be in And we were like, let's both quit our jobs and move to San Francisco and do the programmer thing. And so I started looking for jobs in San Francisco and I was like, ooh, Slack is shiny.

00:04:29

Speaker

um And it's this cool tool that like seems like maybe it would be a fun fit. And it's like not too big, which is I was also looking for a company that was not I didn't want to go from like Rent the Runway size, which was maybe like sub 100 engineers for sure like me d like 50 ish um if you like a google like i didn't want that and so that like narrowed the pool a little bit and i initially got form letter rejected did i tell you this story no you got a letter rejected from slack got form letter rejected

00:05:05

Speaker

we clearly didn't stick Clearly it did not stick. um So mostly Brother Runway, I was doing mostly front end stuff and I wanted to move into more of a back end role. And I'd started doing that at my job, but I wanted my next job to be more back endy.

00:05:19

Speaker

And I'd done the first round of like applying to various jobs in a back endy role, but I had a very front end, like a one year out of college front end resume, right? Like, yeah.

00:05:34

Speaker

and i Even when the market was like good, it's hit or miss. right um and so I decided, okay, I'm going to start hiring, ah hiring applying to front-end roles.

00:05:48

Speaker

Instead, and so I applied to a front-end role that I saw at Slack, and I was like, oh, like maybe I'll be a good fit. um And yeah, I got a form letter rejected like three days later, among with like three other form letter rejections. I was like looking through, i think, AngelList for jobs at the time. It's now named something else.

00:06:07

Speaker

and But you could have like good filtering on size of company and stuff. And at that point, I was like, I don't understand. Like, is it my resume? Like, what what's going on? Why am I getting all these form letter rejections? Like, I'm not even getting to a recruiter call.

00:06:21

Speaker

Like, I know I'm like, not that bad. Yeah. please At least worth a call at that point in your career. At least worth a call. Thank you. And Rent the Runway is not some garbage little startup that even at that point is, you know, a name that some people know.

00:06:35

Speaker

Some people certainly. i think what I wonder if like the, had I applied more in New York and stayed in New York that might've held more weight. And I was trying to do the like move across the coast thing. I don't know.

00:06:47

Speaker

I don't know. um But I ended up sending an email to um Camille Fournier who had been CTO during my first part of my tenure at Slack and who had been excited to hire me. but reached out to her and I was like, look, like if you have a minute, I'm,

00:07:03

Speaker

please. I don't know what it is. Is it my resume? Like what's going on? I'm just like trying to find a job that I'll be in good fit for and like move to San Francisco. And I think my email was like, it was too long to be quite honest. It was like very rambly.

00:07:18

Speaker

And she responded with two words, um which was intros incoming. ah Bless her. Yeah.

00:07:28

Speaker

And one of those intros was to Julia Grace, who was leading the infrastructure team and at Slack at the time. And she saw that I had been informed for rejecting. She sent me this email she was like, I am so sorry. We formed that or rejected you. Like, would you still like to give it another go? was like, yeah, like it really would. And then I went through the whole interview process and...

00:07:53

Speaker

It was between a job at Slack and a job at Twitter. um And took the job at Slack. That seems, ah that's amazing, first of all.

00:08:05

Speaker

um and And Julia is amazing. So I have no doubt that you both made the right choice there. um What did, okay, so you show up at Slack. It's now a year post-launch, maybe a year plus a month of interview kerfuffle.

00:08:22

Speaker

um what What did Slack look like? What was the tech stack there that you had walked into? ara A lot of just PHP everywhere.

00:08:32

Speaker

There's like a lot of PHP and like some files that were like thousands of lines off.

00:08:41

Speaker

um Yeah, libchat.php. Shout out to libchat.php. That file was... thousands of lines of PHP. Five to six thousand lines long and of just like this PHP just like and really like defensively written PHP right like because there is no type annotations for anything and so it was like if this stuff I got in is like total garbage let's see what kind of garbage it is right like is it a null is it a zero is it a false um is it like wild

00:09:13

Speaker

An empty array, because also everything's arrays. a And so yeah every single function essentially had these like checks and these log errors if you reach this sort of point and it's the wrong combination of things.

00:09:27

Speaker

and then And then you um do your best to do the thing that you were supposed to do in that function and and you return it. And it was all a very... like procedural, right? Like there was no like object oriented anything. It was truly just like a five to 6,000 line long PHP file with just like function, name of function and some stuff in it.

00:09:49

Speaker

It is wild to hear that that is the the like developer experience of showing up at Slack, especially given as as you said, the product experience of Slack is just 180 degrees different from that.

00:10:02

Speaker

Oh, yes. Oh, yeah. Like 100%. But I think it's not... What was really interesting is that like I think my personal experience going into it, especially coming from like a very like Java-centric backend world, was that if you just did like a Command F, you could find things.

00:10:29

Speaker

That makes a ton of sense. Like, that's easy to explore. It's easy to explore. The naming was, like, actually not that bad. The naming patterns of, like, oh, like, this functions in, like, this file because it starts with that name, right?

00:10:44

Speaker

Like, lib chat format.

00:10:47

Speaker

It has all the functions named chat format, blah, blah, blah. And like from that name, you can probably figure out what they do. And so it was like pretty easy to like race your way through things.

00:11:00

Speaker

um It wasn't necessarily easy to make decisions, like step back and make decisions about the whole ball.

00:11:13

Speaker

um But if you were like tasked with like, oh, there's this weird bug, go fix the bug. It was like, okay, well, like, let me just like trace through and like figure out like which null check got

Handling Slack's Growth: Performance and Teamwork

00:11:24

Speaker

omitted. Because inevitably it was like, at some point it becomes null for some reason.

00:11:30

Speaker

And then we just try to spew it back out. It's like, okay, where does that happen? Okay, it's here. We forgot the null check here. Like ideally at a test. you how you're feeling about adding a test and then like push the fix and and go. um And so I think it like made for very like quick iteration early on which I associate with like the quality of the product, the fact that like people really cared deeply about how good the product experience was.

00:12:01

Speaker

And you had this like really fast development cycle. And so like you could, make a PR, merge it, deploy it to click the deploy button, which deployed it to a hundred percent of production within like 30 seconds or something absurd.

00:12:18

Speaker

and then see it for yourself too. Like, Oh, I fixed the bug. Like my Slack doesn't do the thing anymore. um i mean the whole like deploy to a hundred percent of production within 30 seconds is a little scary.

00:12:29

Speaker

um And, and even the, the most well-intentioned changes that are reviewed by the most senior people that and Sometimes um don't go so well yeah Indeed. they i I joined Slack about two weeks after we decided to not do that. And in no small part because Slack decided to not do this anymore.

00:12:58

Speaker

that was part of a group of people that ah showed up and we were asked to think about how to fix that. And of course, everyone else who worked there. um But that wasn't but wasn't for several years after the point we're talking about.

00:13:11

Speaker

Yeah, this is like all all two years later. Yeah. ah You can live in that state for a really long time, it turns out. um It depends on what you want to focus on. Well, also, you can undo fast.

00:13:23

Speaker

That's the thing. You can ship out the change that breaks absolutely the entire planet, but you can also undo that change equally fast. Yeah. And if if you're comfortable with that, it makes sense. we We talked so a couple weeks ago on this show about the...

00:13:37

Speaker

um about how Lyft worked on one of their migrations and how they could move really fast because they had the infrastructure that would hide user errors from people.

00:13:47

Speaker

So it sort of gave you the confidence. If you're culturally comfortable with, I broke it and i fixed it, and that took 10 less than five minutes, That equally provides that kind of confidence. And I actually, know, we're laughing about it because it feels a little ridiculous to treat Slack like that, which is now this infrastructure of the world.

00:14:05

Speaker

um my gosh, yes. But maybe that's fine, actually. Maybe that's why Slack got there in no small part. Yeah, no, I definitely think that like was a huge, a huge part of it. And I think too, a huge part of it was like,

00:14:18

Speaker

putting a lot of being really thoughtful about who they hired, even though, you know, they sent out some form letter rejections that they clearly should not have. Mostly thoughtful. Once a human gets involved. Mostly thoughtful.

00:14:30

Speaker

But do feel like like I was the largest hiring class that they'd had in in like ever at that point, which was like maybe 30-ish people. That's lot. Yeah, in October 2016. Yeah.

00:14:45

Speaker

And it was like there were new people starting almost every week at that point rather than every other week. and it was just like going like mad and yeah and I remember turning to the person who was sitting next to me ah one point and I was like oh hey Audrey like how do I do this thing and they were like I started last week so I had no idea was like cool well hey nice to meet you great I i it it is it is hard to wrap your head around hyper growth where if you double the company every year it means that half the people have not been there a year

00:15:17

Speaker

um which is Which is a lot. um But it that was slower growth than Slack's user growth at that point, I think, as well. Yes. Yeah. which Which in no small part was the reason for the the test migration.

00:15:32

Speaker

Yes. yeah let's talk about that. Yes. So many fires everywhere. So many fires. So at what point did um my my understanding of the test migration was basically we i we were running you know MySQL master-master pairs ah as the default store for basically everything. And ah if nothing else, you could squint and look at the growth rate of our largest customer and realize that Amazon will not sell Slack a machine big enough by some date. So we had to fix that, if nothing else. Yeah. Yeah.

00:16:07

Speaker

So that was that was compelling, if nothing else. But there's there's a bunch of other reasons to do it. ah At what point did you get involved in this effort? Pretty early on. So when my... i I was in a product development at role at Slack for like three months, I want to say.

00:16:26

Speaker

Something like that.

00:16:29

Speaker

So I started on the like core enterprise experience. And enterprise, the product, ah not shipped yet. like It was going to GA... a like three-ish months after I started. and but IBM at the time, the biggest customer was like data using the the enterprise grid, the grid product.

00:16:48

Speaker

And one of the things that I had been hankering to work on Ride the Runway, but had never really gotten the like leeway to do was work on performance problems. I always thought those were like really interesting and intriguing.

00:17:00

Speaker

And my onboarding buddy, when I started at Slack, was specifically working on performance problems at the time. And yeah. And so he was sort of like getting me set up and like figuring out how to navigate the code base, et cetera.

00:17:15

Speaker

And I was like, okay, so like give me some tickets. Like, i'm I'm ready to, like, start doing some stuff, right? And and he was like, well, I'm working on these performance tickets. i don't know if you find that interesting. I was like, give me the performance tickets.

00:17:27

Speaker

Like, I would like to explain how this works, to understand how this works. And I think i really liked figuring out how a system works through that lens. And so I think it was just, like, a really good...

00:17:40

Speaker

fit for me. and And so he gave me some some performance tickets and i was like, ooh, this is really fun. I'm enjoying fixing these. And it was like the team actually cared that I was fixing performance tickets.

00:17:53

Speaker

And that's honestly rare. Yeah, no, it really is. And like, it was like, oh, no, IBM really needs us to fix this because we're GA-ing and like it needs to feel and like look legit to level where like we're not very comfortable right now. Let's give ourselves as much like wiggle room as we can because because once g a they've agreed they're going to start onboarding a lot more people way faster.

00:18:18

Speaker

So we need to fix these now before it literally means like Slack will not load for them when they start logging on 9 like nine am Eastern. So it was like actually valuable to do this stuff, which is great.

00:18:30

Speaker

So I started working on these things along with whatever product features I was working on at the time. And then GA happened. And maybe like a week later, it was like, yeah, we need like an enterprise performance-focused group of people just 100% of the time working on enterprise performance. Yeah.

00:18:48

Speaker

Okay. Yeah. Was it, I mean, how did that how did that show up was that... did Did IBM's grid not load on that that first day of GA? Or was this just like forward-looking panic?

00:19:04

Speaker

I mean, forward-looking as like two weeks really

00:19:13

Speaker

there were hints that, well, one, they like told us we're going to start ramping up. um And we want to be able to do that. And that's our expectation, given that like we signed this contract.

00:19:25

Speaker

And so, like yep, we were expecting a certain amount of they're going to keep growing. And we need to be able to sustain that. And then there was also a lot of like and fires around, like I don't know how many times, how many hours total in my life I spent looking at percent CPU idle on the database shard that housed IBM's data.

00:19:48

Speaker

Like, probably a significant number of hours total.

00:19:54

Speaker

It's, like, a weird thing to say. I spent a lot of hours... Hundreds or thousands of database charges at Slack. And this is the one that gets the attention. This is the one. Yeah, this is the one.

00:20:06

Speaker

And they kept having, like, stuff happen, like, at one point... And this is after the team was formed, but, like, it really just, like, further exemplifies why we needed this team. And we were talking, like...

00:20:18

Speaker

four or five engineers. um They... It was like mid-April. So we GA'd end of January. By mid-April... I'm going to get this wrong, but I think they'd grown by like at least 20% in terms of like daily actives from end of January. They had a channel with, I want to say, like 17,000 to 18,000 users in that channel.

00:20:45

Speaker

and

00:20:48

Speaker

And someone accidentally archived the channel.

00:20:54

Speaker

Oh, no. ah hu And at the time, if you looked at the... the So the one smart thing that the archive code did was it instead of trying to do it synchronously within the span of a single actual HTTP record, which would not have scaled, obviously, that would have just... Actually, maybe that would have been better because it would have just timed out.

00:21:19

Speaker

Or out of memory. And then we wouldn't have had this problem later on. Anyway, I'm just realizing this. That's a great point. But it didn't do that. But it didn't do that. No, what it did instead is it enqueued a job.

00:21:31

Speaker

And then the job could have run for however long it wanted to run. I mean, realistically, I think we killed jobs after like 24 hours max, which should have been sooner to be quite honest.

00:21:43

Speaker

Well, whatever. enqueued a job. And in that job, in a tight loop, It issued a single delete on the channel membership row for every single one of those users in that channel.

00:21:59

Speaker

Which was grand. ah <unk> he thousand of them All 18,000 of them. All 18,000 of them in a tight loop. And this also happened during like around peak login time for users at IDM.

00:22:12

Speaker

And so we were hammering. their database shard with 18,000 individual delete requests at the same time that they were trying to do anything in the client. and if you're familiar with like,

00:22:30

Speaker

oh Slack works, you know that everything hinges on a channel, right? Like messages are in channels. Canvases are shared to channels. The huddles start in channels. Like everything is about channel membership in the end. Like, do you have access to see this file? Do you have access to see this message? Are you in this channel? Shouldn't you get a notification because you're in this channel?

00:22:49

Speaker

Everything is channel membership dependent. It's like how like access permissions and everything works. And so um of all the tables, that is the one to not hammer. Right.

00:23:02

Speaker

Yep. Because otherwise you're going to slow ah everything down. ah And so we sat there mostly helplessly just looking at the CPU idle crap and seeing yeah being the number just like go down, down, down, down.

00:23:19

Speaker

And you know the like swallowing the goat analogy of like the snake like swallows the goat, slowly digests it. Like that's just what we had to do. We had to swallow the goat. We had to wait for the 18,000 individual delete requests to finish.

00:23:33

Speaker

And then everything recovered and it was fine. How long did that take? like Maybe an hour or two. Oh, yeah. Oh, it was bad. It was very bad.

00:23:47

Speaker

It was not good. That's, yep, okay. Yeah. And like most of these things, like we never thought before that that was a problem because like one, like no one accidentally, the worst part is then we had to recover that data, all that channel membership data that was hard deleted because it was an accidental archival of a channel.

00:24:07

Speaker

um no one No one wanted this to happen. Literally no one wanted this to happen.

00:24:13

Speaker

Yeah, it was not good. And... and um But before, no one archived big channels like this. And so, like, why would we have spent... Even if we knew but this was gonna... Like, that this was a problem area.

00:24:28

Speaker

That, like, it was not good to in a for-loop... like issue individual deletes even if we knew that which like we obviously did it wasn't worth the time and effort to like go fix that until it actually became a problem because there was like 15 000 other things to do um so it was only when we like realized how many of these um potential um ah foot foot shooters what do we call them foot guns landmines foot guns thank you shooters who am I

00:25:05

Speaker

and It was only once we'd realized how many foot guns we had and that it was like, oh, yeah, look, this is a good thing to have this team. And on that team, one of the wildest things was because we were all in person at that time and we were all West Coast, all in the San Francisco office.

00:25:20

Speaker

And so 6 a.m. was when the 9 a.m. IBMers would all sort of log in. And because we had so many of these random buyers happening and because logging in to Slack was in itself like a very database intensive request, you have to fetch, ah again, the channel membership.

00:25:40

Speaker

It all comes back to the tail membership. yeah to fetch all the tail membership for all these users at basically the same time because everyone started starting their workday at around the same time.

Technical Solutions and Efficiencies at Slack

00:25:51

Speaker

And so like you had to make you we all traded off being butt in chair at the office at 6 a.m. watching the CPU idle crap.

00:26:02

Speaker

I didn't realize that like several weeks. Yeah, I guess at some point you you get tired very quickly of realizing that you will be woken up at 6 a.m. You might as well just be there at 6 a.m. m and yeah have a couple of other people to stare at the problems with you.

00:26:20

Speaker

Yeah, and it was like, what kind of random shim are we going to write this morning? All right. Skin the wheel. Wild. i mean i love that ah I love that approach to just, you know, go go run towards the fire, right? This is going to happen.

00:26:36

Speaker

We're going to have to fix it. Let's at least make our lives nice when we go fix the the fire. um I love that. But this problem no longer exists.

00:26:47

Speaker

what What you do about it? Oh gosh, many things in many steps over many years. But so it's incremental. There's like the super long scope of the project and there was like the shorter scope.

00:27:01

Speaker

Basically three weeks after I started is when they hired the engineer who was tasked with figuring out how to fix my SQL deployment.

00:27:13

Speaker

um And he came in and scoped out the test and said like, this is the thing that we should do. And so he was working on that at the same time that we on the Enterprise Performance Team were like fighting fires and essentially trying to give ourselves enough breathing room to be able to like tackle the like more fundamental performance problems across the board.

00:27:35

Speaker

um But we couldn't do that until like we didn't have to be butt in seat at 6 a.m. anymore. Right. Yeah. And so one of the things that we needed to fix in that first session section to give us that breathing room was to and change how we were storing channel membership.

00:27:57

Speaker

At first, we started just doing a couple of like different queries to sort of optimize how we were doing. Like mentions counts, like badging, figuring out what we need to badge and and whatnot. Because again, for that is dependent on channel membership.

00:28:11

Speaker

Of course. yeah Everything is. ah So we figured out a couple of tweaks that we could do there to calculate mentions in a way that was a little bit better. um But what we quickly realized is that there was only so much we could do to make those queries work.

00:28:27

Speaker

better um because they were on distinct tables so fun fact back in the day channel membership was on three different tables go on ah so private channel membership and group dm membership was on the same table Public channel membership was on one table and DM membership was on a completely different table.

00:28:59

Speaker

Right. Okay. Which like sort of made sense when you think about the like product implications. Yeah. j Yeah. Yeah. And so like not a wild idea that we were in that state.

00:29:15

Speaker

What was wild though is that on boot, when all these clients need to load up this information, what information do you need? All of it. Exactly. You need all of your channel membership. You need your DMs. You need your private channels.

00:29:29

Speaker

You need your group DMs. And you need your public channels. And so what does that turn into? That turns into a really lovely union all query that i ends up hitting three different tables.

00:29:40

Speaker

And it ends up doing that for every single user who's logging in at the same time, which is super fun ah because you're spending... We were doing that in MySQL. So had we been doing like three individual...

00:29:52

Speaker

requests out to MySQL, say like, get the public channel ah membership, get the DM membership, get the private channel membership, and been smarter about ordering that. Maybe, but we determined that the fastest way to get that information was to offload all of it to MySQL and do the union there and then get that information back.

00:30:12

Speaker

Yeah. That makes sense. Yeah. instrumented it or whatever. We figured it out that like that was actually the most efficient thing. ah But we needed that to be more efficient. And so essentially what it turned into was this initial project, which like they, had no like they like management, had no business giving me this project. like i was...

00:30:32

Speaker

ah two years out of college. Like I written, had never written a line of SQL before I showed up to Slack, like six months prior.

00:30:43

Speaker

Never in my life. I'd never written a line of PHP before I showed up six months prior. Like I had no business doing this project. and And yet you were the one. And yet I think we had this many problems and this many people to solve it.

00:31:00

Speaker

And that's where I'm going back to like, I think they just figured out really who they needed to hire. Slack was really good at hiring really good people and knowing how to put them in situations where they would find and grow and figure it out.

00:31:16

Speaker

And like there's a lot of trust. And maybe it didn't pan out every single time, but most of the time I feel like it really catapulted lot of, particularly junior people who started there, because we were given these projects that were...

00:31:30

Speaker

we're insane to give them to junior people. But in a lot of cases, like, had other folks around us to, like, ah review pull requests, like, help jump into incidents when, like, inevitably something bad happened, right? Like, it was a very, like, we're all in this together mentality, which I think, like, also helped. Like, it doesn't help to, like, put a junior person and into a fire and be like, good luck, have fun, um if you're not gonna, like, also, like, be there to, like, pour some water it. Yeah, absolutely. It's it's important to give junior people real problems to work on. Like, that's the only way you learn and grow is the way you keep people engaged.

00:32:09

Speaker

yeah We're all in this to do things that matter just just because you're junior doesn't shouldn't work on things that matter. ah But yeah, you need that. You need that support or you're just going straight into the deep end with bricks around your feet.

00:32:21

Speaker

Yeah, no good. Don't want that. And that's how you burn everyone out. And then you have no more junior people and then good luck to you.

00:32:31

Speaker

So the first thing we did was we combined, we created a new table. We called it channels members and we combined the data from groups members, which was the private stuff and teams channels members.

00:32:44

Speaker

Don't ask me about the naming scheme. We'd like mistakes were made. um So joined that data onto a new table. And that eased a lot of things.

00:32:56

Speaker

Turns out it made ah lot of queries a lot better. Interesting. That's not actually obvious to me, like a priori. um But yeah, it it makes it makes sense that it's simpler, but maybe I just have bad SQL intuition.

00:33:12

Speaker

um But there were so many places that we were doing checks across both things. And not necessarily places where we do were doing checks where we were like uploading all of it to MySQL, but places where we were doing it in the code, like in file permissions checks, we were doing like an initial check to see if you or if you had access to the file because it was in a public channel.

00:33:33

Speaker

And then we were doing a check, I don't know in what order, but I think in that order, we were doing a check to see if you had access to the file because it was in a private channel that you were in. Or then we do we're doing the check finally, like, is it in a DM you're in? I don't know. Like, let's just try to find access for this file for you.

00:33:48

Speaker

And we were doing that individually a bunch of times. And we could simplify that to just two checks until we eventually migrated the DMs membership to also be in that table.

00:34:00

Speaker

um And then it was just one check. Like, is this file shared in any of, like, what channels are you in? Give me that result. Is it shared in any of the, like, is is the file shared in any of the channels that you are in?

00:34:14

Speaker

Done. Like, you don't have to do that check a bunch of times. And so it, like, because it offloaded a bunch of work from the database, we weren't doing more of these checks. it ended up like reducing a lot of the, it giving us more wiggle room, essentially. like Yeah.

00:34:33

Speaker

That makes lot of sense. And it's sort of going back to what you said about LibChat, that it is, you know, even at five or 6,000 lines, like a sim the simplicity of like, we're going to do this, we're going to do that, we're going to do this.

00:34:45

Speaker

is is fine as long as you can find it. you can You can explore the code. You can find what you need to find. But at some point, there will be a scale, and maybe maybe whatever company you're working at doesn't hit it. Slack clearly did, where the conceptual simplicity of, no, we put it all in one place, and then the access patterns and the things we the number of things that we could possibly ever think about just becomes much, much smaller.

00:35:10

Speaker

And it's suddenly worthwhile to swallow that goat and say like we're gonna put everything in one table one table and just make it happen yeah so another hurdle that I completely glossed over that we had to deal with to do this is that at the time there was just like raw sequel everywhere

00:35:29

Speaker

it was like everywhere you know I've talked to a lot of people about this how do you feel about that was that was that the right choice hmm

00:35:42

Speaker

I mean, hindsight's 20-20. Yeah. yeah I mean, like... Valid. Look, Slack is what it is today. Yeah.

00:35:50

Speaker

odd I buy that in a safe and simple, too. It did work, and so it was fine. It worked. So yes, it was the right decision. Could other decisions have also been the right decision?

00:36:01

Speaker

Possibly. I think that's valid. um But I think it it did not hinder product development in enough to slow down.

00:36:16

Speaker

So I guess. Then it didn't and it didn't matter. then and Then it was a fine decision. One possibly many fine decisions. Yeah. Yeah. It's just because that's also so all we have to do, right? Is just like make enough right decisions

Testing and Communication Strategies

00:36:30

Speaker

over and over again. But you don't know if they're right.

00:36:32

Speaker

Yeah. At whatever point. not a sense It makes a lot of sense. We talk on the show a lot about the idea of like... This is a show about doing migrations, but many times the right decision is to simply not do a migration. Like don't do work you don't need to do.

00:36:47

Speaker

Oh my gosh, yes. Yeah. And I think up until that point, like burning fire, IBM, biggest customer, giving us lots of money, trusting us to be their hub of, collaboration hub. Oh, that was the word we were using at the time, right?

00:37:03

Speaker

I don't know if it was. Right? Yeah. Was that like pre-IPO? Maybe something like that. Anyway, the collaboration hub, right? We were their collaboration hub. They were trusting us. So we like had to deliver and like the way to give them the ability to grow.

00:37:18

Speaker

it This was one of many projects, obviously, but like was to just simplify the data model, make sure that like we had fewer times that we had to like run these checks over and over again. yeah,

00:37:29

Speaker

um

00:37:32

Speaker

And the only way to do that was to find all of the raw SQL in the code that hit either of those two tables, the table with the public membership information and the table with the private membership information.

00:37:44

Speaker

and Because how else do you want to consolidate that data other than like go through every single individual call site, grep for like select blah from groups members across the entire code base?

00:38:02

Speaker

Find all of those. Okay, then like rewrite, like throw in a little feature flag if statement in there. And then like, that is how you end up burned um over and over and over again. so one of the things that I ended up doing was extracting all of those SQL calls, essentially just wrapping them in individual functions in another file.

00:38:25

Speaker

And so I just like extracted away the SQL by one level because ORMs were like, not what I was tasked with doing. That was not the problem that I was meant to solve. Like that had nothing to do with like, just join the tables. Like that was just like, do that and do it yesterday. Right. Like I was not going to solve slacks of worm problem that summer. um Definitely not as like a one year out of college grad.

00:38:51

Speaker

I mean, some might argue that's the ideal time in your career to overbuild an ORM. Yeah, fair enough. I'm pretty sure I did that yeah like two years into my career. Like I built a lot of ORM looking stuff.

00:39:05

Speaker

Oh yeah. Because like encapsulation and like all the things you learned about like proper ways to do things. Yeah. um I was not going to do that. So ah it was just... This call, oh, the exact same function the the exact same SQL query is being made over here.

00:39:22

Speaker

Okay, cool, just replace with function call. Function call contains copy paste the SQL query. Like that was all I was doing. And then all of my feature flags were in that file.

00:39:34

Speaker

And i was able to just like turn one off, turn one on see what happened, um make sure that it was all clear. Obviously, there was a whole like backfill all the data, make sure that we double write.

00:39:46

Speaker

Yeah. because you need the data in the new membership table that you're going to point all the stuff to to be correct and match the data that is encapsulated both in the public channel data and in the private channel data.

00:40:00

Speaker

um At one point ah in this full migration, I accidentally, no one noticed this, but this one data was working with. that um On Thursdays at the time, we had like gather hours where like there was like appetizer-y food at the office and most people hung out.

00:40:17

Speaker

Yeah. And um I decided that was a good time to do a test run of having um Slack black read from the new membership table. And...

00:40:29

Speaker

yeah And um for that time, I accidentally ran the the wrong script um and and I kicked everyone at Slack out of all of their channels.

00:40:45

Speaker

So instead of doing karaoke, I guess in addition to doing karaoke at 4 p.m., they also did not have access to Slack. uh yeah effectively um and uh thankfully i just flipped the flag back to read from the old tables and so no one noticed this one guy it was like weird my sidebar just kind of flashed i was like nothing happened don't worry i'm gonna i'm gonna repopulate that table later um i love that and and

00:41:20

Speaker

ah but but So I want to take get a brief detour into it's a tools because you mentioned like four or five things there that you were looking at. You've talked about looking at idle CPU on ah the IBM shard and looking at you know looking at actual table membership, looking at feature flags.

00:41:39

Speaker

And there's there's just this pile of information you need in order to do this work. What What's critical for you? Like, how do you think about ah assembling like your a little tool stack to do this, this work?

00:41:54

Speaker

But for this type of thing where it's like... channel membership is the crux of everything like it needs to show the source of truth as humanly live as possible computerly wise because we're cuteass speed not human speed yeah and so like i need to be able to ascertain that like if this person is not in this channel and they should be like i need to see that like immediately because i don't want to spend I don't want to wait for that person to find out that suddenly something is weird with their channel membership and then like have them write into CE and then like get a Zendesk ticket. And then like, I do like the whole flow of like, oh, well, let me make sure that this person's in this channel that they're supposed to be in. Like what happened?

00:42:35

Speaker

um And so like getting that information live is so crucial to making sure that you're doing it well. um Dry runs, whatever the output of...

00:42:47

Speaker

Like, I'm about to run the script. It will do this potentially destructive destructive thing. Show me, like, in big, bold letters, what it is about to do. It's sort of like when you go to, like, delete a repo in GitHub, right? And it's like, please type the repo name. And you're like, ugh, this again, right?

00:43:05

Speaker

But it's like, no, like, you'll want to make sure that you're deleting repo if you're deleting the repo. the repo that you want to delete. um Because it is pretty darn destructive. um And having been on the other side of like customers accidentally deleting a bunch of Slack data um and having to recover that, like no one wants to recover that.

00:43:25

Speaker

And it's like a bad time for everyone else. So like knowing exactly the potential outcome of the potentially destructive thing that you're going to do is pretty crucial. And knowing like what state, like feature flies, like what is flipped?

00:43:39

Speaker

What is not? Like what? What of the traffic that's flowing what's in what treatment, what's in what flag, but all that stuff. love them And at all the different stages, right? like So i ran it for just Slack. And so like I needed to make sure that like and at that point, this was just like PHP array blob in a really long file.

00:44:06

Speaker

Again, just long files. It's like a blob. And it was like, this is the flag that I'm working on. And like of the teams enabled, it is Slack, right? And like, but that was kind of annoying to have to read and parse in a text file and in a big long PHP array blob.

00:44:25

Speaker

um So I have that on my, I'm like a shortcut that I wrote. well love um Yeah, absolutely. That makes a ton of sense.

00:44:38

Speaker

just needed to make sure that sometimes you just read things wrong too and you're like no I need to make sure that like I actually read this right and this is actually enabled for these people um yeah Grafana was great for looking at all the stuff that's It's really interesting that that most of the stuff you're talking about is is production data.

00:44:58

Speaker

it It makes sense, right? That that's the scary place to delete data. That's the scary place when performance goes sideways. The performance of my local machine does not matter to anybody. No one cares. and No one cares. It's probably fine, too. It's got like three users.

00:45:10

Speaker

um So, yeah, it's like it's all this production data. i I wonder, one of the things that... When i was but I was thinking about you know starting this company last year and talking with people about like what they cared about, one of the strands of thought that came out of a ton of people, not not everyone like by a long shot, was um product engineers do not want to care about production.

00:45:34

Speaker

And they perceive that in many cases, caring about production slows them down because production contains the code that they're not writing right now.

00:45:45

Speaker

I'm curious your reaction to that is. Oh. man.

00:45:53

Speaker

Well, okay, so this is where I'm going to say the spicy thing, but that I have come to um believe more and more strongly over the last several years. and And especially in light of what I spent my last four years at Slack doing, which was um unapologetically load testing in production.

00:46:18

Speaker

Because I think production is actually the only environment that matters. um Well, of the environments that matter, it is the one that matters the most.

00:46:31

Speaker

i can get behind that. And... Like, all of your testing pre-production is never going to be sufficient.

00:46:42

Speaker

And, like, I will die on that hill. Like, I'm ready to, like, people can at me. Like, I...

00:46:49

Speaker

And i actually got recently sort of not into it, like a discussion and at work about testing and production. And obviously a lot of people are like freaked out by it because you're potentially impacting real customer data.

00:47:07

Speaker

But because the stakes are so much higher, people tend to take it a lot more seriously And trans think much more critically about actual bottlenecks and when they're going to run into them. So like, for example, and I did this a couple times so where was like, hey, I have this thing that I wrote, this tool um that creates a bunch of fake Slack users. And um I'm going to boot a bunch of them up.

00:47:39

Speaker

and at this time on this day. And I anticipate about this much traffic, which is like maybe like 10 to 15% over what we've ever experienced. ah Just letting you know, and for teams, that this is a thing and that I'm going to do.

00:47:52

Speaker

um I'm going to be monitoring graphs. Here are the graphs that I'm going to be monitoring. And I will turn it off if I see anything go above like these red lines. And like, please verify that the red lines make sense for your systems because like I'm not an expert in Memcash, whatever, like all these things, right? Like I am not an expert. I'm more of a breathy person. So like make sure that these red lines are correct. I pulled them from your own graphs so they better darn be good.

00:48:18

Speaker

And um we will see what happens. And that lit fires under people's butts. And not, but I think a lot of people were also very curious to see what would happen.

00:48:29

Speaker

Like, where are the bottlenecks, right? Like, we don't know. Yeah. Let's find out. Definitionally. Yes, literally. You can't spin up staging environment that like looks like production and attempt to do something that looks like production in staging and expect to hit the same bottlenecks. like How are we going to know that like console will stop generating a file that has all of the services I'm supposed to tell you about after a certain number of services?

00:49:02

Speaker

like Once you've scaled up and you have so many VT gates or whatever that suddenly it stops it stops being able to generate the file.

00:49:13

Speaker

And then you have no VT gates because for some reason, instead of just cutting it off, it just generates nothing. um And then you have nothing that you're pointing to anymore.

00:49:24

Speaker

um You know, these types of things. Like you can't, you can't really know that this is a true bottleneck in anything but production. Yeah.

00:49:35

Speaker

That makes sense. That makes a ton sense. Especially if you're focused on performance problems, you're never going to see it. and Never. Never. and Never ever. going to see these things all over all over the product org, right? That of any organization, of any reasonable scale, you don't get to ignore performance problems because because your feature may or may not scale depending on what your code looks like.

00:49:57

Speaker

Yeah, and when you get to a certain size too, like, and and complexity of architecture, it's like prohibitively expensive to create an entirely like copy-paste version of your architecture in a like test-only environment.

00:50:14

Speaker

Yeah, absolutely. And then you start getting into these conversations about like, well, like was this system truly isolated during this test? And like, what state was it in? Where were with this flag?

00:50:25

Speaker

Was this flag enabled on this environment at this time? Like this subsystem, like, oh, well, it was undergoing a migration at that point, which was not happening elsewhere for some reason, because it's being like, everyone's going to end up testing their stuff in the test environment.

00:50:40

Speaker

So like the conflation of all of the potential, like the factorial, like, base that is like this feature flag enabled, this feature flag enabled, this version of Python, this version of blah, blah, blah. Like it just ends up completely out of control.

00:50:55

Speaker

and so like anything you ascertain in the testing environment, because they're so expensive to spin up, won't map to production. And then you're just going to be frustrated that you spent all this money and time on building this test environment that was supposed to tell you when stuff was going to break in production, but it doesn't because everyone's actually using it to test stuff.

00:51:17

Speaker

And then you're be like why did I spend so much money a test environment when it didn't actually stop these problems from getting production? and makes so if I love that and love your point about the test environment to succeed needs to tell you what's going to happen in production.

00:51:35

Speaker

And if if it doesn't, it has failed. And if it doesn't look like production, it has failed. And it doesn't look like production. It just doesn't. It never will. It never will. It never will.

00:51:46

Speaker

Unless your system has an architecture such that it's like, you can like copy paste the entire thing and you're deploying it to like entire subsystems. um Like you have exactly the exact architecture of all the pieces that you potentially need in like IED. And then you have exactly the same thing IED.

00:52:13

Speaker

whatever the code for one of the west u.s west things is right like i don't know ied is the one everyone knows right if virginia goes down we're all screwed um like nothing will work for anyone

00:52:30

Speaker

but that's rare and even those will have some unifying communication point right like how are you deploying to these things I'm willing to bet you're deploying from to those things from the same unified source.

00:52:44

Speaker

Yeah. Presumably you don't need all your systems in both production and dev or staging. Mm-hmm.

00:52:54

Speaker

All right. Well, I'm convinced. I might have come convinced already, but. ah Yeah. So that's my take on testing production. I think we should do more of it. And I think it also leads to like building solutions that are immediately useful in production. So think about like circuit breaking, right? Like you want to have ah system whereby you are detecting that there's like too much load going to ah certain, going back to some of the test stuff we talked about, like a certain VT shard.

00:53:25

Speaker

And so like ah and the ability to create back pressure that then feeds back into whatever testing system that's exerting that pressure is really important, telling it to just shut off. Some of the load test tooling that we built with Slack was the first that we turned off, obviously, when we were detecting any sort of pressure on any of the databases.

00:53:46

Speaker

And one of the ways that we tested that the circuit breaking worked was creating a using the load test tooling. And we made sure that that worked in dev, obviously. But then in production, we were able to actually simulate that.

00:54:04

Speaker

And now like the rest of the entire like Slack architecture benefits from having this like really good circuit breaking infrastructure. And so like by protecting yourself from the testing, you are building a solution that helps the entire system.

00:54:21

Speaker

And I would argue that's most of the outcomes that you'll get from testing and production is building solutions that will actually help the health and reliability and availability of your entire system.

00:54:34

Speaker

That makes a lot of sense. there's There's a certain amount of the system is not... The system is the system. The system is not the code. The system is not the AWS bill. it is It is the system. It is the combination of all of those things in a unique way. And and things like circuit breaking only make sense if you can run them.

00:54:52

Speaker

You can't run circuit breaking in dev in a meaningful way.

00:54:57

Speaker

Yeah. And that's the synthesis you're about. Yeah, absolutely. um i I like the broadening of the the sort of software engineering role to it is not about writing code. It is about maintaining the the system that lives in production, which I think most people would agree with.

00:55:14

Speaker

mean, that's where customers are. like Yeah. The goal is not to write code and walk away and get lunch. The goal is to go do things to customers. That's what the job is.

00:55:26

Speaker

um and And part of that job lives off your lap laptop.

00:55:32

Speaker

ah I want to come back and just make sure that we we hit the coda of of the story. So... um There was a bunch of fires, fires to put a bunch of simplification. You got to the point where you you were actually doing, you'd refactored everything into ah relatively coherent set of ah functions and a smaller set of accesses and reads and writes to the and channel membership, new channel membership table itself.

00:56:00

Speaker

Yeah. how did How did the actual, if we're going to talk about production, how did the actual production rollout go? What surprises did you see as you were actually turning this on for customers, if any?

00:56:15

Speaker

and I mean, there was a time that I accidentally deleted everyone in Slack's membership. um that was That was a fun one. They're not customers, though. Yeah, yeah. Okay. Fair enough. um

00:56:29

Speaker

Well, good surprise. Love good surprises. Good surprise. So like a couple months after. sure um Yeah, a couple months after like we were fully reading from, we were doing comparison checks, like sampling comparison checks at some point of like seeing like,

00:56:50

Speaker

Oh, for I forgot what percentage we did, probably like under 1%. We were checking like the data that we're fetching looks like the data in the old tables. I think we did that for about two weeks and then we turned that off and we saw that like, by and large, the data on the new table was was good.

00:57:05

Speaker

um We think there were a couple instances where um the we weren't updating data His channel membership also still stores, as far as I'm concerned, your last read timestamp for the channel that you are a member of.

00:57:25

Speaker

and Because that's unique to everyone who's in the in the channel, obviously. And so it makes sense to store that on that row. And I think there was like a couple places in the code that like my grep hadn't caught before. the path that ends up updating that, which became really clear really fast when we tested it internally mostly.

00:57:43

Speaker

and But then there was like one case where and it only happened in ah like enterprise grid setup where it's like a little bit more complicated um and Slack wasn't an enterprise grid there were and there were very There were very few of us, even if we were working on grid, like using grid in the way that and and and and it ended up going itself in production, but it it ended up being just like two line change.

00:58:19

Speaker

And it was like very quick to spot and just like, okay, cool. Ship this out. Deploy button. 100% of production in 30 seconds. ah See, it's a good decision if you're all aligned around it.

00:58:30

Speaker

Yeah. Yeah. um What ended up being the biggest win out of this migration, though, was a couple yeah i think a couple months after we'd stopped reading from any of the old tables, we'd cleared out all that data.

00:58:44

Speaker

um There was a customer that and was going from one IDP to another. and in that process, um ended up accidentally issuing delete requests for 30,000 users.

00:59:02

Speaker

ah Yeah. and And at the time, Slack really just liked straight up deleting stuff out of databases rather than like adding a date delete field with like a timestamp. It was just like, delete, bye.

00:59:18

Speaker

And it's like it's on the backup for seven days and like otherwise, like good luck to you.

00:59:24

Speaker

I know how painful it is to get data off at but those backups. I worked with that team. Yeah, it is rough. ah Actually, but just made that harder in some ways too. But in this case, yeah, so it ended up issuing a delete request for every single channel membership row associated with every single one of those 30,000 users, because that's what you do when you clean up a user. You also delete all their channel membership.

00:59:53

Speaker

And of course, this happened as on the day of the company picnic. um So like very few of us are at our machines, um which also happened to be my husband's birthday.

01:00:07

Speaker

And ah was the channel membership expert. And so it was me to recover, like create a script that ran from the backups and recover all those channel memberships.

01:00:21

Speaker

But a week later, I turned around and because i had unified all of the queries that hit the channel membership table in one file, I was able to edit a delete column.

01:00:36

Speaker

and find all the requisite queries that were deletes and instead turn them into update date delete column. we never had to recover channel membership from a backup in that way ever again.

01:00:50

Speaker

That's amazing. I love it. was great. It was so satisfying. Yeah, that's it that's that is deeply satisfying. I love that. All right. That was fabulous.

01:01:01

Speaker

That's how it is. All right. um We're getting close to the end, so two two final questions. What advice would you have for people who are embarking on some of these bigger refactors, bigger migrations?

01:01:18

Speaker

Hmm. Read my book? No. Absolutely.

01:01:25

Speaker

Um... recent habit years because I was reminding myself about some of of the test stuff which we actually didn't touch about at all but the more complete story ah from today is I think chapter 11 or something there's like two case studies in this and like one of them is is is this piece fantastic for those people that are listening that's refactoring at scale and we will drop a link in the and in the description Oh, thank you.

01:01:52

Speaker

That's very nice. Of course.

01:01:57

Speaker

I think two biggest things. um Over communication, like more communication. Number one is more communication than you think you actually need.

01:02:11

Speaker

and That sounds really annoying, especially for like the types of engineers who like doing this type of work, which is sort of like background-y work. and like not We're not talking about like flashy new front-end button that like everyone's so excited because do people tweet about product releases anymore?

01:02:30

Speaker

um Sometimes, What's verb for that anyway? Anyway, whatever. I don't know. We're in the weirdest timeline, I have decided. and

01:02:42

Speaker

Yeah, so I think overcommunication is really, really key because you're going to end up touching something that

01:02:54

Speaker

someone else didn't expect you to touch. why Inevitably. Like, even if it's like what you think is the smallest migration possible, like cast a way bigger net of the people that you are telling about.

01:03:08

Speaker

Go like evangel the thing that you are doing as much as seemingly possible. Because inevitably, something's going to happen.

01:03:19

Speaker

Like, like a bug will arise. For sure. That's like guaranteed. Maybe that's like number two. Like, know today before you even start that there will be at least one bug.

01:03:31

Speaker

Yeah. Like, you won't get them all. This is not Pokemon. Like, ah you won't get them all. You will miss something. So like get the anxiety out of your body now that like you're going to screw up.

01:03:46

Speaker

like You will. You will screw up. Like there will be a screw up at some point, maybe even two or three or four. Ideally, they're not big, but you have fewer chances of making a major screw up if you do number one, which is over communicate as much as humanly. ball and And obviously that differs from,

01:04:08

Speaker

like depending on what kind of company you work at, like how you do like project updates, et cetera, et cetera. um whether you do it in Slack, whether you do it somewhere else, I don't know, whatever, but like figure out and work with it. Maybe it's not your manager's expertise in like over-communicating projects like this. Like this is a very different kind of project to over-communicate than it is like a big feature launch.

01:04:34

Speaker

Those have like traditionally like PMs behind them. And like, I mean, I know you PMed an infer team, but like they have like a lot more

01:04:44

Speaker

um processes usually already in place. Like more eyes are on those feature releases already to begin with. The migration stuff is the stuff that like no one wants to hear about and like is just like happening opening in the background.

01:05:02

Speaker

That's like kind of the worst. Yeah. You want everyone to know about this migration that you are embarking on and you want to be annoying about it. And if you talk to someone and they don't know that you are migrating from Python, whatever to Python, whatever else, like you, you want them to know that when you're entering their DMs, they're like, oh, this is, this is the one, the person who's migrating Python to Python, whatever. Like you want them to have that reaction because you have over communicated that much.

01:05:33

Speaker

much Then you get less burned on the other side. And then you can make mistakes. And people will know, like, oh, like, you won't be in an incident for three hours being like, what is going on? It's like, like, within maybe half an hour, you'll be like, oh, this is probably related to that migration that like, Alice has told me 15,000 times about. It's like, yes.

01:05:55

Speaker

Yes. Yes, it is. Right. You'll know much sooner because those are the kinds of bugs that like usually take way longer to solve than the feature bugs.

01:06:06

Speaker

And yeah. And wrapping up on the thread that I was starting earlier, your manager might not be um the best person to understand the communication plan behind ah migration.

01:06:20

Speaker

this unless they have that experience. um And so I would find a person who is really good about doing that kind of communication. and If that does not exist, your job, learn to be that person.

01:06:35

Speaker

There's like a whole section in here about how to be annoying about your... I hope that's the chapter name. but ah It's not. I really should have named it that. um It's like inversion. Yeah.

01:06:49

Speaker

version yeah in second edition um

01:06:54

Speaker

but there's a whole chapter chapter seven communication so um you can learn to be that person who's annoying about the migration I love it. um i that i mean, that tracks. I think that that makes a ton of sense. since there's It is so often under communicated when you're doing something that feels technical or that feels that is not. like there's There is a machine that produces the the noise for feature launches and there is not that same machine. And if you're an engineer, you've got to figure how to do it.

01:07:26

Speaker

Yeah. And a lot of the times it's about being creative and how you can land. hate saying this, but like, there's no better word for it right now in my head, like shift left, just shift all the way.

01:07:40

Speaker

you can't get further left. So like, if you have the ability and the time and the energy or whatever, it may be turned us this. i I know we talked about this, I believe at some point last summer,

01:07:51

Speaker

I promise myself we're not going to do a product pitch in this. But yeah, absolutely. This is hyper aligned with the way we're thinking about what turn will do. Dub it in VS Code. Like, if the line you are editing as an engineer is undergoing some kind of related migration as ah brireatory mig migratory migratory, that's we got this I'm going to blame ESL.

01:08:17

Speaker

Migration. The file, whatever it is, just be where the people are. um And then can blame VS Code for being annoying about warning folks about your migration rather than yourselves.

01:08:35

Speaker

Absolutely. make it if If we can all hate the tool instead of hate yes the problem or each other, then ah that's great. i mean, it's a problem for the vendor, but that's a vendor problem and who cares?

01:08:49

Speaker

Then we've solved all problems. Hate the tool, not the people. Exactly. um All right. And last question. Where can people find you on the internet if they have more questions? Oh, um a this is going to be dangerous, what I'm going to say, but we'll see.

01:09:05

Speaker

um I mostly respond to LinkedIn messages every so often. So I have a LinkedIn profile. It's my name. um But also modethecodetoad.com is my website.

01:09:18

Speaker

Awesome. Because it rhymes. We'll put links to both of those in the description. All right. yeah Well, thank you so much. Thank you, TR. This was super fun. This was.