Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Slack’s Code Migration Uncovered a Terrifying Truth image

Slack’s Code Migration Uncovered a Terrifying Truth

Tern Stories
Avatar
43 Plays8 months ago

Code migration isn’t just a technical task, it’s a minefield.

In this episode, we dive into a real-world code migration story from Slack that uncovered terrifying truths lurking deep in the system.

From multi-user duplication chaos to security incidents hiding in plain sight, this is code migration at its most high-stakes.

You’ll hear how Slack’s “Storm Chasers” team tackled six massive code migration prongs, rewrote critical data models, and avoided taking down the entire platform, all while operating at enterprise scale.

If you’ve ever touched a legacy system, dealt with user provisioning, or tried to clean up a years-old data mess, this is your masterclass in what code migration really looks like.

Whether you’re a staff engineer, EM, or just trying to survive your company’s next code migration, don’t miss this.

Get Tern Stories in your inbox: https://tern.sh/youtube

Connect with Sarah:   linkedin.com/sarah-mann-14758844  

Recommended
Transcript

Security Incidents and User Management Challenges at Slack

00:00:00
Speaker
If a guest is getting treated like a full member, it's a security incident. If I was in the org and the product and engineering workspace and also customer success, I would have three different users under the hood, all representing me, but not on each of the different workspaces.
00:00:15
Speaker
Even though we have a single user object for you, your association with the workspace is where the guest status was stored. but Instead of making one big change after we do a year of work, we're to making a lot of little changes every day of every week. And each of those might cause an incident, but they're going to be sub threes.
00:00:37
Speaker
Today on Turn Stories, we have a very special guest. um I have today Sarah Mann with me. um Sarah is a software engineer at Slack, but she came from a background where she was actually a PhD in applied mathematics and had a whole career as a data scientist before getting into software engineering full-time at Remedio, which was eventually

Introduction to Sarah Mann and Her Team's Role

00:01:01
Speaker
acquired by Slack.
00:01:01
Speaker
So we're going to talk today about... the intersection of those things, all of the amazing migrations that drove at Slack, and what that means for the future of how we actually fix these problems. So Sarah, welcome. i'm so glad to have you here.
00:01:17
Speaker
Hi, happy to be here. All right. So let's get right into it, because I want to understand a little bit about you have a unique view on what reliability and quality and and means within software engineering. So tell me a little bit about your role at Slack and the team and how you approach the problems they're solving.
00:01:40
Speaker
and So for my last three years at Slack, I was on a team we called Storm Chasers, which had this sort of interesting mission of ah looking at trends and incidents that Slack was having trying to i identify patterns and then ah seeing if we could solve the root problems under those.
00:02:05
Speaker
So some of the incidents are that Slack was having were very clearly owned by a particular team. There were people working on them. There were one-offs. But then there'd be these like pockets of incidents where it's like we keep having problems with this system, but no one team owns it.

Challenges with User Data Across Workspaces

00:02:20
Speaker
um There were obvious often very cross-functional issues. um And storm chasers would step in and do um anywhere from like a week sprint to a we spent almost two years on on one of these. Yeah.
00:02:37
Speaker
trying to fix it, usually with some yeah multiple projects going on at the same time. um So two of the the bigger projects I worked on were around ah users. ah Users is a, you have users in most companies.
00:02:52
Speaker
Slack has couple. You need to have a representation of users in data and in code that then every other product piece relies on And usually no one team owns users.
00:03:04
Speaker
And so when there's problems with the fundamental data model or provisioning users or whatever, ah nobody has quite the agency to fix them. And so s Storm Chasers worked on users quite a bit.
00:03:16
Speaker
um And authentication was the other one that we spent actually the most time on. Yeah.
00:03:23
Speaker
Right on. So, okay, let's talk about, let's talk about, actually, which one of those do you want to talk about first? Because I'm curious to hear both of those stories.
00:03:34
Speaker
um We might start with users just because it was like our, it was our first Storm Chasers project and it was actually the impetus for for founding it. and ah Cool. Oh, interesting. So it's actually like the foundational story of the team.
00:03:46
Speaker
Okay. So you mentioned that this is all driven by like incidents and problems. What happened? Yeah.
00:03:54
Speaker
um So with users, one of the big stories was issues with provisioning users and with keeping track of user state correctly. um And to get in like the technical why on that, we have to go back and talk a little bit about users at Slack and the enterprise product.

Development of Slack's Enterprise Product

00:04:12
Speaker
um Okay. So you and i were both at Slack. So we know all about this for for listeners. Okay. So way back when, the beginning of Slack, you had a a team at Slack and channels in that team and users on that team who were in channels.
00:04:29
Speaker
And then slack had explosive growth, ah got lots of users, and some of those teams got really big. And so Slack had been engineered around...
00:04:41
Speaker
The team is a fundamental unit and we put all of the data for a team on a single shard in the data stores. And when if the teams got too big, i we didn't have space on in our data or CPU to run ah a single team.
00:05:00
Speaker
like So to be able to expand and support larger and larger teams, they came up with the enterprise-grade product. um And what they did is sort of said, larger companies usually have some subcomponents to them.
00:05:17
Speaker
What if we made each of those sort of sub companies, pillars, whatever, have their own team and stitch them together into a grid where they're in one organization with a lot of teams together.
00:05:30
Speaker
And so the way that looked at Slack is like the product and engineering org had their own team and sales had their own team and sales. ah ah customer support had their own team and they were all stitched together under the Slack org um or maybe a company that like acquired a bunch of different companies. It was a conglomerate. Each of the individual companies would have their own team.
00:05:51
Speaker
And so it was a way of being able to like make these bigger organizations work within the Slack architecture where each team was had like a finite amount of compute and storage resource versus attached attached to it.
00:06:04
Speaker
um And this worked and it was super successful. um But the way that we did that is we like very literally stitched teams together under

Temporary Solutions and Technical Debt

00:06:12
Speaker
the hood. um And we started calling them workspaces at this point.
00:06:15
Speaker
um But it meant that ah you actually had a different user on each of the workspaces inside of an organization. so So within Slack, if I was in the org and the ah product and engineering workspace and also customer success, I would have three different users under the hood, all representing me, but not on each of the different workspaces.
00:06:41
Speaker
And so now you have a problem of trying to keep all of those users relatively in sync.
00:06:47
Speaker
so there A little bit of complexity, which almost always precedes ah reliability walls.
00:06:54
Speaker
So like this work, this allowed Slack to scale for a while, but then we spent a lot of time basically paying down this tech debt. Yeah, absolutely. One of my favorite um like pounding tidbits I heard about Slack is if Slack was designed, there's only conceptually and deployed them into some software for like teams of no more than what, like 50 people?
00:07:15
Speaker
um Yeah. And then and then launching, it's it's fascinating to hear about, like, how the enterprise product basically said, well, we can't unwind that assumption in order to get this product to market. Let's just take the building blocks we have and create a product out of it that ends up being successful, which is really cool.
00:07:34
Speaker
Yeah. Let's just duct tape them together. And it worked. Yeah. But behind the scenes, it was ah very hairy. um And so then we've like, Slack's just been unwinding that for years now.
00:07:49
Speaker
ah And so one of the things that happened before I worked on users is we did a project called Single Enterprise Users. Where we said, okay, instead of having n plus one users ah for to a a user objects representing a single person. So n's the number of workspaces you're in You get one more for the organization.
00:08:09
Speaker
So if you're in 100 workspaces, there are 101 users representing you. Let's take that down to one. Okay. And so they did a lot of work to make that happen.
00:08:21
Speaker
um And what we ended up with was a users table that has a single row per each user on an organization. um and then a users teams rows, which says, which maps this user has some relationship to this workspace inside the organization.
00:08:39
Speaker
And so anything about that specific to that workspace or team is in the users teams table. Anything that's a global is in the users table.
00:08:50
Speaker
wonderful and Yeah, it sounds straightforward enough, except that when they were setting this up, there were some unknowns. um And one of those was your guest status. um So there are three different types of users in Slack. There are full members, ah which have sort of that's what most people are.
00:09:11
Speaker
ah Multi-channel guests. So this would often be like if you're an organization that hires contractors, you might pull in the contractor as a multi-channel guest and they have a couple of channels that they have access to, but they do not generally have access to join any public channel.
00:09:25
Speaker
um And then there's single channel guests. So. ah Maybe you're working with an outside vendor and you have a channel that they have a single channel that they have access to to work on in the particular project, but they don't.
00:09:38
Speaker
They don't have one of those. um And that guest status super important because if you get it wrong and somebody is supposed to be a single channel guest, but you treat them like a full member, they've suddenly got access to a bunch of public channels that they were not supposed to have access to.
00:09:51
Speaker
Right. And it wasn't super clear early on whether ah somebody could be a full member in one workspace within a grid and a guest in another workspace or whether that should be a global property.
00:10:07
Speaker
And so that ended up getting stored on the user's teams table so that we had the flexibility to say your guest status can be different on every workspace. Yeah. product evolved, it lived in the real world, and it became clear that that was not a tenable idea.
00:10:23
Speaker
We actually wanted the guest status to be locked in and consistent across all the workspaces. But we'd stored it in user teams, so was written in n plus one places.
00:10:37
Speaker
So even though we have a single user object for you, your association with the workspace is where the the guest status

Strategies for Data Consistency and Security

00:10:44
Speaker
was stored. Interesting. Yes. So one of the trends of incidents we were seeing was things like that, where we were storing the information on users teams, but it was supposed to be consistent for the user across all their teams.
00:10:58
Speaker
And when that information got changed, were we updating it in all of those locations correctly? And the answer was like, usually, but not always. Oh, no. Yeah.
00:11:12
Speaker
In the extreme case of our power users for our biggest customers, a user could be in hundreds of workspaces. So when you make a change to any of these bits of information, it wasn't happening synchronously with the um API request. We were offloading it to the job queue so we could go update hundreds of rows.
00:11:35
Speaker
But when you offload something to the job queue and do it asynchronously, you can have failures. And if you're holding the same information a bunch of different places and you have spaghetti code that can update it from lots of different places, you're going to make mistakes.
00:11:51
Speaker
And so we were making mistakes and they were costly because, again, if a guest is getting treated like a full member, it's a security incident. Yeah, those are those are bad incidents. i I imagine you can't talk super specifically about it, but were there were there any memorable, like specific incidents?
00:12:10
Speaker
ah what presentations of this during during incidents? You know, I'm not remembering anything super strongly, just that like we had multiple that were this flavor. Yeah.
00:12:22
Speaker
The other complicating piece on this, um so not only were we storing this in the user's team's table, so in lots of different places, when we were provisioning users, this might get stored on the invite for the user,
00:12:36
Speaker
So I invite you to join my workspace. We have, you know, a data row for that invite. And that invite says whether you're supposed to be a guest or ah full member.
00:12:48
Speaker
um You can revoke that invite. You can issue a new invite without revoking it. The new invite can have a different status on it. Sure. Yeah. Okay. I'm following.
00:13:02
Speaker
And at this time we started this project at the time, When we um in sent the invite, we may or may not have also created a user at that time.
00:13:14
Speaker
So we might have just created invite and said, there is this invited person. They don't exist in the user's table yet, but like this is this is how they should be created when they are created. and Or we might have actually created the user and have a user's row and an invite row for the same person.
00:13:32
Speaker
That in theory have the same information, but if that invite is revoked and reissued, all that information is supposed to change and stay in sync. And which information do you trust? Do you check trust the user's row or do you trust the invite row?
00:13:47
Speaker
I'm not going to answer that question.
00:13:51
Speaker
Surely it will be wrong. Interesting. Okay, so there's... that
00:13:56
Speaker
You know, when you first started describing this, I i was thinking, OK, fine, that makes sense that you can you can simply, i the way you're accessing it, just keep the data in sync. It's a little twitchy, but that's fine. But even three or four examples deep, there's obviously a ton of ways that you can just miss.
00:14:12
Speaker
Even if you miss, like the job queue bails to process a couple of jobs and all of a sudden you have stale data and and now you have now you have database problems.
00:14:23
Speaker
ah Okay, so not a great state. i I am convinced. So it's like, it's very complex because even just thinking about it, you're like, what is the source of truth? um And i there's lots of ways for it to go wrong and lots of complexity in the code, which is not a place you want to live in.
00:14:41
Speaker
And so this is a project that we ended up having ah six prongs of attack on, and I cannot list them all anymore, but I'll tell you about at least two of them. I was going to say, what's give me your top two.
00:14:54
Speaker
I wrote a great deck on this at Slack and it lives within Slack's information system. Totally valid. It was long hope ago, but I don't remember all six of them anymore. So um you how do you go after this?
00:15:07
Speaker
um We had a lot of ah meetings where we went, it works that way? Why? um and our tech lead on the project, ah we had a document that has section in it. Our tech lead was Izzy.
00:15:23
Speaker
um And it has section called Izzy's List of Horrors, where we would just like...
00:15:30
Speaker
As we found new things, that we were like, ooh, that's going to burn us. We would we would add to it. So it was a little bit ad hoc. It was a lot of like exploration stage. But when we got more organized about this, um the first like big thing that we did was we renormalized the tables.
00:15:46
Speaker
So we said... We now know more about how this product works. We know that what we the first decision we made has turned out to not be serving us. We will go change that decision.
00:15:57
Speaker
And so we moved the flag for the account status and a couple other flags um from users teams to users. So now it is stored in one place for the user, not n plus one.
00:16:11
Speaker
um And that means that when you change the status, it is synchronous. um That makes a ton of sense. Makes lot sense. You're doing this on hundreds of millions of users. So it's like a process to rewrite the data and read from the new data and do it correctly.
00:16:28
Speaker
um how
00:16:30
Speaker
How did you... okay getting to the state where everything makes sense. Yes, you should do that. But you were in a state where the data didn't make sense. ah How did you make the decision when you ran into something that would have made Izzy's list of horrors?
00:16:45
Speaker
Yeah. How do we decide what was on the list or how to address it? How did you, how did you address it? but How did you like rationalize what is fundamentally incompatible state with itself in the data race?
00:16:58
Speaker
ah Yeah. Good question. um i think that one, we probably decided that the org level user had to be the correct one. and and, and, and we read that one over.
00:17:11
Speaker
um
00:17:14
Speaker
Although we might have in some cases taken the least privileged. Yeah, that makes sense. That makes a lot of sense is that if you've got something limited. Yeah. And like, fortunately, that wasn't like the database was like rife with these because every one of these that was um inconsistent, like ends up being an incident.
00:17:32
Speaker
So it's like it's not like we had this is not a case where we had hundreds or thousands or millions of problems in the database. This is a case where like even a handful could be an incident. Yeah, there's the needle in a haystack kind of of problem. Yeah, this is the the data is mostly right.
00:17:48
Speaker
Yeah, so it's like it's a number where it's like you can inspect them and decide what they should be. um Yeah, that's really interesting because it it means the project is done when you are confident it is done, not when it appears to be done, because it probably appears to be done now.
00:18:02
Speaker
Like it's it's two nines, three nines, four nines correct at this moment. Yeah, but this is one more like any error is not acceptable. Yeah, yeah, exactly. You're not trying to drive an error rate close to zero. Like it really needs to be zero.
00:18:16
Speaker
Yeah. And so this is like this is actually a theme of a lot of our work um is um writing down what the data contracts are supposed to be. So this is a case that we've talked about of like, do we have the same guest status on every row?
00:18:31
Speaker
And then we make a dashboard about that. So we have um a lot of companies do this and Slack does. We snapshot the databases every day into our data warehouse. And then we have our all of our analytics tools available to us.
00:18:43
Speaker
And so we can run dashboards in the analytics environment that looks at our ah snapshot of our production data. And we can say, tell me which ones are out of sync.
00:18:55
Speaker
And like, yes, you can go to the MySQL databases in production directly to do this, but you're querying production databases on a large scale and potentially affecting, um you know resource utilization and production. So you don't want to be doing that. very regular Yeah, a little scary.
00:19:12
Speaker
But having that data makes a lot of sense. i i sort of want to pick at that, actually. So you created these dashboards. you know i want to come back to the other prongs of this project, but you're creating these dashboards. um i have been handed a lot of dashboards in my career, and mostly my reaction to them is like, cool, thanks, and I never look at them again. How did you actually sort of like make progress based on these dashboards?
00:19:39
Speaker
Yeah, I've written a lot of dashboards in my life and they have a shelf life. um And so while we're in the middle of the project, we were loading these daily and eventually we finished the project and didn't look in them for a year.
00:19:53
Speaker
um But in that time, we were both loading them daily. So you start out, you're like, the most important part is like,
00:20:04
Speaker
The engineers who write the system, they have some idea in their heads of how the system works. And they like kind of bake those assumptions into the code, into the data model, but they often don't write it down very explicitly.
00:20:15
Speaker
So now you're a couple years later and you're looking at the code and nobody remembers what the assumptions are, but everyone's sort of making some assumptions about it.

Tools and Processes for Data Consistency

00:20:21
Speaker
so one of the first things you do is actually try and write those down very concretely. guest status should always be the same.
00:20:28
Speaker
And then you can put that into a dashboard. um And so we would usually do these over time because we have daily snapshots of the database. So we draw graph over time of how many users in the database have an inconsistent guest status.
00:20:42
Speaker
And so maybe we're starting off with you know handful or 100. We do our backfill scripts to get everyone synced up. um Then we at Slack had a system where you could run a query in analytics ah environment on a schedule. And if it returned any rows, it would file ah fire off alert into a prescribed Slack channel.
00:21:08
Speaker
So we basically set up alerting against our analytics queries. And we'd say, we've in theory fixed all of these issues. If there's a new one, tell me. That's cool.
00:21:19
Speaker
Oh, I love that approach. that's So you're just replaceable understanding. You write down just like explicitly what you're doing. You enshrine it in a query and go fix the problem and you can see it fixes you as you actually make it through the work.
00:21:34
Speaker
And then you enshrine and then you quit using that dashboard, move on to the next thing, but set up an alert. So you've like locked in the progress and you know that you're not going to have to go back and fix that in the future.
00:21:49
Speaker
Yeah, because it's there's usually like two phases. There's like we've got historical data problems and we need to fix those. And then there's the like, well, how did we get those historical data problems? Maybe there was like an incident a year and a half ago where the databases went offline for like an hour and we like didn't recover from it correctly and we wrote some bad data some places.
00:22:10
Speaker
OK, we need to recover that data, but maybe we have ongoing leaks, but you can't tell what they are because there's so much trash in the databases. So first you clean up the databases and then you look for new occurrences of the problem.
00:22:21
Speaker
And by having that alerting, you know when it happens and you know when it happens in real time. So you can go look at your logs while they're fresh. And then you can go do the debugging and say, what caused this? And is this going to be an ongoing thing? So you do like a lot of follow up work on closing the holes, plugging them. That's cool.
00:22:40
Speaker
Yeah. what What caused this? What caused this today? What caused this at 2 p.m. when I got this error from this alert? Yeah, they so so much more tractable than what caused this sometime between the but dawn of time and today. Yeah.
00:22:54
Speaker
And so now in this example, that's not exactly how it went down because we actually just rewrote the data so that users were only in one place. But you can imagine this in other scenarios where like you do have relationships between but data and various tables where they're supposed to be consistent in some way.
00:23:11
Speaker
Yeah.
00:23:14
Speaker
You know, if a user is deactivated, like so they've been removed with an organization, maybe they need to be removed from all of the channels they were in or marked unactive in all of those channels. so Yeah.
00:23:27
Speaker
I can't remember the top of the head if that's exactly how we do it, but like that sound like a reasonable assumption. That's probably not written down somewhere, but you could write it down and you could write an analytics query against it and make sure that you have that data consistency that you expect.
00:23:41
Speaker
um Interesting. Yeah, um because but that makes some a sense. Yeah, that is a strategy that we use like over and over again on the various projects that we did. um So that was like the, so the the one of the first big prongs of the user project was normalize the data.
00:23:57
Speaker
Another one was this situation with the invites. we were like, OK, like the invites and you have the invites, you have the users rows, like who knows what's correct. We are going to make a declaration of how we think it should work that doesn't make sense to people.
00:24:13
Speaker
And that is that when an invite is sent, we always create a user and the users row is the source of truth. So as things change, so maybe i that invite is accepted and the user joins and then they get removed and then they get a new invite, it's still just the user's row.
00:24:34
Speaker
It's not the new invite that tells you it, the user's row is should be correct. So when we get that new invite, we update the user's row. And the information on the invite itself is there for auditing purposes, but it is not the source of truth of the current state.

Balancing Projects and Customer Dependencies

00:24:47
Speaker
um The big change here is that we were sometimes creating users when the invite was sent, and sometimes we weren't. And we were like, well, that is just too complicated for anyone to understand.
00:24:59
Speaker
And we said, we are always pre-provisioning users. There needs to be a user's row once the invite is sent. Whether they show up in UI, Different question. We have flags for that. But under the hood, they exist in our users table.
00:25:14
Speaker
but Yeah. Okay. Getting that done took a year. year? Yeah. Okay. What was that year? Yeah. like okay want me that what what was what was that year um So a lot of that was relatively easy, but it actually turned out that we had some ah team level settings that were controlling whether the um user was created at a time of invite. um And we had to reach out to individual customers and say, like, we're changing the functionality of this setting.
00:25:52
Speaker
We need your like go ahead to do it. And the last customer to do that, it just like took us months to get them on the phone, and make them understand what we were doing, explain to them that like it actually wasn't going to change their product um experience. But we were sort of changing esoterically this the setting that they had opted into.
00:26:15
Speaker
um And it meant that we couldn't clean up a bunch of dead code for a really long time um because we had to support this for way longer than we wanted to. Interesting. But with with the benefit of hindsight, would you have done anything differently?
00:26:27
Speaker
ah I mean, we still made the right choice of like, I'm really happy that we, at the end of the day, were able to get us to say it we're in now because new engineers at Slack, like don't even ever have to know about some of these settings or any of this complexity anymore.
00:26:41
Speaker
Like it just isn't an issue, but it was painful to get there. Yeah. that's a And that wasn't an engineering time issue. That was just like a communication time.
00:26:53
Speaker
We like did other projects in the meantime and we came back and like, app yeah know Six months after we'd finished the main bulk of that project, we came back and did a sweep of code cleanup because we were finally able to delete some really complicated logic.
00:27:06
Speaker
Yeah. i i mean I guess you have but like a biweekly reminder that's like, go bug. Go see if the customers responded to our emails yet.
00:27:17
Speaker
Interesting. So did, mean, this is all motivated by trying to clean up data. to clean up the data such that it didn't have incidents to the fact that customers were blocking the sort of forward progress of this, this problem.
00:27:35
Speaker
but How did you, how do you manage the urgency between like, this is causing real incidents and hiding real complexity that creates landmines for us with, but we're gated on having a customer understand how this feature is implemented. and Yeah.
00:27:51
Speaker
um You know, in this project, we had six prongs. And so in any time we were blocked on one, we would just go work on the other one. And so we had like some stack ranking of priority, but we had enough different things that needed to happen that getting blocked in one place, we were just like, we'll just go fix this instead for a while.
00:28:11
Speaker
um We were also in this case able to, you know, spend some of Getting that one setting changed blocked us from being able delete dead code, but it didn't block us from being able to do all of the refactoring of the existing code, right? We could make these deprecated functions that hold the held the old logic, but we could still create the new functions with the simplified logic, have all the new code pass in place, migrate 99.9% our customers onto using this,
00:28:41
Speaker
like nine percent of our customers onto using this get, you know, in our dashboards, we just wrote in a like, if team ID is this one team who won't get back to us, ignore their data.
00:28:54
Speaker
you Don't tell us about it. Don't end up with your team ID written into the Slack code base is like a goal of mine. Yeah.
00:29:04
Speaker
In this case, it was an analytics query. So not like actually committed code. But like, yeah. um but It's gonna be still, don't be a specific customer on a dashboard. yeah So like it was annoying because we weren't able like there was still this like legacy code that people occasionally had to touch because it was still a live code path.
00:29:24
Speaker
ah But we were able to do most of the work up front and then like really just delete things later. Yeah. Yeah. Interesting. Yeah. And if you're, if you're heading, if you're trying to pick away some like giant goal of like, don't like make all of this consistent, you've got six different prongs for it going after it.
00:29:44
Speaker
You can, you can work on any of it and you'll still make forward progress. That, that makes a ton of sense. Yeah. Yeah. one of sort yeah One of the tricky things about all of our projects is actually figuring out when we were done, because there's like infinite work you can do around making users easy to work with in the Slack code base.

Codebase Improvements and Project Management

00:30:03
Speaker
They're just such a core concept. They appear everywhere. When have you made the code good enough? So but when have we made the code good enough? We picked out our six prongs and we finished it.
00:30:16
Speaker
And in the hindsight of, like, we did that. um And, you know, the thing with incidents is it's not like we're having 15 incidents a day. We're having, like, a couple and only a small portion of them are unused. It was enough that it really mattered to get it done.
00:30:30
Speaker
But you don't have a large enough volume of data that you can, like, look at an incident trend and be like, here's the cliff where we fixed everything. have to wait a year. But we... And then you can look back and be like, oh, no, really, we aren't dealing with those types of incidents anymore.
00:30:43
Speaker
So it became clear once we got, you know, six months to a year out of that project, we could actually pat ourselves on the back be like, know, we haven't got called into an incident around like our users are correct. Are the users data correct?
00:30:54
Speaker
Because we we did fix it. um But we had like these very intense conversations of like, what is a above our cut line on making users dependable?
00:31:08
Speaker
it i would. I would love to know more about that because there's this, there's this moment, I think, ah especially in like, you know, 2025, we're all, we're all fully on board with everything AI at this point.
00:31:19
Speaker
um And the AI record to write all of our code and fix all of our, our bugs and, and it will never make these kinds of mistakes. um And that's obviously sarcasm.
00:31:30
Speaker
The, There is this this thing that is happening that you're doing that is like very much engineering, but it's happening in a room arguing with people about, no, if we do these six things, not these eight things, not these four things, these six things, we will not have the number of incidents we want.
00:31:50
Speaker
and Tell me a little bit about the decision process of like how do you how do you get the confidence to make a decision when you know that the data is going to take six months to roll up?
00:32:03
Speaker
um You know, in in this case, we had a fabulous tech lead running um this project who had been at Slack for a long time, had a really good like finger on the pulse and like intuition about what matters and like historical context about why various decisions were made in the past and what was feasible.
00:32:23
Speaker
um In hindsight, I can say that we had a project that had six prongs, but...
00:32:32
Speaker
We never had a roadmap that said that. um That was like at the end of the project. It was like, actually, we had 16 things that we did. And like all of them were um declarative statements about how things should work.
00:32:47
Speaker
So they were things like when we send an invite, we create a user. It's a very simple statement. It's like, oh, yeah, that that sounds like it should be true. And it's like, well, wasn't true. It was not true six months ago.
00:33:00
Speaker
It was extremely untrue. And now it is true to the best of our knowledge. Yeah. um um But on the day to day, it was like, it's always a balancing act of like, who do you have available to work on this project?
00:33:15
Speaker
um What feels urgent at the moment? And what is blocking other work? Right. So there's like, there's things that you do along the way. We haven't talked about, um, ah typing code.
00:33:27
Speaker
ah So Slack is in hackling. It was originally written in PHP. It moved over to hackling, which is just a strongly typed version of p PHP, but it meant that any legacy code was essentially untyped.
00:33:42
Speaker
We quote unquote typed the code base a few years ago when the latest version of HHVM required it. But you can have a typed code in hacklang where the type of a variable is mixed.
00:33:55
Speaker
I.e. not typed at all.
00:33:58
Speaker
It's a type, I guess. um And so, yeah know, like invites code was a good example of We've had invites for as long as we've had Slack. So like that is old legacy code.
00:34:10
Speaker
It had been hastily ah typed and it was like very unclear like which of these dictionaries has the flag for is this user a guest?
00:34:24
Speaker
Yeah, you can you can have strongly typed code where you store the same or competing data in multiple places for sure. Yeah. Or you can have, and what we had was loosely typed code where it was very unclear how anything was getting passed around. And so like, while it was never one of the prongs of our project was not like make all the user codes wrongly typed.
00:34:49
Speaker
And we did not achieve that. There were moments in it where it was like, well, i want to do X, Y, and Z, but like, I have no idea how this flag is getting passed around. what I'm actually doing today is I'm strongly typing this dictionary as a shape now. mean um So there's some like ordering of we're trying to do six things.
00:35:07
Speaker
Some of them are interdependent. So today I'm working on the thing that is not blocked. Yeah. Yeah, absolutely. and And just at the end of the project, you look up and say like, these are the things that we did fully and we we believe are important to do. And the seventh thing that's a quarter way done.
00:35:24
Speaker
Maybe not. Izzy's list of horrors was never complete.
00:35:30
Speaker
Never not everything off of it. That's... I'd like to be surprised at that, but I'm not. um What... As you were going through it, i'm I'm curious if you sort of look back on your experience doing that.
00:35:47
Speaker
What tools were the most effective for you as you went through that process? And what tools... would you like to exist that you, you couldn't find at the time?
00:35:59
Speaker
Oh yeah. Um, we, so as I've mentioned, we use a lot of our analytics tools just to understand what was going on as well as log stash Kibana to like, so with some, a lot of time sifting through our logging and adding logging, um to understand what was actually happening in the code.
00:36:17
Speaker
yeah Um, um,
00:36:22
Speaker
In terms of refactoring the code, we did a lot of really manual work ah because we were often making very subtle changes to functionality that weren't super amenable to automation. Yeah.
00:36:44
Speaker
I think the other project um I think we may talk about is around authentication. And there were some things in there that were more rote that tooling might have helped us with more. um I'm not sure. and the user you It's like better tools for moving one column from a table, table A to table B, like might have helped it wasn't actually the bulk of our effort.
00:37:05
Speaker
The bulk of the effort was making sure that the um code was updated to read from the the new table correctly. um Yeah. In all the right places. It's it's interesting.
00:37:17
Speaker
I think there's there's actually something really real there around the problem's not the code. The problem is not the, you know, the refactoring is hard and subtle, um but, or maybe the refactoring is straightforward, but the effect you're trying to achieve is hard and subtle and requires some some local work before you get there.
00:37:36
Speaker
And the hard part is where you you go into the logs, like you add logs and you check if they're there and they're you check if the behavior is there. But that really resonates, the idea that the problem is not, I'm going to do a thing and then I'm going to reload the page and like the button will be blue and that makes sense. The hard part is, i am trying to figure out in production, like what is happening here?
00:37:57
Speaker
And do I have a correct mental model of the world? And like that's where you spend all your time. yeah That's really interesting. And it's, I think, indicative of like a certain shape of problem that's not trivial.
00:38:10
Speaker
Yeah, you're trying to unwind 10 years worth of logic and product decisions. Like, what is, what is our, how's our product actually work today? In these, like, weird corner cases.
00:38:21
Speaker
Like, it's obvious how it works in the majority case. But, like, in this odd combination of team settings, like, how does the product actually work? um And how how should it work? Does that actually make sense?

Addressing Authentication and Security Issues

00:38:33
Speaker
um up that's That's tricky, and I don't know what tool it helps with that. Yeah. Cool. I love this. There's this this piece here of... I never thought of this idea of ah using...
00:38:49
Speaker
um like combining the idea of graph-driven development and and really a focus on the data and production, i think that's a really interesting insight of just like a core motion of how you of how you approach this kind of problem, like problems is with that this complexity.
00:39:07
Speaker
So that's super cool. I do want to shift gears. I want to talk a little bit about the off-migration. Which is great because it's another great example of how we did the exact same tool. The exact same trick we applied to it. It works. Do it again. A new situation. ah Okay, so back to the beginning.
00:39:27
Speaker
Everything's incidents. what What was wrong? um Any problem you have with authentication is a security incident. And so instead of if it were any other, you have a bug in authentication, that if it were not authentication would just be like an inconvenience to the user, call it like SEV3, is immediately a SEV1 because it's a security incident.
00:39:50
Speaker
ah unfortunate um And authentication is, again, one of those like primitives of any product. You have users, they have to have a way to authenticate themselves and log into your product. And so it's one of the first things that you write.
00:40:05
Speaker
And that also means all of the code is the oldest code in the code base. And nothing else works if authentication doesn't work. Yeah. um Scary area.
00:40:17
Speaker
Yeah. Of course, we have like security teams who work on some of this, but their focus is often more on discovering vulnerabilit vulnerabilities than on owning it from a product sense.
00:40:29
Speaker
um So there was another like lack of clear ownership moment, um cross-functional moment, which is why storm chasers got involved. um This is...
00:40:42
Speaker
Another case where we made some dashboards and like burn down problems. um So there were actually two prongs to this this project. I must start with the second prong because it's most related to what we've been talking about and then go back to the first prong.
00:40:58
Speaker
um In time order. Yeah. So the the second phase of this project was actually sort of a general mandate of make authentication work correctly all the time.
00:41:13
Speaker
um Fundamentally for authentication, um to have access to Slack, you have a session written in a sessions table and you have like a token associated with that session. so That session.
00:41:27
Speaker
And again, when you have state changes with users, you want to sometimes invalidate those sessions. So maybe the user gets logged out by an admin, all of their sessions should get um invalidated. The user's kicked out of the team, all of their sessions should get invalidated.
00:41:42
Speaker
um We have push notification tokens that are actually attached to the session for the mobile ah mobile session. If you are logged out on mobile, you also shouldn't get push notifications. So this is another place where theyre like there should be some data contracts. If you look at a user and they're deactivated, they should not have a active session anymore.
00:42:06
Speaker
Makes sense. That makes sense. It should be true. Run an analytics query. Is it true? at the time, it was not. It is now.
00:42:17
Speaker
it Which is the reason I'm talking about it. Which is why agree with all you. If weren't true now, I wouldn't be telling you this. Fair.
00:42:27
Speaker
yeah Was it usually true? Probably. um but So we wrote this. We started this project and we this this team, Storm Chasers, didn't have any particular du domain knowledge about authentication. Yeah.
00:42:41
Speaker
authentication I think I put in an extra syllable. And so one of the ways we organized our thinking about the project, and we just started this dashboard, which turned into like 15 or 20 queries where we wrote down what the contract was on various tables about authentication.
00:42:59
Speaker
And we were like, we think these things should be true. Are they actually true? And the answer would be, oh, we have hundreds of millions of rows. Right.
00:43:13
Speaker
That are incorrect. um And some of these, a lot of these are like not that big a deal. So like the example of this user has been removed from the team.
00:43:24
Speaker
Do they still have active sessions? Okay, yes. Often that was true. Does that mean that they could access Slack? Generally, no. It's like our login processes were checking that the user is actually active to allow you to log in.
00:43:39
Speaker
But the issue when you rely on your you know application level logic is it can be wrong and somebody can change it.
00:43:50
Speaker
Versus in the sessions table, we had this row, this column that's like mark this session as inactive or like invalid. And all of our queries to log you in at the MySQL level filter out those rows that are inactive.
00:44:05
Speaker
So like once you've once you've marked a session as it invalid, You don't ever look at it again for most product stuff, right? It shows up on auditing pages.
00:44:17
Speaker
so it's, like, much safer if you've actually correctly labeled the session of, like, this session is invalid than you're just relying on, well, the session should be valid and the user should also be valid and, like, the team should not be deleted Yeah.
00:44:35
Speaker
if it's If it's simply marked in the data, in a place where you can audit it, where you can you can know, and where everyone's actually reading from, then... And you can filter at the MySable level. So like you just it just never shows up in the back end if it's been marked as inactive or invalid.
00:44:52
Speaker
um So we had hundreds of millions of rows like this. Across like and on like multiple different dimensions of like, there's something about this that we feel like is incorrect. And we feel like authentication is a place where we should have a really strong data model and we should make our our data adhere to it.
00:45:10
Speaker
And so then we started fixing the data, running backfills, and then setting up alerts, looking at our dashboards every day and plugging all the holes.
00:45:22
Speaker
So that same iterative loop. Interesting. And so same same process, no no just applying this this lesson of like, go find, like write down the data contract, figure out where do you have the issues and just like, let's plow forward with it.
00:45:39
Speaker
Yeah. Because any of those, like I said, like most of these, it's like, we do enforce the logic, the application layer. So like, just because it's a little bit wrong in the database doesn't mean that we're in product doing anything untowards.
00:45:55
Speaker
But you're like one bug away from it. yeah so Yeah. You're one fish away from disaster. and Your data is correct. Now you're like a handful of bugs away from but from having an actual problem.
00:46:09
Speaker
We're always only a handful away. You increase the number of bugs you have to have simultaneously to have an actual problem.

Session Management Improvements

00:46:17
Speaker
um So as you're going through this, you were telling me um before the show that you made promise to yourself.
00:46:25
Speaker
that you have not made with other projects. Tell me about that. Okay. So this is the other prong of this project. ah So back to that, um the enterprise data model at Slack that we were talking about earlier ah um when you're in an enterprise at Slack, as opposed to just a standalone team, you're probably the a member of multiple workspaces.
00:46:50
Speaker
And since we just stitched these all together with duct tape and and goodwill, The way it was built is that you have a session for every workspace. And that's annoying.
00:47:01
Speaker
That is hard to keep track of for all of the same reasons that having a user for every workspace was. Which, as you mentioned before, could be hundreds of sessions. Which could be hundreds of sessions. And we also had... um
00:47:15
Speaker
So this product logic that if you had a session for any one workspace within the organization, well like, really, you should have a session for every workspace you're a member of because we don't have different login flow for every workspace. It's just kind of a under the hood magic to make everything work that you happen to have a session for each workspace.
00:47:34
Speaker
And so we had some really fun logic sitting there that like if we got a token and a cookie and we looked at it and realized that there are workspaces that you're a member of, but you didn't actually have the session, we would create it on the fly and you'd never know.
00:47:53
Speaker
But that means that we kept having incidents where we're like, OK, this user was logged in when they weren't supposed to be for a reason. Admin had removed them from the team. They thought they logged out, whatever, but they still were able to log in or were not correctly logged out. and we were like, well, where did this session come from?
00:48:10
Speaker
And it'd be this like wild logic about trying to keep maintaining your sessions for the org. Make sure you have sessions for the org. That's the kind of report you don't want to get. Yeah.
00:48:22
Speaker
We we this person's last day was yesterday. And yet they are not only still logged in, they are continuing to mint new sessions from where? yeah And not from a login.
00:48:33
Speaker
And not from a logout. And again, I can tell you this because this is not how it works anymore. oh So just as with users that have been a project to go down to a single underp enterprise user, ah we then had a project to do a single enterprise session.
00:48:46
Speaker
So you get one session that is at the org level that grants you access to every workspace you were a member of in that organization. So we don't use sessions to track We use sections as a top level access.
00:49:02
Speaker
And then we use team membership and channel membership to track what within the organization you should have access to.
00:49:13
Speaker
So we um started thinking about this project in December and we'd been wanting to do this. It'd been like a known thing that everyone wanted to do for a while, but we were just getting our heads in the game.
00:49:25
Speaker
And in January, i realized a major complication on this and just how dicey this was going to be. um And so um I had this New Year's resolution that was like, we're going to have to like delete hundreds of millions of sessions and regenerate new ones.
00:49:43
Speaker
And swap out a bunch of tokens. Yeah, sure. So New Year's resolution, i don't ever make use of New Year's resolutions. We're not going to take down Slack.
00:49:54
Speaker
There will be no sub zeros, no sub ones in this. I love it. We will have SEVs twos, threes, and fours. So we will have lower severity incidents.
00:50:07
Speaker
They're almost certainly unavoidable. um But we are going to structure this project so that instead of making one big change after we do a year of work, we're going making a lot of little changes every day of every week.
00:50:21
Speaker
And each of those might cause an incident, but they're going to be SEVs.
00:50:29
Speaker
And we pulled it off. Amazing. OK, so how did you got to tell me, how do you do that? Yeah. So one of. It turned out that the most complicated aspect of this was another. Weird side effect of how we've written everything.
00:50:44
Speaker
um So because you were supposed to have a session for every workspace you were in the a member of. We made this hack that of a lot of the time when the product wanted to know what workspaces you were in, it didn't ask workspace membership.
00:51:01
Speaker
It asked the sessions table. So, for example, if you go into a part of a product where it like asks you for the workspace, where shows you a list of your workspaces, that was backed by sessions.
00:51:17
Speaker
That makes a lot of sense. Yeah. Sort of. but Given everything else you've just said, if if you had said that... works. mean, a minute ago. the sidebar, for example, it has like you're an organization, has all your workspaces that was actually like coming out of your sessions.
00:51:33
Speaker
um And when you went into admin and you're like, oh, you want to add a member, what workspace, well, that's not example. You want to set up an app.
00:51:44
Speaker
What workspaces should the app have access to? And you can only add them to workspaces you're a member of. That dropdown was backed by your sessions.
00:51:57
Speaker
Okay, yeah. Now you're changing the fundamental data model, right? You no longer have a session for every workspace. So every piece of product that is depending on that, it needs to be updated.
00:52:10
Speaker
um And so this is one that we... um We just started writing new code, we wrote a whole new framework for how do you interact with the authenticated user in the backend and what workspaces they should have access to.
00:52:26
Speaker
And there's basically like this one one function that was load bearing for the old way. And it had about 300 references. And we just had to burn through them and move all of them one by one, sometimes a handful at a time, but all of them manually over into the new framework that was now backed by, okay, you have access to the organization.
00:52:50
Speaker
What workspaces are you a member of in the organization? Yeah. Okay, so you've got 300 call sites to go through, it and there is a... and They are all extremely capable of causing an incident.
00:53:08
Speaker
What does actually moving moving over one of those call sites look like safely? i have How do you know? like It's not that you can do 299 of them, the last one's going break. like How do I find the last dangerous call site?
00:53:21
Speaker
How... Like, how how do you actually go through doing this safely? ah Yeah. um You know, so we, you know, I picked out like a hundred of them in the first week because a hundred of them were actually just saying like, is this particular user ID currently authenticated?
00:53:38
Speaker
And so they were doing this like weird, like they were asking for all of the user team pairs that they were authenticated on when all it needed to do was like, is this user authenticated? And so it was very easy to build the helper function in the new world because like you don't care about the team anymore.
00:53:53
Speaker
but just like, is this user ID authenticated? And that's a super easy Boolean function to write. And so like there was just a batches of them that was like, actually, the question you're asking is simpler than what you're doing. So like let's give you a good helper function to do it yeah yeah And then there were these like absolute landmines.
00:54:12
Speaker
There was something around crumbs that I think I had to... um I tried to rewrite it and reverted it three different times. um And I do not remember any of the details of what it was.
00:54:24
Speaker
But it was the pain of my existence for six weeks. I kept trying to touch it and got it wrong. um um yeah like Fortunately, the new logic was...
00:54:36
Speaker
I want to say easy to write the ukulele of, but it actually wasn't. um Seemed easy. In some ways, easy to write the ukulele of and move over. um There were some like really interesting nuances in there. Like um oh when you're making a list of workspaces, maybe you're in a hundred of them. So how do we order them?
00:54:58
Speaker
Mostly alphabetical, but whichever is the last one you were actually looking at in product should be on top. And so how did we track that? That was a thing that was like, you can't actually just look at the team membership table for that.
00:55:13
Speaker
So we had to move some of that logic over. So there's some like some interesting nuances of that. Yeah, interesting. That's cool. I love this insight that each of those call sites nominally uses the same function, nominally needs the same functionality, but like actually it doesn't.
00:55:29
Speaker
It's just a really rich source of information. Exactly. A third of them are asking like the world's easiest question. like Are you authenticated? Full stop. And then some of them are asking these incredibly complex questions where you need most the data, but... and you And you should treat them differently.
00:55:44
Speaker
Like discovering which ones deal with what types of data is is the question. Right. And one of the benefits ah in the end was that like, we now have these different helper functions and so it becomes very obvious in the code what question is actually be asked being asked.
00:56:00
Speaker
Does this piece of code actually need to know everywhere you're logged in or just knowing... is this user logged in enough? Is this used, should this user have access to this workspace? That's the only question you're asking. There's an obvious helper function and it becomes obvious in the code that that's a question you're answering versus like getting back this list and doing a for loop through it to find if you're in the one that you're looking for.
00:56:24
Speaker
Yeah, yeah, that makes and makes so much sense. And because we did it this way, and meant that every time we did one of these migrations, like that's a point where you can have a sev. But if you're like team picker on a um like an app installation page is a little wonky, that's a sev three.
00:56:45
Speaker
like Like, no, it's not good, but you have not taken down Slack for everything. Yeah. Yeah. where is that We did all of us at once and you had 50 things that were broken. Like, that bad.
00:56:59
Speaker
Yeah, absolutely. i And by knowing it's sort of like ladders up, but like, you know, from i going from what is this powering to? What can we change about it to where does it appear to the the user? Like that sets your sensitivity.
00:57:15
Speaker
and Once you know that, like you have to know those things or you're just throwing darts and hope it doesn't land on stuff one. Yes, and you can make judgment calls of like, can I do 15 of these in one PR and just push it out and walk away from it? Or like, should I do this one and put it behind an experiment where I have an instant on-off toggle on it where I can revert back to old functionality immediately?
00:57:41
Speaker
um So when you're in something, your spidey senses are going off of like, this feels dicey.
00:57:48
Speaker
ah You have an appropriate tool for it. um Very cool. So how'd you do? Did you have any severance? We did not take down Slack.
00:58:00
Speaker
One of the interesting things about that project, though, that we did have sensitive twos and threes, two threes and fours, which we expected. About half of them we discovered but did not cause.
00:58:13
Speaker
So you start looking at the code and you get in there and you start going,
00:58:21
Speaker
Not how that should work. Oh, we aren we've been having an incident for six months. Three years. We opened incidents where like this has been a issue in the code and the for as long as we can tell.
00:58:38
Speaker
And this is also the thing with the dashboard, right? It's like um we had a sort of joke rule on the team. Joke rule on the team of ah don't look at logs on a Friday afternoon.
00:58:51
Speaker
Because we had a habit of like, we'd put in logging to try and understand how something worked. And then it'd be Friday afternoon and like, where have you been working on that week? You kind of wrapped up, but you like, you don't want to start the next like really hairy PR. So you're like, let me go check on that logging I added yesterday.
00:59:09
Speaker
And then that, if that logging is interesting and you dig a few more layers deep and you're like, oh, we've been in an incident for three years. Oh, wow. And it's four o'clock on a Friday afternoon.
00:59:20
Speaker
And you have to call the incident and nobody wants to be in an incident. You you are you're obligated at that point because you know what's happening. You can fix it. ah No one even deployed anything.
00:59:30
Speaker
you were trying to avoid this. um So yeah, we discovered at least as many incidents as we cost. but You know, I think that's like, that's such a theme of, of any of these bigger changes that like when you're adding typing, um,
00:59:47
Speaker
I don't think this is on a show. I think this is just someone I talked to, is that they discovered that their code base had two fully different types of, or fully different installations of Django. It turns out you can do that. There's like forks available, I think.
00:59:59
Speaker
um They both use the request object. The request object ducks type duck types to each other, and you can mostly use them interchangeably. But if you start adding types, you realize that you cannot, in fact, And there are runtime errors and your users have been hitting them and you have just never thought about it or caught them.
01:00:17
Speaker
Yeah. Isn't that fun? So now it's four o'clock on a Friday afternoon. And that's what going to do for the next four hours. yeah Cool. Well, um we're coming up on on time.
01:00:30
Speaker
um How did that ah then ah authentication now works works correctly? How did you decide that you were done with that? Like, were there end prongs that you hit? Yeah.
01:00:44
Speaker
What was the end of that project? So the project around doing a single enterprise session had a really clear end because at some point there were no longer workspace level sessions. There were only top level sessions.
01:00:57
Speaker
Full stop. Nice and clear.
01:01:00
Speaker
And then you delete some more code. and i knew And then you walk walk away from that part. The other part of this project, so we actually did single enterprise sessions first, and then we did a sort of more generic, like, and we we generally need to have good data integrity around that. And so we had some things of, like, all these data contracts. We got the most important ones um cleaned up.
01:01:26
Speaker
Cleaned up the existing code, followed up on a lot of holes, got to a play was place where we felt very confident on it. We did a lot of refactors on this. This was, you know, all this code was originally written as PHP. We've moved a lot towards object-oriented programming and hackling. So we did a lot of reencapsulation of the various objects. Yeah.
01:01:47
Speaker
One of my favorite, favorite pieces of that project was, um you know, we talked about how ah token and a cookie can give you access to like a bunch of different workspaces in the org.
01:02:01
Speaker
So how we unpack that, how we go from that token and cookie to which workspaces you should be in We had at least three different functions that were doing that when we started the project and they were all doing it slightly differently.
01:02:13
Speaker
And so by the end of the project, we had one way that we were unpacking that information. And that was all everyone used the same one way. um And we debated a lot on the cut line of when we were going to stop working

Efficiency Improvements in Session Management

01:02:29
Speaker
on that.
01:02:29
Speaker
ah And there's always more to do. There's always, always more. Absolutely. But there's another trend of instance probably pulls you away.
01:02:41
Speaker
There's one, my favorite trick on on data integrity stuff. Do we have time for one more story? Absolutely. Okay, so this is a trick that we've used a couple times, but the first time I used it was at Remetto.
01:02:53
Speaker
um And Remetto, we built a workplace profiles for companies. um And so we'd import data from the companies and do this like cool, like polished profile about users and what they worked on, what teams they were on, and their background, all that stuff.
01:03:12
Speaker
um are Our data model was a graph. The founders were Facebook people. Everything was a graph. um And so we functionally had two tables. We had objects and associations.
01:03:25
Speaker
And the data was somewhat dynamic. So it would change daily. um And we would make sure that we had the correct information, but we wouldn't necessarily clean up and delete old information.
01:03:38
Speaker
So, for example, like if an employee left the company, We might process that by removing their association from the company's object to the employee's object.
01:03:49
Speaker
We wouldn't necessarily synchronously delete the employee. So we have these jobs to then do that cleanup. So we called them like the orphan checker. Go through all of the objects and see and if any of them are orphaned, i.e. disconnected from the graft.
01:04:05
Speaker
And so this is mostly like to improve the usability of the tables, but not it didn't ah it didn't affect the product directly. didn't affect what you could see in product, but it kept our tables clean.
01:04:19
Speaker
um The job to do this was taking like eight to 12 hours to run because it was picking up every object in the database and saying, are there any ah associations between the object and the other object?
01:04:34
Speaker
If not, delete it. But it was doing it like one at a time. It was streaming through them. um And so one of the things I did a room at Remedo is I set up their data warehouse. And so we set up scrapes of the database into an analytics table.
01:04:49
Speaker
Notice the theme. So now once you have that, so our Remedo, we were using Snowflake. Snowflake is great at running these big queries where you do joins between multiple tables.
01:05:02
Speaker
So we rewrote the orphan checker as SQL queries. That can then identify for you which where your orphans are, but it's a different system from your production data. So you just take that you dump it in a CSV.
01:05:18
Speaker
And then you go have your production jobs read from that CSV that we then call hints. So the analytics infrastructure has told you, as far as we can tell, here are your orphans.
01:05:30
Speaker
Mm-hmm. The production jobs that have, you know, read write access to my sequel and picks it up and says, OK, let me double check. Is this an orphan? OK, delete it. Do whatever the job is on it.
01:05:42
Speaker
And it took the runtime of our orphan checking jobs down by like ninety nine point five percent. Oh my God. That's amazing. It went from running in like 12 hours to like 10 minutes.
01:05:53
Speaker
Like I love that. um And so that's one that we reused when we were doing sessions because there's another one where there are hundreds of millions of sessions at Slack And so we were starting to write these scripts to invalidate them.
01:06:07
Speaker
The first time you do that run, you do need to kind of go and inspect all of them. But as like we're discovering holes along the way and wanting to plug them, we don't need to go read a whole one hundred million of them. We need we know from our analytics dashboards which 20 of them are the new problem today.
01:06:25
Speaker
And so we did the same trick. We wrote back to a CSV file and then we had a job that we would manually kick off that would say, here's your hints file,

Lessons Learned and Future Directions

01:06:33
Speaker
go run it. And it runs in seconds. Instead of loading your database, reading every row and checking it against four different tables.
01:06:43
Speaker
That's really cool. That's ah that's an awesome trick. Yeah. like Like you saw across multiple companies, like it's a generalizable pattern. and You can just you can do this if you have any kind of, well, we have this wobbliness in our production infrastructure.
01:06:56
Speaker
Let's go find the hints, write it to a CSV and then have the production infrastructure do a very small cleanup. Yeah. Now you're doing a targeted cleanup that runs quickly and it is not imposing a burden on your production infrastructure.
01:07:10
Speaker
Yeah, ah that's a do it this afternoon kind of project. I love that.
01:07:16
Speaker
That's very cool. Well, this has been fantastic. I think there's some really are really super practical lessons in here, which was really fun. um but two Two questions is as we wrap up. um Where can folks find you on the internet?
01:07:33
Speaker
um And is there anything our audience can do to help you? um I exist on LinkedIn ah and don't exist very publicly anywhere else.
01:07:44
Speaker
Very cool. We'll find your LinkedIn, drop it in the and that channel description or the description for the video. Yeah. And I um i did ah leave Slack in April and I've been taking the summer off, but I'm going to be starting to look at jobs this fall.
01:08:01
Speaker
Yeah. Right on. I'll make me there is, I'm sure a ton of people who could use the kind of the kind of approach you bring to these kind of migration. So I'm sure I'm sure you'll hear something from someone in the audience.
01:08:15
Speaker
ah
01:08:18
Speaker
All right, Sarah, thank you so much. This has been amazing. Thank you. It's been very fun to talk to you.