Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
How They Cut Code Migration Time Without Sacrificing Quality image

How They Cut Code Migration Time Without Sacrificing Quality

Tern Stories
Avatar
24 Plays5 months ago

Code migration doesn’t have to be slow and painful. In this episode, Madeline and I break down a proven method to speed up your code migration process without sacrificing code quality or stability.

You’ll learn actionable insights from a real-world team who successfully tackled code migration at scale, and did it faster than most thought possible!

If you’re planning a code migration, currently in one, or trying to avoid the common pitfalls, this episode is packed with the perspective you need.

Let's break down a real world example with Madeline, which is a smarter way to approach your code migration strategy and accelerate delivery without cutting corners!

Get Tern Stories in your inbox: https://tern.sh/youtube

Connect with Madeline!

➡️ https://www.linkedin.com/in/madelineshortt/

Recommended
Transcript

Introduction and Context

00:00:00
Speaker
If we hadn't been in the position in the migration to be able to do this, I don't really know what we would have done. How on earth did you manage to land a project of this size in April of 2020 while COVID was starting? Black traffic spikes.
00:00:16
Speaker
is everyone's working from home. What if we put IBM into sunset mode first and it brought down like the QPS 650 to like 250 instantaneously?
00:00:29
Speaker
Basically save the shard, like replication lag disappeared. Today on the show, we have Madeline Wu Short. um She has been ah software engineer at Ripple and Slack um before diving into the wild world of startups as founding engineer and then general manager at Balsa.
00:00:46
Speaker
um And before Balsa ended up or the team ended up at Airtable. um She has done ah ton of amazing work across all those companies. And today we're going to talk about how she led one of the most important migrations at Slack, where sheli ah she helped move the messages data of IBM's ah data and everyone else's um from MySQL to Vitesse.
00:01:18
Speaker
So yeah thank you so much for coming on the show, Madeline. Yeah, thanks for having me. It is super fun to talk about this project.

Significance of Slack's Data Migration

00:01:25
Speaker
I feel like it's still my crowning achievement of my engineering career. And I still can't believe we pulled it off.
00:01:33
Speaker
ah No small, no small amount of ah work. It's just messages at Slack, which is like a small part of the overall product. I know. Yes. Everyone was terrified about this migration for like good reason. I mean, like if you lose messages at Slack, that is, that is very catastrophic.
00:01:51
Speaker
Can't have that happen. So, okay. Um, the, just to set the stage a little bit, this is what, 2018, 2019, um,
00:02:00
Speaker
This is like 2019. Yeah. yeah Cool. Slack is in hyper growth mode. Yep. Tens of thousands, 100,000 customers. Yeah. And some of the biggest organizations have finally led the made the leap to yeah onboarding that Slack.
00:02:17
Speaker
Yes. What... What actually was the migration here? what yeah know Moving to Vitesse is a technology change, but yeah what what prompted this and why did Slack decide to even start on this project?
00:02:30
Speaker
Yeah, so Slack had already kind of um undertaken some of the migrations to Vitesse. I think initially, um i think Stuart's idea of Slack was like the largest team we were ever going to see was going to be 100 people.
00:02:45
Speaker
up but Which in in hindsight, you know, of course, when you're starting a startup, like that's your wildest dreams. um And so, yeah, the data was all team sharded, which of course makes sense if you have 100 people.
00:02:58
Speaker
ah And then obviously Slack grew tremendously um to where you have, you know, like IBM is using it. And um team sharding just really doesn't scale for that when you have kind of like 300,000 people in an organization.

Challenges in Scaling and Migration Strategy

00:03:16
Speaker
um and so the migration to the test was really started for lots of different tables in order to shard data differently. and um So I think when the test was first adopted, the question was, should messages be first because it's kind of like the scariest table to change or should kind of like other tables go first and then we come back to messages? um The decision kind of like before my time getting involved in the project was like, let's do other tables and then we'll we'll come back to messages once we've like validated things and maybe when it feels less scary. And so um core infrastructure had led and like many other migrations um to get other tables charted by like charted by user or whatever.
00:04:11
Speaker
like channel or kind of other things. um But at this point, when I got involved, yeah, I was ah i i was a product engineer um on the messages team um in core product.
00:04:25
Speaker
um We owned everything you kind of think of when you first think of Slack, like messages, files. ah It was five backend engineers kind of owning all of that, which is a whole different story.
00:04:38
Speaker
Yeah. It seems like Slack could spare more people for messages, but... You would think, yes. I mean, we did good work. We did good work, but yes, I think staffing was always an issue. but um And historically, of like ah the migrations had been owned by infrastructure.
00:04:58
Speaker
um And so product teams were kind of brought on to advise, but not really um not really kind of like a core part of the team. um and think we got to the point where...
00:05:10
Speaker
um Everyone knew of like IBM shard, which is 542, which is always bad when you have a pet shard. i got a number. Everyone knew that number.
00:05:21
Speaker
Everyone knew this number. And it got to the point where we like, ah infrastructure was like, we cannot make any alterations to the messages table um because it is just too hard to kind of like perform that safely.
00:05:35
Speaker
um a wild. Yes. Yeah. And it was even the point where it's like, we need to figure something out um because like they basically calculated a like a drop dead date for the shard for IBM based on the like typical growth of like message sending with their current size.
00:05:56
Speaker
um And people were like, I think the test migrations were kind of a notorious a little notorious at Slack for taking a really long time um and being really tough.

Technical Hurdles and Team Dynamics

00:06:12
Speaker
I mean, umm obviously, it's like a really yeah complex project to undertake.
00:06:17
Speaker
I also think um they'd always been done just purely from like the core infrastructure team, so not partnering with the product side. um And I think there are a couple of migrations that were done with like very few people.
00:06:31
Speaker
So either just one engineer, maybe two, but like not their primary focus. And so I think the organization got a little skittish about undertaking like and more migrations. They also took like 18 months.
00:06:45
Speaker
That's a long time to kind of like set out to invest in. in And so some of the ideas floated for IBM shard was like, why don't we just time partition it?
00:06:57
Speaker
um And we'll put a timestamp in the code base and like any messages after this time period, we'll just go to a different shard. ah
00:07:07
Speaker
Which, you know, probably would have worked. That could have worked. see that. It could have worked. um I think it made me nervous because I'm like, i you know, Slack isn't hyper growth. Ideally, IBM is not our largest customer for a long period of time.
00:07:23
Speaker
Hopefully we have other customers that, you know, how many if statements are we going to have? Yeah. So walk me through a little bit what ah what the like right way to use Vitesse is. like how How was Slack thinking about it and what was like the ideal state?
00:07:43
Speaker
And then I'm going to come back and ask about like why what was not ideal. Yeah, so I think with Vitesse, you know, it's um what's really nice is that, you know, like it's still MySQL under the hood. Slack, you know, used MySQL kind of like directly, um but you're able, you know, kind of gives you all the power to reshard MySQL very nicely. And like the application from the application side, you don't really need to know how it's sharded. You kind of just pass in the sharding key and then Vitesse kind of handles
00:08:15
Speaker
um all of the like complicated sharding logic that's like not automatically built into MySQL. um And so for like messages, you know, when we want to reshard my channel, this is something that from the like application point of view, you know, it's, it's still going to hit MySQL at the end of the day, you know, from the database perspective.
00:08:37
Speaker
um And, you know, we can kind of like reshard the data without, you know, the application really knowing it's now sharded by channel versus team. Got it makes it. Makes a ton of sense.
00:08:50
Speaker
And how was it, compare that to like, how was it set up that it was on the IBM's data is on one specific shard? Yes. How did Slack organize data previous to the test?
00:09:02
Speaker
Yeah, so previously it was ah like ah whole team's data would all be on one shard. There's kind of like an... Everything. ever Everything. There's kind of an asterisk to um kind of like Slack had a very interesting and complicated like enterprise structure. And so, you know, like you could have, if you had an enterprise Slack instance and there's other workspaces, your workspaces could be on different shards, but your enterprise, you to know, would be on on one.
00:09:32
Speaker
um But if you were kind of like a plain vanilla, you know, like paid team, all of your data would be on one shard, kind of like regardless of how large. Yeah, that instance was. Okay, that makes it a lot more clear why to go to the test.
00:09:49
Speaker
So how long um that with all of IBM's data on a ah single shard, you mentioned there was this ticking clock. How long did you have to actually pull off this migration?

Migration Process and Urgency

00:10:01
Speaker
So the the like date, I want say, was like it was like 12 or 14 months. Was it a high confidence estimate?
00:10:15
Speaker
So this was, I mean, yeah, so this is the scary part, right? like it Things could change. And like the the thing is that like after, you know spoiler alert, after we started the migration, things did change where IBM wanted to go wall to wall.
00:10:29
Speaker
So they wanted to get even bigger. Oh, no. Yes. And so I think this made me even more glad that like we decided to invest and like do the migration properly.
00:10:40
Speaker
um And then there's even... I'll save the grand finale for later. um
00:10:49
Speaker
But yeah, I mean, this is what, you know, we kind of like modeled different kind of like ideas for like how fast their messages could grow. But yeah, it's rare that a migration has kind of like such a clear end date.
00:11:05
Speaker
um But it turned out to be really helpful in order to um really advocate for enough staffing um in order to kind of like ah get the project done. So I ended up partnering with Core Infrastructure.
00:11:20
Speaker
started. got four people on the project from the beginning. um it was a really nice collaboration between product and infrastructure. We had people who knew a lot about messages and like the write paths and the read paths. We had the expertise from core infrastructure on the test side.
00:11:37
Speaker
um And so I think having this really scary date allowed us to kind get buy-in from leadership. Yeah. That makes a lot of sense. i Tell me more about the the like, um you know, I'm assembling a team moment there.
00:11:57
Speaker
There's, like, what what did you need to know? What did you, what people did you need to get in the room order to be able to do this? Because it's a very different experience to say we're going to move all the messages inside of 12 months than it is, like, we're going to move a little channel in however long it takes us.
00:12:12
Speaker
Totally. Yeah. So I had a good friend on the infrastructure side who had kind of been like chatting to me about this. And, you know, like Maggie was like, I think this makes sense to do as like a partnership.
00:12:25
Speaker
And she was really the one who like planted the seeds. And I was like, yeah, we we really need to do this. you know, do this with a full team. um i think what I loved about this is we had, like, we had all the decision makers on the team, which I think really sped up the process. You know, like, we had representatives from everyone.
00:12:46
Speaker
um But it was, yeah, it was two folks from core infrastructure who kind of like handled like the nitty gritty the test stuff. We needed to set up a new key space that I think was like the largest one they had made in the test so far, just because like messages is a lot of data, a lot of storage.
00:13:10
Speaker
And then on the kind of like on the messages side, ah There is a lot of really old code that touched the messages table. There were when we were migrating over call sites, it was surprising.
00:13:24
Speaker
um i mean, surprising and fries again scary, like there were like raw SQL queries going out to the messages table and different places that we didn't know about. And so, yeah, it was really helpful to have kind of like the, like all the expertise um in a room and we like had an offsite and we all sat around a table and like made decisions, you know?
00:13:44
Speaker
um And so it was, was, yeah, um was very much like a, you know, like a table project, which felt really good. Cool. um How did,
00:13:57
Speaker
What did you learn during that during that phase, that sort of discovery that nobody knew coming into the the project? And like how did that change your plans?
00:14:08
Speaker
Yeah, I think the thing that surprised me most was, you know, like, messages is kind of the most important table at Slack in a lot of ways. You said messages, you received messages.
00:14:21
Speaker
um And no one like Like, there were columns that literally no one at the company, like, had the historical knowledge about.
00:14:32
Speaker
Like, yeah. So there were, like, certain columns where I was, like, I felt like an archaeologist where I was, like, Trying to figure out like, when was this column used? What was it used for? Are we using it anymore? Is it important?
00:14:45
Speaker
um And it just felt crazy to me. You know, I'd only been at Slack for like, you know, a couple years at that point. It just felt crazy to me that like I was the one who was like... understanding the table uh and like figuring out like what does all this mean um and so that was a big yeah i think that that was a big surprise um i think also this is around the time when external shared channels um was launching so this is you know slack's feature where you can share a channel kind of like with another slack instance um
00:15:19
Speaker
um And so you can kind of like talk through this. ah This was an extremely complicated, like technical project that like a different pillar was owning. um And it was, I think, like a very new initiative. And so there was there was kind of like we were migrating messages and restarting by channels kind of around the same time that they were doing a lot of like.
00:15:42
Speaker
changes to the business logic um and so that that just took more kind of like coordination and learning um though I don't think it like it was like ah something that we really had to think about whereas like previous migrations had it had to consider kind of like how things change with external show channels Got it.
00:16:03
Speaker
Yeah. And you mentioned that if if the data is all charted by team, external shared channels breaks that model. Yes. Yes. Which was very complicated for them to think about.
00:16:15
Speaker
um It's almost a little bit too bad that maybe this is another reason. Like we had, you know, done the migration even earlier um in order to kind of like help enable that product feature. But yeah. So they had to do like a lot of complicated thinking about like,
00:16:30
Speaker
Where does this channel live? Who hosts it? How do we know which shard to go to in order to read these messages? A bit of like all these complicated things. Yeah. Yeah. um All while all the tables are being ripped out from under them and moved into Batess. Exactly. Exactly.
00:16:47
Speaker
We played very nicely. We had no incidents with them. So...
00:16:53
Speaker
So how did this migration actually work? Because you mentioned that the sort of previous approaches weren't particularly quick. Yes. Did you approach this differently in order to hit the deadline? Yeah.
00:17:06
Speaker
Yeah. So I was, ah i was like maybe a little bit of a bully to people because I was like, we are only, yeah are i was like, we are doing the bare minimum to migrate this table.
00:17:20
Speaker
And I was really clear about scope. was like, we are not taking any additional cleanup or any additional scope in order to like get this migration done on time. I think previous migrations had done kind of like more cleanup work or kind of like changed access patterns and kind of undertaken more in their scope, kind of like to do the migration. And it makes sense. so You know, you're like catching all the data, like, you know, it's maybe your only opportunity to kind of like do a large scale, like kind of like refactor, like might as well take it.
00:17:51
Speaker
But I was like so clear. I was like we have this really scary deadline. We need to do this safely. We need to do this quickly. So we really only like only migrated the data.
00:18:03
Speaker
um i think the, something I look back and like kind of wish we had done is, you know, we had these old columns laying around um that weren't used anymore. I kind of wish we had dropped them, um but didn't really, yeah, i didn't want to add any more scope.
00:18:20
Speaker
um We did. i think the one place where, We added a tiny bit more scope was we like changed some of the reads to read from like replica in the Vitesse world um because we we were like in there and had so much product knowledge that we could be like this one's.
00:18:36
Speaker
Yeah, we know like based on like the right read access pattern like this one's safe to do. um And so also to try and you know alleviate some shard pressure, um we undertook that.
00:18:48
Speaker
But this was like, like i had a clear North Star for everyone of like, we're doing like the least amount of things in order to hu yeah kind of ah finish this migration in time.
00:19:04
Speaker
That makes a ton of sense. you have You have a hard deadline and you have to finish Yeah. Was any of that cut scope contentious? It was a little contentious at the beginning. think.
00:19:15
Speaker
Yeah. I think that there were, you know, kind of like at the very beginning planning stages, there's a lot of like big round table meetings about this. um I think. folks you know wanted to make more improvements you know like the messages migration have been a thing that like infrastructure had thought about for a long time you know there's a lot of kind of hopes and dreams attached to it um and so I think on one hand it was a little contentious because people wanted more they you know they wanted the opportunity to do more think on the other hand, like, it was just so clear to everyone that, like, there was this date, you know, and like, were in a really bad situation. And so, you know, while it was, like, a little bit contentious, it was never, it didn never felt like a real option to undertake.
00:20:03
Speaker
kind of like these larger scale refactors or like changing the primary key was something, you know, if this was like team channel timestamp. And there have been hopes to be able to drop team out of that primary key. um And, you know, if you're starting by channel, like you can actually do that.
00:20:20
Speaker
um But I think other migrations have had more complications by, you know, obviously by changing the primary key and you're like, we wait just need to get this done.
00:20:32
Speaker
Yeah. accurately yeah needs to finish it needs to be right yeah yeah let's get it done uh that makes i that i love that clarity i think that makes a ton of sense for any large project is saying like this is the successful outcome and then you've got you've got a baseball bat to go yeah yeah i had ah there this this do you know like the seagull meme of the seagulls like screaming it like it says like on this screaming so like literally at the top of my text back it was just like take all the messages and put them into the test like that was just like the yeah but little like tagline like that's like literally like just what we did so yeah it was really helpful to have that
00:21:15
Speaker
have that alignment i don't know if you always have that for projects um especially like migrations you know um and so it's helpful to have that kind of like back pressure that makes sense okay so you've got you've got a deadline you've got clarity you've got the right team yes you're making it work yeah how to go I mean, this is, I think this is why I'm still happy to talk about it this many years later.

Success Amidst COVID and Future Scalability

00:21:42
Speaker
It went really, really well. um We had a great team. I have this photo. so i really like, you know, I really like planning. And so I have this photo of like a timeline and sticky notes of like when we go into like dark mode, when we go into light mode and kind of sunset mode in these different places. And I had, oh,
00:22:05
Speaker
Oops, sorry. um And I had kind of like mapped it all out um as we started um ah have us finish in like ah beginning of April 2020.
00:22:21
Speaker
um And we hit it like exactly. um And so like we were able to ah kind of like we hit every um kind of like mode transition time point where we like hit dark mode when we are supposed to we hit light mode and so and we are able to kind of like finish it even a little bit early it went really smoothly i know uh okay i was there and i didn't quite realize this how on earth did you manage to land a project of this size in april of 2020 while covid has was starting
00:23:02
Speaker
Okay. Okay. So it's little bit. yeah I know. i know. So this is the other point where i yes, I, yeah, ah we were incredibly lucky to have started the project when we did, because but this is probably, this is like the most, exciting this is probably why I'm most proud. This is like the most exciting part.
00:23:21
Speaker
So we we, were running a little bit early. And so we were in sunset mode for every week. Well, we were and you were in light mode for everyone. So this is where we're still writing to both Legacy and Vitesse.
00:23:35
Speaker
um We're only reading from Vitesse and we're returning that to clients. But we still like theoretically have the option to go back to Legacy because like we're still writing and reading from there. um And that had gone really well.
00:23:49
Speaker
I want to say this is like February of 2020. Yes. No one knows yet. Yeah, no one knows yet. We have put... Internally, we have moved into sunset mode.
00:24:00
Speaker
um And so internally, we are no longer reading from Legacy. um And then... So we feel pretty good about sunset mode. We feel pretty good about light mode. And then, you know, the lockdown happens.
00:24:14
Speaker
And what happens is Slack's traffic spikes because everyone's working from home and everyone's now coordinating on Slack. And so what we started to see, yeah.
00:24:26
Speaker
yeah So what we started to see on IBM Shard is they were getting replication lag that started morning of Eastern time and lasted about to afternoon Pacific.
00:24:39
Speaker
And then it would finally catch up and then it would kind of restart the next day. my God. Yeah, and this is terrifying. And we're basically watching like the replication, like got it got bad. And we're just like watching the shard.
00:24:54
Speaker
And um this is the place where like we're ahead of schedule. We're like, what if we put IBM into sunset mode first? And so this will take all the message reads off of legacy.
00:25:09
Speaker
um and so it was a little scary because typically we do kind of our enterprise customers last. Um, but we did them first after us internally.
00:25:20
Speaker
um and it brought down like the QPS, uh, from like,
00:25:28
Speaker
650 like 250 instantaneously basically save the shard like replication lag disappeared it was totally fine it was totally healthy um and so this is another thing where i'm like if we hadn't been in the position in the migration to be able to do this like i i don't really know what we would have done yeah Yeah.
00:25:54
Speaker
That's wild. There's... Yeah. It was... You were suddenly in a position where IBM... Because this is the old... This is replication like on the old setup, right? Yes, exactly. So this everything that messages, which we could keep joking about is unimportant, but is actually incredibly important. Totally.
00:26:09
Speaker
Yeah. yeah Yeah. So this is like... Yeah. Lots of different things. um Yeah. And got really, yeah, got really scary and um we're able to do this and the shard turned green and went smoothly. And so I think this is the, yeah, this the place where I'm like, so, so pleased we're in the position to be able to do that.
00:26:31
Speaker
I love that. That's, it's so interesting because it, the, One of the themes that I've noticed is that migrations almost always seem to be about like moving almost as slow as possible, given the constraint that we have to get this done at some point, because they're so disruptive and they're yeah so challenging.
00:26:50
Speaker
But then you get these moments where you're like, no, we did the work. right we are ready to handle some exogenous spike. and Like the world fell apart and the world works from home now.
00:27:01
Speaker
And now suddenly you can handle it and say, yeah, okay, we'll just fix the problem here and cut IBM's traffic by 60% on their legacy chart. Yeah, I know. It's like you do all the preventative medicine work for like the previous eight months, you know, and then you kind of get this payoff.
00:27:16
Speaker
I think another moment of that is that during this process, um you know, like during this whole migration, we heard that like IBM wanted to go wall to wall. We also heard that um Amazon was interested in coming in and immediately becoming Slack's biggest customer, of which then there is, you know, like lots of work to be able to support that.
00:27:41
Speaker
But it felt so it felt so incredible that kind of like from perspective. Like from their like team, you know, a shard perspective, we're like, oh, messages will just be sharded by channel for them.
00:27:54
Speaker
Like it will be fine. Like we don't have to worry about kind of the scalability of the messages table um for a customer that was like, I don't know, three or four times larger than our last biggest customer. Like something crazy. Yeah.
00:28:09
Speaker
It's a lot of people. There's a lot of people that work at Amazon, turns out. are i know my goodness yeah we had uh we had mode on the show a couple weeks ago she talked about toy pond and load testing for amazon and it was it was notable especially in light of what we've just talked about is that she was looking for problems right that it was not it was not messages weren't going to work because messages already on the test it was what else is going fall apart Yeah. Yeah. No, it was it was very cool getting to work with her on. Yeah. Making sure kind of like reactions and all the rest of the things still would work on on such a large scale.
00:28:49
Speaker
So what did what did IBM and so what what did their change or their decision to go wall to wall mean for the project? and And did that happen before COVID or was that like a COVID decision?
00:29:02
Speaker
That was before COVID. Yeah. So this was, yeah, this was before the project or before the project was finished. ah And so I think that this, yeah, this happened like during the migration.
00:29:14
Speaker
This made me even more glad that we had like kind of like really advocated for the full staffing of the team from the beginning.
00:29:25
Speaker
um Because like, yeah, with migrations, it's, it's, they're very long-term thinking. Like you can't change things, you know, and get immediate results. um And, yeah,
00:29:37
Speaker
I think that, you know, we redid the predictions about like, when is the shark going to fall over if we like end up increasing this many people? I think we also got a little lucky that like sales, you know, sales cycles take a long time. And so, um you know, they weren't kind of all going to come on the next day. And so we had some time. But um again, it was like in the capacity planning for them going wall to wall.

Long-term Migration Challenges

00:30:01
Speaker
We could say that, ok if we're if we're totally done by April, that's going to be and that's going to be an enough time for them to go wall to wall without having an issue.
00:30:13
Speaker
How did you, kind of a similar question to the the estimate at the beginning, how much did you trust people? The speed up that it was coming. So I, yeah. So I think other people trusted it more. I really didn't trust it. I was just like, because you, I mean, like you never know what's going happen, you know, like literally COVID happens and it just felt like, um yeah, it just felt like we, like,
00:30:39
Speaker
worst case, we, like, finish early, right? I mean, like, that's great. um And so I think that, um yeah, I really tried to push for, like, having, like, having four people in a migration for 10 months was, like, a hard sell to, like, leadership um and was definitely, like, the largest team that has worked on a migration for Vitas at Slack.
00:31:05
Speaker
And so I think I kept on being, like, Yeah, maybe the best case scenario is this estimate, but like, you know, this is like the happy path where things kind of act normally. And um and so it it was definitely something that was like, you know, breathing down our neck the whole time of like, we we have to like keep going and...
00:31:27
Speaker
um it is it's also like it's hard to work on of project for 10 months and like keep that sense of urgency the whole time you know like 10 months is really um it's a marathon it's not a sprint in where you're kind of like trying to pull log hours and so I think being extremely systematic and having these you know like each mode transition from like you know, like dark mode to light mode. We treated each of those like as a launch.
00:31:55
Speaker
And so I think that also helped break things up and have like kind of like clear goals. And then we kind of like reset and work towards the next mode. but That makes a lot of sense. And and actually, um that reminds me of a thing I wanted to ask. like As you're thinking about you know this longer project and breaking it up into those milestones, did you think about, was this just like a pile of work you had to do? Or were there opportunities that you could step back and like create tools or create systems that then accelerated the work down the line?
00:32:26
Speaker
And how did you how did you think about like making that investment early versus just doing the work? Yeah, so I think that we were very fortunate to, like, kind of be able to piggyback on a lot of the kind of like, shared libraries that um other Vitesse migrations has, like, built up.
00:32:45
Speaker
um I think this is one of the, you know, like, reals I'm glad messages didn't go first, you know, that, like... You're not undertaking everything as part of this migration. i think something we did invest more was the kind of like diff logging, especially for like dark mode.
00:33:04
Speaker
ah Messages is really complicated. You know, I think um there was a column that was just like MSG. That was like a JSON blob that, you know.
00:33:15
Speaker
Yep, yep. ah yeah yes yeah that had kind of like all like the message content in it. um And so initially it would be like, you know, like this column is different between the test and legacy. And it's like, well, that's not, we need more granularity for this specific migration in order to be able to tell um kind of like what potential other calls, you know, call sites went wrong.
00:33:39
Speaker
um And so we did more investment there. i think something looking back is that, um, We still like to in order to like figure investigate the diffs like we spent a lot of time in spreadsheets, you know, like importing logs and then like, you know, like.
00:33:59
Speaker
this diff, you know, from this call stack, there's, like, you know, I don't know, like, 45,000 of them, you know, like, and we'd, like, work through this spreadsheet, and people would take different, you know, call sites to investigate, but it just felt like it, it, uh, it felt really scrappy, in a way where, you know, we were able to get done, but I'm, like, I think in hindsight, I'm, like, this, this felt more kind of, like, um,
00:34:29
Speaker
I don't know, piecing things together than it should have for like a migration of bit of this size. um And I also think that in some ways it felt like more of an art than a science to know when it was safe to transition modes.
00:34:46
Speaker
um You know, like we think we understand all the diffs, you know, like we we think everything's okay, but i I wish there had been kind of like um ah like a clear tool to be able to to say you know like you've resolved i don't know like 99 of the diffs like there's these ones that are diff only you know like or a kind of are uh don't have a consistent pattern you know like green light like i think you're you're good to keep moving forward
00:35:20
Speaker
Got it. That makes a ton of sense. you're chugging through those diffs and those logs... chugging through those those diffs and those logs one by one, you've got to do the work and kind of keeping track of like how far you've come and what is the overall picture.
00:35:36
Speaker
ah That feels like something that should be externalizable. but You should be able to put that in a tool. Right. I know. I'm like, this is not, you know, like, yeah, this process of going from like dark mode to light mode to sunset mode. Like this is what every migration goes through. You know, this we're not reinventing the wheel on this side.
00:35:55
Speaker
ah would have been It would have been nice also, I think, because, i you know, I really took the responsibility as, like, the tech lead to be the decision maker. You know, like, if I say we go into light mode and something bad happens, like, this is on me. Like, I really wanted the team to feel um like it wasn't on all of them to make the decision to, like, when to flip these feature flags and move forward.
00:36:19
Speaker
um It would been nice to have, like, a migration companion in the tooling to be, like, you know, like, here's what did they I mean, I think I was very fortunate that like there was so much expertise already in Slack, but I can imagine like if you're doing this at a place that doesn't have kind of like that depth of knowledge, migrations are lonely and they take a really long time. um and I think it would be nice to feel like you're not kind of going through it just on your own.
00:36:48
Speaker
Yeah, absolutely. There's a huge difference between I looked at this data and I made this decision. And I looked at this scattered set of feelings I have and I made this decision. ah thanks Exactly. but just Even if you're wrong, like you have the comfort of like, look, I made a good decision. Sometimes good decisions are wrong.
00:37:06
Speaker
Totally. Yeah. But it was like thoughtful and here's why it was very organized. Yeah, absolutely. Absolutely. that makes That makes a ton of sense. um Cool. So tell...
00:37:20
Speaker
let's say i think we I think we've hit all of the bits of the actual migration. What happened after flipping IBM? was there Was it smooth to get sort of all of the other customers flipped or were there any surprises in kind of the ti the wind down of it?
00:37:37
Speaker
There weren't any surprises in the wind down, thank goodness. Yeah, because were seeing like IBM's charges have moved the worst, but we were seeing kind of this like unprecedented spike in traffic across the board, you know. Right. Or...
00:37:51
Speaker
Yeah, for COVID happening. um And so, yeah, that was really smooth. It was a little bittersweet that after working together for so long, kind of like this team, we had all of our like celebrations, you know, on Zoom in like pandemic world.
00:38:08
Speaker
But we, ah that ah we like officially dropped, you know, the legacy tables for messages on Zoom together. um Yeah. And that went, that went smoothly. And then it's, it's fascinating, like how quickly you just move on to like the next thing. Cause like then, you know, IBM and all of Mode's work with the next was like immediately like, okay, next challenge to go figure out.
00:38:33
Speaker
Yeah. Absolutely. it's I think that's one of the things that's sort of underplayed at working at these hyper growth companies is that they just human brains are not meant to deal with like the line doubles every you eight months, 12 months or whatever it is. It just, yeah, like that was a huge effort it took everything you had. And then i someone else, you probably has to do it again for something slightly different.
00:39:01
Speaker
Yeah, it just feels like the yeah like the realm of possibility is just like increased every year. It's really Yeah, absolutely. um
00:39:15
Speaker
Let let's maybe flip over into sort of quick cry-ery questions or or maybe not even because maybe have longer answers to these and it'll be interesting. um So went through this whole journey.
00:39:26
Speaker
um Yeah. what If you could go back and give yourself one piece of advice at the beginning, what would what would that be?
00:39:38
Speaker
question. think... question i
00:39:43
Speaker
I think it would be just like he keep going. Yeah. Like I think we were able to do that. But I i think that. Yeah. I think like just like keep making like the steady pieces of progress. You know like the week over week progress. And it's like all that come together the end.
00:40:02
Speaker
Yeah. yeah, I also think just knowing that like, it's going to be okay, like you're going to make it to the other side, i think also would have been helpful to know at different points.
00:40:15
Speaker
oh But yeah, I think, i think I'm very fortunate, you know, like looking back, like, I think we made a lot of the right decisions. That doesn't always happen for different projects. And so, um yeah, i'm I'm really proud of yeah and what the team did together. But I think, yeah, just keep going. Just keep putting one foot in front of the other and we're going to get there.
00:40:39
Speaker
yeah'll You'll be ready when the world shuts down a month before the project. Exactly. know.
00:40:47
Speaker
would have known.

AI Tools and Migration Simplification

00:40:49
Speaker
but would have doubt um it's it It is 2025, so I feel like we have to talk about AI a little bit. you are how do you think this How do you think this would have gone differently if you had and access to Yeah.
00:41:07
Speaker
I think it would have. I think there's two big things. I think it would have have helped us. um There was a part of this project at the very beginning um where we were migrating all of the um calls to the messages table to use like our library, Vitesse library that kind of handled you know, like dark mode and light mode and kind of like the nitty gritty under the hood of like, does this call go to both legacy and Vitesse? Does just go to legacy? How does the diffing work?
00:41:40
Speaker
um And I think that was surprisingly difficult to like find all the call sites to the messages table in the code base, which was a little bit of something that we hadn't foreseen or we were kind of just like, we'll migrate the call sites and then we do the hard part.
00:41:57
Speaker
um And I think having like having AI be able to do that or i like do the first pass, I think would have been tremendously helpful, especially because we pretty good unit test coverage at that point, you know, like where we would run the unit tests in like what we call this like pass-through mode where basically if anything hit the messages like test database that didn't go to the the test library it would know like throw an error so like you know we could have just like had ai kind of like iterate on that you know versus engineers you know spending a couple weeks um trying to figure that out
00:42:35
Speaker
I also, yeah, I also think from the diffing side, you know, like looking at all those logs um and and trying to come up with like some, yeah, of like some narrative. We spent a lot of time in dark mode doing that.
00:42:50
Speaker
and And like, what do these logs actually mean? um why are they happening? Are they worrisome? Are they kind of just race conditions? that That was maybe the hardest part of the project was all of that.
00:43:04
Speaker
And I think that that feels like something that, you know, tool could exist to help with. That's, I love that. the most magical moments I've had with AI are when I realized that there is some structure I'm trying to find and I can feed it the right information to like,
00:43:23
Speaker
Either like, obviously it's great when you can just dump a bunch of logs and be like, what is the most common error and why? And sometimes it doesn't. That's really magic. um But it feels almost more magical if you can dump in a bunch of information and say like, hey, I need you to like group these and then like back out, like start to figure out and like give me a narrative, anything. And even even if it's not 100% right, it's so helpful. just gives you like something we'll grab onto it.
00:43:50
Speaker
Totally. It's somewhere to start with. And I think like sometimes coming up with that initial hypothesis, like the hardest part. Yeah. it' I am not built to read 45,000 lines of Jason. Oh my God. I know exactly. don't like it.
00:44:04
Speaker
AI would have no problem. Yeah, it's like we had to do so much kind of like formatting of the diffs because like, yeah, they were just like incomprehensible. And so we were like, please tell me like, yeah, in our logging, we had to be like, tell me the keys of the MSG blob that are different, you know? And like we had all of this different stuff.
00:44:26
Speaker
And yeah, I mean, like AI would have no problem with it. um Just figuring that out. Yeah, and there's, I think there's a ton of opportunity in, in that kind of like, loop like, if you can get a feedback loop, right, and you can say, it it you need it for people, right? Like, you finding the improving the diff logging tools and dumping that information out and saying, like, this is what we would need to go to the next stage, like,
00:44:50
Speaker
That's the kind of thing that if you can feed that back in, you can build a real AI loop around what the heck are we going to do? And then here's a bunch of dumb tasks, right? that everyone Everyone I've talked to, at least, and i'm I'm curious to get your take on this, is that the most successful things people have done with AI are all floating stupid stuff to it.
00:45:08
Speaker
Oh, totally. Yeah. No, I totally think that. I'm like... um Yeah, I mean, this is a bit of a tangent, but like um my mom you know like works at the CDC, does um kind of like runs grant reviews, and a big part of that is like doing kind of like summary notes for um and like these different scientists.
00:45:33
Speaker
And that takes up like a lot of time to be like thoughtful and put everything together. And I'm just like, you know this is someone who has extremely specialized skill set, you know a PhD in psychology,
00:45:43
Speaker
And like and doing these like things where I'm like, this would be it's like, you know, they can't use AI in their jobs right now. But like, yeah, you could do, know, this could be easily done immediately and then just have someone review it ah and use AI. And then like it frees everyone up to do kind of like the more meaningful part of their work.
00:46:05
Speaker
um And so I think that like, that's what I'm like most excited about is like, cut out the busy work, you know, that everyone's kind of doing and, um, let folks focus on, you know, what they specially equipped to do.
00:46:20
Speaker
yeah absolutely. It's not, computers have never been great at replacing the work that is like complex and meaningful. The point is they go do stuff that like we don't want to do and seems kind of grindy.
00:46:31
Speaker
Yeah. Yeah, exactly. yeah Um, there's just more of those things to do. Um,
00:46:39
Speaker
Since we're just chatting now, you went to Balsa after I did.

Startups vs Large Companies

00:46:44
Speaker
Back to Airtable, you know another another big hyper growth startup.
00:46:51
Speaker
Startup. I know. what what was your What are your sort of lessons in operating in like you know a thousand person company, five person company, thousand person company?
00:47:04
Speaker
Yeah, I think in the, you know, in the like the five person company, it's like, it feels like the purest form of like, kind of like the work we do.
00:47:16
Speaker
Because you you get such like, the product either works or doesn't. And like, you're you're you're really like fighting for survival, you know, and so everything's really pure.
00:47:27
Speaker
um I feel like when you're at a larger company, you know, like you contribute features, you know, but like the company is default surviving, you know, like it's kind of you, it's kind of going to move on with or without you and kind of like have that.
00:47:42
Speaker
you know, like lumbering um continuance. um And so I feel like i love the pureness of startups. Like you just really get the feedback. Nothing's sugarcoated because it can't be sugarcoated.
00:47:56
Speaker
um You're just, yeah, you're like truly building. um and so I really loved that experience. I think it's also extremely stressful, you know, like
00:48:10
Speaker
I think someone told me this before I started at Balsa and they you know had done startups in the past. They were like, it's really an emotional journey. you know, like they're like, the hours are hard, they're long, but it's really the emotional piece.
00:48:25
Speaker
And I like, I don't know what you're talking about. I'm sure. Yeah, you, you get it. And like, out look I meditate. I drink water every day. i can manage my emotions and that's just, you get into it.
00:48:39
Speaker
Yeah, I'm like, I'm pretty emotionally intelligent. But yeah, it's crazy. um i think that I was also like, it's like the best of times.
00:48:50
Speaker
Yeah, I mean, we put together such a great team at Balsa and I feel like... you know um And you really get to build product and then you really get to build a company. And I think that's a piece that I hadn't really considered. It's like you get to shape the culture and you get to make really intentional decisions. And I think that was such a gift.
00:49:12
Speaker
um And I think that just seeing like how much impact you can have, I think really taught me It just made me realize that, you know, even at these larger companies, I think I can have more of an impact than I initially thought I could.
00:49:28
Speaker
Yeah. Yeah. You mentioned the the clarity of saying, like, we have to do the messages migration or like IBM's going to work. And it's like that that really stuck out to me because you don't get that clarity so often at bigger companies.
00:49:41
Speaker
You really don't. Yeah. I think that's, and it's so motivating to have that clarity. I feel like it, it really feels like, i mean, this is something that drives me and gives me a lot of energy at work is being able to like have a really clear impact and say like, you know, I was able to do this and here's what it changed. And like,
00:50:01
Speaker
ah here's why that matters. um And like, yeah, these larger companies, it's it's pretty rare to get that opportunity. i think I was very fortunate at Slack to get that you know a couple of different times. Like I also worked on WYSIWYG, which felt like a huge project.
00:50:16
Speaker
um And, but yeah, at other companies, you know, it's like, initiative works or does it you know I think at at Airtable you know Balsa was acquired to try out something new kind of like create a point solution can try and compete with create a point solution on top of the Airtable platform that can try and compete with like kind of like other true point solution companies.
00:50:46
Speaker
um And I think it's a, it was a really different motion for Airtable. And I think that had a lot of kind of some internal friction. um and you know, it's it's gone well from the sales side. I don't know if it makes total sense on the engineering internal side, but it's something where like an initiative that is this big can fail or not kind of like see its full potential at a company that's larger.
00:51:11
Speaker
The company still continues, you know? um And so, yeah I think I really love having that really clear impact. And this project for Slack and messages was definitely that.
00:51:24
Speaker
Yeah. Yeah, that makes a ton of sense. i I'm three acquisitions deep myself. And like that alignment is just challenging. like No matter what you do, i've been I've been part of companies where you do it immediately and companies where it's like four years later, people are still fighting about it.
00:51:41
Speaker
In both cases, the grass is always greener. like it's never It's never perfect, but the company continues. Yeah, they do. exactly.
00:51:52
Speaker
Cool.

Contact Information

00:51:53
Speaker
Well, um we could talk all day, but ah ah we should wrap up. So final question. um Where can people find you on the Internet if they want to continue the conversation? um They can find me at M, letter M at Madeline Short dot com. That's my email address. You can also find me on LinkedIn. Feel free to message me um Yeah, it was really fun.
00:52:18
Speaker
I'm digging this back up from my memory and getting to relive it. I still feel very proud of this team and this project. And so it's always fun talking about it. It's an amazing project. Thank you so much for coming on.
00:52:31
Speaker
Yeah, of course.