Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Inside Snapchat’s BOLD Code Migration: Faster & Leaner Rebuild image

Inside Snapchat’s BOLD Code Migration: Faster & Leaner Rebuild

Tern Stories
Avatar
27 Plays3 months ago

In this episode, we dive into Snapchat’s bold code migration, a project that reshaped the way their app runs at scale.

This code migration wasn’t just about rewriting code, it was about survival, speed, and creating a leaner system that could handle hundreds of millions of users.

You’ll hear how the team faced challenges like supporting long-dead features, rebuilding APIs bug-for-bug, and even migrating conversations instead of entire users.

These behind-the-scenes stories reveal the hidden complexity of a massive code migration and the lessons that can help any engineering team facing similar challenges.

Whether you’re an engineer, a founder, or just fascinated by how the biggest apps in the world evolve, this deep dive into code migration will show you what it really takes to pull off a technical transformation at scale.

Connect with Ben Hollis ➡️ https://stately.cloud/

Get Tern Stories in your inbox: ➡️ https://tern.sh/youtube

Recommended
Transcript

Snapchat's Challenges and Frustrations

00:00:00
Speaker
I don't know what I'm doing here. I'm not shipping the way we want to. We're bleeding users. It's time for a Hail Mary. If I send you a photo and you view it, your device has to say, i viewed this photo so the system can know to the lead.
00:00:12
Speaker
If you're an engineer, you go, wait, how does that work? Right? And the answer is, well, it can't. provably do it. He says, I've got a great idea for how the app should be. Please redesign it in a month.
00:00:23
Speaker
Why don't we just try it? Like, why don't we just do it? Because what do we have to lose? Today, i have Ben Hollis here with me. um Ben is an engineer who's worked at Amazon and Snap.
00:00:36
Speaker
um He spent 10 years at Amazon working on internal tools, databases, and Amazon Go, store read pick things up, walk out, I'll charge you for it. And then he spent eight years at Snap where he led the infrastructure team.

Introducing Ben Hollis and His Impact

00:00:52
Speaker
And that's what we're going to talk about today, where he was one of the most senior engineers, the most senior engineer, maybe. One of, yeah, yeah, for sure. I'll give it i'll give it one up. One of the most senior engineers who actually led a full stack rewrite of how Snap worked, all the way from infrastructure up through client libraries, iOS and Android.
00:01:14
Speaker
um And today we're going to walk through that journey, what Ben learned about it, and how we are how we move forward from there. So Ben, welcome to the show. Yeah, thank you. I'm glad glad to be here.
00:01:27
Speaker
So let's, i I was looking up some numbers from Snapchat. um Snapchat's a fascinating company because the, I remember there was a whole like social media war in 2015 or 16 where the snap numbers were, were growing. it was insane. It was like a hundred million, 200 million DAUs.
00:01:48
Speaker
And then Instagram launched stories. And six months later, Instagram, For the first time ever, the numbers lurched. Like you all actually lost users.
00:01:58
Speaker
Yeah. Yeah. It's a tough time. you right It does not seem like a pleasant time. It's shortly after you joined, right?

Snapchat's Competitive Pressures and Redesign Efforts

00:02:06
Speaker
Yeah, right. Yeah. i joined in 2015. So, you know, we had been there a year or two, right?
00:02:12
Speaker
So yeah, we're early, early on. Um, but the journey after that, I feel like a whole world wrote off Slack or wrote Snapchat after that. The journey though, like snap absolutely re-accelerated today. They're 500 million or so daily. almost Yeah.
00:02:28
Speaker
Almost there. Yeah. Yeah. Which is, which is an enormous number. It's like twice what Twitter and X or X, whatever, uh, has and and and they won't even tell us at this point. That was the the last high word.
00:02:43
Speaker
Yeah, Twitter was always m a monthly users, and and Snap always reported DAU, which is like totally incomparable. And yet still much higher. Still much higher, yeah. A huge, a huge effort.
00:02:56
Speaker
But This infrastructure and this full stack rewrite infrastructure for client happened right that lurch. So tell me a little bit about what was going on at Snap at the time. What prompted this this rewrite and and how did that feel after that ah decline?
00:03:16
Speaker
Uh, I mean, it was pretty insane, right? I mean, you know, like you said, uh, Zuck had noticed the acceleration of snap as well, had famously tried to purchase Snapchat, um, few years earlier.
00:03:30
Speaker
Right. Um, and then, you know, at some point decided why compete when you can clone. Right. Uh, and so, ah the stories clone had come out. We were, well, I don't know how we were reacting to it exactly, but, uh,
00:03:47
Speaker
we didn't really react to it. And then i'm trying to think it was October, November of that year. um Evan, the CEO came back with a idea for a redesign of the app, um like a really drastic redesign that, that moved stories around and and put it on the left side of the app, redid how we were going to think about content, all this stuff.
00:04:11
Speaker
And um that started a death march, like ah a real deal death march, because he was like, well, ah and this this is relevant. But like he was like back in the day when it was me and my buddies sitting around the table in my dad's house, we would ship things really, really fast. And people have not been shipping things fast at this company in the last couple of years. And I'm frustrated about that.
00:04:32
Speaker
I have this cool new idea. I've just seen some cool apps in China that ah do everything and blah, blah, blah. Comes back, he says, I've got a great idea for the how the app should be. Please redesign it in a month and rebuild it um completely.
00:04:48
Speaker
A month. Yeah, yeah, yeah. And our team had just actually come off of a death march by shipping a separate feature. And so we we were immediately retasked to rebuild some portion of this grand redesign.
00:05:00
Speaker
um And so everybody did that. and And actually, we were up in Seattle, but a lot of the team, you know, most of the team was down in LA. They all got stuffed into an aircraft hangar and spent a month um long table at long tables, like typing away ah code you know, and Android and iOS code, it was mostly client focused redesign, right? The backend wasn't really touched and just like moving everything around.
00:05:28
Speaker
And um as you can imagine, that wasn't a super fun process for anybody involved, right? It just wasn't, maybe not pleasant. Yeah, yeah. And so ah the the month came and went and we hit the deadline. we We rebuilt it as we were supposed to.
00:05:48
Speaker
And we didn't ship it then. They said, oh, we're coming into December. Apple traditionally takes the month off. Oh my God. You know, we really don't, we got a few things we still want to do. We're not going to actually launch it at the deadline.
00:06:02
Speaker
Um, so, you know, we had just run off the end of the cliff, very tired.

Ben's Vision for Snapchat's Infrastructure

00:06:07
Speaker
I flew down to LA to, to visit everybody at the, at the hangar and see how they were feeling. The the mood was not, uh, ideal.
00:06:16
Speaker
Um, but what's important in this part of the story is, toward the hangar, i got pneumonia. I got really sick. Because it was all these humans crammed into one place. And again, this was 2017. We didn't know what it was like when you put a lot of humans in one place.
00:06:33
Speaker
Right. We didn't know about illnesses and things. So I got like really sick, like really, really, really sick. So I was like out of commission for the rest of December. um And I came back from that healed up and I was like, I don't know what I'm doing here. Right. Like, I don't know.
00:06:53
Speaker
I don't have a thing to do. You know, the momentum had stopped and I just didn't have an idea. And so I was like, I don't know if I want to continue it Snap.
00:07:03
Speaker
um and That's the answer yes ah in question. Right, yeah. And I'm like really doing the soul searching. um Momentum i had been cut from from being sick. And I went, if I'm going to stay here, I need to do something really meaningful and really big.
00:07:20
Speaker
And having just been in the guts of all the systems, I said, you know, I've been working on the side and all these like, you know, other systems. I want to attack the core of the app.
00:07:33
Speaker
um And that meant messaging, right? Snapping, chatting, video calling, like all the the core parts of like why people adopted the app in the first place. And I i thought, yeah there's so many things about the architecture of of that system that I i saw We couldn't ship the features we wanted to quickly, like the way Evan wanted to, right? You know, he had gotten used to that, sit at the table, come up with an idea, ship the thing.
00:08:00
Speaker
We weren't doing that. right And I thought like, it's the architecture of this that is, that is causing that. And if you look back, it it totally makes sense. And it was, ah it was reasonable, right? Like, yeah,
00:08:14
Speaker
it had evolved from a very simple app like early early snapchat was you could only send pictures then they added chat then they added group chat and that you know all these features but they accrued organically and so that architecture was just really really messy and like you couldn't really um change it in any reasonable way so anyway it's a digression i guess but like I was like, I want to want to really redo that.
00:08:41
Speaker
um i want to I want to fix that. And then separately, i was like, well, another thing I could do is I could change the way we do our backend infrastructure because we were in a huge monolith all on ah Google's App Engine, if you remember that.
00:08:57
Speaker
We were the biggest consumer of of Google App Engine. And you know that had honestly worked really, really well for us, but we'd outgrown it. right So again, it was slowing us down. Everything was in one big blob.
00:09:09
Speaker
We needed to move to a service-oriented architecture or something like that. So was like, if I don't do the chat thing, I'd like to work on this infrastructure thing and and and change the way we build and deploy and manage services and how they talk to each other and all that.
00:09:25
Speaker
Um, so I went to our, uh, head of engineering and I said, here are these ideas I have. I'd like to do this. And he said, great, do them. Yeah, definitely, definitely do that. And I'm like, well, which one? He's like, just do all of them.
00:09:36
Speaker
Just do both of them. Yeah. It's like, all right, great. Uh, and I didn't really know what that meant at the time, but, um, But rather rather quickly, me and and my co-founder here at Stately, but like you know at the time at Snap, um we got given the messaging team and and told to, you know all right, great.
00:09:57
Speaker
You had an idea, make it happen. Which is kind of wild. Terrifying. So tell me a little bit more about what it looks like to to like bite that off. Because conventional wisdom for even medium-sized migrations is like, what can you break up? What can you do separately?
00:10:15
Speaker
What can you define? What can you let burn? But it sounds like you got you both had the idea and then got the mandate to go blow it all up. Have fun. Yeah. Yeah. Yeah. There was a lot of buy-in, honestly, which was...
00:10:27
Speaker
Shocking. Like, you know, where was that coming from? i mean, I think everybody had seen it, right? They'd seen the slowdown in in innovation and the and the ability to ship and how complicated things had gotten um and how tough it was for anybody to get anything out without breaking things.
00:10:45
Speaker
um Yeah. And it was just like, you know, honestly, there was a feeling why not at that point, right? Like, you know, go big or go home sort of.
00:10:56
Speaker
we're in this place, we're kind of in a slump, you know, to rejoin the story of the the redesign, like it didn't go well, right? The, the redesign we had all cranked on. um If you look at the, the curve of users for snap, that's where it, it actually starts going down.
00:11:12
Speaker
Right. um And, and some of that was Instagram stories, but some of it was this redesign in the middle that, you know, for better or worse, just changed the way people interacted with the app and they they didn't like it. So so we're in this phase where we're not shipping the way we want to. We're bleeding users.
00:11:30
Speaker
Things aren't great.
00:11:33
Speaker
It's time for a Hail Mary, right? um And and the to be clear, we weren't the only Hail Mary. There was a lot of stuff going on, right? Everybody was scrambling to do do stuff. But in our part of the world, the idea was...
00:11:46
Speaker
get us and into a great place for messaging where we could continue to innovate on features. The other thing is like the the infrastructure cost was insane. And I ah can't possibly quote real numbers, but like ah you wouldn't like them.
00:12:02
Speaker
ah it it was It was far too high. And, you know, again, as a consequence of it being a... um a system that had evolved, right?
00:12:13
Speaker
Cost wasn't a huge concern, but now when you're paying historic checks to to Google and Amazon, you start to care about it, right? Yeah, absolutely. i I do remember the

Complexities of Snapchat's Architecture

00:12:27
Speaker
the Snap S1, which was shortly around this time, had numbers to Google that were promised and and specific and They got a reaction on the internet, like easily linear hundreds of millions of dollars. was a lot.
00:12:39
Speaker
yeah Yeah. So we thought, well, we can, we can fix that. We can fix the innovation slowdown. We can fix everything all at once. ah And it you're right. It's totally insane.
00:12:50
Speaker
Like it we knew at the time it was insane, but we, there is this mood of like, well, why don't we just try it? Like, why don't we just do it? Because what do we have to lose, right?
00:13:04
Speaker
um And so the the idea of breaking it up,
00:13:10
Speaker
Logically, we should. Right. Like and we all knew that. Right. um Almost all of our team had come from Amazon. We we'd seen huge migrations there.
00:13:20
Speaker
We knew the ones that went well and the ones that went badly. um We knew all about how to do this stuff incrementally. But we sat down at the whiteboard and just started sketching out how these systems would work. And we're like, okay, we're going to need a thing over here. And this is going to talk to this. And then we're going to need a system that looks like this and looks like this.
00:13:39
Speaker
And the more we wrote down, the more excited we got. And we were just kind of like, we can actually just make this stuff happen and the outcome will be worth it. Well, you know, that never works. It'd be clear. It never, never works.
00:13:51
Speaker
When you get excited about that and you draw the huge system, you never actually get to make it. um Now we did like just to spoiler it, but like, Yeah, it shouldn't have worked. It should, but spoiler, it did.
00:14:04
Speaker
um Tell me, I think one of the things you mentioned in, as we were chatting before this, which I thought was so super fascinating, is that when you do these incremental migrations, you get, you you sort of define the value really well. If we're going to upgrade, you know, the tiniest thing is this service gets a bump and it will have these three things, but there are full system properties that,
00:14:30
Speaker
are not available if you're doing things incrementally. and um And similarly, you sort of mentioned, we we've talked about DAUs a couple of times. like That is the number that that you all were measured on internally. Tell me a little bit about the ah kind of like, you were operating at a totally different altitude here.
00:14:46
Speaker
What full system properties were available to you? And how do you start to think about like, here are the pieces that'll line up, that'll deliver those that we could not possibly do in smaller chunks?
00:14:58
Speaker
Yeah, I think the the one that was kind of most critical there was the interaction between the client and the back end, right? So Snapchats has a lot more logic on the client than you would expect.
00:15:13
Speaker
um The client's super fat, ah super smart. um and And a lot of that comes back to the way Snapchat works, right? So Snapchat famously deletes your photos.
00:15:24
Speaker
um If you're an engineer, you go, wait, how does that work, right? And the the answer is, well, I mean, It can't provably do it. Right. and And I'll say like we do delete the photos. like but Everything that happens in the TOS happens.
00:15:40
Speaker
But it's a it's a process that has to have cooper cooperation from a client. Right. So like if I send you a photo and you view it, your device has to say, i've viewed this photo so the system can know to delete it.
00:15:53
Speaker
right So the the client is much more interactive than say, i don't know, a WhatsApp or something that just has to send or receive messages and call it a day and nobody cares. um So we had this really, really fat client.
00:16:04
Speaker
And what that meant is that ah we couldn't really get all the benefits that we wanted by just fixing the backend. And we couldn't really fix just the client without updating the backend because the backend was entirely built to support the client as it worked.
00:16:23
Speaker
at that moment, right? So you have this interlocked system where you're like, well, you know, and and we had tried. now Let's be clear. We had tried, like, ah we're going to focus on just the client. We're going to make the client great.
00:16:34
Speaker
But the interaction models, the APIs, the exact way they worked, how data could be loaded, when it could be loaded. You were really, really constrained. So there was only so good we could make the client in isolation.
00:16:47
Speaker
And then we had tried to just make the backend better, but the client is interacting with it in a certain way. You can can't really improve it there. So that's why we really had to, and we did what much more than this, but at the very least, we had to keep the client and the server kind of um being rebuilt at the same time.
00:17:06
Speaker
Right? So that that's kind of how we like backed into that. Interesting. that makes That makes a lot of sense. What were some of the hairier features you get had to support there? and I know that Snarf has tried a lot of stuff and there's sort of this famous, as you're you're alluding to, like this famous model, like Ev drives that company, Erden drives that company a lot more directly than a lot of other companies. Yeah, yeah.
00:17:33
Speaker
Yeah, I mean, there were there were two kind of categories there. One was the more mainstream features that were just really, really complicated.
00:17:45
Speaker
And you go again, like, wow well, what's so complicated about Snapchat? that The most complicated thing we hit was the feed. And I don't know if you're familiar with Snapchat, but the feed has you know every conversation that you have with your friends or or groups or whatever. And then it has a little icon and the icon is red or is purple.
00:18:06
Speaker
Nobody, aside from a ah couple people, know knows what red and purple mean, but it's very important it gets it right. ah And then there's a little ah text that says you know something about the last action.
00:18:18
Speaker
And then when you tap it, different things happen. Maybe it plays the snaps, but if there are chats, it might go into the chat. But if there are chats and snaps, then it has to play the snaps in a certain order and then get you to the chats.
00:18:29
Speaker
That logic of even just showing what to show and then what the interaction is, is mind-blowingly complicated.
00:18:37
Speaker
Really, really, really like very, very difficult. And it was one of those things where, um you know, you had the load bearing engineers on both iOS and Android who like this person knows how the feed logic works.
00:18:50
Speaker
And they're the only person who really knows it and they have it in their head. if you need to change something, you have to go talk to that person. Right. There was a lot of that sort of stuff. And then on top of that, again, because the client was super smart.
00:19:02
Speaker
ah there was an Android implementation and an iOS implementation. And they were written by different people at different times. And the logic was slightly different. So that was the other thing you could ask, oh, well, how's it supposed to work? and you go, well, on Android or iOS.
00:19:17
Speaker
Which one, right? Yeah, and so that sort of stuff was really, really, really complicated. The other category that we ran into was um features that honestly didn't exist anymore.
00:19:32
Speaker
Right. So like a good example was Snapcash for a hot minute. You could send money through Snapchat. oh Right. Yeah. Everybody was doing it at the time. And there was a really cool announcement video and was feature was out there and people did it.
00:19:48
Speaker
um And then that got pulled out of the app and stuff. But people could have saved those messages. So you still need to show them in your in your archive of messages if somebody had saved you know a receipt for something, but You can't send them anymore. and know Supporting that sort of stuff. we're supporting it Yeah. Some of that stuff is really crazy to support because, you know, how do you even get one of those, right? You have to go find one and and see what the format is. And of course, none of this stuff was documented or anything like that. Oh, course not.
00:20:21
Speaker
The feature builds exist anymore. Why does it need to exist? Right, right. I mean, but yeah. And again, It's all logical, right? These were features that were built as an experiment really, really quickly thrown out there. Okay, they didn't work. Pull it

Innovative Infrastructure Solutions at Snapchat

00:20:35
Speaker
out.
00:20:35
Speaker
That's fine. But then when you put in the constraint of, well, I'm going to rewrite the entire system, it needs to support every message that's ever happened. Well, now you have to care about this thing again that was just a one-off, right?
00:20:46
Speaker
Yeah. Interesting. Oh, I love that. um How did you you think about the, I'm really fascinated by like the full stack nature of this migration that's,
00:20:58
Speaker
you know There's a whole write-up on the Snap Engineering blog about like your work around Envoy and what the control plane looked like. And at the same time, you're you're off on the client rewriting logic, um figuring that out. how did you gett How did you get people... i don't use the word alignment because that feels very like manage me mid to E. but If you're trying to create whole system properties, you've got to get people talking to each other. like What language did you use?
00:21:28
Speaker
but It was tough, right? um There were kind of... and There are two things here, right? I mean, Snap was not um a very ah command-driven...
00:21:41
Speaker
hierarchy within engineering. So obviously for product, evan what Evan says goes, and he has his group of designers that that work with that. And like so how the product works, how it looks, what it does, very command driven.
00:21:55
Speaker
For engineering, it was a ah lot more um you know independent silos and and different groups doing different things and and not a lot of commonality, honestly. So this was another place where we drew from our experience at Amazon, ah which, well, there's some things that it would be a little easier at Amazon.
00:22:16
Speaker
um But the the idea there was... for infrastructure tooling that we wanted people to use, we just were going to build it and hope it was so good that people had to adopt it, right? Like that they saw it and they went, wow, like I couldn't live without this, right?
00:22:34
Speaker
And so that actually influenced the design of a lot of this stuff. So, you know, ah it was all meant to be very loosely coupled and easy to compose. And um a lot of the stuff, like you said, we were doing with Envoy,
00:22:48
Speaker
and we had a ah really cool little control plane for that, was to make it so that teams could own services, be loosely coupled to each other, and then declare, you know, whatever, a retry policy or a timeout or a route to a different region or something like that in some really declarative way and not have to coordinate. So in it in a lot of ways, the software itself was enabling that loose coordination between teams.
00:23:13
Speaker
um Like if i if we had a really command-driven hierarchy and everybody could be told you're doing this now, you're all writing everything in go and you're doing the blah, blah, blah, blah. I would have architected something else.
00:23:25
Speaker
I would have built something slightly different because we could, you know, that's like the Google model, right? Just tell everybody what to do and they all do it. Yeah. um But yeah. It doesn't work there either that you still have. Yeah. Sure. Sure. Sure. But they don't have to ship products. Oh, yeah.
00:23:42
Speaker
um but um But yeah, so in a lot of ways, i felt like the infrastructure part was ah architecture, like ah engineering for the organizational properties that we had and building software that like meshed with how people were thinking.
00:24:02
Speaker
And part of that was like, we didn't have to run around and convince people quite as much, right? um Like it was designed specifically to help avoid having to have those conversations.
00:24:13
Speaker
Yeah, absolutely. going Going back to the idea of like engineering is loosely organized. if you but something out there and or If you put something out there and you've got this Hail Mary kind of moment that's happening, like people are standing on the sidelines going like, oh, I can pick up this thing and run with it. That makes a ton of sense.
00:24:29
Speaker
Exactly. So if you were a team that like didn't want to be an app engine anymore, and a lot of them didn't, and you wanted to use Go or you wanted to use Python or something like that, what were your options? Well, You could forge ahead on your own or you could use this new thing that we built that's really cool that plugs into everything else. And well, you're going to use that, right?
00:24:46
Speaker
I've met like three engineers who would forge your ahead on their own and nobody else wants to. Yeah, yeah. And that's the same thing. We learned that at Amazon, too, where like. Really early on, we built these service frameworks. And there was no mandate to use them, but everybody used them, right? and Yeah, absolutely.
00:25:01
Speaker
did that Did that approach, which is super common in infrastructure, did that scale? like Did you see the same kind of um the same kind of like build it and they will come approach work for, say the front end and the ah mobile clients? Yeah.
00:25:18
Speaker
Um, front end was especially hard, actually. um That's a really interesting one to bring up.
00:25:27
Speaker
Front end for Snap meant Android and iOS mobile apps, right? And, and folks have a real identity around that stuff, right? um I don't see too often people saying like, oh, I'm a whatever, Go developer. Like sometimes you'll say, oh, I'm a Rails developer. I'm a Django developer or something like that.
00:25:48
Speaker
But people really are like, I'm an iOS developer. Like, I specialize in the iOS platform. And a lot of times that means specializing in Apple or Google's APIs, you know, the way that that things are supposed to be built there.
00:26:04
Speaker
And then when you scale out to something where you're like, you know, like what we did with with the messaging thing where we said, well, you're not going to write in the platform language. You're going to use this C++ plus plus layer that's shared between Android and iOS, there's a lot of pushback on that, right? It's very, very unfamiliar.
00:26:22
Speaker
um And so those took a lot more convincing, honestly, um to say, okay, you know, yes, you're going to have to let go of a lot of this code, but I promise you it'll be better.
00:26:36
Speaker
ah It's a hard hard argument to make, right? Until you kind of at least built the shell of it and put it in there and gone like, look, now you're not having to worry about coordinating every request and, you know, computing the new feed update model or whatever it is. Right. um And then they say, oh, OK, yeah, actually, that is pretty easy. Right. But yeah that that one was a way ah much harder pill to swallow, I think. hmm.
00:27:00
Speaker
Yeah, that's interesting. So you you actually went in and shipped code APIs for them. That's like, library, use this, and it solves some of your problems. So it was really philosophically the same. like I'm going to provide you with internal tooling that solves the problem. But it's really interesting he ran into that kind of resistance. Yeah.
00:27:20
Speaker
Yeah, I mean, and again, it was just a really, really big change. ah Like a counterexample, Snap has an internal UI framework that's a cross-platform u UI framework that I think they're open sourcing pretty soon.
00:27:35
Speaker
um That one was more of a build-it-and-you-will-come sort of thing where it was built off on the side, used for a few features. Some people saw it and went, wow, this is really, really great. So then they use it for their feature and it sort of spread organically.
00:27:48
Speaker
So that it can happen, right, it is what I'm saying. um But I think it's harder on um the front end and especially mobile, right? Yeah, makes it makes a ton of sense.
00:28:00
Speaker
um So tell me, like if we zip ahead in time a little bit. Tell me a little bit about like landing this plane. Like what were the big chunks that of work that landed and how did you but did you see those roll up into like, yeah, we did it.
00:28:14
Speaker
who Yeah, it was. And i'll I'll apologize if my memory is a little fuzzy on some of this stuff, so I might get the order a little bit wrong.
00:28:26
Speaker
But um basically, we we built an entirely new backend, right, as a separate service, separate database. I mean, the old stuff was in GCP, the new stuff's all in Amazon DynamoDB, completely new API, completely new service, blah, blah, blah.
00:28:42
Speaker
um And then we built a backwards compatibility layer in Java in the App Engine layer that could call either our new service or the existing stuff.
00:28:56
Speaker
And so we supported all the existing APIs in terms of the new service, right? That was like the craziest part of the whole thing, right?

The Migration Strategy and Execution

00:29:05
Speaker
Because then we had to be bug by bug compatible with the old system on a completely new API, which we had just built, which we didn't know worked yet, right?
00:29:14
Speaker
Right, because we hadn't built the client for it yet. Really, or we were in the middle of building the client for it, right? So we were like, I think this is the model for this. And then you'd build your compatibility layer and you'd say, well, we totally forgot about this type of message that we'd never heard about, right? That has these types of properties. We need to build that in.
00:29:33
Speaker
So it was very, very iterative. um So we built that, we launched that, um and we had metrics, right? So this was like the most important part is we did a dual read-write mode where on a per conversation basis, we could say, well, I'm going to clone it in the background and then...
00:29:52
Speaker
split the rights and you're still using the old system. your Your data is mastered in the old system, but there's a copy. And then we, we diff everything at the point at every API, ah every response, everything coming in, everything coming out.
00:30:06
Speaker
And then we publish metrics. So we'd be like, all right, bog standard text messages, hundred percent match, no problem. We've got it. For snaps, ooh, every once in a while, there's a difference.
00:30:20
Speaker
What is that? Okay, well, I'm going to go in and investigate. I'm going to look at the logs. Oh, yeah, okay. In 2010, there was a type of message that was written for a couple of years that, you know,
00:30:32
Speaker
had these properties and then that got you know, et cetera, et cetera. And then also, you know, keep in mind, we're reinventing every other system at the same time. So that's causing other changes. That diffing system was like the the key to making this seamless, right?
00:30:49
Speaker
If we didn't have that, it would have been I don't know, YOLO works on my machine, right? Like, you know, because really quickly we had, oh I can bring up my Snapchat. Oh, I'm using the new backend. Like, it works. It's great.
00:31:00
Speaker
well I can build Snapchat in a weekend, especially today. ae But does it really work for the trillions of saved messages? You're going to want a little better... assurance for that, right? Yeah.
00:31:12
Speaker
So that was that was like step one. um And even that we did it like separately for one-on-one chats versus group chats because in the back they were entirely different systems.
00:31:24
Speaker
um And then we were rolling out this new client, right? That this this cross-platform C++ plus plus client that had been built from the ground up. And that only spoke the new API, yeah right? Because we weren't going to put all the effort in to make it so speak on the client, both the old and the new API.
00:31:42
Speaker
So now you have, okay, with the new API. So now you have to coordinate a migration in the backend where conversation by conversation, you might switch it over. And then the client can say, well, of the 300 conversations I have going on, some of them are on the new system and some of them are on the old system.
00:31:58
Speaker
So I get to switch which API I use and you know which library you use dynamically, cetera, et cetera, et cetera. etc And it's actually way, way more complicated than what I'm saying because you also did this on a conversation level and on the feed level. ah It was a whole thing.
00:32:12
Speaker
Very, very, very, very complicated. But the point is, what the key of it and what made it even possible was we were breaking down the problem into the smallest chunk we could, right? So if I could say, well, you don't migrate a user, you migrate a conversation, just one conversation at a time, right? That is something, and you you build all the infrastructure to be able to flip, okay, this conversation is in dual write mode, okay, now it's fully migrated.
00:32:40
Speaker
I understand those states. I can communicate that to the client. The client can then choose which library to use to actually interact with that. made it very, very controllable. And it also meant that we could start rolling out, oops, something went wrong, roll it back.
00:32:53
Speaker
Clients never know, right? um Nobody ever noticed any of this stuff was happening. There weren't outages. There weren't bugs. Like nobody saw it. Yeah, nobody nobody noticed it during this entire thing.
00:33:06
Speaker
But it was because we had that infrastructure in place to make that sort of stuff happen. Yeah, that is okay. That makes it all make much more sense. You weren't rewriting the whole system. I mean, you were, but you weren't rewriting the whole system all at once that by picking not user, not screen, but conversation through this, you were, you were full stack rewriting each of the types of data or the types of like interaction that snap provided.
00:33:37
Speaker
That is still a big deal. It's a lot of work, but it makes a lot more sense because not only are you moving them over one at a time, you're moving them over. You've got levers at every layer by vertical slice. So you're just, everything is just a little piece and you can do it incrementally.
00:33:54
Speaker
Yep. And then all backstopped by those metrics that make sure you know, that that we didn't mess something up in the middle. Right. Yeah. Yeah. Those metrics are are absolutely key. That's a theme.
00:34:05
Speaker
That's like a theme of this show at this point. It's, you know, open up a Slack works. It's a way that like data dog cut over the APM product, the new data store that you need to know, is this thing working? And it's not just like, did I test it outright?
00:34:20
Speaker
and Just but last week, we were talking but and with Renard at Sourcegraph, and he rewrote the query parser. It's the same thing. and just like, does the new system parse the queries in the same way as the old? I think he used the phrase bug for bug compatible as well. Yeah, yeah yeah exactly.
00:34:35
Speaker
feel like there's a there's really a ah discipline and a science to this and like a methodology. Yeah. should teach it in school, right? Like there should be a class on migrating, migrating from one thing to another because it's important and it always has to happen eventually.
00:34:52
Speaker
this idea of grouping by user level visibility, like that's the grad level version of this, that every, that the default version of this is like, I look at a request and I do the compare, but I don't have to either do something very coarse about who gets flipped over, like organizations or, or it's just like totally random.
00:35:12
Speaker
And then you just, you know, kind of shrug if the user doesn't get the right answer. um But this is, this is really smart. I love that. Yep. And we ended up, you know, repeating that pattern.
00:35:23
Speaker
Once we had that machinery for doing migrations, we were like, wow, we're drink power, basically, right? Like we can migrate anything. so that was like one of the follow ons we did right away was, and you know, this part of our original plan, but we regionalized the service. So it previously had been only one region.
00:35:40
Speaker
But we said, OK, well, I would love it if your data was close to you. If you used Snapchat from India, it was a very bad experience because you had to go all the way to Iowa. ah And you know we talked to a lot of people about how fast the speed of light was.
00:35:54
Speaker
ah And we suddenly realized, oh, well, we built this system for moving data from one place to another and making sure it actually matches and We can just do that for regionalization. So we reused that whole system to actually, once we had the new system, to move data from one place to another.
00:36:12
Speaker
And like so I would never say it was easy, but it helped. you You mentioned when we talked before that there are some experiences about messaging in Snap that no other platform has been able to replicate.
00:36:27
Speaker
what What were those? trying to remember exactly what I was saying there. I mean, a lot of it, this is, this is just me bragging about ah how much we did, but like we had an obsession cost cost right, which translated to an obsession with size.
00:36:49
Speaker
So the size of messages, um because it directly turns into your bandwidth bill for Dynamo, it turns into your storage bill, right? um But we realized that it also turns into, well, how many packets does the cell tower have to exchange with the device to get a message in?
00:37:07
Speaker
And as a result of that obsession with size, we pretty much kept everything in one packet, um you know, Yeah, I mean, at any snap that you get sent, like not the image itself, but the data about the image is one packet.
00:37:22
Speaker
Any chat, assuming you didn't write a lot of text, right, fits in one packet. There's a variable component, right? um And so because we we had cranked the metadata down and like, boy, we really puzzled out every last byte from these things.
00:37:36
Speaker
We could keep everything in one packet. Having one packet meant fewer retransmits and a bad network condition. So you actually will find if you like go out into the the boonies where the cell signal is pretty bad, Snapchat will work.
00:37:51
Speaker
You can still chat with people. It might be a little slower, but it'll still work. Facebook Messenger, going to happen, right? iMessage, not as good, and right? So it's kind of wild to be like, oh, I can't iMessage people, but I can still use Snapchat. Huh, I guess we did a pretty good job, right?
00:38:07
Speaker
That's cool. Yeah, that's that's the kind of thing that you just can't you can't incrementally improve towards. i mean, you can you can get there, but like you need to touch every of the system. um and And for us, like that was...
00:38:19
Speaker
the design of the API, the design of every single data structure, moving to gRPC from a JSON-based thing. we When we did our C++ plus plus client, we ripped out the networking layer and replaced it with Cronet, which is Chrome's ah underlying network thing.
00:38:36
Speaker
We were the first ah to use gRPC C++ plus plus on mobile. like It was just all this stuff that we had to do. And you know some of it was principled, and some of it was just like, I really want to save every last byte.

Speculation on AI and Future Innovations

00:38:49
Speaker
Awesome. So this was ah eons ago. We didn't know about either respiratory diseases or, uh, um, what, what would you have done differently if you had the tools that were available today?
00:39:04
Speaker
Yeah, it's super, super interesting. um
00:39:11
Speaker
I mean, I think I can imagine a ton of stuff that we could have done with the help of AI, right? but um Obviously, we didn't have any of that at that point.
00:39:23
Speaker
I think when you pair up ah the AI tools we have now with a um a backstop that's real, like that's provable, I guess, is is the thing. So for us, that would be those metrics and those logs that make sure that you haven't screwed stuff up.
00:39:41
Speaker
the pain of creating that backwards compatibility layer could have been vastly accelerated. Right? Yeah. Because what it really came down to was reading thousands and thousands and thousands of lines of obscure undocumented code, trying to figure out what the behavior is, more or less guessing, okay, I think this is how this works.
00:40:03
Speaker
Deploying it. waiting for the metrics, waiting for the logs, looking at the mismatched logs, going, okay, puzzle this out. I can easily imagine building an LLM agent loop for that, where you say, okay, great.
00:40:15
Speaker
The LLM every day looks at the logs, figures out what the bugs are, goes, changes the code, tries it again. Hey, we could have all been on the beach sipping Mai Tai as well. This thing like self-organized itself.
00:40:26
Speaker
Yeah. Just insiderating tokens. Yeah.
00:40:32
Speaker
Yeah, I mean, it's a great point, right? Like you have, you had manually built this structure like, what is the smallest unit of change? Like it's a conversation, know, all the way through or a conversation, maybe even in a couple of different levels.
00:40:45
Speaker
And what is the, like, what is good look like? You've got that, those logs and metrics. And it's only at that point that you really, It's almost it's almost and you know a little bit of a knock on current AI usage. It's like, you got to do all the work that you just talked about before you can even bring AI to bear.
00:41:03
Speaker
So like yeah, the last 20% gets faster, but It's just like how if you're using the the coding tools today, the difference between doing it with a test suite ah and not is is huge, right? And for us, the metrics and the these diff frameworks were the test suite effectively.
00:41:25
Speaker
the better The best part about it is they weren't a synthetic test suite. They were real live production data, right? So we knew for real ah how it was goingnna going to work out.
00:41:36
Speaker
Yeah. Yeah. You know, you're never questioning the test suite. Like, is this test testing the right thing? Like, oh yeah. And I think, you know, I see that as being the future of a lot of this stuff is if you can pair up an agent that can affect change with a system that can accept change safely, right? Validate the change.
00:41:58
Speaker
So either it's impervious to breakage or, know, because it operates with some rules or a structure Or, you know, whatever gives you the way to do it in shadow or some of these other techniques. If you incorporate a a lot of these techniques, then you can really let loose these agents. And, you know, whether you say, OK, it's Claude or it's an army of interns.
00:42:22
Speaker
Right. It's this effectively the same ah output. Right. They're going to go and do a bunch of stuff. And as long as they have the right fences around them. ah You can get a lot of great value out of that.
00:42:33
Speaker
Yeah. Yeah, that's that really resonates. That there's so much that open loop, it doesn't do well. i think I think the other thing that I've seen a bunch of is that I've seen people using AI to try and spin up these frameworks faster because it's now cheaper get there and you realize the benefit on the back end. But it's like a fundamentally different type of work, right?
00:42:53
Speaker
Because you know you can go it away. It totally is. Yeah. i'm I'm trying to think of like how much of the actual framework could we have built? And i mean, obviously you could scaffold a lot of it, but I think you would need a really careful eye invalidating that the framework works.
00:43:11
Speaker
Once you've got the framework working, you've effectively got a ah playground, a sandbox that you can put anybody in, right? Whether it's an LLM or a human. And, you know, if you've made it safe so you can throw anything at at the wall and and you won't break something, then yeah, go nuts, do whatever, right?
00:43:30
Speaker
ah You can be as bold as you want. you can You can take as many risks as you want because there aren't really risks, right? You've already fenced it in. Yeah, absolutely. Even if it's only 20 or 30% cheaper, actually you'll create the framework.
00:43:41
Speaker
The gains you pick up by having it are just so much more valuable. I love that. That makes sense. um So what did you learn from this experience and what did you take to your next gig?
00:43:53
Speaker
Yeah, well, I'd like to say I learned not to do giant ah ambitious infrastructure projects, but that's clearly not true.

StatelyDB and the Future of Infrastructure

00:44:03
Speaker
opposite I'm still addicted to it.
00:44:06
Speaker
um i mean, a lot of what we took away from it was that these sorts of frameworks ah don't exist, right? They're not off the shelf.
00:44:17
Speaker
And like I was just saying, having a system that kind of programmatically reduces risk of change is like a key unlock to being able to go nuts and and whether that's product innovation or, you know, doing a crazy backend rewrite or or whatever other thing you want to do, right? Save costs, whatever.
00:44:39
Speaker
So that's why we when we left Snap, we were like, okay, what could we build that that embodies these things? principles, and and that's why we're building this database, StatelyDB, that um lets you change anything about your schema, anything about your data model, kind of any way you want without ever breaking prod, right?
00:45:02
Speaker
Which again, talking about ambious ambitious infrastructure projects. um But yeah, that's the idea, right? and And again, it comes back to what we were trying to do there with Snap. We had an idea in our head of here's what we want our ideal data model to be.
00:45:16
Speaker
Clearly, our data model is holding us back. But how do you affect that change? Well, it's a huge project. Well, does it need to be a huge project, right? Why can't the database help me with that? Why can't it, you know, and programmatically understand the sorts of transformations I'm trying to do?
00:45:33
Speaker
build that backwards compatibility layer for me automatically. And then, you know, I'm off to the races. So that's what we're building, right? You know, change anything you want, anytime, automatic backwards compatibility. So you never change, you never break your existing clients, but you can forge ahead with your new projects, you know, unconstrained.
00:45:56
Speaker
and makes That makes a ton of sense. And it really, it feels like another break in the house of where I'm hoping the the world gets to, which is like, I basically want to see the world, you know, move away from as much tech dead and legacy code as possible.
00:46:13
Speaker
And the two pieces you need for that are you need fundamental building blocks that are changeable. Like if you have a database that you cannot change, then that becomes something that you have to put wrapping paper around. Like that's where the other half of this is what we're doing in turn, right? Like let's help you put more wrapping paper around things and move things incrementally. But like maybe you wouldn't have to do that so hard you The underlying code would, if everything was as easy to change as code.
00:46:40
Speaker
Code is, well, it's not trivial change, but it's a site easier to change than database in most cases. and Well, it's easy to change if you don't care about anything that's currently running, right?
00:46:52
Speaker
Right? It's super easy to change, right? Like, just go for it. Oh, everybody's broken. whatever. They should upgrade, right? aye my I can make it real fast. Yeah, exactly. And so, you know, what we're thinking is there's got to be infrastructure primitives that let you change at the at the speed you want without having to know all the history of, you know, whatever, 10 years of ah bedroom development, right?
00:47:17
Speaker
um, without breaking things. Right. And that, that can be built into the infrastructure. And then once you've got that primitive, like what can you do? Right. if If you have a database that you can't break and you won't break your existing applications by changing it, what would you build?
00:47:33
Speaker
Right. How much faster would you build? and like and then, you know, can you extend that to other types of infrastructure? I bet. Probably. Yeah. But again, I don't want to go too crazy on huge infrastructure projects, right? No no no more of that for me.
00:47:49
Speaker
ah Just the databases. this Just the database. Just the corner of the infrastructure world. Yeah, exactly. Exactly. but Very cool. Well, this has been a fantastic conversation. And I could talk about this all day, but unfortunately, we are running out of time.
00:48:04
Speaker
ah One last question for you, i guess, too. um Where can folks find you on the internet? um and And how can they and how can they help Yeah, well, so ah ah bunch of us who built all that stuff at Snapchat are now at Stately Cloud, building StatelyDB.
00:48:23
Speaker
Our website is stately.cloud. So really easy to remember. um So come check us out. We'd love for people to um give us feedback, try things out, um ah talk to us. ah If you've ever had the problem of trying to change anything about your infrastructure and especially your database and have found it frustrating or difficult or you gave up because you didn't want to,
00:48:44
Speaker
please talk to us. We'd love to talk through your problem and see whether we can help or if not, whether we can learn from um your experience. Awesome. Sounds great.
00:48:54
Speaker
Well, Ben, thank you so much. This is an amazing conversation.