Oops!Zencastr was unable to start because Javascript is disabled
To fix this problem, check your browser's settings and enable Javascript

Become a Creator today!Start creating today - Share your story with the world!

00:00:00

00:00:01

Ratcheting Progress: How Lyft Migrated 150+ Services from Python 2 to 3 | Ep. 6

8 Plays1 month ago

In this episode of Turn Stories, I sit down with Anthony Sottile, who led Lyft's migration from Python 2 to 3. They did 150 services in less than six months. But what stood out wasn't just the scale or the timeline, it was the approach.

They really leaned on tooling so they could ratchet progress forward, and then they'd lock it in with CI rules so nothing backslid.

They would ship automated PRs, and if nobody complained but nobody looked at them, they would just purge them, and they trusted their infrastructure.

They would do deployments where even if there were errors, they knew the system would retry, and there wouldn't be any user visible impact.

And that made for a really powerful migration story. So we talk about what worked, we talk about what didn't, and some of the most hilarious failures across a couple of key services. -----

Get Tern Stories in your inbox: https://tern.sh/youtube

Go subscribe to Anthony! ➡️ https://www.youtube.com/anthonywritescode

Transcript

Leading Lyft's Python Migration

00:00:00

Speaker

In this episode of Turn Stories, I sit down with Anthony Satile, who led Lyft's migration from Python 2 to 3. They did 150 services in less than six months. But what stood out wasn't just the scale or the timeline, it was the approach.

00:00:13

Speaker

They really leaned on tooling so they could ratchet progress forward, and then they'd lock it in with CI rules so nothing backslid. They would ship automated PRs, and if nobody complained and but nobody looked at them, they would just merge them.

00:00:26

Speaker

And they trusted their infrastructure. they would do deployments where even if there were errors, they knew the system would retry and there wouldn't be any user visible impact. And that made for a really powerful migration story.

00:00:39

Speaker

So we talk about what worked, we talk about what didn't, and some of the most hilarious failures across a couple of key services. We talk about what tooling and patterns you could take from this to make your migrations work like they did.

Anthony's Career Journey

00:00:51

Speaker

enjoy today um we've got anthony satile today on the show and we are going to talk about the lyft's python two to three migration um anthony has worked at a ton of awesome companies yelp stripe lyft and now works at century um and i'm excited to hear about what you have done over the course of that career so welcome it's good to be here thank you for having me on All right. So before we dive in, um you were telling me before ah the show, as we were chatting, you have managed to pick up an extremely similar job at each of these shows, maybe described as migration engineer.

00:01:30

Speaker

um So what? Tell me about that. That's pretty close. I think that's a pretty accurate description. um But I guess let me let walk you through my career progression and how I sort of got to where I am now.

00:01:41

Speaker

um I started out actually as a front-end developer writing CSS and JavaScript and web views at Yelp. And I found that it was difficult for me to get my job done because I was constantly waiting on back-end engineers to implement APIs or but other stuff that I needed.

00:01:58

Speaker

And so I was like, well, I could probably do full stack. I taught myself enough Python to be dangerous and started being a full stack developer. um As a full stack developer, I realized, hey, I'm spending all this time waiting on like build tooling or CI or, you know, like linting or infrastructure, those sorts of things. I was like, well, I could probably do that.

00:02:20

Speaker

And so I started working, I started a web frameworks team, or at least that's how we It was really a developer productivity team, but we called it Web Frameworks because that was how we got it greenlit. um And I started working on developer tools and realized that's the stuff that I really, really enjoyed doing and haven't really looked back since then.

00:02:38

Speaker

yeah I did the same stuff at Lyft where I was ah hired to build out developer infrastructure. ah Stripe was a little bit different. I was building specifically Python infrastructure, but again, it was developer tools and ah making that stuff happen.

00:02:53

Speaker

And I'm Currently building that out at Century as well. Right on. How did I'm curious, did you do you feel like you missed the i am building the feature as you got deeper and deeper into the I will unblock myself world?

Developer Productivity Focus

00:03:09

Speaker

Sometimes it was nice to see the fruits of my labor on a production web app and be like, yeah, I made this page. um But after a while, i was like, no, I don't really want to deal with a product manager and having to twiddle pixels for a living. like um But they they like the big thing for me was like I was able to be a multiplying factor for other developers, and that was way cooler to me than any sort of user-facing product. like It was really awesome to be able to build a tool that made a whole bunch of people you know way more productive.

00:03:44

Speaker

i I'll ignore the comments about product managers, but I yeah ah all limbs are bad you know some of us are ok um i i do agree that the feedback the feedback loop of seeing your coworkers work on things that you get to implicitly take credit for basically everything the company does, which I ah totally understand.

00:04:04

Speaker

Yeah, and you i mean you work directly with your users, you know with the other developers at your company, and so you get direct feedback and you can you see people excited about the stuff you build.

00:04:15

Speaker

Whereas, you know okay, maybe the A-B test is successful and that's our users being you know excited about this, but it doesn't quite... It's not as direct feedback as I like to get.

00:04:27

Speaker

Yeah, absolutely. That makes a ton of sense.

Migration Strategy and Planning

00:04:29

Speaker

ah All right, so this Lyft migration... when When was this and what stage was Lift Sure. um So I think the best way to phrase this is in terms of the 2020 end of life of Python 2, which in theory was when you should have been migrated off by. ah Lyft actually took a little bit more of a progressive approach to this, partially due to my convincing ah that we should try and accomplish this before the end of life.

00:05:00

Speaker

That way we were never in a situation where we were on an unsupported version of Python. And so setting the stage, this is probably, i mean, the start of the process was probably early 20 or late 2018, early 2019, where we had decided, hey, I think we want to migrate off of Python 2. And I think we want to try and get it done by 2022.

00:05:21

Speaker

um At the time, Lyft was a couple hundred engineers. I don't remember the exact number, ah but about 50 or 60 teams and a couple hundred services. Probably a similar number of services to... ah Actually, no, I have the number. It was 600 services. So we had more services than developers at the time.

00:05:45

Speaker

but's Yay, microservices. Microservices architecture in the late twenty ten s yeah I mean, it got lift to where they were. actually think that it was pretty successful in the approach that they did. And like, we built the necessary infrastructure to make microservices work.

00:06:01

Speaker

And so like, it wasn't, it wasn't all terrible. um Oh, I should also talk about the number of Python 2 services. So of those 600 services, some were written in Go.

00:06:12

Speaker

um The rest were written in Python. And there were about 150 Python 2 services spread across about 40 different teams.

00:06:22

Speaker

That is a non-trivial number two to run. is a non-trivial number to migrate. Yeah, it was especially large given that the DevInfor team was about four people at the time. And i think only two of us were focused on Python infrastructure.

00:06:39

Speaker

And so realistically, it was kind of like two of us plus whatever volunteers we could scrounge together to try and pull off this massive migration. So was that team but that seem given the mandate to go migrate Python or was that an idea?

00:06:57

Speaker

So I always like to talk about how our team prioritized and made projects happen. And they really come from three sources. One is leadership tells us we need to do a thing. Okay, fine. We need to do a thing.

00:07:09

Speaker

The other is we collect feedback from developers and figure out what other people at the company want us to do. And then we sometimes formulate those into actual projects or sometimes we ignore them, to be honest.

00:07:21

Speaker

And then the third is ideas that come from within our team ourselves where it's like, we know that this is going to be impactful or you know we hypothesize it's going to be impactful. How do we spin this to everyone else and make it happen? and um The Python 2 to 3 migration was actually a bit of a mix of those second two categories. Leadership obviously didn't care.

00:07:42

Speaker

um We had to convince them as part of making it a a real project. But we had a lot of developers at Lyft that were like, hey, ah my service is stuck in Python 2.

00:07:53

Speaker

I would love some help getting it to Python 3 because I want this feature or this syntax or this library or this tool. um And they were stuck in the old Python 2 realm where they couldn't utilize those things.

00:08:05

Speaker

Yeah. But I think most of the push was within our team. We were like, hey, and we have a lot of overhead providing developer tools for both language versions. It would be so much better if we could just standardize on Python 3 and move everyone forward and not have to deal with a spooky security scenario where, oh no, we're running unpatched Python in production and suddenly there's a zero day and like we need to somehow patch Python now.

00:08:31

Speaker

We wanted to avoid that as much as possible. that That seems like the kind of situation where suddenly leadership would be involved and care about this migration. on That's the thing about, oh, maybe this is too on the nose, but I feel like that is a common thing about security as a priority, which is like, nobody cares until you get owned and then it's a P0.

00:08:53

Speaker

um Whereas I think a much more healthy approach to security is thinking about it up front and like actually investing in prevention and you know tackling these things before they're on fire.

00:09:04

Speaker

Absolutely. It's it's so, if if you patch everything where you that will definitely own you in the future and didn't patch anything else, that would be the ideal scenario. But that requires knowing things that I don't think you can know. That's also true, yeah.

00:09:18

Speaker

But until then, it's like keeping the dumpster fire not as much burning as it could be. Absolutely. So i developer friction, knowledge that this eventually might turn into a security issue, and and your team decided to push for it.

00:09:35

Speaker

So, all right, a lot of services to run. How did you think about the major milestones of this migration? Yeah, so this was actually pretty tricky because like and i I have a prescribed way to migrate to Python 2. I guess I'll talk about how I would do that for a single service.

00:09:52

Speaker

um But I think our problem at Lyft was multiplied by the fact that there were many, many services. And I'll talk about eventually the tool that helped us with that. um when i think about well migrations in general but specifically this python 2 migration it was split up into a couple of steps uh the first is we know we have python 2 code we know it's passing its current set of tests uh let's try and get it just parsing in python 3. now run it run it through very rudimentary syntax checks and get those passing uh

00:10:23

Speaker

There were some amount of things that we had to fix there. Like, for instance, async was a valuable variable name in Python 2. But suddenly, in the Python version we were migrating to, that's a keyword. And so there were yeah modules named async and variables named async, classes named async that suddenly don't work. And so we had to yeah manually go in and rename some of those things.

00:10:45

Speaker

um There were also things like, you know, the print statement going to the print function. ah We needed to fix those. Fortunately, there's a tool that we used called, oh, what is that tool called?

00:10:57

Speaker

Why do not remember? I'll think of it later. But there's a there's a very popular tool. It's based on lib2.3 that um helped us, you know, do some of the very early syntax-based migrations, things that were, you know, but things that we could automate.

00:11:15

Speaker

um Once we had syntax passing, the next milestone was to get linting passing. So run run the code through the same validation checks that we were running in Python 2. Things like MyPy or, well, MyPy if we had the chance, but like Flake8 and ah other other linter tools, basically.

00:11:36

Speaker

yeah Once we had linting passing, we moved on to tests, so trying to get the test suite to pass in both Python 2 and Python 3 at the same time. This was sort of important because we wanted to... We wanted every milestone to be something that we could commit and deploy, commit and deploy, commit and deploy, rather than you know Having one big bang pull request that just you know cut the ties at the same time.

00:11:59

Speaker

ah This meant that we had to do a little bit more work because it is yeah know arguably easier to have a code base that is only Python 3 or only Python 2, but because we wanted to maintain compatibility and have this smooth rollout transition without any... like ah potential scary, oh god, we need to roll back now, but someone's already made a patch to the new version. like We wanted to maintain compatibility to make it easier to land each milestone independently.

00:12:28

Speaker

um And then once we got the desk passing, it was really just try it in production until it succeeds.

Deployment and Error Management

00:12:33

Speaker

ah Roll it out, monitor errors, monitor system health and statistics and other stuff like that. And i it's when it's good, it's good.

00:12:43

Speaker

um Most services, most services were successful in the first deploy. I think that was just luck. But um I think the worst service, actually, the worst service we had to roll back and forth like 12 times to get it out. But even that like wasn't too bad. Like deployments were pretty fast at Lyft and ah rollbacks weren't a huge deal. And, you know, ah we had designed like circuit breakers and failovers for any service that was um important enough to need them.

00:13:17

Speaker

Was there, i mean, when you put something into production production and are monitoring errors, like that's real user traffic. Was there a concern about the impact of like, we're testing on real users?

00:13:29

Speaker

There definitely was. um We had a few things in place that mitigated those problems ah to some extent. Like, I mean, the the biggest mitigator of those problems is every service went through a canary deployment. So ah as part of rolling out, you would roll out to a very small percentage of traffic and that would, you know,

00:13:51

Speaker

reduce the blast radius of a bad change. Of course, those are still real requests going to real users. The other thing that helped, that sort of saved those from being catastrophic, even if it were completely broken, ah was, you know, it's ah it' a web of microservices, so necessarily we had to bake in retries and you know, we had established service tiering to say like, you know, a tier one service calling a tier two service, ah must, uh, ignore failure whatever, or retry on failure or have some, you know, timeout budget, whatever to handle of a failure scenario there.

00:14:30

Speaker

Um, and so we had sort of avoided this catastrophic failure being a problem just by way of how microservices were set up.

00:14:40

Speaker

Like, uh, A Python 2 rollout was no different than any other feature rollout. Like if you rolled out broken code to production, it's sort of sort of mitigated by Canary and retry mechanisms.

00:14:54

Speaker

Of course, a tier zero service, you don't have those same ah safeguards because you are front line. Like if it's broken, it's broken. But even in there, like your Canary saved things being completely trashed.

00:15:09

Speaker

Interesting. i I love the approach of like, you're essentially driving the error rate low enough that the rest of the system can then actually succeed. who So you're never going to get to a hundred percent um success, but you're you're sort of below the noise floor of what matters to the user.

00:15:25

Speaker

Yeah. And in any distributed system, something has always broken. So it's not really, it's not different in that regard. We're just, you know, minimizing that brokenness. Right. Yeah.

00:15:37

Speaker

That's really interesting. um So how did you tell me a little bit more about this this idea of your you're incrementally committing the migration as you go along.

00:15:50

Speaker

um Presumably, you didn't stop the world for each of these services and say, we're not going to commit to them. We're not going to do feature work on them. there was There was some amount of other work happening.

00:16:01

Speaker

How did you weave together we're going to change half your service reading the async module with the team's otherwise work. Yeah. um So yeah, that was actually one of the very important parts of this migration. And I think one of the important parts of migrations in general, which is like, you don't want to stop the world. You want to sort of be not really a fly on the wall, but you don't want to like derail any feature development that's happening at the same time.

00:16:28

Speaker

um And so the way that we did this, or at least the way that we tried to do this was each incremental milestone that we made in a particular service, ah we would add automation to keep that, you know, what that milestone fallet like.

00:16:46

Speaker

If you made the syntax pass in Python 3 and just committed the syntax changes, they would immediately regress if unless you had a system in place to prevent those from regressing.

Automation and Tooling

00:16:55

Speaker

And so for every step that we did, you know we baked it into the CI setup for that service. So like, you know,

00:17:03

Speaker

OK, your syntax passing, great. We have a syntax checker now in your CI. Your linting is passing, great. We have you know a new linter in your CI. Oh, your tests are passing on by the three?

00:17:14

Speaker

OK, now your test suite runs both Python versions yeah for this intermediate intermediary period. This did mean that some feature requests were maybe a little bit more difficult because now they actually have to do the right things. But I think the the impact developers was minimized as much as possible.

00:17:31

Speaker

We also put in tools that would auto format stuff. So even if you made the mistake, often the tools would fix it for you and you wouldn't have to think about it. Got it. Where did those tools run? Were they were they in CI or on people's laptops? so ah A bit of both. um So I've got a shameless self plug here. I wrote a GitHub framework called Precommit, ah which supports linters and code formatters in a bunch of different programming languages.

00:18:00

Speaker

And one of the core ideas of Precommit is it manages the tools and installs them for you so you can run them directly on your local machine. But you can also run Precommit in CI so that it'll run the same set of tools there as well.

00:18:13

Speaker

And so basically, you know, githooks are a first line of defense in and getting early feedback on things and automatically fixing things. But we also ran them in CI and use the same set of processes in both places to ensure that they don't regress.

00:18:28

Speaker

um And so like, yeah, we basically wrote a lot of pre-commit hooks to do this or used off the shelf tools that fit into the framework. Got it. so So someone, you probably had some bumpiness if you were in the middle of a feature branch at the moment that you decided you were going to start the Python 3 migration. But i I don't know, maybe you shouldn't have long-lived feature branches either, right? I agree with that. Yeah, feature branches should not be long-lived.

00:18:53

Speaker

um But even in that case, like a lot of, you know, imagine you you branched from mainline we introduced a new syntax check. ah You opened your pull request from an old branch. What would actually happen there if you're lucky and you didn't get a merge conflict immediately?

00:19:07

Speaker

If you didn't get a merge conflict, CI would merge with mainline and then run the new formatters that were introduced on mainline and in theory automatically fix your pull request for you. um We didn't actually implement the automatic fixing, but you would have to pull and run pre-commit run dash dash all files. and be close enough to automatic, but...

00:19:26

Speaker

um A lot of cases, like, yeah, there is a little bit of bump, but we try to minimize it as much as possible. Got it. um Did that... You

00:19:39

Speaker

you you had this this sort of ah regimented way of thinking about it. you We'll do syntax, and then we'll do lint, and then we'll do then we'll get the test passing. Was that... On one hand, the when you explain it like that, it makes a lot of sense to me. um I can't come up with another way to do it off top my head. That's not obviously just reverse those and that seemed backwards.

00:19:58

Speaker

um Why did you pick that particular ordering? And are there other ways of doing a migration that make that they you just send decided don't work as well? Yeah. So um in this particular case, we tried to like,

00:20:14

Speaker

like I like to think about it as a ratcheting. like we want we We know we're starting at this, like air quotes, bad sp ah state, and we want to get to this, air quotes, good state. And how do we incrementally move towards there, but put something in place that doesn't allow it to slide backwards? Yeah.

00:20:32

Speaker

And the particular steps that I chose for this two to three migration ah were things that were either easy, automatable, or you know something we could definitively say, OK, this will not regress because we put this thing in place.

00:20:45

Speaker

um I don't think that there's much. I mean, in this particular one, I think this is the way to do it. There's probably you could probably skip some steps or combine some steps, but ultimately, like you have to at some point make the syntax battle and at some point you have to make the linting pass.

00:21:00

Speaker

um And at some point you have to make the test pass. Like you could, i guess you could do them in the other order, but this seems like easiest things first and then, you know, hardest things last and then finally roll it over.

00:21:13

Speaker

um You could also take more risks. Like you could do it in a way where it is a big bang pull request and you're not, you know, you don't have this compatibility layer that we decided to do.

00:21:24

Speaker

Yeah. But, you know, leveraging my expert opinions here, you're better to stay compatible and you're better to have smaller feature branches or smaller incremental changes ah rather than trying to yeah do it all at once.

Microservices vs. Monorepos

00:21:38

Speaker

There were smaller services at Lyft that we did combine some of those steps. So like, you know there were There were definitely some where it was like, well, you know, the linting or the the syntax linting and tests all pass, and I only changed two lines of code.

00:21:52

Speaker

Fine, we'll just combine those steps. um But yeah, for anything non-trivial, it was it was useful to split them up. Yeah. And it in saying smaller services, you actually have a natural factoring already. You already are splitting the work kind of vertically across several hundred different services.

00:22:14

Speaker

So do you think that that logic still holds if you're thinking about something like a monorepo? Ooh. So... to be honest. I'm not much of a fan of monorepos, but I actually don't think it really changes much here.

00:22:29

Speaker

ah to pay Well, depending on how you handle your monorepo. I think the like the most naive understanding of a monorepo, which is just, it's a repo where you put repos inside of it.

00:22:41

Speaker

It's not really any different than handling microservices and a bunch of separate repos, except that you happen to have everything checked out right in front of you. Yeah. I do think there are potentially some benefits to being able to apply, like say you wanted to apply the syntax change to every single service at the same time.

00:22:58

Speaker

You could do that in a monorepo in a single patch. um I think it would have been harder because you couldn't actually, like, you'd have to do a lot more work in a single patch or, know, or you'd split it up.

00:23:12

Speaker

But, you know, if you have separate repos, you just naturally split it up. I don't know. I think what I want to say is like, it's not easier or harder, whether using microservices or a monorepo here, it's really based on what tooling do you have to support these types of things in either of those scenarios.

00:23:27

Speaker

Like if you don't have good microservices or micro repos tooling, you're not gonna have a good time with micro repos. You don't have good tooling for a monorepo, you're not going to have a good time with a monorepo. And then it's down to personal opinion about which of those types of tools do you like or are familiar with or gel nicely with your brain structure. And that's where the personal opinion about monorepos comes in.

00:23:49

Speaker

Yeah, that makes sense. I think i think there's if you're if if you're not if you're not only monorepo, but you're also monolith, I think then you have a ah real challenge here. I was talking with but some folks who have a 10 million line Python monolith at the core of their application. And ah you know one of the things they would love to do is divide it up into separate you know packages that they could migrate or deal with more incrementally. and Yeah, absolutely. It just...

00:24:15

Speaker

might have to eat the eat the pain on that one, do one big patch. Yeah, monoliths are a perpetual struggle. I think i've i've I've encountered that problem at every place that I've worked so far.

00:24:27

Speaker

um yeah Lyft actually spent a whole large migration to decouple and what did they call it? split up their monolith but they had a funny fancy word decomp it was project decomp monolith decompositions all the rage there's the ptsd kind of anyway

00:24:50

Speaker

i'm gonna get someone from twitter on the show next real really deep um

00:24:59

Speaker

So, all right, you got you have you have the set of tooling. we got You've got your your factory. did it Did that tooling create any further problems? Did that automagic commit hook framework get in anyone's way more than it helped?

00:25:11

Speaker

um Well, I can talk about one of the things that went wrong. i don't think it was really the fault of the ah the the linter formatter framework, but more of the fault of we didn't know what we were doing in production.

00:25:25

Speaker

um so actually, let's take a step back because we didn't talk about how we even tried to federate these out to all of the repos. Because I think that's yeah important to frame this first.

00:25:37

Speaker

um Lyft wrote this internal tool called Refactorator. And so the basis of Refactorator was a database. I use the term loosely because it was actually just Google Sheets.

00:25:51

Speaker

An executor, which I use that term loosely as well because it was just a Jenkins job. And a series of of automatic fixes. And what it would do is you would write this automatic fix. You could write tests for it, et cetera.

00:26:05

Speaker

And Refactorator would check out every repo, at, I almost said century, every repo at Lyft, and it would try and apply this check, um or it would ah try and apply the change. And it would also see whether it was already completed or whether it failed or whatever status it was.

00:26:22

Speaker

And it would generate pull requests for them. And then it would store all of this information in the refactorator database, and you could view like the status of what all of the pull requests were at.

00:26:33

Speaker

um And so we had this like giant spreadsheet of every single service and each of those stages that we had. And so we could kind of make this, ah you talked earlier about splitting by vertical. we We actually had vertical and horizontal splitting literally in a spreadsheet ah to figure out the status of this migration.

00:26:52

Speaker

And a lot of this was us just writing these automations, doing doing that part once, and just letting it rip and you know sending it out to all the repos. ah We had a bit of a... Well, I would say we had a policy, but it was sort of unwritten policy of like...

00:27:10

Speaker

If it were a refactorator did not get merged for a while and no one said no, then it must have been good enough and we would self-approve and merge them. um And surprise, surprise, people are not, most most people are not on top of reviewing external pull request changes. like They're usually pretty good about reviewing people from their team, but an outside pull request, yeah it slips through the cracks most of the time.

00:27:35

Speaker

Especially an automated pull request that comes from a bot. it so Not the top of the heap. Yeah, I think we had we had the hardest time getting these automated things merged. But anyway, ah we'll talk about one particular thing that went wrong, which was when we first did and our first phase of automatic formatting.

00:27:55

Speaker

Like basically taking code and saying, okay, we're going to run... why can I not remember that and the name of that tool? Hold on Give me one sec, because I don't remember. Because I want to give credit to it. Based autoformatter.com.

00:28:12

Speaker

uh

00:28:15

Speaker

um what was it called oh come on really upgrading tools my tool is mentioned there but not the one i'm thinking of well what use is that right ah Maybe it got removed because it's broken because lib2.3 got removed in modern by then.

00:28:40

Speaker

Modernize! That's what it's called. It's called modernize. That makes sense. That's a good name. It's a great name. yeah so our first Our first pass was to run modernize and pi upgrade. Pi upgrade is a tool that I wrote.

00:28:51

Speaker

um on the code bases so that they would be syntax compatible with Python 3 and they would remove any um legacy Python, like pre-Python 2.7 syntax ah because by then like pre-Python 2.7 syntax was a lot harder to make compatible with Python 3.

00:29:10

Speaker

And so that was kind of our first phase and this was to get syntax compatibility. And we wrote through a Factorator. We sent it out to all the repos. ah We waited a week. And then we auto-merged any of the ones that didn't have test failures or ah people complaining about them.

00:29:28

Speaker

yeah They had a week. They they should have responded. Some of them were two weeks. Some of them were three weeks. like we We gave people ample time to notice the changes and respond to them. And one of our services was a wild data pipeline thing that I don't think anyone really understood how it worked.

00:29:48

Speaker

ah And it was the specialist of snowflakes in our service graph and infrastructure. It ah ran on some third party that we were paying a bunch of money to run our service there.

00:30:01

Speaker

Don't really know why. ah But as this automated refactor, which passed tests and passed linting, ah got merged and rolled out and broke this production app.

00:30:13

Speaker

Not only did it break this production app, nobody noticed for like three or four days. So a whole ah whole slew of things had to go wrong here. and The reason that it broke this production app is our assumption and the thing that was written in the code base that it was running Python 2.7 was factually incorrect.

Unexpected Migration Challenges

00:30:32

Speaker

was It was actually running Python 2.6 like four years after, actually probably even more than that, after 2.6 end of life. tisy did anyone tell security at any point once security noticed they were uh very unhappy but yeah so then they had to roll it back and do a whole bunch of work okay it was seven years past the end of life at that point because python 2.6 end life in 2013 and throwback

00:31:04

Speaker

and what a throwback um but yeah, so, you know, a whole bunch of things had to fail there. Like, you know, the the repo thought it was running two seven. It definitely was not. The tests thought they were running two seven. Definitely was not.

00:31:17

Speaker

The linters thought they were running two seven. Definitely was not. There was no monitoring production. So if it was broken, the only way you would notice is by actually looking and noticing. Um, and yeah, all those things lined up to be a, a, uh,

00:31:35

Speaker

pretty unfortunate situation.

00:31:40

Speaker

But I think the the the you know the good and the bad of it is that was the worst thing that happened in this rollout. It was just just that one service. Wait, so three days in, the service isn't running. Nobody has noticed.

00:31:53

Speaker

Yep.

00:31:55

Speaker

At that point, I would assume it would be down for weeks. What?

00:32:00

Speaker

Somebody did notice. Someone did notice after three days and, oh no, you've imperiled the data pipeline was the words that they used when, which became a little bit of a meme amongst my friends. Anyway.

00:32:12

Speaker

um Someone did notice, ah not due to any monitors or anything, but they had just like tried to check the output of some model that they were running, and ah there was no output. And they were like, wait, this should have been here. Why is it not here?

00:32:30

Speaker

Dug into it, found that the app wasn't running, and then yeah went from there. there was There was screaming and shouting involved, but that's a company culture problem and not a migration problem.

00:32:44

Speaker

That's wild. um i mean i'm I'm trying to think if there's anything. It feels like... We're here to start. um How did the test and linting pass in that scenario?

00:33:00

Speaker

it feel It feels like something should have broken, or I don't understand computers as well as I think I do. Yeah, so the reason that it passed is the repo itself assumed it was running Python 2.7.

00:33:11

Speaker

It had no idea that production was a completely different version of Python. And ah because of that, like the the build, actually, it is a miracle that it succeeded all because the build product that it produced targeted 2.7 and it happened to just run and be successful in 2.6. Yeah.

00:33:30

Speaker

but A lot of times you can have syntax compatibility and you can have vibrate compatibility. It was just getting lucky. um And then unlucky. But... So, was the fix... what What was the fix? Do you just need to run it on 2.7 in production?

00:33:45

Speaker

oh We went through a whole bunch of stuff with that because, well, as soon as security noticed, they're like, what the heck? Also, the version of 2.6 that was running wasn't even the latest patch version, and so immediately we were scrambling to like either apply security patches to get it to the latest 2.6 or get to 2.7.

00:34:04

Speaker

um The other part is we didn't control the machines it was running on. They were running on this third-party vendor that ah provision them and they had limited network access and like all sorts of weird stuff.

00:34:15

Speaker

um So the first thing that we actually did was build 2.7 an ETL job. So like... and run a C compiler in, in this bespoke third party environment.

00:34:28

Speaker

um And so then we had a blob that was the Python two seven interpreter and its standard library and stuff. And so we could ship a little Python two six shim that would bootstrap two seven. And then now that service is actually running two seven was, was at least to get it to a place where we have some amount of, you know,

00:34:47

Speaker

we're We're grounded in truth now. um And then we used the same process to make that service run Python 3 once we had done the rest of the migration process. But we had to do some wild stuff to get it to even a normal state of running a supported version of Python.

00:35:05

Speaker

that's if if you If your production environment is not what you expect, i just it's amazing that anything works at

Creative Problem Solving at Yelp

00:35:12

Speaker

all. Surprise! I have to say, running a C compiler in an ETL job is probably not the wildest abuse of a data pipeline I've ever done.

00:35:21

Speaker

um One of my first projects at Yelp was actually using MapReduce to resize every business photo at Yelp. LAUGHTER Because MapReduce was the easiest way for us to get really cheap compute, and well, you can and sort of... The Map was trivial, and then the Reduce was ah side effects in s three that's S3. That is not the worst. Honestly, it feels like a very MapReduce-able problem.

00:35:49

Speaker

yeah Yeah, but the reduce didn't actually reduce anything, and the bat didn't actually map anything. so It was really just a like parallel for loop with convenience. yeah Alright, so let's talk a little bit about um the but kind of scale out of this.

00:36:07

Speaker

um of of this process.

Manual and Automated Migration Balance

00:36:11

Speaker

um So you have, you end you end up with ah set of tooling, mostly running through refactorator, modernized, cleaning up people's code bases.

00:36:20

Speaker

um How much, if you were to were to put a percentage on that, how much of that automation actually worked successfully? And then what did you do for the stuff that didn't?

00:36:32

Speaker

Yeah, so and and actually I think this is a good target number to hit for any sort of migration. We were right around like 80 to 85% successful purely from automation and with no human input, um which is extremely high. like That's great. we got We got pretty lucky that the code wasn't completely terrible.

00:36:51

Speaker

That last 10 to 15% was 15 to 20%, you know, whatever percentage is remaining. ah We had to do some manual work. And the way that we sort of organized this was we had the two people from developer infrastructure that were working on this project.

00:37:09

Speaker

And then we recruited people from teams that were well what I like to call infra minded or ah sort of wanting to learn more about other stuff ah types of people.

00:37:23

Speaker

And we organized basically... little we were i forgot what we called them they weren't typing parties but it was essentially the same idea where we like we would check out a meeting room and we would hang out for a couple hours and you know crank through all the manual parts and um because we were all like right there and you know in person we were able to quickly say know Hey, have you seen this particular case before? Yeah. Oh, here's what you do. Here's how you fix that. Yeah. This code got copy paste from over here. So apply the same fix that we did there.

00:37:56

Speaker

Um, and we were able to be but basically like, you have, have a team about five people accomplish all the manual things. Um, The other nice thing about these infra-minded individuals was they were often able to turn this around and be like, hey, look look at this thing that I to accomplish this company initiative that looks well in a promotion packet.

00:38:20

Speaker

it's It is not to be understated how much that drives behavior and influence and ah and the way projects get done at all. Especially at Lyft. Lyft was very heavy on the like promotion grind set.

00:38:33

Speaker

ah you You get what you incentivize. I don't think that's controversial. Yes. Impact, impact, impact. I do love the idea that there's almost this this cascading...

00:38:45

Speaker

a set of, you know, you had a team working on things, but you recruited, you didn't recruit the entire organization. You recruited these five people to come inside. How were there other layers that you thought you thought about sort of beyond that of spheres of influence or, or groups that you needed to convince Yeah, the the early part of the project, like before we even started working on it at all, um I spent so much time convincing people.

00:39:15

Speaker

And a lot of that was like, a lot of it started with me going to my friends that were on feature teams and being like, hey, this is the thing that we want to do um Help me build some you know, like, it sounds like it's something that you would want. Like, how do we, let's build some evidence that this is something we actually want to accomplish and ah help us get some like quotes that we can use in a proposal document for leadership and such.

00:39:40

Speaker

And so we started as like grassroots going around to people and being like, Hey, what do you think of this? How is this going to work? Um, But convincing a bunch of engineers is easy. Convincing you know engineering managers and directors and leadership is way more complicated and a lot more work there.

00:40:01

Speaker

I think the biggest thing that we were able to convince people with is we had done one service as just kind of like a trial. and ah we were mostly successful entirely with automation. And so that helped us go to leadership and say like, hey, yes, this is a huge project, um but we don't think it's going to impact people's day-to-day work that much. We think we can accomplish this with a small you tiger team of people.

00:40:29

Speaker

We think that automation is going to be key here, and we're going to be mostly successful just from automating stuff. Yeah. And I think that like, you know, when when you when you go to someone and say, hey, the cost is low and we get these benefits, it's a lot easier to convince someone than saying like, oh, yeah, we're going to need one engineer from every team to work on this full time for a quarter. ah That's your your do not pass code. Do not collect $200. You're never going to get that project off the ground.

00:40:56

Speaker

Yeah, absolutely. i've I've been on the other side of that table and i you sort of get like one or two tokens per company per quarter for

Migration Project Success

00:41:06

Speaker

that kind of project. And man, i hope I hope your project is the one that makes it if you're going to ask for like a tenth of an FTE from every team, even though FTE math is not accurate, but we all do it. And yeah two FTEs spread broadly is very different than like, we're going to spend two people on this and they're just going to finish it.

00:41:26

Speaker

yeah yeahp And how do you manage your spoons? Yeah. and you're just We're going to have these people. They're going to write automation. And to prove the automation works is huge. um you You don't work there anymore, so you can answer this honestly. Was that estimate correct? ah We completed by 2020.

00:41:45

Speaker

ah The estimate was very close to being correct. like I think there were... There were probably like five or ten services that slipped beyond the deadline. um But a lot of those slipped not due to our team's process, but because they had a very important project that required them to block pull requests in the repo for that period of time.

00:42:11

Speaker

So like even though our PRs were passing tests and they explicitly said, no, we're going to wait until we ship this large feature that we want to roll out. And we're like, okay, fine. Your feature is going to make money. Ours doesn't.

00:42:23

Speaker

you know And we had sort of agreements with them that they were like, we will follow up immediately after this to merge these and roll these out and make sure that's successful. um But yeah, our estimate was actually very close being correct. We started, i want to say, all this fiscal quarter is so garbage, but like...

00:42:43

Speaker

February of 2019 with a goal of hitting um September of 2019. Give us a couple of months of leeway in case we slipped. And I think we had almost everything done by ah mid-October.

00:42:59

Speaker

That's pretty close. A little slow, but most of it was just like five or 10 services at the end. That's given there were 150 services like we were pretty close.

00:43:10

Speaker

There is reasonable evidence that if ah in large companies, if you bring in an outside consultant to i modernize your services, that something like 80% of those projects simply never complete and they're giving themselves five years at a time. So a six week slip seems fine. Honestly, yeah. yeah Easy, easy. Sounds great.

00:43:31

Speaker

and Yeah, you know, change the unit, double the number. Like, we didn't even do that. it was It was pretty much just like, this is what we think the work is. We just multiplied it out from our example service and turned out to be pretty close.

00:43:43

Speaker

ah Having having that that knowledge of doing one first is just so key. is it That makes a lot of sense. um All right, couple minutes left.

Technology Opinions and Thoughts

00:43:53

Speaker

So we'll do we'll do quick fire to wrap things up. Cool.

00:43:57

Speaker

um Alright, if one technology never existed, what would you choose?

00:44:06

Speaker

Like, I can delete any technology from existence? Gone. Off the map.

00:44:12

Speaker

but Wait, i can I can only pick one, though. That's really tough. There's so many I would like to delete. Um... Man.

00:44:24

Speaker

That's a tough one.

00:44:27

Speaker

Oh... I don't want to say something too spicy, so maybe we'll go with something a little less spicy. ah and And some pain from a previous generation. So it's it's dead now, and so I i can just retroactively delete it.

00:44:45

Speaker

I'm going to say ActiveX.

00:44:49

Speaker

There is, I think, a lot of people who already believe ActiveX never existed. Which is great. Which is great. Because it was it was miserable to develop for.

00:45:04

Speaker

oh that's great. I love it. um what What was the one tool or feature of a tool that you didn't decide to write during this migration? Because it just wasn't worth it.

00:45:21

Speaker

We had started trying to do... ah We had started trying to do like a parallel execution of both language versions in production mode, where we were going to like run two and three for a long period of time and compare their CPU memory usage and stuff.

00:45:45

Speaker

And like we had sketched out like the deployment system to make this possible. were basically going to duplicate every capital D Kubernetes deployments and like run them side by side.

00:45:56

Speaker

and we just said screw it it'll just be you know just a normal pull request will be simpler anyway and so we didn't end up pursuing that but it was in our back pocket if somebody wanted something a little bit more cautious of a rollout you know the true blue green comparison but we didn't need it so we didn't build it yeah that's That stuff is much more typical for database migration. That's a must-have if you're going to do something stateful. but no Yeah, that makes sense.

00:46:26

Speaker

I understand that. We're lucky that everything was stateless, so we didn't have to do anything too concrete like that. um and And last one. If folks have more questions, where can they find you on the internet?

00:46:39

Speaker

Ah. See, I've actually deleted most of my social media. I can't. but Well, I'm going to shout myself out as an alternative. um You can find me on YouTube at Anthony Wright's Code or on Twitch at Anthony Wright's Code as well.

00:46:57

Speaker

but And that's probably the easiest place to find me. And folks should definitely check out your stream. It's awesome. Thanks. Thanks.

00:47:05

Speaker

Cool. Well, thank you so much for coming on. This was really fun. Yeah, I'm always happy to chat about these things. And it was really enjoyable chat with you.