Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
The iOS Developer Who Picked Nomad Over Kubernetes | Ep. 14 image

The iOS Developer Who Picked Nomad Over Kubernetes | Ep. 14

Tern Stories
Avatar
44 Plays9 months ago

Tom Elliott joined Yext in 2015 as an iOS developer. Within a year, both his mobile projects were canceled.

But Tom's career crisis was nothing compared to what was happening with Yext's infrastructure. 

The company was hitting the limits of homegrown tools built by ex-Google engineers who had recreated their own version of Borg called Khan. It worked when Yext was small. Now they were running thousands of microservices—Tom calculated they had a higher microservice-to-employee ratio than Uber.

The real breaking point? Server upgrades. Every time the infrastructure team needed to upgrade a machine, they had to manually edit code to move services off, do the upgrade, then edit code again to move everything back.

"Everything was manually configured as to where things lay," Tom recalls.

Khan had been built by "one or two people" as a coalition of the willing.

Now Yext had grown past 20 teams, and adding the features they desperately needed—like automatic workload migration—would require massive investment in a homegrown system.

---

Get Tern Stories in your inbox: https://tern.sh/youtube

Connect with Tom! : https://www.ocuroot.com/

Recommended
Transcript

Yext's Early Infrastructure Challenges

00:00:00
Speaker
Everything was like manually configured as to where things lay. and We went from like less than 10 servers to 80. If we fell behind, there was no way to recover. That's terrifying. It was very much the pace and the nature of the work that made it drain people in the way that it did. You could run the script in a dry run mode for every single job that a team had. And then it would say, all right, there is like a 99% chance that this 90 jobs are just going to work. I would much prefer people did a ton of prototyping, like with those, you know, dry runs and all that kind of thing, or like just gather information about stuff and have that at their disposal.
00:00:43
Speaker
and then have like a more rough timeline and schedule for things and then use that information to to figure out the best way forward in web. Absolutely, I

Infrastructure Transformation with Tom Elliott

00:00:52
Speaker
agree. Today on Turn Stories, we've got Tom Elliott.
00:00:56
Speaker
um Tom is currently the founder at Akiroot. But before that, he worked on the infrastructure at Yext, where he was he joined as a senior software engineer, um ended up as director of engineering, and oversaw ah ton of their transformation.
00:01:11
Speaker
we're going to talk about why they didn't use Kubernetes today and all the journey that happened along there and how they they ended up where they did. but um Welcome, Tom. um It's great to be here.
00:01:24
Speaker
So, all right, let's get right into it. Because i was looking at kind of the trajectory of Yext and trying to trying to put a couple of dates together. You talked about um the decisions you made and and the company. What I didn't realize is that Yext had raised their Series f shortly before you joined. $50 million on like half a billion dollar valuation.
00:01:45
Speaker
Yeah. And you joined, if I'm not mistaken, like the week that Kubernetes went to general availability. So you were at this like perfect storm of scale and like infrastructure churn in the world.
00:02:02
Speaker
Tell me a little bit about what it was like to walk into Yext at that point. And like, what did you see? Yeah, ah so it was it was interesting, especially because at the time i was coming from a really small startup.
00:02:15
Speaker
like and you know the i When I left the company before, there were like four people there. At most, they'd been eight. And then I joined this 300-person company. There's now like a 1,500-person company.
00:02:27
Speaker
um and it it was It was interesting because there was a lot of stuff that was already there. you know and a lot of history already, but there were also a lot of practices that were they were still trying to figure out and like sorting out the best um the best approach for

Legacy Systems and Team Structure

00:02:42
Speaker
things. They they weren't doing CI at that point. It was all like manual deploys from someone's laptop, which... Well, desktop at that time. Oh, everyone had desktops.
00:02:51
Speaker
ah You were considered really fancy if you got one of the trashcan Macs back in the day. That's how long ago this was. And... yeah Yeah, it's funny you mentioned the GA of Kubernetes.
00:03:03
Speaker
ah that That timing sounds about right. i hadn't put that one together myself. um But at the time, Yext was using um this orchestrator that they built in-house um because that sort of stuff didn't exist.
00:03:15
Speaker
um And they'd done the classic thing in that period of time of just hiring a bunch of people from Google. ah which was relatively easy for them because they were right across the street from Google at the time. So it was like, oh, hey, your commute doesn't have to change. Welcome aboard. So those engineers all went about basically rebuilding the tools that they already had full and being like, oh, I wish I still had this. So they ended up with a version of Borg that they called Khan.
00:03:43
Speaker
um And I thought it was beautifully ironic that we ended up using Nomad because that is yet another Star Trek villain. But... It's kind of a deep cut Star Trek villain, but still. That's funny.
00:03:55
Speaker
um Interesting. That's yeah. The desire to use ah your old tools is definitely real and especially real in people who worked at Google. No offense to anyone who's listening who worked at Google.
00:04:06
Speaker
Yeah. Cool. So what's when you got there, was like how were things going? Was the scale, like real handling the scale in a way that felt sustainable or was there there sort of cracks in the cracks on the woodwork?
00:04:23
Speaker
um I think it was it was generally working well at that point. There was definitely a sense that... you know People were starting to wrestle with legacy at that point. And there were like you know some things that had been around for a while that people didn't want to touch.
00:04:39
Speaker
um and But there was a lot of exciting new stuff going on. like and i I joined right when someone had finished up their internationalization project. So there was this whole fun internal framework for that.
00:04:52
Speaker
um Many years going around trying different ways of doing front end. and That was always a struggle until they ah they adopted React. um and um But it yeah, it was like there was like, what, six or so teams, relatively small, like four to eight people.
00:05:11
Speaker
ah That was a structure that worked really well. For Yextin, actually kind of kept going. It was a ah really big deal when they had added another layer of management on top of those teams.
00:05:22
Speaker
um But like each team had a really well-defined thing. They had a concept of owning your entire stack. So the you know each team would be responsible for a product vertical and would own database right to front end.
00:05:37
Speaker
So you had to do a little bit of everything. um I joined a team that compounded that because we were doing mobile as well. So we were like you um back-end database,
00:05:50
Speaker
all the stuff in the middle, front end, and mobile apps on the side. Interesting. how so How many engineers were you at that point? It was like somewhere between 30 and 50.
00:06:03
Speaker
Okay. Even at that scale, it seems like there is... there's all There would be a lot of like either duplication or confusion. was there how do you How did that model work for you all where you basically had teams that had to own the entire stack? like i see I could see the upsides, but it seems like there's maybe downsides to that as well.

Engineering Practices and Tooling

00:06:23
Speaker
Yeah, I mean, I think it worked well because they were really good early on making sure that there were consistent patterns and that there were common ways of doing things. that you know They enforced a very small set of languages. It was a really big deal when they added Go as well as and Java.
00:06:42
Speaker
um They were very specific about you know where you could deploy, um what sort of tools you would be expected to use. um ah Over time, and this sort of like was kind of there when I joined, but got better as ah as it went, um the onboarding process. it was It was generally understood that most people would be using the same IDE.
00:07:03
Speaker
and They ended up gravitating towards JetBrains IDEs. um And there was a big push to like you know deploy code on your first day, which required quite a lot of consistency.
00:07:14
Speaker
um And so the the challenge of teams sort of doing things their own way or like diverging from that, that only really became a problem around about 15 teams.
00:07:25
Speaker
And when they started to introduce like groups on top of the team that then had their own way of doing things, um and the products started to get a little bit further apart. Interesting. So was that that growth from like, you know, four to four, six, eight teams up to 15 all sharing basically technology based on cultural norms?
00:07:47
Speaker
Was that the was that the trigger that caused you to kind of invent to go investigate something that wasn't con? Or was there another trigger?

Evaluating Orchestration Solutions

00:07:58
Speaker
um there There were a couple of um contributing factors. I mean, one of them was that we we didn't really want want to be investing all the time into Calm at that point. And it you know it was like one or two people were working on it. And that was kind of sustainable for a while. But as things got bigger, you know you'd have to dedicate more and more resources to it, which...
00:08:18
Speaker
didn't make a whole lot of sense since there were options available. um And it was probably more around the 20 team mark where we started to see like the challenges there because then you were getting into places where there were features that we were clearly missing that would have taken a lot of work to put in place.
00:08:37
Speaker
um So we had a separate infrastructure team who were managing all the servers. like Everything was in a data center in New Jersey. And anytime they wanted to upgrade one of those machines, um they would have to go through this whole down. So it used to be that they would actually have to change code to move stuff off of those machines.
00:08:56
Speaker
and then change the code back to put it all back. And it's like everything was like manually configured as to where things lay. And, you know, there was a lot of work involved in adding a new server or that kind of thing, which you didn't have to do very often when it was small. But, you know my time there, we went from like, you know, less than 10 servers to 80 in one environment.
00:09:19
Speaker
And So that that was sort of a pain point that we saw. And there were like tools that were put in place to like help facilitate those moves and and all that kind of thing. But like we were itching to have something that was more powerful, but also a little bit more standard so it could rely on either another company or um And we had a couple of tries at Kubernetes.
00:09:44
Speaker
ah The first try suffered because we were hosting our own stuff in a data center. This was like before a lot of the tooling that made that more straightforward for Kubernetes.
00:09:56
Speaker
So we we really struggled to get the management aspect of it together and feel comfortable with that being able to trust that these collectors were going to be reliable enough for us in that data center.

Kubernetes Implementation Challenges

00:10:10
Speaker
Yeah. and I mean, this was, this was quite a while ago. Sure. Sure. But what you what did you hit? Do you remember? um it was Yeah, there were there were a few a few incidents where um basically just like you know a whole bunch of nodes went down, ah things didn't get moved around correctly, and then like we were doing some funky stateful stuff for some reason, so like we lost things, and um and the the upgrade story was really the biggest challenge for us.
00:10:38
Speaker
um Interesting. yeah I probably shouldn't have said that we'd lost things, because I don't think we lost any data, but... Yeah, but you can have, i mean, this is similar at at Slack where like, if you lose a cache, you haven't lost data, but you have lost operational stability. Absolutely.
00:10:55
Speaker
Yeah. Oh yeah. That's a, that's a, that's a really great clarification because that was like, we'd have things like RabbitMQ node goes down, you lose that queue, you've got to recreate it from the database and that takes time.
00:11:06
Speaker
Yeah, so absolutely. there's Semi-stateful. We need a word for stuff that isn't like data loss, but really, really hurts when it gets rescheduled. yeah Yeah, absolutely.
00:11:17
Speaker
um Yeah, so there was there was that, yeah, the upgrade path it was a real pain. um like And yeah know it's still not the easiest thing in the world to upgrade Kubernetes. Yeah. but If you have it foist on you by your provider and all that kind of thing.
00:11:31
Speaker
But you know we we we really struggled getting a good cadence down for that, which wasn't previously an issue with Calm because we just upgraded each and every node. It was pretty straightforward.
00:11:44
Speaker
um So that you know we we sort of shelved that. um And then there was another project exploring using Kubernetes in cloud regions because we had a bunch of cloud regions for Kubernetes.
00:12:01
Speaker
for serving consumer data that was like read-only, API search, all those kinds of things. um And we we explored that for a while. I think we tried to but bite off more than we could chew with that. And you know we tried we tried to basically like reinvent everything all in one go, which was um unsurprisingly, in retrospect, very difficult.
00:12:25
Speaker
um So we ended up... ah This might be skipping ahead, but let me ask you a question that might be a little disrespectful. was Was this a problem with Kubernetes or a problem with the way you had deployed Kubernetes?
00:12:40
Speaker
ah In the first case, it was absolutely about how we deployed Kubernetes. um and In the second case, I think it was more of a project planning challenge.
00:12:51
Speaker
um it was It was a very big vision and there was a lot of great things about that vision that the we kept hold of afterwards. But the plan was like, hey, we're going to reinvent everything and it's going to be the world and this is going to be the best thing since sliced bread. And yeah I'm sure that's a pretty familiar story to a lot of people.
00:13:11
Speaker
um And so that that, again, it wasn't necessarily Kubernetes. It was like all the work that we thought that we had to do to get to Kubernetes at that point was sort of the the big the big challenge. And we sort of boiled that down in retrospect.
00:13:27
Speaker
Had a look back over that project and we thought, all right, well, what what do we need to get something that's just what what's going to drive us forward over the next 10 years?
00:13:38
Speaker
yeah um And we kept hitting up against the ah lift of doing a complete Kubernetes migration um involving containerization.
00:13:51
Speaker
um because we were running bare processes on servers at that point. And um we were relying on having things installed on those servers. um yeah but i'm I'm sure a lot of people would relate to problems with image magic.
00:14:06
Speaker
i've I've heard many people complain about image man magic at many companies because it's just one of those dependencies you end up with and you've got to remember to have it and relate. So there was a lot of stuff that was already set up on these servers. The responsibility was spread around.
00:14:21
Speaker
And by this time, we had like, you know maybe a couple of thousand ah jobs running. um ah At one point, I did the math and worked out that we had a higher microservice to employee density than Huber um where did, which is probably not the flex that I thought it was at the time.
00:14:43
Speaker
um In retrospect, certainly not.
00:14:48
Speaker
ah So, me ask you this, because you are you know i love this model where you have like tightly aligned product teams who own soup to nuts, the entire stack.
00:14:59
Speaker
I love a little bit less that maybe those people had to think about the bare metal in a data center in New Jersey, or at least when I think about working there. um
00:15:10
Speaker
Who cared enough to go work on Kubernetes or you keeping con up, like driving these decisions forward, because my naive assumption would be that any given product team doesn't care that much because got a roadmap and OKRs and a CTO breathing down their neck.

Leadership and Strategic Decisions

00:15:28
Speaker
Who is who is willing to go to bat for like we need a major platform shift here? ah Funnily enough, the CTO, who was really happy with me. We were really fortunate in that we had a few very, very technically strong senior leaders um with within within the org. And um yeah a guy here and I can't remember if he became the CTO before or after proposing this project. It was probably before.
00:15:56
Speaker
um he had been wrestling with these kind of problems for a long time. He had a big hand in both the previous attempts. Like the first one, the the migration part was like his baby. And he he cooked up a really good plan. Actually got us to a point when most things were running and in Kubernetes, which helped us later on.
00:16:13
Speaker
um And so he he proposed looking at Nomad explicitly because it didn't require you to run containers on. Interesting.
00:16:24
Speaker
um And he he made the leap that this was like this was the the big slowdown factor that was going to cause us problems. And just being able to effectively lift and shift most of our jobs um seemed a lot more tenable.
00:16:40
Speaker
Interesting. Very cool. That makes a ton of sense. You done you you could pick Nomad. like You could make that decision because you have done all of this work with Kubernetes and sketched out, like well, here's all the work and 90% of it's going to be putting stuff in containers.
00:16:55
Speaker
What if we don't do that? Yeah. And there was like you know networks and all all those other things. and um the The irony being that we'd effectively built something that was based on Google's internal tools.
00:17:10
Speaker
But the public open source project that they created diverged so far from that that we had to go and take a third party. That's funny.
00:17:21
Speaker
I have heard that before.
00:17:25
Speaker
Yeah, it's it's kind of interesting because it's like, you know, there's also the the time gap as well. Like, you know, and Kubernetes is way more complicated than it it was when it was originally.
00:17:38
Speaker
um but and as I recall it didn't even have deployments initially right it was like all pods that you were managing yourself in the first release yeah deployments are a later abstraction because who could have known that you need to deploy software yeah who who could have thought that you might want more than one instance of it I'm glad they figured it out honestly yeah yeah absolutely absolutely Um, but yeah, so the, um, so a lot of the advantages of, of Nomad was that it wasn't trying to be everything.
00:18:14
Speaker
yeah So, you know, you could add on console, you could add on vault and all these other things to do the fancy additional bits. But if you just want to run a process, yeah, Nomad's got you.

Migration to Nomad

00:18:26
Speaker
Um,
00:18:27
Speaker
So you made CTO makes this decision. it's like, i've I've seen too much with Kubernetes. We're not doing that work. We're going to go to Nomad. We're just picking up Nomad, not the other the entire HashiCorp stack. what What was the goal? was it Was it to get everything off of Khan? Were there certain services or processes that were were more important? like How did you think about actually getting from here to the promised land?
00:18:52
Speaker
So the biggest motivating factor for us was enabling the infrastructure team to do upgrades of individual servers easily. So getting everything in the data center.
00:19:06
Speaker
onto Nomad was our primary goal. And we you know we did have these additional points of presence, but we weren't too worried about the upgrade story there because it was, um you know you could do immutable infrastructure just withdrawing the grid.
00:19:23
Speaker
sorry um So we start off with the plan. Let's look at the data center first. That's the big thing. And there were um we wanted to look at ways that we could order the migration to be based on business needs.
00:19:39
Speaker
And we'd considered like, okay, so we'll do the, you know, the least important things first, as in like those few services that like, oh, maybe they run a cron job for something that only really impacts, you know, 30% of our customers.
00:19:55
Speaker
If that cron job doesn't happen, we can do it manually. You know, we can, we can recover this pretty easily. And then there's other things like, oh, there's this you know this service that provides a front end for a product that a very small number of customers are using. So you know the the risk there is much lower. And then like doing the most significant things last. like Everyone relies on this. This is absolutely critical. If this goes down for more than 20 minutes, we're in trouble kind of services.
00:20:24
Speaker
um So we thought about doing that, and then we realized we couldn't. because this was in a data center and we did not have that much headroom. So budgets being what they are, I think we were able to maintain a headroom of maybe five servers at a time.
00:20:42
Speaker
um And but around then we had 50 or 60 servers to worry about. um So we ended up doing this like two to three machines at a time.
00:20:55
Speaker
and we would be doing direct migration of the jobs that were pinned in Khan over to Nomad. And so that had to be coordinated much more closely.
00:21:10
Speaker
um So in the end, we created a task force, I think we called it a task force. We might have had a fancier name. but um And that was formed of like a bunch of people from my group ah focused on DevOps.
00:21:27
Speaker
um It had a a permanent member in one of our architects who was like really good at this kind of thing. um And then we would rotate people in from the other groups.
00:21:40
Speaker
um And we kind of try and schedule it so that we do the migration group by group. um And because of the way Khan had grown up, we actually ended up with like dedicated machines for certain aspects of the product. So it actually made group by group migration kind of tenable because they effectively owned the whole of two or three servers each.
00:22:00
Speaker
um the The problem with that was that if we fell behind, there was no way to recover.
00:22:11
Speaker
It's terrifying. So it's like, if you've we've like scheduled this out, and we've said to all of these teams, this is the schedule for when your people are on deck, and when we need to steal someone for a sprint from your team, which, like you know, you you say that to anyone, like, you're not going to have this this person on your team for a few weeks, they' they're not going to feel good about that.
00:22:33
Speaker
ah and But every time we, like the first few weeks, we were like only getting like 70% of the way there. And then like something, we have an issue where like maybe we have to do a bunch of rollbacks or we discover something we hadn't expected. So weve got to spend time working on that. um you can't You can't meet your goal enough.
00:22:53
Speaker
So there ended up being like this kind of semi mopping up kind of process. Like we had to extend some weeks, which pushed the whole schedule out and kind of juggle things around a little bit.
00:23:05
Speaker
So, um you know, we were sort of forced into doing things in that kind of sequential way, but like um it, was was very, very difficult to maintain pace and to deal with those sort of like unmovable goals.
00:23:23
Speaker
Yeah, that but that kind of that kind of pressure coupled with what is effectively exploratory work. Like you have a process, but every service might be a little bit different. is i That is difficult.
00:23:38
Speaker
um What sort of things did you run into? Were there were there like particularly gnarly cases that, with the benefit of hindsight, you you would have liked to have known about so you could keep yourself better on track?
00:23:51
Speaker
ah The biggest problems tended to be where we discovered there wasn't problem particular package installed on the the host or there was a quirk of configuration that wasn't quite as we expected it.
00:24:06
Speaker
um Like, I think we had issues with like the number of available ports a couple of times. And and Anytime we discovered there was something missing one of those things, we had to go to the infra team and then be like, can you get someone to drop everything to do this for as quickly as quickly as possible, please? um we didn't have low-level access to one of these machines.
00:24:29
Speaker
Gotcha. And this was stuff where you would discover like you'd move a service over and it was sitting there listening on a port that does not talk to the internet? there was Yeah, but that was definitely a thing.
00:24:40
Speaker
Yeah. but um I think some of the migration did involve like standardizing ports. um But then there was like patterns where it would work fine in Khan, but the convention that we'd set up of like, oh, you have to set the port number by passing in this particular flag. If like one service didn't do that, it probably didn't matter that much in Khan. But soon as you move it over to Nomad and it needs to advertise, that that all goes out the window.
00:25:11
Speaker
Yeah. Yeah. yeah Yeah. Did you build tools as you were going through this, to like catch common bugs, try and make it, make it go smoother the further you got in the process?

Automating the Migration Process

00:25:20
Speaker
um A lot of it was scriptable.
00:25:22
Speaker
um And that was one of the really nice advantages of having all of these conventions and standard ways of doing things. Cause we could just for the, for a large part, like sort of automate the generation of the nomad conflict from the,
00:25:33
Speaker
um the Khan config. um We actually ended up having a ah wrapper written around the Nomad config to make it even closer. And so we had this sort like additional YAML BSL ah for managing all these things, which I wasn't wasn't the biggest fan of, I will have admit.
00:25:50
Speaker
um it did It did help with a couple of things. um So that that that was largely automatable. ah There was also you know scripts to move something over, wait till it came up, and then like report on a couple of common, yeah do health checks, make sure all the ports are open, all that sort of thing.
00:26:10
Speaker
um And we made sure that our monitoring was very, very good when we were doing this. So like if something suddenly went wrong, like it hit an edge case where it needed this you know one package or config on the server, and then if it wasn't there, the whole thing would fall over, we'd be able to catch that quickly.
00:26:28
Speaker
Got it. That makes sense. How did you think about these things are some of this this stuff is largely automatable. How did you think about the line between what was worth automating and what was just better to do manually, given that you were going to have to do this 5.7 times per engineer?
00:26:45
Speaker
Yeah, and I mean, the ah think the big thing was like figuring out once things were automatable in a repeatable way, how often you could batch things together to minimize human interaction. Because if like you know one team has 100 services that they need to worry about, you don't want to have to run the same script 100 times.
00:27:04
Speaker
ah You want to be able to batch things together and like yeah but we we ended up like usually doing maybe five services to a single commit, which seemed to be a ah good enough size that you could effectively review it.
00:27:18
Speaker
And then light rolling back wasn't too big of a deal, um but you weren't like flooded with all these different commits that needed to be reviewed. Yeah, that's an interesting that's an interesting way of thinking about it is like you you will move at a certain speed it would be Of course, there's there's some upside to having it all done tomorrow.
00:27:38
Speaker
But i the risk is probably too high. So finding like five per gets you like the risk profile you want. I like it a lot yeah makes so yeah and that I think that was a big part of selling it to the individual teams as well.
00:27:53
Speaker
um yeah Because if something goes wrong, they were probably going to be pulled in. um You know, things went wrong enough. So it's like, oh, we're doing a thing that might get you paged.
00:28:04
Speaker
ah This is what we're doing to try and avoid that and make it as easy as possible for you to help if you are in. Gotcha. how How did you work with those teams? you're You're about to go rip their service out and maybe break it.
00:28:17
Speaker
Like what kind of prep did you do with them? Um, so having, having a member of their team, like on the task force was a really big, a big help because that was someone who kind of knew what was going on and they, they could sort of be the face of the migration for that team.
00:28:34
Speaker
Um, but we, we talked about this plan for months ahead of kicking it off. Um, and the, like with this and sort of a follow on migration I might hint about in a minute, um it was really important to keep communicating and like keep all the teams informed about what was going on.
00:28:56
Speaker
And at this point, we did like a bi-weekly demo at the end of every sprint. So it was pretty easy for us to slot in a video for the migration program process to say, like hey, this is what has happened. These are some of the interesting challenges. This is what we're going to be doing next time.
00:29:10
Speaker
um And usually, I think, when we had someone scheduled to come onto the task force to do some migration projects,
00:29:21
Speaker
they'd probably spend the sprint before planning it and just double-checking things and and having an idea of what order they wanted to do things in. Cool. That makes sense. um was there as you As you went through that, you know you prepped all the teams, and you built the schedule, and you're going through it.
00:29:37
Speaker
How did... how was your team's morale over the course of the project? like Did it get better? or like Was there like a low point or a high point? um It ebbed and slowed.
00:29:49
Speaker
um you know the you know If we have have a bad sprint, people are going to feel down about that. um Getting it done felt good, but I think a few people did feel a little bit burnt out after that.
00:30:00
Speaker
and um and i like I think that it was very much the pace and the nature of the work that made it drain people in the way that it did.
00:30:14
Speaker
um we were We were fortunate because after we'd done this this migration to Nomad, like a year, two years afterwards, we ended up having to do something similar to containerize everything.
00:30:26
Speaker
You did end up containerizing everything. We did end up containerizing everything. And that that went significantly more smoothly because we could like we didn't have to do it machine by machine because Nomad helped us move things around a little bit.
00:30:38
Speaker
um And that from the automation standpoint, we were able to go through a process where you could run the script. in a dry run mode that every single job that a team had.
00:30:53
Speaker
And then it would say, all right, there is like a 99% chance that this 90 jobs are just going to work. If you run the script on this 10 have a problem. Yeah.
00:31:08
Speaker
And we may or may not know exactly what that problem is at this point, but these are going to need more attention. So we were able to sort of pre-roll parts of the migration so we could see where the attention needed to be before it became an urgent issue.
00:31:24
Speaker
That sounds huge. but That feels like it's the kind of thing that would just totally change the shape of the effort. Yeah, absolutely. it It made it significantly more comfortable and it also meant that we didn't have to centralize all of the work within one team. um So we got a really nice kind of side benefit of like, okay, we had to work closely with each of the groups and each of the teams to schedule the work for them.
00:31:51
Speaker
And usually it was like between two projects or you know they dedicated a couple of people to but they could do it how they saw fit. um but it became a race. We had a leaderboard. We had like very public list of every single team, how many jobs they'd migrated and how many they had left to go.
00:32:13
Speaker
And so no one wanted to be last. I love it. and So they were, they were like incentivized to schedule this work and just knock it out. And it,
00:32:27
Speaker
It was a little more, you know sort of up from planning and prep work, but maybe we did a month of building the tooling, building the leaderboard and all that kind of thing. And then we could just sit back and nudge people occasionally.
00:32:40
Speaker
That's if they asked for help, we could help. We could jump in, but it sort of became this kind of snowball community effect. That's cool. I mean, that i mean that's sounds qualitatively different than grinding through with with the teams under high pressure.
00:32:58
Speaker
mean, knowing what you know now, is there a different approach that you would have taken to the Nomad migration in order to get that same kind of vibe, for lack of a better word? It's a good question. I mean, the...
00:33:12
Speaker
having those um that restriction on the servers was was always going to be a big problem. um It would have been really nice if we could sort of like dual run Khan and Nomad at the same time. but And I think we talked about that with the infrastructure team and they weren't entirely comfortable with doing it that way.
00:33:32
Speaker
um What I probably would have done would have um explored tools that would have allowed you to done do that kind of dry run process. um And sort of maybe like, you know, you set up one server that um isn't, you know, has limited access. It's not going to be hooked up to the proxies or anything like that.
00:33:54
Speaker
It wouldn't be load balanced. um You just have one server sitting there and then just cycle through all the jobs and try to bring them up in Nomad just to see if they can operate on that platform.
00:34:08
Speaker
um And then you've got your list of like, okay, what needs tweaking? What needs fixing? Maybe we can move a few jobs around ahead of time before we start doing the machine version.
00:34:19
Speaker
Interesting. Yeah. That makes a ton of sense that if you've got, there's the, it, everything's a feedback loop problem. but Yeah. and And if you can push that feedback loop to the, Yeah. And if, if teams can help, teams can understand like this is broken and it doesn't seem like it should be broken, then you don't have to call them when the service is threatening to be down.
00:34:39
Speaker
ah Yeah. I mean, there's another very, no shifting lap. It's like, you know, the more, the more problems you can identify and anticipate early on when the risk is low, the easier it's on, you're going to have what is critical.
00:34:53
Speaker
Yeah, absolutely. um i i don't I don't think I caught it. um was there What was the impetus for the Docker migration? You decided to put it off and then you came back to it. like well yeah So, yeah, I mean, originally we wed thought, okay, well, you know, we can run bad processes in Nomad. That's going to take a lot of pain away.
00:35:13
Speaker
Great. we've We've at least bought ourselves some time. And we kind of wanted to do containers, but the original reasoning behind wanting to do containers was it's slightly less work for the infra team, which would have made them very happy, but it's hard to sell to the product engineers.
00:35:28
Speaker
thats Where's the benefit to me, they say. um But also it would have given us more flexibility um around you know how you configure each individual job and all that sort of thing.
00:35:42
Speaker
um um There's always the lingering, oh, it's best practice. This is how everyone else does it. Try to ignore that. But the motivation in the end was actually a security audit.
00:35:57
Speaker
oh ah We were using CentOS for a while and that got end of life. So we were like, all right, clock is ticking for us to change the OS on every single server so that we can pass this audit.
00:36:12
Speaker
So we've got a very strict deadline. um and And there's like a push to do it. You can sell that to the teams. You'd be like, hey, you know if we we lose this audit, we definitely lose customers, especially in sensitive industries.
00:36:30
Speaker
um You could ballpark a dollar amount that that would that would come to. but and So the reason for containerization was, well, hey, we've got to strip out the OS and start again.
00:36:43
Speaker
So we don't want to have to do this like whack-a-mole with packages again. So let's just containerize everything and just drop it onto this No OS. Oh, that's fascinating. um We saw something extremely similar at Slack, actually, where um every two years we had to do an Ubuntu upgrade. And the question and the the discussion with every team every time was it's going to be twice as hard for you to do a Docker upgrade to containerize your stuff now, but it's the last time you have to do it.
00:37:12
Speaker
You think you're going to work here for another two years? yeah
00:37:16
Speaker
Most of Slack is containers now. ah Yeah, absolutely. you know that's like Knowing that that is coming and it is something you have to deal with yeah is is is a huge motivator. and For us, um it meant that we we actually had to get to this deadline. you know like the The Nomad thing, the the pain of missing that deadline, I think we ah we ended up going over by that maybe a month.
00:37:42
Speaker
um the pain of missing that deadline was just, oh, we've stuck we've got to continue to think about this thing. Right. um Whereas, the yeah, the audit was a real push.
00:37:55
Speaker
Coming down from the top, everyone could get behind that. And we actually ended up finishing a month early on the containerization, which was very nice. Hey, there you go. that's That dry run script and having everyone, that's real.
00:38:07
Speaker
That's awesome.

Career Evolution of Tom Elliott

00:38:09
Speaker
Very cool. I wanted to ask about your journey in this a little bit because you joined as an IC um you i as a senior engineer and and on this. By the end of it, you were director of this group.
00:38:19
Speaker
um what What was your journey like over the course of ah this time? um It was actually more complicated than that, because I was not just hired as an IC. I was hired as a mobile developer, ah specifically an iOS developer.
00:38:38
Speaker
ah ah That doesn't sound like this area of the stack. Yeah. yeah ah I was there for nine years. A lot of things happened. um So, yeah So I joined explicitly to work on this one mobile project that they were they were working on. that And then um I got moved on to another mobile project.
00:38:57
Speaker
And both of those got canceled within my first year. but Or at least so like you know we dropped them into maintenance mode. and not um Not a significant thing for us to work on.
00:39:08
Speaker
ah So I get moved to working on the back end on the Pages product. um That was when I started working in Go, fell in love with that language, still using it to this day.
00:39:20
Speaker
And um as I'm starting to work on the back end, I'm seeing that there's like these frustrations of dealing with you know all of these different services.
00:39:31
Speaker
And it was something that I never really had to deal with before because I was either like building, you know, singular mobile applications that stand alone or monolithic PHP apps with Laravel.
00:39:45
Speaker
So I have to start like 15 different services. on my local machine to be able to test one feature. I'm like, all right, there's got to be an easier way of doing this. And a couple of people had their scripts, but there were like one-offs and all that sort of thing.
00:40:02
Speaker
So I ended up like going through a couple of iterations of bash scripts to do it. um that would like do like terminal screens. So you had 15 segments of your terminal all spewing logs at the same time.
00:40:15
Speaker
um And that culminated in a tool um that would basically just, you would configure it to um have the startup scripts for all of your jobs.
00:40:30
Speaker
um You would group them together. then you just said, I just want to start this job group. keep it running, restart things when you change stuff. And within about six months or so, everyone in the engineering team was using Awesome.
00:40:47
Speaker
um And that felt so good. Yeah, bet. Yeah, it was it was like, there was always a bit of a separation between what I was building and the people that were using whether that was because they were anonymous people on the internet.
00:41:04
Speaker
I used to work for a VPN company, so they were very anonymous.
00:41:09
Speaker
um Or like there were layers of organization between you and the customer. um So you know you had sales engineers, you had support reps, you had ah account managers, you had the product team.
00:41:21
Speaker
um And so the only time you would really be guaranteed to talk to a customer was when they were very, very angry. Um, and it felt so good to be able to just look to the person sitting on my right and see that they were using something.
00:41:37
Speaker
That's cool. Yeah. It's a great feeling. That's really cool. And I, I can absolutely see how that would be more motivating than we ship the next version of the mobile app. And I hope these 10,000 people click this button more.
00:41:49
Speaker
Yeah, yeah, exactly. So like I got a real kick out of that, built a few other bits and pieces, ah helped with a couple of other people's projects. projects um And the thing that really sealed it for me was that I actually ended up replacing that original tool that I built with tilt uh which um not sure if you're familiar with that yeah okay recognition Daniel a Bentley right yeah yeah so yeah I think I know Dan so Dan showed me a demo of tilt very very early on I think they were like months into buildings he showed it to me on a bar at Gotham go mean it was like 2018 and I look at this and I'm like this is a much better version of the thing that I bought
00:42:40
Speaker
this is so um So we spent some time figuring out what we we needed. We entered a ah contract so we could like ask for a few um additional features, including like running things not in Kubernetes.
00:42:57
Speaker
That's one. needs to Yeah, we certainly did. um And so, you know, replace that tool and and I had to do a little bit of work to smooth that transition, but people were also loving this new tool because of a few of the additional capabilities that they got.
00:43:16
Speaker
And um that sort of sealed it for me that, like oh, so it's not just building things that like I feel good about the fact that built it. There's also...
00:43:28
Speaker
a world where I can just find things for people and make their lives better by facilitating their use of this. Um, and I think that was noticed, uh, cause we did sort of a bit of a reorg and, um,
00:43:45
Speaker
one of the VPs at the time comes to me and just says, all right, so it seems like you like building tools and you like you know helping people set up tools and all that kind of thing. Do you just want to run that group?
00:43:57
Speaker
We'll give you three teams. Um, I'm like, heck yeah, I'm in. Um, and that, that grew over time, added some responsibility, shifted some things around a bit, but, um, was doing that for about five years and, um, yeah, dev tooling. It's, uh, sort of my niche and it's what I love.
00:44:16
Speaker
Yeah. That's awesome. That's such a cool story. I had no idea that, that you sort of like found your way to it at Yext. Um, that that's fantastic um all right we're getting towards the end of our time so i want to ask just a couple of of quick quick questions quick fire do a lightning round um so let's start with start with fun and silly things well what's the most embarrassing workaround that is probably still running in production
00:44:47
Speaker
um Okay, so ah we had an issue with um basically making sure that everything was running the latest version off of the latest commit.

Deployment Practices and Flexibility in Planning

00:45:01
Speaker
um Because ah you know CI tooling, it was like sometimes easy to miss doing a promote or something. So we basically deployed the world every night. And they're still doing that.
00:45:17
Speaker
I'm genuinely shocked that that works. well it's I see how it solves the problem, but that's awesome. I think it's the kind of the kind of thing that you could only get away with in the sort of organization where like deploying to production rapidly is kind of the whole point.
00:45:37
Speaker
and like So you know for years, we were deploying 200, 300 times a day production Wild. that's fine Yeah, it was like small, quick iterative stuff, you know, feature flags to control bigger releases.
00:45:55
Speaker
um But because we kept up that cadence, like deploying the world overnight just to double check that everything was in sync. like was was fine. you know the The only kind of problems that it caused us were ah resource contention.
00:46:10
Speaker
um And um there was occasionally a situation where someone heard you know was rolling something out in a not quite backwards compatible way and they quickly learned not to do that.
00:46:23
Speaker
yeah Interesting. That's very cool. um All right. What migration advice does everyone give that is actually terrible? oh man, you're asking me to have a controversial opinion.
00:46:36
Speaker
hmm. yeah um so
00:46:47
Speaker
i I think that a lot of people would emphasize the planning stage of things. um And they would probably do, like for any project, like people say like, you know, oh, plan so, so much so that your plan is perfect and nothing can possibly go wrong.
00:47:06
Speaker
Well, it's going to go wrong. Like your plan will change. be okay with that. Like if you're doing a design ah for something, like when was the last time you look back on your design and had implemented exactly what you wrote down?
00:47:20
Speaker
Like it never happens. Absolutely. Yeah. but Like how checkpoints, be ready to change things, be adaptable and be okay with that. Absolutely. I was ready to, i was getting mad about this because I'm like no this is so much of what we care about at turn is like planning and shifting information left in migrations. And I absolutely 100% agree with you. The plan changes. That's the whole point. You need to constantly learn what has changed and what you're doing in the moment based on what you know.
00:47:48
Speaker
Yeah. Yeah, absolutely. And, and, you know, I'm not, I don't want to say that planning is bad, And I don't think people should like aim to do as little planning as possible. i think the more you know, the better. up But you know i I would much prefer people did a ton of prototyping like with those you know dry runs and all that kind of thing, or like just gather information about stuff and have that at their disposal, and then have a more rough timeline and schedule for things, and then use that information to to figure out the best way forward in the moment. Absolutely.
00:48:22
Speaker
I agree. That makes a It's 2025, so let's talk about AI. ah What would you do different if you had the AI tools of today?
00:48:34
Speaker
ah
00:48:37
Speaker
I mean, I'm still trying to figure out how I want to use them now.
00:48:42
Speaker
yeah Fair. um I think that like um using AI to help you generate um communication is useful.
00:48:55
Speaker
So like you know just coming off the and don't play too much, but also you've got to tell people when you're roughly planning to do stuff. So like communicating schedules, communicating like processes.
00:49:06
Speaker
um i'd want to lean more heavily on ai to come up with not just um you know summaries of things help you draft stuff out maybe come up with structure for things but also like taking the same message and rephrasing it a few ways because you've probably got to tell people something seven times before it sticks and if you can make it a little bit different each time
00:49:31
Speaker
absolutely i love that actually i gave a ah little talk to the product or get Slack about how to how to run big projects. And one of my pieces of advice was like, you got to have a couple of different stories because everyone cares about different things.
00:49:44
Speaker
um But maybe more fundamentally than that, like people get tired of you saying literally the same thing, but you have to keep saying the same thing, try and get them to listen. That's easy for AI. I had not thought about um Cool.
00:49:57
Speaker
um And last one, where where can folks find you online if they want to reach out? ah So I post most frequently on LinkedIn and Blue Sky.
00:50:10
Speaker
um So, you know, LinkedIn, I'm just Tom Elliott. Two L's and two T's. A lot of people make the mistake of leaving out one of those. um I'm also at telliot.me on Blue Sky.
00:50:22
Speaker
Very cool. Oh, and tell me a little bit about what you're working on now because you're not at Yext anymore.

Future Projects and Closing Remarks

00:50:26
Speaker
That's true. Yeah, I left Yext almost exactly a year ago. So I'm currently working on a tool called Okiroot, which I like to describe as a layer on top of CICD ah to help you manage large numbers of environments, to chain together all your different resources um so that you've you can have like much more complex behaviors but automate them much more successfully.
00:50:52
Speaker
um So in the case of doing things like the Nomad migration, in theory, we could have put together a ah plan that basically described all the relationships between um each of the different servers, which ones were Nomad hosts, which ones were servers, ah which ones clients, and the packages that you installed, and then create a map of all of those dependencies so that first of all, you can visualize the structure of your um environment, and then you can completely automate its creation.
00:51:23
Speaker
Very cool. um Definitely something near and dear to my heart and I'm sure a bunch of listeners as well. So um Tom, thank you so much for coming on the show. This has been fantastic.
00:51:35
Speaker
um And hopefully you can have a couple of good conversations with with folks about Nomad Kubernetes and what they should do with their CICD. Sounds great. I had a great time. Thank you very much.