Oops!Zencastr was unable to start because Javascript is disabled
To fix this problem, check your browser's settings and enable Javascript

Become a Creator today!Start creating today - Share your story with the world!

00:00:00

00:00:01

Outscaling ElasticSearch at Datadog | Ep. 4

20 Plays1 month ago

What does it take to replace the core data store at a company like Datadog?

That's the question I asked Ian Noland, who led Datadog's migration from Elasticsearch to a new system called Husky.

In this insightful conversation, Ian shares the challenges and triumphs of this monumental task, including why APM was the first product to move, how they used Kafka to run both systems in parallel, and the importance of deliberately preserving a system's bugs.

Join us as we delve into the world of observability and data management, and discover how Ian and his team successfully navigated this challenging journey.

Get Tern Stories in your inbox: https://tern.sh/youtube

Transcript

Migration from Elasticsearch to Husky

00:00:00

Speaker

What does it take to replace the core data store at a company like Datadog? That's the question I asked Ian Noland, who led Datadog's migration from Elasticsearch to a new system called Husky. We get into why APM was the first product to move, how they used Kafka to run both systems in parallel, and what it means to preserve a system's bugs

Introducing Ian Noland

00:00:18

Speaker

deliberately.

00:00:18

Speaker

Here's how they pulled it off. Welcome to the next episode of Turn Stories. I'm here with Ian Noland. um I'm extremely excited to have this conversation. Ian, you've been at a ton of interesting places. You were early at AWS, you're SVP at Two Sigma, and you ran the infrastructure platform at Datadog for a long time.

Need for New Storage Platform

00:00:40

Speaker

um i we're We're going to talk today about ah Datadog's migration to their latest generation of stories storage platform. um And i'm extreme I'm excited to have you on the show. So welcome.

00:00:53

Speaker

Thank you. Excited to be here. Cool. Well, let's let's dive in. um the The latest generation storage platform, obviously, there but this is not a one-tick effort. um There had been a couple of generations.

00:01:08

Speaker

From your perspective, what was it like on the previous generation? And what What were people seeing that kickstarted this need to think about a new platform?

00:01:21

Speaker

Yeah, so so there's ah there's a couple of aspects to this. So Datahog's original product, it was, you'd call it infrastructure monitoring, and everyone thinks of it as metrics. ah You know, everyone, even today who I meet, will complain about the cost of Datahog's custom metrics.

00:01:34

Speaker

um And what that means is that Datastore, the data, can do a lot of compression on time and what they call over space, over cardinality, but wasn't great about sort of that massive cardinality data. And so, and so that was, you know, data was very successful as that. Around about 2018, they started getting to what they now call the three pillars of observability. And so logs and APM are the big two.

00:01:56

Speaker

And they are the classic where you actually need high cardinality. And so the the way that they built it, you know, Datadog was a very move fast, use as much open source as you can startup. And you can imagine it was essentially for these two products, it was Kafka, sort of as you were ingesting all the data from customers, and then a lot of shouted Elasticsearch. And, you know, Elasticsearch at the time and, know, Elasticsearch probably even today.

00:02:20

Speaker

it's It's pretty good if you have a static data set in terms of size. It's pretty awful if it's consistently horizontally scaling. And that was sort of, Datadog was ah you very successful. were probably growing 100% year over year in terms of data volumes at this point in

Strategic Expansion Challenges

00:02:33

Speaker

time.

00:02:33

Speaker

And so the the teams that owned, the APMs and Logs team, were in massive pain caused by essentially elastic search operations. And so um so so that's what it looked like. The other thing that was becoming clear at Datadog at this point is as the three pillars were succeeding,

00:02:47

Speaker

ah We sort of, you know, and you know butre we're now five years later. There was this clear desire to build it like 20 more products. And most of those products were going to be a lot more like logs and APM than are going to be like metrics. And so so it was very clear that we needed to do something about the core data store.

00:03:03

Speaker

and And so that was sort of what led to this idea. Okay, what what do we end up like migrating? yeah how how do we How do we migrate and what do? make Makes sense. So you had ah engineers in pain being woken up at all hours. You had this like strategic need to go to 20 different products. um so what a where did the idea come from?

00:03:27

Speaker

how did How did you come up with, let's build a whole new system instead of evolving the new one? And what were like the core tenants of that system? Yeah, it was interesting in that so that that you know this was a three-year migration

Internal Dynamics and Migration Effort

00:03:40

Speaker

into N. And for the first 12 months, actually, this was not under me. I was a VP of engineering.

00:03:44

Speaker

And actually, APM was under another VP of engineering. And Logs was under another VP of engineering. And like all sort of engineering teams, like our ah tendency was to want to incrementally improve what we had beneath us and not necessarily cooperate across across the three different VPs.

00:04:01

Speaker

ah And yeah that you you get into like, why is that a problem? For us, it was particularly, you know, was hard to hire great, you know, great backend storage engineers. It was sort of competing for it. So I'll give full credit. This this wasn't actually my idea. it was um the the VP of logs had come in from an acquisition way and he'd been thinking about the general case of problem for a long time.

00:04:20

Speaker

And he just like, so so it's this general observation. And I think at this point, everyone in industry knows it, which is, especially for observability data where like 90% of written data is never read. Elasticsearch is a really bad way of implementing that actually, because it's indexing everything as it's sort of being ingested.

00:04:36

Speaker

um And so particularly with this push, especially with logs, right? To, you know, commoditize, you know, why are you paying anything more than what it costs us to write that 90% on top S3? It became very, very clear ah ah sort of circa 2018 that the right architecture was a column store on s three you sort of what what everyone everyone in in this space has today.

00:04:57

Speaker

And so this sort of came from him. And I think the further thing that he sort of realized is that like, I have to find a way of marketing this platform. Like, so, you know, that he had to change the logs product storage into an event platform.

Execution of APM Migration

00:05:10

Speaker

But I had to change it in a way where I get the momentum internally that new products will bet on me rather than just stand up new thing one after another after another. And so it's sort of this realization, i think you'd say but much better but but ah architecture for what needs to be stored, but also better strategic thinking that we just can't have each product continually reinventing the wheel. We need we need to start consolidating on a platform.

00:05:33

Speaker

And so that was that was all him. ah Full credit to him sort of seeing it and convincing me um at the time. Like i was like a lot of people, I was just focusing on my own problems ah that the metric team was happening happening having undergrowth.

00:05:45

Speaker

Gotcha. ah I think it's so that's so typical of so many organizations is that you've got your thing that you're trying to do. Yeah. for like Engineers, we silo very well. And yeah it takes good people to sort of see bigger than the silos. did you So what what convinced you? was it there Was there something in it for you? Or did he appeal to your better nature or something else?

00:06:07

Speaker

I think... No, I think was better than nature. so so you know, I have a strong engineering background. And so like, you know, the the idea of a store of a big architecture and thinking big, like that was part of the Amazon time ah had had like really ah put the put that in my mind.

00:06:21

Speaker

And so I think it was seeing that I think he sold the vision well. And i was the type of leader who understood that vision very well, actually. So in many ways, like this was a cost for me. Eventually this team would move under me. Right. And so things would change, but actually this was a cost for me. I had to give up good candidates to this team.

00:06:37

Speaker

actually had to convince my teams on legacy. that like they would be served by this eventually, but like I can't actually staff them up now. so So it was a net cost. So no, it was appealing to my better nature. I think Datadog in general, honestly, has a good culture about that type of decision. It's one of the things having co-founders who are technical is they get that type of more strategic thinking.

00:06:57

Speaker

Yeah, that makes sense that there's a ton. I mean, that you can only do so many things at once. And especially for these big migrations, you end up swallowing, you know, one strategic effort per an engineers and that's all you can support. Exactly. um What was was some of the stuff that you had to give up in order to make this happen?

00:07:16

Speaker

So I think the biggest sort of what I said, like strictly it was candidates. Like, like I remember I, um, a guy called George Talbot. He's still here in new York, a great engineer, but I went out of my way personally to recruit him for the metrics team. Like, you know, I sort of did, and then it was like, you know, a month before he was, I think it wass maybe two weeks before he was due to start, uh, Emmanuel who, who had earned this, this effort sort of said like, Ian, like, this is the more strategic offering you need to give him up.

00:07:42

Speaker

And, you know it was tough. Like, you know, personally, it's it's easiest of VP to think bigger for the company. but I then had to convince my director, hey, I've got this great engineer who could help you on your problems today.

00:07:53

Speaker

It is more important for the company that he goes work on this strategic effort. And and it's it's that type of thing that is tough. That's a hard conversation. Yeah. It is.

00:08:03

Speaker

ah Again, what you hope as a leader is you've built up enough goodwill that you can occasionally do that and argue that it's strategic for the company and you'll keep the trust of the leader, you'll keep the trust of the team.

00:08:15

Speaker

And and that that ah that's what I sort of saw as my role. i think at that time it actually worked okay. um i've I've had it not work at ah other times in my life. um But no, that was sort

Phased Approach and Completion

00:08:25

Speaker

of the hardest thing. It's like the teams who are more on fire with incremental problems don't get the engineers that they need because you have to sort of staff up.

00:08:33

Speaker

this This is a key thing with this. It was sort of done as a Tiger team. So so it was staffed up ah sort of completely independently of current problems. Yeah. um And that's that can be hard to hear on teams that are that are underwater on fire.

00:08:47

Speaker

Exactly. Okay, so you... yeah You had a visionary VP that convinced you and a bunch of other people. You knew what you needed to do. You'd staffed up. So walk me through the major milestones. How did you actually think about getting this done?

00:09:00

Speaker

Yes. So Daydog, when they started it, it had just those two two initial products. But they they were already at the point, you know, they I think they'd had the synthetics product, the network monitoring products. I had these two whale products to migrate. And then they were starting to see like these tiny products start to build up. And that list would be about 20 products by the time the time it was done.

00:09:20

Speaker

um So I think it was a classic when they started it. ah you It was like, oh, we'll be done in 12 months. It actually took three years, you know as these things did. um the The very wise thing they did, though, was just focus on one product to start with. And so it was interesting. This is part of the other two VPs. So had the VP of APM.

00:09:36

Speaker

He'd hired actually two two great engineers, um who were very, very, very deep, and you know, i built this type of system in past lives. He had to actually give them up to the logs team, which was the better overall architecture um for the long term. So he gave them up, but then the quid pro quo was APM was prioritized as the number one um product. And so really for the first 18 months, this isn't quite true, but really for the first 18 months of this three-year migration,

00:10:02

Speaker

They just purely focused on APM. um And APM was the one that was probably the as a product. you know There was Dynatrace out there. There was those good competitive reasons to make APM more stable, have better performing queries as fast as possible.

00:10:16

Speaker

so So it was a good win-win. And so so that was sort of the first 18 months. the The next six to 12 was logs. And then those small ones but was was the tail, essentially. That's really interesting that you ended up with APM.

00:10:30

Speaker

I love that you picked a specific product and said, like, that's the use case that we're going to drive this, like, technical migration around. And of course, i have a soft spot for APM being of, like, worked in it at the same time that Datadog was absolutely crushing it the early days. It's...

00:10:47

Speaker

the

00:10:51

Speaker

how do you How do you think about that balance between APM as the the strategic bet? And what were the yeah what was the thinking behind saying like that's going to become the strategic um focus of the company, even though maybe logs and metrics are the bigger businesses, the bigger technical use cases, the bigger data sets that maybe need this?

00:11:15

Speaker

Yeah. And so there's ah there's a couple of like, this is like, i think good, like good VP level decision making, right? Is you often think you're thinking sort of product strategic and engineering strategic. and when you can bring the right answer for those two things together, you've you've made a good decision. So the product strategic is while APM was smaller, it was growing aggressively.

00:11:34

Speaker

and We'd hired some really good ex-Java people, I think, to to like really make the APM for Java really, really good. And so there was a sense that, yeah, APM was a smaller business today, but if if we invested in it the right way, it was going to be just as big as as Logs in Metric. So there was really good sense that um that APM had just as much you know TAM, I guess, you know total total market there.

00:11:55

Speaker

um On the engineering side, the the nice thing about APM, so so Logs is really, because no Logs gets used for security use cases, they care a lot about deduplication and data loss. Whereas APM in general, you know, they've got used to sampling, they've got used to, ah you can be a little bit more lossy. So in terms of engineering features, it actually required the least amount of features um to to to ship. And so so so it was one of those things that made sense from an engineering perspective and it made sense from a business perspective.

Technical Challenges in Migration

00:12:22

Speaker

So it wasn't a hard it wasn't a hard argument to to argue for. Once APM was shipped, though, then you get the, well, how come you can't ship logs yesterday? So so after after you you see the success, you sort of see the consequences of the decision.

00:12:35

Speaker

But but i think it was a pretty easy decision. <unk> you've got one product, we have 20 we want on this. yeah ah Why are you not done? Exactly. that's exactly no You've been working on this for two years and you know why does it take so long? it's It's a classic, everyone outside gets very impatient, especially once they've seen the initial success.

00:12:53

Speaker

um But it was it was good that it wasn't a hard, it wasn't a knife fight over which which was the right product. Cool. um Okay, so you've got you've got APM, you've picked it.

00:13:04

Speaker

um APM wasn't built on This wasn't like Greenfield, right? Like APM had already existed and it needed its own little migration to move over too. So how did you and tackle that?

00:13:15

Speaker

Yes. so So again, this is surely purely my peer. So he had a really good idea in terms of how to make this migration happen. So the way that all of this stuff was like each team would cobble together their own self-managed Kafka, their own self-managed Elasticsearch. And then there's self, like there was always some amount of live data that you Elasticsearch sucked on and you were sort of caching slash, you know, materializing stuff.

00:13:37

Speaker

And so APM had all had all that. um What what might my peers sort of realized with his product manager was if we can make the Kafka management really, really easy for product teams, then we can sort of take over the ingestion and they will stop caring actually about the data store at that point. like So we can, by by thinking about the ingest problem and making it very, very easy for the yeah the product engineers to sort of manage Kafka or whatever, that also allows us to like take over the storage problem. So that was actually the first, you know, if you look at what they actually built in the first six months before they wrote a line of code for the new data store, it was all around Kafka management. And it was all around basically getting that problem, um weve i guess in terms of, you know the the book we're sort of writing on platform injury.

00:14:22

Speaker

They were raising the abstraction um a around ingestion So that at that point, they could completely swap out the back end without the product teams carrying. And so that was really, once they'd done that with essentially the APM team being the pilot customer for that, that became their pattern for how they could just get teams on their platform and completely strip out ah what was happening underneath ah we with almost full control, of because not no totally full control, but but almost full control on the software side.

00:14:47

Speaker

That makes a ton of sense. There's there's this interesting parallel where I talked to someone a few weeks ago on the show about how they did a front-end migration in the same way. And there was the similar problem of like if we could wave a magic wand and make the change, like that wouldn't that's not the end state we want.

00:15:04

Speaker

Because if you don't raise the abstraction barrier, you still have a bunch of people who have to know the new system. And like did you did you ever talk with them? Did you ever... tell them how they were going to operate this? Did you tell them how to debug it and how to build on it? Because they've got efforts.

00:15:20

Speaker

So yeah, you have to make this choice. You either get people spun up on the new system and excited about it and read in, or probably much more appealing to folks who are being woken up at three in the morning. You just take the problem away from them. Exactly.

00:15:33

Speaker

Exactly. ah ah what was hard about that? Because if I'm a product team, that sounds amazing. I think it was, so so if you imagine how the ingestion works for Datadog, you end up with a like a lot of sort of sharding decisions and sort of queue management problems.

00:15:52

Speaker

And so the initial, like, especially when you're in that fast moving phase, every team comes up with their own semantics. you So you you end up with like every team has slightly different semantics that are exactly right for their product.

00:16:04

Speaker

but actually very, very hard to abstract across. And so I actually think the hardest aspect ah of it is was the product management aspect, which was essentially like, okay, take a dead letter queue, right? okay Okay, this thing, if we try and consume it, breaks the storage engine.

00:16:18

Speaker

We can't throw it away, it's just got to sit in this dead letter queue until we can page an engineer to look at it. So something like an idea like that, it's just like coming up with the right design of it that you end up making these four different product teams happy such that, you know, that they're happy to sort of hand you the page at the end of the day. so So I think it's just these little decisions. It's just the usual...

00:16:40

Speaker

abstracting across something that's in production where you are asking the product team to sometimes walk back a little bit of what they're comfortable with today so something that you know you can support over over way more use

Testing and Ensuring Performance

00:16:50

Speaker

cases. So very much, I think, much less a technical problem than I think a product definition problem.

00:16:54

Speaker

Yeah. every Every migration is just like 100 migrations in a trench coat. How much work you put on the product team and how much can you actually internalize it? And there you know as much as you want to you should try to internalize as much as possible.

00:17:09

Speaker

you know Sometimes they do have to make changes. you you have to You have to you know manage that relationship well to to get them to make those changes. Yeah, absolutely. And there's this iterative process that's happening where you're you're almost thinking... i was going to ask you about testing, but you almost answered that question of like you're you're working with the product team to understand what is their sense of correctness.

00:17:31

Speaker

Yeah, exactly. So how...

00:17:35

Speaker

That's all well and good. And it's it's lovely to work at a place where you can work with people hand in glove. um But Datadog's a real business and people rely on that. How do you think about deploying and merging those changes in a way that doesn't break the existing...

00:17:50

Speaker

workflows of, you know, important customers. but So I think there's there's two sides. I think there's the the one that, like, engineers, you know, maybe find more attractive in, like, the measures. it so like, on the Kafka, I think it was just very, very careful ops, honestly. Like, so so once they migrated to Kafka, over they were they were operating Elasticsearch the same way. So, like, there was no worries.

00:18:10

Speaker

but Like but what you worry about with storage systems, ah you know, are you consuming everything? And then are your queries performing correctly? And sorry, are they giving the right answers and are they performing well enough?

00:18:22

Speaker

And so the Kafka stuff, I think, was mostly just fine-grained management, but then they actually had the actual Husky system, the thing that was replacing Elasticsearch. And then the team spent a lot of time on ah shadow but but what that calls shadow ingestion, right? So basically you are literally forking off the back of Kafka and writing everything to your legacy store Elasticsearch and your new store Husky.

00:18:45

Speaker

And then they had to write a query comparison system. So this thing that would send queries to both systems and look at the results in terms of both performance and correctness and make sure that the the the same things are getting. And so that was a lot of investment in sort of bespoke tooling um to essentially and ensure that ah the the same results are being there. Like that but the truth that like said, you know the classic with with data is like, you know, false positives are just as bad as false negatives. So if the newer system did something more correct, but actually led to led to a real customer getting paid at two o'clock in the morning, that's actually just as bad as it being incorrect. And so they had to spend a lot of time really analyzing those results. And so, you know, there's this was actually true of my time at Amazon as well, like progressive deployments and shadow deployments

00:19:33

Speaker

were the main two tools that they allowed teams to allow the team to get this right. Absolutely. i And Kafka is such a tricky one as well, because it's not it's not like the complexity is in the code that you can you can run locally. Like this is an operations problem and all of the problems are going to happen in production.

00:19:51

Speaker

yeah um i've One of the investors in turn um was in your seat at a similar company. And we would trade text messages that were just Kafka and like a series of swears. Yeah. Because every time you push something into production, there's this open question of like, how careful were we? Do we have the right tooling in place? Do we know that this thing is going to work? And you know hopefully you've got hopefully've been looking at shadow traffic for a month, because if not, you've got no prayer that this thing is going to work.

00:20:21

Speaker

Yeah, exactly. um I've got a bunch of good answers this, but I want to know your answers. Weirdest edge case as you were doing this. oh I think the weirdest edge case I probably just mentioned, but I think the weirdest edge cases were just Elasticsearch where you find out Elasticsearch, yeah especially when you're like aggregate you know doing these aggregate care queries over a time series, which Elasticsearch was never really built for time series. It's just people... yeah People learned that they could shove time series in there.

00:20:50

Speaker

so So it was just Elasticsearch giving the wrong answers and and you as an engineering team having to say, well, actually, we're going to, you know, the Datadog answer is essentially the Elasticsearch answer from 2018 forever because we cannot, like, we we cannot as a business, you know, deploy something where a bunch of our existing customers get a bunch of false alarms and, you know, so so the the Elasticsearch semantics became the correct semantics.

00:21:16

Speaker

so So I think that that was the interesting edge cases. It was just those, but when you have to say to an engineering team, it it actually doesn't matter what is fully correct mathematically. what what matters is what what sorry What matters is what Elasticsearch does. and and This decision was made in 2015. Lucene is older than that, right? So it's probably made but didn luinne in Lucene in 2008.

00:21:40

Speaker

ah Amazing. ah and Those decisions will live forever.

00:21:45

Speaker

um Cool. So ah bunch of so you had a you had an approach, especially to getting APM moved over. How did you, i mean, that's that's not a trivial approach. How did you lather in to repeat that across the other...

00:21:59

Speaker

made two major services and 17 subsequent services. so So I think logs was ah was a big, heavy, like, you know, anytime anytime you have a big product where you're talking about hundreds of millions of of revenue and also logs was significantly different, particularly in the, um you know, the never wanting to lose a ah log message, never wanting to double write a log message it was a big thing, a big thing with logs.

00:22:21

Speaker

So I'd say that looked a lot, honestly. So number but two looks a lot like number one. Like it was fairly hand managed, a lot of individual attention to detail. um I'd say the nice thing that happened, and this is like, i think every true success is during this time period, you could say there was a lot of Husky native new products started. So there's probably like five products that never ran on Elasticsearch. They were just sort of they didn't need any new features and they could just use the new system Husky from day one.

00:22:46

Speaker

And so that was just very, very nice. That was almost like a pain-free migration. That's the dream. Yeah, that's the dream. And the nice thing about them is they were small and they were young products where like, look, if you,

00:22:59

Speaker

If you end up like, you know, having a bug, that like some some query, you know, times out. the ah but But it's completely new. So this is like new behavior. It's not a change in expectations.

00:23:09

Speaker

And so so they got a lot of their growth, a lot of that growth in terms of broadness just out of these new products that were being launched. um I'd say the the what was probably the the five or so like network monitoring, I think it was process monitoring, the actual old events thing, they built enough sort of tooling at the lift and shift stage. so like That was fairly automated at that point in time.

00:23:32

Speaker

I still feel like it was like only one at a time and each took a month. I'd say at this scale of migration where it's tilde 20, lots of human eyes like it it was never that fully automated we just roll it out and it just works um so but it it did after logs it got a lot faster uh than you know it it was i think logs think strictly was 12 months to be fully done but it was like six months together handling 80 of traffic uh for logs and the the small the the rest were mostly much faster than that um and

00:24:03

Speaker

It's interesting that you say that you're... So you're driving internally with like the amount of traffic you moved over that you could do you could do it incrementally? Yeah. Again, this is part of a joy. For all of us, we hate Kafka. The nice thing about having those queues in ingestion is that the thing behind it can be very selective in how much that it handles. So that's it was it was very easy to like just ah feature flag customers into it and and see that see the results, right?

00:24:27

Speaker

So sir yeah, that the nice thing about this async... You know, you're you're paying for this extra copy of the data. The nice thing about it is it does give the engineering team and the product team a lot of freedom um to to make the migration nice.

00:24:39

Speaker

Yeah.

Final Stages and Post-Migration Cleanup

00:24:40

Speaker

Yeah. And moving over customer by customer is is just like so conceptually clean. That makes a ton of sense. Yeah. um So the that's great that you moved so move move logs and everything else, like the dominoes just fell because you you tackled all the hard problems, basically. like Mostly, although there's a big caveat to this. So if you are a DayDog customer and you've been a DayDog customer for a while, DayDog's always had this event view, which which sort of dates back to syslog, ah like you know dates back to 2013 when DayDog was a pure infrastructure monitoring.

00:25:10

Speaker

So the last migration was actually that. And the big thing is they needed they needed a feature where they could actually annotate log lines. And that was a feature that no other internal customer, no no other data no product needed.

00:25:23

Speaker

And so convincing actually the product managers, but even just ah the the event platform team, hey, you know we we we need to finish the migration. like We are not a shitty engineering shop that gets...

00:25:36

Speaker

95% done and leaves like one poor team like in hell essentially operating their own Elasticsearch cluster but just because of this one feature. And so that last one, again, it was non-trivial technically actually ah to to do that.

00:25:50

Speaker

um But it was more, and this is probably my role as VP, it was more just convincing everyone, like, we would all like ourselves better if we actually complete migration. And that was sort of, and I as VP will recognize this effort, yeah even if the business is like, hey, you should have put this engineering on some new product that is sexy, which is what you always hear as a leader.

00:26:12

Speaker

Like, we need to clean up our architecture. We need to get this migration done. And so, yeah, it turned out to just be one feature, ah but a non-trivial feature um And then just like, oh, everyone's like, oh you know this this thing doesn't really matter, does it? And it's like, no, no, it completely matters, right?

00:26:26

Speaker

It is still a part of our product. It's not that well used, but we we can't kill it in the product. So it's that classic, like that last little hump in that migration, just willing it to get done. that that that was That was more difficult than you would think.

00:26:40

Speaker

um But it's just like, because there's always other stuff the engineers could be doing, right? And so so it's convincing everyone this will be worth it. Yeah. And there's a, the motivation shifts at that point. It sounds like. that exactly everyone's yeah Especially once you get that new thing you out and you can start putting new capabilities on it.

00:26:55

Speaker

Everyone just wants to unlock the new capabilities. And it's like, no no, we've got to finish the migration. And yeah, ah you let's be disciplined again. Nice thing about day dog having engineering co-founders is they get it, but even them, like, you know you need, you need defense being played at almost every level of your hierarchy ah to actually, you know, to to make this happen.

00:27:14

Speaker

Yeah. And there's, it's, I'm actually a big proponent of not finishing migrations if everyone's aligned on that, because it's, it's fine. Like no big deal, but it's a great point that if you get close enough, if you're, if you've done 60% and everyone kind of loses interest and there's no value, like, yeah, stop doing work.

00:27:30

Speaker

But if you've done 99% migration, that win is worth it, even if you're not going to see it show up in revenue metrics. it's just It's just great for morale. Yeah, and i think I think it depends a little bit on the migration. like I definitely would agree with you for something that isn't leaving significant infrastructure that is customer-facing.

00:27:49

Speaker

Essentially, what it was destining one team to be was Opshell forever, or Opshell for five years until maybe maybe it got built. And like that's bad for the team, bad for the customers. And so it was easier to justify in terms of,

00:28:02

Speaker

uh, this is right for the business to kill. Whereas, yeah, there's other migrations where it's just like cleanliness. And like, and I think we've, you know, you and I both said, right. Like there will be another migration four years from now. Why not just wait for it? And so, so I agree with you in those circumstances that sometimes you get, you can be too pure.

00:28:19

Speaker

Yeah. And it's, I mean, that's part of the job is figuring out when you, when do you go to a hundred percent and where can you pause it? Where can you wait for the next one? Yeah, for sure. Um,

00:28:32

Speaker

what sort of Speaking of that, like not trying to let leave teams in ops hell, and you had a bunch of double infrastructure running, what kind of what kind of cleanup did you see? and how How long did this migration go past the point of declaring victory?

00:28:47

Speaker

I think because of the fact that they built those, again, it's it's maybe the joy of Kafka that like, because they built those incremental tooling to like fail across very early, it wasn't so much like the usual, you know, just, you know, shutting down, you know I think we've both seen, right? Like sometimes shutting down infrastructure, particularly like it just takes, takes years sometimes. um Actually, I'll,

00:29:08

Speaker

Migration from VMs to Kubernetes was exactly like that. Like literally took took a lot of time. In this case, actually, it wasn't too bad. Like it was once it could support 100% queries for 100% of customers and we were confident that we weren't losing data.

00:29:23

Speaker

it didn't have that long tail of cleanup that that some migrations

Human Aspects and Leadership Dynamics

00:29:27

Speaker

have. So yeah, it was a nice thing. I think the nature of the system itself meant the tooling was built up front as opposed to like that last 20% cleanup automation type stuff. um So yeah, was it wasn't it like it took a long time because a lot had to be built, but it didn't take a long time because there was a lot of cleanup. that That was the nice thing about this migration.

00:29:46

Speaker

Cool. ah You're actually actually prioritizing going through and and doing that cleanup as you want so you didn't leave yourself with like the debt of the debt. exactly Exactly. Cool.

00:29:57

Speaker

um So let's talk a little bit about but like the human side of this migration. um How did... you How did sort of the originating team, which is, I guess, originally under this other VP, but but eventually came under you, like, how did they feel about the migration? And how did that change over the three years that you went through this?

00:30:21

Speaker

Yes, so there's an interesting, like, really human dynamic in this. So so the the VP I mentioned, he'd come in as part of an acquisition. He'd always had this vision. He'd sold this vision a lot. And so it was interesting is that you had this team, this logs team, that was completely pained by just the growth in logs, but very, very ambitious, actually. They're they're an ambitious team to want to take on more.

00:30:41

Speaker

And so to an extent that taking on the um taking on the migration cost, was part of this vision of of doing an event, like getting out of a logs platform, SPD event platform.

00:30:53

Speaker

I think the the VP and and the director sold that very, very well to the team. So it was a team you know working incredibly hard. Like honestly, you know this is sort of well known now, but like it was it was a burnout factory. Like most engineers would only last about 18 months on the team before the 60 hour weeks. The team was also in France.

00:31:09

Speaker

So we're dealing with like issues on US time was super painful. It was into their evenings, into their nights. um So it was interesting. It was a team that though that was very, because they were so they believed in the mission so much, you like the migration was, like the the fact that this was a difficult migration, that was totally on board with. like that it wasn't It wasn't one of those ones we've probably both seen, right, where the team, the platform team itself doesn't really want to do the migration. This was the opposite. They really wanted to do the migration because it matched their mission. And they had leaders who got good alignment of the engineers in that mission. So so yeah, it wasn't,

00:31:43

Speaker

Definitely, I've seen much harder migrations in terms of selling into the team. That's interesting. And you mentioned this earlier, is that the this transmission from like logging to event-driven platform was not a Datadog-specific idea, that there was this this industry motion. not sure umm I'm sure that there's some listeners who haven't you know worked in observability for 10 years.

00:32:03

Speaker

it Tell me a little bit more about like what was happening more broadly that influenced the team to be so excited about that kind of change.

Industry Trends and Internal Management

00:32:11

Speaker

I think it was, so i think it was just Datadog was at this point in observability sort of circa 2018, I think is is when this acquisition happened. So like a year before I actually even started.

00:32:21

Speaker

i think it was really the ah you know when you know, when monitoring became observability in a way. so So I think it was this incredible excitement at the time around the breadth of things you can do with observability and how much, the nice thing about Datadog in general, right, you know,

00:32:36

Speaker

VPs complain about it because how expensive it is. Engineers tend to love it. And so engineers working on Datadog love working on at Datadog because their colleagues who don't work at Datadog love Datadog, right? And so so I think honest honestly, but if you pull it apart, I think it was just this excitement around what you can do with observability.

00:32:52

Speaker

And so like you know an event platform is essentially just a time series structured log. like ah It's just that that recognition. And so I think it's that team just fully believed that, and and maybe a little bit too much actually, this this goes back to where it doesn't actually fit metrics that well.

00:33:06

Speaker

but they just believe that all of observability's problems can be solved with this one platform. i mean, if if you look at Honeycomb's marketing today, we sort of have the same marketing. Everything can be solved with Honeycomb.

00:33:17

Speaker

um A lot of problems can actually be solved with with this approach. And so I think it was just that excitement at the time around that approach. but With this platform and enough money. Exactly. you can to happen Yeah, and essentially, you know, you wouldve like, you you know,

00:33:30

Speaker

building a Because in the end, they all end up building a fully fun ah fully functional distributed database that can handle SQL. And like that's 100 developer users. So it does take a lot of investment for these things to actually hit their ah their initial vision.

00:33:45

Speaker

Yeah, absolutely. like um Cool. so yeah you had that team that was that was bought in on it. Did that excitement spread to other teams being on the other side of the the fence?

00:33:56

Speaker

I don't think so. i think what the... So so an interesting thing ah and that this started with, ah but definitely i continued, is that there actually wasn't a TPM on this project for the first two of the three years. i We did have a TPM for the last year, but we didn't have a TPM.

00:34:11

Speaker

What we did have, though, was a full-time product manager. And that product manager ah who who was you know part of this acquisition had come in, very very much believed in the vision. He was a very, very good person at selling the vision and particularly coming up with compromises that would convince like convince other products that, hey, this is it's much better for you to get on us and to be part of the migration with us than doing it yourself independently.

00:34:36

Speaker

a lot of what that looked like was actually just compliance stuff. So like HIPAA and PCI DSS, Right. No, no product team wants to be bothered by those things. And so it's like, hey, we're going to be the ones we're going to burden ourselves with the compliance stuff.

00:34:51

Speaker

And so so I think that's what so I think it was just a classic. You'd say it was finding the incremental value for the business that got them excited about the the platform and a platform migration. um But yeah, I'd say like, it's like, honestly, in any engineering org, there was a decent amount of skepticism of the approach. um And so it wasn't like, you know, they it wasn't like total skepticism, but it wasn't total belief either. It was like, okay, let's wait and see sort of from the engineers.

00:35:16

Speaker

But it was finding those business wins that actually got people excited for the migration. That's cool. what So i love the idea of having someone internally who's really focused on almost like landing that story, right? Like we're going to we're going to build an industry changing platform. We're going to align ourselves with the future. We're going to do great things for Datadog. But for you, what that's going to look like is not dealing PCI.

00:35:38

Speaker

Yeah. goes I remember like there it was the network monitoring product. Like there's always like, beware hiring ah staff engineers into product teams because they will convince themselves that the platform will never be good enough and they just need to you know do something internally. And like this wasn't a bad engineer, but like I've just seen that mindset repeat so many times when product teams do it.

00:35:57

Speaker

And so what this product manager and and the lead staff engineer did was this like negotiation over about six months where they did a bigger and bigger proof of concept and sort of won him over that we are actually better, and we do these compliance stuff aside. So, you sort of this dance of convincing um convincing teams that, hey, we when we get off Elasticsearch, you want to be on us, as opposed to taking all the problems of managing open source onto yourself. So that ah that, to me, is a big part of a product manager role um for platform teams, is those types of delicate negotiations.

Project Management without a TPM

00:36:30

Speaker

That makes sense. That feels... It feels right my experience. How much, I think there's, it's easy to see from inside the platform team what the selling motion looks like because you've got someone pushing the vision and getting people bought in.

00:36:44

Speaker

How did that, how did that feed back? Like, it's ah It's a two-way street, I imagine, or unless you've got an exceptionally talented internal PM. How did it feed back to the product teams?

00:36:56

Speaker

Yeah, or how did it feed back to the platform team? and Did it change the Did it change the goals did it change the plan? like what how did that change the plan I'd say this was very much... since i've I've seen this work out differently different migrations.

00:37:11

Speaker

This was very much the product manager, the lead staff engineer, and the two staff and the two engineering managers who were there. um They very much acted as the umbrella. so So to some extent, there was so much work on the ground keeping the old system running and migrating to the new team.

00:37:27

Speaker

I think the engineers were fairly protected from the actual product side of it. It definitely wasn't the... and Camille Fournier, my co-author of the platform, and she's like, you know every engineer should be directly dealing with support cases because that's how you understand the customer.

00:37:43

Speaker

This wasn't their pattern at all, actually. Those four really isolated the team ah from having to care too much. And so it really just became a ah roadmap timing problem that the the lead product manager sort of operated for the team. So, yeah, it wasn't one...

00:37:58

Speaker

where the team, like the actual engineers had to concern themselves too much about winning people over. i'veve I've definitely seen other migrations that that wasn't the case. But yeah, this this one had that nice property and that they could just focus on building.

00:38:11

Speaker

That's awesome. and and especially especially if you can fix it just by like reordering what you are already already going to do or reordering how the customer integrates with it. then yeah The one thing I think that the lead staff engineer did really, really well was every two weeks, he wrote a summary of progress and he he' was a very good writer. And so he simultaneously did this well for engineers,

00:38:31

Speaker

but also engineering managers. So like he wasn't, it wasn't overly technical and he simultaneously did it for both the internal team and the external team. and And so it was this nice, like it was just two pages, but this nice two pages every two weeks that kept everyone sensed that there is momentum here. Like it just kept the outside people, uh, seeing that stuff was being worked on and it kept the internal people who might, may be like on what they saw that week. They didn't see any wins.

00:38:56

Speaker

kept the internal engineers seeing the wins that were happening. And so was just a really nice way of maintaining momentum across what was a pretty big project. You know, yeah three years is a long time when all said and done. so So I did like that. And the fact that i was an engineer doing it, but an engineer who wasn't stuck at like an engineer's level of detail, it was much better than having a TPM or a product manager do it. where Or, you know, they tend to just do like a execution story, but this was a far better story than that. And so so I think that was a big part of keeping everyone aligned across the timeline.

00:39:26

Speaker

Do you actually think it was a benefit to not have a TPM? I think it was a benefit until that tail. So so I think, yeah, so I totally, and this is actually, know, my co-author Camille, like in the book, she hates TPMs early in migrations because they tend to take what is essentially a product problem and turn it into an execution problem.

00:39:43

Speaker

And thinking that if we just execute off this Gantt chart correctly, everyone will be happy. so So I did think, you know, she would say like it stretches your staff engineer to be a bit more of a TPM. Tanya Rowdy's book, The Staff Engineer's Path is great. And like good staff engineers should be a little bit like a TPM.

00:39:58

Speaker

It stretches the product manager to think in terms of schedule, which some product managers don't. But the long tail, even for us, with just like 20% tail, we absolutely need a TPM to just crunch the freaking, you know the trade-offs and whatever.

00:40:11

Speaker

um But yeah, I do think The longer you can go without a TPM, the better. And sometimes that means telling your staff engineer and your product manager, hey, you're part-time TPMs. And you know everyone can do it. like it Everyone will insist they can't do it. like you know Managing an Excel spreadsheet is not that hard. like so So they can do it. But you don't want it to become their full-time job. So you have to be sensitive to how much work it is. The the sense of colors and the width of columns in a spreadsheet is not the primary skill of a TPM. That's not what they're bringing to the table. Yeah. you can do that engineer.

Reflections on the Migration Process

00:40:44

Speaker

Yeah.

00:40:46

Speaker

Uh, that i Yeah, absolutely. I and entirely agree. You can't get by. you can't You can't really finish that swing with that many pieces unless you've got a TPM. but Yeah. And particularly, like we'd get into like big... like as we As the project succeeded, like suddenly started eating up a lot of the budget. So you had like had to play defect defense against the finance team. and like so So these other non-technical stakeholders come in. And and a good TPM is great at that. like But they have to be on top of the details to do that. And and that has both pros and cons.

00:41:17

Speaker

Yeah, I guess after three years of running double infrastructure, finance wakes up. They do. There's like this, because it' is it cog? I don't know. this is that This gets into like technical terms. Like if if you're doing it more for R&D, it's not cogs.

00:41:28

Speaker

And so like the finance who cares about margin doesn't care. But eventually it gets just big enough regardless. Finance does care. And so, yeah, they they do wake up. i I would occasionally stare at the breakdown between like COGS and r and d and like the teams that were allocated to that at Slack. And it was like staring into the void. It's like, that doesn't make any sense. Yeah. this yeah i mean, accounting is, it definitely has, it's a black art. like got and that's It's right. I'm sure I'm not an accountant. I'm going to stop looking at this spreadsheet. Yeah. um

00:42:02

Speaker

If you had to do it again, what would you do differently? I think, so it's interesting, especially with the first year it wasn't under me, then the two years it was. um And I do think they did, a like, for a team that was heavily loaded, they did a really good job that first year.

00:42:19

Speaker

It's interesting. I don't, I don't really know. So it's like, you know, the the answer to that is almost like what fires come up. But then like it's always easy in hindsight, right? Like what what fires came up that we should have known maybe earlier.

00:42:35

Speaker

Like what took a lot of time in logs was that correctness around only ingest once because so you don't want yeah you don't want that security log message that says if I see five of these a day to go up because you only saw four, you double wrote one.

00:42:48

Speaker

so it's so so maybe it was internal product but internal project timing those projects that should have kicked off earlier just because they became the long pole uh in hindsight um but like that's every like i feel like i'm i'm explaining every project um

00:43:05

Speaker

Otherwise, like it's it's always in in management, right? You get into, oh, I wish i wish we'd been able to put more engineers on it sooner. But like we we were like ripping engineers away from directors with burning problems. Like it it wasn't that easy to do that either.

00:43:18

Speaker

So I don't know. I think all in all, it was a fairly successful project. You know, we we you know the director of it left, completely burnt out two years into it.

00:43:30

Speaker

It would have been nice to like not have the burnout happen. But yeah, that was that was tough because it was just ah the hours of Paris versus US were tough. So i don't know. I don't look back with with too many regrets on this one, actually. so Yeah, that you can, I mean, with with perfect hindsight, sure, you could yeah you could know that you should solve ah yeah know exactly once in jest or you could you could throw more engineers at it. Maybe honestly, and it comes back to it.

00:43:55

Speaker

You're probably going to bring your director out faster too. It might be coming a little bit back to like you know the fact that I had to be sold by this other VP on what in hindsight was completely ah the right decision. It might be me you know being a little, I'd only been a data dog like four or five months and but my and my directors had burning problems.

00:44:12

Speaker

But yeah, might just be like being a little bit more open to my peer VPs, you know, when they when they're seeing maybe further out than I'm seeing. um Because again, like data metrics are still not running on this today. Like maybe in 2027, the grand unification will actually happen.

00:44:28

Speaker

But yeah, it's tough when you have this other VP say, look, I need all the headcount and you go, Well, I need a cut too. but it's up But I think he was totally right. I think it' was that big vision but was was was the right was the right vision. So so I'd say maybe in hindsight, I took that, you know, think think a little bit bigger in your decisions.

Current Work and Future Prospects

00:44:47

Speaker

But that's less on the team and and more on me personally. And you did end up having to to run it anyway. So maybe it was good that you were i didn't willing to staff it. Yeah, exactly. I love it. That makes sense. um

00:45:01

Speaker

But you're not a data dog anymore. What's the next big thing for you? Yes. so So I left Daydog, actually more or less declared success, although like I was managing 700 people at this point. So this wasn't my success. it was my team's success. So I left Daydog two years ago, about March 2023.

00:45:15

Speaker

I took time to write a book on platform engineering. It's a lot of sort of you know the theme of this talk, sort of the management level of these types of decisions. That's sort of the book. It's very much a management book. ah Then about a year ago, I decided that I was you know still very, very hungry, didn't want to be an author together.

00:45:31

Speaker

ah So sort of co-founded Junction Labs. And one of those, across my my time on sort of these migrations, it was you know the Nitro staff at AWS, Two Sigma compute platform, Two Sigma was the the finance company i worked on, and then Datadog.

00:45:45

Speaker

um yeah know This sort of uh shadow deployments and progressive deployments like almost like continuous deployment type stuff i always thought this was the what unlocked like the big uh sort of big re-architectures big migrations to get to the actual next level and so so ju junction labs is there so so everything we just talked about actually used kafka and was asynchronous and kafka is great if you're willing to take the cost of asynchronous messaging But if you're in a synchronous space and you're not behind the API gateway, you don't really have a lot of answers for these types of features. And so so Junction Labs is trying to build into the networking layer a lot of these capabilities around, ah you call it like,

00:46:27

Speaker

essentially ah traffic shaping, ah traffic mirroring, ah traffic ah dynamic traffic load balancing. And so we're sort of doing this, like the the technical term that people who are in the space would know, we're a proxy-less service mesh. So we're trying to do the same ideas service mesh have, but not force you into a big migration onto sidecars.

00:46:45

Speaker

Instead, we're trying to do this as a smaller migration onto into a drop-in library that just ah unlocks these capabilities that now each individual... engineering team can sort of control their own fate in this way. And so so that's Junction Labs. We're we're still seed stage.

00:46:59

Speaker

You know it's a couple of guys in a room at the moment, are still very early. um But over time, like I think, ah again, every successful migration I've seen has used some notion of how do we put stuff in production and learn from production? Because there's no way we can simulate this in in testing, integration testing at scale.

00:47:18

Speaker

So it's really the test in production idea. Absolutely. That makes, it makes a ton of sense. And it's, it mirrors a lot of what we're thinking about it turn as well as, you know, that you can make the code changes, but nobody wants a diff. No one declares a no one declares victory when GitHub has a bunch of code on it that you need to prove to yourself that it works. And that starts, you know, that starts with unit tests and it moves on integration tests and it moves out into getting it into production. And that, that is such a scary moment.

00:47:45

Speaker

Yeah. You put something into production and anything you can do to to smooth that over. That's exactly exactly right. It's basically like, you know, it's sort of not not not quite as mean at the moment, right? But like, you I only test in production sort of that 10 years ago idea.

00:48:00

Speaker

um but actually across across my, you know, 15 years doing this, um like testing production is the right way, but you have to make it very, very safe, like brain dead, like where you can't, you know an engineer just checking something cannot shoot themselves in the foot.

00:48:14

Speaker

And so like, yeah, being out outside the networking level, just be able to like, you know, select a test tenant ID or or it select 1% of traffic. You these are all like, there's a a system out of Twitter called Finagle in 2010. You know, these none of these techniques are new, but really they've never made it outside sort of the big shops. And and so that's a lot of what we're trying to do with Junction Labs.

00:48:35

Speaker

Right on. That's awesome. ah we'll We'll drop a link to Junction and um the platform engineering book in the and the comments. um Cool. Well, we're just out of time. So a couple couple of quick fire questions just to wrap things up.

00:48:49

Speaker

If you could wipe one technology off face of the planet, never existed. What would it be? Never existed. That's a... Honestly, so when I interviewed at Datadog, they they asked me to to design a system, actually. And it's actually this this sort of system we talked about. And i I said, I will not use Elasticsearch because Elasticsearch does not scale.

00:49:12

Speaker

and And I was sort of skeptical because at that point, they weren't having all the burning problems. But Elasticsearch, like it just was, i don't know Elasticsearch themselves has sort of moved to a column store now, but like that, that like 2015 era Elasticsearch of static clusters that are so difficult to manage and have noisy neighbor problems and hotspots. I would say that era of Elasticsearch was a terrible idea.

00:49:34

Speaker

and was just clearly a terrible idea, but it solved some quick problems for fast moving startups. And so they used it anyway. and then we all ended up bearing a cost. I i think it's different to Kafka. Like you can, Yes, I'd say el last Elastic's 2015 era static cluster Elasticsearch. I think that was that's the one I've just seen cause so much pain.

00:49:54

Speaker

don't want to say for so little gain, but like you can you could take Lucene and build your own thing and it wasn't that hard. like ah It was never clear to me why anyone loved Elasticsearch. Just blow it off blow it out of the ah off the map, and it it'll pull five years of progress forward because we'll fix the problem. Exactly. You're built to horizontally scale from day one rather than treating it as an

Final Thoughts and Future Endeavors

00:50:14

Speaker

afterthought. like that was you know It was sort of crazy in hindsight. A plus. what when When we get our time machine, I'll i'll let you know.

00:50:21

Speaker

um What's the most ridiculous bug you've seen in the husky migration outside of in your whole career? Ridiculous or painful? Like the... ah The worst one, so this is back to my Nitro. So Nitro was the re-implementation of EC2's data pipeline. So the first the first version it became the whole hypervisor.

00:50:41

Speaker

And so it was a literally bits-to-byes conversion. a Divide by eight had been left off by our strongest engineer who had left, though, like a year ago. and it turned out to be like it was on an ICMP reply. So you get a packet into the data pipeline, and and you basically have to send back an ICMP, maybe too big, and you have to like...

00:51:02

Speaker

offset into the source packet on like the port or whatever. And this was like a like two characters divided by eight. So a tiny bug by like the strongest engineer I've ever seen, but he was sitting there and no customer had hit it until it hit, it was hitting one customer at scale.

00:51:17

Speaker

And this like two character bug ended us taking about five, Three developer months to ship because we had to be so careful about not, and like, understanding did this impact any customer other than the one who's reported it to us was a massive endeavor. And so, like, I'd say it was just ridiculous in terms of the number of characters to the amount of man months it took.

00:51:39

Speaker

And, know, it always, its you know, it was classic. The code did not have unit tests. So it's a classic, this is why you unit test and not just integration test.

00:51:50

Speaker

But yeah, it's just the one where the pain was so deep and and to dissolve. And essentially, it took my strongest engineer three three months to fix a two-character bug. and like That's horrible. That's the smile of someone who who experienced incredible pain a decade ago. Exactly. yeah i mean yeah In hindsight, it's all lessons. At the time, it was absolutely awful. And, you know, great diversion from what we wanted to be doing.

00:52:19

Speaker

Cool. and And last one, um if folks want to follow up, where can they find you on the internet? I think the easiest, a Junction Labs has a blog. So I think it's junctionlabs.io is the probably the easiest place. Of course, I'm on LinkedIn, but it's just, I'm one of the only two Ian Nolans on LinkedIn. And I'm the only one with long hair at this point in my life. So easy to find on LinkedIn.

00:52:40

Speaker

Right on. All right, we'll drop links to that. um Thanks so much for being on. Excellent. Thank you for having me.