Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Platform Engineering at NYTimes image

Platform Engineering at NYTimes

S4 E1 ยท Kubernetes Bytes
Avatar
1.8k Plays10 months ago

In the first episode of Season 4 for the Kubernetes Bytes podcast, Bhavin sits down with Ahmed Bebars, Staff Software Engineer at NYTimes to talk about how the times uses Kubernetes and Platform Engineering to accelerate their developer productivity and improve developer experience. They talk about what the technology stack at NYTimes looks like, how the platform team has built a resilient platform on Amazon EKS and share some best practices for anyone starting their journey with Platform Engineering.

Check out our website at https://kubernetesbytes.com/

Timestamps:

  • 01:20 Cloud Native News
  • 09:55 Interview with Ahmed
  • 50:04 Key takeaways

Cloud Native News:

  • https://www.aquasec.com/news/60m-additional-funding/
  • https://devclass.com/2023/12/12/docker-buys-atomicjar-to-integrate-container-based-test-automation/
  • https://www.businessinsider.com/armory-acquired-startup-harness-7-million-2023-12
  • https://techcrunch.com/2023/12/19/scaleops-looks-to-cut-cloud-bills-by-automating-kubernetes-configurations
  • https://wraltechwire.com/2023/12/21/ciscos-latest-cloud-play-exec-explains-the-deal-for-tech-startup-isovalent/
  • https://www.kubernetes.dev/resources/release/

Show links:

  • https://open.nytimes.com/
  • https://tickets.kcdnewyork.com/
  • https://github.com/abebars
  • https://www.linkedin.com/in/ahmedbebars/
  • https://github.com/nytimes
Recommended
Transcript

Introduction to Kubernetes Bites

00:00:03
Speaker
You are listening to Kubernetes Bites, a podcast bringing you the latest from the world of cloud native data management. My name is Ryan Walner and I'm joined by Bob and Shaw coming to you from Boston, Massachusetts. We'll be sharing our thoughts on recent cloud native news and talking to industry experts about their experiences and challenges managing the wealth of data in today's cloud native ecosystem.
00:00:31
Speaker
Good morning, good afternoon, and good evening wherever you are. And we are coming to you from Boston, Massachusetts. Today is January 8, 2024. Hope everyone is doing well and staying safe. Let's dive into it.

Season 4 Launch

00:00:44
Speaker
Happy New Year, everyone. Kubernetes Bites is back. So we just finished season three with an AIML intro or 101 episode. Hope you guys like that. And we are excited to jump into the season four or 2024 season with an exciting episode with
00:01:01
Speaker
Kubernetes user so before we actually go into the topic for today and introduce our guest let's talk about what's going on with the cloud native ecosystem recently.
00:01:14
Speaker
I know we were off for like a few weeks, so we do have a lot of exciting news to cover. Let's start by talking about some funding rounds and some acquisitions.

Industry Moves and Acquisitions

00:01:22
Speaker
Talking about funding rounds, Aqua Security, the community security vendor with their open source project trivia, raised an additional $60 million funding for a total of $325 million raised through their lifetime.
00:01:35
Speaker
This looks like an extension series around based on their website and crunch basis site, which basically like help them maintain their valuation over a billion dollars. So congrats for still being a unicorn aqua aquariums. Is that what they call them? I don't know, but congrats Aqua security. Next up we have Docker.
00:01:56
Speaker
buys a small startup called Atomic Jar to integrate their container-based test automation tools, specifically something called as the test container projects. Test container projects are open source, multi-language library for providing throwaway, lightweight instances of database, message queues, web browsers, anything that you can package in a container, they can deliver it to you too for your test dev workflows.
00:02:24
Speaker
So go and check that out. We'll have a link for it in the show notes. They didn't say how much the acquisition was for or how many people will join the team. But it looks like the entire Atomic Jar team will now be part of Docker as well, including all of their open source projects and efforts.
00:02:39
Speaker
Next up, in terms of acquisitions, Harness.io, a key player in the continuous delivery ecosystem, buys another CD startup called Armory. It looks like they bought it for $7 million based on business insider information. Armory had raised more than $80 million, so it's sad that now they had to sell themselves for $7 million.
00:03:02
Speaker
But hey, the technology is still around. I'm hoping the people at Armory still have a job. So that's the way to get an exit, I guess. But yeah, Harness adds to its tool chest or wall chest and add more tools to its ecosystem.
00:03:19
Speaker
Next up, talking about a funding round, ScaleOps, a very small startup, I think based out of Israel, raises $15 million in Series A round, bringing the total money raised to $21.5 million. And ScaleOps, as the name suggests, helps organizations scale their Kubernetes deployments, Kubernetes clusters,
00:03:40
Speaker
and do that cost effectively, right? So based on their website, it looks like they do dynamic pod resizing for compute and memory. Also your CPU and memory requests and limits, but they do that dynamically instead of having you to, instead of asking the user to update those limits every time.
00:03:58
Speaker
They also optimize the nodes that are part of your Kubernetes cluster. I'm thinking about this as an offering that works for public cloud or managed Kubernetes platforms to begin with. In terms of node optimizations, they help by removing under-provisioned nodes. If your nodes are not utilized fully, it will remove those nodes by moving the pods that were running on those under-utilized nodes.
00:04:22
Speaker
two different other nodes in the clusters. They also help you by replacing expensive nodes that your developers might have asked for by cheaper ones just because they know how much capacity both from a compute memory and storage is actually needed by your workloads. And then it also consolidates pods for more efficient compute utilization. So it looks like they're mostly focused on compute. Again, I won't say that they are the first vendor in the ecosystem, but definitely like
00:04:50
Speaker
bringing more light to this space, right, where given the market that we are in right now, given that all the zerp or zero interest rate projects are kind of done for almost a year, this will help organizations maintain their communities and environments and do so cost effectively, right? Next up, we have another acquisition. I know it's been a busy three weeks, like who would have thought
00:05:15
Speaker
over Christmas and New Year breaks or over holidays, we'll see a lot of funding rounds and acquisitions. But this one is a big one. Cisco actually acquires Isovalent. So Isovalent, if you don't know, which will be weird, but they are the company behind the Psyllium project, the Tetragon project.
00:05:35
Speaker
They do a lot of good things in the CNI ecosystem. The EBPF ecosystems are now part of Cisco Securities Group. Again, the intent to acquire was announced. I think the acquisition is supposed to complete in the third quarter of this year. No price was actually being shared, but looking at the history, Isovalent had raised $69 million through its series A and series B.
00:06:00
Speaker
Importantly, Cisco was already part of both of these funding rounds. And again, Cisco specifically, I'm talking about the venture arm that they have. So Cisco had already invested in Isovalent over the couple of rounds. Personally, I think this is a great move, right? Cisco, I've worked on Cisco's, which is Cisco Nexus and iOS, which is in the past. And I've seen how widely used they are. This just means that, okay, with Psyllium being that open source project and
00:06:28
Speaker
how it has become the default cni plugin for eks and aks and gke. This helps Cisco maintain their presence even in this cloud native ecosystem. Great move by Cisco. Congrats to everyone at Isovalent. You have a new home now.

Kubernetes 1.29 Release Highlights

00:06:46
Speaker
One last news that's not related to acquisitions or funding rounds. Kubernetes 1.29 went out. I think somewhere in mid-December, it was after we published our AIML episode. We didn't cover it then, but it's GA for everyone to use. That was the last release of 2023, just in time, because if they pushed anything
00:07:08
Speaker
and during Christmas, I don't think that might have been picked up as well. This is a good stable release, 49 enhancements, 11 of those features moved to stable, 19 graduated to beta and 19, sorry, yeah, 19 graduated to beta and 19 to alpha. Sorry, I was looking at my notes here. A couple of things that caught my eye from the stable announcements or the stable enhancements, read, write one spot which was in beta till now is now a stable feature. So if you are using
00:07:37
Speaker
persistent volume rate on Kubernetes, you know that the access modes are read write once, read write many. Now we have read write once pod. This is, again, the way to differentiate this is, if you're using read write once, you would think that only one pod will have access to that specific persistent volume, but read write once actually gave
00:07:57
Speaker
or could give access to all the pods running on one node in your Kubernetes cluster access to that persistent volume. So this, I guess, restricts it even further. Now it makes it de-tried one spot. So only one pod can have access to a volume at one time. So go look at it if that's something that you're interested in. And then second thing that got my eye was KMS v2 encryption at rest is available for all Kubernetes resources in your cluster. So
00:08:22
Speaker
a reliable solution to encrypt all resources. We'll have a link in the show notes that talks about how to do that. Again, this feature has been around. KMSV1 was around for a while. KMSV2 has been in beta for a while. This is now just being generally available.
00:08:38
Speaker
That's it for news for

Kubernetes at The New York Times

00:08:40
Speaker
this episode. I know we had a few articles and announcements, but let me move to the next step and talk about our guest for today, right? So, we have Ahmad Bebars, hopefully, sorry Ahmad, I butchered your last name, but he's a staff software engineer working for the New York Times. So, as part of season four, we want to strive towards getting more and more of these user stories.
00:09:02
Speaker
I know each season we move from doing more 101 episodes to bringing on more vendors and startup founders to talk about what's new and exciting. This year we also want to focus on getting more user perspectives. The feedback that we got after the Chick-fil-A episode that we did last year and the Major League Baseball episode that we did last year was
00:09:20
Speaker
exciting just to summarize it in a single word. So we will strive to get more and more users using Kubernetes on a day-to-day basis to join us for a broad episode and talk about their experience. So if you are one of those users.
00:09:36
Speaker
Please reach out to us. We would love to have you on the episode again. Doesn't matter if you are like New York Times or Major League Baseball, even if you are like a 10% team and you're using Kubernetes, you would love to hear from you and make sure and have you share your experiences with rest of our listeners. OK, so without further ado, let's bring Emma Dawn. Emma, welcome to Kubernetes Bites podcast. Happy New Year and thank you for joining us. Why don't we start with like a quick introduction of who you are, what you do at New York Times, and then then we can kick off the official episode.
00:10:05
Speaker
Sure, first of all, happy new year to you and thanks for having me. I'm really excited about talking about Kubernetes and what are we doing at the New York Times. So my name is Ahmed Babars. I'm a staff software engineer at the New York Times, specifically in delivery engineering.
00:10:21
Speaker
And our focus is to maintain and build core infrastructure and develop a centralized platform for our product engineering team to help them with their buildings, their application services, and all of that kind of stuff. And I work with multiple teams inside the organization from CI, CD, Kubernetes, CDNs, and other things.
00:10:42
Speaker
Okay, awesome. So, is it just one team that supports all product engineering teams inside the New York Times or do you have multiple teams supporting different organizations? So, it really depends on each. So, we are in missions and it depends on what your goal looks like. So, in delivery engineering, we have multiple teams specifically to different areas like CICD. My team is working on Kubernetes, but I also work with the CDN team.
00:11:09
Speaker
and doing all of the service delivery or the traffic delivery. And then the teams are responsible for resources provisioning and all of the cloud stuff. And then we have teams for developer experience. So it's really an organization that tries to centralize the entire process and building a central platform to make it more standard and also avoid any kind of productivity cut from I'm trying to maintain my infrastructure, I'm trying to do this, things like that.
00:11:39
Speaker
Okay, okay. That's good, right? So let's start with basics. I know you are on the Kubernetes Bites podcast. So like, let's start by talking about when and why did you and New York Times choose to go down the containers and Kubernetes route? How was that decision made? And what are some of the benefits that you see? Yeah. So I've been with the Times for about five years now. I can't talk. I cannot talk before that. My understanding is
00:12:05
Speaker
like a lot of teams are using Kubernetes even like we before we started like a central platform but like you can imagine like I'm a team I'm building my own clusters and I'm running my own things and another team doing that and like every team is doing wherever they feel it's empowered to deliver their services because that's we need to have so the other thing is like when we started like our journey to like make things centralized and
00:12:30
Speaker
build a platform that makes life easier to every engineer and makes the process smoother, we thought about what is the right tool that we can build containers on? Or even before containers, what is the right tool to ship runtime? With all of the ecosystem and we're seeing Kubernetes happening and all of the work in there, we thought Kubernetes could be a good tool since it's open source, it's more like cloud agnostic.
00:12:59
Speaker
sets in all of the cloud providers and you can do it on premise if you need to. So that's where like we decided about like, maybe like two years ago when we started think about the platform or the central platform, that's the tools that we're going to build a platform top.
00:13:15
Speaker
Okay, no, I think that makes sense, right? And that's what we see in the community as well. Like, instead of having developers or individual application teams reinvent the wheel and try to do everything on their own. Having like larger organizations that can obviously afford to have a centralized platform team going down the route. We also have like a fancy term called platform engineering that kind of is that umbrella term that covers all of these things. So I guess my question to you next question to you is about like,
00:13:41
Speaker
When you started this journey at The Times, did you start with some open source project to build your IDP or internal developer platform? Or did you end up building something that's more custom to you, that's more proprietary to The Times?
00:13:54
Speaker
That's a very good question. So with every platform, you're trying not to reinvent the wheel. So you're trying to see what's available in the market and then what's happening and then you customize on top of it. So when we started with Kubernetes, it's going to be a good tool for doing that. So what other components that we are looking for?
00:14:12
Speaker
And looking at the open source, we were already doing some stuff with specifically an EKS. So by then we were looking into what other things that we can use. So we needed, for example, network isolation. So we need network policy and all of that kind of stuff. While Kubernetes provides some of these, but we need something that we can, because we are running a multi-tenant architecture.
00:14:33
Speaker
That's where it was our decision. So we decided like Psyllium is a good approach. There are a couple of products, but we decided to go with Psyllium. So another thing we need to ensure is that all of the policies and everything is intact. We have governance and all of the kind of stuff. So gatekeepers come to the mix. Then later through the process, we were thinking about service mesh. So Stu came to the process and then we started using it. Then we start to use cluster of the scalar, but things change a bit. And like we start to look at carpenter.
00:15:02
Speaker
But these are all open source tools that we're using. However, there are some other things that we are building in-house. So for example, we try to automate the process of how to onboard Teams and accounts to our Kubernetes cluster. So we built a controller that based on some triggers that when an event happened, let's start from there, start to build the CRD, all of that kind of stuff. So we have some controllers that we built in the platform.
00:15:31
Speaker
to help us with the process, but we try to rely more on like a lot of open source tools so we can like ship things faster and also have that ability to contribute upstream and we don't have to, you know, recode everything. That makes sense. Yeah. Okay. So you already described a lot of your tech stack today. I know you said you use EKS as a distribution is times all focused on AWS EKS or are you also like multicloud?
00:15:59
Speaker
It really depends, this is not an area where we would think this way. I believe that was the way that I would think of it. When we decided to where we were going to build the platform, we have some thoughts about where it's going to be. I think we decided on building the platform on top of AWS for now.
00:16:21
Speaker
yes provide us with the tools but like at any point of time you know you can you can be a multi-cloud but I think it's an investment that you have to make and decisions that you would make to ensure like do we really need it or not like how we would do that there are always a say that like things are cloud agnostic but like come to a point where like you have to change providers you have to change a lot of things so it's not like an easy decision to go for
00:16:48
Speaker
I think for now, we are using EKS as Kubernetes distribution for the platform. I agree. There is going to be lock-in at some layer of the stack. You can't be completely vendor-agnostic. That will be basically saying you're building everything

Platform Engineering Practices at NYT

00:17:04
Speaker
in-house.
00:17:04
Speaker
Exactly. The juice is not worth the squeeze, I guess. Talking more about your platform, you said you support a lot of application teams. The reason I'm asking this is I've used Backstage a bit and seen tools that are built on top of Backstage. It has a nice user interface that developers can use to select these apps or select these golden parts to get started.
00:17:30
Speaker
Does something similar exist inside your platform that you have built or is it? I've also seen scenarios where people developers are used to using just Kubernetes CRDs. So once you have clusters for them, they can create the right CRDs and get access to what they need. Which way are you leaning?
00:17:52
Speaker
I can tell you like my opinion. I can't tell you what are we doing. So like, first of all, what are we doing? So I think like the main important piece here is like we do, we think about like the productivity and the developer experience. So like, well, I can, I can give you all of the tools to do whatever you need. It's going to be like,
00:18:08
Speaker
super, less open you need and all of that kind of stuff. I still didn't get you like that easy, quick golden pathway. I'm just like having a bunch of tools, uh, compiled together and, and say, Hey, these are the tools that you can use. So I think like to take a step back, we thought about the platform in a workflow way. So like, what are you trying to do? Like I'm trying to onboard. So like, if you're trying to onboard you, like you go to a forum, you go to our interface and that's going to onboard you.
00:18:35
Speaker
What is this onboarding going to do is going to create like a few things for you. So now I'm trying to choose container runtime, like this is going to be Kubernetes at the moment. So what else? So if I describe the entire flow for you, it's the first step is onboarding and then develop.
00:18:50
Speaker
The onboarding step, it's exactly what you meant about interface and user enter some information, like I need to use that, I need to use this CI, I need to use Argo, what's my service look like app, I'm exposing it public or not. And then like this templating engine will create all of the resources from like a get repo, like all of that templates that you need from like customized or Helm,
00:19:15
Speaker
whatever all of this look like. And then there are a few CRDs that we're talking about here. Like maybe we have a CRD for our traffic management. That's something we built that will get templated for you based on the information. So from start to end, like when you set up
00:19:32
Speaker
Like when you enter your information for the service that you are trying to deploy till you get to production, we're talking about like 10 minutes to get there with a template. Like we say go and then you get like a HTTP server, for example. And that helps you to not think about how I would do a lot of things, but like all of these are shipped out and then you can customize and build on top of it. Okay. So as part of the onboarding workflow, right? When they request for a certain amount of resources,
00:20:02
Speaker
Like, is it such that like, do you have like a huge Kubernetes or EKS cluster and developers get their own namespace for that multi tenant environment? Or do you have like a cluster as a service or a cluster vending machine where every developer gets their own small three node, four node cluster based on what they want to do?
00:20:18
Speaker
Yeah, very good question. I spoke about this in large, in multiple talks that I have done, but we decided to go with a multi-tenant approach. So we have like specific clusters, cross environments, cross regions that like we deploy people, we deploy like applications to them.
00:20:38
Speaker
You can get more than one namespace. It's something that we already provide out of the box. But when we onboard you as a tenant or as an entire account, we give you a default namespace with everything that is a controller that I spoke about earlier. You get your network policies, you get your SEO gateways, you get all of the stuff that you want to get out of the box. Even if you are not onboarding through the platform, you still can use Kubernetes out of the box within the setup that we have. But if you take it to the
00:21:07
Speaker
other end, and you use the platform as a whole, you get the whole experience from that. But we also know that not one size fits all. So we have like few cases where like, I need multiple namespaces. So the same controller, you just like write a small CRD and you get the new namespace with all of the default values that we have. So basically,
00:21:30
Speaker
It's a multi-region, multi-tenant, like different nodes and all of that kind of stuff are combined together. And we're thinking about like different, also like, I have to be honest, we are still like early in the journey. So things may change. Maybe we've been thinking about like, what if we need to dispense one cluster for a specific need? How are we going to do that? Or how things like this would be done? Okay.
00:21:55
Speaker
No, I think that was a good description of how you handle multi-tenancy. That leads me to the next question. If you are maintaining your own platform and a huge cluster, how do you do day zero, day two operations? Day zero might be okay. You deploy that huge cluster for once and everything is set up. What about day two? How do you handle Kubernetes version upgrades or AMI upgrades? Are these disruptive, non-disruptive, transparent to the developer? Can you talk more about that, please?
00:22:24
Speaker
Yeah, this is like, this has been a problem in general, the community has been thinking about like, how you do like, should I do like blue green? Should I do in place? What are the tools? So it really depends on like, what are we changing? Like, but to be fair, we are trying to be like, like honest and transparent about how we do upgrades. So like when we try to do any like things that we think that might hit any
00:22:52
Speaker
issue. We always like say we always like keep everyone posted. Hey, we are upgrading this from this version to this version, just to like if you see anything, but like we have more than one environment to test things at. Okay. We don't ship things to production directly. It's just like, you know, just normal SDLS or DSC.
00:23:13
Speaker
SDLC, software lifecycle. And then we have a B environment that no one has access that we test stuff on. And then we ship it iteratively to go to the right place. So we try to handle this. But there are some stuff that, for example, one of the things that I'm really happy with how the community think about
00:23:35
Speaker
For example, the approach that Carpenter is using for upgrading AMIs is really nice because we don't say we need that specific AMI, but we basically say, get the latest, for example. And then we have our setup to expire nodes every certain amount. So that gets us a new node. Oh, that's awesome. Yeah. So we don't have to maintain
00:24:01
Speaker
all the batching and all of the kind of stuff when we have long time node. From the other hand, when we talk about control plane for EKS specifically, that's done by AWS, so we just batch it. And then I think they keep adding more features about compatibility and other things, but we have to do all of our due diligence from a perspective of someone using HPE, like V1 beta, for example,
00:24:30
Speaker
you know, V2 came up and like, then things start to break and all of that kind of stuff. But yeah, we do, we prefer to do like some in-place upgrade. And that's also having the multi-regional stuff helps because we not necessarily ship to every region. We start like one region and then see how it's reacting and then like go to the other region to help also mitigate any problems that might happen through the way.
00:24:56
Speaker
Okay, that makes sense. So like, if you're talking more about the developer experience, right, I want to, like, let's say, there is a new request from the product team, and you have to the developer has to work on adding a new feature. If it's a really minor feature, how do they go about that? Like, they will write their code, maybe in their local IDE.
00:25:17
Speaker
Where do they test it? Where do they run unit tests? What does the CICD pipeline looks like? And are the changes to production automated? I know you said it's really fast for a developer to get something from laptop to production, but I wanted to see what that process actually looks like. Yeah.
00:25:34
Speaker
Definitely, I can speak to that. So when I mentioned earlier, it's a 10-minute, I was more talking about the onboarding experience for a new service, not serving any traffic when it's deployed to production. It's just like the domain is there, the service is ready, but there's no actual code. We talk about the lifecycle in general. So we started with like
00:25:56
Speaker
engineers are testing their code locally, testing how the environment ships, all of that kind of stuff works. And then we do have developed a couple of things about making pull request environments. So once you ship a pull request, you get an environment on top of the platform that specifically shows you the changes that you have done and build an image and build all of the necessary tools behind the scene.
00:26:21
Speaker
And then once you get like the normal review cycle, like everything looks good, you ship it and depend on like some features are fully like goes once you merge it, it's really a preference on how teams work. Like some feature when you measure domain, it goes to like a lifecycle of like, now I'm going to do it in dev auto promotion to staging and do tests and then like continuous integration. Some other teams have to
00:26:46
Speaker
you know, promoted menu, it really depends on the use cases. But yeah, like the lifecycle is like, you go from local to an environment that you test on. It's like similar to the environment that you have in. And then you go to your development, do more tests. And staging is kind of more like that last environment before production, where like you test actually everything. So we
00:27:09
Speaker
And there are other things that we also have so it really depends on the features but like you could add smoke tests, stress tests, integration tests with other features depending on like if your feature or your service talking to many other internal services in the company or things like that so it really depends on the situation.
00:27:29
Speaker
And do you do any form of production like tests, right?

NYT's Data Handling Strategies

00:27:34
Speaker
Like the amount of traffic that New York Times receives on all their different apps is at a different scale than what a developer usually thinks about. So is that involved somewhere? Like we were talking to last year a company called SpeedScale that basically takes a dump of all the production data.
00:27:49
Speaker
the petabytes and petabytes of data and then whenever you have your app in that staging environment, they actually test it against the production-like data to see if everything works as expected and then only they push it to production. Is that something that the Times does and you do at
00:28:05
Speaker
That's not an area that I'm super familiar with, but I can tell you there are some teams might be using Canary releases. Just a couple things. But there's one thing that I've been seeing recently that we started, because we are in a centralized place, we start to have features in our ingress. So we built ingress for the platform that does the mirroring.
00:28:28
Speaker
So in that situation, we can like teams can decide like, I want to mirror some of the traffic from production on a specific way and send it somewhere else to test like, how's that gonna look like? And they're always, you know, experimentation, depending on like how you do it. Why? Because like, there's a lot of services have to play a role into like the deployment. But yeah, I have seen some stuff, but it's not an area where like, I do a day to day operation, it's more like the
00:28:55
Speaker
App teams where are more familiar with things like that. Gotcha that makes sense so like good going back to what you are responsible for it like maintaining the platform how do you design for avoid single points of failure is right like.
00:29:10
Speaker
a node can go down, there might be some outage between AZs or USC's one once in a while might like give you a mini heart attack. How do you plan for those events, right? And like, can you talk about the resiliency that you have built in? And is the developer responsible for doing that at the application layer or the platform is supposed to own that?
00:29:29
Speaker
Yeah, so that's a really good question. And there are different points where we talk about resiliency. One of them, so I'm going to give you a couple of examples. One of the examples is the ingress. So the ingress is built automatically to manage failover. So we've built it deployed in multi-region capacity, and your traffic is going already through a multi-region. If one region failed, even if you are deployed to a single region, your traffic goes to the complete other region and gets traversed over the internal network.
00:29:59
Speaker
So that avoids like we have a single place where like traffic will fall down. From other things like for Kubernetes, for example, we do have service mesh. So like clusters in the same environment in two different regions are already meshed together in a way using Selium, using Istio. So like if you are deploying your service to multi-region perspective and like one of your region failed, the other region should automatically take over. And most of these features we are trying to
00:30:29
Speaker
Give to the engineers out of the box like we don't want to like but we also have a lot of customizations that you can play in like you can say This is a way that I want to send between maybe you are not in active active situation So you are an active passive I depend on like each team requirement, but like most of it. We are trying to think about
00:30:50
Speaker
a lot more into how to take the best practices, implement them, and give all of the teams the freedom and the ability to decide what features they need to enable or disable, or what is their specific requirement. But we always think about this because if you centralize the platform, you just want to make sure that it's resilient enough that it can avoid any single point of failures.
00:31:17
Speaker
Yeah. And that makes sense, right? And I think that centralized team really helps with that because everybody doesn't have to build that logic inside the application layer and spend more time delivering a single feature. They can rely on platform being that resilient layer. But like you spoke about multi-region and ingress or a load balancer, basically routing traffic based on the availability, right? How do you handle data? Like I think I remember that you don't run data on Kubernetes, like you're not, you're leveraging other AWS services.
00:31:45
Speaker
But how does active, like is it active, active in the first place? And how does it work cross region? Yeah, that's a great question. So that's also one of the things that really we are in the early journey. So we are trying to, you know, like you're breathing a cluster, if you have volumes and you have all of that kind of stuff, it makes things more complicated. Like how you will like backup all of this, how you restore them. And we are running like,
00:32:12
Speaker
I think I mentioned earlier that we are running on like a spot mode as well for some instances so they can go out. So we are trying to use
00:32:21
Speaker
any managed storage that we can leverage to have everything around. So let's say, for example, if we are using AWS with EKS, what other managed storage? If you want to save a file, it's probably S3. That would be the very place. If you want a database, probably you would go to RDS. Just have it there. The components about how these are deployed might be different from an application to an application.
00:32:51
Speaker
In some use cases, you are just deploying a file to an S3, so you can't afford going from our West region, for example, to the same S3. But there are other situations where you have a very sensitive latency, so you may have, let's say, talk about cache, we're talking about Redis, so you may have two distributed cache in each region, and one can be a leader and one take over when something happened.
00:33:16
Speaker
But this is really dependent on the situation. I have seen databases where they are multi-easy, multi-region, or a global database, where just your access endpoint depends on where you're coming from that will handle it automatically. I have seen applications decide that I always want to write to one region, and when a region, when there's a failure, there's automation to promote the other database somehow. So it's still, it's also dependent on
00:33:44
Speaker
we offer the capabilities to just be like, you can do like a single deployment, you can do a multi region, you can do a multi like we always all of our clusters are multi easy by default region. So that solve a couple pieces of problems. But then like you come to like permanent storage or like resistance storage and then how this is handled, it has a lot of different cases. But we recommend going for
00:34:10
Speaker
manage tools to help with the whole process in general for your application. Gotcha. And that makes sense. So that makes sense for your infrastructure requirements.
00:34:23
Speaker
One question is, are these managed services part of your platform? So when they are selecting, OK, I need X amount of CPU memory resources, and I need an RDS or a DynamoDB instance, do they request it as part of the platform? And then you have maybe Terraform or CloudFormation on the back end that's actually spinning these up and giving those endpoints to the developers?
00:34:44
Speaker
Yeah, this is a plan. So currently, like there are a couple of these that might be like have a modules that spins them up. But like the plan in general is to be able also to provide like less options. Like if I know already where is your traffic coming from? I should know like what security group that they applied all of the kind of data and what I enroll. We should like so for now, for example, a couple of things that we do, like when you deploy an application to EKS, for example, we give you the ACR.
00:35:13
Speaker
like the Terraform built for it. We give you the ERSA or the service account, like IAM service account for your role, and we give you the Terraform for it built already and deployed. So there are other conversations and discussions about like what else could be an S3, could be a Dynamo table, could be something else. And you just have to choose an option. So I think like that would be like more stuff that comes in and makes the process even easier from a developer experience perspective.
00:35:42
Speaker
I think that's awesome. That's the key thing with a centralized platform team. You have to treat it as an ever-evolving thing. It's not that, oh, I have some automation and then I can call it a day and work on something else. No, it is about continuously improving the platform and making sure developers are using it.
00:36:02
Speaker
If you don't evolve it, they'll be like, okay, this is, I can use the platform only for one specific niche use case, but everything else I have to do it on my own. So I like that approach of it's always going to be, there are always going to be new features to it. Yeah, exactly. I mentioned, funny enough, one of my last talk, I was talking about like, was a lesson learned and I was like, having like,

Platform Evolution and AI Integration

00:36:24
Speaker
Platform engineering is not a project. It's a product. And it has to be iteratively, and you have to listen to customers. So it's not like a one thing. Exactly like what you're saying. It's not one thing that you're going to do it and then, hey, it's done. It's just like it's a very many iterations that you have to go through to get to the point where now it's working, now what's next, and all of the kind of stuff.
00:36:47
Speaker
OK, so you listed like you already brought up best practices right from your own learnings. We have listeners to the podcast that are in all stages. There are organizations that are really matured right along with you and innovating on this front. But there are still organizations that are getting started or development teams that are getting started. And you share some of your best practices or lessons learned so that people don't repeat them with our listeners.
00:37:12
Speaker
Yeah, of course. So like, as I'm describing our journey, I talked about this, whatever I'm describing here is what worked for us. So there is no one size fits all. Like I have seen other people like using a cluster as a service, for example, namespace as a service, like we like, first of all, you have to look into like, what is the need inside the organization? Like what are you building for? Like one thing that I have seen is that
00:37:38
Speaker
Oh, we are engineers. So like we tend to like, Hey, I want to build a shiny thing. I want to build something that like I want to go to back. But like it's at the end of the day is like, if you have a platform without adoption, like it's not a success. It's just like you are the mission is like to build something that helps others. So if it's not, that is not like really the goal that you are aiming for. The other thing is like,
00:38:02
Speaker
Super critical is if you're building a product, building a platform, building anything, you should have a great documentation. Documentation is a key on how to do things. Even if you don't have the right automation to do things, maybe building the simplest documentation that a user can understand makes a problem easier. If you follow one, two, three steps, that would be another thing to go through. You would also avoid being in a support mode where
00:38:31
Speaker
you build something it's working but there are hundreds of use cases that people may go through and instead of like going through a documentation and people read about like oh this is how i fix it they have to go ask and then like you get into a situation where like you getting dragged into the support mentality uh for your platform the last thing which is super important is
00:38:54
Speaker
feedback, like customer feedback is like, has to be like, intuitively, like every, I know, if you can do it every day, I'll go for it. This is what in general, like, you build a feature, you need to get feedback, you need to get like, champion about it, you need to see what really people want, like, it might be like, conflicting with what are you trying to do. But like, this is where there's a value on like buildings that so if
00:39:22
Speaker
If you think like we should go this way, but everyone else think like if you build that, that would be helpful. You should start to think about and listen to customers about what's really necessary. There are smaller things that, smaller automations that could make it easier. Like one thing that I've been seeing, for example, and this is a personal experience to me, like
00:39:44
Speaker
I know how to deploy an app, for example, in and out. People who work on the platform know how to do things. Then you get it for granted that you think it's easy and self-explained to other people. And then you get into this problem like, it's super easy. But yeah, I'm not saying it's easy because that's your day to day. But when someone else comes to trying to use your tool, they don't necessarily know your tool. So if you don't give them the right
00:40:14
Speaker
approaches and the right documentation, they're gonna just like feel like, oh, there's something missing. And then like, smaller automation pieces and smaller bits can make the entire process easier. So instead of telling someone, hey, here's a list of 10 things, you could give them like, maybe if you don't have automation, you could give them maybe a small pass script.
00:40:34
Speaker
that does this 10 things, makes the process easier. So that would eliminate like user error, eliminate anything like that. Okay. No, I think those are the useful tips. I'll make sure that I summarize those in the key takeaways section at the end of the episode. So like with Kubernetes, right? If you listen to any of the 2023 episodes, you'll see that we started a new question at the end. Like we asked people like, we asked a question to chat GPT and got a response from it. And we asked the same question to our guest and
00:41:03
Speaker
It was a funny question at the end, just a way to involve more AI. 2024, this is season four for us. What we are thinking is instead of doing that gimmicky bit, actually ask, what are you doing with AI? How is AI impacting your day-to-day? How are you planning on incorporating AI into your existing product or platform?
00:41:23
Speaker
any of these things, like how are you thinking about AI today, Emma? Yeah. So let me answer the question like in two parts, like as an organization and borders separately, personally. So as an organization, I think like it's important to like look into AI and machine learning. And I know there are other folks like are looking at this area. But like, there are stuff that we shared already. So like, for example, like if you go to open.nytimes.com, like
00:41:49
Speaker
We do have our open blog and you can find articles about how our cooking teams are using machine learning to do personalized stuff. You can read about other things. I don't think it's something that we are not familiar with. We do have machine learning teams in the organization.
00:42:11
Speaker
specifically do things with models, with data, and a lot of things. It's not my area of expertise to dive more. But we do have the capabilities and folks who are looking at something like that. So before you go into the personal bit, are you thinking about long term as part of this ever evolving platform, supporting something like a Kubeflow or an MLflow or some other
00:42:37
Speaker
model orchestration thing on top of Kubernetes and maybe extending your customer or user base to even data scientists and not just developers. Is that going to happen? I believe that is a vision that like personally I have in my mind. Like right now, like we are working with like services and like ABIs and all of the kind of stuff. But there's another aspect of like how you do a lot of other things on Kubernetes, like how you do data batching, how you do data streaming, how you do
00:43:06
Speaker
learning, how you provide, like as we provide like a simple way of doing services and deployment, can we provide a simpler way? I have seen examples on like Jupyter as a service. You can just like, hey, I need a quick thing to do. Yeah, I know. And other things like I need to batch something, I need a batch jobs that does stuff. So it's completely on top of my head, like how we can evolve into this area. Like we also have
00:43:35
Speaker
data teams and like machine learning teams that we have to work with and orchestrate like how something like this works. One of the benefits that I've seen like in other re-invent talks is like
00:43:48
Speaker
the running Jupyter Kubeflow or any other JupyterLab notebook on EKS and combining it with the power of Carpenter. Whenever you actually want to execute your notebook, that's when Carpenter will spin up those additional GPU nodes. Once the job is done, it automatically spins it now. I was like, wow, that's awesome. And I brought this up in our last episode as well, but I love the move from AWS about open sourcing.
00:44:14
Speaker
carpenter so it's not just restricted to AWS now like you can use it with other cloud providers obviously that will need some more work but that option is out there right so if you're building with carpenter that's a great tool which is now available to the open source audience. Exactly and you really knew that like I was really impressed by like how they thought about it and then like I started recently I think I saw a provider for carpenter and AKS
00:44:39
Speaker
So like it does the same logic because the logic itself is really helped with like scaling. So we have seen like, I think a couple of my colleagues did a talk about like, how are we using carpenter in KubeCon and the last KubeCon. And like the idea on like things that spike fast or all of the kind of stuff, carpenter is fully capable of that.
00:44:58
Speaker
So in the future, if I imagine how I would see the platform from my personal experience or from my personal vision, I would see why we don't have data, why we don't have other things. But it's more of a priority. What really helps our organization and how we can help the teams building what they need. Gotcha. Makes sense. And let's go back to the second half of the AI question. How are you using it on your own, on your personal time?
00:45:24
Speaker
Yeah, I'll tell you, it's super helpful. When it started, I was a little bit skeptical about all of that kind of stuff. But I think there's a lot of benefits of not really offloading things to it, but being more productive. I'll give you a simple example. I want to share something. I want a nice image to share about what I'm doing.
00:45:45
Speaker
Like instead of me setting and designing, it's not my area. Like I know how I write the content, but like we may not know how to create an image that's so beautiful. So like I would prompt an AI to like build an image for me and then it's done. But there are other things like if, if it's doing like, if there's an automation, it's like, when I think about it, it's more of like, it's a tool to help me do my job better. Like I would ask, like one of the things I would ask is like,
00:46:15
Speaker
I'm an engineer, but like, do I remember every single code bed in my head? No, like, I understand what I'm doing. And I'm going to do it in a way that I think like, it's architecturally correct, like logically correct, all of that kind of stuff. But what if I can have if there's a better task, or if there's something that
00:46:33
Speaker
I didn't see before, some things that can go and do all of the searching behind the scene and get me the answers that I still have to apply or I still have to look into. So personally, I find a lot of value. So like, there is a kit as GPT, like the CLI, which like it's super helpful to debug things. I didn't get a chance to like use it in Word, but I'm using it personally to play with and... Yeah, quite a lot.
00:47:02
Speaker
Other things like that, but I would see a value on like.
00:47:06
Speaker
I'm going for a more structured way. If you ask me what I would like, it's just a bot where an AI would, as an engineer, I would come and say, hey, I want to build a service. So it will ask me a couple of questions. And based on specific answers, it will start to tailor my experience to, oh, now you need it public. Now you need that. Now you need that. And then give me a template that I can start working on.
00:47:32
Speaker
That's not a bad thing or when when the logs when I have one of my servers is not working correctly and I can just like pass an error message and or it was just like have some diagnostic collected together and like aggregate it in a way that it's more human. I believe this is a good way. So I think we are going in the right track in my opinion on in general as an industry how to go for it.
00:47:55
Speaker
Yeah, we sure are right. Like one of our previous guests on the episode runs a company called RunWen. They have a local tool called RunWen local. And it does what you just said about troubleshooting. Like they have these AI assistant personas that are specifically trained on how you troubleshoot Postgres or how you troubleshoot Redis.
00:48:14
Speaker
you just point it to them and they'll run a bunch of commands and then identify the issue for you. So you're not spending four hours troubleshooting something, spending 15 minutes figuring out, okay, this is the issue, how do I solve that specific issue? So check that out, add it to your list of tools in addition to Kubernetes, GPT.
00:48:31
Speaker
Awesome. Okay. So that brings us to the last

Community Engagement and Conclusion

00:48:33
Speaker
question. I know you already said open.newyorktimes.com, but where can people connect with you with your team's journey? What new innovative things you're working on? How do you how do people get in touch with you? If they have more questions, share any and all links that you might have.
00:48:47
Speaker
Yeah, so open.nytimes, that's where we share all of our blogs and the amazing work that we are doing in the organization. We also on GitHub slash nytimes, where you can see all of the open source tools that we have. Personally, you can reach out to me on LinkedIn. I can just give you a link for my profile and people reach out to you.
00:49:09
Speaker
or my email and like happy to help and talk about anything that we have done and presented and like even like not just like help but like I would love people also to reach out and say we did our journey differently and I learned from them on like what worked for them and how we can like collaborate with each other I think that's the power of community at the end of the day.
00:49:34
Speaker
Oh yeah, that's the perfect way to end the episode. Thank you so much, Ahmed, for joining us for this Community Spites podcast recording, first of the new year and first of season four. We love to have you back once you include AIML as part of your platform. So if you do any talks in the future, I'll make sure that I watch those and reach back out to you. But yeah, thank you. Thank you so much for joining us today. Of course. Thanks for having me. And we'll be back one day. Yeah, for sure.
00:50:03
Speaker
Okay, that was a great episode. Again, thank you very much for joining us, but I would like to quickly summarize what we discussed in the episode, or at least have a few key takeaways as Ryan and I usually do. So to start with, the first one being, treat platform engineering as a product and not a process, or not a one-time thing. As a product, it has to be iterative, right? So you keep building the platform better with every release.
00:50:32
Speaker
make sure you're serving your users better or maybe even adding more users to your platform. This is something I was thinking about. I know Facebook or Meta and Instagram and all of these SaaS companies, even Salesforce for that matter, when they do their earnings report, they list out their daily active users or DAO or monthly active users or MAO, I guess, to show the adoption of their platform or the usage of their application.
00:50:58
Speaker
I think with platform engineering, that's something that we should be doing. If you are responsible for building a platform inside your organization, make sure that you are measuring somehow and making your product better such that you're getting more and more users onboarded.
00:51:13
Speaker
Spotify everybody knows they were the windows behind a backstage project and internally they use backstage as well. Last year I was surprised that they have like 96% utilization from their engineering team for backstage. So like 96% of Spotify developers actually use backstage as their IDP to perform tasks and build their pipelines and release code to production.
00:51:37
Speaker
So if you are doing going on this platform engineering journey, you make sure that you are treating your platform or an IDP as a product instead of just a feature or one time deliverable. Our next key takeaway that I had from Ahmed's conversation was documentation. Documentation is the key, as he said.
00:51:53
Speaker
Even if sometimes you don't have automation for the entire pipeline or entire process, having documentation to fill those gaps, maybe having those shell commands written down in a internal wiki or an internal Slack or a Confluence page somewhere would help users. It's behind the thought process. If I'm doing something over and over again, it might make sense to me.
00:52:15
Speaker
for my teammate or for a user in the same organization who's like three teams away from me, they might not understand or they might not know the entire process like I do. So documenting it is definitely a better solution to make sure that your users have an easier time adopting the platform. And then finally, a feedback. As I said, feedback gathering, periodic feedback is
00:52:39
Speaker
Absolutely essential. It helps you make sure that your features are built for the right audience. If not, you can tweak those. And you should also take this feedback to define what your roadmap looks like. Again, roadmap because you're thinking about our internal developer platforms as a product.
00:52:55
Speaker
Think about gathering feedback, maybe having a process in place where you have maybe a SurveyMonkey page or maybe you have quarterly meetings. I know inside actual product teams, they do like customer technical advisory boards every quarter or every once every six months to share the roadmap, maybe gather some feedback or talk about issues their users might be having. Maybe think about implementing some of that as part of your internal organization to improve your IDPs.
00:53:23
Speaker
But yeah, I think those were all the things that caught my eye. If I missed something that you feel was a key takeaway, hit me up on Slack. Hit me up on LinkedIn or Twitter on my account or the Kubernetes Bites account. But with that, I think that brings us to the end of today's episode. Please join our Slack channel. The way to find the link to join our Slack group is just go to our website, kubernetesbites.com.
00:53:48
Speaker
Subscribe to our YouTube channel and please, please, please leave us good reviews, five-star reviews, or at least some reviews on wherever you listen to your podcasts, like Apple Podcasts, Spotify, Google Podcasts. Anyway, it helps us use those algorithms to make sure we reach more audience. Again, we are in 2024 as part of one of my personal
00:54:11
Speaker
growth resolutions or growth goals is make sure we increase the audience from what it is today and take it to the next level. So anything you can do to help us would be great. Like Ryan will be back in future episodes and we'll continue doing this podcast. So without further ado, that brings us to the end of today's episode. I'm Bhavan and thanks for joining another episode of the Kubernetes Bites podcast.
00:54:39
Speaker
Thank you for listening to the Kubernetes Bites Podcast.