Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Episode 3: Building Resilient Systems: SRE Best Practices and Insights image

Episode 3: Building Resilient Systems: SRE Best Practices and Insights

Observability Talk
Avatar
104 Plays5 months ago

In this episode of Observability talk, we look at Observability from the perspective of building resilient systems through site reliability engineering. We explore the origin of the position and move into understanding the best practices, the divergent nature of reliability across organizations, and the experience of being in the trenches of problem-solving.

We are joined by Safeer CM, a seasoned technology leader specializing in site reliability, DevOps, and platform engineering. Safeer has worked for organizations like LinkedIn and Flipkart and has a solid background in cloud and infrastructure management working with large-scale internet companies and startups. He has also authored a book titled “Architecting Cloud-Native Serverless Solutions”.

Also Check Out

Transcript

Introduction to Safir and SRE Experience

00:00:12
Speaker
Welcome to a new episode of Observability Talk. One of our our objectives from this podcast is to look at observability from the lances of various practitioners in the space. Today, we are glad to introduce you to Safir, who is a seasoned technology leader specializing in site reliability, DevOps, and platform engineering. With over 18 years of experience, he has a solid background in cloud and infrastructure management, working with both large-scale enterprises and startups. Saffir is also the author of a book titled, Architecting Cloud Native Serverless Solutions, and has spoken at various conferences on the topic of observability, DevOps, and site reliability engineering. Ivam, welcome to you, Saffir.

Defining SRE and Its Evolution

00:01:01
Speaker
My first question to you is, today we see a lot of people working in DevOps, L1, L2 monitoring space, doing software engineering people doing chaos engineering sort of work, and each or one of them calling themselves sort of SRE. So I just want you to define SRE for us, as well as what sort of responsibilities SRE takes care of.
00:01:29
Speaker
See, um, the foundation rule here that like, you know, see the titles will vary depending on the company's culture, and organization, historic reasons, right? It could happen to it. And I have, you know, even in my career, I have gone through a bunch of like, you know, different titles and all. It never, you know, only recently it picked up as an essay, you know, site reliability as the title. So it started as a system in network engineer or a network analyst, then systems engineer systems, like to things like that. and So it, uh, depends on the company to a point that what title it will be. But what we need to look at foundationally is like, you know, a role should be defined by what, you know, that what is the primary experience for that job, right? And for site level to engineering, if you know, if we look at the history of, you know, how it evolved, right? See, it came as a response to the, you know, challenges of scaling up labs, you know, infrastructure with the conventional DevOps model, right? Or the system model, right?
00:02:26
Speaker
that fundamental issue with the, see, I came, you know, I was a sysadmin like, you know, by role and by function, I was practically a sysadmin essentially in the beginning of my career. So I can speak to this from the experience, right? So the problem with the sysadmin role was that, like, you know, after the infrastructure, are agree right, your sysadmin team also had to be scaled linearly. The reason for that, like, you know, as your intro grows, We need more people to fix your problem. Even if it is the same problems or you know, things like that, because so for the scale, ah you will not be able to

Core SRE Principles and Toil Reduction

00:02:56
Speaker
handle it beyond that point. Right. So that's one no know fundamental problem. Right. And the other problem is that like, you know, the focus was on solving the current issue. Right. Of course, we always set your focus on the not solving the current issue. Right. But, you know, and the way it goes is that, you know, you get an alert, you pull up some, in rambook or you know, based on your history and experience.
00:03:16
Speaker
and fix that problem and you know if it is something undocumented probably if you have you follow best practices you will document it and use it as a runbook for the next step right that is you know where it was but if you look at the whole scheme right it was pretty reactionary all right and what you know yeah what they didn't work for large scale right they cannot scale this you know streams you know linearly right so It has to be like in a nonlinear to be, you know, uh, two to six, right? That is again, like, you know, from a practical standpoint, it has to be like that. Um, so that was the first thing. And the second thing is the approach of like, you know, reactive, you know, mechanism itself, right? That doesn't work. And the way Google and, you know, most of the others define, uh, you know, sorry, right. That approach is that, you know, change that reactionary model by applying this software engineering mindset to the, you know, large scale and scalability problems. Right.
00:04:09
Speaker
And that is that's a shift but in terms of mindset down light and like an approach. Now, this would mean that like you know any operation task you have to perform repeatedly should be automated

Observability and Incident Management in SRE

00:04:20
Speaker
out. right And to the possible extent, sometimes you know it's not fully possible, but you should still have to cover whatever cannot be automated, you should still have a standard SOP, which is up-to-date and all those stuff. That was very basic stuff. And this is you know in terms of SRE language, this is called toil reduction. they know the work itself it always is choice because when you have to keep toiling on, like, you know, fixing the same things over and over again. So, toiling is essentially like a fast way, you know, automating, you know, things, you know, and in a software way, you know, software to engineers, engineering way, right? But it's not just about like, you know, automating a routine task, right? It's also about building tools and platforms that improves the reliability, makes like, you know, developers and like the engineer's life easier, right? All those things. So, that was, you know, one of the first key principles of SRE, right? and
00:05:04
Speaker
wherever you set up a site, that is one of the foundational things. Now, also like, you know, the other thing, right? The level of these systems are essentially, you know, dependent on observability. And that's not like non-negotiable thing, right? It has to be, you know, it is dependent anyway. And that is one of the key responsibilities of SREs, right? It not only, you know, involves setting up your, you know, observability. infrastructure Of course, that is on one of the baseline responsibilities. But it was about like, you know, looking at, you know, what is the ecosystem? What kind of, you know, best practices you do. Work with software engineer as a liaison for, you know, observability. And ensure that all critical services are alerting and like, you know, react to the mechanism. Right. That happens. Now, the other one of the responsibilities is like, you know, incident management is very critical. Right. And, um,
00:05:55
Speaker
Some very, you know, ah big companies have dedicated items streams, but in most cases, SRE is,

Holistic SRE Approach in Digital Transformation

00:06:01
Speaker
even in those cases, right? Including those cases, SRE is the one responsible for defining what should be the process look like. And if we went into this automation into it, what roles should be there and you know, what should be the end result of top like, you know, going up to, you know, the post-mortem level, right? So that is the idea. And of course, you know, change and release management also kind of, you know, roles up to, you know, this responsibility, you know, up to a point. Um, the other thing, of course, like, you know, that thing that, you know, we call about large scale systems, system of reliability and all those other, you know, other responsibility. So that, like again, you know, you have to work with, you know, DevSets with partner. Then because at the end of the day, the folks who build those systems are the your first champions of, you know, reliability, right? So it starts to work with them and eliminate like, you know, single point of failure is improve your, like, you know,
00:06:52
Speaker
DER strategies, and all disaster recovery, business continuity processes, things like that. And of course, like on the last but not the least is so we ah we talk about when we talk about SREs, we talk about SLS and SLOs. right That is kind of like the KP, so for SREs. So that, of course, is one of the goals that has to be derived from what our SREs does. But as long as you fit into most of these categories, you are an SRE, so to speak. you know
00:07:23
Speaker
yeah Very, very true. So, ah if I say that SRE actually is more software engineer than assessment, men ah right would that be a fair statement? It is because you have to apply that engineering mindset. right um You are more than just an operator. You are thinking about the problem, what could be a systemic solve to it, and then take it forward. rightway in know I could very well find a solution once and like, you know, keep re repeating it like, you know, money hitting on the

SRE's Role in Customer Trust and Service Reliability

00:07:56
Speaker
keyboard, right? That's possible. That's not what we want. And somewhere you need to also get it ah ah ah engineered within your application such that you continuously improve your reliability, right? So that yeah sort of mindset comes more from software engineering side. Yeah. Understood. See, ah that's like yeah, that's something I, you know, tell people over and over again, right? Reliability is a feature of your software product.
00:08:21
Speaker
and we We may not know look at it that way because it's not apparent until something breaks, of course. like so It is an inherent feature and you know we need to treat it that way and only then like the real SRE, will get the real and not benefit out of it. Right. So, ah i mean this is very, very different. I think a lot of people who generally ah ah compare that, okay, I am an SRE if I am doing only observability, I am an SRE if I am doing only DevOps, or I am an SRE if I doing like, a toilet reduction or chaos engineering, right. But looks like it is much wider ah than that. Because each one of them are only subsection of what an SRE is doing. Yeah. So having that perspective is what is important. that we
00:09:07
Speaker
See, depending on the company, which team you are in, you know, how your teams are structured, you might only end up doing some part of it. But what you need to understand is you shouldn't say, look, this is my part. That is not my part. That is not the way it is. It's a holistic approach. You have to be ready for everything. cold Now that we have defined SRE, right, given today's enterprises are digitally transforming themselves, right. And they are becoming more digital first, almost every vertical today is ah ah some way or the other in some format, some scale, digitally transforming, yeah transforming themselves and then trying to serve their customers digitally.
00:09:46
Speaker
ah In such an environment, do you think it the SRE practice becomes very critical for such enterprises? And if not, um what would be the alternative? See, as you said, you know, rightly pointed out right now,

Leadership Buy-In and SRE Implementation Approaches

00:10:03
Speaker
every company is transforming and then becoming kind of a digital, you know, company, right. And there is also this saying that like, you know, every company is a data company now, because customer data, business data, that, that, that's what, you know, they operate on, right. And but the digital infrastructure, right. So, uh, see the success of any end of enterprise from, you know, it's primarily dependent on the, you know, trust of the customers, right. Right. You know, the they essentially, they look for what is the value they can derive out of it, or, you know, companies look, you know, what is the value they are providing? All right. And, uh, you deliver the value, right? They will, you know, honor you and like, you know, appreciate you have, you know, uh, me with you, right. And that is how you grow.
00:10:42
Speaker
Now, the problem, however, is that, you know, you do the digital transformation, but your services are not delivered in a reliable or like, you know, available way. Then the customers customerss are going to lose trust from you, right? And that is why a service is very critical because reliability, you have to deliver a You know, it's a feature, right? It's an inherent feature of your product. You treat it that way and that is what is, you know, why it is some important and not important, right? And see, Now or like, you know, later, enterprises will have to embrace, you know, reliability as a practice. But, you know, it is not to say that enterprises don't do this now, right? You know, they all have their processes, procedures, but the problem is once you embrace digital transformation, the pace at which you do things will change, right? Because if you have to go in a far, with faster, you know, than digital economy, right? You have to move faster, like, you know, the faster time to market, like, you know, how you deliver stuff, you know, at the earliest, right? When that happens, more changes, of course, to your software infrastructure.
00:11:42
Speaker
Changes are the primary root cause for any sort of like, you know, reliability issues or scalability issues, right? And you cannot avoid changes because that is how you deliver the value. So it's, you know, you have to balance it out and you need to have like reliability, you know, reliability engineering, you know, to run this, right? And to be honest, there is no alternative to

Setting Up SRE Practices

00:12:01
Speaker
it, right? But let's see, when we say there's no alternative to it, site reliability or reliability doesn't mean that, you know, you have to have an SRE team to, you know, do that. right that's That's not the point. um Let's say let' see you are you will have an existing engineering team for your digital transformation. They themselves can do it. It's not about again, la it's about the principles and the factors that you do rather than your role and like what you know what is your title. right It can be done. why did you know So enterprise has the source to decide who will implement reliability, but they don't have the choice to decide, like okay, I can opt out of reliability. That is not going to happen.
00:12:39
Speaker
So just to extend this, do you also see the way the application architecture has evolved over time? That has also posed for a higher need of these reliability engineers because of earlier, as you said, right, that it was simple. So, it could be still monitored in whichever way may be a little reactive, but it could still be monitored. right right ah But because of the infrastructure complexity, application architecture evolving, ah scale, which we are dealing with today, do you see the combination of this has also required made this requirement very useful? Yeah, absolutely. See, you're spoken on that. right because
00:13:27
Speaker
It's no more, you know, the, see, when we started they own learning computer science, it was about, you know, you had a simple, you know, our application frontend database and and know so on and so forth, right? It is not that case anymore, right? We have distributed systems. Even you are, see earlier, you know, when you refer to distributed systems, it was database systems like Hadoop or like, you know, those kinds of things, which are some, you know, level of complexity. Now, even your microservice architecture is like, you know, quite complex and distributed, probably more than your data system. Right. So that complexity definitely brings about, you know, problems of, you know, reliability, right? And as long as you have too many test files, too many protocols to talk to, I'm like, you know, you don't know where exactly, you know, things are breaking, right? So it's ever more important, you know, to be.
00:14:12
Speaker
ah Now that we have talked about SRE and that sort of a create a business case that digital transforming enterprises should actually on i be using SRE in some form or fashion. um so You have been setting up SRE practices in your past organizations. Can you provide a brief view into what really goes into setting up a SRE practice in an enterprise? yeah yeah um Before we talk about the practice itself, a prerequisite

Baseline Observability and Automation

00:14:45
Speaker
to implementing the SRE anywhere. The primary thing is the buy-in from the leadership, from the top leadership. And the top leadership has as we call it. And that is not wrong. They also have to percolate that message to you on the lower levels of the management, the chain. It has to go down the chain, essentially, so that they will really understand this is a priority for us.
00:15:10
Speaker
So the reason for this, right, site-level engineering, like, you know, other initiatives like, you know, DevOps, Platform Engineering, they are all, these are all overlapped, but all these things, right, I put them into a category of engineering excellence, so to speak. These are not the foundation, you know, not always, you know, you look at it as a foundation for developing software, right? It comes as like, you know, for a lot of companies, it comes as an output, and it happens, right? That's the way things work, right? The issue is, There is going to be always a, you know, work for a priority session, right? My feature rollout versus my reliability, you know, or implementation. What should I prioritize? And that can create conflicts, reprioritization, and like, you know, slowing down of like, you know, adopting this sort of practices, right? So it has to be communicated that it's a priority for the leadership or the company's direction. And so that's the primary thing, wherever we go, right? And if you look at it, right, ah there are two ways, from and two directions from which the leadership comes to this.
00:16:12
Speaker
One is reactive, one is proactive. The proactive ones are like, you know, the leaders who, you know, are starting a company or and are in the early stages of their company. And they from the, you know, their past experience and like, you know, of course, like in their whole process, think that their liability has to be like, you know, baked into early on. Right. And they set out to build a team and like, you know, that, and obviously, you know, this environment is easier to implement aside. Now, on the other hand, you know, most of the time, What happens? It's a reaction for like, you know, things that already happened, right? You already have a sizable infrastructure. You are fit with like, you know, outages, things like that. So you have reliability issues and, you know, scaling issues already. And then you, you know, as a reaction, you do it. I think wrong with that, again, it depends on the business as in the end of the time, right? But the thing is, it's a little bit more complicated, like, you know, implementing that kind of, you know, that kind of, you know, and then that is where like, you know, more by any means, you know, definitely required.
00:17:08
Speaker
And coming to the ah approach itself, right? yeah See, you are going to like, you know, do something new, right? Or comparatively new thing, right? To the, you know, enterprise, right? Now, the first thing is to, you know, understand the lay of the land. Understand, like, you know, what are the top-reliability problems they are facing, right? And so to know that, you have to talk, talk, and talk. Talk to the leaders. Talk to the, you know, people on the ground.

Collaboration for Reliable Systems Development

00:17:33
Speaker
If you see customer reports or whatnot, find them or hear them. um look at if there is so some sort of, you know, incident management or even incident reporting, read the reports, look at what our metrics is available from whatever tools. So which gives you a full picture, you know, kind of a decent picture, right? So in effect, you know, the way I put it is like, you know, treat liability asset feature or like a product that you are going to implement. And these are your user interviews. Okay. So you always, you know, when you implement a product, you always have to do that initial user interviews to like, you know, find out what is required, what is missing and you know, how should we, you know,
00:18:08
Speaker
Uh, drive this whole effort. So think of, you know, those, you know, feedback that come from those users as well, you know, features to be implemented within a service. Of course, we have base principles. I didn't know what to do, but it should they also be tailored to like, you know, what is required. That is what you start. Right. And once you understand the priorities and now what is missing, right? Uh, you will have to implement, you know, practices and, you know, uh, set up systems for that. And depending on again, like, you know, what you learned from the other a previous step, You might either have to build something greenfield or you might be like, you know, augmenting the existing systems, right? Whichever it is, you will have to do it. Right. And, uh, going back to the necessary responsibilities, right? Um, I would say, like, you know, here observability and, you know, incident management would be like, you know, your top priorities, right? Again, priorities would come from somewhat from, you know, whatever you are using in the list. But if you look at it from an absolute perspective, again, this will align also from the feedbacks, essentially, because you need to have observability.
00:19:06
Speaker
to understand your systems. And then, second is, see, ultimately, you know, what is important is customers. Customers are impacted by outages, which means you need to have an incident management process to sort that out. And incident management first would be complete only if you have the backing of observability, proper observability. So that becomes kind of like, you know, the baseline, right? And, you know, at some point again, like, you know, change management also comes under, you know, this portfolio, right? And the the next priority would be to like it like a look at what is the source of the highest level of toils in your organization across all teams. right And a lot maintains that ah the root cause of the toil itself would be systemic problems. it But nobody's going to listen to you ah not to change systemic problems because it's a longer time investment. there So they won't be interested in that until you give them some breathing time.
00:20:02
Speaker
For that, you have to automate out some of the top priority problems. and eight Right? that's That's where you have to start. And once you do that, and you want to strike some sort of, you know, ah what is a right balance between like, you know, what you did in the automation side and got some, you know, save some time rate for the developers. said Then you partnered with them to solve the, you know, the, you know, systemic problems as well. And then your toil and ligand system so solutions can go, you know, hand in hand. But you know, on the outset, you have to start with, you know, solving some of their, like, you know, burning problems, right right? So we have to start on that. And once you you know get these two things sorted, right, and, you know, of course, certainly you will develop tools and platforms and everything for this.

Human Element in SRE Practices

00:20:45
Speaker
And, but, you know, beyond this one, like, i you will kind of, you know, reasonably stable, I would say, again, stability is relative, but still much better than where you start. Right.
00:20:55
Speaker
And, uh, then like, you know, of course you go to, you know, the more, you know, uh, what do you say? Mature part of like, you know, working with the developers to know and build the reliable systems, improve the level of access and systems, basically your strategies and, you know, things like that. but Right. And the of payos just what's not like everything that is required for a much reliable and resilient infrastructure. That's what, you know, you go about. And, I would say that is where in the technical aspects, mostly, and and you know, you have to start to take it. This is not a completely straight, um, but this is what you start. one thing, you know, that's almost as important as all these points and should run in parallel with, you know, these efforts from day one, you know, it's your people, essentially, right? So he's very, very critical, you know, I was stressing like, I can't stress this enough. um See, when you're going into an enterprise, or, you know, you are, you know, starting to build reliability, right? You already have an infra, you already already have software systems, which means there are also people and teams
00:21:53
Speaker
One or more teams, you know, in different shapes and forms, they are doing parts of reliability or infra management and you know, things like that.

Cost of Reliability and Business Alignment

00:22:00
Speaker
These teams, one way is depending on how your management structure, you know, your buying works, you could either absorb them, mold them into your reliability team. If that is not possible, then you should find a good synergy with them and work with them as partners. If this doesn't work, See, again, this is a bit of like, you know, people management and their political management, to be honest. So if you don't get it right, other initiatives may not, you know, go as smooth as possible. So invest in the people and of course, like, you know, hire people, you know, train people and, you know, build it's a continuous endeavor, you know, it will have to go on and on. But it's very important, you know, sort out people. Right. So ah whenever you are ah implementing these SRE practices,
00:22:48
Speaker
ah Have you come across where you had to balance out the cost part of it? Because obviously ah reliability is coming at a cost, right? ah got appear Because if it is not ah thought through when application was being built as a feature, as you mentioned a couple of times, right? And if here it is an afterthought, then obviously it becomes like, I am already at, let's say 95% of reliability. increasing it to 99% is much more costly than how much revenue you will get by increasing that 4%. Right. So, so what sort of consideration, cost consideration actually comes into setting up this practice. Right. You see, the cost will come in two ways. One is, you know, investing in people the other is, you know, investing in systems, right. On the system side, like, you know, of course, you know, we need to look at like, you know,
00:23:43
Speaker
What is the expectation that, you know, the customers have on us or like, you know, what is the goal we have set for ourselves? Let's say, you know, do you want, you know, 99.99% of the liability or.99 or like, you know, double nine or what? So based on that time, like, you know, sometimes also complex depending on the nature of the business, right? You will have set up certain things, right? And time there is no way out of it. And that is part of the business, you know, cost that it will have to be there. But did you look at some of the liability things you could make, you know, choices about like, you know, tools, vendors that you use, or like, you can also space out, like, you know, your initiatives into different orders to manage the, you know, uh, budget, right. But there are certain, like, you know, based things that you cannot compromise on, like, you know, incident management, and you know, some part of observability, again, observability goes in layers, right. It's not like, you know, it's in one shot. They would think that kind of, you know, or think that.
00:24:35
Speaker
So in this kind of solutions, you have to consider multiple things. One is build versus buy, right? That is also the, um, what kind of solutions you use, then you start using it. And again, it will all tie up to like, you know, what is your priorities at the end moment, right? Like, you know, you could wait for, you know, let's say, you know, facing to happen still next year, but metrics is probably non-equishable. And how long you persist your logs versus, you know, if you have a complex thing, probably, you know, you want to persist for six months. or even, you know, six days would work other otherwise. Right. And again, it's very strong ab important to nonsis so and whatever is it the system, you have to make those kinds of choices depending on the you cost, but cost is very, very important.

Gaps in Observability and Actionable Alerts

00:25:15
Speaker
And that is what right now, this end-air stream of free knobs as a, you know, as a, you know, initiative has come up, right? Everybody looks at cloud costing, you know, how do you manage cost? It's very important.
00:25:27
Speaker
Right, right. Safir, whenever you have set up these SRE factors in a different place, right you talked about that you try creating a baseline sort of thing. I'm very interested to know ah The gaps which you find when you're setting up this baseline, where do you find these gaps more often? I mean, um I remember talking to customers and then they generally talk about they already have ah some sort of monitoring already in place. Like you said, right, it is mostly reactive.
00:26:00
Speaker
ah But when you started doing some baseline checks, ah in which area, which component or which side of ah things where you find most of the gaps um or what what are the low hanging fruits which you believe should be tackled very quickly, ah which will improve the reliability, ah ah I mean, in much lesser time. Got it. See, on the observability front of like know um things, that i did you find gaps in like and a lot of things, right? Not just observability, like, you know, reliabilities, you know, focus things, process itself, like, you know, some of them would be like, you know, updated, you know, runbooks could be updated. So you find gaps in almost everything. But if we focus on observability side, right? See, the way you have to look at it is, again, observability is, you know, just like reliability, you know, um it's a journey, right? And it evolves over at time. And, you know, different, you know, if you look at it right there, different, you know, organizations have come up with different observability maturity models.
00:26:59
Speaker
um, AWS S one, uh, graphic labs as one. And, you know, a bunch of comments are you just Google for observability, maturity model, and you will see a whole bunch of things. Right. But, um, the thing is, uh, you know, enormous of these models, right? The first thing is always to establish an observability baseline. Okay. And, you know, to do that, you have to first take stock of what is existing, like what kind of solutions you have, what is it providing you? Right. In terms of like, you know, data points, right. And the first thing I would say to establish a baseline is to, you know, implement component level observability, like your software systems, you know, observability and click the component levels. Right. Because, you know, every other thing is going to be an aggregate of fewer components, actually, at some levels and like in some form, right. So that is going to be your primary baseline. And if you look at it further, right, in the same limiting um component, you know, when observability itself,
00:27:52
Speaker
uh, our servant, you know, as we all know, you all talk about, you know, three pillars, right? You know, metrics, logs, and you know, monitoring, right? And this is not personal view,

Evolving Observability Models

00:28:02
Speaker
but I think, you know, it logically it would not make some sense that, you know, metrics should be the first priority. And, you know, log sums up to that and traces are going to be the last one. Right. The reason for also that, right? Also like, you know, you get the nature of your systems by looking at the metrics. Logs are not something that can make sense of it immediately, right? The other part is that, like, you know, in terms of application development, locks is something that comes naturally to developers. Like, you know, they will log for things and it's guaranteed that like, you know, at some form or shape, you will have application locks, um but at least lock it. It's not something like, you know, organ external. So that's why, you know, locks can wait, but metrics is, it doesn't, you know, it's not a natural thing to implement because, you know, different standards, you know, and different companies have different ways of implementing it at different types. Right. So it's not that easy. and So that's why your first IoT should be metrics.
00:28:52
Speaker
And, you know, it makes sense from like my useful na as well as in a cost logs are costly, uh, cost here and, you know, there's like even more cost here in terms of engineering as well as systems. Right. Yeah. So, um, so when you, you know, once you, they say that, okay, I have the sort of metrics, what SR is usually recommend is that, you know, the golden signal, you know, what we call golden signals, right? Um, it typically, you know, contents like, you know, latency, uh, traffic, like, you know, your overall traffic. what kind of ratios you are getting and saturation, like, you know, is your capacity, like, you know, for a given system, like, you know, you find, you know, the right signals, you will get a lot of metrics from any system, right? Now, out of that, you find what is aligning with, you know, these four categories, we call it a few metrics. If you overdo the metrics also, like, you know, you are just going to town in alerts and like, you know, things like that. So you have to have the right signals in the right category, but mainly like, you know, whatever gives you the first
00:29:51
Speaker
The principle is that like an if you're getting an alert, it should be an actionable alert. If that is not the case, no point in setting an alert and looking at a threshold or a metric at that point in time. So it should be the best. And that will give you a baseline. That's my theory of doing things. But just to add to that, it's also equally important to have some sort of high level metrics at a business level.

Preparing for Major Sales Events

00:30:12
Speaker
and i um This involves like on looking at your traffic numbers. That's a very basic thing, right? You can anywhere do this. It comes from your load balancers and proxies and, you know, funded services, right? You can do this. If you go a little deeper, there are, like, you know, basic business metrics, like, you know, DAU, like, you know, daily hourly users, MAU, monthly hourly users. And you can also look at, you know, economic patterns. So this in itself will give you, like, some sense of, like, you know, baseline, right? Because while I said that, and I stress that, you know, component level observability is the baseline, these are also part of your baseline. It starts to be,
00:30:43
Speaker
like That is when you can easily detect a problem. right See, you might have a 7-0 machine in your database, but does that imply that like on a your user is not able to log in or do their primary responsibilities. So to do that, yeah you need that know micro asce as well as macro level like an observability to know to a point. right Um, you know, you can get that, you know, basically that, and once you get the, you know, they scan, you then claim the, you know, observability, you know, maturity ladder, right. Then you get like, you know, real user monitoring, right. And, uh, synthetic monitoring and, you know, tracing and, you know, a whole lot of, you know, whatever, you know, you want to implement on like, you know, micro and macro. So that is how you evolve, but basically it's always like.
00:31:26
Speaker
So, I mean, this is something which ah ah like we were also looking at creating sort of a durability maturity model for our customers. And I think what you rightly said is that you start with the component part and then go to the lock and then go to the application. And that basically will give a much deeper visibility once you start building a one ah on top of the other. ah In fact, at ViewNet, we basically have, we were discussing right that, you know, innovated on top of this, something called business journey observability, right? So we bring in the business part on top of it. Yeah, it was interesting. Yeah. So, so we'll see, I'll just add to that. data normally You have that right model, right? So the another way, you know, a lot of SRS, we look at, we look in this domain is, we call it critical workflows, like, you know,
00:32:20
Speaker
every business, every thing, you know, service has like, you know, set up a critical journeys, you will have a bunch of user journeys, you know, through your system. Out of that there are critical ones. And you know, monitoring them or like, you know, looking at them and you know, seeing if they're succeeding or like, you know, what is the observability angle to that? That's very important, you know, that's very important actually.
00:32:39
Speaker
Safir, you have been working on various e-commerce companies or earlier other digital companies,

Infrastructure Prep for E-Commerce Events

00:32:45
Speaker
right? And some of the e-commerce companies are known for their annual sales event. And I have been told by multiple people now that the whole process actually up to that ah event um from preparing the infrastructure, preparing the system, so almost like a nine month affair, right? Can you talk about a little bit about how somebody ah like that these ah large enterprises goes about preparing or what is the process which they basically follow to prepare for such events? The most important thing right now about the know this kind of things is planning. web You cannot to not do it on day to day. It has to be planned out like not thoroughly. Typically, it takes at least a quarter. Again, it depends on what all things you start at.
00:33:34
Speaker
But, you know, minimum is one quarter. The rest, you know, depends actually. Right. And, you know, when you have to start something like this, you need some sort of an input block, you know, what is going to go down on like, you know, that sale events, right. And that typically comes from business. They would say, like, you know, look, you know, this year, you know, we are going to do this thing, you know, this, you know, we are going to roll out this feature or like, we want this, like they only want to say feature it. We want this particular kind of like an experience or like, you know, And then based on that, we expect that we're going to have so much traffic for some customers. They're talking to customer numbers and you know things like that, and how much for they can sell, what kind of offers they would want to roll out, and and know what kind of promotions they would want, things like that. right and So once that comes out,
00:34:22
Speaker
of course, like, you know, you have to build it all. Some of them will be like, you already, you might already have those kind of features, but there'll also be like a whole bunch of features that has to be rolled out because changes has to be like, you know, has to be happening so that we can engage the customers. Right. So, once as that happens, right, all these requirements will translate to the increasing number of requests. Obviously, you know, it has to let's say, you know, you add an additional widget to your page that is going to do its own request to the backend. Right. So it adds up basically. All right. And, you know, it's not just about like, you know, the one that is requesting the request, it will find out basically, right. That service will require, you know, four other services. And that's, that's basically in the microservice. So it will find out time, like, you know, it could sign out, you know, in linear or it could find out like an exponential rate. It depends on the natural of, you know, services out there. Right. So it has to happen, which means like, you know, um,
00:35:21
Speaker
The respective engineers are responsible for like and implementing these changes. They have to come up with some sort of a number of requests general they need to support from the outside and also the number of requests they will send out to like know the other services. and you know Of course, and not this will require capacity at the end. Also, it will require capacity for the dependent services or downstream services. So everybody has to do this calculations, work on like, you know, these numbers and get their capacities, like, you know, worked out within our capacity allocation teams or your core infrastructure systems teams. Right. Let us shop. Right. Okay. Now, once they get it, you know, of course, like in the capacity allocation, features will be rolled out. Right. But this is all based on, you know, somewhat theoretical numbers, right? Because, you know, of course it has, you know, it has some method to it. It has, you know, past experience and like, you know, foundation group. Right.
00:36:16
Speaker
But still it is on paper, right? Now it needs to be like tested out in like and in or reality in the real world. right So what you can do the best if um beforehand and before the event to figure this out is do a stress test or like in a load testing, right? Now that load test has to happen at, you know, overall, like, you know, let's say, you know, it's an e-commerce site at the level search, you know, search or whatever, you know, those features you don't test out. You might also want to test out like, you know, each services, you know, core services, services like, you know, how they are holding up and you know, things like that. Right. right And typically, you know, know, before, you know, it ended itself. and That could happen. Right. And yeah, so that will happen. And the one thing that we forgot, wouldn't forget, but like, no, one thing that doesn't get that much visibility in this thing is like, you know, the impact this has on your in front core services.
00:37:09
Speaker
All right. Okay. Because they also have to scale it at the end of the day. We, of course, you know, they allocate it up. So that is one thing, but they also, their control systems, ah which might be taking like an additional traffic and most importantly, right. Look at observability, right. That's an application of, I would say, look, and a look I need, you know, thousand more like, you know, cores or like a thousand more, you know, ports together. But every one of the thousand ports are going to send like, you know, turn off methods to those that will be back. Right. and So you have to increase the capacity of your like, you know, backup control systems as well. You have to now increase your right path. You also have to improve your read path. It's very important. This is something that, you know, a lot of observant, you know, tips, because we accept the request. Then we assume that like, you know, the pattern of usage, you know, for us would be like, you know, as before doesn't happen, right? Let's say, you know, uh, and even say, even kids, every senior engineer is going to open like a five dashboards and then I keep hitting refresh. And every single one of this, essentially like, you know, goes and fetches, you know, a whole bunch of data, right?
00:38:06
Speaker
So you have to put con controls in place. You have to make sure that that you are, you know, area is scaled up. So all those sin impacts will come there. Right. And, uh, in know addition to all this, right, you know, people will also make, you know, structural changes to, you know, their infrastructure by like implementing, you know, new or their strategies, new software, learning, chaos engineering. Right. And, you know, there are things called, you know, digress, you know, some anticipated, you know, eventually some problem. It's. What is the way to, you know, give a reduced, but like, you know, availability, based like, you know, users. You still be available, but reduce the user experience a little while, you know, still serving the primary partner. So those kinds of things has to be worked out. So I think learned that pretty much gives a.
00:38:49
Speaker
This gives me I mean in just five minutes you have tried explaining ah the kind of work which you guys would have done right before this annual sale event um is just amazing I mean I'm thinking about so many things you need to think through and not only from the development side like you said right the control system observability system everywhere you need to take care of those TPS EPS ah and scale them And the last part which you talked about was very interesting. Everybody has opened a dashboard and then continuously checking whether his service is running or not. That would be putting in so much load on the observability system itself. Observability is taken for granted. It is how it is the assumption. And nobody thinks about the scaling of the system itself.

Generative AI in SRE and Observability

00:39:41
Speaker
So it's for the reliability of observability teams to take care of it.
00:39:45
Speaker
got Yeah, people generally assume that anyway, somebody is there, somebody is monitoring, and they will take care of figuring out when things are going wrong and then ah let us know sort of thing, right? ah very yeah It's very interesting. So, Saphir, coming to something which has been happening for ah for ah for all of us for the last couple of years, is generative AI. Where do you think in this whole mix of so much complexity, so much of different thing which an SRE goes through, ah ah right? Where do you think this generative AI as a technology can actually help ah ah the SRE team as well as overall in general on observability and reliability part of it?
00:40:31
Speaker
no and see um This got papas show plate us rightly yeah and is So we cannot anticipate what we can do in the future. But at least from then on, what are we have seen so far and what we can assume or imagine, right? I see it happening in a few places. The first one obviously is a program. It has been doing a lot of things. sand but ah kind like My only thing there is like, no, I have slightly, you know, SRE or any infrastructure perspective that like, you know,
00:41:03
Speaker
more than helping in writing code, generative is valuable in writing test cases or like, you know, testing and like, you know, the other, you know, other things, right? Because, you know, those are the things that people leave out. And if, you know, you could give a framework or like, you know, a boilerplate for those kinds of things, it helps, you know, it goes along, right? So that's, you know, programming is one thing. The other thing obviously is, you know, incident management, right? So the IM is always like, you know, game of, you know, context. Right. And, you know, you sort of past experience to solve the, you know, certain problems, right. Right. Like like and i see helping uh, providing the necessary context, stitching information from, you know, this one, the systems together, you know, with the context. And of course, like, you know, I think, you know, I'm going to the length of like, you know, writing the post-water reports. Right. So it could, you know, with all those context, you know, you can do that. That's one thing. The other thing that typically happens with, you know, infra teams or like, you know, everybody, right.
00:41:57
Speaker
It's supporting fellow developers and teams, their queries, their requests, things like that. A lot of value usually goes into that. Now, if you have some sort of like an intern, an AA that can work like an internal support team, which can answer common queries, download, such as basic solutions. You can talk about what are the available services in the info and things like that. That's one thing that can do.
00:42:29
Speaker
lot of happening, at you see like, you know, some of them will be an alert correlation, like, you know, how do you correlate in a lot of that happen? Again, context and risk management, right? I would say, you know, that is the context, context building in general, analytics for like, you know, observability, you know, things like that. All right. And so, again, you can go on and on with a lot of problems where, you know, this can be because everywhere there is some sort of applicability, right? Yeah. put in debo there But right. The way I see for SRE capabilities or any internal capabilities, right? <unk>rate but re developmenter like an noise operation
00:43:01
Speaker
that's what is more important because like, you know, all these things that, you know, I saw this was an absolute agenda. Everybody in terms of, you know, information out like no information is specific to the, you know, your enterprise, right? So, but I guess what's important there, but yeah, again, that's what like now we're saying, like, what we are going through. Yeah, very, very true. Safit, thank you so much for talking about this. um See, in fact, ah ah we also felt the way which you talked about, right, that ah when an incident happens, ah the next step for most of the people are to figure out how do I solve this, right? And this is where the narrative AI may actually be very helpful, ah because it has so much of information. I mean, if it is trained by internet data,
00:43:47
Speaker
ah somebody would have some definitely seen that problem earlier and somewhere some resolution steps on resolution and ah might be available. And if you ask that question to this, and and if it able to list down, okay, these are the five steps you need to do to resolve this, that will definitely do wonders for for people, right? I mean, they are there ability to resolve the problem ah quickly will be ah that will become very, very easy, very, very simple. right yes it is they Yeah, that is true. See, look at the incidents, right? They are time critical, right? And that part, that is not a point where no you should have to know employ your full know cognitive ability and like a cognitive overload.
00:44:32
Speaker
That's why we have runbooks and people in process so that you can mindlessly operate those kinds of things. Now you add the generative A into the mix, it's going to get much better. And as long as we build it reliably, I would say that's going to add a lot of value. Right. I think what you said is very right. And this brings to one of my favorite question, which I generally ask people. who have always been very hands-on on managing infrastructure. The most frequent way of resolving a problem is still restart. I mean, restart service, restart process, restart system, and so

Humor in Problem-Solving and Data Collection

00:45:13
Speaker
on. Yeah, I see the fundamental question is how you tried turning it on and off. That's the legendary question that you know goes around. right yeah See, I have done this at least in the startup for my career, more than I would like and i like to admit. And but and it's a remedy, so to speak.
00:45:29
Speaker
So the problem, the thing is like, you know, you run into a point where like, no you know, you have typical, like, you know, problems or vulnerabilities, you know, run into a wireline, you know, the only other thing you can do is reset the whole thing, right? Which is like, you know, restart for like, you know, a server or an application, right? ah It works to a point. The issue there is that like, you know, what do you do after that? Let's say we trick a word. Let's say that, you know, we are so public, you know, we trick a word, right? How do you see that we're talking about what? We're not talking about like, you know, how do you, like, you know, driven it from happening at the end rate. For that, you need the context, some of which you know, you would get from the observability systems. But if you are going to like, you know, knock something down before that, you know, at least you can collect as much information as possible, right? You know, if you have, a you know, you can pull a, you know, profile, you know, some, you know, Java application, visual application, or you get a trace or like, you know, pull logs, whatever.
00:46:24
Speaker
So in the Kubernetes world, like and know if you don't have centralized logging, a lot of your logs will be lost to the moment you restart the whole thing. So those things can happen. So try to pull out asma as much as possible. Make sure that you run out of every other block before doing this. And even if you don't do it, you have to do it in a rolling restart version. So that and everything is not broken for your end user.

Industry Observability Practices Comparison

00:46:45
Speaker
You're done upstream. Just to happen that again, that's you know your last resort, essentially.
00:46:54
Speaker
Right, right. I mean, I have used this as first foremost resort most of the time and I don't know what is really happening. And I mean, most of the time system has come back and worked. So I mean, I have not been an SRE. So I am i free to do that. Right. But yeah, I mean, I understand that SREs have to be very, very careful in restarting system without collecting whatever information they can collect. and It is it is very, very right
00:47:25
Speaker
the a would be Just one other thing I want to close before I close, I just wanted to understand this part that you have worked on different social media ah platforms like LinkedIn, Flipkart and so on. right How do you see this observability as a platform? Because we are very much closely associated with observability. So I wanted to always ask this, how you see observability different when it comes to each of these past 10 years of yours? Interesting question. See, I don't want to know really compare and contrast here. But let me cover a few points I noted.
00:48:05
Speaker
right it A lot of them, they have sizable infrastructure. rates right And they usually tend to build their own observability systems. Another scale, you it makes a lot of sense. and absolutely it was but that's not it As you know asure know volume up your of your and your observability data goes up. If you have a vendor, the vendor cost also you know goes up exponentially, right? That happens. And sometimes you know it's also possible for that your vendor may not be in a bid for it, because every service, your SaaS service is basically built for like on an average mid-size customer, right? And so probably, you are in an extreme where it may not hold up that way, unless it is done in that fashion. So it can happen, right? So then another question of build versus buy comes up to know in front of the leadership, right? And typically, know such enterprises would like, and know
00:48:57
Speaker
choose to build. That's what happens. And this I have seen like, you know, as a patent, you which however is that't that, you know, a lot of leaders, right? They don't realize that like, you know, what you say on the vendor cost, right? At least part of it will, you know, you will have to spend on engineering and infrastructure. so Right. Because it's not it's not a freelance, right? And observability is a costly business, whether you do it on your own or like, you know, somebody else would for you. Right. And that's, that's gonna happen. that It's continuous investment. and And, you know, you stop doing that and, you know, it will stop providing value. You cannot just, you know, say that not pay my observatory did set up the observability and we are good. And let us not redeploy them on something else or like, you know, we don't have funding for something. It doesn't work. That's something to be not very careful about, right. And in all these organizations, right, testing is the last thing to get implemented. and But actually it's not because, you know, course to have associated it both in terms of engineering, as well as infrared. We'll say everybody, you know, every application has to get involved in.
00:49:55
Speaker
implementing trace, maybe on somebody auto developer but way. Obviously, there is auto-discrimination at things like that, but know it still doesn't work as we planned all the times, that especially in high-performance systems. That is always there. and Another thing is alert correlation. I think I mentioned it already in another and an answer. Alert correlation is quite important. I trust this enough because since I have been since having the time, this has been necessary. You get an alert, get in that context and like see what other the alerts are firing, what is going down. it's quite important for us. Now, some of this, you know, we build internally, we find some code ways of associating things, you know, that are in that later systems. But you know, it's not always going to be that easy. Right. So then there is like, you know, the critical thing about like, you know, signal to noise ratio, SNR, what we call it, right, essentially, like, you know, how many of the alerts you are getting is actionable. If you've got another rule of thumb should be that like, you know, if an alert is not actionable, don't even have that alert. You know, alerting is not a place where you log stuff.
00:50:54
Speaker
that is not the way to do it. If you want to log something, create a daily report, a weekly report or whatnot. So that is the way. So in all these themes, so my observation about what will take place in my perspective is that no you will have to go and a little bit of a hybrid way with multiple solutions. then i Some would be vendor, some would be open source, some could be, you know, you build something. right But the thing is, you need to strike a balance between both like and our innovation and host. And you know when you have somebody, you know some vendors who are specializing in observability, they will bring some innovation. So you have to use that. At the same time, you need to also make sure that another cost you know and there's no cost over. And right and the cost is in control. But to do this in the right way, um you have to be in control of your data, especially in the data collection. And then you get observability data.
00:51:43
Speaker
okay so Let's say you are agents. You will look at like what kind of standards you use. Try to use open standards for like emitting any sort of data. Open the unit is what typically is the gold standard now. So try to stick to lay out those open standards, open collections. And most importantly, build a data pipeline or an observability pipeline which can carry this data, split it, duplicate it, redirect it to a number of places. So that way you can have vendors on one side, you run solutions on the other side and pick up just what you want to do.

Build vs. Buy Debate in Observability Solutions

00:52:14
Speaker
So, it should do that, I think you know it will deliver the most value. So, one of the things we talked about Safir is something which we come across very very often with our customers of this build versus buy debate. eight And while we have certain customers who felt that it is something which they should do themselves, ah but later realizing exactly what you said, right, that there are just so many different ah options or things you need to take care of and bringing in that that sort of ah industry knowledge, then the SOPs, then making sure that you basically take care of those
00:52:56
Speaker
whenever a problem happens, you bring it back to the system ah and so on. right That is what we see does not really happen. right so So I think ah that's very well put. that people have to find a good balance in terms of what they really want to achieve and whether they should really build this internally as well as or go with a vendor, OEM vendor like us. like that That makes real sense.

Closing Remarks and Gratitude

00:53:23
Speaker
right up so So I think we are towards the end of our time, but I just want to ask you one last question but in terms of, see you have written a book.
00:53:34
Speaker
right Um, is there, is there, is there anything else you are currently working on, uh, which you would like to share with us, like, uh, new book or some tools, something else? Um, so yeah, I have another, uh, book in the early in a sense of discussion. Um, so last time, you know, when I wrote the book, I went solo. This time I have, um, good, like no co-author. So we are working on, then we have this little team, but you know, we are not doing down on what exactly, you know, work on that one. Um, but can in complete but it's going to be, you know, towards, you know, around the cloud native landscape. And, uh, of course there are other things like, you know, writing blogs, what could, you know, some of the internet communities and tech communities and things like that. That is something. And of course, like, you know, uh, I do some bit of, bit of consulting you and other things. So all those things are going on, but yeah, overall going. Cool. Cool. Thank you so much, Safir for your time.
00:54:31
Speaker
It was really good talking to you about observability, SRE, and demystifying some of these aspects of what generally people feel about SRE. Thank you so much for your time. Thank you for having me. It was great pleasure. Good talk. I hope you enjoyed my conversation with Safid today. If you really like this, please share it with others who would be very interested in this topic of site reliability engineering or DevOps. For more information about us, please visit us at www.unitsystems.com